* [PATCH] doc: add a explanation of Git's data model
@ 2025-10-03 17:34 Julia Evans via GitGitGadget
2025-10-03 21:46 ` Kristoffer Haugsbakk
` (4 more replies)
0 siblings, 5 replies; 89+ messages in thread
From: Julia Evans via GitGitGadget @ 2025-10-03 17:34 UTC (permalink / raw)
To: git; +Cc: Julia Evans, Julia Evans
From: Julia Evans <julia@jvns.ca>
Git very often uses the terms "object", "reference", or "index" in its
documentation.
However, it's hard to find a clear explanation of these terms and how
they relate to each other in the documentation. The closest candidates
currently are:
1. `gitglossary`. This makes a good effort, but it's an alphabetically
ordered dictionary and a dictionary is not a good way to learn
concepts. You have to jump around too much and it's not possible to
present the concepts in the order that they should be explained.
2. `gitcore-tutorial`. This explains how to use the "core" Git commands.
This is a nice document to have, but it's not necessary to learn how
`update-index` works to understand Git's data model, and we should
not be requiring users to learn how to use the "plumbing" commands
if they want to learn what the term "index" or "object" means.
3. `gitrepository-layout`. This is a great resource, but it includes a
lot of information about configuration and internal implementation
details which are not related to the data model. It also does
not explain how commits work.
The result of this is that Git users (even users who have been using
Git for 15+ years) struggle to read the documentation because they don't
know what the core terms mean, and it's not possible to add links
to help them learn more.
Add an explanation of Git's data model. Some choices I've made in
deciding what "core data model" means:
1. Omit pseudorefs like `FETCH_HEAD`, because it's not clear to me
if those are intended to be user facing or if they're more like
internal implementation details.
2. Don't talk about submodules other than by mentioning how they
relate to trees. This is because Git has a lot of special features,
and explaining how they all work exhaustively could quickly go
down a rabbit hole which would make this document less useful for
understanding Git's core behaviour.
3. Don't discuss the structure of a commit message
(first line, trailers, GPG signatures, etc).
Perhaps this should change.
Some other choices I've made:
1. Mention packed refs only in a note.
2. Don't mention that the full name of the branch `main` is
technically `refs/heads/main`. This should likely change but I
haven't worked out how to do it in a clear way yet.
3. Mostly avoid referring to the `.git` directory, because the exact
details of how things are stored change over time.
This should perhaps change from "mostly" to "entirely"
but I haven't worked out how to do that in a clear way yet.
Signed-off-by: Julia Evans <julia@jvns.ca>
---
doc: Add a explanation of Git's data model
Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-1981%2Fjvns%2Fgitdatamodel-v1
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-1981/jvns/gitdatamodel-v1
Pull-Request: https://github.com/gitgitgadget/git/pull/1981
Documentation/Makefile | 1 +
Documentation/gitdatamodel.adoc | 226 ++++++++++++++++++++++++++++++++
2 files changed, 227 insertions(+)
create mode 100644 Documentation/gitdatamodel.adoc
diff --git a/Documentation/Makefile b/Documentation/Makefile
index 6fb83d0c6e..5f4acfacbd 100644
--- a/Documentation/Makefile
+++ b/Documentation/Makefile
@@ -52,6 +52,7 @@ MAN7_TXT += gitcli.adoc
MAN7_TXT += gitcore-tutorial.adoc
MAN7_TXT += gitcredentials.adoc
MAN7_TXT += gitcvs-migration.adoc
+MAN7_TXT += gitdatamodel.adoc
MAN7_TXT += gitdiffcore.adoc
MAN7_TXT += giteveryday.adoc
MAN7_TXT += gitfaq.adoc
diff --git a/Documentation/gitdatamodel.adoc b/Documentation/gitdatamodel.adoc
new file mode 100644
index 0000000000..4b2cb167dc
--- /dev/null
+++ b/Documentation/gitdatamodel.adoc
@@ -0,0 +1,226 @@
+gitdatamodel(7)
+===============
+
+NAME
+----
+gitdatamodel - Git's core data model
+
+DESCRIPTION
+-----------
+
+It's not necessary to understand Git's data model to use Git, but it's
+very helpful when reading Git's documentation so that you know what it
+means when the documentation says "object" "reference" or "index".
+
+Git's core operations use 4 kinds of data:
+
+1. <<objects,Objects>>: commits, trees, blobs, and tag objects
+2. <<references,References>>: branches, tags,
+ remote-tracking branches, etc
+3. <<index,The index>>, also known as the staging area
+4. <<reflogs,Reflogs>>
+
+[[objects]]
+OBJECTS
+-------
+
+Commits, trees, blobs, and tag objects are all stored in Git's object database.
+Every object has:
+
+1. an *ID*, which is the SHA-1 hash of its contents.
+ It's fast to look up a Git object using its ID.
+ The ID is usually represented in hexadecimal, like
+ `1b61de420a21a2f1aaef93e38ecd0e45e8bc9f0a`.
+2. a *type*. There are 4 types of objects:
+ <<commit,commits>>, <<tree,trees>>, <<blob,blobs>>,
+ and <<tag-object,tag objects>>.
+3. *contents*. The structure of the contents depends on the type.
+
+Once an object is created, it can never be changed.
+Here are the 4 types of objects:
+
+[[commit]]
+commits::
+ A commit contains:
++
+1. Its *parent commit ID(s)*. The first commit in a repository has 0 parents,
+ regular commits have 1 parent, merge commits have 2+ parents
+2. A *commit message*
+3. All the *files* in the commit, stored as a *<<tree,tree>>*
+4. An *author* and the time the commit was authored
+5. A *committer* and the time the commit was committed
++
+Here's how an example commit is stored:
++
+----
+tree 1b61de420a21a2f1aaef93e38ecd0e45e8bc9f0a
+parent 4ccb6d7b8869a86aae2e84c56523f8705b50c647
+author Maya <maya@example.com> 1759173425 -0400
+committer Maya <maya@example.com> 1759173425 -0400
+
+Add README
+----
++
+Like all other objects, commits can never be changed after they're created.
+For example, "amending" a commit with `git commit --amend` creates a new commit.
+The old commit will eventually be deleted by `git gc`.
+
+[[tree]]
+trees::
+ A tree is how Git represents a directory. It lists, for each item in
+ the tree:
++
+1. The *permissions*, for example `100644`
+2. The *type*: either <<blob,`blob`>> (a file), `tree` (a directory),
+ or <<commit,`commit`>> (a Git submodule)
+3. The *object ID*
+4. The *filename*
++
+For example, this is how a tree containing one directory (`src`) and one file
+(`README.md`) is stored:
++
+----
+100644 blob 8728a858d9d21a8c78488c8b4e70e531b659141f README.md
+040000 tree 89b1d2e0495f66d6929f4ff76ff1bb07fc41947d src
+----
++
+*NOTE:* The permissions are in the same format as UNIX permissions, but
+the only allowed permissions for files (blobs) are 644 and 755.
+
+[[blob]]
+blobs::
+ A blob is how Git represents a file. A blob object contains the
+ file's contents.
++
+Storing a new blob for every new version of a file can get big, so
+`git gc` periodically compresses objects for efficiency in `.git/objects/pack`.
+
+[[tag-object]]
+tag objects::
+ Tag objects (also known as "annotated tags") contain:
++
+1. The *tagger* and tag date
+2. A *tag message*, similar to a commit message
+3. The *ID* of the object (often a commit) that they reference
+
+[[references]]
+REFERENCES
+----------
+
+References are a way to give a name to a commit.
+It's easier to remember "the changes I'm working on are on the `turtle`
+branch" than "the changes are in commit bb69721404348e".
+Git often uses "ref" as shorthand for "reference".
+
+References that you create are stored in the `.git/refs` directory,
+and Git has a few special internal references like `HEAD` that are stored
+in the base `.git` directory.
+
+References can either be:
+
+1. References to an object ID, usually a <<commit,commit>> ID
+2. References to another reference. This is called a "symbolic reference".
+
+Git handles references differently based on which subdirectory of
+`.git/refs` they're stored in.
+Here are the main types:
+
+[[branch]]
+branches: `.git/refs/heads/<name>`::
+ A branch is a name for a commit ID.
+ That commit is the latest commit on the branch.
+ Branches are stored in the `.git/refs/heads/` directory.
++
+To get the history of commits on a branch, Git will start at the commit
+ID the branch references, and then look at the commit's parent(s),
+the parent's parent, etc.
+
+[[tag]]
+tags: `.git/refs/tags/<name>`::
+ A tag is a name for a commit ID, tag object ID, or other object ID.
+ Tags are stored in the `refs/tags/` directory.
++
+Even though branches and commits are both "a name for a commit ID", Git
+treats them very differently.
+Branches are expected to be regularly updated as you work on the branch,
+but it's expected that a tag will never change after you create it.
+
+[[HEAD]]
+HEAD: `.git/HEAD`::
+ `HEAD` is where Git stores your current <<branch,branch>>.
+ `HEAD` is normally a symbolic reference to your current branch, for
+ example `ref: refs/heads/main` if your current branch is `main`.
+ `HEAD` can also be a direct reference to a commit ID,
+ that's called "detached HEAD state".
+
+[[remote-tracking-branch]]
+remote tracking branches: `.git/refs/remotes/<remote>/<branch>`::
+ A remote-tracking branch is a name for a commit ID.
+ It's how Git stores the last-known state of a branch in a remote
+ repository. `git fetch` updates remote-tracking branches. When
+ `git status` says "you're up to date with origin/main", it's looking at
+ this.
+
+[[other-refs]]
+Other references::
+ Git tools may create references in any subdirectory of `.git/refs`.
+ For example, linkgit:git-stash[1], linkgit:git-bisect[1],
+ and linkgit:git-notes[1] all create their own references
+ in `.git/refs/stash`, `.git/refs/bisect`, etc.
+ Third-party Git tools may also create their own references.
++
+Git may also create references in the base `.git` directory
+other than `HEAD`, like `ORIG_HEAD`.
+
+*NOTE:* As an optimization, references may be stored as packed
+refs instead of in `.git/refs`. See linkgit:git-pack-refs[1].
+
+[[index]]
+THE INDEX
+---------
+
+The index, also known as the "staging area", contains the current staged
+version of every file in your Git repository. When you commit, the files
+in the index are used as the files in the next commit.
+
+Unlike a tree, the index is a flat list of files.
+Each index entry has 4 fields:
+
+1. The *permissions*
+2. The *<<blob,blob>> ID* of the file
+3. The *filename*
+4. The *number*. This is normally 0, but if there's a merge conflict
+ there can be multiple versions (with numbers 0, 1, 2, ..)
+ of the same filename in the index.
+
+It's extremely uncommon to look at the index directly: normally you'd
+run `git status` to see a list of changes between the index and <<HEAD,HEAD>>.
+But you can use `git ls-files --stage` to see the index.
+Here's the output of `git ls-files --stage` in a repository with 2 files:
+
+----
+100644 8728a858d9d21a8c78488c8b4e70e531b659141f 0 README.md
+100644 665c637a360874ce43bf74018768a96d2d4d219a 0 src/hello.py
+----
+
+[[reflogs]]
+REFLOGS
+-------
+
+Git stores the history of branch, tag, and HEAD refs in a reflog
+(you should read "reflog" as "ref log"). Not every ref is logged by
+default, but any ref can be logged.
+
+Each reflog entry has:
+
+1. *Before/after *commit IDs*
+2. *User* who made the change, for example `Maya <maya@example.com>`
+3. *Timestamp*
+4. *Log message*, for example `pull: Fast-forward`
+
+Reflogs only log changes made in your local repository.
+They are not shared with remotes.
+
+GIT
+---
+Part of the linkgit:git[1] suite
base-commit: bb69721404348ea2db0a081c41ab6ebfe75bdec8
--
gitgitgadget
^ permalink raw reply related [flat|nested] 89+ messages in thread* Re: [PATCH] doc: add a explanation of Git's data model 2025-10-03 17:34 [PATCH] doc: add a explanation of Git's data model Julia Evans via GitGitGadget @ 2025-10-03 21:46 ` Kristoffer Haugsbakk 2025-10-06 19:36 ` Julia Evans 2025-10-06 3:32 ` Junio C Hamano ` (3 subsequent siblings) 4 siblings, 1 reply; 89+ messages in thread From: Kristoffer Haugsbakk @ 2025-10-03 21:46 UTC (permalink / raw) To: Josh Soref, git; +Cc: Julia Evans On Fri, Oct 3, 2025, at 19:34, Julia Evans via GitGitGadget wrote: > From: Julia Evans <julia@jvns.ca> > > Git very often uses the terms "object", "reference", or "index" in its > documentation. > > However, it's hard to find a clear explanation of these terms and how > they relate to each other in the documentation. The closest candidates > currently are: > > 1. `gitglossary`. This makes a good effort, but it's an alphabetically > ordered dictionary and a dictionary is not a good way to learn > concepts. You have to jump around too much and it's not possible to > present the concepts in the order that they should be explained. > 2. `gitcore-tutorial`. This explains how to use the "core" Git commands. > This is a nice document to have, but it's not necessary to learn how > `update-index` works to understand Git's data model, and we should > not be requiring users to learn how to use the "plumbing" commands > if they want to learn what the term "index" or "object" means. > 3. `gitrepository-layout`. This is a great resource, but it includes a > lot of information about configuration and internal implementation > details which are not related to the data model. It also does > not explain how commits work. > > The result of this is that Git users (even users who have been using > Git for 15+ years) struggle to read the documentation because they don't > know what the core terms mean, and it's not possible to add links > to help them learn more. > > Add an explanation of Git's data model. Some choices I've made in > deciding what "core data model" means: > > 1. Omit pseudorefs like `FETCH_HEAD`, because it's not clear to me > if those are intended to be user facing or if they're more like > internal implementation details. > 2. Don't talk about submodules other than by mentioning how they > relate to trees. This is because Git has a lot of special features, > and explaining how they all work exhaustively could quickly go > down a rabbit hole which would make this document less useful for > understanding Git's core behaviour. > 3. Don't discuss the structure of a commit message > (first line, trailers, GPG signatures, etc). > Perhaps this should change. > > Some other choices I've made: > > 1. Mention packed refs only in a note. I don’t think it’s worth mentioning this at all. More on that later. > 2. Don't mention that the full name of the branch `main` is > technically `refs/heads/main`. This should likely change but I > haven't worked out how to do it in a clear way yet. I think this is worth getting into. This is a pretty user-facing concept. > 3. Mostly avoid referring to the `.git` directory, because the exact > details of how things are stored change over time. > This should perhaps change from "mostly" to "entirely" > but I haven't worked out how to do that in a clear way yet. I think that’s good. I mean, I think us users don’t need that level of detail and shouldn’t be “inspired” to muck with the internals. If that makes sense. (See later) > > Signed-off-by: Julia Evans <julia@jvns.ca> > --- > doc: Add a explanation of Git's data model >[snip] > diff --git a/Documentation/Makefile b/Documentation/Makefile >[snip] > diff --git a/Documentation/gitdatamodel.adoc > b/Documentation/gitdatamodel.adoc > new file mode 100644 > index 0000000000..4b2cb167dc > --- /dev/null > +++ b/Documentation/gitdatamodel.adoc > @@ -0,0 +1,226 @@ > +gitdatamodel(7) > +=============== > + > +NAME > +---- > +gitdatamodel - Git's core data model > + > +DESCRIPTION > +----------- > + > +It's not necessary to understand Git's data model to use Git, but it's > +very helpful when reading Git's documentation so that you know what it > +means when the documentation says "object" "reference" or "index". I haven’t gone hunting through the docs to see if this is covered elsewhere. But the thrust of all the things here definitely feel to me like something that should be presented and documented in such a way. > + > +Git's core operations use 4 kinds of data: Maybe small numerals should be spelled as words in running text? > + > +1. <<objects,Objects>>: commits, trees, blobs, and tag objects > +2. <<references,References>>: branches, tags, > + remote-tracking branches, etc > +3. <<index,The index>>, also known as the staging area > +4. <<reflogs,Reflogs>> Reflogs is certainly auxiliary ref data. What makes it qualify as one-of-the-four? I am open to it being both, to be clear. > + > +[[objects]] > +OBJECTS > +------- > + > +Commits, trees, blobs, and tag objects are all stored in Git's object > database. > +Every object has: > + > +1. an *ID*, which is the SHA-1 hash of its contents. > + It's fast to look up a Git object using its ID. > + The ID is usually represented in hexadecimal, like > + `1b61de420a21a2f1aaef93e38ecd0e45e8bc9f0a`. > +2. a *type*. There are 4 types of objects: > + <<commit,commits>>, <<tree,trees>>, <<blob,blobs>>, > + and <<tag-object,tag objects>>. > +3. *contents*. The structure of the contents depends on the type. > + > +Once an object is created, it can never be changed. > +Here are the 4 types of objects: As a curious Git user this seems correct. > + > +[[commit]] > +commits:: > + A commit contains: > ++ > +1. Its *parent commit ID(s)*. The first commit in a repository has 0 > parents, Maybe this is a subjective style thing but is it necessary to use “(s)” when the context makes clear that it could be zero to many? Its *parent commit IDs. ... > + regular commits have 1 parent, merge commits have 2+ parents s/2+/two or more/ ? Same point as the “numeral” one above. > +2. A *commit message* > +3. All the *files* in the commit, stored as a *<<tree,tree>>* > +4. An *author* and the time the commit was authored > +5. A *committer* and the time the commit was committed > ++ > +Here's how an example commit is stored: > ++ > +---- > +tree 1b61de420a21a2f1aaef93e38ecd0e45e8bc9f0a > +parent 4ccb6d7b8869a86aae2e84c56523f8705b50c647 > +author Maya <maya@example.com> 1759173425 -0400 > +committer Maya <maya@example.com> 1759173425 -0400 > + > +Add README > +---- > ++ > +Like all other objects, commits can never be changed after they're > created. > +For example, "amending" a commit with `git commit --amend` creates a > new commit. > +The old commit will eventually be deleted by `git gc`. Maybe this could be moved to a part about what happens (eventually) to unreachable objects? Mentioning `git gc` and how things will get deleted raises questions naturally. Like why would they be deleted? Okay that’s clear: the previous commit will be replaced by the amended one. Then when it is not reachable by anything (even the reflog) it will get garbage collected. It all follows. But is the reader necessarily mature enough in their understanding to make the inference? This is a long-winded way of saying: if you’re gonna discuss `git gc` you might need to go into all of these concepts. > + > +[[tree]] > +trees:: > + A tree is how Git represents a directory. It lists, for each item > in > + the tree: > ++ > +1. The *permissions*, for example `100644` > +2. The *type*: either <<blob,`blob`>> (a file), `tree` (a directory), > + or <<commit,`commit`>> (a Git submodule) > +3. The *object ID* > +4. The *filename* > ++ > +For example, this is how a tree containing one directory (`src`) and > one file > +(`README.md`) is stored: > ++ > +---- > +100644 blob 8728a858d9d21a8c78488c8b4e70e531b659141f README.md > +040000 tree 89b1d2e0495f66d6929f4ff76ff1bb07fc41947d src > +---- > ++ > +*NOTE:* The permissions are in the same format as UNIX permissions, but > +the only allowed permissions for files (blobs) are 644 and 755. > + Makes sense. > +[[blob]] > +blobs:: > + A blob is how Git represents a file. A blob object contains the > + file's contents. > ++ > +Storing a new blob for every new version of a file can get big, so > +`git gc` periodically compresses objects for efficiency in > `.git/objects/pack`. This gets into mentioning implementation files(?) like you mentioned in the commit message. 1. That it’s a packfile and where it is might be too much detail for this doc 2. I vaguely recall documents discussing what happens to “storing every version” discussing deltas instead of packs? Again, I am not a Git developer though. > + > +[[tag-object]] > +tag objects:: > + Tag objects (also known as "annotated tags") contain: > ++ > +1. The *tagger* and tag date > +2. A *tag message*, similar to a commit message > +3. The *ID* of the object (often a commit) that they reference s/often/typically/ ? I know it can get tedious to caveat the 99% cases with things that are technically possible. Maybe if it gets “bad enough” there could be a part that explains/distinguishes the high-level/porcelain Git use and what is technically possible: you make a `git tag -a`, which is on a commit... except if you accidentally run it on top of an existing tag. Then even the porcelain won’t protect you from making a tag-on-tag. (But it will issue a warning I guess.) Hmm. Now I don’t know. > + > +[[references]] > +REFERENCES > +---------- > + > +References are a way to give a name to a commit. > +It's easier to remember "the changes I'm working on are on the `turtle` > +branch" than "the changes are in commit bb69721404348e". > +Git often uses "ref" as shorthand for "reference". Good. > + > +References that you create are stored in the `.git/refs` directory, > +and Git has a few special internal references like `HEAD` that are > stored > +in the base `.git` directory. Implementation file details. You also mention `.git/refs/heads/<name>` below. But refs aren’t stored as files if you are using the *reftable* backend. And that backend will become the default for new repositories in Git 3.0, I think. How does reftable work? I don’t know. But I don’t think we need to know after reading this doc. :) To be clear: how files are stored might not matter here. > + > +References can either be: > + > +1. References to an object ID, usually a <<commit,commit>> ID > +2. References to another reference. This is called a "symbolic > reference". You seem to have used `**` when introducing terms: This is a *symbolic reference* >[snip ref stuff] > + > +[[HEAD]] > +HEAD: `.git/HEAD`:: > + `HEAD` is where Git stores your current <<branch,branch>>. > + `HEAD` is normally a symbolic reference to your current branch, for > + example `ref: refs/heads/main` if your current branch is `main`. > + `HEAD` can also be a direct reference to a commit ID, > + that's called "detached HEAD state". > + > +[[remote-tracking-branch]] > +remote tracking branches: `.git/refs/remotes/<remote>/<branch>`:: > + A remote-tracking branch is a name for a commit ID. > + It's how Git stores the last-known state of a branch in a remote > + repository. `git fetch` updates remote-tracking branches. When > + `git status` says "you're up to date with origin/main", it's looking at > + this. Looks good. > + > +[[other-refs]] > +Other references:: > + Git tools may create references in any subdirectory of `.git/refs`. > + For example, linkgit:git-stash[1], linkgit:git-bisect[1], > + and linkgit:git-notes[1] all create their own references > + in `.git/refs/stash`, `.git/refs/bisect`, etc. > + Third-party Git tools may also create their own references. > ++ > +Git may also create references in the base `.git` directory > +other than `HEAD`, like `ORIG_HEAD`. > + > +*NOTE:* As an optimization, references may be stored as packed > +refs instead of in `.git/refs`. See linkgit:git-pack-refs[1]. I don’t know if this is relevant for both ref backends. And does it matter? > + > +[[index]] > +THE INDEX > +--------- > + > +The index, also known as the "staging area", contains the current > staged > +version of every file in your Git repository. When you commit, the > files > +in the index are used as the files in the next commit. > + > +Unlike a tree, the index is a flat list of files. > +Each index entry has 4 fields: > + > +1. The *permissions* > +2. The *<<blob,blob>> ID* of the file > +3. The *filename* > +4. The *number*. This is normally 0, but if there's a merge conflict > + there can be multiple versions (with numbers 0, 1, 2, ..) > + of the same filename in the index. > + > +It's extremely uncommon to look at the index directly: normally you'd > +run `git status` to see a list of changes between the index and > <<HEAD,HEAD>>. > +But you can use `git ls-files --stage` to see the index. > +Here's the output of `git ls-files --stage` in a repository with 2 > files: > + > +---- > +100644 8728a858d9d21a8c78488c8b4e70e531b659141f 0 README.md > +100644 665c637a360874ce43bf74018768a96d2d4d219a 0 src/hello.py > +---- > + > +[[reflogs]] > +REFLOGS > +------- > + > +Git stores the history of branch, tag, and HEAD refs in a reflog > +(you should read "reflog" as "ref log"). Not every ref is logged by You’ve heard of the re-flog too? > +default, but any ref can be logged. > + > +Each reflog entry has: > + > +1. *Before/after *commit IDs* > +2. *User* who made the change, for example `Maya <maya@example.com>` > +3. *Timestamp* > +4. *Log message*, for example `pull: Fast-forward` > + > +Reflogs only log changes made in your local repository. > +They are not shared with remotes. Makes sense. > + > +GIT > +--- > +Part of the linkgit:git[1] suite I appreciate that this is the first version and you might have plans after this one. But I wonder if this doc could use a fair number of `gitlink` to branch out to all the other parts. Like git-reflog(1), gitglossary(7). Thanks for starting on a whole new doc. That must take quite some effort. ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: [PATCH] doc: add a explanation of Git's data model 2025-10-03 21:46 ` Kristoffer Haugsbakk @ 2025-10-06 19:36 ` Julia Evans 2025-10-06 21:44 ` D. Ben Knoble 2025-10-08 9:59 ` Kristoffer Haugsbakk 0 siblings, 2 replies; 89+ messages in thread From: Julia Evans @ 2025-10-06 19:36 UTC (permalink / raw) To: Kristoffer Haugsbakk, Julia Evans, git Thanks for the review! >> 2. Don't mention that the full name of the branch `main` is >> technically `refs/heads/main`. This should likely change but I >> haven't worked out how to do it in a clear way yet. > > I think this is worth getting into. This is a pretty > user-facing concept. I think I'll see if I can figure out a way to mention this and at the same time remove most of the rest of the references to the `.git` directory when explaining references (which you talked about further down), including packed refs. >> + >> +1. <<objects,Objects>>: commits, trees, blobs, and tag objects >> +2. <<references,References>>: branches, tags, >> + remote-tracking branches, etc >> +3. <<index,The index>>, also known as the staging area >> +4. <<reflogs,Reflogs>> > > Reflogs is certainly auxiliary ref data. What makes it qualify as > one-of-the-four? I am open to it being both, to be clear. The reason I like to talk about reflogs is that it gives you a way to "undo" Git operations that can be really useful. And any Git command that updates refs can updates that ref's reflog. Understanding how reflogs work helps to understand what the limitations of using reflogs to undo mistakes is: for example the index is not a ref, so you can't use the reflog to undo changes to the index. >> +2. A *commit message* >> +3. All the *files* in the commit, stored as a *<<tree,tree>>* >> +4. An *author* and the time the commit was authored >> +5. A *committer* and the time the commit was committed >> ++ >> +Here's how an example commit is stored: >> ++ >> +---- >> +tree 1b61de420a21a2f1aaef93e38ecd0e45e8bc9f0a >> +parent 4ccb6d7b8869a86aae2e84c56523f8705b50c647 >> +author Maya <maya@example.com> 1759173425 -0400 >> +committer Maya <maya@example.com> 1759173425 -0400 >> + >> +Add README >> +---- >> ++ >> +Like all other objects, commits can never be changed after they're >> created. >> +For example, "amending" a commit with `git commit --amend` creates a >> new commit. > >> +The old commit will eventually be deleted by `git gc`. > > Maybe this could be moved to a part about what happens (eventually) to > unreachable objects? > > Mentioning `git gc` and how things will get deleted raises > questions naturally. Like why would they be deleted? Okay > that’s clear: the previous commit will be replaced by the > amended one. Then when it is not reachable by anything > (even the reflog) it will get garbage collected. > > It all follows. But is the reader necessarily mature enough > in their understanding to make the inference? > > This is a long-winded way of saying: if you’re gonna discuss > `git gc` you might need to go into all of these concepts. If folks here think this is a reasonable document to add to Git I'll try get some beta readers to read this, see which parts folks find confusing, and address those, keeping the `git gc` stuff in mind. Similarly for the style comments. >> +blobs:: >> + A blob is how Git represents a file. A blob object contains the >> + file's contents. >> ++ >> +Storing a new blob for every new version of a file can get big, so >> +`git gc` periodically compresses objects for efficiency in >> `.git/objects/pack`. > > This gets into mentioning implementation files(?) like you mentioned in > the commit message. That's true! The reason I think this is important to mention is that I find that people often "reject" information that they find implausible, even if it comes from a credible source. ("that can't be true! I must be not understanding correctly. Oh well, I'll just ignore that!") I sometimes hear from users that "commits can't be snapshots", because it would take up too much disk space to store every version of every commit. So I find that sometimes explaining a little bit about the implementation can make the information more memorable. Certainly I'm not able to remember details that don't make sense with my mental model of how computers work and I don't expect other people to either, so I think it's important to give an explanation that handles the biggest "objections". > 1. That it’s a packfile and where it is might be too much detail for > this doc > 2. I vaguely recall documents discussing what happens to “storing every > version” discussing deltas instead of packs? Again, I am not a Git > developer though. I could be wrong about the details here, I'm not a Git developer either. From https://git-scm.com/book/en/v2/Git-Internals-Packfiles it looks like packfiles are implemented using deltas. >> + >> +References can either be: >> + >> +1. References to an object ID, usually a <<commit,commit>> ID >> +2. References to another reference. This is called a "symbolic >> reference". > > You seem to have used `**` when introducing terms: > > This is a *symbolic reference* Thanks, will take a look at that. >> +[[reflogs]] >> +REFLOGS >> +------- >> + >> +Git stores the history of branch, tag, and HEAD refs in a reflog >> +(you should read "reflog" as "ref log"). Not every ref is logged by > > You’ve heard of the re-flog too? haha exactly, I just want folks to understand why it's called that :) > I appreciate that this is the first version and you might have plans > after this one. But I wonder if this doc could use a fair number of > `gitlink` to branch out to all the other parts. Like git-reflog(1), > gitglossary(7). That's reasonable. Do you often use the "See also" section of man pages? I've never looked at them so I'm always curious about how people are actually using them in practice. I also need to think about what else could link *to* this, because without attention to discoverability probably nobody will find it. My main idea so far is actually to add it to https://git-scm.com/learn but I wanted to send it here instead of adding it to the website directly because I thought it could benefit from a more detailed review. > Thanks for starting on a whole new doc. That must take quite > some effort. All the work on documentation takes a lot of effort, in some ways it's easier to write something new than to edit something existing :) ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: [PATCH] doc: add a explanation of Git's data model 2025-10-06 19:36 ` Julia Evans @ 2025-10-06 21:44 ` D. Ben Knoble 2025-10-06 21:46 ` Julia Evans 2025-10-08 9:59 ` Kristoffer Haugsbakk 1 sibling, 1 reply; 89+ messages in thread From: D. Ben Knoble @ 2025-10-06 21:44 UTC (permalink / raw) To: Julia Evans; +Cc: Kristoffer Haugsbakk, Julia Evans, git On Mon, Oct 6, 2025 at 3:37 PM Julia Evans <julia@jvns.ca> wrote: > > Thanks for the review! > > >> 2. Don't mention that the full name of the branch `main` is > >> technically `refs/heads/main`. This should likely change but I > >> haven't worked out how to do it in a clear way yet. > > > > I think this is worth getting into. This is a pretty > > user-facing concept. > > I think I'll see if I can figure out a way to mention this and at the > same time remove most of the rest of the references to the `.git` > directory when explaining references (which you talked about > further down), including packed refs. A colleague will be explaining reflog for an audience tomorrow, and decided to briefly explain refs, too—which tells me this is much-needed. For refs themselves, perhaps "git for-each-ref" is a reasonable place to start? Since it tells you the refs you have and how to spell them explicitly regardless of how they are stored? -- D. Ben Knoble ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: [PATCH] doc: add a explanation of Git's data model 2025-10-06 21:44 ` D. Ben Knoble @ 2025-10-06 21:46 ` Julia Evans 2025-10-06 21:55 ` D. Ben Knoble 0 siblings, 1 reply; 89+ messages in thread From: Julia Evans @ 2025-10-06 21:46 UTC (permalink / raw) To: D. Ben Knoble; +Cc: Kristoffer Haugsbakk, Julia Evans, git On Mon, Oct 6, 2025, at 5:44 PM, D. Ben Knoble wrote: > On Mon, Oct 6, 2025 at 3:37 PM Julia Evans <julia@jvns.ca> wrote: >> >> Thanks for the review! >> >> >> 2. Don't mention that the full name of the branch `main` is >> >> technically `refs/heads/main`. This should likely change but I >> >> haven't worked out how to do it in a clear way yet. >> > >> > I think this is worth getting into. This is a pretty >> > user-facing concept. >> >> I think I'll see if I can figure out a way to mention this and at the >> same time remove most of the rest of the references to the `.git` >> directory when explaining references (which you talked about >> further down), including packed refs. > > A colleague will be explaining reflog for an audience tomorrow, and > decided to briefly explain refs, too—which tells me this is > much-needed. > > For refs themselves, perhaps "git for-each-ref" is a reasonable place > to start? Since it tells you the refs you have and how to spell them > explicitly regardless of how they are stored? Interesting, do you use git for-each-ref? What do you use it for? > -- > D. Ben Knoble ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: [PATCH] doc: add a explanation of Git's data model 2025-10-06 21:46 ` Julia Evans @ 2025-10-06 21:55 ` D. Ben Knoble 2025-10-09 13:20 ` Julia Evans 0 siblings, 1 reply; 89+ messages in thread From: D. Ben Knoble @ 2025-10-06 21:55 UTC (permalink / raw) To: Julia Evans; +Cc: Kristoffer Haugsbakk, Julia Evans, git On Mon, Oct 6, 2025 at 5:47 PM Julia Evans <julia@jvns.ca> wrote: > > > > On Mon, Oct 6, 2025, at 5:44 PM, D. Ben Knoble wrote: > > On Mon, Oct 6, 2025 at 3:37 PM Julia Evans <julia@jvns.ca> wrote: > >> > >> Thanks for the review! > >> > >> >> 2. Don't mention that the full name of the branch `main` is > >> >> technically `refs/heads/main`. This should likely change but I > >> >> haven't worked out how to do it in a clear way yet. > >> > > >> > I think this is worth getting into. This is a pretty > >> > user-facing concept. > >> > >> I think I'll see if I can figure out a way to mention this and at the > >> same time remove most of the rest of the references to the `.git` > >> directory when explaining references (which you talked about > >> further down), including packed refs. > > > > A colleague will be explaining reflog for an audience tomorrow, and > > decided to briefly explain refs, too—which tells me this is > > much-needed. > > > > For refs themselves, perhaps "git for-each-ref" is a reasonable place > > to start? Since it tells you the refs you have and how to spell them > > explicitly regardless of how they are stored? > > Interesting, do you use git for-each-ref? > What do you use it for? Ah, yes, but primarily for scripting. What I should have clarified is that "the tool (I know of) to interrogate the refs you currently have is git-for-each-ref" (like how git-ls-remote is the tool to interrogate a remote's refs). It avoids the issues with assuming "tree .git/refs" or similar will capture the actual data. -- D. Ben Knoble ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: [PATCH] doc: add a explanation of Git's data model 2025-10-06 21:55 ` D. Ben Knoble @ 2025-10-09 13:20 ` Julia Evans 0 siblings, 0 replies; 89+ messages in thread From: Julia Evans @ 2025-10-09 13:20 UTC (permalink / raw) To: D. Ben Knoble; +Cc: Kristoffer Haugsbakk, Julia Evans, git >> >> I think I'll see if I can figure out a way to mention this and at the >> >> same time remove most of the rest of the references to the `.git` >> >> directory when explaining references (which you talked about >> >> further down), including packed refs. >> > >> > A colleague will be explaining reflog for an audience tomorrow, and >> > decided to briefly explain refs, too—which tells me this is >> > much-needed. >> > >> > For refs themselves, perhaps "git for-each-ref" is a reasonable place >> > to start? Since it tells you the refs you have and how to spell them >> > explicitly regardless of how they are stored? >> >> Interesting, do you use git for-each-ref? >> What do you use it for? > > Ah, yes, but primarily for scripting. > > What I should have clarified is that "the tool (I know of) to > interrogate the refs you currently have is git-for-each-ref" (like how > git-ls-remote is the tool to interrogate a remote's refs). It avoids > the issues with assuming "tree .git/refs" or similar will capture the > actual data. Ah, that makes sense! I spent a little while trying to come up with something that would give a "similar result" to running `cat .git/<refname>` and I came up with this: git for-each-ref <ref-name> --include-root-refs --format="%(refname) %(if)%(symref)%(then)%(symref)%(else)%(objectname:short)%(end)" I hoped to find a simple equivalent to that `cat` command (kind of the equivalent of `git cat-file -p`) that would work with other ref backends but couldn't find one. ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: [PATCH] doc: add a explanation of Git's data model 2025-10-06 19:36 ` Julia Evans 2025-10-06 21:44 ` D. Ben Knoble @ 2025-10-08 9:59 ` Kristoffer Haugsbakk 1 sibling, 0 replies; 89+ messages in thread From: Kristoffer Haugsbakk @ 2025-10-08 9:59 UTC (permalink / raw) To: Julia Evans, Josh Soref, git On Mon, Oct 6, 2025, at 21:36, Julia Evans wrote: >[snip] >>> +blobs:: >>> + A blob is how Git represents a file. A blob object contains the >>> + file's contents. >>> ++ >>> +Storing a new blob for every new version of a file can get big, so >>> +`git gc` periodically compresses objects for efficiency in >>> `.git/objects/pack`. >> >> This gets into mentioning implementation files(?) like you mentioned in >> the commit message. > > That's true! The reason I think this is important to mention is that I find > that people often "reject" information that they find implausible, even > if it comes from a credible source. ("that can't be true! I must be > not understanding correctly. Oh well, I'll just ignore that!") > > I sometimes hear from users that "commits can't be snapshots", because > it would take up too much disk space to store every version of > every commit. So I find that sometimes explaining a little bit about the > implementation can make the information more memorable. > > Certainly I'm not able to remember details that don't make sense > with my mental model of how computers work and I don't expect other > people to either, so I think it's important to give an explanation that > handles the biggest "objections". That’s very intresting. Yes, maybe people need to be told/taught to a level which might be considered “just implementation details” or else both neither their curiosity won’t be satisfied *nor* will their own sense of error-correction for the seemingly implausible. >[snip] >> I appreciate that this is the first version and you might have plans >> after this one. But I wonder if this doc could use a fair number of >> `gitlink` to branch out to all the other parts. Like git-reflog(1), >> gitglossary(7). > > That's reasonable. Do you often use the "See also" section of > man pages? I've never looked at them so I'm always curious about > how people are actually using them in practice. I don’t really use See Also when looking things up. But I notice all the mentions of other docs in running text. >[snip] ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: [PATCH] doc: add a explanation of Git's data model 2025-10-03 17:34 [PATCH] doc: add a explanation of Git's data model Julia Evans via GitGitGadget 2025-10-03 21:46 ` Kristoffer Haugsbakk @ 2025-10-06 3:32 ` Junio C Hamano 2025-10-06 19:03 ` Julia Evans 2025-10-07 12:37 ` Kristoffer Haugsbakk 2025-10-07 14:32 ` Patrick Steinhardt ` (2 subsequent siblings) 4 siblings, 2 replies; 89+ messages in thread From: Junio C Hamano @ 2025-10-06 3:32 UTC (permalink / raw) To: Julia Evans via GitGitGadget; +Cc: git, Julia Evans "Julia Evans via GitGitGadget" <gitgitgadget@gmail.com> writes: > +MAN7_TXT += gitdatamodel.adoc > MAN7_TXT += gitdiffcore.adoc > ... > +gitdatamodel(7) > +=============== > + > +NAME > +---- > +gitdatamodel - Git's core data model > + > +DESCRIPTION > +----------- The above causes doc-lint to barf. https://github.com/git/git/actions/runs/18265502271/job/51999236907#step:4:655 gitdatamodel.adoc:226: has no required 'SYNOPSIS' section! LINT MAN SEC giteveryday.adoc make[1]: *** [Makefile:498: .build/lint-docs/man-section-order/gitdatamodel.ok] Error 1 You can check locally with "make check-docs" without waiting for my integration cycle to push to GitHub CI. Thanks. ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: [PATCH] doc: add a explanation of Git's data model 2025-10-06 3:32 ` Junio C Hamano @ 2025-10-06 19:03 ` Julia Evans 2025-10-07 12:37 ` Kristoffer Haugsbakk 1 sibling, 0 replies; 89+ messages in thread From: Julia Evans @ 2025-10-06 19:03 UTC (permalink / raw) To: Junio C Hamano, Julia Evans; +Cc: git > The above causes doc-lint to barf. > > https://github.com/git/git/actions/runs/18265502271/job/51999236907#step:4:655 > > gitdatamodel.adoc:226: has no required 'SYNOPSIS' section! > LINT MAN SEC giteveryday.adoc > make[1]: *** [Makefile:498: > .build/lint-docs/man-section-order/gitdatamodel.ok] Error 1 > > > You can check locally with "make check-docs" without waiting for my > integration cycle to push to GitHub CI. Thanks, will fix. ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: [PATCH] doc: add a explanation of Git's data model 2025-10-06 3:32 ` Junio C Hamano 2025-10-06 19:03 ` Julia Evans @ 2025-10-07 12:37 ` Kristoffer Haugsbakk 2025-10-07 16:38 ` Junio C Hamano 1 sibling, 1 reply; 89+ messages in thread From: Kristoffer Haugsbakk @ 2025-10-07 12:37 UTC (permalink / raw) To: Junio C Hamano, Josh Soref; +Cc: git, Julia Evans On Mon, Oct 6, 2025, at 05:32, Junio C Hamano wrote: > "Julia Evans via GitGitGadget" <gitgitgadget@gmail.com> writes: > >> +MAN7_TXT += gitdatamodel.adoc >> MAN7_TXT += gitdiffcore.adoc >> ... >> +gitdatamodel(7) >> +=============== >> + >> +NAME >> +---- >> +gitdatamodel - Git's core data model >> + >> +DESCRIPTION >> +----------- > > The above causes doc-lint to barf. >[snip] > You can check locally with "make check-docs" without waiting for my > integration cycle to push to GitHub CI. I think you meant `make lint-docs` for both of these. ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: [PATCH] doc: add a explanation of Git's data model 2025-10-07 12:37 ` Kristoffer Haugsbakk @ 2025-10-07 16:38 ` Junio C Hamano 0 siblings, 0 replies; 89+ messages in thread From: Junio C Hamano @ 2025-10-07 16:38 UTC (permalink / raw) To: Kristoffer Haugsbakk; +Cc: Josh Soref, git, Julia Evans "Kristoffer Haugsbakk" <kristofferhaugsbakk@fastmail.com> writes: > On Mon, Oct 6, 2025, at 05:32, Junio C Hamano wrote: >> "Julia Evans via GitGitGadget" <gitgitgadget@gmail.com> writes: >> >>> +MAN7_TXT += gitdatamodel.adoc >>> MAN7_TXT += gitdiffcore.adoc >>> ... >>> +gitdatamodel(7) >>> +=============== >>> + >>> +NAME >>> +---- >>> +gitdatamodel - Git's core data model >>> + >>> +DESCRIPTION >>> +----------- >> >> The above causes doc-lint to barf. >>[snip] >> You can check locally with "make check-docs" without waiting for my >> integration cycle to push to GitHub CI. > > I think you meant `make lint-docs` for both of these. The former is a typo for "causes lint-docs to barf", but I did mean "make check-docs" as the recipe for local checking. You could also do "make -C Documentation lint-docs", but that is a lot more to type ;-). Thanks. ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: [PATCH] doc: add a explanation of Git's data model 2025-10-03 17:34 [PATCH] doc: add a explanation of Git's data model Julia Evans via GitGitGadget 2025-10-03 21:46 ` Kristoffer Haugsbakk 2025-10-06 3:32 ` Junio C Hamano @ 2025-10-07 14:32 ` Patrick Steinhardt 2025-10-07 17:02 ` Junio C Hamano ` (2 more replies) 2025-10-08 13:53 ` [PATCH v2] " Julia Evans via GitGitGadget 2025-10-09 14:20 ` [PATCH] doc: add a " Julia Evans 4 siblings, 3 replies; 89+ messages in thread From: Patrick Steinhardt @ 2025-10-07 14:32 UTC (permalink / raw) To: Julia Evans via GitGitGadget; +Cc: git, Julia Evans On Fri, Oct 03, 2025 at 05:34:36PM +0000, Julia Evans via GitGitGadget wrote: > diff --git a/Documentation/gitdatamodel.adoc b/Documentation/gitdatamodel.adoc > new file mode 100644 > index 0000000000..4b2cb167dc > --- /dev/null > +++ b/Documentation/gitdatamodel.adoc > @@ -0,0 +1,226 @@ > +gitdatamodel(7) > +=============== > + > +NAME > +---- > +gitdatamodel - Git's core data model > + > +DESCRIPTION > +----------- > + > +It's not necessary to understand Git's data model to use Git, but it's > +very helpful when reading Git's documentation so that you know what it > +means when the documentation says "object" "reference" or "index". There's a missing comma after "object". > + > +Git's core operations use 4 kinds of data: > + > +1. <<objects,Objects>>: commits, trees, blobs, and tag objects > +2. <<references,References>>: branches, tags, > + remote-tracking branches, etc > +3. <<index,The index>>, also known as the staging area > +4. <<reflogs,Reflogs>> This list makes sense to me. There's of course more data structures in Git, but all the other data structures shouldn't really matter to users at all as they are mostly caches or internal details of the on-disk format. There's potentially one exception though, namely the Git configuration. I'd claim that Git "uses" the Git configuration similarly to how it uses the others, but I get why it's not explicitly mentioned here. > +[[objects]] > +OBJECTS > +------- > + > +Commits, trees, blobs, and tag objects are all stored in Git's object database. > +Every object has: > + > +1. an *ID*, which is the SHA-1 hash of its contents. I think this needs to be adapted to not single out SHA-1 as the only hashing algorithm. We already support SHA-256, so we should definitely say that the algorithm can be swapped. Maybe something like: An *object ID*, which is the cryptographic hash of its contents. By default, Git uses SHA-1 as object hash, but alternative hashes like SHA-256 are supported. > + It's fast to look up a Git object using its ID. > + The ID is usually represented in hexadecimal, like > + `1b61de420a21a2f1aaef93e38ecd0e45e8bc9f0a`. > +2. a *type*. There are 4 types of objects: > + <<commit,commits>>, <<tree,trees>>, <<blob,blobs>>, > + and <<tag-object,tag objects>>. > +3. *contents*. The structure of the contents depends on the type. Nit: every object also has an object size. Not sure though whether it's fine to imply that with "contents". > +Once an object is created, it can never be changed. > +Here are the 4 types of objects: > + > +[[commit]] > +commits:: > + A commit contains: > ++ > +1. Its *parent commit ID(s)*. The first commit in a repository has 0 parents, > + regular commits have 1 parent, merge commits have 2+ parents I'd say "at least two parents" instead of "2+ parents". > +2. A *commit message* > +3. All the *files* in the commit, stored as a *<<tree,tree>>* > +4. An *author* and the time the commit was authored > +5. A *committer* and the time the commit was committed > ++ > +Here's how an example commit is stored: > ++ > +---- > +tree 1b61de420a21a2f1aaef93e38ecd0e45e8bc9f0a > +parent 4ccb6d7b8869a86aae2e84c56523f8705b50c647 > +author Maya <maya@example.com> 1759173425 -0400 > +committer Maya <maya@example.com> 1759173425 -0400 > + > +Add README > +---- In practice, commits can have other headers that are ignored by Git. But that's certainly not part of Git's core data model, so I don't think we should mention that here. > +Like all other objects, commits can never be changed after they're created. > +For example, "amending" a commit with `git commit --amend` creates a new commit. > +The old commit will eventually be deleted by `git gc`. If we mention git-gc(1) I think it would make sense to use `linkgit:git-gc[1]` instead to provide a link to its man page. > +[[tree]] > +trees:: > + A tree is how Git represents a directory. It lists, for each item in > + the tree: > ++ > +1. The *permissions*, for example `100644` I think we should rather call these "mode bits". These bits are permissions indeed when you have a blob, but for subtrees, symlinks and submodules they aren't. > +2. The *type*: either <<blob,`blob`>> (a file), `tree` (a directory), > + or <<commit,`commit`>> (a Git submodule) There's also symlinks. > +3. The *object ID* > +4. The *filename* > ++ > +For example, this is how a tree containing one directory (`src`) and one file > +(`README.md`) is stored: > ++ > +---- > +100644 blob 8728a858d9d21a8c78488c8b4e70e531b659141f README.md > +040000 tree 89b1d2e0495f66d6929f4ff76ff1bb07fc41947d src > +---- > ++ > +*NOTE:* The permissions are in the same format as UNIX permissions, but > +the only allowed permissions for files (blobs) are 644 and 755. > + > +[[blob]] > +blobs:: > + A blob is how Git represents a file. A blob object contains the > + file's contents. > ++ > +Storing a new blob for every new version of a file can get big, so > +`git gc` periodically compresses objects for efficiency in `.git/objects/pack`. I would claim that it's not necessary to mention object compression. This should be a low-level detail that users don't ever have to worry about. Furthermore, packing objects isn't only relevant in the context of blobs: trees for example also tend to compress very well as there typically is only small incremental updates to trees. > +[[tag-object]] > +tag objects:: > + Tag objects (also known as "annotated tags") contain: > ++ > +1. The *tagger* and tag date > +2. A *tag message*, similar to a commit message > +3. The *ID* of the object (often a commit) that they reference They can also be signed, if we want to mention that. > +[[references]] > +REFERENCES > +---------- > + > +References are a way to give a name to a commit. > +It's easier to remember "the changes I'm working on are on the `turtle` > +branch" than "the changes are in commit bb69721404348e". > +Git often uses "ref" as shorthand for "reference". > + > +References that you create are stored in the `.git/refs` directory, > +and Git has a few special internal references like `HEAD` that are stored > +in the base `.git` directory. This isn't true anymore with the introduction of the reftable backend, which is slated to become the default backend. I'd argue that this is another implementation detail that the user shouldn't have to worry about. > +References can either be: > + > +1. References to an object ID, usually a <<commit,commit>> ID > +2. References to another reference. This is called a "symbolic reference". > + > +Git handles references differently based on which subdirectory of > +`.git/refs` they're stored in. So instead of saying "subdirectory", I'd rather say "reference hierarchy". In general, I think we should explain that references are layed out in a hierarchy. This is somewhat obvious with the "files" backend, as we use directories there. But as we move on to the "reftable" backend this may become less obvious over time. > +Here are the main types: > + > +[[branch]] > +branches: `.git/refs/heads/<name>`:: Here and in the other cases we should then strip the `.git/` prefix. > + A branch is a name for a commit ID. > + That commit is the latest commit on the branch. > + Branches are stored in the `.git/refs/heads/` directory. > ++ > +To get the history of commits on a branch, Git will start at the commit > +ID the branch references, and then look at the commit's parent(s), > +the parent's parent, etc. > + > +[[tag]] > +tags: `.git/refs/tags/<name>`:: > + A tag is a name for a commit ID, tag object ID, or other object ID. > + Tags are stored in the `refs/tags/` directory. > ++ > +Even though branches and commits are both "a name for a commit ID", Git > +treats them very differently. > +Branches are expected to be regularly updated as you work on the branch, > +but it's expected that a tag will never change after you create it. This sounds a bit like the user itself needs to update the branch. How about this instead: Even though branches and commits are both "a name for a commit ID", Git treats them very differently: - Branches can be checked out directly. If so, creating a new commit will automatically update the checked-out branch to point to the new commit. - Tags cannot be checked out directly and don't move when creating a new commit. Instead, one can only check out the commit that a branch points to. This is called "detached HEAD", and the effect is that a new commit will not update > +[[HEAD]] > +HEAD: `.git/HEAD`:: > + `HEAD` is where Git stores your current <<branch,branch>>. > + `HEAD` is normally a symbolic reference to your current branch, for > + example `ref: refs/heads/main` if your current branch is `main`. > + `HEAD` can also be a direct reference to a commit ID, > + that's called "detached HEAD state". > + > +[[remote-tracking-branch]] > +remote tracking branches: `.git/refs/remotes/<remote>/<branch>`:: > + A remote-tracking branch is a name for a commit ID. > + It's how Git stores the last-known state of a branch in a remote > + repository. `git fetch` updates remote-tracking branches. When > + `git status` says "you're up to date with origin/main", it's looking at > + this. This misses "refs/remotes/<remote>/HEAD". This reference is a symbolic reference that indicates the default branch on the remote side. > +[[other-refs]] > +Other references:: > + Git tools may create references in any subdirectory of `.git/refs`. > + For example, linkgit:git-stash[1], linkgit:git-bisect[1], > + and linkgit:git-notes[1] all create their own references > + in `.git/refs/stash`, `.git/refs/bisect`, etc. > + Third-party Git tools may also create their own references. > ++ > +Git may also create references in the base `.git` directory > +other than `HEAD`, like `ORIG_HEAD`. Let's mention that such references are typically spelt all-uppercase with underscores between. You shouldn't ever create a reference that is for example called ".git/foo". We enforce this restriction inconsistently, only, but I don't think that should keep us from spelling out the common rule. > +*NOTE:* As an optimization, references may be stored as packed > +refs instead of in `.git/refs`. See linkgit:git-pack-refs[1]. I'd drop this note. It's an internal implementation detail and only true for the "files" backend. The "reftable" backend stores references quite differently and doesn't really "pack" references. > +[[index]] > +THE INDEX > +--------- > + > +The index, also known as the "staging area", contains the current staged Honestly, I always forget which of these two nouns we are supposed to use nowadays. I think consensus was to use "index" and avoid using "staging area"? Not sure though, but I think we should only mention one of these. > +version of every file in your Git repository. When you commit, the files > +in the index are used as the files in the next commit. > + > +Unlike a tree, the index is a flat list of files. > +Each index entry has 4 fields: > + > +1. The *permissions* > +2. The *<<blob,blob>> ID* of the file > +3. The *filename* > +4. The *number*. This is normally 0, but if there's a merge conflict I think we don't call this "number", but "stage". > + there can be multiple versions (with numbers 0, 1, 2, ..) > + of the same filename in the index. > + > +It's extremely uncommon to look at the index directly: normally you'd > +run `git status` to see a list of changes between the index and <<HEAD,HEAD>>. > +But you can use `git ls-files --stage` to see the index. > +Here's the output of `git ls-files --stage` in a repository with 2 files: > + > +---- > +100644 8728a858d9d21a8c78488c8b4e70e531b659141f 0 README.md > +100644 665c637a360874ce43bf74018768a96d2d4d219a 0 src/hello.py > +---- > + > +[[reflogs]] > +REFLOGS > +------- > + > +Git stores the history of branch, tag, and HEAD refs in a reflog > +(you should read "reflog" as "ref log"). Not every ref is logged by > +default, but any ref can be logged. If we mention this here, do we maybe want to mention how the user can decide which references are logged? > +Each reflog entry has: > + > +1. *Before/after *commit IDs* This will probably misformat as we have three asterisks here, not two. > +2. *User* who made the change, for example `Maya <maya@example.com>` > +3. *Timestamp* Suggestion: "*Timestamp* when that change has been made". > +4. *Log message*, for example `pull: Fast-forward` > + > +Reflogs only log changes made in your local repository. > +They are not shared with remotes. We may want ot mention that you can reference reflog entries via `refs/heads/<branch>@{<reflog-nr>}`. In general, one thing that I think would be important to highlight in this document is revisions. Most of the commands tend to not accept references, but revisions instead, which are a lot more flexible. They use our do-what-I-mean mechanism to resolve, but also allow the user to specify commits relative to one another. It's probably sufficient though to mention them briefly and then redirect to girevisions(7). Thanks for working on this! Patrick ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: [PATCH] doc: add a explanation of Git's data model 2025-10-07 14:32 ` Patrick Steinhardt @ 2025-10-07 17:02 ` Junio C Hamano 2025-10-07 19:30 ` Julia Evans 2025-10-07 18:39 ` D. Ben Knoble 2025-10-07 18:55 ` Julia Evans 2 siblings, 1 reply; 89+ messages in thread From: Junio C Hamano @ 2025-10-07 17:02 UTC (permalink / raw) To: Patrick Steinhardt; +Cc: Julia Evans via GitGitGadget, git, Julia Evans Patrick Steinhardt <ps@pks.im> writes: >> +Git's core operations use 4 kinds of data: >> + >> +1. <<objects,Objects>>: commits, trees, blobs, and tag objects >> +2. <<references,References>>: branches, tags, >> + remote-tracking branches, etc >> +3. <<index,The index>>, also known as the staging area >> +4. <<reflogs,Reflogs>> > > This list makes sense to me. There's of course more data structures in > Git, but all the other data structures shouldn't really matter to users > at all as they are mostly caches or internal details of the on-disk > format. > > There's potentially one exception though, namely the Git configuration. > I'd claim that Git "uses" the Git configuration similarly to how it uses > the others, but I get why it's not explicitly mentioned here. The core operations do not use Git configuration any more than they use what is specified by the command line arguments. >> +[[objects]] >> +OBJECTS >> +------- >> + >> +Commits, trees, blobs, and tag objects are all stored in Git's object database. >> +Every object has: >> + >> +1. an *ID*, which is the SHA-1 hash of its contents. > > I think this needs to be adapted to not single out SHA-1 as the only > hashing algorithm. We already support SHA-256, so we should definitely > say that the algorithm can be swapped. Maybe something like: Good point. Also officially they are called "object name". > An *object ID*, which is the cryptographic hash of its contents. By > default, Git uses SHA-1 as object hash, but alternative hashes like > SHA-256 are supported. I'd avoid "object name is the result of hashing X" which historically was a source of question: "why does 'sha1sum README.md' give different hash from 'git add README.md && git ls-files -s README.md'?" It is an irrelevant implementation detail (and you'd eventually end up having to say "X is <type> SP <length> NUL <contents>"). An object name, which is derived cryptographically from its type, size and contents. All versions of Git can use SHA-1 hash function, but more recent versions of Git can also use SHA-256 hash function. >> +commits:: >> + A commit contains: >> ++ >> +1. Its *parent commit ID(s)*. The first commit in a repository has 0 parents, >> + regular commits have 1 parent, merge commits have 2+ parents > > I'd say "at least two parents" instead of "2+ parents". Yup, that reads much better. >> +tree 1b61de420a21a2f1aaef93e38ecd0e45e8bc9f0a >> +parent 4ccb6d7b8869a86aae2e84c56523f8705b50c647 >> +author Maya <maya@example.com> 1759173425 -0400 >> +committer Maya <maya@example.com> 1759173425 -0400 >> + >> +Add README >> +---- > > In practice, commits can have other headers that are ignored by Git. But > that's certainly not part of Git's core data model, so I don't think we > should mention that here. Third-party software can add truly garbage ones that do not have any meaning, and Git tolerates by ignoring them. But there are others that Git does pay attention to, like encoding, gpgsig, etc., which may worth mention (in the form that "these four are what you typically see, but there may be others" without even naming any). ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: [PATCH] doc: add a explanation of Git's data model 2025-10-07 17:02 ` Junio C Hamano @ 2025-10-07 19:30 ` Julia Evans 2025-10-07 20:01 ` Junio C Hamano 0 siblings, 1 reply; 89+ messages in thread From: Julia Evans @ 2025-10-07 19:30 UTC (permalink / raw) To: Junio C Hamano, Patrick Steinhardt; +Cc: Julia Evans, git >> I think this needs to be adapted to not single out SHA-1 as the only >> hashing algorithm. We already support SHA-256, so we should definitely >> say that the algorithm can be swapped. Maybe something like: > > Good point. Also officially they are called "object name". I hadn't realized that "object name" was the official name, it does seem to be used a lot in the docs. I'm going to try something like this: 1. an *ID* (aka "object name"), which is a cryptographic hash of its type and contents. I think it's useful to refer this as an "ID", because usually we call it a "commit ID" or "tag ID" and not a "commit name" or "tag name" and it makes it more clear that "object name" and "commit ID" refer to the same identifier. >>> +tree 1b61de420a21a2f1aaef93e38ecd0e45e8bc9f0a >>> +parent 4ccb6d7b8869a86aae2e84c56523f8705b50c647 >>> +author Maya <maya@example.com> 1759173425 -0400 >>> +committer Maya <maya@example.com> 1759173425 -0400 >>> + >>> +Add README >>> +---- >> >> In practice, commits can have other headers that are ignored by Git. But >> that's certainly not part of Git's core data model, so I don't think we >> should mention that here. > > Third-party software can add truly garbage ones that do not have any > meaning, and Git tolerates by ignoring them. But there are others > that Git does pay attention to, like encoding, gpgsig, etc., which > may worth mention (in the form that "these four are what you typically > see, but there may be others" without even naming any). I didn't realize that there were other optional fields, will try to communicate this somehow. ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: [PATCH] doc: add a explanation of Git's data model 2025-10-07 19:30 ` Julia Evans @ 2025-10-07 20:01 ` Junio C Hamano 0 siblings, 0 replies; 89+ messages in thread From: Junio C Hamano @ 2025-10-07 20:01 UTC (permalink / raw) To: Julia Evans; +Cc: Patrick Steinhardt, Julia Evans, git "Julia Evans" <julia@jvns.ca> writes: > I think it's useful to refer this as an "ID", because usually we call it a > "commit ID" or "tag ID" and not a "commit name" or "tag name" > and it makes it more clear that "object name" and "commit ID" > refer to the same identifier. It is a bit funny that they do not exactly align. "object name" aka "object ID" "$type object name" aka "$type ID" for type in (commit, blob, tree, tag) In any case, we should add "object ID" and other "$type ID" to the glossary, if you are going to use it very often. We have entries for spelled out "identifier" but I do not think "ID" is there yet. Thanks. ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: [PATCH] doc: add a explanation of Git's data model 2025-10-07 14:32 ` Patrick Steinhardt 2025-10-07 17:02 ` Junio C Hamano @ 2025-10-07 18:39 ` D. Ben Knoble 2025-10-07 18:55 ` Julia Evans 2 siblings, 0 replies; 89+ messages in thread From: D. Ben Knoble @ 2025-10-07 18:39 UTC (permalink / raw) To: Patrick Steinhardt; +Cc: Julia Evans via GitGitGadget, git, Julia Evans On Tue, Oct 7, 2025 at 11:51 AM Patrick Steinhardt <ps@pks.im> wrote: > > On Fri, Oct 03, 2025 at 05:34:36PM +0000, Julia Evans via GitGitGadget wrote: [snip] > > + A branch is a name for a commit ID. > > + That commit is the latest commit on the branch. > > + Branches are stored in the `.git/refs/heads/` directory. > > ++ > > +To get the history of commits on a branch, Git will start at the commit > > +ID the branch references, and then look at the commit's parent(s), > > +the parent's parent, etc. > > + > > +[[tag]] > > +tags: `.git/refs/tags/<name>`:: > > + A tag is a name for a commit ID, tag object ID, or other object ID. > > + Tags are stored in the `refs/tags/` directory. > > ++ > > +Even though branches and commits are both "a name for a commit ID", Git > > +treats them very differently. > > +Branches are expected to be regularly updated as you work on the branch, > > +but it's expected that a tag will never change after you create it. > > This sounds a bit like the user itself needs to update the branch. How > about this instead: > > Even though branches and commits are both "a name for a commit ID", Git > treats them very differently: > > - Branches can be checked out directly. If so, creating a new > commit will automatically update the checked-out branch to > point to the new commit. > > - Tags cannot be checked out directly and don't move when > creating a new commit. Instead, one can only check out the > commit that a branch points to. This is called "detached > HEAD", and the effect is that a new commit will not update missing "the tag." ? ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: [PATCH] doc: add a explanation of Git's data model 2025-10-07 14:32 ` Patrick Steinhardt 2025-10-07 17:02 ` Junio C Hamano 2025-10-07 18:39 ` D. Ben Knoble @ 2025-10-07 18:55 ` Julia Evans 2025-10-08 4:18 ` Patrick Steinhardt 2 siblings, 1 reply; 89+ messages in thread From: Julia Evans @ 2025-10-07 18:55 UTC (permalink / raw) To: Patrick Steinhardt, Julia Evans; +Cc: git On Tue, Oct 7, 2025, at 10:32 AM, Patrick Steinhardt wrote: > On Fri, Oct 03, 2025 at 05:34:36PM +0000, Julia Evans via GitGitGadget wrote: >> diff --git a/Documentation/gitdatamodel.adoc b/Documentation/gitdatamodel.adoc >> new file mode 100644 >> index 0000000000..4b2cb167dc >> --- /dev/null >> +++ b/Documentation/gitdatamodel.adoc >> @@ -0,0 +1,226 @@ >> +gitdatamodel(7) >> +=============== >> + >> +NAME >> +---- >> +gitdatamodel - Git's core data model >> + >> +DESCRIPTION >> +----------- >> + >> +It's not necessary to understand Git's data model to use Git, but it's >> +very helpful when reading Git's documentation so that you know what it >> +means when the documentation says "object" "reference" or "index". > > There's a missing comma after "object". Will fix. >> + >> +Git's core operations use 4 kinds of data: >> + >> +1. <<objects,Objects>>: commits, trees, blobs, and tag objects >> +2. <<references,References>>: branches, tags, >> + remote-tracking branches, etc >> +3. <<index,The index>>, also known as the staging area >> +4. <<reflogs,Reflogs>> > > This list makes sense to me. There's of course more data structures in > Git, but all the other data structures shouldn't really matter to users > at all as they are mostly caches or internal details of the on-disk > format. > > There's potentially one exception though, namely the Git configuration. > I'd claim that Git "uses" the Git configuration similarly to how it uses > the others, but I get why it's not explicitly mentioned here. > >> +[[objects]] >> +OBJECTS >> +------- >> + >> +Commits, trees, blobs, and tag objects are all stored in Git's object database. >> +Every object has: >> + >> +1. an *ID*, which is the SHA-1 hash of its contents. > > I think this needs to be adapted to not single out SHA-1 as the only > hashing algorithm. We already support SHA-256, so we should definitely > say that the algorithm can be swapped. Maybe something like: > > An *object ID*, which is the cryptographic hash of its contents. By > default, Git uses SHA-1 as object hash, but alternative hashes like > SHA-256 are supported. Makes sense. I might just say "cryptographic hash of its type and contents" and leave it that. I'm not sure it's worth getting into details of the exact hash function. >> + It's fast to look up a Git object using its ID. >> + The ID is usually represented in hexadecimal, like >> + `1b61de420a21a2f1aaef93e38ecd0e45e8bc9f0a`. >> +2. a *type*. There are 4 types of objects: >> + <<commit,commits>>, <<tree,trees>>, <<blob,blobs>>, >> + and <<tag-object,tag objects>>. >> +3. *contents*. The structure of the contents depends on the type. > > Nit: every object also has an object size. Not sure though whether it's > fine to imply that with "contents". I think it is. >> +Once an object is created, it can never be changed. >> +Here are the 4 types of objects: >> + >> +[[commit]] >> +commits:: >> + A commit contains: >> ++ >> +1. Its *parent commit ID(s)*. The first commit in a repository has 0 parents, >> + regular commits have 1 parent, merge commits have 2+ parents > > I'd say "at least two parents" instead of "2+ parents". > >> +2. A *commit message* >> +3. All the *files* in the commit, stored as a *<<tree,tree>>* >> +4. An *author* and the time the commit was authored >> +5. A *committer* and the time the commit was committed >> ++ >> +Here's how an example commit is stored: >> ++ >> +---- >> +tree 1b61de420a21a2f1aaef93e38ecd0e45e8bc9f0a >> +parent 4ccb6d7b8869a86aae2e84c56523f8705b50c647 >> +author Maya <maya@example.com> 1759173425 -0400 >> +committer Maya <maya@example.com> 1759173425 -0400 >> + >> +Add README >> +---- > > In practice, commits can have other headers that are ignored by Git. But > that's certainly not part of Git's core data model, so I don't think we > should mention that here. > >> +Like all other objects, commits can never be changed after they're created. >> +For example, "amending" a commit with `git commit --amend` creates a new commit. >> +The old commit will eventually be deleted by `git gc`. > > If we mention git-gc(1) I think it would make sense to use > `linkgit:git-gc[1]` instead to provide a link to its man page. Agreed. >> +[[tree]] >> +trees:: >> + A tree is how Git represents a directory. It lists, for each item in >> + the tree: >> ++ >> +1. The *permissions*, for example `100644` > > I think we should rather call these "mode bits". These bits are > permissions indeed when you have a blob, but for subtrees, symlinks and > submodules they aren't. I think it's a bit strange to call them mode bits since I thought they were stored as ASCII strings and it's basically an enum of 5 options, but I see your point. I think "file mode" will work and that's used elsewhere. I wonder if it would make sense to list all of the possible file modes if this isn't documented anywhere else, my impression is that it's a short list and that it's unlikely to change much in the future. And listing them all might make it more clear that Git's file modes don't have much in common with Unix file modes. I looked for where this is documented and it looks like the only place is in `man git-fast-import` . That man page says that there are just 5 options (040000, 160000, 100644, 100755, 120000) >> +2. The *type*: either <<blob,`blob`>> (a file), `tree` (a directory), >> + or <<commit,`commit`>> (a Git submodule) > > There's also symlinks. I created a test symlink and it looks like symlinks are stored as type "blob". I might say which type corresponds to which file mode, though I'm not sure what type corresponds to the "gitlink" mode (commit?). I think these are the 5 modes and what they mean / what type they should have. Not sure about the gitlink mode though. - `100644`: regular file (with type `blob`) - `100755`: executable file (with type `blob`) - `120000`: symbolic link (with type `blob`) - `040000`: directory (with type `tree`) - `160000`: gitlink, for use with submodules (with type `commit`) >> +3. The *object ID* >> +4. The *filename* >> ++ >> +For example, this is how a tree containing one directory (`src`) and one file >> +(`README.md`) is stored: >> ++ >> +---- >> +100644 blob 8728a858d9d21a8c78488c8b4e70e531b659141f README.md >> +040000 tree 89b1d2e0495f66d6929f4ff76ff1bb07fc41947d src >> +---- >> ++ >> +*NOTE:* The permissions are in the same format as UNIX permissions, but >> +the only allowed permissions for files (blobs) are 644 and 755. >> + >> +[[blob]] >> +blobs:: >> + A blob is how Git represents a file. A blob object contains the >> + file's contents. >> ++ >> +Storing a new blob for every new version of a file can get big, so >> +`git gc` periodically compresses objects for efficiency in `.git/objects/pack`. > > I would claim that it's not necessary to mention object compression. > This should be a low-level detail that users don't ever have to worry > about. Furthermore, packing objects isn't only relevant in the context > of blobs: trees for example also tend to compress very well as there > typically is only small incremental updates to trees. I discussed why I think this important in another reply, https://lore.kernel.org/all/51e0a55c-1f1d-4cae-9459-8c2b9220e52d@app.fastmail.com/, will paste what I said here. I'll think about this more though. paste follows: That's true! The reason I think this is important to mention is that I find that people often "reject" information that they find implausible, even if it comes from a credible source. ("that can't be true! I must be not understanding correctly. Oh well, I'll just ignore that!") I sometimes hear from users that "commits can't be snapshots", because it would take up too much disk space to store every version of every commit. So I find that sometimes explaining a little bit about the implementation can make the information more memorable. Certainly I'm not able to remember details that don't make sense with my mental model of how computers work and I don't expect other people to either, so I think it's important to give an explanation that handles the biggest "objections". >> +[[tag-object]] >> +tag objects:: >> + Tag objects (also known as "annotated tags") contain: >> ++ >> +1. The *tagger* and tag date >> +2. A *tag message*, similar to a commit message >> +3. The *ID* of the object (often a commit) that they reference > > They can also be signed, if we want to mention that. I guess that's true for commit objects too. Not sure whether to mention it either, can add it if others think it's important. >> +[[references]] >> +REFERENCES >> +---------- >> + >> +References are a way to give a name to a commit. >> +It's easier to remember "the changes I'm working on are on the `turtle` >> +branch" than "the changes are in commit bb69721404348e". >> +Git often uses "ref" as shorthand for "reference". >> + >> +References that you create are stored in the `.git/refs` directory, >> +and Git has a few special internal references like `HEAD` that are stored >> +in the base `.git` directory. > > This isn't true anymore with the introduction of the reftable backend, > which is slated to become the default backend. I'd argue that this is > another implementation detail that the user shouldn't have to worry > about. Makes sense, will fix. (as well as other references to the .git prefix and "subdirectories"). >> +References can either be: >> + >> +1. References to an object ID, usually a <<commit,commit>> ID >> +2. References to another reference. This is called a "symbolic reference". >> + >> +Git handles references differently based on which subdirectory of >> +`.git/refs` they're stored in. > > So instead of saying "subdirectory", I'd rather say "reference > hierarchy". > > In general, I think we should explain that references are layed out > in a hierarchy. This is somewhat obvious with the "files" backend, as we > use directories there. But as we move on to the "reftable" backend this > may become less obvious over time. That makes sense. >> +[[tag]] >> +tags: `.git/refs/tags/<name>`:: >> + A tag is a name for a commit ID, tag object ID, or other object ID. >> + Tags are stored in the `refs/tags/` directory. >> ++ >> +Even though branches and commits are both "a name for a commit ID", Git >> +treats them very differently. >> +Branches are expected to be regularly updated as you work on the branch, >> +but it's expected that a tag will never change after you create it. > > This sounds a bit like the user itself needs to update the branch. How > about this instead: > > Even though branches and commits are both "a name for a commit ID", Git > treats them very differently: > > - Branches can be checked out directly. If so, creating a new > commit will automatically update the checked-out branch to > point to the new commit. > > - Tags cannot be checked out directly and don't move when > creating a new commit. Instead, one can only check out the > commit that a branch points to. This is called "detached > HEAD", and the effect is that a new commit will not update I think mentioning that branches can be checked out and that tags can't is a good idea. >> +[[HEAD]] >> +HEAD: `.git/HEAD`:: >> + `HEAD` is where Git stores your current <<branch,branch>>. >> + `HEAD` is normally a symbolic reference to your current branch, for >> + example `ref: refs/heads/main` if your current branch is `main`. >> + `HEAD` can also be a direct reference to a commit ID, >> + that's called "detached HEAD state". >> + >> +[[remote-tracking-branch]] >> +remote tracking branches: `.git/refs/remotes/<remote>/<branch>`:: >> + A remote-tracking branch is a name for a commit ID. >> + It's how Git stores the last-known state of a branch in a remote >> + repository. `git fetch` updates remote-tracking branches. When >> + `git status` says "you're up to date with origin/main", it's looking at >> + this. > > This misses "refs/remotes/<remote>/HEAD". This reference is a symbolic > reference that indicates the default branch on the remote side. Is "refs/remotes/<remote>/HEAD" a remote-tracking branch? I've never thought about that reference and I'm not sure what to call it. >> +[[other-refs]] >> +Other references:: >> + Git tools may create references in any subdirectory of `.git/refs`. >> + For example, linkgit:git-stash[1], linkgit:git-bisect[1], >> + and linkgit:git-notes[1] all create their own references >> + in `.git/refs/stash`, `.git/refs/bisect`, etc. >> + Third-party Git tools may also create their own references. >> ++ >> +Git may also create references in the base `.git` directory >> +other than `HEAD`, like `ORIG_HEAD`. > > Let's mention that such references are typically spelt all-uppercase > with underscores between. You shouldn't ever create a reference that is > for example called ".git/foo". > > We enforce this restriction inconsistently, only, but I don't think that > should keep us from spelling out the common rule. That makes sense. I'm also not sure whether third-party Git tools are "supposed" to create references outside of "refs/", or whether that's common. >> +*NOTE:* As an optimization, references may be stored as packed >> +refs instead of in `.git/refs`. See linkgit:git-pack-refs[1]. > > I'd drop this note. It's an internal implementation detail and only true > for the "files" backend. The "reftable" backend stores references quite > differently and doesn't really "pack" references. > >> +[[index]] >> +THE INDEX >> +--------- >> + >> +The index, also known as the "staging area", contains the current staged > > Honestly, I always forget which of these two nouns we are supposed to > use nowadays. I think consensus was to use "index" and avoid using > "staging area"? Not sure though, but I think we should only mention > one of these. > >> +version of every file in your Git repository. When you commit, the files >> +in the index are used as the files in the next commit. >> + >> +Unlike a tree, the index is a flat list of files. >> +Each index entry has 4 fields: >> + >> +1. The *permissions* >> +2. The *<<blob,blob>> ID* of the file >> +3. The *filename* >> +4. The *number*. This is normally 0, but if there's a merge conflict > > I think we don't call this "number", but "stage". Thanks, I see that it's sometimes called "stage number" which is a little easier to search for so I'll call it that. >> + there can be multiple versions (with numbers 0, 1, 2, ..) >> + of the same filename in the index. >> + >> +It's extremely uncommon to look at the index directly: normally you'd >> +run `git status` to see a list of changes between the index and <<HEAD,HEAD>>. >> +But you can use `git ls-files --stage` to see the index. >> +Here's the output of `git ls-files --stage` in a repository with 2 files: >> + >> +---- >> +100644 8728a858d9d21a8c78488c8b4e70e531b659141f 0 README.md >> +100644 665c637a360874ce43bf74018768a96d2d4d219a 0 src/hello.py >> +---- >> + >> +[[reflogs]] >> +REFLOGS >> +------- >> + >> +Git stores the history of branch, tag, and HEAD refs in a reflog >> +(you should read "reflog" as "ref log"). Not every ref is logged by >> +default, but any ref can be logged. > > If we mention this here, do we maybe want to mention how the user can > decide which references are logged? Do you mean by using the setting `core.logAllRefUpdates`? >> +Each reflog entry has: >> + >> +1. *Before/after *commit IDs* > > This will probably misformat as we have three asterisks here, not two. > >> +2. *User* who made the change, for example `Maya <maya@example.com>` >> +3. *Timestamp* > > Suggestion: "*Timestamp* when that change has been made". Makes sense. >> +4. *Log message*, for example `pull: Fast-forward` >> + >> +Reflogs only log changes made in your local repository. >> +They are not shared with remotes. > > We may want ot mention that you can reference reflog entries via > `refs/heads/<branch>@{<reflog-nr>}`. > > In general, one thing that I think would be important to highlight in > this document is revisions. Most of the commands tend to not accept > references, but revisions instead, which are a lot more flexible. They > use our do-what-I-mean mechanism to resolve, but also allow the user to > specify commits relative to one another. It's probably sufficient though > to mention them briefly and then redirect to girevisions(7). Will think about this, I'm not sure how to best incorporate that. Maybe under the commits section. > Thanks for working on this! Thanks for the review! - Julia ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: [PATCH] doc: add a explanation of Git's data model 2025-10-07 18:55 ` Julia Evans @ 2025-10-08 4:18 ` Patrick Steinhardt 2025-10-08 15:53 ` Junio C Hamano 0 siblings, 1 reply; 89+ messages in thread From: Patrick Steinhardt @ 2025-10-08 4:18 UTC (permalink / raw) To: Julia Evans; +Cc: Julia Evans, git On Tue, Oct 07, 2025 at 02:55:37PM -0400, Julia Evans wrote: > On Tue, Oct 7, 2025, at 10:32 AM, Patrick Steinhardt wrote: > > On Fri, Oct 03, 2025 at 05:34:36PM +0000, Julia Evans via GitGitGadget wrote: > >> diff --git a/Documentation/gitdatamodel.adoc b/Documentation/gitdatamodel.adoc > >> new file mode 100644 > >> index 0000000000..4b2cb167dc > >> --- /dev/null > >> +++ b/Documentation/gitdatamodel.adoc [snip] > >> +[[tree]] > >> +trees:: > >> + A tree is how Git represents a directory. It lists, for each item in > >> + the tree: > >> ++ > >> +1. The *permissions*, for example `100644` > > > > I think we should rather call these "mode bits". These bits are > > permissions indeed when you have a blob, but for subtrees, symlinks and > > submodules they aren't. > > I think it's a bit strange to call them mode bits since I thought they were stored > as ASCII strings and it's basically an enum of 5 options, but I see your point. > I think "file mode" will work and that's used elsewhere. > > I wonder if it would make sense to list all of the possible file modes if > this isn't documented anywhere else, my impression is that it's a short > list and that it's unlikely to change much in the future. Agreed, that seems reasonable to me. > And listing them all might make it more clear that Git's file modes don't > have much in common with Unix file modes. > I looked for where this is documented and it looks like the only place is > in `man git-fast-import` . That man page says that there are just 5 options > (040000, 160000, 100644, 100755, 120000) > > >> +2. The *type*: either <<blob,`blob`>> (a file), `tree` (a directory), > >> + or <<commit,`commit`>> (a Git submodule) > > > > There's also symlinks. > > I created a test symlink and it looks like symlinks are stored as type "blob". > I might say which type corresponds to which file mode, > though I'm not sure what type corresponds to the "gitlink" mode (commit?). Yeah, gitlinks are used for submodules. They point to an object ID that refers to a commit in the submodule itself. > I think these are the 5 modes and what they mean / what type they > should have. Not sure about the gitlink mode though. > > - `100644`: regular file (with type `blob`) > - `100755`: executable file (with type `blob`) > - `120000`: symbolic link (with type `blob`) > - `040000`: directory (with type `tree`) > - `160000`: gitlink, for use with submodules (with type `commit`) This list looks good to me. gitlinks are somewhat special given that they refer to a commit stored in the submodule repository, not in the repository that has the gitlink. But the expectation is that the object name should always resolve to a commit indeed. [snip] > >> +[[blob]] > >> +blobs:: > >> + A blob is how Git represents a file. A blob object contains the > >> + file's contents. > >> ++ > >> +Storing a new blob for every new version of a file can get big, so > >> +`git gc` periodically compresses objects for efficiency in `.git/objects/pack`. > > > > I would claim that it's not necessary to mention object compression. > > This should be a low-level detail that users don't ever have to worry > > about. Furthermore, packing objects isn't only relevant in the context > > of blobs: trees for example also tend to compress very well as there > > typically is only small incremental updates to trees. > > I discussed why I think this important in another reply, > https://lore.kernel.org/all/51e0a55c-1f1d-4cae-9459-8c2b9220e52d@app.fastmail.com/, > will paste what I said here. I'll think about this more though. > > paste follows: > > That's true! The reason I think this is important to mention is that I find > that people often "reject" information that they find implausible, even > if it comes from a credible source. ("that can't be true! I must be > not understanding correctly. Oh well, I'll just ignore that!") > > I sometimes hear from users that "commits can't be snapshots", because > it would take up too much disk space to store every version of > every commit. So I find that sometimes explaining a little bit about the > implementation can make the information more memorable. > > Certainly I'm not able to remember details that don't make sense > with my mental model of how computers work and I don't expect other > people to either, so I think it's important to give an explanation that > handles the biggest "objections". Hm, fair I guess. In any case, if we want to mention this I'd leave away the details how exactly Git achieves this. E.g. we could say something like: Storing a new blob for every new version of a file can result to a lot of duplication. Git regularly runs repository maintenance to optimize to counteract this. Part of the maintenance involves compression of objects, where incremental changes to the same object are optimized to be stored as deltas, only. We skip over the details, but this should give enough pointers to an interested reader to go dig deeper. We could also generalize this to objects in general, not only blobs. [snip] > >> +[[HEAD]] > >> +HEAD: `.git/HEAD`:: > >> + `HEAD` is where Git stores your current <<branch,branch>>. > >> + `HEAD` is normally a symbolic reference to your current branch, for > >> + example `ref: refs/heads/main` if your current branch is `main`. > >> + `HEAD` can also be a direct reference to a commit ID, > >> + that's called "detached HEAD state". > >> + > >> +[[remote-tracking-branch]] > >> +remote tracking branches: `.git/refs/remotes/<remote>/<branch>`:: > >> + A remote-tracking branch is a name for a commit ID. > >> + It's how Git stores the last-known state of a branch in a remote > >> + repository. `git fetch` updates remote-tracking branches. When > >> + `git status` says "you're up to date with origin/main", it's looking at > >> + this. > > > > This misses "refs/remotes/<remote>/HEAD". This reference is a symbolic > > reference that indicates the default branch on the remote side. > > Is "refs/remotes/<remote>/HEAD" a remote-tracking branch? > I've never thought about that reference and I'm not sure what to call it. No, it's not. I think the term we use is "remote reference". > >> +[[other-refs]] > >> +Other references:: > >> + Git tools may create references in any subdirectory of `.git/refs`. > >> + For example, linkgit:git-stash[1], linkgit:git-bisect[1], > >> + and linkgit:git-notes[1] all create their own references > >> + in `.git/refs/stash`, `.git/refs/bisect`, etc. > >> + Third-party Git tools may also create their own references. > >> ++ > >> +Git may also create references in the base `.git` directory > >> +other than `HEAD`, like `ORIG_HEAD`. > > > > Let's mention that such references are typically spelt all-uppercase > > with underscores between. You shouldn't ever create a reference that is > > for example called ".git/foo". > > > > We enforce this restriction inconsistently, only, but I don't think that > > should keep us from spelling out the common rule. > > That makes sense. I'm also not sure whether third-party > Git tools are "supposed" to create references outside of "refs/", > or whether that's common. They really shouldn't, and to the best of my knowledge they don't. There is only a rather limited number of root references with very specific use cases. And nowadays we have also tightened the meaning of pseudo refs, of which there are only two ("FETCH_HEAD" and "MERGE_HEAD"). [snip] > >> +[[reflogs]] > >> +REFLOGS > >> +------- > >> + > >> +Git stores the history of branch, tag, and HEAD refs in a reflog > >> +(you should read "reflog" as "ref log"). Not every ref is logged by > >> +default, but any ref can be logged. > > > > If we mention this here, do we maybe want to mention how the user can > > decide which references are logged? > > Do you mean by using the setting `core.logAllRefUpdates`? Yeah. Otherwise the reader won't have any pointers to figure out _how_ they can change this. I don't think we have a man page that provides a better overview than this configuration. Thanks! Patrick ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: [PATCH] doc: add a explanation of Git's data model 2025-10-08 4:18 ` Patrick Steinhardt @ 2025-10-08 15:53 ` Junio C Hamano 2025-10-08 19:06 ` Julia Evans 0 siblings, 1 reply; 89+ messages in thread From: Junio C Hamano @ 2025-10-08 15:53 UTC (permalink / raw) To: Patrick Steinhardt; +Cc: Julia Evans, Julia Evans, git Patrick Steinhardt <ps@pks.im> writes: >> I sometimes hear from users that "commits can't be snapshots", because >> it would take up too much disk space to store every version of >> every commit. So I find that sometimes explaining a little bit about the >> implementation can make the information more memorable. >> >> Certainly I'm not able to remember details that don't make sense >> with my mental model of how computers work and I don't expect other >> people to either, so I think it's important to give an explanation that >> handles the biggest "objections". > > Hm, fair I guess. In any case, if we want to mention this I'd leave away > the details how exactly Git achieves this. E.g. we could say something > like: > > Storing a new blob for every new version of a file can result to a > lot of duplication. Git regularly runs repository maintenance to > optimize to counteract this. Part of the maintenance involves > compression of objects, where incremental changes to the same object > are optimized to be stored as deltas, only. > > We skip over the details, but this should give enough pointers to an > interested reader to go dig deeper. We could also generalize this to > objects in general, not only blobs. Interesting. It is of course not wrong at all, but it was not what I would have expected for the first explanation to help confused folks who say "commits cannot be snapshots as they take too much space". To me, it was a realization that even in a project whose tree (think of "du -s .") is huge, each of its commits touches only a handful of paths, hence a large portion of that huge tree would be shared with the previous snapshot. >> > This misses "refs/remotes/<remote>/HEAD". This reference is a symbolic >> > reference that indicates the default branch on the remote side. >> >> Is "refs/remotes/<remote>/HEAD" a remote-tracking branch? >> I've never thought about that reference and I'm not sure what to call it. > > No, it's not. I think the term we use is "remote reference". Honestly I didn't know/think we have any special terminology for the refs/remotes/*/HEAD symref. Historically HEAD did not "track" the remote state, and we did take advantage of that fact to use it as a place to record the preference with respect to which remote-tracking branch we would want to primarily interact with. But these days because the protocol is capable of expressing where the symrefs point at, the users can make it track just like all other refs inside refs/remotes/*/ hiearchy. So I personally think it is OK to call it in remote-tracking branch. Thanks. ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: [PATCH] doc: add a explanation of Git's data model 2025-10-08 15:53 ` Junio C Hamano @ 2025-10-08 19:06 ` Julia Evans 0 siblings, 0 replies; 89+ messages in thread From: Julia Evans @ 2025-10-08 19:06 UTC (permalink / raw) To: Junio C Hamano, Patrick Steinhardt; +Cc: Julia Evans, git On Wed, Oct 8, 2025, at 11:53 AM, Junio C Hamano wrote: > Patrick Steinhardt <ps@pks.im> writes: > >>> I sometimes hear from users that "commits can't be snapshots", because >>> it would take up too much disk space to store every version of >>> every commit. So I find that sometimes explaining a little bit about the >>> implementation can make the information more memorable. >>> >>> Certainly I'm not able to remember details that don't make sense >>> with my mental model of how computers work and I don't expect other >>> people to either, so I think it's important to give an explanation that >>> handles the biggest "objections". >> >> Hm, fair I guess. In any case, if we want to mention this I'd leave away >> the details how exactly Git achieves this. E.g. we could say something >> like: >> >> Storing a new blob for every new version of a file can result to a >> lot of duplication. Git regularly runs repository maintenance to >> optimize to counteract this. Part of the maintenance involves >> compression of objects, where incremental changes to the same object >> are optimized to be stored as deltas, only. >> >> We skip over the details, but this should give enough pointers to an >> interested reader to go dig deeper. We could also generalize this to >> objects in general, not only blobs. > > Interesting. It is of course not wrong at all, but it was not what > I would have expected for the first explanation to help confused > folks who say "commits cannot be snapshots as they take too much > space". > > To me, it was a realization that even in a project whose tree (think > of "du -s .") is huge, each of its commits touches only a handful > of paths, hence a large portion of that huge tree would be shared > with the previous snapshot. That's a good point, I forgot that I've explained it that way too. I might change it to that instead. >>> > This misses "refs/remotes/<remote>/HEAD". This reference is a symbolic >>> > reference that indicates the default branch on the remote side. >>> >>> Is "refs/remotes/<remote>/HEAD" a remote-tracking branch? >>> I've never thought about that reference and I'm not sure what to call it. >> >> No, it's not. I think the term we use is "remote reference". > > Honestly I didn't know/think we have any special terminology for the > refs/remotes/*/HEAD symref. > > Historically HEAD did not "track" the remote state, and we did take > advantage of that fact to use it as a place to record the preference > with respect to which remote-tracking branch we would want to > primarily interact with. > > But these days because the protocol is capable of expressing where > the symrefs point at, the users can make it track just like all > other refs inside refs/remotes/*/ hiearchy. So I personally think > it is OK to call it in remote-tracking branch. I may just add this to the remote-tracking branch sentence then, which is hopefully correct: `refs/remotes/<remote>/HEAD` is a symbolic reference to the remote's default branch. This is the branch that `git clone` checks out by default. ^ permalink raw reply [flat|nested] 89+ messages in thread
* [PATCH v2] doc: add a explanation of Git's data model 2025-10-03 17:34 [PATCH] doc: add a explanation of Git's data model Julia Evans via GitGitGadget ` (2 preceding siblings ...) 2025-10-07 14:32 ` Patrick Steinhardt @ 2025-10-08 13:53 ` Julia Evans via GitGitGadget 2025-10-10 11:51 ` Patrick Steinhardt 2025-10-14 21:12 ` [PATCH v3] " Julia Evans via GitGitGadget 2025-10-09 14:20 ` [PATCH] doc: add a " Julia Evans 4 siblings, 2 replies; 89+ messages in thread From: Julia Evans via GitGitGadget @ 2025-10-08 13:53 UTC (permalink / raw) To: git Cc: Kristoffer Haugsbakk, D. Ben Knoble, Patrick Steinhardt, Julia Evans, Julia Evans From: Julia Evans <julia@jvns.ca> Git very often uses the terms "object", "reference", or "index" in its documentation. However, it's hard to find a clear explanation of these terms and how they relate to each other in the documentation. The closest candidates currently are: 1. `gitglossary`. This makes a good effort, but it's an alphabetically ordered dictionary and a dictionary is not a good way to learn concepts. You have to jump around too much and it's not possible to present the concepts in the order that they should be explained. 2. `gitcore-tutorial`. This explains how to use the "core" Git commands. This is a nice document to have, but it's not necessary to learn how `update-index` works to understand Git's data model, and we should not be requiring users to learn how to use the "plumbing" commands if they want to learn what the term "index" or "object" means. 3. `gitrepository-layout`. This is a great resource, but it includes a lot of information about configuration and internal implementation details which are not related to the data model. It also does not explain how commits work. The result of this is that Git users (even users who have been using Git for 15+ years) struggle to read the documentation because they don't know what the core terms mean, and it's not possible to add links to help them learn more. Add an explanation of Git's data model. Some choices I've made in deciding what "core data model" means: 1. Omit pseudorefs like `FETCH_HEAD`, because it's not clear to me if those are intended to be user facing or if they're more like internal implementation details. 2. Don't talk about submodules other than by mentioning how they relate to trees. This is because Git has a lot of special features, and explaining how they all work exhaustively could quickly go down a rabbit hole which would make this document less useful for understanding Git's core behaviour. 3. Don't discuss the structure of a commit message (first line, trailers etc). 4. Don't mention configuration. 5. Don't mention the `.git` directory, to avoid getting too much into implementation details Signed-off-by: Julia Evans <julia@jvns.ca> --- doc: Add a explanation of Git's data model Changes in v2: The biggest change is to remove all mentions of the .git directory, and explain references in a way that doesn't refer to "directories" at all, and instead talks about the "hierarchy" (from Kristoffer and Patrick's reviews). Also: * objects: Mention that an object ID is called an "object name", and update the glossary to include the term "object ID" (from Junio's review) * objects: Replace "SHA-1 hash" with "cryptographic hash" which is more accurate (from Patrick's review) * blobs: Made the explanation of git gc a little higher level and took some ideas from Patrick's suggested wording (from Patrick's and Kroftoffer's reviews) * commits: Mention that tag objects and commits can optionally have other fields. I didn't mention the GPG signature specifically, but don't have any objections to adding it. (from Patrick and Junio's reviews) * commits: Remove one of the mentions of git gc, since it perhaps opens up too much of a rabbit hole: "how does git gc decide which commits to clean up?". (from Kristoffer's review) * tag objects: Add an example of how a tag object is represented (from user feedback on the draft) * index: Use the term "file mode" instead of "permissions", and list all allowed file modes (from Patrick's review) * index: Use "stage number" instead of "number" for index entries (from Patrick's review) * reflogs: Remove "any ref can be logged", it raises some questions of "how do you tell Git to log a ref that it isn't normally logging?" and my guess is that it's uncommon to ask Git to log more refs. I don't think it's a "lie" to omit this but I can bring it back if folks disagree. (from Patrick's review) * reflogs: Fix an error I noticed in the explanation of reflogs: tags aren't logged by default and remote-tracking branches are, according to man git-config * branches and tags: Be clearer about how branches are usually updated (by committing), and make it a little more obvious that only branches can be checked out. This is a bit tricky because using the word "check out" introduces a rabbit hole that I want to avoid (what does "check out" mean?). I've dealt this by just talking about the "current branch" (HEAD) since that is defined here, and making it more explicit that HEAD must either be a branch or a commit, there's no "HEAD is a tag" option. (from Patrick's review) * tags: Explain the differences between annotated and lightweight tags (this is the main piece of user feedback I've gotten on the draft so far) * Various style/typo changes ("2 or more", linkgit:git-gc[1], removed extra asterisks, added empty SYNOPSIS, "commits -> tags" typo fix, add to meson build) non-changes: * I still haven't mentioned things that aren't part of the "data model", like revision params and configuration. I think there could be a place for them but I haven't found it yet. * tag objects: I noticed that there's a "tag" header field in tag objects (like tag v1.0.0) but I didn't mention it yet because I couldn't figure out what the purpose of that field is (I thought the tag name was stored in the reference, why is it duplicated in the tag object?) Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-1981%2Fjvns%2Fgitdatamodel-v2 Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-1981/jvns/gitdatamodel-v2 Pull-Request: https://github.com/gitgitgadget/git/pull/1981 Range-diff vs v1: 1: fcbd21b6da ! 1: 3b38a88dc7 doc: add a explanation of Git's data model @@ Commit message down a rabbit hole which would make this document less useful for understanding Git's core behaviour. 3. Don't discuss the structure of a commit message - (first line, trailers, GPG signatures, etc). - Perhaps this should change. - - Some other choices I've made: - - 1. Mention packed refs only in a note. - 2. Don't mention that the full name of the branch `main` is - technically `refs/heads/main`. This should likely change but I - haven't worked out how to do it in a clear way yet. - 3. Mostly avoid referring to the `.git` directory, because the exact - details of how things are stored change over time. - This should perhaps change from "mostly" to "entirely" - but I haven't worked out how to do that in a clear way yet. + (first line, trailers etc). + 4. Don't mention configuration. + 5. Don't mention the `.git` directory, to avoid getting too much into + implementation details Signed-off-by: Julia Evans <julia@jvns.ca> @@ Documentation/gitdatamodel.adoc (new) +---- +gitdatamodel - Git's core data model + ++SYNOPSIS ++-------- ++gitdatamodel ++ +DESCRIPTION +----------- + +It's not necessary to understand Git's data model to use Git, but it's +very helpful when reading Git's documentation so that you know what it -+means when the documentation says "object" "reference" or "index". ++means when the documentation says "object", "reference" or "index". + +Git's core operations use 4 kinds of data: + @@ Documentation/gitdatamodel.adoc (new) +Commits, trees, blobs, and tag objects are all stored in Git's object database. +Every object has: + -+1. an *ID*, which is the SHA-1 hash of its contents. ++1. an *ID* (aka "object name"), which is a cryptographic hash of its ++ type and contents. + It's fast to look up a Git object using its ID. -+ The ID is usually represented in hexadecimal, like ++ This is usually represented in hexadecimal, like + `1b61de420a21a2f1aaef93e38ecd0e45e8bc9f0a`. +2. a *type*. There are 4 types of objects: + <<commit,commits>>, <<tree,trees>>, <<blob,blobs>>, @@ Documentation/gitdatamodel.adoc (new) + +[[commit]] +commits:: -+ A commit contains: ++ A commit contains these required fields ++ (though there are other optional fields): ++ +1. Its *parent commit ID(s)*. The first commit in a repository has 0 parents, -+ regular commits have 1 parent, merge commits have 2+ parents ++ regular commits have 1 parent, merge commits have 2 or more parents +2. A *commit message* +3. All the *files* in the commit, stored as a *<<tree,tree>>* +4. An *author* and the time the commit was authored @@ Documentation/gitdatamodel.adoc (new) ++ +Like all other objects, commits can never be changed after they're created. +For example, "amending" a commit with `git commit --amend` creates a new commit. -+The old commit will eventually be deleted by `git gc`. + +[[tree]] +trees:: + A tree is how Git represents a directory. It lists, for each item in + the tree: ++ -+1. The *permissions*, for example `100644` ++1. The *file mode*, for example `100644` +2. The *type*: either <<blob,`blob`>> (a file), `tree` (a directory), + or <<commit,`commit`>> (a Git submodule) +3. The *object ID* @@ Documentation/gitdatamodel.adoc (new) +040000 tree 89b1d2e0495f66d6929f4ff76ff1bb07fc41947d src +---- ++ -+*NOTE:* The permissions are in the same format as UNIX permissions, but -+the only allowed permissions for files (blobs) are 644 and 755. ++Git only supports these file modes: +++ ++ - `100644`: regular file (with type `blob`) ++ - `100755`: executable file (with type `blob`) ++ - `120000`: symbolic link (with type `blob`) ++ - `040000`: directory (with type `tree`) ++ - `160000`: gitlink, for use with submodules (with type `commit`) + +[[blob]] +blobs:: + A blob is how Git represents a file. A blob object contains the + file's contents. ++ -+Storing a new blob for every new version of a file can get big, so -+`git gc` periodically compresses objects for efficiency in `.git/objects/pack`. ++ ++NOTE: Storing a new blob for every new version of a file can use a ++lot of disk space. To handle this, Git periodically runs repository ++maintenance with linkgit:git-gc[1]. Part of this maintenance is ++compressing objects so that if a small part of a file was changed, only ++the change is stored instead of the whole file. + +[[tag-object]] +tag objects:: -+ Tag objects (also known as "annotated tags") contain: ++ Tag objects (also known as "annotated tags") contain these required fields ++ (though there are other optional fields): ++ +1. The *tagger* and tag date +2. A *tag message*, similar to a commit message -+3. The *ID* of the object (often a commit) that they reference ++3. The *ID* and *type* of the object (often a commit) that they reference ++ ++Here's how an example tag object is stored: ++ ++---- ++object 750b4ead9c87ceb3ddb7a390e6c7074521797fb3 ++type commit ++tag v1.0.0 ++tagger Maya <maya@example.com> 1759927359 -0400 ++ ++Release version 1.0.0 ++---- + +[[references]] +REFERENCES @@ Documentation/gitdatamodel.adoc (new) +branch" than "the changes are in commit bb69721404348e". +Git often uses "ref" as shorthand for "reference". + -+References that you create are stored in the `.git/refs` directory, -+and Git has a few special internal references like `HEAD` that are stored -+in the base `.git` directory. -+ +References can either be: + +1. References to an object ID, usually a <<commit,commit>> ID +2. References to another reference. This is called a "symbolic reference". + -+Git handles references differently based on which subdirectory of -+`.git/refs` they're stored in. -+Here are the main types: ++References are stored in a hierarchy, and Git handles references ++differently based on where they are in the hierarchy. ++Most references are under `refs/`. Here are the main types: + +[[branch]] -+branches: `.git/refs/heads/<name>`:: ++branches: `refs/heads/<name>`:: + A branch is a name for a commit ID. + That commit is the latest commit on the branch. -+ Branches are stored in the `.git/refs/heads/` directory. ++ +To get the history of commits on a branch, Git will start at the commit +ID the branch references, and then look at the commit's parent(s), +the parent's parent, etc. + +[[tag]] -+tags: `.git/refs/tags/<name>`:: ++tags: `refs/tags/<name>`:: + A tag is a name for a commit ID, tag object ID, or other object ID. -+ Tags are stored in the `refs/tags/` directory. ++ Tags that reference a tag object ID are called "annotated tags", ++ because the tag object contains a tag message. ++ Tags that reference a commit ID, blob ID, or tree ID are ++ called "lightweight tags". ++ -+Even though branches and commits are both "a name for a commit ID", Git ++Even though branches and tags are both "a name for a commit ID", Git +treats them very differently. -+Branches are expected to be regularly updated as you work on the branch, -+but it's expected that a tag will never change after you create it. ++Branches are expected to change over time: when you make a commit, Git ++will update your <<HEAD,current branch>> to reference the new changes. ++It's expected that a tag will never change after you create it. + +[[HEAD]] -+HEAD: `.git/HEAD`:: ++HEAD: `HEAD`:: + `HEAD` is where Git stores your current <<branch,branch>>. -+ `HEAD` is normally a symbolic reference to your current branch, for -+ example `ref: refs/heads/main` if your current branch is `main`. -+ `HEAD` can also be a direct reference to a commit ID, -+ that's called "detached HEAD state". ++ `HEAD` can either be: ++ 1. A symbolic reference to your current branch, for example `ref: ++ refs/heads/main` if your current branch is `main`. ++ 2. A direct reference to a commit ID. ++ This is called "detached HEAD state". + +[[remote-tracking-branch]] -+remote tracking branches: `.git/refs/remotes/<remote>/<branch>`:: ++remote tracking branches: `refs/remotes/<remote>/<branch>`:: + A remote-tracking branch is a name for a commit ID. + It's how Git stores the last-known state of a branch in a remote + repository. `git fetch` updates remote-tracking branches. When @@ Documentation/gitdatamodel.adoc (new) + +[[other-refs]] +Other references:: -+ Git tools may create references in any subdirectory of `.git/refs`. ++ Git tools may create references anywhere under `refs/`. + For example, linkgit:git-stash[1], linkgit:git-bisect[1], + and linkgit:git-notes[1] all create their own references -+ in `.git/refs/stash`, `.git/refs/bisect`, etc. ++ in `refs/stash`, `refs/bisect`, etc. + Third-party Git tools may also create their own references. ++ -+Git may also create references in the base `.git` directory -+other than `HEAD`, like `ORIG_HEAD`. -+ -+*NOTE:* As an optimization, references may be stored as packed -+refs instead of in `.git/refs`. See linkgit:git-pack-refs[1]. ++Git may also create references other than `HEAD` at the base of the ++hierarchy, like `ORIG_HEAD`. + +[[index]] +THE INDEX @@ Documentation/gitdatamodel.adoc (new) +1. The *permissions* +2. The *<<blob,blob>> ID* of the file +3. The *filename* -+4. The *number*. This is normally 0, but if there's a merge conflict ++4. The *stage number*. This is normally 0, but if there's a merge conflict + there can be multiple versions (with numbers 0, 1, 2, ..) + of the same filename in the index. + @@ Documentation/gitdatamodel.adoc (new) +REFLOGS +------- + -+Git stores the history of branch, tag, and HEAD refs in a reflog -+(you should read "reflog" as "ref log"). Not every ref is logged by -+default, but any ref can be logged. ++Git stores the history of your branch, remote-tracking branch, and HEAD refs ++in a reflog (you should read "reflog" as "ref log"). + +Each reflog entry has: + -+1. *Before/after *commit IDs* ++1. Before/after *commit IDs* +2. *User* who made the change, for example `Maya <maya@example.com>` -+3. *Timestamp* ++3. *Timestamp* when the change was made +4. *Log message*, for example `pull: Fast-forward` + +Reflogs only log changes made in your local repository. @@ Documentation/gitdatamodel.adoc (new) +GIT +--- +Part of the linkgit:git[1] suite + + ## Documentation/glossary-content.adoc ## +@@ Documentation/glossary-content.adoc: This commit is referred to as a "merge commit", or sometimes just a + identified by its <<def_object_name,object name>>. The objects usually + live in `$GIT_DIR/objects/`. + +-[[def_object_identifier]]object identifier (oid):: +- Synonym for <<def_object_name,object name>>. ++[[def_object_identifier]]object identifier, object ID, oid:: ++ Synonyms for <<def_object_name,object name>>. + + [[def_object_name]]object name:: + The unique identifier of an <<def_object,object>>. The + + ## Documentation/meson.build ## +@@ Documentation/meson.build: manpages = { + 'gitcore-tutorial.adoc' : 7, + 'gitcredentials.adoc' : 7, + 'gitcvs-migration.adoc' : 7, ++ 'gitdatamodel.adoc' : 7, + 'gitdiffcore.adoc' : 7, + 'giteveryday.adoc' : 7, + 'gitfaq.adoc' : 7, Documentation/Makefile | 1 + Documentation/gitdatamodel.adoc | 248 ++++++++++++++++++++++++++++ Documentation/glossary-content.adoc | 4 +- Documentation/meson.build | 1 + 4 files changed, 252 insertions(+), 2 deletions(-) create mode 100644 Documentation/gitdatamodel.adoc diff --git a/Documentation/Makefile b/Documentation/Makefile index 6fb83d0c6e..5f4acfacbd 100644 --- a/Documentation/Makefile +++ b/Documentation/Makefile @@ -52,6 +52,7 @@ MAN7_TXT += gitcli.adoc MAN7_TXT += gitcore-tutorial.adoc MAN7_TXT += gitcredentials.adoc MAN7_TXT += gitcvs-migration.adoc +MAN7_TXT += gitdatamodel.adoc MAN7_TXT += gitdiffcore.adoc MAN7_TXT += giteveryday.adoc MAN7_TXT += gitfaq.adoc diff --git a/Documentation/gitdatamodel.adoc b/Documentation/gitdatamodel.adoc new file mode 100644 index 0000000000..c3a25ea8d2 --- /dev/null +++ b/Documentation/gitdatamodel.adoc @@ -0,0 +1,248 @@ +gitdatamodel(7) +=============== + +NAME +---- +gitdatamodel - Git's core data model + +SYNOPSIS +-------- +gitdatamodel + +DESCRIPTION +----------- + +It's not necessary to understand Git's data model to use Git, but it's +very helpful when reading Git's documentation so that you know what it +means when the documentation says "object", "reference" or "index". + +Git's core operations use 4 kinds of data: + +1. <<objects,Objects>>: commits, trees, blobs, and tag objects +2. <<references,References>>: branches, tags, + remote-tracking branches, etc +3. <<index,The index>>, also known as the staging area +4. <<reflogs,Reflogs>> + +[[objects]] +OBJECTS +------- + +Commits, trees, blobs, and tag objects are all stored in Git's object database. +Every object has: + +1. an *ID* (aka "object name"), which is a cryptographic hash of its + type and contents. + It's fast to look up a Git object using its ID. + This is usually represented in hexadecimal, like + `1b61de420a21a2f1aaef93e38ecd0e45e8bc9f0a`. +2. a *type*. There are 4 types of objects: + <<commit,commits>>, <<tree,trees>>, <<blob,blobs>>, + and <<tag-object,tag objects>>. +3. *contents*. The structure of the contents depends on the type. + +Once an object is created, it can never be changed. +Here are the 4 types of objects: + +[[commit]] +commits:: + A commit contains these required fields + (though there are other optional fields): ++ +1. Its *parent commit ID(s)*. The first commit in a repository has 0 parents, + regular commits have 1 parent, merge commits have 2 or more parents +2. A *commit message* +3. All the *files* in the commit, stored as a *<<tree,tree>>* +4. An *author* and the time the commit was authored +5. A *committer* and the time the commit was committed ++ +Here's how an example commit is stored: ++ +---- +tree 1b61de420a21a2f1aaef93e38ecd0e45e8bc9f0a +parent 4ccb6d7b8869a86aae2e84c56523f8705b50c647 +author Maya <maya@example.com> 1759173425 -0400 +committer Maya <maya@example.com> 1759173425 -0400 + +Add README +---- ++ +Like all other objects, commits can never be changed after they're created. +For example, "amending" a commit with `git commit --amend` creates a new commit. + +[[tree]] +trees:: + A tree is how Git represents a directory. It lists, for each item in + the tree: ++ +1. The *file mode*, for example `100644` +2. The *type*: either <<blob,`blob`>> (a file), `tree` (a directory), + or <<commit,`commit`>> (a Git submodule) +3. The *object ID* +4. The *filename* ++ +For example, this is how a tree containing one directory (`src`) and one file +(`README.md`) is stored: ++ +---- +100644 blob 8728a858d9d21a8c78488c8b4e70e531b659141f README.md +040000 tree 89b1d2e0495f66d6929f4ff76ff1bb07fc41947d src +---- ++ +Git only supports these file modes: ++ + - `100644`: regular file (with type `blob`) + - `100755`: executable file (with type `blob`) + - `120000`: symbolic link (with type `blob`) + - `040000`: directory (with type `tree`) + - `160000`: gitlink, for use with submodules (with type `commit`) + +[[blob]] +blobs:: + A blob is how Git represents a file. A blob object contains the + file's contents. ++ + +NOTE: Storing a new blob for every new version of a file can use a +lot of disk space. To handle this, Git periodically runs repository +maintenance with linkgit:git-gc[1]. Part of this maintenance is +compressing objects so that if a small part of a file was changed, only +the change is stored instead of the whole file. + +[[tag-object]] +tag objects:: + Tag objects (also known as "annotated tags") contain these required fields + (though there are other optional fields): ++ +1. The *tagger* and tag date +2. A *tag message*, similar to a commit message +3. The *ID* and *type* of the object (often a commit) that they reference + +Here's how an example tag object is stored: + +---- +object 750b4ead9c87ceb3ddb7a390e6c7074521797fb3 +type commit +tag v1.0.0 +tagger Maya <maya@example.com> 1759927359 -0400 + +Release version 1.0.0 +---- + +[[references]] +REFERENCES +---------- + +References are a way to give a name to a commit. +It's easier to remember "the changes I'm working on are on the `turtle` +branch" than "the changes are in commit bb69721404348e". +Git often uses "ref" as shorthand for "reference". + +References can either be: + +1. References to an object ID, usually a <<commit,commit>> ID +2. References to another reference. This is called a "symbolic reference". + +References are stored in a hierarchy, and Git handles references +differently based on where they are in the hierarchy. +Most references are under `refs/`. Here are the main types: + +[[branch]] +branches: `refs/heads/<name>`:: + A branch is a name for a commit ID. + That commit is the latest commit on the branch. ++ +To get the history of commits on a branch, Git will start at the commit +ID the branch references, and then look at the commit's parent(s), +the parent's parent, etc. + +[[tag]] +tags: `refs/tags/<name>`:: + A tag is a name for a commit ID, tag object ID, or other object ID. + Tags that reference a tag object ID are called "annotated tags", + because the tag object contains a tag message. + Tags that reference a commit ID, blob ID, or tree ID are + called "lightweight tags". ++ +Even though branches and tags are both "a name for a commit ID", Git +treats them very differently. +Branches are expected to change over time: when you make a commit, Git +will update your <<HEAD,current branch>> to reference the new changes. +It's expected that a tag will never change after you create it. + +[[HEAD]] +HEAD: `HEAD`:: + `HEAD` is where Git stores your current <<branch,branch>>. + `HEAD` can either be: + 1. A symbolic reference to your current branch, for example `ref: + refs/heads/main` if your current branch is `main`. + 2. A direct reference to a commit ID. + This is called "detached HEAD state". + +[[remote-tracking-branch]] +remote tracking branches: `refs/remotes/<remote>/<branch>`:: + A remote-tracking branch is a name for a commit ID. + It's how Git stores the last-known state of a branch in a remote + repository. `git fetch` updates remote-tracking branches. When + `git status` says "you're up to date with origin/main", it's looking at + this. + +[[other-refs]] +Other references:: + Git tools may create references anywhere under `refs/`. + For example, linkgit:git-stash[1], linkgit:git-bisect[1], + and linkgit:git-notes[1] all create their own references + in `refs/stash`, `refs/bisect`, etc. + Third-party Git tools may also create their own references. ++ +Git may also create references other than `HEAD` at the base of the +hierarchy, like `ORIG_HEAD`. + +[[index]] +THE INDEX +--------- + +The index, also known as the "staging area", contains the current staged +version of every file in your Git repository. When you commit, the files +in the index are used as the files in the next commit. + +Unlike a tree, the index is a flat list of files. +Each index entry has 4 fields: + +1. The *permissions* +2. The *<<blob,blob>> ID* of the file +3. The *filename* +4. The *stage number*. This is normally 0, but if there's a merge conflict + there can be multiple versions (with numbers 0, 1, 2, ..) + of the same filename in the index. + +It's extremely uncommon to look at the index directly: normally you'd +run `git status` to see a list of changes between the index and <<HEAD,HEAD>>. +But you can use `git ls-files --stage` to see the index. +Here's the output of `git ls-files --stage` in a repository with 2 files: + +---- +100644 8728a858d9d21a8c78488c8b4e70e531b659141f 0 README.md +100644 665c637a360874ce43bf74018768a96d2d4d219a 0 src/hello.py +---- + +[[reflogs]] +REFLOGS +------- + +Git stores the history of your branch, remote-tracking branch, and HEAD refs +in a reflog (you should read "reflog" as "ref log"). + +Each reflog entry has: + +1. Before/after *commit IDs* +2. *User* who made the change, for example `Maya <maya@example.com>` +3. *Timestamp* when the change was made +4. *Log message*, for example `pull: Fast-forward` + +Reflogs only log changes made in your local repository. +They are not shared with remotes. + +GIT +--- +Part of the linkgit:git[1] suite diff --git a/Documentation/glossary-content.adoc b/Documentation/glossary-content.adoc index e423e4765b..20ba121314 100644 --- a/Documentation/glossary-content.adoc +++ b/Documentation/glossary-content.adoc @@ -297,8 +297,8 @@ This commit is referred to as a "merge commit", or sometimes just a identified by its <<def_object_name,object name>>. The objects usually live in `$GIT_DIR/objects/`. -[[def_object_identifier]]object identifier (oid):: - Synonym for <<def_object_name,object name>>. +[[def_object_identifier]]object identifier, object ID, oid:: + Synonyms for <<def_object_name,object name>>. [[def_object_name]]object name:: The unique identifier of an <<def_object,object>>. The diff --git a/Documentation/meson.build b/Documentation/meson.build index e34965c5b0..ace0573e82 100644 --- a/Documentation/meson.build +++ b/Documentation/meson.build @@ -192,6 +192,7 @@ manpages = { 'gitcore-tutorial.adoc' : 7, 'gitcredentials.adoc' : 7, 'gitcvs-migration.adoc' : 7, + 'gitdatamodel.adoc' : 7, 'gitdiffcore.adoc' : 7, 'giteveryday.adoc' : 7, 'gitfaq.adoc' : 7, base-commit: bb69721404348ea2db0a081c41ab6ebfe75bdec8 -- gitgitgadget ^ permalink raw reply related [flat|nested] 89+ messages in thread
* Re: [PATCH v2] doc: add a explanation of Git's data model 2025-10-08 13:53 ` [PATCH v2] " Julia Evans via GitGitGadget @ 2025-10-10 11:51 ` Patrick Steinhardt 2025-10-13 14:48 ` Junio C Hamano 2025-10-14 21:12 ` [PATCH v3] " Julia Evans via GitGitGadget 1 sibling, 1 reply; 89+ messages in thread From: Patrick Steinhardt @ 2025-10-10 11:51 UTC (permalink / raw) To: Julia Evans via GitGitGadget Cc: git, Kristoffer Haugsbakk, D. Ben Knoble, Julia Evans On Wed, Oct 08, 2025 at 01:53:41PM +0000, Julia Evans via GitGitGadget wrote: [snip] > +[[blob]] > +blobs:: > + A blob is how Git represents a file. A blob object contains the > + file's contents. > ++ > + > +NOTE: Storing a new blob for every new version of a file can use a > +lot of disk space. To handle this, Git periodically runs repository > +maintenance with linkgit:git-gc[1]. Part of this maintenance is By the way, this isn't true nowadays: Git does not use `git gc --auto` anymore, but instead `git maintenance run --auto`. So we really should be linking to "linkgit:git-maintenance[1]". This tool _by default_ executes git-gc(1). But it can be configured to use alternative strategies, and when using scalar(1) we actually use a different strategy. [snip] > +[[references]] > +REFERENCES > +---------- > + > +References are a way to give a name to a commit. > +It's easier to remember "the changes I'm working on are on the `turtle` > +branch" than "the changes are in commit bb69721404348e". > +Git often uses "ref" as shorthand for "reference". > + > +References can either be: > + > +1. References to an object ID, usually a <<commit,commit>> ID > +2. References to another reference. This is called a "symbolic reference". > + > +References are stored in a hierarchy, and Git handles references > +differently based on where they are in the hierarchy. > +Most references are under `refs/`. Here are the main types: Not quite true. Pseudo refs are outside the hierarchy and are in fact treated differently. But root refs are treated the same as any other reference. References are stored in a hierarchy. While most references are stored in the "refs/" hierarchy, some references with special meaning like for example "HEAD" are stored directly in the root of the hierarchy. I don't really think we should get into root refs vs pseudo refs here, so maybe this is sufficient? [snip] > +[[other-refs]] > +Other references:: > + Git tools may create references anywhere under `refs/`. > + For example, linkgit:git-stash[1], linkgit:git-bisect[1], > + and linkgit:git-notes[1] all create their own references > + in `refs/stash`, `refs/bisect`, etc. > + Third-party Git tools may also create their own references. > ++ > +Git may also create references other than `HEAD` at the base of the > +hierarchy, like `ORIG_HEAD`. Maybe append: "These references are called root refs (see linkgit:gitglossary[7])." Patrick ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: [PATCH v2] doc: add a explanation of Git's data model 2025-10-10 11:51 ` Patrick Steinhardt @ 2025-10-13 14:48 ` Junio C Hamano 2025-10-14 5:45 ` Patrick Steinhardt 0 siblings, 1 reply; 89+ messages in thread From: Junio C Hamano @ 2025-10-13 14:48 UTC (permalink / raw) To: Patrick Steinhardt Cc: Julia Evans via GitGitGadget, git, Kristoffer Haugsbakk, D. Ben Knoble, Julia Evans Patrick Steinhardt <ps@pks.im> writes: > On Wed, Oct 08, 2025 at 01:53:41PM +0000, Julia Evans via GitGitGadget wrote: > [snip] >> +[[blob]] >> +blobs:: >> + A blob is how Git represents a file. A blob object contains the >> + file's contents. >> ++ >> + >> +NOTE: Storing a new blob for every new version of a file can use a >> +lot of disk space. To handle this, Git periodically runs repository >> +maintenance with linkgit:git-gc[1]. Part of this maintenance is > > By the way, this isn't true nowadays: Git does not use `git gc --auto` > anymore, but instead `git maintenance run --auto`. So we really should > be linking to "linkgit:git-maintenance[1]". > > This tool _by default_ executes git-gc(1). But it can be configured to > use alternative strategies, and when using scalar(1) we actually use a > different strategy. For the curious, this happened around a95ce124 (maintenance: replace run_auto_gc(), 2020-09-17). > Not quite true. Pseudo refs are outside the hierarchy and are in fact > treated differently. But root refs are treated the same as any other > reference. > > References are stored in a hierarchy. While most references are > stored in the "refs/" hierarchy, some references with special > meaning like for example "HEAD" are stored directly in the root of > the hierarchy. > > I don't really think we should get into root refs vs pseudo refs here, > so maybe this is sufficient? I do not think "root ref" (or pseudo for that matter) is a concept that has no use in this context. If this is really about data model, where you find refs (or what the "pathname looking" thing exactly look like that names your refs) should be immaterial. It does help to know that HEAD is just a ref. It also would help to know there are symbolic refs that point at other refs, which is much more relevant to the data model. Thanks. ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: [PATCH v2] doc: add a explanation of Git's data model 2025-10-13 14:48 ` Junio C Hamano @ 2025-10-14 5:45 ` Patrick Steinhardt 2025-10-14 9:18 ` Julia Evans 0 siblings, 1 reply; 89+ messages in thread From: Patrick Steinhardt @ 2025-10-14 5:45 UTC (permalink / raw) To: Junio C Hamano Cc: Julia Evans via GitGitGadget, git, Kristoffer Haugsbakk, D. Ben Knoble, Julia Evans On Mon, Oct 13, 2025 at 07:48:15AM -0700, Junio C Hamano wrote: > Patrick Steinhardt <ps@pks.im> writes: > > On Wed, Oct 08, 2025 at 01:53:41PM +0000, Julia Evans via GitGitGadget wrote: > > [snip] > > Not quite true. Pseudo refs are outside the hierarchy and are in fact > > treated differently. But root refs are treated the same as any other > > reference. > > > > References are stored in a hierarchy. While most references are > > stored in the "refs/" hierarchy, some references with special > > meaning like for example "HEAD" are stored directly in the root of > > the hierarchy. > > > > I don't really think we should get into root refs vs pseudo refs here, > > so maybe this is sufficient? > > I do not think "root ref" (or pseudo for that matter) is a concept > that has no use in this context. If this is really about data > model, where you find refs (or what the "pathname looking" thing > exactly look like that names your refs) should be immaterial. It > does help to know that HEAD is just a ref. It also would help to > know there are symbolic refs that point at other refs, which is much > more relevant to the data model. Yeah, I don't necessarily think that we need to mention root refs here. But what I think we need to avoid is the following sentence, as it is misleading: References are stored in a hierarchy, and Git handles references differently based on where they are in the hierarchy. Pseudo refs are stored outside of the hierarchy and are indeed handled differently. But root refs are stored outside of the hierarchy and are treated the same as any other ref, even though they of course have special meaning to some commands. So maybe something like this would be preferable: References are stored in a hierarchy. References that sit at the root of the hierarchy often have special meaning to Git commands, like for example "HEAD" or "REBASE_HEAD". It hints at the fact that these references are special, but not in how they are handled but rather in what they mean. It doesn't go into our two pseudo refs at all, but given that there's only FETCH_HEAD and MERGE_HEAD I don't think we should explain them. The water is getting somewhat murky around pseudorefs anyway, so it probably only causes more confusion. Patrick ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: [PATCH v2] doc: add a explanation of Git's data model 2025-10-14 5:45 ` Patrick Steinhardt @ 2025-10-14 9:18 ` Julia Evans 2025-10-14 11:45 ` Patrick Steinhardt 2025-10-14 13:39 ` Junio C Hamano 0 siblings, 2 replies; 89+ messages in thread From: Julia Evans @ 2025-10-14 9:18 UTC (permalink / raw) To: Patrick Steinhardt, Junio C Hamano Cc: Julia Evans, git, Kristoffer Haugsbakk, D. Ben Knoble On Tue, Oct 14, 2025, at 1:45 AM, Patrick Steinhardt wrote: > On Mon, Oct 13, 2025 at 07:48:15AM -0700, Junio C Hamano wrote: >> Patrick Steinhardt <ps@pks.im> writes: >> > On Wed, Oct 08, 2025 at 01:53:41PM +0000, Julia Evans via GitGitGadget wrote: >> > [snip] >> > Not quite true. Pseudo refs are outside the hierarchy and are in fact >> > treated differently. But root refs are treated the same as any other >> > reference. >> > >> > References are stored in a hierarchy. While most references are >> > stored in the "refs/" hierarchy, some references with special >> > meaning like for example "HEAD" are stored directly in the root of >> > the hierarchy. >> > >> > I don't really think we should get into root refs vs pseudo refs here, >> > so maybe this is sufficient? >> >> I do not think "root ref" (or pseudo for that matter) is a concept >> that has no use in this context. If this is really about data >> model, where you find refs (or what the "pathname looking" thing >> exactly look like that names your refs) should be immaterial. It >> does help to know that HEAD is just a ref. It also would help to >> know there are symbolic refs that point at other refs, which is much >> more relevant to the data model. > > Yeah, I don't necessarily think that we need to mention root refs here. > But what I think we need to avoid is the following sentence, as it is > misleading: > > References are stored in a hierarchy, and Git handles references > differently based on where they are in the hierarchy. > Why do you say that it’s misleading? (what do you think it’s implying that is not true?) What i’m trying to communicate is that branches, tags, etc are treated differently from each other and that Git knows how to handle them based on where they are in the hierarchy. > Pseudo refs are stored outside of the hierarchy and are indeed handled > differently. But root refs are stored outside of the hierarchy and are > treated the same as any other ref, even though they of course have > special meaning to some commands. > > So maybe something like this would be preferable: > > References are stored in a hierarchy. References that sit at the > root of the hierarchy often have special meaning to Git commands, > like for example "HEAD" or "REBASE_HEAD". > > It hints at the fact that these references are special, but not in how > they are handled but rather in what they mean. It doesn't go into our > two pseudo refs at all, but given that there's only FETCH_HEAD and > MERGE_HEAD I don't think we should explain them. The water is getting > somewhat murky around pseudorefs anyway, so it probably only causes more > confusion. > Patrick ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: [PATCH v2] doc: add a explanation of Git's data model 2025-10-14 9:18 ` Julia Evans @ 2025-10-14 11:45 ` Patrick Steinhardt 2025-10-14 13:39 ` Junio C Hamano 1 sibling, 0 replies; 89+ messages in thread From: Patrick Steinhardt @ 2025-10-14 11:45 UTC (permalink / raw) To: Julia Evans Cc: Junio C Hamano, Julia Evans, git, Kristoffer Haugsbakk, D. Ben Knoble On Tue, Oct 14, 2025 at 05:18:58AM -0400, Julia Evans wrote: > > > On Tue, Oct 14, 2025, at 1:45 AM, Patrick Steinhardt wrote: > > On Mon, Oct 13, 2025 at 07:48:15AM -0700, Junio C Hamano wrote: > >> Patrick Steinhardt <ps@pks.im> writes: > >> > On Wed, Oct 08, 2025 at 01:53:41PM +0000, Julia Evans via GitGitGadget wrote: > >> > [snip] > >> > Not quite true. Pseudo refs are outside the hierarchy and are in fact > >> > treated differently. But root refs are treated the same as any other > >> > reference. > >> > > >> > References are stored in a hierarchy. While most references are > >> > stored in the "refs/" hierarchy, some references with special > >> > meaning like for example "HEAD" are stored directly in the root of > >> > the hierarchy. > >> > > >> > I don't really think we should get into root refs vs pseudo refs here, > >> > so maybe this is sufficient? > >> > >> I do not think "root ref" (or pseudo for that matter) is a concept > >> that has no use in this context. If this is really about data > >> model, where you find refs (or what the "pathname looking" thing > >> exactly look like that names your refs) should be immaterial. It > >> does help to know that HEAD is just a ref. It also would help to > >> know there are symbolic refs that point at other refs, which is much > >> more relevant to the data model. > > > > Yeah, I don't necessarily think that we need to mention root refs here. > > But what I think we need to avoid is the following sentence, as it is > > misleading: > > > > References are stored in a hierarchy, and Git handles references > > differently based on where they are in the hierarchy. > > > > Why do you say that it’s misleading? (what do you think it’s implying > that is not true?) > > What i’m trying to communicate is that branches, tags, etc are treated > differently from each other and that Git knows how to handle them > based on where they are in the hierarchy. Oh, I think I managed to repeatedly misread this sentence! I was basically s/where/whether/ and thought that this was saying that refs are handled differently depending on whether they are stored _in_ that hierarchy or _outside_ of it. And that would have been misleading indeed. But that's not what this sentence says at all. So please ignore this tangent, sorry :) Patrick ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: [PATCH v2] doc: add a explanation of Git's data model 2025-10-14 9:18 ` Julia Evans 2025-10-14 11:45 ` Patrick Steinhardt @ 2025-10-14 13:39 ` Junio C Hamano 1 sibling, 0 replies; 89+ messages in thread From: Junio C Hamano @ 2025-10-14 13:39 UTC (permalink / raw) To: Julia Evans Cc: Patrick Steinhardt, Julia Evans, git, Kristoffer Haugsbakk, D. Ben Knoble "Julia Evans" <julia@jvns.ca> writes: >> Yeah, I don't necessarily think that we need to mention root refs here. >> But what I think we need to avoid is the following sentence, as it is >> misleading: >> >> References are stored in a hierarchy, and Git handles references >> differently based on where they are in the hierarchy. >> > > Why do you say that it’s misleading? (what do you think it’s > implying that is not true?) > > What i’m trying to communicate is that branches, tags, etc are > treated differently from each other and that Git knows how to > handle them based on where they are in the hierarchy. FWIW, I had the same reaction to what response you are responding to said. I think Patrick assumes that our target audiences would assume the "hierarchy" begins at "refs/" and "root" things are outside the hierarchy, but my mental model saw that the hierarchy began at the root level, most of things are in "refs/", but one level above it lives things like HEAD and ORIG_HEAD. I think both can be valid, but I do not know which views are more common. Thanks. ^ permalink raw reply [flat|nested] 89+ messages in thread
* [PATCH v3] doc: add a explanation of Git's data model 2025-10-08 13:53 ` [PATCH v2] " Julia Evans via GitGitGadget 2025-10-10 11:51 ` Patrick Steinhardt @ 2025-10-14 21:12 ` Julia Evans via GitGitGadget 2025-10-15 6:24 ` Patrick Steinhardt ` (4 more replies) 1 sibling, 5 replies; 89+ messages in thread From: Julia Evans via GitGitGadget @ 2025-10-14 21:12 UTC (permalink / raw) To: git Cc: Kristoffer Haugsbakk, D. Ben Knoble, Patrick Steinhardt, Julia Evans, Julia Evans From: Julia Evans <julia@jvns.ca> Git very often uses the terms "object", "reference", or "index" in its documentation. However, it's hard to find a clear explanation of these terms and how they relate to each other in the documentation. The closest candidates currently are: 1. `gitglossary`. This makes a good effort, but it's an alphabetically ordered dictionary and a dictionary is not a good way to learn concepts. You have to jump around too much and it's not possible to present the concepts in the order that they should be explained. 2. `gitcore-tutorial`. This explains how to use the "core" Git commands. This is a nice document to have, but it's not necessary to learn how `update-index` works to understand Git's data model, and we should not be requiring users to learn how to use the "plumbing" commands if they want to learn what the term "index" or "object" means. 3. `gitrepository-layout`. This is a great resource, but it includes a lot of information about configuration and internal implementation details which are not related to the data model. It also does not explain how commits work. The result of this is that Git users (even users who have been using Git for 15+ years) struggle to read the documentation because they don't know what the core terms mean, and it's not possible to add links to help them learn more. Add an explanation of Git's data model. Some choices I've made in deciding what "core data model" means: 1. Omit pseudorefs like `FETCH_HEAD`, because it's not clear to me if those are intended to be user facing or if they're more like internal implementation details. 2. Don't talk about submodules other than by mentioning how they relate to trees. This is because Git has a lot of special features, and explaining how they all work exhaustively could quickly go down a rabbit hole which would make this document less useful for understanding Git's core behaviour. 3. Don't discuss the structure of a commit message (first line, trailers etc). 4. Don't mention configuration. 5. Don't mention the `.git` directory, to avoid getting too much into implementation details Signed-off-by: Julia Evans <julia@jvns.ca> --- doc: Add a explanation of Git's data model Changes in v2: The biggest change is to remove all mentions of the .git directory, and explain references in a way that doesn't refer to "directories" at all, and instead talks about the "hierarchy" (from Kristoffer and Patrick's reviews). Also: * objects: Mention that an object ID is called an "object name", and update the glossary to include the term "object ID" (from Junio's review) * objects: Replace "SHA-1 hash" with "cryptographic hash" which is more accurate (from Patrick's review) * blobs: Made the explanation of git gc a little higher level and took some ideas from Patrick's suggested wording (from Patrick's and Kroftoffer's reviews) * commits: Mention that tag objects and commits can optionally have other fields. I didn't mention the GPG signature specifically, but don't have any objections to adding it. (from Patrick and Junio's reviews) * commits: Remove one of the mentions of git gc, since it perhaps opens up too much of a rabbit hole: "how does git gc decide which commits to clean up?". (from Kristoffer's review) * tag objects: Add an example of how a tag object is represented (from user feedback on the draft) * index: Use the term "file mode" instead of "permissions", and list all allowed file modes (from Patrick's review) * index: Use "stage number" instead of "number" for index entries (from Patrick's review) * reflogs: Remove "any ref can be logged", it raises some questions of "how do you tell Git to log a ref that it isn't normally logging?" and my guess is that it's uncommon to ask Git to log more refs. I don't think it's a "lie" to omit this but I can bring it back if folks disagree. (from Patrick's review) * reflogs: Fix an error I noticed in the explanation of reflogs: tags aren't logged by default and remote-tracking branches are, according to man git-config * branches and tags: Be clearer about how branches are usually updated (by committing), and make it a little more obvious that only branches can be checked out. This is a bit tricky because using the word "check out" introduces a rabbit hole that I want to avoid (what does "check out" mean?). I've dealt this by just talking about the "current branch" (HEAD) since that is defined here, and making it more explicit that HEAD must either be a branch or a commit, there's no "HEAD is a tag" option. (from Patrick's review) * tags: Explain the differences between annotated and lightweight tags (this is the main piece of user feedback I've gotten on the draft so far) * Various style/typo changes ("2 or more", linkgit:git-gc[1], removed extra asterisks, added empty SYNOPSIS, "commits -> tags" typo fix, add to meson build) non-changes: * I still haven't mentioned things that aren't part of the "data model", like revision params and configuration. I think there could be a place for them but I haven't found it yet. * tag objects: I noticed that there's a "tag" header field in tag objects (like tag v1.0.0) but I didn't mention it yet because I couldn't figure out what the purpose of that field is (I thought the tag name was stored in the reference, why is it duplicated in the tag object?) Changes in v3: I asked for feedback from Git users on Mastodon and got 220 pieces of feedback from 48 different users. People seemed very excited to read about Git's data model. Usually I judge explanations by what folks report learning from them. Here people reported learning: * how branches are stored (that a branch is "a name for a commit") * how objects work * that Git has separate "author" and "committer" fields * that amending a commit does not change it * that a tree is "just a directory" (not something more complicated), and how trees are stored * that Git repos can contain symlinks * that Git saves modes separately from the OS. * how the stage number works * that when you git add a file, Git will create an object * that third-party tools can create their own refs. * that the reflog stores the history of branches (not just HEAD), and what reflogs are for Also (of course) there were quite a few points of confusion! The main 4 pieces of feedback were 1. The index section doesn't explain what the word "staged" means, and one person says that it makes it sounds like only files that you "git add"ed are in the index. Rewrite the explanation to avoid using the word "staged" to define the index and instead define the word "staging". 2. Explain the difference between "annotated tags" and "lightweight tags" (done) 3. Add examples for tag objects and reflogs (done) 4. Mention a little more about where things are stored in the .git directory, which I'd removed in v2. This seems most important for .git/refs, so I added a hopefully accurate note about how refs are stored by default, with a comment about one of the major implications. I did not discuss where objects or the index are stored, because I don't think the implementation details of how objects are stored are as important, and there are better tools for viewing the "raw" state of objects and the index (with git cat-file -p or git ls-files --staged). Here's every other change I made in response to the feedback, as well as a few comments that I did not address. intro: * Give a 1-sentence intro to "reflog" objects: * people really like having git ls-files --stage as a way to view the index, so add git cat-file -p as well in a note commits: * 2 people asked "Are commits stored as a diff?". Say that diffs are calculated at runtime, this is very important. * The order the fields are given in don't match the order in the example. Make them match. * "All the files in the commit, stored as a tree" is throwing a few people off. Be clearer that it's the tree ID of the base directory. * Several people asked "What's the difference between an author and committer? I added an example using git cherry-pick that I'm not 100% happy with (what if the reader doesn't know what cherry-pick does?). There might be a better example to give here. * In the note about commits being amended: one person suggested saying "creates a new commit with the same parent" to make it clearer what the relationship between the new and old commit are. I liked that idea so I did it. trees: * file modes. 2 people want to know more about "The file mode, for example 100644". Also 2 people are curious about what relationship these have to Unix permissions. Say that they're inspired by Unix permissions, and move the list of possible file modes up to make the relationship clearer * On "so git-gc(1) periodically compresses objects to save disk space", there are a few follow up comments wondering about more, which makes me think the comment about compression is actually a distraction. Say something simpler instead, ("Git only needs to store new versions of files which were changed in that commit"), from Junio's suggestion * Re "commit (a Git submodule)": 2 people say it's not clear how trees relate to submodules. Say that it refers to a commit in a different repository. * One person says they're not sure if the "object ID" is a hash. Link it to the definition of "object ID". tag objects: * Requests for an example, added one. * Requests to explain the difference between "lightweight" and "annotated" tags, added it. tags: * one person thinks "It’s expected that a tag will never change after you create it." is too strong (since of course you can change it with git tag -f). Say instead that tags are "usually" not changed. HEAD: * Several people are asking for more detail about detached HEAD state. There's actually quite a lot to talk about here (what it means, how it happens, what it implies, and how you might adjust your workflow to avoid it by using git switch). I don't think we can get into all of that here, so refer to the DETACHED HEAD section of git-checkout instead. I'm not totally happy with the current version of that section but that seems like the most practical solution right now. remote-tracking branches: * discuss refs/remotes/<remote>/HEAD. the index: * "permissions" should be "file mode" (like with trees). Changed. * "filename" should be "file path". Changed. * the stage number can only be 0, 1, 2, or 3, since it's 2 bits. Also maybe say that the numbers have specific meanings. Said it can only be 0/1/2/3 but did not give the specific meanings. reflogs * Request for an example. Added one. * It's not clear if there's one reflog per branch/tag/HEAD, or if there's one universal reflog. Make this clearer. * Mention the role of the reflog in retrieving "lost" commits or undoing bad rebases. Not fixed: * intro: A couple of people say that it's confusing that tags are both "an object" and "a reference". Handled this by just explaining the difference between an annotated and a lightweight tag further down. I'd like to make this clearer in the intro but not sure if there's a way to do it. * commits and tag objects: one person asks if there's a reference for the other "optional fields", like "encoding" and "gpgsig". I couldn't find one, so left this as is. * HEAD: A couple of people ask if there are any other symbolic references other than HEAD, or if they can make their own symbolic references. I don't know the answer to this. * HEAD: the HEAD: HEAD thing looks weird, it made more sense when it was HEAD: .git/HEAD. Will think about this. * reflogs: One person asks: if reflogs only store local changes, why does it track the user who made the change? Is that for remote operations like fetches and pulls? Or for cases where more than one user is using the same repo on a system? I don't know the answer to this. * reflogs: How can you see the full data in the reflog? git reflog show doesn't list the user who made the change. git reflog show <refname> --format="%h | %gd | %gn <%ge> | %gs" --date=iso seems to work but it's really a mouthful, not sure it's useful to include all that. * index: Is it worth mentioning that the index can be locked? I don't have an opinion about this. * other: One person asks what a "working tree" is. It made me wonder if "the current working directory" has a place in Git's data model. My feeling is "no" but I could be convinced otherwise. * overall: "How can Git be so fast? If I switch branches, how does it figure out what to add, remove or replace?". I don't think this is the right place for that discussion but it would * there are some docs CI errors I haven't figured out yet (IDREF attribute linkend references an unknown ID "tree") Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-1981%2Fjvns%2Fgitdatamodel-v3 Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-1981/jvns/gitdatamodel-v3 Pull-Request: https://github.com/gitgitgadget/git/pull/1981 Range-diff vs v2: 1: 3b38a88dc7 ! 1: 39da4e04cf doc: add a explanation of Git's data model @@ Documentation/gitdatamodel.adoc (new) +2. <<references,References>>: branches, tags, + remote-tracking branches, etc +3. <<index,The index>>, also known as the staging area -+4. <<reflogs,Reflogs>> ++4. <<reflogs,Reflogs>>: logs of changes to references ("ref log") + +[[objects]] +OBJECTS @@ Documentation/gitdatamodel.adoc (new) +Commits, trees, blobs, and tag objects are all stored in Git's object database. +Every object has: + ++[[object-id]] +1. an *ID* (aka "object name"), which is a cryptographic hash of its + type and contents. + It's fast to look up a Git object using its ID. @@ Documentation/gitdatamodel.adoc (new) + A commit contains these required fields + (though there are other optional fields): ++ -+1. Its *parent commit ID(s)*. The first commit in a repository has 0 parents, ++1. All the *files* in the commit, stored as the *<<tree,tree>>* ID of ++ the commit's base directory. ++2. Its *parent commit ID(s)*. The first commit in a repository has 0 parents, + regular commits have 1 parent, merge commits have 2 or more parents -+2. A *commit message* -+3. All the *files* in the commit, stored as a *<<tree,tree>>* -+4. An *author* and the time the commit was authored -+5. A *committer* and the time the commit was committed ++3. An *author* and the time the commit was authored ++4. A *committer* and the time the commit was committed. ++ If you cherry-pick (linkgit:git-cherry-pick[1]) someone else's commit, ++ then they will be the author and you'll be the committer. ++5. A *commit message* ++ +Here's how an example commit is stored: ++ @@ Documentation/gitdatamodel.adoc (new) +---- ++ +Like all other objects, commits can never be changed after they're created. -+For example, "amending" a commit with `git commit --amend` creates a new commit. ++For example, "amending" a commit with `git commit --amend` creates a new ++commit with the same parent. +++ ++Git does not store the diff for a commit: when you ask Git for a ++diff it calculates it on the fly. + +[[tree]] +trees:: + A tree is how Git represents a directory. It lists, for each item in + the tree: ++ -+1. The *file mode*, for example `100644` ++[[file-mode]] ++1. The *file mode*, for example `100644`. The format is inspired by Unix ++ permissions, but Git's modes are much more limited. Git only supports these file modes: +++ ++ - `100644`: regular file (with type `blob`) ++ - `100755`: executable file (with type `blob`) ++ - `120000`: symbolic link (with type `blob`) ++ - `040000`: directory (with type `tree`) ++ - `160000`: gitlink, for use with submodules (with type `commit`) ++ +2. The *type*: either <<blob,`blob`>> (a file), `tree` (a directory), -+ or <<commit,`commit`>> (a Git submodule) -+3. The *object ID* ++ or <<commit,`commit`>> (a Git submodule, which is a ++ commit from a different Git repository) ++3. The <<object-id,*object ID*>> +4. The *filename* ++ +For example, this is how a tree containing one directory (`src`) and one file @@ Documentation/gitdatamodel.adoc (new) +100644 blob 8728a858d9d21a8c78488c8b4e70e531b659141f README.md +040000 tree 89b1d2e0495f66d6929f4ff76ff1bb07fc41947d src +---- -++ -+Git only supports these file modes: -++ -+ - `100644`: regular file (with type `blob`) -+ - `100755`: executable file (with type `blob`) -+ - `120000`: symbolic link (with type `blob`) -+ - `040000`: directory (with type `tree`) -+ - `160000`: gitlink, for use with submodules (with type `commit`) ++ + +[[blob]] +blobs:: + A blob is how Git represents a file. A blob object contains the + file's contents. ++ -+ -+NOTE: Storing a new blob for every new version of a file can use a -+lot of disk space. To handle this, Git periodically runs repository -+maintenance with linkgit:git-gc[1]. Part of this maintenance is -+compressing objects so that if a small part of a file was changed, only -+the change is stored instead of the whole file. ++When you make a new commit, Git only needs to store new versions of ++files which were changed in that commit. This means that commits ++can use relatively little disk space even in a very large repository. + +[[tag-object]] +tag objects:: -+ Tag objects (also known as "annotated tags") contain these required fields ++ Tag objects contain these required fields + (though there are other optional fields): ++ -+1. The *tagger* and tag date -+2. A *tag message*, similar to a commit message -+3. The *ID* and *type* of the object (often a commit) that they reference ++1. The *ID* and *type* of the object (often a commit) that they reference ++2. The *tagger* and tag date ++3. A *tag message*, similar to a commit message + +Here's how an example tag object is stored: + @@ Documentation/gitdatamodel.adoc (new) +Release version 1.0.0 +---- + ++NOTE: All of the examples in this section were generated with ++`git cat-file -p <object-id>`, which shows the contents of a Git object. ++ +[[references]] +REFERENCES +---------- @@ Documentation/gitdatamodel.adoc (new) + A tag is a name for a commit ID, tag object ID, or other object ID. + Tags that reference a tag object ID are called "annotated tags", + because the tag object contains a tag message. -+ Tags that reference a commit ID, blob ID, or tree ID are ++ Tags that reference a commit, blob, or tree ID are + called "lightweight tags". ++ +Even though branches and tags are both "a name for a commit ID", Git +treats them very differently. +Branches are expected to change over time: when you make a commit, Git +will update your <<HEAD,current branch>> to reference the new changes. -+It's expected that a tag will never change after you create it. ++Tags are usually not changed after they're created. + +[[HEAD]] +HEAD: `HEAD`:: @@ Documentation/gitdatamodel.adoc (new) + `HEAD` can either be: + 1. A symbolic reference to your current branch, for example `ref: + refs/heads/main` if your current branch is `main`. -+ 2. A direct reference to a commit ID. -+ This is called "detached HEAD state". ++ 2. A direct reference to a commit ID. This is called "detached HEAD ++ state", see the DETACHED HEAD section of linkgit:git-checkout[1] for more. + +[[remote-tracking-branch]] +remote tracking branches: `refs/remotes/<remote>/<branch>`:: @@ Documentation/gitdatamodel.adoc (new) + repository. `git fetch` updates remote-tracking branches. When + `git status` says "you're up to date with origin/main", it's looking at + this. +++ ++`refs/remotes/<remote>/HEAD` is a symbolic reference to the remote's ++default branch. This is the branch that `git clone` checks out by default. + +[[other-refs]] +Other references:: @@ Documentation/gitdatamodel.adoc (new) ++ +Git may also create references other than `HEAD` at the base of the +hierarchy, like `ORIG_HEAD`. +++ ++NOTE: By default, Git references are stored as files in the `.git` directory. ++For example, the branch `main` is stored in `.git/refs/heads/main`. ++This means that you can't have branches named both `maya` and `maya/some-task`, ++because there can't be a file and a directory with the same name. + +[[index]] +THE INDEX +--------- + -+The index, also known as the "staging area", contains the current staged -+version of every file in your Git repository. When you commit, the files -+in the index are used as the files in the next commit. ++The index, also known as the "staging area", contains a list of every ++file in the repository and its contents. When you commit, the files in ++the index are used as the files in the next commit. ++ ++You can add files to the index or update the version in the index with ++linkgit:git-add[1]. Adding a file to the index or updating its version ++is called "staging" the file for commit. + -+Unlike a tree, the index is a flat list of files. ++Unlike a <<tree,tree>>, the index is a flat list of files. +Each index entry has 4 fields: + -+1. The *permissions* ++1. The *<<file-mode,file mode>>* +2. The *<<blob,blob>> ID* of the file -+3. The *filename* -+4. The *stage number*. This is normally 0, but if there's a merge conflict -+ there can be multiple versions (with numbers 0, 1, 2, ..) -+ of the same filename in the index. ++3. The *file path*, for example `src/hello.py` ++4. The *stage number*, either 0, 1, 2, or 3. This is normally 0, but if ++ there's a merge conflict there can be multiple versions of the same ++ filename in the index. + +It's extremely uncommon to look at the index directly: normally you'd +run `git status` to see a list of changes between the index and <<HEAD,HEAD>>. @@ Documentation/gitdatamodel.adoc (new) +REFLOGS +------- + -+Git stores the history of your branch, remote-tracking branch, and HEAD refs -+in a reflog (you should read "reflog" as "ref log"). ++Git stores a history called a "reflog" for every branch, remote-tracking ++branch, and HEAD. This means that if you make a mistake and "lose" a ++commit, you can generally recover the commit ID by running ++`git reflog <reference>`. + +Each reflog entry has: + @@ Documentation/gitdatamodel.adoc (new) +Reflogs only log changes made in your local repository. +They are not shared with remotes. + ++For example, here's how the reflog for `HEAD` in a repository with 2 ++commits is stored: ++ ++---- ++0000000000000000000000000000000000000000 4ccb6d7b8869a86aae2e84c56523f8705b50c647 Maya <maya@example.com> 1759173408 -0400 commit (initial): Initial commit ++4ccb6d7b8869a86aae2e84c56523f8705b50c647 750b4ead9c87ceb3ddb7a390e6c7074521797fb3 Maya <maya@example.com> 1759173425 -0400 commit: Add README ++---- ++ +GIT +--- +Part of the linkgit:git[1] suite Documentation/Makefile | 1 + Documentation/gitdatamodel.adoc | 281 ++++++++++++++++++++++++++++ Documentation/glossary-content.adoc | 4 +- Documentation/meson.build | 1 + 4 files changed, 285 insertions(+), 2 deletions(-) create mode 100644 Documentation/gitdatamodel.adoc diff --git a/Documentation/Makefile b/Documentation/Makefile index 6fb83d0c6e..5f4acfacbd 100644 --- a/Documentation/Makefile +++ b/Documentation/Makefile @@ -52,6 +52,7 @@ MAN7_TXT += gitcli.adoc MAN7_TXT += gitcore-tutorial.adoc MAN7_TXT += gitcredentials.adoc MAN7_TXT += gitcvs-migration.adoc +MAN7_TXT += gitdatamodel.adoc MAN7_TXT += gitdiffcore.adoc MAN7_TXT += giteveryday.adoc MAN7_TXT += gitfaq.adoc diff --git a/Documentation/gitdatamodel.adoc b/Documentation/gitdatamodel.adoc new file mode 100644 index 0000000000..f49574dfae --- /dev/null +++ b/Documentation/gitdatamodel.adoc @@ -0,0 +1,281 @@ +gitdatamodel(7) +=============== + +NAME +---- +gitdatamodel - Git's core data model + +SYNOPSIS +-------- +gitdatamodel + +DESCRIPTION +----------- + +It's not necessary to understand Git's data model to use Git, but it's +very helpful when reading Git's documentation so that you know what it +means when the documentation says "object", "reference" or "index". + +Git's core operations use 4 kinds of data: + +1. <<objects,Objects>>: commits, trees, blobs, and tag objects +2. <<references,References>>: branches, tags, + remote-tracking branches, etc +3. <<index,The index>>, also known as the staging area +4. <<reflogs,Reflogs>>: logs of changes to references ("ref log") + +[[objects]] +OBJECTS +------- + +Commits, trees, blobs, and tag objects are all stored in Git's object database. +Every object has: + +[[object-id]] +1. an *ID* (aka "object name"), which is a cryptographic hash of its + type and contents. + It's fast to look up a Git object using its ID. + This is usually represented in hexadecimal, like + `1b61de420a21a2f1aaef93e38ecd0e45e8bc9f0a`. +2. a *type*. There are 4 types of objects: + <<commit,commits>>, <<tree,trees>>, <<blob,blobs>>, + and <<tag-object,tag objects>>. +3. *contents*. The structure of the contents depends on the type. + +Once an object is created, it can never be changed. +Here are the 4 types of objects: + +[[commit]] +commits:: + A commit contains these required fields + (though there are other optional fields): ++ +1. All the *files* in the commit, stored as the *<<tree,tree>>* ID of + the commit's base directory. +2. Its *parent commit ID(s)*. The first commit in a repository has 0 parents, + regular commits have 1 parent, merge commits have 2 or more parents +3. An *author* and the time the commit was authored +4. A *committer* and the time the commit was committed. + If you cherry-pick (linkgit:git-cherry-pick[1]) someone else's commit, + then they will be the author and you'll be the committer. +5. A *commit message* ++ +Here's how an example commit is stored: ++ +---- +tree 1b61de420a21a2f1aaef93e38ecd0e45e8bc9f0a +parent 4ccb6d7b8869a86aae2e84c56523f8705b50c647 +author Maya <maya@example.com> 1759173425 -0400 +committer Maya <maya@example.com> 1759173425 -0400 + +Add README +---- ++ +Like all other objects, commits can never be changed after they're created. +For example, "amending" a commit with `git commit --amend` creates a new +commit with the same parent. ++ +Git does not store the diff for a commit: when you ask Git for a +diff it calculates it on the fly. + +[[tree]] +trees:: + A tree is how Git represents a directory. It lists, for each item in + the tree: ++ +[[file-mode]] +1. The *file mode*, for example `100644`. The format is inspired by Unix + permissions, but Git's modes are much more limited. Git only supports these file modes: ++ + - `100644`: regular file (with type `blob`) + - `100755`: executable file (with type `blob`) + - `120000`: symbolic link (with type `blob`) + - `040000`: directory (with type `tree`) + - `160000`: gitlink, for use with submodules (with type `commit`) + +2. The *type*: either <<blob,`blob`>> (a file), `tree` (a directory), + or <<commit,`commit`>> (a Git submodule, which is a + commit from a different Git repository) +3. The <<object-id,*object ID*>> +4. The *filename* ++ +For example, this is how a tree containing one directory (`src`) and one file +(`README.md`) is stored: ++ +---- +100644 blob 8728a858d9d21a8c78488c8b4e70e531b659141f README.md +040000 tree 89b1d2e0495f66d6929f4ff76ff1bb07fc41947d src +---- + + +[[blob]] +blobs:: + A blob is how Git represents a file. A blob object contains the + file's contents. ++ +When you make a new commit, Git only needs to store new versions of +files which were changed in that commit. This means that commits +can use relatively little disk space even in a very large repository. + +[[tag-object]] +tag objects:: + Tag objects contain these required fields + (though there are other optional fields): ++ +1. The *ID* and *type* of the object (often a commit) that they reference +2. The *tagger* and tag date +3. A *tag message*, similar to a commit message + +Here's how an example tag object is stored: + +---- +object 750b4ead9c87ceb3ddb7a390e6c7074521797fb3 +type commit +tag v1.0.0 +tagger Maya <maya@example.com> 1759927359 -0400 + +Release version 1.0.0 +---- + +NOTE: All of the examples in this section were generated with +`git cat-file -p <object-id>`, which shows the contents of a Git object. + +[[references]] +REFERENCES +---------- + +References are a way to give a name to a commit. +It's easier to remember "the changes I'm working on are on the `turtle` +branch" than "the changes are in commit bb69721404348e". +Git often uses "ref" as shorthand for "reference". + +References can either be: + +1. References to an object ID, usually a <<commit,commit>> ID +2. References to another reference. This is called a "symbolic reference". + +References are stored in a hierarchy, and Git handles references +differently based on where they are in the hierarchy. +Most references are under `refs/`. Here are the main types: + +[[branch]] +branches: `refs/heads/<name>`:: + A branch is a name for a commit ID. + That commit is the latest commit on the branch. ++ +To get the history of commits on a branch, Git will start at the commit +ID the branch references, and then look at the commit's parent(s), +the parent's parent, etc. + +[[tag]] +tags: `refs/tags/<name>`:: + A tag is a name for a commit ID, tag object ID, or other object ID. + Tags that reference a tag object ID are called "annotated tags", + because the tag object contains a tag message. + Tags that reference a commit, blob, or tree ID are + called "lightweight tags". ++ +Even though branches and tags are both "a name for a commit ID", Git +treats them very differently. +Branches are expected to change over time: when you make a commit, Git +will update your <<HEAD,current branch>> to reference the new changes. +Tags are usually not changed after they're created. + +[[HEAD]] +HEAD: `HEAD`:: + `HEAD` is where Git stores your current <<branch,branch>>. + `HEAD` can either be: + 1. A symbolic reference to your current branch, for example `ref: + refs/heads/main` if your current branch is `main`. + 2. A direct reference to a commit ID. This is called "detached HEAD + state", see the DETACHED HEAD section of linkgit:git-checkout[1] for more. + +[[remote-tracking-branch]] +remote tracking branches: `refs/remotes/<remote>/<branch>`:: + A remote-tracking branch is a name for a commit ID. + It's how Git stores the last-known state of a branch in a remote + repository. `git fetch` updates remote-tracking branches. When + `git status` says "you're up to date with origin/main", it's looking at + this. ++ +`refs/remotes/<remote>/HEAD` is a symbolic reference to the remote's +default branch. This is the branch that `git clone` checks out by default. + +[[other-refs]] +Other references:: + Git tools may create references anywhere under `refs/`. + For example, linkgit:git-stash[1], linkgit:git-bisect[1], + and linkgit:git-notes[1] all create their own references + in `refs/stash`, `refs/bisect`, etc. + Third-party Git tools may also create their own references. ++ +Git may also create references other than `HEAD` at the base of the +hierarchy, like `ORIG_HEAD`. ++ +NOTE: By default, Git references are stored as files in the `.git` directory. +For example, the branch `main` is stored in `.git/refs/heads/main`. +This means that you can't have branches named both `maya` and `maya/some-task`, +because there can't be a file and a directory with the same name. + +[[index]] +THE INDEX +--------- + +The index, also known as the "staging area", contains a list of every +file in the repository and its contents. When you commit, the files in +the index are used as the files in the next commit. + +You can add files to the index or update the version in the index with +linkgit:git-add[1]. Adding a file to the index or updating its version +is called "staging" the file for commit. + +Unlike a <<tree,tree>>, the index is a flat list of files. +Each index entry has 4 fields: + +1. The *<<file-mode,file mode>>* +2. The *<<blob,blob>> ID* of the file +3. The *file path*, for example `src/hello.py` +4. The *stage number*, either 0, 1, 2, or 3. This is normally 0, but if + there's a merge conflict there can be multiple versions of the same + filename in the index. + +It's extremely uncommon to look at the index directly: normally you'd +run `git status` to see a list of changes between the index and <<HEAD,HEAD>>. +But you can use `git ls-files --stage` to see the index. +Here's the output of `git ls-files --stage` in a repository with 2 files: + +---- +100644 8728a858d9d21a8c78488c8b4e70e531b659141f 0 README.md +100644 665c637a360874ce43bf74018768a96d2d4d219a 0 src/hello.py +---- + +[[reflogs]] +REFLOGS +------- + +Git stores a history called a "reflog" for every branch, remote-tracking +branch, and HEAD. This means that if you make a mistake and "lose" a +commit, you can generally recover the commit ID by running +`git reflog <reference>`. + +Each reflog entry has: + +1. Before/after *commit IDs* +2. *User* who made the change, for example `Maya <maya@example.com>` +3. *Timestamp* when the change was made +4. *Log message*, for example `pull: Fast-forward` + +Reflogs only log changes made in your local repository. +They are not shared with remotes. + +For example, here's how the reflog for `HEAD` in a repository with 2 +commits is stored: + +---- +0000000000000000000000000000000000000000 4ccb6d7b8869a86aae2e84c56523f8705b50c647 Maya <maya@example.com> 1759173408 -0400 commit (initial): Initial commit +4ccb6d7b8869a86aae2e84c56523f8705b50c647 750b4ead9c87ceb3ddb7a390e6c7074521797fb3 Maya <maya@example.com> 1759173425 -0400 commit: Add README +---- + +GIT +--- +Part of the linkgit:git[1] suite diff --git a/Documentation/glossary-content.adoc b/Documentation/glossary-content.adoc index e423e4765b..20ba121314 100644 --- a/Documentation/glossary-content.adoc +++ b/Documentation/glossary-content.adoc @@ -297,8 +297,8 @@ This commit is referred to as a "merge commit", or sometimes just a identified by its <<def_object_name,object name>>. The objects usually live in `$GIT_DIR/objects/`. -[[def_object_identifier]]object identifier (oid):: - Synonym for <<def_object_name,object name>>. +[[def_object_identifier]]object identifier, object ID, oid:: + Synonyms for <<def_object_name,object name>>. [[def_object_name]]object name:: The unique identifier of an <<def_object,object>>. The diff --git a/Documentation/meson.build b/Documentation/meson.build index e34965c5b0..ace0573e82 100644 --- a/Documentation/meson.build +++ b/Documentation/meson.build @@ -192,6 +192,7 @@ manpages = { 'gitcore-tutorial.adoc' : 7, 'gitcredentials.adoc' : 7, 'gitcvs-migration.adoc' : 7, + 'gitdatamodel.adoc' : 7, 'gitdiffcore.adoc' : 7, 'giteveryday.adoc' : 7, 'gitfaq.adoc' : 7, base-commit: bb69721404348ea2db0a081c41ab6ebfe75bdec8 -- gitgitgadget ^ permalink raw reply related [flat|nested] 89+ messages in thread
* Re: [PATCH v3] doc: add a explanation of Git's data model 2025-10-14 21:12 ` [PATCH v3] " Julia Evans via GitGitGadget @ 2025-10-15 6:24 ` Patrick Steinhardt 2025-10-15 15:34 ` Junio C Hamano 2025-10-15 19:58 ` Junio C Hamano ` (3 subsequent siblings) 4 siblings, 1 reply; 89+ messages in thread From: Patrick Steinhardt @ 2025-10-15 6:24 UTC (permalink / raw) To: Julia Evans via GitGitGadget Cc: git, Kristoffer Haugsbakk, D. Ben Knoble, Julia Evans On Tue, Oct 14, 2025 at 09:12:26PM +0000, Julia Evans via GitGitGadget wrote: [snip] > +[[commit]] > +commits:: > + A commit contains these required fields > + (though there are other optional fields): > ++ > +1. All the *files* in the commit, stored as the *<<tree,tree>>* ID of > + the commit's base directory. > +2. Its *parent commit ID(s)*. The first commit in a repository has 0 parents, > + regular commits have 1 parent, merge commits have 2 or more parents > +3. An *author* and the time the commit was authored > +4. A *committer* and the time the commit was committed. > + If you cherry-pick (linkgit:git-cherry-pick[1]) someone else's commit, > + then they will be the author and you'll be the committer. > +5. A *commit message* > ++ > +Here's how an example commit is stored: > ++ > +---- > +tree 1b61de420a21a2f1aaef93e38ecd0e45e8bc9f0a > +parent 4ccb6d7b8869a86aae2e84c56523f8705b50c647 > +author Maya <maya@example.com> 1759173425 -0400 > +committer Maya <maya@example.com> 1759173425 -0400 > + > +Add README > +---- > ++ > +Like all other objects, commits can never be changed after they're created. > +For example, "amending" a commit with `git commit --amend` creates a new > +commit with the same parent. Let's say "parents" instead of "parent" here so that it also works for root and merge commits. [snip] > +[[other-refs]] > +Other references:: > + Git tools may create references anywhere under `refs/`. > + For example, linkgit:git-stash[1], linkgit:git-bisect[1], > + and linkgit:git-notes[1] all create their own references > + in `refs/stash`, `refs/bisect`, etc. > + Third-party Git tools may also create their own references. > ++ > +Git may also create references other than `HEAD` at the base of the > +hierarchy, like `ORIG_HEAD`. > ++ > +NOTE: By default, Git references are stored as files in the `.git` directory. > +For example, the branch `main` is stored in `.git/refs/heads/main`. > +This means that you can't have branches named both `maya` and `maya/some-task`, > +because there can't be a file and a directory with the same name. Hm. I think mentioning this can help, but it may also creates questions when someone has a "main" branch but is unable find it in ".git/refs/heads/main" because it has either been packed, or because the repository uses reftables. I don't really know what to do about this. I think the most sensible thing would be to introduce two man pages gitformat-reffiles(5) and gitformat-reftables(5) that we can reference here for further reading. [snip] > +[[reflogs]] > +REFLOGS > +------- > + > +Git stores a history called a "reflog" for every branch, remote-tracking I think it's a bit unclear what "history" means here. Maybe: Git stores a "reflog" for every branch, remote-tracking branch and "HEAD" that contains the annotated history of all updates for a particular reference. This means... Other than those handful of comments I'm happy with the current version, thanks! Patrick ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: [PATCH v3] doc: add a explanation of Git's data model 2025-10-15 6:24 ` Patrick Steinhardt @ 2025-10-15 15:34 ` Junio C Hamano 2025-10-15 17:20 ` Julia Evans 0 siblings, 1 reply; 89+ messages in thread From: Junio C Hamano @ 2025-10-15 15:34 UTC (permalink / raw) To: Patrick Steinhardt Cc: Julia Evans via GitGitGadget, git, Kristoffer Haugsbakk, D. Ben Knoble, Julia Evans Patrick Steinhardt <ps@pks.im> writes: >> +Like all other objects, commits can never be changed after they're created. >> +For example, "amending" a commit with `git commit --amend` creates a new >> +commit with the same parent. > > Let's say "parents" instead of "parent" here so that it also works for > root and merge commits. I just found it amusing that parents can be 0 ;-) >> +NOTE: By default, Git references are stored as files in the `.git` directory. >> +For example, the branch `main` is stored in `.git/refs/heads/main`. >> +This means that you can't have branches named both `maya` and `maya/some-task`, >> +because there can't be a file and a directory with the same name. > > Hm. I think mentioning this can help, but it may also creates questions > when someone has a "main" branch but is unable find it in > ".git/refs/heads/main" because it has either been packed, or because the > repository uses reftables. I had the same thought. The only thing we want to stress here is that the names of refs _behave_ like filesystem entities. So how about saying just Note: when you have a branch with <name>, you cannot have any branch whose name begins with "<name>/". and stop at it? It may look like an arbitrary limitation, and once in a distant future ref-files gets retired, it will become one (as there is no inherent reason why reftable backend must retain it; it only enforces the same limitation to ensure that the names it stores interoperate with another clone that uses ref-files backend). At the data-model level (which is the theme of this document), it is just as immaterial as refnames may be case insensitive on some systems. Mentioning the limitation may be good, but the data model document is not the right place to explain where this limitation comes from (i.e. to be compatible with and expressible in ref-files backend). We do not say "you may not be able to have 'maya' branch and 'mAYa' branch at the same time on some systems", either ;-). >> +Git stores a history called a "reflog" for every branch, remote-tracking > > I think it's a bit unclear what "history" means here. Maybe: "records of updates", perhaps? ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: [PATCH v3] doc: add a explanation of Git's data model 2025-10-15 15:34 ` Junio C Hamano @ 2025-10-15 17:20 ` Julia Evans 2025-10-15 20:42 ` Junio C Hamano 0 siblings, 1 reply; 89+ messages in thread From: Julia Evans @ 2025-10-15 17:20 UTC (permalink / raw) To: Junio C Hamano, Patrick Steinhardt Cc: Julia Evans, git, Kristoffer Haugsbakk, D. Ben Knoble On Wed, Oct 15, 2025, at 11:34 AM, Junio C Hamano wrote: > Patrick Steinhardt <ps@pks.im> writes: > >>> +Like all other objects, commits can never be changed after they're created. >>> +For example, "amending" a commit with `git commit --amend` creates a new >>> +commit with the same parent. >> >> Let's say "parents" instead of "parent" here so that it also works for >> root and merge commits. > > I just found it amusing that parents can be 0 ;-) Will change to parent(s). >>> +NOTE: By default, Git references are stored as files in the `.git` directory. >>> +For example, the branch `main` is stored in `.git/refs/heads/main`. >>> +This means that you can't have branches named both `maya` and `maya/some-task`, >>> +because there can't be a file and a directory with the same name. >> >> Hm. I think mentioning this can help, but it may also creates questions >> when someone has a "main" branch but is unable find it in >> ".git/refs/heads/main" because it has either been packed, or because the >> repository uses reftables. > > I had the same thought. The only thing we want to stress here is > that the names of refs _behave_ like filesystem entities. So how > about saying just > > Note: when you have a branch with <name>, you cannot have any > branch whose name begins with "<name>/". > > and stop at it? It may look like an arbitrary limitation, and once > in a distant future ref-files gets retired, it will become one (as > there is no inherent reason why reftable backend must retain it; it > only enforces the same limitation to ensure that the names it stores > interoperate with another clone that uses ref-files backend). At > the data-model level (which is the theme of this document), it is > just as immaterial as refnames may be case insensitive on some > systems. > > Mentioning the limitation may be good, but the data model document > is not the right place to explain where this limitation comes from > (i.e. to be compatible with and expressible in ref-files backend). I'm still not clear on why you think we shouldn't mention that how references behave depends on which filesystem you're using. Is it because the fact that how references behave depends on which FS you're using is considered a "bug", Git is working on eventually fixing that bug via the reftable backend, and we don't want to document "bugs" as an expected part of the data model? I do think it's important to tell users where the data model has "weak points" where the abstraction leaks through to the implementation, pretending that abstractions are stronger than they are leads to unnecessary confusion. > We do not say "you may not be able to have 'maya' branch and 'mAYa' > branch at the same time on some systems", either ;-). Speaking of case-insensitive filesystems, I wonder if we should add a short note about the rules for filenames in Git. I ran into an issue recently where I had a filename with a colon in it, and my collaborator (who was using Windows) could not check out the branch because of that, and I saw another similar issue recently where one collaborator was using a case-insensitive filesystem and the other wasn't. My guess is that Git does not enforce any rules about filenames (?), and it's up to the user to make sure that the filenames in the repository will work well for everyone collaborating on the repository. >>> +Git stores a history called a "reflog" for every branch, remote-tracking >> >> I think it's a bit unclear what "history" means here. Maybe: > > "records of updates", perhaps? Agreed. Perhaps this instead: Every time a branch, remote-tracking branch, or HEAD is updated, Git updates a log called a "reflog" for that <<reference,reference>>. ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: [PATCH v3] doc: add a explanation of Git's data model 2025-10-15 17:20 ` Julia Evans @ 2025-10-15 20:42 ` Junio C Hamano 2025-10-16 14:21 ` Julia Evans 0 siblings, 1 reply; 89+ messages in thread From: Junio C Hamano @ 2025-10-15 20:42 UTC (permalink / raw) To: Julia Evans Cc: Patrick Steinhardt, Julia Evans, git, Kristoffer Haugsbakk, D. Ben Knoble "Julia Evans" <julia@jvns.ca> writes: > I'm still not clear on why you think we shouldn't mention that how > references behave depends on which filesystem you're using. Simply because the main purpose of this document is to give a data-model. A case insensitive filesystem limiting the set of names you can use depending on what other names are in use is a quality of implementation issue, which I view as a mere distraction when we are giving overview at the conceptual level. Thanks. ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: [PATCH v3] doc: add a explanation of Git's data model 2025-10-15 20:42 ` Junio C Hamano @ 2025-10-16 14:21 ` Julia Evans 0 siblings, 0 replies; 89+ messages in thread From: Julia Evans @ 2025-10-16 14:21 UTC (permalink / raw) To: Junio C Hamano Cc: Patrick Steinhardt, Julia Evans, git, Kristoffer Haugsbakk, D. Ben Knoble On Wed, Oct 15, 2025, at 4:42 PM, Junio C Hamano wrote: > "Julia Evans" <julia@jvns.ca> writes: > >> I'm still not clear on why you think we shouldn't mention that how >> references behave depends on which filesystem you're using. > > Simply because the main purpose of this document is to give a > data-model. A case insensitive filesystem limiting the set of names > you can use depending on what other names are in use is a quality of > implementation issue, which I view as a mere distraction when we are > giving overview at the conceptual level. Okay, I'll delete the note. ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: [PATCH v3] doc: add a explanation of Git's data model 2025-10-14 21:12 ` [PATCH v3] " Julia Evans via GitGitGadget 2025-10-15 6:24 ` Patrick Steinhardt @ 2025-10-15 19:58 ` Junio C Hamano 2025-10-16 15:19 ` Julia Evans 2025-10-16 15:24 ` Kristoffer Haugsbakk ` (2 subsequent siblings) 4 siblings, 1 reply; 89+ messages in thread From: Junio C Hamano @ 2025-10-15 19:58 UTC (permalink / raw) To: Julia Evans via GitGitGadget Cc: git, Kristoffer Haugsbakk, D. Ben Knoble, Patrick Steinhardt, Julia Evans "Julia Evans via GitGitGadget" <gitgitgadget@gmail.com> writes: > +[[commit]] > +commits:: > + A commit contains these required fields > + (though there are other optional fields): > ++ > +1. All the *files* in the commit, stored as the *<<tree,tree>>* ID of > + the commit's base directory. "all the files' exact contents at the time of the commit" is what we mean here, and once readers know what a tree is, the above sentence would be understood as such, but "All the files" felt somewhat fuzzy. I wonder if presenting objects in bottom-up fashion makes it easier to see? Learn that a blob records exact content of a file, then learn that a tree records the set of paths with exact contents stored at these paths, and after that, learn that a commit records a tree, hence a snapshot of the whole set of contents. I dunno... > +2. Its *parent commit ID(s)*. The first commit in a repository has 0 parents, > + regular commits have 1 parent, merge commits have 2 or more parents > +3. An *author* and the time the commit was authored > +4. A *committer* and the time the commit was committed. > + If you cherry-pick (linkgit:git-cherry-pick[1]) someone else's commit, > + then they will be the author and you'll be the committer. It felt a bit odd to single-out cherry-pick here. I think the important thing to become aware of for the readers at this point is that the author and committer can be different people, and it does not matter how one commits somebody else's patch at the mechanical level. Perhaps replace "If you cherry-pick..." with something like "note: a change authored by a person at some point in time can be committed by another person at a different time, and these fields are to record both persons' contributions separately", perhaps, if we really want to say more. > +Git does not store the diff for a commit: when you ask Git for a > +diff it calculates it on the fly. I think this is an attempt to demystify "are we really storing snapshot for each commit?" thing, but then "when you ask Git to show the commit, it calculates the diff from its parent on the fly" might achieve that better, perhaps? > +[[tree]] > +trees:: > + A tree is how Git represents a directory. It lists, for each item in > + the tree: > ++ > +[[file-mode]] > +1. The *file mode*, for example `100644`. The format is inspired by Unix > + permissions, but Git's modes are much more limited. Git only supports these file modes: > ++ > + - `100644`: regular file (with type `blob`) > + - `100755`: executable file (with type `blob`) > + - `120000`: symbolic link (with type `blob`) > + - `040000`: directory (with type `tree`) > + - `160000`: gitlink, for use with submodules (with type `commit`) It is not really "supporting" file modes. Rather, Git only records 5 kinds of entities associated with each path in a tree object, and uses numbers taht remotely resemble POSIX file modes to represent these 5 kinds. Perhaps "supports" -> "uses"? > +2. The *type*: either <<blob,`blob`>> (a file), `tree` (a directory), > + or <<commit,`commit`>> (a Git submodule, which is a > + commit from a different Git repository) > +3. The <<object-id,*object ID*>> > +4. The *filename* Here it may be worth noting that this "filename" is a single pathname component (roughly, what you would see in non-recursive "ls"). In other words, it may be a directory name. I wonder if we need to say "<blob> (a file, or a symbolic link)"? > +[[blob]] > +blobs:: > + A blob is how Git represents a file. A blob object contains the > + file's contents. "represents a file" hints as if the thing may know its name, but that is not the case (its name is given only by surrounding tree). "A blob is how Git represents uninterpreted series of bytes, and most commonly used to store file's contents." or something, perhaps? > +When you make a new commit, Git only needs to store new versions of > +files which were changed in that commit. This means that commits > +can use relatively little disk space even in a very large repository. That invites the "aren't we storing a delta after all, then?" confusion. "Git only needs to newly store new versions of files and directories. Files and directories that were not modified by the commit are shared with its parent commit". > +NOTE: All of the examples in this section were generated with > +`git cat-file -p <object-id>`, which shows the contents of a Git object. Was this necessary to say this? Blobs, Commits, and Tags are textual, so "-p" does very minimum thing, but Trees are binary garbage, so "-p" output is heavily massaged version of the contents. > +[[branch]] > +branches: `refs/heads/<name>`:: > + A branch is a name for a commit ID. Well a commit ID is an alternative way to refer to a commit object *name*, so it is a bit strange to say "a name for a commit ID". Perhaps "A branch ref stores a commit ID." is better? > +[[tag]] > +tags: `refs/tags/<name>`:: > + A tag is a name for a commit ID, tag object ID, or other object ID. Likewise. "A tag ref stores any kind of object ID, but commonly they are commit objects or tag objects" > + Tags that reference a tag object ID are called "annotated tags", > + because the tag object contains a tag message. > + Tags that reference a commit, blob, or tree ID are > + called "lightweight tags". > ++ > +Even though branches and tags are both "a name for a commit ID", Git > +treats them very differently. > +Branches are expected to change over time: when you make a commit, Git > +will update your <<HEAD,current branch>> to reference the new changes. This sentence talks about branch moving because it advances with more commits. Did we want to say "HEAD" here before we explain what it is? "HEAD" can move for another reason (i.e. branch switching) and using "HEAD" in the context of talking about growing history might invite confusion. I dunno. > +Tags are usually not changed after they're created. > +[[HEAD]] > +HEAD: `HEAD`:: > + `HEAD` is where Git stores your current <<branch,branch>>. Hmm... > + `HEAD` can either be: > + 1. A symbolic reference to your current branch, for example `ref: > + refs/heads/main` if your current branch is `main`. > + 2. A direct reference to a commit ID. This is called "detached HEAD > + state", see the DETACHED HEAD section of linkgit:git-checkout[1] for more. These two are very reasonable. But "your current <<branch>>" refers only to #1. `HEAD` refers to the commit your current work is based on, and it is the commit that will become the first parent of the commit once your current work is concluded. It can either be ... perhaps. > +[[remote-tracking-branch]] > +remote tracking branches: `refs/remotes/<remote>/<branch>`:: Please always write "remote-tracking" with a hyphen (see glossary). > + A remote-tracking branch is a name for a commit ID. Either "A remote-tracking branch stores a commit object name" or "A remote-tracking branch points at a commit object", followed by "in order to keep track of the last-nown state of ..." in a single sentence. > +[[index]] > +THE INDEX > +--------- > + > +The index, also known as the "staging area", contains a list of every > +file in the repository and its contents. When you commit, the files in > +the index are used as the files in the next commit. It is hard to define what "every file in the repository" really is. Files that you removed last week do not count. Files added in your wip branch elsewhere are obviously not yet in the index when you are working on your primary branch. > +You can add files to the index or update the version in the index with > +linkgit:git-add[1]. Adding a file to the index or updating its version > +is called "staging" the file for commit. It may be worth to clarify by saying "staging the contents of the file" (you can edit the file further after you "git add") that you are taking a snapshot at the time you ran "git add", instead of giving a general instruction to "keey an eye on this file" to Git (if it were, then the next "git commit" would behave more like "git add -u && git commit"). > +[[reflogs]] > +REFLOGS > +------- > + > +Git stores a history called a "reflog" for every branch, remote-tracking > +branch, and HEAD. This means that if you make a mistake and "lose" a > +commit, you can generally recover the commit ID by running > +`git reflog <reference>`. > + > +Each reflog entry has: > + > +1. Before/after *commit IDs* > +2. *User* who made the change, for example `Maya <maya@example.com>` > +3. *Timestamp* when the change was made > +4. *Log message*, for example `pull: Fast-forward` > + > +Reflogs only log changes made in your local repository. > +They are not shared with remotes. Technically it is correct that before/after are recorded, but there is no way for the end-user to interact with them. "git reflog" walking these entries will only give you a single commit object. The username is also recorded, but I do not think of a way to view the information, let alone using it for querying. Especially when the reftable backend is in use, you cannot even read the raw representation like you can do with files backend (where something like "cat .git/logs/HEAD" would let you peek into the details). I am not sure if we want to go into this detail. Perhaps drop everything after "Each reflog entry has:"? > +For example, here's how the reflog for `HEAD` in a repository with 2 > +commits is stored: > + > +---- > +0000000000000000000000000000000000000000 4ccb6d7b8869a86aae2e84c56523f8705b50c647 Maya <maya@example.com> 1759173408 -0400 commit (initial): Initial commit > +4ccb6d7b8869a86aae2e84c56523f8705b50c647 750b4ead9c87ceb3ddb7a390e6c7074521797fb3 Maya <maya@example.com> 1759173425 -0400 commit: Add README > +---- > + > +GIT > +--- > +Part of the linkgit:git[1] suite > diff --git a/Documentation/glossary-content.adoc b/Documentation/glossary-content.adoc > index e423e4765b..20ba121314 100644 > --- a/Documentation/glossary-content.adoc > +++ b/Documentation/glossary-content.adoc > @@ -297,8 +297,8 @@ This commit is referred to as a "merge commit", or sometimes just a > identified by its <<def_object_name,object name>>. The objects usually > live in `$GIT_DIR/objects/`. > > -[[def_object_identifier]]object identifier (oid):: > - Synonym for <<def_object_name,object name>>. > +[[def_object_identifier]]object identifier, object ID, oid:: > + Synonyms for <<def_object_name,object name>>. > > [[def_object_name]]object name:: > The unique identifier of an <<def_object,object>>. The > diff --git a/Documentation/meson.build b/Documentation/meson.build > index e34965c5b0..ace0573e82 100644 > --- a/Documentation/meson.build > +++ b/Documentation/meson.build > @@ -192,6 +192,7 @@ manpages = { > 'gitcore-tutorial.adoc' : 7, > 'gitcredentials.adoc' : 7, > 'gitcvs-migration.adoc' : 7, > + 'gitdatamodel.adoc' : 7, > 'gitdiffcore.adoc' : 7, > 'giteveryday.adoc' : 7, > 'gitfaq.adoc' : 7, > > base-commit: bb69721404348ea2db0a081c41ab6ebfe75bdec8 ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: [PATCH v3] doc: add a explanation of Git's data model 2025-10-15 19:58 ` Junio C Hamano @ 2025-10-16 15:19 ` Julia Evans 2025-10-16 16:54 ` Junio C Hamano 0 siblings, 1 reply; 89+ messages in thread From: Julia Evans @ 2025-10-16 15:19 UTC (permalink / raw) To: Junio C Hamano, Julia Evans Cc: git, Kristoffer Haugsbakk, D. Ben Knoble, Patrick Steinhardt On Wed, Oct 15, 2025, at 3:58 PM, Junio C Hamano wrote: > "Julia Evans via GitGitGadget" <gitgitgadget@gmail.com> writes: > >> +[[commit]] >> +commits:: >> + A commit contains these required fields >> + (though there are other optional fields): >> ++ >> +1. All the *files* in the commit, stored as the *<<tree,tree>>* ID of >> + the commit's base directory. > > "all the files' exact contents at the time of the commit" is what we > mean here, and once readers know what a tree is, the above sentence > would be understood as such, but "All the files" felt somewhat > fuzzy. I wonder if presenting objects in bottom-up fashion makes it > easier to see? Learn that a blob records exact content of a file, > then learn that a tree records the set of paths with exact contents > stored at these paths, and after that, learn that a commit records a > tree, hence a snapshot of the whole set of contents. I dunno... Will try "The contents of all the *files* in the commit..." to make it a little more explicit that it's a snapshot. >> +2. Its *parent commit ID(s)*. The first commit in a repository has 0 parents, >> + regular commits have 1 parent, merge commits have 2 or more parents >> +3. An *author* and the time the commit was authored >> +4. A *committer* and the time the commit was committed. >> + If you cherry-pick (linkgit:git-cherry-pick[1]) someone else's commit, >> + then they will be the author and you'll be the committer. > > It felt a bit odd to single-out cherry-pick here. > > I think the important thing to become aware of for the readers at > this point is that the author and committer can be different people, > and it does not matter how one commits somebody else's patch at the > mechanical level. > > Perhaps replace "If you cherry-pick..." with something like "note: a > change authored by a person at some point in time can be committed > by another person at a different time, and these fields are to > record both persons' contributions separately", perhaps, if we > really want to say more. I'll just delete the comment about cherry-pick. I think it's already obvious (from the fact that are two different fields) that the author and committer can be different (and happen at different times), and if we don't want to explain why that might happen there's no need to say more. >> +Git does not store the diff for a commit: when you ask Git for a >> +diff it calculates it on the fly. > > I think this is an attempt to demystify "are we really storing > snapshot for each commit?" thing, but then "when you ask Git to show > the commit, it calculates the diff from its parent on the fly" might > achieve that better, perhaps? Sure, can change it to that. >> +[[tree]] >> +trees:: >> + A tree is how Git represents a directory. It lists, for each item in >> + the tree: >> ++ >> +[[file-mode]] >> +1. The *file mode*, for example `100644`. The format is inspired by Unix >> + permissions, but Git's modes are much more limited. Git only supports these file modes: >> ++ >> + - `100644`: regular file (with type `blob`) >> + - `100755`: executable file (with type `blob`) >> + - `120000`: symbolic link (with type `blob`) >> + - `040000`: directory (with type `tree`) >> + - `160000`: gitlink, for use with submodules (with type `commit`) > > It is not really "supporting" file modes. Rather, Git only records > 5 kinds of entities associated with each path in a tree object, and > uses numbers taht remotely resemble POSIX file modes to represent > these 5 kinds. > > Perhaps "supports" -> "uses"? "Uses" sounds good to me. >> +2. The *type*: either <<blob,`blob`>> (a file), `tree` (a directory), >> + or <<commit,`commit`>> (a Git submodule, which is a >> + commit from a different Git repository) >> +3. The <<object-id,*object ID*>> >> +4. The *filename* > > Here it may be worth noting that this "filename" is a single > pathname component (roughly, what you would see in non-recursive > "ls"). In other words, it may be a directory name. > > I wonder if we need to say "<blob> (a file, or a symbolic link)"? I'm inclined to leave this alone because arguably a symbolic link is a file but I don't feel strongly about this. >> +[[blob]] >> +blobs:: >> + A blob is how Git represents a file. A blob object contains the >> + file's contents. > > "represents a file" hints as if the thing may know its name, but > that is not the case (its name is given only by surrounding tree). > > "A blob is how Git represents uninterpreted series of bytes, and > most commonly used to store file's contents." or something, perhaps? I'll say "A blob is how Git represents a file's contents", unless Git has another use for blobs that I don't know about (I think it's not that much of a stretch to say that a symbolic link is a special kind of file where the "contents" are the the link destination). I think it's always clearer to be more specific when possible, if there's only one purpose for blobs it's unnecessary (and IMO a bit misleading, because it makes the reader wonder if there are other purposes that they should know about) to say that blobs can be used to store any arbitrary bytes for any purpose. If there is another purpose I think we should give an example. >> +When you make a new commit, Git only needs to store new versions of >> +files which were changed in that commit. This means that commits >> +can use relatively little disk space even in a very large repository. > > That invites the "aren't we storing a delta after all, then?" > confusion. > > "Git only needs to newly store new versions of files and > directories. Files and directories that were not modified by the > commit are shared with its parent commit". I agree it makes it sound a little bit like we're storing a delta. Will think about how to phrase this differently. >> +NOTE: All of the examples in this section were generated with >> +`git cat-file -p <object-id>`, which shows the contents of a Git object. > > Was this necessary to say this? Blobs, Commits, and Tags are > textual, so "-p" does very minimum thing, but Trees are binary > garbage, so "-p" output is heavily massaged version of the contents. Ah, I didn't know how trees were stored, thanks. I can remove "which shows the contents of a Git object", people can read the man page for `git cat-file` if they want details. >> +[[branch]] >> +branches: `refs/heads/<name>`:: >> + A branch is a name for a commit ID. > > Well a commit ID is an alternative way to refer to a commit object > *name*, so it is a bit strange to say "a name for a commit ID". > > Perhaps "A branch ref stores a commit ID." is better? I think I'll leave this alone, none of the many test readers reported being confused by it. >> +[[tag]] >> +tags: `refs/tags/<name>`:: >> + A tag is a name for a commit ID, tag object ID, or other object ID. > > Likewise. "A tag ref stores any kind of object ID, but commonly > they are commit objects or tag objects" > >> + Tags that reference a tag object ID are called "annotated tags", >> + because the tag object contains a tag message. >> + Tags that reference a commit, blob, or tree ID are >> + called "lightweight tags". >> ++ >> +Even though branches and tags are both "a name for a commit ID", Git >> +treats them very differently. >> +Branches are expected to change over time: when you make a commit, Git >> +will update your <<HEAD,current branch>> to reference the new changes. > > This sentence talks about branch moving because it advances with > more commits. Did we want to say "HEAD" here before we explain what > it is? "HEAD" can move for another reason (i.e. branch switching) > and using "HEAD" in the context of talking about growing history > might invite confusion. I dunno. The text says "current branch", it just cross-references the "HEAD" section in the HTML version if someone wants to read about what is meant by "current branch". >> +Tags are usually not changed after they're created. > >> +[[HEAD]] >> +HEAD: `HEAD`:: >> + `HEAD` is where Git stores your current <<branch,branch>>. > > Hmm... > >> + `HEAD` can either be: >> + 1. A symbolic reference to your current branch, for example `ref: >> + refs/heads/main` if your current branch is `main`. >> + 2. A direct reference to a commit ID. This is called "detached HEAD >> + state", see the DETACHED HEAD section of linkgit:git-checkout[1] for more. > > These two are very reasonable. But "your current <<branch>>" refers > only to #1. > > `HEAD` refers to the commit your current work is based on, and > it is the commit that will become the first parent of the commit > once your current work is concluded. It can either be ... > > perhaps. I like the idea of mentioning that HEAD will be the parent commit of any commit that you make. Will think about how to incorporate that, and about how to resolve " `HEAD` is where Git stores your current <<branch,branch>>." being not exactly true. >> +[[remote-tracking-branch]] >> +remote tracking branches: `refs/remotes/<remote>/<branch>`:: > > Please always write "remote-tracking" with a hyphen (see glossary). Will fix. >> + A remote-tracking branch is a name for a commit ID. > > Either "A remote-tracking branch stores a commit object name" or "A > remote-tracking branch points at a commit object", followed by "in > order to keep track of the last-nown state of ..." in a single > sentence. I see that you don't like the "name for a commit ID" phrasing :) Maybe there's another way to say it, though again none of the test readers said they were confused by this or disagreed with the phrasing. >> +[[index]] >> +THE INDEX >> +--------- >> + >> +The index, also known as the "staging area", contains a list of every >> +file in the repository and its contents. When you commit, the files in >> +the index are used as the files in the next commit. > > It is hard to define what "every file in the repository" really is. > Files that you removed last week do not count. Files added in your > wip branch elsewhere are obviously not yet in the index when you are > working on your primary branch. Agreed, I'm not so happy with "every file in the repository" either. My intent was to make it clear that it's not "just the files you `git add`ed". I'll think about a different phrasing that communicates the same thing. Perhaps mentioning how it relates to the HEAD commit would help. >> +You can add files to the index or update the version in the index with >> +linkgit:git-add[1]. Adding a file to the index or updating its version >> +is called "staging" the file for commit. > > It may be worth to clarify by saying "staging the contents of the > file" (you can edit the file further after you "git add") that you > are taking a snapshot at the time you ran "git add", instead of > giving a general instruction to "keey an eye on this file" to Git > (if it were, then the next "git commit" would behave more like "git > add -u && git commit"). Maybe, will think about this too. >> +[[reflogs]] >> +REFLOGS >> +------- >> + >> +Git stores a history called a "reflog" for every branch, remote-tracking >> +branch, and HEAD. This means that if you make a mistake and "lose" a >> +commit, you can generally recover the commit ID by running >> +`git reflog <reference>`. >> + >> +Each reflog entry has: >> + >> +1. Before/after *commit IDs* >> +2. *User* who made the change, for example `Maya <maya@example.com>` >> +3. *Timestamp* when the change was made >> +4. *Log message*, for example `pull: Fast-forward` >> + >> +Reflogs only log changes made in your local repository. >> +They are not shared with remotes. > > Technically it is correct that before/after are recorded, but there > is no way for the end-user to interact with them. "git reflog" > walking these entries will only give you a single commit object. > The username is also recorded, but I do not think of a way to view > the information, let alone using it for querying. You can view the username with git reflog --format="%gn <%ge>". (according to `man git-log`). I don't see a way to view the old commit ID. Perhaps we should include the username but not the old commit ID then. I'm not sure. > Especially when the reftable backend is in use, you cannot even read > the raw representation like you can do with files backend (where > something like "cat .git/logs/HEAD" would let you peek into the > details). I am not sure if we want to go into this detail. > > Perhaps drop everything after "Each reflog entry has:"? Perhaps we could give a stripped down list, like 1. The new *commit ID* the reference points to 2. *Timestamp* when the change was made 3. *Log message*, for example `pull: Fast-forward` And then instead of giving the contents of `.git/logs/HEAD` (which as you say includes some fields that there's no way for the user to interact with), instead we could just show the output of `git reflog main`, like this: You can view the reflog for `git reflog`, for example here's the reflog for a `main` branch which has changed twice: $ git reflog main --date=iso --no-decorate 750b4ea main@{2025-09-29 15:17:05 -0400}: commit: Add README 4ccb6d7 main@{2025-09-29 15:16:48 -0400}: commit (initial): Initial commit I added `--no-decorate` there because the decorations are a distraction when talking about the data model. This version omits the username which is a little weird (it is possible to access the username) but mentioning the username is a little weird too because it raises some questions that are hard to answer about what that field is for, and you have to pass an obscure format string to view it. Not sure what's best here. >> +For example, here's how the reflog for `HEAD` in a repository with 2 >> +commits is stored: >> + >> +---- >> +0000000000000000000000000000000000000000 4ccb6d7b8869a86aae2e84c56523f8705b50c647 Maya <maya@example.com> 1759173408 -0400 commit (initial): Initial commit >> +4ccb6d7b8869a86aae2e84c56523f8705b50c647 750b4ead9c87ceb3ddb7a390e6c7074521797fb3 Maya <maya@example.com> 1759173425 -0400 commit: Add README >> +---- Thanks for the review. - Julia ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: [PATCH v3] doc: add a explanation of Git's data model 2025-10-16 15:19 ` Julia Evans @ 2025-10-16 16:54 ` Junio C Hamano 2025-10-16 18:59 ` Julia Evans 0 siblings, 1 reply; 89+ messages in thread From: Junio C Hamano @ 2025-10-16 16:54 UTC (permalink / raw) To: Julia Evans Cc: Julia Evans, git, Kristoffer Haugsbakk, D. Ben Knoble, Patrick Steinhardt "Julia Evans" <julia@jvns.ca> writes: >>> +[[tree]] >>> +trees:: >>> + A tree is how Git represents a directory. It lists, for each item in >>> + the tree: >>> ++ >>> +[[file-mode]] >>> +1. The *file mode*, for example `100644`. The format is inspired by Unix >>> + permissions, but Git's modes are much more limited. Git only supports these file modes: >>> ++ >>> + - `100644`: regular file (with type `blob`) >>> + - `100755`: executable file (with type `blob`) >>> + - `120000`: symbolic link (with type `blob`) >>> + - `040000`: directory (with type `tree`) >>> + - `160000`: gitlink, for use with submodules (with type `commit`) >> >> It is not really "supporting" file modes. Rather, Git only records >> 5 kinds of entities associated with each path in a tree object, and >> uses numbers taht remotely resemble POSIX file modes to represent >> these 5 kinds. >> >> Perhaps "supports" -> "uses"? > > "Uses" sounds good to me. Also "much more limited" is misleading. We only represent 5 kinds of things, so we use only 5 mode-bits-looking numbers. >>> +2. The *type*: either <<blob,`blob`>> (a file), `tree` (a directory), >>> + or <<commit,`commit`>> (a Git submodule, which is a >>> + commit from a different Git repository) >>> +3. The <<object-id,*object ID*>> >>> +4. The *filename* >> >> Here it may be worth noting that this "filename" is a single >> pathname component (roughly, what you would see in non-recursive >> "ls"). In other words, it may be a directory name. Comments? >>> +[[blob]] >>> +blobs:: >>> + A blob is how Git represents a file. A blob object contains the >>> + file's contents. >> >> "represents a file" hints as if the thing may know its name, but >> that is not the case (its name is given only by surrounding tree). >> >> "A blob is how Git represents uninterpreted series of bytes, and >> most commonly used to store file's contents." or something, perhaps? > > I'll say "A blob is how Git represents a file's contents", unless Git has > another use for blobs that I don't know about (I think it's not > that much of a stretch to say that a symbolic link is a special kind > of file where the "contents" are the the link destination). A few configuration variables like mailmap.blob name a blob object, for which _only_ its contents, i.e., the sequence of bytes, matter and where they originally were stored does not matter. But we are falling into the area of tautology, as any sequence of bytes can be stored in a file so they can be called "contents of a file". But the point is that these bytes do not have to be stored to become a blob (think: "git cat-file -t blob -w --stdin"). > I think it's always clearer to be more specific when possible, if there's only > one purpose for blobs it's unnecessary (and IMO a bit misleading, because > it makes the reader wonder if there are other purposes that they should > know about) to say that blobs can be used to store any arbitrary bytes for > any purpose. I do not think describing other use cases is unnecessary. Even if we limit ourselves to discuss a single purpose for blob, i.e. to represent the contents of a file, we should stress that blob is to store _only_ contents, and not other aspects of the file (e.g., in what paths with what mode), and that is where my reaction to "how Git reprsents a file" comes from. >>> +[[branch]] >>> +branches: `refs/heads/<name>`:: >>> + A branch is a name for a commit ID. >> >> Well a commit ID is an alternative way to refer to a commit object >> *name*, so it is a bit strange to say "a name for a commit ID". >> >> Perhaps "A branch ref stores a commit ID." is better? > > I think I'll leave this alone, none of the many test readers reported > being confused by it. Would a confused person report that they are confused? ;-) > I see that you don't like the "name for a commit ID" phrasing :) > Maybe there's another way to say it, though again none of the test > readers said they were confused by this or disagreed with the phrasing. Yes, I get that given "refs/heads/main", you want to say "main" is one of the ways to have repo_get_oid() to yield the commit object, and you are using "name" in that sense, but it is more like a ref can be used to name an object. It is *not* the name of the object, because the object can have other names, and more importantly, it (i.e., to give a name for an object) is not the only thing that a ref can do. And that is why I do not like that phrasing, combined with the target of giving that name is spelled "a commit ID". The commit ID is already another way to name the thing the refname can be also used to name: a commit object. A commit object and a commit object name are different things. The latter is a name that can refer to the former. And a ref can be used just like the latter to refer to the former (i.e. "commit object"). By the way, I do like the way many of your responses are "will think about it more", not "I'll take your version". Very much appreciated. Thanks. ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: [PATCH v3] doc: add a explanation of Git's data model 2025-10-16 16:54 ` Junio C Hamano @ 2025-10-16 18:59 ` Julia Evans 2025-10-16 20:48 ` Junio C Hamano 0 siblings, 1 reply; 89+ messages in thread From: Julia Evans @ 2025-10-16 18:59 UTC (permalink / raw) To: Junio C Hamano Cc: Julia Evans, git, Kristoffer Haugsbakk, D. Ben Knoble, Patrick Steinhardt On Thu, Oct 16, 2025, at 12:54 PM, Junio C Hamano wrote: > "Julia Evans" <julia@jvns.ca> writes: > >>>> +[[tree]] >>>> +trees:: >>>> + A tree is how Git represents a directory. It lists, for each item in >>>> + the tree: >>>> ++ >>>> +[[file-mode]] >>>> +1. The *file mode*, for example `100644`. The format is inspired by Unix >>>> + permissions, but Git's modes are much more limited. Git only supports these file modes: >>>> ++ >>>> + - `100644`: regular file (with type `blob`) >>>> + - `100755`: executable file (with type `blob`) >>>> + - `120000`: symbolic link (with type `blob`) >>>> + - `040000`: directory (with type `tree`) >>>> + - `160000`: gitlink, for use with submodules (with type `commit`) >>> >>> It is not really "supporting" file modes. Rather, Git only records >>> 5 kinds of entities associated with each path in a tree object, and >>> uses numbers taht remotely resemble POSIX file modes to represent >>> these 5 kinds. >>> >>> Perhaps "supports" -> "uses"? >> >> "Uses" sounds good to me. > > Also "much more limited" is misleading. We only represent 5 kinds > of things, so we use only 5 mode-bits-looking numbers. What does it mislead the reader to think? My goal is to communicate that if you want to tell Git to remember that a file's Unix permissions were 700, that's not possible. >>>> +2. The *type*: either <<blob,`blob`>> (a file), `tree` (a directory), >>>> + or <<commit,`commit`>> (a Git submodule, which is a >>>> + commit from a different Git repository) >>>> +3. The <<object-id,*object ID*>> >>>> +4. The *filename* >>> >>> Here it may be worth noting that this "filename" is a single >>> pathname component (roughly, what you would see in non-recursive >>> "ls"). In other words, it may be a directory name. > > Comments? Oops, missed this in my first pass. I looked at them man pages for a couple of commands ("mv", "cp") and it looks like it's normal to refer to files and directories jointly as "files", or refer to them as having a "file name". So I think it's okay to call it a "file name" even if the "file" may be a directory. >>>> +[[blob]] >>>> +blobs:: >>>> + A blob is how Git represents a file. A blob object contains the >>>> + file's contents. >>> >>> "represents a file" hints as if the thing may know its name, but >>> that is not the case (its name is given only by surrounding tree). >>> >>> "A blob is how Git represents uninterpreted series of bytes, and >>> most commonly used to store file's contents." or something, perhaps? >> >> I'll say "A blob is how Git represents a file's contents", unless Git has >> another use for blobs that I don't know about (I think it's not >> that much of a stretch to say that a symbolic link is a special kind >> of file where the "contents" are the the link destination). > > A few configuration variables like mailmap.blob name a blob object, > for which _only_ its contents, i.e., the sequence of bytes, matter > and where they originally were stored does not matter. > > But we are falling into the area of tautology, as any sequence of > bytes can be stored in a file so they can be called "contents of a > file". But the point is that these bytes do not have to be stored > to become a blob (think: "git cat-file -t blob -w --stdin"). I'm trying to think through what the goal of explaining the nature of a "blob" is. To me describing blobs primarily as "bytes" makes it sound a bit like "Git will treat this as opaque binary data, Git will not attempt to interpret the contents of a blob in any way" (which is certainly true for many blob storage systems!). But it's not true that Git treats blobs as opaque binary data, unlike other blob storage systems, Git has diff and merge algorithms to interpret the contents of the file to some extent and try to do useful things with them. Another goal we could have is to be clear that there are no limits to what kind of files you can store in Git: you can equally well store text files and binary files. >> I think it's always clearer to be more specific when possible, if there's only >> one purpose for blobs it's unnecessary (and IMO a bit misleading, because >> it makes the reader wonder if there are other purposes that they should >> know about) to say that blobs can be used to store any arbitrary bytes for >> any purpose. > > I do not think describing other use cases is unnecessary. Even if > we limit ourselves to discuss a single purpose for blob, i.e. to > represent the contents of a file, we should stress that blob is to > store _only_ contents, and not other aspects of the file (e.g., in > what paths with what mode), and that is where my reaction to "how > Git reprsents a file" comes from. I think it does make sense to say the blob stores only the contents, though IMO that's fairly clear already since we've already explained where the other parts of the file are stored by the time we get to explaining "blob". >>>> +[[branch]] >>>> +branches: `refs/heads/<name>`:: >>>> + A branch is a name for a commit ID. >>> >>> Well a commit ID is an alternative way to refer to a commit object >>> *name*, so it is a bit strange to say "a name for a commit ID". >>> >>> Perhaps "A branch ref stores a commit ID." is better? >> >> I think I'll leave this alone, none of the many test readers reported >> being confused by it. > > Would a confused person report that they are confused? ;-) Everyone leaving feedback gets a prompt something like this asking them to categorize their feedback, and "I'm confused" is one of the options. https://jvns.ca/images/feedback-categories.png I definitely got many "I'm confused" and "I have a question" comments about other things that were confusing to readers. >> I see that you don't like the "name for a commit ID" phrasing :) >> Maybe there's another way to say it, though again none of the test >> readers said they were confused by this or disagreed with the phrasing. > > Yes, I get that given "refs/heads/main", you want to say "main" is > one of the ways to have repo_get_oid() to yield the commit object, > and you are using "name" in that sense, but it is more like a ref > can be used to name an object. It is *not* the name of the object, > because the object can have other names, and more importantly, it > (i.e., to give a name for an object) is not the only thing that a > ref can do. That's interesting, what else can a ref do other than to give a name to an object? > And that is why I do not like that phrasing, combined > with the target of giving that name is spelled "a commit ID". The > commit ID is already another way to name the thing the refname can > be also used to name: a commit object. A commit object and a commit > object name are different things. The latter is a name that can > refer to the former. I'm curious about why it's important to you to make this distinction between a commit ID and a commit object. To me the commit ID and the commit object come as a package, since the commit ID is calculated from the commit object. > And a ref can be used just like the latter to > refer to the former (i.e. "commit object"). > By the way, I do like the way many of your responses are "will think > about it more", not "I'll take your version". > > Very much appreciated. I'm glad to hear that! It's a fun puzzle to figure out how to express things clearly and accurately and concisely. - Julia ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: [PATCH v3] doc: add a explanation of Git's data model 2025-10-16 18:59 ` Julia Evans @ 2025-10-16 20:48 ` Junio C Hamano 0 siblings, 0 replies; 89+ messages in thread From: Junio C Hamano @ 2025-10-16 20:48 UTC (permalink / raw) To: Julia Evans Cc: Julia Evans, git, Kristoffer Haugsbakk, D. Ben Knoble, Patrick Steinhardt "Julia Evans" <julia@jvns.ca> writes: >>>> It is not really "supporting" file modes. Rather, Git only records >>>> 5 kinds of entities associated with each path in a tree object, and >>>> uses numbers taht remotely resemble POSIX file modes to represent >>>> these 5 kinds. >>>> >>>> Perhaps "supports" -> "uses"? >>> >>> "Uses" sounds good to me. >> >> Also "much more limited" is misleading. We only represent 5 kinds >> of things, so we use only 5 mode-bits-looking numbers. > > What does it mislead the reader to think? My goal is to communicate that > if you want to tell Git to remember that a file's Unix permissions were > 700, that's not possible. Yes, rewording "support" to "use" is one good way to do so. But "limited" implies that lifting the limitation would allow you to store more. That is the misguided thinking I want to avoid here. There is no limitations to lift. We only differentiate 5 kinds hence we only use 5 permission-bit-looking numbers. We do not differenciate a file with permission 0600 from aother with 0644. >>>> Here it may be worth noting that this "filename" is a single >>>> pathname component (roughly, what you would see in non-recursive >>>> "ls"). In other words, it may be a directory name. >> >> Comments? > > Oops, missed this in my first pass. > > I looked at them man pages for a couple of commands ("mv", "cp") > and it looks like it's normal to refer to files and directories jointly > as "files", or refer to them as having a "file name". So I think it's okay > to call it a "file name" even if the "file" may be a directory. Ah, not that part. I was more interested in seeing how we express "in these names, there won't be any slashes". >>>>> +[[blob]] >>>>> +blobs:: By the way, I kept forgetting to mention, but why are all of these listed terms plural (not just object types but also "branches" and "tags"? > But it's not true that Git treats blobs as opaque binary data, unlike > other blob storage systems, Git has diff and merge algorithms to > interpret the contents of the file to some extent and try to do useful > things with them. Yes, but diff and merge happens way above the object layer, where the question "what is blob" has a meaning. And these "blobs are recorded in a tree together with other blobs and trees recursively, and the single top-level tree describes a snapshot of a single state, which is recorded in a commit" data model descriptions is exactly about the lower-level object layer. > Another goal we could have is to be clear that there are no limits to > what kind of files you can store in Git: you can equally well store text > files and binary files. That is a natural consequence of blobs being nothing more than uninterpreted sequence of bytes. >>> I see that you don't like the "name for a commit ID" phrasing :) >>> Maybe there's another way to say it, though again none of the test >>> readers said they were confused by this or disagreed with the phrasing. >> >> Yes, I get that given "refs/heads/main", you want to say "main" is >> one of the ways to have repo_get_oid() to yield the commit object, >> and you are using "name" in that sense, but it is more like a ref >> can be used to name an object. It is *not* the name of the object, >> because the object can have other names, and more importantly, it >> (i.e., to give a name for an object) is not the only thing that a >> ref can do. > > That's interesting, what else can a ref do other than to give a name to > an object? For example, a ref is a key to reflog, so obvoiusly it is more than just a single commit. If you say "git checkout main" and "git checkout main^{commit}", they refer to the same commit, but the former is a sign that you want the next commit you make from that state to grow that branch (and not any other branch you may have that happen to be pointing at the same commit), while the other one is not. >> And that is why I do not like that phrasing, combined >> with the target of giving that name is spelled "a commit ID". The >> commit ID is already another way to name the thing the refname can >> be also used to name: a commit object. A commit object and a commit >> object name are different things. The latter is a name that can >> refer to the former. > > I'm curious about why it's important to you to make this distinction > between a commit ID and a commit object. To me the commit ID and the > commit object come as a package, since the commit ID is calculated from > the commit object. It may be the most natural name for the commit object, but that does not mean the name is the object. Let's not go phylosophical. ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: [PATCH v3] doc: add a explanation of Git's data model 2025-10-14 21:12 ` [PATCH v3] " Julia Evans via GitGitGadget 2025-10-15 6:24 ` Patrick Steinhardt 2025-10-15 19:58 ` Junio C Hamano @ 2025-10-16 15:24 ` Kristoffer Haugsbakk 2025-10-20 16:37 ` Kristoffer Haugsbakk 2025-10-27 19:32 ` [PATCH v4] doc: add an " Julia Evans via GitGitGadget 4 siblings, 0 replies; 89+ messages in thread From: Kristoffer Haugsbakk @ 2025-10-16 15:24 UTC (permalink / raw) To: Josh Soref, git; +Cc: D. Ben Knoble, Patrick Steinhardt, Julia Evans > [PATCH v3] doc: add a explanation of Git's data model s/a explanation/an explanation/ On Tue, Oct 14, 2025, at 23:12, Julia Evans via GitGitGadget wrote: > From: Julia Evans <julia@jvns.ca> > > Git very often uses the terms "object", "reference", or "index" in its > documentation. >[snip] ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: [PATCH v3] doc: add a explanation of Git's data model 2025-10-14 21:12 ` [PATCH v3] " Julia Evans via GitGitGadget ` (2 preceding siblings ...) 2025-10-16 15:24 ` Kristoffer Haugsbakk @ 2025-10-20 16:37 ` Kristoffer Haugsbakk 2025-10-20 18:01 ` Junio C Hamano 2025-10-27 19:32 ` [PATCH v4] doc: add an " Julia Evans via GitGitGadget 4 siblings, 1 reply; 89+ messages in thread From: Kristoffer Haugsbakk @ 2025-10-20 16:37 UTC (permalink / raw) To: Josh Soref, git; +Cc: D. Ben Knoble, Patrick Steinhardt, Julia Evans On Tue, Oct 14, 2025, at 23:12, Julia Evans via GitGitGadget wrote: > From: Julia Evans <julia@jvns.ca> > > Git very often uses the terms "object", "reference", or "index" in its > documentation. > > However, it's hard to find a clear explanation of these terms and how > they relate to each other in the documentation. The closest candidates > currently are: >[snip] For some reason I get an error with `Documentation/doc-diff` when run against 446c8a72 (Merge branch 'je/doc-data-model' into seen, 2025-10-16). Here I’m comparing with `master`. $ ./doc-diff 4253630c6f07a4bdcc9aa62a50e26a4d466219d1 446c8a72be6cf1b6121e643590a9acacfc21c5fb Previous HEAD position was b20e48e0232 doc: add a explanation of Git's data model HEAD is now at 446c8a72be6 Merge branch 'je/doc-data-model' into seen make: Entering directory '<git repo>/Documentation/tmp-doc-diff/worktree' install -d -m 755 '<git repo>/Documentation/tmp-doc-diff/installed/446c8a72be6cf1b6121e643590a9acacfc21c5fb+/home/kristoffer/share/man/man3' (cd perl/build/man/man3 && tar cf - .) | \ (cd '<git repo>/Documentation/tmp-doc-diff/installed/446c8a72be6cf1b6121e643590a9acacfc21c5fb+/home/kristoffer/share/man/man3' && umask 022 && tar xof -) make -C Documentation install-man make[1]: Entering directory '<git repo>/Documentation/tmp-doc-diff/worktree/Documentation' GEN cmd-list.made GEN doc.dep GEN asciidoc.conf ASCIIDOC git-add.xml ASCIIDOC git-config.xml ASCIIDOC git-diff-tree.xml ASCIIDOC git-fast-import.xml ASCIIDOC git-fetch.xml ASCIIDOC git-fsck.xml ASCIIDOC git-log.xml ASCIIDOC git-merge-tree.xml ASCIIDOC git-patch-id.xml ASCIIDOC git-pull.xml ASCIIDOC git-push.xml ASCIIDOC git-replay.xml ASCIIDOC git-repo.xml ASCIIDOC git-rev-list.xml ASCIIDOC git-rev-parse.xml ASCIIDOC git-shortlog.xml ASCIIDOC git-show.xml ASCIIDOC git-sparse-checkout.xml ASCIIDOC git-stash.xml ASCIIDOC git-tag.xml ASCIIDOC git-worktree.xml ASCIIDOC git.xml ASCIIDOC gitformat-loose.xml ASCIIDOC gitformat-pack.xml ASCIIDOC gitcli.xml ASCIIDOC gitcredentials.xml XMLTO gitdatamodel.7 XMLTO git-add.1 XMLTO git-diff-tree.1 XMLTO git-fast-import.1 XMLTO git-fetch.1 XMLTO git-fsck.1 xmlto: <git repo>/Documentation/tmp-doc-diff/worktree/Documentation/gitdatamodel.xml does not validate (status 3) xmlto: Fix document syntax or use --skip-validation option <git repo>/Documentation/tmp-doc-diff/worktree/Documentation/gitdatamodel.xml:71: element link: validity error : IDREF attribute linkend references an unknown ID "tree" <git repo>/Documentation/tmp-doc-diff/worktree/Documentation/gitdatamodel.xml:96: element link: validity error : IDREF attribute linkend references an unknown ID "tree" <git repo>/Documentation/tmp-doc-diff/worktree/Documentation/gitdatamodel.xml:397: element link: validity error : IDREF attribute linkend references an unknown ID "tree" Document <git repo>/Documentation/tmp-doc-diff/worktree/Documentation/gitdatamodel.xml does not validate make[1]: *** [Makefile:380: gitdatamodel.7] Error 13 make[1]: *** Waiting for unfinished jobs.... make[1]: Leaving directory '<git repo>/Documentation/tmp-doc-diff/worktree/Documentation' make: *** [Makefile:3676: install-man] Error 2 make: Leaving directory '<git repo>/Documentation/tmp-doc-diff/worktree' The syntax looks correct. So I don’t know what is wrong. `make html` works *and* makes the link. At first look it might be to do with the anchor on a definition list but I tried removing the anchors and expected to get an error for `blob` next. But that didn’t happen. In short I don’t see what is special about `tree`. ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: [PATCH v3] doc: add a explanation of Git's data model 2025-10-20 16:37 ` Kristoffer Haugsbakk @ 2025-10-20 18:01 ` Junio C Hamano 0 siblings, 0 replies; 89+ messages in thread From: Junio C Hamano @ 2025-10-20 18:01 UTC (permalink / raw) To: Kristoffer Haugsbakk Cc: Josh Soref, git, D. Ben Knoble, Patrick Steinhardt, Julia Evans "Kristoffer Haugsbakk" <kristofferhaugsbakk@fastmail.com> writes: > xmlto: <git repo>/Documentation/tmp-doc-diff/worktree/Documentation/gitdatamodel.xml does not validate (status 3) > xmlto: Fix document syntax or use --skip-validation option > <git repo>/Documentation/tmp-doc-diff/worktree/Documentation/gitdatamodel.xml:71: element link: validity error : IDREF attribute linkend references an unknown ID "tree" > <git repo>/Documentation/tmp-doc-diff/worktree/Documentation/gitdatamodel.xml:96: element link: validity error : IDREF attribute linkend references an unknown ID "tree" > <git repo>/Documentation/tmp-doc-diff/worktree/Documentation/gitdatamodel.xml:397: element link: validity error : IDREF attribute linkend references an unknown ID "tree" > Document <git repo>/Documentation/tmp-doc-diff/worktree/Documentation/gitdatamodel.xml does not validate > make[1]: *** [Makefile:380: gitdatamodel.7] Error 13 > make[1]: *** Waiting for unfinished jobs.... > make[1]: Leaving directory '<git repo>/Documentation/tmp-doc-diff/worktree/Documentation' > make: *** [Makefile:3676: install-man] Error 2 > make: Leaving directory '<git repo>/Documentation/tmp-doc-diff/worktree' > > The syntax looks correct. So I don’t know what is wrong. `make html` > works *and* makes the link. > > At first look it might be to do with the anchor on a definition list but > I tried removing the anchors and expected to get an error for `blob` > next. But that didn’t happen. > > In short I don’t see what is special about `tree`. This seems to work it around without breaking .html generation too badly for AsciiDoc and without breaking .7/.html generation for Asciidoctor. Generation of .7 were broken with AsciiDoc so we cannot complain even if the result is suboptimal, but the generated manpage with this patch using AsciiDoc did not look too bad, either. I do not know AsciiDoc internals (and I am not particularly interested to learn it now), but I am guessing that the bug is that when it sees [[tree]], it tries to find an element to put id="tree", but before it finds any approprifate one, it sees [[filemode]] and uses the element it finds to hold id="filemode", losing sight of the need to add id="tree" somewhere. Documentation/gitdatamodel.adoc | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/Documentation/gitdatamodel.adoc b/Documentation/gitdatamodel.adoc index f49574dfae..7232fe3861 100644 --- a/Documentation/gitdatamodel.adoc +++ b/Documentation/gitdatamodel.adoc @@ -83,8 +83,10 @@ trees:: A tree is how Git represents a directory. It lists, for each item in the tree: + +1. The *file mode*, for example `100644`. ++ [[file-mode]] -1. The *file mode*, for example `100644`. The format is inspired by Unix +The format is inspired by Unix permissions, but Git's modes are much more limited. Git only supports these file modes: + - `100644`: regular file (with type `blob`) -- 2.51.1-556-g06b2a500e9 ^ permalink raw reply related [flat|nested] 89+ messages in thread
* [PATCH v4] doc: add an explanation of Git's data model 2025-10-14 21:12 ` [PATCH v3] " Julia Evans via GitGitGadget ` (3 preceding siblings ...) 2025-10-20 16:37 ` Kristoffer Haugsbakk @ 2025-10-27 19:32 ` Julia Evans via GitGitGadget 2025-10-27 21:54 ` Junio C Hamano 2025-10-30 20:32 ` [PATCH v5] " Julia Evans via GitGitGadget 4 siblings, 2 replies; 89+ messages in thread From: Julia Evans via GitGitGadget @ 2025-10-27 19:32 UTC (permalink / raw) To: git Cc: Kristoffer Haugsbakk, D. Ben Knoble, Patrick Steinhardt, Julia Evans, Julia Evans From: Julia Evans <julia@jvns.ca> Git very often uses the terms "object", "reference", or "index" in its documentation. However, it's hard to find a clear explanation of these terms and how they relate to each other in the documentation. The closest candidates currently are: 1. `gitglossary`. This makes a good effort, but it's an alphabetically ordered dictionary and a dictionary is not a good way to learn concepts. You have to jump around too much and it's not possible to present the concepts in the order that they should be explained. 2. `gitcore-tutorial`. This explains how to use the "core" Git commands. This is a nice document to have, but it's not necessary to learn how `update-index` works to understand Git's data model, and we should not be requiring users to learn how to use the "plumbing" commands if they want to learn what the term "index" or "object" means. 3. `gitrepository-layout`. This is a great resource, but it includes a lot of information about configuration and internal implementation details which are not related to the data model. It also does not explain how commits work. The result of this is that Git users (even users who have been using Git for 15+ years) struggle to read the documentation because they don't know what the core terms mean, and it's not possible to add links to help them learn more. Add an explanation of Git's data model. Some choices I've made in deciding what "core data model" means: 1. Omit pseudorefs like `FETCH_HEAD`, because it's not clear to me if those are intended to be user facing or if they're more like internal implementation details. 2. Don't talk about submodules other than by mentioning how they relate to trees. This is because Git has a lot of special features, and explaining how they all work exhaustively could quickly go down a rabbit hole which would make this document less useful for understanding Git's core behaviour. 3. Don't discuss the structure of a commit message (first line, trailers etc). 4. Don't mention configuration. 5. Don't mention the `.git` directory, to avoid getting too much into implementation details Signed-off-by: Julia Evans <julia@jvns.ca> --- doc: Add a explanation of Git's data model Changes in v2: The biggest change is to remove all mentions of the .git directory, and explain references in a way that doesn't refer to "directories" at all, and instead talks about the "hierarchy" (from Kristoffer and Patrick's reviews). Also: * objects: Mention that an object ID is called an "object name", and update the glossary to include the term "object ID" (from Junio's review) * objects: Replace "SHA-1 hash" with "cryptographic hash" which is more accurate (from Patrick's review) * blobs: Made the explanation of git gc a little higher level and took some ideas from Patrick's suggested wording (from Patrick's and Kroftoffer's reviews) * commits: Mention that tag objects and commits can optionally have other fields. I didn't mention the GPG signature specifically, but don't have any objections to adding it. (from Patrick and Junio's reviews) * commits: Remove one of the mentions of git gc, since it perhaps opens up too much of a rabbit hole: "how does git gc decide which commits to clean up?". (from Kristoffer's review) * tag objects: Add an example of how a tag object is represented (from user feedback on the draft) * index: Use the term "file mode" instead of "permissions", and list all allowed file modes (from Patrick's review) * index: Use "stage number" instead of "number" for index entries (from Patrick's review) * reflogs: Remove "any ref can be logged", it raises some questions of "how do you tell Git to log a ref that it isn't normally logging?" and my guess is that it's uncommon to ask Git to log more refs. I don't think it's a "lie" to omit this but I can bring it back if folks disagree. (from Patrick's review) * reflogs: Fix an error I noticed in the explanation of reflogs: tags aren't logged by default and remote-tracking branches are, according to man git-config * branches and tags: Be clearer about how branches are usually updated (by committing), and make it a little more obvious that only branches can be checked out. This is a bit tricky because using the word "check out" introduces a rabbit hole that I want to avoid (what does "check out" mean?). I've dealt this by just talking about the "current branch" (HEAD) since that is defined here, and making it more explicit that HEAD must either be a branch or a commit, there's no "HEAD is a tag" option. (from Patrick's review) * tags: Explain the differences between annotated and lightweight tags (this is the main piece of user feedback I've gotten on the draft so far) * Various style/typo changes ("2 or more", linkgit:git-gc[1], removed extra asterisks, added empty SYNOPSIS, "commits -> tags" typo fix, add to meson build) non-changes: * I still haven't mentioned things that aren't part of the "data model", like revision params and configuration. I think there could be a place for them but I haven't found it yet. * tag objects: I noticed that there's a "tag" header field in tag objects (like tag v1.0.0) but I didn't mention it yet because I couldn't figure out what the purpose of that field is (I thought the tag name was stored in the reference, why is it duplicated in the tag object?) Changes in v3: I asked for feedback from Git users on Mastodon and got 220 pieces of feedback from 48 different users. People seemed very excited to read about Git's data model. Usually I judge explanations by what folks report learning from them. Here people reported learning: * how branches are stored (that a branch is "a name for a commit") * how objects work * that Git has separate "author" and "committer" fields * that amending a commit does not change it * that a tree is "just a directory" (not something more complicated), and how trees are stored * that Git repos can contain symlinks * that Git saves modes separately from the OS. * how the stage number works * that when you git add a file, Git will create an object * that third-party tools can create their own refs. * that the reflog stores the history of branches (not just HEAD), and what reflogs are for Also (of course) there were quite a few points of confusion! The main 4 pieces of feedback were 1. The index section doesn't explain what the word "staged" means, and one person says that it makes it sounds like only files that you "git add"ed are in the index. Rewrite the explanation to avoid using the word "staged" to define the index and instead define the word "staging". 2. Explain the difference between "annotated tags" and "lightweight tags" (done) 3. Add examples for tag objects and reflogs (done) 4. Mention a little more about where things are stored in the .git directory, which I'd removed in v2. This seems most important for .git/refs, so I added a hopefully accurate note about how refs are stored by default, with a comment about one of the major implications. I did not discuss where objects or the index are stored, because I don't think the implementation details of how objects are stored are as important, and there are better tools for viewing the "raw" state of objects and the index (with git cat-file -p or git ls-files --staged). Here's every other change I made in response to the feedback, as well as a few comments that I did not address. intro: * Give a 1-sentence intro to "reflog" objects: * people really like having git ls-files --stage as a way to view the index, so add git cat-file -p as well in a note commits: * 2 people asked "Are commits stored as a diff?". Say that diffs are calculated at runtime, this is very important. * The order the fields are given in don't match the order in the example. Make them match. * "All the files in the commit, stored as a tree" is throwing a few people off. Be clearer that it's the tree ID of the base directory. * Several people asked "What's the difference between an author and committer? I added an example using git cherry-pick that I'm not 100% happy with (what if the reader doesn't know what cherry-pick does?). There might be a better example to give here. * In the note about commits being amended: one person suggested saying "creates a new commit with the same parent" to make it clearer what the relationship between the new and old commit are. I liked that idea so I did it. trees: * file modes. 2 people want to know more about "The file mode, for example 100644". Also 2 people are curious about what relationship these have to Unix permissions. Say that they're inspired by Unix permissions, and move the list of possible file modes up to make the relationship clearer * On "so git-gc(1) periodically compresses objects to save disk space", there are a few follow up comments wondering about more, which makes me think the comment about compression is actually a distraction. Say something simpler instead, ("Git only needs to store new versions of files which were changed in that commit"), from Junio's suggestion * Re "commit (a Git submodule)": 2 people say it's not clear how trees relate to submodules. Say that it refers to a commit in a different repository. * One person says they're not sure if the "object ID" is a hash. Link it to the definition of "object ID". tag objects: * Requests for an example, added one. * Requests to explain the difference between "lightweight" and "annotated" tags, added it. tags: * one person thinks "It’s expected that a tag will never change after you create it." is too strong (since of course you can change it with git tag -f). Say instead that tags are "usually" not changed. HEAD: * Several people are asking for more detail about detached HEAD state. There's actually quite a lot to talk about here (what it means, how it happens, what it implies, and how you might adjust your workflow to avoid it by using git switch). I don't think we can get into all of that here, so refer to the DETACHED HEAD section of git-checkout instead. I'm not totally happy with the current version of that section but that seems like the most practical solution right now. remote-tracking branches: * discuss refs/remotes/<remote>/HEAD. the index: * "permissions" should be "file mode" (like with trees). Changed. * "filename" should be "file path". Changed. * the stage number can only be 0, 1, 2, or 3, since it's 2 bits. Also maybe say that the numbers have specific meanings. Said it can only be 0/1/2/3 but did not give the specific meanings. reflogs * Request for an example. Added one. * It's not clear if there's one reflog per branch/tag/HEAD, or if there's one universal reflog. Make this clearer. * Mention the role of the reflog in retrieving "lost" commits or undoing bad rebases. Not fixed: * intro: A couple of people say that it's confusing that tags are both "an object" and "a reference". Handled this by just explaining the difference between an annotated and a lightweight tag further down. I'd like to make this clearer in the intro but not sure if there's a way to do it. * commits and tag objects: one person asks if there's a reference for the other "optional fields", like "encoding" and "gpgsig". I couldn't find one, so left this as is. * HEAD: A couple of people ask if there are any other symbolic references other than HEAD, or if they can make their own symbolic references. I don't know the answer to this. * HEAD: the HEAD: HEAD thing looks weird, it made more sense when it was HEAD: .git/HEAD. Will think about this. * reflogs: One person asks: if reflogs only store local changes, why does it track the user who made the change? Is that for remote operations like fetches and pulls? Or for cases where more than one user is using the same repo on a system? I don't know the answer to this. * reflogs: How can you see the full data in the reflog? git reflog show doesn't list the user who made the change. git reflog show <refname> --format="%h | %gd | %gn <%ge> | %gs" --date=iso seems to work but it's really a mouthful, not sure it's useful to include all that. * index: Is it worth mentioning that the index can be locked? I don't have an opinion about this. * other: One person asks what a "working tree" is. It made me wonder if "the current working directory" has a place in Git's data model. My feeling is "no" but I could be convinced otherwise. * overall: "How can Git be so fast? If I switch branches, how does it figure out what to add, remove or replace?". I don't think this is the right place for that discussion but it would * there are some docs CI errors I haven't figured out yet (IDREF attribute linkend references an unknown ID "tree") changes in v4: This is a combination of trying to make some of the intro text a little more "friendly" for someone new to Git's data model, avoiding implying things that are false, and removing information that isn't relevant to the data model. intro: * Add a 1-line description of what a "reflog" is (from user feedback) objects: * Start with a "friendly" description of what an object is, similar to what we do for references and the reflog * Rename "commits" to "commit" and similarly for trees etc (from Junio's review) * Remove the explanation of what git cat-file -p does, since it might be misleading and if people want to know they can read the man page (from Junio's review) commits: * Start by saying that the commit contains the full directory structure of all the files (from Junio's comment about how it may not be clear that the commit contains all the files' exact contents at the time of the commit) * Remove the comment about cherry-pick (from Junio's review) * Replace "ask Git for a diff" with "ask Git to show the commit with git show" (from Junio's review) trees: * Make the description a little more friendly * Reorder so that "type" is defined before we refer to the "type" * Say that file modes are "only spiritually related" to Unix permissions instead of talking about what Git "supports" (from Junio's review) blobs: * Try to make it clearer how "commits use relatively little disk space" is true while not implying that commits are diffs, by using an example (from Junio's review) branches: * Replace "a branch is a name for a commit ID" with "a branch refers to a commit ID" (except in the intro sentence for the "references" section). Similarly for tags etc. (from Junio's review) * Remove the note about how branches are stored in .git (from Junio's review) HEAD: * Be clearer that HEAD is not always the current branch, because there may not be a current branch (from Junio's review) index: * Be a little more specific about how exactly the index is converted into a commit. (from Junio's comment about how it's not clear what "every file in the repository" means) reflog: * Be clearer that there are many reflogs (one for each reference with a log), not just one reflog (from Junio and Patrick's reviews) * Omit the user and "Before" commit IDs from the list of fields, because you usually don't see them (from Junio's review) * Show the output of git reflog main in the example instead of the contents of the reflog file, to avoid showing the user and before commit ID Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-1981%2Fjvns%2Fgitdatamodel-v4 Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-1981/jvns/gitdatamodel-v4 Pull-Request: https://github.com/gitgitgadget/git/pull/1981 Range-diff vs v3: 1: 39da4e04cf ! 1: 92249b5b08 doc: add a explanation of Git's data model @@ Metadata Author: Julia Evans <julia@jvns.ca> ## Commit message ## - doc: add a explanation of Git's data model + doc: add an explanation of Git's data model Git very often uses the terms "object", "reference", or "index" in its documentation. @@ Documentation/gitdatamodel.adoc (new) +OBJECTS +------- + -+Commits, trees, blobs, and tag objects are all stored in Git's object database. ++All of the commits and files in a Git repository are stored as "Git objects". ++Git objects never change after they're created, and every object has an ID, ++like `1b61de420a21a2f1aaef93e38ecd0e45e8bc9f0a`. ++ ++This means that if you have an object's ID, you can always recover its ++exact contents as long as the object hasn't been deleted. ++ +Every object has: + +[[object-id]] @@ Documentation/gitdatamodel.adoc (new) + and <<tag-object,tag objects>>. +3. *contents*. The structure of the contents depends on the type. + -+Once an object is created, it can never be changed. -+Here are the 4 types of objects: ++Here's how each type of object is structured: + +[[commit]] -+commits:: -+ A commit contains these required fields ++commit:: ++ A commit contains the full directory structure of every file ++ in that version of the repository and each file's contents. ++ It has these these required fields + (though there are other optional fields): ++ -+1. All the *files* in the commit, stored as the *<<tree,tree>>* ID of -+ the commit's base directory. ++1. The *files* in the commit, stored as the *<<tree,tree>>* ID ++ of the commit's base directory. +2. Its *parent commit ID(s)*. The first commit in a repository has 0 parents, + regular commits have 1 parent, merge commits have 2 or more parents +3. An *author* and the time the commit was authored +4. A *committer* and the time the commit was committed. -+ If you cherry-pick (linkgit:git-cherry-pick[1]) someone else's commit, -+ then they will be the author and you'll be the committer. +5. A *commit message* ++ +Here's how an example commit is stored: @@ Documentation/gitdatamodel.adoc (new) +For example, "amending" a commit with `git commit --amend` creates a new +commit with the same parent. ++ -+Git does not store the diff for a commit: when you ask Git for a -+diff it calculates it on the fly. ++Git does not store the diff for a commit: when you ask Git to show ++the commit with linkgit:git-show[1], it calculates the diff from its ++parent on the fly. + +[[tree]] -+trees:: -+ A tree is how Git represents a directory. It lists, for each item in -+ the tree: ++tree:: ++ A tree is how Git represents a directory. ++ It can contain files or other trees (which are subdirectories). ++ It lists, for each item in the tree: ++ -+[[file-mode]] -+1. The *file mode*, for example `100644`. The format is inspired by Unix -+ permissions, but Git's modes are much more limited. Git only supports these file modes: ++1. The *filename*, for example `hello.py` ++2. The *type*: either <<blob,`blob`>> (a file), `tree` (a directory), ++ or <<commit,`commit`>> (a Git submodule, which is a ++ commit from a different Git repository) ++3. The *file mode*. Git has these file modes. which are only ++ spiritually related to Unix permissions: ++ + - `100644`: regular file (with type `blob`) + - `100755`: executable file (with type `blob`) @@ Documentation/gitdatamodel.adoc (new) + - `040000`: directory (with type `tree`) + - `160000`: gitlink, for use with submodules (with type `commit`) + -+2. The *type*: either <<blob,`blob`>> (a file), `tree` (a directory), -+ or <<commit,`commit`>> (a Git submodule, which is a -+ commit from a different Git repository) -+3. The <<object-id,*object ID*>> -+4. The *filename* ++4. The <<object-id,*object ID*>> with the contents of the file or directory ++ +For example, this is how a tree containing one directory (`src`) and one file +(`README.md`) is stored: @@ Documentation/gitdatamodel.adoc (new) +040000 tree 89b1d2e0495f66d6929f4ff76ff1bb07fc41947d src +---- + -+ +[[blob]] -+blobs:: -+ A blob is how Git represents a file. A blob object contains the -+ file's contents. ++blob:: ++ A blob object contains a file's contents. ++ -+When you make a new commit, Git only needs to store new versions of -+files which were changed in that commit. This means that commits -+can use relatively little disk space even in a very large repository. ++When you make a commit, Git stores the full contents of each file that ++you changed as a blob. ++For example, if you have a commit that changes 2 files in a repository ++with 1000 files, that commit will create 2 new blobs, and use the ++previous blob ID for the other 998 files. ++This means that commits can use relatively little disk space even in a ++very large repository. + +[[tag-object]] -+tag objects:: ++tag object:: + Tag objects contain these required fields + (though there are other optional fields): ++ @@ Documentation/gitdatamodel.adoc (new) +---- + +NOTE: All of the examples in this section were generated with -+`git cat-file -p <object-id>`, which shows the contents of a Git object. ++`git cat-file -p <object-id>`. + +[[references]] +REFERENCES @@ Documentation/gitdatamodel.adoc (new) +branch" than "the changes are in commit bb69721404348e". +Git often uses "ref" as shorthand for "reference". + -+References can either be: ++References can either refer to: + -+1. References to an object ID, usually a <<commit,commit>> ID -+2. References to another reference. This is called a "symbolic reference". ++1. An object ID, usually a <<commit,commit>> ID ++2. Another reference. This is called a "symbolic reference". + +References are stored in a hierarchy, and Git handles references +differently based on where they are in the hierarchy. @@ Documentation/gitdatamodel.adoc (new) + +[[branch]] +branches: `refs/heads/<name>`:: -+ A branch is a name for a commit ID. ++ A branch refers to a commit ID. + That commit is the latest commit on the branch. ++ +To get the history of commits on a branch, Git will start at the commit @@ Documentation/gitdatamodel.adoc (new) + +[[tag]] +tags: `refs/tags/<name>`:: -+ A tag is a name for a commit ID, tag object ID, or other object ID. -+ Tags that reference a tag object ID are called "annotated tags", -+ because the tag object contains a tag message. -+ Tags that reference a commit, blob, or tree ID are -+ called "lightweight tags". ++ A tag refers to a commit ID, tag object ID, or other object ID. ++ There are two types of tags: ++ 1. "Annotated tags", which reference a <<tag-object,tag object>> ID ++ which contains a tag message ++ 2. "Lightweight tags", which reference a commit, blob, or tree ID ++ directly ++ -+Even though branches and tags are both "a name for a commit ID", Git ++Even though branches and tags both refer to a commit ID, Git +treats them very differently. +Branches are expected to change over time: when you make a commit, Git -+will update your <<HEAD,current branch>> to reference the new changes. ++will update your <<HEAD,current branch>> to point to the new commit. +Tags are usually not changed after they're created. + +[[HEAD]] +HEAD: `HEAD`:: -+ `HEAD` is where Git stores your current <<branch,branch>>. -+ `HEAD` can either be: -+ 1. A symbolic reference to your current branch, for example `ref: -+ refs/heads/main` if your current branch is `main`. -+ 2. A direct reference to a commit ID. This is called "detached HEAD -+ state", see the DETACHED HEAD section of linkgit:git-checkout[1] for more. ++ `HEAD` is where Git stores your current <<branch,branch>>, ++ if there is a current branch. `HEAD` can either be: +++ ++1. A symbolic reference to your current branch, for example `ref: ++ refs/heads/main` if your current branch is `main`. ++2. A direct reference to a commit ID. In this case there is no current branch. ++ This is called "detached HEAD state", see the DETACHED HEAD section ++ of linkgit:git-checkout[1] for more. + +[[remote-tracking-branch]] -+remote tracking branches: `refs/remotes/<remote>/<branch>`:: -+ A remote-tracking branch is a name for a commit ID. ++remote-tracking branches: `refs/remotes/<remote>/<branch>`:: ++ A remote-tracking branch refers to a commit ID. + It's how Git stores the last-known state of a branch in a remote + repository. `git fetch` updates remote-tracking branches. When + `git status` says "you're up to date with origin/main", it's looking at @@ Documentation/gitdatamodel.adoc (new) ++ +Git may also create references other than `HEAD` at the base of the +hierarchy, like `ORIG_HEAD`. -++ -+NOTE: By default, Git references are stored as files in the `.git` directory. -+For example, the branch `main` is stored in `.git/refs/heads/main`. -+This means that you can't have branches named both `maya` and `maya/some-task`, -+because there can't be a file and a directory with the same name. + +[[index]] +THE INDEX +--------- -+ -+The index, also known as the "staging area", contains a list of every -+file in the repository and its contents. When you commit, the files in -+the index are used as the files in the next commit. -+ -+You can add files to the index or update the version in the index with -+linkgit:git-add[1]. Adding a file to the index or updating its version -+is called "staging" the file for commit. ++The index, also known as the "staging area", is a list of files and ++the contents of each file, stored as a <<blob,blob>>. ++You can add files to the index or update the contents of a file in the ++index with linkgit:git-add[1]. This is called "staging" the file for commit. + +Unlike a <<tree,tree>>, the index is a flat list of files. ++When you commit, Git converts the list of files in the index to a ++directory <<tree,tree>> and uses that tree in the new <<commit,commit>>. ++ +Each index entry has 4 fields: + -+1. The *<<file-mode,file mode>>* ++1. The *<<tree,file mode>>* +2. The *<<blob,blob>> ID* of the file +3. The *file path*, for example `src/hello.py` +4. The *stage number*, either 0, 1, 2, or 3. This is normally 0, but if @@ Documentation/gitdatamodel.adoc (new) +REFLOGS +------- + -+Git stores a history called a "reflog" for every branch, remote-tracking -+branch, and HEAD. This means that if you make a mistake and "lose" a -+commit, you can generally recover the commit ID by running -+`git reflog <reference>`. ++Every time a branch, remote-tracking branch, or HEAD is updated, Git ++updates a log called a "reflog" for that <<references,reference>>. ++This means that if you make a mistake and "lose" a commit, you can ++generally recover the commit ID by running `git reflog <reference>`. + -+Each reflog entry has: ++A reflog is a list of log entries. Each entry has: + -+1. Before/after *commit IDs* -+2. *User* who made the change, for example `Maya <maya@example.com>` -+3. *Timestamp* when the change was made -+4. *Log message*, for example `pull: Fast-forward` ++1. The *commit ID* ++2. *Timestamp* when the change was made ++3. *Log message*, for example `pull: Fast-forward` + +Reflogs only log changes made in your local repository. +They are not shared with remotes. + -+For example, here's how the reflog for `HEAD` in a repository with 2 -+commits is stored: ++You can view a reflog with `git reflog <reference>`. ++For example, here's the reflog for a `main` branch which has changed twice: + +---- -+0000000000000000000000000000000000000000 4ccb6d7b8869a86aae2e84c56523f8705b50c647 Maya <maya@example.com> 1759173408 -0400 commit (initial): Initial commit -+4ccb6d7b8869a86aae2e84c56523f8705b50c647 750b4ead9c87ceb3ddb7a390e6c7074521797fb3 Maya <maya@example.com> 1759173425 -0400 commit: Add README ++$ git reflog main --date=iso --no-decorate ++750b4ea main@{2025-09-29 15:17:05 -0400}: commit: Add README ++4ccb6d7 main@{2025-09-29 15:16:48 -0400}: commit (initial): Initial commit +---- + +GIT Documentation/Makefile | 1 + Documentation/gitdatamodel.adoc | 286 ++++++++++++++++++++++++++++ Documentation/glossary-content.adoc | 4 +- Documentation/meson.build | 1 + 4 files changed, 290 insertions(+), 2 deletions(-) create mode 100644 Documentation/gitdatamodel.adoc diff --git a/Documentation/Makefile b/Documentation/Makefile index 6fb83d0c6e..5f4acfacbd 100644 --- a/Documentation/Makefile +++ b/Documentation/Makefile @@ -52,6 +52,7 @@ MAN7_TXT += gitcli.adoc MAN7_TXT += gitcore-tutorial.adoc MAN7_TXT += gitcredentials.adoc MAN7_TXT += gitcvs-migration.adoc +MAN7_TXT += gitdatamodel.adoc MAN7_TXT += gitdiffcore.adoc MAN7_TXT += giteveryday.adoc MAN7_TXT += gitfaq.adoc diff --git a/Documentation/gitdatamodel.adoc b/Documentation/gitdatamodel.adoc new file mode 100644 index 0000000000..e36e833f66 --- /dev/null +++ b/Documentation/gitdatamodel.adoc @@ -0,0 +1,286 @@ +gitdatamodel(7) +=============== + +NAME +---- +gitdatamodel - Git's core data model + +SYNOPSIS +-------- +gitdatamodel + +DESCRIPTION +----------- + +It's not necessary to understand Git's data model to use Git, but it's +very helpful when reading Git's documentation so that you know what it +means when the documentation says "object", "reference" or "index". + +Git's core operations use 4 kinds of data: + +1. <<objects,Objects>>: commits, trees, blobs, and tag objects +2. <<references,References>>: branches, tags, + remote-tracking branches, etc +3. <<index,The index>>, also known as the staging area +4. <<reflogs,Reflogs>>: logs of changes to references ("ref log") + +[[objects]] +OBJECTS +------- + +All of the commits and files in a Git repository are stored as "Git objects". +Git objects never change after they're created, and every object has an ID, +like `1b61de420a21a2f1aaef93e38ecd0e45e8bc9f0a`. + +This means that if you have an object's ID, you can always recover its +exact contents as long as the object hasn't been deleted. + +Every object has: + +[[object-id]] +1. an *ID* (aka "object name"), which is a cryptographic hash of its + type and contents. + It's fast to look up a Git object using its ID. + This is usually represented in hexadecimal, like + `1b61de420a21a2f1aaef93e38ecd0e45e8bc9f0a`. +2. a *type*. There are 4 types of objects: + <<commit,commits>>, <<tree,trees>>, <<blob,blobs>>, + and <<tag-object,tag objects>>. +3. *contents*. The structure of the contents depends on the type. + +Here's how each type of object is structured: + +[[commit]] +commit:: + A commit contains the full directory structure of every file + in that version of the repository and each file's contents. + It has these these required fields + (though there are other optional fields): ++ +1. The *files* in the commit, stored as the *<<tree,tree>>* ID + of the commit's base directory. +2. Its *parent commit ID(s)*. The first commit in a repository has 0 parents, + regular commits have 1 parent, merge commits have 2 or more parents +3. An *author* and the time the commit was authored +4. A *committer* and the time the commit was committed. +5. A *commit message* ++ +Here's how an example commit is stored: ++ +---- +tree 1b61de420a21a2f1aaef93e38ecd0e45e8bc9f0a +parent 4ccb6d7b8869a86aae2e84c56523f8705b50c647 +author Maya <maya@example.com> 1759173425 -0400 +committer Maya <maya@example.com> 1759173425 -0400 + +Add README +---- ++ +Like all other objects, commits can never be changed after they're created. +For example, "amending" a commit with `git commit --amend` creates a new +commit with the same parent. ++ +Git does not store the diff for a commit: when you ask Git to show +the commit with linkgit:git-show[1], it calculates the diff from its +parent on the fly. + +[[tree]] +tree:: + A tree is how Git represents a directory. + It can contain files or other trees (which are subdirectories). + It lists, for each item in the tree: ++ +1. The *filename*, for example `hello.py` +2. The *type*: either <<blob,`blob`>> (a file), `tree` (a directory), + or <<commit,`commit`>> (a Git submodule, which is a + commit from a different Git repository) +3. The *file mode*. Git has these file modes. which are only + spiritually related to Unix permissions: ++ + - `100644`: regular file (with type `blob`) + - `100755`: executable file (with type `blob`) + - `120000`: symbolic link (with type `blob`) + - `040000`: directory (with type `tree`) + - `160000`: gitlink, for use with submodules (with type `commit`) + +4. The <<object-id,*object ID*>> with the contents of the file or directory ++ +For example, this is how a tree containing one directory (`src`) and one file +(`README.md`) is stored: ++ +---- +100644 blob 8728a858d9d21a8c78488c8b4e70e531b659141f README.md +040000 tree 89b1d2e0495f66d6929f4ff76ff1bb07fc41947d src +---- + +[[blob]] +blob:: + A blob object contains a file's contents. ++ +When you make a commit, Git stores the full contents of each file that +you changed as a blob. +For example, if you have a commit that changes 2 files in a repository +with 1000 files, that commit will create 2 new blobs, and use the +previous blob ID for the other 998 files. +This means that commits can use relatively little disk space even in a +very large repository. + +[[tag-object]] +tag object:: + Tag objects contain these required fields + (though there are other optional fields): ++ +1. The *ID* and *type* of the object (often a commit) that they reference +2. The *tagger* and tag date +3. A *tag message*, similar to a commit message + +Here's how an example tag object is stored: + +---- +object 750b4ead9c87ceb3ddb7a390e6c7074521797fb3 +type commit +tag v1.0.0 +tagger Maya <maya@example.com> 1759927359 -0400 + +Release version 1.0.0 +---- + +NOTE: All of the examples in this section were generated with +`git cat-file -p <object-id>`. + +[[references]] +REFERENCES +---------- + +References are a way to give a name to a commit. +It's easier to remember "the changes I'm working on are on the `turtle` +branch" than "the changes are in commit bb69721404348e". +Git often uses "ref" as shorthand for "reference". + +References can either refer to: + +1. An object ID, usually a <<commit,commit>> ID +2. Another reference. This is called a "symbolic reference". + +References are stored in a hierarchy, and Git handles references +differently based on where they are in the hierarchy. +Most references are under `refs/`. Here are the main types: + +[[branch]] +branches: `refs/heads/<name>`:: + A branch refers to a commit ID. + That commit is the latest commit on the branch. ++ +To get the history of commits on a branch, Git will start at the commit +ID the branch references, and then look at the commit's parent(s), +the parent's parent, etc. + +[[tag]] +tags: `refs/tags/<name>`:: + A tag refers to a commit ID, tag object ID, or other object ID. + There are two types of tags: + 1. "Annotated tags", which reference a <<tag-object,tag object>> ID + which contains a tag message + 2. "Lightweight tags", which reference a commit, blob, or tree ID + directly ++ +Even though branches and tags both refer to a commit ID, Git +treats them very differently. +Branches are expected to change over time: when you make a commit, Git +will update your <<HEAD,current branch>> to point to the new commit. +Tags are usually not changed after they're created. + +[[HEAD]] +HEAD: `HEAD`:: + `HEAD` is where Git stores your current <<branch,branch>>, + if there is a current branch. `HEAD` can either be: ++ +1. A symbolic reference to your current branch, for example `ref: + refs/heads/main` if your current branch is `main`. +2. A direct reference to a commit ID. In this case there is no current branch. + This is called "detached HEAD state", see the DETACHED HEAD section + of linkgit:git-checkout[1] for more. + +[[remote-tracking-branch]] +remote-tracking branches: `refs/remotes/<remote>/<branch>`:: + A remote-tracking branch refers to a commit ID. + It's how Git stores the last-known state of a branch in a remote + repository. `git fetch` updates remote-tracking branches. When + `git status` says "you're up to date with origin/main", it's looking at + this. ++ +`refs/remotes/<remote>/HEAD` is a symbolic reference to the remote's +default branch. This is the branch that `git clone` checks out by default. + +[[other-refs]] +Other references:: + Git tools may create references anywhere under `refs/`. + For example, linkgit:git-stash[1], linkgit:git-bisect[1], + and linkgit:git-notes[1] all create their own references + in `refs/stash`, `refs/bisect`, etc. + Third-party Git tools may also create their own references. ++ +Git may also create references other than `HEAD` at the base of the +hierarchy, like `ORIG_HEAD`. + +[[index]] +THE INDEX +--------- +The index, also known as the "staging area", is a list of files and +the contents of each file, stored as a <<blob,blob>>. +You can add files to the index or update the contents of a file in the +index with linkgit:git-add[1]. This is called "staging" the file for commit. + +Unlike a <<tree,tree>>, the index is a flat list of files. +When you commit, Git converts the list of files in the index to a +directory <<tree,tree>> and uses that tree in the new <<commit,commit>>. + +Each index entry has 4 fields: + +1. The *<<tree,file mode>>* +2. The *<<blob,blob>> ID* of the file +3. The *file path*, for example `src/hello.py` +4. The *stage number*, either 0, 1, 2, or 3. This is normally 0, but if + there's a merge conflict there can be multiple versions of the same + filename in the index. + +It's extremely uncommon to look at the index directly: normally you'd +run `git status` to see a list of changes between the index and <<HEAD,HEAD>>. +But you can use `git ls-files --stage` to see the index. +Here's the output of `git ls-files --stage` in a repository with 2 files: + +---- +100644 8728a858d9d21a8c78488c8b4e70e531b659141f 0 README.md +100644 665c637a360874ce43bf74018768a96d2d4d219a 0 src/hello.py +---- + +[[reflogs]] +REFLOGS +------- + +Every time a branch, remote-tracking branch, or HEAD is updated, Git +updates a log called a "reflog" for that <<references,reference>>. +This means that if you make a mistake and "lose" a commit, you can +generally recover the commit ID by running `git reflog <reference>`. + +A reflog is a list of log entries. Each entry has: + +1. The *commit ID* +2. *Timestamp* when the change was made +3. *Log message*, for example `pull: Fast-forward` + +Reflogs only log changes made in your local repository. +They are not shared with remotes. + +You can view a reflog with `git reflog <reference>`. +For example, here's the reflog for a `main` branch which has changed twice: + +---- +$ git reflog main --date=iso --no-decorate +750b4ea main@{2025-09-29 15:17:05 -0400}: commit: Add README +4ccb6d7 main@{2025-09-29 15:16:48 -0400}: commit (initial): Initial commit +---- + +GIT +--- +Part of the linkgit:git[1] suite diff --git a/Documentation/glossary-content.adoc b/Documentation/glossary-content.adoc index e423e4765b..20ba121314 100644 --- a/Documentation/glossary-content.adoc +++ b/Documentation/glossary-content.adoc @@ -297,8 +297,8 @@ This commit is referred to as a "merge commit", or sometimes just a identified by its <<def_object_name,object name>>. The objects usually live in `$GIT_DIR/objects/`. -[[def_object_identifier]]object identifier (oid):: - Synonym for <<def_object_name,object name>>. +[[def_object_identifier]]object identifier, object ID, oid:: + Synonyms for <<def_object_name,object name>>. [[def_object_name]]object name:: The unique identifier of an <<def_object,object>>. The diff --git a/Documentation/meson.build b/Documentation/meson.build index e34965c5b0..ace0573e82 100644 --- a/Documentation/meson.build +++ b/Documentation/meson.build @@ -192,6 +192,7 @@ manpages = { 'gitcore-tutorial.adoc' : 7, 'gitcredentials.adoc' : 7, 'gitcvs-migration.adoc' : 7, + 'gitdatamodel.adoc' : 7, 'gitdiffcore.adoc' : 7, 'giteveryday.adoc' : 7, 'gitfaq.adoc' : 7, base-commit: bb69721404348ea2db0a081c41ab6ebfe75bdec8 -- gitgitgadget ^ permalink raw reply related [flat|nested] 89+ messages in thread
* Re: [PATCH v4] doc: add an explanation of Git's data model 2025-10-27 19:32 ` [PATCH v4] doc: add an " Julia Evans via GitGitGadget @ 2025-10-27 21:54 ` Junio C Hamano 2025-10-28 20:10 ` Julia Evans 2025-10-30 20:32 ` [PATCH v5] " Julia Evans via GitGitGadget 1 sibling, 1 reply; 89+ messages in thread From: Junio C Hamano @ 2025-10-27 21:54 UTC (permalink / raw) To: Julia Evans via GitGitGadget Cc: git, Kristoffer Haugsbakk, D. Ben Knoble, Patrick Steinhardt, Julia Evans "Julia Evans via GitGitGadget" <gitgitgadget@gmail.com> writes: > diff --git a/Documentation/gitdatamodel.adoc b/Documentation/gitdatamodel.adoc > new file mode 100644 > index 0000000000..e36e833f66 > --- /dev/null > +++ b/Documentation/gitdatamodel.adoc > @@ -0,0 +1,286 @@ > +gitdatamodel(7) > +=============== > + > +NAME > +---- > +gitdatamodel - Git's core data model > + > +SYNOPSIS > +-------- > +gitdatamodel > + > +DESCRIPTION > +----------- > + > +It's not necessary to understand Git's data model to use Git, but it's > +very helpful when reading Git's documentation so that you know what it > +means when the documentation says "object", "reference" or "index". "While it is not necessary ..., it is helpful ..." may flow better than "It is not necesary ..., but it is very helpful". > +This means that if you have an object's ID, you can always recover its > +exact contents as long as the object hasn't been deleted. Somewhere in distant footnote, we may want to mention that objects that are in use are never deleted, and when they get removed (i.e., garbage collection). As part of the data model, "everything is retained by default, until we can prove it is no longer reachable" probably belongs somewhere. > +Here's how each type of object is structured: > + > +[[commit]] > +commit:: > + A commit contains the full directory structure of every file > + in that version of the repository and each file's contents. What you are describing here is more of the property of a tree; a commit is a bit richer. A commit records a snapshot of the every file in the project at one point in time, records who contributed to create such a snapshot and why, and how that particular snapshot relates to other snapshots in the history. > + It has these these required fields "these these". > +Like all other objects, commits can never be changed after they're created. > +For example, "amending" a commit with `git commit --amend` creates a new > +commit with the same parent. "same parent." -> "same parent, without modifying the original commit object at all"? Maybe redundant? I dunno. > +[[tree]] > +tree:: > + A tree is how Git represents a directory. "a directory" -> "contents in a directory"? I dunno. > + It can contain files or other trees (which are subdirectories). > + It lists, for each item in the tree: > ++ > +1. The *filename*, for example `hello.py` > +2. The *type*: either <<blob,`blob`>> (a file), `tree` (a directory), > + or <<commit,`commit`>> (a Git submodule, which is a > + commit from a different Git repository) This is a bit of white lie. A tree object entry never stores the type of the object. It records <mode, object name, path component>. The second field you see in git ls-tree output is computed from the object name (when the object is available) or inferred from the mode bits. > +3. The *file mode*. Git has these file modes. which are only > + spiritually related to Unix permissions: In the cover letter part of the message I am responding to, I saw repeated mention of "permissions should be "file mode"; let's be consistent. "Git has these file modes, which are ..." -> Git uses the following file mode to represent what each tree entry is (because an object of the same type, e.g. "blob", is used to represent more than one kind of things). The file mode are assigned to resemble Unix file mode. Note that Git does not _store_ permissions, and there are only two kinds of regular files; non-executable (100644) or executable (100755). To Git, there are no files that are "readable only by the owner" etc., so file mode bits like 100600, 100400, etc., are never used. > +[[tag-object]] > +tag object:: > + Tag objects contain these required fields > + (though there are other optional fields): > ++ > +1. The *ID* and *type* of the object (often a commit) that they reference Not wrong per-se, but it is a bit curious to lump these two into a single enumerated item here, unlike "author" and "committer" were enumerated separately for commit objects. If you are going to show "cat-file -p" output for illustration, it may be help readers understand them if you had them separately listed here. > +2. The *tagger* and tag date > +3. A *tag message*, similar to a commit message > +[[index]] > +THE INDEX > +--------- > +The index, also known as the "staging area", is a list of files and > +the contents of each file, stored as a <<blob,blob>>. > +You can add files to the index or update the contents of a file in the > +index with linkgit:git-add[1]. This is called "staging" the file for commit. > + > +Unlike a <<tree,tree>>, the index is a flat list of files. This is a bit of white lie, as modern versions of Git could be collapsing uninteresting parts of the directory structure as a single tree in an index entry (this is called "sparse index"), and can expand such collapsed "tree" in the index on-demand into its constituent files and directories. But I do not mind presenting the traditional world model for conceptual simplicity. > +When you commit, Git converts the list of files in the index to a > +directory <<tree,tree>> and uses that tree in the new <<commit,commit>>. > + > +Each index entry has 4 fields: > + > +1. The *<<tree,file mode>>* > +2. The *<<blob,blob>> ID* of the file If you were to collapse descriptions like you did for tag objects where ID and TYPE were treated as a unit, here is the place to do so. With the mode bits and object ID, we can represent regular files that are non-executable, regular files that are executable, symbolic links, and submodules (if a sparse-index is in use, an index entry could be a subdirectory, but I suggested above that we can ignore them for simplicity). But <<blob,blob>> is highly misleading. Even if we ignore sparse-index, we may see a commit object there. Each index entry records 1. The object that occupies the path, as (file mode, object name) tuple. Most often, it is a regular file whose contents are stored in a blob object, that is either non-executable (100644), executable (100755), or a symbolic link (120000), but the object can be a commit in another repository if it represents a submodule. 2. The stage number, which is normally 0, but entries with higher stages for the same path are used during a conflicted merge. 3. The path name for the index entry. > +3. The *file path*, for example `src/hello.py` > +4. The *stage number*, either 0, 1, 2, or 3. This is normally 0, but if > + there's a merge conflict there can be multiple versions of the same > + filename in the index. If you are going by "ls-files -s" output, it may be better to swap 3 and 4 above for ease of understanding. > +It's extremely uncommon to look at the index directly: normally you'd > +run `git status` to see a list of changes between the index and <<HEAD,HEAD>>. > +But you can use `git ls-files --stage` to see the index. > +Here's the output of `git ls-files --stage` in a repository with 2 files: > + > +---- > +100644 8728a858d9d21a8c78488c8b4e70e531b659141f 0 README.md > +100644 665c637a360874ce43bf74018768a96d2d4d219a 0 src/hello.py > +---- > + > +[[reflogs]] > +REFLOGS > +------- > + > +Every time a branch, remote-tracking branch, or HEAD is updated, Git > +updates a log called a "reflog" for that <<references,reference>>. If we want to avoid using word X while explaining X, then we can rephrase it as "Git updates a record in the reflog for that reference". ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: [PATCH v4] doc: add an explanation of Git's data model 2025-10-27 21:54 ` Junio C Hamano @ 2025-10-28 20:10 ` Julia Evans 2025-10-28 20:31 ` Junio C Hamano 0 siblings, 1 reply; 89+ messages in thread From: Julia Evans @ 2025-10-28 20:10 UTC (permalink / raw) To: Junio C Hamano, Julia Evans Cc: git, Kristoffer Haugsbakk, D. Ben Knoble, Patrick Steinhardt >> + >> +It's not necessary to understand Git's data model to use Git, but it's >> +very helpful when reading Git's documentation so that you know what it >> +means when the documentation says "object", "reference" or "index". > > "While it is not necessary ..., it is helpful ..." may flow better > than "It is not necesary ..., but it is very helpful". > >> +This means that if you have an object's ID, you can always recover its >> +exact contents as long as the object hasn't been deleted. > > Somewhere in distant footnote, we may want to mention that objects > that are in use are never deleted, and when they get removed (i.e., > garbage collection). As part of the data model, "everything is > retained by default, until we can prove it is no longer reachable" > probably belongs somewhere. Agreed, I really like this idea. Came up with the following, which I'll put at the bottom of the "References" section if I don't come up with a better idea. (I don't feel strongly about where exactly it should go): NOTE: Objects will only be deleted if they aren't "reachable" from any reference. An object is "reachable" if we can find it by following tags to whatever they tag, commits to their parents or trees, and trees to the trees or blobs that they contain. For example, if you amend a commit, with `git commit --amend`, the old commit will usually not be reachable, so it may be deleted eventually. >> +Here's how each type of object is structured: >> + >> +[[commit]] >> +commit:: >> + A commit contains the full directory structure of every file >> + in that version of the repository and each file's contents. > > What you are describing here is more of the property of a tree; a > commit is a bit richer. > > A commit records a snapshot of the every file in the project at > one point in time, records who contributed to create such a > snapshot and why, and how that particular snapshot relates to > other snapshots in the history. I don't understand the goal of explaining a commit in detail in paragraph form when we already explain everything in a commit right below this. My goal of this intro sentence is just to emphasize what I think is the least obvious point in that list, which is that commits contain every file. Happy to change it to something shorter like "A commit records a snapshot of the every file in the project" if you prefer that wording. >> + It has these these required fields > > "these these". Oops, will fix >> +Like all other objects, commits can never be changed after they're created. >> +For example, "amending" a commit with `git commit --amend` creates a new >> +commit with the same parent. > > "same parent." -> "same parent, without modifying the original > commit object at all"? Maybe redundant? I dunno. > >> +[[tree]] >> +tree:: >> + A tree is how Git represents a directory. > > "a directory" -> "contents in a directory"? I dunno. > >> + It can contain files or other trees (which are subdirectories). >> + It lists, for each item in the tree: >> ++ >> +1. The *filename*, for example `hello.py` >> +2. The *type*: either <<blob,`blob`>> (a file), `tree` (a directory), >> + or <<commit,`commit`>> (a Git submodule, which is a >> + commit from a different Git repository) > > This is a bit of white lie. A tree object entry never stores the > type of the object. It records <mode, object name, path component>. > > The second field you see in git ls-tree output is computed from the > object name (when the object is available) or inferred from the mode > bits. Thanks, I didn't realize how tree object entries were stored. Will remove "type". >> +3. The *file mode*. Git has these file modes. which are only >> + spiritually related to Unix permissions: > > In the cover letter part of the message I am responding to, I saw > repeated mention of "permissions should be "file mode"; let's be > consistent. > > "Git has these file modes, which are ..." -> Makes sense. Will change to "Unix file modes" from "Unix permissions". I don't think this needs a more dramatic rewrite though. > Git uses the following file mode to represent what each tree > entry is (because an object of the same type, e.g. "blob", is > used to represent more than one kind of things). The file mode > are assigned to resemble Unix file mode. > > Note that Git does not _store_ permissions, and there are only > two kinds of regular files; non-executable (100644) or > executable (100755). To Git, there are no files that are > "readable only by the owner" etc., so file mode bits like > 100600, 100400, etc., are never used. > >> +[[tag-object]] >> +tag object:: >> + Tag objects contain these required fields >> + (though there are other optional fields): >> ++ >> +1. The *ID* and *type* of the object (often a commit) that they reference > > Not wrong per-se, but it is a bit curious to lump these two into a > single enumerated item here, unlike "author" and "committer" were > enumerated separately for commit objects. If you are going to show > "cat-file -p" output for illustration, it may be help readers > understand them if you had them separately listed here. Agreed, I'll split them into two items. >> +2. The *tagger* and tag date >> +3. A *tag message*, similar to a commit message > >> +[[index]] >> +THE INDEX >> +--------- >> +The index, also known as the "staging area", is a list of files and >> +the contents of each file, stored as a <<blob,blob>>. >> +You can add files to the index or update the contents of a file in the >> +index with linkgit:git-add[1]. This is called "staging" the file for commit. >> + >> +Unlike a <<tree,tree>>, the index is a flat list of files. > > This is a bit of white lie, as modern versions of Git could be > collapsing uninteresting parts of the directory structure as a > single tree in an index entry (this is called "sparse index"), and > can expand such collapsed "tree" in the index on-demand into its > constituent files and directories. But I do not mind presenting the > traditional world model for conceptual simplicity. I didn't know that, thanks. I guess I'll leave it the way it is for now. It could be good to add a footnote, but I don't actually know how to add footnotes in this document format. >> +When you commit, Git converts the list of files in the index to a >> +directory <<tree,tree>> and uses that tree in the new <<commit,commit>>. >> + >> +Each index entry has 4 fields: >> + >> +1. The *<<tree,file mode>>* >> +2. The *<<blob,blob>> ID* of the file > > If you were to collapse descriptions like you did for tag objects > where ID and TYPE were treated as a unit, here is the place to do > so. With the mode bits and object ID, we can represent regular > files that are non-executable, regular files that are executable, > symbolic links, and submodules (if a sparse-index is in use, an > index entry could be a subdirectory, but I suggested above that we > can ignore them for simplicity). > > But <<blob,blob>> is highly misleading. Even if we ignore > sparse-index, we may see a commit object there. Thanks, I didn't realize that. Will change to say that it can be a blob or commit ID. I don't think that collapsing will help, IMO it's important to keep a consistent format. > Each index entry records > > 1. The object that occupies the path, as (file mode, object > name) tuple. Most often, it is a regular file whose contents > are stored in a blob object, that is either non-executable > (100644), executable (100755), or a symbolic link (120000), > but the object can be a commit in another repository if it > represents a submodule. > > 2. The stage number, which is normally 0, but entries with > higher stages for the same path are used during a conflicted > merge. > > 3. The path name for the index entry. > >> +3. The *file path*, for example `src/hello.py` >> +4. The *stage number*, either 0, 1, 2, or 3. This is normally 0, but if >> + there's a merge conflict there can be multiple versions of the same >> + filename in the index. > > If you are going by "ls-files -s" output, it may be better to swap 3 > and 4 above for ease of understanding. Good point, will do. >> +It's extremely uncommon to look at the index directly: normally you'd >> +run `git status` to see a list of changes between the index and <<HEAD,HEAD>>. >> +But you can use `git ls-files --stage` to see the index. >> +Here's the output of `git ls-files --stage` in a repository with 2 files: >> + >> +---- >> +100644 8728a858d9d21a8c78488c8b4e70e531b659141f 0 README.md >> +100644 665c637a360874ce43bf74018768a96d2d4d219a 0 src/hello.py >> +---- >> + >> +[[reflogs]] >> +REFLOGS >> +------- >> + >> +Every time a branch, remote-tracking branch, or HEAD is updated, Git >> +updates a log called a "reflog" for that <<references,reference>>. > > If we want to avoid using word X while explaining X, then we can > rephrase it as "Git updates a record in the reflog for that > reference". I think the current phrasing is okay. I also didn't respond to some of the phrasing suggestions above if I didn't understand the goal of them. Hope that's okay. ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: [PATCH v4] doc: add an explanation of Git's data model 2025-10-28 20:10 ` Julia Evans @ 2025-10-28 20:31 ` Junio C Hamano 0 siblings, 0 replies; 89+ messages in thread From: Junio C Hamano @ 2025-10-28 20:31 UTC (permalink / raw) To: Julia Evans Cc: Julia Evans, git, Kristoffer Haugsbakk, D. Ben Knoble, Patrick Steinhardt "Julia Evans" <julia@jvns.ca> writes: > Agreed, I really like this idea. Came up with the following, which I'll put at > the bottom of the "References" section if I don't come up with a better idea. > (I don't feel strongly about where exactly it should go): > > NOTE: Objects will only be deleted if they aren't "reachable" from any reference. > An object is "reachable" if we can find it by following tags to whatever > they tag, commits to their parents or trees, and trees to the trees or > blobs that they contain. > For example, if you amend a commit, with `git commit --amend`, > the old commit will usually not be reachable, so it may be deleted eventually. Other reachability anchors exist, like the index and reflog entries, but we have to stop at somewhere. I am fine if we do not mention them explicitly for the sake of simplicity. >>> +Here's how each type of object is structured: >>> + >>> +[[commit]] >>> +commit:: >>> + A commit contains the full directory structure of every file >>> + in that version of the repository and each file's contents. >> >> What you are describing here is more of the property of a tree; a >> commit is a bit richer. >> >> A commit records a snapshot of the every file in the project at >> one point in time, records who contributed to create such a >> snapshot and why, and how that particular snapshot relates to >> other snapshots in the history. > > I don't understand the goal of explaining a commit in detail in > paragraph form when we already explain everything in a commit right > below this. > > My goal of this intro sentence is just to emphasize what I think is the > least obvious point in that list, which is that commits contain every file. > > Happy to change it to something shorter like > "A commit records a snapshot of the every file in the project" if you > prefer that wording. Not really. Somebody who is skimming, who reads just the headline without reading enumeration, would not be able to tell differenes between a tree and a commit. Your enumeration lists _what_ is recorded, the headline I gave you above explains _what_ they are recorded _for_. > I think the current phrasing is okay. I also didn't respond to some of the > phrasing suggestions above if I didn't understand the goal of them. > Hope that's okay. If you do not understand, please ask ;-) ^ permalink raw reply [flat|nested] 89+ messages in thread
* [PATCH v5] doc: add an explanation of Git's data model 2025-10-27 19:32 ` [PATCH v4] doc: add an " Julia Evans via GitGitGadget 2025-10-27 21:54 ` Junio C Hamano @ 2025-10-30 20:32 ` Julia Evans via GitGitGadget 2025-10-31 14:44 ` Junio C Hamano ` (3 more replies) 1 sibling, 4 replies; 89+ messages in thread From: Julia Evans via GitGitGadget @ 2025-10-30 20:32 UTC (permalink / raw) To: git Cc: Kristoffer Haugsbakk, D. Ben Knoble, Patrick Steinhardt, Julia Evans, Julia Evans From: Julia Evans <julia@jvns.ca> Git very often uses the terms "object", "reference", or "index" in its documentation. However, it's hard to find a clear explanation of these terms and how they relate to each other in the documentation. The closest candidates currently are: 1. `gitglossary`. This makes a good effort, but it's an alphabetically ordered dictionary and a dictionary is not a good way to learn concepts. You have to jump around too much and it's not possible to present the concepts in the order that they should be explained. 2. `gitcore-tutorial`. This explains how to use the "core" Git commands. This is a nice document to have, but it's not necessary to learn how `update-index` works to understand Git's data model, and we should not be requiring users to learn how to use the "plumbing" commands if they want to learn what the term "index" or "object" means. 3. `gitrepository-layout`. This is a great resource, but it includes a lot of information about configuration and internal implementation details which are not related to the data model. It also does not explain how commits work. The result of this is that Git users (even users who have been using Git for 15+ years) struggle to read the documentation because they don't know what the core terms mean, and it's not possible to add links to help them learn more. Add an explanation of Git's data model. Some choices I've made in deciding what "core data model" means: 1. Omit pseudorefs like `FETCH_HEAD`, because it's not clear to me if those are intended to be user facing or if they're more like internal implementation details. 2. Don't talk about submodules other than by mentioning how they relate to trees. This is because Git has a lot of special features, and explaining how they all work exhaustively could quickly go down a rabbit hole which would make this document less useful for understanding Git's core behaviour. 3. Don't discuss the structure of a commit message (first line, trailers etc). 4. Don't mention configuration. 5. Don't mention the `.git` directory, to avoid getting too much into implementation details Signed-off-by: Julia Evans <julia@jvns.ca> --- doc: Add a explanation of Git's data model Changes in v2: The biggest change is to remove all mentions of the .git directory, and explain references in a way that doesn't refer to "directories" at all, and instead talks about the "hierarchy" (from Kristoffer and Patrick's reviews). Also: * objects: Mention that an object ID is called an "object name", and update the glossary to include the term "object ID" (from Junio's review) * objects: Replace "SHA-1 hash" with "cryptographic hash" which is more accurate (from Patrick's review) * blobs: Made the explanation of git gc a little higher level and took some ideas from Patrick's suggested wording (from Patrick's and Kroftoffer's reviews) * commits: Mention that tag objects and commits can optionally have other fields. I didn't mention the GPG signature specifically, but don't have any objections to adding it. (from Patrick and Junio's reviews) * commits: Remove one of the mentions of git gc, since it perhaps opens up too much of a rabbit hole: "how does git gc decide which commits to clean up?". (from Kristoffer's review) * tag objects: Add an example of how a tag object is represented (from user feedback on the draft) * index: Use the term "file mode" instead of "permissions", and list all allowed file modes (from Patrick's review) * index: Use "stage number" instead of "number" for index entries (from Patrick's review) * reflogs: Remove "any ref can be logged", it raises some questions of "how do you tell Git to log a ref that it isn't normally logging?" and my guess is that it's uncommon to ask Git to log more refs. I don't think it's a "lie" to omit this but I can bring it back if folks disagree. (from Patrick's review) * reflogs: Fix an error I noticed in the explanation of reflogs: tags aren't logged by default and remote-tracking branches are, according to man git-config * branches and tags: Be clearer about how branches are usually updated (by committing), and make it a little more obvious that only branches can be checked out. This is a bit tricky because using the word "check out" introduces a rabbit hole that I want to avoid (what does "check out" mean?). I've dealt this by just talking about the "current branch" (HEAD) since that is defined here, and making it more explicit that HEAD must either be a branch or a commit, there's no "HEAD is a tag" option. (from Patrick's review) * tags: Explain the differences between annotated and lightweight tags (this is the main piece of user feedback I've gotten on the draft so far) * Various style/typo changes ("2 or more", linkgit:git-gc[1], removed extra asterisks, added empty SYNOPSIS, "commits -> tags" typo fix, add to meson build) non-changes: * I still haven't mentioned things that aren't part of the "data model", like revision params and configuration. I think there could be a place for them but I haven't found it yet. * tag objects: I noticed that there's a "tag" header field in tag objects (like tag v1.0.0) but I didn't mention it yet because I couldn't figure out what the purpose of that field is (I thought the tag name was stored in the reference, why is it duplicated in the tag object?) Changes in v3: I asked for feedback from Git users on Mastodon and got 220 pieces of feedback from 48 different users. People seemed very excited to read about Git's data model. Usually I judge explanations by what folks report learning from them. Here people reported learning: * how branches are stored (that a branch is "a name for a commit") * how objects work * that Git has separate "author" and "committer" fields * that amending a commit does not change it * that a tree is "just a directory" (not something more complicated), and how trees are stored * that Git repos can contain symlinks * that Git saves modes separately from the OS. * how the stage number works * that when you git add a file, Git will create an object * that third-party tools can create their own refs. * that the reflog stores the history of branches (not just HEAD), and what reflogs are for Also (of course) there were quite a few points of confusion! The main 4 pieces of feedback were 1. The index section doesn't explain what the word "staged" means, and one person says that it makes it sounds like only files that you "git add"ed are in the index. Rewrite the explanation to avoid using the word "staged" to define the index and instead define the word "staging". 2. Explain the difference between "annotated tags" and "lightweight tags" (done) 3. Add examples for tag objects and reflogs (done) 4. Mention a little more about where things are stored in the .git directory, which I'd removed in v2. This seems most important for .git/refs, so I added a hopefully accurate note about how refs are stored by default, with a comment about one of the major implications. I did not discuss where objects or the index are stored, because I don't think the implementation details of how objects are stored are as important, and there are better tools for viewing the "raw" state of objects and the index (with git cat-file -p or git ls-files --staged). Here's every other change I made in response to the feedback, as well as a few comments that I did not address. intro: * Give a 1-sentence intro to "reflog" objects: * people really like having git ls-files --stage as a way to view the index, so add git cat-file -p as well in a note commits: * 2 people asked "Are commits stored as a diff?". Say that diffs are calculated at runtime, this is very important. * The order the fields are given in don't match the order in the example. Make them match. * "All the files in the commit, stored as a tree" is throwing a few people off. Be clearer that it's the tree ID of the base directory. * Several people asked "What's the difference between an author and committer? I added an example using git cherry-pick that I'm not 100% happy with (what if the reader doesn't know what cherry-pick does?). There might be a better example to give here. * In the note about commits being amended: one person suggested saying "creates a new commit with the same parent" to make it clearer what the relationship between the new and old commit are. I liked that idea so I did it. trees: * file modes. 2 people want to know more about "The file mode, for example 100644". Also 2 people are curious about what relationship these have to Unix permissions. Say that they're inspired by Unix permissions, and move the list of possible file modes up to make the relationship clearer * On "so git-gc(1) periodically compresses objects to save disk space", there are a few follow up comments wondering about more, which makes me think the comment about compression is actually a distraction. Say something simpler instead, ("Git only needs to store new versions of files which were changed in that commit"), from Junio's suggestion * Re "commit (a Git submodule)": 2 people say it's not clear how trees relate to submodules. Say that it refers to a commit in a different repository. * One person says they're not sure if the "object ID" is a hash. Link it to the definition of "object ID". tag objects: * Requests for an example, added one. * Requests to explain the difference between "lightweight" and "annotated" tags, added it. tags: * one person thinks "It’s expected that a tag will never change after you create it." is too strong (since of course you can change it with git tag -f). Say instead that tags are "usually" not changed. HEAD: * Several people are asking for more detail about detached HEAD state. There's actually quite a lot to talk about here (what it means, how it happens, what it implies, and how you might adjust your workflow to avoid it by using git switch). I don't think we can get into all of that here, so refer to the DETACHED HEAD section of git-checkout instead. I'm not totally happy with the current version of that section but that seems like the most practical solution right now. remote-tracking branches: * discuss refs/remotes/<remote>/HEAD. the index: * "permissions" should be "file mode" (like with trees). Changed. * "filename" should be "file path". Changed. * the stage number can only be 0, 1, 2, or 3, since it's 2 bits. Also maybe say that the numbers have specific meanings. Said it can only be 0/1/2/3 but did not give the specific meanings. reflogs * Request for an example. Added one. * It's not clear if there's one reflog per branch/tag/HEAD, or if there's one universal reflog. Make this clearer. * Mention the role of the reflog in retrieving "lost" commits or undoing bad rebases. Not fixed: * intro: A couple of people say that it's confusing that tags are both "an object" and "a reference". Handled this by just explaining the difference between an annotated and a lightweight tag further down. I'd like to make this clearer in the intro but not sure if there's a way to do it. * commits and tag objects: one person asks if there's a reference for the other "optional fields", like "encoding" and "gpgsig". I couldn't find one, so left this as is. * HEAD: A couple of people ask if there are any other symbolic references other than HEAD, or if they can make their own symbolic references. I don't know the answer to this. * HEAD: the HEAD: HEAD thing looks weird, it made more sense when it was HEAD: .git/HEAD. Will think about this. * reflogs: One person asks: if reflogs only store local changes, why does it track the user who made the change? Is that for remote operations like fetches and pulls? Or for cases where more than one user is using the same repo on a system? I don't know the answer to this. * reflogs: How can you see the full data in the reflog? git reflog show doesn't list the user who made the change. git reflog show <refname> --format="%h | %gd | %gn <%ge> | %gs" --date=iso seems to work but it's really a mouthful, not sure it's useful to include all that. * index: Is it worth mentioning that the index can be locked? I don't have an opinion about this. * other: One person asks what a "working tree" is. It made me wonder if "the current working directory" has a place in Git's data model. My feeling is "no" but I could be convinced otherwise. * overall: "How can Git be so fast? If I switch branches, how does it figure out what to add, remove or replace?". I don't think this is the right place for that discussion but it would * there are some docs CI errors I haven't figured out yet (IDREF attribute linkend references an unknown ID "tree") changes in v4: This is a combination of trying to make some of the intro text a little more "friendly" for someone new to Git's data model, avoiding implying things that are false, and removing information that isn't relevant to the data model. intro: * Add a 1-line description of what a "reflog" is (from user feedback) objects: * Start with a "friendly" description of what an object is, similar to what we do for references and the reflog * Rename "commits" to "commit" and similarly for trees etc (from Junio's review) * Remove the explanation of what git cat-file -p does, since it might be misleading and if people want to know they can read the man page (from Junio's review) commits: * Start by saying that the commit contains the full directory structure of all the files (from Junio's comment about how it may not be clear that the commit contains all the files' exact contents at the time of the commit) * Remove the comment about cherry-pick (from Junio's review) * Replace "ask Git for a diff" with "ask Git to show the commit with git show" (from Junio's review) trees: * Make the description a little more friendly * Reorder so that "type" is defined before we refer to the "type" * Say that file modes are "only spiritually related" to Unix permissions instead of talking about what Git "supports" (from Junio's review) blobs: * Try to make it clearer how "commits use relatively little disk space" is true while not implying that commits are diffs, by using an example (from Junio's review) branches: * Replace "a branch is a name for a commit ID" with "a branch refers to a commit ID" (except in the intro sentence for the "references" section). Similarly for tags etc. (from Junio's review) * Remove the note about how branches are stored in .git (from Junio's review) HEAD: * Be clearer that HEAD is not always the current branch, because there may not be a current branch (from Junio's review) index: * Be a little more specific about how exactly the index is converted into a commit. (from Junio's comment about how it's not clear what "every file in the repository" means) reflog: * Be clearer that there are many reflogs (one for each reference with a log), not just one reflog (from Junio and Patrick's reviews) * Omit the user and "Before" commit IDs from the list of fields, because you usually don't see them (from Junio's review) * Show the output of git reflog main in the example instead of the contents of the reflog file, to avoid showing the user and before commit ID changes in v5: Mostly smaller tweaks this time. The only major addition is to add a note about how unreachable objects may be deleted. From Junio's review: * Remove "type" in the description of what's in a tree (since I have learned that is not a separate field, it's part of the file mode) * Fix a typo ("these these") * Remove the intro sentence about what a "commit" is and instead only describe its contents in the list of fields, to avoid implying that a commit is the same as a tree * Say "Unix file modes" instead of "Unix permissions" * In the tag objects contents: make "ID" and "type" separate list items since they're separate fields * in the index section: * list all of the possible file modes (since from my understanding there are fewer allowed file modes here than in a tree) * mention that the object can be either a commit or blob * make the order match the order in git ls-files Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-1981%2Fjvns%2Fgitdatamodel-v5 Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-1981/jvns/gitdatamodel-v5 Pull-Request: https://github.com/gitgitgadget/git/pull/1981 Range-diff vs v4: 1: 92249b5b08 ! 1: d342255dad doc: add an explanation of Git's data model @@ Documentation/gitdatamodel.adoc (new) + +[[commit]] +commit:: -+ A commit contains the full directory structure of every file -+ in that version of the repository and each file's contents. -+ It has these these required fields ++ A commit contains these required fields + (though there are other optional fields): ++ -+1. The *files* in the commit, stored as the *<<tree,tree>>* ID ++1. The full directory structure of all the files in that version of the ++ repository and each file's contents, stored as the *<<tree,tree>>* ID + of the commit's base directory. +2. Its *parent commit ID(s)*. The first commit in a repository has 0 parents, + regular commits have 1 parent, merge commits have 2 or more parents @@ Documentation/gitdatamodel.adoc (new) + It lists, for each item in the tree: ++ +1. The *filename*, for example `hello.py` -+2. The *type*: either <<blob,`blob`>> (a file), `tree` (a directory), -+ or <<commit,`commit`>> (a Git submodule, which is a -+ commit from a different Git repository) -+3. The *file mode*. Git has these file modes. which are only -+ spiritually related to Unix permissions: ++2. The *file mode*. Git has these file modes. which are only ++ spiritually related to Unix file modes: ++ -+ - `100644`: regular file (with type `blob`) ++ - `100644`: regular file (with <<object,object type>> `blob`) + - `100755`: executable file (with type `blob`) + - `120000`: symbolic link (with type `blob`) + - `040000`: directory (with type `tree`) + - `160000`: gitlink, for use with submodules (with type `commit`) + -+4. The <<object-id,*object ID*>> with the contents of the file or directory ++3. The <<object-id,*object ID*>> with the contents of the file or directory ++ +For example, this is how a tree containing one directory (`src`) and one file +(`README.md`) is stored: @@ Documentation/gitdatamodel.adoc (new) + Tag objects contain these required fields + (though there are other optional fields): ++ -+1. The *ID* and *type* of the object (often a commit) that they reference -+2. The *tagger* and tag date -+3. A *tag message*, similar to a commit message ++1. The object *ID* it references ++2. The object *type* ++3. The *tagger* and tag date ++4. A *tag message*, similar to a commit message + +Here's how an example tag object is stored: + @@ Documentation/gitdatamodel.adoc (new) +Git may also create references other than `HEAD` at the base of the +hierarchy, like `ORIG_HEAD`. + ++NOTE: Git may delete objects that aren't "reachable" from any reference. ++An object is "reachable" if we can find it by following tags to whatever ++they tag, commits to their parents or trees, and trees to the trees or ++blobs that they contain. ++For example, if you amend a commit, with `git commit --amend`, ++the old commit will usually not be reachable, so it may be deleted eventually. ++Reachable objects will never be deleted. ++ +[[index]] +THE INDEX +--------- @@ Documentation/gitdatamodel.adoc (new) + +Each index entry has 4 fields: + -+1. The *<<tree,file mode>>* -+2. The *<<blob,blob>> ID* of the file -+3. The *file path*, for example `src/hello.py` -+4. The *stage number*, either 0, 1, 2, or 3. This is normally 0, but if ++1. The *file mode*, which must be one of: ++ - `100644`: regular file (with <<object,object type>> `blob`) ++ - `100755`: executable file (with type `blob`) ++ - `120000`: symbolic link (with type `blob`) ++ - `160000`: gitlink, for use with submodules (with type `commit`) ++2. The *<<blob,blob>>* ID of the file, ++ or (rarely) the *<<commit,commit>>* ID of the submodule ++3. The *stage number*, either 0, 1, 2, or 3. This is normally 0, but if + there's a merge conflict there can be multiple versions of the same + filename in the index. ++4. The *file path*, for example `src/hello.py` + +It's extremely uncommon to look at the index directly: normally you'd +run `git status` to see a list of changes between the index and <<HEAD,HEAD>>. Documentation/Makefile | 1 + Documentation/gitdatamodel.adoc | 296 ++++++++++++++++++++++++++++ Documentation/glossary-content.adoc | 4 +- Documentation/meson.build | 1 + 4 files changed, 300 insertions(+), 2 deletions(-) create mode 100644 Documentation/gitdatamodel.adoc diff --git a/Documentation/Makefile b/Documentation/Makefile index 6fb83d0c6e..5f4acfacbd 100644 --- a/Documentation/Makefile +++ b/Documentation/Makefile @@ -52,6 +52,7 @@ MAN7_TXT += gitcli.adoc MAN7_TXT += gitcore-tutorial.adoc MAN7_TXT += gitcredentials.adoc MAN7_TXT += gitcvs-migration.adoc +MAN7_TXT += gitdatamodel.adoc MAN7_TXT += gitdiffcore.adoc MAN7_TXT += giteveryday.adoc MAN7_TXT += gitfaq.adoc diff --git a/Documentation/gitdatamodel.adoc b/Documentation/gitdatamodel.adoc new file mode 100644 index 0000000000..1cefbb4833 --- /dev/null +++ b/Documentation/gitdatamodel.adoc @@ -0,0 +1,296 @@ +gitdatamodel(7) +=============== + +NAME +---- +gitdatamodel - Git's core data model + +SYNOPSIS +-------- +gitdatamodel + +DESCRIPTION +----------- + +It's not necessary to understand Git's data model to use Git, but it's +very helpful when reading Git's documentation so that you know what it +means when the documentation says "object", "reference" or "index". + +Git's core operations use 4 kinds of data: + +1. <<objects,Objects>>: commits, trees, blobs, and tag objects +2. <<references,References>>: branches, tags, + remote-tracking branches, etc +3. <<index,The index>>, also known as the staging area +4. <<reflogs,Reflogs>>: logs of changes to references ("ref log") + +[[objects]] +OBJECTS +------- + +All of the commits and files in a Git repository are stored as "Git objects". +Git objects never change after they're created, and every object has an ID, +like `1b61de420a21a2f1aaef93e38ecd0e45e8bc9f0a`. + +This means that if you have an object's ID, you can always recover its +exact contents as long as the object hasn't been deleted. + +Every object has: + +[[object-id]] +1. an *ID* (aka "object name"), which is a cryptographic hash of its + type and contents. + It's fast to look up a Git object using its ID. + This is usually represented in hexadecimal, like + `1b61de420a21a2f1aaef93e38ecd0e45e8bc9f0a`. +2. a *type*. There are 4 types of objects: + <<commit,commits>>, <<tree,trees>>, <<blob,blobs>>, + and <<tag-object,tag objects>>. +3. *contents*. The structure of the contents depends on the type. + +Here's how each type of object is structured: + +[[commit]] +commit:: + A commit contains these required fields + (though there are other optional fields): ++ +1. The full directory structure of all the files in that version of the + repository and each file's contents, stored as the *<<tree,tree>>* ID + of the commit's base directory. +2. Its *parent commit ID(s)*. The first commit in a repository has 0 parents, + regular commits have 1 parent, merge commits have 2 or more parents +3. An *author* and the time the commit was authored +4. A *committer* and the time the commit was committed. +5. A *commit message* ++ +Here's how an example commit is stored: ++ +---- +tree 1b61de420a21a2f1aaef93e38ecd0e45e8bc9f0a +parent 4ccb6d7b8869a86aae2e84c56523f8705b50c647 +author Maya <maya@example.com> 1759173425 -0400 +committer Maya <maya@example.com> 1759173425 -0400 + +Add README +---- ++ +Like all other objects, commits can never be changed after they're created. +For example, "amending" a commit with `git commit --amend` creates a new +commit with the same parent. ++ +Git does not store the diff for a commit: when you ask Git to show +the commit with linkgit:git-show[1], it calculates the diff from its +parent on the fly. + +[[tree]] +tree:: + A tree is how Git represents a directory. + It can contain files or other trees (which are subdirectories). + It lists, for each item in the tree: ++ +1. The *filename*, for example `hello.py` +2. The *file mode*. Git has these file modes. which are only + spiritually related to Unix file modes: ++ + - `100644`: regular file (with <<object,object type>> `blob`) + - `100755`: executable file (with type `blob`) + - `120000`: symbolic link (with type `blob`) + - `040000`: directory (with type `tree`) + - `160000`: gitlink, for use with submodules (with type `commit`) + +3. The <<object-id,*object ID*>> with the contents of the file or directory ++ +For example, this is how a tree containing one directory (`src`) and one file +(`README.md`) is stored: ++ +---- +100644 blob 8728a858d9d21a8c78488c8b4e70e531b659141f README.md +040000 tree 89b1d2e0495f66d6929f4ff76ff1bb07fc41947d src +---- + +[[blob]] +blob:: + A blob object contains a file's contents. ++ +When you make a commit, Git stores the full contents of each file that +you changed as a blob. +For example, if you have a commit that changes 2 files in a repository +with 1000 files, that commit will create 2 new blobs, and use the +previous blob ID for the other 998 files. +This means that commits can use relatively little disk space even in a +very large repository. + +[[tag-object]] +tag object:: + Tag objects contain these required fields + (though there are other optional fields): ++ +1. The object *ID* it references +2. The object *type* +3. The *tagger* and tag date +4. A *tag message*, similar to a commit message + +Here's how an example tag object is stored: + +---- +object 750b4ead9c87ceb3ddb7a390e6c7074521797fb3 +type commit +tag v1.0.0 +tagger Maya <maya@example.com> 1759927359 -0400 + +Release version 1.0.0 +---- + +NOTE: All of the examples in this section were generated with +`git cat-file -p <object-id>`. + +[[references]] +REFERENCES +---------- + +References are a way to give a name to a commit. +It's easier to remember "the changes I'm working on are on the `turtle` +branch" than "the changes are in commit bb69721404348e". +Git often uses "ref" as shorthand for "reference". + +References can either refer to: + +1. An object ID, usually a <<commit,commit>> ID +2. Another reference. This is called a "symbolic reference". + +References are stored in a hierarchy, and Git handles references +differently based on where they are in the hierarchy. +Most references are under `refs/`. Here are the main types: + +[[branch]] +branches: `refs/heads/<name>`:: + A branch refers to a commit ID. + That commit is the latest commit on the branch. ++ +To get the history of commits on a branch, Git will start at the commit +ID the branch references, and then look at the commit's parent(s), +the parent's parent, etc. + +[[tag]] +tags: `refs/tags/<name>`:: + A tag refers to a commit ID, tag object ID, or other object ID. + There are two types of tags: + 1. "Annotated tags", which reference a <<tag-object,tag object>> ID + which contains a tag message + 2. "Lightweight tags", which reference a commit, blob, or tree ID + directly ++ +Even though branches and tags both refer to a commit ID, Git +treats them very differently. +Branches are expected to change over time: when you make a commit, Git +will update your <<HEAD,current branch>> to point to the new commit. +Tags are usually not changed after they're created. + +[[HEAD]] +HEAD: `HEAD`:: + `HEAD` is where Git stores your current <<branch,branch>>, + if there is a current branch. `HEAD` can either be: ++ +1. A symbolic reference to your current branch, for example `ref: + refs/heads/main` if your current branch is `main`. +2. A direct reference to a commit ID. In this case there is no current branch. + This is called "detached HEAD state", see the DETACHED HEAD section + of linkgit:git-checkout[1] for more. + +[[remote-tracking-branch]] +remote-tracking branches: `refs/remotes/<remote>/<branch>`:: + A remote-tracking branch refers to a commit ID. + It's how Git stores the last-known state of a branch in a remote + repository. `git fetch` updates remote-tracking branches. When + `git status` says "you're up to date with origin/main", it's looking at + this. ++ +`refs/remotes/<remote>/HEAD` is a symbolic reference to the remote's +default branch. This is the branch that `git clone` checks out by default. + +[[other-refs]] +Other references:: + Git tools may create references anywhere under `refs/`. + For example, linkgit:git-stash[1], linkgit:git-bisect[1], + and linkgit:git-notes[1] all create their own references + in `refs/stash`, `refs/bisect`, etc. + Third-party Git tools may also create their own references. ++ +Git may also create references other than `HEAD` at the base of the +hierarchy, like `ORIG_HEAD`. + +NOTE: Git may delete objects that aren't "reachable" from any reference. +An object is "reachable" if we can find it by following tags to whatever +they tag, commits to their parents or trees, and trees to the trees or +blobs that they contain. +For example, if you amend a commit, with `git commit --amend`, +the old commit will usually not be reachable, so it may be deleted eventually. +Reachable objects will never be deleted. + +[[index]] +THE INDEX +--------- +The index, also known as the "staging area", is a list of files and +the contents of each file, stored as a <<blob,blob>>. +You can add files to the index or update the contents of a file in the +index with linkgit:git-add[1]. This is called "staging" the file for commit. + +Unlike a <<tree,tree>>, the index is a flat list of files. +When you commit, Git converts the list of files in the index to a +directory <<tree,tree>> and uses that tree in the new <<commit,commit>>. + +Each index entry has 4 fields: + +1. The *file mode*, which must be one of: + - `100644`: regular file (with <<object,object type>> `blob`) + - `100755`: executable file (with type `blob`) + - `120000`: symbolic link (with type `blob`) + - `160000`: gitlink, for use with submodules (with type `commit`) +2. The *<<blob,blob>>* ID of the file, + or (rarely) the *<<commit,commit>>* ID of the submodule +3. The *stage number*, either 0, 1, 2, or 3. This is normally 0, but if + there's a merge conflict there can be multiple versions of the same + filename in the index. +4. The *file path*, for example `src/hello.py` + +It's extremely uncommon to look at the index directly: normally you'd +run `git status` to see a list of changes between the index and <<HEAD,HEAD>>. +But you can use `git ls-files --stage` to see the index. +Here's the output of `git ls-files --stage` in a repository with 2 files: + +---- +100644 8728a858d9d21a8c78488c8b4e70e531b659141f 0 README.md +100644 665c637a360874ce43bf74018768a96d2d4d219a 0 src/hello.py +---- + +[[reflogs]] +REFLOGS +------- + +Every time a branch, remote-tracking branch, or HEAD is updated, Git +updates a log called a "reflog" for that <<references,reference>>. +This means that if you make a mistake and "lose" a commit, you can +generally recover the commit ID by running `git reflog <reference>`. + +A reflog is a list of log entries. Each entry has: + +1. The *commit ID* +2. *Timestamp* when the change was made +3. *Log message*, for example `pull: Fast-forward` + +Reflogs only log changes made in your local repository. +They are not shared with remotes. + +You can view a reflog with `git reflog <reference>`. +For example, here's the reflog for a `main` branch which has changed twice: + +---- +$ git reflog main --date=iso --no-decorate +750b4ea main@{2025-09-29 15:17:05 -0400}: commit: Add README +4ccb6d7 main@{2025-09-29 15:16:48 -0400}: commit (initial): Initial commit +---- + +GIT +--- +Part of the linkgit:git[1] suite diff --git a/Documentation/glossary-content.adoc b/Documentation/glossary-content.adoc index e423e4765b..20ba121314 100644 --- a/Documentation/glossary-content.adoc +++ b/Documentation/glossary-content.adoc @@ -297,8 +297,8 @@ This commit is referred to as a "merge commit", or sometimes just a identified by its <<def_object_name,object name>>. The objects usually live in `$GIT_DIR/objects/`. -[[def_object_identifier]]object identifier (oid):: - Synonym for <<def_object_name,object name>>. +[[def_object_identifier]]object identifier, object ID, oid:: + Synonyms for <<def_object_name,object name>>. [[def_object_name]]object name:: The unique identifier of an <<def_object,object>>. The diff --git a/Documentation/meson.build b/Documentation/meson.build index e34965c5b0..ace0573e82 100644 --- a/Documentation/meson.build +++ b/Documentation/meson.build @@ -192,6 +192,7 @@ manpages = { 'gitcore-tutorial.adoc' : 7, 'gitcredentials.adoc' : 7, 'gitcvs-migration.adoc' : 7, + 'gitdatamodel.adoc' : 7, 'gitdiffcore.adoc' : 7, 'giteveryday.adoc' : 7, 'gitfaq.adoc' : 7, base-commit: bb69721404348ea2db0a081c41ab6ebfe75bdec8 -- gitgitgadget ^ permalink raw reply related [flat|nested] 89+ messages in thread
* Re: [PATCH v5] doc: add an explanation of Git's data model 2025-10-30 20:32 ` [PATCH v5] " Julia Evans via GitGitGadget @ 2025-10-31 14:44 ` Junio C Hamano 2025-11-03 7:40 ` Patrick Steinhardt 2025-11-03 19:43 ` Julia Evans 2025-10-31 21:49 ` Junio C Hamano ` (2 subsequent siblings) 3 siblings, 2 replies; 89+ messages in thread From: Junio C Hamano @ 2025-10-31 14:44 UTC (permalink / raw) To: Julia Evans via GitGitGadget Cc: git, Kristoffer Haugsbakk, D. Ben Knoble, Patrick Steinhardt, Julia Evans "Julia Evans via GitGitGadget" <gitgitgadget@gmail.com> writes: > diff --git a/Documentation/gitdatamodel.adoc b/Documentation/gitdatamodel.adoc > new file mode 100644 > index 0000000000..1cefbb4833 > --- /dev/null > +++ b/Documentation/gitdatamodel.adoc > @@ -0,0 +1,296 @@ > +gitdatamodel(7) > +=============== > + > +NAME > +---- > +gitdatamodel - Git's core data model > + > +SYNOPSIS > +-------- > +gitdatamodel > + > +DESCRIPTION > +----------- > + > +It's not necessary to understand Git's data model to use Git, but it's > +very helpful when reading Git's documentation so that you know what it > +means when the documentation says "object", "reference" or "index". > + > +Git's core operations use 4 kinds of data: > + > +1. <<objects,Objects>>: commits, trees, blobs, and tag objects > +2. <<references,References>>: branches, tags, > + remote-tracking branches, etc > +3. <<index,The index>>, also known as the staging area > +4. <<reflogs,Reflogs>>: logs of changes to references ("ref log") > + > +[[objects]] > +OBJECTS > +------- > + > +All of the commits and files in a Git repository are stored as "Git objects". > +Git objects never change after they're created, and every object has an ID, > +like `1b61de420a21a2f1aaef93e38ecd0e45e8bc9f0a`. > + > +This means that if you have an object's ID, you can always recover its > +exact contents as long as the object hasn't been deleted. > + > +Every object has: > + > +[[object-id]] > +1. an *ID* (aka "object name"), which is a cryptographic hash of its > + type and contents. > + It's fast to look up a Git object using its ID. > + This is usually represented in hexadecimal, like > + `1b61de420a21a2f1aaef93e38ecd0e45e8bc9f0a`. > +2. a *type*. There are 4 types of objects: > + <<commit,commits>>, <<tree,trees>>, <<blob,blobs>>, > + and <<tag-object,tag objects>>. > +3. *contents*. The structure of the contents depends on the type. > + > +Here's how each type of object is structured: > + > +[[commit]] > +commit:: > + A commit contains these required fields > + (though there are other optional fields): > ++ > +1. The full directory structure of all the files in that version of the > + repository and each file's contents, stored as the *<<tree,tree>>* ID > + of the commit's base directory. "base directory" is a new term; I think we most often use "top-level" directory (in various spellings). $ git grep -e 'base directory' -e 'level directory' Documentation/ > +[[tree]] > +tree:: > + A tree is how Git represents a directory. > + It can contain files or other trees (which are subdirectories). > + It lists, for each item in the tree: > ++ > +1. The *filename*, for example `hello.py` > +2. The *file mode*. Git has these file modes. which are only "has these" -> "uses only these" to clarify that this is an exhaustive enumeration and users cannot invent 100664 and others, which is a mistake Git itself used to make/allow. > +[[tag-object]] > +tag object:: > + Tag objects contain these required fields > + (though there are other optional fields): > ++ > +1. The object *ID* it references > +2. The object *type* I would rephrase these to 1. The *ID* of the object it references 2. The *type* of the object it references because (1) a tag object references another object, not ID. To name the object it reference, it uses the object name of it, but just like your name is not you, object name is not the object (it merely is *one* way to refer to it). (2) unless it is very clear to readers that "The object" in 1. and 2. refer to the same object, 2. invites a question "type of which object?". > +[[branch]] > +branches: `refs/heads/<name>`:: > + A branch refers to a commit ID. A branch refers to a commit object (by its ID). Ditto for tags. > +NOTE: Git may delete objects that aren't "reachable" from any reference. > +An object is "reachable" if we can find it by following tags to whatever > +they tag, commits to their parents or trees, and trees to the trees or > +blobs that they contain. > +For example, if you amend a commit, with `git commit --amend`, > +the old commit will usually not be reachable, so it may be deleted eventually. > +Reachable objects will never be deleted. Very good write-up. As we would touch upon reflog later in the same document, we may want to extend the "amend" example a bit, perhaps like Note: Git never deletes objects that are "reachable". An object is "reachable" if .... An unreachable object may be deleted. For example, ... a newly created commit will replace the old commit and the current branch ref points at the new commit. The old commit is recorded in the <<reflogs,reflog>> of the current branch, so it is still "reachable", but sufficiently old reflog entries are expired away, the old commit may become unreachable at that point, and would get deleted. Other than the above, I found everything very nicely written. Thanks. ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: [PATCH v5] doc: add an explanation of Git's data model 2025-10-31 14:44 ` Junio C Hamano @ 2025-11-03 7:40 ` Patrick Steinhardt 2025-11-03 15:38 ` Junio C Hamano 2025-11-03 19:43 ` Julia Evans 1 sibling, 1 reply; 89+ messages in thread From: Patrick Steinhardt @ 2025-11-03 7:40 UTC (permalink / raw) To: Junio C Hamano Cc: Julia Evans via GitGitGadget, git, Kristoffer Haugsbakk, D. Ben Knoble, Julia Evans On Fri, Oct 31, 2025 at 07:44:43AM -0700, Junio C Hamano wrote: > "Julia Evans via GitGitGadget" <gitgitgadget@gmail.com> writes: [snip] > > +[[commit]] > > +commit:: > > + A commit contains these required fields > > + (though there are other optional fields): > > ++ > > +1. The full directory structure of all the files in that version of the > > + repository and each file's contents, stored as the *<<tree,tree>>* ID > > + of the commit's base directory. > > "base directory" is a new term; I think we most often use > "top-level" directory (in various spellings). > > $ git grep -e 'base directory' -e 'level directory' Documentation/ We'd refer to the top-level directory when talking about the worktree. But what's referenced here is not referring to the worktree, but to the commit's tree. And here I think we rather consistently use "root tree", don't we? Our docs already mention "root tree" in several contexts. Patrick ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: [PATCH v5] doc: add an explanation of Git's data model 2025-11-03 7:40 ` Patrick Steinhardt @ 2025-11-03 15:38 ` Junio C Hamano 0 siblings, 0 replies; 89+ messages in thread From: Junio C Hamano @ 2025-11-03 15:38 UTC (permalink / raw) To: Patrick Steinhardt Cc: Julia Evans via GitGitGadget, git, Kristoffer Haugsbakk, D. Ben Knoble, Julia Evans Patrick Steinhardt <ps@pks.im> writes: > We'd refer to the top-level directory when talking about the worktree. > But what's referenced here is not referring to the worktree, but to the > commit's tree. And here I think we rather consistently use "root tree", > don't we? Our docs already mention "root tree" in several contexts. Ah, thanks. I wasn't aware that we use the phrase "root tree"; I recall that I've always said something awkward like "the tree that corresponds to the top-level of your working tree", due to lack of that exact word. It would be nice to add it to Documentation/glossary-content.adoc, perhaps? Here is my attempt (I am not committing this, and I won't be polishing it myself, but recording it as #leftoverbit material for somebody else to polish and make it a part of our documentation set). Documentation/glossary-content.adoc | 6 ++++++ 1 file changed, 6 insertions(+) diff --git c/Documentation/glossary-content.adoc w/Documentation/glossary-content.adoc index e423e4765b..bdf469f137 100644 --- c/Documentation/glossary-content.adoc +++ w/Documentation/glossary-content.adoc @@ -627,6 +627,12 @@ the `refs/tags/` hierarchy is used to represent local tags.. To throw away part of the development, i.e. to assign the <<def_head,head>> to an earlier <<def_revision,revision>>. +[[def_root_tree]]root tree:: + The tree objct that corresponds to the top-level directory + of a checkout of the project. A <<def_commit,commit>> object + holds a snapshot of the project state by recording the object + name of its root tree. + [[def_SCM]]SCM:: Source code management (tool). ^ permalink raw reply related [flat|nested] 89+ messages in thread
* Re: [PATCH v5] doc: add an explanation of Git's data model 2025-10-31 14:44 ` Junio C Hamano 2025-11-03 7:40 ` Patrick Steinhardt @ 2025-11-03 19:43 ` Julia Evans 2025-11-04 1:34 ` Junio C Hamano 1 sibling, 1 reply; 89+ messages in thread From: Julia Evans @ 2025-11-03 19:43 UTC (permalink / raw) To: Junio C Hamano, Julia Evans Cc: git, Kristoffer Haugsbakk, D. Ben Knoble, Patrick Steinhardt On Fri, Oct 31, 2025, at 10:44 AM, Junio C Hamano wrote: > "Julia Evans via GitGitGadget" <gitgitgadget@gmail.com> writes: > >> diff --git a/Documentation/gitdatamodel.adoc b/Documentation/gitdatamodel.adoc >> new file mode 100644 >> index 0000000000..1cefbb4833 >> --- /dev/null >> +++ b/Documentation/gitdatamodel.adoc >> @@ -0,0 +1,296 @@ >> +gitdatamodel(7) >> +=============== >> + >> +NAME >> +---- >> +gitdatamodel - Git's core data model >> + >> +SYNOPSIS >> +-------- >> +gitdatamodel >> + >> +DESCRIPTION >> +----------- >> + >> +It's not necessary to understand Git's data model to use Git, but it's >> +very helpful when reading Git's documentation so that you know what it >> +means when the documentation says "object", "reference" or "index". >> + >> +Git's core operations use 4 kinds of data: >> + >> +1. <<objects,Objects>>: commits, trees, blobs, and tag objects >> +2. <<references,References>>: branches, tags, >> + remote-tracking branches, etc >> +3. <<index,The index>>, also known as the staging area >> +4. <<reflogs,Reflogs>>: logs of changes to references ("ref log") >> + >> +[[objects]] >> +OBJECTS >> +------- >> + >> +All of the commits and files in a Git repository are stored as "Git objects". >> +Git objects never change after they're created, and every object has an ID, >> +like `1b61de420a21a2f1aaef93e38ecd0e45e8bc9f0a`. >> + >> +This means that if you have an object's ID, you can always recover its >> +exact contents as long as the object hasn't been deleted. >> + >> +Every object has: >> + >> +[[object-id]] >> +1. an *ID* (aka "object name"), which is a cryptographic hash of its >> + type and contents. >> + It's fast to look up a Git object using its ID. >> + This is usually represented in hexadecimal, like >> + `1b61de420a21a2f1aaef93e38ecd0e45e8bc9f0a`. >> +2. a *type*. There are 4 types of objects: >> + <<commit,commits>>, <<tree,trees>>, <<blob,blobs>>, >> + and <<tag-object,tag objects>>. >> +3. *contents*. The structure of the contents depends on the type. >> + >> +Here's how each type of object is structured: >> + >> +[[commit]] >> +commit:: >> + A commit contains these required fields >> + (though there are other optional fields): >> ++ >> +1. The full directory structure of all the files in that version of the >> + repository and each file's contents, stored as the *<<tree,tree>>* ID >> + of the commit's base directory. > > "base directory" is a new term; I think we most often use > "top-level" directory (in various spellings). > > $ git grep -e 'base directory' -e 'level directory' Documentation/ > >> +[[tree]] >> +tree:: >> + A tree is how Git represents a directory. >> + It can contain files or other trees (which are subdirectories). >> + It lists, for each item in the tree: >> ++ >> +1. The *filename*, for example `hello.py` >> +2. The *file mode*. Git has these file modes. which are only > > "has these" -> "uses only these" to clarify that this is an > exhaustive enumeration and users cannot invent 100664 and others, > which is a mistake Git itself used to make/allow. I like the idea to make it more explicit that this is an exhaustive enumeration. I'll try changing it to this instead: "These are all of the file modes in Git (which are only spiritually related to Unix file modes):" >> +[[tag-object]] >> +tag object:: >> + Tag objects contain these required fields >> + (though there are other optional fields): >> ++ >> +1. The object *ID* it references >> +2. The object *type* > > I would rephrase these to > > 1. The *ID* of the object it references > 2. The *type* of the object it references > > because (1) a tag object references another object, not ID. To name > the object it reference, it uses the object name of it, but just > like your name is not you, object name is not the object (it merely > is *one* way to refer to it). (2) unless it is very clear to readers > that "The object" in 1. and 2. refer to the same object, 2. invites > a question "type of which object?". That makes sense to me, will change it to that. >> +[[branch]] >> +branches: `refs/heads/<name>`:: >> + A branch refers to a commit ID. > > A branch refers to a commit object (by its ID). Ditto for tags. What's the goal of this? I can't tell what misconception you're trying to avoid here. >> +NOTE: Git may delete objects that aren't "reachable" from any reference. >> +An object is "reachable" if we can find it by following tags to whatever >> +they tag, commits to their parents or trees, and trees to the trees or >> +blobs that they contain. >> +For example, if you amend a commit, with `git commit --amend`, >> +the old commit will usually not be reachable, so it may be deleted eventually. >> +Reachable objects will never be deleted. > > Very good write-up. As we would touch upon reflog later in the same > document, we may want to extend the "amend" example a bit, perhaps > like > > Note: Git never deletes objects that are "reachable". An object > is "reachable" if .... An unreachable object may be deleted. > > For example, ... a newly created commit will replace the old > commit and the current branch ref points at the new commit. The > old commit is recorded in the <<reflogs,reflog>> of the current > branch, so it is still "reachable", but sufficiently old reflog > entries are expired away, the old commit may become unreachable > at that point, and would get deleted. I like that, will include something similar, lightly reworded. > Other than the above, I found everything very nicely written. > > Thanks. ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: [PATCH v5] doc: add an explanation of Git's data model 2025-11-03 19:43 ` Julia Evans @ 2025-11-04 1:34 ` Junio C Hamano 2025-11-04 15:45 ` Julia Evans 0 siblings, 1 reply; 89+ messages in thread From: Junio C Hamano @ 2025-11-04 1:34 UTC (permalink / raw) To: Julia Evans Cc: Julia Evans, git, Kristoffer Haugsbakk, D. Ben Knoble, Patrick Steinhardt "Julia Evans" <julia@jvns.ca> writes: >>> +tree:: >>> + A tree is how Git represents a directory. >>> + It can contain files or other trees (which are subdirectories). >>> + It lists, for each item in the tree: >>> ++ >>> +1. The *filename*, for example `hello.py` >>> +2. The *file mode*. Git has these file modes. which are only >> >> "has these" -> "uses only these" to clarify that this is an >> exhaustive enumeration and users cannot invent 100664 and others, >> which is a mistake Git itself used to make/allow. > > I like the idea to make it more explicit that this is an exhaustive > enumeration. I'll try changing it to this instead: "These are all of the file > modes in Git (which are only spiritually related to Unix file modes):" The primary reason why I suggested "uses only these" was because I thought it would strongly hint that random additions beyond the set is unwelcome. As long as that implication is not lost, I do not have strong preference between "we only use these and nothing else" and your "these are all that we use". >>> +[[tag-object]] >>> +tag object:: >>> + Tag objects contain these required fields >>> + (though there are other optional fields): >>> ++ >>> +1. The object *ID* it references >>> +2. The object *type* >> >> I would rephrase these to >> >> 1. The *ID* of the object it references >> 2. The *type* of the object it references >> >> because (1) a tag object references another object, not ID. To name >> the object it reference, it uses the object name of it, but just >> like your name is not you, object name is not the object (it merely >> is *one* way to refer to it). (2) unless it is very clear to readers >> that "The object" in 1. and 2. refer to the same object, 2. invites >> a question "type of which object?". > > That makes sense to me, will change it to that. > >>> +[[branch]] >>> +branches: `refs/heads/<name>`:: >>> + A branch refers to a commit ID. >> >> A branch refers to a commit object (by its ID). Ditto for tags. > > What's the goal of this? I can't tell what misconception you're > trying to avoid here. This comes from the same place as the suggestion for the tag object above, i.e. "a tag object references another object, not ID.". Exactly the same reasoning applies here. A branch refers to a commit, and to name the object it references, it uses the object name of it, but just like your name is not you, object name is not the object itself. Thanks. ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: [PATCH v5] doc: add an explanation of Git's data model 2025-11-04 1:34 ` Junio C Hamano @ 2025-11-04 15:45 ` Julia Evans 2025-11-04 20:53 ` Junio C Hamano 0 siblings, 1 reply; 89+ messages in thread From: Julia Evans @ 2025-11-04 15:45 UTC (permalink / raw) To: Junio C Hamano Cc: Julia Evans, git, Kristoffer Haugsbakk, D. Ben Knoble, Patrick Steinhardt On Mon, Nov 3, 2025, at 8:34 PM, Junio C Hamano wrote: > "Julia Evans" <julia@jvns.ca> writes: > >>>> +tree:: >>>> + A tree is how Git represents a directory. >>>> + It can contain files or other trees (which are subdirectories). >>>> + It lists, for each item in the tree: >>>> ++ >>>> +1. The *filename*, for example `hello.py` >>>> +2. The *file mode*. Git has these file modes. which are only >>> >>> "has these" -> "uses only these" to clarify that this is an >>> exhaustive enumeration and users cannot invent 100664 and others, >>> which is a mistake Git itself used to make/allow. >> >> I like the idea to make it more explicit that this is an exhaustive >> enumeration. I'll try changing it to this instead: "These are all of the file >> modes in Git (which are only spiritually related to Unix file modes):" > > The primary reason why I suggested "uses only these" was because I > thought it would strongly hint that random additions beyond the set > is unwelcome. As long as that implication is not lost, I do not > have strong preference between "we only use these and nothing else" > and your "these are all that we use". > >>>> +[[tag-object]] >>>> +tag object:: >>>> + Tag objects contain these required fields >>>> + (though there are other optional fields): >>>> ++ >>>> +1. The object *ID* it references >>>> +2. The object *type* >>> >>> I would rephrase these to >>> >>> 1. The *ID* of the object it references >>> 2. The *type* of the object it references >>> >>> because (1) a tag object references another object, not ID. To name >>> the object it reference, it uses the object name of it, but just >>> like your name is not you, object name is not the object (it merely >>> is *one* way to refer to it). (2) unless it is very clear to readers >>> that "The object" in 1. and 2. refer to the same object, 2. invites >>> a question "type of which object?". >> >> That makes sense to me, will change it to that. >> >>>> +[[branch]] >>>> +branches: `refs/heads/<name>`:: >>>> + A branch refers to a commit ID. >>> >>> A branch refers to a commit object (by its ID). Ditto for tags. >> >> What's the goal of this? I can't tell what misconception you're >> trying to avoid here. > > This comes from the same place as the suggestion for the tag object > above, i.e. "a tag object references another object, not ID.". > > Exactly the same reasoning applies here. A branch refers to a > commit, and to name the object it references, it uses the object > name of it, but just like your name is not you, object name is not > the object itself. I agree the ID of a commit is not the same as the commit itself. The reason I said "refers to a commit ID" is that it's a very concise explanation and I don't see any risk that the reader will be confused by it. Unlike with my name, commit IDs uniquely identify commits, so I think it will be clear to the reader that the commit ID is going to be used to retrieve the commit object. The problem with "A branch refers to a commit object (by its ID)." is that it introduces some more potential for confusion: it makes it sound like there might be other ways to refer to a commit object than by its ID. Maybe there's another option? To me this introduces the potential for more confusion and does not solve any specific problem. ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: [PATCH v5] doc: add an explanation of Git's data model 2025-11-04 15:45 ` Julia Evans @ 2025-11-04 20:53 ` Junio C Hamano 2025-11-04 21:24 ` Julia Evans 0 siblings, 1 reply; 89+ messages in thread From: Junio C Hamano @ 2025-11-04 20:53 UTC (permalink / raw) To: Julia Evans Cc: Julia Evans, git, Kristoffer Haugsbakk, D. Ben Knoble, Patrick Steinhardt "Julia Evans" <julia@jvns.ca> writes: > The problem with "A branch refers to a commit object (by its ID)." is Ah, I didn't mean to say "you must use exactly that phrase". But branch refers to a commit object, it does not refer to the name of a commit object. Perhaps "a branch ref records the object name of a commit object", would be better? The untold implication of the phrasing is that anybody who reads what is recorded by that ref can then use the result to refer to (find) the commit object. > it introduces some more potential for confusion: it makes it > sound like there might be other ways to refer to a commit object > than by its ID. Yes, there are unbound number of ways to refer to a commit object. $ git show-ref refs/heads/maint bb5c624209fcaebd60b9572b2cc8c61086e39b57 refs/heads/maint The branch ref let you refer to a commit object by recording its commit object name bb5c6242, but for humans, it is much easier to refer to the same commit as "v2.51.2^{commit}", which is far more memorable. Of course I can use master~32^2 to call the same commit object, which is less memorable gives us a hint that the tip of master fully contains that maintenance release. What's more useful depends on how the name will be used, and the hexadecimal object names happen to be how refs record the objects they refer to. ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: [PATCH v5] doc: add an explanation of Git's data model 2025-11-04 20:53 ` Junio C Hamano @ 2025-11-04 21:24 ` Julia Evans 2025-11-04 23:45 ` Junio C Hamano 0 siblings, 1 reply; 89+ messages in thread From: Julia Evans @ 2025-11-04 21:24 UTC (permalink / raw) To: Junio C Hamano Cc: Julia Evans, git, Kristoffer Haugsbakk, D. Ben Knoble, Patrick Steinhardt On Tue, Nov 4, 2025, at 3:53 PM, Junio C Hamano wrote: > "Julia Evans" <julia@jvns.ca> writes: > >> The problem with "A branch refers to a commit object (by its ID)." is > > Ah, I didn't mean to say "you must use exactly that phrase". > > But branch refers to a commit object, it does not refer to the name > of a commit object. > > Perhaps "a branch ref records the object name of a commit object", > would be better? The untold implication of the phrasing is that > anybody who reads what is recorded by that ref can then use the > result to refer to (find) the commit object. > >> it introduces some more potential for confusion: it makes it >> sound like there might be other ways to refer to a commit object >> than by its ID. > > Yes, there are unbound number of ways to refer to a commit object. > > $ git show-ref refs/heads/maint > bb5c624209fcaebd60b9572b2cc8c61086e39b57 refs/heads/maint > > The branch ref let you refer to a commit object by recording its > commit object name bb5c6242, but for humans, it is much easier to > refer to the same commit as "v2.51.2^{commit}", which is far more > memorable. Of course I can use master~32^2 to call the same commit > object, which is less memorable gives us a hint that the tip of > master fully contains that maintenance release. What's more useful > depends on how the name will be used, and the hexadecimal object > names happen to be how refs record the objects they refer to. I'm aware that there are other ways to refer to a commit other than its ID, but as far as I know literally every other way to refer to a commit eventually ends up going through the commit ID to retrieve the commit. For example you could use `master^32`. but presumably what that does is to find `master`, look up the commit ID for `master`, and then go through 32 parents until it finds the appropriate commit ID and then looks up the object corresponding to that ID I do not see the point of implying that the commit ID is not "special", or that it's only one of many ways to find a commit because to me it seems very special, since there is no way I know of to retrieve a commit that doesn't ultimately end up using the commit ID at some point. (though that ID might not be encoded in hexadecimal) ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: [PATCH v5] doc: add an explanation of Git's data model 2025-11-04 21:24 ` Julia Evans @ 2025-11-04 23:45 ` Junio C Hamano 2025-11-05 0:02 ` Julia Evans 0 siblings, 1 reply; 89+ messages in thread From: Junio C Hamano @ 2025-11-04 23:45 UTC (permalink / raw) To: Julia Evans Cc: Julia Evans, git, Kristoffer Haugsbakk, D. Ben Knoble, Patrick Steinhardt "Julia Evans" <julia@jvns.ca> writes: > I do not see the point of implying that the commit ID is not "special", or that > it's only one of many ways to find a commit because to me it seems very special, > since there is no way I know of to retrieve a commit that doesn't ultimately > end up using the commit ID at some point. (though that ID might not be encoded > in hexadecimal) That is not what I am trying to say. The hexadecimal name is the most neutral way to refer to a commit object, and in that sense it is special. It is the way ref subsystem uses to record the name of objects, and that makes it special enough. But that does not mean that the name _is_ the object. The hexadecimal name is a way you use to name the object, but is not the object itself, and the special-ness of that name does not change it. ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: [PATCH v5] doc: add an explanation of Git's data model 2025-11-04 23:45 ` Junio C Hamano @ 2025-11-05 0:02 ` Julia Evans 2025-11-05 3:21 ` Ben Knoble 0 siblings, 1 reply; 89+ messages in thread From: Julia Evans @ 2025-11-05 0:02 UTC (permalink / raw) To: Junio C Hamano Cc: Julia Evans, git, Kristoffer Haugsbakk, D. Ben Knoble, Patrick Steinhardt On Tue, Nov 4, 2025, at 6:45 PM, Junio C Hamano wrote: > "Julia Evans" <julia@jvns.ca> writes: > >> I do not see the point of implying that the commit ID is not "special", or that >> it's only one of many ways to find a commit because to me it seems very special, >> since there is no way I know of to retrieve a commit that doesn't ultimately >> end up using the commit ID at some point. (though that ID might not be encoded >> in hexadecimal) > > That is not what I am trying to say. The hexadecimal name is the > most neutral way to refer to a commit object, and in that sense it > is special. It is the way ref subsystem uses to record the name of > objects, and that makes it special enough. > > But that does not mean that the name _is_ the object. The > hexadecimal name is a way you use to name the object, but is not the > object itself, and the special-ness of that name does not change it. Okay. I still do not understand at all why this is so important to you (for the reasons I mentioned before) but I'll see if there's anything I can do. ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: [PATCH v5] doc: add an explanation of Git's data model 2025-11-05 0:02 ` Julia Evans @ 2025-11-05 3:21 ` Ben Knoble 2025-11-05 16:26 ` Julia Evans 0 siblings, 1 reply; 89+ messages in thread From: Ben Knoble @ 2025-11-05 3:21 UTC (permalink / raw) To: Julia Evans Cc: Junio C Hamano, Julia Evans, git, Kristoffer Haugsbakk, Patrick Steinhardt > Le 4 nov. 2025 à 19:02, Julia Evans <julia@jvns.ca> a écrit : > > > >> On Tue, Nov 4, 2025, at 6:45 PM, Junio C Hamano wrote: >> "Julia Evans" <julia@jvns.ca> writes: >>> I do not see the point of implying that the commit ID is not "special", or that >>> it's only one of many ways to find a commit because to me it seems very special, >>> since there is no way I know of to retrieve a commit that doesn't ultimately >>> end up using the commit ID at some point. (though that ID might not be encoded >>> in hexadecimal) >> That is not what I am trying to say. The hexadecimal name is the >> most neutral way to refer to a commit object, and in that sense it >> is special. It is the way ref subsystem uses to record the name of >> objects, and that makes it special enough. >> But that does not mean that the name _is_ the object. The >> hexadecimal name is a way you use to name the object, but is not the >> object itself, and the special-ness of that name does not change it. > > Okay. I still do not understand at all why this is so important to you > (for the reasons I mentioned before) but I'll see if there's anything I can do. Perhaps one way to look at is, what diagram would I draw given different textual explanations? The diagram we _want_ folks to draw (?) is the one where a branch points at a commit [a circle, perhaps], which points to a tree [triangle] and recursively blobs [squares], like I’ve seen Stolee draw for GitHub blogs. We might also want folks to label the arrows with names, or not. One way to interpret the “branch refers to a commit ID” might be to draw a diagram where the branch points to an ID label, and to find the circle you have to separately consult a different part of the diagram. Both seem useful to me, though as the former has fewer moving pieces might be better for the model this document describes? I dunno. ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: [PATCH v5] doc: add an explanation of Git's data model 2025-11-05 3:21 ` Ben Knoble @ 2025-11-05 16:26 ` Julia Evans 2025-11-06 3:07 ` Ben Knoble 0 siblings, 1 reply; 89+ messages in thread From: Julia Evans @ 2025-11-05 16:26 UTC (permalink / raw) To: D. Ben Knoble Cc: Junio C Hamano, Julia Evans, git, Kristoffer Haugsbakk, Patrick Steinhardt On Tue, Nov 4, 2025, at 10:21 PM, Ben Knoble wrote: >> Le 4 nov. 2025 à 19:02, Julia Evans <julia@jvns.ca> a écrit : >> >> >> >>> On Tue, Nov 4, 2025, at 6:45 PM, Junio C Hamano wrote: >>> "Julia Evans" <julia@jvns.ca> writes: >>>> I do not see the point of implying that the commit ID is not "special", or that >>>> it's only one of many ways to find a commit because to me it seems very special, >>>> since there is no way I know of to retrieve a commit that doesn't ultimately >>>> end up using the commit ID at some point. (though that ID might not be encoded >>>> in hexadecimal) >>> That is not what I am trying to say. The hexadecimal name is the >>> most neutral way to refer to a commit object, and in that sense it >>> is special. It is the way ref subsystem uses to record the name of >>> objects, and that makes it special enough. >>> But that does not mean that the name _is_ the object. The >>> hexadecimal name is a way you use to name the object, but is not the >>> object itself, and the special-ness of that name does not change it. >> >> Okay. I still do not understand at all why this is so important to you >> (for the reasons I mentioned before) but I'll see if there's anything I can do. > > Perhaps one way to look at is, what diagram would I draw given > different textual explanations? > > The diagram we _want_ folks to draw (?) is the one where a branch > points at a commit [a circle, perhaps], which points to a tree > [triangle] and recursively blobs [squares], like I’ve seen Stolee draw > for GitHub blogs. > > We might also want folks to label the arrows with names, or not. > > One way to interpret the “branch refers to a commit ID” might be to > draw a diagram where the branch points to an ID label, and to find the > circle you have to separately consult a different part of the diagram. Yes, the most common type of Git diagram I see is something like this: https://git-scm.com/book/en/v2/images/head-to-master.png which only includes references, commits, and HEAD. That's the diagram I have in mind when writing this text, and I think it's a useful and accurate diagram to keep in mind, and it's one that you see very often when using Git tools, including in `git log --graph`. (it's not a _complete_ diagram of every type of object, but diagrams do not need to be complete to be accurate) I personally would not use a graph diagram to explain how commits relate to trees and blobs (normally I use `git cat-file -p` instead, like I did in this `gitdatamodel` document. You can see this comic for a "visual" example of how I've approached discussing trees and blobs in the past with `git cat-file -p` https://wizardzines.com/comics/explore-a-commit/). > Both seem useful to me, though as the former has fewer moving pieces > might be better for the model this document describes? I dunno. ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: [PATCH v5] doc: add an explanation of Git's data model 2025-11-05 16:26 ` Julia Evans @ 2025-11-06 3:07 ` Ben Knoble 0 siblings, 0 replies; 89+ messages in thread From: Ben Knoble @ 2025-11-06 3:07 UTC (permalink / raw) To: Julia Evans Cc: Junio C Hamano, Julia Evans, git, Kristoffer Haugsbakk, Patrick Steinhardt > Le 5 nov. 2025 à 11:27, Julia Evans <julia@jvns.ca> a écrit : > > > > On Tue, Nov 4, 2025, at 10:21 PM, Ben Knoble wrote: >>>> Le 4 nov. 2025 à 19:02, Julia Evans <julia@jvns.ca> a écrit : >>> >>> >>> >>>> On Tue, Nov 4, 2025, at 6:45 PM, Junio C Hamano wrote: >>>> "Julia Evans" <julia@jvns.ca> writes: >>>>> I do not see the point of implying that the commit ID is not "special", or that >>>>> it's only one of many ways to find a commit because to me it seems very special, >>>>> since there is no way I know of to retrieve a commit that doesn't ultimately >>>>> end up using the commit ID at some point. (though that ID might not be encoded >>>>> in hexadecimal) >>>> That is not what I am trying to say. The hexadecimal name is the >>>> most neutral way to refer to a commit object, and in that sense it >>>> is special. It is the way ref subsystem uses to record the name of >>>> objects, and that makes it special enough. >>>> But that does not mean that the name _is_ the object. The >>>> hexadecimal name is a way you use to name the object, but is not the >>>> object itself, and the special-ness of that name does not change it. >>> >>> Okay. I still do not understand at all why this is so important to you >>> (for the reasons I mentioned before) but I'll see if there's anything I can do. >> >> Perhaps one way to look at is, what diagram would I draw given >> different textual explanations? >> >> The diagram we _want_ folks to draw (?) is the one where a branch >> points at a commit [a circle, perhaps], which points to a tree >> [triangle] and recursively blobs [squares], like I’ve seen Stolee draw >> for GitHub blogs. >> >> We might also want folks to label the arrows with names, or not. >> >> One way to interpret the “branch refers to a commit ID” might be to >> draw a diagram where the branch points to an ID label, and to find the >> circle you have to separately consult a different part of the diagram. > > Yes, the most common type of Git diagram I see is something like this: > https://git-scm.com/book/en/v2/images/head-to-master.png > which only includes references, commits, and HEAD. > > That's the diagram I have in mind when writing this text, and I think it's > a useful and accurate diagram to keep in mind, and it's one that you see > very often when using Git tools, including in `git log --graph`. (it's not > a _complete_ diagram of every type of object, but diagrams do not need to be > complete to be accurate) > > I personally would not use a graph diagram to explain how commits relate to > trees and blobs (normally I use `git cat-file -p` instead, like I did in this > `gitdatamodel` document. You can see this comic for a "visual" example of how > I've approached discussing trees and blobs in the past with `git cat-file -p` > https://wizardzines.com/comics/explore-a-commit/). Fair enough. Here’s a post I “stole” ;) the shapes from, for posterity: https://github.blog/open-source/git/commits-are-snapshots-not-diffs My larger point was: since these are the diagrams I’m imagining we want to convey to a reader, perhaps ID can be omitted for brevity? IOW, the relationship between objects is the thing to highlight. OTOH, when exploring the data, especially at the plumbing level it seems we have to do the “pointer-chasing” ourselves (see cat-file). So idk. ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: [PATCH v5] doc: add an explanation of Git's data model 2025-10-30 20:32 ` [PATCH v5] " Julia Evans via GitGitGadget 2025-10-31 14:44 ` Junio C Hamano @ 2025-10-31 21:49 ` Junio C Hamano 2025-11-03 7:40 ` Patrick Steinhardt 2025-11-07 19:52 ` [PATCH v6] " Julia Evans via GitGitGadget 3 siblings, 0 replies; 89+ messages in thread From: Junio C Hamano @ 2025-10-31 21:49 UTC (permalink / raw) To: Julia Evans via GitGitGadget Cc: git, Kristoffer Haugsbakk, D. Ben Knoble, Patrick Steinhardt, Julia Evans The document refers to <<object,object type>> but the id to refer the descripion of the object is defined as [[objects]]; we need a band-aid like this one to pass GitHub Actions CI. As description for individual object types are titled singular like [[commit]], [[blob]], etc., this band-aid drops the plural 's' from the tail of [[objects]], but as long as we are consistent, of course, we could go the other direction. diff --git a/Documentation/gitdatamodel.adoc b/Documentation/gitdatamodel.adoc index 1cefbb4833..eaab3f800b 100644 --- a/Documentation/gitdatamodel.adoc +++ b/Documentation/gitdatamodel.adoc @@ -18,13 +18,13 @@ means when the documentation says "object", "reference" or "index". Git's core operations use 4 kinds of data: -1. <<objects,Objects>>: commits, trees, blobs, and tag objects +1. <<object,Objects>>: commits, trees, blobs, and tag objects 2. <<references,References>>: branches, tags, remote-tracking branches, etc 3. <<index,The index>>, also known as the staging area 4. <<reflogs,Reflogs>>: logs of changes to references ("ref log") -[[objects]] +[[object]] OBJECTS ------- -- 2.51.2-719-gbbf487eab4 ^ permalink raw reply related [flat|nested] 89+ messages in thread
* Re: [PATCH v5] doc: add an explanation of Git's data model 2025-10-30 20:32 ` [PATCH v5] " Julia Evans via GitGitGadget 2025-10-31 14:44 ` Junio C Hamano 2025-10-31 21:49 ` Junio C Hamano @ 2025-11-03 7:40 ` Patrick Steinhardt 2025-11-03 19:52 ` Julia Evans 2025-11-07 19:52 ` [PATCH v6] " Julia Evans via GitGitGadget 3 siblings, 1 reply; 89+ messages in thread From: Patrick Steinhardt @ 2025-11-03 7:40 UTC (permalink / raw) To: Julia Evans via GitGitGadget Cc: git, Kristoffer Haugsbakk, D. Ben Knoble, Julia Evans On Thu, Oct 30, 2025 at 08:32:16PM +0000, Julia Evans via GitGitGadget wrote: > diff --git a/Documentation/gitdatamodel.adoc b/Documentation/gitdatamodel.adoc > new file mode 100644 > index 0000000000..1cefbb4833 > --- /dev/null > +++ b/Documentation/gitdatamodel.adoc [snip] > +2. Its *parent commit ID(s)*. The first commit in a repository has 0 parents, > + regular commits have 1 parent, merge commits have 2 or more parents > +3. An *author* and the time the commit was authored > +4. A *committer* and the time the commit was committed. > +5. A *commit message* Nit: The punctuation is a bit inconsistent here, as some list items have a trailing dot while others don't. > +[[references]] > +REFERENCES > +---------- > + > +References are a way to give a name to a commit. > +It's easier to remember "the changes I'm working on are on the `turtle` > +branch" than "the changes are in commit bb69721404348e". > +Git often uses "ref" as shorthand for "reference". > + > +References can either refer to: > + > +1. An object ID, usually a <<commit,commit>> ID > +2. Another reference. This is called a "symbolic reference". Same here. Other than these two nits and Junio's comments I think this is in a good enough shape. Thanks for working on this! Patrick ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: [PATCH v5] doc: add an explanation of Git's data model 2025-11-03 7:40 ` Patrick Steinhardt @ 2025-11-03 19:52 ` Julia Evans 0 siblings, 0 replies; 89+ messages in thread From: Julia Evans @ 2025-11-03 19:52 UTC (permalink / raw) To: Patrick Steinhardt, Julia Evans; +Cc: git, Kristoffer Haugsbakk, D. Ben Knoble On Mon, Nov 3, 2025, at 2:40 AM, Patrick Steinhardt wrote: > On Thu, Oct 30, 2025 at 08:32:16PM +0000, Julia Evans via GitGitGadget wrote: >> diff --git a/Documentation/gitdatamodel.adoc b/Documentation/gitdatamodel.adoc >> new file mode 100644 >> index 0000000000..1cefbb4833 >> --- /dev/null >> +++ b/Documentation/gitdatamodel.adoc > [snip] >> +2. Its *parent commit ID(s)*. The first commit in a repository has 0 parents, >> + regular commits have 1 parent, merge commits have 2 or more parents >> +3. An *author* and the time the commit was authored >> +4. A *committer* and the time the commit was committed. >> +5. A *commit message* > > Nit: The punctuation is a bit inconsistent here, as some list items have > a trailing dot while others don't. Thanks, will fix. >> +[[references]] >> +REFERENCES >> +---------- >> + >> +References are a way to give a name to a commit. >> +It's easier to remember "the changes I'm working on are on the `turtle` >> +branch" than "the changes are in commit bb69721404348e". >> +Git often uses "ref" as shorthand for "reference". >> + >> +References can either refer to: >> + >> +1. An object ID, usually a <<commit,commit>> ID >> +2. Another reference. This is called a "symbolic reference". > > Same here. > > Other than these two nits and Junio's comments I think this is in a good > enough shape. Thanks for working on this! > > Patrick ^ permalink raw reply [flat|nested] 89+ messages in thread
* [PATCH v6] doc: add an explanation of Git's data model 2025-10-30 20:32 ` [PATCH v5] " Julia Evans via GitGitGadget ` (2 preceding siblings ...) 2025-11-03 7:40 ` Patrick Steinhardt @ 2025-11-07 19:52 ` Julia Evans via GitGitGadget 2025-11-07 21:03 ` Junio C Hamano ` (2 more replies) 3 siblings, 3 replies; 89+ messages in thread From: Julia Evans via GitGitGadget @ 2025-11-07 19:52 UTC (permalink / raw) To: git Cc: Kristoffer Haugsbakk, D. Ben Knoble, Patrick Steinhardt, Julia Evans, Julia Evans From: Julia Evans <julia@jvns.ca> Git very often uses the terms "object", "reference", or "index" in its documentation. However, it's hard to find a clear explanation of these terms and how they relate to each other in the documentation. The closest candidates currently are: 1. `gitglossary`. This makes a good effort, but it's an alphabetically ordered dictionary and a dictionary is not a good way to learn concepts. You have to jump around too much and it's not possible to present the concepts in the order that they should be explained. 2. `gitcore-tutorial`. This explains how to use the "core" Git commands. This is a nice document to have, but it's not necessary to learn how `update-index` works to understand Git's data model, and we should not be requiring users to learn how to use the "plumbing" commands if they want to learn what the term "index" or "object" means. 3. `gitrepository-layout`. This is a great resource, but it includes a lot of information about configuration and internal implementation details which are not related to the data model. It also does not explain how commits work. The result of this is that Git users (even users who have been using Git for 15+ years) struggle to read the documentation because they don't know what the core terms mean, and it's not possible to add links to help them learn more. Add an explanation of Git's data model. Some choices I've made in deciding what "core data model" means: 1. Omit pseudorefs like `FETCH_HEAD`, because it's not clear to me if those are intended to be user facing or if they're more like internal implementation details. 2. Don't talk about submodules other than by mentioning how they relate to trees. This is because Git has a lot of special features, and explaining how they all work exhaustively could quickly go down a rabbit hole which would make this document less useful for understanding Git's core behaviour. 3. Don't discuss the structure of a commit message (first line, trailers etc). 4. Don't mention configuration. 5. Don't mention the `.git` directory, to avoid getting too much into implementation details Signed-off-by: Julia Evans <julia@jvns.ca> --- doc: Add a explanation of Git's data model Changes in v2: The biggest change is to remove all mentions of the .git directory, and explain references in a way that doesn't refer to "directories" at all, and instead talks about the "hierarchy" (from Kristoffer and Patrick's reviews). Also: * objects: Mention that an object ID is called an "object name", and update the glossary to include the term "object ID" (from Junio's review) * objects: Replace "SHA-1 hash" with "cryptographic hash" which is more accurate (from Patrick's review) * blobs: Made the explanation of git gc a little higher level and took some ideas from Patrick's suggested wording (from Patrick's and Kroftoffer's reviews) * commits: Mention that tag objects and commits can optionally have other fields. I didn't mention the GPG signature specifically, but don't have any objections to adding it. (from Patrick and Junio's reviews) * commits: Remove one of the mentions of git gc, since it perhaps opens up too much of a rabbit hole: "how does git gc decide which commits to clean up?". (from Kristoffer's review) * tag objects: Add an example of how a tag object is represented (from user feedback on the draft) * index: Use the term "file mode" instead of "permissions", and list all allowed file modes (from Patrick's review) * index: Use "stage number" instead of "number" for index entries (from Patrick's review) * reflogs: Remove "any ref can be logged", it raises some questions of "how do you tell Git to log a ref that it isn't normally logging?" and my guess is that it's uncommon to ask Git to log more refs. I don't think it's a "lie" to omit this but I can bring it back if folks disagree. (from Patrick's review) * reflogs: Fix an error I noticed in the explanation of reflogs: tags aren't logged by default and remote-tracking branches are, according to man git-config * branches and tags: Be clearer about how branches are usually updated (by committing), and make it a little more obvious that only branches can be checked out. This is a bit tricky because using the word "check out" introduces a rabbit hole that I want to avoid (what does "check out" mean?). I've dealt this by just talking about the "current branch" (HEAD) since that is defined here, and making it more explicit that HEAD must either be a branch or a commit, there's no "HEAD is a tag" option. (from Patrick's review) * tags: Explain the differences between annotated and lightweight tags (this is the main piece of user feedback I've gotten on the draft so far) * Various style/typo changes ("2 or more", linkgit:git-gc[1], removed extra asterisks, added empty SYNOPSIS, "commits -> tags" typo fix, add to meson build) non-changes: * I still haven't mentioned things that aren't part of the "data model", like revision params and configuration. I think there could be a place for them but I haven't found it yet. * tag objects: I noticed that there's a "tag" header field in tag objects (like tag v1.0.0) but I didn't mention it yet because I couldn't figure out what the purpose of that field is (I thought the tag name was stored in the reference, why is it duplicated in the tag object?) Changes in v3: I asked for feedback from Git users on Mastodon and got 220 pieces of feedback from 48 different users. People seemed very excited to read about Git's data model. Usually I judge explanations by what folks report learning from them. Here people reported learning: * how branches are stored (that a branch is "a name for a commit") * how objects work * that Git has separate "author" and "committer" fields * that amending a commit does not change it * that a tree is "just a directory" (not something more complicated), and how trees are stored * that Git repos can contain symlinks * that Git saves modes separately from the OS. * how the stage number works * that when you git add a file, Git will create an object * that third-party tools can create their own refs. * that the reflog stores the history of branches (not just HEAD), and what reflogs are for Also (of course) there were quite a few points of confusion! The main 4 pieces of feedback were 1. The index section doesn't explain what the word "staged" means, and one person says that it makes it sounds like only files that you "git add"ed are in the index. Rewrite the explanation to avoid using the word "staged" to define the index and instead define the word "staging". 2. Explain the difference between "annotated tags" and "lightweight tags" (done) 3. Add examples for tag objects and reflogs (done) 4. Mention a little more about where things are stored in the .git directory, which I'd removed in v2. This seems most important for .git/refs, so I added a hopefully accurate note about how refs are stored by default, with a comment about one of the major implications. I did not discuss where objects or the index are stored, because I don't think the implementation details of how objects are stored are as important, and there are better tools for viewing the "raw" state of objects and the index (with git cat-file -p or git ls-files --staged). Here's every other change I made in response to the feedback, as well as a few comments that I did not address. intro: * Give a 1-sentence intro to "reflog" objects: * people really like having git ls-files --stage as a way to view the index, so add git cat-file -p as well in a note commits: * 2 people asked "Are commits stored as a diff?". Say that diffs are calculated at runtime, this is very important. * The order the fields are given in don't match the order in the example. Make them match. * "All the files in the commit, stored as a tree" is throwing a few people off. Be clearer that it's the tree ID of the base directory. * Several people asked "What's the difference between an author and committer? I added an example using git cherry-pick that I'm not 100% happy with (what if the reader doesn't know what cherry-pick does?). There might be a better example to give here. * In the note about commits being amended: one person suggested saying "creates a new commit with the same parent" to make it clearer what the relationship between the new and old commit are. I liked that idea so I did it. trees: * file modes. 2 people want to know more about "The file mode, for example 100644". Also 2 people are curious about what relationship these have to Unix permissions. Say that they're inspired by Unix permissions, and move the list of possible file modes up to make the relationship clearer * On "so git-gc(1) periodically compresses objects to save disk space", there are a few follow up comments wondering about more, which makes me think the comment about compression is actually a distraction. Say something simpler instead, ("Git only needs to store new versions of files which were changed in that commit"), from Junio's suggestion * Re "commit (a Git submodule)": 2 people say it's not clear how trees relate to submodules. Say that it refers to a commit in a different repository. * One person says they're not sure if the "object ID" is a hash. Link it to the definition of "object ID". tag objects: * Requests for an example, added one. * Requests to explain the difference between "lightweight" and "annotated" tags, added it. tags: * one person thinks "It’s expected that a tag will never change after you create it." is too strong (since of course you can change it with git tag -f). Say instead that tags are "usually" not changed. HEAD: * Several people are asking for more detail about detached HEAD state. There's actually quite a lot to talk about here (what it means, how it happens, what it implies, and how you might adjust your workflow to avoid it by using git switch). I don't think we can get into all of that here, so refer to the DETACHED HEAD section of git-checkout instead. I'm not totally happy with the current version of that section but that seems like the most practical solution right now. remote-tracking branches: * discuss refs/remotes/<remote>/HEAD. the index: * "permissions" should be "file mode" (like with trees). Changed. * "filename" should be "file path". Changed. * the stage number can only be 0, 1, 2, or 3, since it's 2 bits. Also maybe say that the numbers have specific meanings. Said it can only be 0/1/2/3 but did not give the specific meanings. reflogs * Request for an example. Added one. * It's not clear if there's one reflog per branch/tag/HEAD, or if there's one universal reflog. Make this clearer. * Mention the role of the reflog in retrieving "lost" commits or undoing bad rebases. Not fixed: * intro: A couple of people say that it's confusing that tags are both "an object" and "a reference". Handled this by just explaining the difference between an annotated and a lightweight tag further down. I'd like to make this clearer in the intro but not sure if there's a way to do it. * commits and tag objects: one person asks if there's a reference for the other "optional fields", like "encoding" and "gpgsig". I couldn't find one, so left this as is. * HEAD: A couple of people ask if there are any other symbolic references other than HEAD, or if they can make their own symbolic references. I don't know the answer to this. * HEAD: the HEAD: HEAD thing looks weird, it made more sense when it was HEAD: .git/HEAD. Will think about this. * reflogs: One person asks: if reflogs only store local changes, why does it track the user who made the change? Is that for remote operations like fetches and pulls? Or for cases where more than one user is using the same repo on a system? I don't know the answer to this. * reflogs: How can you see the full data in the reflog? git reflog show doesn't list the user who made the change. git reflog show <refname> --format="%h | %gd | %gn <%ge> | %gs" --date=iso seems to work but it's really a mouthful, not sure it's useful to include all that. * index: Is it worth mentioning that the index can be locked? I don't have an opinion about this. * other: One person asks what a "working tree" is. It made me wonder if "the current working directory" has a place in Git's data model. My feeling is "no" but I could be convinced otherwise. * overall: "How can Git be so fast? If I switch branches, how does it figure out what to add, remove or replace?". I don't think this is the right place for that discussion but it would * there are some docs CI errors I haven't figured out yet (IDREF attribute linkend references an unknown ID "tree") changes in v4: This is a combination of trying to make some of the intro text a little more "friendly" for someone new to Git's data model, avoiding implying things that are false, and removing information that isn't relevant to the data model. intro: * Add a 1-line description of what a "reflog" is (from user feedback) objects: * Start with a "friendly" description of what an object is, similar to what we do for references and the reflog * Rename "commits" to "commit" and similarly for trees etc (from Junio's review) * Remove the explanation of what git cat-file -p does, since it might be misleading and if people want to know they can read the man page (from Junio's review) commits: * Start by saying that the commit contains the full directory structure of all the files (from Junio's comment about how it may not be clear that the commit contains all the files' exact contents at the time of the commit) * Remove the comment about cherry-pick (from Junio's review) * Replace "ask Git for a diff" with "ask Git to show the commit with git show" (from Junio's review) trees: * Make the description a little more friendly * Reorder so that "type" is defined before we refer to the "type" * Say that file modes are "only spiritually related" to Unix permissions instead of talking about what Git "supports" (from Junio's review) blobs: * Try to make it clearer how "commits use relatively little disk space" is true while not implying that commits are diffs, by using an example (from Junio's review) branches: * Replace "a branch is a name for a commit ID" with "a branch refers to a commit ID" (except in the intro sentence for the "references" section). Similarly for tags etc. (from Junio's review) * Remove the note about how branches are stored in .git (from Junio's review) HEAD: * Be clearer that HEAD is not always the current branch, because there may not be a current branch (from Junio's review) index: * Be a little more specific about how exactly the index is converted into a commit. (from Junio's comment about how it's not clear what "every file in the repository" means) reflog: * Be clearer that there are many reflogs (one for each reference with a log), not just one reflog (from Junio and Patrick's reviews) * Omit the user and "Before" commit IDs from the list of fields, because you usually don't see them (from Junio's review) * Show the output of git reflog main in the example instead of the contents of the reflog file, to avoid showing the user and before commit ID changes in v5: Mostly smaller tweaks this time. The only major addition is to add a note about how unreachable objects may be deleted. From Junio's review: * Remove "type" in the description of what's in a tree (since I have learned that is not a separate field, it's part of the file mode) * Fix a typo ("these these") * Remove the intro sentence about what a "commit" is and instead only describe its contents in the list of fields, to avoid implying that a commit is the same as a tree * Say "Unix file modes" instead of "Unix permissions" * In the tag objects contents: make "ID" and "type" separate list items since they're separate fields * in the index section: * list all of the possible file modes (since from my understanding there are fewer allowed file modes here than in a tree) * mention that the object can be either a commit or blob * make the order match the order in git ls-files changes in v6: * Make punctuation more consistent (from Patrick's review) * Explain more about when exactly amended commits will get deleted (when their reflog entry expires), from Junio's review * Be more explicit that there are only 5 file modes in Git (from Junio's review) * Make tag object description clearer (from Junio's review) * We had a long discussion about the phrasing of "A branch refers to a commit ID" but I didn't come up with any ideas for how to improve the phrasing so I left it as is. Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-1981%2Fjvns%2Fgitdatamodel-v6 Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-1981/jvns/gitdatamodel-v6 Pull-Request: https://github.com/gitgitgadget/git/pull/1981 Range-diff vs v5: 1: d342255dad ! 1: 6e2a7bbe6b doc: add an explanation of Git's data model @@ Documentation/gitdatamodel.adoc (new) ++ +1. The full directory structure of all the files in that version of the + repository and each file's contents, stored as the *<<tree,tree>>* ID -+ of the commit's base directory. ++ of the commit's base directory +2. Its *parent commit ID(s)*. The first commit in a repository has 0 parents, + regular commits have 1 parent, merge commits have 2 or more parents +3. An *author* and the time the commit was authored -+4. A *committer* and the time the commit was committed. ++4. A *committer* and the time the commit was committed +5. A *commit message* ++ +Here's how an example commit is stored: @@ Documentation/gitdatamodel.adoc (new) + It lists, for each item in the tree: ++ +1. The *filename*, for example `hello.py` -+2. The *file mode*. Git has these file modes. which are only -+ spiritually related to Unix file modes: ++2. The *file mode*. These are all of the file modes in Git. ++ They're only spiritually related to Unix file modes. ++ + - `100644`: regular file (with <<object,object type>> `blob`) + - `100755`: executable file (with type `blob`) @@ Documentation/gitdatamodel.adoc (new) + Tag objects contain these required fields + (though there are other optional fields): ++ -+1. The object *ID* it references -+2. The object *type* ++1. The *ID* of the object it references ++2. The *type* of the object it references +3. The *tagger* and tag date +4. A *tag message*, similar to a commit message + @@ Documentation/gitdatamodel.adoc (new) +References can either refer to: + +1. An object ID, usually a <<commit,commit>> ID -+2. Another reference. This is called a "symbolic reference". ++2. Another reference. This is called a "symbolic reference" + +References are stored in a hierarchy, and Git handles references +differently based on where they are in the hierarchy. @@ Documentation/gitdatamodel.adoc (new) +Git may also create references other than `HEAD` at the base of the +hierarchy, like `ORIG_HEAD`. + -+NOTE: Git may delete objects that aren't "reachable" from any reference. ++NOTE: Git may delete objects that aren't "reachable" from any reference ++or <<reflogs,reflog>>. +An object is "reachable" if we can find it by following tags to whatever +they tag, commits to their parents or trees, and trees to the trees or +blobs that they contain. -+For example, if you amend a commit, with `git commit --amend`, ++For example, if you amend a commit with `git commit --amend`, ++there will no longer be a branch that points at the old commit. ++The old commit is recorded in the current branch's <<reflogs,reflog>>, ++so it is still "reachable", but when the reflog entry expires it may ++become unreachable and get deleted. ++ +the old commit will usually not be reachable, so it may be deleted eventually. +Reachable objects will never be deleted. + Documentation/Makefile | 1 + Documentation/gitdatamodel.adoc | 302 ++++++++++++++++++++++++++++ Documentation/glossary-content.adoc | 4 +- Documentation/meson.build | 1 + 4 files changed, 306 insertions(+), 2 deletions(-) create mode 100644 Documentation/gitdatamodel.adoc diff --git a/Documentation/Makefile b/Documentation/Makefile index 6fb83d0c6e..5f4acfacbd 100644 --- a/Documentation/Makefile +++ b/Documentation/Makefile @@ -52,6 +52,7 @@ MAN7_TXT += gitcli.adoc MAN7_TXT += gitcore-tutorial.adoc MAN7_TXT += gitcredentials.adoc MAN7_TXT += gitcvs-migration.adoc +MAN7_TXT += gitdatamodel.adoc MAN7_TXT += gitdiffcore.adoc MAN7_TXT += giteveryday.adoc MAN7_TXT += gitfaq.adoc diff --git a/Documentation/gitdatamodel.adoc b/Documentation/gitdatamodel.adoc new file mode 100644 index 0000000000..b54ff0e52b --- /dev/null +++ b/Documentation/gitdatamodel.adoc @@ -0,0 +1,302 @@ +gitdatamodel(7) +=============== + +NAME +---- +gitdatamodel - Git's core data model + +SYNOPSIS +-------- +gitdatamodel + +DESCRIPTION +----------- + +It's not necessary to understand Git's data model to use Git, but it's +very helpful when reading Git's documentation so that you know what it +means when the documentation says "object", "reference" or "index". + +Git's core operations use 4 kinds of data: + +1. <<objects,Objects>>: commits, trees, blobs, and tag objects +2. <<references,References>>: branches, tags, + remote-tracking branches, etc +3. <<index,The index>>, also known as the staging area +4. <<reflogs,Reflogs>>: logs of changes to references ("ref log") + +[[objects]] +OBJECTS +------- + +All of the commits and files in a Git repository are stored as "Git objects". +Git objects never change after they're created, and every object has an ID, +like `1b61de420a21a2f1aaef93e38ecd0e45e8bc9f0a`. + +This means that if you have an object's ID, you can always recover its +exact contents as long as the object hasn't been deleted. + +Every object has: + +[[object-id]] +1. an *ID* (aka "object name"), which is a cryptographic hash of its + type and contents. + It's fast to look up a Git object using its ID. + This is usually represented in hexadecimal, like + `1b61de420a21a2f1aaef93e38ecd0e45e8bc9f0a`. +2. a *type*. There are 4 types of objects: + <<commit,commits>>, <<tree,trees>>, <<blob,blobs>>, + and <<tag-object,tag objects>>. +3. *contents*. The structure of the contents depends on the type. + +Here's how each type of object is structured: + +[[commit]] +commit:: + A commit contains these required fields + (though there are other optional fields): ++ +1. The full directory structure of all the files in that version of the + repository and each file's contents, stored as the *<<tree,tree>>* ID + of the commit's base directory +2. Its *parent commit ID(s)*. The first commit in a repository has 0 parents, + regular commits have 1 parent, merge commits have 2 or more parents +3. An *author* and the time the commit was authored +4. A *committer* and the time the commit was committed +5. A *commit message* ++ +Here's how an example commit is stored: ++ +---- +tree 1b61de420a21a2f1aaef93e38ecd0e45e8bc9f0a +parent 4ccb6d7b8869a86aae2e84c56523f8705b50c647 +author Maya <maya@example.com> 1759173425 -0400 +committer Maya <maya@example.com> 1759173425 -0400 + +Add README +---- ++ +Like all other objects, commits can never be changed after they're created. +For example, "amending" a commit with `git commit --amend` creates a new +commit with the same parent. ++ +Git does not store the diff for a commit: when you ask Git to show +the commit with linkgit:git-show[1], it calculates the diff from its +parent on the fly. + +[[tree]] +tree:: + A tree is how Git represents a directory. + It can contain files or other trees (which are subdirectories). + It lists, for each item in the tree: ++ +1. The *filename*, for example `hello.py` +2. The *file mode*. These are all of the file modes in Git. + They're only spiritually related to Unix file modes. ++ + - `100644`: regular file (with <<object,object type>> `blob`) + - `100755`: executable file (with type `blob`) + - `120000`: symbolic link (with type `blob`) + - `040000`: directory (with type `tree`) + - `160000`: gitlink, for use with submodules (with type `commit`) + +3. The <<object-id,*object ID*>> with the contents of the file or directory ++ +For example, this is how a tree containing one directory (`src`) and one file +(`README.md`) is stored: ++ +---- +100644 blob 8728a858d9d21a8c78488c8b4e70e531b659141f README.md +040000 tree 89b1d2e0495f66d6929f4ff76ff1bb07fc41947d src +---- + +[[blob]] +blob:: + A blob object contains a file's contents. ++ +When you make a commit, Git stores the full contents of each file that +you changed as a blob. +For example, if you have a commit that changes 2 files in a repository +with 1000 files, that commit will create 2 new blobs, and use the +previous blob ID for the other 998 files. +This means that commits can use relatively little disk space even in a +very large repository. + +[[tag-object]] +tag object:: + Tag objects contain these required fields + (though there are other optional fields): ++ +1. The *ID* of the object it references +2. The *type* of the object it references +3. The *tagger* and tag date +4. A *tag message*, similar to a commit message + +Here's how an example tag object is stored: + +---- +object 750b4ead9c87ceb3ddb7a390e6c7074521797fb3 +type commit +tag v1.0.0 +tagger Maya <maya@example.com> 1759927359 -0400 + +Release version 1.0.0 +---- + +NOTE: All of the examples in this section were generated with +`git cat-file -p <object-id>`. + +[[references]] +REFERENCES +---------- + +References are a way to give a name to a commit. +It's easier to remember "the changes I'm working on are on the `turtle` +branch" than "the changes are in commit bb69721404348e". +Git often uses "ref" as shorthand for "reference". + +References can either refer to: + +1. An object ID, usually a <<commit,commit>> ID +2. Another reference. This is called a "symbolic reference" + +References are stored in a hierarchy, and Git handles references +differently based on where they are in the hierarchy. +Most references are under `refs/`. Here are the main types: + +[[branch]] +branches: `refs/heads/<name>`:: + A branch refers to a commit ID. + That commit is the latest commit on the branch. ++ +To get the history of commits on a branch, Git will start at the commit +ID the branch references, and then look at the commit's parent(s), +the parent's parent, etc. + +[[tag]] +tags: `refs/tags/<name>`:: + A tag refers to a commit ID, tag object ID, or other object ID. + There are two types of tags: + 1. "Annotated tags", which reference a <<tag-object,tag object>> ID + which contains a tag message + 2. "Lightweight tags", which reference a commit, blob, or tree ID + directly ++ +Even though branches and tags both refer to a commit ID, Git +treats them very differently. +Branches are expected to change over time: when you make a commit, Git +will update your <<HEAD,current branch>> to point to the new commit. +Tags are usually not changed after they're created. + +[[HEAD]] +HEAD: `HEAD`:: + `HEAD` is where Git stores your current <<branch,branch>>, + if there is a current branch. `HEAD` can either be: ++ +1. A symbolic reference to your current branch, for example `ref: + refs/heads/main` if your current branch is `main`. +2. A direct reference to a commit ID. In this case there is no current branch. + This is called "detached HEAD state", see the DETACHED HEAD section + of linkgit:git-checkout[1] for more. + +[[remote-tracking-branch]] +remote-tracking branches: `refs/remotes/<remote>/<branch>`:: + A remote-tracking branch refers to a commit ID. + It's how Git stores the last-known state of a branch in a remote + repository. `git fetch` updates remote-tracking branches. When + `git status` says "you're up to date with origin/main", it's looking at + this. ++ +`refs/remotes/<remote>/HEAD` is a symbolic reference to the remote's +default branch. This is the branch that `git clone` checks out by default. + +[[other-refs]] +Other references:: + Git tools may create references anywhere under `refs/`. + For example, linkgit:git-stash[1], linkgit:git-bisect[1], + and linkgit:git-notes[1] all create their own references + in `refs/stash`, `refs/bisect`, etc. + Third-party Git tools may also create their own references. ++ +Git may also create references other than `HEAD` at the base of the +hierarchy, like `ORIG_HEAD`. + +NOTE: Git may delete objects that aren't "reachable" from any reference +or <<reflogs,reflog>>. +An object is "reachable" if we can find it by following tags to whatever +they tag, commits to their parents or trees, and trees to the trees or +blobs that they contain. +For example, if you amend a commit with `git commit --amend`, +there will no longer be a branch that points at the old commit. +The old commit is recorded in the current branch's <<reflogs,reflog>>, +so it is still "reachable", but when the reflog entry expires it may +become unreachable and get deleted. + +the old commit will usually not be reachable, so it may be deleted eventually. +Reachable objects will never be deleted. + +[[index]] +THE INDEX +--------- +The index, also known as the "staging area", is a list of files and +the contents of each file, stored as a <<blob,blob>>. +You can add files to the index or update the contents of a file in the +index with linkgit:git-add[1]. This is called "staging" the file for commit. + +Unlike a <<tree,tree>>, the index is a flat list of files. +When you commit, Git converts the list of files in the index to a +directory <<tree,tree>> and uses that tree in the new <<commit,commit>>. + +Each index entry has 4 fields: + +1. The *file mode*, which must be one of: + - `100644`: regular file (with <<object,object type>> `blob`) + - `100755`: executable file (with type `blob`) + - `120000`: symbolic link (with type `blob`) + - `160000`: gitlink, for use with submodules (with type `commit`) +2. The *<<blob,blob>>* ID of the file, + or (rarely) the *<<commit,commit>>* ID of the submodule +3. The *stage number*, either 0, 1, 2, or 3. This is normally 0, but if + there's a merge conflict there can be multiple versions of the same + filename in the index. +4. The *file path*, for example `src/hello.py` + +It's extremely uncommon to look at the index directly: normally you'd +run `git status` to see a list of changes between the index and <<HEAD,HEAD>>. +But you can use `git ls-files --stage` to see the index. +Here's the output of `git ls-files --stage` in a repository with 2 files: + +---- +100644 8728a858d9d21a8c78488c8b4e70e531b659141f 0 README.md +100644 665c637a360874ce43bf74018768a96d2d4d219a 0 src/hello.py +---- + +[[reflogs]] +REFLOGS +------- + +Every time a branch, remote-tracking branch, or HEAD is updated, Git +updates a log called a "reflog" for that <<references,reference>>. +This means that if you make a mistake and "lose" a commit, you can +generally recover the commit ID by running `git reflog <reference>`. + +A reflog is a list of log entries. Each entry has: + +1. The *commit ID* +2. *Timestamp* when the change was made +3. *Log message*, for example `pull: Fast-forward` + +Reflogs only log changes made in your local repository. +They are not shared with remotes. + +You can view a reflog with `git reflog <reference>`. +For example, here's the reflog for a `main` branch which has changed twice: + +---- +$ git reflog main --date=iso --no-decorate +750b4ea main@{2025-09-29 15:17:05 -0400}: commit: Add README +4ccb6d7 main@{2025-09-29 15:16:48 -0400}: commit (initial): Initial commit +---- + +GIT +--- +Part of the linkgit:git[1] suite diff --git a/Documentation/glossary-content.adoc b/Documentation/glossary-content.adoc index e423e4765b..20ba121314 100644 --- a/Documentation/glossary-content.adoc +++ b/Documentation/glossary-content.adoc @@ -297,8 +297,8 @@ This commit is referred to as a "merge commit", or sometimes just a identified by its <<def_object_name,object name>>. The objects usually live in `$GIT_DIR/objects/`. -[[def_object_identifier]]object identifier (oid):: - Synonym for <<def_object_name,object name>>. +[[def_object_identifier]]object identifier, object ID, oid:: + Synonyms for <<def_object_name,object name>>. [[def_object_name]]object name:: The unique identifier of an <<def_object,object>>. The diff --git a/Documentation/meson.build b/Documentation/meson.build index e34965c5b0..ace0573e82 100644 --- a/Documentation/meson.build +++ b/Documentation/meson.build @@ -192,6 +192,7 @@ manpages = { 'gitcore-tutorial.adoc' : 7, 'gitcredentials.adoc' : 7, 'gitcvs-migration.adoc' : 7, + 'gitdatamodel.adoc' : 7, 'gitdiffcore.adoc' : 7, 'giteveryday.adoc' : 7, 'gitfaq.adoc' : 7, base-commit: bb69721404348ea2db0a081c41ab6ebfe75bdec8 -- gitgitgadget ^ permalink raw reply related [flat|nested] 89+ messages in thread
* Re: [PATCH v6] doc: add an explanation of Git's data model 2025-11-07 19:52 ` [PATCH v6] " Julia Evans via GitGitGadget @ 2025-11-07 21:03 ` Junio C Hamano 2025-11-07 21:23 ` Junio C Hamano 2025-11-12 19:53 ` [PATCH v7] " Julia Evans via GitGitGadget 2 siblings, 0 replies; 89+ messages in thread From: Junio C Hamano @ 2025-11-07 21:03 UTC (permalink / raw) To: Julia Evans via GitGitGadget Cc: git, Kristoffer Haugsbakk, D. Ben Knoble, Patrick Steinhardt, Julia Evans "Julia Evans via GitGitGadget" <gitgitgadget@gmail.com> writes: > From: Julia Evans <julia@jvns.ca> > > Git very often uses the terms "object", "reference", or "index" in its > documentation. Not about the updated text (which I haven't carefully read yet), but we'd need this squashed in to avoid xml that does not validate when using AsciiDoc (not Asciidoctor) to format gitdatamode.7 documentation. XMLTO gitdatamodel.7 xmlto: /home/gitster/w/git.git/Documentation/gitdatamodel.xml does not validate (status 3) xmlto: Fix document syntax or use --skip-validation option Document /home/gitster/w/git.git/Documentation/gitdatamodel.xml does not validate Perhaps I forgot to send this after queuing the previous round, even though it was queued on top of the previous round in 'seen'. The patch still applies cleanly to this version, and seems to fix the breakage for me. ... goes and looks ... Ah, no, I did not forget. The same patch is in the review thread of the previous round: https://lore.kernel.org/git/xmqqcy62213a.fsf@gitster.g/ Documentation/gitdatamodel.adoc | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/Documentation/gitdatamodel.adoc b/Documentation/gitdatamodel.adoc index 1cefbb4833..eaab3f800b 100644 --- a/Documentation/gitdatamodel.adoc +++ b/Documentation/gitdatamodel.adoc @@ -18,13 +18,13 @@ means when the documentation says "object", "reference" or "index". Git's core operations use 4 kinds of data: -1. <<objects,Objects>>: commits, trees, blobs, and tag objects +1. <<object,Objects>>: commits, trees, blobs, and tag objects 2. <<references,References>>: branches, tags, remote-tracking branches, etc 3. <<index,The index>>, also known as the staging area 4. <<reflogs,Reflogs>>: logs of changes to references ("ref log") -[[objects]] +[[object]] OBJECTS ------- -- 2.52.0-rc1-455-g30608eb744 ^ permalink raw reply related [flat|nested] 89+ messages in thread
* Re: [PATCH v6] doc: add an explanation of Git's data model 2025-11-07 19:52 ` [PATCH v6] " Julia Evans via GitGitGadget 2025-11-07 21:03 ` Junio C Hamano @ 2025-11-07 21:23 ` Junio C Hamano 2025-11-07 21:40 ` Julia Evans 2025-11-12 19:53 ` [PATCH v7] " Julia Evans via GitGitGadget 2 siblings, 1 reply; 89+ messages in thread From: Junio C Hamano @ 2025-11-07 21:23 UTC (permalink / raw) To: Julia Evans via GitGitGadget Cc: git, Kristoffer Haugsbakk, D. Ben Knoble, Patrick Steinhardt, Julia Evans "Julia Evans via GitGitGadget" <gitgitgadget@gmail.com> writes: > changes in v6: > > * Make punctuation more consistent (from Patrick's review) Good. > * Explain more about when exactly amended commits will get deleted > (when their reflog entry expires), from Junio's review Looked good. > * Be more explicit that there are only 5 file modes in Git (from > Junio's review) I find "These are all of the file modes in Git" hard to read and understand, and more importantly, does not imply that we won't be adding any others strongly enough, than something like "Git uses only the following modes to represent the objects it stores". > * Make tag object description clearer (from Junio's review) OK. > * We had a long discussion about the phrasing of "A branch refers to a > commit ID" but I didn't come up with any ideas for how to improve the > phrasing so I left it as is. I gave you something that is clearly an improvement there, though. Just like a tag object records "the ID of the object it references", a branch records "the ID of the commit it references". Another thing we discussed and a better alternative offered during the last round was "base directory", to which Patrick mentioned "we rather consistently use 'root tree'" cf. https://lore.kernel.org/git/aQhcbHJjiI5GtV6Y@pks.im/ Other than a few minor points I pointed out above, and the broken xml id/idref that does not validate, this round looks good to me. Thanks. ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: [PATCH v6] doc: add an explanation of Git's data model 2025-11-07 21:23 ` Junio C Hamano @ 2025-11-07 21:40 ` Julia Evans 2025-11-07 23:07 ` Junio C Hamano 0 siblings, 1 reply; 89+ messages in thread From: Julia Evans @ 2025-11-07 21:40 UTC (permalink / raw) To: Junio C Hamano, Julia Evans Cc: git, Kristoffer Haugsbakk, D. Ben Knoble, Patrick Steinhardt On Fri, Nov 7, 2025, at 4:23 PM, Junio C Hamano wrote: > "Julia Evans via GitGitGadget" <gitgitgadget@gmail.com> writes: > >> changes in v6: >> >> * Make punctuation more consistent (from Patrick's review) > > Good. > >> * Explain more about when exactly amended commits will get deleted >> (when their reflog entry expires), from Junio's review > > Looked good. > >> * Be more explicit that there are only 5 file modes in Git (from >> Junio's review) > > I find "These are all of the file modes in Git" hard to read and > understand, and more importantly, does not imply that we won't be > adding any others strongly enough, than something like "Git uses > only the following modes to represent the objects it stores". > >> * Make tag object description clearer (from Junio's review) I wonder if it would help to de-emphasize the octal representation of the file modes, and instead give them names since (from a data model section Git's file modes are really more like an enum with 5 values than ) Something like this: Git has 5 file modes: - *regular file* (with <<object,object type>> `blob`) - *executable file* (with type `blob`) - *symbolic link* (with type `blob`) - *directory* (with type `tree`) - *gitlink*, for use with submodules (with type `commit`) NOTE: Git normally displays file modes in the same format as Unix file modes (100644, 100755, 120000, 040000, and 160000 respectively), but file modes are only spiritually related to Unix file modes. > OK. > >> * We had a long discussion about the phrasing of "A branch refers to a >> commit ID" but I didn't come up with any ideas for how to improve the >> phrasing so I left it as is. > > I gave you something that is clearly an improvement there, though. > Just like a tag object records "the ID of the object it references", > a branch records "the ID of the commit it references". To me an "improvement" is something that helps the reader understand how Git's data model, and I do not understand in what way this rephrasing helps the reader, or how you think the current phrasing might cause confusion for the reader. From my point of view "a branch refers to a commit ID" clearly means the exact same thing as "a branch records the ID of the commit it references" and "a branch records the ID of the commit it references" is just a less clear and more indirect way to communicate that. > Another thing we discussed and a better alternative offered during > the last round was "base directory", to which Patrick mentioned > "we rather consistently use 'root tree'" > > cf. https://lore.kernel.org/git/aQhcbHJjiI5GtV6Y@pks.im/ I think it would be better to stick with "directory" here, because I've gotten several reader comments saying that they do not understand the term "tree" when it is used as a synonym for "directory". Maybe "root directory"? > Other than a few minor points I pointed out above, and the broken > xml id/idref that does not validate, this round looks good to me. Will fix the broken XML. ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: [PATCH v6] doc: add an explanation of Git's data model 2025-11-07 21:40 ` Julia Evans @ 2025-11-07 23:07 ` Junio C Hamano 2025-11-08 19:43 ` Junio C Hamano 2025-11-09 0:48 ` Ben Knoble 0 siblings, 2 replies; 89+ messages in thread From: Junio C Hamano @ 2025-11-07 23:07 UTC (permalink / raw) To: Julia Evans Cc: Julia Evans, git, Kristoffer Haugsbakk, D. Ben Knoble, Patrick Steinhardt "Julia Evans" <julia@jvns.ca> writes: > I wonder if it would help to de-emphasize the octal representation > of the file modes, and instead give them names since (from a > data model section Git's file modes are really more like an enum with > 5 values than ) > > Something like this: > > Git has 5 file modes: > > - *regular file* (with <<object,object type>> `blob`) > - *executable file* (with type `blob`) > - *symbolic link* (with type `blob`) > - *directory* (with type `tree`) > - *gitlink*, for use with submodules (with type `commit`) > > NOTE: Git normally displays file modes in the same format as Unix file modes > (100644, 100755, 120000, 040000, and 160000 respectively), but file modes are > only spiritually related to Unix file modes. Then, I would suggest further deemphasize the "file modes" even more. * Git stores/tracks 5 different file types, which are non-executable files, executable files, symbolic links, directories, and gitlinks. * Git uses one bitpattern each to mark these 5 different kinds of things in tree objects. These bitpatterns were loosely modelled after UNIX file mode bits. The first half entirely avoids saying "mode" and that is very deliberate. > ... I do not understand in what way this rephrasing helps the > reader, or how you think the current phrasing might cause confusion for the > reader. A branch (or any ref) does *not* *REFERENCE* an ID. They refer to objects by *recording* an ID. The distinction is not clear with your wording. >> Another thing we discussed and a better alternative offered during >> the last round was "base directory", to which Patrick mentioned >> "we rather consistently use 'root tree'" >> >> cf. https://lore.kernel.org/git/aQhcbHJjiI5GtV6Y@pks.im/ > > I think it would be better to stick with "directory" here, because I've gotten > several reader comments saying that they do not understand the > term "tree" when it is used as a synonym for "directory". > > Maybe "root directory"? I am OK with "root" but that is conditional; only if it is not used together with the word "directory". We are not talking about "root directory" where common directories like /usr, /etc, /dev and /tmp hang immediately below. If we use the word "directory", I'd strongly prefer to see it with adjective like "top-level" that implies that it is something different from "root directory" but is relative to the project in question. Thanks. ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: [PATCH v6] doc: add an explanation of Git's data model 2025-11-07 23:07 ` Junio C Hamano @ 2025-11-08 19:43 ` Junio C Hamano 2025-11-09 0:48 ` Ben Knoble 1 sibling, 0 replies; 89+ messages in thread From: Junio C Hamano @ 2025-11-08 19:43 UTC (permalink / raw) To: Julia Evans Cc: Julia Evans, git, Kristoffer Haugsbakk, D. Ben Knoble, Patrick Steinhardt Junio C Hamano <gitster@pobox.com> writes: > "Julia Evans" <julia@jvns.ca> writes: > >> I wonder if it would help to de-emphasize the octal representation >> of the file modes, and instead give them names since (from a >> data model section Git's file modes are really more like an enum with >> 5 values than ) >> >> Something like this: >> >> Git has 5 file modes: >> >> - *regular file* (with <<object,object type>> `blob`) >> - *executable file* (with type `blob`) >> - *symbolic link* (with type `blob`) >> - *directory* (with type `tree`) >> - *gitlink*, for use with submodules (with type `commit`) >> >> NOTE: Git normally displays file modes in the same format as Unix file modes >> (100644, 100755, 120000, 040000, and 160000 respectively), but file modes are >> only spiritually related to Unix file modes. > > Then, I would suggest further deemphasize the "file modes" even > more. > > * Git stores/tracks 5 different file types, which are > non-executable files, executable files, symbolic links, > directories, and gitlinks. > > * Git uses one bitpattern each to mark these 5 different kinds > of things in tree objects. These bitpatterns were loosely > modelled after UNIX file mode bits. > > The first half entirely avoids saying "mode" and that is very > deliberate. > >>> Another thing we discussed and a better alternative offered during >>> the last round was "base directory", to which Patrick mentioned >>> "we rather consistently use 'root tree'" >>> >>> cf. https://lore.kernel.org/git/aQhcbHJjiI5GtV6Y@pks.im/ >> >> I think it would be better to stick with "directory" here, because I've gotten >> several reader comments saying that they do not understand the >> term "tree" when it is used as a synonym for "directory". >> >> Maybe "root directory"? > > I am OK with "root" but that is conditional; only if it is not used > together with the word "directory". We are not talking about "root > directory" where common directories like /usr, /etc, /dev and /tmp > hang immediately below. If we use the word "directory", I'd > strongly prefer to see it with adjective like "top-level" that > implies that it is something different from "root directory" but is > relative to the project in question. The above two points should probably be trivial to address. I've already squashed in the xml validation fixes to [v6], so let's finish the rest quickly. I have no more words to offer somebody, who says she does not know why saying "branch records ID of the commit it refers to" is an improvement over "branch refers to ID of the commit", when she already accepts that "The *ID* of the object it references" is a better way than "The object *ID* it references" to describe one of the fields in an annotated tag object. So I wouldn't mind if v7 still said "branch refers to commit id". We can update it with follow-up series as needed, and it is not worth blocking the rest of the document. Refs (including branches), refer to objects exactly the same way an annotated tag refers to another object, or a tree entry in a tree object refers to a blob, tree, or a commit object. Recording the hexadecimal hash is an implementation detail of the way how they reference the object, and the phrasing used for the tag field in an annotated tag reflects that by clearly distinguishing - recording the ID - referring to the object as two separate things. The former is merely a means to the end which is the latter, i.e. the purpose of refs, tree-entry in a tree, tag field in a tag object, and all other things that refer to an object by recording its ID. ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: [PATCH v6] doc: add an explanation of Git's data model 2025-11-07 23:07 ` Junio C Hamano 2025-11-08 19:43 ` Junio C Hamano @ 2025-11-09 0:48 ` Ben Knoble 2025-11-09 4:59 ` Junio C Hamano 1 sibling, 1 reply; 89+ messages in thread From: Ben Knoble @ 2025-11-09 0:48 UTC (permalink / raw) To: Junio C Hamano Cc: Julia Evans, Julia Evans, git, Kristoffer Haugsbakk, Patrick Steinhardt > Le 7 nov. 2025 à 18:08, Junio C Hamano <gitster@pobox.com> a écrit : > > "Julia Evans" <julia@jvns.ca> writes: > >> ... I do not understand in what way this rephrasing helps the >> reader, or how you think the current phrasing might cause confusion for the >> reader. > > A branch (or any ref) does *not* *REFERENCE* an ID. They refer to > objects by *recording* an ID. The distinction is not clear with > your wording. I concur with your later email that this is not worth delaying the rest of the document for. My only other opinion on the matter is: what does making this distinction clear do to benefit readers of this document? I cannot come up with one, and I suspect Julia cannot either. Clearly you feel strongly about it, though, given the shouty caps and “I have no more words” phrasing, which I find convey a tone that is… less than welcoming. Perhaps it’s simply time to move on? And someone motivated can propose improvements. ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: [PATCH v6] doc: add an explanation of Git's data model 2025-11-09 0:48 ` Ben Knoble @ 2025-11-09 4:59 ` Junio C Hamano 2025-11-10 15:56 ` Julia Evans 0 siblings, 1 reply; 89+ messages in thread From: Junio C Hamano @ 2025-11-09 4:59 UTC (permalink / raw) To: Ben Knoble Cc: Julia Evans, Julia Evans, git, Kristoffer Haugsbakk, Patrick Steinhardt Ben Knoble <ben.knoble@gmail.com> writes: > My only other opinion on the matter is: what does making this > distinction clear do to benefit readers of this document? I care about teaching people not just _what_ but _why_, because with vague distinction, many tend to memorize _what_ without understanding the reasoning behind it. "Our object names are computed as a hash of the contents in it formatted in a canonical way" is "what we do to compute an object name", but the reason behind the design is because we want to be able to dedup the same thing cheaply, detect two objects that are different cheaply, which is "why" in this example and it is equally, if not more, important. The refs and objects record object names, and that is "what"; the reason why they do so is to refer to these objects. If somebody comes up with other ways to uniquely refer to these objects, their implementation of git-compatible system does not have to make their refs record object names---they can draw a line from a circle to a rectangle instead of writing the object name of that rectangle in the circle---and their system is still compatible with the Git data model at the higher/conceptual level. IOW, what exactly is done at the byte level (like file format) is lower part of the "data model", but what these byte level details wants to achieve is the other, higher half of the "data model". A data model documentation should teach both levels. ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: [PATCH v6] doc: add an explanation of Git's data model 2025-11-09 4:59 ` Junio C Hamano @ 2025-11-10 15:56 ` Julia Evans 2025-11-11 10:13 ` Junio C Hamano 0 siblings, 1 reply; 89+ messages in thread From: Julia Evans @ 2025-11-10 15:56 UTC (permalink / raw) To: Junio C Hamano, D. Ben Knoble Cc: Julia Evans, git, Kristoffer Haugsbakk, Patrick Steinhardt On Sat, Nov 8, 2025, at 11:59 PM, Junio C Hamano wrote: > Ben Knoble <ben.knoble@gmail.com> writes: > >> My only other opinion on the matter is: what does making this >> distinction clear do to benefit readers of this document? > > I care about teaching people not just _what_ but _why_, because with > vague distinction, many tend to memorize _what_ without > understanding the reasoning behind it. "Our object names are > computed as a hash of the contents in it formatted in a canonical > way" is "what we do to compute an object name", but the reason > behind the design is because we want to be able to dedup the same > thing cheaply, detect two objects that are different cheaply, which > is "why" in this example and it is equally, if not more, important. > > The refs and objects record object names, and that is "what"; the > reason why they do so is to refer to these objects. If somebody > comes up with other ways to uniquely refer to these objects, their > implementation of git-compatible system does not have to make their > refs record object names---they can draw a line from a circle to a > rectangle instead of writing the object name of that rectangle in > the circle---and their system is still compatible with the Git data > model at the higher/conceptual level. IOW, what exactly is done at > the byte level (like file format) is lower part of the "data model", > but what these byte level details wants to achieve is the other, > higher half of the "data model". A data model documentation should > teach both levels. Thanks, this is exactly what I was looking for when I asked in what way this rephrasing helps the reader. I agree that explaining the "why" is very important. It sounds like there are 2 "whats" and "whys" here: #1: what: object IDs are hashes of the contents why: this makes it very fast to avoid storing duplicate information, and it's extremely fast to check if 2 objects are the same or not I love the idea of explaining this. I think we could incorporate it very easily by adding this paragraph in the "Objects" introduction, right before "Here's how each type of object is structured": The reason the ID is a cryptographic hash is that it makes it extremely fast for Git to tell if 2 objects have the same contents or not (if they have the same ID, they have the same contents!), and it means Git will never store duplicate objects. Will add that unless there are any objections. #2: what: The refs contain object IDs why: to refer to the object I think this is so obvious that going out of our way to explain it risks confusing the reader. Spending too much time explaining something obvious can make the reader feel like they're missing something. I can't imagine any purpose for the refs containing object IDs other than to refer to the object? Like you noticed in the tag object section, I think saying that the tag object "refers to an object" works well in that context, but in the context of explaining what a branch is it makes the text more confusing. ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: [PATCH v6] doc: add an explanation of Git's data model 2025-11-10 15:56 ` Julia Evans @ 2025-11-11 10:13 ` Junio C Hamano 2025-11-11 13:07 ` Ben Knoble 2025-11-11 15:24 ` Julia Evans 0 siblings, 2 replies; 89+ messages in thread From: Junio C Hamano @ 2025-11-11 10:13 UTC (permalink / raw) To: Julia Evans Cc: D. Ben Knoble, Julia Evans, git, Kristoffer Haugsbakk, Patrick Steinhardt "Julia Evans" <julia@jvns.ca> writes: > Like you noticed in the tag object section, I think saying that the tag > object "refers to an object" works well in that context, but in the context > of explaining what a branch is it makes the text more confusing. Sorry, but I do not understand your objection, as I cannot see what confusion it would bring in in saying "a ref refers to an object" (or "a branch refers to a commit object"). A ref refers to an object, just like a tag field in a tag object or a tree-entry in a tree object refer to another object. They do so by recording the name of the object they refer to. So what's so confusing if we said that straight? Are you saying that the noun "reference" (or "ref") is a sufficient clue to readers that their objective is to "refer to" something, so "refers to" is a redundant thing to say? Maybe its just me, but I find it a quite roundabout thing to say that a ref refers to an object name (or "ID" if you like), simply because name or ID *is* a way to refer to the thing that is assigned that name, so you are making a ref to refer to something ("name") that refers to what it ("ref") originally wanted to refer to ("object"). That is what I find the most strange in the construction "A branch refers to ID" at the conceptual level. I am much less unhappy with "A branch records an ID", but stopping at that may make readers ask the obvious question "what goal does that design aim to achieve?" (whose answer is of course "to refer to the object that is assigned that ID"). "A branch refers to a commit object by recording its object name", "A branch records the ID of a commit it refers to", "A branch records the ID of the commit at the tip of its history". Any of the phrasing that does not make "ID" the object/target of the verb "refer to" would work to avoid that strange construction. By the way, Ben used a word "unwelcome", but the words that are more appropriate to describe my reaction were "frustrated" (for not being able to explain what I know to be true clearly to make others understand) and "disappointed". ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: [PATCH v6] doc: add an explanation of Git's data model 2025-11-11 10:13 ` Junio C Hamano @ 2025-11-11 13:07 ` Ben Knoble 2025-11-11 15:24 ` Julia Evans 1 sibling, 0 replies; 89+ messages in thread From: Ben Knoble @ 2025-11-11 13:07 UTC (permalink / raw) To: Junio C Hamano Cc: Julia Evans, Julia Evans, git, Kristoffer Haugsbakk, Patrick Steinhardt > Le 11 nov. 2025 à 05:13, Junio C Hamano <gitster@pobox.com> a écrit : > > By the way, Ben used a word "unwelcome", but the words that are more > appropriate to describe my reaction were "frustrated" (for not being > able to explain what I know to be true clearly to make others > understand) and "disappointed". While that isn’t precisely my phrasing, I certainly had a similar impact. Either way, thanks for clarifying: my intent (not always the same as impact; see prior) was to point out the way such things may be perceived. I deeply appreciate, by way of having been in similar situations, that you were frustrated/disappointed with the conversation or yourself—language is hard! I’m glad that frustration was not directed at any contributor, and I am also hopeful that future contributors will see a maintainer who cares about contributors and clear communication and decide they should spend time on this project. That is the source of my use of the word “welcome.” :) ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: [PATCH v6] doc: add an explanation of Git's data model 2025-11-11 10:13 ` Junio C Hamano 2025-11-11 13:07 ` Ben Knoble @ 2025-11-11 15:24 ` Julia Evans 2025-11-12 19:16 ` Junio C Hamano 1 sibling, 1 reply; 89+ messages in thread From: Julia Evans @ 2025-11-11 15:24 UTC (permalink / raw) To: Junio C Hamano Cc: D. Ben Knoble, Julia Evans, git, Kristoffer Haugsbakk, Patrick Steinhardt (this message got a bit long but the tl;dr is: maybe "a branch is a label for a commit ID" would work?) On Tue, Nov 11, 2025, at 5:13 AM, Junio C Hamano wrote: > "Julia Evans" <julia@jvns.ca> writes: > >> Like you noticed in the tag object section, I think saying that the tag >> object "refers to an object" works well in that context, but in the context >> of explaining what a branch is it makes the text more confusing. > > Sorry, but I do not understand your objection, as I cannot see what > confusion it would bring in in saying "a ref refers to an object" > (or "a branch refers to a commit object"). > A ref refers to an > object, just like a tag field in a tag object or a tree-entry in a > tree object refer to another object. They do so by recording the > name of the object they refer to. So what's so confusing if we said > that straight? My main strategy for figuring out if something is confusing or not is to talk to a few different users of the software and to ask them what they think. I have a pretty empirical approach to figuring out if an explanation is clear or not, if people think it's clear, then it's clear. (the question of "accuracy" is separate of course) From experience talking to people about references in Git I know that this particular thing is extremely easy to get wrong, I used to often try to explain branches by saying something like "A branch to a commit" and I would get kind of a blank stare, which is why I'm so cautious about the phrasing here. The reason I started with "a branch is a name for a commit ID" initially is that I've found that people respond well to that phrasing in the past, and I don't think it gives a misleading impression about what a branch is. But I thought your point that (in the context of this document) the term "name" could perhaps be confused with "object name" was reasonable, so I've been trying to come up with an alternative. It's always a little tricky to explain from first principles _why_ something is confusing, when I started working on explaining Git a couple of years ago I would have thought that many of your suggested phrasings would be an effective way to explain how Git branches work to people and I was very surprised to see how careful I had to be around the phrasing to get folks to understand how branches work. > Are you saying that the noun "reference" (or "ref") is a sufficient > clue to readers that their objective is to "refer to" something, so > "refers to" is a redundant thing to say? > > Maybe its just me, but I find it a quite roundabout thing to say > that a ref refers to an object name (or "ID" if you like), simply > because name or ID *is* a way to refer to the thing that is assigned > that name, so you are making a ref to refer to something ("name") > that refers to what it ("ref") originally wanted to refer to > ("object"). My thought process is sort of like this: I have two descriptions of "a Git branch" that people have responded well to in the past in practice: 1. a branch is a name for a commit ID (you said that the use of "name" could be confused with "object name", which I thought was fair) 2. a branch is a file that contains a commit ID (people often respond very well to how concrete this is, but it refers to Git's implementation which we're trying to avoid in this context) So I'm trying to find a different wording that's similar to one of these two phrasings that I know are effective, but that doesn't have those problems. Some of the options we've discussed are: - "a branch refers to a commit ID" (which as you've said has kind of a "type" issue since technically the branch refers to a commit, though when I've discussed it people they don't seem to think it's a problem in practice) - "a branch refers to a commit, using its ID" (we had a long discussion about how "using its ID" can lead the reader to think "wait, how else could you refer to a commit", which in the context of trying to learn what a branch is an unproductive distraction) - "a branch records a commit ID" (from my discussions I'm pretty sure the word "records" does not work, I think it's because introducing a new verb like "records" is always a bit dangerous) One idea I just had is "a branch is a label for a commit ID", which I think avoids the issue with "name" from earlier. > That is what I find the most strange in the construction "A branch > refers to ID" at the conceptual level. I am much less unhappy with > "A branch records an ID", but stopping at that may make readers ask > the obvious question "what goal does that design aim to achieve?" > (whose answer is of course "to refer to the object that is assigned > that ID"). > > "A branch refers to a commit object by recording its object name", > "A branch records the ID of a commit it refers to", "A branch > records the ID of the commit at the tip of its history". Any of the > phrasing that does not make "ID" the object/target of the verb > "refer to" would work to avoid that strange construction. > > By the way, Ben used a word "unwelcome", but the words that are more > appropriate to describe my reaction were "frustrated" (for not being > able to explain what I know to be true clearly to make others > understand) and "disappointed". ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: [PATCH v6] doc: add an explanation of Git's data model 2025-11-11 15:24 ` Julia Evans @ 2025-11-12 19:16 ` Junio C Hamano 2025-11-12 22:49 ` Junio C Hamano 0 siblings, 1 reply; 89+ messages in thread From: Junio C Hamano @ 2025-11-12 19:16 UTC (permalink / raw) To: Julia Evans Cc: D. Ben Knoble, Julia Evans, git, Kristoffer Haugsbakk, Patrick Steinhardt "Julia Evans" <julia@jvns.ca> writes: >> Maybe its just me, but I find it a quite roundabout thing to say >> that a ref refers to an object name (or "ID" if you like), simply >> because name or ID *is* a way to refer to the thing that is assigned >> that name, so you are making a ref to refer to something ("name") >> that refers to what it ("ref") originally wanted to refer to >> ("object"). > ... > One idea I just had is "a branch is a label for a commit ID", which > I think avoids the issue with "name" from earlier. > >> That is what I find the most strange in the construction "A branch >> refers to ID" at the conceptual level. I am much less unhappy with >> "A branch records an ID", but stopping at that may make readers ask >> the obvious question "what goal does that design aim to achieve?" >> (whose answer is of course "to refer to the object that is assigned >> that ID"). >> >> "A branch refers to a commit object by recording its object name", >> "A branch records the ID of a commit it refers to", "A branch >> records the ID of the commit at the tip of its history". Any of the >> phrasing that does not make "ID" the object/target of the verb >> "refer to" would work to avoid that strange construction. Sorry, but I am having a hard time to come up with something that I can give to help somebody who rejects "record", saying that it is a new verb, and in the same message introduces "label" as a better alternative, as we haven't seen "label" used in this context, either. Besides, a label, a name, or an ID are all that are used to refer to something (in this context, "a commit object"), so I find the newly proposed one just as roundabout as "a branch refers to ID" in the same way. The use of *ID* is a low-level implementation detail to make the ref work as a label for, or make the ref refer to, an object, so "is a label for ID" is just as bad as "refers to ID". If we do not hesitate using a new word and introduce "label", "a branch works as a label for a commit object" may probably work, probably. Thanks. ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: [PATCH v6] doc: add an explanation of Git's data model 2025-11-12 19:16 ` Junio C Hamano @ 2025-11-12 22:49 ` Junio C Hamano 2025-11-13 19:50 ` Julia Evans 0 siblings, 1 reply; 89+ messages in thread From: Junio C Hamano @ 2025-11-12 22:49 UTC (permalink / raw) To: Julia Evans Cc: D. Ben Knoble, Julia Evans, git, Kristoffer Haugsbakk, Patrick Steinhardt Junio C Hamano <gitster@pobox.com> writes: > If we do not hesitate using a new word and introduce "label", "a > branch works as a label for a commit object" may probably work, > probably. Another thing. Do we want to limit the definition of "branch" very narrowly, i.e., "subset of refs whose refname begins with refs/heads/"? Or do we want to give a description at a bit higher conceptual level, something like: A branch is a mechanism to help you grow one line of history (in the sea/cloud of commits) by (1) keeping track of the commit it currently is at (by recording its ID in the ref used to implement the branch), (2) allowing you easily record a new commit you create while you are on it as a child of the current commit (by allowing the symbolic ref "HEAD" to point the ref used to implement the branch), (3) keeping the description of the theme of the particular line of history being developed there (by using "branch.<name>.description" configuration variable for the branch) which is incorporated when the branch gets merged to an integration branch, and (4) keeping track of how the branch has grown over time (in the reflog for the ref used to implement the branch). We can limit ourselves to view a "branch" as a narrow subset of a ref that can point at a single commit in the dag of commits, and it can be updated at any time to point another different commit that has no relation to the previous commit. Once we stop limiting ourselves and explain the purpose of using a "branch", "it can be updated to point any random commit" stops being entirely true. While the "git branch -f" command can be used to do so, doing so all the time would go against what makes a branch a branch, i.e. to keep track of the process of growing the history, and it is expected that it would be a lot more common for the commit pointed at by the branch ref to move by growing the history with "git commit", refining the history with "git rebase", etc. But that can only follow if readers understand the branch as more than "just a ref whose name begins with refs/heads/". I am not sure what level the data model description you are writing should be at. The current description seems to concentrate too narrowly on "a branch is a specialization of a ref" aspect, and while it is not incorrect as a description of a building block of a tool set to implement a workflow, it might be too limiting to form a proper mental model. I dunno. ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: [PATCH v6] doc: add an explanation of Git's data model 2025-11-12 22:49 ` Junio C Hamano @ 2025-11-13 19:50 ` Julia Evans 2025-11-13 20:07 ` Junio C Hamano 2025-11-13 20:18 ` Julia Evans 0 siblings, 2 replies; 89+ messages in thread From: Julia Evans @ 2025-11-13 19:50 UTC (permalink / raw) To: Junio C Hamano Cc: D. Ben Knoble, Julia Evans, git, Kristoffer Haugsbakk, Patrick Steinhardt On Wed, Nov 12, 2025, at 5:49 PM, Junio C Hamano wrote: > Junio C Hamano <gitster@pobox.com> writes: > >> If we do not hesitate using a new word and introduce "label", "a >> branch works as a label for a commit object" may probably work, >> probably. > > Another thing. > > Do we want to limit the definition of "branch" very narrowly, i.e., > "subset of refs whose refname begins with refs/heads/"? > > Or do we want to give a description at a bit higher conceptual > level, something like: > > A branch is a mechanism to help you grow one line of history (in > the sea/cloud of commits) by (1) keeping track of the commit it > currently is at (by recording its ID in the ref used to implement > the branch), (2) allowing you easily record a new commit you > create while you are on it as a child of the current commit (by > allowing the symbolic ref "HEAD" to point the ref used to > implement the branch), (3) keeping the description of the theme of > the particular line of history being developed there (by using > "branch.<name>.description" configuration variable for the branch) > which is incorporated when the branch gets merged to an > integration branch, and (4) keeping track of how the branch has > grown over time (in the reflog for the ref used to implement the > branch). > > We can limit ourselves to view a "branch" as a narrow subset of a > ref that can point at a single commit in the dag of commits, and it > can be updated at any time to point another different commit that > has no relation to the previous commit. I think talking too much about the intentions behind branches runs the risk of getting into a discussion from Git workflows which IMO is definitely out of scope for this document. For example "which is incorporated when the branch gets merged to an integration branch" is talking about a specific Git workflow. From my point of view as a Git user one of Git's biggest strengths is its flexibility; because branches _can_ be moved to point at a different commit at any time in various ways (via `git reset --hard`, `git rebase`, or `git commit --amend`), there's a lot of flexibility in how someone can choose to use Git, including never using branches at all. (the flexibility is also one of the things that makes Git hard of course :) ) So I'd prefer to keep editorializing about what a branch "means" to a minimum. Right now we have this, which tries to explain a very small amount about how branches are used that should apply to almost all Git workflows: "Even though branches and tags both refer to a commit ID, Git treats them very differently. Branches are expected to change over time: when you make a commit, Git will update your current branch to point to the new commit. " > Once we stop limiting ourselves and explain the purpose of using a > "branch", "it can be updated to point any random commit" stops being > entirely true. While the "git branch -f" command can be used to do > so, doing so all the time would go against what makes a branch a > branch, i.e. to keep track of the process of growing the history, > and it is expected that it would be a lot more common for the commit > pointed at by the branch ref to move by growing the history with > "git commit", refining the history with "git rebase", etc. But that > can only follow if readers understand the branch as more than "just > a ref whose name begins with refs/heads/". > > I am not sure what level the data model description you are writing > should be at. The current description seems to concentrate too > narrowly on "a branch is a specialization of a ref" aspect, and > while it is not incorrect as a description of a building block of a > tool set to implement a workflow, it might be too limiting to form > a proper mental model. I dunno. ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: [PATCH v6] doc: add an explanation of Git's data model 2025-11-13 19:50 ` Julia Evans @ 2025-11-13 20:07 ` Junio C Hamano 2025-11-13 20:18 ` Julia Evans 1 sibling, 0 replies; 89+ messages in thread From: Junio C Hamano @ 2025-11-13 20:07 UTC (permalink / raw) To: Julia Evans Cc: D. Ben Knoble, Julia Evans, git, Kristoffer Haugsbakk, Patrick Steinhardt "Julia Evans" <julia@jvns.ca> writes: > From my point of view as a Git user one of Git's biggest strengths is its > flexibility; because branches _can_ be moved to point at a different > commit at any time in various ways (via `git reset --hard`, `git rebase`, or > `git commit --amend`), there's a lot of flexibility in how someone can > choose to use Git, including never using branches at all. > (the flexibility is also one of the things that makes Git hard of course :) ) > > So I'd prefer to keep editorializing about what a branch "means" > to a minimum. OK. If you go to such an extreme and make readers oblivious to what a branch means, some of the sanity measures we have (e.g., a branch ref will never point at anything but a commit object, not even a commit-ish tag is allowed) would become "unnecessary nuisance" to them. In other words, as I hinted, it takes a delicate balancing act. ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: [PATCH v6] doc: add an explanation of Git's data model 2025-11-13 19:50 ` Julia Evans 2025-11-13 20:07 ` Junio C Hamano @ 2025-11-13 20:18 ` Julia Evans 2025-11-13 20:34 ` Chris Torek 2025-11-13 23:11 ` Junio C Hamano 1 sibling, 2 replies; 89+ messages in thread From: Julia Evans @ 2025-11-13 20:18 UTC (permalink / raw) To: Junio C Hamano Cc: D. Ben Knoble, Julia Evans, git, Kristoffer Haugsbakk, Patrick Steinhardt On Thu, Nov 13, 2025, at 2:50 PM, Julia Evans wrote: > On Wed, Nov 12, 2025, at 5:49 PM, Junio C Hamano wrote: >> Junio C Hamano <gitster@pobox.com> writes: >> >>> If we do not hesitate using a new word and introduce "label", "a >>> branch works as a label for a commit object" may probably work, >>> probably. >> >> Another thing. >> >> Do we want to limit the definition of "branch" very narrowly, i.e., >> "subset of refs whose refname begins with refs/heads/"? >> >> Or do we want to give a description at a bit higher conceptual >> level, something like: >> >> A branch is a mechanism to help you grow one line of history (in >> the sea/cloud of commits) by (1) keeping track of the commit it >> currently is at (by recording its ID in the ref used to implement >> the branch), (2) allowing you easily record a new commit you >> create while you are on it as a child of the current commit (by >> allowing the symbolic ref "HEAD" to point the ref used to >> implement the branch), (3) keeping the description of the theme of >> the particular line of history being developed there (by using >> "branch.<name>.description" configuration variable for the branch) >> which is incorporated when the branch gets merged to an >> integration branch, and (4) keeping track of how the branch has >> grown over time (in the reflog for the ref used to implement the >> branch). >> >> We can limit ourselves to view a "branch" as a narrow subset of a >> ref that can point at a single commit in the dag of commits, and it >> can be updated at any time to point another different commit that >> has no relation to the previous commit. > > I think talking too much about the intentions behind branches runs > the risk of getting into a discussion from Git workflows which IMO > is definitely out of scope for this document. For example "which is > incorporated when the branch gets merged to an integration branch" is > talking about a specific Git workflow. > > From my point of view as a Git user one of Git's biggest strengths is its > flexibility; because branches _can_ be moved to point at a different > commit at any time in various ways (via `git reset --hard`, `git rebase`, or > `git commit --amend`), there's a lot of flexibility in how someone can > choose to use Git, including never using branches at all. > (the flexibility is also one of the things that makes Git hard of course :) ) > > So I'd prefer to keep editorializing about what a branch "means" > to a minimum. To immediately contradict myself a bit: after sending this I thought to look through Mark Dominus's great blog posts about Git to see if he has anything to say about this, and I came across this article: https://blog.plover.com/prog/git/branches.html, called "I wish people would stop insisting that Git branches are nothing but refs". It reminded me that of course in Git the word "branch" often is used to mean "a sequence of commits", for example if I make a branch called `topic` and add 2 commits to it I might say that that "branch" is that sequence of two commits. I think the way Dominus talks about this is very interesting: The reason people say this, the disconnection is that the Git software doesn't have any formal representation of branches. Conceptually, the branch is there; the git commands just don't understand it. This is the most important mismatch between the conceptual model and what the Git software actually does. To me the sticky point is that "the branch is these two commits" is an important and useful concept in Git, but it doesn't really _exist_ in Git's data model, because Git only stores a branch as a reference to a commit. One way I've resolved this in the past is to say something like "you can think about a branch in 3 different ways!" https://wizardzines.com/comics/whats-a-branch/ The idea there is to talk about how a branch might be _conceptually_ "a line of development", but that Git doesn't have anything in its data model to track what the "base" of the line of development is, so any time you want Git to think of a branch as "these 2 commits" you need to give it a way to determine the base. > Right now we have this, which tries to explain a very small amount > about how branches are used that should apply to almost > all Git workflows: > > "Even though branches and tags both refer to a commit ID, Git treats > them very differently. Branches are expected to change over time: when > you make a commit, Git will update your current branch to point to the > new commit. " > >> Once we stop limiting ourselves and explain the purpose of using a >> "branch", "it can be updated to point any random commit" stops being >> entirely true. While the "git branch -f" command can be used to do >> so, doing so all the time would go against what makes a branch a >> branch, i.e. to keep track of the process of growing the history, >> and it is expected that it would be a lot more common for the commit >> pointed at by the branch ref to move by growing the history with >> "git commit", refining the history with "git rebase", etc. But that >> can only follow if readers understand the branch as more than "just >> a ref whose name begins with refs/heads/". >> >> I am not sure what level the data model description you are writing >> should be at. The current description seems to concentrate too >> narrowly on "a branch is a specialization of a ref" aspect, and >> while it is not incorrect as a description of a building block of a >> tool set to implement a workflow, it might be too limiting to form >> a proper mental model. I dunno. ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: [PATCH v6] doc: add an explanation of Git's data model 2025-11-13 20:18 ` Julia Evans @ 2025-11-13 20:34 ` Chris Torek 2025-11-13 23:11 ` Junio C Hamano 1 sibling, 0 replies; 89+ messages in thread From: Chris Torek @ 2025-11-13 20:34 UTC (permalink / raw) To: Julia Evans Cc: Junio C Hamano, D. Ben Knoble, Julia Evans, git, Kristoffer Haugsbakk, Patrick Steinhardt On Thu, Nov 13, 2025 at 12:19 PM Julia Evans <julia@jvns.ca> wrote: > To immediately contradict myself a bit: after sending this I thought to > look through Mark Dominus's great blog posts about Git to see if > he has anything to say about this, and I came across this article: > https://blog.plover.com/prog/git/branches.html, called "I wish people > would stop insisting that Git branches are nothing but refs". > > It reminded me that of course in Git the word "branch" often is used > to mean "a sequence of commits" ... Yes, this is the crux of the issue: The word "branch" is ambiguous. In Git, the *branch name* is the `refs/heads/whatever` name, and we also have remote-tracking branch names under `refs/remotes/`. The *branch*, however, is some ill-defined set of commits starting from the specific commit identified by a branch name *or* any other unique identifier, and then working backwards for some unspecified number of steps with unspecified constraints. Sometimes the bare term "branch" means one or another of these various things, and sometimes it's meant to encompass all of them... Chris ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: [PATCH v6] doc: add an explanation of Git's data model 2025-11-13 20:18 ` Julia Evans 2025-11-13 20:34 ` Chris Torek @ 2025-11-13 23:11 ` Junio C Hamano 1 sibling, 0 replies; 89+ messages in thread From: Junio C Hamano @ 2025-11-13 23:11 UTC (permalink / raw) To: Julia Evans Cc: D. Ben Knoble, Julia Evans, git, Kristoffer Haugsbakk, Patrick Steinhardt "Julia Evans" <julia@jvns.ca> writes: >> So I'd prefer to keep editorializing about what a branch "means" >> to a minimum. > > To immediately contradict myself a bit: after sending this I thought to > look through Mark Dominus's great blog posts about Git to see if > he has anything to say about this ... > The idea there is to talk about how a branch might be _conceptually_ > "a line of development", but that Git doesn't have anything in its data > model to track what the "base" of the line of development is, so any > time you want Git to think of a branch as "these 2 commits" you need to > give it a way to determine the base. Yup, that is why you need to walk a fine line between what is hard and mechanical "bits in the system" data model, and the conceptual goal human users build using the bits as building blocks. Aside from that "branch" description, the rest of the document has been polished well enough that we are quickly approaching the point of diminishing returns, I would think. Should we declare victory and mark the topic for 'next' by now? Thanks. ^ permalink raw reply [flat|nested] 89+ messages in thread
* [PATCH v7] doc: add an explanation of Git's data model 2025-11-07 19:52 ` [PATCH v6] " Julia Evans via GitGitGadget 2025-11-07 21:03 ` Junio C Hamano 2025-11-07 21:23 ` Junio C Hamano @ 2025-11-12 19:53 ` Julia Evans via GitGitGadget 2025-11-12 20:26 ` Junio C Hamano 2025-11-23 2:37 ` Junio C Hamano 2 siblings, 2 replies; 89+ messages in thread From: Julia Evans via GitGitGadget @ 2025-11-12 19:53 UTC (permalink / raw) To: git Cc: Kristoffer Haugsbakk, D. Ben Knoble, Patrick Steinhardt, Julia Evans, Julia Evans From: Julia Evans <julia@jvns.ca> Git very often uses the terms "object", "reference", or "index" in its documentation. However, it's hard to find a clear explanation of these terms and how they relate to each other in the documentation. The closest candidates currently are: 1. `gitglossary`. This makes a good effort, but it's an alphabetically ordered dictionary and a dictionary is not a good way to learn concepts. You have to jump around too much and it's not possible to present the concepts in the order that they should be explained. 2. `gitcore-tutorial`. This explains how to use the "core" Git commands. This is a nice document to have, but it's not necessary to learn how `update-index` works to understand Git's data model, and we should not be requiring users to learn how to use the "plumbing" commands if they want to learn what the term "index" or "object" means. 3. `gitrepository-layout`. This is a great resource, but it includes a lot of information about configuration and internal implementation details which are not related to the data model. It also does not explain how commits work. The result of this is that Git users (even users who have been using Git for 15+ years) struggle to read the documentation because they don't know what the core terms mean, and it's not possible to add links to help them learn more. Add an explanation of Git's data model. Some choices I've made in deciding what "core data model" means: 1. Omit pseudorefs like `FETCH_HEAD`, because it's not clear to me if those are intended to be user facing or if they're more like internal implementation details. 2. Don't talk about submodules other than by mentioning how they relate to trees. This is because Git has a lot of special features, and explaining how they all work exhaustively could quickly go down a rabbit hole which would make this document less useful for understanding Git's core behaviour. 3. Don't discuss the structure of a commit message (first line, trailers etc). 4. Don't mention configuration. 5. Don't mention the `.git` directory, to avoid getting too much into implementation details Signed-off-by: Julia Evans <julia@jvns.ca> --- doc: Add a explanation of Git's data model Changes in v2: The biggest change is to remove all mentions of the .git directory, and explain references in a way that doesn't refer to "directories" at all, and instead talks about the "hierarchy" (from Kristoffer and Patrick's reviews). Also: * objects: Mention that an object ID is called an "object name", and update the glossary to include the term "object ID" (from Junio's review) * objects: Replace "SHA-1 hash" with "cryptographic hash" which is more accurate (from Patrick's review) * blobs: Made the explanation of git gc a little higher level and took some ideas from Patrick's suggested wording (from Patrick's and Kroftoffer's reviews) * commits: Mention that tag objects and commits can optionally have other fields. I didn't mention the GPG signature specifically, but don't have any objections to adding it. (from Patrick and Junio's reviews) * commits: Remove one of the mentions of git gc, since it perhaps opens up too much of a rabbit hole: "how does git gc decide which commits to clean up?". (from Kristoffer's review) * tag objects: Add an example of how a tag object is represented (from user feedback on the draft) * index: Use the term "file mode" instead of "permissions", and list all allowed file modes (from Patrick's review) * index: Use "stage number" instead of "number" for index entries (from Patrick's review) * reflogs: Remove "any ref can be logged", it raises some questions of "how do you tell Git to log a ref that it isn't normally logging?" and my guess is that it's uncommon to ask Git to log more refs. I don't think it's a "lie" to omit this but I can bring it back if folks disagree. (from Patrick's review) * reflogs: Fix an error I noticed in the explanation of reflogs: tags aren't logged by default and remote-tracking branches are, according to man git-config * branches and tags: Be clearer about how branches are usually updated (by committing), and make it a little more obvious that only branches can be checked out. This is a bit tricky because using the word "check out" introduces a rabbit hole that I want to avoid (what does "check out" mean?). I've dealt this by just talking about the "current branch" (HEAD) since that is defined here, and making it more explicit that HEAD must either be a branch or a commit, there's no "HEAD is a tag" option. (from Patrick's review) * tags: Explain the differences between annotated and lightweight tags (this is the main piece of user feedback I've gotten on the draft so far) * Various style/typo changes ("2 or more", linkgit:git-gc[1], removed extra asterisks, added empty SYNOPSIS, "commits -> tags" typo fix, add to meson build) non-changes: * I still haven't mentioned things that aren't part of the "data model", like revision params and configuration. I think there could be a place for them but I haven't found it yet. * tag objects: I noticed that there's a "tag" header field in tag objects (like tag v1.0.0) but I didn't mention it yet because I couldn't figure out what the purpose of that field is (I thought the tag name was stored in the reference, why is it duplicated in the tag object?) Changes in v3: I asked for feedback from Git users on Mastodon and got 220 pieces of feedback from 48 different users. People seemed very excited to read about Git's data model. Usually I judge explanations by what folks report learning from them. Here people reported learning: * how branches are stored (that a branch is "a name for a commit") * how objects work * that Git has separate "author" and "committer" fields * that amending a commit does not change it * that a tree is "just a directory" (not something more complicated), and how trees are stored * that Git repos can contain symlinks * that Git saves modes separately from the OS. * how the stage number works * that when you git add a file, Git will create an object * that third-party tools can create their own refs. * that the reflog stores the history of branches (not just HEAD), and what reflogs are for Also (of course) there were quite a few points of confusion! The main 4 pieces of feedback were 1. The index section doesn't explain what the word "staged" means, and one person says that it makes it sounds like only files that you "git add"ed are in the index. Rewrite the explanation to avoid using the word "staged" to define the index and instead define the word "staging". 2. Explain the difference between "annotated tags" and "lightweight tags" (done) 3. Add examples for tag objects and reflogs (done) 4. Mention a little more about where things are stored in the .git directory, which I'd removed in v2. This seems most important for .git/refs, so I added a hopefully accurate note about how refs are stored by default, with a comment about one of the major implications. I did not discuss where objects or the index are stored, because I don't think the implementation details of how objects are stored are as important, and there are better tools for viewing the "raw" state of objects and the index (with git cat-file -p or git ls-files --staged). Here's every other change I made in response to the feedback, as well as a few comments that I did not address. intro: * Give a 1-sentence intro to "reflog" objects: * people really like having git ls-files --stage as a way to view the index, so add git cat-file -p as well in a note commits: * 2 people asked "Are commits stored as a diff?". Say that diffs are calculated at runtime, this is very important. * The order the fields are given in don't match the order in the example. Make them match. * "All the files in the commit, stored as a tree" is throwing a few people off. Be clearer that it's the tree ID of the base directory. * Several people asked "What's the difference between an author and committer? I added an example using git cherry-pick that I'm not 100% happy with (what if the reader doesn't know what cherry-pick does?). There might be a better example to give here. * In the note about commits being amended: one person suggested saying "creates a new commit with the same parent" to make it clearer what the relationship between the new and old commit are. I liked that idea so I did it. trees: * file modes. 2 people want to know more about "The file mode, for example 100644". Also 2 people are curious about what relationship these have to Unix permissions. Say that they're inspired by Unix permissions, and move the list of possible file modes up to make the relationship clearer * On "so git-gc(1) periodically compresses objects to save disk space", there are a few follow up comments wondering about more, which makes me think the comment about compression is actually a distraction. Say something simpler instead, ("Git only needs to store new versions of files which were changed in that commit"), from Junio's suggestion * Re "commit (a Git submodule)": 2 people say it's not clear how trees relate to submodules. Say that it refers to a commit in a different repository. * One person says they're not sure if the "object ID" is a hash. Link it to the definition of "object ID". tag objects: * Requests for an example, added one. * Requests to explain the difference between "lightweight" and "annotated" tags, added it. tags: * one person thinks "It’s expected that a tag will never change after you create it." is too strong (since of course you can change it with git tag -f). Say instead that tags are "usually" not changed. HEAD: * Several people are asking for more detail about detached HEAD state. There's actually quite a lot to talk about here (what it means, how it happens, what it implies, and how you might adjust your workflow to avoid it by using git switch). I don't think we can get into all of that here, so refer to the DETACHED HEAD section of git-checkout instead. I'm not totally happy with the current version of that section but that seems like the most practical solution right now. remote-tracking branches: * discuss refs/remotes/<remote>/HEAD. the index: * "permissions" should be "file mode" (like with trees). Changed. * "filename" should be "file path". Changed. * the stage number can only be 0, 1, 2, or 3, since it's 2 bits. Also maybe say that the numbers have specific meanings. Said it can only be 0/1/2/3 but did not give the specific meanings. reflogs * Request for an example. Added one. * It's not clear if there's one reflog per branch/tag/HEAD, or if there's one universal reflog. Make this clearer. * Mention the role of the reflog in retrieving "lost" commits or undoing bad rebases. Not fixed: * intro: A couple of people say that it's confusing that tags are both "an object" and "a reference". Handled this by just explaining the difference between an annotated and a lightweight tag further down. I'd like to make this clearer in the intro but not sure if there's a way to do it. * commits and tag objects: one person asks if there's a reference for the other "optional fields", like "encoding" and "gpgsig". I couldn't find one, so left this as is. * HEAD: A couple of people ask if there are any other symbolic references other than HEAD, or if they can make their own symbolic references. I don't know the answer to this. * HEAD: the HEAD: HEAD thing looks weird, it made more sense when it was HEAD: .git/HEAD. Will think about this. * reflogs: One person asks: if reflogs only store local changes, why does it track the user who made the change? Is that for remote operations like fetches and pulls? Or for cases where more than one user is using the same repo on a system? I don't know the answer to this. * reflogs: How can you see the full data in the reflog? git reflog show doesn't list the user who made the change. git reflog show <refname> --format="%h | %gd | %gn <%ge> | %gs" --date=iso seems to work but it's really a mouthful, not sure it's useful to include all that. * index: Is it worth mentioning that the index can be locked? I don't have an opinion about this. * other: One person asks what a "working tree" is. It made me wonder if "the current working directory" has a place in Git's data model. My feeling is "no" but I could be convinced otherwise. * overall: "How can Git be so fast? If I switch branches, how does it figure out what to add, remove or replace?". I don't think this is the right place for that discussion but it would * there are some docs CI errors I haven't figured out yet (IDREF attribute linkend references an unknown ID "tree") changes in v4: This is a combination of trying to make some of the intro text a little more "friendly" for someone new to Git's data model, avoiding implying things that are false, and removing information that isn't relevant to the data model. intro: * Add a 1-line description of what a "reflog" is (from user feedback) objects: * Start with a "friendly" description of what an object is, similar to what we do for references and the reflog * Rename "commits" to "commit" and similarly for trees etc (from Junio's review) * Remove the explanation of what git cat-file -p does, since it might be misleading and if people want to know they can read the man page (from Junio's review) commits: * Start by saying that the commit contains the full directory structure of all the files (from Junio's comment about how it may not be clear that the commit contains all the files' exact contents at the time of the commit) * Remove the comment about cherry-pick (from Junio's review) * Replace "ask Git for a diff" with "ask Git to show the commit with git show" (from Junio's review) trees: * Make the description a little more friendly * Reorder so that "type" is defined before we refer to the "type" * Say that file modes are "only spiritually related" to Unix permissions instead of talking about what Git "supports" (from Junio's review) blobs: * Try to make it clearer how "commits use relatively little disk space" is true while not implying that commits are diffs, by using an example (from Junio's review) branches: * Replace "a branch is a name for a commit ID" with "a branch refers to a commit ID" (except in the intro sentence for the "references" section). Similarly for tags etc. (from Junio's review) * Remove the note about how branches are stored in .git (from Junio's review) HEAD: * Be clearer that HEAD is not always the current branch, because there may not be a current branch (from Junio's review) index: * Be a little more specific about how exactly the index is converted into a commit. (from Junio's comment about how it's not clear what "every file in the repository" means) reflog: * Be clearer that there are many reflogs (one for each reference with a log), not just one reflog (from Junio and Patrick's reviews) * Omit the user and "Before" commit IDs from the list of fields, because you usually don't see them (from Junio's review) * Show the output of git reflog main in the example instead of the contents of the reflog file, to avoid showing the user and before commit ID changes in v5: Mostly smaller tweaks this time. The only major addition is to add a note about how unreachable objects may be deleted. From Junio's review: * Remove "type" in the description of what's in a tree (since I have learned that is not a separate field, it's part of the file mode) * Fix a typo ("these these") * Remove the intro sentence about what a "commit" is and instead only describe its contents in the list of fields, to avoid implying that a commit is the same as a tree * Say "Unix file modes" instead of "Unix permissions" * In the tag objects contents: make "ID" and "type" separate list items since they're separate fields * in the index section: * list all of the possible file modes (since from my understanding there are fewer allowed file modes here than in a tree) * mention that the object can be either a commit or blob * make the order match the order in git ls-files changes in v6: * Make punctuation more consistent (from Patrick's review) * Explain more about when exactly amended commits will get deleted (when their reflog entry expires), from Junio's review * Be more explicit that there are only 5 file modes in Git (from Junio's review) * Make tag object description clearer (from Junio's review) * We had a long discussion about the phrasing of "A branch refers to a commit ID" but I didn't come up with any ideas for how to improve the phrasing so I left it as is. changes in v7: * Replace "file mode" with "file type", to make it more obvious that Git does not support general Unix file modes. Remove a broken XML link as a side effect. * Use "top-level directory" instead of "base directory" * Like last time, I still don't have any better ideas for "A branch refers to a commit ID" Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-1981%2Fjvns%2Fgitdatamodel-v7 Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-1981/jvns/gitdatamodel-v7 Pull-Request: https://github.com/gitgitgadget/git/pull/1981 Range-diff vs v6: 1: 6e2a7bbe6b ! 1: 22a1b32017 doc: add an explanation of Git's data model @@ Documentation/gitdatamodel.adoc (new) ++ +1. The full directory structure of all the files in that version of the + repository and each file's contents, stored as the *<<tree,tree>>* ID -+ of the commit's base directory ++ of the commit's top-level directory +2. Its *parent commit ID(s)*. The first commit in a repository has 0 parents, + regular commits have 1 parent, merge commits have 2 or more parents +3. An *author* and the time the commit was authored @@ Documentation/gitdatamodel.adoc (new) + It lists, for each item in the tree: ++ +1. The *filename*, for example `hello.py` -+2. The *file mode*. These are all of the file modes in Git. -+ They're only spiritually related to Unix file modes. -++ -+ - `100644`: regular file (with <<object,object type>> `blob`) -+ - `100755`: executable file (with type `blob`) -+ - `120000`: symbolic link (with type `blob`) -+ - `040000`: directory (with type `tree`) -+ - `160000`: gitlink, for use with submodules (with type `commit`) -+ -+3. The <<object-id,*object ID*>> with the contents of the file or directory ++2. The *file type*, which must be one of these five types: ++ - *regular file* ++ - *executable file* ++ - *symbolic link* ++ - *directory* ++ - *gitlink* (for use with submodules) ++3. The <<object-id,*object ID*>> with the contents of the file, directory, ++ or gitlink. ++ +For example, this is how a tree containing one directory (`src`) and one file +(`README.md`) is stored: @@ Documentation/gitdatamodel.adoc (new) +040000 tree 89b1d2e0495f66d6929f4ff76ff1bb07fc41947d src +---- + ++NOTE: In the output above, Git displays the file type of each tree entry ++using a format that's loosely modelled on Unix file modes (`100644` is ++"regular file", `100755` is "executable file", `120000` is "symbolic ++link", `040000` is "directory", and `160000` is "gitlink"). It also ++displays the object's type: `blob` for files and symlinks, `tree` for ++directories, and `commit` for gitlinks. ++ +[[blob]] +blob:: + A blob object contains a file's contents. @@ Documentation/gitdatamodel.adoc (new) + +Each index entry has 4 fields: + -+1. The *file mode*, which must be one of: -+ - `100644`: regular file (with <<object,object type>> `blob`) -+ - `100755`: executable file (with type `blob`) -+ - `120000`: symbolic link (with type `blob`) -+ - `160000`: gitlink, for use with submodules (with type `commit`) ++1. The *file type*, which must be one of: ++ - *regular file* ++ - *executable file* ++ - *symbolic link* ++ - *gitlink* (for use with submodules) +2. The *<<blob,blob>>* ID of the file, + or (rarely) the *<<commit,commit>>* ID of the submodule +3. The *stage number*, either 0, 1, 2, or 3. This is normally 0, but if Documentation/Makefile | 1 + Documentation/gitdatamodel.adoc | 307 ++++++++++++++++++++++++++++ Documentation/glossary-content.adoc | 4 +- Documentation/meson.build | 1 + 4 files changed, 311 insertions(+), 2 deletions(-) create mode 100644 Documentation/gitdatamodel.adoc diff --git a/Documentation/Makefile b/Documentation/Makefile index 6fb83d0c6e..5f4acfacbd 100644 --- a/Documentation/Makefile +++ b/Documentation/Makefile @@ -52,6 +52,7 @@ MAN7_TXT += gitcli.adoc MAN7_TXT += gitcore-tutorial.adoc MAN7_TXT += gitcredentials.adoc MAN7_TXT += gitcvs-migration.adoc +MAN7_TXT += gitdatamodel.adoc MAN7_TXT += gitdiffcore.adoc MAN7_TXT += giteveryday.adoc MAN7_TXT += gitfaq.adoc diff --git a/Documentation/gitdatamodel.adoc b/Documentation/gitdatamodel.adoc new file mode 100644 index 0000000000..3614f5960e --- /dev/null +++ b/Documentation/gitdatamodel.adoc @@ -0,0 +1,307 @@ +gitdatamodel(7) +=============== + +NAME +---- +gitdatamodel - Git's core data model + +SYNOPSIS +-------- +gitdatamodel + +DESCRIPTION +----------- + +It's not necessary to understand Git's data model to use Git, but it's +very helpful when reading Git's documentation so that you know what it +means when the documentation says "object", "reference" or "index". + +Git's core operations use 4 kinds of data: + +1. <<objects,Objects>>: commits, trees, blobs, and tag objects +2. <<references,References>>: branches, tags, + remote-tracking branches, etc +3. <<index,The index>>, also known as the staging area +4. <<reflogs,Reflogs>>: logs of changes to references ("ref log") + +[[objects]] +OBJECTS +------- + +All of the commits and files in a Git repository are stored as "Git objects". +Git objects never change after they're created, and every object has an ID, +like `1b61de420a21a2f1aaef93e38ecd0e45e8bc9f0a`. + +This means that if you have an object's ID, you can always recover its +exact contents as long as the object hasn't been deleted. + +Every object has: + +[[object-id]] +1. an *ID* (aka "object name"), which is a cryptographic hash of its + type and contents. + It's fast to look up a Git object using its ID. + This is usually represented in hexadecimal, like + `1b61de420a21a2f1aaef93e38ecd0e45e8bc9f0a`. +2. a *type*. There are 4 types of objects: + <<commit,commits>>, <<tree,trees>>, <<blob,blobs>>, + and <<tag-object,tag objects>>. +3. *contents*. The structure of the contents depends on the type. + +Here's how each type of object is structured: + +[[commit]] +commit:: + A commit contains these required fields + (though there are other optional fields): ++ +1. The full directory structure of all the files in that version of the + repository and each file's contents, stored as the *<<tree,tree>>* ID + of the commit's top-level directory +2. Its *parent commit ID(s)*. The first commit in a repository has 0 parents, + regular commits have 1 parent, merge commits have 2 or more parents +3. An *author* and the time the commit was authored +4. A *committer* and the time the commit was committed +5. A *commit message* ++ +Here's how an example commit is stored: ++ +---- +tree 1b61de420a21a2f1aaef93e38ecd0e45e8bc9f0a +parent 4ccb6d7b8869a86aae2e84c56523f8705b50c647 +author Maya <maya@example.com> 1759173425 -0400 +committer Maya <maya@example.com> 1759173425 -0400 + +Add README +---- ++ +Like all other objects, commits can never be changed after they're created. +For example, "amending" a commit with `git commit --amend` creates a new +commit with the same parent. ++ +Git does not store the diff for a commit: when you ask Git to show +the commit with linkgit:git-show[1], it calculates the diff from its +parent on the fly. + +[[tree]] +tree:: + A tree is how Git represents a directory. + It can contain files or other trees (which are subdirectories). + It lists, for each item in the tree: ++ +1. The *filename*, for example `hello.py` +2. The *file type*, which must be one of these five types: + - *regular file* + - *executable file* + - *symbolic link* + - *directory* + - *gitlink* (for use with submodules) +3. The <<object-id,*object ID*>> with the contents of the file, directory, + or gitlink. ++ +For example, this is how a tree containing one directory (`src`) and one file +(`README.md`) is stored: ++ +---- +100644 blob 8728a858d9d21a8c78488c8b4e70e531b659141f README.md +040000 tree 89b1d2e0495f66d6929f4ff76ff1bb07fc41947d src +---- + +NOTE: In the output above, Git displays the file type of each tree entry +using a format that's loosely modelled on Unix file modes (`100644` is +"regular file", `100755` is "executable file", `120000` is "symbolic +link", `040000` is "directory", and `160000` is "gitlink"). It also +displays the object's type: `blob` for files and symlinks, `tree` for +directories, and `commit` for gitlinks. + +[[blob]] +blob:: + A blob object contains a file's contents. ++ +When you make a commit, Git stores the full contents of each file that +you changed as a blob. +For example, if you have a commit that changes 2 files in a repository +with 1000 files, that commit will create 2 new blobs, and use the +previous blob ID for the other 998 files. +This means that commits can use relatively little disk space even in a +very large repository. + +[[tag-object]] +tag object:: + Tag objects contain these required fields + (though there are other optional fields): ++ +1. The *ID* of the object it references +2. The *type* of the object it references +3. The *tagger* and tag date +4. A *tag message*, similar to a commit message + +Here's how an example tag object is stored: + +---- +object 750b4ead9c87ceb3ddb7a390e6c7074521797fb3 +type commit +tag v1.0.0 +tagger Maya <maya@example.com> 1759927359 -0400 + +Release version 1.0.0 +---- + +NOTE: All of the examples in this section were generated with +`git cat-file -p <object-id>`. + +[[references]] +REFERENCES +---------- + +References are a way to give a name to a commit. +It's easier to remember "the changes I'm working on are on the `turtle` +branch" than "the changes are in commit bb69721404348e". +Git often uses "ref" as shorthand for "reference". + +References can either refer to: + +1. An object ID, usually a <<commit,commit>> ID +2. Another reference. This is called a "symbolic reference" + +References are stored in a hierarchy, and Git handles references +differently based on where they are in the hierarchy. +Most references are under `refs/`. Here are the main types: + +[[branch]] +branches: `refs/heads/<name>`:: + A branch refers to a commit ID. + That commit is the latest commit on the branch. ++ +To get the history of commits on a branch, Git will start at the commit +ID the branch references, and then look at the commit's parent(s), +the parent's parent, etc. + +[[tag]] +tags: `refs/tags/<name>`:: + A tag refers to a commit ID, tag object ID, or other object ID. + There are two types of tags: + 1. "Annotated tags", which reference a <<tag-object,tag object>> ID + which contains a tag message + 2. "Lightweight tags", which reference a commit, blob, or tree ID + directly ++ +Even though branches and tags both refer to a commit ID, Git +treats them very differently. +Branches are expected to change over time: when you make a commit, Git +will update your <<HEAD,current branch>> to point to the new commit. +Tags are usually not changed after they're created. + +[[HEAD]] +HEAD: `HEAD`:: + `HEAD` is where Git stores your current <<branch,branch>>, + if there is a current branch. `HEAD` can either be: ++ +1. A symbolic reference to your current branch, for example `ref: + refs/heads/main` if your current branch is `main`. +2. A direct reference to a commit ID. In this case there is no current branch. + This is called "detached HEAD state", see the DETACHED HEAD section + of linkgit:git-checkout[1] for more. + +[[remote-tracking-branch]] +remote-tracking branches: `refs/remotes/<remote>/<branch>`:: + A remote-tracking branch refers to a commit ID. + It's how Git stores the last-known state of a branch in a remote + repository. `git fetch` updates remote-tracking branches. When + `git status` says "you're up to date with origin/main", it's looking at + this. ++ +`refs/remotes/<remote>/HEAD` is a symbolic reference to the remote's +default branch. This is the branch that `git clone` checks out by default. + +[[other-refs]] +Other references:: + Git tools may create references anywhere under `refs/`. + For example, linkgit:git-stash[1], linkgit:git-bisect[1], + and linkgit:git-notes[1] all create their own references + in `refs/stash`, `refs/bisect`, etc. + Third-party Git tools may also create their own references. ++ +Git may also create references other than `HEAD` at the base of the +hierarchy, like `ORIG_HEAD`. + +NOTE: Git may delete objects that aren't "reachable" from any reference +or <<reflogs,reflog>>. +An object is "reachable" if we can find it by following tags to whatever +they tag, commits to their parents or trees, and trees to the trees or +blobs that they contain. +For example, if you amend a commit with `git commit --amend`, +there will no longer be a branch that points at the old commit. +The old commit is recorded in the current branch's <<reflogs,reflog>>, +so it is still "reachable", but when the reflog entry expires it may +become unreachable and get deleted. + +the old commit will usually not be reachable, so it may be deleted eventually. +Reachable objects will never be deleted. + +[[index]] +THE INDEX +--------- +The index, also known as the "staging area", is a list of files and +the contents of each file, stored as a <<blob,blob>>. +You can add files to the index or update the contents of a file in the +index with linkgit:git-add[1]. This is called "staging" the file for commit. + +Unlike a <<tree,tree>>, the index is a flat list of files. +When you commit, Git converts the list of files in the index to a +directory <<tree,tree>> and uses that tree in the new <<commit,commit>>. + +Each index entry has 4 fields: + +1. The *file type*, which must be one of: + - *regular file* + - *executable file* + - *symbolic link* + - *gitlink* (for use with submodules) +2. The *<<blob,blob>>* ID of the file, + or (rarely) the *<<commit,commit>>* ID of the submodule +3. The *stage number*, either 0, 1, 2, or 3. This is normally 0, but if + there's a merge conflict there can be multiple versions of the same + filename in the index. +4. The *file path*, for example `src/hello.py` + +It's extremely uncommon to look at the index directly: normally you'd +run `git status` to see a list of changes between the index and <<HEAD,HEAD>>. +But you can use `git ls-files --stage` to see the index. +Here's the output of `git ls-files --stage` in a repository with 2 files: + +---- +100644 8728a858d9d21a8c78488c8b4e70e531b659141f 0 README.md +100644 665c637a360874ce43bf74018768a96d2d4d219a 0 src/hello.py +---- + +[[reflogs]] +REFLOGS +------- + +Every time a branch, remote-tracking branch, or HEAD is updated, Git +updates a log called a "reflog" for that <<references,reference>>. +This means that if you make a mistake and "lose" a commit, you can +generally recover the commit ID by running `git reflog <reference>`. + +A reflog is a list of log entries. Each entry has: + +1. The *commit ID* +2. *Timestamp* when the change was made +3. *Log message*, for example `pull: Fast-forward` + +Reflogs only log changes made in your local repository. +They are not shared with remotes. + +You can view a reflog with `git reflog <reference>`. +For example, here's the reflog for a `main` branch which has changed twice: + +---- +$ git reflog main --date=iso --no-decorate +750b4ea main@{2025-09-29 15:17:05 -0400}: commit: Add README +4ccb6d7 main@{2025-09-29 15:16:48 -0400}: commit (initial): Initial commit +---- + +GIT +--- +Part of the linkgit:git[1] suite diff --git a/Documentation/glossary-content.adoc b/Documentation/glossary-content.adoc index e423e4765b..20ba121314 100644 --- a/Documentation/glossary-content.adoc +++ b/Documentation/glossary-content.adoc @@ -297,8 +297,8 @@ This commit is referred to as a "merge commit", or sometimes just a identified by its <<def_object_name,object name>>. The objects usually live in `$GIT_DIR/objects/`. -[[def_object_identifier]]object identifier (oid):: - Synonym for <<def_object_name,object name>>. +[[def_object_identifier]]object identifier, object ID, oid:: + Synonyms for <<def_object_name,object name>>. [[def_object_name]]object name:: The unique identifier of an <<def_object,object>>. The diff --git a/Documentation/meson.build b/Documentation/meson.build index e34965c5b0..ace0573e82 100644 --- a/Documentation/meson.build +++ b/Documentation/meson.build @@ -192,6 +192,7 @@ manpages = { 'gitcore-tutorial.adoc' : 7, 'gitcredentials.adoc' : 7, 'gitcvs-migration.adoc' : 7, + 'gitdatamodel.adoc' : 7, 'gitdiffcore.adoc' : 7, 'giteveryday.adoc' : 7, 'gitfaq.adoc' : 7, base-commit: bb69721404348ea2db0a081c41ab6ebfe75bdec8 -- gitgitgadget ^ permalink raw reply related [flat|nested] 89+ messages in thread
* Re: [PATCH v7] doc: add an explanation of Git's data model 2025-11-12 19:53 ` [PATCH v7] " Julia Evans via GitGitGadget @ 2025-11-12 20:26 ` Junio C Hamano 2025-11-23 2:37 ` Junio C Hamano 1 sibling, 0 replies; 89+ messages in thread From: Junio C Hamano @ 2025-11-12 20:26 UTC (permalink / raw) To: Julia Evans via GitGitGadget Cc: git, Kristoffer Haugsbakk, D. Ben Knoble, Patrick Steinhardt, Julia Evans "Julia Evans via GitGitGadget" <gitgitgadget@gmail.com> writes: > +2. The *file type*, which must be one of these five types: > + - *regular file* > + - *executable file* > + - *symbolic link* > + - *directory* > + - *gitlink* (for use with submodules) > +3. The <<object-id,*object ID*>> with the contents of the file, directory, > + or gitlink. > ++ > +For example, this is how a tree containing one directory (`src`) and one file > +(`README.md`) is stored: > ++ > +---- > +100644 blob 8728a858d9d21a8c78488c8b4e70e531b659141f README.md > +040000 tree 89b1d2e0495f66d6929f4ff76ff1bb07fc41947d src > +---- > + > +NOTE: In the output above, Git displays the file type of each tree entry > +using a format that's loosely modelled on Unix file modes (`100644` is > +"regular file", `100755` is "executable file", `120000` is "symbolic > +link", `040000` is "directory", and `160000` is "gitlink"). It also > +displays the object's type: `blob` for files and symlinks, `tree` for > +directories, and `commit` for gitlinks. As a description of the data model, moving the exact bit assignment to a side note like the above hunk (relative to the previous iteration) does make the body text less cluttered, which I think is a welcome change. ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: [PATCH v7] doc: add an explanation of Git's data model 2025-11-12 19:53 ` [PATCH v7] " Julia Evans via GitGitGadget 2025-11-12 20:26 ` Junio C Hamano @ 2025-11-23 2:37 ` Junio C Hamano 2025-12-01 8:14 ` Patrick Steinhardt 1 sibling, 1 reply; 89+ messages in thread From: Junio C Hamano @ 2025-11-23 2:37 UTC (permalink / raw) To: Julia Evans via GitGitGadget Cc: git, Kristoffer Haugsbakk, D. Ben Knoble, Patrick Steinhardt, Julia Evans "Julia Evans via GitGitGadget" <gitgitgadget@gmail.com> writes: > changes in v7: > > * Replace "file mode" with "file type", to make it more obvious that > Git does not support general Unix file modes. Remove a broken XML > link as a side effect. > * Use "top-level directory" instead of "base directory" > * Like last time, I still don't have any better ideas for "A branch > refers to a commit ID" We haven't seen much comment on this iteration, and hopefully that is not showing the lack of interest ;-) Shall we mark the topic for 'next' now? Thanks. ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: [PATCH v7] doc: add an explanation of Git's data model 2025-11-23 2:37 ` Junio C Hamano @ 2025-12-01 8:14 ` Patrick Steinhardt 2025-12-02 12:25 ` Junio C Hamano 0 siblings, 1 reply; 89+ messages in thread From: Patrick Steinhardt @ 2025-12-01 8:14 UTC (permalink / raw) To: Junio C Hamano Cc: Julia Evans via GitGitGadget, git, Kristoffer Haugsbakk, D. Ben Knoble, Julia Evans On Sat, Nov 22, 2025 at 06:37:39PM -0800, Junio C Hamano wrote: > "Julia Evans via GitGitGadget" <gitgitgadget@gmail.com> writes: > > > changes in v7: > > > > * Replace "file mode" with "file type", to make it more obvious that > > Git does not support general Unix file modes. Remove a broken XML > > link as a side effect. > > * Use "top-level directory" instead of "base directory" > > * Like last time, I still don't have any better ideas for "A branch > > refers to a commit ID" > > We haven't seen much comment on this iteration, and hopefully that > is not showing the lack of interest ;-) Shall we mark the topic for > 'next' now? I've been out of office, but I certainly think that this version is more than "good enough", and I have a lot of interest in these topics. I've seen you already merged it to 'master' -- yay! By the way, thanks a ton Julia for all these improvements to our docs. I highly appreciate them and think that this is sorely needed. Our users will certainly appreciate your work! Patrick ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: [PATCH v7] doc: add an explanation of Git's data model 2025-12-01 8:14 ` Patrick Steinhardt @ 2025-12-02 12:25 ` Junio C Hamano 0 siblings, 0 replies; 89+ messages in thread From: Junio C Hamano @ 2025-12-02 12:25 UTC (permalink / raw) To: Patrick Steinhardt Cc: Julia Evans via GitGitGadget, git, Kristoffer Haugsbakk, D. Ben Knoble, Julia Evans Patrick Steinhardt <ps@pks.im> writes: > By the way, thanks a ton Julia for all these improvements to our docs. I > highly appreciate them and think that this is sorely needed. Our users > will certainly appreciate your work! Same here. ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: [PATCH] doc: add a explanation of Git's data model 2025-10-03 17:34 [PATCH] doc: add a explanation of Git's data model Julia Evans via GitGitGadget ` (3 preceding siblings ...) 2025-10-08 13:53 ` [PATCH v2] " Julia Evans via GitGitGadget @ 2025-10-09 14:20 ` Julia Evans 2025-10-10 0:42 ` Ben Knoble 4 siblings, 1 reply; 89+ messages in thread From: Julia Evans @ 2025-10-09 14:20 UTC (permalink / raw) To: Julia Evans, git I collected some feedback from Git users on this v2 document. I'm expecting more feedback, but here's an initial brain dump of my notes. I mostly wrote this for my own use but I thought it might be interesting to other folks too. intro: - Say that we're going to explain what "objects", "references" etc are. (so that readers know they're not expected to know what those words mean yet) - It's confusing that tags are both "an object" and "a reference". Need to think about whether there's a way to address this, I was hoping that using the terms "tag object" and "tag" would be enough but maybe not. - Give a 1-sentence intro to "reflog" (easy) commits: - The order the fields are given in don't match the order in the example, maybe they should. - "All the files in the commit, stored as a tree" is throwing a few people off. I think we should communicate something like "a tree hash that describes the root of the project and then by extension the whole project", but phrased more clearly. Will figure that out. - "Are commits stored as a diff?" (2 people asked where diffs come from, I think we need to add a note saying that diffs are calculated at runtime, it's a very common misconception and I think it should be easy to clear up) - "What's the difference between an author and committer? (I actually don't know either, will try to find out and see if it's straightforward to add a short note explaining it) - In the note about commits being amended: one person suggested saying "creates a new commit with the same parent" which I think might be clearer. trees: - One person asks what a "working tree" is. I don't think this is a good place for that, but it made me wonder if "the current working directory" has a place in this document. I feel like no but not 100% sure. - 2 people want to know more about "The file mode, for example 100644". Moving "Git only supports these file modes..." further up so that folks can immediately see what the options are here should help with this. - On "so git-gc(1) periodically compresses objects to save disk space", there are a few follow up comments wondering about more, which makes me think the comment about compression is actually a distraction. I'll say something simpler instead, from Junio's suggestion. tag objects: - Requests for an example, will add one. - Requests to explain the difference between "lightweight" and "annotated" tags, will add. references: - Two people pointed out that because references are often stored as files, you can't have two references named `julia/ticket-number` and `julia/ticket-number/task-name`. I'm not sure if this is a fundamental limit of the refs data model (does the reftable backend have the same limitation?), but it could be a good reason to mention that refs are often stored as files, because it makes it obvious that you can't have a file and a directory with the same name. Obviously this is an issue that is affecting people relatively often in practice though so I think it's worth mentioning in some way. branches: HEAD: - One person asks if there are any other symbolic references other than HEAD, or if they can make their own symbolic references. I don't know and I don't know if this is worth mentioning. - `HEAD: HEAD` looks weird, it made sense when it was `HEAD: .git/HEAD`. Will think about how to fix this. - Several people are asking for more detail about detached HEAD state. My current idea here is to just give an example of a way you can end up in detached HEAD state. ("by checking out a tag"), but in an ideal world it would be easy to find out what it means, how it happens, what it implies, and how you might adjust your workflow to avoid it (by using `git switch`). But we can't get into all of that here. I'd love to just link to a more detailed explanation of detached HEAD state but I'm not totally satisfied with the one that's currently in `man git-checkout`. It may be best to just leave this in a slightly suboptimal state, write a really clear explanation of detached HEAD state somewhere else, and then link to it. the index: - "permissions" should be "file mode" (like with trees) - "filename" should be "file path" - The index can also be locked. Might be worth mentioning. - This doesn't explain what "staged" means, perhaps mention the relationship to `git add` reflogs - Mention the role of the reflog in retrieving "lost" commits or undoing bad rebases. - How can you see the full data in the reflog? `git reflog show` doesn't list the user who made the change git reflog show <refname> --format="%h | %gd | %gn <%ge> | %gs" --date=iso works but it's really a mouthful, not sure I want to include all that Overall: several people suggested mentioning more about where things are stored in the `.git` directory, which I just removed. I think I want to avoid this (not sure yet), but I'm going to think about the underlying motivation for this suggestion and see if it can be addressed in a different way. Some ideas for what functions discussing the `.git` directory has: 1. Like I mentioned above with branches, sometimes the implementation causes some extra constraints like "you can't have branches `julia/ticket` and `julia/ticket/task`". So often people like to know a little about the implementation because it can help predict some of the holes in the abstractions you're using. 2. It lets you view the "raw" data, so you can be totally sure about what Git is storing. This is nice because Git's UI can be very inconsistent sometimes, so looking at the raw data gives a sense of certainty about what's actually there. I tried to put together a list of ways to look at the "raw" data without looking in the `.git` directory. The ways for objects and the index are great, but for references and the reflog they involve these pretty complex format strings, I'm not confident I've gotten the format strings right and IMO they don't inspire a lot of confidence. View an object with: ---- git cat-file -p <object-id> ---- View a reference with: ---- git for-each-ref <ref-name> --include-root-refs --format="%(refname) %(if)%(symref)%(then)%(symref)%(else)%(objectname:short)%(end)" ---- View the index with: ---- git ls-files --stage ---- View the reflog for a reference with: ---- git reflog show <refname> --format="%h | %gd | %gn <%ge> | %gs" --date=iso ---- On Fri, Oct 3, 2025, at 1:34 PM, Julia Evans via GitGitGadget wrote: > From: Julia Evans <julia@jvns.ca> > > Git very often uses the terms "object", "reference", or "index" in its > documentation. > > However, it's hard to find a clear explanation of these terms and how > they relate to each other in the documentation. The closest candidates > currently are: > > 1. `gitglossary`. This makes a good effort, but it's an alphabetically > ordered dictionary and a dictionary is not a good way to learn > concepts. You have to jump around too much and it's not possible to > present the concepts in the order that they should be explained. > 2. `gitcore-tutorial`. This explains how to use the "core" Git commands. > This is a nice document to have, but it's not necessary to learn how > `update-index` works to understand Git's data model, and we should > not be requiring users to learn how to use the "plumbing" commands > if they want to learn what the term "index" or "object" means. > 3. `gitrepository-layout`. This is a great resource, but it includes a > lot of information about configuration and internal implementation > details which are not related to the data model. It also does > not explain how commits work. > > The result of this is that Git users (even users who have been using > Git for 15+ years) struggle to read the documentation because they don't > know what the core terms mean, and it's not possible to add links > to help them learn more. > > Add an explanation of Git's data model. Some choices I've made in > deciding what "core data model" means: > > 1. Omit pseudorefs like `FETCH_HEAD`, because it's not clear to me > if those are intended to be user facing or if they're more like > internal implementation details. > 2. Don't talk about submodules other than by mentioning how they > relate to trees. This is because Git has a lot of special features, > and explaining how they all work exhaustively could quickly go > down a rabbit hole which would make this document less useful for > understanding Git's core behaviour. > 3. Don't discuss the structure of a commit message > (first line, trailers, GPG signatures, etc). > Perhaps this should change. > > Some other choices I've made: > > 1. Mention packed refs only in a note. > 2. Don't mention that the full name of the branch `main` is > technically `refs/heads/main`. This should likely change but I > haven't worked out how to do it in a clear way yet. > 3. Mostly avoid referring to the `.git` directory, because the exact > details of how things are stored change over time. > This should perhaps change from "mostly" to "entirely" > but I haven't worked out how to do that in a clear way yet. > > Signed-off-by: Julia Evans <julia@jvns.ca> > --- > doc: Add a explanation of Git's data model > > Published-As: > https://github.com/gitgitgadget/git/releases/tag/pr-1981%2Fjvns%2Fgitdatamodel-v1 > Fetch-It-Via: git fetch https://github.com/gitgitgadget/git > pr-1981/jvns/gitdatamodel-v1 > Pull-Request: https://github.com/gitgitgadget/git/pull/1981 > > Documentation/Makefile | 1 + > Documentation/gitdatamodel.adoc | 226 ++++++++++++++++++++++++++++++++ > 2 files changed, 227 insertions(+) > create mode 100644 Documentation/gitdatamodel.adoc > > diff --git a/Documentation/Makefile b/Documentation/Makefile > index 6fb83d0c6e..5f4acfacbd 100644 > --- a/Documentation/Makefile > +++ b/Documentation/Makefile > @@ -52,6 +52,7 @@ MAN7_TXT += gitcli.adoc > MAN7_TXT += gitcore-tutorial.adoc > MAN7_TXT += gitcredentials.adoc > MAN7_TXT += gitcvs-migration.adoc > +MAN7_TXT += gitdatamodel.adoc > MAN7_TXT += gitdiffcore.adoc > MAN7_TXT += giteveryday.adoc > MAN7_TXT += gitfaq.adoc > diff --git a/Documentation/gitdatamodel.adoc > b/Documentation/gitdatamodel.adoc > new file mode 100644 > index 0000000000..4b2cb167dc > --- /dev/null > +++ b/Documentation/gitdatamodel.adoc > @@ -0,0 +1,226 @@ > +gitdatamodel(7) > +=============== > + > +NAME > +---- > +gitdatamodel - Git's core data model > + > +DESCRIPTION > +----------- > + > +It's not necessary to understand Git's data model to use Git, but it's > +very helpful when reading Git's documentation so that you know what it > +means when the documentation says "object" "reference" or "index". > + > +Git's core operations use 4 kinds of data: > + > +1. <<objects,Objects>>: commits, trees, blobs, and tag objects > +2. <<references,References>>: branches, tags, > + remote-tracking branches, etc > +3. <<index,The index>>, also known as the staging area > +4. <<reflogs,Reflogs>> > + > +[[objects]] > +OBJECTS > +------- > + > +Commits, trees, blobs, and tag objects are all stored in Git's object > database. > +Every object has: > + > +1. an *ID*, which is the SHA-1 hash of its contents. > + It's fast to look up a Git object using its ID. > + The ID is usually represented in hexadecimal, like > + `1b61de420a21a2f1aaef93e38ecd0e45e8bc9f0a`. > +2. a *type*. There are 4 types of objects: > + <<commit,commits>>, <<tree,trees>>, <<blob,blobs>>, > + and <<tag-object,tag objects>>. > +3. *contents*. The structure of the contents depends on the type. > + > +Once an object is created, it can never be changed. > +Here are the 4 types of objects: > + > +[[commit]] > +commits:: > + A commit contains: > ++ > +1. Its *parent commit ID(s)*. The first commit in a repository has 0 > parents, > + regular commits have 1 parent, merge commits have 2+ parents > +2. A *commit message* > +3. All the *files* in the commit, stored as a *<<tree,tree>>* > +4. An *author* and the time the commit was authored > +5. A *committer* and the time the commit was committed > ++ > +Here's how an example commit is stored: > ++ > +---- > +tree 1b61de420a21a2f1aaef93e38ecd0e45e8bc9f0a > +parent 4ccb6d7b8869a86aae2e84c56523f8705b50c647 > +author Maya <maya@example.com> 1759173425 -0400 > +committer Maya <maya@example.com> 1759173425 -0400 > + > +Add README > +---- > ++ > +Like all other objects, commits can never be changed after they're > created. > +For example, "amending" a commit with `git commit --amend` creates a > new commit. > +The old commit will eventually be deleted by `git gc`. > + > +[[tree]] > +trees:: > + A tree is how Git represents a directory. It lists, for each item > in > + the tree: > ++ > +1. The *permissions*, for example `100644` > +2. The *type*: either <<blob,`blob`>> (a file), `tree` (a directory), > + or <<commit,`commit`>> (a Git submodule) > +3. The *object ID* > +4. The *filename* > ++ > +For example, this is how a tree containing one directory (`src`) and > one file > +(`README.md`) is stored: > ++ > +---- > +100644 blob 8728a858d9d21a8c78488c8b4e70e531b659141f README.md > +040000 tree 89b1d2e0495f66d6929f4ff76ff1bb07fc41947d src > +---- > ++ > +*NOTE:* The permissions are in the same format as UNIX permissions, but > +the only allowed permissions for files (blobs) are 644 and 755. > + > +[[blob]] > +blobs:: > + A blob is how Git represents a file. A blob object contains the > + file's contents. > ++ > +Storing a new blob for every new version of a file can get big, so > +`git gc` periodically compresses objects for efficiency in > `.git/objects/pack`. > + > +[[tag-object]] > +tag objects:: > + Tag objects (also known as "annotated tags") contain: > ++ > +1. The *tagger* and tag date > +2. A *tag message*, similar to a commit message > +3. The *ID* of the object (often a commit) that they reference > + > +[[references]] > +REFERENCES > +---------- > + > +References are a way to give a name to a commit. > +It's easier to remember "the changes I'm working on are on the `turtle` > +branch" than "the changes are in commit bb69721404348e". > +Git often uses "ref" as shorthand for "reference". > + > +References that you create are stored in the `.git/refs` directory, > +and Git has a few special internal references like `HEAD` that are > stored > +in the base `.git` directory. > + > +References can either be: > + > +1. References to an object ID, usually a <<commit,commit>> ID > +2. References to another reference. This is called a "symbolic > reference". > + > +Git handles references differently based on which subdirectory of > +`.git/refs` they're stored in. > +Here are the main types: > + > +[[branch]] > +branches: `.git/refs/heads/<name>`:: > + A branch is a name for a commit ID. > + That commit is the latest commit on the branch. > + Branches are stored in the `.git/refs/heads/` directory. > ++ > +To get the history of commits on a branch, Git will start at the commit > +ID the branch references, and then look at the commit's parent(s), > +the parent's parent, etc. > + > +[[tag]] > +tags: `.git/refs/tags/<name>`:: > + A tag is a name for a commit ID, tag object ID, or other object ID. > + Tags are stored in the `refs/tags/` directory. > ++ > +Even though branches and commits are both "a name for a commit ID", Git > +treats them very differently. > +Branches are expected to be regularly updated as you work on the > branch, > +but it's expected that a tag will never change after you create it. > + > +[[HEAD]] > +HEAD: `.git/HEAD`:: > + `HEAD` is where Git stores your current <<branch,branch>>. > + `HEAD` is normally a symbolic reference to your current branch, for > + example `ref: refs/heads/main` if your current branch is `main`. > + `HEAD` can also be a direct reference to a commit ID, > + that's called "detached HEAD state". > + > +[[remote-tracking-branch]] > +remote tracking branches: `.git/refs/remotes/<remote>/<branch>`:: > + A remote-tracking branch is a name for a commit ID. > + It's how Git stores the last-known state of a branch in a remote > + repository. `git fetch` updates remote-tracking branches. When > + `git status` says "you're up to date with origin/main", it's > looking at > + this. > + > +[[other-refs]] > +Other references:: > + Git tools may create references in any subdirectory of `.git/refs`. > + For example, linkgit:git-stash[1], linkgit:git-bisect[1], > + and linkgit:git-notes[1] all create their own references > + in `.git/refs/stash`, `.git/refs/bisect`, etc. > + Third-party Git tools may also create their own references. > ++ > +Git may also create references in the base `.git` directory > +other than `HEAD`, like `ORIG_HEAD`. > + > +*NOTE:* As an optimization, references may be stored as packed > +refs instead of in `.git/refs`. See linkgit:git-pack-refs[1]. > + > +[[index]] > +THE INDEX > +--------- > + > +The index, also known as the "staging area", contains the current > staged > +version of every file in your Git repository. When you commit, the > files > +in the index are used as the files in the next commit. > + > +Unlike a tree, the index is a flat list of files. > +Each index entry has 4 fields: > + > +1. The *permissions* > +2. The *<<blob,blob>> ID* of the file > +3. The *filename* > +4. The *number*. This is normally 0, but if there's a merge conflict > + there can be multiple versions (with numbers 0, 1, 2, ..) > + of the same filename in the index. > + > +It's extremely uncommon to look at the index directly: normally you'd > +run `git status` to see a list of changes between the index and > <<HEAD,HEAD>>. > +But you can use `git ls-files --stage` to see the index. > +Here's the output of `git ls-files --stage` in a repository with 2 > files: > + > +---- > +100644 8728a858d9d21a8c78488c8b4e70e531b659141f 0 README.md > +100644 665c637a360874ce43bf74018768a96d2d4d219a 0 src/hello.py > +---- > + > +[[reflogs]] > +REFLOGS > +------- > + > +Git stores the history of branch, tag, and HEAD refs in a reflog > +(you should read "reflog" as "ref log"). Not every ref is logged by > +default, but any ref can be logged. > + > +Each reflog entry has: > + > +1. *Before/after *commit IDs* > +2. *User* who made the change, for example `Maya <maya@example.com>` > +3. *Timestamp* > +4. *Log message*, for example `pull: Fast-forward` > + > +Reflogs only log changes made in your local repository. > +They are not shared with remotes. > + > +GIT > +--- > +Part of the linkgit:git[1] suite > > base-commit: bb69721404348ea2db0a081c41ab6ebfe75bdec8 > -- > gitgitgadget ^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: [PATCH] doc: add a explanation of Git's data model 2025-10-09 14:20 ` [PATCH] doc: add a " Julia Evans @ 2025-10-10 0:42 ` Ben Knoble 0 siblings, 0 replies; 89+ messages in thread From: Ben Knoble @ 2025-10-10 0:42 UTC (permalink / raw) To: Julia Evans; +Cc: Julia Evans, git > Le 9 oct. 2025 à 10:21, Julia Evans <julia@jvns.ca> a écrit : > > I collected some feedback from Git users on this v2 document. I'm expecting more > feedback, but here's an initial brain dump of my notes. I mostly wrote this for > my own use but I thought it might be interesting to other folks too. > [snip] > references: > > - Two people pointed out that because references are often stored as files, > you can't have two references named `julia/ticket-number` and > `julia/ticket-number/task-name`. > I'm not sure if this is a fundamental limit of the refs data model > (does the reftable backend have the same limitation?), but it could be > a good reason to mention that refs are often stored as files, because > it makes it obvious that you can't have a file and a directory with > the same name. > Obviously this is an issue that is affecting people relatively often > in practice though so I think it's worth mentioning in some way. I don’t think the reftable backend has this limitation (?), but it reminded me of another important one: on case-insensitive filesystems you cannot have both « julia » and « JULIA » branches! This occasionally creates problems where someone cannot fetch/clone what has been pushed. Anyway: it’s worth mentioning the files for that purpose. It would be nice to improve the UI as you describe below to continue to be able to naturally interrogate Git without needing to know about all the storage formats (recall that cat-file works just fine with packs and MIDXs!). > Overall: several people suggested mentioning more about where things > are stored in the `.git` directory, which I just removed. > > I think I want to avoid this (not sure yet), but I'm going to think > about the underlying motivation for this suggestion and see if it can be > addressed in a different way. > > Some ideas for what functions discussing the `.git` directory has: > > 1. Like I mentioned above with branches, sometimes the implementation causes > some extra constraints like "you can't have branches `julia/ticket` > and `julia/ticket/task`". So often people like to know a little > about the implementation because it can help predict some of the > holes in the abstractions you're using. > 2. It lets you view the "raw" data, so you can be totally sure about > what Git is storing. This is nice because Git's UI can be very > inconsistent sometimes, so looking at the raw data gives a sense of > certainty about what's actually there. > > I tried to put together a list of ways to look at the "raw" data without > looking in the `.git` directory. The ways for objects and the index are great, > but for references and the reflog they involve these pretty complex format > strings, I'm not confident I've gotten the format strings right and IMO > they don't inspire a lot of confidence. > > View an object with: > ---- > git cat-file -p <object-id> > ---- > > View a reference with: > > ---- > git for-each-ref <ref-name> --include-root-refs --format="%(refname) %(if)%(symref)%(then)%(symref)%(else)%(objectname:short)%(end)" > ---- > > View the index with: > > ---- > git ls-files --stage > ---- > > View the reflog for a reference with: > > ---- > git reflog show <refname> --format="%h | %gd | %gn <%ge> | %gs" --date=iso > ---- [kept for context] ^ permalink raw reply [flat|nested] 89+ messages in thread
end of thread, other threads:[~2025-12-02 12:25 UTC | newest] Thread overview: 89+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2025-10-03 17:34 [PATCH] doc: add a explanation of Git's data model Julia Evans via GitGitGadget 2025-10-03 21:46 ` Kristoffer Haugsbakk 2025-10-06 19:36 ` Julia Evans 2025-10-06 21:44 ` D. Ben Knoble 2025-10-06 21:46 ` Julia Evans 2025-10-06 21:55 ` D. Ben Knoble 2025-10-09 13:20 ` Julia Evans 2025-10-08 9:59 ` Kristoffer Haugsbakk 2025-10-06 3:32 ` Junio C Hamano 2025-10-06 19:03 ` Julia Evans 2025-10-07 12:37 ` Kristoffer Haugsbakk 2025-10-07 16:38 ` Junio C Hamano 2025-10-07 14:32 ` Patrick Steinhardt 2025-10-07 17:02 ` Junio C Hamano 2025-10-07 19:30 ` Julia Evans 2025-10-07 20:01 ` Junio C Hamano 2025-10-07 18:39 ` D. Ben Knoble 2025-10-07 18:55 ` Julia Evans 2025-10-08 4:18 ` Patrick Steinhardt 2025-10-08 15:53 ` Junio C Hamano 2025-10-08 19:06 ` Julia Evans 2025-10-08 13:53 ` [PATCH v2] " Julia Evans via GitGitGadget 2025-10-10 11:51 ` Patrick Steinhardt 2025-10-13 14:48 ` Junio C Hamano 2025-10-14 5:45 ` Patrick Steinhardt 2025-10-14 9:18 ` Julia Evans 2025-10-14 11:45 ` Patrick Steinhardt 2025-10-14 13:39 ` Junio C Hamano 2025-10-14 21:12 ` [PATCH v3] " Julia Evans via GitGitGadget 2025-10-15 6:24 ` Patrick Steinhardt 2025-10-15 15:34 ` Junio C Hamano 2025-10-15 17:20 ` Julia Evans 2025-10-15 20:42 ` Junio C Hamano 2025-10-16 14:21 ` Julia Evans 2025-10-15 19:58 ` Junio C Hamano 2025-10-16 15:19 ` Julia Evans 2025-10-16 16:54 ` Junio C Hamano 2025-10-16 18:59 ` Julia Evans 2025-10-16 20:48 ` Junio C Hamano 2025-10-16 15:24 ` Kristoffer Haugsbakk 2025-10-20 16:37 ` Kristoffer Haugsbakk 2025-10-20 18:01 ` Junio C Hamano 2025-10-27 19:32 ` [PATCH v4] doc: add an " Julia Evans via GitGitGadget 2025-10-27 21:54 ` Junio C Hamano 2025-10-28 20:10 ` Julia Evans 2025-10-28 20:31 ` Junio C Hamano 2025-10-30 20:32 ` [PATCH v5] " Julia Evans via GitGitGadget 2025-10-31 14:44 ` Junio C Hamano 2025-11-03 7:40 ` Patrick Steinhardt 2025-11-03 15:38 ` Junio C Hamano 2025-11-03 19:43 ` Julia Evans 2025-11-04 1:34 ` Junio C Hamano 2025-11-04 15:45 ` Julia Evans 2025-11-04 20:53 ` Junio C Hamano 2025-11-04 21:24 ` Julia Evans 2025-11-04 23:45 ` Junio C Hamano 2025-11-05 0:02 ` Julia Evans 2025-11-05 3:21 ` Ben Knoble 2025-11-05 16:26 ` Julia Evans 2025-11-06 3:07 ` Ben Knoble 2025-10-31 21:49 ` Junio C Hamano 2025-11-03 7:40 ` Patrick Steinhardt 2025-11-03 19:52 ` Julia Evans 2025-11-07 19:52 ` [PATCH v6] " Julia Evans via GitGitGadget 2025-11-07 21:03 ` Junio C Hamano 2025-11-07 21:23 ` Junio C Hamano 2025-11-07 21:40 ` Julia Evans 2025-11-07 23:07 ` Junio C Hamano 2025-11-08 19:43 ` Junio C Hamano 2025-11-09 0:48 ` Ben Knoble 2025-11-09 4:59 ` Junio C Hamano 2025-11-10 15:56 ` Julia Evans 2025-11-11 10:13 ` Junio C Hamano 2025-11-11 13:07 ` Ben Knoble 2025-11-11 15:24 ` Julia Evans 2025-11-12 19:16 ` Junio C Hamano 2025-11-12 22:49 ` Junio C Hamano 2025-11-13 19:50 ` Julia Evans 2025-11-13 20:07 ` Junio C Hamano 2025-11-13 20:18 ` Julia Evans 2025-11-13 20:34 ` Chris Torek 2025-11-13 23:11 ` Junio C Hamano 2025-11-12 19:53 ` [PATCH v7] " Julia Evans via GitGitGadget 2025-11-12 20:26 ` Junio C Hamano 2025-11-23 2:37 ` Junio C Hamano 2025-12-01 8:14 ` Patrick Steinhardt 2025-12-02 12:25 ` Junio C Hamano 2025-10-09 14:20 ` [PATCH] doc: add a " Julia Evans 2025-10-10 0:42 ` Ben Knoble
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).