From: Junio C Hamano <gitster@pobox.com>
To: Duy Nguyen <pclouds@gmail.com>
Cc: "brian m. carlson" <sandals@crustytoothpaste.net>,
Michael Haggerty <mhagger@alum.mit.edu>,
"Kyle J. McKay" <mackyle@gmail.com>,
Git Mailing List <git@vger.kernel.org>,
Andreas Schwab <schwab@linux-m68k.org>
Subject: Re: [PATCH v2 01/10] Define a structure for object IDs.
Date: Thu, 12 Mar 2015 23:03:05 -0700 [thread overview]
Message-ID: <xmqqwq2lzb6u.fsf@gitster.dls.corp.google.com> (raw)
In-Reply-To: <CACsJy8CMZbwyBayX-bbWmGwV=AWC000Yx6LfzOcB2irq2X6qHQ@mail.gmail.com> (Duy Nguyen's message of "Fri, 13 Mar 2015 07:58:20 +0700")
Duy Nguyen <pclouds@gmail.com> writes:
>> You mean "if it came in <pack, offset> format, convert it down to
>> <sha1> until the last second that it is needed (e.g. need to put
>> that in a tree object in order to compute the object name of the
>> containing tree object)"?
>
> I picked my words poorly. It should be <pack, the index in pack>
> instead of the _byte_ offset.
Thanks for a clarification, but I do not think it affects the main
point of the discussion very much. If we use "union in struct",
where we can store either an SHA-1 hash or some other identifying
information for the object, but not both, then at some point you
would need to convert <pack, nth> to <sha-1> in a codepath that
needs the sha-1 hash value (e.g. if the object A, that is known to
you as <pack, nth>, is placed in a tree object B, you need the
object name of A in <sha-1> representation in order to compute the
object name of the tree object B. You can choose to keep it in
<pack, nth> form in "struct object_id { union }" and look up the
<sha-1> from the pack index every time, or you can choose to give
up the <pack, nth> form and upgrade the "struct object_id" to store
<sha-1> at that point.
If you keep both <pack, nth> *and* <sha-1> in "struct object_id" at
the same time, you can choose whichever is convenient, but it would
bloat everything in core. Not just it bloats "struct object" and
its subclasses, the in-core index entries, which is what I meant
by ...
>> Unless you fix that "union in struct" assumption, that is.
... this.
>> To me, <pack, offset> information smells to belong more to a "struct
>> object" (or its subclass) as an optional annotation---when a caller
>> is asked to parse_object(), you would bypass the sha1_read_file()
>> that goes and looks the object name up from the list of pack .idx
>> and instead go there straight using that annotation.
>
> For pack v4, commits and trees can be encoded this way.
Even if your in-pack representation of a commit object allowed to
store the tree pointer in <pack, nth> format, its object name must
be computed as if you have the commit object in the traditional
format and computed the hash of that (together with the standard
"<type> <size>\0" header), and at that point, you need the contained
object's name in <sha-1> format (imagine how you would implement the
commit-tree command). Hence, I do not think the v4 encoding changes
the discussion very much. I see the primary value of v4 encoding is
to shorten the length of various fields take on-disk and in-pack.
If it were <pack, offset>, it could be argued that it would also be
faster to get to the packed data in the packfile, and going from
<pack, nth> to the .idx file and then going to the location in the
data in the packfile would be faster than going from <sha-1> to a
particular pack and its in-pack offset, with the difference of cost
around log(n)*m where n is the number of objects in a pack and m is
the total number of packs in the repository.
It is true that <nth> format (force that the referred-to object
lives in the same pack as the referrer) can help speed up
interpretation of extended SHA-1 expression, e.g. "v1.0^0:t", which
can read v1.0 tag in v4 format, find the <nth> info for the commit
pointed by the tag and get to that data in the pack, find the <nth>
info for the top-tree recorded in that commit and directly get to
the data of that tree, and then find the entry for "t", which will
give the object name for that subtree again in <nth> format, and at
that point you can find the <sha-1> of that final object, without
having to know any object names of the intermediate objects
(i.e. you must start from <sha-1> of the tag you obtain from the
refs API, but you didn't use the object name of the commit and its
top-level tree). So for such a codepath, I would say it would be
sufficient to use a "union in struct" people have been envisioning
and convert <pack, nth> to <sha-1> when the latter form becomes
necessary for the first time for the object.
Anyway, wouldn't this be all academic? I do not see how you would
keep the object name in the <pack, nth> format in-core, as the
obj_hash[] is a hashtable keyed by <sha-1>, and even when we switch
to a different hash, I cannot see how such a table to ensure the
singleton-ness of in-core objects can be keyed sometimes by <hash>
and by <pack, nth> in some other time.
next prev parent reply other threads:[~2015-03-13 6:03 UTC|newest]
Thread overview: 37+ messages / expand[flat|nested] mbox.gz Atom feed top
2015-03-07 23:23 [PATCH v2 00/10] Use a structure for object IDs brian m. carlson
2015-03-07 23:23 ` [PATCH v2 01/10] Define " brian m. carlson
[not found] ` <CEA07500-9F47-4B24-AD5D-1423A601A4DD@gmail.com>
2015-03-11 22:08 ` brian m. carlson
2015-03-12 0:26 ` Junio C Hamano
2015-03-12 9:34 ` brian m. carlson
2015-03-12 10:28 ` Michael Haggerty
2015-03-12 10:46 ` brian m. carlson
2015-03-12 11:16 ` Duy Nguyen
2015-03-12 18:24 ` Junio C Hamano
2015-03-12 18:44 ` Junio C Hamano
2015-03-13 0:58 ` Duy Nguyen
2015-03-13 6:03 ` Junio C Hamano [this message]
2015-03-14 11:49 ` Duy Nguyen
2015-03-14 22:19 ` Junio C Hamano
2015-03-15 0:17 ` Duy Nguyen
2015-03-15 2:32 ` Junio C Hamano
2015-03-07 23:23 ` [PATCH v2 02/10] Define utility functions " brian m. carlson
2015-03-08 9:57 ` Duy Nguyen
2015-03-08 14:48 ` brian m. carlson
2015-03-11 12:44 ` Michael Haggerty
2015-03-07 23:23 ` [PATCH v2 03/10] bisect.c: convert leaf functions to use struct object_id brian m. carlson
2015-03-07 23:23 ` [PATCH v2 04/10] archive.c: convert " brian m. carlson
2015-03-11 14:20 ` Michael Haggerty
2015-03-11 22:12 ` brian m. carlson
2015-03-07 23:24 ` [PATCH v2 05/10] zip: use GIT_SHA1_HEXSZ for trailers brian m. carlson
2015-03-07 23:24 ` [PATCH v2 06/10] bulk-checkin.c: convert to use struct object_id brian m. carlson
2015-03-07 23:24 ` [PATCH v2 07/10] diff: convert struct combine_diff_path to object_id brian m. carlson
2015-03-07 23:24 ` [PATCH v2 08/10] commit: convert parts to struct object_id brian m. carlson
2015-03-11 14:46 ` Michael Haggerty
2015-03-07 23:24 ` [PATCH v2 09/10] patch-id: convert to use " brian m. carlson
2015-03-07 23:24 ` [PATCH v2 10/10] apply: convert threeway_stage to object_id brian m. carlson
2015-03-08 7:43 ` [PATCH v2 00/10] Use a structure for object IDs Junio C Hamano
2015-03-11 2:38 ` Kyle J. McKay
2015-03-11 16:08 ` Michael Haggerty
2015-03-11 20:35 ` Junio C Hamano
2015-03-13 22:45 ` brian m. carlson
-- strict thread matches above, loose matches on Subject: below --
2015-03-13 23:39 brian m. carlson
2015-03-13 23:39 ` [PATCH v2 01/10] Define " brian m. carlson
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=xmqqwq2lzb6u.fsf@gitster.dls.corp.google.com \
--to=gitster@pobox.com \
--cc=git@vger.kernel.org \
--cc=mackyle@gmail.com \
--cc=mhagger@alum.mit.edu \
--cc=pclouds@gmail.com \
--cc=sandals@crustytoothpaste.net \
--cc=schwab@linux-m68k.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.