Re: [PATCH v2] bulk-checkin: only support blobs in index_bulk_checkin

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Junio C Hamano <gitster@pobox.com>
To: "Eric W. Biederman" <ebiederm@gmail.com>
Cc: "brian m. carlson" <sandals@crustytoothpaste.net>, <git@vger.kernel.org>
Subject: Re: [PATCH v2] bulk-checkin: only support blobs in index_bulk_checkin
Date: Tue, 19 Sep 2023 23:59:57 -0700	[thread overview]
Message-ID: <xmqqr0mtcosy.fsf@gitster.g> (raw)
In-Reply-To: <878r918ps3.fsf@gmail.froward.int.ebiederm.org> (Eric W. Biederman's message of "Tue, 19 Sep 2023 22:52:28 -0500")

"Eric W. Biederman" <ebiederm@gmail.com> writes:

> As the code is written today index_bulk_checkin only accepts blobs.
> Remove the enum object_type parameter and rename index_bulk_checkin to
> index_blob_bulk_checkin, index_stream to index_blob_stream,
> deflate_to_pack to deflate_blob_to_pack, stream_to_pack to
> stream_blobk_to_pack, to make this explicit.

> Not supporting commits, tags, or trees has no downside as it is not
> currently supported now, and commits, tags, and trees being smaller by
> design do not have the problem that the problem that index_bulk_checkin
> was built to solve.

Exactly.  The streaming was primarily to help dealing with huge
blobs that cannot be held in-core.  Of course other parts (like
comparing them) of the system would require to hold them in-core
so some things may not work for them, but at least it is a start
to be able to _hash_ them to store them in the object store and to
give them names.

> What is more this is very desiable from the context of the hash function
> transition.

A bit hard to parse; perhaps want a comma before "this"?

> For blob objects it is straight forward to compute multiple hash
> functions during index_bulk_checkin as the object header and content of
> a blob is the same no matter which hash function is being used to
> compute the oid of a blob.

OK.

> For commits, tress, and tags the object header and content that need to
> be hashed ard different for different hashes.  Even worse the object
> header can not be known until the size of the content that needs to be
> hashed is known.  The size of the content that needs to be hashed can
> not be known until a complete pass is made through all of the variable
> length entries of the original object.

"tress" -> "trees".  Also a comma after "worse".

> As far as I can tell this extra pass defeats most of the purpose of
> streaming, and it is much easier to implement with in memory buffers.

The purpose of streaming being the ability to hash and compute the
object name without having to hold the entirety of the object, I am
not sure the above is a good argument.  You can run multiple passes
by streaming the same data twice if you needed to, and how much
easier the implementation may become if you can assume that you can
hold everything in-core, what you cannot fit in-core would not fit
in-core, so ...

> So if it is needed to write commits, trees, and tags directly to pack
> files writing a separate function to do the would be needed.

But I am OK with this conclusion.  As the way to compute the
fallback hashes for different types of objects are very different,
compared to a single-hash world where as long as you come up with a
serialization you have only a single way to hash and name the
object.  We would end up having separate helper functions per target
type anyway, even if we kept a single entry point function like
index_stream().  The single entry point function will only be used
to just dispatch to type specific ones, so renaming what we have today
and making it clear they are for "blobs" does make sense.

next prev parent reply	other threads:[~2023-09-20  7:00 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-09-20  3:52 [PATCH v2] bulk-checkin: only support blobs in index_bulk_checkin Eric W. Biederman
2023-09-20  6:59 ` Junio C Hamano [this message]
2023-09-20 12:24   ` Eric W. Biederman
2023-09-26 15:58     ` [PATCH v3] " Eric W. Biederman
2023-09-26 21:48       ` Junio C Hamano
2023-09-27  1:38         ` Taylor Blau
2023-09-27  4:08           ` Junio C Hamano
2023-09-27 14:34             ` Taylor Blau
2023-09-27 16:26               ` Junio C Hamano
2023-09-27 20:06                 ` Eric W. Biederman
2023-09-27 20:13           ` Eric W. Biederman
2023-09-28  9:39       ` Oswald Buddenhagen

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=xmqqr0mtcosy.fsf@gitster.g \
    --to=gitster@pobox.com \
    --cc=ebiederm@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=sandals@crustytoothpaste.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).