Re: [PATCH] csum-file: flush less often

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Junio C Hamano <gitster@pobox.com>
To: Derrick Stolee <stolee@gmail.com>
Cc: Derrick Stolee via GitGitGadget <gitgitgadget@gmail.com>,
	git@vger.kernel.org, peff@peff.net, me@ttaylorr.com,
	Derrick Stolee <derrickstolee@github.com>,
	Derrick Stolee <dstolee@microsoft.com>
Subject: Re: [PATCH] csum-file: flush less often
Date: Thu, 25 Mar 2021 11:46:20 -0700	[thread overview]
Message-ID: <xmqq8s6bvzpf.fsf@gitster.g> (raw)
In-Reply-To: <84ccabca-0bd3-d0cb-6b38-f96d75c0bbd6@gmail.com> (Derrick Stolee's message of "Thu, 25 Mar 2021 07:55:55 -0400")

Derrick Stolee <stolee@gmail.com> writes:

> But...is there something we could still do here?
>
> My confusion about flushing is mostly due to my error, but upon
> reflection the loop is doing a lot of different things, but most of
> the time we know which behavior we need at the start, in the middle,
> and at the end:
>
>      1. Fill the existing buffer with the beginning of 'buf'. If the
>         hashfile's buffer is full, then flush.

"But do not do this if f->buffer is empty, and we are writing out
more than sizeof(f->buffer)." is missing, isn't it?

>      2. Flush sizeof(f->buffer) chunks directly out of 'buf' as long as
>         possible.
>     
>      3. Copy the remaining bytes out of 'buf' into the hashfile's buffer.

It is debatable if the existing loop, which came mostly from Nico's
a8032d12 (sha1write: don't copy full sized buffers, 2008-09-02), is
too clever; I personally find it concise and readable enough, but my
reading is tainted.

If you use a couple of helpers to reduce the repeated "crc and hash"
pattern in your variant, it may become easier to follow than the
original, but I dunno.

> Here is a rewrite that more explicitly follows this flow:
>
> void hashwrite(struct hashfile *f, const void *buf, unsigned int count)
> {
> 	const int full_buffer = sizeof(f->buffer);
> 	unsigned left = full_buffer - f->offset;
> 	unsigned nr = count > left ? left : count;
>
> 	/*
> 	 * Initially fill the buffer in a batch until it
> 	 * is full, then flush.
> 	 */
> 	if (f->do_crc)
> 		f->crc32 = crc32(f->crc32, buf, nr);
>
> 	memcpy(f->buffer + f->offset, buf, nr);

Here, if the f->buffer was empty, we end up memcpy a full bufferful
unconditionally.  Nico's original cleverly takes advantage of the
fact that 'nr' would be the full buffer size only when the f->buffer
was empty upon entry to the function and we have more byte than the
size of the buffer to copy out directly from 'buf'.

> 	f->offset += nr;
> 	count -= nr;
> 	buf = (char *) buf + nr;
>
> 	if (left == nr)
> 		hashflush(f);
>
> 	/*
> 	 * After filling the hashfile's buffer and flushing, take
> 	 * batches of full_buffer bytes directly from the input
> 	 * buffer.
> 	 */
> 	while (count >= full_buffer) {
> 		if (f->do_crc)
> 			f->crc32 = crc32(f->crc32, buf, full_buffer);
>
> 		the_hash_algo->update_fn(&f->ctx, buf, full_buffer);
> 		flush(f, buf, full_buffer);
>
> 		count -= full_buffer;
> 		buf = (char *) buf + full_buffer;
> 	}
>
> 	/*
> 	 * Capture any remaining bytes at the end of the input buffer
> 	 * into the hashfile's buffer. We do not need to flush because
> 	 * count is strictly less than full_buffer here.
> 	 */
> 	if (count) {
> 		if (f->do_crc)
> 			f->crc32 = crc32(f->crc32, buf, count);
>
> 		memcpy(f->buffer + f->offset, buf, count);
> 		f->offset = count;
> 	}
> 	
> 	if (f->base)
> 		hashwrite(f->base, buf, count);
> }
> ...
> So, I'm of two minds here:
>
>  1. This is embarassing. I wasted everyone's time for nothing. I can retract
>     this patch.
>
>  2. This is embarassing. I overstated the problem here. But we might be able
>     to eke out a tiny performance boost here.
>
> I'm open to either. I think we should default to dropping this patch unless
> someone thinks the rewrite above is a better organization of the logic. (I
> can then send a v2 including that version and an updated commit message.)

3. The current code around "if (nr == sizeof(f->buffer))" might be a
   bit too clever for readers who try to understand what is going
   on, and the whole "while" loop may deserve a comment based on
   what you wrote before your replacement implementation.

next prev parent reply	other threads:[~2021-03-25 18:47 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-03-24 17:50 [PATCH] csum-file: flush less often Derrick Stolee via GitGitGadget
2021-03-25 11:55 ` Derrick Stolee
2021-03-25 18:46   ` Junio C Hamano [this message]
2021-03-25 18:52     ` Junio C Hamano
2021-03-26  3:16       ` Jeff King
2021-03-26 12:38 ` [PATCH v2] csum-file: make hashwrite() more readable Derrick Stolee via GitGitGadget
2021-03-26 21:38   ` Junio C Hamano
2021-03-28  8:38   ` Jeff King

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=xmqq8s6bvzpf.fsf@gitster.g \
    --to=gitster@pobox.com \
    --cc=derrickstolee@github.com \
    --cc=dstolee@microsoft.com \
    --cc=git@vger.kernel.org \
    --cc=gitgitgadget@gmail.com \
    --cc=me@ttaylorr.com \
    --cc=peff@peff.net \
    --cc=stolee@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.