From: Jeff King <peff@peff.net>
To: Ben Hoskings <ben@hoskings.net>
Cc: git@vger.kernel.org
Subject: Re: [RFC/PATCH] Proof-of-concept streamed hashing, demoed in `git hash-object`
Date: Tue, 17 Feb 2009 12:33:03 -0500 [thread overview]
Message-ID: <20090217173302.GC31297@sigill.intra.peff.net> (raw)
In-Reply-To: <0984029E-57D0-4EFA-A060-E0B6FFA77D58@hoskings.net>
On Wed, Feb 18, 2009 at 12:31:35AM +1000, Ben Hoskings wrote:
> This patch adds a proof-of-concept implementation of streaming SHA1
> calculation in sha1_file.c, as demoed with `git hash-object <large input
> file>`. Instead of the command's memory footprint being equal to the input
> file's size, this caps it at SHA1_CHUNK_SIZE (currently 64MB).
This might be nice to reduce the memory footprint for _some_ operations
that only need to look at the data once, but...
> Capping memory use this easily seems like a win, but then all this code
> does is stream-calculate a SHA1 and print it to stdout. There seem to be a
> lot of disparate places throughout the codebase where objects have their
> SHA1 calculated.
...as you noticed, there are a lot of other places where we need the
whole object. So you are not suddenly making git not suck with a 700MB
file on a 512MB system.
But more importantly, hash-object (via index_fd) already mmaps the input
when it is possible to do so. So on memory-tight systems, won't the OS
already release memory when appropriate (i.e., your virtual size is
large, but the OS is free to optimize what is actually paged in as
appropriate).
I suppose this is an extra hint to the OS that we don't care about past
chunks. On systems which support it, perhaps using madvise with
MADV_SEQUENTIAL would be another useful hint.
> Then again, I presume most of these are working with blobs and not entire
> files, and hence wouldn't require streaming anyway. (I'm assuming blobs
> don't grow large enough to warrant it - is that necessarily true?)
Either I don't understand what you are saying here, or you are missing
something fundamental about hwo git works: blobs _are_ entire files.
> On my machine, the original implementation hashed a 700MB file in 5.8sec.
> My patch does it in 6.2sec with SHA1_CHUNK_SIZE set to 64MB.
Hmph. So it's slower? I'm not sure what the advantage is, then. On a
low-memory machine, the OS's paging strategy should be reasonable,
especially with MADV_SEQUENTIAL (though I haven't tested it). And on a
machine with enough memory, it's slower.
I guess you are evicting fewer pages from other processes on the system
in the meantime.
> +inline void write_sha1_fd_process_chunk(int fd, unsigned long len,
> + unsigned long offset,
> git_SHA_CTX *c,
> + void *buf)
> +{
> + buf = xmmap(NULL, len, PROT_READ, MAP_PRIVATE, fd, offset);
> + git_SHA1_Update(c, buf, len);
> + munmap(buf, len);
> +}
What is the point of the buf input parameter, which is promptly
overwritten?
> +int hash_sha1_fd(int fd, unsigned long len, const char *type,
> + unsigned char *sha1)
> +{
> + char hdr[32];
> + int hdrlen;
> + write_sha1_fd_prepare(fd, len, type, sha1, hdr, &hdrlen);
> + return 0;
> +}
What is the point of the hdr and hdrlen variables being passed in to
receive values, when we never actually look at them? I would think you
are conforming to an existing interface except that you just added the
function earlier in the patch.
-Peff
prev parent reply other threads:[~2009-02-17 17:34 UTC|newest]
Thread overview: 3+ messages / expand[flat|nested] mbox.gz Atom feed top
2009-02-17 14:31 [RFC/PATCH] Proof-of-concept streamed hashing, demoed in `git hash-object` Ben Hoskings
2009-02-17 15:03 ` Ben Hoskings
2009-02-17 17:33 ` Jeff King [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20090217173302.GC31297@sigill.intra.peff.net \
--to=peff@peff.net \
--cc=ben@hoskings.net \
--cc=git@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).