[RFC/PATCH] Proof-of-concept streamed hashing, demoed in `git hash-object`

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [RFC/PATCH] Proof-of-concept streamed hashing, demoed in `git hash-object`
@ 2009-02-17 14:31 Ben Hoskings
  2009-02-17 15:03 ` Ben Hoskings
  2009-02-17 17:33 ` Jeff King
  0 siblings, 2 replies; 3+ messages in thread
From: Ben Hoskings @ 2009-02-17 14:31 UTC (permalink / raw)
  To: git

Hi folks,

This patch adds a proof-of-concept implementation of streaming SHA1  
calculation in sha1_file.c, as demoed with `git hash-object <large  
input file>`. Instead of the command's memory footprint being equal to  
the input file's size, this caps it at SHA1_CHUNK_SIZE (currently 64MB).

Capping memory use this easily seems like a win, but then all this  
code does is stream-calculate a SHA1 and print it to stdout. There  
seem to be a lot of disparate places throughout the codebase where  
objects have their SHA1 calculated.

Then again, I presume most of these are working with blobs and not  
entire files, and hence wouldn't require streaming anyway. (I'm  
assuming blobs don't grow large enough to warrant it - is that  
necessarily true?)

The memory usage can be verified by running
while true; do ps aux | grep hash-object | grep -v grep; sleep 0.2; done

and then running `git hash-object <large input file>` in a second  
terminal. The memory use stays at or below SHA1_CHUNK_SIZE until the  
streamed hash is printed on the terminal and the non-streamed hash is  
subsequently calculated.

On my machine, the original implementation hashed a 700MB file in  
5.8sec. My patch does it in 6.2sec with SHA1_CHUNK_SIZE set to 64MB.

Cheers
Ben Hoskings

---
  sha1_file.c |   47 +++++++++++++++++++++++++++++++++++++++++++++++
  1 files changed, 47 insertions(+), 0 deletions(-)

diff --git a/sha1_file.c b/sha1_file.c
index 5b6e0f6..59f0adb 100644
--- a/sha1_file.c
+++ b/sha1_file.c
@@ -33,6 +33,8 @@ static unsigned long sz_fmt(size_t s) { return  
(unsigned long)s; }
  static size_t sz_fmt(size_t s) { return s; }
  #endif

+#define SHA1_CHUNK_SIZE (size_t)(1024*1024*64)
+
  const unsigned char null_sha1[20];

  const signed char hexval_table[256] = {
@@ -2242,6 +2244,39 @@ static void write_sha1_file_prepare(const void  
*buf, unsigned long len,
  	git_SHA1_Final(sha1, &c);
  }

+inline void write_sha1_fd_process_chunk(int fd, unsigned long len,
+                                        unsigned long offset,  
git_SHA_CTX *c,
+                                        void *buf)
+{
+  buf = xmmap(NULL, len, PROT_READ, MAP_PRIVATE, fd, offset);
+  git_SHA1_Update(c, buf, len);
+  munmap(buf, len);
+}
+
+static void write_sha1_fd_prepare(int fd, unsigned long len,
+                                  const char *type, unsigned char  
*sha1,
+                                  char *hdr, int *hdrlen)
+{
+	git_SHA_CTX c;
+	void *buf = NULL;
+	unsigned long offset = 0;
+
+	*hdrlen = sprintf(hdr, "%s %lu", type, len)+1;
+
+	git_SHA1_Init(&c);
+	git_SHA1_Update(&c, hdr, *hdrlen);
+
+	for (; offset + SHA1_CHUNK_SIZE <= len; offset += SHA1_CHUNK_SIZE) {
+		write_sha1_fd_process_chunk(fd, SHA1_CHUNK_SIZE, offset, &c, buf);
+	}
+
+	if (len % SHA1_CHUNK_SIZE) {
+		write_sha1_fd_process_chunk(fd, len % SHA1_CHUNK_SIZE, offset, &c,  
buf);
+	}
+
+	git_SHA1_Final(sha1, &c);
+}
+
  /*
   * Move the just written object into its final resting place
   */
@@ -2294,6 +2329,15 @@ int hash_sha1_file(const void *buf, unsigned  
long len, const char *type,
  	return 0;
  }

+int hash_sha1_fd(int fd, unsigned long len, const char *type,
+                 unsigned char *sha1)
+{
+	char hdr[32];
+	int hdrlen;
+	write_sha1_fd_prepare(fd, len, type, sha1, hdr, &hdrlen);
+	return 0;
+}
+
  /* Finalize a file on disk, and close it. */
  static void close_sha1_file(int fd)
  {
@@ -2523,6 +2567,9 @@ int index_fd(unsigned char *sha1, int fd, struct  
stat *st, int write_object,
  			ret = -1;
  		strbuf_release(&sbuf);
  	} else if (size) {
+		hash_sha1_fd(fd, size, typename(type), sha1);
+		printf("%s <- chunked hash\n", sha1_to_hex(sha1));
+
  		void *buf = xmmap(NULL, size, PROT_READ, MAP_PRIVATE, fd, 0);
  		ret = index_mem(sha1, buf, size, write_object, type, path);
  		munmap(buf, size);
-- 
1.6.1.2

^ permalink raw reply related	[flat|nested] 3+ messages in thread

* Re: [RFC/PATCH] Proof-of-concept streamed hashing, demoed in `git hash-object`
  2009-02-17 14:31 [RFC/PATCH] Proof-of-concept streamed hashing, demoed in `git hash-object` Ben Hoskings
@ 2009-02-17 15:03 ` Ben Hoskings
  2009-02-17 17:33 ` Jeff King
  1 sibling, 0 replies; 3+ messages in thread
From: Ben Hoskings @ 2009-02-17 15:03 UTC (permalink / raw)
  To: git

On 18/02/2009, at 12:31 AM, Ben Hoskings wrote:

> Hi folks,
>
> This patch adds a proof-of-concept implementation of streaming SHA1  
> calculation in sha1_file.c, as demoed with `git hash-object <large  
> input file>`. Instead of the command's memory footprint being equal  
> to the input file's size, this caps it at SHA1_CHUNK_SIZE (currently  
> 64MB).

Argh, sorry—the patch was somehow corrupted; the single spaces at the  
start of the surrounding lines of each hunk became double spaces. I'm  
not sure what's happening there.

Here's the commit on GitHub for anyone who would like to apply the  
patch.

http://github.com/benhoskings/git/commit/33b3d26ca40becf00f57008ef9d215d9287fb8e8

Cheers
Ben

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [RFC/PATCH] Proof-of-concept streamed hashing, demoed in `git hash-object`
  2009-02-17 14:31 [RFC/PATCH] Proof-of-concept streamed hashing, demoed in `git hash-object` Ben Hoskings
  2009-02-17 15:03 ` Ben Hoskings
@ 2009-02-17 17:33 ` Jeff King
  1 sibling, 0 replies; 3+ messages in thread
From: Jeff King @ 2009-02-17 17:33 UTC (permalink / raw)
  To: Ben Hoskings; +Cc: git

On Wed, Feb 18, 2009 at 12:31:35AM +1000, Ben Hoskings wrote:

> This patch adds a proof-of-concept implementation of streaming SHA1  
> calculation in sha1_file.c, as demoed with `git hash-object <large input 
> file>`. Instead of the command's memory footprint being equal to the input 
> file's size, this caps it at SHA1_CHUNK_SIZE (currently 64MB).

This might be nice to reduce the memory footprint for _some_ operations
that only need to look at the data once, but...

> Capping memory use this easily seems like a win, but then all this code 
> does is stream-calculate a SHA1 and print it to stdout. There seem to be a 
> lot of disparate places throughout the codebase where objects have their 
> SHA1 calculated.

...as you noticed, there are a lot of other places where we need the
whole object. So you are not suddenly making git not suck with a 700MB
file on a 512MB system.

But more importantly, hash-object (via index_fd) already mmaps the input
when it is possible to do so. So on memory-tight systems, won't the OS
already release memory when appropriate (i.e., your virtual size is
large, but the OS is free to optimize what is actually paged in as
appropriate).

I suppose this is an extra hint to the OS that we don't care about past
chunks. On systems which support it, perhaps using madvise with
MADV_SEQUENTIAL would be another useful hint.

> Then again, I presume most of these are working with blobs and not entire 
> files, and hence wouldn't require streaming anyway. (I'm assuming blobs 
> don't grow large enough to warrant it - is that necessarily true?)

Either I don't understand what you are saying here, or you are missing
something fundamental about hwo git works: blobs _are_ entire files.

> On my machine, the original implementation hashed a 700MB file in 5.8sec. 
> My patch does it in 6.2sec with SHA1_CHUNK_SIZE set to 64MB.

Hmph. So it's slower? I'm not sure what the advantage is, then. On a
low-memory machine, the OS's paging strategy should be reasonable,
especially with MADV_SEQUENTIAL (though I haven't tested it). And on a
machine with enough memory, it's slower.

I guess you are evicting fewer pages from other processes on the system
in the meantime.

> +inline void write_sha1_fd_process_chunk(int fd, unsigned long len,
> +                                        unsigned long offset,  
> git_SHA_CTX *c,
> +                                        void *buf)
> +{
> +  buf = xmmap(NULL, len, PROT_READ, MAP_PRIVATE, fd, offset);
> +  git_SHA1_Update(c, buf, len);
> +  munmap(buf, len);
> +}

What is the point of the buf input parameter, which is promptly
overwritten?

> +int hash_sha1_fd(int fd, unsigned long len, const char *type,
> +                 unsigned char *sha1)
> +{
> +	char hdr[32];
> +	int hdrlen;
> +	write_sha1_fd_prepare(fd, len, type, sha1, hdr, &hdrlen);
> +	return 0;
> +}

What is the point of the hdr and hdrlen variables being passed in to
receive values, when we never actually look at them? I would think you
are conforming to an existing interface except that you just added the
function earlier in the patch.

-Peff

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2009-02-17 17:34 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-02-17 14:31 [RFC/PATCH] Proof-of-concept streamed hashing, demoed in `git hash-object` Ben Hoskings
2009-02-17 15:03 ` Ben Hoskings
2009-02-17 17:33 ` Jeff King

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).