git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 4/8] git-repack --max-pack-size: add fixup_header_footer()
@ 2007-04-08 23:22 Dana How
  2007-04-09  0:04 ` Junio C Hamano
  0 siblings, 1 reply; 18+ messages in thread
From: Dana How @ 2007-04-08 23:22 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Git Mailing List, danahow


Add our own version of the one in fast-import.c here.
Will need this later to correct an incorrect object
count in the pack header.

Signed-off-by: Dana How <how@deathvalley.cswitch.com>
---
 builtin-pack-objects.c |   40 ++++++++++++++++++++++++++++++++++++++++
 1 files changed, 40 insertions(+), 0 deletions(-)

diff --git a/builtin-pack-objects.c b/builtin-pack-objects.c
index 8415549..7ab0712 100644
--- a/builtin-pack-objects.c
+++ b/builtin-pack-objects.c
@@ -518,6 +518,46 @@ static off_t write_one(struct sha1file *f,
 	return offset + write_object(f, e);
 }
 
+/*
+ * Move this, the version in fast-import.c,
+ * and index_pack.c:readjust_pack_header_and_sha1 into sha1_file.c ?
+ */
+static void fixup_header_footer(int pack_fd, unsigned char *pack_file_sha1,
+				char *pack_name, uint32_t object_count)
+{
+	static const int buf_sz = 128 * 1024;
+	SHA_CTX c;
+	struct pack_header hdr;
+	char *buf;
+
+	if (lseek(pack_fd, 0, SEEK_SET) != 0)
+		die("Failed seeking to start: %s", strerror(errno));
+	if (read_in_full(pack_fd, &hdr, sizeof(hdr)) != sizeof(hdr))
+		die("Unable to reread header of %s", pack_name);
+	if (lseek(pack_fd, 0, SEEK_SET) != 0)
+		die("Failed seeking to start: %s", strerror(errno));
+	hdr.hdr_entries = htonl(object_count);
+	write_or_die(pack_fd, &hdr, sizeof(hdr));
+
+	SHA1_Init(&c);
+	SHA1_Update(&c, &hdr, sizeof(hdr));
+
+	buf = xmalloc(buf_sz);
+	for (;;) {
+		size_t n = xread(pack_fd, buf, buf_sz);
+		if (!n)
+			break;
+		if (n < 0)
+			die("Failed to checksum %s", pack_name);
+		SHA1_Update(&c, buf, n);
+	}
+	free(buf);
+
+	SHA1_Final(pack_file_sha1, &c);
+	write_or_die(pack_fd, pack_file_sha1, 20);
+	close(pack_fd);
+}
+
 typedef int (*entry_sort_t)(const struct object_entry *, const struct object_entry *);
 
 static entry_sort_t current_sort;
-- 
1.5.1.89.g8abf0

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH 4/8] git-repack --max-pack-size: add fixup_header_footer()
  2007-04-08 23:22 [PATCH 4/8] git-repack --max-pack-size: add fixup_header_footer() Dana How
@ 2007-04-09  0:04 ` Junio C Hamano
  2007-04-09  0:18   ` Nicolas Pitre
  0 siblings, 1 reply; 18+ messages in thread
From: Junio C Hamano @ 2007-04-09  0:04 UTC (permalink / raw)
  To: Dana How; +Cc: Git Mailing List, Shawn Pearce, Nicolas Pitre

Dana How <danahow@gmail.com> writes:

> +/*
> + * Move this, the version in fast-import.c,
> + * and index_pack.c:readjust_pack_header_and_sha1 into sha1_file.c ?
> + */
> +static void fixup_header_footer(int pack_fd, unsigned char *pack_file_sha1,
> +				char *pack_name, uint32_t object_count)
> +{

Indeed that is a very good point.

I admit I did not notice we already had the duplication between
fast-import.c and index-pack.c

Shawn, Nico, what do you think?  Wouldn't it be better to
refactor them first, independent of Dana's series?

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 4/8] git-repack --max-pack-size: add fixup_header_footer()
  2007-04-09  0:04 ` Junio C Hamano
@ 2007-04-09  0:18   ` Nicolas Pitre
  2007-04-09 17:38     ` Shawn O. Pearce
  0 siblings, 1 reply; 18+ messages in thread
From: Nicolas Pitre @ 2007-04-09  0:18 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Dana How, Git Mailing List, Shawn Pearce

On Sun, 8 Apr 2007, Junio C Hamano wrote:

> Dana How <danahow@gmail.com> writes:
> 
> > +/*
> > + * Move this, the version in fast-import.c,
> > + * and index_pack.c:readjust_pack_header_and_sha1 into sha1_file.c ?
> > + */
> > +static void fixup_header_footer(int pack_fd, unsigned char *pack_file_sha1,
> > +				char *pack_name, uint32_t object_count)
> > +{
> 
> Indeed that is a very good point.
> 
> I admit I did not notice we already had the duplication between
> fast-import.c and index-pack.c
> 
> Shawn, Nico, what do you think?  Wouldn't it be better to
> refactor them first, independent of Dana's series?

Probably, yes.  But probably not in sha1_file.c though.  This file is 
getting a bit large already, and it deals with pack reading only not 
pack writing.

I think another file with common pack writing functions could be 
created.  Pack index writing is another item that is currently 
duplicated in pack-objects and index-pack for example.


Nicolas

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 4/8] git-repack --max-pack-size: add fixup_header_footer()
  2007-04-09  0:18   ` Nicolas Pitre
@ 2007-04-09 17:38     ` Shawn O. Pearce
  2007-04-09 18:30       ` Nicolas Pitre
  0 siblings, 1 reply; 18+ messages in thread
From: Shawn O. Pearce @ 2007-04-09 17:38 UTC (permalink / raw)
  To: Nicolas Pitre; +Cc: Junio C Hamano, Dana How, Git Mailing List

Nicolas Pitre <nico@cam.org> wrote:
> On Sun, 8 Apr 2007, Junio C Hamano wrote:
> 
> > Dana How <danahow@gmail.com> writes:
> > 
> > > +/*
> > > + * Move this, the version in fast-import.c,
> > > + * and index_pack.c:readjust_pack_header_and_sha1 into sha1_file.c ?
> > > + */
> > > +static void fixup_header_footer(int pack_fd, unsigned char *pack_file_sha1,
> > > +				char *pack_name, uint32_t object_count)
> > > +{
> > 
> > Indeed that is a very good point.
> > 
> > I admit I did not notice we already had the duplication between
> > fast-import.c and index-pack.c
> > 
> > Shawn, Nico, what do you think?  Wouldn't it be better to
> > refactor them first, independent of Dana's series?
> 
> Probably, yes.  But probably not in sha1_file.c though.  This file is 
> getting a bit large already, and it deals with pack reading only not 
> pack writing.
> 
> I think another file with common pack writing functions could be 
> created.  Pack index writing is another item that is currently 
> duplicated in pack-objects and index-pack for example.

I agree entirely.  And I'd like to see that refactoring occur
before this series, or as part of it.  At least for the nr_objects
correction routine.  To be honest we should have done that when
fast-import.c and index-pack.c both needed that logic, but we didn't.
I don't remember whose version showed up first in Junio's tree
(I think it was index-pack.c) but the other one (probably me with
fast-import.c) should have done the refactoring then.

We already have *waaay* too many functions that know packfile
structure.  I'd like to see that decline, but unfortunately a number
of them are using rather specialized data structures so it makes
things somewhat difficult.

For example, writing objects to a packfile: we have 3
implementations.  fast-import.c doesn't use sha1write_compressed
because that was a waste of time to compute the SHA_CTX when we
know we have to go back and fixup nr_objects.  It also doesn't use
it because fast-import.c's pack-splitting logic is based on the
final object size, not the starting offset.  It does the deflate
itself, decides if the end of the object will overflow, and if so,
jumps to a new packfile.

-- 
Shawn.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 4/8] git-repack --max-pack-size: add fixup_header_footer()
  2007-04-09 17:38     ` Shawn O. Pearce
@ 2007-04-09 18:30       ` Nicolas Pitre
  2007-04-09 18:40         ` Shawn O. Pearce
  0 siblings, 1 reply; 18+ messages in thread
From: Nicolas Pitre @ 2007-04-09 18:30 UTC (permalink / raw)
  To: Shawn O. Pearce; +Cc: Junio C Hamano, Dana How, Git Mailing List

On Mon, 9 Apr 2007, Shawn O. Pearce wrote:

> Nicolas Pitre <nico@cam.org> wrote:
> > I think another file with common pack writing functions could be 
> > created.  Pack index writing is another item that is currently 
> > duplicated in pack-objects and index-pack for example.
> 
> I agree entirely.  And I'd like to see that refactoring occur
> before this series, or as part of it.  At least for the nr_objects
> correction routine.

That's easy enough for the nr_objects correction case.

The next obvious candidate would be index generation, but I'd wait a bit 
for all features in progress to be merged and stabilize first.

> For example, writing objects to a packfile: we have 3
> implementations.  fast-import.c doesn't use sha1write_compressed
> because that was a waste of time to compute the SHA_CTX when we
> know we have to go back and fixup nr_objects.  It also doesn't use
> it because fast-import.c's pack-splitting logic is based on the
> final object size, not the starting offset.  It does the deflate
> itself, decides if the end of the object will overflow, and if so,
> jumps to a new packfile.

I'd be really tempted to create a pack v4 which only change is to still 
have the pack header at the beginning of the pack like we do today, but 
include the header in the pack SHA1 computation at the end of the stream 
only.  This way the pack SHA1 could be computed as the pack is 
generated, and the header fixed up without having to read the entire 
pack back.  I think it was Geert Bosch who proposed this and it makes 
tons of sense IMHO.


Nicolas

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 4/8] git-repack --max-pack-size: add fixup_header_footer()
  2007-04-09 18:30       ` Nicolas Pitre
@ 2007-04-09 18:40         ` Shawn O. Pearce
  2007-04-09 19:11           ` Dana How
  0 siblings, 1 reply; 18+ messages in thread
From: Shawn O. Pearce @ 2007-04-09 18:40 UTC (permalink / raw)
  To: Nicolas Pitre; +Cc: Junio C Hamano, Dana How, Git Mailing List

Nicolas Pitre <nico@cam.org> wrote:
> I'd be really tempted to create a pack v4 which only change is to still 
> have the pack header at the beginning of the pack like we do today, but 
> include the header in the pack SHA1 computation at the end of the stream 
> only.  This way the pack SHA1 could be computed as the pack is 
> generated, and the header fixed up without having to read the entire 
> pack back.  I think it was Geert Bosch who proposed this and it makes 
> tons of sense IMHO.

Yes.  If we really are heading in this direction of needing to
correct object counts, we should make that change.  Its trivial
to hang onto that header for the duration of the rest of the data
processing, and tack it onto the end for final SHA-1 computation.

Since pack v4 looks like it will be a dev cycle longer than these
index format changes and pack-splitting changes, I have to say I
agree with you.  Lets move "pack v4" back to "pack v5" and make v4
just a shift of where the header is included in the SHA-1.

We're heading where I said I didn't want to go, which is two
file format changes in 2007.  But I think that ship has already
sailed...  Folks need support for larger repositories now, and pack
v4/v5 (whatever you call our dictionary work) just isn't ready.
Nor does it solve the big packfile problems that this current work
is addressing.

-- 
Shawn.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 4/8] git-repack --max-pack-size: add fixup_header_footer()
  2007-04-09 18:40         ` Shawn O. Pearce
@ 2007-04-09 19:11           ` Dana How
  2007-04-09 19:33             ` Nicolas Pitre
  0 siblings, 1 reply; 18+ messages in thread
From: Dana How @ 2007-04-09 19:11 UTC (permalink / raw)
  To: Shawn O. Pearce; +Cc: Nicolas Pitre, Junio C Hamano, Git Mailing List, danahow

On 4/9/07, Shawn O. Pearce <spearce@spearce.org> wrote:
> Nicolas Pitre <nico@cam.org> wrote:
> > I'd be really tempted to create a pack v4 which only change is to still
> > have the pack header at the beginning of the pack like we do today, but
> > include the header in the pack SHA1 computation at the end of the stream
> > only.  This way the pack SHA1 could be computed as the pack is
> > generated, and the header fixed up without having to read the entire
> > pack back.  I think it was Geert Bosch who proposed this and it makes
> > tons of sense IMHO.
>
> Yes.  If we really are heading in this direction of needing to
> correct object counts, we should make that change.  Its trivial
> to hang onto that header for the duration of the rest of the data
> processing, and tack it onto the end for final SHA-1 computation.

I like the property that when an SHA-1 appears at the end of a file,
it is a checksum of every byte before it.  The ideas above are a
departure from that.  Do we want this rule to be different for each file type?

Wouldn't the following address the "object count unknown
at the start of sequential pack writing" problem:
  Write 0 for object count in the header. This is a flag to look for
  another header of same format just before the final SHA-1 which
  has the correct count. The SHA-1 is still a checksum of everything
  before it and no seeking/rewriting is needed on generation.  When
  reading the object count from a .pack file, you might need to add
      xread(pack_fd, &header, sizeof(header));
+    if (!header.object_count) {
+      lseek(pack_fd, -20-sizeof(header), SEEK_END);
+      xread(pack_fd, &header, sizeof(header);
+    }
  Or maybe you want this before the object_list_sha1 instead (20->40).

Finally, when I generate several 2GB split packfiles,  I do notice
the slight delay for fixup_header_footer(), and I do think it's a bit
ugly, but in quantitative terms it's an insignificant part of a long
operation that's infrequently performed.  Does this need to be
optimized at all?

Thanks,
-- 
Dana L. How  danahow@gmail.com  +1 650 804 5991 cell

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 4/8] git-repack --max-pack-size: add fixup_header_footer()
  2007-04-09 19:11           ` Dana How
@ 2007-04-09 19:33             ` Nicolas Pitre
  2007-04-09 21:38               ` Dana How
  0 siblings, 1 reply; 18+ messages in thread
From: Nicolas Pitre @ 2007-04-09 19:33 UTC (permalink / raw)
  To: Dana How; +Cc: Shawn O. Pearce, Junio C Hamano, Git Mailing List

On Mon, 9 Apr 2007, Dana How wrote:

> Wouldn't the following address the "object count unknown
> at the start of sequential pack writing" problem:
>  Write 0 for object count in the header. This is a flag to look for
>  another header of same format just before the final SHA-1 which
>  has the correct count. The SHA-1 is still a checksum of everything
>  before it and no seeking/rewriting is needed on generation.

No.  You really wants to know up front how many objects a pack contains 
when streaming it.  And this is not only for packs written to stdout.

> Finally, when I generate several 2GB split packfiles,  I do notice
> the slight delay for fixup_header_footer(), and I do think it's a bit
> ugly, but in quantitative terms it's an insignificant part of a long
> operation that's infrequently performed.  Does this need to be
> optimized at all?

Maybe, maybe not.  That depends how much data we think GIT could be used 
to manage in the future.  With a 1TB pack file you definitely want to 
optimize that case.

OTOH this could wait for the real pack v4 too.


Nicolas

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 4/8] git-repack --max-pack-size: add fixup_header_footer()
  2007-04-09 19:33             ` Nicolas Pitre
@ 2007-04-09 21:38               ` Dana How
  2007-04-09 23:22                 ` Nicolas Pitre
  0 siblings, 1 reply; 18+ messages in thread
From: Dana How @ 2007-04-09 21:38 UTC (permalink / raw)
  To: Nicolas Pitre; +Cc: Shawn O. Pearce, Junio C Hamano, Git Mailing List, danahow

On 4/9/07, Nicolas Pitre <nico@cam.org> wrote:
> On Mon, 9 Apr 2007, Dana How wrote:
> > Wouldn't the following address the "object count unknown
> > at the start of sequential pack writing" problem:
> >  Write 0 for object count in the header. This is a flag to look for
> >  another header of same format just before the final SHA-1 which
> >  has the correct count. The SHA-1 is still a checksum of everything
> >  before it and no seeking/rewriting is needed on generation.
>
> No.  You really wants to know up front how many objects a pack contains
> when streaming it.  And this is not only for packs written to stdout.
OK, let me ask a dumb question and flog one last additional obvious idea.

Does your wanting to know stem from more than wanting
to stick to one malloc of all the object info at once?

My suggestion quoted above is actually a change to the .pack format.
With all the other ideas for .pack format changes floating around,
let me withdraw that and suggest a simpler one: write a "0" in the header,
and terminate the pack with a sentinel in object format before the final SHA-1s.
The sentinel would be type=OBJ_NONE/length=0, i.e. a null byte.
"Not much" would need to be updated to tolerate it and
you could count objects while looking for it (if header has 0)
during normal processing.  (I'm reacting to your word "streaming".)

> > Finally, when I generate several 2GB split packfiles,  I do notice
> > the slight delay for fixup_header_footer(), and I do think it's a bit
> > ugly, but in quantitative terms it's an insignificant part of a long
> > operation that's infrequently performed.  Does this need to be
> > optimized at all?
> Maybe, maybe not.  That depends how much data we think GIT could be used
> to manage in the future.  With a 1TB pack file you definitely want to
> optimize that case.
OK.  Just FYI, we have a perforce repository near 200GB and
this is not what would concern me right now if we converted all
or part of it to git.  Of course that would depend on the packing schedule.

> OTOH this could wait for the real pack v4 too.
Makes sense to me.  The fewer format changes the better.

BTW,  I've caught up on reading the mailing list archives,
but I don't recall seeing any overview of the objectives of pack v2/v3/v4.
Does that exist any where? I didn't see it in Documentation or
Documentation/technical. It would probably reduce uninformed
questions like the above.  I've deduced rationales for what
miscellaneous details I have seen,
except moving the SHA-1s from .idx to .pack (?).

Thanks,
-- 
Dana L. How  danahow@gmail.com  +1 650 804 5991 cell

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 4/8] git-repack --max-pack-size: add fixup_header_footer()
  2007-04-09 21:38               ` Dana How
@ 2007-04-09 23:22                 ` Nicolas Pitre
  0 siblings, 0 replies; 18+ messages in thread
From: Nicolas Pitre @ 2007-04-09 23:22 UTC (permalink / raw)
  To: Dana How; +Cc: Shawn O. Pearce, Junio C Hamano, Git Mailing List

On Mon, 9 Apr 2007, Dana How wrote:

> On 4/9/07, Nicolas Pitre <nico@cam.org> wrote:
> > On Mon, 9 Apr 2007, Dana How wrote:
> > > Wouldn't the following address the "object count unknown
> > > at the start of sequential pack writing" problem:
> > >  Write 0 for object count in the header. This is a flag to look for
> > >  another header of same format just before the final SHA-1 which
> > >  has the correct count. The SHA-1 is still a checksum of everything
> > >  before it and no seeking/rewriting is needed on generation.
> > 
> > No.  You really wants to know up front how many objects a pack contains
> > when streaming it.  And this is not only for packs written to stdout.
> OK, let me ask a dumb question and flog one last additional obvious idea.
> 
> Does your wanting to know stem from more than wanting
> to stick to one malloc of all the object info at once?

That, plus progress reporting when fetching.  And progress reporting is 
probably the most important thing for the user experience.

> My suggestion quoted above is actually a change to the .pack format.
> With all the other ideas for .pack format changes floating around,
> let me withdraw that and suggest a simpler one: write a "0" in the header,
> and terminate the pack with a sentinel in object format before the final
> SHA-1s.
> The sentinel would be type=OBJ_NONE/length=0, i.e. a null byte.
> "Not much" would need to be updated to tolerate it and
> you could count objects while looking for it (if header has 0)
> during normal processing.  (I'm reacting to your word "streaming".)

When streaming you really want to know up front how much to expect / 
wait for.

> BTW,  I've caught up on reading the mailing list archives,
> but I don't recall seeing any overview of the objectives of pack v2/v3/v4.
> Does that exist any where? I didn't see it in Documentation or
> Documentation/technical. It would probably reduce uninformed
> questions like the above.  I've deduced rationales for what
> miscellaneous details I have seen,
> except moving the SHA-1s from .idx to .pack (?).

I'm sure this exists in the archive as I recall sending a summary about 
that a while ago.

In short:

 - Pack v1 was the initial implementation from Linus.  It had some flaws
   and didn't exist for more than 2 or 3 days.  It was never used in any 
   official release.

 - Pack v2 is what GIT still produces today.

 - Pack v3 was created when a bit in the delta encoding was redefined.
   See commit d60fc1c8649f8 for the details.  Because it caused too much
   compatibility issues when we attempted to enable pack v3 at the time, 
   we reverted pack generation to v2.

 - Pack v4 has a much larger scope.  This is the WIP from Shawn Pearce 
   and I and I know for sure Shawn already posted a detailed design 
   overview to the list already.


Nicolas

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH 4/8] git-repack --max-pack-size: add fixup_header_footer()
@ 2007-04-30 23:21 Dana How
  2007-05-01  5:06 ` Shawn O. Pearce
  0 siblings, 1 reply; 18+ messages in thread
From: Dana How @ 2007-04-30 23:21 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Git Mailing List, danahow


Add our own version of the one in fast-import.c here.
Needed later to correct bad object count in header for split pack.

Signed-off-by: Dana L. How <danahow@gmail.com>
---
 builtin-pack-objects.c |   36 ++++++++++++++++++++++++++++++++++++
 1 files changed, 36 insertions(+), 0 deletions(-)

diff --git a/builtin-pack-objects.c b/builtin-pack-objects.c
index bc45ca6..98066bf 100644
--- a/builtin-pack-objects.c
+++ b/builtin-pack-objects.c
@@ -562,6 +562,42 @@ static off_t write_one(struct sha1file *f,
 	return offset + size;
 }
 
+static void fixup_header_footer(int pack_fd, unsigned char *pack_file_sha1,
+				char *pack_name, uint32_t object_count)
+{
+	static const int buf_sz = 128 * 1024;
+	SHA_CTX c;
+	struct pack_header hdr;
+	char *buf;
+
+	if (lseek(pack_fd, 0, SEEK_SET) != 0)
+		die("Failed seeking to start: %s", strerror(errno));
+	if (read_in_full(pack_fd, &hdr, sizeof(hdr)) != sizeof(hdr))
+		die("Unable to reread header of %s", pack_name);
+	if (lseek(pack_fd, 0, SEEK_SET) != 0)
+		die("Failed seeking to start: %s", strerror(errno));
+	hdr.hdr_entries = htonl(object_count);
+	write_or_die(pack_fd, &hdr, sizeof(hdr));
+
+	SHA1_Init(&c);
+	SHA1_Update(&c, &hdr, sizeof(hdr));
+
+	buf = xmalloc(buf_sz);
+	for (;;) {
+		size_t n = xread(pack_fd, buf, buf_sz);
+		if (!n)
+			break;
+		if (n < 0)
+			die("Failed to checksum %s", pack_name);
+		SHA1_Update(&c, buf, n);
+	}
+	free(buf);
+
+	SHA1_Final(pack_file_sha1, &c);
+	write_or_die(pack_fd, pack_file_sha1, 20);
+	close(pack_fd);
+}
+
 /* forward declarations for write_pack_file */
 static void write_index_file(off_t last_obj_offset, unsigned char *sha1);
 static int adjust_perm(const char *path, mode_t mode);
-- 
1.5.2.rc0.766.gba60-dirty

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH 4/8] git-repack --max-pack-size: add fixup_header_footer()
  2007-04-30 23:21 Dana How
@ 2007-05-01  5:06 ` Shawn O. Pearce
  2007-05-01  5:41   ` Dana How
  0 siblings, 1 reply; 18+ messages in thread
From: Shawn O. Pearce @ 2007-05-01  5:06 UTC (permalink / raw)
  To: Dana How; +Cc: Junio C Hamano, Git Mailing List

Dana How <danahow@gmail.com> wrote:
> Add our own version of the one in fast-import.c here.
> Needed later to correct bad object count in header for split pack.
...
> diff --git a/builtin-pack-objects.c b/builtin-pack-objects.c
> index bc45ca6..98066bf 100644
> --- a/builtin-pack-objects.c
> +++ b/builtin-pack-objects.c
> @@ -562,6 +562,42 @@ static off_t write_one(struct sha1file *f,
>  	return offset + size;
>  }
>  
> +static void fixup_header_footer(int pack_fd, unsigned char *pack_file_sha1,
> +				char *pack_name, uint32_t object_count)
> +{

This looks a *lot* like the code in fast-import.c.  Why not
refactor both to use the same implementation and stuff it away in
say pack-check.c (for lack of a better place), or start a new file
(pack-write.c)?

There is a *lot* of code in fast-import.c (over 2,000 lines) that
was half-copied from other core code, and that was half created
on its own.  This is also true in index-pack.c and pack-objects.c;
I'd like to see these implementations unify more rather than copy
code from each other.

I know git-blame will identify the original author quite well,
but I'd really like to avoid adding lots more code to maintain
if we can avoid it.

-- 
Shawn.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 4/8] git-repack --max-pack-size: add fixup_header_footer()
  2007-05-01  5:06 ` Shawn O. Pearce
@ 2007-05-01  5:41   ` Dana How
  2007-05-01  6:03     ` Shawn O. Pearce
  0 siblings, 1 reply; 18+ messages in thread
From: Dana How @ 2007-05-01  5:41 UTC (permalink / raw)
  To: Shawn O. Pearce; +Cc: Junio C Hamano, Git Mailing List, danahow

On 4/30/07, Shawn O. Pearce <spearce@spearce.org> wrote:
> Dana How <danahow@gmail.com> wrote:
> > Add our own version of the one in fast-import.c here.
> > Needed later to correct bad object count in header for split pack.
> ...
> > diff --git a/builtin-pack-objects.c b/builtin-pack-objects.c
> > --- a/builtin-pack-objects.c
> > +++ b/builtin-pack-objects.c
> > @@ -562,6 +562,42 @@ static off_t write_one(struct sha1file *f,
> > +static void fixup_header_footer(int pack_fd, unsigned char *pack_file_sha1,
> > +                             char *pack_name, uint32_t object_count)
> > +{
>
> This looks a *lot* like the code in fast-import.c.
I hope so!  That's where I copied it from.

> Why not
> refactor both to use the same implementation and stuff it away in
> say pack-check.c (for lack of a better place), or start a new file
> (pack-write.c)?
Actually I didn't just copy it, I tried to rewrite it for my use
as well as the fast-import.c use (note there is a 3rd copy
in some *index*.c file which I didn't try to merge in yet).
However I didn't yet put it in a new file or change fast-import.c
to call it since I wanted to change as little as possible.

> There is a *lot* of code in fast-import.c (over 2,000 lines) that
> was half-copied from other core code, and that was half created
> on its own.  This is also true in index-pack.c and pack-objects.c;
> I'd like to see these implementations unify more rather than copy
> code from each other.
>
> I know git-blame will identify the original author quite well,
> but I'd really like to avoid adding lots more code to maintain
> if we can avoid it.

I agree with all your arguments.  I had several reasons
to avoid extra rearrangements/refactorings:
(a) First patch to git, not previously known to me;
(b) I prefer to separate new functionality from "clean-up" work;
(c) I didn't really view myself as the person to make the *writing*
    code in git as well organized/minimized as the *reading* code
    [e.g. the sliding mmap stuff -- nice!];
(d) Apparently you and Nicolas Pitre have a lot of pending changes
    affecting the packing code.

I'd have no problem submitting a follow-on patch later containing
some clean-up work if you & NP clear it, so I know I won't have
problems from (d).  Note I had to completely rewrite this patch
when NP submitted some of his pending stuff.

NP wrote that you posted a summary of v3/v4 pack ideas,
but I couldn't find it in the list archives.  Could you either
email it to me, or (re-)post it to the list?

Thanks,
-- 
Dana L. How  danahow@gmail.com  +1 650 804 5991 cell

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 4/8] git-repack --max-pack-size: add fixup_header_footer()
  2007-05-01  5:41   ` Dana How
@ 2007-05-01  6:03     ` Shawn O. Pearce
  2007-05-01  7:32       ` Johannes Schindelin
  2007-05-01 17:48       ` Nicolas Pitre
  0 siblings, 2 replies; 18+ messages in thread
From: Shawn O. Pearce @ 2007-05-01  6:03 UTC (permalink / raw)
  To: Dana How; +Cc: Junio C Hamano, Git Mailing List

Dana How <danahow@gmail.com> wrote:
> On 4/30/07, Shawn O. Pearce <spearce@spearce.org> wrote:
> >Why not
> >refactor both to use the same implementation and stuff it away in
> >say pack-check.c (for lack of a better place), or start a new file
> >(pack-write.c)?
> Actually I didn't just copy it, I tried to rewrite it for my use
> as well as the fast-import.c use (note there is a 3rd copy
> in some *index*.c file which I didn't try to merge in yet).
> However I didn't yet put it in a new file or change fast-import.c
> to call it since I wanted to change as little as possible.
...
> I agree with all your arguments.  I had several reasons
> to avoid extra rearrangements/refactorings:
> (a) First patch to git, not previously known to me;
> (b) I prefer to separate new functionality from "clean-up" work;

A really good reason.  ;-)

But I'd still rather see it done right the first time, then done
partially (copied) and wait for someone to clean it up later.
Sometimes that cleanup doesn't happen.  Anyway, I'm not against
you copying it, I just think there's a better way (refactor at
least fast-import's use of it).
 
> (c) I didn't really view myself as the person to make the *writing*
>    code in git as well organized/minimized as the *reading* code
>    [e.g. the sliding mmap stuff -- nice!];

Not sure I follow that argument, but thanks for the compliment on
the sliding mmap work.  I think I was trying to point out that that
one particular function is actually quite simple, and you did a
direct copy of it from fast-import.c.  Instead a simple refactoring
that allows both pack-objects.c and fast-import.c share the same
implementation of those 30 or so lines of code would be a good
start towards cleaning up the writing code.  Yes it doesn't help
the index-pack.c usage of same logic, but baby steps will help us
to cleanup the mess we have already made!

We like baby steps around here...  ;-)

> (d) Apparently you and Nicolas Pitre have a lot of pending changes
>    affecting the packing code.

Yes, but Nico's work has also destroyed in pack v4 topic.  Nico has
promised to start working on porting some of that work over, but I
don't know if he has been able to start doing so yet.  I personally
have been too busy this past month and a half to really work on
packv4, but I'm hoping to circle back to it before the end of May
(if Nico doesn't beat me to it!).

> I'd have no problem submitting a follow-on patch later containing
> some clean-up work if you & NP clear it, so I know I won't have
> problems from (d).  Note I had to completely rewrite this patch
> when NP submitted some of his pending stuff.

Yea, hazard of working in this part of the code when Nico is
also active.  My own sliding mmap stuff was written twice too,
for the same reason - Nico doing much needed improvements right in
the same spot as I was working, at the same time.
 
> NP wrote that you posted a summary of v3/v4 pack ideas,
> but I couldn't find it in the list archives.  Could you either
> email it to me, or (re-)post it to the list?

You can start here:

  http://news.gmane.org/find-root.php?group=gmane.comp.version-control.git&article=40799
  http://article.gmane.org/gmane.comp.version-control.git/42423
  http://article.gmane.org/gmane.comp.version-control.git/45406
  http://news.gmane.org/find-root.php?group=gmane.comp.version-control.git&article=43046

anything by me or Nico in those threads is good reading on pack v4.
The last link is probably the best thread available on the subject.

-- 
Shawn.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 4/8] git-repack --max-pack-size: add fixup_header_footer()
  2007-05-01  6:03     ` Shawn O. Pearce
@ 2007-05-01  7:32       ` Johannes Schindelin
  2007-05-01 17:48       ` Nicolas Pitre
  1 sibling, 0 replies; 18+ messages in thread
From: Johannes Schindelin @ 2007-05-01  7:32 UTC (permalink / raw)
  To: Shawn O. Pearce; +Cc: Dana How, Junio C Hamano, Git Mailing List

Hi,

On Tue, 1 May 2007, Shawn O. Pearce wrote:

> Dana How <danahow@gmail.com> wrote:
>
> > On 4/30/07, Shawn O. Pearce <spearce@spearce.org> wrote:
> >
> > >Why not refactor both to use the same implementation and stuff it 
> > >away in say pack-check.c (for lack of a better place), or start a new 
> > >file (pack-write.c)?
> >
> > Actually I didn't just copy it, I tried to rewrite it for my use as 
> > well as the fast-import.c use (note there is a 3rd copy in some 
> > *index*.c file which I didn't try to merge in yet). However I didn't 
> > yet put it in a new file or change fast-import.c to call it since I 
> > wanted to change as little as possible.
> ...
> > I agree with all your arguments.  I had several reasons
> > to avoid extra rearrangements/refactorings:
> > (a) First patch to git, not previously known to me;
> > (b) I prefer to separate new functionality from "clean-up" work;
> 
> A really good reason.  ;-)
> 
> But I'd still rather see it done right the first time, then done
> partially (copied) and wait for someone to clean it up later.

FWIW it is not just a clean up in the sense of making the code more 
elegant. It is about catching bugs. The smaller the code base, and the 
more code is reused for common tasks, the easier it is to

- get at bugs,
- fix bugs, and
- prevent bugs

Some might even argue that the elegance of the code is a direct indicator 
of its quality.

Obviously, I am suggesting to go the slightly slower route. In effect, 
this will turn out to be the faster route, though, since bug fixing 
traditionally makes up for 90% of development time.

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 4/8] git-repack --max-pack-size: add fixup_header_footer()
  2007-05-01  6:03     ` Shawn O. Pearce
  2007-05-01  7:32       ` Johannes Schindelin
@ 2007-05-01 17:48       ` Nicolas Pitre
  2007-05-01 17:58         ` Dana How
  1 sibling, 1 reply; 18+ messages in thread
From: Nicolas Pitre @ 2007-05-01 17:48 UTC (permalink / raw)
  To: Shawn O. Pearce; +Cc: Dana How, Junio C Hamano, Git Mailing List

On Tue, 1 May 2007, Shawn O. Pearce wrote:

> Dana How <danahow@gmail.com> wrote:
> > On 4/30/07, Shawn O. Pearce <spearce@spearce.org> wrote:
> > >Why not
> > >refactor both to use the same implementation and stuff it away in
> > >say pack-check.c (for lack of a better place), or start a new file
> > >(pack-write.c)?
> > Actually I didn't just copy it, I tried to rewrite it for my use
> > as well as the fast-import.c use (note there is a 3rd copy
> > in some *index*.c file which I didn't try to merge in yet).
> > However I didn't yet put it in a new file or change fast-import.c
> > to call it since I wanted to change as little as possible.
> ...
> > I agree with all your arguments.  I had several reasons
> > to avoid extra rearrangements/refactorings:
> > (a) First patch to git, not previously known to me;
> > (b) I prefer to separate new functionality from "clean-up" work;
> 
> A really good reason.  ;-)
> 
> But I'd still rather see it done right the first time, then done
> partially (copied) and wait for someone to clean it up later.
> Sometimes that cleanup doesn't happen.

Well I intended to do more cleanups in the pack code eventually.  That 
included the index writing and pack header fixing.  But I was expecting 
for the pack splitting changes to go in first as it is likely to impose 
some requirements of its own. It is then easier to have a proper 
interface common to all users after everything is in place.

> > (d) Apparently you and Nicolas Pitre have a lot of pending changes
> >    affecting the packing code.
> 
> Yes, but Nico's work has also destroyed in pack v4 topic.  Nico has
> promised to start working on porting some of that work over, but I
> don't know if he has been able to start doing so yet.

It shouldn't be too hard to port the existing code.  Most of it is new 
code that hooks into the existing code in a limited way.

> I personally have been too busy this past month and a half to really 
> work on packv4, but I'm hoping to circle back to it before the end of 
> May (if Nico doesn't beat me to it!).

While the current pack v4 branch is certainly valuable, I consider it 
more as a proof of concept and a test bench.  In practice it isn't 
really efficient and it won't be able to show its full potential until 
core code like tree walking is better abstracted seamless processing of 
parallel tree representations.

But we're getting there slowly.  My work on index v2 will make the pack 
v4 changes much smaller in that area.  My progress display rework was a 
direct "huh!" reaction to the extra progress reporting the current pack 
v4 code added.  And the general pack-objects.c refactoring was made with 
a look on better and easier pack v4 integration in the future.

So in my mind pack v4 already started to make its way in the main code 
in subtle ways.  ;-)  But since I may do Git work only when I'm bored 
progress doesn't happen as fast as one would have expected.

> > I'd have no problem submitting a follow-on patch later containing
> > some clean-up work if you & NP clear it, so I know I won't have
> > problems from (d).  Note I had to completely rewrite this patch
> > when NP submitted some of his pending stuff.
> 
> Yea, hazard of working in this part of the code when Nico is
> also active.  My own sliding mmap stuff was written twice too,
> for the same reason - Nico doing much needed improvements right in
> the same spot as I was working, at the same time.

Well well.  OK I'm used to be considered as the bad guy anyway.  ;-)


Nicolas

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 4/8] git-repack --max-pack-size: add fixup_header_footer()
  2007-05-01 17:48       ` Nicolas Pitre
@ 2007-05-01 17:58         ` Dana How
  2007-05-01 18:39           ` Nicolas Pitre
  0 siblings, 1 reply; 18+ messages in thread
From: Dana How @ 2007-05-01 17:58 UTC (permalink / raw)
  To: Nicolas Pitre; +Cc: Shawn O. Pearce, Junio C Hamano, Git Mailing List, danahow

On 5/1/07, Nicolas Pitre <nico@cam.org> wrote:
> On Tue, 1 May 2007, Shawn O. Pearce wrote:
> > Dana How <danahow@gmail.com> wrote:
> > > On 4/30/07, Shawn O. Pearce <spearce@spearce.org> wrote:
> > > > Why not
> > > > refactor both to use the same implementation and stuff it away in
> > > > say pack-check.c (for lack of a better place), or start a new file
> > > > (pack-write.c)?
> > > Actually I didn't just copy it, I tried to rewrite it for my use
> > > as well as the fast-import.c use (note there is a 3rd copy
> > > in some *index*.c file which I didn't try to merge in yet).
> > > However I didn't yet put it in a new file or change fast-import.c
> > > to call it since I wanted to change as little as possible.
> > ...
> > > I agree with all your arguments.  I had several reasons
> > > to avoid extra rearrangements/refactorings:
> > > (a) First patch to git, not previously known to me;
> > > (b) I prefer to separate new functionality from "clean-up" work;
> >
> > A really good reason.  ;-)
> >
> > But I'd still rather see it done right the first time, then done
> > partially (copied) and wait for someone to clean it up later.
> > Sometimes that cleanup doesn't happen.
>
> Well I intended to do more cleanups in the pack code eventually.  That
> included the index writing and pack header fixing.  But I was expecting
> for the pack splitting changes to go in first as it is likely to impose
> some requirements of its own. It is then easier to have a proper
> interface common to all users after everything is in place.
I was in the middle of creating pack-write.c at Shawn's suggestion.
It will only contain fixup_header_footer(), to be called by fast-import.c
and builtin-pack-object.c.  index-pack.c also has
readjust_pack_header_and_sha1(),
which is compatible except it doesn't close the file.  I was going to leave it
alone for now.  This new file should be the logical place to put other common
pack-writing-related things.  Please barf now if you don't think I
should do this
tiny refactoring at this point.

> > > I'd have no problem submitting a follow-on patch later containing
> > > some clean-up work if you & NP clear it, so I know I won't have
> > > problems from (d).  Note I had to completely rewrite this patch
> > > when NP submitted some of his pending stuff.
> >
> > Yea, hazard of working in this part of the code when Nico is
> > also active.  My own sliding mmap stuff was written twice too,
> > for the same reason - Nico doing much needed improvements right in
> > the same spot as I was working, at the same time.
>
> Well well.  OK I'm used to be considered as the bad guy anyway.  ;-)
You *did* tell me about your upcoming patches as I recall ;-)

Thanks,
-- 
Dana L. How  danahow@gmail.com  +1 650 804 5991 cell

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 4/8] git-repack --max-pack-size: add fixup_header_footer()
  2007-05-01 17:58         ` Dana How
@ 2007-05-01 18:39           ` Nicolas Pitre
  0 siblings, 0 replies; 18+ messages in thread
From: Nicolas Pitre @ 2007-05-01 18:39 UTC (permalink / raw)
  To: Dana How; +Cc: Shawn O. Pearce, Junio C Hamano, Git Mailing List

On Tue, 1 May 2007, Dana How wrote:

> I was in the middle of creating pack-write.c at Shawn's suggestion. It 
> will only contain fixup_header_footer(), to be called by fast-import.c 
> and builtin-pack-object.c.  index-pack.c also has 
> readjust_pack_header_and_sha1(), which is compatible except it doesn't 
> close the file.  I was going to leave it alone for now.  This new file 
> should be the logical place to put other common pack-writing-related 
> things.  Please barf now if you don't think I should do this tiny 
> refactoring at this point.

That's fine.  Please go ahead.


Nicolas

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2007-05-01 18:40 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-04-08 23:22 [PATCH 4/8] git-repack --max-pack-size: add fixup_header_footer() Dana How
2007-04-09  0:04 ` Junio C Hamano
2007-04-09  0:18   ` Nicolas Pitre
2007-04-09 17:38     ` Shawn O. Pearce
2007-04-09 18:30       ` Nicolas Pitre
2007-04-09 18:40         ` Shawn O. Pearce
2007-04-09 19:11           ` Dana How
2007-04-09 19:33             ` Nicolas Pitre
2007-04-09 21:38               ` Dana How
2007-04-09 23:22                 ` Nicolas Pitre
  -- strict thread matches above, loose matches on Subject: below --
2007-04-30 23:21 Dana How
2007-05-01  5:06 ` Shawn O. Pearce
2007-05-01  5:41   ` Dana How
2007-05-01  6:03     ` Shawn O. Pearce
2007-05-01  7:32       ` Johannes Schindelin
2007-05-01 17:48       ` Nicolas Pitre
2007-05-01 17:58         ` Dana How
2007-05-01 18:39           ` Nicolas Pitre

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).