git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Jeff King <peff@peff.net>
To: Charles Bailey <charles@hashpling.org>
Cc: Junio C Hamano <gitster@pobox.com>, git@vger.kernel.org
Subject: [PATCH 8/7] cat-file: sort and de-dup output of --batch-all-objects
Date: Mon, 22 Jun 2015 07:06:32 -0400	[thread overview]
Message-ID: <20150622110632.GA26436@peff.net> (raw)
In-Reply-To: <20150622103321.GB12584@peff.net>

On Mon, Jun 22, 2015 at 06:33:21AM -0400, Jeff King wrote:

> By the way, in addition to not showing objects in order,
> list-all-objects (and my cat-file option) may show duplicates. Do we
> want to "sort -u" for the user? It might be nice for them to always get
> a de-duped and sorted list. Aside from the CPU cost of sorting, it does
> mean we'll allocate ~80MB for the kernel to store the sha1s. I guess
> that's not too much when you are talking about the kernel repo. I took
> the coward's way out and just mentioned the limitation in the
> documentation, but I'm happy to be persuaded.

The patch below does the sort/de-dup. I'd probably just squash it into
patch 7, though.

I did have one additional thought, though. We are treating this as two
separate operations: "what are the sha1s in the repo" and "show me
information about this sha1". But by integrating with cat-file, we could
actually show information not just about a particular sha1, but about a
particular on-disk object.

E.g., if there are duplicates of a particular object, some formatters
like "%(objectsize:disk)" and "%(deltabase)" pick one arbitrarily to
show. I don't know if anybody actually cares about that in practice, but
if we show duplicates, we could give the accurate information for each
instance (and in fact we could give other information like loose vs
packed, which file contains the object, etc).

I tend to think that the lack of de-duping is sufficiently confusing
that it should be the default, and we can always add a "no really, show
me the duplicates" option later. It is not as simple as skipping the
de-dup step. We'd have to actually avoid calling sha1_object_info, and
use the information found in the loose/pack traversal (which would in
turn require exposing the low-level bits of sha1_object_info).

-- >8 --
Subject: cat-file: sort and de-dup output of --batch-all-objects

The sorting we could probably live without, but printing
duplicates is just a hassle for the user, who must then
de-dup themselves (or risk a wrong answer if they are doing
something like counting objects with a particular property).

Signed-off-by: Jeff King <peff@peff.net>
---
 Documentation/git-cat-file.txt |  3 +--
 builtin/cat-file.c             | 22 +++++++++++++++-------
 t/t1006-cat-file.sh            |  3 +--
 3 files changed, 17 insertions(+), 11 deletions(-)

diff --git a/Documentation/git-cat-file.txt b/Documentation/git-cat-file.txt
index 6831b08..3105fc0 100644
--- a/Documentation/git-cat-file.txt
+++ b/Documentation/git-cat-file.txt
@@ -74,8 +74,7 @@ OPTIONS
 	requested batch operation on all objects in the repository and
 	any alternate object stores (not just reachable objects).
 	Requires `--batch` or `--batch-check` be specified. Note that
-	the order of the objects is unspecified, and there may be
-	duplicate entries.
+	the objects are visited in order sorted by their hashes.
 
 --buffer::
 	Normally batch output is flushed after each object is output, so
diff --git a/builtin/cat-file.c b/builtin/cat-file.c
index 95604c4..07baad1 100644
--- a/builtin/cat-file.c
+++ b/builtin/cat-file.c
@@ -9,6 +9,7 @@
 #include "userdiff.h"
 #include "streaming.h"
 #include "tree-walk.h"
+#include "sha1-array.h"
 
 struct batch_options {
 	int enabled;
@@ -324,19 +325,19 @@ struct object_cb_data {
 	struct expand_data *expand;
 };
 
-static int batch_object_cb(const unsigned char *sha1,
-			   struct object_cb_data *data)
+static void batch_object_cb(const unsigned char sha1[20], void *vdata)
 {
+	struct object_cb_data *data = vdata;
 	hashcpy(data->expand->sha1, sha1);
 	batch_object_write(NULL, data->opt, data->expand);
-	return 0;
 }
 
 static int batch_loose_object(const unsigned char *sha1,
 			      const char *path,
 			      void *data)
 {
-	return batch_object_cb(sha1, data);
+	sha1_array_append(data, sha1);
+	return 0;
 }
 
 static int batch_packed_object(const unsigned char *sha1,
@@ -344,7 +345,8 @@ static int batch_packed_object(const unsigned char *sha1,
 			       uint32_t pos,
 			       void *data)
 {
-	return batch_object_cb(sha1, data);
+	sha1_array_append(data, sha1);
+	return 0;
 }
 
 static int batch_objects(struct batch_options *opt)
@@ -375,11 +377,17 @@ static int batch_objects(struct batch_options *opt)
 		data.info.typep = &data.type;
 
 	if (opt->all_objects) {
+		struct sha1_array sa = SHA1_ARRAY_INIT;
 		struct object_cb_data cb;
+
+		for_each_loose_object(batch_loose_object, &sa, 0);
+		for_each_packed_object(batch_packed_object, &sa, 0);
+
 		cb.opt = opt;
 		cb.expand = &data;
-		for_each_loose_object(batch_loose_object, &cb, 0);
-		for_each_packed_object(batch_packed_object, &cb, 0);
+		sha1_array_for_each_unique(&sa, batch_object_cb, &cb);
+
+		sha1_array_clear(&sa);
 		return 0;
 	}
 
diff --git a/t/t1006-cat-file.sh b/t/t1006-cat-file.sh
index 2b4220a..18dbdc8 100755
--- a/t/t1006-cat-file.sh
+++ b/t/t1006-cat-file.sh
@@ -569,8 +569,7 @@ test_expect_success 'cat-file --batch-all-objects shows all objects' '
 	) >>expect.unsorted &&
 	sort <expect.unsorted >expect &&
 	git -C all-two cat-file --batch-all-objects \
-				--batch-check="%(objectname)" >actual.unsorted &&
-	sort <actual.unsorted >actual &&
+				--batch-check="%(objectname)" >actual &&
 	test_cmp expect actual
 '
 
-- 
2.4.4.719.g3984bc6

  parent reply	other threads:[~2015-06-22 11:06 UTC|newest]

Thread overview: 51+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-06-19  9:10 Improvements to parse-options and a new filter-objects command Charles Bailey
2015-06-19  9:10 ` [PATCH 1/3] Correct test-parse-options to handle negative ints Charles Bailey
2015-06-19 18:28   ` Junio C Hamano
2015-06-19  9:10 ` [PATCH 2/3] Move unsigned long option parsing out of pack-objects.c Charles Bailey
2015-06-19 11:03   ` Remi Galan Alfonso
2015-06-19 11:06     ` Charles Bailey
2015-06-19 17:58   ` Junio C Hamano
2015-06-19 18:39     ` Junio C Hamano
2015-06-20 15:31       ` Jakub Narębski
2015-06-19 18:47     ` Jakub Narębski
2015-06-20 16:51     ` Charles Bailey
2015-06-20 17:47       ` Junio C Hamano
2015-06-19  9:10 ` [PATCH 3/3] Add filter-objects command Charles Bailey
2015-06-19 10:10   ` Jeff King
2015-06-19 10:33     ` Charles Bailey
2015-06-19 10:52       ` Jeff King
2015-06-19 18:28         ` Junio C Hamano
2015-06-19 10:52       ` John Keeping
2015-06-19 11:04         ` Charles Bailey
2015-06-21 18:25 ` Improvements to integer option parsing Charles Bailey
2015-06-21 18:25   ` [PATCH 1/2] Correct test-parse-options to handle negative ints Charles Bailey
2015-06-21 18:25   ` [PATCH 2/2] Move unsigned long option parsing out of pack-objects.c Charles Bailey
2015-06-21 18:30     ` Charles Bailey
2015-06-22 22:03       ` Junio C Hamano
2015-06-22 22:08     ` Junio C Hamano
2015-06-22 22:09   ` Improvements to integer option parsing Junio C Hamano
2015-06-22 22:42     ` Charles Bailey
2015-06-21 19:20 ` Fast enumeration of objects Charles Bailey
2015-06-21 19:20   ` [PATCH] Add list-all-objects command Charles Bailey
2015-06-22  8:38     ` Jeff King
2015-06-22 10:33       ` Jeff King
2015-06-22 10:40         ` [PATCH 1/7] for_each_packed_object: automatically open pack index Jeff King
2015-06-22 10:40         ` [PATCH 2/7] cat-file: minor style fix in options list Jeff King
2015-06-22 10:41         ` [PATCH 3/7] cat-file: move batch_options definition to top of file Jeff King
2015-06-22 10:45         ` [PATCH 4/7] cat-file: add --buffer option Jeff King
2015-06-22 10:45         ` [PATCH 5/7] cat-file: stop returning value from batch_one_object Jeff King
2015-06-22 10:45         ` [PATCH 6/7] cat-file: split batch_one_object into two stages Jeff King
2015-06-22 10:45         ` [PATCH 7/7] cat-file: add --batch-all-objects option Jeff King
2015-06-26  6:56           ` Eric Sunshine
2015-06-26 15:48             ` Jeff King
2015-06-22 11:06         ` Jeff King [this message]
2015-06-22 22:03           ` [PATCH 8/7] cat-file: sort and de-dup output of --batch-all-objects Charles Bailey
2015-06-22 23:46             ` Jeff King
2015-06-22 21:48         ` [PATCH] Add list-all-objects command Charles Bailey
2015-06-22 21:50         ` Junio C Hamano
2015-06-22 23:50           ` Jeff King
2015-06-22 11:38       ` Charles Bailey
2015-06-22  9:57     ` Duy Nguyen
2015-06-22 10:24       ` Jeff King
2015-06-22  8:35   ` Fast enumeration of objects Jeff King
2015-06-22 19:44     ` Junio C Hamano

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20150622110632.GA26436@peff.net \
    --to=peff@peff.net \
    --cc=charles@hashpling.org \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).