All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Nguyễn Thái Ngọc Duy" <pclouds@gmail.com>
To: git@vger.kernel.org
Cc: "Junio C Hamano" <gitster@pobox.com>,
	"Nguyễn Thái Ngọc Duy" <pclouds@gmail.com>
Subject: [PATCH v2] add: add --bulk to index all objects into a pack file
Date: Fri,  4 Oct 2013 13:57:51 +0700	[thread overview]
Message-ID: <1380869871-31631-1-git-send-email-pclouds@gmail.com> (raw)
In-Reply-To: <1380772811-15415-1-git-send-email-pclouds@gmail.com>

The use case is

    tar -xzf bigproject.tar.gz
    cd bigproject
    git init
    git add .
    # git grep or something

The first add will generate a bunch of loose objects. With --bulk, all
of them are forced into a single pack instead, less clutter on disk
and maybe faster object access.

On gdb-7.5.1 source directory, the loose .git directory takes 66M
according to `du` while the packed one takes 32M. Timing of
"git grep --cached":

          loose     packed
real    0m1.671s   0m1.372s
user    0m1.542s   0m1.313s
sys     0m0.126s   0m0.056s

It's not an all-win situation though. --bulk is slower than --no-bulk
because:

 - Triple hashing: we need to calculate both object SHA-1s _and_ pack
   SHA-1. At the end we have to fix up the pack, which means rehashing
   the entire pack again. --no-bulk only cares about object SHA-1s.

 - We write duplicate objects to the pack then truncate it, because we
   don't know if it's a duplicate until we're done writing, and cannot
   keep it in core because it's potentially big. So extra I/O (but
   hopefully not too much because duplicate objects should not happen
   often).

 - Sort and write .idx file.

 - (For the future) --no-bulk could benefit from multithreading
   because this is CPU bound operation. --bulk could not.

But again this comparison is not fair, --bulk is closer to:

    git add . &&
    git ls-files --stage | awk '{print $2;}'| \
        git pack-objects .git/objects/pack-

except that it does not deltifies nor sort objects.

Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com>
---
 v2 examines pros and cons of --bulk and I'm not sure if turning it on
 automatically (with heuristics) is a good idea anymore.

 Oh and it fixes not packing empty files.

 Documentation/git-add.txt | 10 ++++++++++
 builtin/add.c             | 10 +++++++++-
 sha1_file.c               |  3 ++-
 3 files changed, 21 insertions(+), 2 deletions(-)

diff --git a/Documentation/git-add.txt b/Documentation/git-add.txt
index 48754cb..147d191 100644
--- a/Documentation/git-add.txt
+++ b/Documentation/git-add.txt
@@ -160,6 +160,16 @@ today's "git add <pathspec>...", ignoring removed files.
 	be ignored, no matter if they are already present in the work
 	tree or not.
 
+--bulk::
+	Normally new objects are indexed and stored in loose format,
+	one file per new object in "$GIT_DIR/objects". This option
+	forces putting all objects into a single new pack. This may
+	be useful when you need to add a lot of files initially.
++
+This option is equivalent to running `git -c core.bigFileThreshold=0 add`.
+If you want to only pack files larger than a size threshold, use the
+long form.
+
 \--::
 	This option can be used to separate command-line options from
 	the list of files, (useful when filenames might be mistaken
diff --git a/builtin/add.c b/builtin/add.c
index 226f758..40cbb71 100644
--- a/builtin/add.c
+++ b/builtin/add.c
@@ -336,7 +336,7 @@ static struct lock_file lock_file;
 static const char ignore_error[] =
 N_("The following paths are ignored by one of your .gitignore files:\n");
 
-static int verbose, show_only, ignored_too, refresh_only;
+static int verbose, show_only, ignored_too, refresh_only, bulk_index;
 static int ignore_add_errors, intent_to_add, ignore_missing;
 
 #define ADDREMOVE_DEFAULT 0 /* Change to 1 in Git 2.0 */
@@ -368,6 +368,7 @@ static struct option builtin_add_options[] = {
 	OPT_BOOL( 0 , "refresh", &refresh_only, N_("don't add, only refresh the index")),
 	OPT_BOOL( 0 , "ignore-errors", &ignore_add_errors, N_("just skip files which cannot be added because of errors")),
 	OPT_BOOL( 0 , "ignore-missing", &ignore_missing, N_("check if - even missing - files are ignored in dry run")),
+	OPT_BOOL( 0 , "bulk", &bulk_index, N_("pack all objects instead of creating loose ones")),
 	OPT_END(),
 };
 
@@ -560,6 +561,13 @@ int cmd_add(int argc, const char **argv, const char *prefix)
 		free(seen);
 	}
 
+	if (bulk_index)
+		/*
+		 * Pretend all blobs are "large" files, forcing them
+		 * all into a pack
+		 */
+		big_file_threshold = 0;
+
 	plug_bulk_checkin();
 
 	if ((flags & ADD_CACHE_IMPLICIT_DOT) && prefix) {
diff --git a/sha1_file.c b/sha1_file.c
index f80bbe4..8b66840 100644
--- a/sha1_file.c
+++ b/sha1_file.c
@@ -3137,7 +3137,8 @@ int index_fd(unsigned char *sha1, int fd, struct stat *st,
 
 	if (!S_ISREG(st->st_mode))
 		ret = index_pipe(sha1, fd, type, path, flags);
-	else if (size <= big_file_threshold || type != OBJ_BLOB ||
+	else if ((big_file_threshold && size <= big_file_threshold) ||
+		 type != OBJ_BLOB ||
 		 (path && would_convert_to_git(path, NULL, 0, 0)))
 		ret = index_core(sha1, fd, size, type, path, flags);
 	else
-- 
1.8.2.82.gc24b958

  parent reply	other threads:[~2013-10-04  6:58 UTC|newest]

Thread overview: 7+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-10-03  4:00 [PATCH] add: add --bulk to index all objects into a pack file Nguyễn Thái Ngọc Duy
2013-10-03  6:43 ` Junio C Hamano
2013-10-03 12:26   ` Duy Nguyen
2013-10-04  6:57 ` Nguyễn Thái Ngọc Duy [this message]
2013-10-04  7:10   ` [PATCH v2] " Matthieu Moy
2013-10-04  7:19     ` Duy Nguyen
2013-10-04 12:38   ` Duy Nguyen

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1380869871-31631-1-git-send-email-pclouds@gmail.com \
    --to=pclouds@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.