From: "Nguyễn Thái Ngọc Duy" <pclouds@gmail.com>
To: git@vger.kernel.org
Cc: "Junio C Hamano" <gitster@pobox.com>,
"Nguyễn Thái Ngọc Duy" <pclouds@gmail.com>
Subject: [PATCH v2] add: add --bulk to index all objects into a pack file
Date: Fri, 4 Oct 2013 13:57:51 +0700 [thread overview]
Message-ID: <1380869871-31631-1-git-send-email-pclouds@gmail.com> (raw)
In-Reply-To: <1380772811-15415-1-git-send-email-pclouds@gmail.com>
The use case is
tar -xzf bigproject.tar.gz
cd bigproject
git init
git add .
# git grep or something
The first add will generate a bunch of loose objects. With --bulk, all
of them are forced into a single pack instead, less clutter on disk
and maybe faster object access.
On gdb-7.5.1 source directory, the loose .git directory takes 66M
according to `du` while the packed one takes 32M. Timing of
"git grep --cached":
loose packed
real 0m1.671s 0m1.372s
user 0m1.542s 0m1.313s
sys 0m0.126s 0m0.056s
It's not an all-win situation though. --bulk is slower than --no-bulk
because:
- Triple hashing: we need to calculate both object SHA-1s _and_ pack
SHA-1. At the end we have to fix up the pack, which means rehashing
the entire pack again. --no-bulk only cares about object SHA-1s.
- We write duplicate objects to the pack then truncate it, because we
don't know if it's a duplicate until we're done writing, and cannot
keep it in core because it's potentially big. So extra I/O (but
hopefully not too much because duplicate objects should not happen
often).
- Sort and write .idx file.
- (For the future) --no-bulk could benefit from multithreading
because this is CPU bound operation. --bulk could not.
But again this comparison is not fair, --bulk is closer to:
git add . &&
git ls-files --stage | awk '{print $2;}'| \
git pack-objects .git/objects/pack-
except that it does not deltifies nor sort objects.
Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com>
---
v2 examines pros and cons of --bulk and I'm not sure if turning it on
automatically (with heuristics) is a good idea anymore.
Oh and it fixes not packing empty files.
Documentation/git-add.txt | 10 ++++++++++
builtin/add.c | 10 +++++++++-
sha1_file.c | 3 ++-
3 files changed, 21 insertions(+), 2 deletions(-)
diff --git a/Documentation/git-add.txt b/Documentation/git-add.txt
index 48754cb..147d191 100644
--- a/Documentation/git-add.txt
+++ b/Documentation/git-add.txt
@@ -160,6 +160,16 @@ today's "git add <pathspec>...", ignoring removed files.
be ignored, no matter if they are already present in the work
tree or not.
+--bulk::
+ Normally new objects are indexed and stored in loose format,
+ one file per new object in "$GIT_DIR/objects". This option
+ forces putting all objects into a single new pack. This may
+ be useful when you need to add a lot of files initially.
++
+This option is equivalent to running `git -c core.bigFileThreshold=0 add`.
+If you want to only pack files larger than a size threshold, use the
+long form.
+
\--::
This option can be used to separate command-line options from
the list of files, (useful when filenames might be mistaken
diff --git a/builtin/add.c b/builtin/add.c
index 226f758..40cbb71 100644
--- a/builtin/add.c
+++ b/builtin/add.c
@@ -336,7 +336,7 @@ static struct lock_file lock_file;
static const char ignore_error[] =
N_("The following paths are ignored by one of your .gitignore files:\n");
-static int verbose, show_only, ignored_too, refresh_only;
+static int verbose, show_only, ignored_too, refresh_only, bulk_index;
static int ignore_add_errors, intent_to_add, ignore_missing;
#define ADDREMOVE_DEFAULT 0 /* Change to 1 in Git 2.0 */
@@ -368,6 +368,7 @@ static struct option builtin_add_options[] = {
OPT_BOOL( 0 , "refresh", &refresh_only, N_("don't add, only refresh the index")),
OPT_BOOL( 0 , "ignore-errors", &ignore_add_errors, N_("just skip files which cannot be added because of errors")),
OPT_BOOL( 0 , "ignore-missing", &ignore_missing, N_("check if - even missing - files are ignored in dry run")),
+ OPT_BOOL( 0 , "bulk", &bulk_index, N_("pack all objects instead of creating loose ones")),
OPT_END(),
};
@@ -560,6 +561,13 @@ int cmd_add(int argc, const char **argv, const char *prefix)
free(seen);
}
+ if (bulk_index)
+ /*
+ * Pretend all blobs are "large" files, forcing them
+ * all into a pack
+ */
+ big_file_threshold = 0;
+
plug_bulk_checkin();
if ((flags & ADD_CACHE_IMPLICIT_DOT) && prefix) {
diff --git a/sha1_file.c b/sha1_file.c
index f80bbe4..8b66840 100644
--- a/sha1_file.c
+++ b/sha1_file.c
@@ -3137,7 +3137,8 @@ int index_fd(unsigned char *sha1, int fd, struct stat *st,
if (!S_ISREG(st->st_mode))
ret = index_pipe(sha1, fd, type, path, flags);
- else if (size <= big_file_threshold || type != OBJ_BLOB ||
+ else if ((big_file_threshold && size <= big_file_threshold) ||
+ type != OBJ_BLOB ||
(path && would_convert_to_git(path, NULL, 0, 0)))
ret = index_core(sha1, fd, size, type, path, flags);
else
--
1.8.2.82.gc24b958
next prev parent reply other threads:[~2013-10-04 6:58 UTC|newest]
Thread overview: 7+ messages / expand[flat|nested] mbox.gz Atom feed top
2013-10-03 4:00 [PATCH] add: add --bulk to index all objects into a pack file Nguyễn Thái Ngọc Duy
2013-10-03 6:43 ` Junio C Hamano
2013-10-03 12:26 ` Duy Nguyen
2013-10-04 6:57 ` Nguyễn Thái Ngọc Duy [this message]
2013-10-04 7:10 ` [PATCH v2] " Matthieu Moy
2013-10-04 7:19 ` Duy Nguyen
2013-10-04 12:38 ` Duy Nguyen
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=1380869871-31631-1-git-send-email-pclouds@gmail.com \
--to=pclouds@gmail.com \
--cc=git@vger.kernel.org \
--cc=gitster@pobox.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.