[PATCH] add: add --bulk to index all objects into a pack file

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH] add: add --bulk to index all objects into a pack file
@ 2013-10-03  4:00 Nguyễn Thái Ngọc Duy
  2013-10-03  6:43 ` Junio C Hamano
  2013-10-04  6:57 ` [PATCH v2] " Nguyễn Thái Ngọc Duy
  0 siblings, 2 replies; 7+ messages in thread
From: Nguyễn Thái Ngọc Duy @ 2013-10-03  4:00 UTC (permalink / raw)
  To: git; +Cc: Nguyễn Thái Ngọc Duy

The use case is

    tar -xzf bigproject.tar.gz
    cd bigproject
    git init
    git add .
    # git grep or something

The first add will generate a bunch of loose objects. With --bulk, all
of them are forced into a single pack instead, less clutter on disk
and maybe faster object access.

This is the equivalent of "git -c core.bigFileThreshold=0 add ." so
it's not really a new functionality. Just some convenient and public
exposure.

Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com>
---
 Interestingly index_stream() seems a bit slower than standard
 index_core(). Perhaps mmap() is faster than a series of read() for
 small files. Room for improvement later.

 Documentation/git-add.txt |  6 ++++++
 builtin/add.c             | 10 +++++++++-
 2 files changed, 15 insertions(+), 1 deletion(-)

diff --git a/Documentation/git-add.txt b/Documentation/git-add.txt
index 48754cb..36a77f6 100644
--- a/Documentation/git-add.txt
+++ b/Documentation/git-add.txt
@@ -160,6 +160,12 @@ today's "git add <pathspec>...", ignoring removed files.
 	be ignored, no matter if they are already present in the work
 	tree or not.
 
+--bulk::
+	Normally new objects are indexed and stored in loose format,
+	one file per new object in "$GIT_DIR/objects". This option
+	forces putting all objects into a single new pack. This may
+	be useful when you need to add a lot of files initially.
+
 \--::
 	This option can be used to separate command-line options from
 	the list of files, (useful when filenames might be mistaken
diff --git a/builtin/add.c b/builtin/add.c
index 226f758..40cbb71 100644
--- a/builtin/add.c
+++ b/builtin/add.c
@@ -336,7 +336,7 @@ static struct lock_file lock_file;
 static const char ignore_error[] =
 N_("The following paths are ignored by one of your .gitignore files:\n");
 
-static int verbose, show_only, ignored_too, refresh_only;
+static int verbose, show_only, ignored_too, refresh_only, bulk_index;
 static int ignore_add_errors, intent_to_add, ignore_missing;
 
 #define ADDREMOVE_DEFAULT 0 /* Change to 1 in Git 2.0 */
@@ -368,6 +368,7 @@ static struct option builtin_add_options[] = {
 	OPT_BOOL( 0 , "refresh", &refresh_only, N_("don't add, only refresh the index")),
 	OPT_BOOL( 0 , "ignore-errors", &ignore_add_errors, N_("just skip files which cannot be added because of errors")),
 	OPT_BOOL( 0 , "ignore-missing", &ignore_missing, N_("check if - even missing - files are ignored in dry run")),
+	OPT_BOOL( 0 , "bulk", &bulk_index, N_("pack all objects instead of creating loose ones")),
 	OPT_END(),
 };
 
@@ -560,6 +561,13 @@ int cmd_add(int argc, const char **argv, const char *prefix)
 		free(seen);
 	}
 
+	if (bulk_index)
+		/*
+		 * Pretend all blobs are "large" files, forcing them
+		 * all into a pack
+		 */
+		big_file_threshold = 0;
+
 	plug_bulk_checkin();
 
 	if ((flags & ADD_CACHE_IMPLICIT_DOT) && prefix) {
-- 
1.8.2.82.gc24b958

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH] add: add --bulk to index all objects into a pack file
  2013-10-03  4:00 [PATCH] add: add --bulk to index all objects into a pack file Nguyễn Thái Ngọc Duy
@ 2013-10-03  6:43 ` Junio C Hamano
  2013-10-03 12:26   ` Duy Nguyen
  2013-10-04  6:57 ` [PATCH v2] " Nguyễn Thái Ngọc Duy
  1 sibling, 1 reply; 7+ messages in thread
From: Junio C Hamano @ 2013-10-03  6:43 UTC (permalink / raw)
  To: Nguyễn Thái Ngọc Duy; +Cc: git

Nguyễn Thái Ngọc Duy  <pclouds@gmail.com> writes:

> The use case is
>
>     tar -xzf bigproject.tar.gz
>     cd bigproject
>     git init
>     git add .
>     # git grep or something

Two obvious thoughts, and a half.

 (1) This particular invocation of "git add" can easily detect that
     it is run in a repository with no $GIT_INDEX_FILE yet, which is
     the most typical case for a big initial import.  It could even
     ask if the current branch is unborn if you wanted to make the
     heuristic more specific to this use case.  Perhaps it would
     make sense to automatically plug the bulk import machinery in
     such a case without an option?

 (2) Imagine performing a dry-run of update_files_in_cache() using a
     different diff-files callback that is similar to the
     update_callback() but that uses the lstat(2) data to see how
     big an import this really is, instead of calling
     add_file_to_index(), before actually registering the data to
     the object database.  If you benchmark to see how expensive it
     is, you may find that such a scheme might be a workable
     auto-tuning mechanism to trigger this.  Even if it were
     moderately expensive, when combined with the heuristics above
     for (1), it might be a worthwhile thing to do only when it is
     likely to be an initial import.

 (3) Is it always a good idea to send everything to a packfile on a
     large addition, or are you often better off importing the
     initial fileset as loose objects?  If the latter, then the
     option name "--bulk" may give users a wrong hint "if you are
     doing a bulk-import, you are bettern off using this option".

This is a very logical extension to what was started at 568508e7
(bulk-checkin: replace fast-import based implementation,
2011-10-28), and I like it.  I suspect "--bulk=<threashold>" might
be a better alternative than setting the threshold unconditionally
to zero, though.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] add: add --bulk to index all objects into a pack file
  2013-10-03  6:43 ` Junio C Hamano
@ 2013-10-03 12:26   ` Duy Nguyen
  0 siblings, 0 replies; 7+ messages in thread
From: Duy Nguyen @ 2013-10-03 12:26 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Git Mailing List

On Thu, Oct 3, 2013 at 1:43 PM, Junio C Hamano <gitster@pobox.com> wrote:
> Nguyễn Thái Ngọc Duy  <pclouds@gmail.com> writes:
>
>> The use case is
>>
>>     tar -xzf bigproject.tar.gz
>>     cd bigproject
>>     git init
>>     git add .
>>     # git grep or something
>
> Two obvious thoughts, and a half.
>
>  (1) This particular invocation of "git add" can easily detect that
>      it is run in a repository with no $GIT_INDEX_FILE yet, which is
>      the most typical case for a big initial import.  It could even
>      ask if the current branch is unborn if you wanted to make the
>      heuristic more specific to this use case.  Perhaps it would
>      make sense to automatically plug the bulk import machinery in
>      such a case without an option?

Yeah! I did not even think of that.

>  (2) Imagine performing a dry-run of update_files_in_cache() using a
>      different diff-files callback that is similar to the
>      update_callback() but that uses the lstat(2) data to see how
>      big an import this really is, instead of calling
>      add_file_to_index(), before actually registering the data to
>      the object database.  If you benchmark to see how expensive it
>      is, you may find that such a scheme might be a workable
>      auto-tuning mechanism to trigger this.  Even if it were
>      moderately expensive, when combined with the heuristics above
>      for (1), it might be a worthwhile thing to do only when it is
>      likely to be an initial import.

We do a lot of lstats nowadays for refreshing index so yeah it's
likely reasonably cheap, but I doubt people do mass update on existing
files often. Adding a large amount of new files (even when .git/index
exists) may be a better indication of an import and we already have
that information from fill_directory().

For the no .git/index case, packing with bulk-checkin probably
produces a reasonably good pack because they don't delta well anyway.
There are no previous versions to delta against. They can delta
against other files but I don't think we'll have good compression with
that. For the case where .git/index exists, we may intefere with "git
gc --auto". We create a less optimal pack, but it's a pack and may
delay gc time longer..

>  (3) Is it always a good idea to send everything to a packfile on a
>      large addition, or are you often better off importing the
>      initial fileset as loose objects?  If the latter, then the
>      option name "--bulk" may give users a wrong hint "if you are
>      doing a bulk-import, you are bettern off using this option".

Hard question. Fewer files are definitely a good thing, for example
when you "rm -rf" the whole thing :-) Disk usage is another. On
gdb-7.3.1, "du -sh" reports .git with loose objects takes 57M, while
the packed one takes 29M. Disk access is slightly faster on packed
.git, at least for "git grep --cached": 0.71s vs 0.83s.

> This is a very logical extension to what was started at 568508e7
> (bulk-checkin: replace fast-import based implementation,
> 2011-10-28), and I like it.  I suspect "--bulk=<threashold>" might
> be a better alternative than setting the threshold unconditionally
> to zero, though.

The threshold may be better in form of a config setting because I
won't want to specify that every time. But does one really want to
keep some small files around in loose format? I don't see it because
my goal is to keep a clean .git, but maybe loose format has some
advantages..
-- 
Duy

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [PATCH v2] add: add --bulk to index all objects into a pack file
  2013-10-03  4:00 [PATCH] add: add --bulk to index all objects into a pack file Nguyễn Thái Ngọc Duy
  2013-10-03  6:43 ` Junio C Hamano
@ 2013-10-04  6:57 ` Nguyễn Thái Ngọc Duy
  2013-10-04  7:10   ` Matthieu Moy
  2013-10-04 12:38   ` Duy Nguyen
  1 sibling, 2 replies; 7+ messages in thread
From: Nguyễn Thái Ngọc Duy @ 2013-10-04  6:57 UTC (permalink / raw)
  To: git; +Cc: Junio C Hamano, Nguyễn Thái Ngọc Duy

The use case is

    tar -xzf bigproject.tar.gz
    cd bigproject
    git init
    git add .
    # git grep or something

The first add will generate a bunch of loose objects. With --bulk, all
of them are forced into a single pack instead, less clutter on disk
and maybe faster object access.

On gdb-7.5.1 source directory, the loose .git directory takes 66M
according to `du` while the packed one takes 32M. Timing of
"git grep --cached":

          loose     packed
real    0m1.671s   0m1.372s
user    0m1.542s   0m1.313s
sys     0m0.126s   0m0.056s

It's not an all-win situation though. --bulk is slower than --no-bulk
because:

 - Triple hashing: we need to calculate both object SHA-1s _and_ pack
   SHA-1. At the end we have to fix up the pack, which means rehashing
   the entire pack again. --no-bulk only cares about object SHA-1s.

 - We write duplicate objects to the pack then truncate it, because we
   don't know if it's a duplicate until we're done writing, and cannot
   keep it in core because it's potentially big. So extra I/O (but
   hopefully not too much because duplicate objects should not happen
   often).

 - Sort and write .idx file.

 - (For the future) --no-bulk could benefit from multithreading
   because this is CPU bound operation. --bulk could not.

But again this comparison is not fair, --bulk is closer to:

    git add . &&
    git ls-files --stage | awk '{print $2;}'| \
        git pack-objects .git/objects/pack-

except that it does not deltifies nor sort objects.

Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com>
---
 v2 examines pros and cons of --bulk and I'm not sure if turning it on
 automatically (with heuristics) is a good idea anymore.

 Oh and it fixes not packing empty files.

 Documentation/git-add.txt | 10 ++++++++++
 builtin/add.c             | 10 +++++++++-
 sha1_file.c               |  3 ++-
 3 files changed, 21 insertions(+), 2 deletions(-)

diff --git a/Documentation/git-add.txt b/Documentation/git-add.txt
index 48754cb..147d191 100644
--- a/Documentation/git-add.txt
+++ b/Documentation/git-add.txt
@@ -160,6 +160,16 @@ today's "git add <pathspec>...", ignoring removed files.
 	be ignored, no matter if they are already present in the work
 	tree or not.
 
+--bulk::
+	Normally new objects are indexed and stored in loose format,
+	one file per new object in "$GIT_DIR/objects". This option
+	forces putting all objects into a single new pack. This may
+	be useful when you need to add a lot of files initially.
++
+This option is equivalent to running `git -c core.bigFileThreshold=0 add`.
+If you want to only pack files larger than a size threshold, use the
+long form.
+
 \--::
 	This option can be used to separate command-line options from
 	the list of files, (useful when filenames might be mistaken
diff --git a/builtin/add.c b/builtin/add.c
index 226f758..40cbb71 100644
--- a/builtin/add.c
+++ b/builtin/add.c
@@ -336,7 +336,7 @@ static struct lock_file lock_file;
 static const char ignore_error[] =
 N_("The following paths are ignored by one of your .gitignore files:\n");
 
-static int verbose, show_only, ignored_too, refresh_only;
+static int verbose, show_only, ignored_too, refresh_only, bulk_index;
 static int ignore_add_errors, intent_to_add, ignore_missing;
 
 #define ADDREMOVE_DEFAULT 0 /* Change to 1 in Git 2.0 */
@@ -368,6 +368,7 @@ static struct option builtin_add_options[] = {
 	OPT_BOOL( 0 , "refresh", &refresh_only, N_("don't add, only refresh the index")),
 	OPT_BOOL( 0 , "ignore-errors", &ignore_add_errors, N_("just skip files which cannot be added because of errors")),
 	OPT_BOOL( 0 , "ignore-missing", &ignore_missing, N_("check if - even missing - files are ignored in dry run")),
+	OPT_BOOL( 0 , "bulk", &bulk_index, N_("pack all objects instead of creating loose ones")),
 	OPT_END(),
 };
 
@@ -560,6 +561,13 @@ int cmd_add(int argc, const char **argv, const char *prefix)
 		free(seen);
 	}
 
+	if (bulk_index)
+		/*
+		 * Pretend all blobs are "large" files, forcing them
+		 * all into a pack
+		 */
+		big_file_threshold = 0;
+
 	plug_bulk_checkin();
 
 	if ((flags & ADD_CACHE_IMPLICIT_DOT) && prefix) {
diff --git a/sha1_file.c b/sha1_file.c
index f80bbe4..8b66840 100644
--- a/sha1_file.c
+++ b/sha1_file.c
@@ -3137,7 +3137,8 @@ int index_fd(unsigned char *sha1, int fd, struct stat *st,
 
 	if (!S_ISREG(st->st_mode))
 		ret = index_pipe(sha1, fd, type, path, flags);
-	else if (size <= big_file_threshold || type != OBJ_BLOB ||
+	else if ((big_file_threshold && size <= big_file_threshold) ||
+		 type != OBJ_BLOB ||
 		 (path && would_convert_to_git(path, NULL, 0, 0)))
 		ret = index_core(sha1, fd, size, type, path, flags);
 	else
-- 
1.8.2.82.gc24b958

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH v2] add: add --bulk to index all objects into a pack file
  2013-10-04  6:57 ` [PATCH v2] " Nguyễn Thái Ngọc Duy
@ 2013-10-04  7:10   ` Matthieu Moy
  2013-10-04  7:19     ` Duy Nguyen
  2013-10-04 12:38   ` Duy Nguyen
  1 sibling, 1 reply; 7+ messages in thread
From: Matthieu Moy @ 2013-10-04  7:10 UTC (permalink / raw)
  To: Nguyễn Thái Ngọc Duy; +Cc: git, Junio C Hamano

Nguyễn Thái Ngọc Duy <pclouds@gmail.com> writes:

> except that it does not deltifies nor sort objects.

I think this should be mentionned in the doc. Otherwise, it seems like
"git add --bulk" is like "git add && git repack".

BTW, will the next "git gc" be efficient after a "add --bulk"? I mean:
will it consider the pack as "already pack" and let it as-is, without
deltification, or will it get a chance to actually repack efficiently?

-- 
Matthieu Moy
http://www-verimag.imag.fr/~moy/

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v2] add: add --bulk to index all objects into a pack file
  2013-10-04  7:10   ` Matthieu Moy
@ 2013-10-04  7:19     ` Duy Nguyen
  0 siblings, 0 replies; 7+ messages in thread
From: Duy Nguyen @ 2013-10-04  7:19 UTC (permalink / raw)
  To: Matthieu Moy; +Cc: Git Mailing List, Junio C Hamano

On Fri, Oct 4, 2013 at 2:10 PM, Matthieu Moy
<Matthieu.Moy@grenoble-inp.fr> wrote:
> Nguyễn Thái Ngọc Duy <pclouds@gmail.com> writes:
>
>> except that it does not deltifies nor sort objects.
>
> I think this should be mentionned in the doc. Otherwise, it seems like
> "git add --bulk" is like "git add && git repack".

Yep. Will do.

> BTW, will the next "git gc" be efficient after a "add --bulk"? I mean:
> will it consider the pack as "already pack" and let it as-is, without
> deltification, or will it get a chance to actually repack efficiently?

gc does "repack -A" so all separate packs will be merged. It may delay
gc time though because it'll take more time to hit the loose object
limit. I think pack-objects will try to deltify so we're good. If we
produce bad deltas in --bulk then that's another story because
pack-objects will try to blindly reuse them.
-- 
Duy

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v2] add: add --bulk to index all objects into a pack file
  2013-10-04  6:57 ` [PATCH v2] " Nguyễn Thái Ngọc Duy
  2013-10-04  7:10   ` Matthieu Moy
@ 2013-10-04 12:38   ` Duy Nguyen
  1 sibling, 0 replies; 7+ messages in thread
From: Duy Nguyen @ 2013-10-04 12:38 UTC (permalink / raw)
  To: Git Mailing List; +Cc: Junio C Hamano, Nguyễn Thái Ngọc Duy

On Fri, Oct 4, 2013 at 1:57 PM, Nguyễn Thái Ngọc Duy <pclouds@gmail.com> wrote:
> It's not an all-win situation though. --bulk is slower than --no-bulk
> because:
>
>  - Triple hashing: we need to calculate both object SHA-1s _and_ pack
>    SHA-1. At the end we have to fix up the pack, which means rehashing
>    the entire pack again. --no-bulk only cares about object SHA-1s.

Oops, it's quadruple hashing: one for object sha-1 (1), one for pack
sha-1 while writing pack (2), two for pack header fixup: one after the
the header fixup (3), and one repeating the same (2) just to verify
that it matches (2) exactly (4). Do we have to be this paranoid? I
think we could drop (2) and (4) in this case.
-- 
Duy

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2013-10-04 12:39 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-10-03  4:00 [PATCH] add: add --bulk to index all objects into a pack file Nguyễn Thái Ngọc Duy
2013-10-03  6:43 ` Junio C Hamano
2013-10-03 12:26   ` Duy Nguyen
2013-10-04  6:57 ` [PATCH v2] " Nguyễn Thái Ngọc Duy
2013-10-04  7:10   ` Matthieu Moy
2013-10-04  7:19     ` Duy Nguyen
2013-10-04 12:38   ` Duy Nguyen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).