All of lore.kernel.org
 help / color / mirror / Atom feed
From: Junio C Hamano <gitster@pobox.com>
To: "Nguyễn Thái Ngọc Duy" <pclouds@gmail.com>
Cc: git@vger.kernel.org
Subject: Re: [PATCH] add: add --bulk to index all objects into a pack file
Date: Wed, 02 Oct 2013 23:43:45 -0700	[thread overview]
Message-ID: <xmqqsiwin9b2.fsf@gitster.dls.corp.google.com> (raw)
In-Reply-To: <1380772811-15415-1-git-send-email-pclouds@gmail.com> ("Nguyễn	Thái Ngọc Duy"'s message of "Thu, 3 Oct 2013 11:00:11 +0700")

Nguyễn Thái Ngọc Duy  <pclouds@gmail.com> writes:

> The use case is
>
>     tar -xzf bigproject.tar.gz
>     cd bigproject
>     git init
>     git add .
>     # git grep or something

Two obvious thoughts, and a half.

 (1) This particular invocation of "git add" can easily detect that
     it is run in a repository with no $GIT_INDEX_FILE yet, which is
     the most typical case for a big initial import.  It could even
     ask if the current branch is unborn if you wanted to make the
     heuristic more specific to this use case.  Perhaps it would
     make sense to automatically plug the bulk import machinery in
     such a case without an option?

 (2) Imagine performing a dry-run of update_files_in_cache() using a
     different diff-files callback that is similar to the
     update_callback() but that uses the lstat(2) data to see how
     big an import this really is, instead of calling
     add_file_to_index(), before actually registering the data to
     the object database.  If you benchmark to see how expensive it
     is, you may find that such a scheme might be a workable
     auto-tuning mechanism to trigger this.  Even if it were
     moderately expensive, when combined with the heuristics above
     for (1), it might be a worthwhile thing to do only when it is
     likely to be an initial import.

 (3) Is it always a good idea to send everything to a packfile on a
     large addition, or are you often better off importing the
     initial fileset as loose objects?  If the latter, then the
     option name "--bulk" may give users a wrong hint "if you are
     doing a bulk-import, you are bettern off using this option".

This is a very logical extension to what was started at 568508e7
(bulk-checkin: replace fast-import based implementation,
2011-10-28), and I like it.  I suspect "--bulk=<threashold>" might
be a better alternative than setting the threshold unconditionally
to zero, though.

  reply	other threads:[~2013-10-03  6:43 UTC|newest]

Thread overview: 7+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-10-03  4:00 [PATCH] add: add --bulk to index all objects into a pack file Nguyễn Thái Ngọc Duy
2013-10-03  6:43 ` Junio C Hamano [this message]
2013-10-03 12:26   ` Duy Nguyen
2013-10-04  6:57 ` [PATCH v2] " Nguyễn Thái Ngọc Duy
2013-10-04  7:10   ` Matthieu Moy
2013-10-04  7:19     ` Duy Nguyen
2013-10-04 12:38   ` Duy Nguyen

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=xmqqsiwin9b2.fsf@gitster.dls.corp.google.com \
    --to=gitster@pobox.com \
    --cc=git@vger.kernel.org \
    --cc=pclouds@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.