git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: "Nguyễn Thái Ngọc Duy" <pclouds@gmail.com>
To: git@vger.kernel.org
Cc: "Junio C Hamano" <gitster@pobox.com>,
	"Nguyễn Thái Ngọc Duy" <pclouds@gmail.com>
Subject: [PATCH v2 00/10] Large blob fixes
Date: Sun,  4 Mar 2012 19:59:46 +0700	[thread overview]
Message-ID: <1330865996-2069-1-git-send-email-pclouds@gmail.com> (raw)
In-Reply-To: <1330329315-11407-1-git-send-email-pclouds@gmail.com>

These patches make sure we avoid keeping whole blob in memory, at
least in common cases. Blob-only streaming code paths are opened to
accomplish that.

There are a few things I'd like to see addressed, perhaps as part of
GSoC if any student steps up.

 - somehow avoid unpack-objects and keep the pack if it contains large
   blobs. I guess we could just save the pack, then decide to
   unpack-objects later. I've updated GSoC ideas page about this.
 
 - pack-objects still puts large blobs in memory if they are in loose
   format. This should not happen if we fix the above. But if anyone
   has spare energy, (s)he can try to stream large loose blobs in the
   pack too. Not sure how ugly the end result could be.

 - archive-zip with large blobs. I think two phases are required
   because we need to calculate crc32 in advance. I have a feeling
   that we could just stream compressed blobs (either in loose or
   packed format) to the zip file, i.e. no decompressing then
   compresssing, which makes two phases nearly as good as one.

 - not really large blob related, but it'd be great to see
   pack-check.c and index-pack.c share as much pack reading code as
   possible, even bettere if sha1_file.c could join the party.

 - I've been thinking whether we could just drop pack-check.c, which
   is only used by fsck, and make fsck run index-pack instead. The
   pros is we can run index-pack in parallel. The cons is, how to
   return marked object list to fsck efficiently.

Anyway changes from v1:

 - use stream_blob_to_fd() patch from Junio (better factoring)
 - split show_object() in "git show" in two separate functions, one
   for tag and one for blob, as they do not share much in the end
 - get rid of "index-pack --verify" patch. It'll come back separately

Junio C Hamano (1):
  streaming: make streaming-write-entry to be more reusable

Nguyễn Thái Ngọc Duy (9):
  Add more large blob test cases
  cat-file: use streaming interface to print blobs
  parse_object: special code path for blobs to avoid putting whole
    object in memory
  show: use streaming interface for showing blobs
  index-pack: split second pass obj handling into own function
  index-pack: reduce memory usage when the pack has large blobs
  pack-check: do not unpack blobs
  archive: support streaming large files to a tar archive
  fsck: use streaming interface for writing lost-found blobs

 archive-tar.c        |   35 +++++++++++++++----
 archive-zip.c        |    9 +++--
 archive.c            |   51 ++++++++++++++++++---------
 archive.h            |   11 +++++-
 builtin/cat-file.c   |   23 ++++++++++++
 builtin/fsck.c       |    8 +---
 builtin/index-pack.c |   95 ++++++++++++++++++++++++++++++++++++--------------
 builtin/log.c        |   34 ++++++++++-------
 cache.h              |    2 +-
 entry.c              |   53 +++-------------------------
 fast-import.c        |    2 +-
 object.c             |   11 ++++++
 pack-check.c         |   21 ++++++++++-
 sha1_file.c          |   78 +++++++++++++++++++++++++++++++++++------
 streaming.c          |   55 +++++++++++++++++++++++++++++
 streaming.h          |    2 +
 t/t1050-large.sh     |   59 ++++++++++++++++++++++++++++++-
 wrapper.c            |   27 ++++++++++++--
 18 files changed, 434 insertions(+), 142 deletions(-)

-- 
1.7.8.36.g69ee2

  parent reply	other threads:[~2012-03-04 13:02 UTC|newest]

Thread overview: 48+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-02-27  7:55 [PATCH 00/11] Large blob fixes Nguyễn Thái Ngọc Duy
2012-02-27  7:55 ` [PATCH 01/11] Add more large blob test cases Nguyễn Thái Ngọc Duy
2012-02-27 20:18   ` Peter Baumann
2012-02-27  7:55 ` [PATCH 02/11] Factor out and export large blob writing code to arbitrary file handle Nguyễn Thái Ngọc Duy
2012-02-27 17:29   ` Junio C Hamano
2012-02-27 21:50     ` Junio C Hamano
2012-02-27  7:55 ` [PATCH 03/11] cat-file: use streaming interface to print blobs Nguyễn Thái Ngọc Duy
2012-02-27 17:44   ` Junio C Hamano
2012-02-28  1:08     ` Nguyen Thai Ngoc Duy
2012-02-27  7:55 ` [PATCH 04/11] parse_object: special code path for blobs to avoid putting whole object in memory Nguyễn Thái Ngọc Duy
2012-02-27  7:55 ` [PATCH 05/11] show: use streaming interface for showing blobs Nguyễn Thái Ngọc Duy
2012-02-27 18:00   ` Junio C Hamano
2012-02-27  7:55 ` [PATCH 06/11] index-pack --verify: skip sha-1 collision test Nguyễn Thái Ngọc Duy
2012-02-27  7:55 ` [PATCH 07/11] index-pack: split second pass obj handling into own function Nguyễn Thái Ngọc Duy
2012-02-27  7:55 ` [PATCH 08/11] index-pack: reduce memory usage when the pack has large blobs Nguyễn Thái Ngọc Duy
2012-02-27  7:55 ` [PATCH 09/11] pack-check: do not unpack blobs Nguyễn Thái Ngọc Duy
2012-02-27  7:55 ` [PATCH 10/11] archive: support streaming large files to a tar archive Nguyễn Thái Ngọc Duy
2012-02-27  7:55 ` [PATCH 11/11] fsck: use streaming interface for writing lost-found blobs Nguyễn Thái Ngọc Duy
2012-02-27 18:43 ` [PATCH 00/11] Large blob fixes Junio C Hamano
2012-02-28  1:23   ` Nguyen Thai Ngoc Duy
2012-03-04 12:59 ` Nguyễn Thái Ngọc Duy [this message]
2012-03-04 12:59   ` [PATCH v2 01/10] Add more large blob test cases Nguyễn Thái Ngọc Duy
2012-03-06  0:59     ` Junio C Hamano
2012-03-04 12:59   ` [PATCH v2 02/10] streaming: make streaming-write-entry to be more reusable Nguyễn Thái Ngọc Duy
2012-03-04 12:59   ` [PATCH v2 03/10] cat-file: use streaming interface to print blobs Nguyễn Thái Ngọc Duy
2012-03-04 23:12     ` Junio C Hamano
2012-03-05  2:42       ` Nguyen Thai Ngoc Duy
2012-03-04 12:59   ` [PATCH v2 04/10] parse_object: special code path for blobs to avoid putting whole object in memory Nguyễn Thái Ngọc Duy
2012-03-04 12:59   ` [PATCH v2 05/10] show: use streaming interface for showing blobs Nguyễn Thái Ngọc Duy
2012-03-04 12:59   ` [PATCH v2 06/10] index-pack: split second pass obj handling into own function Nguyễn Thái Ngọc Duy
2012-03-04 12:59   ` [PATCH v2 07/10] index-pack: reduce memory usage when the pack has large blobs Nguyễn Thái Ngọc Duy
2012-03-04 12:59   ` [PATCH v2 08/10] pack-check: do not unpack blobs Nguyễn Thái Ngọc Duy
2012-03-04 12:59   ` [PATCH v2 09/10] archive: support streaming large files to a tar archive Nguyễn Thái Ngọc Duy
2012-03-04 12:59   ` [PATCH v2 10/10] fsck: use streaming interface for writing lost-found blobs Nguyễn Thái Ngọc Duy
2012-03-05  3:43   ` [PATCH v3 00/11] Large blob fixes Nguyễn Thái Ngọc Duy
2012-03-05  3:43   ` [PATCH v3 01/11] Add more large blob test cases Nguyễn Thái Ngọc Duy
2012-03-05  3:43   ` [PATCH v3 02/11] streaming: make streaming-write-entry to be more reusable Nguyễn Thái Ngọc Duy
2012-03-05  3:43   ` [PATCH v3 03/11] cat-file: use streaming interface to print blobs Nguyễn Thái Ngọc Duy
2012-03-05  3:43   ` [PATCH v3 04/11] parse_object: special code path for blobs to avoid putting whole object in memory Nguyễn Thái Ngọc Duy
2012-03-06  0:57     ` Junio C Hamano
2012-03-05  3:43   ` [PATCH v3 05/11] show: use streaming interface for showing blobs Nguyễn Thái Ngọc Duy
2012-03-05  3:43   ` [PATCH v3 06/11] index-pack: split second pass obj handling into own function Nguyễn Thái Ngọc Duy
2012-03-05  3:43   ` [PATCH v3 07/11] index-pack: reduce memory usage when the pack has large blobs Nguyễn Thái Ngọc Duy
2012-03-05  3:43   ` [PATCH v3 08/11] pack-check: do not unpack blobs Nguyễn Thái Ngọc Duy
2012-03-05  3:43   ` [PATCH v3 09/11] archive: support streaming large files to a tar archive Nguyễn Thái Ngọc Duy
2012-03-06  0:57     ` Junio C Hamano
2012-03-05  3:43   ` [PATCH v3 10/11] fsck: use streaming interface for writing lost-found blobs Nguyễn Thái Ngọc Duy
2012-03-05  3:43   ` [PATCH v3 11/11] update-server-info: respect core.bigfilethreshold Nguyễn Thái Ngọc Duy

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1330865996-2069-1-git-send-email-pclouds@gmail.com \
    --to=pclouds@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).