git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Junio C Hamano <gitster@pobox.com>
To: git@vger.kernel.org
Subject: [PATCH v3 0/6] Bulk check-in
Date: Thu,  1 Dec 2011 16:40:43 -0800	[thread overview]
Message-ID: <1322786449-25753-1-git-send-email-gitster@pobox.com> (raw)
In-Reply-To: <1322699263-14475-6-git-send-email-gitster@pobox.com>

I would declare that the earlier parts of the v2 that are about factoring
out various API pieces from existing code are basically completed, so they
are not part of this iteration.

The bulk-checkin patch from v2 has been tweaked a bit (deflate_to_pack()
initializes "already_hashed_to" pointer to 0, instead of the current file
position "seekback"), and then the rest of the series builds on top of it
to add a new in-pack encoding that I am tentatively calling "chunked".

The basic idea is to represent a large/huge blob as a concatenation of
smaller blobs. An entry in a pack in "chunked" representation records a
list of object names of the component blob objects.  The object name given
to such a blob is computed exactly the same way as before. In other words,
the name of a object does not depend on its representation; we hash "blob
<size> NUL" and the whole large blob contents to come up with its name. It
is *not* the hash of the component blob object names.

As can be seen in the log message of the "support chunked-object encoding"
patch, many pieces are still missing from this series and filling them
will be a long and tortuous journey. But we need to start somewhere.

I specifically excluded any heuristics to split large objects into chunks
in a self-synchronising way so that a small edit near the beginning of a
large blob results in a handful of new component blobs followed by the
same component blobs as used to represent the same blob before such an
edit, and I do not plan to work on that part myself. My impression from
listening Avery's plug for "bup" is that it is a solved problem; it should
be reasonably straightforward to lift the logic and plug it into the
framework presented here (once the codebase gets solid enough, that is).

After this series, the next step for me is likely to teach the streaming
interface about "chunked" objects, and then pack-objects to take notice
and reuse "chunked" representation when sending things out (which means
that sending a "chunked" encoded blob would involve sending the component
blobs it uses, among other things), but I expect that it will extend well
into next year.

Junio C Hamano (6):
  bulk-checkin: replace fast-import based implementation
  varint-in-pack: refactor varint encoding/decoding
  new representation types in the packstream
  bulk-checkin: allow the same data to be multiply hashed
  bulk-checkin: support chunked-object encoding
  chunked-object: fallback checkout codepaths

 Makefile               |    3 +
 builtin/add.c          |    5 +
 builtin/pack-objects.c |   34 ++---
 bulk-checkin.c         |  415 ++++++++++++++++++++++++++++++++++++++++++++++++
 bulk-checkin.h         |   17 ++
 cache.h                |   13 ++-
 config.c               |    9 +
 environment.c          |    2 +
 pack-write.c           |   50 +++++-
 pack.h                 |    2 +
 sha1_file.c            |  150 +++++++++---------
 split-chunk.c          |   28 ++++
 t/t1050-large.sh       |  135 +++++++++++++++-
 zlib.c                 |    9 +-
 14 files changed, 760 insertions(+), 112 deletions(-)
 create mode 100644 bulk-checkin.c
 create mode 100644 bulk-checkin.h
 create mode 100644 split-chunk.c

-- 
1.7.8.rc4.177.g4d64

  parent reply	other threads:[~2011-12-02  0:40 UTC|newest]

Thread overview: 15+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2011-12-01  0:27 [PATCH v2 0/5] Bulk Check-in Junio C Hamano
2011-12-01  0:27 ` [PATCH v2 1/5] write_pack_header(): a helper function Junio C Hamano
2011-12-01  0:27 ` [PATCH v2 2/5] create_tmp_packfile(): " Junio C Hamano
2011-12-01  0:27 ` [PATCH v2 3/5] finish_tmp_packfile(): " Junio C Hamano
2011-12-01  0:27 ` [PATCH v2 4/5] csum-file: introduce sha1file_checkpoint Junio C Hamano
2011-12-01  0:27 ` [PATCH v2 5/5] bulk-checkin: replace fast-import based implementation Junio C Hamano
2011-12-01  8:05   ` Nguyen Thai Ngoc Duy
2011-12-01 15:46     ` Junio C Hamano
2011-12-02  0:40   ` Junio C Hamano [this message]
2011-12-02  0:40     ` [PATCH v3 1/6] " Junio C Hamano
2011-12-02  0:40     ` [PATCH v3 2/6] varint-in-pack: refactor varint encoding/decoding Junio C Hamano
2011-12-02  0:40     ` [PATCH v3 3/6] new representation types in the packstream Junio C Hamano
2011-12-02  0:40     ` [PATCH v3 4/6] bulk-checkin: allow the same data to be multiply hashed Junio C Hamano
2011-12-02  0:40     ` [PATCH v3 5/6] bulk-checkin: support chunked-object encoding Junio C Hamano
2011-12-02  0:40     ` [PATCH v3 6/6] chunked-object: fallback checkout codepaths Junio C Hamano

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1322786449-25753-1-git-send-email-gitster@pobox.com \
    --to=gitster@pobox.com \
    --cc=git@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).