Re: git push to amazon s3 [was: [GSoC] What is status of Git's Google Summer of Code 2008 projects?]

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: "Shawn O. Pearce" <spearce@spearce.org>
To: Tarmigan <tarmigan+git@gmail.com>
Cc: Marek Zawirski <marek.zawirski@gmail.com>,
	git@vger.kernel.org, Daniel Barkalow <barkalow@iabervon.org>,
	Nick Hengeveld <nickh@reactrix.com>,
	Johannes Schindelin <Johannes.Schindelin@gmx.de>
Subject: Re: git push to amazon s3 [was: [GSoC] What is status of Git's Google Summer of Code 2008 projects?]
Date: Wed, 9 Jul 2008 03:22:22 +0000	[thread overview]
Message-ID: <20080709032222.GA18520@spearce.org> (raw)
In-Reply-To: <905315640807072248w44ccdc4y2f1cf54a10c50c43@mail.gmail.com>

Tarmigan <tarmigan+git@gmail.com> wrote:
> (trimmed cc list to folks who've touched http-push.c)
> On Mon, Jul 7, 2008 at 9:19 PM, Shawn O. Pearce <spearce@spearce.org> wrote:
> > Using Marek's pack generation code I added support for push over
> > the dumb sftp:// and amazon-s3:// protocols, with the latter also
> > supporting transparent client side encryption.
> 
> Can you describe the s3 support that you added?  Did you do any
> locking when you pushed?  The objects and packs seem likely to be
> naturally OK, but I was worried about refs/ and especially
> objects/info/packs and info/refs (fetch over http works currently out
> of the box with publicly accessable s3 repos).

It behaves like http push does in C git in that it is pretty
transparent to the end-user:

  # Create a bucket using other S3 tools.
  # I used jets3t's cockpit tool to creat "gitney".

  # Create a configuration file for jgit's S3 client:
  #
  $ touch ~/.jgit_s3_public
  $ chmod 600 ~/.jgit_s3_public
  $ cat >>~/.jgit_s3_public
  accesskey: AWSAccessKeyId
  secretkey: AWSSecretAccessKey
  acl: public
  EOF

  # Configure the remote and push
  #
  $ git remote add s3 amazon-s3://.jgit_s3_public@gitney/projects/egit.git/
  $ jgit push s3 refs/heads/master
  $ jgit push --tags s3

  # Future incremental updates are just as easy
  #
  $ jgit push s3 refs/heads/master

  (or)
  $ git config remote.s3.push refs/heads/master
  $ jgit push s3

This is now cloneable[*1*]:

  $ git clone http://gitney.s3.amazonaws.com/projects/egit.git

Pushes are incremental, rather than the approach you outlined, as
that causes a full re-upload of the repository.  Consequently there
is relatively little bandwidth usage during subsequent pushes.

A jgit amazon-s3 URL is organized as:

  amazon-s3://$config@$bucket/$prefix

where:

  $config = path to configuration in $GIT_DIR/$config or $HOME/$config
  $bucket = name of the Amazon S3 bucket holding the objects
  $prefix = prefix to apply to all objects, implicitly ends in "/"

Amazon S3 atomically replaces a file, but offers no locking support.
Our crude remote VFS abstraction offers two types of file write
operations:

  - Atomic write for small (in-memory) files <~1M
  - Stream write for large (e.g. pack) files >~1M

In the S3 implementation both operations are basically the same
code, since even large streams are atomic updates.  But in sftp://
our remote VFS writes to a "$path.lock" for the atomic case and
renames to "$path".  This is not the same as a real lock, but it
avoids readers from seeing an in-progress update.

We are very carefully to order the update operations to try and
avoid any sorts of race conditions:

  - Delete loose refs which are being deleted.
  - Upload new pack:
    - If same pack already exists:
      - (atomic) Remove it from objects/info/packs
      - Delete .idx
      - Delete .pack
    - Upload new .pack
    - Upload new .idx
    - (atomic) Add to front of objects/info/packs.
  - (atomic) Create/update loose refs.
  - (atomic) Update (if necessary) packed-refs.
  - (atomic) Update info/refs.

Since we are pushing over a dumb transport we assume readers
are pulling over the same dumb transport and thus rely upon the
objects/info/packs and info/refs files to obtain the listing of
what is available.  This isn't true though for jgit's sftp://
and amazon-s3:// protocols as both support navigation of the
objects/packs and refs/{heads,tags} tree directly.

Locking on S3 is difficult.  Multi-object writes may not sync
across the S3 cluster immediately.  This means you can write to A,
then to B, then read A and see the old content still there, then
seconds later read A again and see the new content suddenly arrive.
It all depends upon when the replicas update and which replica
the load-balancer sends you into during the request.  So despite
our attempts to order writes to S3 it is still possible for an S3
write to appear "late" and for a client to see a ref in info/refs
for which the corresponding pack is not listed in object/info/packs.

However, this is the same mirroring problem that kernel.org has for
its git trees.  I believe they are moved out to the public mirrors
by dumb rsync and not some sort of smart git-aware transport.
As rsync is free to order the writes out of order kernel.org has
the same issue.  ;-)

Actually I suspect the S3 replica update occurs more quickly than
the kernel.org mirrors update, so the window under which a client
can see out-of-order writes is likely smaller.

*1* You need a bug fix in jgit to correctly initialize HEAD during
    push to a new, non-existant repository stored on S3.  The patch
    is going to be posted later this evening, its still in my tree.

-- 
Shawn.

     prev parent reply	other threads:[~2008-07-09  3:26 UTC|newest]

Thread overview: 4+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2008-07-08  5:48 git push to amazon s3 [was: [GSoC] What is status of Git's Google Summer of Code 2008 projects?] Tarmigan
2008-07-08  5:56 ` Mike Hommey
2008-07-09  3:26   ` Shawn O. Pearce
2008-07-09  3:22 ` Shawn O. Pearce [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20080709032222.GA18520@spearce.org \
    --to=spearce@spearce.org \
    --cc=Johannes.Schindelin@gmx.de \
    --cc=barkalow@iabervon.org \
    --cc=git@vger.kernel.org \
    --cc=marek.zawirski@gmail.com \
    --cc=nickh@reactrix.com \
    --cc=tarmigan+git@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).