From: "Shawn O. Pearce" <spearce@spearce.org>
To: Tarmigan <tarmigan+git@gmail.com>
Cc: Marek Zawirski <marek.zawirski@gmail.com>,
git@vger.kernel.org, Daniel Barkalow <barkalow@iabervon.org>,
Nick Hengeveld <nickh@reactrix.com>,
Johannes Schindelin <Johannes.Schindelin@gmx.de>
Subject: Re: git push to amazon s3 [was: [GSoC] What is status of Git's Google Summer of Code 2008 projects?]
Date: Wed, 9 Jul 2008 03:22:22 +0000 [thread overview]
Message-ID: <20080709032222.GA18520@spearce.org> (raw)
In-Reply-To: <905315640807072248w44ccdc4y2f1cf54a10c50c43@mail.gmail.com>
Tarmigan <tarmigan+git@gmail.com> wrote:
> (trimmed cc list to folks who've touched http-push.c)
> On Mon, Jul 7, 2008 at 9:19 PM, Shawn O. Pearce <spearce@spearce.org> wrote:
> > Using Marek's pack generation code I added support for push over
> > the dumb sftp:// and amazon-s3:// protocols, with the latter also
> > supporting transparent client side encryption.
>
> Can you describe the s3 support that you added? Did you do any
> locking when you pushed? The objects and packs seem likely to be
> naturally OK, but I was worried about refs/ and especially
> objects/info/packs and info/refs (fetch over http works currently out
> of the box with publicly accessable s3 repos).
It behaves like http push does in C git in that it is pretty
transparent to the end-user:
# Create a bucket using other S3 tools.
# I used jets3t's cockpit tool to creat "gitney".
# Create a configuration file for jgit's S3 client:
#
$ touch ~/.jgit_s3_public
$ chmod 600 ~/.jgit_s3_public
$ cat >>~/.jgit_s3_public
accesskey: AWSAccessKeyId
secretkey: AWSSecretAccessKey
acl: public
EOF
# Configure the remote and push
#
$ git remote add s3 amazon-s3://.jgit_s3_public@gitney/projects/egit.git/
$ jgit push s3 refs/heads/master
$ jgit push --tags s3
# Future incremental updates are just as easy
#
$ jgit push s3 refs/heads/master
(or)
$ git config remote.s3.push refs/heads/master
$ jgit push s3
This is now cloneable[*1*]:
$ git clone http://gitney.s3.amazonaws.com/projects/egit.git
Pushes are incremental, rather than the approach you outlined, as
that causes a full re-upload of the repository. Consequently there
is relatively little bandwidth usage during subsequent pushes.
A jgit amazon-s3 URL is organized as:
amazon-s3://$config@$bucket/$prefix
where:
$config = path to configuration in $GIT_DIR/$config or $HOME/$config
$bucket = name of the Amazon S3 bucket holding the objects
$prefix = prefix to apply to all objects, implicitly ends in "/"
Amazon S3 atomically replaces a file, but offers no locking support.
Our crude remote VFS abstraction offers two types of file write
operations:
- Atomic write for small (in-memory) files <~1M
- Stream write for large (e.g. pack) files >~1M
In the S3 implementation both operations are basically the same
code, since even large streams are atomic updates. But in sftp://
our remote VFS writes to a "$path.lock" for the atomic case and
renames to "$path". This is not the same as a real lock, but it
avoids readers from seeing an in-progress update.
We are very carefully to order the update operations to try and
avoid any sorts of race conditions:
- Delete loose refs which are being deleted.
- Upload new pack:
- If same pack already exists:
- (atomic) Remove it from objects/info/packs
- Delete .idx
- Delete .pack
- Upload new .pack
- Upload new .idx
- (atomic) Add to front of objects/info/packs.
- (atomic) Create/update loose refs.
- (atomic) Update (if necessary) packed-refs.
- (atomic) Update info/refs.
Since we are pushing over a dumb transport we assume readers
are pulling over the same dumb transport and thus rely upon the
objects/info/packs and info/refs files to obtain the listing of
what is available. This isn't true though for jgit's sftp://
and amazon-s3:// protocols as both support navigation of the
objects/packs and refs/{heads,tags} tree directly.
Locking on S3 is difficult. Multi-object writes may not sync
across the S3 cluster immediately. This means you can write to A,
then to B, then read A and see the old content still there, then
seconds later read A again and see the new content suddenly arrive.
It all depends upon when the replicas update and which replica
the load-balancer sends you into during the request. So despite
our attempts to order writes to S3 it is still possible for an S3
write to appear "late" and for a client to see a ref in info/refs
for which the corresponding pack is not listed in object/info/packs.
However, this is the same mirroring problem that kernel.org has for
its git trees. I believe they are moved out to the public mirrors
by dumb rsync and not some sort of smart git-aware transport.
As rsync is free to order the writes out of order kernel.org has
the same issue. ;-)
Actually I suspect the S3 replica update occurs more quickly than
the kernel.org mirrors update, so the window under which a client
can see out-of-order writes is likely smaller.
*1* You need a bug fix in jgit to correctly initialize HEAD during
push to a new, non-existant repository stored on S3. The patch
is going to be posted later this evening, its still in my tree.
--
Shawn.
prev parent reply other threads:[~2008-07-09 3:26 UTC|newest]
Thread overview: 4+ messages / expand[flat|nested] mbox.gz Atom feed top
2008-07-08 5:48 git push to amazon s3 [was: [GSoC] What is status of Git's Google Summer of Code 2008 projects?] Tarmigan
2008-07-08 5:56 ` Mike Hommey
2008-07-09 3:26 ` Shawn O. Pearce
2008-07-09 3:22 ` Shawn O. Pearce [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20080709032222.GA18520@spearce.org \
--to=spearce@spearce.org \
--cc=Johannes.Schindelin@gmx.de \
--cc=barkalow@iabervon.org \
--cc=git@vger.kernel.org \
--cc=marek.zawirski@gmail.com \
--cc=nickh@reactrix.com \
--cc=tarmigan+git@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).