git push to amazon s3 [was: [GSoC] What is status of Git's Google Summer of Code 2008 projects?]

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* git push to amazon s3 [was: [GSoC] What is status of Git's Google Summer of Code 2008 projects?]
@ 2008-07-08  5:48 Tarmigan
  2008-07-08  5:56 ` Mike Hommey
  2008-07-09  3:22 ` Shawn O. Pearce
  0 siblings, 2 replies; 4+ messages in thread
From: Tarmigan @ 2008-07-08  5:48 UTC (permalink / raw)
  To: Shawn O. Pearce
  Cc: Marek Zawirski, git, Daniel Barkalow, Nick Hengeveld,
	Johannes Schindelin

(trimmed cc list to folks who've touched http-push.c)

On Mon, Jul 7, 2008 at 9:19 PM, Shawn O. Pearce <spearce@spearce.org> wrote:
> Using Marek's pack generation code I added support for push over
> the dumb sftp:// and amazon-s3:// protocols, with the latter also
> supporting transparent client side encryption.
>
> I chose to add these features to jgit partly as an exercise to prove
> that Marek's code was built well enough to be reused for this task,
> partly because I wanted to backup some private personal repositories
> to Amazon S3, and partly to prove that multiple dumb transports
> could implement push support.

That sounds cool.  I've been looking into adding s3 push into cgit,
and was looking into modifying http-push.c, but got in over my head.
I had trouble trying to make it fit into the DAV model that http-push
is built around, in part because s3 doesn't seem to support any
locking and a lot of the http-push code seems to be around the
locking.

Can you describe the s3 support that you added?  Did you do any
locking when you pushed?  The objects and packs seem likely to be
naturally OK, but I was worried about refs/ and especially
objects/info/packs and info/refs (fetch over http works currently out
of the box with publicly accessable s3 repos).

Thanks,
Tarmigan

PS For anyone else who's interested, here's some instructions on how I
got started with s3 and git:

Start by creating an amazon s3 account

Next download and install "aws" from http://timkay.com/aws/
Set it up and install your amazon keys as specified.

# I setup a bucket named git_test.
s3mkdir git_test

#  Run this script to upload a git repo to amazon (run
update-server-info first):
#!/bin/bash
for i in $(tree -fi --noreport git_test_orig.git) ; do
    #exclude directories
    if [ ! -d $i ] ; then
        echo $i
        s3put "x-amz-acl: public-read" git_test/ $i
    fi
done

# Then you can clone (really, feel free to clone from this url.  It
should just work):
git clone http://s3.amazonaws.com/git_test/git_test_orig.git

# This experimenting on the git.git repo will set you back about US$0.05

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: git push to amazon s3 [was: [GSoC] What is status of Git's Google Summer of Code 2008 projects?]
  2008-07-08  5:48 git push to amazon s3 [was: [GSoC] What is status of Git's Google Summer of Code 2008 projects?] Tarmigan
@ 2008-07-08  5:56 ` Mike Hommey
  2008-07-09  3:26   ` Shawn O. Pearce
  2008-07-09  3:22 ` Shawn O. Pearce
  1 sibling, 1 reply; 4+ messages in thread
From: Mike Hommey @ 2008-07-08  5:56 UTC (permalink / raw)
  To: Tarmigan
  Cc: Shawn O. Pearce, Marek Zawirski, git, Daniel Barkalow,
	Nick Hengeveld, Johannes Schindelin

On Mon, Jul 07, 2008 at 10:48:59PM -0700, Tarmigan wrote:
> (trimmed cc list to folks who've touched http-push.c)
> 
> On Mon, Jul 7, 2008 at 9:19 PM, Shawn O. Pearce <spearce@spearce.org> wrote:
> > Using Marek's pack generation code I added support for push over
> > the dumb sftp:// and amazon-s3:// protocols, with the latter also
> > supporting transparent client side encryption.
> >
> > I chose to add these features to jgit partly as an exercise to prove
> > that Marek's code was built well enough to be reused for this task,
> > partly because I wanted to backup some private personal repositories
> > to Amazon S3, and partly to prove that multiple dumb transports
> > could implement push support.
> 
> That sounds cool.  I've been looking into adding s3 push into cgit,
> and was looking into modifying http-push.c, but got in over my head.
> I had trouble trying to make it fit into the DAV model that http-push
> is built around, in part because s3 doesn't seem to support any
> locking and a lot of the http-push code seems to be around the
> locking.
> 
> Can you describe the s3 support that you added?  Did you do any
> locking when you pushed?  The objects and packs seem likely to be
> naturally OK, but I was worried about refs/ and especially
> objects/info/packs and info/refs (fetch over http works currently out
> of the box with publicly accessable s3 repos).

FWIW, I'm starting to work again on the http backend overhaul. My idea
is to provide a generic dumb protocol vfs-like interface, so that other
dumb protocols could be built out of it.

Mike

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: git push to amazon s3 [was: [GSoC] What is status of Git's Google Summer of Code 2008 projects?]
  2008-07-08  5:48 git push to amazon s3 [was: [GSoC] What is status of Git's Google Summer of Code 2008 projects?] Tarmigan
  2008-07-08  5:56 ` Mike Hommey
@ 2008-07-09  3:22 ` Shawn O. Pearce
  1 sibling, 0 replies; 4+ messages in thread
From: Shawn O. Pearce @ 2008-07-09  3:22 UTC (permalink / raw)
  To: Tarmigan
  Cc: Marek Zawirski, git, Daniel Barkalow, Nick Hengeveld,
	Johannes Schindelin

Tarmigan <tarmigan+git@gmail.com> wrote:
> (trimmed cc list to folks who've touched http-push.c)
> On Mon, Jul 7, 2008 at 9:19 PM, Shawn O. Pearce <spearce@spearce.org> wrote:
> > Using Marek's pack generation code I added support for push over
> > the dumb sftp:// and amazon-s3:// protocols, with the latter also
> > supporting transparent client side encryption.
> 
> Can you describe the s3 support that you added?  Did you do any
> locking when you pushed?  The objects and packs seem likely to be
> naturally OK, but I was worried about refs/ and especially
> objects/info/packs and info/refs (fetch over http works currently out
> of the box with publicly accessable s3 repos).

It behaves like http push does in C git in that it is pretty
transparent to the end-user:

  # Create a bucket using other S3 tools.
  # I used jets3t's cockpit tool to creat "gitney".

  # Create a configuration file for jgit's S3 client:
  #
  $ touch ~/.jgit_s3_public
  $ chmod 600 ~/.jgit_s3_public
  $ cat >>~/.jgit_s3_public
  accesskey: AWSAccessKeyId
  secretkey: AWSSecretAccessKey
  acl: public
  EOF

  # Configure the remote and push
  #
  $ git remote add s3 amazon-s3://.jgit_s3_public@gitney/projects/egit.git/
  $ jgit push s3 refs/heads/master
  $ jgit push --tags s3

  # Future incremental updates are just as easy
  #
  $ jgit push s3 refs/heads/master

  (or)
  $ git config remote.s3.push refs/heads/master
  $ jgit push s3

This is now cloneable[*1*]:

  $ git clone http://gitney.s3.amazonaws.com/projects/egit.git

Pushes are incremental, rather than the approach you outlined, as
that causes a full re-upload of the repository.  Consequently there
is relatively little bandwidth usage during subsequent pushes.

A jgit amazon-s3 URL is organized as:

  amazon-s3://$config@$bucket/$prefix

where:

  $config = path to configuration in $GIT_DIR/$config or $HOME/$config
  $bucket = name of the Amazon S3 bucket holding the objects
  $prefix = prefix to apply to all objects, implicitly ends in "/"

Amazon S3 atomically replaces a file, but offers no locking support.
Our crude remote VFS abstraction offers two types of file write
operations:

  - Atomic write for small (in-memory) files <~1M
  - Stream write for large (e.g. pack) files >~1M

In the S3 implementation both operations are basically the same
code, since even large streams are atomic updates.  But in sftp://
our remote VFS writes to a "$path.lock" for the atomic case and
renames to "$path".  This is not the same as a real lock, but it
avoids readers from seeing an in-progress update.

We are very carefully to order the update operations to try and
avoid any sorts of race conditions:

  - Delete loose refs which are being deleted.
  - Upload new pack:
    - If same pack already exists:
      - (atomic) Remove it from objects/info/packs
      - Delete .idx
      - Delete .pack
    - Upload new .pack
    - Upload new .idx
    - (atomic) Add to front of objects/info/packs.
  - (atomic) Create/update loose refs.
  - (atomic) Update (if necessary) packed-refs.
  - (atomic) Update info/refs.

Since we are pushing over a dumb transport we assume readers
are pulling over the same dumb transport and thus rely upon the
objects/info/packs and info/refs files to obtain the listing of
what is available.  This isn't true though for jgit's sftp://
and amazon-s3:// protocols as both support navigation of the
objects/packs and refs/{heads,tags} tree directly.

Locking on S3 is difficult.  Multi-object writes may not sync
across the S3 cluster immediately.  This means you can write to A,
then to B, then read A and see the old content still there, then
seconds later read A again and see the new content suddenly arrive.
It all depends upon when the replicas update and which replica
the load-balancer sends you into during the request.  So despite
our attempts to order writes to S3 it is still possible for an S3
write to appear "late" and for a client to see a ref in info/refs
for which the corresponding pack is not listed in object/info/packs.

However, this is the same mirroring problem that kernel.org has for
its git trees.  I believe they are moved out to the public mirrors
by dumb rsync and not some sort of smart git-aware transport.
As rsync is free to order the writes out of order kernel.org has
the same issue.  ;-)

Actually I suspect the S3 replica update occurs more quickly than
the kernel.org mirrors update, so the window under which a client
can see out-of-order writes is likely smaller.

*1* You need a bug fix in jgit to correctly initialize HEAD during
    push to a new, non-existant repository stored on S3.  The patch
    is going to be posted later this evening, its still in my tree.

-- 
Shawn.

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: git push to amazon s3 [was: [GSoC] What is status of Git's Google Summer of Code 2008 projects?]
  2008-07-08  5:56 ` Mike Hommey
@ 2008-07-09  3:26   ` Shawn O. Pearce
  0 siblings, 0 replies; 4+ messages in thread
From: Shawn O. Pearce @ 2008-07-09  3:26 UTC (permalink / raw)
  To: Mike Hommey
  Cc: Tarmigan, Marek Zawirski, git, Daniel Barkalow, Nick Hengeveld,
	Johannes Schindelin

Mike Hommey <mh@glandium.org> wrote:
> FWIW, I'm starting to work again on the http backend overhaul. My idea
> is to provide a generic dumb protocol vfs-like interface, so that other
> dumb protocols could be built out of it.

jgit has a vfs abstraction for the different dumb protocols.  Not sure
if you would find it of any value to read as we are also able to use a
number of Java standard abstractions like InputStream/OutputStream,
but here it is:

  WalkRemoteObjectDatabase:
  http://repo.or.cz/w/egit.git?a=blob;f=org.spearce.jgit/src/org/spearce/jgit/transport/WalkRemoteObjectDatabase.java;h=915faac9eb85e59c0ed2c08b9631d03cbc4c6bf8;hb=8d085723b260f3b51a70d11b723608779160b090

Thus far this abstraction has worked for sftp:// and amazon-s3://.
WebDAV may make it more complicated due to locking being available
(and something we would want to use to protect writes) but S3 uses
HTTP PUT much like DAV does to upload content so there wouldn't be
too many changes to actually implement DAV support.

-- 
Shawn.

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2008-07-09  3:27 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-07-08  5:48 git push to amazon s3 [was: [GSoC] What is status of Git's Google Summer of Code 2008 projects?] Tarmigan
2008-07-08  5:56 ` Mike Hommey
2008-07-09  3:26   ` Shawn O. Pearce
2008-07-09  3:22 ` Shawn O. Pearce

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).