is hosting a read-mostly git repo on a distributed file system practical?

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* is hosting a read-mostly git repo on a distributed file system practical?
@ 2011-04-13  1:40 Jon Seymour
  2011-04-13  2:06 ` Shawn Pearce
  0 siblings, 1 reply; 5+ messages in thread
From: Jon Seymour @ 2011-04-13  1:40 UTC (permalink / raw)
  To: Git Mailing List

Is it practical to host a read-mostly git repo on a WAN-based
distributed file system?

The idea is that most developers would use the DFS-based repo to track
the tip of the development stream, but only the integrator would
publish updates to the DFS-based repo.

As such, the need to repack the DFS-based repo will be somewhat, but
not completely, reduced.

Is this going to be practical, or are whole of repo operations
eventually going to kill me because of latency and bandwidth issues
associated with use of the DFS?

Are there things I can do with the git configuration (such as limiting
repacking behaviour) that will help?

jon.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: is hosting a read-mostly git repo on a distributed file system practical?
  2011-04-13  1:40 is hosting a read-mostly git repo on a distributed file system practical? Jon Seymour
@ 2011-04-13  2:06 ` Shawn Pearce
  2011-04-13  2:29   ` Jon Seymour
  0 siblings, 1 reply; 5+ messages in thread
From: Shawn Pearce @ 2011-04-13  2:06 UTC (permalink / raw)
  To: Jon Seymour; +Cc: Git Mailing List

On Tue, Apr 12, 2011 at 21:40, Jon Seymour <jon.seymour@gmail.com> wrote:
> Is it practical to host a read-mostly git repo on a WAN-based
> distributed file system?

Usually not. But test it and find out?

> The idea is that most developers would use the DFS-based repo to track
> the tip of the development stream, but only the integrator would
> publish updates to the DFS-based repo.
>
> As such, the need to repack the DFS-based repo will be somewhat, but
> not completely, reduced.

Serving git clone is basically a repack operation when run over
git://, http:// or SSH. If the DFS was mounted as a local filesystem,
git clone would turn into a cpio to copy the directory contents. I'm
not sure if that is what you are suggesting to do here or not.

> Is this going to be practical, or are whole of repo operations
> eventually going to kill me because of latency and bandwidth issues
> associated with use of the DFS?

Latency is a problem. The Git pack file has decent locality, but there
are some things that could still stand to be improved. It really
doesn't work well unless the pack is held completely in the machine's
memory.

-- 
Shawn.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: is hosting a read-mostly git repo on a distributed file system practical?
  2011-04-13  2:06 ` Shawn Pearce
@ 2011-04-13  2:29   ` Jon Seymour
  0 siblings, 0 replies; 5+ messages in thread
From: Jon Seymour @ 2011-04-13  2:29 UTC (permalink / raw)
  To: Shawn Pearce; +Cc: Git Mailing List

On Wed, Apr 13, 2011 at 12:06 PM, Shawn Pearce <spearce@spearce.org> wrote:
> On Tue, Apr 12, 2011 at 21:40, Jon Seymour <jon.seymour@gmail.com> wrote:
>> The idea is that most developers would use the DFS-based repo to track
>> the tip of the development stream, but only the integrator would
>> publish updates to the DFS-based repo.
>>
>> As such, the need to repack the DFS-based repo will be somewhat, but
>> not completely, reduced.
>
> Serving git clone is basically a repack operation when run over
> git://, http:// or SSH. If the DFS was mounted as a local filesystem,
> git clone would turn into a cpio to copy the directory contents. I'm
> not sure if that is what you are suggesting to do here or not.
>

All clients, including the client that occasionally updates the
read-mostly repo would be mounting the DFS
as a local file system. My environment is one where DFS is easy, but
establishing a shared server is more complicated (ie. bureaucratic).

I guess I am prepared to put up with a slow initial clone (my
developer pool will be relatively stable and pulling from a
peer via git: or ssh: will usually be acceptable for this occasional need).

What I am most interested in is the incremental performance. Can my
integrator, who occasionally
updates the shared repo, avoid automatically repacking it (and hence
taking the whole of repo latency hit)
and can my developers who are pulling the updates do so reliably
without a whole of repo scan?

>> Is this going to be practical, or are whole of repo operations
>> eventually going to kill me because of latency and bandwidth issues
>> associated with use of the DFS?
>
> Latency is a problem. The Git pack file has decent locality, but there
> are some things that could still stand to be improved. It really
> doesn't work well unless the pack is held completely in the machine's
> memory.

I understand that avoiding repacking for an extended period brings its
own problems, so I guess I could live with a local repack followed by
an rsync transfer to re-initial the shared remote, if this was
warranted.

I agree, there is no substitute for testing this, but experience of
others can be helpful in deciding whether it is even worth attempting.

>
> --
> Shawn.
>

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: is hosting a read-mostly git repo on a distributed file system practical?
@ 2011-04-13  3:47 George Spelvin
  2011-04-13  4:57 ` Jon Seymour
  0 siblings, 1 reply; 5+ messages in thread
From: George Spelvin @ 2011-04-13  3:47 UTC (permalink / raw)
  To: jon.seymour; +Cc: git, linux, spearce

> All clients, including the client that occasionally updates the
> read-mostly repo would be mounting the DFS as a local file system. My
> environment is one where DFS is easy, but establishing a shared server
> is more complicated (ie. bureaucratic).

> I guess I am prepared to put up with a slow initial clone (my developer
> pool will be relatively stable and pulling from a peer via git: or ssh:
> will usually be acceptable for this occasional need).

> What I am most interested in is the incremental performance. Can my
> integrator, who occasionally updates the shared repo, avoid automatically
> repacking it (and hence taking the whole of repo latency hit) and can
> my developers who are pulling the updates do so reliably without a whole
> of repo scan?

I think the answers are yes, but I have to make a vouple of things clear:
* You can *definitely* control repack behaviour.  .keep files are the
  simplest way to prevent repacking.
* Are you talking about hosting only a "bare" repository, or one with
  the unpacked source tree as well?  If you try to run git commands on
  a large network-mounted source tree, things can get more than a bit
  sluggish; git recursively stats the whole tree fairly frequently.
  (There are ways to precent that, notably core.ignoreStat, but they
  make it less friendly.)
* You can clone from a repository mounted on the file system just as
  easily as you can from a network server.  So there's no need to set
  up a server if you find it onconvenient.
* Normally, the developers will clone from the integrator's repository
  before doing anything, so the source tree, and any changes they make,
  will be local.
* A local clone will try to hard link to the object directory.  I think
  it will copy them if it fails, or you can force that with "git clone
  --no-hardlinks".  For a more space-saving version, try "git clone
  -s", which will make a sort of soft link to the upstream repository.
  It's a git concept, so repacking upstream won't do any harm, but you
  Must Not delete objects from the upstream repository or you'll create
  dangling references in the downstream.
* If using the objects on the DFS mount turns out to be slow, you can
  just do the initial clone with --no-hardlinks.  Then the developers'
  day-to-day work is all local.

Indeed, you could easily do everything via DFS.  Give everyone a personal
"public" repo to push to, which is read-only to everyone else, and let
the integrator pull from those.

> I understand that avoiding repacking for an extended period brings its
> own problems, so I guess I could live with a local repack followed by
> an rsync transfer to re-initial the shared remote, if this was
> warranted.

Normally, you do a generational garbage collection thing.  You repack the
current work frequently (which is fast to do, and to share, because
it's small), and the larger, slower, older packs less frequently.

Anyway, I hope this helps!

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: is hosting a read-mostly git repo on a distributed file system practical?
  2011-04-13  3:47 George Spelvin
@ 2011-04-13  4:57 ` Jon Seymour
  0 siblings, 0 replies; 5+ messages in thread
From: Jon Seymour @ 2011-04-13  4:57 UTC (permalink / raw)
  To: George Spelvin; +Cc: git, spearce

On Wed, Apr 13, 2011 at 1:47 PM, George Spelvin <linux@horizon.com> wrote:

> I think the answers are yes, but I have to make a vouple of things clear:
> * You can *definitely* control repack behaviour.  .keep files are the
>  simplest way to prevent repacking.

Good.

> * Are you talking about hosting only a "bare" repository, or one with
>  the unpacked source tree as well?  If you try to run git commands on
>  a large network-mounted source tree, things can get more than a bit
>  sluggish; git recursively stats the whole tree fairly frequently.
>  (There are ways to precent that, notably core.ignoreStat, but they
>  make it less friendly.)

Bare. Developers use local disk for local repos and working tree.

> * You can clone from a repository mounted on the file system just as
>  easily as you can from a network server.  So there's no need to set
>  up a server if you find it onconvenient.

Are there advantages to using rsync for the initial clone? Will I get
better restartability in the case that the network is less than 100% reliable?

I do remember trying to use a DFS-file system in the past, before I understood
pack management properly and I seem to recall issues with network reliability.

> Indeed, you could easily do everything via DFS.  Give everyone a personal
> "public" repo to push to, which is read-only to everyone else, and let
> the integrator pull from those.
>

I'll probably use ssh-secured peer to peer for publishing purposes.
The main thing I want
the DFS-hosted repo for is to provide a single, always up, go-to point
for the shared tag set.

>
> Anyway, I hope this helps!
>

Yep, thank you.

jon.

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2011-04-13  4:57 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-04-13  1:40 is hosting a read-mostly git repo on a distributed file system practical? Jon Seymour
2011-04-13  2:06 ` Shawn Pearce
2011-04-13  2:29   ` Jon Seymour
  -- strict thread matches above, loose matches on Subject: below --
2011-04-13  3:47 George Spelvin
2011-04-13  4:57 ` Jon Seymour

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).