* Synchronous replication on push
@ 2024-11-02 2:06 Taylor R Campbell
2024-11-02 10:09 ` Matěj Cepl
2024-11-04 23:47 ` Jeff King
0 siblings, 2 replies; 10+ messages in thread
From: Taylor R Campbell @ 2024-11-02 2:06 UTC (permalink / raw)
To: git
Suppose I have a front end repository:
user@frontend.example.com:/repo.git
Whenever I push anything to it, I want the push -- that is, all the
objects, and all the ref updates -- to be synchronously replicated to
another remote repository, the back end:
git@backend.example.com:/repo.git
If this replication fails -- whether because the back end is down, or
because the front end crashed and rolled back to an earlier state, or
because the back end has been updated independently and rejects a
force push, or whatever -- I want the push to fail. But, absent these
failures, I want frontend and backend to store the same set of objects
and refs.
(Actually, I want to replicate it to a quorum of multiple back ends
with a three-phase commit protocol -- but I'll start with the
single-replica case for simplicity.)
How can I do this with git?
One option, of course, is to use a replicated file system like
glusterfs, or replicated block store like DRBD. But that
(a) likely requires a lot more round-trips than git push/send-pack,
(b) can't be used for replication to other git hosts like Github, and
(c) can't be used for other remote transports like git-cinnabar.
So I'd like to do this at the git level, not at the file system or
block store level.
Here are some approaches I've tried:
1. `git clone --mirror -o backend git@backend.example.com:/repo.git'
to create the front end repository, plus the following pre-receive
hook in the front end:
#!/bin/sh
exec git push backend
This doesn't work because the pre-receive hook runs in the
quarantine environment, and `git push' wants to update
`refs/heads/main', which is forbidden in the quarantine
environment.
(However, git push to frontend doesn't actually fail with nonzero
exit status -- it prints an error message, `ref updates forbidden
inside quarantine environment', but exits wtih status 0.)
But maybe the ref update is harmless in this environment.
2. Same as (1), but the pre-receive hook is:
#!/bin/sh
unset GIT_QUARANTINE_PATH
exec git push backend
This doesn't work because `git push' in the pre-receive hook
doesn't find anything it needs to push -- the ref update hasn't
happened yet.
3. Same as (1), but the pre-receive hook assembles a command line of
exec git push backend ${new0}:${ref0} ${new1}:${ref1} ...,
with all the ref updates passed on stdin (ignoring the old values).
This fails because `--mirror can't be combined with refspecs'.
4. Same as (3), but remote.backend.mirror is explicitly disabled after
`git clone --mirror' finishes.
On push to the primary, this prints an error message
remote: error: update_ref failed for ref 'refs/heads/main': ref updates forbidden inside quarantine environment
but somehow the push succeeds in spite of this message, and the
primary and replica both get updated.
And if I inject an error on push to the replica, by making the
replica's pre-receive hook fail with nonzero exit status, neither
primary nor replica is updated and the push fails with an error
message (`pre-receive hook declined') _and_ nonzero exit status --
as desired.
So maybe this actually works, but the error message on _successful_
pushes is unsettling!
5. Same as (1), but the pre-receive hook assembles a command line of
exec git send-pack git@backend.example.com:/repo.git \
${new0}:${ref0} ${new1}:${ref1} ...
with all the ref updates passed on stdin (ignoring the old values).
This seems to work, and it propagates errors injected on push to
the replica, but it is limited to local or ssh remotes, as far as I
can tell -- it does not appear that git-send-pack works with custom
remote transports.
Perhaps using mirror clones is the wrong approach here, and perhaps I
should instead explicitly create tracking branches in the primary that
are only updated if the push succeeds -- but this will still require
getting around the quarantine restrictions on git push in the
pre-receive hook.
Is there a way to achieve this (ideally, with plausible extension to a
three-phase commit protocol) that doesn't trigger unsettling nonfatal
error messages and that works with custom remote transports?
^ permalink raw reply [flat|nested] 10+ messages in thread* Re: Synchronous replication on push 2024-11-02 2:06 Synchronous replication on push Taylor R Campbell @ 2024-11-02 10:09 ` Matěj Cepl 2024-11-02 13:35 ` Taylor R Campbell 2024-11-04 23:47 ` Jeff King 1 sibling, 1 reply; 10+ messages in thread From: Matěj Cepl @ 2024-11-02 10:09 UTC (permalink / raw) To: Taylor R Campbell, git [-- Attachment #1.1.1: Type: text/plain, Size: 726 bytes --] On Sat Nov 2, 2024 at 3:06 AM CET, Taylor R Campbell wrote: > Suppose I have a front end repository: > > user@frontend.example.com:/repo.git > > Whenever I push anything to it, I want the push -- that is, all the > objects, and all the ref updates -- to be synchronously replicated to > another remote repository, the back end: > > git@backend.example.com:/repo.git https://stackoverflow.com/q/14290113/164233 -- http://matej.ceplovi.cz/blog/, @mcepl@floss.social GPG Finger: 3C76 A027 CA45 AD70 98B5 BC1D 7920 5802 880B C9D8 We understand our competition isn’t with Caldera or SuSE--our competition is with Microsoft. -- Bob Young of Red Hat http://www.linuxjournal.com/article/3553 [-- Attachment #1.2: E09FEF25D96484AC.asc --] [-- Type: application/pgp-keys, Size: 3102 bytes --] -----BEGIN PGP PUBLIC KEY BLOCK----- mQGiBD2g5T0RBACZdnG/9T4JS2mlxsHeFbex1KWweKPuYTpnbu8Fe7rNYMWZ/AKc 9Vm+RuoVErm4HGsb0pL5ZPnncA+m80W8EzQm2rs8PD2mHNsUhDOGnk+0fm+25WSU 6YLzd8lttxPia75A5OqBEAmJlyJUSmoWKjAK/q1Tj5HW3+/7XqWYYCJzAwCgjR2D irw8QP8GCoUUXxeNpIOTqzMD/j66VTln+rxYT12U4jxLlsOs5Y0LVQfUbpDFEYy9 mkWX8iNTUZsx+m6uhylamm3EkN/dW0b2sQ4D3ocZekriLPDR/X0P1XPUdcy28a6o WZoVAKN26X+PwxSq3JCiQEJgPJeKxiLiExh3lDitNyAS0WUD/xQOqryEFb9ksGxL R9UCA/9WUQMwgQvEUhuVB7qSnREo3+ks34Kltp71uUjuMjLk3ykSptyn8oV+XZgx rxPAD+WOJn51yFxbo+OPNdH6wG2ZaXFj47rX6GQ9W6wI7K0QhdyQTps8KNlsJuDQ pz7XME98ob8SszsvkPPm/gX0oWdOIqHipHnMlL684jRHCWHVjrQdTWF0ZWogQ2Vw bCA8bWF0ZWpAY2VwbG92aS5jej6IYAQTEQIAIAIeAQIXgAIZAQUCRSoWAgYLCQgH AwIEFQIIAwQWAgMBAAoJEOCf7yXZZISsr5sAoIAqsNcs1Sl9jrmqv7vJzL4QG68V AJ9+30NmBClQwpmqnA26nCa4+WS5abQbTWF0ZWogQ2VwbCA8Y2VwbC5tQG5ldS5l ZHU+iGAEExECACACGwMCHgECF4AFAkUqFgkGCwkIBwMCBBUCCAMEFgIDAQAKCRDg n+8l2WSErAULAJoC8yrptOgooJOzLzmLxDc1mzeGDACdFBwZlvFcj1T2dmCRNdn5 cErRyBe0G01hdMSbaiBDZXBsIDxtY2VwbEBjZXBsLmV1PohiBBMRAgAiBQJQixpw AhsDBgsJCAcDAgYVCAIJCgsEFgIDAQIeAQIXgAAKCRDgn+8l2WSErBMYAJ9eQEpi bL6Vm7sUOhupxD/UsHiWlQCdHYi+UNpzC1mKYtDSWa1ocfO1Q760HE1hdGVqIENl cGwgPGNlcGxtQHNlem5hbS5jej6IYAQTEQIAIAIbAwIeAQIXgAUCRSoWCQYLCQgH AwIEFQIIAwQWAgMBAAoJEOCf7yXZZISsP14Ani6U87hSUXDU+3ZTaDRXIwasTttl AJ0QWhjSmaJTdkkpfqmRB9bRi9pAQbQfTWF0xJtqIENlcGwgPGNlcGxAc3VyZmJl c3QubmV0PohgBBMRAgAgAhsDAh4BAheABQJFKhYJBgsJCAcDAgQVAggDBBYCAwEA CgkQ4J/vJdlkhKwBBwCbBOoTY52hYeKnKuU/uRjOTsUMg3IAnjTTrXYHD49xyLs8 T/Vpsuk6ZP/htCFNYXRlaiBDZXBsIDxtYXRlai5jZXBsQGdtYWlsLmNvbT6IYAQT EQIAIAIbAwIeAQIXgAUCRSoWCQYLCQgHAwIEFQIIAwQWAgMBAAoJEOCf7yXZZISs ki0An0Gw1MjZJATtVq11Su0mjd3rDQChAJ0eePE0amSwYVGSpSNb264+XjUotrQs TWF0ZWogQ2VwbCAoUmVkSGF0IEN6ZWNoKSA8bWNlcGxAcmVkaGF0LmNvbT6IYAQT EQIAIAUCRSyciwIbAwYLCQgHAwIEFQIIAwQWAgMBAh4BAheAAAoJEOCf7yXZZISs byQAniqw1PX24BlbBD22zNqYwzfIPDhwAJ4m/3ytuJzsfxrEac1tSoEb2+H9vrQ5 TWF0ZWogQ2VwbCA8Y2VwbC1aTzRGMEtubUNESGsxdU1KU0JrUW1RQHB1YmxpYy5n bWFuZS5vcmc+iGAEExECACACGwMCHgECF4AFAkUqFgkGCwkIBwMCBBUCCAMEFgID AQAKCRDgn+8l2WSErAn9AJ9bO0NUqLnMDTCcchtVzK6yEOLkCgCfXwkty1uEAzQI 5kt9Gec8yQpxDli0Gk1hdGVqIENlcGwgPG1jZXBsQHN1c2UuZGU+iGMEExECACMF Alr65CsCGwMHCwkIBwMCAQYVCAIJCgsEFgIDAQIeAQIXgAAKCRDgn+8l2WSErHjO AJ47yF9STX/Es4qsJPjW961He9H3bgCdEsjOgt7czE87Gy0D1KXWWNTdTtW0G01h dGVqIENlcGwgPG1jZXBsQHN1c2UuY29tPohjBBMRAgAjBQJa+uQ/AhsDBwsJCAcD AgEGFQgCCQoLBBYCAwECHgECF4AACgkQ4J/vJdlkhKwsQQCdGmGXW73O6Q3TB0V0 xP9yLwMjDtEAnjKWDW8PKO90nx8IkPodxr1nCvJbtBpNYXRlaiBDZXBsIDxtY2Vw bEBzdXNlLmN6PohjBBMRAgAjBQJa+uRPAhsDBwsJCAcDAgEGFQgCCQoLBBYCAwEC HgECF4AACgkQ4J/vJdlkhKyKtQCdHDpolHg/1qDaw/4CQyUzAfNvHk0AniEYL6BF rdyonhgQf/ZXzXjnKzSeuQENBD2g5UEQBACfxoz2nmzGJz6ueKHkTeXcQZvK4WzK TN/uJJhEmSuQmOKymbIkGL6vBQb+W4KxvLl2lAbNlfIgLGDLCs1YAwfSpJ4vS4mt liPgA2OtZ5j1WSOqpxedQPGVba5gVo7HNSOMUtZKTz7VsCvR94v05comhO1Gok75 ZxHtYyVHuk5V8wADBQP/ft+W4F0tccwslzz8O/c9/Mj8KZDYmfMyNb7ielT2WeQ3 iFF9AxMT6OvOxAQbDJvurfKeYlydcXLs6cy4lKce1hFaJ4i+MOFLVV1ZnZDDChRP pQ6KrRCHLb+mLY+SYD37O7p0spQA+9gsEE/tmn+5sW7LE8hqSOoPVdf7Y5yUDj6I RgQYEQIABgUCPaDlQQAKCRDgn+8l2WSErEUSAJ42T1l/2TFykbULBqqAtnbC6kR0 wwCdEnRlCGlvnO78R0FgKXlt3RyzGuE= =sxoW -----END PGP PUBLIC KEY BLOCK----- [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 216 bytes --] ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Synchronous replication on push 2024-11-02 10:09 ` Matěj Cepl @ 2024-11-02 13:35 ` Taylor R Campbell 2024-11-02 14:49 ` brian m. carlson 0 siblings, 1 reply; 10+ messages in thread From: Taylor R Campbell @ 2024-11-02 13:35 UTC (permalink / raw) To: Matěj Cepl; +Cc: git > Date: Sat, 02 Nov 2024 11:09:52 +0100 > From: Matěj Cepl <mcepl@cepl.eu> > > On Sat Nov 2, 2024 at 3:06 AM CET, Taylor R Campbell wrote: > > Suppose I have a front end repository: > > > > user@frontend.example.com:/repo.git > > > > Whenever I push anything to it, I want the push -- that is, all the > > objects, and all the ref updates -- to be synchronously replicated to > > another remote repository, the back end: > > > > git@backend.example.com:/repo.git > > https://stackoverflow.com/q/14290113/164233 Thanks, but that is about how to configure my local repository to use multiple remotes for a single git push command, which is not what I'm asking about. I'm asking about how to configure a _single_ frontend remote, from the perspective of developers who are pushing from their development workstations, so that it replicates to one or many backend stores. This is, for example, the usage model of Github's proprietary implementation. ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Synchronous replication on push 2024-11-02 13:35 ` Taylor R Campbell @ 2024-11-02 14:49 ` brian m. carlson 2024-11-04 13:35 ` Taylor R Campbell 0 siblings, 1 reply; 10+ messages in thread From: brian m. carlson @ 2024-11-02 14:49 UTC (permalink / raw) To: Taylor R Campbell; +Cc: Matěj Cepl, git [-- Attachment #1: Type: text/plain, Size: 1224 bytes --] On 2024-11-02 at 13:35:11, Taylor R Campbell wrote: > I'm asking about how to configure a _single_ frontend remote, from the > perspective of developers who are pushing from their development > workstations, so that it replicates to one or many backend stores. > This is, for example, the usage model of Github's proprietary > implementation. I don't think there's built-in functionality for this and I'm not sure that it can be done without additional software. If you really wanted to try to do this with out of the box Git, you could create a `pre-receive` hook that did policy controls and then on success, took all of the objects from the quarantine and rsynced them (without overwriting) to the remote store, and then use the `reference-transaction` hook to replicate the reference transaction to the remote side via SSH or something. I haven't tested this, so it might or might not work, but you could try it. Note that GitHub has a separate service that does the replication and intercepts the ref update to send it through the three-phase commit, so they don't rely on features of core Git to implement this functionality. -- brian m. carlson (they/them or he/him) Toronto, Ontario, CA [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 262 bytes --] ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Synchronous replication on push 2024-11-02 14:49 ` brian m. carlson @ 2024-11-04 13:35 ` Taylor R Campbell 2024-11-04 14:40 ` Konstantin Ryabitsev 2024-11-04 22:36 ` brian m. carlson 0 siblings, 2 replies; 10+ messages in thread From: Taylor R Campbell @ 2024-11-04 13:35 UTC (permalink / raw) To: brian m. carlson; +Cc: Matěj Cepl, git > Date: Sat, 2 Nov 2024 14:49:04 +0000 > From: "brian m. carlson" <sandals@crustytoothpaste.net> > > On 2024-11-02 at 13:35:11, Taylor R Campbell wrote: > > I'm asking about how to configure a _single_ frontend remote, from the > > perspective of developers who are pushing from their development > > workstations, so that it replicates to one or many backend stores. > > This is, for example, the usage model of Github's proprietary > > implementation. > > I don't think there's built-in functionality for this and I'm not sure > that it can be done without additional software. I'm happy to write some additional software. But I would like to understand what constraints there are on, e.g., pre-receive hooks and the ref updates of git push that make them collide in the ways I discovered, so that I can understand how to make that additional software reliable. For example: - Can I suppress the local ref updates of the remote in git push, just like git send-pack doesn't attempt any local ref updates of the remote? Or can I defer them to the post-receive hook? (By `local ref updates of the remote', I mean updates of the refs that live in the local repository for the remote.backend.fetch or remote.backend.push refspecs, rather than refs that exist in the remote repository which obviously I do want to update.) - Can I use git send-pack with a custom remote transport? - When I git clone --mirror, explicitly disable the mirror flag, and then git push in the pre-receive hook, why is there an error message printed even though the push exits with status zero and appears to have had all the effects I want? - What undesirable side effects can git push have in a mirror cloned with git clone --mirror, but with the mirror flag subsequently disabled? - What undesirable side effects can git push have in a pre-receive hook if I explicitly disable the quarantine environment by unsetting GIT_QUARANTINE_PATH in the environment? > If you really wanted to try to do this with out of the box Git, you > could create a `pre-receive` hook that did policy controls and then on > success, took all of the objects from the quarantine and rsynced them > (without overwriting) to the remote store, and then use the > `reference-transaction` hook to replicate the reference transaction to > the remote side via SSH or something. I haven't tested this, so it > might or might not work, but you could try it. Thanks, can you expand on how this would work with the constraints I listed in my question? Recapitulating: One option, of course, is to use a replicated file system like glusterfs, or replicated block store like DRBD. But that (a) likely requires a lot more round-trips than git push/send-pack, (b) can't be used for replication to other git hosts like Github, and (c) can't be used for other remote transports like git-cinnabar. It sounds like rsyncing over ssh is incompatible with (b) and (c), but perhaps I misunderstood what you're getting at. I tried to see if there is some way that reference-transaction hooks help me here but there wasn't anything obvious to me. ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Synchronous replication on push 2024-11-04 13:35 ` Taylor R Campbell @ 2024-11-04 14:40 ` Konstantin Ryabitsev 2024-11-04 15:50 ` Taylor R Campbell 2024-11-04 22:36 ` brian m. carlson 1 sibling, 1 reply; 10+ messages in thread From: Konstantin Ryabitsev @ 2024-11-04 14:40 UTC (permalink / raw) To: Taylor R Campbell; +Cc: brian m. carlson, Matěj Cepl, git On Mon, Nov 04, 2024 at 01:35:44PM +0000, Taylor R Campbell wrote: > > > I'm asking about how to configure a _single_ frontend remote, from the > > > perspective of developers who are pushing from their development > > > workstations, so that it replicates to one or many backend stores. > > > This is, for example, the usage model of Github's proprietary > > > implementation. > > > > I don't think there's built-in functionality for this and I'm not sure > > that it can be done without additional software. > > I'm happy to write some additional software. Alternatively, you can take a look at grokmirror, which is what kernel.org uses: https://pypi.org/project/grokmirror/ It's pull-based instead of push-based, for several reasons: 1. We replicate to multiple worldwide frontends, and we expect that some of them may be unreachable at the time when we attempt a push 2. This allows us to propagate repository deletes 3. This allows us to propagate details like descriptions and authors Grokmirror also has a listener daemon that can trigger a pull, so it's possible to have near-instantaneous replication by notifying the remote node that a repository has been updated and should be pulled. -K ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Synchronous replication on push 2024-11-04 14:40 ` Konstantin Ryabitsev @ 2024-11-04 15:50 ` Taylor R Campbell 0 siblings, 0 replies; 10+ messages in thread From: Taylor R Campbell @ 2024-11-04 15:50 UTC (permalink / raw) To: Konstantin Ryabitsev; +Cc: brian m. carlson, Matěj Cepl, git > Date: Mon, 4 Nov 2024 09:40:06 -0500 > From: Konstantin Ryabitsev <konstantin@linuxfoundation.org> > > Alternatively, you can take a look at grokmirror, which is what kernel.org > uses: > https://pypi.org/project/grokmirror/ > > It's pull-based instead of push-based, for several reasons: > > 1. We replicate to multiple worldwide frontends, and we expect that some of > them may be unreachable at the time when we attempt a push > 2. This allows us to propagate repository deletes > 3. This allows us to propagate details like descriptions and authors > > Grokmirror also has a listener daemon that can trigger a pull, so it's > possible to have near-instantaneous replication by notifying the remote node > that a repository has been updated and should be pulled. Thanks, that looks useful, but it's not quite what I'm looking for. Part of the goal is essentially the same (qualitative) types of service guarantee that Github advertises:[*] once the user's `git push' command has succeeded with nonzero exit status, the objects and ref updates have been written to multiple backing stores so it would take a failure of a quorum of those backing stores to lose the data. In particular, a backend may reject an update, and when this happens (in the multi-backend case, by enough backends that no quorum is reached), the user who ran `git push' needs to know that it failed so they don't, e.g., delete their branch, run git gc, and go on their merry way having silently lost data. [*] https://github.blog/engineering/infrastructure/stretching-spokes/ ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Synchronous replication on push 2024-11-04 13:35 ` Taylor R Campbell 2024-11-04 14:40 ` Konstantin Ryabitsev @ 2024-11-04 22:36 ` brian m. carlson 1 sibling, 0 replies; 10+ messages in thread From: brian m. carlson @ 2024-11-04 22:36 UTC (permalink / raw) To: Taylor R Campbell; +Cc: Matěj Cepl, git [-- Attachment #1: Type: text/plain, Size: 1374 bytes --] On 2024-11-04 at 13:35:44, Taylor R Campbell wrote: > Thanks, can you expand on how this would work with the constraints I > listed in my question? Recapitulating: > > One option, of course, is to use a replicated file system like > glusterfs, or replicated block store like DRBD. But that > > (a) likely requires a lot more round-trips than git push/send-pack, > (b) can't be used for replication to other git hosts like Github, and > (c) can't be used for other remote transports like git-cinnabar. > > It sounds like rsyncing over ssh is incompatible with (b) and (c), but > perhaps I misunderstood what you're getting at. I tried to see if > there is some way that reference-transaction hooks help me here but > there wasn't anything obvious to me. It should be noted that you cannot do what GitHub does with the three-phase commit with arbitrary remotes. A three-phase commit provides a prepared-to-commit stage where the backends agree that they (or at least a majority of them) will make the change. The Git protocol doesn't offer such functionality, so you can't use arbitrary remotes for this purpose. You'll need to either replicate to only hosts you control (as GitHub does), or you'll need to give up on having your three-phase commit operation. -- brian m. carlson (they/them or he/him) Toronto, Ontario, CA [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 262 bytes --] ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Synchronous replication on push 2024-11-02 2:06 Synchronous replication on push Taylor R Campbell 2024-11-02 10:09 ` Matěj Cepl @ 2024-11-04 23:47 ` Jeff King 2024-11-05 1:34 ` Taylor R Campbell 1 sibling, 1 reply; 10+ messages in thread From: Jeff King @ 2024-11-04 23:47 UTC (permalink / raw) To: Taylor R Campbell; +Cc: git On Sat, Nov 02, 2024 at 02:06:53AM +0000, Taylor R Campbell wrote: > Whenever I push anything to it, I want the push -- that is, all the > objects, and all the ref updates -- to be synchronously replicated to > another remote repository, the back end: This isn't quite how replication works at, say, GitHub. But let me first explain some of what you're seeing, and then I'll give some higher level comments at the end. > Here are some approaches I've tried: > > 1. `git clone --mirror -o backend git@backend.example.com:/repo.git' > to create the front end repository, plus the following pre-receive > hook in the front end: > > #!/bin/sh > exec git push backend > > This doesn't work because the pre-receive hook runs in the > quarantine environment, and `git push' wants to update > `refs/heads/main', which is forbidden in the quarantine > environment. > > (However, git push to frontend doesn't actually fail with nonzero > exit status -- it prints an error message, `ref updates forbidden > inside quarantine environment', but exits wtih status 0.) > > But maybe the ref update is harmless in this environment. I think the quarantine error is working as designed. If your push updates local refs in the frontend repo, any object-existence checks it does from the quarantine area are not necessarily valid if the quarantine environment goes away without migrating the objects (e.g., if you reject the push). So this: > 2. Same as (1), but the pre-receive hook is: > > #!/bin/sh > unset GIT_QUARANTINE_PATH > exec git push backend is potentially dangerous. Instead, you should disable push's attempt to update the local tracking refs. There isn't an option to do that, but if you don't have a "fetch" config line, then there are no tracking refs. I.e., rather than using "clone --mirror", create your frontend repo like this: git init --bare git config remote.backend.url git@backend.example.com:/repo.git git fetch backend refs/*:refs/* And then push won't try to update anything in the frontend repo. Side note: there's a small maybe-bug here that I noticed if the backend is on the same local filesystem. In that case GIT_QUARANTINE_PATH remains set for the receive-pack process running on the backend repo, and will refuse to update refs (where it should be safe to do so!). In your example that doesn't happen because GIT_QUARANTINE_PATH does not make it across the ssh connection. But arguably we should be clearing GIT_QUARANTINE_PATH in local_repo_env like we do for GIT_DIR, etc. I don't think you ran into this, but just another hiccup I found while trying to reproduce your situation. Moving on... > This doesn't work because `git push' in the pre-receive hook > doesn't find anything it needs to push -- the ref update hasn't > happened yet. Right. You could do it from a post-receive, but if the point is to be able to reject the push to the frontend, it must happen before the refs have been updated! So... > 3. Same as (1), but the pre-receive hook assembles a command line of > > exec git push backend ${new0}:${ref0} ${new1}:${ref1} ..., > > with all the ref updates passed on stdin (ignoring the old values). ...yes, this is the correct approach. You're not _quite_ passing all of the relevant info, though, because you're ignoring the old value of each ref. And ideally you'd make sure you were moving backend's ref0 from "old0" to "new0"; otherwise you risk overwriting something that happened independently on the backend. Of course that creates new questions, like what happens when the frontend and backend get out of sync. > This fails because `--mirror can't be combined with refspecs'. Yes. I don't think you really want "--mirror" in the first place, since you won't be fetching from the backend (or will you? If you are, that creates new questions about atomicity and syncing). If you do the init+fetch above, it won't be set. > 4. Same as (3), but remote.backend.mirror is explicitly disabled after > `git clone --mirror' finishes. > > On push to the primary, this prints an error message > > remote: error: update_ref failed for ref 'refs/heads/main': ref updates forbidden inside quarantine environment > > but somehow the push succeeds in spite of this message, and the > primary and replica both get updated. This is again the quarantine issue updating local tracking branches. However, we don't consider that a hard error, as updating them is opportunistic (we'd get the new values on the next fetch anyway). If you drop the refspec as above, you shouldn't see that any more. > 5. Same as (1), but the pre-receive hook assembles a command line of > > exec git send-pack git@backend.example.com:/repo.git \ > ${new0}:${ref0} ${new1}:${ref1} ... > > with all the ref updates passed on stdin (ignoring the old values). > > This seems to work, and it propagates errors injected on push to > the replica, but it is limited to local or ssh remotes, as far as I > can tell -- it does not appear that git-send-pack works with custom > remote transports. I don't remember all of the limitations of send-pack anymore. Even though "push" is more porcelain than plumbing, I'd probably still recommend it for a script, just because I think direct use of send-pack isn't going to be all that exercised, so you are likely to find missing bits of functionality and so forth. I think just dropping the refspecs and using push would be following the more well-trodden path. Now back to the main point: is this a good way to do replication? I don't think it's _terrible_, but there are two flaws I can see: 1. You're not kicking off the backend push until the frontend has received and processed the whole pack. So you're doubling the end-to-end latency of the push. In an ideal world you'd actually stream the incoming packfile to the backend, which would doing its own quarantined index-pack[*] on it in real-time. And then when you get to the pre-receive hook, all that's left is for all of the replicas to agree to commit to the ref update. [*] That would fix the latency, but of course you'd be spending a bunch of CPU on each replica to do the same indexing computation. You _could_ do that once, streaming the result out to the replicas, and then sending them just the resulting index. But there is some safety in repeating the computation on each replica (they _should_ all have the same objects, but if that isn't the case, you'd notice if one of them was missing, say, a delta base that the others have). GitHub's original replication design did repeat the computation, and AFAIK that is still the case today. 2. Using "push" isn't a very atomic way of updating refs. The backends will either accept the push or not, and then the frontend will try to update its refs. What if it fails? What if another push comes in simultaneously? Can they overwrite each other or lose pushed data? Or get the frontend and backends out of sync? Git's ref atomicity strategy is generally to take a lock on a ref, then check that its current value is the expected "old" value, and then update it to the "new" value and release the lock atomically. So you probably want to ask each backend replica to take the ref locks and check the old values, then respond "yes, I'm ready to commit", and then you send back "OK, commit" at which point they do the update. But "push" doesn't give you that kind of granularity (neither for the backends or on the frontend). Back when GitHub's replication system was designed, nothing did, and we had to use custom code. These days the reference-transaction lets you act in that stage where the ref lock is held (and my understanding is that GitLab implemented it to do the same kind of three-phase commit). But I don't have much experience with it myself. It might be enough if the frontend transaction hook talked to the backends, initiating an update-ref there with a transaction hook to pause and wait for the three-phase agreement. Maybe some of that points you in the right direction. -Peff ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Synchronous replication on push 2024-11-04 23:47 ` Jeff King @ 2024-11-05 1:34 ` Taylor R Campbell 0 siblings, 0 replies; 10+ messages in thread From: Taylor R Campbell @ 2024-11-05 1:34 UTC (permalink / raw) To: Jeff King; +Cc: git > Date: Mon, 4 Nov 2024 18:47:05 -0500 > From: Jeff King <peff@peff.net> > > On Sat, Nov 02, 2024 at 02:06:53AM +0000, Taylor R Campbell wrote: > > > Whenever I push anything to it, I want the push -- that is, all the > > objects, and all the ref updates -- to be synchronously replicated to > > another remote repository, the back end: > > This isn't quite how replication works at, say, GitHub. But let me first > explain some of what you're seeing, and then I'll give some higher level > comments at the end. Great, thanks! I understand Github works differently, and I'm not trying to replicate everything about Github's architecture, which I expect to take substantial novel software engineering effort. But I am trying to make sure I understand how the parts fit together well enough provide qualitatively similar types of guarantees about durability when the user's `git push' exits nonzero. I really have two different goals here, which have similar needs for relaying pushes but which I'm sure will diverge at some point: 1. provide a synchronous push/pull git frontend to an hg backend with git-cinnabar (so to ordinary git clients it looks just like an ordinary git remote, without needing git-cinnabar), and 2. provide a git frontend that replicates to one or many git backends for better resilience to server loss. > Instead, you should disable push's attempt to > update the local tracking refs. There isn't an option to do that, but > if you don't have a "fetch" config line, then there are no tracking > refs. I.e., rather than using "clone --mirror", create your frontend > repo like this: > > git init --bare > git config remote.backend.url git@backend.example.com:/repo.git > git fetch backend refs/*:refs/* > > And then push won't try to update anything in the frontend repo. Thanks, that hadn't occurred to me as an option. > Side note: there's a small maybe-bug here that I noticed if the > backend is on the same local filesystem. In that case > GIT_QUARANTINE_PATH remains set for the receive-pack process running > on the backend repo, and will refuse to update refs (where it should > be safe to do so!). In your example that doesn't happen because > GIT_QUARANTINE_PATH does not make it across the ssh connection. But > arguably we should be clearing GIT_QUARANTINE_PATH in local_repo_env > like we do for GIT_DIR, etc. I don't think you ran into this, but just > another hiccup I found while trying to reproduce your situation. (I did actually run into this, so in my test scripts I have been using git {clone,config,...} ext::"env -i PATH=$PATH git %s /path/to/backend.git" ... instead of just git {clone,config,...} /path/to/backend.git ... in order to nix GIT_QUARANTINE_PATH from the environment -- and anything else I might not have thought of -- while running git-receive-pack on the backend. But it didn't seem germane to the problem at hand so I didn't want to clutter up my already somewhat long question with such details unless someone asked me to share my reproducer!) > > 3. Same as (1), but the pre-receive hook assembles a command line of > > > > exec git push backend ${new0}:${ref0} ${new1}:${ref1} ..., > > > > with all the ref updates passed on stdin (ignoring the old values). > > ...yes, this is the correct approach. You're not _quite_ passing all of > the relevant info, though, because you're ignoring the old value of each > ref. And ideally you'd make sure you were moving backend's ref0 from > "old0" to "new0"; otherwise you risk overwriting something that happened > independently on the backend. Of course that creates new questions, > like what happens when the frontend and backend get out of sync. Right -- there will be some combination of --force-with-lease or pre-receive tests at the other end to handle this. But for now my focus is on making git push work in pre-receive at all. As long as anything out-of-sync leads to noisy failure, possibly requiring manual intervention, that's good enough for now (and I'm not (yet) concerned with . > > remote: error: update_ref failed for ref 'refs/heads/main': ref updates forbidden inside quarantine environment > > > > but somehow the push succeeds in spite of this message, and the > > primary and replica both get updated. > > This is again the quarantine issue updating local tracking branches. > However, we don't consider that a hard error, as updating them is > opportunistic (we'd get the new values on the next fetch anyway). > > If you drop the refspec as above, you shouldn't see that any more. Yes, thanks! > Now back to the main point: is this a good way to do replication? I > don't think it's _terrible_, but there are two flaws I can see: These are all good points that I will consider once I get to them now that I can make progress past the obstacle of local tracking ref updates in pre-receive git push, thanks. > 1. You're not kicking off the backend push until the frontend has > received and processed the whole pack. So you're doubling the > end-to-end latency of the push. In an ideal world you'd actually > stream the incoming packfile to the backend, which would doing its > own quarantined index-pack[*] on it in real-time. And then when you > get to the pre-receive hook, all that's left is for all of the > replicas to agree to commit to the ref update. Git doesn't currently have any hooks for doing this, right? So presumably this will require a custom git-receive-pack replacement that understands the git wire protocol to stream the packfile to backends (which is what I assume Github's spokes proxies do). > 2. Using "push" isn't a very atomic way of updating refs. The backends > will either accept the push or not, and then the frontend will try > to update its refs. What if it fails? What if another push comes in > simultaneously? Can they overwrite each other or lose pushed data? > Or get the frontend and backends out of sync? Right -- there's a lot to work out for the three-phase commit part. One simplification for now is to reject non-fast-forward pushes (and ref deletion), and to not worry too much about ordering of independent ref updates or whether I even want serializable isolation or just read-repeatable or -committed for that. That said, regarding push atomicity: Suppose users concurrently do alice$ git push frontend X Y bob$ git push frontend Y X That is, there are overlapping ref updates, and suppose Alice and Bob have incompatible referents for X and Y (non-fast-forward, or they're using --force-with-lease but not --atomic, or whatever). When are the locks on X and Y taken relative to pre-receive in the frontend? Can the pre-receive hooks for Alice's push and Bob's push run concurrently or are they serialized by locks on the common refs X and Y? This can't deadlock, can it? (I assume the locks on refs are taken in a consistent order.) It's unclear to me from the githooks(5), git-push(1), and git-receive-pack(1) man pages what the ordering of hooks and ref locking is, or what serialization guarantees hooks have -- if any. ^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2024-11-05 1:34 UTC | newest] Thread overview: 10+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2024-11-02 2:06 Synchronous replication on push Taylor R Campbell 2024-11-02 10:09 ` Matěj Cepl 2024-11-02 13:35 ` Taylor R Campbell 2024-11-02 14:49 ` brian m. carlson 2024-11-04 13:35 ` Taylor R Campbell 2024-11-04 14:40 ` Konstantin Ryabitsev 2024-11-04 15:50 ` Taylor R Campbell 2024-11-04 22:36 ` brian m. carlson 2024-11-04 23:47 ` Jeff King 2024-11-05 1:34 ` Taylor R Campbell
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).