Monitoring a repository for changes

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Monitoring a repository for changes
@ 2017-06-21 14:27 Tim Hutt
  2017-06-21 15:04 ` Ævar Arnfjörð Bjarmason
  2017-06-21 21:19 ` Jonathan Nieder
  0 siblings, 2 replies; 9+ messages in thread
From: Tim Hutt @ 2017-06-21 14:27 UTC (permalink / raw)
  To: git

Hi,

Currently if you want to monitor a repository for changes there are
three options:

* Polling - run a script to check for updates every 60 seconds.
* Server side hooks
* Web hooks (on Github, Bitbucket etc.)

Unfortunately for many (most?) cases server-side hooks and web hooks
are not suitable. They require you to both have admin access to the
repo and have a public server available to push updates to. That is a
huge faff when all I want to do is run some local code when a repo is
updated (e.g. play a sound).

Currently people resort to polling
(https://stackoverflow.com/a/5199111/265521) which is just ugly. I
would like to propose that there should be a forth option that uses a
persistent connection to monitor the repo. It would be used something
like this:

    git watch https://github.com/git/git.git

or

    git watch git@github.com:git/git.git

It would then print simple messages to stdout. The complexity of what
it prints is up for debate, - it could be something as simple as
"PUSH\n", or it could include more information, e.g. JSON-encoded
information about the commits. I'd be happy with just "PUSH\n" though.

In terms of implementation, the HTTP transport could use Server-Sent
Events, and the SSH transport can pretty much do whatever so that
should be easy.

Thoughts?

Tim

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Monitoring a repository for changes
  2017-06-21 14:27 Monitoring a repository for changes Tim Hutt
@ 2017-06-21 15:04 ` Ævar Arnfjörð Bjarmason
  2017-06-21 19:44   ` Jeff King
  2017-06-21 19:52   ` Eric Wong
  2017-06-21 21:19 ` Jonathan Nieder
  1 sibling, 2 replies; 9+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2017-06-21 15:04 UTC (permalink / raw)
  To: Tim Hutt; +Cc: git

On Wed, Jun 21 2017, Tim Hutt jotted:

> Hi,
>
> Currently if you want to monitor a repository for changes there are
> three options:
>
> * Polling - run a script to check for updates every 60 seconds.
> * Server side hooks
> * Web hooks (on Github, Bitbucket etc.)
>
> Unfortunately for many (most?) cases server-side hooks and web hooks
> are not suitable. They require you to both have admin access to the
> repo and have a public server available to push updates to. That is a
> huge faff when all I want to do is run some local code when a repo is
> updated (e.g. play a sound).
>
> Currently people resort to polling
> (https://stackoverflow.com/a/5199111/265521) which is just ugly. I
> would like to propose that there should be a forth option that uses a
> persistent connection to monitor the repo. It would be used something
> like this:
>
>     git watch https://github.com/git/git.git
>
> or
>
>     git watch git@github.com:git/git.git
>
> It would then print simple messages to stdout. The complexity of what
> it prints is up for debate, - it could be something as simple as
> "PUSH\n", or it could include more information, e.g. JSON-encoded
> information about the commits. I'd be happy with just "PUSH\n" though.

Insofar as this could be implemented in some standard way in Git it's
likely to have a large overlap with the "protocol v2" that keeps coming
up here on-list. You might want to search for past threads discussing
that.

> In terms of implementation, the HTTP transport could use Server-Sent
> Events, and the SSH transport can pretty much do whatever so that
> should be easy.

In case you didn't know, any of the non-trivially sized git hosting
providers (e.g. github, gitlab) provide you access over ssh, but you
can't just run any arbitrary command, it's a tiny set of whitelisted
commands. See the "git-shell" manual page (github doesn't use that exact
software, but something similar).

But overall, it would be nice to have some rationale for this approach
other than that you think polling is ugly. There's a lot of advantages
to polling for something you don't need near-instantly, e.g. imagine how
many active connections a site like GitHub would need to handle if
something like this became widely used, that's in a lot of ways harder
to scale and load balance than just having clients that poll something
that's trivially cached as static content.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Monitoring a repository for changes
  2017-06-21 15:04 ` Ævar Arnfjörð Bjarmason
@ 2017-06-21 19:44   ` Jeff King
  2017-06-21 19:55     ` Stefan Beller
  2017-06-21 19:52   ` Eric Wong
  1 sibling, 1 reply; 9+ messages in thread
From: Jeff King @ 2017-06-21 19:44 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason; +Cc: Tim Hutt, git

On Wed, Jun 21, 2017 at 05:04:12PM +0200, Ævar Arnfjörð Bjarmason wrote:

> > In terms of implementation, the HTTP transport could use Server-Sent
> > Events, and the SSH transport can pretty much do whatever so that
> > should be easy.
> 
> In case you didn't know, any of the non-trivially sized git hosting
> providers (e.g. github, gitlab) provide you access over ssh, but you
> can't just run any arbitrary command, it's a tiny set of whitelisted
> commands. See the "git-shell" manual page (github doesn't use that exact
> software, but something similar).

These days you don't even hit the actual fileservers with ssh at all.
We terminate all of the protocols (http, git://, and ssh) at a proxy
layer that kicks off git commands in the actual repositories using a
separate protocol. The ssh handshakes were a huge performance
bottleneck, so by doing it that way we can scale out the front-end tier
independently of the repository storage (and of course it also provides
a convenient layer for mapping user visible repository names into
sharded paths).

Not to take away from your point. Just a little bit of trivia.

> But overall, it would be nice to have some rationale for this approach
> other than that you think polling is ugly. There's a lot of advantages
> to polling for something you don't need near-instantly, e.g. imagine how
> many active connections a site like GitHub would need to handle if
> something like this became widely used, that's in a lot of ways harder
> to scale and load balance than just having clients that poll something
> that's trivially cached as static content.

Yeah. The naive way to implement this would be to have the client
connect and receive the ref advertisement. And then when it's a noop
(nothing to fetch), instead of saying "I want these objects", say
"Please pause until one or more refs change". But I don't think we'd
want to leave actual upload-pack processes sitting paused on the server.
Their memory usage is too high.

For this kind of "long polling" we have a separate front-end tier with a
daemon that keeps the per-client cost very low. We could possibly wedge
that into our proxy layer, but the system would be a lot simpler and
more flexible if this were done separately from the actual git protocol.
E.g., if an HTTP endpoint were defined that paused and returned data
only when a particular repository's refs were updated.

Another option is to keep polling, but just make noop fetches a lot
cheaper. The ref advertisement on some repositories can get into the
megabytes. I'd love to see protocol extensions for:

  1. The client asking only for bits of the ref namespace they care
     about. I have some preliminary patches for this, but I really need
     to polish them.

  2. Something ETag-ish where the client can say "I already saw state X,
     do you have updates?" Even just handling "no, no updates" (like an
     ETag) would be a big benefit. Bonus points if it can say "since
     state X, these are the changes; you are now at state Y".

The sticking point on both is that the client needs to speak before the
ref advertisement begins, which is why we have to deal with the protocol
v2 headache.

-Peff

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Monitoring a repository for changes
  2017-06-21 15:04 ` Ævar Arnfjörð Bjarmason
  2017-06-21 19:44   ` Jeff King
@ 2017-06-21 19:52   ` Eric Wong
  2017-06-21 21:56     ` Ævar Arnfjörð Bjarmason
  1 sibling, 1 reply; 9+ messages in thread
From: Eric Wong @ 2017-06-21 19:52 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason; +Cc: Tim Hutt, git

Ævar Arnfjörð Bjarmason <avarab@gmail.com> wrote:
> On Wed, Jun 21 2017, Tim Hutt jotted:
> 
> > Hi,
> >
> > Currently if you want to monitor a repository for changes there are
> > three options:
> >
> > * Polling - run a script to check for updates every 60 seconds.
> > * Server side hooks
> > * Web hooks (on Github, Bitbucket etc.)
> >
> > Unfortunately for many (most?) cases server-side hooks and web hooks
> > are not suitable. They require you to both have admin access to the
> > repo and have a public server available to push updates to. That is a
> > huge faff when all I want to do is run some local code when a repo is
> > updated (e.g. play a sound).

Yeah, it kinda sucks that way.

Currently, for one of my public-inbox mirrors which has ssh
access to the primary server on public-inbox.org, I have:

	#!/bin/sh
	while true
	do
		# GNU tail(1) uses inotify to avoid polling on Linux
		ssh public-inbox.org tail -F /path/to/git-vger.git/info/refs | \
				while read sha1 ref
		do
			for GIT_DIR in git-vger.git
			do
				export GIT_DIR
				git fetch || continue
				git update-server-info
				public-inbox-index # update Xapian index
			done
		done
	done

It's not perfect as it requires multiple processes on the
server, but it's better than polling for my limited use.

> > Currently people resort to polling
> > (https://stackoverflow.com/a/5199111/265521) which is just ugly. I
> > would like to propose that there should be a forth option that uses a
> > persistent connection to monitor the repo. It would be used something
> > like this:
> >
> >     git watch https://github.com/git/git.git
> >
> > or
> >
> >     git watch git@github.com:git/git.git
> >
> > It would then print simple messages to stdout. The complexity of what
> > it prints is up for debate, - it could be something as simple as
> > "PUSH\n", or it could include more information, e.g. JSON-encoded
> > information about the commits. I'd be happy with just "PUSH\n" though.
> 
> Insofar as this could be implemented in some standard way in Git it's
> likely to have a large overlap with the "protocol v2" that keeps coming
> up here on-list. You might want to search for past threads discussing
> that.

Yeah, it hasn't been a priority for me, either...

> > In terms of implementation, the HTTP transport could use Server-Sent
> > Events, and the SSH transport can pretty much do whatever so that
> > should be easy.
> 
> In case you didn't know, any of the non-trivially sized git hosting
> providers (e.g. github, gitlab) provide you access over ssh, but you
> can't just run any arbitrary command, it's a tiny set of whitelisted
> commands. See the "git-shell" manual page (github doesn't use that exact
> software, but something similar).
> 
> But overall, it would be nice to have some rationale for this approach
> other than that you think polling is ugly. There's a lot of advantages
> to polling for something you don't need near-instantly, e.g. imagine how
> many active connections a site like GitHub would need to handle if
> something like this became widely used, that's in a lot of ways harder
> to scale and load balance than just having clients that poll something
> that's trivially cached as static content.

Polling becomes more expensive with TLS and high-latency
connections, and also increases power consumption if done
frequently for redundancy purposes.

I've long wanted to do something better to allow others to keep
public-inbox mirrors up-to-date.  Having only 64-128 bytes of
overhead per userspace per-connection should be totally doable
based on my experience working on cmogstored; at which point
port exhaustion will become the limiting factor (or TLS overhead
for HTTPS).

But perhaps a cheaper option might be the traditional email/IRC
notification and having a client-side process watch for that
before fetching.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Monitoring a repository for changes
  2017-06-21 19:44   ` Jeff King
@ 2017-06-21 19:55     ` Stefan Beller
  0 siblings, 0 replies; 9+ messages in thread
From: Stefan Beller @ 2017-06-21 19:55 UTC (permalink / raw)
  To: Jeff King
  Cc: Ævar Arnfjörð Bjarmason, Tim Hutt,
	git@vger.kernel.org

On Wed, Jun 21, 2017 at 12:44 PM, Jeff King <peff@peff.net> wrote:
>
> Yeah. The naive way to implement this would be to have the client
> connect and receive the ref advertisement. And then when it's a noop
> (nothing to fetch), instead of saying "I want these objects", say
> "Please pause until one or more refs change". But I don't think we'd
> want to leave actual upload-pack processes sitting paused on the server.
> Their memory usage is too high.

https://git.eclipse.org/r/#/c/6587/

JGit has had its experiments with some standing connection and then
having some sort of Pub/Sub system. AFAICT it did not go anywhere
because of the number of connections (even if you optimize for
the serverside, such that each connection is just the cost of a java
thread and a file descriptor).

>
> The sticking point on both is that the client needs to speak before the
> ref advertisement begins, which is why we have to deal with the protocol
> v2 headache.

I would not call it headache, but large project that is not to be tackled
by one person alone. ;)

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Monitoring a repository for changes
  2017-06-21 14:27 Monitoring a repository for changes Tim Hutt
  2017-06-21 15:04 ` Ævar Arnfjörð Bjarmason
@ 2017-06-21 21:19 ` Jonathan Nieder
  1 sibling, 0 replies; 9+ messages in thread
From: Jonathan Nieder @ 2017-06-21 21:19 UTC (permalink / raw)
  To: Tim Hutt; +Cc: git

Hi,

Tim Hutt wrote:

> Currently if you want to monitor a repository for changes there are
> three options:
>
> * Polling - run a script to check for updates every 60 seconds.
> * Server side hooks
> * Web hooks (on Github, Bitbucket etc.)
>
> Unfortunately for many (most?) cases server-side hooks and web hooks
> are not suitable. They require you to both have admin access to the
> repo and have a public server available to push updates to. That is a
> huge faff when all I want to do is run some local code when a repo is
> updated (e.g. play a sound).

On the polling side, it is possible to improve things a little:
https://www.kernel.org/mirroring-kernelorg-repositories.html
https://github.com/mricon/grokmirror

A hanging GET or websocket is more client-friendly but more expensive
server-side.  That doesn't rule out making it happen on some servers
if someone does the work.  If I understand correctly then this
architecture tends to lead to centralization --- a small number of
services providing notifications pushed from multiple sources, as with
https://developers.google.com/web/fundamentals/engage-and-retain/push-notifications/how-push-works

If someone wants to try adding something like grokmirror (which
describes the state of multiple repositories, amortizing the
per-request costs) to git, especially if it supports something
etag-like as Jeff King suggested, then I would be interested.

Thanks and hope that helps,
Jonathan

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Monitoring a repository for changes
  2017-06-21 19:52   ` Eric Wong
@ 2017-06-21 21:56     ` Ævar Arnfjörð Bjarmason
  2017-06-21 22:20       ` Eric Wong
  0 siblings, 1 reply; 9+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2017-06-21 21:56 UTC (permalink / raw)
  To: Eric Wong; +Cc: Tim Hutt, git


On Wed, Jun 21 2017, Eric Wong jotted:

> Ævar Arnfjörð Bjarmason <avarab@gmail.com> wrote:
>> On Wed, Jun 21 2017, Tim Hutt jotted:
>>
>> > Hi,
>> >
>> > Currently if you want to monitor a repository for changes there are
>> > three options:
>> >
>> > * Polling - run a script to check for updates every 60 seconds.
>> > * Server side hooks
>> > * Web hooks (on Github, Bitbucket etc.)
>> >
>> > Unfortunately for many (most?) cases server-side hooks and web hooks
>> > are not suitable. They require you to both have admin access to the
>> > repo and have a public server available to push updates to. That is a
>> > huge faff when all I want to do is run some local code when a repo is
>> > updated (e.g. play a sound).
>
> Yeah, it kinda sucks that way.
>
> Currently, for one of my public-inbox mirrors which has ssh
> access to the primary server on public-inbox.org, I have:
>
> 	#!/bin/sh
> 	while true
> 	do
> 		# GNU tail(1) uses inotify to avoid polling on Linux
> 		ssh public-inbox.org tail -F /path/to/git-vger.git/info/refs | \
> 				while read sha1 ref
> 		do
> 			for GIT_DIR in git-vger.git
> 			do
> 				export GIT_DIR
> 				git fetch || continue
> 				git update-server-info
> 				public-inbox-index # update Xapian index
> 			done
> 		done
> 	done
>
> It's not perfect as it requires multiple processes on the
> server, but it's better than polling for my limited use.
>
>> > Currently people resort to polling
>> > (https://stackoverflow.com/a/5199111/265521) which is just ugly. I
>> > would like to propose that there should be a forth option that uses a
>> > persistent connection to monitor the repo. It would be used something
>> > like this:
>> >
>> >     git watch https://github.com/git/git.git
>> >
>> > or
>> >
>> >     git watch git@github.com:git/git.git
>> >
>> > It would then print simple messages to stdout. The complexity of what
>> > it prints is up for debate, - it could be something as simple as
>> > "PUSH\n", or it could include more information, e.g. JSON-encoded
>> > information about the commits. I'd be happy with just "PUSH\n" though.
>>
>> Insofar as this could be implemented in some standard way in Git it's
>> likely to have a large overlap with the "protocol v2" that keeps coming
>> up here on-list. You might want to search for past threads discussing
>> that.
>
> Yeah, it hasn't been a priority for me, either...
>
>> > In terms of implementation, the HTTP transport could use Server-Sent
>> > Events, and the SSH transport can pretty much do whatever so that
>> > should be easy.
>>
>> In case you didn't know, any of the non-trivially sized git hosting
>> providers (e.g. github, gitlab) provide you access over ssh, but you
>> can't just run any arbitrary command, it's a tiny set of whitelisted
>> commands. See the "git-shell" manual page (github doesn't use that exact
>> software, but something similar).
>>
>> But overall, it would be nice to have some rationale for this approach
>> other than that you think polling is ugly. There's a lot of advantages
>> to polling for something you don't need near-instantly, e.g. imagine how
>> many active connections a site like GitHub would need to handle if
>> something like this became widely used, that's in a lot of ways harder
>> to scale and load balance than just having clients that poll something
>> that's trivially cached as static content.
>
> Polling becomes more expensive with TLS and high-latency
> connections, and also increases power consumption if done
> frequently for redundancy purposes.
>
> I've long wanted to do something better to allow others to keep
> public-inbox mirrors up-to-date.  Having only 64-128 bytes of
> overhead per userspace per-connection should be totally doable
> based on my experience working on cmogstored; at which point
> port exhaustion will become the limiting factor (or TLS overhead
> for HTTPS).

Come to think of it I should probably have asked you about this, but I
have a one-liner running that polls every 5 minutes, but will stop if I
haven't changed my git.git in a day:

    while true; do if test $(find ~/g/git -type f -mmin -1440 | wc -l) -gt 0; then git pull; else echo too old; fi ; date ; sleep 300; done

> But perhaps a cheaper option might be the traditional email/IRC
> notification and having a client-side process watch for that
> before fetching.

If there was a IRC channel with this info I could/would use that,
getting it via E-Mail would just get me into the same problem
public-inbox is currently solving for me, i.e. I might as well keep the
git ML up-to-date on that machine if I'm going to otherwise need to
subscribe to a "hey there's a new message on the git ML" list :)

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Monitoring a repository for changes
  2017-06-21 21:56     ` Ævar Arnfjörð Bjarmason
@ 2017-06-21 22:20       ` Eric Wong
  2017-06-21 22:36         ` Eric Wong
  0 siblings, 1 reply; 9+ messages in thread
From: Eric Wong @ 2017-06-21 22:20 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason; +Cc: Tim Hutt, git

Ævar Arnfjörð Bjarmason <avarab@gmail.com> wrote:
> On Wed, Jun 21 2017, Eric Wong jotted:
> > I've long wanted to do something better to allow others to keep
> > public-inbox mirrors up-to-date.  Having only 64-128 bytes of
> > overhead per userspace per-connection should be totally doable
> > based on my experience working on cmogstored; at which point
> > port exhaustion will become the limiting factor (or TLS overhead
> > for HTTPS).
> 
> Come to think of it I should probably have asked you about this, but I
> have a one-liner running that polls every 5 minutes, but will stop if I
> haven't changed my git.git in a day:
> 
>     while true; do if test $(find ~/g/git -type f -mmin -1440 | wc -l) -gt 0; then git pull; else echo too old; fi ; date ; sleep 300; done

Polling https://public-inbox.org/git ?  no need to stop it,
every 5 seconds is fine if you're not worried about power
consumption on your end :)

> > But perhaps a cheaper option might be the traditional email/IRC
> > notification and having a client-side process watch for that
> > before fetching.
> 
> If there was a IRC channel with this info I could/would use that,
> getting it via E-Mail would just get me into the same problem
> public-inbox is currently solving for me, i.e. I might as well keep the
> git ML up-to-date on that machine if I'm going to otherwise need to
> subscribe to a "hey there's a new message on the git ML" list :)

The IRC server would have the same scalability problems faced by
maintaining persistent connections to git-daemon or HTTP
servers, however.  And, yes, email does seem redundant, and
modern header sizes (with DKIM, etc) are gigantic; but
connection lifetime and concurrency is manageable to the server
even if not instantaneous.

I also considered having clients setup a listener of some sort,
(possibly using UDP) but that would have all the problems with
git:// + firewalls.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Monitoring a repository for changes
  2017-06-21 22:20       ` Eric Wong
@ 2017-06-21 22:36         ` Eric Wong
  0 siblings, 0 replies; 9+ messages in thread
From: Eric Wong @ 2017-06-21 22:36 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason; +Cc: Tim Hutt, git

Eric Wong <e@80x24.org> wrote:
> And, yes, email does seem redundant, and
> modern header sizes (with DKIM, etc) are gigantic; but
> connection lifetime and concurrency is manageable to the server
> even if not instantaneous.

I should add that any email notification message should be
significantly shorter than a normal message going to the list;
possibly just a parsable subject line and empty body.

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2017-06-21 22:37 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2017-06-21 14:27 Monitoring a repository for changes Tim Hutt
2017-06-21 15:04 ` Ævar Arnfjörð Bjarmason
2017-06-21 19:44   ` Jeff King
2017-06-21 19:55     ` Stefan Beller
2017-06-21 19:52   ` Eric Wong
2017-06-21 21:56     ` Ævar Arnfjörð Bjarmason
2017-06-21 22:20       ` Eric Wong
2017-06-21 22:36         ` Eric Wong
2017-06-21 21:19 ` Jonathan Nieder

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).