git over webdav: what can I do for improving http-push ?

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* git over webdav: what can I do for improving http-push ?
@ 2007-12-30 22:59 Grégoire Barbier
  2007-12-31  3:46 ` Daniel Barkalow
  0 siblings, 1 reply; 14+ messages in thread
From: Grégoire Barbier @ 2007-12-30 22:59 UTC (permalink / raw)
  To: git

[-- Attachment #1: Type: text/plain, Size: 2127 bytes --]

Hi everybody.

I've just subscribed to the list, therefore I think it should be rude 
not to introduce myself. My name is Grégoire Barbier (non 
french-speaking people should call me Greg and don't bother with 
non-ascii characters), I'm working mainly as a consultant (using 
Powerpoint and wearing a tie) but have some personal and professional 
interest in programming, especially about middlewares. BTW I apologize 
for my poor english.

I'm using Git for a rather short time but enough to fall in love with 
it. For a few days I'm trying to use it over webdav, that is over 
http/https with write (push) access. As for me, the main rationale to 
use http(s) rather than git or ssh is to get through corporate 
firewalls, otherwise I would probably not bother with webdav.

With 1.5.3.6 and 1.5.4-rc2, I encounter severe issues that make me think 
that http-push is not totally ready for production. That's why I would 
like to have a discussion with some of you that use and maintain it, to 
see what I can do to improve it or to help you improve it.

Here are some issues I encountered:
- http-push does not release locks when failing due to syntax error 
(e.g. if one types "git push" instead of "git push origin master")
- http-push freezes with no message with urls not terminated by a slash
- http-push does not create directory for the object (objects/xx/) and 
if the directory exists, it does not actually push objects without 
having USE_CURL_MULTI defined (which is not the compilation default)

I've starting to look at the source code and make some little 
improvements, but I feel that I should rather discuss with you to 
understand why there are two rather independant modes in http-push 
(USE_CURL_MULTI or not) and what is the real target (I don't want to 
work twice, neither to mess up the work of someone else that would be 
currently reorginzing this part of the code).

I attached some patches I did against 1.5.4-rc2, but I'm not sure they 
are doing it the good way, so I wouldn't be suprised if you were not 
okay to apply them as is.

-- 
Grégoire Barbier - gb à gbarbier.org - +33 6 21 35 73 49

[-- Attachment #2: 0001-Removed-double-free-int-http-push.c.patch --]
[-- Type: text/plain, Size: 903 bytes --]

>From 216f27db1f768fd80519a4e57dd835a1f497d902 Mon Sep 17 00:00:00 2001
From: Gregoire Barbier, gb at gbarbier dot org <gb@panoramix.(none)>
Date: Sun, 30 Dec 2007 17:45:54 +0100
Subject: [PATCH] Removed double-free() int http-push.c.

---
 http-push.c |    2 --
 1 files changed, 0 insertions(+), 2 deletions(-)

diff --git a/http-push.c b/http-push.c
index 64be904..55d0c94 100644
--- a/http-push.c
+++ b/http-push.c
@@ -1979,7 +1979,6 @@ static int remote_exists(const char *path)
 
 	if (start_active_slot(slot)) {
 		run_active_slot(slot);
-		free(url);
 		if (results.http_code == 404)
 			ret = 0;
 		else if (results.curl_result == CURLE_OK)
@@ -1987,7 +1986,6 @@ static int remote_exists(const char *path)
 		else
 			fprintf(stderr, "HEAD HTTP error %ld\n", results.http_code);
 	} else {
-		free(url);
 		fprintf(stderr, "Unable to start HEAD request\n");
 	}
 
-- 
1.5.4.rc2.4.gcef60-dirty


[-- Attachment #3: 0002-Making-HTTP-push-more-robust-and-more-user-friendly.patch --]
[-- Type: text/plain, Size: 3780 bytes --]

>From 70226905d8f1dd6ed7d953285a6ee693f1e87b65 Mon Sep 17 00:00:00 2001
From: Gregoire Barbier, gb at gbarbier dot org <gb@panoramix.(none)>
Date: Sun, 30 Dec 2007 17:48:07 +0100
Subject: [PATCH] Making HTTP push more robust and more user-friendly: - fail when info/refs exists and is already locked (avoiding some repository corruption) - warn if the URL does not end with '/' (since 302 is not yet handled) - more explicit error message when the URL or password is not set correctly (instead of "no DAV locking support") - DAV locking time of 1 minute instead of 10 minutes (avoid waiting 10 minutes for a orphan lock to expire)

---
 http-push.c |   17 ++++++++++++++++-
 http.c      |   25 +++++++++++++++++++++++++
 http.h      |    1 +
 3 files changed, 42 insertions(+), 1 deletions(-)

diff --git a/http-push.c b/http-push.c
index 55d0c94..c005903 100644
--- a/http-push.c
+++ b/http-push.c
@@ -57,7 +57,7 @@ enum XML_Status {
 #define PROPFIND_ALL_REQUEST "<?xml version=\"1.0\" encoding=\"utf-8\" ?>\n<D:propfind xmlns:D=\"DAV:\">\n<D:allprop/>\n</D:propfind>"
 #define LOCK_REQUEST "<?xml version=\"1.0\" encoding=\"utf-8\" ?>\n<D:lockinfo xmlns:D=\"DAV:\">\n<D:lockscope><D:exclusive/></D:lockscope>\n<D:locktype><D:write/></D:locktype>\n<D:owner>\n<D:href>mailto:%s</D:href>\n</D:owner>\n</D:lockinfo>"
 
-#define LOCK_TIME 600
+#define LOCK_TIME 60
 #define LOCK_REFRESH 30
 
 /* bits #0-15 in revision.h */
@@ -2224,6 +2224,16 @@ int main(int argc, char **argv)
 
 	no_pragma_header = curl_slist_append(no_pragma_header, "Pragma:");
 
+	/* Verify connexion string (agains bad URLs or password errors) */
+	if (remote->url && remote->url[strlen(remote->url)-1] != '/') {
+		fprintf(stderr, "Warning: remote URL does not end with a '/' which often leads to problems\n");
+	}
+	if (!http_test_connection(remote->url)) {
+		fprintf(stderr, "Error: cannot access to remote URL (maybe malformed URL, network error or bad credentials)\n");
+		rc = 1;
+		goto cleanup;
+	}
+
 	/* Verify DAV compliance/lock support */
 	if (!locking_available()) {
 		fprintf(stderr, "Error: no DAV locking support on remote repo %s\n", remote->url);
@@ -2239,6 +2249,11 @@ int main(int argc, char **argv)
 		info_ref_lock = lock_remote("info/refs", LOCK_TIME);
 		if (info_ref_lock)
 			remote->can_update_info_refs = 1;
+		else {
+			fprintf(stderr, "Error: cannot lock existing info/refs\n");
+			rc = 1;
+			goto cleanup;
+		}
 	}
 	if (remote->has_info_packs)
 		fetch_indices();
diff --git a/http.c b/http.c
index d2c11ae..8b04ae9 100644
--- a/http.c
+++ b/http.c
@@ -634,3 +634,28 @@ int http_fetch_ref(const char *base, const char *ref, unsigned char *sha1)
 	free(url);
 	return ret;
 }
+
+int http_test_connection(const char *url)
+{
+	struct strbuf buffer = STRBUF_INIT;
+	struct active_request_slot *slot;
+	struct slot_results results;
+	int ret = 0;
+
+	slot = get_active_slot();
+	slot->results = &results;
+	curl_easy_setopt(slot->curl, CURLOPT_FILE, &buffer);
+	curl_easy_setopt(slot->curl, CURLOPT_WRITEFUNCTION, fwrite_buffer);
+	curl_easy_setopt(slot->curl, CURLOPT_HTTPHEADER, NULL);
+	curl_easy_setopt(slot->curl, CURLOPT_URL, url);
+	if (start_active_slot(slot)) {
+		run_active_slot(slot);
+		if (results.curl_result == CURLE_OK)
+			ret = -1;
+		else
+			error("Cannot access to URL %s, return code %d", url, results.curl_result);
+	} else
+		error("Unable to start request");
+	strbuf_release(&buffer);
+	return ret;
+}
diff --git a/http.h b/http.h
index aeba930..b353007 100644
--- a/http.h
+++ b/http.h
@@ -77,6 +77,7 @@ extern void step_active_slots(void);
 
 extern void http_init(void);
 extern void http_cleanup(void);
+extern int http_test_connection(const char *url);
 
 extern int data_received;
 extern int active_requests;
-- 
1.5.4.rc2.4.gcef60-dirty


[-- Attachment #4: 0003-Releasing-webdav-lock-even-if-push-fails-because-of.patch --]
[-- Type: text/plain, Size: 1303 bytes --]

>From e00ae0f4b9ed0e61088fa729a7cabbfcbd006b98 Mon Sep 17 00:00:00 2001
From: Gregoire Barbier, gb at gbarbier dot org <gb@panoramix.(none)>
Date: Sun, 30 Dec 2007 19:35:31 +0100
Subject: [PATCH] Releasing webdav lock even if push fails because of bad (or no) reference on command line.

---
 http-push.c |   13 ++++++++-----
 1 files changed, 8 insertions(+), 5 deletions(-)

diff --git a/http-push.c b/http-push.c
index c005903..cbbf432 100644
--- a/http-push.c
+++ b/http-push.c
@@ -2275,11 +2275,14 @@ int main(int argc, char **argv)
 	if (!remote_tail)
 		remote_tail = &remote_refs;
 	if (match_refs(local_refs, remote_refs, &remote_tail,
-		       nr_refspec, (const char **) refspec, push_all))
-		return -1;
+		       nr_refspec, (const char **) refspec, push_all)) {
+		rc = -1;
+		goto cleanup;
+	}
 	if (!remote_refs) {
 		fprintf(stderr, "No refs in common and none specified; doing nothing.\n");
-		return 0;
+		rc = 0;
+		goto cleanup;
 	}
 
 	new_refs = 0;
@@ -2410,10 +2413,10 @@ int main(int argc, char **argv)
 			fprintf(stderr, "Unable to update server info\n");
 		}
 	}
-	if (info_ref_lock)
-		unlock_remote(info_ref_lock);
 
  cleanup:
+	if (info_ref_lock)
+		unlock_remote(info_ref_lock);
 	free(remote);
 
 	curl_slist_free_all(no_pragma_header);
-- 
1.5.4.rc2.4.gcef60-dirty


[-- Attachment #5: 0004-Adding-define-DEFAULT_MAX_REQUESTS-for-USE_CURL_MUL.patch --]
[-- Type: text/plain, Size: 623 bytes --]

>From cef60c7940008487547894855eeed34d2edeb48e Mon Sep 17 00:00:00 2001
From: Gregoire Barbier, gb at gbarbier dot org <gb@panoramix.(none)>
Date: Sun, 30 Dec 2007 21:30:25 +0100
Subject: [PATCH] Adding #define DEFAULT_MAX_REQUESTS for USE_CURL_MULTI mode

---
 http.c |    1 +
 1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/http.c b/http.c
index 8b04ae9..7b1bcb8 100644
--- a/http.c
+++ b/http.c
@@ -4,6 +4,7 @@ int data_received;
 int active_requests = 0;
 
 #ifdef USE_CURL_MULTI
+#define DEFAULT_MAX_REQUESTS 4
 static int max_requests = -1;
 static CURLM *curlm;
 #endif
-- 
1.5.4.rc2.4.gcef60-dirty


[-- Attachment #6: 0005-Moving-ifdef-endif-for-making-USE_CURL_MULTI-code.patch --]
[-- Type: text/plain, Size: 3170 bytes --]

>From b34d81c1fff43a806cb91615effd00e424bd0e6b Mon Sep 17 00:00:00 2001
From: Gregoire Barbier, gb at gbarbier dot org <gb@panoramix.(none)>
Date: Sun, 30 Dec 2007 23:40:18 +0100
Subject: [PATCH] Moving #ifdef/#endif for making USE_CURL_MULTI code more visible.

---
 http-push.c |   15 +++++++++++----
 1 files changed, 11 insertions(+), 4 deletions(-)
 mode change 100644 => 100755 http-push.c

diff --git a/http-push.c b/http-push.c
old mode 100644
new mode 100755
index cbbf432..6dd3c45
--- a/http-push.c
+++ b/http-push.c
@@ -116,10 +116,12 @@ struct transfer_request
 	struct remote_lock *lock;
 	struct curl_slist *headers;
 	struct buffer buffer;
+#ifdef USE_CURL_MULTI
 	char filename[PATH_MAX];
 	char tmpfile[PATH_MAX];
 	int local_fileno;
 	FILE *local_stream;
+#endif
 	enum transfer_state state;
 	CURLcode curl_result;
 	char errorstr[CURL_ERROR_SIZE];
@@ -175,6 +177,7 @@ struct remote_ls_ctx
 	struct remote_ls_ctx *parent;
 };
 
+#ifdef USE_CURL_MULTI
 static void finish_request(struct transfer_request *request);
 static void release_request(struct transfer_request *request);
 
@@ -186,7 +189,6 @@ static void process_response(void *callback_data)
 	finish_request(request);
 }
 
-#ifdef USE_CURL_MULTI
 static size_t fwrite_sha1_file(void *ptr, size_t eltsize, size_t nmemb,
 			       void *data)
 {
@@ -383,7 +385,6 @@ static void start_mkcol(struct transfer_request *request)
 		request->url = NULL;
 	}
 }
-#endif
 
 static void start_fetch_packed(struct transfer_request *request)
 {
@@ -581,6 +582,7 @@ static void start_move(struct transfer_request *request)
 		request->url = NULL;
 	}
 }
+#endif
 
 static int refresh_lock(struct remote_lock *lock)
 {
@@ -660,15 +662,18 @@ static void release_request(struct transfer_request *request)
 			entry->next = entry->next->next;
 	}
 
+#ifdef USE_CURL_MULTI
 	if (request->local_fileno != -1)
 		close(request->local_fileno);
 	if (request->local_stream)
 		fclose(request->local_stream);
+#endif
 	if (request->url != NULL)
 		free(request->url);
 	free(request);
 }
 
+#ifdef USE_CURL_MULTI
 static void finish_request(struct transfer_request *request)
 {
 	struct stat st;
@@ -793,7 +798,6 @@ static void finish_request(struct transfer_request *request)
 	}
 }
 
-#ifdef USE_CURL_MULTI
 static int fill_active_slot(void *unused)
 {
 	struct transfer_request *request = request_queue_head;
@@ -841,8 +845,10 @@ static void add_fetch_request(struct object *obj)
 	request->url = NULL;
 	request->lock = NULL;
 	request->headers = NULL;
+#ifdef USE_CURL_MULTI
 	request->local_fileno = -1;
 	request->local_stream = NULL;
+#endif
 	request->state = NEED_FETCH;
 	request->next = request_queue_head;
 	request_queue_head = request;
@@ -881,12 +887,13 @@ static int add_send_request(struct object *obj, struct remote_lock *lock)
 	request->url = NULL;
 	request->lock = lock;
 	request->headers = NULL;
+#ifdef USE_CURL_MULTI
 	request->local_fileno = -1;
 	request->local_stream = NULL;
+#endif
 	request->state = NEED_PUSH;
 	request->next = request_queue_head;
 	request_queue_head = request;
-
 #ifdef USE_CURL_MULTI
 	fill_active_slots();
 	step_active_slots();
-- 
1.5.4.rc2.4.gcef60-dirty


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: git over webdav: what can I do for improving http-push ?
  2007-12-30 22:59 git over webdav: what can I do for improving http-push ? Grégoire Barbier
@ 2007-12-31  3:46 ` Daniel Barkalow
  2007-12-31 16:57   ` Graham Barr
  0 siblings, 1 reply; 14+ messages in thread
From: Daniel Barkalow @ 2007-12-31  3:46 UTC (permalink / raw)
  To: Grégoire Barbier; +Cc: git

[-- Attachment #1: Type: TEXT/PLAIN, Size: 3227 bytes --]

On Sun, 30 Dec 2007, Grégoire Barbier wrote:

> Hi everybody.
> 
> I've just subscribed to the list, therefore I think it should be rude not to
> introduce myself. My name is Grégoire Barbier (non french-speaking people
> should call me Greg and don't bother with non-ascii characters), I'm working
> mainly as a consultant (using Powerpoint and wearing a tie) but have some
> personal and professional interest in programming, especially about
> middlewares. BTW I apologize for my poor english.
> 
> I'm using Git for a rather short time but enough to fall in love with it. For
> a few days I'm trying to use it over webdav, that is over http/https with
> write (push) access. As for me, the main rationale to use http(s) rather than
> git or ssh is to get through corporate firewalls, otherwise I would probably
> not bother with webdav.

In general, we've been able to either get through firewalls with ssh or 
it's all in the same VPN. So it's kind of unloved at this point. People 
poke at it occasionally, but mostly in the context of other fixes, I 
think.

> With 1.5.3.6 and 1.5.4-rc2, I encounter severe issues that make me think that
> http-push is not totally ready for production. That's why I would like to have
> a discussion with some of you that use and maintain it, to see what I can do
> to improve it or to help you improve it.
> 
> Here are some issues I encountered:
> - http-push does not release locks when failing due to syntax error (e.g. if
> one types "git push" instead of "git push origin master")
> - http-push freezes with no message with urls not terminated by a slash
> - http-push does not create directory for the object (objects/xx/) and if the
> directory exists, it does not actually push objects without having
> USE_CURL_MULTI defined (which is not the compilation default)
> 
> I've starting to look at the source code and make some little improvements,
> but I feel that I should rather discuss with you to understand why there are
> two rather independant modes in http-push (USE_CURL_MULTI or not) and what is
> the real target (I don't want to work twice, neither to mess up the work of
> someone else that would be currently reorginzing this part of the code).

I think the issue is the CURL_MULTI library code is either not supported 
or is broken in versions of curl that many distros still ship, but we can 
do a lot better with it, so the duplicate implementation is plausibly 
worthwhile.

One thing I personally thought would be worthwhile would be to separate 
out the logic for sending stuff like the fetching logic is in 
walker.{c,h}, and include the necessary methods in struct walker. There 
were people interested in sftp (for the case where you can get ssh through 
firewalls, but you aren't allowed to install programs on the file server 
and git isn't installed system-wide).

One thing that's worth doing when looking at the code is using "git blame" 
to find out where the lines you're changing came from, and "git log 
<hash>" to find out what the person writing them was trying to do. This 
will also turn up the people who've been working in the area, who you 
might want to cc, since they'll be good reviewers.

	-Daniel
*This .sig left intentionally blank*

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: git over webdav: what can I do for improving http-push ?
  2007-12-31  3:46 ` Daniel Barkalow
@ 2007-12-31 16:57   ` Graham Barr
  2008-01-01 11:33     ` Jan Hudec
  0 siblings, 1 reply; 14+ messages in thread
From: Graham Barr @ 2007-12-31 16:57 UTC (permalink / raw)
  To: Daniel Barkalow; +Cc: Grégoire Barbier, git

Daniel Barkalow wrote:
> On Sun, 30 Dec 2007, Grégoire Barbier wrote:

>> I'm using Git for a rather short time but enough to fall in love with it. For
>> a few days I'm trying to use it over webdav, that is over http/https with
>> write (push) access. As for me, the main rationale to use http(s) rather than
>> git or ssh is to get through corporate firewalls, otherwise I would probably
>> not bother with webdav.

> In general, we've been able to either get through firewalls with ssh or 
> it's all in the same VPN. So it's kind of unloved at this point. People 
> poke at it occasionally, but mostly in the context of other fixes, I 
> think.

If you have a http proxy that you can use, the you can use ssh via that with
something like corkscrew. http://wiki.kartbuilding.net/index.php/Corkscrew_-_ssh_over_https

A simple shell script wrapper around ssh to detect when you are behind a firewall
can inject the ProxyCommand into the command line arguments with -o

Graham.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: git over webdav: what can I do for improving http-push ?
  2007-12-31 16:57   ` Graham Barr
@ 2008-01-01 11:33     ` Jan Hudec
  2008-01-01 11:41       ` Grégoire Barbier
  0 siblings, 1 reply; 14+ messages in thread
From: Jan Hudec @ 2008-01-01 11:33 UTC (permalink / raw)
  To: Graham Barr; +Cc: Daniel Barkalow, Grégoire Barbier, git

On Mon, Dec 31, 2007 at 10:57:52 -0600, Graham Barr wrote:
> Daniel Barkalow wrote:
>> On Sun, 30 Dec 2007, Grégoire Barbier wrote:
>
>>> I'm using Git for a rather short time but enough to fall in love with it. For
>>> a few days I'm trying to use it over webdav, that is over http/https with
>>> write (push) access. As for me, the main rationale to use http(s) rather than
>>> git or ssh is to get through corporate firewalls, otherwise I would probably
>>> not bother with webdav.
>
>> In general, we've been able to either get through firewalls with ssh or 
>> it's all in the same VPN. So it's kind of unloved at this point. People 
>> poke at it occasionally, but mostly in the context of other fixes, I 
>> think.
>
> If you have a http proxy that you can use, the you can use ssh via that with
> something like corkscrew. http://wiki.kartbuilding.net/index.php/Corkscrew_-_ssh_over_https

This, obviously, requires, that ssh is running on port 443, because most HTTP
proxies won't let you CONNECT anywhere else. I have also heared of a HTTP
proxy, that will check whether the session inside CONNECT starts with SSL
handshake and will break your connection if it does not.

> A simple shell script wrapper around ssh to detect when you are behind a firewall
> can inject the ProxyCommand into the command line arguments with -o

Most of the time simply setting the parameter in .ssh/config works better
-- because you are often behind a proxy for some sites only.

-- 
						 Jan 'Bulb' Hudec <bulb@ucw.cz>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: git over webdav: what can I do for improving http-push ?
  2008-01-01 11:33     ` Jan Hudec
@ 2008-01-01 11:41       ` Grégoire Barbier
  2008-01-01 18:12         ` Jakub Narebski
  0 siblings, 1 reply; 14+ messages in thread
From: Grégoire Barbier @ 2008-01-01 11:41 UTC (permalink / raw)
  To: git

Jan Hudec a écrit :
> On Mon, Dec 31, 2007 at 10:57:52 -0600, Graham Barr wrote:
>   
>> Daniel Barkalow wrote:
>>     
>>> On Sun, 30 Dec 2007, Grégoire Barbier wrote:
>>>       
>>>> As for me, the main rationale to use http(s) rather than
>>>> git or ssh is to get through corporate firewalls, otherwise I would probably
>>>> not bother with webdav.
>>>>         
>>> In general, we've been able to either get through firewalls with ssh or 
>>> it's all in the same VPN. So it's kind of unloved at this point. People 
>>> poke at it occasionally, but mostly in the context of other fixes, I 
>>> think.
>>>       
>> If you have a http proxy that you can use, the you can use ssh via that with
>> something like corkscrew. http://wiki.kartbuilding.net/index.php/Corkscrew_-_ssh_over_https
>>     
>
> This, obviously, requires, that ssh is running on port 443, because most HTTP
> proxies won't let you CONNECT anywhere else. I have also heared of a HTTP
> proxy, that will check whether the session inside CONNECT starts with SSL
> handshake and will break your connection if it does not.
>   

Hello Jan.

I think we have similar experiences. I have personnaly be faced to 
proxies that not only scan for the SSL handshake but do 
man-in-the-middle "attack" to break the SSL into two parts, checking for 
HTTP inside it (and probably scanning for viruses and things like hat, I 
think).

I first replied privatly to Graham because I didn't think it was 
interesting for the whole list.
It was a mistake, here is my answer:

In fact, I already use this hack where it is possible.

However some well advised companies does not allow CONNECT through their HTTP proxy without some limitations that make this tip unusable (for instance: allowing only port 443, allowing only sites of a white-list, forcing a man-in-the-middle that not only breaks the confidentiality but too forbids the use of other protocols such as SSH, even on port 443).

BTW such circumvention of the security facilities is often (at less where I live and with my clients) forbidden in some corporate rules, even when it is technically possible.
Therefore I'm not allowed to do so and, furthermore, I cannot tell my clients to do so and write documents that tell it's the good way.

I think that real HTTP support is better than all workarounds we will be able to find to get through firewalls (when CONNECT is not available, some awful VPNs that send Etherne over HTTP may work ;-)).
That's why I'm ok to work several hours on git code to enhance real HTTP(S) support.

-- 
Grégoire Barbier - gb à gbarbier.org - +33 6 21 35 73 49

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: git over webdav: what can I do for improving http-push ?
  2008-01-01 11:41       ` Grégoire Barbier
@ 2008-01-01 18:12         ` Jakub Narebski
  2008-01-01 20:23           ` Jan Hudec
  0 siblings, 1 reply; 14+ messages in thread
From: Jakub Narebski @ 2008-01-01 18:12 UTC (permalink / raw)
  To: Grégoire Barbier; +Cc: git

Grégoire Barbier <gb@gbarbier.org> writes:

> I think that real HTTP support is better than all workarounds we
> will be able to find to get through firewalls (when CONNECT is not
> available, some awful VPNs that send Etherne over HTTP may work
> ;-)).  That's why I'm ok to work several hours on git code to
> enhance real HTTP(S) support.

There was also an idea to create a CGI program, or enhance gitweb
to use for pushing. I don't know if it would be better way to pursue
to work around corporate firewalls, or not...

-- 
Jakub Narebski
Poland
ShadeHawk on #git

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: git over webdav: what can I do for improving http-push ?
  2008-01-01 18:12         ` Jakub Narebski
@ 2008-01-01 20:23           ` Jan Hudec
  2008-01-03 19:14             ` Grégoire Barbier
  0 siblings, 1 reply; 14+ messages in thread
From: Jan Hudec @ 2008-01-01 20:23 UTC (permalink / raw)
  To: Jakub Narebski; +Cc: Grégoire Barbier, git

On Tue, Jan 01, 2008 at 10:12:28 -0800, Jakub Narebski wrote:
> Grégoire Barbier <gb@gbarbier.org> writes:
> 
> > I think that real HTTP support is better than all workarounds we
> > will be able to find to get through firewalls (when CONNECT is not
> > available, some awful VPNs that send Etherne over HTTP may work
> > ;-)).  That's why I'm ok to work several hours on git code to
> > enhance real HTTP(S) support.
> 
> There was also an idea to create a CGI program, or enhance gitweb
> to use for pushing. I don't know if it would be better way to pursue
> to work around corporate firewalls, or not...

It is what bzr and mercurial do and I think it would be quite good way to go
for cases like this. Eg. while our corporate firewall does allow anything
through connect on 443 (so I can use ssh that way), it does *not* support
web-dav in non-ssl mode. So I eg. can't even get from public subversion
repositories at work.

I have also thought about optimizing download using CGI, but than I thought,
that maybe there is a way to statically generate packs so, that if the client
wants n revisions, the number of revisions it downloads is O(n) and the
number of packs it gets them from (and thus number of round-trips) is
O(log(n)). Assuming the client always wants everything up to the tip, of
course. Now this is trivial with linear history (pack first half, than half
of what's left, etc., gives logarithmic number of packs and you always
download at most twice as much as you need), but it would be nice if somebody
found a way (even one that satisfies the conditions on average only) to do
this with non-linear history, it would be very nice improvement to the http
download -- native git server optimizes amount of data transfered very well,
but at the cost of quite heavy CPU load on the server.

-- 
						 Jan 'Bulb' Hudec <bulb@ucw.cz>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: git over webdav: what can I do for improving http-push ?
  2008-01-01 20:23           ` Jan Hudec
@ 2008-01-03 19:14             ` Grégoire Barbier
  2008-01-03 21:15               ` Jan Hudec
  0 siblings, 1 reply; 14+ messages in thread
From: Grégoire Barbier @ 2008-01-03 19:14 UTC (permalink / raw)
  To: Jan Hudec, Jakub Narebski; +Cc: git

Jan Hudec a écrit :
> On Tue, Jan 01, 2008 at 10:12:28 -0800, Jakub Narebski wrote:
>   
>> Grégoire Barbier <gb@gbarbier.org> writes:
>>
>>     
>>> I think that real HTTP support is better than all workarounds we
>>> will be able to find to get through firewalls (when CONNECT is not
>>> available, some awful VPNs that send Etherne over HTTP may work
>>> ;-)).  That's why I'm ok to work several hours on git code to
>>> enhance real HTTP(S) support.
>>>       
>> There was also an idea to create a CGI program, or enhance gitweb
>> to use for pushing. I don't know if it would be better way to pursue
>> to work around corporate firewalls, or not...
>>     
I subscribe to this point of view.
I will look at the list archive to search for what has been said before 
about this.
>
> It is what bzr and mercurial do and I think it would be quite good way to go
> for cases like this.
Ok, I will have to look at bzr and mercurial...

>  Eg. while our corporate firewall does allow anything
> through connect on 443 (so I can use ssh that way), it does *not* support
> web-dav in non-ssl mode. So I eg. can't even get from public subversion
> repositories at work.
>
> I have also thought about optimizing download using CGI, but than I thought,
> that maybe there is a way to statically generate packs so, that if the client
> wants n revisions, the number of revisions it downloads is O(n) and the
> number of packs it gets them from (and thus number of round-trips) is
> O(log(n)). Assuming the client always wants everything up to the tip, of
> course. Now this is trivial with linear history (pack first half, than half
> of what's left, etc., gives logarithmic number of packs and you always
> download at most twice as much as you need), but it would be nice if somebody
> found a way (even one that satisfies the conditions on average only) to do
> this with non-linear history, it would be very nice improvement to the http
> download -- native git server optimizes amount of data transfered very well,
> but at the cost of quite heavy CPU load on the server.
>   
Well... frankly I don't think I'm able of such things.
Writing a walker over webdav or  a simple cgi is a thing I can do (I 
think),  but I'm not tought enough (or not ready to take the time 
needed) to have a look on the internals of packing revisions (whereas I 
can imagine it would means that "my" walker would be suitable only for 
small projects in terms of code amount and commit frequency).

I had a quick look on bzr and hg, and it seems that bzr use the easy way 
(walker, no optimizations) and hg a cgi (therefore, maybe 
optimizations). By quick look I mean that I sniff the HTTP queries on 
the network during a clone. I need to look harder...

BTW I never looked at the git:// protocol. Do you think that by 
tunneling the git protocol in a cgi (hg uses URLs of the form 
"/mycgi?cmd=mycommand&...", therefore I think "tunnel" is not a bad 
word...) the performance would be good?
Maybe it's not that hard to write a performant HTTP/CGI protocol for Git 
if it's based upon existing code such as the git protocol.

-- 
Grégoire Barbier - gb à gbarbier.org - +33 6 21 35 73 49

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: git over webdav: what can I do for improving http-push ?
  2008-01-03 19:14             ` Grégoire Barbier
@ 2008-01-03 21:15               ` Jan Hudec
  2008-01-03 21:43                 ` Linus Torvalds
                                   ` (2 more replies)
  0 siblings, 3 replies; 14+ messages in thread
From: Jan Hudec @ 2008-01-03 21:15 UTC (permalink / raw)
  To: Grégoire Barbier; +Cc: Jakub Narebski, git

On Thu, Jan 03, 2008 at 20:14:09 +0100, Grégoire Barbier wrote:
> Jan Hudec a écrit :
[...]
>> It is what bzr and mercurial do and I think it would be quite good way to go
>> for cases like this.
> Ok, I will have to look at bzr and mercurial...

Bzr is quite far, design-wise, I fear. The mercurial might be a little more
interesting to study, but being in python and internally somewhat
file-oriented, I wouldn't think it is of much use.

You should start with upload, leaving the download direction to the dumb
machinery git currently uses.

[...]
>> I have also thought about optimizing download using CGI, but than I thought,
>> that maybe there is a way to statically generate packs so, that if the client
>> wants n revisions, the number of revisions it downloads is O(n) and the
>> number of packs it gets them from (and thus number of round-trips) is
>> O(log(n)). Assuming the client always wants everything up to the tip, of
>> course. Now this is trivial with linear history (pack first half, than half
>> of what's left, etc., gives logarithmic number of packs and you always
>> download at most twice as much as you need), but it would be nice if somebody
>> found a way (even one that satisfies the conditions on average only) to do
>> this with non-linear history, it would be very nice improvement to the http
>> download -- native git server optimizes amount of data transfered very well,
>> but at the cost of quite heavy CPU load on the server.
>>   
> Well... frankly I don't think I'm able of such things.
> Writing a walker over webdav or  a simple cgi is a thing I can do (I 
> think),  but I'm not tought enough (or not ready to take the time needed) 
> to have a look on the internals of packing revisions (whereas I can imagine 
> it would means that "my" walker would be suitable only for small projects 
> in terms of code amount and commit frequency).

Well, it does not depend on the walker -- the walker is quite simple and
already written anyway.

> I had a quick look on bzr and hg, and it seems that bzr use the easy way 
> (walker, no optimizations)

That's not quite true -- bzr has both dumb (walker over plain HTTP) and smart
(CGI) methods. But their CGI is really just tunelling their custom protocol
over HTTP and that protocol will not be anywhere near what we want for git
because of vastly different design of the storage.

> and hg a cgi (therefore, maybe optimizations). 
> By quick look I mean that I sniff the HTTP queries on the network during a 
> clone. I need to look harder...

Yes, mercurial uses a CGI. But I am not sure how similar their approach is to
anything that would make sense for git, so looking at the details might or
might not be useful.

> BTW I never looked at the git:// protocol. Do you think that by tunneling 
> the git protocol in a cgi (hg uses URLs of the form 
> "/mycgi?cmd=mycommand&...", therefore I think "tunnel" is not a bad 
> word...) the performance would be good?

It would be pretty hard to tunnel it and it would loose all it's nice
properties. The git protocol, for pull, basically works like this:

 - server sends a list of it's refs
 - client tells server which ones it wants
 - client starts listing revisions it has, newest to oldest
 - server tells client whenever it finds common ancestor with one of the
   heads desired
 - client restarts the listing from next ref
 - server starts sending the data when client runs out of refs to list

The main point about the protocol is, that the client is listing the refs, as
fast as it can and server will stop it when it sees a revision it knows.
Therefore there will only be one round-trip to discover each common ancestor.

However, you can't do this over HTTP, because response won't be started until
the request is received. You could be sending a lot of smallish requests and
quick, often empty, responses to them. However, that will waste a lot of
bandwidth (because of the HTTP overhead) and loose much of the speed anyway.
Also the HTTP protocol is stateless, but this is inherently stateful, so you
would have to work that around somehow too. Therefore a different approach is
preferable on HTTP.

Now to keep it stateless, I thought that:
 - client would first ask for list of refs
 - client would than ask for pack containing the first ref
 - server would respond with pack containing just that commit plus all
   objects that are not referenced by any of it's parents
 - if client does not have it's parent, it would ask for pack containing that
 - since it's second request, server would pack 2 revisions (with necessary
   objects) this time
 - if client still does not have all parents, it would again ask for a pack,
   receiving 4 revisions this time (next 8, 16, etc...)

This would guarantee, that when you want n revisions, you make at most
log2(n) requests and get at most 2*n revisions (well, the requests are for
each ref separately, so it's more like m*log2(n) where m is number of refs,
but still). Additionally, it would be stateless, because the client would
simply say 'I want commit abcdef01 and this is my 3rd request' and server
would provide that commit and 7 it's parents (ie. 2^3 commits).

Now generating the packs takes it's CPU. The servers like git.kernel.org have
quite high CPU load. But in this schema, all clients would most of the time
get the same packs (unlike native git protocol, where the client always gets
single pack with exactly what it needs). So the idea struck me, that they
could simply be statically generated and fetched via the existing dumb
protocol. It would get the efficiency and save a lot of CPU power, which
is allows one to serve quite busy git repository from a limited (and
therefore cheap) virtual machine or even (yes, I saw such idea on the list)
serving any repository from NSLU2.

Now to create a packing policy to create such packs, you don't actually need
to touch any C -- because git-repack is still in shell -- and you don't
really need to touch any internals, because you only need to change how the
pack-objects command will be called and leave all the dirty details to that
command.

I would personally not re-split the already generated packs. Only find some
algorithm for choosing when packs are deep enough in history so they should
be merged together. It would also might not make sense to ever pack unpacked
objects to more than one pack -- a dumb-HTTP-served might have a requirement
of running this kind of repack after every push and clients will rarely have
part of single push to the server.

> Maybe it's not that hard to write a performant HTTP/CGI protocol for Git if 
> it's based upon existing code such as the git protocol.

For push it might, or might not be easy. But in the worst case you should be
able to calculate a pack to upload locally (fetching from the server
beforehead if necessary), upload that pack (or bundle) and update all the
refs.

For pull it certainly won't be. You might be able to reimplement the common
ref discovery using some kind of gradually increasing ref list and then
generate a bundle for the server, but optimizing the dumb protocol seems more
useful to me. As I said, generating the packs only requires devising a way of
selecting which objects should go together and git pack-objects will take
care of the dirty details of generating the packs and git update-server-info
will take care of the dirty details of presenting the list of packs to
client.

-- 
						 Jan 'Bulb' Hudec <bulb@ucw.cz>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: git over webdav: what can I do for improving http-push ?
  2008-01-03 21:15               ` Jan Hudec
@ 2008-01-03 21:43                 ` Linus Torvalds
  2008-01-03 21:47                 ` Jakub Narebski
  2008-01-03 23:54                 ` Martin Langhoff
  2 siblings, 0 replies; 14+ messages in thread
From: Linus Torvalds @ 2008-01-03 21:43 UTC (permalink / raw)
  To: Jan Hudec; +Cc: Grégoire Barbier, Jakub Narebski, git



On Thu, 3 Jan 2008, Jan Hudec wrote:
> 
> That's not quite true -- bzr has both dumb (walker over plain HTTP) and smart
> (CGI) methods. But their CGI is really just tunelling their custom protocol
> over HTTP and that protocol will not be anywhere near what we want for git
> because of vastly different design of the storage.

Well, tunnelling the native git protocol is *exactly* what you'd want to 
do with some git CGI thing. So no, I don't think the actual stuff you 
tunnel would have any relationship, but the actual code to set up a tunnel 
and making it all look like some fake html sequence might be something 
that can be used as a base.

		Linus

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: git over webdav: what can I do for improving http-push ?
  2008-01-03 21:15               ` Jan Hudec
  2008-01-03 21:43                 ` Linus Torvalds
@ 2008-01-03 21:47                 ` Jakub Narebski
  2008-01-03 23:29                   ` Grégoire Barbier
  2008-01-03 23:54                 ` Martin Langhoff
  2 siblings, 1 reply; 14+ messages in thread
From: Jakub Narebski @ 2008-01-03 21:47 UTC (permalink / raw)
  To: Jan Hudec; +Cc: Grégoire Barbier, git

Jan Hudec wrote:
> On Thu, Jan 03, 2008 at 20:14:09 +0100, Grégoire Barbier wrote:
>
> > I had a quick look on bzr and hg, and it seems that bzr use the easy way 
> > (walker, no optimizations)
> 
> That's not quite true -- bzr has both dumb (walker over plain HTTP) and smart
> (CGI) methods. But their CGI is really just tunelling their custom protocol
> over HTTP and that protocol will not be anywhere near what we want for git
> because of vastly different design of the storage.

Perhaps we could also simply tunnel git protocol over HTTP / HTTPS?
 
> > and hg a cgi (therefore, maybe optimizations). 
> > By quick look I mean that I sniff the HTTP queries on the network during a 
> > clone. I need to look harder...
> 
> Yes, mercurial uses a CGI. But I am not sure how similar their approach is to
> anything that would make sense for git, so looking at the details might or
> might not be useful.
> 
> > BTW I never looked at the git:// protocol. Do you think that by tunneling 
> > the git protocol in a cgi (hg uses URLs of the form 
> > "/mycgi?cmd=mycommand&...", therefore I think "tunnel" is not a bad 
> > word...) the performance would be good?
> 
> It would be pretty hard to tunnel it and it would loose all it's nice
> properties. The git protocol, for pull, basically works like this:
> 
>  - server sends a list of it's refs
>  - client tells server which ones it wants
>  - client starts listing revisions it has, newest to oldest
>  - server tells client whenever it finds common ancestor with one of the
>    heads desired
>  - client restarts the listing from next ref
>  - server starts sending the data when client runs out of refs to list
> 
> The main point about the protocol is, that the client is listing the refs, as
> fast as it can and server will stop it when it sees a revision it knows.
> Therefore there will only be one round-trip to discover each common ancestor.
> 
> However, you can't do this over HTTP, because response won't be started until
> the request is received. You could be sending a lot of smallish requests and
> quick, often empty, responses to them. However, that will waste a lot of
> bandwidth (because of the HTTP overhead) and loose much of the speed anyway.
> Also the HTTP protocol is stateless, but this is inherently stateful, so you
> would have to work that around somehow too. Therefore a different approach is
> preferable on HTTP.

Perhaps we could use AJAX (XMLHttpRequest for communication, plain HTTP or
IFRAMES for sending data) or something like this for git protocol tunneling?

-- 
Jakub Narebski
Poland

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: git over webdav: what can I do for improving http-push ?
  2008-01-03 21:47                 ` Jakub Narebski
@ 2008-01-03 23:29                   ` Grégoire Barbier
  0 siblings, 0 replies; 14+ messages in thread
From: Grégoire Barbier @ 2008-01-03 23:29 UTC (permalink / raw)
  To: git

Jakub Narebski a écrit :
> Perhaps we could use AJAX (XMLHttpRequest for communication, plain HTTP or
> IFRAMES for sending data) or something like this for git protocol tunneling?
>   
well... I think I may manage to avoid javascript... ;-)
more seriously, I was thinking of using http in an automated, 
un-human-browsable manner, not a full html user interface.

-- 
Grégoire Barbier - gb à gbarbier.org - +33 6 21 35 73 49

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: git over webdav: what can I do for improving http-push ?
  2008-01-03 21:15               ` Jan Hudec
  2008-01-03 21:43                 ` Linus Torvalds
  2008-01-03 21:47                 ` Jakub Narebski
@ 2008-01-03 23:54                 ` Martin Langhoff
  2008-01-04 19:59                   ` Jan Hudec
  2 siblings, 1 reply; 14+ messages in thread
From: Martin Langhoff @ 2008-01-03 23:54 UTC (permalink / raw)
  To: Jan Hudec; +Cc: Grégoire Barbier, Jakub Narebski, git

On Jan 4, 2008 10:15 AM, Jan Hudec <bulb@ucw.cz> wrote:
> Now to keep it stateless, I thought that:
...
> This would guarantee, that when you want n revisions, you make at most
> log2(n) requests and get at most 2*n revisions (well, the requests are for

That is still a lot! How about, for each ref

 - Client sends a POST listing the ref and the latest related commit
it has that the server is likely to have (from origin/heads/<ref>).
Optionally, it can provide a blacklist of <treeish> (where every
object refered is known) and blob sha1s.
 - Server sends the new sha1 of the ref, and a thin pack that covers the changes
 - The client can disconnect to stop the transaction. For example --
if it sees the sha1 of a huge object that it already has. It can
re-request, with a blacklist.

A good number of objects will be sent unnecesarily - with no option to
the client to say "I have this" - but by using the hint of letting the
server know we have origin/heads/<ref> I suspect that it will be
minimal.

Also:
 - It will probably be useful to list all the refs the client knows
from that server in the request.
 - If the ref has changed with a non-fast-forward, the server needs to
say so, and provide a listing of the commits. As soon as the client
spots a common commit, it can close the connection -- it now knows
what ref to tell the server about in a subsequent command.

This way, you ideally have 1 request per ref, 2 if it has been
rebased/rewound. This can probably get reorganised to do several refs
in one request.

cheers,


m

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: git over webdav: what can I do for improving http-push ?
  2008-01-03 23:54                 ` Martin Langhoff
@ 2008-01-04 19:59                   ` Jan Hudec
  0 siblings, 0 replies; 14+ messages in thread
From: Jan Hudec @ 2008-01-04 19:59 UTC (permalink / raw)
  To: Martin Langhoff; +Cc: Grégoire Barbier, Jakub Narebski, git

On Fri, Jan 04, 2008 at 12:54:58 +1300, Martin Langhoff wrote:
> On Jan 4, 2008 10:15 AM, Jan Hudec <bulb@ucw.cz> wrote:
> > Now to keep it stateless, I thought that:
> ...
> > This would guarantee, that when you want n revisions, you make at most
> > log2(n) requests and get at most 2*n revisions (well, the requests are for
> 
> That is still a lot! How about, for each ref

The whole point of that is that the packs can be statically precomputed and
served with quite low CPU load, which is useful for serving from shared
computers (like servers in school computer labs or cheapo web hosting) or
slow servers like NSLU2. Also it makes HTTP caching actually useful, because
the set of possible requests is quite limited.

Also, while I said it's for each ref, the packs should really be optimized
for the common case of fetching all refs, which would really make it just
log2(n) packs and 2*n revisions for each whole download.

>  - Client sends a POST listing the ref and the latest related commit
> it has that the server is likely to have (from origin/heads/<ref>).
> Optionally, it can provide a blacklist of <treeish> (where every
> object refered is known) and blob sha1s.
>  - Server sends the new sha1 of the ref, and a thin pack that covers the changes
>  - The client can disconnect to stop the transaction. For example --
> if it sees the sha1 of a huge object that it already has. It can
> re-request, with a blacklist.
> 
> A good number of objects will be sent unnecesarily - with no option to
> the client to say "I have this" - but by using the hint of letting the
> server know we have origin/heads/<ref> I suspect that it will be
> minimal.

It would be better to only unnecesarily send revlists. Since each HTTP packed
will likely have something like 1kb overhead, sending few kb worth of revlist
is still pretty efficient. So just send part of revlist, than more revlist
and so on until you find exactly which revisions you need and than ask for
them. That will save *both* bandwidth *and* server CPU. The only reason to
waste bandwidth is to save CPU and you are not doing that.

> Also:
>  - It will probably be useful to list all the refs the client knows
> from that server in the request.
>  - If the ref has changed with a non-fast-forward, the server needs to
> say so, and provide a listing of the commits. As soon as the client
> spots a common commit, it can close the connection -- it now knows
> what ref to tell the server about in a subsequent command.
> 
> This way, you ideally have 1 request per ref, 2 if it has been
> rebased/rewound. This can probably get reorganised to do several refs
> in one request.
> 
> cheers,
> 
> 
> m
-- 
						 Jan 'Bulb' Hudec <bulb@ucw.cz>

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2008-01-04 19:59 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-12-30 22:59 git over webdav: what can I do for improving http-push ? Grégoire Barbier
2007-12-31  3:46 ` Daniel Barkalow
2007-12-31 16:57   ` Graham Barr
2008-01-01 11:33     ` Jan Hudec
2008-01-01 11:41       ` Grégoire Barbier
2008-01-01 18:12         ` Jakub Narebski
2008-01-01 20:23           ` Jan Hudec
2008-01-03 19:14             ` Grégoire Barbier
2008-01-03 21:15               ` Jan Hudec
2008-01-03 21:43                 ` Linus Torvalds
2008-01-03 21:47                 ` Jakub Narebski
2008-01-03 23:29                   ` Grégoire Barbier
2008-01-03 23:54                 ` Martin Langhoff
2008-01-04 19:59                   ` Jan Hudec

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).