[PATCH 2/3] Support for partial HTTP transfers

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 2/3] Support for partial HTTP transfers
@ 2005-09-26 17:52 Nick Hengeveld
  2005-09-26 21:19 ` Junio C Hamano
                   ` (2 more replies)
  0 siblings, 3 replies; 10+ messages in thread
From: Nick Hengeveld @ 2005-09-26 17:52 UTC (permalink / raw)
  To: git


Support for partial HTTP transfers - if a previous temp file is detected,
read it in and start the HTTP transfer from where the previous left off.

Signed-off-by: Nick Hengeveld <nickh@reactrix.com>


---

 http-fetch.c |   72 ++++++++++++++++++++++++++++++++++++++++++++++++++++++----
 1 files changed, 67 insertions(+), 5 deletions(-)

34a692953368188cbaefbd6c60e400053f8528b4
diff --git a/http-fetch.c b/http-fetch.c
--- a/http-fetch.c
+++ b/http-fetch.c
@@ -13,8 +13,12 @@
 #define curl_global_init(a) do { /* nothing */ } while(0)
 #endif
 
+#define PREV_BUF_SIZE 4096
+#define RANGE_HEADER_SIZE 30
+
 static CURL *curl;
 static struct curl_slist *no_pragma_header;
+static struct curl_slist *no_range_header;
 
 static char *initial_base;
 
@@ -351,14 +355,26 @@ int fetch_object(struct alt_base *repo, 
 	char *filename = sha1_file_name(sha1);
 	unsigned char real_sha1[20];
 	char tmpfile[PATH_MAX];
+	char prevfile[PATH_MAX];
 	int ret;
 	char *url;
 	char *posn;
+	int prevlocal;
+	unsigned char prev_buf[PREV_BUF_SIZE];
+	ssize_t prev_read = 0;
+	long prev_posn = 0;
+	char range[RANGE_HEADER_SIZE];
+	struct curl_slist *range_header = NULL;
+	CURLcode curl_result;
+
+	snprintf(tmpfile, sizeof(tmpfile), "%s.temp", filename);
+	snprintf(prevfile, sizeof(prevfile), "%s.prev", filename);
+	unlink(prevfile);
+	rename(tmpfile, prevfile);
+	unlink(tmpfile);
 
-	snprintf(tmpfile, sizeof(tmpfile), "%s/obj_XXXXXX",
-		 get_object_directory());
+	local = open(tmpfile, O_WRONLY | O_CREAT | O_EXCL, 0666);
 
-	local = mkstemp(tmpfile);
 	if (local < 0)
 		return error("Couldn't create temporary file %s for %s: %s\n",
 			     tmpfile, filename, strerror(errno));
@@ -386,8 +402,52 @@ int fetch_object(struct alt_base *repo, 
 
 	curl_easy_setopt(curl, CURLOPT_URL, url);
 
-	if (curl_easy_perform(curl)) {
-		unlink(filename);
+	/* If a previous temp file is present, process what was already
+	   fetched. */
+	prevlocal = open(prevfile, O_RDONLY);
+	if (prevlocal != -1) {
+		do {
+			prev_read = read(prevlocal, prev_buf, PREV_BUF_SIZE);
+			if (prev_read>0) {
+				if (fwrite_sha1_file(prev_buf,
+						     1,
+						     prev_read,
+						     NULL) == prev_read) {
+					prev_posn += prev_read;
+				} else {
+					prev_read = -1;
+				}
+			}
+		} while (prev_read > 0);
+		close(prevlocal);
+	}
+	unlink(prevfile);
+
+	/* Reset inflate/SHA1 if there was an error reading the previous temp
+	   file; also rewind to the beginning of the local file. */
+	if (prev_read == -1) {
+		memset(&stream, 0, sizeof(stream));
+		inflateInit(&stream);
+		SHA1_Init(&c);
+		if (prev_posn>0) {
+			prev_posn = 0;
+			lseek(local, SEEK_SET, 0);
+		}
+	}
+
+	/* If we have successfully processed data from a previous fetch
+	   attempt, only fetch the data we don't already have. */
+	if (prev_posn>0) {
+		sprintf(range, "Range: bytes=%ld-", prev_posn);
+		range_header = curl_slist_append(range_header, range);
+		curl_easy_setopt(curl, CURLOPT_HTTPHEADER, range_header);
+	}
+
+	/* Clear out the Range: header after performing the request, so
+	   other curl requests don't inherit inappropriate header data */
+	curl_result = curl_easy_perform(curl);
+	curl_easy_setopt(curl, CURLOPT_HTTPHEADER, no_range_header);
+	if (curl_result != 0) {
 		return -1;
 	}
 
@@ -517,6 +577,7 @@ int main(int argc, char **argv)
 
 	curl = curl_easy_init();
 	no_pragma_header = curl_slist_append(no_pragma_header, "Pragma:");
+	no_range_header = curl_slist_append(no_range_header, "Range:");
 
         /* Set SSL parameters if they were provided */
 	if (ssl_cert != NULL) {
@@ -549,6 +610,7 @@ int main(int argc, char **argv)
 		return 1;
 
 	curl_slist_free_all(no_pragma_header);
+	curl_slist_free_all(no_range_header);
 	curl_global_cleanup();
 	return 0;
 }

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 2/3] Support for partial HTTP transfers
  2005-09-26 17:52 [PATCH 2/3] Support for partial HTTP transfers Nick Hengeveld
@ 2005-09-26 21:19 ` Junio C Hamano
  2005-09-27  0:09   ` Nick Hengeveld
  2005-09-26 22:23 ` Daniel Barkalow
  2005-09-27 10:35 ` sf
  2 siblings, 1 reply; 10+ messages in thread
From: Junio C Hamano @ 2005-09-26 21:19 UTC (permalink / raw)
  To: Nick Hengeveld; +Cc: git

Nick Hengeveld <nickh@reactrix.com> writes:

> +	unlink(prevfile);
> +	rename(tmpfile, prevfile);
> +	unlink(tmpfile);

If rename() succeeds then tmpfile is no more.  If rename() fails
because there were no tmpfile to begin with, it is not an error
(i.e. you are not recovering) and that case there would not be
tmpfile either.  Otherwise, if tmpfile still remains to unlink()
because rename() failed for any other reason, wouldn't you
rather report it as an error and abort?

I wonder what happens if by mistake or intentionally we run two
http-fetch instances simultaneously.  IIRC, the current code is
safe -- the resulting object database will have the object file
fetched by one of the instance, and the updating of ref is done
via write_ref_sha1(), so it also is safe.  But your change may
introduce an interesting case where one creates a tmpfile, the
other one moves it to prevfile and starts using its partial
contents, and possibly gets confused -- it will probabaly fail
at the end detecting inconsistent object so it is probably not a
big loss.

Personally, I do not think people would mind much if we
introduce the BKL at the beginning of git-fetch.sh to prevent
multiple fetches stomping on each other, if somebody cared
enough (hint, hint).

> -	snprintf(tmpfile, sizeof(tmpfile), "%s/obj_XXXXXX",
> -		 get_object_directory());
> +	local = open(tmpfile, O_WRONLY | O_CREAT | O_EXCL, 0666);
> -	local = mkstemp(tmpfile);

I introduced this mkstemp() part recently to mimic the local
object creation where newly created object files are built in
the same directory, hoping that they would be allocated close to
each other when more than one are created in sequence.  I do not
know it matters much in practice -- has anybody measured, and
does anybody care?

> +	/* Reset inflate/SHA1 if there was an error reading the previous temp
> +	   file; also rewind to the beginning of the local file. */

Maybe not just rewind but truncate as well?  It probably does
not matter in practice much, but previous representation your
fetch was interrupted in the middle could have been much larger
than the representation you are slurping right now.

There was a discussion about an object file of the same SHA1 and
the same contents can have different compressed representations
(we hash then compress so the resulting filesize depends on the
compression level without affecting the contents of the object).
In a "doctor, it hurts when I do this -- don't do it, then" kind
of corner case, a DNS rotated pair of webservers could be
serving the same object in different representations and you may
get interrupted while fetching from one, and restart the
transfer from the other.  The SHA1 check at the end hopefully
would catch this kind of situation, and that round of http-fetch
would fail -- the user needs to re-run the fetch so it is not a
big loss, but it is something to keep in mind.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 2/3] Support for partial HTTP transfers
  2005-09-26 17:52 [PATCH 2/3] Support for partial HTTP transfers Nick Hengeveld
  2005-09-26 21:19 ` Junio C Hamano
@ 2005-09-26 22:23 ` Daniel Barkalow
  2005-09-27 10:35 ` sf
  2 siblings, 0 replies; 10+ messages in thread
From: Daniel Barkalow @ 2005-09-26 22:23 UTC (permalink / raw)
  To: Nick Hengeveld; +Cc: git

On Mon, 26 Sep 2005, Nick Hengeveld wrote:

> Support for partial HTTP transfers - if a previous temp file is detected,
> read it in and start the HTTP transfer from where the previous left off.

You probably want the corresponding code for where it's downloading pack
and index files, which tend to be a lot bigger than individual objects. 

	-Daniel
*This .sig left intentionally blank*

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 2/3] Support for partial HTTP transfers
  2005-09-26 21:19 ` Junio C Hamano
@ 2005-09-27  0:09   ` Nick Hengeveld
  2005-09-27  5:46     ` Junio C Hamano
  0 siblings, 1 reply; 10+ messages in thread
From: Nick Hengeveld @ 2005-09-27  0:09 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git

On Mon, Sep 26, 2005 at 02:19:31PM -0700, Junio C Hamano wrote:

> If rename() succeeds then tmpfile is no more.  If rename() fails
> because there were no tmpfile to begin with, it is not an error
> (i.e. you are not recovering) and that case there would not be
> tmpfile either.  Otherwise, if tmpfile still remains to unlink()
> because rename() failed for any other reason, wouldn't you
> rather report it as an error and abort?

I didn't see that as a fatal condition since the transfer would still
take place, albeit a full transfer rather than a partial one.  It does
seem worth at least reporting it as a warning though.  If a previous
tmpfile exists and both the rename and unlink fail, the subsequent
tmpfile open will also fail - unlikely but probably also worth
detecting and reporting.

> I wonder what happens if by mistake or intentionally we run two
> http-fetch instances simultaneously.  IIRC, the current code is
> safe -- the resulting object database will have the object file
> fetched by one of the instance, and the updating of ref is done
> via write_ref_sha1(), so it also is safe.  But your change may
> introduce an interesting case where one creates a tmpfile, the
> other one moves it to prevfile and starts using its partial
> contents, and possibly gets confused -- it will probabaly fail
> at the end detecting inconsistent object so it is probably not a
> big loss.

How about using mkstemp on the prev file to keep multiple instances
from stepping on each other?  Since O_CREAT | O_EXCL is used to
open the tmpfile, only one instance will be able to succeed and
continue.

> > +	/* Reset inflate/SHA1 if there was an error reading the previous temp
> > +	   file; also rewind to the beginning of the local file. */
> 
> Maybe not just rewind but truncate as well?  It probably does
> not matter in practice much, but previous representation your
> fetch was interrupted in the middle could have been much larger
> than the representation you are slurping right now.

Good point, I'll update the patch.

> There was a discussion about an object file of the same SHA1 and
> the same contents can have different compressed representations
> (we hash then compress so the resulting filesize depends on the
> compression level without affecting the contents of the object).
> In a "doctor, it hurts when I do this -- don't do it, then" kind
> of corner case, a DNS rotated pair of webservers could be
> serving the same object in different representations and you may
> get interrupted while fetching from one, and restart the
> transfer from the other.  The SHA1 check at the end hopefully
> would catch this kind of situation, and that round of http-fetch
> would fail -- the user needs to re-run the fetch so it is not a
> big loss, but it is something to keep in mind.

That's an annoying case, all right...  Would it be worth including a 
full retry if a partial failed the SHA1 check?

-- 
For a successful technology, reality must take precedence over public
relations, for nature cannot be fooled.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 2/3] Support for partial HTTP transfers
  2005-09-27  0:09   ` Nick Hengeveld
@ 2005-09-27  5:46     ` Junio C Hamano
  2005-09-27 15:36       ` Nick Hengeveld
  0 siblings, 1 reply; 10+ messages in thread
From: Junio C Hamano @ 2005-09-27  5:46 UTC (permalink / raw)
  To: Nick Hengeveld; +Cc: git

Nick Hengeveld <nickh@reactrix.com> writes:

> How about using mkstemp on the prev file to keep multiple instances
> from stepping on each other?  Since O_CREAT | O_EXCL is used to
> open the tmpfile, only one instance will be able to succeed and
> continue.

I think (O_CREAT|O_EXCL) in your code can be easily defeated by
this sequence:

    one                 	two

    unlink prevfile
    rename tmpfile, prevfile
    unlink tmpfile
    local = open tmpfile
                                unlink prevfile
                                rename tmpfile, prevfile
                                unlink tmpfile
                                local = open tmpfile
                                prevlocal = open prevfile
    write local
                                copyfile prevlocal, local
    write local
				???

Not that I think the multiple instances should be prevented at
this low level --- if they stomp on each other at this level, it
is very likely the they are doing duplicated work on the network
side as well, and should be prevented from doing so at much
higher level than this, I think.  That's why I said I do not
mind BKL upfront in git-fetch.sh.

> That's an annoying case, all right...  Would it be worth including a 
> full retry if a partial failed the SHA1 check?

Probably not.  Just failing loudly and asking the upper layer to
retry would be fine, I think.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 2/3] Support for partial HTTP transfers
  2005-09-26 17:52 [PATCH 2/3] Support for partial HTTP transfers Nick Hengeveld
  2005-09-26 21:19 ` Junio C Hamano
  2005-09-26 22:23 ` Daniel Barkalow
@ 2005-09-27 10:35 ` sf
  2 siblings, 0 replies; 10+ messages in thread
From: sf @ 2005-09-27 10:35 UTC (permalink / raw)
  To: git

Nick Hengeveld wrote:
> Support for partial HTTP transfers - if a previous temp file is detected,
> read it in and start the HTTP transfer from where the previous left off.

Please read my recent posting:

	http://permalink.gmane.org/gmane.comp.version-control.git/8991.

Regards

	Stephan

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 2/3] Support for partial HTTP transfers
  2005-09-27  5:46     ` Junio C Hamano
@ 2005-09-27 15:36       ` Nick Hengeveld
  2005-09-27 17:09         ` Junio C Hamano
  0 siblings, 1 reply; 10+ messages in thread
From: Nick Hengeveld @ 2005-09-27 15:36 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git

On Mon, Sep 26, 2005 at 10:46:30PM -0700, Junio C Hamano wrote:

> I think (O_CREAT|O_EXCL) in your code can be easily defeated by
> this sequence:
> 
>     one                 	two
> 
>     unlink prevfile
>     rename tmpfile, prevfile
>     unlink tmpfile
>     local = open tmpfile
>                                 unlink prevfile
>                                 rename tmpfile, prevfile
>                                 unlink tmpfile
>                                 local = open tmpfile
>                                 prevlocal = open prevfile
>     write local
>                                 copyfile prevlocal, local
>     write local
> 				???

The way I understand this sequence, one and two will not be writing to
the same file.  When two unlinks tmpfile, one will still be able to 
write to its local, but the body of the file that one is writing will
be removed when it closes local.

> Not that I think the multiple instances should be prevented at
> this low level --- if they stomp on each other at this level, it
> is very likely the they are doing duplicated work on the network
> side as well, and should be prevented from doing so at much
> higher level than this, I think.  That's why I said I do not
> mind BKL upfront in git-fetch.sh.

True, is that something I should include with the partial patch?

-- 
For a successful technology, reality must take precedence over public
relations, for nature cannot be fooled.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 2/3] Support for partial HTTP transfers
  2005-09-27 15:36       ` Nick Hengeveld
@ 2005-09-27 17:09         ` Junio C Hamano
  2005-09-27 17:19           ` Nick Hengeveld
  0 siblings, 1 reply; 10+ messages in thread
From: Junio C Hamano @ 2005-09-27 17:09 UTC (permalink / raw)
  To: Nick Hengeveld; +Cc: git

Nick Hengeveld <nickh@reactrix.com> writes:

> The way I understand this sequence, one and two will not be writing to
> the same file.  When two unlinks tmpfile, one will still be able to 
> write to its local, but the body of the file that one is writing will
> be removed when it closes local.

True.  What I meant by "interesting" is that two is reading from
what one is writing.

>> Not that I think the multiple instances should be prevented at
>> this low level --- if they stomp on each other at this level, it
>> is very likely the they are doing duplicated work on the network
>> side as well, and should be prevented from doing so at much
>> higher level than this, I think.  That's why I said I do not
>> mind BKL upfront in git-fetch.sh.
>
> True, is that something I should include with the partial patch?

No.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 2/3] Support for partial HTTP transfers
  2005-09-27 17:09         ` Junio C Hamano
@ 2005-09-27 17:19           ` Nick Hengeveld
  2005-09-27 22:24             ` Junio C Hamano
  0 siblings, 1 reply; 10+ messages in thread
From: Nick Hengeveld @ 2005-09-27 17:19 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git

On Tue, Sep 27, 2005 at 10:09:49AM -0700, Junio C Hamano wrote:

> True.  What I meant by "interesting" is that two is reading from
> what one is writing.

Excellent point.  Given the potential for problems related to this
issue or to compressed representation, would it make sense to enable
partial transfers via a command-line option and leave the feature
disabled by default?

-- 
For a successful technology, reality must take precedence over public
relations, for nature cannot be fooled.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 2/3] Support for partial HTTP transfers
  2005-09-27 17:19           ` Nick Hengeveld
@ 2005-09-27 22:24             ` Junio C Hamano
  0 siblings, 0 replies; 10+ messages in thread
From: Junio C Hamano @ 2005-09-27 22:24 UTC (permalink / raw)
  To: Nick Hengeveld; +Cc: git

Nick Hengeveld <nickh@reactrix.com> writes:

> On Tue, Sep 27, 2005 at 10:09:49AM -0700, Junio C Hamano wrote:
>
>> True.  What I meant by "interesting" is that two is reading from
>> what one is writing.
>
> Excellent point.  Given the potential for problems related to this
> issue or to compressed representation, would it make sense to enable
> partial transfers via a command-line option and leave the feature
> disabled by default?

Here is what I think.  Multiple transfers at the same time, if
it becomes problem in pratcice, should be prevented at the upper
layer anyway, so this "interesting" problem is a moot point.  So
I'd say let's not worry about that.

Doing partial transfer support for packs and indices would be
more useful addition to your code than dealing with the above --
packs are much bigger than individual objects and would get much
bigger benefit from restartable download.

Different compressed representation is just a theoretical
curiosity and we do not know if it is a problem in practice.  As
long as you can detect the situation and die, it would be OK, I
think. 

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2005-09-27 22:24 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2005-09-26 17:52 [PATCH 2/3] Support for partial HTTP transfers Nick Hengeveld
2005-09-26 21:19 ` Junio C Hamano
2005-09-27  0:09   ` Nick Hengeveld
2005-09-27  5:46     ` Junio C Hamano
2005-09-27 15:36       ` Nick Hengeveld
2005-09-27 17:09         ` Junio C Hamano
2005-09-27 17:19           ` Nick Hengeveld
2005-09-27 22:24             ` Junio C Hamano
2005-09-26 22:23 ` Daniel Barkalow
2005-09-27 10:35 ` sf

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).