Cloning from sites with 404 overridden

Git development
 help / color / mirror / Atom feed

* Cloning from sites with 404 overridden
@ 2006-03-19 10:52 Marco Costalba
  2006-03-19 13:25 ` Paolo Ciarrocchi
  0 siblings, 1 reply; 35+ messages in thread
From: Marco Costalba @ 2006-03-19 10:52 UTC (permalink / raw)
  To: git; +Cc: junkio

Hi all,

    I have set a git repository on a hosted public site:
http://digilander.libero.it/mcostalba/scm/qgit.git

I cannot run any process (read git-daemon) on that site, so git-clone uses
a 'dumb server' type protocol and this is what I got.

$ git clone http://digilander.libero.it/mcostalba/scm/qgit.git
error: File 8dea03519e75f47da91108330dde3043defddd60
(http://digilander.libero.it/mcostalba/scm/qgit.git/objects/8d/ea03519e75f47da91108330dde3043defddd60)
corrupt
Getting pack list for http://digilander.libero.it/mcostalba/scm/qgit.git/
Getting index for pack fe1f3586b38e70e963de47f31379ef170adc5ca9
Getting pack fe1f3586b38e70e963de47f31379ef170adc5ca9
 which contains 8dea03519e75f47da91108330dde3043defddd60
walk 8dea03519e75f47da91108330dde3043defddd60
walk ec47dab590fb838ba2be7af5bf9aa46d9f2e502d

-------------- cut ------------------------

walk 907d47e836f4f174386d02d21e38aeafc1e79626
walk 5d3454248bbb3aaba080057dc9666a3c3aaeca1f
$

The above mentioned error belongs to git requests a non existing object
(8dea03519e75f47da91108330dde3043defddd60) _and_  the site answers with
a pre-canned 'page not found' html page instead of reporting 404 error.

After some research I found it is quite common for public hosting
sites to use a pre-canned
'Sorry, no page here' html stuff instead of 404.

So my request is if it is possible for git to _learn_ this and to
avoid been fooled by
these kind of public sites.

Thanks
Marco

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Cloning from sites with 404 overridden
  2006-03-19 10:52 Marco Costalba
@ 2006-03-19 13:25 ` Paolo Ciarrocchi
  2006-03-19 14:04   ` Marco Costalba
  0 siblings, 1 reply; 35+ messages in thread
From: Paolo Ciarrocchi @ 2006-03-19 13:25 UTC (permalink / raw)
  To: Marco Costalba; +Cc: git, junkio

On 3/19/06, Marco Costalba <mcostalba@gmail.com> wrote:
> Hi all,

Ciao Marco,

>     I have set a git repository on a hosted public site:
> http://digilander.libero.it/mcostalba/scm/qgit.git
>
> I cannot run any process (read git-daemon) on that site, so git-clone uses
> a 'dumb server' type protocol and this is what I got.
>
> $ git clone http://digilander.libero.it/mcostalba/scm/qgit.git
> error: File 8dea03519e75f47da91108330dde3043defddd60
> (http://digilander.libero.it/mcostalba/scm/qgit.git/objects/8d/ea03519e75f47da91108330dde3043defddd60)
> corrupt
> Getting pack list for http://digilander.libero.it/mcostalba/scm/qgit.git/
> Getting index for pack fe1f3586b38e70e963de47f31379ef170adc5ca9
> Getting pack fe1f3586b38e70e963de47f31379ef170adc5ca9
>  which contains 8dea03519e75f47da91108330dde3043defddd60
> walk 8dea03519e75f47da91108330dde3043defddd60
> walk ec47dab590fb838ba2be7af5bf9aa46d9f2e502d
>
> -------------- cut ------------------------
>
> walk 907d47e836f4f174386d02d21e38aeafc1e79626
> walk 5d3454248bbb3aaba080057dc9666a3c3aaeca1f
> $
>
> The above mentioned error belongs to git requests a non existing object
> (8dea03519e75f47da91108330dde3043defddd60) _and_  the site answers with
> a pre-canned 'page not found' html page instead of reporting 404 error.
>
> After some research I found it is quite common for public hosting
> sites to use a pre-canned
> 'Sorry, no page here' html stuff instead of 404.
>
> So my request is if it is possible for git to _learn_ this and to
> avoid been fooled by
> these kind of public sites.
>

How about getting an account on kernel.org?

Anyway, here is what I did:
paolo@Italia:~$ cg-clone
http://digilander.libero.it/mcostalba/scm/qgit.git qgit defaulting to
local storage area
Fetching head...
Fetching objects...
error: File 8dea03519e75f47da91108330dde3043defddd60
(http://digilander.libero.i
t/mcostalba/scm/qgit.git/objects/8d/ea03519e75f47da91108330dde3043defddd60)
corr upt

Getting pack list for http://digilander.libero.it/mcostalba/scm/qgit.git/
Getting index for pack fe1f3586b38e70e963de47f31379ef170adc5ca9
Getting pack fe1f3586b38e70e963de47f31379ef170adc5ca9
 which contains 8dea03519e75f47da91108330dde3043defddd60
Fetching tags...
Missing tag qgit-0.93... retrieved
Missing tag qgit-0.94... retrieved
Missing tag qgit-0.94.1... retrieved
Missing tag qgit-0.95.1... retrieved
Missing tag qgit-0.96... retrieved
Missing tag qgit-0.96.1... retrieved
Missing tag qgit-0.97... retrieved
Missing tag qgit-0.97.1... retrieved
Missing tag qgit-0.97.2... retrieved
Missing tag qgit-1.0... retrieved
Missing tag qgit-1.1rc1... retrieved
Missing tag qgit-1.1rc3... retrieved
New branch: 8dea03519e75f47da91108330dde3043defddd60
Cloned to qgit/ (origin
http://digilander.libero.it/mcostalba/scm/qgit.git available as branch
"origin")

Why am I getting this error?
error: File 8dea03519e75f47da91108330dde3043defddd60
(http://digilander.libero.i
t/mcostalba/scm/qgit.git/objects/8d/ea03519e75f47da91108330dde3043defddd60)
corr upt

--
Paolo
http://paolociarrocchi.googlepages.com

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Cloning from sites with 404 overridden
  2006-03-19 13:25 ` Paolo Ciarrocchi
@ 2006-03-19 14:04   ` Marco Costalba
  2006-03-19 19:37     ` Junio C Hamano
  2006-03-19 19:47     ` Junio C Hamano
  0 siblings, 2 replies; 35+ messages in thread
From: Marco Costalba @ 2006-03-19 14:04 UTC (permalink / raw)
  To: Paolo Ciarrocchi; +Cc: git, junkio

On 3/19/06, Paolo Ciarrocchi <paolo.ciarrocchi@gmail.com> wrote:
> On 3/19/06, Marco Costalba <mcostalba@gmail.com> wrote:
> >
>
> How about getting an account on kernel.org?
>

I don't think I have the credentials to ask for ;-)

> Anyway, here is what I did:
> paolo@Italia:~$ cg-clone
> http://digilander.libero.it/mcostalba/scm/qgit.git qgit defaulting to
>
> Why am I getting this error?
> error: File 8dea03519e75f47da91108330dde3043defddd60
> (http://digilander.libero.i
> t/mcostalba/scm/qgit.git/objects/8d/ea03519e75f47da91108330dde3043defddd60)
> corr upt
>

Because http server of digilander.libero.it instead of responding with
404 code (page not
found) sends a not standard html page as answer. To see the page just point
your browser to:
http://digilander.libero.it /mcostalba/scm/qgit.git/objects/8d/ea03519e75f47d

Git does not understand object is missing and thinks what site sends
_is_ the requested
object and then founds that is (of course) corrupted.


Marco

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Cloning from sites with 404 overridden
  2006-03-19 14:04   ` Marco Costalba
@ 2006-03-19 19:37     ` Junio C Hamano
  2006-03-19 21:40       ` Marco Costalba
  2006-03-20 18:29       ` Lukas Sandström
  2006-03-19 19:47     ` Junio C Hamano
  1 sibling, 2 replies; 35+ messages in thread
From: Junio C Hamano @ 2006-03-19 19:37 UTC (permalink / raw)
  To: Marco Costalba; +Cc: Paolo Ciarrocchi, git, junkio

"Marco Costalba" <mcostalba@gmail.com> writes:

> http://digilander.libero.it /mcostalba/scm/qgit.git/objects/8d/ea03519e75f47d
>
> Git does not understand object is missing and thinks what site sends
> _is_ the requested
> object and then founds that is (of course) corrupted.

To be fair, the site is _not_ missing anything from HTTP
protocol perspective, because when git asks 8d/ea0351... file,
the server responds with a regular "HTTP/1.0 200 OK" response.
So it is _your_ repository that is corrupt -- instead of
correctly _lacking_ the file you should have removed with
prune-packed, it has a garbage file.

Having said that, I agree that it would be nicer if we support
such a site, in the same spirit that we already bend backwards
to support really dumb hosted http servers that do not give
directory index by using objects/info/packs and info/refs.

I think it wouldn't be too much a hassle to add logic to
http-fetch.c (perhaps with an additional "--no-404" option or
somesuch) to fall back on pack transfer upon seeing a corrupt
loose object.  We do the falling back when getting 404 error to
a request for a loose object, so the new code would essentially
do the same and you might be OK.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Cloning from sites with 404 overridden
  2006-03-19 14:04   ` Marco Costalba
  2006-03-19 19:37     ` Junio C Hamano
@ 2006-03-19 19:47     ` Junio C Hamano
  2006-03-19 21:31       ` Petr Baudis
  2006-03-20  4:32       ` Randal L. Schwartz
  1 sibling, 2 replies; 35+ messages in thread
From: Junio C Hamano @ 2006-03-19 19:47 UTC (permalink / raw)
  To: Marco Costalba; +Cc: git

"Marco Costalba" <mcostalba@gmail.com> writes:

> On 3/19/06, Paolo Ciarrocchi <paolo.ciarrocchi@gmail.com> wrote:
>>
>> How about getting an account on kernel.org?
>
> I don't think I have the credentials to ask for ;-)

Heh, it has a striking resemblance to the first thing I said
when Linus asked me if I want to take over git.git: "It would
be embarrassing to be the first person to have an account there
without having a single line of code in the kernel" ;-).

Well, you won't be the first (in fact it appears I wasn't
either), and it would never hurt to ask.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Cloning from sites with 404 overridden
  2006-03-19 19:47     ` Junio C Hamano
@ 2006-03-19 21:31       ` Petr Baudis
  2006-03-19 21:43         ` Petr Baudis
  2006-03-19 21:45         ` Marco Costalba
  2006-03-20  4:32       ` Randal L. Schwartz
  1 sibling, 2 replies; 35+ messages in thread
From: Petr Baudis @ 2006-03-19 21:31 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Marco Costalba, git

Dear diary, on Sun, Mar 19, 2006 at 08:47:21PM CET, I got a letter
where Junio C Hamano <junkio@cox.net> said that...
> "Marco Costalba" <mcostalba@gmail.com> writes:
> 
> > On 3/19/06, Paolo Ciarrocchi <paolo.ciarrocchi@gmail.com> wrote:
> >>
> >> How about getting an account on kernel.org?
> >
> > I don't think I have the credentials to ask for ;-)
> 
> Heh, it has a striking resemblance to the first thing I said
> when Linus asked me if I want to take over git.git: "It would
> be embarrassing to be the first person to have an account there
> without having a single line of code in the kernel" ;-).
> 
> Well, you won't be the first (in fact it appears I wasn't
> either), and it would never hurt to ask.

Yeah, I think I was there before you... ;-)

-- 
				Petr "Pasky" Baudis
Stuff: http://pasky.or.cz/
Right now I am having amnesia and deja-vu at the same time.  I think
I have forgotten this before.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Cloning from sites with 404 overridden
  2006-03-19 19:37     ` Junio C Hamano
@ 2006-03-19 21:40       ` Marco Costalba
  2006-03-19 23:21         ` Junio C Hamano
  2006-03-20 18:29       ` Lukas Sandström
  1 sibling, 1 reply; 35+ messages in thread
From: Marco Costalba @ 2006-03-19 21:40 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Paolo Ciarrocchi, git

On 3/19/06, Junio C Hamano <junkio@cox.net> wrote:
> "Marco Costalba" <mcostalba@gmail.com> writes:
>
> > http://digilander.libero.it /mcostalba/scm/qgit.git/objects/8d/ea03519e75f47d
> >
> > Git does not understand object is missing and thinks what site sends
> > _is_ the requested
> > object and then founds that is (of course) corrupted.
>
> To be fair, the site is _not_ missing anything from HTTP
> protocol perspective, because when git asks 8d/ea0351... file,
> the server responds with a regular "HTTP/1.0 200 OK" response.
> So it is _your_ repository that is corrupt -- instead of
> correctly _lacking_ the file you should have removed with
> prune-packed, it has a garbage file.
>

Currently my git repo layout is as follow
$ pwd
<local master copy>/qgit.git/.git
$ ls
branches/  description  HEAD    index  objects/   refs/
config     FETCH_HEAD   hooks/  info/  ORIG_HEAD  remotes/
$ ls objects
2c/  32/  53/  5c/  6a/  info/  pack/

The host copy should be the exact mirror of the local copy (I use
sitecopy to sync
host). I have also verified this directly accessing the host with ftp.

So the 8d/ea0351... file is really not existent. BTW I have run git
prune and git-prune-packed
also.

Finally accessing the missing object with a browser

http://digilander.libero.it/mcostalba/
scm/qgit.git/objects/8d/ea03519e75f47da91108330dde3043defddd60

gives a pre-canned (in italian) 'Sorry page not found' stuff.

So I really think the site "HTTP/1.0 200 OK" response it's a fake.
Perhaps security related to avoid sniffing (just a guess because I have
absolutely zero competence in security related things).


Marco

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Cloning from sites with 404 overridden
  2006-03-19 21:31       ` Petr Baudis
@ 2006-03-19 21:43         ` Petr Baudis
  2006-03-19 21:45         ` Marco Costalba
  1 sibling, 0 replies; 35+ messages in thread
From: Petr Baudis @ 2006-03-19 21:43 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Marco Costalba, git

Dear diary, on Sun, Mar 19, 2006 at 10:31:25PM CET, I got a letter
where Petr Baudis <pasky@suse.cz> said that...
> Dear diary, on Sun, Mar 19, 2006 at 08:47:21PM CET, I got a letter
> where Junio C Hamano <junkio@cox.net> said that...
> > Heh, it has a striking resemblance to the first thing I said
> > when Linus asked me if I want to take over git.git: "It would
> > be embarrassing to be the first person to have an account there
> > without having a single line of code in the kernel" ;-).
> > 
> > Well, you won't be the first (in fact it appears I wasn't
> > either), and it would never hurt to ask.
> 
> Yeah, I think I was there before you... ;-)

Silly me, on a second thought I've realized that I already had some
stuff in the kernel by then. Sorry for the noise.

-- 
				Petr "Pasky" Baudis
Stuff: http://pasky.or.cz/
Right now I am having amnesia and deja-vu at the same time.  I think
I have forgotten this before.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Cloning from sites with 404 overridden
  2006-03-19 21:31       ` Petr Baudis
  2006-03-19 21:43         ` Petr Baudis
@ 2006-03-19 21:45         ` Marco Costalba
  1 sibling, 0 replies; 35+ messages in thread
From: Marco Costalba @ 2006-03-19 21:45 UTC (permalink / raw)
  To: Petr Baudis; +Cc: Junio C Hamano, git

On 3/19/06, Petr Baudis <pasky@suse.cz> wrote:
> Dear diary, on Sun, Mar 19, 2006 at 08:47:21PM CET, I got a letter
> where Junio C Hamano <junkio@cox.net> said that...
> > "Marco Costalba" <mcostalba@gmail.com> writes:
> >
> > > On 3/19/06, Paolo Ciarrocchi <paolo.ciarrocchi@gmail.com> wrote:
> > >>
> > >> How about getting an account on kernel.org?
> > >
> > > I don't think I have the credentials to ask for ;-)
> >
> > Heh, it has a striking resemblance to the first thing I said
> > when Linus asked me if I want to take over git.git: "It would
> > be embarrassing to be the first person to have an account there
> > without having a single line of code in the kernel" ;-).
> >
> > Well, you won't be the first (in fact it appears I wasn't
> > either), and it would never hurt to ask.
>
> Yeah, I think I was there before you... ;-)
>
> --

Please could someone tell me what door I should knock at?

Thanks
Marco

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Cloning from sites with 404 overridden
  2006-03-19 21:40       ` Marco Costalba
@ 2006-03-19 23:21         ` Junio C Hamano
  2006-03-20  6:31           ` Marco Costalba
  0 siblings, 1 reply; 35+ messages in thread
From: Junio C Hamano @ 2006-03-19 23:21 UTC (permalink / raw)
  To: Marco Costalba; +Cc: git

"Marco Costalba" <mcostalba@gmail.com> writes:

> Finally accessing the missing object with a browser
>
> http://digilander.libero.it/mcostalba/
> scm/qgit.git/objects/8d/ea03519e75f47da91108330dde3043defddd60
>
> gives a pre-canned (in italian) 'Sorry page not found' stuff.
>
> So I really think the site "HTTP/1.0 200 OK" response it's a fake.
> Perhaps security related to avoid sniffing (just a guess because I have
> absolutely zero competence in security related things).

I think you are just rephrasing what I said.  From the HTTP
protocol perspective, you _do_ have that 8d/3a0351 thing on that
server, because you do not correctly say "No we donot have it"
using 404 response.

Your inability to produce 404 is a different matter -- often the
hosting server is not under your control.  But that does not
change the fact that the repository observed by your clients is
"broken".  That is why a workaround flag like I suggested may be
needed for such a setup.

This is totally untested, but maybe something like this?

---
diff --git a/http-fetch.c b/http-fetch.c
index 7de818b..d523798 100644
--- a/http-fetch.c
+++ b/http-fetch.c
@@ -8,6 +8,7 @@
 #define RANGE_HEADER_SIZE 30
 
 static int got_alternates = -1;
+static int unreliable_404 = 0;
 
 static struct curl_slist *no_pragma_header;
 
@@ -822,12 +823,18 @@ static int fetch_object(struct alt_base 
 		close(obj_req->local); obj_req->local = -1;
 	}
 
+	
+
 	if (obj_req->state == ABORTED) {
 		ret = error("Request for %s aborted", hex);
-	} else if (obj_req->curl_result != CURLE_OK &&
-		   obj_req->http_code != 416) {
+	} else if ((obj_req->curl_result != CURLE_OK &&
+		    obj_req->http_code != 416)  ||
+		   (unreliable_404 &&
+		    obj_req->curl_result == CURLE_OK &&
+		    obj_req->zret != Z_STREAM_END)) {
 		if (obj_req->http_code == 404 ||
-		    obj_req->curl_result == CURLE_FILE_COULDNT_READ_FILE)
+		    obj_req->curl_result == CURLE_FILE_COULDNT_READ_FILE ||
+		    unreliable_404)
 			ret = -1; /* Be silent, it is probably in a pack. */
 		else
 			ret = error("%s (curl_result = %d, http_code = %ld, sha1 = %s)",
@@ -966,6 +973,8 @@ int main(int argc, char **argv)
 			arg++;
 		} else if (!strcmp(argv[arg], "--recover")) {
 			get_recover = 1;
+		} else if (!strcmp(argv[arg], "--unreliable-404")) {
+			unreliable_404 = 1;
 		}
 		arg++;
 	}

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* Re: Cloning from sites with 404 overridden
  2006-03-19 19:47     ` Junio C Hamano
  2006-03-19 21:31       ` Petr Baudis
@ 2006-03-20  4:32       ` Randal L. Schwartz
  1 sibling, 0 replies; 35+ messages in thread
From: Randal L. Schwartz @ 2006-03-20  4:32 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Marco Costalba, git

>>>>> "Junio" == Junio C Hamano <junkio@cox.net> writes:

Junio> Heh, it has a striking resemblance to the first thing I said
Junio> when Linus asked me if I want to take over git.git: "It would
Junio> be embarrassing to be the first person to have an account there
Junio> without having a single line of code in the kernel" ;-).

Junio> Well, you won't be the first (in fact it appears I wasn't
Junio> either), and it would never hurt to ask.

Wow.  That would perhaps completely rule out people who have never owned
anything that can execute the x86 instruction set except in emulation. :)

-- 
Randal L. Schwartz - Stonehenge Consulting Services, Inc. - +1 503 777 0095
<merlyn@stonehenge.com> <URL:http://www.stonehenge.com/merlyn/>
Perl/Unix/security consulting, Technical writing, Comedy, etc. etc.
See PerlTraining.Stonehenge.com for onsite and open-enrollment Perl training!

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Cloning from sites with 404 overridden
  2006-03-19 23:21         ` Junio C Hamano
@ 2006-03-20  6:31           ` Marco Costalba
  2006-03-20  8:44             ` Junio C Hamano
  0 siblings, 1 reply; 35+ messages in thread
From: Marco Costalba @ 2006-03-20  6:31 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git

On 3/20/06, Junio C Hamano <junkio@cox.net> wrote:
>
> Your inability to produce 404 is a different matter -- often the
> hosting server is not under your control.  But that does not
> change the fact that the repository observed by your clients is
> "broken".  That is why a workaround flag like I suggested may be
> needed for such a setup.
>
> This is totally untested, but maybe something like this?
>

It works for me. Just some trailing white space warning when applying.

I didn't found a way to pass '--unreliable-404' flag from git-clone,
perhaps my bad,
I have tested forcing the flag in sources.


Marco

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Cloning from sites with 404 overridden
  2006-03-20  6:31           ` Marco Costalba
@ 2006-03-20  8:44             ` Junio C Hamano
  2006-03-20 12:17               ` Marco Costalba
  0 siblings, 1 reply; 35+ messages in thread
From: Junio C Hamano @ 2006-03-20  8:44 UTC (permalink / raw)
  To: Marco Costalba; +Cc: git

"Marco Costalba" <mcostalba@gmail.com> writes:

>> This is totally untested, but maybe something like this?
>
> It works for me. Just some trailing white space warning when applying.

The change only removes the error message without changing any
other logic, so if that works for you, I wonder if leaving
things as they are is a better option than doing anything short
of implementing an AI that tries to pattern-match the "allegedly
corrupt file" with "sorry no such page found" in many natural
languages.

My test patch makes it impossible to track down the real
breakage when an HTTP-reachable repository _does_ have a corrupt
object.

So how about doing this instead?

-- >8 --
diff --git a/http-fetch.c b/http-fetch.c
index 8fd9de0..1405c1f 100644
--- a/http-fetch.c
+++ b/http-fetch.c
@@ -8,6 +8,7 @@
 #define RANGE_HEADER_SIZE 30
 
 static int got_alternates = -1;
+static int corrupt_object_found = 0;
 
 static struct curl_slist *no_pragma_header;
 
@@ -830,6 +831,7 @@ static int fetch_object(struct alt_base 
 				    obj_req->errorstr, obj_req->curl_result,
 				    obj_req->http_code, hex);
 	} else if (obj_req->zret != Z_STREAM_END) {
+		corrupt_object_found++;
 		ret = error("File %s (%s) corrupt", hex, obj_req->url);
 	} else if (memcmp(obj_req->sha1, obj_req->real_sha1, 20)) {
 		ret = error("File %s has bad hash", hex);
@@ -989,5 +991,11 @@ int main(int argc, char **argv)
 
 	http_cleanup();
 
+	if (corrupt_object_found) {
+		fprintf(stderr,
+"Some loose object were found to be corrupt, but they might be just\n"
+"a false '404 Not Found' error message sent with incorrect HTTP\n"
+"status code.  Suggest running git fsck-objects.\n");
+	}
 	return rc;
 }

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* Re: Cloning from sites with 404 overridden
  2006-03-20  8:44             ` Junio C Hamano
@ 2006-03-20 12:17               ` Marco Costalba
  0 siblings, 0 replies; 35+ messages in thread
From: Marco Costalba @ 2006-03-20 12:17 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git

On 3/20/06, Junio C Hamano <junkio@cox.net> wrote:
> "Marco Costalba" <mcostalba@gmail.com> writes:
>
> >> This is totally untested, but maybe something like this?
> >
> > It works for me. Just some trailing white space warning when applying.
>
> The change only removes the error message without changing any
> other logic, so if that works for you, I wonder if leaving
> things as they are is a better option than doing anything short
> of implementing an AI that tries to pattern-match the "allegedly
> corrupt file" with "sorry no such page found" in many natural
> languages.
>
> My test patch makes it impossible to track down the real
> breakage when an HTTP-reachable repository _does_ have a corrupt
> object.
>
> So how about doing this instead?
>
> -- >8 --

> +               fprintf(stderr,
> +"Some loose object were found to be corrupt, but they might be just\n"
> +"a false '404 Not Found' error message sent with incorrect HTTP\n"
> +"status code.  Suggest running git fsck-objects.\n");
> +       }
>         return rc;
>  }
>

I think it's better, read more correct.

Could be a real corrupted file or just a false 404, so better a
warning then an error message and also better a warning then nothing.

Marco

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Cloning from sites with 404 overridden
  2006-03-19 19:37     ` Junio C Hamano
  2006-03-19 21:40       ` Marco Costalba
@ 2006-03-20 18:29       ` Lukas Sandström
  2006-03-20 19:43         ` Petr Baudis
  2006-03-20 19:54         ` Nick Hengeveld
  1 sibling, 2 replies; 35+ messages in thread
From: Lukas Sandström @ 2006-03-20 18:29 UTC (permalink / raw)
  To: git; +Cc: Junio C Hamano, Paolo Ciarrocchi

Junio C Hamano wrote:
> "Marco Costalba" <mcostalba@gmail.com> writes:
>>http://digilander.libero.it /mcostalba/scm/qgit.git/objects/8d/ea03519e75f47d
> 
> To be fair, the site is _not_ missing anything from HTTP
> protocol perspective, because when git asks 8d/ea0351... file,
> the server responds with a regular "HTTP/1.0 200 OK" response.
> So it is _your_ repository that is corrupt -- instead of
> correctly _lacking_ the file you should have removed with
> prune-packed, it has a garbage file.

Actually, it sends a 302 redirect. 

Perhaps a repository config option to treat a 302 as a 404?

/Lukas Sandström

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Cloning from sites with 404 overridden
  2006-03-20 18:29       ` Lukas Sandström
@ 2006-03-20 19:43         ` Petr Baudis
  2006-03-20 19:54         ` Nick Hengeveld
  1 sibling, 0 replies; 35+ messages in thread
From: Petr Baudis @ 2006-03-20 19:43 UTC (permalink / raw)
  To: Lukas Sandström; +Cc: git, Junio C Hamano, Paolo Ciarrocchi

Dear diary, on Mon, Mar 20, 2006 at 07:29:02PM CET, I got a letter
where Lukas Sandström <lukass@etek.chalmers.se> said that...
> Junio C Hamano wrote:
> > "Marco Costalba" <mcostalba@gmail.com> writes:
> >>http://digilander.libero.it /mcostalba/scm/qgit.git/objects/8d/ea03519e75f47d
> > 
> > To be fair, the site is _not_ missing anything from HTTP
> > protocol perspective, because when git asks 8d/ea0351... file,
> > the server responds with a regular "HTTP/1.0 200 OK" response.
> > So it is _your_ repository that is corrupt -- instead of
> > correctly _lacking_ the file you should have removed with
> > prune-packed, it has a garbage file.
> 
> Actually, it sends a 302 redirect. 
> 
> Perhaps a repository config option to treat a 302 as a 404?

I think that would be too ugly _and_ specific a workaround for the
particular site. It's reasonable to keep it generalized for all the
broken repositories when already doing it.

-- 
				Petr "Pasky" Baudis
Stuff: http://pasky.or.cz/
Right now I am having amnesia and deja-vu at the same time.  I think
I have forgotten this before.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Cloning from sites with 404 overridden
  2006-03-20 18:29       ` Lukas Sandström
  2006-03-20 19:43         ` Petr Baudis
@ 2006-03-20 19:54         ` Nick Hengeveld
  1 sibling, 0 replies; 35+ messages in thread
From: Nick Hengeveld @ 2006-03-20 19:54 UTC (permalink / raw)
  To: Lukas Sandström; +Cc: git, Junio C Hamano, Paolo Ciarrocchi

On Mon, Mar 20, 2006 at 07:29:02PM +0100, Lukas Sandström wrote:

> Perhaps a repository config option to treat a 302 as a 404?

FWIW, it used to work that way and was modified to follow redirects back at
commit 66c9ec25553ce7332c46e2017b9c4d7c26310fff.

-- 
For a successful technology, reality must take precedence over public
relations, for nature cannot be fooled.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Cloning from sites with 404 overridden
@ 2006-03-22  2:59 linux
  2006-03-22  3:12 ` Shawn Pearce
                   ` (3 more replies)
  0 siblings, 4 replies; 35+ messages in thread
From: linux @ 2006-03-22  2:59 UTC (permalink / raw)
  To: git

If someone feels ambitious, you can detect this condition automatically
by searching for a file that you know won't be there and seeing if you
get a 404 response to that.

To avoid punishing good servers, it would be nice to defer the test
until reciving the first corrupted object.

I'm not sure what the best "object that's not supposed to be there" is.
It could just be a random hash, or would a malformed object file name
be better?  Any fixed name has a finite chance of being created by
someone somewhere, but generating 160-bit random numbers is a PITA on
non-freenix platforms.

(As an aside, I suspect this is all caused by Microsoft's "friendly HTML
error messages" invention.)

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Cloning from sites with 404 overridden
  2006-03-22  2:59 Cloning from sites with 404 overridden linux
@ 2006-03-22  3:12 ` Shawn Pearce
  2006-03-22  4:13   ` Linus Torvalds
  2006-03-22  6:06 ` Marco Costalba
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 35+ messages in thread
From: Shawn Pearce @ 2006-03-22  3:12 UTC (permalink / raw)
  To: linux; +Cc: git

'0' x 40.  :-) There's some places already in the GIT source
which would have ``issues'' if they got an object with this hash.
Not sure if it is actually an entirely impossible hash or just one
that is highly improbable.

My own website has this problem and its because I'm using WordPress
to handle all URLs on the site; I haven't yet found a way to
configure WordPress to return a proper 404 when the URL can't be
mapped to something on the server.  Note that 404 status codes can
in fact return pretty HTML content for the user, and many websites
do this and many browsers display that pretty HTML.  But a bot can
then also recognize the status code and DTRT.

The webservers are just plain broken, mine included.  I think the
best option is to delay corrupt object reporting to the end of
the download process if you get only one corrupt object and that
corrupt object was actually attainable from a pack.  And in this
case its just a minor warning:

	Warning: The server appears to not return proper HTTP status
	codes on missing files.  The files were found in one or
	more packs so the download is OK, but the server administrator
	should really fix their server.  If you know the server
	administrator you might want to prod them to do so.

But that's already been suggested and I thought someone worked up
a patch based on that idea?  If not I could try to do so since my
own damn server has the problem.  :-)

linux@horizon.com wrote:
> If someone feels ambitious, you can detect this condition automatically
> by searching for a file that you know won't be there and seeing if you
> get a 404 response to that.
> 
> To avoid punishing good servers, it would be nice to defer the test
> until reciving the first corrupted object.
> 
> I'm not sure what the best "object that's not supposed to be there" is.
> It could just be a random hash, or would a malformed object file name
> be better?  Any fixed name has a finite chance of being created by
> someone somewhere, but generating 160-bit random numbers is a PITA on
> non-freenix platforms.
> 
> 
> (As an aside, I suspect this is all caused by Microsoft's "friendly HTML
> error messages" invention.)

-- 
Shawn.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Cloning from sites with 404 overridden
  2006-03-22  3:12 ` Shawn Pearce
@ 2006-03-22  4:13   ` Linus Torvalds
  0 siblings, 0 replies; 35+ messages in thread
From: Linus Torvalds @ 2006-03-22  4:13 UTC (permalink / raw)
  To: Shawn Pearce; +Cc: linux, git



On Tue, 21 Mar 2006, Shawn Pearce wrote:
>
> '0' x 40.  :-) There's some places already in the GIT source
> which would have ``issues'' if they got an object with this hash.
> Not sure if it is actually an entirely impossible hash or just one
> that is highly improbable.

The all-zeroes hash is as improbable as any other one, and finding a 
"collision" (ie a "real object") with that hash is as improbable as any 
other collision, ie we can (and do) depend on it beign a unique identifier 
for "does not exist".

			Linus

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Cloning from sites with 404 overridden
  2006-03-22  2:59 Cloning from sites with 404 overridden linux
  2006-03-22  3:12 ` Shawn Pearce
@ 2006-03-22  6:06 ` Marco Costalba
  2006-03-22  6:47   ` Junio C Hamano
  2006-03-22 13:36 ` Andreas Ericsson
  2006-03-22 17:22 ` Nick Hengeveld
  3 siblings, 1 reply; 35+ messages in thread
From: Marco Costalba @ 2006-03-22  6:06 UTC (permalink / raw)
  To: linux@horizon.com; +Cc: git, spearce, torvalds

On 21 Mar 2006 21:59:21 -0500, linux@horizon.com <linux@horizon.com> wrote:
> If someone feels ambitious, you can detect this condition automatically
> by searching for a file that you know won't be there and seeing if you
> get a 404 response to that.
>

Perhaps I am proposing a total idiocy, I don't know git-fetch
internals, but wouldn't be better to avoid trying to download a non
existing object? So to fix the problem at the origin?

I don't know if it is possible to list contents before try to download
so to avoid asking for a non existing object.

Marco

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Cloning from sites with 404 overridden
  2006-03-22  6:06 ` Marco Costalba
@ 2006-03-22  6:47   ` Junio C Hamano
  0 siblings, 0 replies; 35+ messages in thread
From: Junio C Hamano @ 2006-03-22  6:47 UTC (permalink / raw)
  To: Marco Costalba; +Cc: git

"Marco Costalba" <mcostalba@gmail.com> writes:

> Perhaps I am proposing a total idiocy, I don't know git-fetch
> internals, but wouldn't be better to avoid trying to download a non
> existing object? So to fix the problem at the origin?
>
> I don't know if it is possible to list contents before try to download
> so to avoid asking for a non existing object.

There is no way for the downloader to know if the upstream
repository has packed which object.  What is happening is that
the commit walker asks for loose object first because it does
not know.  Upon getting a "no such file" (or in the case of
misconfigured HTTP server that does not say 404, "corrupt
object"), it then checks if the object appears in the pack by
downloading the pack index.  It can tell what objects are in the
packs by looking at the pack index and downloads the pack that
contains needed object.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Cloning from sites with 404 overridden
  2006-03-22  2:59 Cloning from sites with 404 overridden linux
  2006-03-22  3:12 ` Shawn Pearce
  2006-03-22  6:06 ` Marco Costalba
@ 2006-03-22 13:36 ` Andreas Ericsson
  2006-03-24 17:29   ` Mark Wooding
  2006-03-22 17:22 ` Nick Hengeveld
  3 siblings, 1 reply; 35+ messages in thread
From: Andreas Ericsson @ 2006-03-22 13:36 UTC (permalink / raw)
  To: linux; +Cc: git

linux@horizon.com wrote:
> If someone feels ambitious, you can detect this condition automatically
> by searching for a file that you know won't be there and seeing if you
> get a 404 response to that.
> 
> To avoid punishing good servers, it would be nice to defer the test
> until reciving the first corrupted object.
> 
> I'm not sure what the best "object that's not supposed to be there" is.

.git/objects/00/hoping-for-a-404-or-webadmin-should-fix

It has the right number of chars so it should fit in wherever a real 
object name does but is obviously bogus anyways.


> It could just be a random hash, or would a malformed object file name
> be better?

A malformed object name is infinitely better. Otherwise we'd end up with 
a wild guess that hits home some day, to much surprise and a bug-report 
I wouldn't want to track. Not to mention the embarrassment when 
explaining why that object-name was chosen.

> 
> (As an aside, I suspect this is all caused by Microsoft's "friendly HTML
> error messages" invention.)

The body of the 404-page has absolutely nothing to do with it.

-- 
Andreas Ericsson                   andreas.ericsson@op5.se
OP5 AB                             www.op5.se
Tel: +46 8-230225                  Fax: +46 8-230231

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Cloning from sites with 404 overridden
  2006-03-22  2:59 Cloning from sites with 404 overridden linux
                   ` (2 preceding siblings ...)
  2006-03-22 13:36 ` Andreas Ericsson
@ 2006-03-22 17:22 ` Nick Hengeveld
  2006-03-22 18:36   ` Nick Hengeveld
  3 siblings, 1 reply; 35+ messages in thread
From: Nick Hengeveld @ 2006-03-22 17:22 UTC (permalink / raw)
  To: linux; +Cc: git

On Tue, Mar 21, 2006 at 09:59:21PM -0500, linux@horizon.com wrote:

> If someone feels ambitious, you can detect this condition automatically
> by searching for a file that you know won't be there and seeing if you
> get a 404 response to that.

It might be feasible to detect this condition using the Content-Type:
header in the server response.  So far, all the GIT repositories I've
tried return text/plain for loose objects and a special 404 page will
likely be text/html.

-- 
For a successful technology, reality must take precedence over public
relations, for nature cannot be fooled.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Cloning from sites with 404 overridden
  2006-03-22 17:22 ` Nick Hengeveld
@ 2006-03-22 18:36   ` Nick Hengeveld
  2006-03-22 19:05     ` Junio C Hamano
  0 siblings, 1 reply; 35+ messages in thread
From: Nick Hengeveld @ 2006-03-22 18:36 UTC (permalink / raw)
  To: git

On Wed, Mar 22, 2006 at 09:22:27AM -0800, Nick Hengeveld wrote:

> It might be feasible to detect this condition using the Content-Type:
> header in the server response.  So far, all the GIT repositories I've
> tried return text/plain for loose objects and a special 404 page will
> likely be text/html.

Something like this:

http_fetch: report text/html responses for loose objects

Some HTTP server environments return a 200 status and text/html error
document or a redirect to one rather than a 404 status if a loose
object does not exist.  This patch detects and reports this condition
to differentiate between a misconfigured server and an actual corrupt
object on the server.

Signed-off-by: Nick Hengeveld <nickh@reactrix.com>


---

 http-fetch.c |   19 ++++++++++++++++++-
 1 files changed, 18 insertions(+), 1 deletions(-)

61069cc348640fef2b8c503b8b8f00f689872cab
diff --git a/http-fetch.c b/http-fetch.c
index dc67218..ee5b585 100644
--- a/http-fetch.c
+++ b/http-fetch.c
@@ -41,6 +41,7 @@ struct object_request
 	CURLcode curl_result;
 	char errorstr[CURL_ERROR_SIZE];
 	long http_code;
+	char *content_type;
 	unsigned char real_sha1[20];
 	SHA_CTX c;
 	z_stream stream;
@@ -258,9 +259,15 @@ static void finish_object_request(struct
 
 static void process_object_response(void *callback_data)
 {
+	char *content_type;
 	struct object_request *obj_req =
 		(struct object_request *)callback_data;
 
+	curl_easy_getinfo(obj_req->slot->curl, CURLINFO_CONTENT_TYPE,
+			  &content_type);
+	if (content_type)
+		obj_req->content_type = strdup(content_type);
+
 	obj_req->curl_result = obj_req->slot->curl_result;
 	obj_req->http_code = obj_req->slot->http_code;
 	obj_req->slot = NULL;
@@ -298,6 +305,8 @@ static void release_object_request(struc
 			entry->next = entry->next->next;
 	}
 
+	if (obj_req->content_type)
+		free(obj_req->content_type);
 	free(obj_req->url);
 	free(obj_req);
 }
@@ -340,6 +349,7 @@ void prefetch(unsigned char *sha1)
 	memcpy(newreq->sha1, sha1, 20);
 	newreq->repo = alt;
 	newreq->url = NULL;
+	newreq->content_type = NULL;
 	newreq->local = -1;
 	newreq->state = WAITING;
 	snprintf(newreq->filename, sizeof(newreq->filename), "%s", filename);
@@ -836,7 +846,14 @@ static int fetch_object(struct alt_base 
 				    obj_req->http_code, hex);
 	} else if (obj_req->zret != Z_STREAM_END) {
 		corrupt_object_found++;
-		ret = error("File %s (%s) corrupt", hex, obj_req->url);
+		if (obj_req->content_type &&
+		    !strcmp(obj_req->content_type, "text/html")) {
+			ret = error("text/html response for file %s (%s)",
+				    sha1_to_hex(obj_req->sha1), obj_req->url);
+		} else {
+			ret = error("File %s (%s) corrupt",
+				    sha1_to_hex(obj_req->sha1), obj_req->url);
+		}
 	} else if (memcmp(obj_req->sha1, obj_req->real_sha1, 20)) {
 		ret = error("File %s has bad hash", hex);
 	} else if (obj_req->rename < 0) {
-- 
1.2.4.gb1bc1d-dirty

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* Re: Cloning from sites with 404 overridden
  2006-03-22 18:36   ` Nick Hengeveld
@ 2006-03-22 19:05     ` Junio C Hamano
  2006-03-22 19:22       ` Junio C Hamano
  2006-03-22 21:24       ` Radoslaw Szkodzinski
  0 siblings, 2 replies; 35+ messages in thread
From: Junio C Hamano @ 2006-03-22 19:05 UTC (permalink / raw)
  To: Nick Hengeveld; +Cc: git

Nick Hengeveld <nickh@reactrix.com> writes:

> Some HTTP server environments return a 200 status and text/html error
> document or a redirect to one rather than a 404 status if a loose
> object does not exist.  This patch detects and reports this condition
> to differentiate between a misconfigured server and an actual corrupt
> object on the server.

> 61069cc348640fef2b8c503b8b8f00f689872cab
> diff --git a/http-fetch.c b/http-fetch.c
> index dc67218..ee5b585 100644
> --- a/http-fetch.c
> +++ b/http-fetch.c
> @@ -41,6 +41,7 @@ struct object_request
>  	CURLcode curl_result;
>...
> +	char *content_type;
>  	unsigned char real_sha1[20];
>...

You probably need only one bit here,...

> @@ -258,9 +259,15 @@ static void finish_object_request(struct
>  
>  static void process_object_response(void *callback_data)
>...  
> +	curl_easy_getinfo(obj_req->slot->curl, CURLINFO_CONTENT_TYPE,
> +			  &content_type);
> +	if (content_type)
> +		obj_req->content_type = strdup(content_type);
> +

... and note if that is an HTML document or not.

We do bend backwards to support ISP HTTP servers, but this might
be going a bit too far.  Also I wonder if ISP runs a really
dumb-friendly configured server that defaults to text/html
unless the mimemap says otherwise.  Loose object files do not
have suffixes and I am expecting these servers would give
whatever the server default is.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Cloning from sites with 404 overridden
  2006-03-22 19:05     ` Junio C Hamano
@ 2006-03-22 19:22       ` Junio C Hamano
  2006-03-23 18:43         ` Nick Hengeveld
  2006-03-22 21:24       ` Radoslaw Szkodzinski
  1 sibling, 1 reply; 35+ messages in thread
From: Junio C Hamano @ 2006-03-22 19:22 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git

Junio C Hamano <junkio@cox.net> writes:

> We do bend backwards to support ISP HTTP servers, but this might
> be going a bit too far.  Also I wonder if ISP runs a really
> dumb-friendly configured server that defaults to text/html
> unless the mimemap says otherwise.  Loose object files do not
> have suffixes and I am expecting these servers would give
> whatever the server default is.

Clarification.  Even if a server configured as such existed and
sent an otherwise valid loose object with text/html, your code
does the right thing.

However the patch would not help when such a server also did a
"Sorry, did you mistype the URL?" HTML response, and I was
wondering how typical that would be.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Cloning from sites with 404 overridden
  2006-03-22 19:05     ` Junio C Hamano
  2006-03-22 19:22       ` Junio C Hamano
@ 2006-03-22 21:24       ` Radoslaw Szkodzinski
  1 sibling, 0 replies; 35+ messages in thread
From: Radoslaw Szkodzinski @ 2006-03-22 21:24 UTC (permalink / raw)
  To: git; +Cc: Junio C Hamano, Nick Hengeveld

[-- Attachment #1: Type: text/plain, Size: 909 bytes --]

On Wednesday 22 March 2006 20:05, Junio C Hamano wrote yet:
>
> .. and note if that is an HTML document or not.
>

Better yet, see first if the object is corrupt. If it is and its Content-Type 
is text/html, error out.

> We do bend backwards to support ISP HTTP servers, but this might
> be going a bit too far.  Also I wonder if ISP runs a really
> dumb-friendly configured server that defaults to text/html
> unless the mimemap says otherwise.  Loose object files do not
> have suffixes and I am expecting these servers would give
> whatever the server default is.

That server would break a *lot* of file types. That admin should be hanged, 
shot, then burned.

I think of only one reason for doing that: to restrict file types posted on 
the server to, say, zip and html.

-- 
GPG Key id:  0xD1F10BA2
Fingerprint: 96E2 304A B9C4 949A 10A0  9105 9543 0453 D1F1 0BA2

AstralStorm

[-- Attachment #2: Type: application/pgp-signature, Size: 191 bytes --]

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Cloning from sites with 404 overridden
  2006-03-22 19:22       ` Junio C Hamano
@ 2006-03-23 18:43         ` Nick Hengeveld
  2006-03-23 20:45           ` Junio C Hamano
  0 siblings, 1 reply; 35+ messages in thread
From: Nick Hengeveld @ 2006-03-23 18:43 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git

On Wed, Mar 22, 2006 at 11:22:14AM -0800, Junio C Hamano wrote:

> You probably need only one bit here,...
> ... and note if that is an HTML document or not.

/me smacks self...

> However the patch would not help when such a server also did a
> "Sorry, did you mistype the URL?" HTML response, and I was
> wondering how typical that would be.

Seems like there are three cases to worry about:

1) the server returns a 200 status and a text/html response instead of a
   404, and the server's default content type is not text/html
2) the server returns a 200 status and a text/html response instead of a
   404, and the server's default content type is text/html
3) the server returns a corrupt object from the repository

I don't think there's a way to distinguish between #2 and #3, so all we
can really do is display as helpful an error message as possible.

We can detect #1 if there has been a previous successful loose object
transfer by tracking whether the repo's default content type is
text/html.  In such a case should http-fetch behave as if the server
returned 404?  If there have been no successful loose object transfers,
we'd have to respond as with #2.  This approach could potentially break
if requests are load-balanced to servers with different
misconfigurations - but I think trying to detect that is bending
backwards a little too far.

On a related note, I noticed that http-fetch will continue to try
inflating/sha1_updating the response after an inflate error has been
detected.  It's probably not a huge deal, but we could just error out
immediately at that point or at least stop the unnecessary processing.

Something like this?  Tested by cloning
http://digilander.libero.it/mcostalba/scm/qgit.git


[PATCH] http-fetch: try to detect 404s from misconfigured servers

Some HTTP server environments return a 200 status and text/html error
document or a redirect to one rather than a 404 status if a loose
object does not exist.  This patch tries to detect such a response
and treat it as a 404.

Signed-off-by: Nick Hengeveld <nickh@reactrix.com>


---

 http-fetch.c |   24 ++++++++++++++++++++++--
 1 files changed, 22 insertions(+), 2 deletions(-)

ab97429c5b0a4b4466ee0072f75706399e42b675
diff --git a/http-fetch.c b/http-fetch.c
index dc67218..bb75050 100644
--- a/http-fetch.c
+++ b/http-fetch.c
@@ -16,6 +16,7 @@ struct alt_base
 {
 	char *base;
 	int got_indices;
+	int default_html_content_type;
 	struct packed_git *packs;
 	struct alt_base *next;
 };
@@ -41,6 +42,7 @@ struct object_request
 	CURLcode curl_result;
 	char errorstr[CURL_ERROR_SIZE];
 	long http_code;
+	char html_content_type;
 	unsigned char real_sha1[20];
 	SHA_CTX c;
 	z_stream stream;
@@ -249,6 +251,9 @@ static void finish_object_request(struct
 		unlink(obj_req->tmpfile);
 		return;
 	}
+	if (obj_req->repo->default_html_content_type == -1)
+		obj_req->repo->default_html_content_type =
+			obj_req->html_content_type;
 	obj_req->rename =
 		move_temp_to_file(obj_req->tmpfile, obj_req->filename);
 
@@ -258,9 +263,15 @@ static void finish_object_request(struct
 
 static void process_object_response(void *callback_data)
 {
+	char *content_type;
 	struct object_request *obj_req =
 		(struct object_request *)callback_data;
 
+	curl_easy_getinfo(obj_req->slot->curl, CURLINFO_CONTENT_TYPE,
+			  &content_type);
+	if (content_type && !strcmp(content_type, "text/html"))
+		obj_req->html_content_type = 1;
+
 	obj_req->curl_result = obj_req->slot->curl_result;
 	obj_req->http_code = obj_req->slot->http_code;
 	obj_req->slot = NULL;
@@ -340,6 +351,7 @@ void prefetch(unsigned char *sha1)
 	memcpy(newreq->sha1, sha1, 20);
 	newreq->repo = alt;
 	newreq->url = NULL;
+	newreq->html_content_type = 0;
 	newreq->local = -1;
 	newreq->state = WAITING;
 	snprintf(newreq->filename, sizeof(newreq->filename), "%s", filename);
@@ -539,6 +551,7 @@ static void process_alternates_response(
 				newalt->next = NULL;
 				newalt->base = target;
 				newalt->got_indices = 0;
+				newalt->default_html_content_type = -1;
 				newalt->packs = NULL;
 				while (tail->next != NULL)
 					tail = tail->next;
@@ -835,8 +848,14 @@ static int fetch_object(struct alt_base 
 				    obj_req->errorstr, obj_req->curl_result,
 				    obj_req->http_code, hex);
 	} else if (obj_req->zret != Z_STREAM_END) {
-		corrupt_object_found++;
-		ret = error("File %s (%s) corrupt", hex, obj_req->url);
+		if (obj_req->html_content_type &&
+		    !obj_req->repo->default_html_content_type)
+			ret = -1; /* Be silent, looks like a 404 */
+		else {
+			corrupt_object_found++;
+			ret = error("File %s (%s) corrupt",
+				    sha1_to_hex(obj_req->sha1), obj_req->url);
+		}
 	} else if (memcmp(obj_req->sha1, obj_req->real_sha1, 20)) {
 		ret = error("File %s has bad hash", hex);
 	} else if (obj_req->rename < 0) {
@@ -985,6 +1004,7 @@ int main(int argc, char **argv)
 	alt = xmalloc(sizeof(*alt));
 	alt->base = url;
 	alt->got_indices = 0;
+	alt->default_html_content_type = -1;
 	alt->packs = NULL;
 	alt->next = NULL;
 
-- 
1.2.4.gb1bc1d-dirty

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* Re: Cloning from sites with 404 overridden
  2006-03-23 18:43         ` Nick Hengeveld
@ 2006-03-23 20:45           ` Junio C Hamano
  0 siblings, 0 replies; 35+ messages in thread
From: Junio C Hamano @ 2006-03-23 20:45 UTC (permalink / raw)
  To: Nick Hengeveld; +Cc: git

Nick Hengeveld <nickh@reactrix.com> writes:

> Seems like there are three cases to worry about:
>
> 1) the server returns a 200 status and a text/html response instead of a
>    404, and the server's default content type is not text/html
> 2) the server returns a 200 status and a text/html response instead of a
>    404, and the server's default content type is text/html
> 3) the server returns a corrupt object from the repository

> I don't think there's a way to distinguish between #2 and #3, so all we
> can really do is display as helpful an error message as possible.

The code behaves correctly the same way whether the server says
404 or 200 with human readable "No such object", and this is
just for formatting error messages, and to be honest I do not
really care at this point.  I think the existing error message
at the end of transfer we added recently should be sufficient.

> On a related note, I noticed that http-fetch will continue to try
> inflating/sha1_updating the response after an inflate error has been
> detected.  It's probably not a huge deal, but we could just error out
> immediately at that point or at least stop the unnecessary processing.

That would probably be more helpful.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Cloning from sites with 404 overridden
  2006-03-22 13:36 ` Andreas Ericsson
@ 2006-03-24 17:29   ` Mark Wooding
  2006-03-24 17:52     ` Junio C Hamano
                       ` (3 more replies)
  0 siblings, 4 replies; 35+ messages in thread
From: Mark Wooding @ 2006-03-24 17:29 UTC (permalink / raw)
  To: git

Andreas Ericsson <ae@op5.se> wrote:

>> I'm not sure what the best "object that's not supposed to be there" is.
>
> .git/objects/00/hoping-for-a-404-or-webadmin-should-fix

If .git/objects/00/00000000000000000000000000000000000000 exists, the
repository has big problems already.

(Aside: `C-u 38 0' doesn't work because Emacs hears `C-u 380' and waits
for a key.  `M-: (insert-char ?0 38) RET' does the right thing, but is
ugly.  Any better suggestions?)

-- [mdw]

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Cloning from sites with 404 overridden
  2006-03-24 17:29   ` Mark Wooding
@ 2006-03-24 17:52     ` Junio C Hamano
  2006-03-24 17:53     ` Linus Torvalds
                       ` (2 subsequent siblings)
  3 siblings, 0 replies; 35+ messages in thread
From: Junio C Hamano @ 2006-03-24 17:52 UTC (permalink / raw)
  To: Mark Wooding; +Cc: git

Mark Wooding <mdw@distorted.org.uk> writes:

> (Aside: `C-u 38 0' doesn't work because Emacs hears `C-u 380' and waits
> for a key.  `M-: (insert-char ?0 38) RET' does the right thing, but is
> ugly.  Any better suggestions?)

C-u 38 C-u 0

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Cloning from sites with 404 overridden
  2006-03-24 17:29   ` Mark Wooding
  2006-03-24 17:52     ` Junio C Hamano
@ 2006-03-24 17:53     ` Linus Torvalds
  2006-03-24 18:16     ` Morten Welinder
  2006-03-24 18:40     ` Andreas Ericsson
  3 siblings, 0 replies; 35+ messages in thread
From: Linus Torvalds @ 2006-03-24 17:53 UTC (permalink / raw)
  To: Mark Wooding; +Cc: git

On Fri, 24 Mar 2006, Mark Wooding wrote:
> 
> (Aside: `C-u 38 0' doesn't work because Emacs hears `C-u 380' and waits
> for a key.  `M-: (insert-char ?0 38) RET' does the right thing, but is
> ugly.  Any better suggestions?)

I don't do GNU emacs, but the way to do it in some other editors that do
repeats somewhat similarly is to do the action that starts with a number
as a macro, and do that macro 37 more times. 

On uemacs: ^X '(' '0' ^X ')' ESC '3' '7' ^X 'E'

(Of course, the easier way is to just do '0' LEFT ^K to put the 0 in the
buffer, and than ESC '3' '8' ^Y to yank it 38 times, but the macro trick
is generic, even if it's a few more keystrokes). 

		Linus "teaching people the one true editor" Torvalds

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Cloning from sites with 404 overridden
  2006-03-24 17:29   ` Mark Wooding
  2006-03-24 17:52     ` Junio C Hamano
  2006-03-24 17:53     ` Linus Torvalds
@ 2006-03-24 18:16     ` Morten Welinder
  2006-03-24 18:40     ` Andreas Ericsson
  3 siblings, 0 replies; 35+ messages in thread
From: Morten Welinder @ 2006-03-24 18:16 UTC (permalink / raw)
  To: Mark Wooding; +Cc: git

> (Aside: `C-u 38 0' doesn't work because Emacs hears `C-u 380' and waits
> for a key.  `M-: (insert-char ?0 38) RET' does the right thing, but is
> ugly.  Any better suggestions?)

There's a million ways to skin that cat.

ESC 38 C-q 60 RET

[Octal 060 == '0']

M.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Cloning from sites with 404 overridden
  2006-03-24 17:29   ` Mark Wooding
                       ` (2 preceding siblings ...)
  2006-03-24 18:16     ` Morten Welinder
@ 2006-03-24 18:40     ` Andreas Ericsson
  3 siblings, 0 replies; 35+ messages in thread
From: Andreas Ericsson @ 2006-03-24 18:40 UTC (permalink / raw)
  To: Mark Wooding; +Cc: git

Mark Wooding wrote:
> Andreas Ericsson <ae@op5.se> wrote:
> 
> 
>>>I'm not sure what the best "object that's not supposed to be there" is.
>>
>>.git/objects/00/hoping-for-a-404-or-webadmin-should-fix
> 
> 
> If .git/objects/00/00000000000000000000000000000000000000 exists, the
> repository has big problems already.
> 

Indeed. I'm off sobriety again, it being friday and all, but I'm 
assuming there are more than 18 zeroes there, yes? The "feature" of the 
above line is that it will fit in any buffer that already exists, and 
will match any third argument to send(2) that already exists.


> (Aside: `C-u 38 0' doesn't work because Emacs hears `C-u 380' and waits
> for a key.  `M-: (insert-char ?0 38) RET' does the right thing, but is
> ugly.  Any better suggestions?)
> 

This I happily don't understand at all. I'm also happy ignorant of what 
it has to do with the issue at hand.

-- 
Andreas Ericsson                   andreas.ericsson@op5.se
OP5 AB                             www.op5.se
Tel: +46 8-230225                  Fax: +46 8-230231

^ permalink raw reply	[flat|nested] 35+ messages in thread

end of thread, other threads:[~2006-03-24 18:40 UTC | newest]

Thread overview: 35+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-03-22  2:59 Cloning from sites with 404 overridden linux
2006-03-22  3:12 ` Shawn Pearce
2006-03-22  4:13   ` Linus Torvalds
2006-03-22  6:06 ` Marco Costalba
2006-03-22  6:47   ` Junio C Hamano
2006-03-22 13:36 ` Andreas Ericsson
2006-03-24 17:29   ` Mark Wooding
2006-03-24 17:52     ` Junio C Hamano
2006-03-24 17:53     ` Linus Torvalds
2006-03-24 18:16     ` Morten Welinder
2006-03-24 18:40     ` Andreas Ericsson
2006-03-22 17:22 ` Nick Hengeveld
2006-03-22 18:36   ` Nick Hengeveld
2006-03-22 19:05     ` Junio C Hamano
2006-03-22 19:22       ` Junio C Hamano
2006-03-23 18:43         ` Nick Hengeveld
2006-03-23 20:45           ` Junio C Hamano
2006-03-22 21:24       ` Radoslaw Szkodzinski
  -- strict thread matches above, loose matches on Subject: below --
2006-03-19 10:52 Marco Costalba
2006-03-19 13:25 ` Paolo Ciarrocchi
2006-03-19 14:04   ` Marco Costalba
2006-03-19 19:37     ` Junio C Hamano
2006-03-19 21:40       ` Marco Costalba
2006-03-19 23:21         ` Junio C Hamano
2006-03-20  6:31           ` Marco Costalba
2006-03-20  8:44             ` Junio C Hamano
2006-03-20 12:17               ` Marco Costalba
2006-03-20 18:29       ` Lukas Sandström
2006-03-20 19:43         ` Petr Baudis
2006-03-20 19:54         ` Nick Hengeveld
2006-03-19 19:47     ` Junio C Hamano
2006-03-19 21:31       ` Petr Baudis
2006-03-19 21:43         ` Petr Baudis
2006-03-19 21:45         ` Marco Costalba
2006-03-20  4:32       ` Randal L. Schwartz

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox