[PATCH 0/4] Report rejections over HTTP when the remote rejects during the transfer

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/4] Report rejections over HTTP when the remote rejects during the transfer
@ 2024-06-12 11:50 Carlos Martín Nieto
  2024-06-12 11:50 ` [PATCH 1/4] t/lib-http: add serve-git.py Carlos Martín Nieto
                   ` (4 more replies)
  0 siblings, 5 replies; 14+ messages in thread
From: Carlos Martín Nieto @ 2024-06-12 11:50 UTC (permalink / raw)
  To: git; +Cc: Carlos Martín Nieto

Hello git list,

While investigating a difference between HTTP and SSH when rejecting a push due
to it being too large, I noticed that rejecting a push without receiving the
entire packfile causes git to print out the error message "pack exceeds maximum
allowed size" but it also shows "Everything up-to-date" instead of the rejection
of every ref update like the server has specified.

This is the result of two issues in git, of which I aim to fix one here, namely

  1) when the server sends the response and closes the connection, remote-curl
  sees that as an error and stops processing the send-pack output, combined with

  2) git does not remember what it asked the remote helper to push so it cannot
  distinguish whether an empty report means "I had an error and did nothing" or
  "everything was up to date and I didn't have to do anything".

The latter issue is more complex so here I'm concentrating on the former, which
has a simple solution but a complex test. The solution is to read in to the end
of what send-pack is telling us (it's creating the whole packfile that we're
throwing away anyway) so we can report back to the user.

The testing however proved a bit complicated as this bug requires the server to
cut off the connection while git is uploading the packfile. The existing HTTP
tests use CGI and as far as I've been able to test, httpd buffers too much for
us to be able to replicate the situation.

This is why there's a python Git server in this patch series that doesn't rely
on CGI but streams the data both ways so it can close the stream as soon as
receive-pack exits. There's already some python tooling in the project and I'm
much more familiar with it than e.g. perl, so I hope that's fine. I tried to
make it as simple as possible while still being able to stream bidirectionally.

Cheers,
   cmn

Carlos Martín Nieto (4):
  t/lib-http: add serve-git.py
  t/lib-http.sh: add functions related to serve-git.py
  t5541: add test for rejecting a push due to packfile size
  remote-curl: read in the push report even if we fail to finish sending
    data

 remote-curl.c             |  24 ++-
 t/lib-httpd.sh            |  20 +++
 t/lib-httpd/serve-git.py  | 353 ++++++++++++++++++++++++++++++++++++++
 t/t5546-receive-limits.sh |  24 +++
 4 files changed, 414 insertions(+), 7 deletions(-)
 create mode 100755 t/lib-httpd/serve-git.py

-- 
2.43.0

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH 1/4] t/lib-http: add serve-git.py
  2024-06-12 11:50 [PATCH 0/4] Report rejections over HTTP when the remote rejects during the transfer Carlos Martín Nieto
@ 2024-06-12 11:50 ` Carlos Martín Nieto
  2024-06-12 21:51   ` Junio C Hamano
  2024-06-12 11:50 ` [PATCH 2/4] t/lib-http.sh: add functions related to serve-git.py Carlos Martín Nieto
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 14+ messages in thread
From: Carlos Martín Nieto @ 2024-06-12 11:50 UTC (permalink / raw)
  To: git; +Cc: Carlos Martín Nieto

This is a basic HTTP server that is able to serve Git content via
calling out to the underlying git commands. This avoids relying on CGI
which can add complexity when trying to replicate some behaviours, in
particular when the bidirectional stream and its directions being open
or closed are important.

Signed-off-by: Carlos Martín Nieto <cmn@dwim.me>
---
 t/lib-httpd/serve-git.py | 353 +++++++++++++++++++++++++++++++++++++++
 1 file changed, 353 insertions(+)
 create mode 100755 t/lib-httpd/serve-git.py

diff --git a/t/lib-httpd/serve-git.py b/t/lib-httpd/serve-git.py
new file mode 100755
index 00000000000..dc2aba992d7
--- /dev/null
+++ b/t/lib-httpd/serve-git.py
@@ -0,0 +1,353 @@
+#!/usr/bin/env python3
+
+"""Serve a Git repository via unauthenticated HTTP.
+
+This script can run stand-alone but it's mostly intended to be a way to be able
+to override the Git-specific parts in order to test git against it.
+
+It also exists as an alternative to the tests that use apache to handle HTTP as
+its CGI implementation can have features or behaviours which stop us from being
+able to test everything we want to.
+
+"""
+
+import os
+import http.server
+import subprocess
+import select
+from enum import Enum
+from urllib.parse import urlparse, parse_qs, unquote
+
+RequestType = Enum('RequestType', ['UPLOAD_PACK', 'RECEIVE_PACK'])
+
+class GitHandler(http.server.BaseHTTPRequestHandler):
+    """Base handler class for serving HTTP Git requests. It will handle
+    fetches and pushes by delegating back to upload-pack and receive-pack as
+    found on its PATH.
+
+    In order to test error conditions, you can override the do_upload_pack
+    and/or do_receive_pack methods in your own class.
+
+    """
+
+    def __init__(self, document_root, *args, **kwargs):
+        self.document_root = document_root
+        super().__init__(*args, **kwargs)
+
+    # Let's not do any logging
+    def log_message(self, _fmt, *_args):
+        pass
+
+    def gitdir(self, path):
+        """Calculate the gitdir from the current directory and the given path"""
+        if path.endswith('/info/refs'):
+            path = path[:-len('/info/refs')]
+        elif path.endswith('/git-upload-pack') or path.endswith('/git-receive-pack'):
+            path = path[:-len(os.path.basename(path))]
+
+        gitdir = os.path.realpath(os.path.join(self.document_root, unquote(path[1:])))
+
+        commonprefix = os.path.commonprefix([self.document_root, os.path.join(gitdir, '')])
+        if commonprefix != self.document_root:
+            raise ValueError("request is trying to escape the document root")
+
+        return gitdir
+
+    def command_from_params(self, query_params):
+        """Returns the type of request basd on the URL query parameters"""
+        if 'git-upload-pack' in query_params['service']:
+            return RequestType.UPLOAD_PACK
+        if 'git-receive-pack' in query_params['service']:
+            return RequestType.RECEIVE_PACK
+
+        return None
+
+    def command_from_path(self, path):
+        """Returns the type of request basd on the URL path"""
+        cmd = os.path.basename(path)
+        if cmd == 'git-upload-pack':
+            return RequestType.UPLOAD_PACK
+        if cmd == 'git-receive-pack':
+            return RequestType.RECEIVE_PACK
+
+        return None
+
+    def send_git_headers(self, request_type, advertisement):
+        """Send the common HTTP headers we use"""
+        self.protocol_version = 'HTTP/1.1'
+        self.send_response(200)
+        ct = "application/x-git-upload-pack"
+        if request_type == RequestType.RECEIVE_PACK:
+            ct = "application/x-git-receive-pack"
+        if advertisement:
+            ct += "-advertisement"
+        else:
+            ct += "-result"
+
+        self.send_header('content-type', ct)
+        self.send_header('cache-control', 'no-cache')
+        self.send_header('transfer-encoding', 'chunked')
+        self.end_headers()
+
+    def respond_error(self):
+        """Generic user-caused error message"""
+        self.send_error(400)
+
+    def do_GET(self): # pylint: disable=invalid-name
+        """Handle the GET request. This is the initial one that wants the ref
+        advertisement"""
+        request = urlparse(self.path)
+        query_params = parse_qs(request.query)
+
+        gitdir = self.gitdir(request.path)
+        if not gitdir:
+            self.respond_error()
+            return
+
+        request_type = self.command_from_params(query_params)
+        protocol = self.headers['git-protocol']
+
+        responder = ChunkedResponder(self.wfile)
+
+        if request_type == RequestType.UPLOAD_PACK:
+            self.send_git_headers(request_type, True)
+            self.do_upload_pack(responder, gitdir, True, protocol)
+        elif request_type == RequestType.RECEIVE_PACK:
+            self.send_git_headers(request_type, True)
+            self.do_receive_pack(responder, gitdir, True, protocol)
+        else:
+            self.respond_error()
+
+    def do_POST(self): # pylint: disable=invalid-name
+        """Handle the POST. These requests form the negotiation and data
+        response."""
+        request = urlparse(self.path)
+
+        gitdir = self.gitdir(request.path)
+        if not gitdir:
+            self.respond_error()
+            return
+
+        request_type = self.command_from_path(request.path)
+        protocol = self.headers['git-protocol']
+
+        responder = ChunkedResponder(self.wfile)
+
+        if request_type == RequestType.UPLOAD_PACK:
+            self.send_git_headers(request_type, False)
+            self.do_upload_pack(responder, gitdir, False, protocol)
+        elif request_type == RequestType.RECEIVE_PACK:
+            self.send_git_headers(request_type, False)
+            self.do_receive_pack(responder, gitdir, False, protocol)
+        else:
+            self.respond_error()
+
+
+    def _copy_input_output(self, proc, resp):
+        """Copy the network data into the process' stdin and the process'
+        stdout to the network
+
+        """
+
+        # The stream of data we read from the client. Default to None for the
+        # GET/advertisement cases so we can override it if we're not there
+        stream = self.rfile
+        if self.headers['content-length']:
+            stream = LimitedReader(self.rfile, int(self.headers['content-length']))
+        elif self.headers['transfer-encoding'] == 'chunked':
+            stream = ChunkDecoder(self.rfile)
+
+        while True:
+            buf = stream.read(4096)
+            if len(buf) == 0:
+                break
+            try:
+                proc.stdin.write(buf)
+            except BrokenPipeError:
+                # The process has exited or closed its stdin so let's
+                # jump back out to finish copying its stdout.
+                break
+
+            r, _w, _e = select.select([proc.stdout], [], [], 0)
+            if proc.stdout in r:
+                buf = proc.stdout.read1()
+                if len(buf) > 0:
+                    resp.write(buf)
+
+    def _run_command(self, resp, command, gitdir, advertisement, protocol):
+        """Run the relevant command and copy the request body into its stdin."""
+        argv = ['git', command, '--stateless-rpc']
+        if advertisement:
+            argv.append('--advertise-refs')
+        argv.append('.')
+
+        env = {k:v for k, v in os.environ.items() if k.startswith('GIT_')}
+        if protocol is not None:
+            env['GIT_PROTOCOL'] = protocol
+
+        stdin = None
+        if not advertisement:
+            stdin = subprocess.PIPE
+
+        with subprocess.Popen(argv, stdin=stdin, stdout=subprocess.PIPE, cwd=gitdir, env=env) as proc:
+            if not advertisement:
+                self._copy_input_output(proc, resp)
+                # Close stdin so upload-pack isn't waiting forever, but if it
+                # did already die then we want to keep going anyway. We ignore
+                # the error to avoid potential race conditions, rather than try
+                # to figure out if the process has exited
+                try:
+                    proc.stdin.close()
+                except BrokenPipeError:
+                    pass
+
+            # Copy whatever the git process has output back to the client as
+            # long as it has something left to say. This covers the ref
+            # advertisement case as well as reporting errors during a push.
+            while True:
+                buf = proc.stdout.read1()
+                if len(buf) == 0:
+                    break
+                resp.write(buf)
+
+            # Close our outgoing connection
+            resp.finish()
+
+            proc.wait()
+
+    def do_upload_pack(self, resp, gitdir, advertisement, protocol):
+        """Run git-upload-pack. advertisement is True if we had a GET so we
+        need to deal with the initial ref advertisement.
+
+        Override this with your own.
+
+        """
+        if advertisement:
+            # self.send_git_heders(RequestType.UPLOAD_PACK, True)
+            resp.write(packet_line("# service=git-upload-pack\n"))
+            resp.write(b"0000")
+
+        self._run_command(resp, 'upload-pack', gitdir, advertisement, protocol)
+
+    def do_receive_pack(self, resp, gitdir, advertisement, protocol):
+        """Run git-receive-pack. advertisement is True if we had a GET so we
+        need to deal with the initial ref advertisement.
+
+        Override this with your own.
+
+        """
+        if advertisement:
+            resp.write(packet_line("# service=git-receive-pack\n"))
+            resp.write(b"0000")
+
+        self._run_command(resp, 'receive-pack', gitdir, advertisement, protocol)
+
+def packet_line(line):
+    """Format some data as a pkt-line"""
+    return f'{len(line)+4:04x}{line}'.encode('ascii')
+
+class ChunkDecoder: # pylint: disable=too-few-public-methods
+    """Decode a chunked-encoded stream of data"""
+
+    def __init__(self, rfile):
+        self.rfile = rfile
+        self.remaining = 0
+        self.finished = False
+
+    def read(self, size = -1):
+        """Read up to size bytes from the chunked stream
+
+        This function will not cross a chunk when reading.
+
+        """
+
+        if self.finished:
+            return b''
+
+        # return whatever we have up to the requested size if we're inside a chunk
+        if self.remaining > 0:
+            to_read = min(size, self.remaining)
+            buf = self.rfile.read1(to_read)
+            self.remaining -= len(buf)
+
+            # each chunk ends with CRLF so let's read it here now that we've
+            # finished reading this chunk
+            if self.remaining == 0:
+                terminator = self.rfile.read(2)
+                if terminator != b'\r\n':
+                    raise ValueError("invalid chunk terminator")
+
+            return buf
+
+        # no data left in the chunk so we have to read the next size
+        line = self.rfile.readline(16)
+        if line is None:
+            raise ValueError("invalid chunk header")
+
+        next_length = int(line, 16)
+        if next_length == 0:
+            self.finished = True
+        self.remaining = next_length
+
+        return self.read(size)
+
+class LimitedReader: # pylint: disable=too-few-public-methods
+    """Read from an underlying buffer up to a maximum size"""
+
+    def __init__(self, rfile, size):
+        self.rfile = rfile
+        self.remaining = size
+
+    def read(self, size = -1):
+        """Read up to size bytes from what is available in the input.
+
+        This function will not wait for more data in the event of a short read
+
+        """
+        if self.remaining == 0:
+            return b''
+
+        buf = self.rfile.read1(size)
+        if len(buf) == 0:
+            return b''
+
+        self.remaining -= len(buf)
+
+        return buf
+
+class ChunkedResponder:
+    """Reply with chunked encoding"""
+
+    def __init__(self, wfile):
+        self.wfile = wfile
+        self.finished = False
+
+    def finish(self):
+        """Send the final zero-length chunk header indicating we're done"""
+        if not self.finished:
+            self.wfile.write(b"0\r\n\r\n")
+            self.finished = True
+
+    def write(self, buf):
+        """Write a chunk including the size header"""
+        self.wfile.write(f"{len(buf):x}\r\n".encode('ascii'))
+        self.wfile.write(buf)
+        self.wfile.write(b"\r\n")
+
+if __name__ == "__main__":
+    import argparse
+    import functools
+
+    parser = argparse.ArgumentParser(description="Serve Git repositories over HTTP")
+    parser.add_argument('--document-root',
+                        help="Directory where the repositories live",
+                        default=os.getcwd())
+    parser.add_argument('--port',
+                        help="The port to listen on",
+                        default=8080,
+                        type=int)
+    parsed_args = parser.parse_args()
+
+    handler_partial = functools.partial(GitHandler, parsed_args.document_root)
+    httpd = http.server.HTTPServer(('', parsed_args.port), handler_partial)
+    httpd.serve_forever()
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH 2/4] t/lib-http.sh: add functions related to serve-git.py
  2024-06-12 11:50 [PATCH 0/4] Report rejections over HTTP when the remote rejects during the transfer Carlos Martín Nieto
  2024-06-12 11:50 ` [PATCH 1/4] t/lib-http: add serve-git.py Carlos Martín Nieto
@ 2024-06-12 11:50 ` Carlos Martín Nieto
  2024-06-13  9:19   ` Jeff King
  2024-06-12 11:50 ` [PATCH 3/4] t5541: add test for rejecting a push due to packfile size Carlos Martín Nieto
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 14+ messages in thread
From: Carlos Martín Nieto @ 2024-06-12 11:50 UTC (permalink / raw)
  To: git; +Cc: Carlos Martín Nieto

These functions manage the custom git serving script for use in tests.

Signed-off-by: Carlos Martín Nieto <cmn@dwim.me>
---
 t/lib-httpd.sh | 20 ++++++++++++++++++++
 1 file changed, 20 insertions(+)

diff --git a/t/lib-httpd.sh b/t/lib-httpd.sh
index d83bafeab32..6454300a041 100644
--- a/t/lib-httpd.sh
+++ b/t/lib-httpd.sh
@@ -94,6 +94,8 @@ esac
 LIB_HTTPD_PATH=${LIB_HTTPD_PATH-"$DEFAULT_HTTPD_PATH"}
 test_set_port LIB_HTTPD_PORT
 
+test_set_port LIB_GIT_SERVE_PORT
+
 TEST_PATH="$TEST_DIRECTORY"/lib-httpd
 HTTPD_ROOT_PATH="$PWD"/httpd
 HTTPD_DOCUMENT_ROOT_PATH=$HTTPD_ROOT_PATH/www
@@ -250,6 +252,24 @@ stop_httpd() {
 		-f "$TEST_PATH/apache.conf" $HTTPD_PARA -k stop
 }
 
+start_serve_git() {
+	test_atexit stop_serve_git
+
+	"$TEST_DIRECTORY"/lib-httpd/serve-git.py \
+		--document-root "$HTTPD_ROOT_PATH"/www \
+		--port "$LIB_GIT_SERVE_PORT" &
+
+	mkdir -p "$HTTPD_ROOT_PATH"
+	echo $! >"$HTTPD_ROOT_PATH"/git-serve.pid
+
+	GIT_SERVE_DEST=127.0.0.1:$LIB_GIT_SERVE_PORT
+	GIT_SERVE_URL=http://$GIT_SERVE_DEST
+}
+
+stop_serve_git() {
+	kill -9 $(cat "$HTTPD_ROOT_PATH"/git-serve.pid)
+}
+
 test_http_push_nonff () {
 	REMOTE_REPO=$1
 	LOCAL_REPO=$2
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH 3/4] t5541: add test for rejecting a push due to packfile size
  2024-06-12 11:50 [PATCH 0/4] Report rejections over HTTP when the remote rejects during the transfer Carlos Martín Nieto
  2024-06-12 11:50 ` [PATCH 1/4] t/lib-http: add serve-git.py Carlos Martín Nieto
  2024-06-12 11:50 ` [PATCH 2/4] t/lib-http.sh: add functions related to serve-git.py Carlos Martín Nieto
@ 2024-06-12 11:50 ` Carlos Martín Nieto
  2024-06-12 21:49   ` Taylor Blau
                     ` (2 more replies)
  2024-06-12 11:50 ` [PATCH 4/4] remote-curl: read in the push report even if we fail to finish sending data Carlos Martín Nieto
  2024-06-13  9:11 ` [PATCH 0/4] Report rejections over HTTP when the remote rejects during the transfer Jeff King
  4 siblings, 3 replies; 14+ messages in thread
From: Carlos Martín Nieto @ 2024-06-12 11:50 UTC (permalink / raw)
  To: git; +Cc: Carlos Martín Nieto

This rejection requires us to make sure we handle this kind of error
correctly rather than throw away the report in remote-curl and end up
with "Everything up-to-date" due to the lack of report.

Signed-off-by: Carlos Martín Nieto <cmn@dwim.me>
---
 t/t5546-receive-limits.sh | 24 ++++++++++++++++++++++++
 1 file changed, 24 insertions(+)

diff --git a/t/t5546-receive-limits.sh b/t/t5546-receive-limits.sh
index 9fc9ba552f1..ccbdf3945ab 100755
--- a/t/t5546-receive-limits.sh
+++ b/t/t5546-receive-limits.sh
@@ -5,6 +5,11 @@ test_description='check receive input limits'
 TEST_PASSES_SANITIZE_LEAK=true
 . ./test-lib.sh
 
+
+ROOT_PATH="$PWD"
+. "$TEST_DIRECTORY"/lib-httpd.sh
+start_serve_git
+
 # Let's run tests with different unpack limits: 1 and 10000
 # When the limit is 1, `git receive-pack` will call `git index-pack`.
 # When the limit is 10000, `git receive-pack` will call `git unpack-objects`.
@@ -83,4 +88,23 @@ test_expect_success "create known-size (1024 bytes) commit" '
 test_pack_input_limit index
 test_pack_input_limit unpack
 
+test_expect_success 'reject too-large push over HTTP' '
+	git init "$HTTPD_DOCUMENT_ROOT_PATH/error_too_large" &&
+	git -C "$HTTPD_DOCUMENT_ROOT_PATH/error_too_large" config receive.maxInputSize 128 &&
+	test-tool genrandom foo $((10*1024*1024)) >large-file &&
+	git add large-file &&
+	test_commit large-file &&
+	test_must_fail git push --porcelain \
+		$GIT_SERVE_URL/error_too_large \
+		HEAD:refs/tags/will-fail >actual &&
+	test_must_fail git -C "$HTTPD_DOCUMENT_ROOT_PATH/error_too_large" \
+		rev-parse --verify refs/tags/will-fail &&
+	cat >expect <<-EOF &&
+	To $GIT_SERVE_URL/error_too_large
+	!	HEAD:refs/tags/will-fail	[remote rejected] (unpacker error)
+	Done
+	EOF
+	test_cmp expect actual
+'
+
 test_done
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH 4/4] remote-curl: read in the push report even if we fail to finish sending data
  2024-06-12 11:50 [PATCH 0/4] Report rejections over HTTP when the remote rejects during the transfer Carlos Martín Nieto
                   ` (2 preceding siblings ...)
  2024-06-12 11:50 ` [PATCH 3/4] t5541: add test for rejecting a push due to packfile size Carlos Martín Nieto
@ 2024-06-12 11:50 ` Carlos Martín Nieto
  2024-06-13  9:55   ` Jeff King
  2024-06-13  9:11 ` [PATCH 0/4] Report rejections over HTTP when the remote rejects during the transfer Jeff King
  4 siblings, 1 reply; 14+ messages in thread
From: Carlos Martín Nieto @ 2024-06-12 11:50 UTC (permalink / raw)
  To: git; +Cc: Carlos Martín Nieto

In these cases the remote might still send us an error even if we fail
to completely send the packfile. This can happen e.g. if the remote has
set a max upload size.

If we just consume send-pack's output and don't send anything to
remote-helper, it will not update any of its structures and will report
"Everything up-to-date" next to the error message.

Signed-off-by: Carlos Martín Nieto <cmn@dwim.me>
---
 remote-curl.c | 24 +++++++++++++++++-------
 1 file changed, 17 insertions(+), 7 deletions(-)

diff --git a/remote-curl.c b/remote-curl.c
index 0b6d7815fdd..9e45e14afec 100644
--- a/remote-curl.c
+++ b/remote-curl.c
@@ -1114,15 +1114,25 @@ static int rpc_service(struct rpc_state *rpc, struct discovery *heads,
 
 	close(client.in);
 	client.in = -1;
-	if (!err) {
-		strbuf_read(rpc_result, client.out, 0);
-	} else {
-		char buf[4096];
-		for (;;)
-			if (xread(client.out, buf, sizeof(buf)) <= 0)
-				break;
+
+	/*
+	 * If we encountered an error, we might still get a report. Consume the
+	 * rest of the packfile and an extra flush and then we can copy
+	 * over the report the same way as in the success case.
+	 */
+	if (err) {
+		int n;
+		do {
+			n = packet_read(rpc->out, rpc->buf, rpc->alloc, 0);
+		} while (n > 0);
+
+		/* Read the final flush separating the payload from the report */
+		packet_read(rpc->out, rpc->buf, rpc->alloc, 0);
 	}
 
+	/* Copy the report of successes/failures */
+	strbuf_read(rpc_result, client.out, 0);
+
 	close(client.out);
 	client.out = -1;
 
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH 3/4] t5541: add test for rejecting a push due to packfile size
  2024-06-12 11:50 ` [PATCH 3/4] t5541: add test for rejecting a push due to packfile size Carlos Martín Nieto
@ 2024-06-12 21:49   ` Taylor Blau
  2024-06-13  9:21   ` Jeff King
  2024-06-13 10:07   ` Jeff King
  2 siblings, 0 replies; 14+ messages in thread
From: Taylor Blau @ 2024-06-12 21:49 UTC (permalink / raw)
  To: Carlos Martín Nieto; +Cc: git

On Wed, Jun 12, 2024 at 01:50:27PM +0200, Carlos Martín Nieto wrote:
> This rejection requires us to make sure we handle this kind of error
> correctly rather than throw away the report in remote-curl and end up
> with "Everything up-to-date" due to the lack of report.
>
> Signed-off-by: Carlos Martín Nieto <cmn@dwim.me>
> ---
>  t/t5546-receive-limits.sh | 24 ++++++++++++++++++++++++
>  1 file changed, 24 insertions(+)

I haven't looked at the CGI stuff to know whether or not this behavior
can be reasonably emulated without the new Python script you wrote, but
a couple of small things in the meantime...

This patch's subject line states that it is modifying script t5541, but
the patch modifies t5546. I think the latter is correct, and there is
just a typo in the subject line, but wanted to make sure I pointed it
out regardless.

More importantly, this test fails after applying through this patch, but
not 4/4. After applying the final patch, it looks like it is still
failing for me. I figure that I am probably holding it wrong, but
regardless, here is the error message I see on my machine:

--- 8< ---
+ git init /home/ttaylorr/src/git/t/trash directory.t5546-receive-limits/httpd/www/error_too_large
Initialized empty Git repository in /home/ttaylorr/src/git/t/trash directory.t5546-receive-limits/httpd/www/error_too_large/.git/
+ git -C /home/ttaylorr/src/git/t/trash directory.t5546-receive-limits/httpd/www/error_too_large config receive.maxInputSize 128
+ test-tool genrandom foo 10485760
+ git add large-file
+ test_commit large-file
+ local notick=
+ local echo=echo
+ local append=
+ local author=
+ local signoff=
+ local indir=
+ local tag=light
+ test 1 != 0
+ break
+ indir=
+ local file=large-file.t
+ test -n
+ echo large-file
+ git add -- large-file.t
+ test -z
+ test_tick
+ test -z set
+ test_tick=1112912053
+ GIT_COMMITTER_DATE=1112912053 -0700
+ GIT_AUTHOR_DATE=1112912053 -0700
+ export GIT_COMMITTER_DATE GIT_AUTHOR_DATE
+ git commit -m large-file
[master 03a3078] large-file
 Author: A U Thor <author@example.com>
 2 files changed, 1 insertion(+)
 create mode 100644 large-file
 create mode 100644 large-file.t
+ git tag large-file
+ test_must_fail git push --porcelain http://127.0.0.1:5546/error_too_large HEAD:refs/tags/will-fail
+ _test_ok=
+ test_must_fail_acceptable git push --porcelain http://127.0.0.1:5546/error_too_large HEAD:refs/tags/will-fail
+ test git = env
+ return 0
+ git push --porcelain http://127.0.0.1:5546/error_too_large HEAD:refs/tags/will-fail
Enumerating objects: 8, done.
Counting objects: 100% (8/8), done.
Delta compression using up to 20 threads
Compressing objects: 100% (6/6), done.
error: RPC failed; curl 55 Send failure: Broken pipe
send-pack: unexpected disconnect while reading sideband packet
Writing objects: 100% (8/8), 10.00 MiB | 34.04 MiB/s, done.
Total 8 (delta 0), reused 0 (delta 0), pack-reused 0 (from 0)
fatal: the remote end hung up unexpectedly
fatal: the remote end hung up unexpectedly
error: failed to push some refs to 'http://127.0.0.1:5546/error_too_large'
+ exit_code=1
+ test 1 -eq 0
+ test_match_signal 13 1
+ test 1 = 141
+ test 1 = 269
+ return 1
+ test 1 -gt 129
+ test 1 -eq 127
+ test 1 -eq 126
+ return 0
+ test_must_fail git -C /home/ttaylorr/src/git/t/trash directory.t5546-receive-limits/httpd/www/error_too_large rev-parse --verify refs/tags/will-fail
+ _test_ok=
+ test_must_fail_acceptable git -C /home/ttaylorr/src/git/t/trash directory.t5546-receive-limits/httpd/www/error_too_large rev-parse --verify refs/tags/will-fail
+ test git = env
+ return 0
+ git -C /home/ttaylorr/src/git/t/trash directory.t5546-receive-limits/httpd/www/error_too_large rev-parse --verify refs/tags/will-fail
fatal: Needed a single revision
+ exit_code=128
+ test 128 -eq 0
+ test_match_signal 13 128
+ test 128 = 141
+ test 128 = 269
+ return 1
+ test 128 -gt 129
+ test 128 -eq 127
+ test 128 -eq 126
+ return 0
+ cat
+ test_cmp expect actual
+ test 2 -ne 2
+ eval diff -u "$@"
+ diff -u expect actual
--- expect	2024-06-12 21:48:50.005929827 +0000
+++ actual	2024-06-12 21:48:49.677930723 +0000
@@ -1,3 +0,0 @@
-To http://127.0.0.1:5546/error_too_large
-!	HEAD:refs/tags/will-fail	[remote rejected] (unpacker error)
-Done
error: last command exited with $?=1
not ok 18 - reject too-large push over HTTP
--- >8 ---

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/4] t/lib-http: add serve-git.py
  2024-06-12 11:50 ` [PATCH 1/4] t/lib-http: add serve-git.py Carlos Martín Nieto
@ 2024-06-12 21:51   ` Junio C Hamano
  0 siblings, 0 replies; 14+ messages in thread
From: Junio C Hamano @ 2024-06-12 21:51 UTC (permalink / raw)
  To: Carlos Martín Nieto; +Cc: git

Carlos Martín Nieto <cmn@dwim.me> writes:

> This is a basic HTTP server that is able to serve Git content via
> calling out to the underlying git commands. This avoids relying on CGI
> which can add complexity when trying to replicate some behaviours, in
> particular when the bidirectional stream and its directions being open
> or closed are important.

You wrote it in the cover letter already, but it won't be available
to "git log" readers.  

Give a bit of explanation on the reason why we do this in the first
place, in other words, what we are shipping an HTTP server with our
source for, before introducing "this is a server".

E.g. "instead of relying on an overly complex Apache with CGI, add a
simple HTTP server based on http.server class that offers THIS AND
THAT BENEFIT FOR OUR PURPOSE to help testing" or something like
that.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 0/4] Report rejections over HTTP when the remote rejects during the transfer
  2024-06-12 11:50 [PATCH 0/4] Report rejections over HTTP when the remote rejects during the transfer Carlos Martín Nieto
                   ` (3 preceding siblings ...)
  2024-06-12 11:50 ` [PATCH 4/4] remote-curl: read in the push report even if we fail to finish sending data Carlos Martín Nieto
@ 2024-06-13  9:11 ` Jeff King
  2024-07-23 13:46   ` Carlos Martín Nieto
  4 siblings, 1 reply; 14+ messages in thread
From: Jeff King @ 2024-06-13  9:11 UTC (permalink / raw)
  To: Carlos Martín Nieto; +Cc: git

On Wed, Jun 12, 2024 at 01:50:24PM +0200, Carlos Martín Nieto wrote:

> While investigating a difference between HTTP and SSH when rejecting a push due
> to it being too large, I noticed that rejecting a push without receiving the
> entire packfile causes git to print out the error message "pack exceeds maximum
> allowed size" but it also shows "Everything up-to-date" instead of the rejection
> of every ref update like the server has specified.
> 
> This is the result of two issues in git, of which I aim to fix one here, namely
> 
>   1) when the server sends the response and closes the connection, remote-curl
>   sees that as an error and stops processing the send-pack output, combined with
> 
>   2) git does not remember what it asked the remote helper to push so it cannot
>   distinguish whether an empty report means "I had an error and did nothing" or
>   "everything was up to date and I didn't have to do anything".
> 
> The latter issue is more complex so here I'm concentrating on the former, which
> has a simple solution but a complex test. The solution is to read in to the end
> of what send-pack is telling us (it's creating the whole packfile that we're
> throwing away anyway) so we can report back to the user.

Hmm. I think the second one is much more important, though. If the
remote helper dies unexpectedly for any reason, shouldn't the parent
git-push realize this and propagate the error? It sounds like you are
fixing one _specific_ case where the remote helper dies so that we don't
run into the problem caused by (2).

But if we fixed (2), then (1) would not be nearly as important anymore
(if at all). But I think there is a little more going on with this
specific case...

> The testing however proved a bit complicated as this bug requires the server to
> cut off the connection while git is uploading the packfile. The existing HTTP
> tests use CGI and as far as I've been able to test, httpd buffers too much for
> us to be able to replicate the situation.

It is not buffering that gets in your way, but rather that Apache will
actually drain the remainder of the request (throwing away the extra
bytes) if the CGI dies. You can see this by running your test against
apache and strace-ing the apache process. It is in a read/write loop
streaming the packfile from network to the CGI's stdin, it hits EPIPE on
the write(), and then it switches to just read().

Which makes sense, because you can't send back the HTTP 200 response
until you've read the whole request. RFC 7230 does make an exception for
error responses:

  A client sending a message body SHOULD monitor the network connection
  for an error response while it is transmitting the request.  If the
  client sees a response that indicates the server does not wish to
  receive the message body and is closing the connection, the client
  SHOULD immediately cease transmitting the body and close its side of
  the connection.

That's just a "SHOULD" but I think curl does this properly. In the case
of GitHub's servers, a fatal condition from index-pack seems to trigger
an HTTP 500. So that is OK by the standard above, but it does mean we
lose the opportunity to provide any error report at the Git protocol
level. And so you end up with output like:

  error: RPC failed; HTTP 500 curl 22 The requested URL returned error: 500
  send-pack: unexpected disconnect while reading sideband packet
  fatal: the remote end hung up unexpectedly

whereas the same push over ssh produces[1]:

  remote: fatal: non-blob object size limit exceeded (commit ddc6d4e7091229184537d078666afb241ea8ef62 is 104857798 bytes)
  error: remote unpack failed: index-pack failed
  To github.com:peff/test.git
   ! [remote rejected] main -> main (failed)
  error: failed to push some refs to 'github.com:peff/test.git'

And even if we teach the remote helper to soak up the whole pack, we're
still going to see the same useless output. So I would argue that the
first fix here needs to be at the server level. It should be soaking up
all of the request bytes if possible so that we can write back an HTTP
200 with the appropriate error report.

Of course it's kind of lame that we don't cut off the client and let it
keep sending useless bytes (though really it is not any different than
fsck errors, which we queue after reading the whole input).

It might be possible to send an application/x-git-receive-pack-result as
the body of the 500 response. And then we could teach the client side to
notice and handle that better. We could possibly even use a different
code (413 Payload Too Large?), and I suspect existing clients would
behave sensibly (and new ones could give a better message). We _might_
even be able to get away with a "200" response here, though even if curl
is OK with it, I think we are stretching the RFC a bit.

[1] The keen-eyed observer may notice this is failing due to a different
    reason than "pack exceeds maximum allowed size". I happen to know
    that GitHub's version of index-pack will also reject commits over 100MB
    in the same unceremonious way, so we can generate the same condition
    without having to spend tons of bandwidth:

      # start with one small commit
      git init
      git commit --allow-empty -m foo

      # now make a too-big one. Note that it will compress well with
      # zlib!
      git cat-file commit HEAD >commit
      perl -e 'print "a" x 79, "\n" for (1..1310720)' >>commit
      commit=$(git hash-object -t commit -w commit)
      git update-ref HEAD $commit

      # pushing just the commit above will not generate the issue,
      # because we'll (racily) have sent the rest of the pack by
      # the time the server notices the problem. So tack on a bunch of
      # big (and uncompressible) irrelevant data.
      dd if=/dev/urandom of=foo.rand bs=1M count=90
      git add foo.rand
      git commit -m 'big data'

    And now pushing over http will quickly-ish get you to the ugly
    output (but over ssh is OK). But note this only work with GitHub!
    Upstream Git does not have the same object-size check.

> This is why there's a python Git server in this patch series that doesn't rely
> on CGI but streams the data both ways so it can close the stream as soon as
> receive-pack exits. There's already some python tooling in the project and I'm
> much more familiar with it than e.g. perl, so I hope that's fine. I tried to
> make it as simple as possible while still being able to stream bidirectionally.

I admit that I don't love carrying a separate webserver implementation,
but I doubt we can convince Apache not to do the draining behavior.
Though based on what I wrote above, I'm not sure that non-draining is a
behavior we care about testing that much (we _do_ care about getting an
HTTP 500 and bailing from the helper, but that is much easier to test).

If we are going to carry some custom server code, python brings some
complications. We don't currently depend on python at all in the test
suite, outside of the git-p4 tests. And those are disabled if NO_PYTHON
is set. We do assume that we have access to perl in the test suite even
if NO_PERL is set, but we try to limit ourselves to a very vanilla
subset that would work even on ancient versions of perl (so basically
treating perl like sed/awk as being in the "it's everywhere" unix
toolbox).

I don't know if we could do the same with python. I suspect there are
systems without it, even these days. I'm also not sure what portability
issues we might see on the specific components you're using. E.g., is
http.server always available?

So I think at the very least we'd need to be able to push this behind a
test prereq and skip the test when we can't spin up the webserver.

I do think doing something similar in perl would be nicer, if only
because it keeps the number of languages we use in the test suite the
same. But I also have a personal bias towards perl (mostly from
exposure). And even with perl, we'd be relying on non-standard modules
that would still require a prereq/skip setup.

-Peff

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 2/4] t/lib-http.sh: add functions related to serve-git.py
  2024-06-12 11:50 ` [PATCH 2/4] t/lib-http.sh: add functions related to serve-git.py Carlos Martín Nieto
@ 2024-06-13  9:19   ` Jeff King
  0 siblings, 0 replies; 14+ messages in thread
From: Jeff King @ 2024-06-13  9:19 UTC (permalink / raw)
  To: Carlos Martín Nieto; +Cc: git

On Wed, Jun 12, 2024 at 01:50:26PM +0200, Carlos Martín Nieto wrote:

> diff --git a/t/lib-httpd.sh b/t/lib-httpd.sh
> index d83bafeab32..6454300a041 100644
> --- a/t/lib-httpd.sh
> +++ b/t/lib-httpd.sh
> [...]
> +start_serve_git() {

I can see why you'd stick this in lib-httpd.sh. But note that we'll bail
from that script's setup phase if we don't find Apache. That's not the
end of the world, but it does mean we'd fail to run this test on
platforms that otherwise could.

> +	test_atexit stop_serve_git

OK, you do the auto-kill on exit, which is good.

> +	"$TEST_DIRECTORY"/lib-httpd/serve-git.py \
> +		--document-root "$HTTPD_ROOT_PATH"/www \
> +		--port "$LIB_GIT_SERVE_PORT" &
> +
> +	mkdir -p "$HTTPD_ROOT_PATH"
> +	echo $! >"$HTTPD_ROOT_PATH"/git-serve.pid
> +
> +	GIT_SERVE_DEST=127.0.0.1:$LIB_GIT_SERVE_PORT
> +	GIT_SERVE_URL=http://$GIT_SERVE_DEST
> +}

But I suspect this part is racy. We started serve-git.py in the
background, but we have no guarantee that it finished starting up, or
even started listening on the port.

We've run into those kinds of races with git-daemon; you can find the
gross fifo-based solution in lib-git-daemon.sh. We don't do anything
special for apache, but I think that's because we depend on "apache -k
start" to handle this (we don't background it ourselves).

> +stop_serve_git() {
> +	kill -9 $(cat "$HTTPD_ROOT_PATH"/git-serve.pid)
> +}

This looks reasonable. You probably want to redirect stderr to
/dev/null, since any script which calls stop_serve_git() itself will
double-kill and see an error on the second one.

-Peff

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 3/4] t5541: add test for rejecting a push due to packfile size
  2024-06-12 11:50 ` [PATCH 3/4] t5541: add test for rejecting a push due to packfile size Carlos Martín Nieto
  2024-06-12 21:49   ` Taylor Blau
@ 2024-06-13  9:21   ` Jeff King
  2024-06-13 10:07   ` Jeff King
  2 siblings, 0 replies; 14+ messages in thread
From: Jeff King @ 2024-06-13  9:21 UTC (permalink / raw)
  To: Carlos Martín Nieto; +Cc: git

On Wed, Jun 12, 2024 at 01:50:27PM +0200, Carlos Martín Nieto wrote:

> diff --git a/t/t5546-receive-limits.sh b/t/t5546-receive-limits.sh
> index 9fc9ba552f1..ccbdf3945ab 100755
> --- a/t/t5546-receive-limits.sh
> +++ b/t/t5546-receive-limits.sh
> @@ -5,6 +5,11 @@ test_description='check receive input limits'
>  TEST_PASSES_SANITIZE_LEAK=true
>  . ./test-lib.sh
>  
> +
> +ROOT_PATH="$PWD"

I don't think this ROOT_PATH is doing anything?

> +. "$TEST_DIRECTORY"/lib-httpd.sh
> +start_serve_git

Since you're adding to an existing script, these should go near the
bottom (or possibly even in their own script). If we don't have apache
or curl support, then loading lib-httpd.sh at all will cause us to bail
from the test script immediately. So we'll never run these existing
tests at all on such platforms, even though we could (and do now).

-Peff

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 4/4] remote-curl: read in the push report even if we fail to finish sending data
  2024-06-12 11:50 ` [PATCH 4/4] remote-curl: read in the push report even if we fail to finish sending data Carlos Martín Nieto
@ 2024-06-13  9:55   ` Jeff King
  2024-07-23 15:07     ` Carlos Martín Nieto
  0 siblings, 1 reply; 14+ messages in thread
From: Jeff King @ 2024-06-13  9:55 UTC (permalink / raw)
  To: Carlos Martín Nieto; +Cc: git

On Wed, Jun 12, 2024 at 01:50:28PM +0200, Carlos Martín Nieto wrote:

> If we just consume send-pack's output and don't send anything to
> remote-helper, it will not update any of its structures and will report
> "Everything up-to-date" next to the error message.

OK, consuming the output at the helper level makes some sense to me.
But...

> diff --git a/remote-curl.c b/remote-curl.c
> index 0b6d7815fdd..9e45e14afec 100644
> --- a/remote-curl.c
> +++ b/remote-curl.c
> @@ -1114,15 +1114,25 @@ static int rpc_service(struct rpc_state *rpc, struct discovery *heads,
>  
>  	close(client.in);
>  	client.in = -1;
> -	if (!err) {
> -		strbuf_read(rpc_result, client.out, 0);
> -	} else {
> -		char buf[4096];
> -		for (;;)
> -			if (xread(client.out, buf, sizeof(buf)) <= 0)
> -				break;
> +
> +	/*
> +	 * If we encountered an error, we might still get a report. Consume the
> +	 * rest of the packfile and an extra flush and then we can copy
> +	 * over the report the same way as in the success case.
> +	 */
> +	if (err) {
> +		int n;
> +		do {
> +			n = packet_read(rpc->out, rpc->buf, rpc->alloc, 0);
> +		} while (n > 0);
> +
> +		/* Read the final flush separating the payload from the report */
> +		packet_read(rpc->out, rpc->buf, rpc->alloc, 0);
>  	}

Isn't this existing code already trying to read everything? I think
rpc->out and client.out are synonyms.

So now instead of reading to EOF, we are reading some set number of
packets. This function is used for both fetches and pushes, isn't it? Is
the expected number of packets the same for both? What about
stateless-connect mode?

> +	/* Copy the report of successes/failures */
> +	strbuf_read(rpc_result, client.out, 0);

OK, so this is where we read the result. Which again, only makes sense
for send-pack. And in theory we've synchronized the protocol through the
packet reads above (are we sure that we always enter the read loop above
from a predictable synchronization point in the protocol, given that we
saw an error?).

What if send-pack doesn't send us anything useful (e.g., it hangs up
without sending the report). Shouldn't we take the presence of "err"
being non-zero as an indication that things are not well, even if we
never get to the send-pack report?

-Peff

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 3/4] t5541: add test for rejecting a push due to packfile size
  2024-06-12 11:50 ` [PATCH 3/4] t5541: add test for rejecting a push due to packfile size Carlos Martín Nieto
  2024-06-12 21:49   ` Taylor Blau
  2024-06-13  9:21   ` Jeff King
@ 2024-06-13 10:07   ` Jeff King
  2 siblings, 0 replies; 14+ messages in thread
From: Jeff King @ 2024-06-13 10:07 UTC (permalink / raw)
  To: Carlos Martín Nieto; +Cc: git

On Wed, Jun 12, 2024 at 01:50:27PM +0200, Carlos Martín Nieto wrote:

> +test_expect_success 'reject too-large push over HTTP' '
> +	git init "$HTTPD_DOCUMENT_ROOT_PATH/error_too_large" &&
> +	git -C "$HTTPD_DOCUMENT_ROOT_PATH/error_too_large" config receive.maxInputSize 128 &&
> +	test-tool genrandom foo $((10*1024*1024)) >large-file &&
> +	git add large-file &&
> +	test_commit large-file &&
> +	test_must_fail git push --porcelain \
> +		$GIT_SERVE_URL/error_too_large \
> +		HEAD:refs/tags/will-fail >actual &&
> +	test_must_fail git -C "$HTTPD_DOCUMENT_ROOT_PATH/error_too_large" \
> +		rev-parse --verify refs/tags/will-fail &&
> +	cat >expect <<-EOF &&
> +	To $GIT_SERVE_URL/error_too_large
> +	!	HEAD:refs/tags/will-fail	[remote rejected] (unpacker error)
> +	Done
> +	EOF
> +	test_cmp expect actual
> +'

This test fails for me (even with the fix in the next patch) with:

  Exception occurred during processing of request from ('127.0.0.1', 47480)
  Traceback (most recent call last):
    File "/usr/lib/python3.11/socketserver.py", line 317, in _handle_request_noblock
      self.process_request(request, client_address)
    File "/usr/lib/python3.11/socketserver.py", line 348, in process_request
      self.finish_request(request, client_address)
    File "/usr/lib/python3.11/socketserver.py", line 361, in finish_request
      self.RequestHandlerClass(request, client_address, self)
    File "/home/peff/compile/git/t/lib-httpd/serve-git.py", line 35, in __init__
      super().__init__(*args, **kwargs)
    File "/usr/lib/python3.11/socketserver.py", line 755, in __init__
      self.handle()
    File "/usr/lib/python3.11/http/server.py", line 436, in handle
      self.handle_one_request()
    File "/usr/lib/python3.11/http/server.py", line 424, in handle_one_request
      method()
    File "/home/peff/compile/git/t/lib-httpd/serve-git.py", line 117, in do_GET
      self.do_receive_pack(responder, gitdir, True, protocol)
    File "/home/peff/compile/git/t/lib-httpd/serve-git.py", line 243, in do_receive_pack
      self._run_command(resp, 'receive-pack', gitdir, advertisement, protocol)
    File "/home/peff/compile/git/t/lib-httpd/serve-git.py", line 192, in _run_command
      with subprocess.Popen(argv, stdin=stdin, stdout=subprocess.PIPE, cwd=gitdir, env=env) as proc:
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/usr/lib/python3.11/subprocess.py", line 1026, in __init__
      self._execute_child(args, executable, preexec_fn, close_fds,
    File "/usr/lib/python3.11/subprocess.py", line 1955, in _execute_child
      raise child_exception_type(errno_num, err_msg, err_filename)
  FileNotFoundError: [Errno 2] No such file or directory: 'git'

which seems...odd. It should be finding "git" in bin-wrappers/, but I
think it is using a restricted/vanilla path. If I put "git" into
/usr/bin, it stops complaining (but obviously this is very wrong, as we
are not running the Git we're trying to test!).

Ah, I think I see it. You set up an environment for the Popen like:

  env = {k:v for k, v in os.environ.items() if k.startswith('GIT_')}
  if protocol is not None:
      env['GIT_PROTOCOL'] = protocol

so it does not contain $PATH (nor other possibly useful things!), so
presumably a fallback $PATH is used. I think you'd want to start with
"env = os.environ.copy()" and then modify it from there.

-Peff

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 0/4] Report rejections over HTTP when the remote rejects during the transfer
  2024-06-13  9:11 ` [PATCH 0/4] Report rejections over HTTP when the remote rejects during the transfer Jeff King
@ 2024-07-23 13:46   ` Carlos Martín Nieto
  0 siblings, 0 replies; 14+ messages in thread
From: Carlos Martín Nieto @ 2024-07-23 13:46 UTC (permalink / raw)
  To: Jeff King; +Cc: git

On Thu, 2024-06-13 at 05:11 -0400, Jeff King wrote:
> On Wed, Jun 12, 2024 at 01:50:24PM +0200, Carlos Martín Nieto wrote:
> 
> > While investigating a difference between HTTP and SSH when rejecting a push due
> > to it being too large, I noticed that rejecting a push without receiving the
> > entire packfile causes git to print out the error message "pack exceeds maximum
> > allowed size" but it also shows "Everything up-to-date" instead of the rejection
> > of every ref update like the server has specified.
> > 
> > This is the result of two issues in git, of which I aim to fix one here, namely
> > 
> >   1) when the server sends the response and closes the connection, remote-curl
> >   sees that as an error and stops processing the send-pack output, combined with
> > 
> >   2) git does not remember what it asked the remote helper to push so it cannot
> >   distinguish whether an empty report means "I had an error and did nothing" or
> >   "everything was up to date and I didn't have to do anything".
> > 
> > The latter issue is more complex so here I'm concentrating on the former, which
> > has a simple solution but a complex test. The solution is to read in to the end
> > of what send-pack is telling us (it's creating the whole packfile that we're
> > throwing away anyway) so we can report back to the user.
> 
> Hmm. I think the second one is much more important, though. If the
> remote helper dies unexpectedly for any reason, shouldn't the parent
> git-push realize this and propagate the error? It sounds like you are
> fixing one _specific_ case where the remote helper dies so that we don't
> run into the problem caused by (2).

I am concentrating on this one bug that seems relatively self-contained
(other than the testing). The (2) bug requires a lot more reworking of
the remote helper mechanism. But fixing it only fixes one aspect which
is the lack of ref report instead of "Everything up-to-date". It still
wouldn't display the message sent back from the server saying why it
was rejected (though as you later argue, it might be correct to ignore
it when we send back the 200).

> 
> But if we fixed (2), then (1) would not be nearly as important anymore

It's not as confusing regarding the up-to-date message, but (1) would
still hide the error message from the server, so it still wouldn't be
showing that message, just some other generic message about the helper
dying. And for HTTP it would still be different from what we see over
SSH.

> (if at all). But I think there is a little more going on with this
> specific case...
> 
> > The testing however proved a bit complicated as this bug requires the server to
> > cut off the connection while git is uploading the packfile. The existing HTTP
> > tests use CGI and as far as I've been able to test, httpd buffers too much for
> > us to be able to replicate the situation.
> 
> It is not buffering that gets in your way, but rather that Apache will
> actually drain the remainder of the request (throwing away the extra
> bytes) if the CGI dies. You can see this by running your test against
> apache and strace-ing the apache process. It is in a read/write loop
> streaming the packfile from network to the CGI's stdin, it hits EPIPE on
> the write(), and then it switches to just read().
> 

Yes, that was bad wording on my side. I think the effect here was that
I did not see the same issues when remote-curl or send-pack manged to
get further into their own data stream that what we see with the 500
from the server that disconnects.

> Which makes sense, because you can't send back the HTTP 200 response
> until you've read the whole request. RFC 7230 does make an exception for
> error responses:
> 
>   A client sending a message body SHOULD monitor the network connection
>   for an error response while it is transmitting the request.  If the
>   client sees a response that indicates the server does not wish to
>   receive the message body and is closing the connection, the client
>   SHOULD immediately cease transmitting the body and close its side of
>   the connection.
> 
> That's just a "SHOULD" but I think curl does this properly. In the case
> of GitHub's servers, a fatal condition from index-pack seems to trigger
> an HTTP 500. So that is OK by the standard above, but it does mean we
> lose the opportunity to provide any error report at the Git protocol
> level. And so you end up with output like:
> 
>   error: RPC failed; HTTP 500 curl 22 The requested URL returned error: 500
>   send-pack: unexpected disconnect while reading sideband packet
>   fatal: the remote end hung up unexpectedly
> 
> whereas the same push over ssh produces[1]:
> 
>   remote: fatal: non-blob object size limit exceeded (commit ddc6d4e7091229184537d078666afb241ea8ef62 is 104857798 bytes)
>   error: remote unpack failed: index-pack failed
>   To github.com:peff/test.git
>    ! [remote rejected] main -> main (failed)
>   error: failed to push some refs to 'github.com:peff/test.git'
> 
> And even if we teach the remote helper to soak up the whole pack, we're
> still going to see the same useless output. So I would argue that the
> first fix here needs to be at the server level. It should be soaking up
> all of the request bytes if possible so that we can write back an HTTP
> 200 with the appropriate error report.
> 
> Of course it's kind of lame that we don't cut off the client and let it
> keep sending useless bytes (though really it is not any different than
> fsck errors, which we queue after reading the whole input).

Ideally we would be able to tell the client quickly too, but that
presents some challenges both on the server and client.

> 
> It might be possible to send an application/x-git-receive-pack-result as
> the body of the 500 response. And then we could teach the client side to
> notice and handle that better. We could possibly even use a different
> code (413 Payload Too Large?), and I suspect existing clients would
> behave sensibly (and new ones could give a better message). We _might_
> even be able to get away with a "200" response here, though even if curl
> is OK with it, I think we are stretching the RFC a bit.

Yes, I had completely missed this before, and I thought it was a curl
behaviour, but we actually just ignore any output from the server if
the status is >= 300 in rpc_in. This makes sense as we don't want to
try to parse a generic error message, but if the server manages to set
the content-type, then it can tell git that it does want it to parse
the message, but it's an error. A slight modification to my server
script does make it seem like it does work and this might have been a
significant part of what I was fighting against before.

If git (send-pack here I think) can reliably detect the server sending
an error, then it might be able to stop sending the data early. That
would then work for any protocol.

> 
> [1] The keen-eyed observer may notice this is failing due to a different
>     reason than "pack exceeds maximum allowed size". I happen to know
>     that GitHub's version of index-pack will also reject commits over 100MB
>     in the same unceremonious way, so we can generate the same condition
>     without having to spend tons of bandwidth:
> 
>       # start with one small commit
>       git init
>       git commit --allow-empty -m foo
> 
>       # now make a too-big one. Note that it will compress well with
>       # zlib!
>       git cat-file commit HEAD >commit
>       perl -e 'print "a" x 79, "\n" for (1..1310720)' >>commit
>       commit=$(git hash-object -t commit -w commit)
>       git update-ref HEAD $commit
> 
>       # pushing just the commit above will not generate the issue,
>       # because we'll (racily) have sent the rest of the pack by
>       # the time the server notices the problem. So tack on a bunch of
>       # big (and uncompressible) irrelevant data.
>       dd if=/dev/urandom of=foo.rand bs=1M count=90
>       git add foo.rand
>       git commit -m 'big data'
> 
>     And now pushing over http will quickly-ish get you to the ugly
>     output (but over ssh is OK). But note this only work with GitHub!
>     Upstream Git does not have the same object-size check.
> 
> > This is why there's a python Git server in this patch series that doesn't rely
> > on CGI but streams the data both ways so it can close the stream as soon as
> > receive-pack exits. There's already some python tooling in the project and I'm
> > much more familiar with it than e.g. perl, so I hope that's fine. I tried to
> > make it as simple as possible while still being able to stream bidirectionally.
> 
> I admit that I don't love carrying a separate webserver implementation,
> but I doubt we can convince Apache not to do the draining behavior.
> Though based on what I wrote above, I'm not sure that non-draining is a
> behavior we care about testing that much (we _do_ care about getting an
> HTTP 500 and bailing from the helper, but that is much easier to test).

Coming to it from the side where you see this for a large packfile, I'm
partial to git working in such a way that we don't have to eat up the
entire rest of the packfile. It would also be nice if we can return
other errors to the user quickly for their own sake.

> 
> If we are going to carry some custom server code, python brings some
> complications. We don't currently depend on python at all in the test
> suite, outside of the git-p4 tests. And those are disabled if NO_PYTHON
> is set. We do assume that we have access to perl in the test suite even
> if NO_PERL is set, but we try to limit ourselves to a very vanilla
> subset that would work even on ancient versions of perl (so basically
> treating perl like sed/awk as being in the "it's everywhere" unix
> toolbox).
> 
> I don't know if we could do the same with python. I suspect there are
> systems without it, even these days. I'm also not sure what portability
> issues we might see on the specific components you're using. E.g., is
> http.server always available?

AFAICT I did not rely on anything that doesn't come with the standard
library, although I did write python 3 which may not be easy to install
everywhere git has even run. It should also be available with python 2,
but I appreciate the extra requirements. I did see a lot of python in
git-p4 but if that's the only thing, that's a different support level
than the core of git.


Cheers,
   cmn

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 4/4] remote-curl: read in the push report even if we fail to finish sending data
  2024-06-13  9:55   ` Jeff King
@ 2024-07-23 15:07     ` Carlos Martín Nieto
  0 siblings, 0 replies; 14+ messages in thread
From: Carlos Martín Nieto @ 2024-07-23 15:07 UTC (permalink / raw)
  To: Jeff King; +Cc: git

On Thu, 2024-06-13 at 05:55 -0400, Jeff King wrote:
> On Wed, Jun 12, 2024 at 01:50:28PM +0200, Carlos Martín Nieto wrote:
> 
> > If we just consume send-pack's output and don't send anything to
> > remote-helper, it will not update any of its structures and will report
> > "Everything up-to-date" next to the error message.
> 
> OK, consuming the output at the helper level makes some sense to me.
> But...
> 
> > diff --git a/remote-curl.c b/remote-curl.c
> > index 0b6d7815fdd..9e45e14afec 100644
> > --- a/remote-curl.c
> > +++ b/remote-curl.c
> > @@ -1114,15 +1114,25 @@ static int rpc_service(struct rpc_state *rpc, struct discovery *heads,
> >  
> >  	close(client.in);
> >  	client.in = -1;
> > -	if (!err) {
> > -		strbuf_read(rpc_result, client.out, 0);
> > -	} else {
> > -		char buf[4096];
> > -		for (;;)
> > -			if (xread(client.out, buf, sizeof(buf)) <= 0)
> > -				break;
> > +
> > +	/*
> > +	 * If we encountered an error, we might still get a report. Consume the
> > +	 * rest of the packfile and an extra flush and then we can copy
> > +	 * over the report the same way as in the success case.
> > +	 */
> > +	if (err) {
> > +		int n;
> > +		do {
> > +			n = packet_read(rpc->out, rpc->buf, rpc->alloc, 0);
> > +		} while (n > 0);
> > +
> > +		/* Read the final flush separating the payload from the report */
> > +		packet_read(rpc->out, rpc->buf, rpc->alloc, 0);
> >  	}
> 
> Isn't this existing code already trying to read everything? I think
> rpc->out and client.out are synonyms.

The existing code will read to EOF if we saw an error, ignoring
anything including any reports.

> 
> So now instead of reading to EOF, we are reading some set number of
> packets. This function is used for both fetches and pushes, isn't it? Is
> the expected number of packets the same for both? What about
> stateless-connect mode?
> 
> > +	/* Copy the report of successes/failures */
> > +	strbuf_read(rpc_result, client.out, 0);
> 
> OK, so this is where we read the result. Which again, only makes sense
> for send-pack. And in theory we've synchronized the protocol through the

The existing code already tries to read the report regardless of
pushing or fetch in the non-err case.

> packet reads above (are we sure that we always enter the read loop above
> from a predictable synchronization point in the protocol, given that we
> saw an error?).

That's what the loop is trying to do. It reads the rest of the
packfile, and it's trying to get to the flush at the end. This is what
the while loop above does, that copies between packet_read and
post_rpc. Given that send-pack is still sending the rest of the
packfile, it should be the same as if we had been sending the data over
the network.

It doesn't quite work which is why there's an extra read there but I
take your point that I forgot that we also run fetches through this so
it's probably going to be more complicated anyway.

> 
> What if send-pack doesn't send us anything useful (e.g., it hangs up
> without sending the report). Shouldn't we take the presence of "err"
> being non-zero as an indication that things are not well, even if we
> never get to the send-pack report?

In this case err is non-zero because we got a non-200 HTTP response, if
I followed the code correctly, so it does mean there was an error, and
it's true that we don't necessarily know why with just that variable.

It's still nicer if we can try to get more data out of send-pack (and
fetch-pack if it does have more information there), but yes this code
isn't quite it.

I might just step back on this a bit and split my fixes here as a
change to make send-pack return an error message should just need a
change to still forward non-2xx responses if the server claims it to be
the Git protocol. That still shows the error message from the server
even if we provide the "Everything up-to-date" message as well.

Cheers,
   cmn

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2024-07-23 15:07 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-06-12 11:50 [PATCH 0/4] Report rejections over HTTP when the remote rejects during the transfer Carlos Martín Nieto
2024-06-12 11:50 ` [PATCH 1/4] t/lib-http: add serve-git.py Carlos Martín Nieto
2024-06-12 21:51   ` Junio C Hamano
2024-06-12 11:50 ` [PATCH 2/4] t/lib-http.sh: add functions related to serve-git.py Carlos Martín Nieto
2024-06-13  9:19   ` Jeff King
2024-06-12 11:50 ` [PATCH 3/4] t5541: add test for rejecting a push due to packfile size Carlos Martín Nieto
2024-06-12 21:49   ` Taylor Blau
2024-06-13  9:21   ` Jeff King
2024-06-13 10:07   ` Jeff King
2024-06-12 11:50 ` [PATCH 4/4] remote-curl: read in the push report even if we fail to finish sending data Carlos Martín Nieto
2024-06-13  9:55   ` Jeff King
2024-07-23 15:07     ` Carlos Martín Nieto
2024-06-13  9:11 ` [PATCH 0/4] Report rejections over HTTP when the remote rejects during the transfer Jeff King
2024-07-23 13:46   ` Carlos Martín Nieto

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).