From mboxrd@z Thu Jan 1 00:00:00 1970 From: =?UTF-8?B?UmVuw6kgU2NoYXJmZQ==?= Subject: Re: [PATCH 5/5] archive-zip: stream large blobs into zip file Date: Tue, 01 May 2012 00:54:01 +0200 Message-ID: <4F9F1809.1060803@lsrfire.ath.cx> References: <1335761837-12482-1-git-send-email-pclouds@gmail.com> <1335761837-12482-6-git-send-email-pclouds@gmail.com> <7vipghf2z0.fsf@alter.siamese.dyndns.org> Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: QUOTED-PRINTABLE Cc: =?UTF-8?B?Tmd1eeG7hW4gVGjDoWkgTmfhu41jIER1eQ==?= , git@vger.kernel.org To: Junio C Hamano X-From: git-owner@vger.kernel.org Tue May 01 00:54:10 2012 Return-path: Envelope-to: gcvg-git-2@plane.gmane.org Received: from vger.kernel.org ([209.132.180.67]) by plane.gmane.org with esmtp (Exim 4.69) (envelope-from ) id 1SOzTt-0005Xf-Pu for gcvg-git-2@plane.gmane.org; Tue, 01 May 2012 00:54:10 +0200 Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1757189Ab2D3WyE convert rfc822-to-quoted-printable (ORCPT ); Mon, 30 Apr 2012 18:54:04 -0400 Received: from india601.server4you.de ([85.25.151.105]:57811 "EHLO india601.server4you.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1757033Ab2D3WyD (ORCPT ); Mon, 30 Apr 2012 18:54:03 -0400 Received: from [192.168.2.105] (p4FFDA315.dip.t-dialin.net [79.253.163.21]) by india601.server4you.de (Postfix) with ESMTPSA id 7E7FE2F805D; Tue, 1 May 2012 00:54:01 +0200 (CEST) User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:12.0) Gecko/20120420 Thunderbird/12.0 In-Reply-To: <7vipghf2z0.fsf@alter.siamese.dyndns.org> Sender: git-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: git@vger.kernel.org Archived-At: Am 30.04.2012 21:12, schrieb Junio C Hamano: > Nguy=E1=BB=85n Th=C3=A1i Ng=E1=BB=8Dc Duy writes: > >> A large blob will be read twice. One for calculating crc32, one for >> actual writing. > > Is that because you need the checksum before the payload? That is > unfortunate. It would be nice (read: not a rejection of this patch--= -it > is a good first step to do it stupid but correct way before trying to > optimize it) to avoid it when the output is seekable, especially beca= use > we are talking about a *large* payload. The ZIP format optionally allows writing the CRC and the sizes after th= e=20 data. This adds a data descriptor with a size of 16 bytes to each=20 streamed entry. Seeking back and correcting these values in an output=20 file would avoid that. >> diff --git a/t/t1050-large.sh b/t/t1050-large.sh >> index fe47554..458fdde 100755 >> --- a/t/t1050-large.sh >> +++ b/t/t1050-large.sh >> @@ -138,4 +138,8 @@ test_expect_success 'tar achiving' ' >> git archive --format=3Dtar HEAD>/dev/null >> ' >> >> +test_expect_success 'zip achiving' ' >> + git archive --format=3Dzip HEAD>/dev/null >> +' > > Can't we do better than "we only check if it finishes without barfing= ; we > cannot afford to check the correctness of the output"? The same comm= ent > applies to all the tests you added to this file in the past 3 months. Streaming to tar can be tested by setting core.big_file_threshold big=20 enough, creating a non-streamed version of the archive and comparing it= =20 to the streamed one. With the seek trick, this would work for ZIP as=20 well. For streaming with an added data descriptor we'd need to actuall= y=20 unpack the ZIP file, though. Ren=C3=A9