From mboxrd@z Thu Jan 1 00:00:00 1970 From: Neal Kreitzinger Subject: Re: GSoC - Some questions on the idea of Date: Sat, 31 Mar 2012 15:28:06 -0500 Message-ID: <4F7768D6.3010400@gmail.com> References: <20120330203430.GB20376@sigill.intra.peff.net> Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit Cc: Bo Chen , Sergio , git@vger.kernel.org To: Jeff King X-From: git-owner@vger.kernel.org Sat Mar 31 22:28:24 2012 Return-path: Envelope-to: gcvg-git-2@plane.gmane.org Received: from vger.kernel.org ([209.132.180.67]) by plane.gmane.org with esmtp (Exim 4.69) (envelope-from ) id 1SE4uI-0002dW-KM for gcvg-git-2@plane.gmane.org; Sat, 31 Mar 2012 22:28:19 +0200 Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1750903Ab2CaU2O (ORCPT ); Sat, 31 Mar 2012 16:28:14 -0400 Received: from mail-ob0-f174.google.com ([209.85.214.174]:35637 "EHLO mail-ob0-f174.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752166Ab2CaU2N (ORCPT ); Sat, 31 Mar 2012 16:28:13 -0400 Received: by obbtb18 with SMTP id tb18so470734obb.19 for ; Sat, 31 Mar 2012 13:28:12 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=message-id:date:from:user-agent:mime-version:newsgroups:to:cc :subject:references:in-reply-to:content-type :content-transfer-encoding; bh=FhTLrHjXi0b7GrXQXzBeMMxhrm7kpcfG6mmxMeSYE5Y=; b=E6202FlVqpSmiVGEP+muG5/oVqjlfBDxqNOYWxvXiVWZmIaLWiS18JBywxX4UdvPgl GUFpnmYsBpa0xzQpMVqvrZ4Yf/xS8QV92JrOkDGkffjKIwYHguTlTltSxyDzAUF+5B9P b2Ht0uUNRNGPbvzj9luubJvJ/zJFlo4sJF7fAg06IUMsLEJmazbh6NfuE48YChCfXXz8 Cqrq1QS4v/64pKIbAwmVr93d2iKH1d+lA7xjdEiMzGLONwPowCIrRggoXIOsJQ4RSLL5 RURpY25ZIHMmg5MOVXe+E45hlup6dWo/aPmgNWzjqfkCG486buC0idHWdKsGNNrqM+WJ oq9w== Received: by 10.60.25.162 with SMTP id d2mr4243677oeg.30.1333225692747; Sat, 31 Mar 2012 13:28:12 -0700 (PDT) Received: from [172.25.2.210] ([67.63.162.200]) by mx.google.com with ESMTPS id vk10sm13105359obb.8.2012.03.31.13.28.11 (version=TLSv1/SSLv3 cipher=OTHER); Sat, 31 Mar 2012 13:28:11 -0700 (PDT) User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.2.28) Gecko/20120306 Thunderbird/3.1.20 Newsgroups: gmane.comp.version-control.git In-Reply-To: <20120330203430.GB20376@sigill.intra.peff.net> Sender: git-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: git@vger.kernel.org Archived-At: On 3/30/2012 3:34 PM, Jeff King wrote: > On Fri, Mar 30, 2012 at 03:51:20PM -0400, Bo Chen wrote: > >> The sub-problems of "delta for large file" problem. >> >> 1 large file >> > But let's take a step back for a moment. Forget about whether a file is > binary or not. Imagine you want to store a very large file in git. > > What are the operations that will perform badly? How can we make them > perform acceptably, and what tradeoffs must we make? E.g., the way the > diff code is written, it would be very difficult to run "git diff" on a > 2 gigabyte file. But is that actually a problem? Answering that means > talking about the characteristics of 2 gigabyte files, and what we > expect to see, and to what degree our tradeoffs will impact them. > > Here's a more concrete example. At first, even storing a 2 gigabyte file > with "git add" was painful, because we would load the whole thing in > memory. Repacking the repository was painful, because we had to rewrite > the whole 2G file into a packfile. Nowadays, we stream large files > directly into their own packfiles, and we have to pay the I/O only once > (and the memory cost never). As a tradeoff, we no longer get delta > compression of large objects. That's OK for some large objects, like > movie files (which don't tend to delta well, anyway). But it's not for > other objects, like virtual machine images, which do tend to delta well. > > So can we devise a solution which efficiently stores these > delta-friendly objects, without losing the performance improvements we > got with the stream-directly-to-packfile approach? > > One possible solution is breaking large files into smaller chunks using > something like the bupsplit algorithm (and I won't go into the details > here, as links to bup have already been mentioned elsewhere, and Junio's > patches make a start at this sort of splitting). > (I'm no expert on "big-files" in git or elsewhere, but this thread is immensely interesting to me as a git user who wants to track all sorts of binary files and possibly large text files in the very near future, ie. all components tied to a server build and upgrades beyond the linux-distro/rpms and perhaps including them also.) Let's take an even bigger step back for a moment. Who determines if a file shall be a big-file or not? Git or the user? How is it determined if a file shall be a "big-file" or not? Who decides bigness: Bigness seems to be relative to system resources. Does the user crunch the numbers to determine if a file is big-file, or does git? If the numbers are relative then should git query the system and make the determination? Either way, once the system-resources are upgraded and formerly "big-files" are no longer considered "big" how is the previous history refactored to behave "non-big-file-like"? Conversely, if the system-resources are re-distributed so that formerly non-big files are now relatively big (ie, moved from powerful central server login to laptops), how is the history refactored to accommodate the newly-relative-bigness? How bigness is decided: There seems to be two basic types of big-files: big-worktree-files, and big-history-files. A big-worktree-file that is delta-friendly is not a big-history-file. A non-big-worktree-file that is delta-unfriendly is a big-file-history problem. If you are working alone on an old computer you are probably more concerned about big-worktree-files (memory). If you are working in a large group making lots of changes to the same files on a powerful server then you are probably more concerned about big-history-file-size (diskspace). Of course, all are concerned about big-worktree-files that are delta-unfriendly. At what point is a delta-friendly file considered a "big-file"? I assume that may depend on the degree delta-friendliness. I imagine that a text file and vm-image differ in delta-friendliness by several degrees. At what point(s) is a delta-unfriendly file considered a "big-file"? I assume that may depend on the degree(s) of delta-unfriendliness. I imagine a compiled program and compressed-container differ in delta-unfriendliness by several degrees. My understanding is that git does not ever delta-compress binary files. That would mean even a small-worktree-binary-file becomes a big-history-file over time. v/r, neal