From: "C. Scott Ananian" <cscott@cscott.net>
To: Linus Torvalds <torvalds@osdl.org>
Cc: git@vger.kernel.org
Subject: Re: space compression (again)
Date: Fri, 15 Apr 2005 14:45:55 -0400 (EDT) [thread overview]
Message-ID: <Pine.LNX.4.61.0504151437100.27637@cag.csail.mit.edu> (raw)
In-Reply-To: <Pine.LNX.4.58.0504151117360.7211@ppc970.osdl.org>
On Fri, 15 Apr 2005, Linus Torvalds wrote:
> The problem with chunking is:
> - it complicates a lot of the routines. Things like "is this file
> unchanged" suddenly become "is this file still the same set of chunks",
> which is just a _lot_ more code and a lot more likely to have bugs.
The blob still has the same hash; therefore the file is still the same.
Nothing looks inside blobs; they just want either the hash or the full
contents (if I understand the algorithms correctly).
I agree it's more code, but I think it can be nicely layered.
> - you have to find a blocking factor. I thought of just going it fixed
> chunks, and that just doesn't help at all.
rsync uses a fixed chunk size, but this chunk can start at any offset (ie,
not constrained to fixed boundaries). This means that adding a single
line to the file works like you'd expect, even though all the chunk
boundaries change. [I think this is what you're talking about.]
> - we already have wasted space due to the low-level filesystem (as
> opposed to "git") usually being block-based, which means that space
> utilization for small objects tends to suck. So you really want to
> prefer objects that are several kB (compressed), and a small block just
> wastes tons of space.
Not on (say) reiserfs, and not over the network. I'm proposing (at the
moment) easy conversion from chunked to unchunked disk representation,
so that you can leave things unchunked if (for example) you know you're
running ext2 with a large block size.
> - there _is_ a natural blocking factor already. That's what a file
> boundary really is within the project, and finding any other is really
> quite hard.
Well, yes, it may be nontrivial. But 'quite hard' depends on your
perspective, I guess. Given a cache of existing chunks, it's just a
few table lookups. =)
> So I'm personally 100% sure that it's not worth it. But I'm not opposed to
> the _concept_: it makes total sense in the "filesystem" view, and is 100%
> equivalent to having an inode with pointers to blocks. I just don't think
> the concept plays out well in reality.
So I guess I'll have to implement this and find out, won't I? =)
--scott
AMLASH overthrow SDI Suharto HBDRILL SMOTH SUMAC SYNCARP kibo Blair
Diplomat Kojarena CIA cracking counter-intelligence CABOUNCE anthrax
( http://cscott.net/ )
next prev parent reply other threads:[~2005-04-15 18:43 UTC|newest]
Thread overview: 12+ messages / expand[flat|nested] mbox.gz Atom feed top
2005-04-15 17:19 space compression (again) C. Scott Ananian
2005-04-15 18:34 ` Linus Torvalds
2005-04-15 18:45 ` C. Scott Ananian [this message]
2005-04-15 19:00 ` Derek Fawcus
2005-04-15 19:11 ` Linus Torvalds
2005-04-16 14:39 ` Martin Uecker
2005-04-16 15:11 ` C. Scott Ananian
2005-04-16 17:37 ` Martin Uecker
2005-04-19 12:39 ` Martin Uecker
2005-04-15 18:50 ` Derek Fawcus
-- strict thread matches above, loose matches on Subject: below --
2005-04-15 19:33 Ray Heasman
2005-04-16 12:29 ` David Lang
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=Pine.LNX.4.61.0504151437100.27637@cag.csail.mit.edu \
--to=cscott@cscott.net \
--cc=git@vger.kernel.org \
--cc=torvalds@osdl.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).