From: Jeff Garzik <jgarzik@pobox.com>
To: Andrea Arcangeli <andrea@suse.de>
Cc: Erik Mouw <erik@harddisk-recovery.com>,
Josh Litherland <josh@temp123.org>,
linux-kernel@vger.kernel.org
Subject: Re: Transparent compression in the FS
Date: Thu, 16 Oct 2003 13:10:20 -0400 [thread overview]
Message-ID: <3F8ED0FC.9070706@pobox.com> (raw)
In-Reply-To: <20031016162926.GF1663@velociraptor.random>
Andrea Arcangeli wrote:
> Hi Jeff,
>
> On Wed, Oct 15, 2003 at 11:13:27AM -0400, Jeff Garzik wrote:
>
>>Josh and others should take a look at Plan9's venti file storage method
>>-- archival storage is a series of unordered blocks, all of which are
>>indexed by the sha1 hash of their contents. This magically coalesces
>>all duplicate blocks by its very nature, including the loooooong runs of
>>zeroes that you'll find in many filesystems. I bet savings on "all
>>bytes in this block are zero" are worth a bunch right there.
>
>
> I had a few ideas on the above.
>
> if the zero blocks are the problem, there's a tool called zum that nukes
> them and replaces them with holes. I use it sometime, example:
>
> andrea@velociraptor:~> dd if=/dev/zero of=zero bs=1M count=100
> 100+0 records in
> 100+0 records out
> andrea@velociraptor:~> ls -ls zero
> 102504 -rw-r--r-- 1 andrea andrea 104857600 2003-10-16 18:24 zero
> andrea@velociraptor:~> ~/bin/i686/zum zero
> zero [820032K] [1 link]
> andrea@velociraptor:~> ls -ls zero
> 0 -rw-r--r-- 1 andrea andrea 104857600 2003-10-16 18:24 zero
> andrea@velociraptor:~>
Neat.
> the hash to the data is interesting, but 1) you lose the zerocopy
> behaviour for the I/O, it's like doing a checksum for all the data going to
> disk that you normally would never do (except for the tiny files in reiserfs
> with tail packing enabled, but that's not bulk I/O), 2) I wonder how much data
> is really duplicate besides the "zero" holes trivially fixable in userspace
> (modulo bzImage or similar where I'm unsure if the fs code in the bootloader
> can handle holes ;).
FWIW archival storage doesn't really care... Since all data written to
disk is hashed with SHA1 (sha1 hash == block's unique id), you gain (a)
duplicate block coalescing and (b) _real_ data integrity guaranteed, but
OTOH, you lose performance and possibly lose zero-copy.
I _really_ like the checksum aspect of Plan9's archival storage (venti).
As Andre H and Larry McVoy love to point out, data isn't _really_ secure
until it's been checksummed, and that checksum data is verified on
reads. LM has an anecdote (doesn't he always? <g>) about how BitKeeper
-- which checksums its data inside the app -- has found data-corrupting
kernel bugs, in days long past.
Jeff
next prev parent reply other threads:[~2003-10-16 17:10 UTC|newest]
Thread overview: 92+ messages / expand[flat|nested] mbox.gz Atom feed top
2003-10-14 20:30 Transparent compression in the FS Josh Litherland
2003-10-15 13:33 ` Erik Mouw
2003-10-15 13:45 ` Josh Litherland
2003-10-15 13:50 ` Nikita Danilov
2003-10-15 14:27 ` Erik Mouw
2003-10-15 14:33 ` Nikita Danilov
2003-10-15 15:54 ` Richard B. Johnson
2003-10-15 16:21 ` Nikita Danilov
2003-10-15 17:19 ` Richard B. Johnson
2003-10-15 17:37 ` Andreas Dilger
2003-10-15 17:48 ` Dave Jones
2003-10-15 18:19 ` Richard B. Johnson
2003-10-15 18:06 ` Hans Reiser
2003-10-17 12:51 ` Edward Shushkin
2003-10-15 16:04 ` Erik Mouw
2003-10-15 17:24 ` Josh Litherland
2003-10-15 18:53 ` Erik Bourget
2003-10-15 19:03 ` Geert Uytterhoeven
2003-10-15 19:14 ` Valdis.Kletnieks
2003-10-15 19:24 ` Geert Uytterhoeven
2003-10-15 18:54 ` root
2003-10-16 2:11 ` Chris Meadors
2003-10-16 3:01 ` Shawn
2003-10-15 14:47 ` Erik Bourget
2003-10-15 15:05 ` Nikita Danilov
2003-10-15 15:06 ` Erik Bourget
2003-10-15 21:36 ` Tomas Szepe
2003-10-16 8:04 ` Ville Herva
2003-10-17 1:32 ` Eric W. Biederman
2003-10-15 15:13 ` Jeff Garzik
2003-10-15 21:00 ` Christopher Li
2003-10-16 16:29 ` Andrea Arcangeli
2003-10-16 16:41 ` P
2003-10-16 17:20 ` Jeff Garzik
2003-10-16 23:12 ` jw schultz
2003-10-17 8:03 ` John Bradford
2003-10-17 14:53 ` Eli Carter
2003-10-17 15:27 ` John Bradford
2003-10-17 16:22 ` Eli Carter
2003-10-17 17:15 ` John Bradford
2003-10-16 17:10 ` Jeff Garzik [this message]
2003-10-16 17:41 ` Andrea Arcangeli
2003-10-16 17:29 ` Larry McVoy
2003-10-16 17:49 ` Val Henson
2003-10-16 21:02 ` Jeff Garzik
2003-10-16 21:18 ` Chris Meadors
2003-10-16 21:25 ` Jeff Garzik
2003-10-16 21:33 ` Davide Libenzi
2003-10-17 3:47 ` Mark Mielke
2003-10-17 14:31 ` Jörn Engel
2003-10-16 23:04 ` jw schultz
2003-10-16 23:30 ` Jeff Garzik
2003-10-16 23:58 ` jw schultz
2003-10-16 23:53 ` David Lang
2003-10-17 1:19 ` Jeff Garzik
2003-10-17 0:45 ` Christopher Li
2003-10-17 1:16 ` Jeff Garzik
2003-10-17 1:32 ` jlnance
2003-10-17 1:47 ` Eric Sandall
2003-10-17 8:11 ` John Bradford
2003-10-17 17:53 ` Eric Sandall
2003-10-17 13:07 ` jlnance
2003-10-17 14:16 ` Jeff Garzik
2003-10-17 15:06 ` Valdis.Kletnieks
2003-10-17 1:49 ` Davide Libenzi
2003-10-17 1:59 ` Larry McVoy
2003-10-17 2:19 ` jw schultz
2003-10-17 9:44 ` Pavel Machek
2003-10-17 12:33 ` jlnance
2003-10-17 18:23 ` jw schultz
2003-10-27 2:08 ` Mike Fedyk
2003-10-27 2:15 ` jw schultz
2003-10-27 2:22 ` Mike Fedyk
2003-10-27 2:45 ` jw schultz
2003-10-16 18:28 ` John Bradford
2003-10-16 18:31 ` Robert Love
2003-10-16 20:18 ` Jeff Garzik
2003-10-16 18:43 ` Muli Ben-Yehuda
2003-10-16 18:56 ` Richard B. Johnson
2003-10-16 19:00 ` Robert Love
2003-10-16 19:27 ` John Bradford
2003-10-16 19:03 ` John Bradford
2003-10-16 19:20 ` Richard B. Johnson
2003-10-17 13:16 ` Ingo Oeser
2003-10-16 23:20 ` jw schultz
2003-10-17 14:47 ` Eli Carter
2003-10-16 8:27 ` tconnors+linuxkernel1066292516
2003-10-17 10:55 ` Ingo Oeser
2003-10-15 16:25 ` David Woodhouse
2003-10-15 16:56 ` Andreas Dilger
2003-10-15 17:44 ` David Woodhouse
[not found] <GTJr.60q.17@gated-at.bofh.it>
[not found] ` <GU2N.6v7.17@gated-at.bofh.it>
[not found] ` <GVBC.Ep.23@gated-at.bofh.it>
[not found] ` <Hjkq.3Al.1@gated-at.bofh.it>
[not found] ` <Hkgx.4Vu.7@gated-at.bofh.it>
[not found] ` <HkA0.5lh.9@gated-at.bofh.it>
[not found] ` <HnxT.3BB.27@gated-at.bofh.it>
2003-10-17 8:15 ` Ihar 'Philips' Filipau
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=3F8ED0FC.9070706@pobox.com \
--to=jgarzik@pobox.com \
--cc=andrea@suse.de \
--cc=erik@harddisk-recovery.com \
--cc=josh@temp123.org \
--cc=linux-kernel@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).