All of lore.kernel.org
 help / color / mirror / Atom feed
From: John Ripley <jripley@riohome.com>
To: linux-kernel@vger.kernel.org
Cc: VDA <VDA@port.imtp.ilyichevsk.odessa.ua>
Subject: Re: COW fs (Re: Editing-in-place of a large file)
Date: Sun, 09 Sep 2001 17:30:15 +0100	[thread overview]
Message-ID: <3B9B9917.DA1CC12F@riohome.com> (raw)
In-Reply-To: <20010902152137.L23180@draal.physics.wisc.edu> <318476047.20010903002818@port.imtp.ilyichevsk.odessa.ua> <3B9B80E2.C9D5B947@riohome.com>

John Ripley wrote:
> 
> VDA wrote:
> >
> > Sunday, September 02, 2001, 11:21:37 PM, Bob McElrath wrote:
> > BM> I would like to take an extremely large file (multi-gigabyte) and edit
> > BM> it by removing a chunk out of the middle.  This is easy enough by
> > BM> reading in the entire file and spitting it back out again, but it's
> > BM> hardly efficent to read in an 8GB file just to remove a 100MB segment.
> 
> > BM> Is there another way to do this?
> 
> > BM> Is it possible to modify the inode structure of the underlying
> > BM> filesystem to free blocks in the middle?  (What to do with the half-full
> > BM> blocks that are left?)  Has anyone written a tool to do something like
> > BM> this?
> 
> > BM> Is there a way to do this in a filesystem-independent manner?
> 
> > A COW fs is a far more useful and cool. A fs where a copy of a file
> > does not duplicate all blocks. Blocks get copied-on-write only when
> > copy of a file is written to. There could be even a fs compressor
> > which looks for and merges blocks with exactly same contents from
> > different files.
> >
> > Maybe ext2/3 folks will play with this idea after ext3?
> >
> > I'm planning to write a test program which will scan my ext2 fs and
> > report how many duplicate blocks with the same contents it sees (i.e
> > how many would I save with a COW fs)
> 
> I've tried this idea. I did an MD5 of every block (4KB) in a partition
> and counted the number of blocks with the same hash. Only about 5-10% of
> blocks on several filesystem were actually duplicates. This might be
> better if you reduced the block size to 512 bytes, but there's a
> question of how much extra space filesystem structures would then take
> up.

Thought I'd reply to myself with some more details :)

Scanning for duplicates gave the following results:

 512 byte blocks
----------------

/dev/sda5 - swap	-   32122 blocks,  11488 duplicates, 35.76%
/dev/sdb3 - swap	-   25297 blocks,   2302 duplicates,  9.09%
/dev/sdc5 - swap	-   34122 blocks,  10239 duplicates, 30.00%

/dev/sda6 - /tmp	-  210845 blocks,  17697 duplicates,  8.39%
/dev/sda7 - /var	-   32122 blocks,   5327 duplicates, 16.58%
/dev/sdb5 - /home	-  220885 blocks,  24541 duplicates, 11.11%
/dev/sdc7 - /usr	- 1084379 blocks, 122370 duplicates, 11.28%

4096 byte blocks
----------------

/dev/sda5 - swap	-   32122 blocks,   9799 duplicates, 30.50%
/dev/sdb3 - swap	-   26105 blocks,      0 duplicates,  0.00%
/dev/sdc5 - swap	-   34122 blocks,  10539 duplicates, 30.88%

/dev/sda6 - /tmp 	-  210845 blocks,  17880 duplicates,  8.48%
/dev/sda7 - /var	-   32122 blocks,   2816 duplicates,  8.76%
/dev/sdb5 - /home	-  220885 blocks,   8908 duplicates,  4.03%
/dev/sdc7 - /usr	- 1084379 blocks,  71778 duplicates,  6.61%

Interesting results for the swap partitions. Probably full of zeros. The
time between runs probably explains the difference in /tmp.

You can grab the program I used from
http://www.pslam.demon.co.uk/md5-stuff.tar.gz
Run with ./md5device </dev/blah

-- 
John Ripley

  reply	other threads:[~2001-09-09 16:30 UTC|newest]

Thread overview: 29+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2001-09-02 20:21 Editing-in-place of a large file Bob McElrath
2001-09-02 21:28 ` COW fs (Re: Editing-in-place of a large file) VDA
2001-09-09 14:46   ` John Ripley
2001-09-09 16:30     ` John Ripley [this message]
2001-09-09 17:41       ` Xavier Bestel
2001-09-10  1:29         ` John Ripley
2001-09-10  6:45           ` Ragnar Kjørstad
2001-09-14 10:06           ` Pavel Machek
2001-09-10 11:11         ` Ihar Filipau
2001-09-10 16:10           ` Kari Hurtta
2001-09-10  2:43       ` Daniel Phillips
2001-09-10  2:58         ` David Lang
2001-09-10  9:28     ` VDA
2001-09-10  9:35       ` John P. Looney
2001-09-14 10:03     ` Pavel Machek
2001-09-02 21:30 ` Editing-in-place of a large file Ingo Oeser
2001-09-03  0:59   ` Larry McVoy
2001-09-03  1:24     ` Ingo Oeser
2001-09-03  1:31       ` Alan Cox
2001-09-03  1:50         ` Ingo Oeser
2001-09-03 10:48           ` Alan Cox
2001-09-03 14:31             ` Daniel Phillips
2001-09-03 14:46             ` Bob McElrath
2001-09-03 14:54               ` Alan Cox
2001-09-03 15:42                 ` Doug McNaught
2001-09-03 15:11               ` Richard Guenther
2001-09-03 21:19             ` Ben Ford
2001-09-03  4:27       ` Bob McElrath
2001-09-03  1:30     ` Daniel Phillips

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=3B9B9917.DA1CC12F@riohome.com \
    --to=jripley@riohome.com \
    --cc=VDA@port.imtp.ilyichevsk.odessa.ua \
    --cc=linux-kernel@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.