All of lore.kernel.org
 help / color / mirror / Atom feed
From: Scott Young <youngs1@sunyit.edu>
To: reiserfs-list@namesys.com
Subject: Re: Can compression at filesystem level improve overall performance?
Date: Mon, 22 Mar 2004 13:00:28 -0500	[thread overview]
Message-ID: <1079978427.4658.63.camel@localhost.localdomain> (raw)
In-Reply-To: <16475.9613.375262.677576@laputa.namesys.com>


> 
> That's common misconception. :)
> 
> The goal of compression is to conserve disk bandwidth rather than space.
> 
> By compressing it is possible to transfer data (== uncompressed data
> user works with), at a rate higher than raw device bandwidth.

I will be doing some research on an algorithm that speeds up data
transfers over a network by adaptively selecting a compression
algorithm.  It can be applied to filesystem reads and writes too.  When
the send queue is reasonably full on the server, it starts compressing
data at the tail of the queue while sending the data at the head of the
queue.  If the output stream catches up to segment currently being
compressed, then that segment is sent uncompressed.  If the compressed
data is not significantly smaller, then the uncompressed data is sent
instead.  For network applications that are not network interface bound
(like rsync over a 100mbit connection), the buffer will be empty most of
the time and therefore little compression would be needed or wanted as
it would only slow the application down.  Compression is chosen from a
pool of algorithms and varied depending on the history of buffer
overflows and under-runs.  Slower, better compression algorithms are
used when the buffer is mostly full and the compression is observably
effective.  The idea here is to minimize the time between the client
requesting the data and having the usable data in a minimal amount of
time.  This can be seen as a time-verses-amount-of-usable-data-on-client
graph, and some applications prefer a low latency for the initial stream
of data (such as a web page) whereas some prefer the time to retrieve a
very large piece of data (such as scp scott@1.2.3.4/SomeBigDocument.sxw
/home/scott over a 56k modem).

Adapting this to filesystem concepts, the server can be seen as the
write process and the client can be seen as the read process.  The idea
can be applied to Reiser4 by compressing the overwrite set while the
journal data is being written, and then compressing the tail of the
relocate set moving backwards until the write stream catches up to the
compression.  It could also take into account the estimated
decompression time when reading the data back, and use it for deciding
whether the compression ratio is good enough to write the compressed
data instead of the uncompressed data.

Another interesting twist would be to cache the compressed data if the
same data is going to be sent from the server several times.  This
reduces CPU overhead on the server (and possibly it's memory
requirements for caching the data, and reduces the amount of data that
needs to be read from the drive), but it is complicated in the context
of a network algorithm and is mostly application-dependent.  This is
research for another day, maybe in the form of a derived-data plugin for
ReiserFS where an application tells the filesystem how to construct the
file, and the filesystem can store the original, the result, or both,
depending on space needs and performance analysis, with copy-on-write
metadata flags when appropriate.

I haven't started coding the adaptive compression algorithm yet, but I
have a general idea about how I am going to implement it.  For the
proof-of-concept, I want to write this using sockets and some basic
library compression algorithms (gzip, bzip2, and maybe a simple MTF +
Adaptive Huffman).  Later variants may work with TCP or other protocols
around that layer.  Any suggestions will be appreciated.


Scott Young





  parent reply	other threads:[~2004-03-22 18:00 UTC|newest]

Thread overview: 24+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2004-03-19 14:25 Can compression at filesystem level improve overall performance? Erik Terpstra
2004-03-19 16:29 ` Redeeman
2004-03-19 16:53   ` Nikita Danilov
2004-03-21 14:29     ` Sean Johnson
2004-03-21 23:17       ` Can compression at filesystem level improve overall The Amazing Dragon
2004-03-21 23:23         ` Sean Johnson
2004-03-22  9:14         ` Hans Reiser
2004-03-22  8:01     ` Can compression at filesystem level improve overall performance? Kris Van Bruwaene
2004-03-22 18:00     ` Scott Young [this message]
2004-03-22 20:04       ` Hans Reiser
2004-03-23  3:03         ` Scott Young
2004-03-23 10:59           ` Hans Reiser
2004-03-24 16:19             ` Scott Young
2004-03-29  5:25               ` Tom Vier
2004-03-29  5:16           ` Tom Vier
2004-03-30  3:34             ` Scott Young
2004-03-30  4:53               ` Tom Vier
2004-03-31  4:51                 ` Scott Young
2004-04-08 21:46                   ` Tom Vier
2004-04-08 11:47                 ` Stewart Smith
2004-03-19 18:59 ` Hans Reiser
2004-03-23  0:17 ` Miguel
     [not found] <no.id>
2004-03-24  0:08 ` The Amazing Dragon
2004-03-24  0:12   ` Alan Horn

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1079978427.4658.63.camel@localhost.localdomain \
    --to=youngs1@sunyit.edu \
    --cc=reiserfs-list@namesys.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.