* Transparent compression with ext4 - especially with zstd
@ 2025-01-19 14:37 Gerhard Wiesinger
2025-01-21 4:01 ` Theodore Ts'o
0 siblings, 1 reply; 13+ messages in thread
From: Gerhard Wiesinger @ 2025-01-19 14:37 UTC (permalink / raw)
To: linux-ext4
Hello,
Are there any plans to include transparent compression with ext4
(especially with zstd)?
Thnx.
Ciao,
Gerhard
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Transparent compression with ext4 - especially with zstd
2025-01-19 14:37 Transparent compression with ext4 - especially with zstd Gerhard Wiesinger
@ 2025-01-21 4:01 ` Theodore Ts'o
2025-01-21 9:42 ` Artem Blagodarenko
2025-01-21 18:47 ` Gerhard Wiesinger
0 siblings, 2 replies; 13+ messages in thread
From: Theodore Ts'o @ 2025-01-21 4:01 UTC (permalink / raw)
To: Gerhard Wiesinger; +Cc: linux-ext4
On Sun, Jan 19, 2025 at 03:37:27PM +0100, Gerhard Wiesinger wrote:
>
> Are there any plans to include transparent compression with ext4 (especially
> with zstd)?
I'm not aware of anyone in the ext4 deveopment commuity working on
something like this. Fully transparent compression is challenging,
since supporting random writes into a compressed file is tricky.
There are solutions (for example, the Stac patent which resulted in
Microsoft to pay $120 million dollars), but even ignoring the
intellectual property issues, they tend to compromise the efficiency
of the compression.
More to the point, given how cheap byte storage tends to be (dollars
per IOPS tend to be far more of a constraint than dollars per GB),
it's unclear what the business case would be for any company to fund
development work in this area, when the cost of a slightly large HDD
or SSD is going to be far cheaper than the necessary software
engineering investrment needed, even for a hyperscaler cloud company
(and even there, it's unclear that transparent compression is really
needed).
What is the business and/or technical problem which you are trying to
solve?
Cheers,
- Ted
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Transparent compression with ext4 - especially with zstd
2025-01-21 4:01 ` Theodore Ts'o
@ 2025-01-21 9:42 ` Artem Blagodarenko
2025-01-21 18:47 ` Gerhard Wiesinger
1 sibling, 0 replies; 13+ messages in thread
From: Artem Blagodarenko @ 2025-01-21 9:42 UTC (permalink / raw)
To: Theodore Ts'o; +Cc: Gerhard Wiesinger, linux-ext4
Hi Gerhard, Theodore,
>even for a hyperscaler cloud company
>(and even there, it's unclear that transparent compression is really
>needed).
Regarding exascale storages. Lustre FS (which uses EXT4 (LDISKFS) as a
backend) has a “Client-side data compression” project (LU-10026) which
adds transparent compression with an extendable set of algorithms. The
initial release includes gzip, lz4, lz4hc, lzo, zstd, zstdfast
algorithms with levels.
More details are in the LUG and LAD 2023-2024 years presentations.
Best regards,
Artem Blagodarenko
From: Theodore Ts'o <tytso@mit.edu>
Date: Tuesday, 21 January 2025 at 04:01
To: Gerhard Wiesinger <lists@wiesinger.com>
Cc: linux-ext4@vger.kernel.org <linux-ext4@vger.kernel.org>
Subject: Re: Transparent compression with ext4 - especially with zstd
On Sun, Jan 19, 2025 at 03:37:27PM +0100, Gerhard Wiesinger wrote:
>
> Are there any plans to include transparent compression with ext4 (especially
> with zstd)?
I'm not aware of anyone in the ext4 deveopment commuity working on
something like this. Fully transparent compression is challenging,
since supporting random writes into a compressed file is tricky.
There are solutions (for example, the Stac patent which resulted in
Microsoft to pay $120 million dollars), but even ignoring the
intellectual property issues, they tend to compromise the efficiency
of the compression.
More to the point, given how cheap byte storage tends to be (dollars
per IOPS tend to be far more of a constraint than dollars per GB),
it's unclear what the business case would be for any company to fund
development work in this area, when the cost of a slightly large HDD
or SSD is going to be far cheaper than the necessary software
engineering investrment needed, even for a hyperscaler cloud company
(and even there, it's unclear that transparent compression is really
needed).
What is the business and/or technical problem which you are trying to
solve?
Cheers,
- Ted
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Transparent compression with ext4 - especially with zstd
2025-01-21 4:01 ` Theodore Ts'o
2025-01-21 9:42 ` Artem Blagodarenko
@ 2025-01-21 18:47 ` Gerhard Wiesinger
2025-01-21 19:33 ` Theodore Ts'o
2025-01-21 21:26 ` Dave Chinner
1 sibling, 2 replies; 13+ messages in thread
From: Gerhard Wiesinger @ 2025-01-21 18:47 UTC (permalink / raw)
To: Theodore Ts'o; +Cc: linux-ext4
On 21.01.2025 05:01, Theodore Ts'o wrote:
> On Sun, Jan 19, 2025 at 03:37:27PM +0100, Gerhard Wiesinger wrote:
>> Are there any plans to include transparent compression with ext4 (especially
>> with zstd)?
> I'm not aware of anyone in the ext4 deveopment commuity working on
> something like this. Fully transparent compression is challenging,
> since supporting random writes into a compressed file is tricky.
> There are solutions (for example, the Stac patent which resulted in
> Microsoft to pay $120 million dollars), but even ignoring the
> intellectual property issues, they tend to compromise the efficiency
> of the compression.
>
> More to the point, given how cheap byte storage tends to be (dollars
> per IOPS tend to be far more of a constraint than dollars per GB),
> it's unclear what the business case would be for any company to fund
> development work in this area, when the cost of a slightly large HDD
> or SSD is going to be far cheaper than the necessary software
> engineering investrment needed, even for a hyperscaler cloud company
> (and even there, it's unclear that transparent compression is really
> needed).
>
> What is the business and/or technical problem which you are trying to
> solve?
>
Regarding necessity:
We are talking in some scenarios about some factors of diskspace. E.g.
in my database scenario with PostgreSQL around 85% of disk space can be
saved (e.g. around factor 7).
In cloud usage scenarios you can easily reduce that amount of allocated
diskspace by around a factor 7 and reduce cost therefore.
You might also get a performance boost by using caching mechanism more
efficient (e.g. using less RAM).
Also with precompressed files (e.g. photo, videos) you can safe around
5-10% overall disk space which sounds less but in the area of several
hundred Gigabytes or even some Petabytes this is a lot of storage.
On evenly distributed data store you can save even more.
The technical topic is that IMHO no stable and practical usable Linux
filesystem which is included in the default kernel exists.
- ZFS works but is not included in the default kernel
- BTRFS has stability and repair issues (see mailing lists) and bugs
with compression (does not compress on the fly in some scenarios)
- bcachefs is experimental
Regarding patents: IMHO at least the STAC patents are all no longer
valid anymore.
Thnx.
Ciao,
Gerhard
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Transparent compression with ext4 - especially with zstd
2025-01-21 18:47 ` Gerhard Wiesinger
@ 2025-01-21 19:33 ` Theodore Ts'o
2025-01-22 0:19 ` Kiselev, Oleg
2025-01-22 7:29 ` Gerhard Wiesinger
2025-01-21 21:26 ` Dave Chinner
1 sibling, 2 replies; 13+ messages in thread
From: Theodore Ts'o @ 2025-01-21 19:33 UTC (permalink / raw)
To: Gerhard Wiesinger; +Cc: linux-ext4
On Tue, Jan 21, 2025 at 07:47:24PM +0100, Gerhard Wiesinger wrote:
> We are talking in some scenarios about some factors of diskspace. E.g. in
> my database scenario with PostgreSQL around 85% of disk space can be saved
> (e.g. around factor 7).
So the problem with using compression with databases is that they need
to be able to do random writes into the middle of a file. So that
means you need to use tricks such as writing into clusters, typically
32k or 64k. What this means is that a single 4k random write gets
amplified into a 32k or 64k write.
> In cloud usage scenarios you can easily reduce that amount of allocated
> diskspace by around a factor 7 and reduce cost therefore.
If you are running this on a cloud platform, where you are limited (on
GCE) or charged (on AWS) by IOPS and throughput, this can be a
performance bottleneck (or cost you extra). At the minimum the extra
I/O throughput will very likely show up on various performance
benchmarks.
Worse, using a transparent compression breaks the ACID properties of
the database. If you crash or have a power failure while rewriting
the 64k compression cluster, all or part of that 64k compression
cluster can be corrupted. And if your customers care about (their)
data integrity, the fact that you cheaped out on disk space might not
be something that would impress them terribly.
The short version is that transparent compression is not free, even if
you ignore the SWE development costs of implementing such a feature,
and then getting that feature to be fit for use in an enterprise use
case. No matter what file system you might want to use, I *strongly*
suggest that you get a power fail rack and try putting the whole stack
on said power fail rack, and try dropping power while running a stress
test --- over, and over, and over again. What you might find would
surprise you.
> The technical topic is that IMHO no stable and practical usable Linux
> filesystem which is included in the default kernel exists.
> - ZFS works but is not included in the default kernel
> - BTRFS has stability and repair issues (see mailing lists) and bugs with
> compression (does not compress on the fly in some scenarios)
> - bcachefs is experimental
When I started work at Google 15 years ago to deploy ext4 into
production, we did precisely this, and as well as deploying to a small
percentage of Google's test fleet to do A:B comparisons before we
deployed to the entire production fleet.
Whether or not it is "practical" and "usable" depends on your
definition, I guess, but from my perspective "stable" and "not losing
users' data" is job #1.
But hey, if it's worth so much to you, I suggest you cost out what it
would cost to actually implement the features that you so much want,
or how much it would cost to make the more complex file systems to be
stable for production use. You might decide that paying the extra
storage costs is way cheaper than software engineering investment
costs involved. At Google, and when I was at IBM before that, we were
always super disciplined about trying to figure out the ROI costs of
some particular project and not just doing it because it was "cool".
There's a famous story about how the engineers working on ZFS didn't
ask for management's permission or input from the sales team before
they started. Sounds great, and there was some cool technology there
in ZFS --- but note that Sun had to put the company up for sale
because they were losing money...
Cheers,
- Ted
P.S. Note: using a compression cluster is the only real way to
support transparent compression if you are using an update-in-place
file system like ext4 or xfs. (And that is what was coverd by the
Stac patents that I mentioned.)
If you are using a log-structed file system, such as ZFS, then you can
simply rewrite the compression cluster *and* update the file system
metadata to point at the new compression cluster --- but then the
garbage collection costs, and the file system metadata update costs
for each database commit are *huge*, and the I/O throughput hit is
even higher. So much so that ZFS recommends that you turn off the
log-structured write and do update-in-place if you want to use a
database on ZFS. But I'm pretty sure that this disables transparent
compression if you are using update-in-place. TNSTAAFL.
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Transparent compression with ext4 - especially with zstd
2025-01-21 18:47 ` Gerhard Wiesinger
2025-01-21 19:33 ` Theodore Ts'o
@ 2025-01-21 21:26 ` Dave Chinner
2025-01-22 6:47 ` Gerhard Wiesinger
1 sibling, 1 reply; 13+ messages in thread
From: Dave Chinner @ 2025-01-21 21:26 UTC (permalink / raw)
To: Gerhard Wiesinger; +Cc: Theodore Ts'o, linux-ext4
On Tue, Jan 21, 2025 at 07:47:24PM +0100, Gerhard Wiesinger wrote:
> On 21.01.2025 05:01, Theodore Ts'o wrote:
> > On Sun, Jan 19, 2025 at 03:37:27PM +0100, Gerhard Wiesinger wrote:
> > > Are there any plans to include transparent compression with ext4 (especially
> > > with zstd)?
> > I'm not aware of anyone in the ext4 deveopment commuity working on
> > something like this. Fully transparent compression is challenging,
> > since supporting random writes into a compressed file is tricky.
> > There are solutions (for example, the Stac patent which resulted in
> > Microsoft to pay $120 million dollars), but even ignoring the
> > intellectual property issues, they tend to compromise the efficiency
> > of the compression.
> >
> > More to the point, given how cheap byte storage tends to be (dollars
> > per IOPS tend to be far more of a constraint than dollars per GB),
> > it's unclear what the business case would be for any company to fund
> > development work in this area, when the cost of a slightly large HDD
> > or SSD is going to be far cheaper than the necessary software
> > engineering investrment needed, even for a hyperscaler cloud company
> > (and even there, it's unclear that transparent compression is really
> > needed).
> >
> > What is the business and/or technical problem which you are trying to
> > solve?
> >
> Regarding necessity:
> We are talking in some scenarios about some factors of diskspace. E.g. in my
> database scenario with PostgreSQL around 85% of disk space can be saved
> (e.g. around factor 7).
So use a database that has built-in data compression capabilities.
e.g. Mysql has transparent table compression functionality.
This requires sparse files and FALLOC_FL_PUNCH_HOLE support in the
filesystem, but there is no need for any special filesystem side
support for data compression to get space gains of up to 75% on
compressible data sets with the default database (16kB record size)
and filesystem configs (4kB block size).
The argument that "application level compression is hard, so we want
the filesystem to do it for us" ignores the fact that it is -much
harder- to do efficient compression in the filesystem than at the
application level.
The OS and filesystem doesn't have the freedom to control
application level data access patterns nor tailor the compression
algorithms to match how the application manages data, so everything
the filesystem implements is a compromise. It will never be optimal
for any given workload, because we have to make sure that it is
not complete garbage for any given workload...
> In cloud usage scenarios you can easily reduce that amount of allocated
> diskspace by around a factor 7 and reduce cost therefore.
Same argument: cloud applications should be managing their data
sets appropriately and efficiently, not relying on the cloud storage
infrastructure to magically do stuff to "reduce costs" for them.
Remeber: there's a massive conflict of interest on the vendor side
here - the less efficient the application (be it CPU, RAM or storage
capacity), the more money the cloud vendor makes from users running
that application. Hence they have little motivation to provide
infrastructure or application functionality that costs them money to
implement and has the impact of reducing their overall revenue
stream...
> You might also get a performance boost by using caching mechanism more
> efficient (e.g. using less RAM).
Not true. Linux caches uncompressed data in the page cache - caching
compressed data will significantly increase the memory footprint and
CPU consumption as it has to be constantly uncompressed and
recompressed as the data changes. This is not a viable caching
strategy for a general purpose OS.
> Also with precompressed files (e.g. photo, videos) you can safe around 5-10%
Video and photos do not compress sufficiently to be a viable runtime
compression target for filesystem based compression. It's a massive
waste of resources to attempt compression of internally compressed
data formats for anything but cold data storage. And even then, if
it's cold storage then the data should be compressed and checksummed
by the cold storage application before it is written to the
filesystem.
> The technical topic is that IMHO no stable and practical usable Linux
> filesystem which is included in the default kernel exists.
> - ZFS works but is not included in the default kernel
> - BTRFS has stability and repair issues (see mailing lists) and bugs with
> compression (does not compress on the fly in some scenarios)
I hear this sort of generic "btrfs is not stable/has bugs" complaint
as a reason for not using btrfs all the time.
I hear just as many, if not more, generic "XFS is unstable and loses
data" claims as a reason for not using XFS, too.
Anecdotal claims are not proof of fact, and I don't see any real
evidence that btrfs is unstable. e.g. Fedora has been using btrfs
as the root filesystem (and has for quite a while now) and there has
been no noticable increase in bug reports (either for fs
functionality or data loss) compared to when ext4 or XFS was used as
the default filesystem type...
IOWs, I redirect generic "btrfs is unstable" complaints to /dev/null
these days, just like I do with generic "XFS is unstable"
complaints.
-Dave.
--
Dave Chinner
david@fromorbit.com
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Transparent compression with ext4 - especially with zstd
2025-01-21 19:33 ` Theodore Ts'o
@ 2025-01-22 0:19 ` Kiselev, Oleg
2025-01-22 6:10 ` Gerhard Wiesinger
2025-01-22 7:29 ` Gerhard Wiesinger
1 sibling, 1 reply; 13+ messages in thread
From: Kiselev, Oleg @ 2025-01-22 0:19 UTC (permalink / raw)
To: Theodore Ts'o, Gerhard Wiesinger; +Cc: linux-ext4@vger.kernel.org
MySQL, MariaDB and PostgreSQL do their own, schema and page-size aware compression. Why not let the databases do this? They are in a better position to do it and trade off the costs where and when it matters to them.
--
Oleg Kiselev
On 1/21/25, 11:35, "Theodore Ts'o" <tytso@mit.edu <mailto:tytso@mit.edu>> wrote:
CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.
On Tue, Jan 21, 2025 at 07:47:24PM +0100, Gerhard Wiesinger wrote:
> We are talking in some scenarios about some factors of diskspace. E.g. in
> my database scenario with PostgreSQL around 85% of disk space can be saved
> (e.g. around factor 7).
So the problem with using compression with databases is that they need
to be able to do random writes into the middle of a file. So that
means you need to use tricks such as writing into clusters, typically
32k or 64k. What this means is that a single 4k random write gets
amplified into a 32k or 64k write.
> In cloud usage scenarios you can easily reduce that amount of allocated
> diskspace by around a factor 7 and reduce cost therefore.
If you are running this on a cloud platform, where you are limited (on
GCE) or charged (on AWS) by IOPS and throughput, this can be a
performance bottleneck (or cost you extra). At the minimum the extra
I/O throughput will very likely show up on various performance
benchmarks.
Worse, using a transparent compression breaks the ACID properties of
the database. If you crash or have a power failure while rewriting
the 64k compression cluster, all or part of that 64k compression
cluster can be corrupted. And if your customers care about (their)
data integrity, the fact that you cheaped out on disk space might not
be something that would impress them terribly.
The short version is that transparent compression is not free, even if
you ignore the SWE development costs of implementing such a feature,
and then getting that feature to be fit for use in an enterprise use
case. No matter what file system you might want to use, I *strongly*
suggest that you get a power fail rack and try putting the whole stack
on said power fail rack, and try dropping power while running a stress
test --- over, and over, and over again. What you might find would
surprise you.
> The technical topic is that IMHO no stable and practical usable Linux
> filesystem which is included in the default kernel exists.
> - ZFS works but is not included in the default kernel
> - BTRFS has stability and repair issues (see mailing lists) and bugs with
> compression (does not compress on the fly in some scenarios)
> - bcachefs is experimental
When I started work at Google 15 years ago to deploy ext4 into
production, we did precisely this, and as well as deploying to a small
percentage of Google's test fleet to do A:B comparisons before we
deployed to the entire production fleet.
Whether or not it is "practical" and "usable" depends on your
definition, I guess, but from my perspective "stable" and "not losing
users' data" is job #1.
But hey, if it's worth so much to you, I suggest you cost out what it
would cost to actually implement the features that you so much want,
or how much it would cost to make the more complex file systems to be
stable for production use. You might decide that paying the extra
storage costs is way cheaper than software engineering investment
costs involved. At Google, and when I was at IBM before that, we were
always super disciplined about trying to figure out the ROI costs of
some particular project and not just doing it because it was "cool".
There's a famous story about how the engineers working on ZFS didn't
ask for management's permission or input from the sales team before
they started. Sounds great, and there was some cool technology there
in ZFS --- but note that Sun had to put the company up for sale
because they were losing money...
Cheers,
- Ted
P.S. Note: using a compression cluster is the only real way to
support transparent compression if you are using an update-in-place
file system like ext4 or xfs. (And that is what was coverd by the
Stac patents that I mentioned.)
If you are using a log-structed file system, such as ZFS, then you can
simply rewrite the compression cluster *and* update the file system
metadata to point at the new compression cluster --- but then the
garbage collection costs, and the file system metadata update costs
for each database commit are *huge*, and the I/O throughput hit is
even higher. So much so that ZFS recommends that you turn off the
log-structured write and do update-in-place if you want to use a
database on ZFS. But I'm pretty sure that this disables transparent
compression if you are using update-in-place. TNSTAAFL.
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Transparent compression with ext4 - especially with zstd
2025-01-22 0:19 ` Kiselev, Oleg
@ 2025-01-22 6:10 ` Gerhard Wiesinger
0 siblings, 0 replies; 13+ messages in thread
From: Gerhard Wiesinger @ 2025-01-22 6:10 UTC (permalink / raw)
To: Kiselev, Oleg, Theodore Ts'o; +Cc: linux-ext4@vger.kernel.org
On 22.01.2025 01:19, Kiselev, Oleg wrote:
> MySQL, MariaDB and PostgreSQL do their own, schema and page-size aware compression. Why not let the databases do this? They are in a better position to do it and trade off the costs where and when it matters to them.
Hello Oleg,
Thnx for the input. For PostgreSQL: AFAIK compression only works for
larger tables (e.g. >2kB) and it looks like it doesn't work for my usecase.
But will have a deeper look into it.
Ciao,
Gerhard
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Transparent compression with ext4 - especially with zstd
2025-01-21 21:26 ` Dave Chinner
@ 2025-01-22 6:47 ` Gerhard Wiesinger
0 siblings, 0 replies; 13+ messages in thread
From: Gerhard Wiesinger @ 2025-01-22 6:47 UTC (permalink / raw)
To: Dave Chinner; +Cc: Theodore Ts'o, linux-ext4
On 21.01.2025 22:26, Dave Chinner wrote:
> On Tue, Jan 21, 2025 at 07:47:24PM +0100, Gerhard Wiesinger wrote:
>> On 21.01.2025 05:01, Theodore Ts'o wrote:
>>> On Sun, Jan 19, 2025 at 03:37:27PM +0100, Gerhard Wiesinger wrote:
>>>> Are there any plans to include transparent compression with ext4 (especially
>>>> with zstd)?
>>> I'm not aware of anyone in the ext4 deveopment commuity working on
>>> something like this. Fully transparent compression is challenging,
>>> since supporting random writes into a compressed file is tricky.
>>> There are solutions (for example, the Stac patent which resulted in
>>> Microsoft to pay $120 million dollars), but even ignoring the
>>> intellectual property issues, they tend to compromise the efficiency
>>> of the compression.
>>>
>>> More to the point, given how cheap byte storage tends to be (dollars
>>> per IOPS tend to be far more of a constraint than dollars per GB),
>>> it's unclear what the business case would be for any company to fund
>>> development work in this area, when the cost of a slightly large HDD
>>> or SSD is going to be far cheaper than the necessary software
>>> engineering investrment needed, even for a hyperscaler cloud company
>>> (and even there, it's unclear that transparent compression is really
>>> needed).
>>>
>>> What is the business and/or technical problem which you are trying to
>>> solve?
>>>
>> Regarding necessity:
>> We are talking in some scenarios about some factors of diskspace. E.g. in my
>> database scenario with PostgreSQL around 85% of disk space can be saved
>> (e.g. around factor 7).
> So use a database that has built-in data compression capabilities.
>
> e.g. Mysql has transparent table compression functionality.
> This requires sparse files and FALLOC_FL_PUNCH_HOLE support in the
> filesystem, but there is no need for any special filesystem side
> support for data compression to get space gains of up to 75% on
> compressible data sets with the default database (16kB record size)
> and filesystem configs (4kB block size).
>
> The argument that "application level compression is hard, so we want
> the filesystem to do it for us" ignores the fact that it is -much
> harder- to do efficient compression in the filesystem than at the
> application level.
>
> The OS and filesystem doesn't have the freedom to control
> application level data access patterns nor tailor the compression
> algorithms to match how the application manages data, so everything
> the filesystem implements is a compromise. It will never be optimal
> for any given workload, because we have to make sure that it is
> not complete garbage for any given workload...
MySQL/MariaDB isnt't an option for me. But will look into this.
>
>> In cloud usage scenarios you can easily reduce that amount of allocated
>> diskspace by around a factor 7 and reduce cost therefore.
> Same argument: cloud applications should be managing their data
> sets appropriately and efficiently, not relying on the cloud storage
> infrastructure to magically do stuff to "reduce costs" for them.
>
> Remeber: there's a massive conflict of interest on the vendor side
> here - the less efficient the application (be it CPU, RAM or storage
> capacity), the more money the cloud vendor makes from users running
> that application. Hence they have little motivation to provide
> infrastructure or application functionality that costs them money to
> implement and has the impact of reducing their overall revenue
> stream...
Right, therefore we want to make the storage usage as small as possible
either on appication level or filesystem level.
>> You might also get a performance boost by using caching mechanism more
>> efficient (e.g. using less RAM).
> Not true. Linux caches uncompressed data in the page cache - caching
> compressed data will significantly increase the memory footprint and
> CPU consumption as it has to be constantly uncompressed and
> recompressed as the data changes. This is not a viable caching
> strategy for a general purpose OS.
AFAIK ZFS caches compressed data in the ARC cache. zstd really has a
very low overhead on decompression with a very good compression ratio
(even better than gz and bz2).
>> Also with precompressed files (e.g. photo, videos) you can safe around 5-10%
> Video and photos do not compress sufficiently to be a viable runtime
> compression target for filesystem based compression. It's a massive
> waste of resources to attempt compression of internally compressed
> data formats for anything but cold data storage. And even then, if
> it's cold storage then the data should be compressed and checksummed
> by the cold storage application before it is written to the
> filesystem.
ZFS uses with zstd the lz4 "early abort" feature which detects with very
low CPU ressources that not compression is necessary and aborts the
compression and stores it uncompressed. If lz4 doesn't abort early, zstd
compression is used. So there are solutions for low ressource usage.
Reagarding rations: In my case 3%:
zfs list -o name,compressratio,compression big/shares/fotovideo
NAME RATIO COMPRESS
big/shares/fotovideo 1.03x zstd-3
>
>> The technical topic is that IMHO no stable and practical usable Linux
>> filesystem which is included in the default kernel exists.
>> - ZFS works but is not included in the default kernel
>> - BTRFS has stability and repair issues (see mailing lists) and bugs with
>> compression (does not compress on the fly in some scenarios)
> I hear this sort of generic "btrfs is not stable/has bugs" complaint
> as a reason for not using btrfs all the time.
That's my practical experience. I tried BTRFS several times and failed
on testing and production. Had a storage topic where some blocks
(several thousand 4k blocks were damaged). On top several VMs were running.
All other filesystems (XFS, ext4, ZFS, UFS2, ) except BTRFS and bcachefs
(which is experimental) were repairable to a consistent state (of course
with some blocks lost).
You can repair BTRFS "forever" without getting it into a consistent state.
A friend of mine had also the experience that it was not mountable and
crashed immediately after a reboot ...
Find the details here on the mailing list:
https://marc.info/?l=linux-btrfs&m=172519149923874&w=2
>
> I hear just as many, if not more, generic "XFS is unstable and loses
> data" claims as a reason for not using XFS, too.
I'm not having that experience. But I try to use ext4 primarily as it is
best for "repair" scenarios.
>
> Anecdotal claims are not proof of fact, and I don't see any real
> evidence that btrfs is unstable. e.g. Fedora has been using btrfs
> as the root filesystem (and has for quite a while now) and there has
> been no noticable increase in bug reports (either for fs
> functionality or data loss) compared to when ext4 or XFS was used as
> the default filesystem type...
That are not anecdotal claims that's my practical experience that BTRFS
is not stable and repairable to a consisent state. Reproduceable, you
can try for yourself.
I'm using Fedora since Fedora FC1 for all production systems.
>
> IOWs, I redirect generic "btrfs is unstable" complaints to /dev/null
> these days, just like I do with generic "XFS is unstable"
> complaints.
>
Try it and you will see it that it is non repairable. You can find
details and testcase (simulation of what I had on overwriting random
blocks) in the link.
As with Fedora I'm using latest and "fresh" stable kernel versions as
well as filesystem utilities. I'm still having that "unrepairable"
original BTRFS filesystem and will try to repair it to a consistent
state from time to time. Until now not successful.
Find the details here on the mailing list:
https://marc.info/?l=linux-btrfs&m=172519149923874&w=2
So you should't redirect the complaints to /dev/null to get BTRFS better :-)
Thnx.
Ciao,
Gerhard
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Transparent compression with ext4 - especially with zstd
2025-01-21 19:33 ` Theodore Ts'o
2025-01-22 0:19 ` Kiselev, Oleg
@ 2025-01-22 7:29 ` Gerhard Wiesinger
2025-01-22 7:37 ` Christoph Hellwig
1 sibling, 1 reply; 13+ messages in thread
From: Gerhard Wiesinger @ 2025-01-22 7:29 UTC (permalink / raw)
To: Theodore Ts'o; +Cc: linux-ext4
On 21.01.2025 20:33, Theodore Ts'o wrote:
> On Tue, Jan 21, 2025 at 07:47:24PM +0100, Gerhard Wiesinger wrote:
>> We are talking in some scenarios about some factors of diskspace. E.g. in
>> my database scenario with PostgreSQL around 85% of disk space can be saved
>> (e.g. around factor 7).
> Worse, using a transparent compression breaks the ACID properties of
> the database. If you crash or have a power failure while rewriting
> the 64k compression cluster, all or part of that 64k compression
> cluster can be corrupted. And if your customers care about (their)
> data integrity, the fact that you cheaped out on disk space might not
> be something that would impress them terribly.
>
BTW: Why does it break the ACID properties?
Typically the transaction log will be (and have to be) flushed/synced to
disk (fsync). If that's ok everything is fine and all DB transactions
can be forwared if necessary. If that fails the last transaction is not
recorded.
I also don't see any compression related. That can also happen without
compression.
Any clarification?
Ciao,
Gerhard
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Transparent compression with ext4 - especially with zstd
2025-01-22 7:29 ` Gerhard Wiesinger
@ 2025-01-22 7:37 ` Christoph Hellwig
2025-01-22 13:19 ` Theodore Ts'o
0 siblings, 1 reply; 13+ messages in thread
From: Christoph Hellwig @ 2025-01-22 7:37 UTC (permalink / raw)
To: Gerhard Wiesinger; +Cc: Theodore Ts'o, linux-ext4
On Wed, Jan 22, 2025 at 08:29:09AM +0100, Gerhard Wiesinger wrote:
> BTW: Why does it break the ACID properties?
It doesn't if implemented properly, which of course means out of place
writes.
The only sane way to implement compression in XFS would be using out
of place writes, which we support for reflinks and which is heavily
used by the new zoned mode. For the latter retrofitting compression
would be relatively easy, but it first needs to get merged, then
stabilize and mature, and then we'll need to see if we have enough
use cases. So don't plan for it.
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Transparent compression with ext4 - especially with zstd
2025-01-22 7:37 ` Christoph Hellwig
@ 2025-01-22 13:19 ` Theodore Ts'o
2025-01-22 14:11 ` Christoph Hellwig
0 siblings, 1 reply; 13+ messages in thread
From: Theodore Ts'o @ 2025-01-22 13:19 UTC (permalink / raw)
To: Christoph Hellwig; +Cc: Gerhard Wiesinger, linux-ext4
On Tue, Jan 21, 2025 at 11:37:38PM -0800, Christoph Hellwig wrote:
> On Wed, Jan 22, 2025 at 08:29:09AM +0100, Gerhard Wiesinger wrote:
> > BTW: Why does it break the ACID properties?
>
> It doesn't if implemented properly, which of course means out of place
> writes.
>
> The only sane way to implement compression in XFS would be using out
> of place writes, which we support for reflinks and which is heavily
> used by the new zoned mode. For the latter retrofitting compression
> would be relatively easy, but it first needs to get merged, then
> stabilize and mature, and then we'll need to see if we have enough
> use cases. So don't plan for it.
... but out of place writes means that every single fdatasync() called
by the database now requires a file system level transaction commits.
So now every single fdatasync(2) results in the data blocks getting
written out to a new location on disk (this is what out of place
writes mean), followed by a CACHE FLUSH, followed by the metadata
updates to point at the new location on the disk, first written to the
file system tranaction log, followed by the fs commit block, followed
by a *second* CACHE FLUSH command.
So now let's look at a sample scenario where the database needs to
update 3 different 4k blocks (for example, where you are are crediting
$100 to an income account, followed by a $100 debit to an expense
account, followed by the database commit.
Without transparent compression commits (assuming the database is
properly using fdatasync so it's not asking the file system to update
the ctime/mtime of the database file):
1) random write A (4k write)
2) random write B (4k write)
3) random write C (4k write)
4) CACHE FLUSH
With transparent compression:
1) random write A
2) random write B
3) random write C
4) CACHE FLUSH
5) update the location of compression cluster A written to the fs journal
6) update the location of compression cluster B written to the fs journal
7) update the location of compression cluster C written to the fs journal
8) write the commit block to the fs journal
9) CACHE FLUSH
This kills performance, and as I mentioned, in general, IOPS are
expensive and write bandwidth is often far more expensive than bytes
storage. This is true both for the raw storage by the cloud provider,
the extra network bandwidth bewteen the host and cluster file system
storing the emulated cloud block device, and amount of money charged
to the cloud customer because it does cost more money to the cloud
provider.
If you try to do transparent compression using update-in-place (for
example, via the technique in the Stac patent) then you don't need to
update the location on disk, but given that you are replacing a 64k
compression cluster every time you update a 4k block, if you crash in
the middle of the 64k compression cluster update, that cluster could
get corrupted --- at which point you break the database's ACID
properties.
Finally, note that both Amazon and Google have first party cloud
products (RDS and CloudSQL, respectively) that provide to the customer
the full MySQL and Postgres feature set. So if you want to enable
database level compression, I believe you *can* do it. Compression is
not free, and not magic, but if it works for you, you *can* enable it
if you are using MySQL or Postgres.
Now, if you are using a database that doesn't support database-level
compression, then why don't you try demanding your vendor that is
providing the database to add compression as a feature? Of course,
they might ask you as the customer to pay $$$, but the development
cost to add new features, whether in the database or the file system,
is also not free.
Cheers,
- Ted
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Transparent compression with ext4 - especially with zstd
2025-01-22 13:19 ` Theodore Ts'o
@ 2025-01-22 14:11 ` Christoph Hellwig
0 siblings, 0 replies; 13+ messages in thread
From: Christoph Hellwig @ 2025-01-22 14:11 UTC (permalink / raw)
To: Theodore Ts'o; +Cc: Christoph Hellwig, Gerhard Wiesinger, linux-ext4
On Wed, Jan 22, 2025 at 08:19:12AM -0500, Theodore Ts'o wrote:
> ... but out of place writes means that every single fdatasync() called
> by the database now requires a file system level transaction commits.
Yes.
> So now every single fdatasync(2) results in the data blocks getting
> written out to a new location on disk (this is what out of place
> writes mean), followed by a CACHE FLUSH, followed by the metadata
> updates to point at the new location on the disk, first written to the
> file system tranaction log, followed by the fs commit block, followed
> by a *second* CACHE FLUSH command.
Or you put the compressed data in the log and have a single FUA
write.
^ permalink raw reply [flat|nested] 13+ messages in thread
end of thread, other threads:[~2025-01-22 14:11 UTC | newest]
Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-01-19 14:37 Transparent compression with ext4 - especially with zstd Gerhard Wiesinger
2025-01-21 4:01 ` Theodore Ts'o
2025-01-21 9:42 ` Artem Blagodarenko
2025-01-21 18:47 ` Gerhard Wiesinger
2025-01-21 19:33 ` Theodore Ts'o
2025-01-22 0:19 ` Kiselev, Oleg
2025-01-22 6:10 ` Gerhard Wiesinger
2025-01-22 7:29 ` Gerhard Wiesinger
2025-01-22 7:37 ` Christoph Hellwig
2025-01-22 13:19 ` Theodore Ts'o
2025-01-22 14:11 ` Christoph Hellwig
2025-01-21 21:26 ` Dave Chinner
2025-01-22 6:47 ` Gerhard Wiesinger
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox