Readahead for compressed data

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Readahead for compressed data
@ 2021-10-21 20:17 Matthew Wilcox
  2021-10-22  0:22 ` Gao Xiang
                   ` (3 more replies)
  0 siblings, 4 replies; 16+ messages in thread
From: Matthew Wilcox @ 2021-10-21 20:17 UTC (permalink / raw)
  To: linux-fsdevel, Jan Kara, Phillip Lougher, linux-erofs,
	linux-btrfs, linux-ntfs-dev, ntfs3, linux-bcache
  Cc: David Howells, Hsin-Yi Wang

As far as I can tell, the following filesystems support compressed data:

bcachefs, btrfs, erofs, ntfs, squashfs, zisofs

I'd like to make it easier and more efficient for filesystems to
implement compressed data.  There are a lot of approaches in use today,
but none of them seem quite right to me.  I'm going to lay out a few
design considerations next and then propose a solution.  Feel free to
tell me I've got the constraints wrong, or suggest alternative solutions.

When we call ->readahead from the VFS, the VFS has decided which pages
are going to be the most useful to bring in, but it doesn't know how
pages are bundled together into blocks.  As I've learned from talking to
Gao Xiang, sometimes the filesystem doesn't know either, so this isn't
something we can teach the VFS.

We (David) added readahead_expand() recently to let the filesystem
opportunistically add pages to the page cache "around" the area requested
by the VFS.  That reduces the number of times the filesystem has to
decompress the same block.  But it can fail (due to memory allocation
failures or pages already being present in the cache).  So filesystems
still have to implement some kind of fallback.

For many (all?) compression algorithms (all?) the data must be mapped at
all times.  Calling kmap() and kunmap() would be an intolerable overhead.
At the same time, we cannot write to a page in the page cache which is
marked Uptodate.  It might be mapped into userspace, or a read() be in
progress against it.  For writable filesystems, it might even be dirty!
As far as I know, no compression algorithm supports "holes", implying
that we must allocate memory which is then discarded.

To me, this calls for a vmap() based approach.  So I'm thinking
something like ...

void *readahead_get_block(struct readahead_control *ractl, loff_t start,
			size_t len);
void readahead_put_block(struct readahead_control *ractl, void *addr,
			bool success);

Once you've figured out which bytes this encrypted block expands to, you
call readahead_get_block(), specifying the offset in the file and length
and get back a pointer.  When you're done decompressing that block of
the file, you get rid of it again.

It's the job of readahead_get_block() to allocate additional pages
into the page cache or temporary pages.  readahead_put_block() will
mark page cache pages as Uptodate if 'success' is true, and unlock
them.  It'll free any temporary pages.

Thoughts?  Anyone want to be the guinea pig?  ;-)

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Readahead for compressed data
  2021-10-21 20:17 Readahead for compressed data Matthew Wilcox
@ 2021-10-22  0:22 ` Gao Xiang
  2021-10-22  1:04 ` Phillip Susi
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 16+ messages in thread
From: Gao Xiang @ 2021-10-22  0:22 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: linux-fsdevel, Jan Kara, Phillip Lougher, linux-erofs,
	linux-btrfs, linux-ntfs-dev, ntfs3, linux-bcache, David Howells,
	Hsin-Yi Wang

On Thu, Oct 21, 2021 at 09:17:40PM +0100, Matthew Wilcox wrote:
> As far as I can tell, the following filesystems support compressed data:
> 
> bcachefs, btrfs, erofs, ntfs, squashfs, zisofs
> 
> I'd like to make it easier and more efficient for filesystems to
> implement compressed data.  There are a lot of approaches in use today,
> but none of them seem quite right to me.  I'm going to lay out a few
> design considerations next and then propose a solution.  Feel free to
> tell me I've got the constraints wrong, or suggest alternative solutions.
> 
> When we call ->readahead from the VFS, the VFS has decided which pages
> are going to be the most useful to bring in, but it doesn't know how
> pages are bundled together into blocks.  As I've learned from talking to
> Gao Xiang, sometimes the filesystem doesn't know either, so this isn't
> something we can teach the VFS.
> 
> We (David) added readahead_expand() recently to let the filesystem
> opportunistically add pages to the page cache "around" the area requested
> by the VFS.  That reduces the number of times the filesystem has to
> decompress the same block.  But it can fail (due to memory allocation
> failures or pages already being present in the cache).  So filesystems
> still have to implement some kind of fallback.
> 
> For many (all?) compression algorithms (all?) the data must be mapped at
> all times.  Calling kmap() and kunmap() would be an intolerable overhead.
> At the same time, we cannot write to a page in the page cache which is
> marked Uptodate.  It might be mapped into userspace, or a read() be in
> progress against it.  For writable filesystems, it might even be dirty!
> As far as I know, no compression algorithm supports "holes", implying
> that we must allocate memory which is then discarded.
> 
> To me, this calls for a vmap() based approach.  So I'm thinking
> something like ...
> 
> void *readahead_get_block(struct readahead_control *ractl, loff_t start,
> 			size_t len);
> void readahead_put_block(struct readahead_control *ractl, void *addr,
> 			bool success);
> 
> Once you've figured out which bytes this encrypted block expands to, you
> call readahead_get_block(), specifying the offset in the file and length
> and get back a pointer.  When you're done decompressing that block of
> the file, you get rid of it again.
> 
> It's the job of readahead_get_block() to allocate additional pages
> into the page cache or temporary pages.  readahead_put_block() will
> mark page cache pages as Uptodate if 'success' is true, and unlock
> them.  It'll free any temporary pages.
> 
> Thoughts?  Anyone want to be the guinea pig?  ;-)

Copied from IRC for reference as well..

As for vmap() strategy, No need to allocate some temporary pages in advance
before compressed I/O is done and async work is triggered, since I/Os /
works could cause noticable latencies. Logically, only inplace or cached
I/O strategy should be decided before I/O and compressed pages need to be
prepared. The benefits of fixed-sized output compression aside from the
relative higher compression ratios is that each compressed pcluster can
be completely decompressed independently, you can select inplace or cache
I/O for each pclusters. And when you decide inplace I/O for some pcluster,
no extra incompressible data is readed from disk or cached uselessly.

As I said, even overall read request (with many pclusters) is 1MiB or
2MiB or some else, also only need allocate 16 pages (64KiB) at most for lz4
for each read request (used for lz4 sliding window), no need such many
extra temporary pages.

Allocating too many pages in advance before I/O IMO is just increasing the
overall memory overhead. Low-ended devices cannot handle I/O quickly but
has limited memory. Temporary pages are only needed before decompression,
not exactly before I/O.

Thanks,
Gao Xiang

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Readahead for compressed data
  2021-10-21 20:17 Readahead for compressed data Matthew Wilcox
  2021-10-22  0:22 ` Gao Xiang
@ 2021-10-22  1:04 ` Phillip Susi
  2021-10-22  1:28   ` Gao Xiang
                     ` (2 more replies)
  2021-10-22  4:36 ` Phillip Lougher
  2021-10-29  6:15 ` Coly Li
  3 siblings, 3 replies; 16+ messages in thread
From: Phillip Susi @ 2021-10-22  1:04 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: linux-fsdevel, Jan Kara, Phillip Lougher, linux-erofs,
	linux-btrfs, linux-ntfs-dev, ntfs3, linux-bcache, David Howells,
	Hsin-Yi Wang


Matthew Wilcox <willy@infradead.org> writes:

> As far as I can tell, the following filesystems support compressed data:
>
> bcachefs, btrfs, erofs, ntfs, squashfs, zisofs
>
> I'd like to make it easier and more efficient for filesystems to
> implement compressed data.  There are a lot of approaches in use today,
> but none of them seem quite right to me.  I'm going to lay out a few
> design considerations next and then propose a solution.  Feel free to
> tell me I've got the constraints wrong, or suggest alternative solutions.
>
> When we call ->readahead from the VFS, the VFS has decided which pages
> are going to be the most useful to bring in, but it doesn't know how
> pages are bundled together into blocks.  As I've learned from talking to
> Gao Xiang, sometimes the filesystem doesn't know either, so this isn't
> something we can teach the VFS.
>
> We (David) added readahead_expand() recently to let the filesystem
> opportunistically add pages to the page cache "around" the area requested
> by the VFS.  That reduces the number of times the filesystem has to
> decompress the same block.  But it can fail (due to memory allocation
> failures or pages already being present in the cache).  So filesystems
> still have to implement some kind of fallback.

Wouldn't it be better to keep the *compressed* data in the cache and
decompress it multiple times if needed rather than decompress it once
and cache the decompressed data?  You would use more CPU time
decompressing multiple times, but be able to cache more data and avoid
more disk IO, which is generally far slower than the CPU can decompress
the data.




^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Readahead for compressed data
  2021-10-22  1:04 ` Phillip Susi
@ 2021-10-22  1:28   ` Gao Xiang
  2021-10-22  1:39     ` Gao Xiang
  2021-10-22  2:09   ` Phillip Lougher
  2021-10-22  8:41   ` Jan Kara
  2 siblings, 1 reply; 16+ messages in thread
From: Gao Xiang @ 2021-10-22  1:28 UTC (permalink / raw)
  To: Phillip Susi
  Cc: Matthew Wilcox, Jan Kara, linux-ntfs-dev, David Howells,
	linux-bcache, Hsin-Yi Wang, linux-fsdevel, Phillip Lougher, ntfs3,
	linux-erofs, linux-btrfs

On Thu, Oct 21, 2021 at 09:04:45PM -0400, Phillip Susi wrote:
> 
> Matthew Wilcox <willy@infradead.org> writes:
> 
> > As far as I can tell, the following filesystems support compressed data:
> >
> > bcachefs, btrfs, erofs, ntfs, squashfs, zisofs
> >
> > I'd like to make it easier and more efficient for filesystems to
> > implement compressed data.  There are a lot of approaches in use today,
> > but none of them seem quite right to me.  I'm going to lay out a few
> > design considerations next and then propose a solution.  Feel free to
> > tell me I've got the constraints wrong, or suggest alternative solutions.
> >
> > When we call ->readahead from the VFS, the VFS has decided which pages
> > are going to be the most useful to bring in, but it doesn't know how
> > pages are bundled together into blocks.  As I've learned from talking to
> > Gao Xiang, sometimes the filesystem doesn't know either, so this isn't
> > something we can teach the VFS.
> >
> > We (David) added readahead_expand() recently to let the filesystem
> > opportunistically add pages to the page cache "around" the area requested
> > by the VFS.  That reduces the number of times the filesystem has to
> > decompress the same block.  But it can fail (due to memory allocation
> > failures or pages already being present in the cache).  So filesystems
> > still have to implement some kind of fallback.
> 
> Wouldn't it be better to keep the *compressed* data in the cache and
> decompress it multiple times if needed rather than decompress it once
> and cache the decompressed data?  You would use more CPU time
> decompressing multiple times, but be able to cache more data and avoid
> more disk IO, which is generally far slower than the CPU can decompress
> the data.

Yes, that was also my another point yesterday talked on #xfs IRC. For
high-decompresion speed algorithms like lz4, yes, thinking about the
benefits of zcache or zram solutions, caching compressed data for
incomplete read requests is more effective than caching uncompressed
data (so we don't need zcache for EROFS at all). But if such data will
be used completely immediately, EROFS will only do inplace I/O only
(since cached compressed data can only increase memory overhead).
Also, considering some algorithms is slow, inplace I/O is more useful
for them. Anyway, it depends on the detail strategy of different
algorithms and can be fined later.

Thanks,
Gao Xiang

> 
> 

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Readahead for compressed data
  2021-10-22  1:28   ` Gao Xiang
@ 2021-10-22  1:39     ` Gao Xiang
  0 siblings, 0 replies; 16+ messages in thread
From: Gao Xiang @ 2021-10-22  1:39 UTC (permalink / raw)
  To: Phillip Susi, Matthew Wilcox, Jan Kara, linux-ntfs-dev,
	David Howells, linux-bcache, Hsin-Yi Wang, linux-fsdevel,
	Phillip Lougher, ntfs3, linux-erofs, linux-btrfs

On Fri, Oct 22, 2021 at 09:28:14AM +0800, Gao Xiang wrote:
> On Thu, Oct 21, 2021 at 09:04:45PM -0400, Phillip Susi wrote:
> > 
> > Matthew Wilcox <willy@infradead.org> writes:
> > 
> > > As far as I can tell, the following filesystems support compressed data:
> > >
> > > bcachefs, btrfs, erofs, ntfs, squashfs, zisofs
> > >
> > > I'd like to make it easier and more efficient for filesystems to
> > > implement compressed data.  There are a lot of approaches in use today,
> > > but none of them seem quite right to me.  I'm going to lay out a few
> > > design considerations next and then propose a solution.  Feel free to
> > > tell me I've got the constraints wrong, or suggest alternative solutions.
> > >
> > > When we call ->readahead from the VFS, the VFS has decided which pages
> > > are going to be the most useful to bring in, but it doesn't know how
> > > pages are bundled together into blocks.  As I've learned from talking to
> > > Gao Xiang, sometimes the filesystem doesn't know either, so this isn't
> > > something we can teach the VFS.
> > >
> > > We (David) added readahead_expand() recently to let the filesystem
> > > opportunistically add pages to the page cache "around" the area requested
> > > by the VFS.  That reduces the number of times the filesystem has to
> > > decompress the same block.  But it can fail (due to memory allocation
> > > failures or pages already being present in the cache).  So filesystems
> > > still have to implement some kind of fallback.
> > 
> > Wouldn't it be better to keep the *compressed* data in the cache and
> > decompress it multiple times if needed rather than decompress it once
> > and cache the decompressed data?  You would use more CPU time
> > decompressing multiple times, but be able to cache more data and avoid
> > more disk IO, which is generally far slower than the CPU can decompress
> > the data.
> 
> Yes, that was also my another point yesterday talked on #xfs IRC. For
> high-decompresion speed algorithms like lz4, yes, thinking about the
> benefits of zcache or zram solutions, caching compressed data for
> incomplete read requests is more effective than caching uncompressed
> data (so we don't need zcache for EROFS at all). But if such data will
> be used completely immediately, EROFS will only do inplace I/O only
> (since cached compressed data can only increase memory overhead).
> Also, considering some algorithms is slow, inplace I/O is more useful
> for them. Anyway, it depends on the detail strategy of different
> algorithms and can be fined later.
> 

But I don't think endless compressed data caching is generally useful
especially for devices that need extra memory as much possible (For
example, we need to keep apps alive as many as possible in Android
without jagging, otherwise it would be bad for user experience). So
limited prefetching/caching only increasing memory reclaim overhead
and page cache thrashing without benefits since compressed data cannot
be used immediately without decompression.

Thanks,
Gao Xiang

> Thanks,
> Gao Xiang
> 
> > 
> > 

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Readahead for compressed data
  2021-10-22  1:04 ` Phillip Susi
  2021-10-22  1:28   ` Gao Xiang
@ 2021-10-22  2:09   ` Phillip Lougher
  2021-10-22  2:31     ` Gao Xiang
  2021-10-22  8:41   ` Jan Kara
  2 siblings, 1 reply; 16+ messages in thread
From: Phillip Lougher @ 2021-10-22  2:09 UTC (permalink / raw)
  To: Phillip Susi, Matthew Wilcox
  Cc: linux-fsdevel, Jan Kara, linux-erofs, linux-btrfs, linux-ntfs-dev,
	ntfs3, linux-bcache, David Howells, Hsin-Yi Wang

On 22/10/2021 02:04, Phillip Susi wrote:
> 
> Matthew Wilcox <willy@infradead.org> writes:
> 
>> As far as I can tell, the following filesystems support compressed data:
>>
>> bcachefs, btrfs, erofs, ntfs, squashfs, zisofs
>>
>> I'd like to make it easier and more efficient for filesystems to
>> implement compressed data.  There are a lot of approaches in use today,
>> but none of them seem quite right to me.  I'm going to lay out a few
>> design considerations next and then propose a solution.  Feel free to
>> tell me I've got the constraints wrong, or suggest alternative solutions.
>>
>> When we call ->readahead from the VFS, the VFS has decided which pages
>> are going to be the most useful to bring in, but it doesn't know how
>> pages are bundled together into blocks.  As I've learned from talking to
>> Gao Xiang, sometimes the filesystem doesn't know either, so this isn't
>> something we can teach the VFS.
>>
>> We (David) added readahead_expand() recently to let the filesystem
>> opportunistically add pages to the page cache "around" the area requested
>> by the VFS.  That reduces the number of times the filesystem has to
>> decompress the same block.  But it can fail (due to memory allocation
>> failures or pages already being present in the cache).  So filesystems
>> still have to implement some kind of fallback.
> 
> Wouldn't it be better to keep the *compressed* data in the cache and
> decompress it multiple times if needed rather than decompress it once
> and cache the decompressed data?  You would use more CPU time
> decompressing multiple times

There is a fairly obvious problem here.   A malicious attacker could
"trick" the filesystem into endlessly decompressing the same blocks,
which because the compressed data is cached, could cause it to use
all CPU, and cause a DOS attack.  Caching the uncompressed data prevents
these decompression attacks from occurring so easily.

Phillip



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Readahead for compressed data
  2021-10-22  2:09   ` Phillip Lougher
@ 2021-10-22  2:31     ` Gao Xiang
  0 siblings, 0 replies; 16+ messages in thread
From: Gao Xiang @ 2021-10-22  2:31 UTC (permalink / raw)
  To: Phillip Lougher
  Cc: Phillip Susi, Matthew Wilcox, linux-fsdevel, Jan Kara,
	linux-erofs, linux-btrfs, linux-ntfs-dev, ntfs3, linux-bcache,
	David Howells, Hsin-Yi Wang

On Fri, Oct 22, 2021 at 03:09:03AM +0100, Phillip Lougher wrote:
> On 22/10/2021 02:04, Phillip Susi wrote:
> > 
> > Matthew Wilcox <willy@infradead.org> writes:
> > 
> > > As far as I can tell, the following filesystems support compressed data:
> > > 
> > > bcachefs, btrfs, erofs, ntfs, squashfs, zisofs
> > > 
> > > I'd like to make it easier and more efficient for filesystems to
> > > implement compressed data.  There are a lot of approaches in use today,
> > > but none of them seem quite right to me.  I'm going to lay out a few
> > > design considerations next and then propose a solution.  Feel free to
> > > tell me I've got the constraints wrong, or suggest alternative solutions.
> > > 
> > > When we call ->readahead from the VFS, the VFS has decided which pages
> > > are going to be the most useful to bring in, but it doesn't know how
> > > pages are bundled together into blocks.  As I've learned from talking to
> > > Gao Xiang, sometimes the filesystem doesn't know either, so this isn't
> > > something we can teach the VFS.
> > > 
> > > We (David) added readahead_expand() recently to let the filesystem
> > > opportunistically add pages to the page cache "around" the area requested
> > > by the VFS.  That reduces the number of times the filesystem has to
> > > decompress the same block.  But it can fail (due to memory allocation
> > > failures or pages already being present in the cache).  So filesystems
> > > still have to implement some kind of fallback.
> > 
> > Wouldn't it be better to keep the *compressed* data in the cache and
> > decompress it multiple times if needed rather than decompress it once
> > and cache the decompressed data?  You would use more CPU time
> > decompressing multiple times
> 
> There is a fairly obvious problem here.   A malicious attacker could
> "trick" the filesystem into endlessly decompressing the same blocks,
> which because the compressed data is cached, could cause it to use
> all CPU, and cause a DOS attack.  Caching the uncompressed data prevents
> these decompression attacks from occurring so easily.

Yes, that is another good point. But with artifact memory pressure,
there is no difference to cache compressed data or decompressed data,
otherwise these pages will be unreclaimable, reclaim any of cached
decompressed data will also need decompress the whole pcluster.

The same to zram or zswap or what else software compression solution,
what we can do is to limit the CPU utilization if any such requirement.
But that is quite hard for lz4 since in-memory lz4 decompression speed
is usually fast than the backend storage.

By the way, as far as I know there were some experimental hardware
accelerator integrated to storage devices. So they don't need to expand
decompress pages as well. And do inplace I/O only for such devices.

Thanks,
Gao Xiang

> 
> Phillip
> 
> 

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Readahead for compressed data
  2021-10-22  1:04 ` Phillip Susi
  2021-10-22  1:28   ` Gao Xiang
  2021-10-22  2:09   ` Phillip Lougher
@ 2021-10-22  8:41   ` Jan Kara
  2021-10-22  9:11     ` Gao Xiang
  2021-10-25 18:59     ` Phillip Susi
  2 siblings, 2 replies; 16+ messages in thread
From: Jan Kara @ 2021-10-22  8:41 UTC (permalink / raw)
  To: Phillip Susi
  Cc: Matthew Wilcox, linux-fsdevel, Jan Kara, Phillip Lougher,
	linux-erofs, linux-btrfs, linux-ntfs-dev, ntfs3, linux-bcache,
	David Howells, Hsin-Yi Wang

On Thu 21-10-21 21:04:45, Phillip Susi wrote:
> 
> Matthew Wilcox <willy@infradead.org> writes:
> 
> > As far as I can tell, the following filesystems support compressed data:
> >
> > bcachefs, btrfs, erofs, ntfs, squashfs, zisofs
> >
> > I'd like to make it easier and more efficient for filesystems to
> > implement compressed data.  There are a lot of approaches in use today,
> > but none of them seem quite right to me.  I'm going to lay out a few
> > design considerations next and then propose a solution.  Feel free to
> > tell me I've got the constraints wrong, or suggest alternative solutions.
> >
> > When we call ->readahead from the VFS, the VFS has decided which pages
> > are going to be the most useful to bring in, but it doesn't know how
> > pages are bundled together into blocks.  As I've learned from talking to
> > Gao Xiang, sometimes the filesystem doesn't know either, so this isn't
> > something we can teach the VFS.
> >
> > We (David) added readahead_expand() recently to let the filesystem
> > opportunistically add pages to the page cache "around" the area requested
> > by the VFS.  That reduces the number of times the filesystem has to
> > decompress the same block.  But it can fail (due to memory allocation
> > failures or pages already being present in the cache).  So filesystems
> > still have to implement some kind of fallback.
> 
> Wouldn't it be better to keep the *compressed* data in the cache and
> decompress it multiple times if needed rather than decompress it once
> and cache the decompressed data?  You would use more CPU time
> decompressing multiple times, but be able to cache more data and avoid
> more disk IO, which is generally far slower than the CPU can decompress
> the data.

Well, one of the problems with keeping compressed data is that for mmap(2)
you have to have pages decompressed so that CPU can access them. So keeping
compressed data in the page cache would add a bunch of complexity. That
being said keeping compressed data cached somewhere else than in the page
cache may certainly me worth it and then just filling page cache on demand
from this data...

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Readahead for compressed data
  2021-10-22  8:41   ` Jan Kara
@ 2021-10-22  9:11     ` Gao Xiang
  2021-10-22  9:22       ` Qu Wenruo
  2021-10-25 18:59     ` Phillip Susi
  1 sibling, 1 reply; 16+ messages in thread
From: Gao Xiang @ 2021-10-22  9:11 UTC (permalink / raw)
  To: Jan Kara
  Cc: Phillip Susi, linux-ntfs-dev, Matthew Wilcox, David Howells,
	linux-bcache, Hsin-Yi Wang, linux-fsdevel, Phillip Lougher, ntfs3,
	linux-erofs, linux-btrfs

On Fri, Oct 22, 2021 at 10:41:27AM +0200, Jan Kara wrote:
> On Thu 21-10-21 21:04:45, Phillip Susi wrote:
> > 
> > Matthew Wilcox <willy@infradead.org> writes:
> > 
> > > As far as I can tell, the following filesystems support compressed data:
> > >
> > > bcachefs, btrfs, erofs, ntfs, squashfs, zisofs
> > >
> > > I'd like to make it easier and more efficient for filesystems to
> > > implement compressed data.  There are a lot of approaches in use today,
> > > but none of them seem quite right to me.  I'm going to lay out a few
> > > design considerations next and then propose a solution.  Feel free to
> > > tell me I've got the constraints wrong, or suggest alternative solutions.
> > >
> > > When we call ->readahead from the VFS, the VFS has decided which pages
> > > are going to be the most useful to bring in, but it doesn't know how
> > > pages are bundled together into blocks.  As I've learned from talking to
> > > Gao Xiang, sometimes the filesystem doesn't know either, so this isn't
> > > something we can teach the VFS.
> > >
> > > We (David) added readahead_expand() recently to let the filesystem
> > > opportunistically add pages to the page cache "around" the area requested
> > > by the VFS.  That reduces the number of times the filesystem has to
> > > decompress the same block.  But it can fail (due to memory allocation
> > > failures or pages already being present in the cache).  So filesystems
> > > still have to implement some kind of fallback.
> > 
> > Wouldn't it be better to keep the *compressed* data in the cache and
> > decompress it multiple times if needed rather than decompress it once
> > and cache the decompressed data?  You would use more CPU time
> > decompressing multiple times, but be able to cache more data and avoid
> > more disk IO, which is generally far slower than the CPU can decompress
> > the data.
> 
> Well, one of the problems with keeping compressed data is that for mmap(2)
> you have to have pages decompressed so that CPU can access them. So keeping
> compressed data in the page cache would add a bunch of complexity. That
> being said keeping compressed data cached somewhere else than in the page
> cache may certainly me worth it and then just filling page cache on demand
> from this data...

It can be cached with a special internal inode, so no need to take
care of the memory reclaim or migration by yourself.

Otherwise, these all need to be take care of. For fixed-sized input
compression, since they are reclaimed in page unit, so it won't be
quite friendly since such data is all coupling. But for fixed-sized
output compression, it's quite natural.

Thanks,
Gao Xiang

> 
> 								Honza
> -- 
> Jan Kara <jack@suse.com>
> SUSE Labs, CR

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Readahead for compressed data
  2021-10-22  9:11     ` Gao Xiang
@ 2021-10-22  9:22       ` Qu Wenruo
  2021-10-22  9:39         ` Gao Xiang
  0 siblings, 1 reply; 16+ messages in thread
From: Qu Wenruo @ 2021-10-22  9:22 UTC (permalink / raw)
  To: Jan Kara, Phillip Susi, linux-ntfs-dev, Matthew Wilcox,
	David Howells, linux-bcache, Hsin-Yi Wang, linux-fsdevel,
	Phillip Lougher, ntfs3, linux-erofs, linux-btrfs



On 2021/10/22 17:11, Gao Xiang wrote:
> On Fri, Oct 22, 2021 at 10:41:27AM +0200, Jan Kara wrote:
>> On Thu 21-10-21 21:04:45, Phillip Susi wrote:
>>>
>>> Matthew Wilcox <willy@infradead.org> writes:
>>>
>>>> As far as I can tell, the following filesystems support compressed data:
>>>>
>>>> bcachefs, btrfs, erofs, ntfs, squashfs, zisofs
>>>>
>>>> I'd like to make it easier and more efficient for filesystems to
>>>> implement compressed data.  There are a lot of approaches in use today,
>>>> but none of them seem quite right to me.  I'm going to lay out a few
>>>> design considerations next and then propose a solution.  Feel free to
>>>> tell me I've got the constraints wrong, or suggest alternative solutions.
>>>>
>>>> When we call ->readahead from the VFS, the VFS has decided which pages
>>>> are going to be the most useful to bring in, but it doesn't know how
>>>> pages are bundled together into blocks.  As I've learned from talking to
>>>> Gao Xiang, sometimes the filesystem doesn't know either, so this isn't
>>>> something we can teach the VFS.
>>>>
>>>> We (David) added readahead_expand() recently to let the filesystem
>>>> opportunistically add pages to the page cache "around" the area requested
>>>> by the VFS.  That reduces the number of times the filesystem has to
>>>> decompress the same block.  But it can fail (due to memory allocation
>>>> failures or pages already being present in the cache).  So filesystems
>>>> still have to implement some kind of fallback.
>>>
>>> Wouldn't it be better to keep the *compressed* data in the cache and
>>> decompress it multiple times if needed rather than decompress it once
>>> and cache the decompressed data?  You would use more CPU time
>>> decompressing multiple times, but be able to cache more data and avoid
>>> more disk IO, which is generally far slower than the CPU can decompress
>>> the data.
>>
>> Well, one of the problems with keeping compressed data is that for mmap(2)
>> you have to have pages decompressed so that CPU can access them. So keeping
>> compressed data in the page cache would add a bunch of complexity. That
>> being said keeping compressed data cached somewhere else than in the page
>> cache may certainly me worth it and then just filling page cache on demand
>> from this data...
>
> It can be cached with a special internal inode, so no need to take
> care of the memory reclaim or migration by yourself.

There is another problem for btrfs (and maybe other fses).

Btrfs can refer to part of the uncompressed data extent.

Thus we could have the following cases:

	0	4K	8K	12K
	|	|	|	|
		    |	    \- 4K of an 128K compressed extent,
		    |		compressed size is 16K
		    \- 4K of an 64K compressed extent,
			compressed size is 8K

In above case, we really only need 8K for page cache, but if we're
caching the compressed extent, it will take extra 24K.

It's definitely not really worthy for this particular corner case.

But it would be pretty common for btrfs, as CoW on compressed data can
easily lead to above cases.

Thanks,
Qu

>
> Otherwise, these all need to be take care of. For fixed-sized input
> compression, since they are reclaimed in page unit, so it won't be
> quite friendly since such data is all coupling. But for fixed-sized
> output compression, it's quite natural.
>
> Thanks,
> Gao Xiang
>
>>
>> 								Honza
>> --
>> Jan Kara <jack@suse.com>
>> SUSE Labs, CR

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Readahead for compressed data
  2021-10-22  9:22       ` Qu Wenruo
@ 2021-10-22  9:39         ` Gao Xiang
  2021-10-22  9:54           ` Gao Xiang
  0 siblings, 1 reply; 16+ messages in thread
From: Gao Xiang @ 2021-10-22  9:39 UTC (permalink / raw)
  To: Qu Wenruo
  Cc: Jan Kara, Phillip Susi, linux-ntfs-dev, Matthew Wilcox,
	David Howells, linux-bcache, Hsin-Yi Wang, linux-fsdevel,
	Phillip Lougher, ntfs3, linux-erofs, linux-btrfs

Hi Qu,

On Fri, Oct 22, 2021 at 05:22:28PM +0800, Qu Wenruo wrote:
> 
> 
> On 2021/10/22 17:11, Gao Xiang wrote:
> > On Fri, Oct 22, 2021 at 10:41:27AM +0200, Jan Kara wrote:
> > > On Thu 21-10-21 21:04:45, Phillip Susi wrote:
> > > > 
> > > > Matthew Wilcox <willy@infradead.org> writes:
> > > > 
> > > > > As far as I can tell, the following filesystems support compressed data:
> > > > > 
> > > > > bcachefs, btrfs, erofs, ntfs, squashfs, zisofs
> > > > > 
> > > > > I'd like to make it easier and more efficient for filesystems to
> > > > > implement compressed data.  There are a lot of approaches in use today,
> > > > > but none of them seem quite right to me.  I'm going to lay out a few
> > > > > design considerations next and then propose a solution.  Feel free to
> > > > > tell me I've got the constraints wrong, or suggest alternative solutions.
> > > > > 
> > > > > When we call ->readahead from the VFS, the VFS has decided which pages
> > > > > are going to be the most useful to bring in, but it doesn't know how
> > > > > pages are bundled together into blocks.  As I've learned from talking to
> > > > > Gao Xiang, sometimes the filesystem doesn't know either, so this isn't
> > > > > something we can teach the VFS.
> > > > > 
> > > > > We (David) added readahead_expand() recently to let the filesystem
> > > > > opportunistically add pages to the page cache "around" the area requested
> > > > > by the VFS.  That reduces the number of times the filesystem has to
> > > > > decompress the same block.  But it can fail (due to memory allocation
> > > > > failures or pages already being present in the cache).  So filesystems
> > > > > still have to implement some kind of fallback.
> > > > 
> > > > Wouldn't it be better to keep the *compressed* data in the cache and
> > > > decompress it multiple times if needed rather than decompress it once
> > > > and cache the decompressed data?  You would use more CPU time
> > > > decompressing multiple times, but be able to cache more data and avoid
> > > > more disk IO, which is generally far slower than the CPU can decompress
> > > > the data.
> > > 
> > > Well, one of the problems with keeping compressed data is that for mmap(2)
> > > you have to have pages decompressed so that CPU can access them. So keeping
> > > compressed data in the page cache would add a bunch of complexity. That
> > > being said keeping compressed data cached somewhere else than in the page
> > > cache may certainly me worth it and then just filling page cache on demand
> > > from this data...
> > 
> > It can be cached with a special internal inode, so no need to take
> > care of the memory reclaim or migration by yourself.
> 
> There is another problem for btrfs (and maybe other fses).
> 
> Btrfs can refer to part of the uncompressed data extent.
> 
> Thus we could have the following cases:
> 
> 	0	4K	8K	12K
> 	|	|	|	|
> 		    |	    \- 4K of an 128K compressed extent,
> 		    |		compressed size is 16K
> 		    \- 4K of an 64K compressed extent,
> 			compressed size is 8K

Thanks for this, but the diagram is broken on my side.
Could you help fix this?

> 
> In above case, we really only need 8K for page cache, but if we're
> caching the compressed extent, it will take extra 24K.

Apart from that, with my wild guess, could we cache whatever the
real I/O is rather than the whole compressed extent unconditionally?
If the whole compressed extent is needed then, we could check if
it's all available in cache, or read the rest instead?

Also, I think no need to cache uncompressed COW data...

Thanks,
Gao Xiang

> 
> It's definitely not really worthy for this particular corner case.
> 
> But it would be pretty common for btrfs, as CoW on compressed data can
> easily lead to above cases.
> 
> Thanks,
> Qu
> 
> > 
> > Otherwise, these all need to be take care of. For fixed-sized input
> > compression, since they are reclaimed in page unit, so it won't be
> > quite friendly since such data is all coupling. But for fixed-sized
> > output compression, it's quite natural.
> > 
> > Thanks,
> > Gao Xiang
> > 
> > > 
> > > 								Honza
> > > --
> > > Jan Kara <jack@suse.com>
> > > SUSE Labs, CR

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Readahead for compressed data
  2021-10-22  9:39         ` Gao Xiang
@ 2021-10-22  9:54           ` Gao Xiang
  2021-10-22 10:40             ` Qu Wenruo
  0 siblings, 1 reply; 16+ messages in thread
From: Gao Xiang @ 2021-10-22  9:54 UTC (permalink / raw)
  To: Qu Wenruo, Jan Kara, Phillip Susi, linux-ntfs-dev, Matthew Wilcox,
	David Howells, linux-bcache, Hsin-Yi Wang, linux-fsdevel,
	Phillip Lougher, ntfs3, linux-erofs, linux-btrfs

On Fri, Oct 22, 2021 at 05:39:29PM +0800, Gao Xiang wrote:
> Hi Qu,
> 
> On Fri, Oct 22, 2021 at 05:22:28PM +0800, Qu Wenruo wrote:
> > 
> > 
> > On 2021/10/22 17:11, Gao Xiang wrote:
> > > On Fri, Oct 22, 2021 at 10:41:27AM +0200, Jan Kara wrote:
> > > > On Thu 21-10-21 21:04:45, Phillip Susi wrote:
> > > > > 
> > > > > Matthew Wilcox <willy@infradead.org> writes:
> > > > > 
> > > > > > As far as I can tell, the following filesystems support compressed data:
> > > > > > 
> > > > > > bcachefs, btrfs, erofs, ntfs, squashfs, zisofs
> > > > > > 
> > > > > > I'd like to make it easier and more efficient for filesystems to
> > > > > > implement compressed data.  There are a lot of approaches in use today,
> > > > > > but none of them seem quite right to me.  I'm going to lay out a few
> > > > > > design considerations next and then propose a solution.  Feel free to
> > > > > > tell me I've got the constraints wrong, or suggest alternative solutions.
> > > > > > 
> > > > > > When we call ->readahead from the VFS, the VFS has decided which pages
> > > > > > are going to be the most useful to bring in, but it doesn't know how
> > > > > > pages are bundled together into blocks.  As I've learned from talking to
> > > > > > Gao Xiang, sometimes the filesystem doesn't know either, so this isn't
> > > > > > something we can teach the VFS.
> > > > > > 
> > > > > > We (David) added readahead_expand() recently to let the filesystem
> > > > > > opportunistically add pages to the page cache "around" the area requested
> > > > > > by the VFS.  That reduces the number of times the filesystem has to
> > > > > > decompress the same block.  But it can fail (due to memory allocation
> > > > > > failures or pages already being present in the cache).  So filesystems
> > > > > > still have to implement some kind of fallback.
> > > > > 
> > > > > Wouldn't it be better to keep the *compressed* data in the cache and
> > > > > decompress it multiple times if needed rather than decompress it once
> > > > > and cache the decompressed data?  You would use more CPU time
> > > > > decompressing multiple times, but be able to cache more data and avoid
> > > > > more disk IO, which is generally far slower than the CPU can decompress
> > > > > the data.
> > > > 
> > > > Well, one of the problems with keeping compressed data is that for mmap(2)
> > > > you have to have pages decompressed so that CPU can access them. So keeping
> > > > compressed data in the page cache would add a bunch of complexity. That
> > > > being said keeping compressed data cached somewhere else than in the page
> > > > cache may certainly me worth it and then just filling page cache on demand
> > > > from this data...
> > > 
> > > It can be cached with a special internal inode, so no need to take
> > > care of the memory reclaim or migration by yourself.
> > 
> > There is another problem for btrfs (and maybe other fses).
> > 
> > Btrfs can refer to part of the uncompressed data extent.
> > 
> > Thus we could have the following cases:
> > 
> > 	0	4K	8K	12K
> > 	|	|	|	|
> > 		    |	    \- 4K of an 128K compressed extent,
> > 		    |		compressed size is 16K
> > 		    \- 4K of an 64K compressed extent,
> > 			compressed size is 8K
> 
> Thanks for this, but the diagram is broken on my side.
> Could you help fix this?

Ok, I understand it. I think here is really a strategy problem
out of CoW, since only 2*4K is needed, you could
 1) cache the whole compressed extent and hope they can be accessed
    later, so no I/O later at all;
 2) don't cache such incomplete compressed extents;
 3) add some trace record and do some finer strategy.

> 
> > 
> > In above case, we really only need 8K for page cache, but if we're
> > caching the compressed extent, it will take extra 24K.
> 
> Apart from that, with my wild guess, could we cache whatever the
> real I/O is rather than the whole compressed extent unconditionally?
> If the whole compressed extent is needed then, we could check if
> it's all available in cache, or read the rest instead?
> 
> Also, I think no need to cache uncompressed COW data...
> 
> Thanks,
> Gao Xiang
> 
> > 
> > It's definitely not really worthy for this particular corner case.
> > 
> > But it would be pretty common for btrfs, as CoW on compressed data can
> > easily lead to above cases.
> > 
> > Thanks,
> > Qu
> > 
> > > 
> > > Otherwise, these all need to be take care of. For fixed-sized input
> > > compression, since they are reclaimed in page unit, so it won't be
> > > quite friendly since such data is all coupling. But for fixed-sized
> > > output compression, it's quite natural.
> > > 
> > > Thanks,
> > > Gao Xiang
> > > 
> > > > 
> > > > 								Honza
> > > > --
> > > > Jan Kara <jack@suse.com>
> > > > SUSE Labs, CR

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Readahead for compressed data
  2021-10-22  9:54           ` Gao Xiang
@ 2021-10-22 10:40             ` Qu Wenruo
  0 siblings, 0 replies; 16+ messages in thread
From: Qu Wenruo @ 2021-10-22 10:40 UTC (permalink / raw)
  To: Jan Kara, Phillip Susi, linux-ntfs-dev, Matthew Wilcox,
	David Howells, linux-bcache, Hsin-Yi Wang, linux-fsdevel,
	Phillip Lougher, ntfs3, linux-erofs, linux-btrfs



On 2021/10/22 17:54, Gao Xiang wrote:
> On Fri, Oct 22, 2021 at 05:39:29PM +0800, Gao Xiang wrote:
>> Hi Qu,
>>
>> On Fri, Oct 22, 2021 at 05:22:28PM +0800, Qu Wenruo wrote:
>>>
>>>
>>> On 2021/10/22 17:11, Gao Xiang wrote:
>>>> On Fri, Oct 22, 2021 at 10:41:27AM +0200, Jan Kara wrote:
>>>>> On Thu 21-10-21 21:04:45, Phillip Susi wrote:
>>>>>>
>>>>>> Matthew Wilcox <willy@infradead.org> writes:
>>>>>>
>>>>>>> As far as I can tell, the following filesystems support compressed data:
>>>>>>>
>>>>>>> bcachefs, btrfs, erofs, ntfs, squashfs, zisofs
>>>>>>>
>>>>>>> I'd like to make it easier and more efficient for filesystems to
>>>>>>> implement compressed data.  There are a lot of approaches in use today,
>>>>>>> but none of them seem quite right to me.  I'm going to lay out a few
>>>>>>> design considerations next and then propose a solution.  Feel free to
>>>>>>> tell me I've got the constraints wrong, or suggest alternative solutions.
>>>>>>>
>>>>>>> When we call ->readahead from the VFS, the VFS has decided which pages
>>>>>>> are going to be the most useful to bring in, but it doesn't know how
>>>>>>> pages are bundled together into blocks.  As I've learned from talking to
>>>>>>> Gao Xiang, sometimes the filesystem doesn't know either, so this isn't
>>>>>>> something we can teach the VFS.
>>>>>>>
>>>>>>> We (David) added readahead_expand() recently to let the filesystem
>>>>>>> opportunistically add pages to the page cache "around" the area requested
>>>>>>> by the VFS.  That reduces the number of times the filesystem has to
>>>>>>> decompress the same block.  But it can fail (due to memory allocation
>>>>>>> failures or pages already being present in the cache).  So filesystems
>>>>>>> still have to implement some kind of fallback.
>>>>>>
>>>>>> Wouldn't it be better to keep the *compressed* data in the cache and
>>>>>> decompress it multiple times if needed rather than decompress it once
>>>>>> and cache the decompressed data?  You would use more CPU time
>>>>>> decompressing multiple times, but be able to cache more data and avoid
>>>>>> more disk IO, which is generally far slower than the CPU can decompress
>>>>>> the data.
>>>>>
>>>>> Well, one of the problems with keeping compressed data is that for mmap(2)
>>>>> you have to have pages decompressed so that CPU can access them. So keeping
>>>>> compressed data in the page cache would add a bunch of complexity. That
>>>>> being said keeping compressed data cached somewhere else than in the page
>>>>> cache may certainly me worth it and then just filling page cache on demand
>>>>> from this data...
>>>>
>>>> It can be cached with a special internal inode, so no need to take
>>>> care of the memory reclaim or migration by yourself.
>>>
>>> There is another problem for btrfs (and maybe other fses).
>>>
>>> Btrfs can refer to part of the uncompressed data extent.
>>>
>>> Thus we could have the following cases:
>>>
>>> 	0	4K	8K	12K
>>> 	|	|	|	|
>>> 		    |	    \- 4K of an 128K compressed extent,
>>> 		    |		compressed size is 16K
>>> 		    \- 4K of an 64K compressed extent,
>>> 			compressed size is 8K
>>
>> Thanks for this, but the diagram is broken on my side.
>> Could you help fix this?
>
> Ok, I understand it. I think here is really a strategy problem
> out of CoW, since only 2*4K is needed, you could
>   1) cache the whole compressed extent and hope they can be accessed
>      later, so no I/O later at all;
>   2) don't cache such incomplete compressed extents;
>   3) add some trace record and do some finer strategy.

Yeah, this should be determined by each fs, like whether they want to
cache compressed extent at all, and at which condition to cache
compressed extent.

(e.g. for btrfs, if we find the range we want is even smaller than the
compressed size, we can skip such cache)

Thus I don't think there would be a silver bullet for such case.

>
>>
>>>
>>> In above case, we really only need 8K for page cache, but if we're
>>> caching the compressed extent, it will take extra 24K.
>>
>> Apart from that, with my wild guess, could we cache whatever the
>> real I/O is rather than the whole compressed extent unconditionally?
>> If the whole compressed extent is needed then, we could check if
>> it's all available in cache, or read the rest instead?
>>
>> Also, I think no need to cache uncompressed COW data...

Yeah, that's definitely the case, as page cache is already doing the
work for us.

Thanks,
Qu

>>
>> Thanks,
>> Gao Xiang
>>
>>>
>>> It's definitely not really worthy for this particular corner case.
>>>
>>> But it would be pretty common for btrfs, as CoW on compressed data can
>>> easily lead to above cases.
>>>
>>> Thanks,
>>> Qu
>>>
>>>>
>>>> Otherwise, these all need to be take care of. For fixed-sized input
>>>> compression, since they are reclaimed in page unit, so it won't be
>>>> quite friendly since such data is all coupling. But for fixed-sized
>>>> output compression, it's quite natural.
>>>>
>>>> Thanks,
>>>> Gao Xiang
>>>>
>>>>>
>>>>> 								Honza
>>>>> --
>>>>> Jan Kara <jack@suse.com>
>>>>> SUSE Labs, CR

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Readahead for compressed data
  2021-10-22  8:41   ` Jan Kara
  2021-10-22  9:11     ` Gao Xiang
@ 2021-10-25 18:59     ` Phillip Susi
  1 sibling, 0 replies; 16+ messages in thread
From: Phillip Susi @ 2021-10-25 18:59 UTC (permalink / raw)
  To: Jan Kara
  Cc: Matthew Wilcox, linux-fsdevel, Phillip Lougher, linux-erofs,
	linux-btrfs, linux-ntfs-dev, ntfs3, linux-bcache, David Howells,
	Hsin-Yi Wang


Jan Kara <jack@suse.cz> writes:

> Well, one of the problems with keeping compressed data is that for mmap(2)
> you have to have pages decompressed so that CPU can access them. So keeping
> compressed data in the page cache would add a bunch of complexity. That
> being said keeping compressed data cached somewhere else than in the page
> cache may certainly me worth it and then just filling page cache on demand
> from this data...

True... Did that multi generational LRU cache ever get merged?  I was
thinking you could use that to make sure that the kernel prefers to
reclaim the decompressed pages in favor of keeping the compressed ones
around.


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Readahead for compressed data
  2021-10-21 20:17 Readahead for compressed data Matthew Wilcox
  2021-10-22  0:22 ` Gao Xiang
  2021-10-22  1:04 ` Phillip Susi
@ 2021-10-22  4:36 ` Phillip Lougher
  2021-10-29  6:15 ` Coly Li
  3 siblings, 0 replies; 16+ messages in thread
From: Phillip Lougher @ 2021-10-22  4:36 UTC (permalink / raw)
  To: Matthew Wilcox, linux-fsdevel, Jan Kara, linux-erofs, linux-btrfs,
	linux-ntfs-dev, ntfs3, linux-bcache
  Cc: David Howells, Hsin-Yi Wang

On 21/10/2021 21:17, Matthew Wilcox wrote:
 > As far as I can tell, the following filesystems support compressed data:
 >
 > bcachefs, btrfs, erofs, ntfs, squashfs, zisofs
 >
 > I'd like to make it easier and more efficient for filesystems to
 > implement compressed data.  There are a lot of approaches in use today,
 > but none of them seem quite right to me.  I'm going to lay out a few
 > design considerations next and then propose a solution.  Feel free to
 > tell me I've got the constraints wrong, or suggest alternative solutions.
 >
 > When we call ->readahead from the VFS, the VFS has decided which pages
 > are going to be the most useful to bring in, but it doesn't know how
 > pages are bundled together into blocks.  As I've learned from talking to
 > Gao Xiang, sometimes the filesystem doesn't know either, so this isn't
 > something we can teach the VFS.
 >
 > We (David) added readahead_expand() recently to let the filesystem
 > opportunistically add pages to the page cache "around" the area requested
 > by the VFS.  That reduces the number of times the filesystem has to
 > decompress the same block.  But it can fail (due to memory allocation
 > failures or pages already being present in the cache).  So filesystems
 > still have to implement some kind of fallback.
 >
 > For many (all?) compression algorithms (all?) the data must be mapped at
 > all times.  Calling kmap() and kunmap() would be an intolerable overhead.
 > At the same time, we cannot write to a page in the page cache which is
 > marked Uptodate.  It might be mapped into userspace, or a read() be in
 > progress against it.  For writable filesystems, it might even be dirty!
 > As far as I know, no compression algorithm supports "holes", implying
 > that we must allocate memory which is then discarded.
 >
 > To me, this calls for a vmap() based approach.  So I'm thinking
 > something like ...
 >
 > void *readahead_get_block(struct readahead_control *ractl, loff_t start,
 > 			size_t len);
 > void readahead_put_block(struct readahead_control *ractl, void *addr,
 > 			bool success);
 >
 > Once you've figured out which bytes this encrypted block expands to, you
 > call readahead_get_block(), specifying the offset in the file and length
 > and get back a pointer.  When you're done decompressing that block of
 > the file, you get rid of it again.

Hi Matthew,

I quite like this new interface.  It should be fairly straight-forward
to make Squashfs use this interface, and it will simplify some of the
code, and make some of the decompressors more efficient.

As I see it, it removes many of the hoops that Squashfs has to go
through to push the additional uncompressed data into the page cache.

It is also a generic solution, and one which doesn't rely on a 
particular decompressor API or way of working, which means it shouldn't 
break any of the existing decompressor usage in the kernel.

The one issue with this generic solution is I fear a lot of people
will complain it is too generic, and prevents some optimisations
which they could have made on their particular decompressor or
filesystem.  The issue that it requires temporary pages to be
allocated upfront (if the page cannot be added to the page cache) has
already been brought up.

At this point I will try to play devil's advocate.  What is the
alternative to passing back a vmapped area of memory to the
filesystem?  The obvious alternative is to pass back an array
of pointers to the individual page structures, or a NULL pointer
representing a hole where the page could not be allocated in the
cache.

This alternative interface has the advantage that a NULL pointer is
passed representing a hole, rather than temporary memory being allocated
upfront.  It is then up to the filesystem and/or decompressor to
deal with the NULL pointer hole which may avoid the use of so
much temporary memory.

But the fact is in the kernel there are many different decompressors
with different APIs and different ways of working.  There are some
(like zlib, zstd, xz) which are "multi-shot" and some (like lzo, lz4)
which are "single-shot".

Multi-shot decompressors allow themselves to be called with only a small
output buffer.  Once the output buffer runs out, they exit and ask to be
called again with another output buffer.  Squashfs uses that to pass in
the set of discontiguous PAGE sized buffers allocated from the page
cache.  Obviously, if Squashfs got a NULL pointer hole, it could switch
to using a single temporary 4K buffer at that point.

But single-shot decompressors expect to be called once, with a single
contiguous output buffer.  They don't work with a set of discontiguous 
PAGE sized buffers.  Due to this Squashfs has to use a contiguous 
"bounce buffer" which the decompressor outputs to, and then copy it to 
the page cache buffers.

The vmap based interface proposed is a generic interface, it works with
both "multi-shot" and "single-shot" decompressors, because it presents
a single contiguous output buffer, and avoids making the filesystem
work with page structures.  There is a lot to like about this approach.

Avoiding using page structures also ties in with some of the other
work Matthew has been doing to clean up the kernel's over reliance
on page structures.  This is something which I am in agreement.

Phillip

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Readahead for compressed data
  2021-10-21 20:17 Readahead for compressed data Matthew Wilcox
                   ` (2 preceding siblings ...)
  2021-10-22  4:36 ` Phillip Lougher
@ 2021-10-29  6:15 ` Coly Li
  3 siblings, 0 replies; 16+ messages in thread
From: Coly Li @ 2021-10-29  6:15 UTC (permalink / raw)
  To: Matthew Wilcox, linux-fsdevel, Jan Kara, Phillip Lougher,
	linux-erofs, linux-btrfs, linux-ntfs-dev, ntfs3, linux-bcache,
	Hsin-Yi Wang, David Howells
  Cc: linux-bcachefs, Kent Overstreet

On 10/22/21 4:17 AM, Matthew Wilcox wrote:
> As far as I can tell, the following filesystems support compressed data:
>
> bcachefs, btrfs, erofs, ntfs, squashfs, zisofs

Hi Matthew,

There is a new bcachefs mailing list linux-bcachefs@vger.kernel.org for 
bcachefs. I add it in Cc in this reply email.

Just FYI for you and other receivers.

Thanks.

Coly Li


>
> I'd like to make it easier and more efficient for filesystems to
> implement compressed data.  There are a lot of approaches in use today,
> but none of them seem quite right to me.  I'm going to lay out a few
> design considerations next and then propose a solution.  Feel free to
> tell me I've got the constraints wrong, or suggest alternative solutions.
>
> When we call ->readahead from the VFS, the VFS has decided which pages
> are going to be the most useful to bring in, but it doesn't know how
> pages are bundled together into blocks.  As I've learned from talking to
> Gao Xiang, sometimes the filesystem doesn't know either, so this isn't
> something we can teach the VFS.
>
> We (David) added readahead_expand() recently to let the filesystem
> opportunistically add pages to the page cache "around" the area requested
> by the VFS.  That reduces the number of times the filesystem has to
> decompress the same block.  But it can fail (due to memory allocation
> failures or pages already being present in the cache).  So filesystems
> still have to implement some kind of fallback.
>
> For many (all?) compression algorithms (all?) the data must be mapped at
> all times.  Calling kmap() and kunmap() would be an intolerable overhead.
> At the same time, we cannot write to a page in the page cache which is
> marked Uptodate.  It might be mapped into userspace, or a read() be in
> progress against it.  For writable filesystems, it might even be dirty!
> As far as I know, no compression algorithm supports "holes", implying
> that we must allocate memory which is then discarded.
>
> To me, this calls for a vmap() based approach.  So I'm thinking
> something like ...
>
> void *readahead_get_block(struct readahead_control *ractl, loff_t start,
> 			size_t len);
> void readahead_put_block(struct readahead_control *ractl, void *addr,
> 			bool success);
>
> Once you've figured out which bytes this encrypted block expands to, you
> call readahead_get_block(), specifying the offset in the file and length
> and get back a pointer.  When you're done decompressing that block of
> the file, you get rid of it again.
>
> It's the job of readahead_get_block() to allocate additional pages
> into the page cache or temporary pages.  readahead_put_block() will
> mark page cache pages as Uptodate if 'success' is true, and unlock
> them.  It'll free any temporary pages.
>
> Thoughts?  Anyone want to be the guinea pig?  ;-)


^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2021-10-29  6:15 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2021-10-21 20:17 Readahead for compressed data Matthew Wilcox
2021-10-22  0:22 ` Gao Xiang
2021-10-22  1:04 ` Phillip Susi
2021-10-22  1:28   ` Gao Xiang
2021-10-22  1:39     ` Gao Xiang
2021-10-22  2:09   ` Phillip Lougher
2021-10-22  2:31     ` Gao Xiang
2021-10-22  8:41   ` Jan Kara
2021-10-22  9:11     ` Gao Xiang
2021-10-22  9:22       ` Qu Wenruo
2021-10-22  9:39         ` Gao Xiang
2021-10-22  9:54           ` Gao Xiang
2021-10-22 10:40             ` Qu Wenruo
2021-10-25 18:59     ` Phillip Susi
2021-10-22  4:36 ` Phillip Lougher
2021-10-29  6:15 ` Coly Li

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).