linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [LSF/MM/BPF TOPIC] Dropping page cache of individual fs
@ 2024-01-16 10:50 Christian Brauner
  2024-01-16 11:45 ` Jan Kara
                   ` (3 more replies)
  0 siblings, 4 replies; 27+ messages in thread
From: Christian Brauner @ 2024-01-16 10:50 UTC (permalink / raw)
  To: lsf-pc
  Cc: Christian Brauner, linux-fsdevel, linux-mm, linux-btrfs,
	linux-block, Matthew Wilcox, Jan Kara, Christoph Hellwig

Hey,

I'm not sure this even needs a full LSFMM discussion but since I
currently don't have time to work on the patch I may as well submit it.

Gnome recently got awared 1M Euro by the Sovereign Tech Fund (STF). The
STF was created by the German government to fund public infrastructure:

"The Sovereign Tech Fund supports the development, improvement and
 maintenance of open digital infrastructure. Our goal is to sustainably
 strengthen the open source ecosystem. We focus on security, resilience,
 technological diversity, and the people behind the code." (cf. [1])

Gnome has proposed various specific projects including integrating
systemd-homed with Gnome. Systemd-homed provides various features and if
you're interested in details then you might find it useful to read [2].
It makes use of various new VFS and fs specific developments over the
last years.

One feature is encrypting the home directory via LUKS. An approriate
image or device must contain a GPT partition table. Currently there's
only one partition which is a LUKS2 volume. Inside that LUKS2 volume is
a Linux filesystem. Currently supported are btrfs (see [4] though),
ext4, and xfs.

The following issue isn't specific to systemd-homed. Gnome wants to be
able to support locking encrypted home directories. For example, when
the laptop is suspended. To do this the luksSuspend command can be used.

The luksSuspend call is nothing else than a device mapper ioctl to
suspend the block device and it's owning superblock/filesystem. Which in
turn is nothing but a freeze initiated from the block layer:

dm_suspend()
-> __dm_suspend()
   -> lock_fs()
      -> bdev_freeze()

So when we say luksSuspend we really mean block layer initiated freeze.
The overall goal or expectation of userspace is that after a luksSuspend
call all sensitive material has been evicted from relevant caches to
harden against various attacks. And luksSuspend does wipe the encryption
key and suspend the block device. However, the encryption key can still
be available clear-text in the page cache. To illustrate this problem
more simply:

truncate -s 500M /tmp/img
echo password | cryptsetup luksFormat /tmp/img --force-password
echo password | cryptsetup open /tmp/img test
mkfs.xfs /dev/mapper/test
mount /dev/mapper/test /mnt
echo "secrets" > /mnt/data
cryptsetup luksSuspend test
cat /mnt/data

This will still happily print the contents of /mnt/data even though the
block device and the owning filesystem are frozen because the data is
still in the page cache.

To my knowledge, the only current way to get the contents of /mnt/data
or the encryption key out of the page cache is via
/proc/sys/vm/drop_caches which is a big hammer.

My initial reaction is to give userspace an API to drop the page cache
of a specific filesystem which may have additional uses. I initially had
started drafting an ioctl() and then got swayed towards a
posix_fadvise() flag. I found out that this was already proposed a few
years ago but got rejected as it was suspected this might just be
someone toying around without a real world use-case. I think this here
might qualify as a real-world use-case.

This may at least help securing users with a regular dm-crypt setup
where dm-crypt is the top layer. Users that stack additional layers on
top of dm-crypt may still leak plaintext of course if they introduce
additional caching. But that's on them.

Of course other ideas welcome.

[1]: https://www.sovereigntechfund.de/en
[2]: https://systemd.io/HOME_DIRECTORY
[3]: https://lore.kernel.org/linux-btrfs/20230908-merklich-bebauen-11914a630db4@brauner/
[4]: A bdev_freeze() call ideally does the following:

     (1) Freeze the block device @bdev
     (2) Find the owning superblock of the block device @bdev and freeze the
         filesystem as well.

     Especially (2) wasn't true for a long time. Filesystems would only be
     able to freeze the filesystems on the main block device. For example, an
     xfs filesystem using an external log device would not be able to be
     frozen if the block layer request came via the external log device. This
     is fixed since v6.8 for all filesystems using appropriate holder
     operations.

     Except for btrfs where block device initiated freezes don't work at all;
     not even for the main block device. I've pointed this out months ago in [3].

     Which is why we currently can't use btrfs with LUKS2 encryption as as
     luksSuspend call will leave the filesystem unfrozen.
[5]: https://gitlab.com/cryptsetup/cryptsetup/-/issues/855
     https://gitlab.gnome.org/Teams/STF/homed/-/issues/23

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Dropping page cache of individual fs
  2024-01-16 10:50 [LSF/MM/BPF TOPIC] Dropping page cache of individual fs Christian Brauner
@ 2024-01-16 11:45 ` Jan Kara
  2024-01-17 12:53   ` Christian Brauner
  2024-01-16 15:25 ` James Bottomley
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 27+ messages in thread
From: Jan Kara @ 2024-01-16 11:45 UTC (permalink / raw)
  To: Christian Brauner
  Cc: lsf-pc, linux-fsdevel, linux-mm, linux-btrfs, linux-block,
	Matthew Wilcox, Jan Kara, Christoph Hellwig

On Tue 16-01-24 11:50:32, Christian Brauner wrote:

<snip the usecase details>

> My initial reaction is to give userspace an API to drop the page cache
> of a specific filesystem which may have additional uses. I initially had
> started drafting an ioctl() and then got swayed towards a
> posix_fadvise() flag. I found out that this was already proposed a few
> years ago but got rejected as it was suspected this might just be
> someone toying around without a real world use-case. I think this here
> might qualify as a real-world use-case.
> 
> This may at least help securing users with a regular dm-crypt setup
> where dm-crypt is the top layer. Users that stack additional layers on
> top of dm-crypt may still leak plaintext of course if they introduce
> additional caching. But that's on them.

Well, your usecase has one substantial difference from drop_caches. You
actually *require* pages to be evicted from the page cache for security
purposes. And giving any kind of guarantees is going to be tough. Think for
example when someone grabs page cache folio reference through vmsplice(2),
then you initiate your dmSuspend and want to evict page cache. What are you
going to do? You cannot free the folio while the refcount is elevated, you
could possibly detach it from the page cache so it isn't at least visible
but that has side effects too - after you resume the folio would remain
detached so it will not see changes happening to the file anymore. So IMHO
the only thing you could do without problematic side-effects is report
error. Which would be user unfriendly and could be actually surprisingly
frequent due to trasient folio references taken by various code paths.

Sure we could report error only if the page has pincount elevated, not only
refcount, but it needs some serious thinking how this would interact.

Also what is going to be the interaction with mlock(2)?

Overall this doesn't seem like "just tweak drop_caches a bit" kind of
work...

								Honza


-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Dropping page cache of individual fs
  2024-01-16 10:50 [LSF/MM/BPF TOPIC] Dropping page cache of individual fs Christian Brauner
  2024-01-16 11:45 ` Jan Kara
@ 2024-01-16 15:25 ` James Bottomley
  2024-01-16 15:40   ` Matthew Wilcox
  2024-01-16 20:56 ` Dave Chinner
  2024-02-17  4:04 ` Kent Overstreet
  3 siblings, 1 reply; 27+ messages in thread
From: James Bottomley @ 2024-01-16 15:25 UTC (permalink / raw)
  To: Christian Brauner, lsf-pc
  Cc: linux-fsdevel, linux-mm, linux-btrfs, linux-block, Matthew Wilcox,
	Jan Kara, Christoph Hellwig

On Tue, 2024-01-16 at 11:50 +0100, Christian Brauner wrote:
> So when we say luksSuspend we really mean block layer initiated
> freeze. The overall goal or expectation of userspace is that after a
> luksSuspend call all sensitive material has been evicted from
> relevant caches to harden against various attacks. And luksSuspend
> does wipe the encryption key and suspend the block device. However,
> the encryption key can still be available clear-text in the page
> cache. To illustrate this problem more simply:
> 
> truncate -s 500M /tmp/img
> echo password | cryptsetup luksFormat /tmp/img --force-password
> echo password | cryptsetup open /tmp/img test
> mkfs.xfs /dev/mapper/test
> mount /dev/mapper/test /mnt
> echo "secrets" > /mnt/data
> cryptsetup luksSuspend test
> cat /mnt/data

Not really anything to do with the drop caches problem, but luks can
use the kernel keyring API for this.  That should ensure the key itself
can be shredded on suspend without replication anywhere in memory.  Of
course the real problem is likely that the key has or is derived from a
password and that password is in the user space gnome-keyring, which
will be much harder to purge ... although if the keyring were using
secret memory it would be way easier ...

So perhaps before we start bending the kernel out of shape in the name
of security, we should also ensure that the various user space
components are secured first.  The most important thing to get right
first is key management (lose the key and someone who can steal the
encrypted data can access it).  Then you can worry about data leaks due
to the cache, which are somewhat harder to exploit easily (to exploit
this you have to get into the cache in the first place, which is
harder).

> This will still happily print the contents of /mnt/data even though
> the block device and the owning filesystem are frozen because the
> data is still in the page cache.
> 
> To my knowledge, the only current way to get the contents of
> /mnt/data or the encryption key out of the page cache is via
> /proc/sys/vm/drop_caches which is a big hammer.

To be honest, why is this too big a hammer?  Secret data could be
sprayed all over the cache, so killing all of it (assuming we can as
Jan points out) would be a security benefit.  I'm sure people would be
willing to pay the additional start up time of an entirely empty cache
on resume in exchange for the nicely evaluateable security guarantee it
gives.  In other words, dropping caches by device is harder to analyse
from security terms (because now you have to figure out where secret
data is and which caches you need to drop) and it's not clear it really
has much advantage in terms of faster resume for the complexity it
would introduce.

James


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Dropping page cache of individual fs
  2024-01-16 15:25 ` James Bottomley
@ 2024-01-16 15:40   ` Matthew Wilcox
  2024-01-16 15:54     ` James Bottomley
  0 siblings, 1 reply; 27+ messages in thread
From: Matthew Wilcox @ 2024-01-16 15:40 UTC (permalink / raw)
  To: James Bottomley
  Cc: Christian Brauner, lsf-pc, linux-fsdevel, linux-mm, linux-btrfs,
	linux-block, Jan Kara, Christoph Hellwig

On Tue, Jan 16, 2024 at 10:25:20AM -0500, James Bottomley wrote:
> On Tue, 2024-01-16 at 11:50 +0100, Christian Brauner wrote:
> > So when we say luksSuspend we really mean block layer initiated
> > freeze. The overall goal or expectation of userspace is that after a
> > luksSuspend call all sensitive material has been evicted from
> > relevant caches to harden against various attacks. And luksSuspend
> > does wipe the encryption key and suspend the block device. However,
> > the encryption key can still be available clear-text in the page
> > cache. To illustrate this problem more simply:
> > 
> > truncate -s 500M /tmp/img
> > echo password | cryptsetup luksFormat /tmp/img --force-password
> > echo password | cryptsetup open /tmp/img test
> > mkfs.xfs /dev/mapper/test
> > mount /dev/mapper/test /mnt
> > echo "secrets" > /mnt/data
> > cryptsetup luksSuspend test
> > cat /mnt/data
> 
> Not really anything to do with the drop caches problem, but luks can
> use the kernel keyring API for this.  That should ensure the key itself
> can be shredded on suspend without replication anywhere in memory.  Of
> course the real problem is likely that the key has or is derived from a
> password and that password is in the user space gnome-keyring, which
> will be much harder to purge ... although if the keyring were using
> secret memory it would be way easier ...

I think you've misunderstood the problem.  Let's try it again.

add-password-to-kernel-keyring
create-encrypted-volume-using-password
write-detailed-confession-to-encrypted-volume
suspend-volume
delete-password-from-kernel-keyring
cat-volume reveals the detailed confession

ie the page cache contains the decrypted data, even though what's on
disc is encrypted.  Nothing to do with key management.

Yes, there are various things we can do that will prevent the page
cache from being dropped, but I strongly suggest _not_ registering your
detailed confession with an RDMA card.  A 99% solution is better than
a 0% solution.

The tricky part, I think, is that the page cache is not indexed physically
but virtually.  We need each inode on the suspended volume to drop
its cache.  Dropping the cache of just the bdev is going to hide the
direectory structure, inode tables, etc, but the real privacy gains are
to be had from dropping file contents.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Dropping page cache of individual fs
  2024-01-16 15:40   ` Matthew Wilcox
@ 2024-01-16 15:54     ` James Bottomley
  0 siblings, 0 replies; 27+ messages in thread
From: James Bottomley @ 2024-01-16 15:54 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Christian Brauner, lsf-pc, linux-fsdevel, linux-mm, linux-btrfs,
	linux-block, Jan Kara, Christoph Hellwig

On Tue, 2024-01-16 at 15:40 +0000, Matthew Wilcox wrote:
> On Tue, Jan 16, 2024 at 10:25:20AM -0500, James Bottomley wrote:
> > On Tue, 2024-01-16 at 11:50 +0100, Christian Brauner wrote:
> > > So when we say luksSuspend we really mean block layer initiated
> > > freeze. The overall goal or expectation of userspace is that
> > > after a luksSuspend call all sensitive material has been evicted
> > > from relevant caches to harden against various attacks. And
> > > luksSuspend does wipe the encryption key and suspend the block
> > > device. However, the encryption key can still be available clear-
> > > text in the page cache. To illustrate this problem more simply:
> > > 
> > > truncate -s 500M /tmp/img
> > > echo password | cryptsetup luksFormat /tmp/img --force-password
> > > echo password | cryptsetup open /tmp/img test
> > > mkfs.xfs /dev/mapper/test
> > > mount /dev/mapper/test /mnt
> > > echo "secrets" > /mnt/data
> > > cryptsetup luksSuspend test
> > > cat /mnt/data
> > 
> > Not really anything to do with the drop caches problem, but luks
> > can use the kernel keyring API for this.  That should ensure the
> > key itself can be shredded on suspend without replication anywhere
> > in memory.  Of course the real problem is likely that the key has
> > or is derived from a password and that password is in the user
> > space gnome-keyring, which will be much harder to purge ...
> > although if the keyring were using secret memory it would be way
> > easier ...
> 
> I think you've misunderstood the problem.  Let's try it again.
> 
> add-password-to-kernel-keyring
> create-encrypted-volume-using-password
> write-detailed-confession-to-encrypted-volume
> suspend-volume
> delete-password-from-kernel-keyring
> cat-volume reveals the detailed confession
> 
> ie the page cache contains the decrypted data, even though what's on
> disc is encrypted.  Nothing to do with key management.

No I didn't; you cut the bit where I referred to that in the second
half of my email you don't quote.

But my point is that caching key material is by far the biggest
security problem because if that happens and it can be recovered, every
secret on the disk is toast.  Caching clear pages from the disk is a
problem, but it's way less severe than caching key material, so making
sure the former is solved should be priority number one (because in
security you start with the biggest exposure first).

I then went on to say that for the second problem, I think making drop
all caches actually do that has the best security properties rather
than segmented cache dropping.

James


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Dropping page cache of individual fs
  2024-01-16 10:50 [LSF/MM/BPF TOPIC] Dropping page cache of individual fs Christian Brauner
  2024-01-16 11:45 ` Jan Kara
  2024-01-16 15:25 ` James Bottomley
@ 2024-01-16 20:56 ` Dave Chinner
  2024-01-17  6:17   ` Theodore Ts'o
  2024-01-17 13:19   ` Christian Brauner
  2024-02-17  4:04 ` Kent Overstreet
  3 siblings, 2 replies; 27+ messages in thread
From: Dave Chinner @ 2024-01-16 20:56 UTC (permalink / raw)
  To: Christian Brauner
  Cc: lsf-pc, linux-fsdevel, linux-mm, linux-btrfs, linux-block,
	Matthew Wilcox, Jan Kara, Christoph Hellwig

On Tue, Jan 16, 2024 at 11:50:32AM +0100, Christian Brauner wrote:
> Hey,
> 
> I'm not sure this even needs a full LSFMM discussion but since I
> currently don't have time to work on the patch I may as well submit it.
> 
> Gnome recently got awared 1M Euro by the Sovereign Tech Fund (STF). The
> STF was created by the German government to fund public infrastructure:
> 
> "The Sovereign Tech Fund supports the development, improvement and
>  maintenance of open digital infrastructure. Our goal is to sustainably
>  strengthen the open source ecosystem. We focus on security, resilience,
>  technological diversity, and the people behind the code." (cf. [1])
> 
> Gnome has proposed various specific projects including integrating
> systemd-homed with Gnome. Systemd-homed provides various features and if
> you're interested in details then you might find it useful to read [2].
> It makes use of various new VFS and fs specific developments over the
> last years.
> 
> One feature is encrypting the home directory via LUKS. An approriate
> image or device must contain a GPT partition table. Currently there's
> only one partition which is a LUKS2 volume. Inside that LUKS2 volume is
> a Linux filesystem. Currently supported are btrfs (see [4] though),
> ext4, and xfs.
> 
> The following issue isn't specific to systemd-homed. Gnome wants to be
> able to support locking encrypted home directories. For example, when
> the laptop is suspended. To do this the luksSuspend command can be used.
> 
> The luksSuspend call is nothing else than a device mapper ioctl to
> suspend the block device and it's owning superblock/filesystem. Which in
> turn is nothing but a freeze initiated from the block layer:
> 
> dm_suspend()
> -> __dm_suspend()
>    -> lock_fs()
>       -> bdev_freeze()
> 
> So when we say luksSuspend we really mean block layer initiated freeze.
> The overall goal or expectation of userspace is that after a luksSuspend
> call all sensitive material has been evicted from relevant caches to
> harden against various attacks. And luksSuspend does wipe the encryption
> key and suspend the block device. However, the encryption key can still
> be available clear-text in the page cache.

The wiping of secrets is completely orthogonal to the freezing of
the device and filesystem - the freeze does not need to occur to
allow the encryption keys and decrypted data to be purged. They
should not be conflated; purging needs to be a completely separate
operation that can be run regardless of device/fs freeze status.

FWIW, focussing on purging the page cache omits the fact that
having access to the directory structure is a problem - one can
still retrieve other user information that is stored in metadata
(e.g. xattrs) that isn't part of the page cache. Even the directory
structure that is cached in dentries could reveal secrets someone
wants to keep hidden (e.g code names for operations/products).

So if we want luksSuspend to actually protect user information when
it runs, then it effectively needs to bring the filesystem right
back to it's "just mounted" state where the only thing in memory is
the root directory dentry and inode and nothing else.

And, of course, this is largely impossible to do because anything
with an open file on the filesystem will prevent this robust cache
purge from occurring....

Which brings us back to "best effort" only, and at this point we
already have drop-caches....

Mind you, I do wonder if drop caches is fast enough for this sort of
use case. It is single threaded, and if the filesystem/system has
millions of cached inodes it can take minutes to run. Unmount has
the same problem - purging large dentry/inode caches takes a *lot*
of CPU time and these operations are single threaded.

So it may not be practical in the luks context to purge caches e.g.
suspending a laptop shouldn't take minutes. However laptops are
getting to the hundreds of GB of RAM these days and so they can
cache millions of inodes, so cache purge runtime is definitely a
consideration here.

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Dropping page cache of individual fs
  2024-01-16 20:56 ` Dave Chinner
@ 2024-01-17  6:17   ` Theodore Ts'o
  2024-01-30  1:14     ` Adrian Vovk
  2024-01-17 13:19   ` Christian Brauner
  1 sibling, 1 reply; 27+ messages in thread
From: Theodore Ts'o @ 2024-01-17  6:17 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Christian Brauner, lsf-pc, linux-fsdevel, linux-mm, linux-btrfs,
	linux-block, Matthew Wilcox, Jan Kara, Christoph Hellwig

On Wed, Jan 17, 2024 at 07:56:01AM +1100, Dave Chinner wrote:
> 
> The wiping of secrets is completely orthogonal to the freezing of
> the device and filesystem - the freeze does not need to occur to
> allow the encryption keys and decrypted data to be purged. They
> should not be conflated; purging needs to be a completely separate
> operation that can be run regardless of device/fs freeze status.
> 
> FWIW, focussing on purging the page cache omits the fact that
> having access to the directory structure is a problem - one can
> still retrieve other user information that is stored in metadata
> (e.g. xattrs) that isn't part of the page cache. Even the directory
> structure that is cached in dentries could reveal secrets someone
> wants to keep hidden (e.g code names for operations/products).

Yeah, I think we need to really revisit the implicit requirements
which were made upfront about wanting to protect against the page
cache being exposed.

What is the threat model that you are trying to protect against?  If
the attacker has access to the memory of the suspended processor, then
number of things you need to protect against becomes *vast*.  For one
thing, if you're going to blow away the LUKS encryption on suspend,
then during the resume process, *before* you allow general user
processes to start running again (when they might try to read from the
file system whose encryption key is no longer available, and thus will
be treated to EIO errors), you're going to have to request that user
to provide the encryption key, either directly or indirectly.

And if the attacker has access to the suspended memory, is it
read-only access, or can the attacker modify the memory image to
include a trojan that records the encryption once it is demanded of
the user, and then mails it off to Moscow or Beijing or Fort Meade?

To address the whole set of problems, it might be that the answer
might lie in something like confidential compute, where the all of the
memory encrypted.  Now you don't need to worry about wiping the page
cache, since it's all encrypted.  Of course, you still need to solve
the problem of how to restablish the confidential compute keys after
it has been wiped as part of the suspend, but you needed to solve that
with the LUKS key anyway.

This also addresses Dave's concern of it might not being practical to
drop all of the caches if their are millions of cached inodes and
cached pages that all need to be dropped at suspend time.


Anoter potential approach is a bit more targetted, which is to mark
certain files as containing keying information, so the system can
focus on making sure those pages are wiped at suspend time.  It still
has issues, such as how the desire to wipe them from the memory at
suspend time interacts with mlock(), which is often done by programs
to prevent them from getting written to swap.  And of course, we still
need to worry about what to do if the file is pinned because it's
being accessed by RDMA or by sendfile(2) --- but perhaps a keyfile has
no business of being accessed via RDMA or blasted out (unencrypted!)
at high speed to a network connection via sendfile(2) --- and so
perhaps those sorts of things should be disallowed if the file is
marked as "this file contains secret keys --- treat it specially".

	 		    	      	 - Ted

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Dropping page cache of individual fs
  2024-01-16 11:45 ` Jan Kara
@ 2024-01-17 12:53   ` Christian Brauner
  2024-01-17 14:35     ` Jan Kara
  0 siblings, 1 reply; 27+ messages in thread
From: Christian Brauner @ 2024-01-17 12:53 UTC (permalink / raw)
  To: Jan Kara
  Cc: lsf-pc, linux-fsdevel, linux-mm, linux-btrfs, linux-block,
	Matthew Wilcox, Christoph Hellwig

On Tue, Jan 16, 2024 at 12:45:19PM +0100, Jan Kara wrote:
> On Tue 16-01-24 11:50:32, Christian Brauner wrote:
> 
> <snip the usecase details>
> 
> > My initial reaction is to give userspace an API to drop the page cache
> > of a specific filesystem which may have additional uses. I initially had
> > started drafting an ioctl() and then got swayed towards a
> > posix_fadvise() flag. I found out that this was already proposed a few
> > years ago but got rejected as it was suspected this might just be
> > someone toying around without a real world use-case. I think this here
> > might qualify as a real-world use-case.
> > 
> > This may at least help securing users with a regular dm-crypt setup
> > where dm-crypt is the top layer. Users that stack additional layers on
> > top of dm-crypt may still leak plaintext of course if they introduce
> > additional caching. But that's on them.
> 
> Well, your usecase has one substantial difference from drop_caches. You
> actually *require* pages to be evicted from the page cache for security
> purposes. And giving any kind of guarantees is going to be tough. Think for
> example when someone grabs page cache folio reference through vmsplice(2),
> then you initiate your dmSuspend and want to evict page cache. What are you
> going to do? You cannot free the folio while the refcount is elevated, you
> could possibly detach it from the page cache so it isn't at least visible
> but that has side effects too - after you resume the folio would remain
> detached so it will not see changes happening to the file anymore. So IMHO
> the only thing you could do without problematic side-effects is report
> error. Which would be user unfriendly and could be actually surprisingly
> frequent due to trasient folio references taken by various code paths.

I wonder though, if you start suspending userspace and the filesystem
how likely are you to encounter these transient errors?

> 
> Sure we could report error only if the page has pincount elevated, not only
> refcount, but it needs some serious thinking how this would interact.
> 
> Also what is going to be the interaction with mlock(2)?
> 
> Overall this doesn't seem like "just tweak drop_caches a bit" kind of
> work...

So when I talked to the Gnome people they were interested in an optimal
or a best-effort solution. So returning an error might actually be useful.

I'm specifically put this here because my knowledge of the page cache
isn't sufficient to make a judgement what guarantees are and aren't
feasible. So I'm grateful for any insight here.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Dropping page cache of individual fs
  2024-01-16 20:56 ` Dave Chinner
  2024-01-17  6:17   ` Theodore Ts'o
@ 2024-01-17 13:19   ` Christian Brauner
  2024-01-17 22:26     ` Dave Chinner
  2024-02-05 17:39     ` Russell Haley
  1 sibling, 2 replies; 27+ messages in thread
From: Christian Brauner @ 2024-01-17 13:19 UTC (permalink / raw)
  To: Dave Chinner
  Cc: lsf-pc, linux-fsdevel, linux-mm, linux-btrfs, linux-block,
	Matthew Wilcox, Jan Kara, Christoph Hellwig, adrianvovk

On Wed, Jan 17, 2024 at 07:56:01AM +1100, Dave Chinner wrote:
> On Tue, Jan 16, 2024 at 11:50:32AM +0100, Christian Brauner wrote:
> > Hey,
> > 
> > I'm not sure this even needs a full LSFMM discussion but since I
> > currently don't have time to work on the patch I may as well submit it.
> > 
> > Gnome recently got awared 1M Euro by the Sovereign Tech Fund (STF). The
> > STF was created by the German government to fund public infrastructure:
> > 
> > "The Sovereign Tech Fund supports the development, improvement and
> >  maintenance of open digital infrastructure. Our goal is to sustainably
> >  strengthen the open source ecosystem. We focus on security, resilience,
> >  technological diversity, and the people behind the code." (cf. [1])
> > 
> > Gnome has proposed various specific projects including integrating
> > systemd-homed with Gnome. Systemd-homed provides various features and if
> > you're interested in details then you might find it useful to read [2].
> > It makes use of various new VFS and fs specific developments over the
> > last years.
> > 
> > One feature is encrypting the home directory via LUKS. An approriate
> > image or device must contain a GPT partition table. Currently there's
> > only one partition which is a LUKS2 volume. Inside that LUKS2 volume is
> > a Linux filesystem. Currently supported are btrfs (see [4] though),
> > ext4, and xfs.
> > 
> > The following issue isn't specific to systemd-homed. Gnome wants to be
> > able to support locking encrypted home directories. For example, when
> > the laptop is suspended. To do this the luksSuspend command can be used.
> > 
> > The luksSuspend call is nothing else than a device mapper ioctl to
> > suspend the block device and it's owning superblock/filesystem. Which in
> > turn is nothing but a freeze initiated from the block layer:
> > 
> > dm_suspend()
> > -> __dm_suspend()
> >    -> lock_fs()
> >       -> bdev_freeze()
> > 
> > So when we say luksSuspend we really mean block layer initiated freeze.
> > The overall goal or expectation of userspace is that after a luksSuspend
> > call all sensitive material has been evicted from relevant caches to
> > harden against various attacks. And luksSuspend does wipe the encryption
> > key and suspend the block device. However, the encryption key can still
> > be available clear-text in the page cache.
> 
> The wiping of secrets is completely orthogonal to the freezing of
> the device and filesystem - the freeze does not need to occur to
> allow the encryption keys and decrypted data to be purged. They
> should not be conflated; purging needs to be a completely separate
> operation that can be run regardless of device/fs freeze status.

Yes, I'm aware. I didn't mean to imply that these things are in any way
necessarily connected. Just that there are use-cases where they are. And
the encrypted home directory case is one. One froze the block device and
filesystem one would now also like to drop the page cache which has most
of the interesting data.

The fact that after a block layer initiated freeze - again mostly a
device mapper problem - one may or may not be able to successfully read
from the filesystem is annoying. Of course one can't write, that will
hang one immediately. But if one still has some data in the page cache
one can still dump the contents of that file. That's at least odd
behavior from a users POV even if for us it's cleary why that's the
case.

And a freeze does do a sync_filesystem() and a sync_blockdev() to flush
out any dirty data for that specific filesystem. So it would be fitting
to give users an api that allows them to also drop the page cache
contents.

For some use-cases like the Gnome use-case one wants to do a freeze and
drop everything that one can from the page cache for that specific
filesystem.

And drop_caches is a big hammer simply because there are workloads where
that isn't feasible. Even on a modern boring laption system one may have
lots of services. On a large scale system one may have thousands of
services and they may all uses separate images (And the border between
isolated services and containers is fuzzy at best.). And here invoking
drop_caches penalizes every service.

One may want to drop the contents of _some_ services but not all of
them. Especially during suspend where one cares about dropping the page
cache of the home directory that gets suspended - encrypted or
unencrypted.

Ignoring the security aspect itself. Just the fact that one froze the
block device and the owning filesystem one may want to go and drop the
page cache as well without impacting every other filesystem on the
system. Which may be thousands. One doesn't want to penalize them all.

Ignoring the specific use-case I know that David has been interested in
a way to drop the page cache for afs. So this is not just for the home
directory case. I mostly wanted to make it clear that there are users of
an interface like this; even if it were just best effort.

> 
> FWIW, focussing on purging the page cache omits the fact that
> having access to the directory structure is a problem - one can
> still retrieve other user information that is stored in metadata
> (e.g. xattrs) that isn't part of the page cache. Even the directory
> structure that is cached in dentries could reveal secrets someone
> wants to keep hidden (e.g code names for operations/products).

Yes, of course but that's fine. The most sensitive data and the biggest
chunks of data will be the contents of files. We don't necessarily need
to cater to the paranoid with this.

> 
> So if we want luksSuspend to actually protect user information when
> it runs, then it effectively needs to bring the filesystem right
> back to it's "just mounted" state where the only thing in memory is
> the root directory dentry and inode and nothing else.

Yes, which we know isn't feasible.

> 
> And, of course, this is largely impossible to do because anything
> with an open file on the filesystem will prevent this robust cache
> purge from occurring....
> 
> Which brings us back to "best effort" only, and at this point we
> already have drop-caches....
> 
> Mind you, I do wonder if drop caches is fast enough for this sort of
> use case. It is single threaded, and if the filesystem/system has
> millions of cached inodes it can take minutes to run. Unmount has
> the same problem - purging large dentry/inode caches takes a *lot*
> of CPU time and these operations are single threaded.
> 
> So it may not be practical in the luks context to purge caches e.g.
> suspending a laptop shouldn't take minutes. However laptops are
> getting to the hundreds of GB of RAM these days and so they can
> cache millions of inodes, so cache purge runtime is definitely a
> consideration here.

I'm really trying to look for a practical api that doesn't require users
to drop the caches for every mounted image on the system.

FYI, I've tried to get some users to reply here so they could speak to
the fact that they don't expect this to be an optimal solution but none
of them know how to reply to lore mboxes so I can just relay
information.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Dropping page cache of individual fs
  2024-01-17 12:53   ` Christian Brauner
@ 2024-01-17 14:35     ` Jan Kara
  2024-01-17 14:52       ` Matthew Wilcox
  0 siblings, 1 reply; 27+ messages in thread
From: Jan Kara @ 2024-01-17 14:35 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Jan Kara, lsf-pc, linux-fsdevel, linux-mm, linux-btrfs,
	linux-block, Matthew Wilcox, Christoph Hellwig

On Wed 17-01-24 13:53:20, Christian Brauner wrote:
> On Tue, Jan 16, 2024 at 12:45:19PM +0100, Jan Kara wrote:
> > On Tue 16-01-24 11:50:32, Christian Brauner wrote:
> > 
> > <snip the usecase details>
> > 
> > > My initial reaction is to give userspace an API to drop the page cache
> > > of a specific filesystem which may have additional uses. I initially had
> > > started drafting an ioctl() and then got swayed towards a
> > > posix_fadvise() flag. I found out that this was already proposed a few
> > > years ago but got rejected as it was suspected this might just be
> > > someone toying around without a real world use-case. I think this here
> > > might qualify as a real-world use-case.
> > > 
> > > This may at least help securing users with a regular dm-crypt setup
> > > where dm-crypt is the top layer. Users that stack additional layers on
> > > top of dm-crypt may still leak plaintext of course if they introduce
> > > additional caching. But that's on them.
> > 
> > Well, your usecase has one substantial difference from drop_caches. You
> > actually *require* pages to be evicted from the page cache for security
> > purposes. And giving any kind of guarantees is going to be tough. Think for
> > example when someone grabs page cache folio reference through vmsplice(2),
> > then you initiate your dmSuspend and want to evict page cache. What are you
> > going to do? You cannot free the folio while the refcount is elevated, you
> > could possibly detach it from the page cache so it isn't at least visible
> > but that has side effects too - after you resume the folio would remain
> > detached so it will not see changes happening to the file anymore. So IMHO
> > the only thing you could do without problematic side-effects is report
> > error. Which would be user unfriendly and could be actually surprisingly
> > frequent due to trasient folio references taken by various code paths.
> 
> I wonder though, if you start suspending userspace and the filesystem
> how likely are you to encounter these transient errors?

Yeah, my expectation is it should not be frequent in that case. But there
could be surprises there - e.g. pages mapping running executable code are
practically unevictable. Userspace should be mostly sleeping so there
shouldn't be many but there would be some so in the worst case that could
result in always returning error from the page cache eviction which would
not be very useful.

> > Sure we could report error only if the page has pincount elevated, not only
> > refcount, but it needs some serious thinking how this would interact.
> > 
> > Also what is going to be the interaction with mlock(2)?
> > 
> > Overall this doesn't seem like "just tweak drop_caches a bit" kind of
> > work...
> 
> So when I talked to the Gnome people they were interested in an optimal
> or a best-effort solution. So returning an error might actually be useful.

OK. So could we then define the effect of your desired call as calling
posix_fadvise(..., POSIX_FADV_DONTNEED) for every file? This is kind of
best-effort eviction which is reasonably well understood by everybody.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Dropping page cache of individual fs
  2024-01-17 14:35     ` Jan Kara
@ 2024-01-17 14:52       ` Matthew Wilcox
  2024-01-17 20:51         ` Phillip Susi
                           ` (2 more replies)
  0 siblings, 3 replies; 27+ messages in thread
From: Matthew Wilcox @ 2024-01-17 14:52 UTC (permalink / raw)
  To: Jan Kara
  Cc: Christian Brauner, lsf-pc, linux-fsdevel, linux-mm, linux-btrfs,
	linux-block, Christoph Hellwig

On Wed, Jan 17, 2024 at 03:35:28PM +0100, Jan Kara wrote:
> OK. So could we then define the effect of your desired call as calling
> posix_fadvise(..., POSIX_FADV_DONTNEED) for every file? This is kind of
> best-effort eviction which is reasonably well understood by everybody.

I feel like we're in an XY trap [1].  What Christian actually wants is
to not be able to access the contents of a file while the device it's
on is suspended, and we've gone from there to "must drop the page cache".

We have numerous ways to intercept file reads and make them either
block or fail.  The obvious one to me is security_file_permission()
called from rw_verify_area().  Can we do everything we need with an LSM?

[1] https://meta.stackexchange.com/questions/66377/what-is-the-xy-problem

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Dropping page cache of individual fs
  2024-01-17 14:52       ` Matthew Wilcox
@ 2024-01-17 20:51         ` Phillip Susi
  2024-01-17 20:58           ` Matthew Wilcox
  2024-01-18 14:26         ` Christian Brauner
  2024-01-30  0:13         ` Adrian Vovk
  2 siblings, 1 reply; 27+ messages in thread
From: Phillip Susi @ 2024-01-17 20:51 UTC (permalink / raw)
  To: Matthew Wilcox, Jan Kara
  Cc: Christian Brauner, lsf-pc, linux-fsdevel, linux-mm, linux-btrfs,
	linux-block, Christoph Hellwig

Matthew Wilcox <willy@infradead.org> writes:

> We have numerous ways to intercept file reads and make them either
> block or fail.  The obvious one to me is security_file_permission()
> called from rw_verify_area().  Can we do everything we need with an LSM?

I like the idea.  That runs when someone opens a file right?  What about
if they already had the file open or mapped before the volume was
locked?  If not, is that OK?  Are we just trying to deny open requests
of files while the volume is locked?

Is that in addition to, or instead of throwing out the key and
suspending IO at the block layer?  If it is in addition, then that would
mean that trying to open a file would fail cleanly, but accessing a page
that is already mapped could hang the task.  In an unkillable state.
For a long time.  Even the OOM killer can't kill a task blocked like
that can it?  Or did that get fixed at some point?


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Dropping page cache of individual fs
  2024-01-17 20:51         ` Phillip Susi
@ 2024-01-17 20:58           ` Matthew Wilcox
  0 siblings, 0 replies; 27+ messages in thread
From: Matthew Wilcox @ 2024-01-17 20:58 UTC (permalink / raw)
  To: Phillip Susi
  Cc: Jan Kara, Christian Brauner, lsf-pc, linux-fsdevel, linux-mm,
	linux-btrfs, linux-block, Christoph Hellwig

On Wed, Jan 17, 2024 at 03:51:37PM -0500, Phillip Susi wrote:
> Matthew Wilcox <willy@infradead.org> writes:
> 
> > We have numerous ways to intercept file reads and make them either
> > block or fail.  The obvious one to me is security_file_permission()
> > called from rw_verify_area().  Can we do everything we need with an LSM?
> 
> I like the idea.  That runs when someone opens a file right?  What about

Every read() and write() call goes through there.  eg ksys_read ->
vfs_read -> rw_verify_area -> security_file_permission

It wouldn't cover mmap accesses.  So if you had the file mmaped
before suspend, you'd still be able to load from the mmap.  There's
no security_ hook for that right now, afaik.

> Is that in addition to, or instead of throwing out the key and
> suspending IO at the block layer?  If it is in addition, then that would
> mean that trying to open a file would fail cleanly, but accessing a page
> that is already mapped could hang the task.  In an unkillable state.
> For a long time.  Even the OOM killer can't kill a task blocked like
> that can it?  Or did that get fixed at some point?

TASK_KILLABLE was added in 2008, but it's up to each individual call
site whether to use killable or uninterruptible sleep.


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Dropping page cache of individual fs
  2024-01-17 13:19   ` Christian Brauner
@ 2024-01-17 22:26     ` Dave Chinner
  2024-01-18 14:09       ` Christian Brauner
  2024-02-05 17:39     ` Russell Haley
  1 sibling, 1 reply; 27+ messages in thread
From: Dave Chinner @ 2024-01-17 22:26 UTC (permalink / raw)
  To: Christian Brauner
  Cc: lsf-pc, linux-fsdevel, linux-mm, linux-btrfs, linux-block,
	Matthew Wilcox, Jan Kara, Christoph Hellwig, adrianvovk

On Wed, Jan 17, 2024 at 02:19:43PM +0100, Christian Brauner wrote:
> On Wed, Jan 17, 2024 at 07:56:01AM +1100, Dave Chinner wrote:
> > On Tue, Jan 16, 2024 at 11:50:32AM +0100, Christian Brauner wrote:
> > > Hey,
> > > 
> > > I'm not sure this even needs a full LSFMM discussion but since I
> > > currently don't have time to work on the patch I may as well submit it.
> > > 
> > > Gnome recently got awared 1M Euro by the Sovereign Tech Fund (STF). The
> > > STF was created by the German government to fund public infrastructure:
> > > 
> > > "The Sovereign Tech Fund supports the development, improvement and
> > >  maintenance of open digital infrastructure. Our goal is to sustainably
> > >  strengthen the open source ecosystem. We focus on security, resilience,
> > >  technological diversity, and the people behind the code." (cf. [1])
> > > 
> > > Gnome has proposed various specific projects including integrating
> > > systemd-homed with Gnome. Systemd-homed provides various features and if
> > > you're interested in details then you might find it useful to read [2].
> > > It makes use of various new VFS and fs specific developments over the
> > > last years.
> > > 
> > > One feature is encrypting the home directory via LUKS. An approriate
> > > image or device must contain a GPT partition table. Currently there's
> > > only one partition which is a LUKS2 volume. Inside that LUKS2 volume is
> > > a Linux filesystem. Currently supported are btrfs (see [4] though),
> > > ext4, and xfs.
> > > 
> > > The following issue isn't specific to systemd-homed. Gnome wants to be
> > > able to support locking encrypted home directories. For example, when
> > > the laptop is suspended. To do this the luksSuspend command can be used.
> > > 
> > > The luksSuspend call is nothing else than a device mapper ioctl to
> > > suspend the block device and it's owning superblock/filesystem. Which in
> > > turn is nothing but a freeze initiated from the block layer:
> > > 
> > > dm_suspend()
> > > -> __dm_suspend()
> > >    -> lock_fs()
> > >       -> bdev_freeze()
> > > 
> > > So when we say luksSuspend we really mean block layer initiated freeze.
> > > The overall goal or expectation of userspace is that after a luksSuspend
> > > call all sensitive material has been evicted from relevant caches to
> > > harden against various attacks. And luksSuspend does wipe the encryption
> > > key and suspend the block device. However, the encryption key can still
> > > be available clear-text in the page cache.
> > 
> > The wiping of secrets is completely orthogonal to the freezing of
> > the device and filesystem - the freeze does not need to occur to
> > allow the encryption keys and decrypted data to be purged. They
> > should not be conflated; purging needs to be a completely separate
> > operation that can be run regardless of device/fs freeze status.
> 
> Yes, I'm aware. I didn't mean to imply that these things are in any way
> necessarily connected. Just that there are use-cases where they are. And
> the encrypted home directory case is one. One froze the block device and
> filesystem one would now also like to drop the page cache which has most
> of the interesting data.
> 
> The fact that after a block layer initiated freeze - again mostly a
> device mapper problem - one may or may not be able to successfully read
> from the filesystem is annoying. Of course one can't write, that will
> hang one immediately. But if one still has some data in the page cache
> one can still dump the contents of that file. That's at least odd
> behavior from a users POV even if for us it's cleary why that's the
> case.

A frozen filesystem doesn't prevent read operations from occurring.

> And a freeze does do a sync_filesystem() and a sync_blockdev() to flush
> out any dirty data for that specific filesystem.

Yes, it's required to do that - the whole point of freezing a
filesystem is to bring the filesystem into a *consistent physical
state on persistent storage* and to hold it in that state until it
is thawed.

> So it would be fitting
> to give users an api that allows them to also drop the page cache
> contents.

Not as part of a freeze operation.

Read operations have *always* been allowed from frozen filesystems;
they are intended to be allowed because one of the use cases for
freezing is to create a consistent filesystem state for backup of
the filesystem. That requires everything in the filesystem can be
read whilst it is frozen, and that means the page cache needs to
remain operational.

What the underlying device allows when it has been *suspended* is a
different issue altogether. The key observation here is that storage
device suspend != filesystem freeze and they can have very different
semantics depending on the operation being performed on the block
device while it is suspended.

IOWs, a device suspend implementation might freeze the filesystem to
bring the contents of the storage device whilst frozen into a
consistent, uptodate state (e.g. for device level backups), but
block device level suspend does not *require* that the filesystem is
frozen whilst the device IO operations are suspended.

> For some use-cases like the Gnome use-case one wants to do a freeze and
> drop everything that one can from the page cache for that specific
> filesystem.

So they have to do an extra system call between FS_IOC_FREEZE and
FS_IOC_THAW. What's the problem with that? What are you trying to
optimise by colliding cache purging with FS_IOC_FREEZE?

If the user/application/infrastructure already has to iterate all
the mounted filesystems to freeze them, then it's trivial for them
to add a cache purging step to that infrastructure for the storage
configurations that might need it. I just don't see why this needs
to be part of a block device freeze operation, especially as the
"purge caches on this filesystem" operation has potential use cases
outside of the luksSuspend context....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Dropping page cache of individual fs
  2024-01-17 22:26     ` Dave Chinner
@ 2024-01-18 14:09       ` Christian Brauner
  0 siblings, 0 replies; 27+ messages in thread
From: Christian Brauner @ 2024-01-18 14:09 UTC (permalink / raw)
  To: Dave Chinner
  Cc: lsf-pc, linux-fsdevel, linux-mm, linux-btrfs, linux-block,
	Matthew Wilcox, Jan Kara, Christoph Hellwig, adrianvovk

> > The fact that after a block layer initiated freeze - again mostly a
> > device mapper problem - one may or may not be able to successfully read
> > from the filesystem is annoying. Of course one can't write, that will
> > hang one immediately. But if one still has some data in the page cache
> > one can still dump the contents of that file. That's at least odd
> > behavior from a users POV even if for us it's cleary why that's the
> > case.
> 
> A frozen filesystem doesn't prevent read operations from occurring.

Yes, that's what I was saying. I'm not disputing that.

> 
> > And a freeze does do a sync_filesystem() and a sync_blockdev() to flush
> > out any dirty data for that specific filesystem.
> 
> Yes, it's required to do that - the whole point of freezing a
> filesystem is to bring the filesystem into a *consistent physical
> state on persistent storage* and to hold it in that state until it
> is thawed.
> 
> > So it would be fitting
> > to give users an api that allows them to also drop the page cache
> > contents.
> 
> Not as part of a freeze operation.

Yes, that's why I'd like to have a separate e.g., flag for fadvise.

> > For some use-cases like the Gnome use-case one wants to do a freeze and
> > drop everything that one can from the page cache for that specific
> > filesystem.
> 
> So they have to do an extra system call between FS_IOC_FREEZE and
> FS_IOC_THAW. What's the problem with that? What are you trying to
> optimise by colliding cache purging with FS_IOC_FREEZE?
> 
> If the user/application/infrastructure already has to iterate all
> the mounted filesystems to freeze them, then it's trivial for them
> to add a cache purging step to that infrastructure for the storage
> configurations that might need it. I just don't see why this needs
> to be part of a block device freeze operation, especially as the
> "purge caches on this filesystem" operation has potential use cases
> outside of the luksSuspend context....

Ah, I'm sorry I think we're accidently talking past each other... I'm
_not_ trying to tie block layer freezing and cache purging. I'm trying
to expose something like:

posix_fadvise(fs_fd, [...], POSIX_FADV_FS_DONTNEED/DROP);

The Gnome people could then do:

cryptsetup luksSuspend
posix_fadvise(fs_fd, [...], POSIX_FADV_FS_DONTNEED/DROP);

as two separate operations.

Because the dropping the caches step is useful to other users as well;
completely independent of the block layer freeze that I used to
motivate this.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Dropping page cache of individual fs
  2024-01-17 14:52       ` Matthew Wilcox
  2024-01-17 20:51         ` Phillip Susi
@ 2024-01-18 14:26         ` Christian Brauner
  2024-01-30  0:13         ` Adrian Vovk
  2 siblings, 0 replies; 27+ messages in thread
From: Christian Brauner @ 2024-01-18 14:26 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Jan Kara, lsf-pc, linux-fsdevel, linux-mm, linux-btrfs,
	linux-block, Christoph Hellwig

On Wed, Jan 17, 2024 at 02:52:32PM +0000, Matthew Wilcox wrote:
> On Wed, Jan 17, 2024 at 03:35:28PM +0100, Jan Kara wrote:
> > OK. So could we then define the effect of your desired call as calling
> > posix_fadvise(..., POSIX_FADV_DONTNEED) for every file? This is kind of
> > best-effort eviction which is reasonably well understood by everybody.
> 
> I feel like we're in an XY trap [1].  What Christian actually wants is
> to not be able to access the contents of a file while the device it's
> on is suspended, and we've gone from there to "must drop the page cache".
> 
> We have numerous ways to intercept file reads and make them either
> block or fail.  The obvious one to me is security_file_permission()
> called from rw_verify_area().  Can we do everything we need with an LSM?
> 
> [1] https://meta.stackexchange.com/questions/66377/what-is-the-xy-problem

Nice idea and we do stuff like that in other scenarios such as [1] where
we care about preventing _writes_ from occuring while a specific service
hasn't been fully set up. So that has been going through my mind as
well. And the LSM approach might be complementary. For example, if
feasible, it could be activated _before_ the freeze operation only
allowing the block layer initiated freeze. And then we can drop the page
cache.

But in this case the LSM approach isn't easily workable or solves the
problem for Gnome. It would force the usage of a bpf LSM most likely as
well. And the LSM would have to be activated when the filesystem is
frozen and then deactivated when it is unfrozen. I'm not even sure
that's currently easily doable.

But the Gnome use-case wants to be able to drop file contents before
they suspend the system. So the thread-model is wider than just someone
being able to read contents on an active systems. But it's best-effort
of course. So failing and reporting an error would be totally fine and
then policy could dictate whether to not even suspend. It actually might
help userspace in general.

The ability to drop the page cache of a specific filesystem is useful
independent of the Gnome use-case especially in systems with thousands
or ten-thousands of services that use separate filesystem images
something that's not uncommon.

[1]: https://github.com/systemd/systemd/blob/74e6a7d84a40de18bb3b18eeef6284f870f30a6e/src/nsresourced/bpf/userns_restrict/userns-restrict.bpf.c

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Dropping page cache of individual fs
  2024-01-17 14:52       ` Matthew Wilcox
  2024-01-17 20:51         ` Phillip Susi
  2024-01-18 14:26         ` Christian Brauner
@ 2024-01-30  0:13         ` Adrian Vovk
  2024-02-15 13:57           ` Jan Kara
  2 siblings, 1 reply; 27+ messages in thread
From: Adrian Vovk @ 2024-01-30  0:13 UTC (permalink / raw)
  To: Matthew Wilcox, Jan Kara
  Cc: Christian Brauner, lsf-pc, linux-fsdevel, linux-mm, linux-btrfs,
	linux-block, Christoph Hellwig

Hello! I'm the "GNOME people" who Christian is referring to

On 1/17/24 09:52, Matthew Wilcox wrote:
> I feel like we're in an XY trap [1].  What Christian actually wants is
> to not be able to access the contents of a file while the device it's
> on is suspended, and we've gone from there to "must drop the page cache".

What we really want is for the plaintext contents of the files to be 
gone from memory while the dm-crypt device backing them is suspended.

Ultimately my goal is to limit the chance that an attacker with access 
to a user's suspended laptop will be able to access the user's encrypted 
data. I need to achieve this without forcing the user to completely log 
out/power off/etc their system; it must be invisible to the user. The 
key word here is limit; if we can remove _most_ files from memory _most_ 
of the time Ithink luksSuspend would be a lot more useful against cold 
boot than it is today.

I understand that perfectly wiping all the files out of memory without 
completely unmounting the filesystem isn't feasible, and that's probably 
OK for our use-case. As long as most files can be removed from memory 
most of the time, anyway...

> We have numerous ways to intercept file reads and make them either
> block or fail.  The obvious one to me is security_file_permission()
> called from rw_verify_area().  Can we do everything we need with an LSM?
>
> [1] https://meta.stackexchange.com/questions/66377/what-is-the-xy-problem

As Christian mentioned: the LSM may be a good addition, but it would 
have to be in addition to wiping the data out of the page cache, not 
instead of. An LSM will not help against a cold boot attack

Adrian


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Dropping page cache of individual fs
  2024-01-17  6:17   ` Theodore Ts'o
@ 2024-01-30  1:14     ` Adrian Vovk
  0 siblings, 0 replies; 27+ messages in thread
From: Adrian Vovk @ 2024-01-30  1:14 UTC (permalink / raw)
  To: Theodore Ts'o, Dave Chinner
  Cc: Christian Brauner, lsf-pc, linux-fsdevel, linux-mm, linux-btrfs,
	linux-block, Matthew Wilcox, Jan Kara, Christoph Hellwig

On 1/17/24 01:17, Theodore Ts'o wrote:
> What is the threat model that you are trying to protect against?  If
> the attacker has access to the memory of the suspended processor, then
> number of things you need to protect against becomes *vast*.  For one
> thing, if you're going to blow away the LUKS encryption on suspend,
> then during the resume process, *before* you allow general user
> processes to start running again (when they might try to read from the
> file system whose encryption key is no longer available, and thus will
> be treated to EIO errors), you're going to have to request that user
> to provide the encryption key, either directly or indirectly.

The threat we have in mind are cold-boot attacks, same as the threat 
that dm-crypt protects against when it lets us wipe the LUKS volume 
key.We want to limit the amount of plain-text user data an attacker can 
acquire from a suspended system.

As I mention elsewhere in this thread, the key word for me is "limit". 
I'm not expecting perfect security, but I'd like most plaintext file 
contents to be removed from memory on suspend so that an attacker cannot 
access most recently accessed files. Ideally it would be "all" not 
"most", of course, but I'll happily take what's feasible

> And if the attacker has access to the suspended memory, is it
> read-only access, or can the attacker modify the memory image to
> include a trojan that records the encryption once it is demanded of
> the user, and then mails it off to Moscow or Beijing or Fort Meade?

Yes, it's read-only access.

If the attacker has write access to the memory image while the system is 
suspended then it's complete game-over on all fronts. At that point they 
can completely replace the kernel if they so choose. This is not 
something I expect to be able to defend against outside of the solutions 
you mention, but those are not feasible on commodity consumer hardware. 
I'm looking to achieve the best we can with what we have. This is also 
not an attack I've heard of in the wild against consumer hardware; I 
know it's possible because I know people who've done it, but it takes 
many weeks (at least) of research and effort to prepare for a given chip 
- definitely not as easy as a cold-boot attack which can take seconds 
and works pretty universally.

> To address the whole set of problems, it might be that the answer
> might lie in something like confidential compute, where the all of the
> memory encrypted.  Now you don't need to worry about wiping the page
> cache, since it's all encrypted.  Of course, you still need to solve
> the problem of how to restablish the confidential compute keys after
> it has been wiped as part of the suspend, but you needed to solve that
> with the LUKS key anyway.

Without special hardware support you'll need to re-establish keys via 
unencrypted software, and unencrypted software can be replaced by an 
attacker if they're able to write to RAM. So it doesn't solve the 
problem you bring up. But anyway I feel this part of the discussion is 
starting to border on theoretical...

Though I suppose encrypting all the memory belonging to just the one 
user with that user's LUKS volume key could be an alternative solution. 
That way wiping out the key has the effect of "wiping out" all the 
user's related memory, at least until we can re-authenticate and bring 
it all back. But I suspect this would not only be extremely difficult to 
implement in the kernel but would also have huge performance cost 
without special hardware

> Anoter potential approach is a bit more targetted, which is to mark
> certain files as containing keying information, so the system can
> focus on making sure those pages are wiped at suspend time.  It still
> has issues, such as how the desire to wipe them from the memory at
> suspend time interacts with mlock(), which is often done by programs
> to prevent them from getting written to swap.  And of course, we still
> need to worry about what to do if the file is pinned because it's
> being accessed by RDMA or by sendfile(2) --- but perhaps a keyfile has
> no business of being accessed via RDMA or blasted out (unencrypted!)
> at high speed to a network connection via sendfile(2) --- and so
> perhaps those sorts of things should be disallowed if the file is
> marked as "this file contains secret keys --- treat it specially".

Secret keys are not what we're trying to protect here necessarily. 
Random user documents are often sensitive. People store tax documents, 
corporate secrets, or any number of other sensitive things on their 
computers. If an attacker can perform a cold boot attack on the device 
then depending on how recently these tax documents or corporate secrets 
were accessed they might just be in memory in plain text, which is not 
good. No amount of protecting the keys prevents this.

That said, having an extra security layer for secret keys would be 
useful. There are definitely files that contain sensitive data, and it 
would be useful to tell the kernel which files those are so that it can 
treat them extra carefully in the ways you suggest. Maybe even avoid 
putting them in plain text into the page cache? But this would have to 
be an extra step, since it's not feasible to make the user mark all the 
files that they consider to be sensitive

Adrian


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Dropping page cache of individual fs
  2024-01-17 13:19   ` Christian Brauner
  2024-01-17 22:26     ` Dave Chinner
@ 2024-02-05 17:39     ` Russell Haley
  1 sibling, 0 replies; 27+ messages in thread
From: Russell Haley @ 2024-02-05 17:39 UTC (permalink / raw)
  To: Christian Brauner, Dave Chinner
  Cc: lsf-pc, linux-fsdevel, linux-mm, linux-btrfs, linux-block,
	Matthew Wilcox, Jan Kara, Christoph Hellwig, adrianvovk

On 1/17/24 07:19, Christian Brauner wrote:

> And drop_caches is a big hammer simply because there are workloads where
> that isn't feasible. Even on a modern boring laption system one may have
> lots of services. On a large scale system one may have thousands of
> services and they may all uses separate images (And the border between
> isolated services and containers is fuzzy at best.). And here invoking
> drop_caches penalizes every service.
> 
> One may want to drop the contents of _some_ services but not all of
> them. Especially during suspend where one cares about dropping the page
> cache of the home directory that gets suspended - encrypted or
> unencrypted.
> 
> Ignoring the security aspect itself. Just the fact that one froze the
> block device and the owning filesystem one may want to go and drop the
> page cache as well without impacting every other filesystem on the
> system. Which may be thousands. One doesn't want to penalize them all.

I'm not following the problem with dropping all the caches, at least for
the suspend use case rather than quick user switching. Suspend takes all
the services on the machine offline for hundreds of milliseconds
minimum.  If they don't hit the ground running... so what?

drop_caches=3 gets the metadata too, I think, which should protect the
directory structure.

>>
>> FWIW, focussing on purging the page cache omits the fact that
>> having access to the directory structure is a problem - one can
>> still retrieve other user information that is stored in metadata
>> (e.g. xattrs) that isn't part of the page cache. Even the directory
>> structure that is cached in dentries could reveal secrets someone
>> wants to keep hidden (e.g code names for operations/products).
> 
> Yes, of course but that's fine. The most sensitive data and the biggest
> chunks of data will be the contents of files. We don't necessarily need
> to cater to the paranoid with this.
> 

If actual security is not required, maybe look into whatever Android is
doing? As far as I know it has similar use pattern and threat model
(wifi passwords, session cookies, and credit card numbers matter;
exposing high-entropy metadata that probably uniquely identifies files
to anyone who has seen the same data elsewhere is fine).

But then, perhaps what Android does is nothing, relying on locked
bootloaders and device-specific kernels to make booting into a generic
memory dumper sufficiently difficult.

>>
>> So if we want luksSuspend to actually protect user information when
>> it runs, then it effectively needs to bring the filesystem right
>> back to it's "just mounted" state where the only thing in memory is
>> the root directory dentry and inode and nothing else.
> 
> Yes, which we know isn't feasible.
> 
>>
>> And, of course, this is largely impossible to do because anything
>> with an open file on the filesystem will prevent this robust cache
>> purge from occurring....
>>
>> Which brings us back to "best effort" only, and at this point we
>> already have drop-caches....
>>
>> Mind you, I do wonder if drop caches is fast enough for this sort of
>> use case. It is single threaded, and if the filesystem/system has
>> millions of cached inodes it can take minutes to run. Unmount has
>> the same problem - purging large dentry/inode caches takes a *lot*
>> of CPU time and these operations are single threaded.
>>
>> So it may not be practical in the luks context to purge caches e.g.
>> suspending a laptop shouldn't take minutes. However laptops are
>> getting to the hundreds of GB of RAM these days and so they can
>> cache millions of inodes, so cache purge runtime is definitely a
>> consideration here.
> 
> I'm really trying to look for a practical api that doesn't require users
> to drop the caches for every mounted image on the system.
> 
> FYI, I've tried to get some users to reply here so they could speak to
> the fact that they don't expect this to be an optimal solution but none
> of them know how to reply to lore mboxes so I can just relay
> information.
> 

User replying here :-)

One possible alternative would be to use suspend-to-encrypted-swap
instead of suspend-to-RAM. It feels like it was left to rot as memory
sizes kept growing and disk speeds didn't, but that trend has reversed.
NVMe SSDs can write several GB/s if fed properly. And hasn't there been
a recent push for authenticated hibernation images?

That would also protect the application memory, which could be quite
sensitive I think because of session cookies, oauth tokens, and the
like. I assume that a sophisticated adversary with access to a memory
image of my logged-in PC would be able to read my email and impersonate
me for at least a week.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Dropping page cache of individual fs
  2024-01-30  0:13         ` Adrian Vovk
@ 2024-02-15 13:57           ` Jan Kara
  2024-02-15 19:46             ` Adrian Vovk
  0 siblings, 1 reply; 27+ messages in thread
From: Jan Kara @ 2024-02-15 13:57 UTC (permalink / raw)
  To: Adrian Vovk
  Cc: Matthew Wilcox, Jan Kara, Christian Brauner, lsf-pc,
	linux-fsdevel, linux-mm, linux-btrfs, linux-block,
	Christoph Hellwig

On Mon 29-01-24 19:13:17, Adrian Vovk wrote:
> Hello! I'm the "GNOME people" who Christian is referring to

Got back to thinking about this after a while...

> On 1/17/24 09:52, Matthew Wilcox wrote:
> > I feel like we're in an XY trap [1].  What Christian actually wants is
> > to not be able to access the contents of a file while the device it's
> > on is suspended, and we've gone from there to "must drop the page cache".
> 
> What we really want is for the plaintext contents of the files to be gone
> from memory while the dm-crypt device backing them is suspended.
> 
> Ultimately my goal is to limit the chance that an attacker with access to a
> user's suspended laptop will be able to access the user's encrypted data. I
> need to achieve this without forcing the user to completely log out/power
> off/etc their system; it must be invisible to the user. The key word here is
> limit; if we can remove _most_ files from memory _most_ of the time Ithink
> luksSuspend would be a lot more useful against cold boot than it is today.

Well, but if your attack vector are cold-boot attacks, then how does
freeing pages from the page cache help you? I mean sure the page allocator
will start tracking those pages with potentially sensitive content as free
but unless you also zero all of them, this doesn't help anything against
cold-boot attacks? The sensitive memory content is still there...

So you would also have to enable something like zero-on-page-free and
generally the cost of this is going to be pretty big?

> I understand that perfectly wiping all the files out of memory without
> completely unmounting the filesystem isn't feasible, and that's probably OK
> for our use-case. As long as most files can be removed from memory most of
> the time, anyway...

OK, understood. I guess in that case something like BLKFLSBUF ioctl on
steroids (to also evict filesystem caches, not only the block device) could
be useful for you.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Dropping page cache of individual fs
  2024-02-15 13:57           ` Jan Kara
@ 2024-02-15 19:46             ` Adrian Vovk
  2024-02-15 23:17               ` Dave Chinner
  0 siblings, 1 reply; 27+ messages in thread
From: Adrian Vovk @ 2024-02-15 19:46 UTC (permalink / raw)
  To: Jan Kara
  Cc: Matthew Wilcox, Christian Brauner, lsf-pc, linux-fsdevel,
	linux-mm, linux-btrfs, linux-block, Christoph Hellwig

On 2/15/24 08:57, Jan Kara wrote:
> On Mon 29-01-24 19:13:17, Adrian Vovk wrote:
>> Hello! I'm the "GNOME people" who Christian is referring to
> Got back to thinking about this after a while...
>
>> On 1/17/24 09:52, Matthew Wilcox wrote:
>>> I feel like we're in an XY trap [1].  What Christian actually wants is
>>> to not be able to access the contents of a file while the device it's
>>> on is suspended, and we've gone from there to "must drop the page cache".
>> What we really want is for the plaintext contents of the files to be gone
>> from memory while the dm-crypt device backing them is suspended.
>>
>> Ultimately my goal is to limit the chance that an attacker with access to a
>> user's suspended laptop will be able to access the user's encrypted data. I
>> need to achieve this without forcing the user to completely log out/power
>> off/etc their system; it must be invisible to the user. The key word here is
>> limit; if we can remove _most_ files from memory _most_ of the time Ithink
>> luksSuspend would be a lot more useful against cold boot than it is today.
> Well, but if your attack vector are cold-boot attacks, then how does
> freeing pages from the page cache help you? I mean sure the page allocator
> will start tracking those pages with potentially sensitive content as free
> but unless you also zero all of them, this doesn't help anything against
> cold-boot attacks? The sensitive memory content is still there...
>
> So you would also have to enable something like zero-on-page-free and
> generally the cost of this is going to be pretty big?

Yes you are right. Just marking pages as free isn't enough.

I'm sure it's reasonable enough to zero out the pages that are getting 
free'd at our request. But the difficulty here is to try and clear pages 
that were freed previously for other reasons, unless we're zeroing out 
all pages on free. So I suppose that leaves me with a couple questions:

- As far as I know, the kernel only naturally frees pages from the page 
cache when they're about to be given to some program for imminent use. 
But then in the case the page isn't only free'd, but also zero'd out 
before it's handed over to the program (because giving a program access 
to a page filled with potentially sensitive data is a bad idea!). Is 
this correct?

- Are there other situations (aside from drop_caches) where the kernel 
frees pages from the page cache? Especially without having to zero them 
anyway? In other words, what situations would turning on some 
zero-pages-on-free setting actually hurt performance?

- Does dismounting a filesystem completely zero out the removed fs's 
pages from the page cache?

- I remember hearing somewhere of some Linux support for zeroing out all 
pages in memory if they're free'd from the page cache. However, I spent 
a while trying to find this (how to turn it on, benchmarks) and I 
couldn't find it. Do you know if such a thing exists, and if so how to 
turn it on? I'm curious of the actual performance impact of it.

>> I understand that perfectly wiping all the files out of memory without
>> completely unmounting the filesystem isn't feasible, and that's probably OK
>> for our use-case. As long as most files can be removed from memory most of
>> the time, anyway...
> OK, understood. I guess in that case something like BLKFLSBUF ioctl on
> steroids (to also evict filesystem caches, not only the block device) could
> be useful for you.
>
> 								Honza

Best,

Adrian


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Dropping page cache of individual fs
  2024-02-15 19:46             ` Adrian Vovk
@ 2024-02-15 23:17               ` Dave Chinner
  2024-02-16  1:14                 ` Adrian Vovk
  0 siblings, 1 reply; 27+ messages in thread
From: Dave Chinner @ 2024-02-15 23:17 UTC (permalink / raw)
  To: Adrian Vovk
  Cc: Jan Kara, Matthew Wilcox, Christian Brauner, lsf-pc,
	linux-fsdevel, linux-mm, linux-btrfs, linux-block,
	Christoph Hellwig

On Thu, Feb 15, 2024 at 02:46:52PM -0500, Adrian Vovk wrote:
> On 2/15/24 08:57, Jan Kara wrote:
> > On Mon 29-01-24 19:13:17, Adrian Vovk wrote:
> > > Hello! I'm the "GNOME people" who Christian is referring to
> > Got back to thinking about this after a while...
> > 
> > > On 1/17/24 09:52, Matthew Wilcox wrote:
> > > > I feel like we're in an XY trap [1].  What Christian actually wants is
> > > > to not be able to access the contents of a file while the device it's
> > > > on is suspended, and we've gone from there to "must drop the page cache".
> > > What we really want is for the plaintext contents of the files to be gone
> > > from memory while the dm-crypt device backing them is suspended.
> > > 
> > > Ultimately my goal is to limit the chance that an attacker with access to a
> > > user's suspended laptop will be able to access the user's encrypted data. I
> > > need to achieve this without forcing the user to completely log out/power
> > > off/etc their system; it must be invisible to the user. The key word here is
> > > limit; if we can remove _most_ files from memory _most_ of the time Ithink
> > > luksSuspend would be a lot more useful against cold boot than it is today.
> > Well, but if your attack vector are cold-boot attacks, then how does
> > freeing pages from the page cache help you? I mean sure the page allocator
> > will start tracking those pages with potentially sensitive content as free
> > but unless you also zero all of them, this doesn't help anything against
> > cold-boot attacks? The sensitive memory content is still there...
> > 
> > So you would also have to enable something like zero-on-page-free and
> > generally the cost of this is going to be pretty big?
> 
> Yes you are right. Just marking pages as free isn't enough.
> 
> I'm sure it's reasonable enough to zero out the pages that are getting
> free'd at our request. But the difficulty here is to try and clear pages
> that were freed previously for other reasons, unless we're zeroing out all
> pages on free. So I suppose that leaves me with a couple questions:
> 
> - As far as I know, the kernel only naturally frees pages from the page
> cache when they're about to be given to some program for imminent use.

Memory pressure does cause cache reclaim. Not just page cache, but
also slab caches and anything else various subsystems can clean up
to free memory..

> But
> then in the case the page isn't only free'd, but also zero'd out before it's
> handed over to the program (because giving a program access to a page filled
> with potentially sensitive data is a bad idea!). Is this correct?

Memory exposed to userspace is zeroed before userspace can access
it.  Kernel memory is not zeroed unless the caller specifically asks
for it to be zeroed.

> - Are there other situations (aside from drop_caches) where the kernel frees
> pages from the page cache? Especially without having to zero them anyway? In

truncate(), fallocate(), direct IO, fadvise(), madvise(), etc. IOWs,
there are lots of runtime vectors that cause page cache to be freed.

> other words, what situations would turning on some zero-pages-on-free
> setting actually hurt performance?

Lots.  page contents are typically cold when the page is freed so
the zeroing is typically memory latency and bandwidth bound. And
doing it on free means there isn't any sort of "cache priming"
performance benefits that we get with zeroing at allocation because
the page contents are not going to be immediately accessed by the
kernel or userspace.

> - Does dismounting a filesystem completely zero out the removed fs's pages
> from the page cache?

No. It just frees them. No explicit zeroing.

> - I remember hearing somewhere of some Linux support for zeroing out all
> pages in memory if they're free'd from the page cache. However, I spent a
> while trying to find this (how to turn it on, benchmarks) and I couldn't
> find it. Do you know if such a thing exists, and if so how to turn it on?
> I'm curious of the actual performance impact of it.

You can test it for yourself: the init_on_free kernel command line
option controls whether the kernel zeroes on free.

Typical distro configuration is: 

$ sudo dmesg |grep auto-init
[    0.018882] mem auto-init: stack:all(zero), heap alloc:on, heap free:off
$

So this kernel zeroes all stack memory, page and heap memory on
allocation, and does nothing on free...

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Dropping page cache of individual fs
  2024-02-15 23:17               ` Dave Chinner
@ 2024-02-16  1:14                 ` Adrian Vovk
  2024-02-16 20:38                   ` init_on_alloc digression: " John Hubbard
  0 siblings, 1 reply; 27+ messages in thread
From: Adrian Vovk @ 2024-02-16  1:14 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Jan Kara, Matthew Wilcox, Christian Brauner, lsf-pc,
	linux-fsdevel, linux-mm, linux-btrfs, linux-block,
	Christoph Hellwig

On 2/15/24 18:17, Dave Chinner wrote:
> On Thu, Feb 15, 2024 at 02:46:52PM -0500, Adrian Vovk wrote:
>> On 2/15/24 08:57, Jan Kara wrote:
>>> On Mon 29-01-24 19:13:17, Adrian Vovk wrote:
>>>> Hello! I'm the "GNOME people" who Christian is referring to
>>> Got back to thinking about this after a while...
>>>
>>>> On 1/17/24 09:52, Matthew Wilcox wrote:
>>>>> I feel like we're in an XY trap [1].  What Christian actually wants is
>>>>> to not be able to access the contents of a file while the device it's
>>>>> on is suspended, and we've gone from there to "must drop the page cache".
>>>> What we really want is for the plaintext contents of the files to be gone
>>>> from memory while the dm-crypt device backing them is suspended.
>>>>
>>>> Ultimately my goal is to limit the chance that an attacker with access to a
>>>> user's suspended laptop will be able to access the user's encrypted data. I
>>>> need to achieve this without forcing the user to completely log out/power
>>>> off/etc their system; it must be invisible to the user. The key word here is
>>>> limit; if we can remove _most_ files from memory _most_ of the time Ithink
>>>> luksSuspend would be a lot more useful against cold boot than it is today.
>>> Well, but if your attack vector are cold-boot attacks, then how does
>>> freeing pages from the page cache help you? I mean sure the page allocator
>>> will start tracking those pages with potentially sensitive content as free
>>> but unless you also zero all of them, this doesn't help anything against
>>> cold-boot attacks? The sensitive memory content is still there...
>>>
>>> So you would also have to enable something like zero-on-page-free and
>>> generally the cost of this is going to be pretty big?
>> Yes you are right. Just marking pages as free isn't enough.
>>
>> I'm sure it's reasonable enough to zero out the pages that are getting
>> free'd at our request. But the difficulty here is to try and clear pages
>> that were freed previously for other reasons, unless we're zeroing out all
>> pages on free. So I suppose that leaves me with a couple questions:
>>
>> - As far as I know, the kernel only naturally frees pages from the page
>> cache when they're about to be given to some program for imminent use.
> Memory pressure does cause cache reclaim. Not just page cache, but
> also slab caches and anything else various subsystems can clean up
> to free memory..
>
>> But
>> then in the case the page isn't only free'd, but also zero'd out before it's
>> handed over to the program (because giving a program access to a page filled
>> with potentially sensitive data is a bad idea!). Is this correct?
> Memory exposed to userspace is zeroed before userspace can access
> it.  Kernel memory is not zeroed unless the caller specifically asks
> for it to be zeroed.
>
>> - Are there other situations (aside from drop_caches) where the kernel frees
>> pages from the page cache? Especially without having to zero them anyway? In
> truncate(), fallocate(), direct IO, fadvise(), madvise(), etc. IOWs,
> there are lots of runtime vectors that cause page cache to be freed.
>
>> other words, what situations would turning on some zero-pages-on-free
>> setting actually hurt performance?
> Lots.  page contents are typically cold when the page is freed so
> the zeroing is typically memory latency and bandwidth bound. And
> doing it on free means there isn't any sort of "cache priming"
> performance benefits that we get with zeroing at allocation because
> the page contents are not going to be immediately accessed by the
> kernel or userspace.
>
>> - Does dismounting a filesystem completely zero out the removed fs's pages
>> from the page cache?
> No. It just frees them. No explicit zeroing.
I see. So even dismounting a filesystem and removing the device 
completely doesn't fully protect from a cold-boot attack. Good to know.
>
>> - I remember hearing somewhere of some Linux support for zeroing out all
>> pages in memory if they're free'd from the page cache. However, I spent a
>> while trying to find this (how to turn it on, benchmarks) and I couldn't
>> find it. Do you know if such a thing exists, and if so how to turn it on?
>> I'm curious of the actual performance impact of it.
> You can test it for yourself: the init_on_free kernel command line
> option controls whether the kernel zeroes on free.
>
> Typical distro configuration is:
>
> $ sudo dmesg |grep auto-init
> [    0.018882] mem auto-init: stack:all(zero), heap alloc:on, heap free:off
> $
>
> So this kernel zeroes all stack memory, page and heap memory on
> allocation, and does nothing on free...

I see. Thank you for all the information.

So ~5% performance penalty isn't trivial, especially to protect against 
something rare/unlikely like a cold-boot attack, but it would be quite 
nice if we could have some semblance of effort put into making sure the 
data is actually out of memory if we claim that we've done our best to 
harden the system against this scenario. Again, I'm all for best-effort 
solutions here; doing 90% is better than doing 0%...

I've got an alternative idea. How feasible would a second API be that 
just goes through free regions of memory and zeroes them out? This would 
be something we call immediately after we tell the kernel to drop 
everything it can relating to a given filesystem. So the flow would be 
something like follows:

1, user puts systemd-homed into this "locked" mode, homed wipes the 
dm-crypt key out of memory and suspends the block device (this already 
exists)
2. homed asks the kernel to drop whatever caches it can relating to that 
filesystem (the topic of this email thread)
3. homed asks the kernel to zero out all unallocated memory to make sure 
that the data is really gone (the second call I'm proposing now).

Sure this operation can take a while, but for our use-cases it's 
probably fine. We would do this only in response to a direct user action 
(and we can show a nice little progress spinner on screen), or right 
before suspend. A couple of extra seconds of work while entering suspend 
isn't going to be noticed by the user. If the hardware supports 
something faster/better to mitigate cold-boot attacks, like memory 
encryption / SEV, then we'd prefer to use that instead of course, but 
for unsupported hardware I think just zeroing out all the memory that 
has been marked free should do the trick just fine.

By the way, something like cryptsetup might want to use this second API 
too to ensure data is purged from memory after it closes a LUKS volume, 
for instance. So for example if you have an encrypted USB stick you use 
on your computer, the data really gets wiped after you unplug it.

> -Dave.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* init_on_alloc digression: [LSF/MM/BPF TOPIC] Dropping page cache of individual fs
  2024-02-16  1:14                 ` Adrian Vovk
@ 2024-02-16 20:38                   ` John Hubbard
  2024-02-16 21:11                     ` Adrian Vovk
  0 siblings, 1 reply; 27+ messages in thread
From: John Hubbard @ 2024-02-16 20:38 UTC (permalink / raw)
  To: Adrian Vovk, Dave Chinner
  Cc: Jan Kara, Matthew Wilcox, Christian Brauner, lsf-pc,
	linux-fsdevel, linux-mm, linux-btrfs, linux-block,
	Christoph Hellwig

On 2/15/24 17:14, Adrian Vovk wrote:
...
>> Typical distro configuration is:
>>
>> $ sudo dmesg |grep auto-init
>> [    0.018882] mem auto-init: stack:all(zero), heap alloc:on, heap 
>> free:off
>> $
>>
>> So this kernel zeroes all stack memory, page and heap memory on
>> allocation, and does nothing on free...
> 
> I see. Thank you for all the information.
> 
> So ~5% performance penalty isn't trivial, especially to protect against 

And it's more like 600% or more, on some systems. For example, imagine if
someone had a memory-coherent system that included both CPUs and GPUs,
each with their own NUMA memory nodes. The GPU has fast DMA engines that
can zero a lot of that memory very very quickly, order(s) of magnitude
faster than the CPU can clear it.

So, the GPU driver is going to clear that memory before handing it
out to user space, and all is well so far.

But init_on_alloc forces the CPU to clear the memory first, because of
the belief here that this is somehow required in order to get defense
in depth. (True, if you can convince yourself that some parts of the
kernel are in a different trust boundary than others. I lack faith
here and am not a believer in such make belief boundaries.)

Anyway, this situation has wasted much time, and at this point, I
wish I could delete the whole init_on_alloc feature.

Just in case you wanted an alt perspective. :)


thanks,
-- 
John Hubbard
NVIDIA


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: init_on_alloc digression: [LSF/MM/BPF TOPIC] Dropping page cache of individual fs
  2024-02-16 20:38                   ` init_on_alloc digression: " John Hubbard
@ 2024-02-16 21:11                     ` Adrian Vovk
  2024-02-16 21:19                       ` John Hubbard
  0 siblings, 1 reply; 27+ messages in thread
From: Adrian Vovk @ 2024-02-16 21:11 UTC (permalink / raw)
  To: John Hubbard, Dave Chinner
  Cc: Jan Kara, Matthew Wilcox, Christian Brauner, lsf-pc,
	linux-fsdevel, linux-mm, linux-btrfs, linux-block,
	Christoph Hellwig

On 2/16/24 15:38, John Hubbard wrote:
> On 2/15/24 17:14, Adrian Vovk wrote:
> ...
>>> Typical distro configuration is:
>>>
>>> $ sudo dmesg |grep auto-init
>>> [    0.018882] mem auto-init: stack:all(zero), heap alloc:on, heap 
>>> free:off
>>> $
>>>
>>> So this kernel zeroes all stack memory, page and heap memory on
>>> allocation, and does nothing on free...
>>
>> I see. Thank you for all the information.
>>
>> So ~5% performance penalty isn't trivial, especially to protect against 
>
> And it's more like 600% or more, on some systems. For example, imagine if
> someone had a memory-coherent system that included both CPUs and GPUs,
> each with their own NUMA memory nodes. The GPU has fast DMA engines that
> can zero a lot of that memory very very quickly, order(s) of magnitude
> faster than the CPU can clear it.
>
> So, the GPU driver is going to clear that memory before handing it
> out to user space, and all is well so far.
>
> But init_on_alloc forces the CPU to clear the memory first, because of
> the belief here that this is somehow required in order to get defense
> in depth. (True, if you can convince yourself that some parts of the
> kernel are in a different trust boundary than others. I lack faith
> here and am not a believer in such make belief boundaries.)

As far as I can tell init_on_alloc isn't about drawing a trust boundary 
between parts of the kernel, but about hardening the kernel against 
mistakes made by developers, i.e. if they forget to initialize some 
memory. If the memory isn't zero'd and the developer forgets to 
initialize it, then potentially memory under user control (from page 
cache or so) can control flow of execution in the kernel. Thus, zeroing 
out the memory provides a second layer of defense even in situations 
where the first layer (not using uninitialized memory) failed. Thus, 
defense in depth.

Is this just an NVIDIA embedded thing (AFAIK your desktop/laptop cards 
don't share memory with the CPU), or would it affect something like 
Intel/AMD APUs as well?

If the GPU is so much faster at zeroing out blocks of memory in these 
systems, maybe the kernel should use the GPU's DMA engine whenever it 
needs to zero out some blocks of memory (I'm joking, mostly; I can 
imagine it's not quite so simple)

> Anyway, this situation has wasted much time, and at this point, I
> wish I could delete the whole init_on_alloc feature.
>
> Just in case you wanted an alt perspective. :)

This is all good to know, thanks.

I'm not particularly interested in init_on_alloc since it doesn't help 
against cold-boot scenarios. Does init_on_free have similar performance 
issues on such systems? (i.e. are you often freeing memory and then 
immediately allocating the same memory in the GPU driver?)

Either way, I'd much prefer to have both turned off and only zero out 
free'd memory periodically / on user request. Not on every allocation/free.

> thanks,

Best,
Adrian


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: init_on_alloc digression: [LSF/MM/BPF TOPIC] Dropping page cache of individual fs
  2024-02-16 21:11                     ` Adrian Vovk
@ 2024-02-16 21:19                       ` John Hubbard
  0 siblings, 0 replies; 27+ messages in thread
From: John Hubbard @ 2024-02-16 21:19 UTC (permalink / raw)
  To: Adrian Vovk, Dave Chinner
  Cc: Jan Kara, Matthew Wilcox, Christian Brauner, lsf-pc,
	linux-fsdevel, linux-mm, linux-btrfs, linux-block,
	Christoph Hellwig

On 2/16/24 13:11, Adrian Vovk wrote:
...
>> But init_on_alloc forces the CPU to clear the memory first, because of
>> the belief here that this is somehow required in order to get defense
>> in depth. (True, if you can convince yourself that some parts of the
>> kernel are in a different trust boundary than others. I lack faith
>> here and am not a believer in such make belief boundaries.)
> 
> As far as I can tell init_on_alloc isn't about drawing a trust boundary 
> between parts of the kernel, but about hardening the kernel against 
> mistakes made by developers, i.e. if they forget to initialize some 

So this is writing code in order to protect against other code, in
the same kernel. So now we need some more code in case this new code
forgets to do something, or has a bug.

This will recurse into an infinite amount of code. :)

> memory. If the memory isn't zero'd and the developer forgets to 
> initialize it, then potentially memory under user control (from page 
> cache or so) can control flow of execution in the kernel. Thus, zeroing 
> out the memory provides a second layer of defense even in situations 
> where the first layer (not using uninitialized memory) failed. Thus, 
> defense in depth.

Why not initialize memory at the entry of every function that sees
the page, then, and call it defense-really-in-depth? It's hard to see
where the silliness ends.

> 
> Is this just an NVIDIA embedded thing (AFAIK your desktop/laptop cards 

Nope. Any system that has slow CPU access to fast accelerator memory
would suffer like this. And many are being built.

> don't share memory with the CPU), or would it affect something like 
> Intel/AMD APUs as well?
> 
> If the GPU is so much faster at zeroing out blocks of memory in these 
> systems, maybe the kernel should use the GPU's DMA engine whenever it 
> needs to zero out some blocks of memory (I'm joking, mostly; I can 
> imagine it's not quite so simple)

Yes, it's conceivable to put in a callback hook from the init_on_alloc
so that it could use a driver to fast-zero the memory. Except that
will never be accepted by anyone who accepts your first argument:
this is "protection" against those forgetful, silly driver writers.


thanks,
-- 
John Hubbard
NVIDIA


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Dropping page cache of individual fs
  2024-01-16 10:50 [LSF/MM/BPF TOPIC] Dropping page cache of individual fs Christian Brauner
                   ` (2 preceding siblings ...)
  2024-01-16 20:56 ` Dave Chinner
@ 2024-02-17  4:04 ` Kent Overstreet
  3 siblings, 0 replies; 27+ messages in thread
From: Kent Overstreet @ 2024-02-17  4:04 UTC (permalink / raw)
  To: Christian Brauner
  Cc: lsf-pc, linux-fsdevel, linux-mm, linux-btrfs, linux-block,
	Matthew Wilcox, Jan Kara, Christoph Hellwig

On Tue, Jan 16, 2024 at 11:50:32AM +0100, Christian Brauner wrote:
> Hey,
> 
> I'm not sure this even needs a full LSFMM discussion but since I
> currently don't have time to work on the patch I may as well submit it.
> 
> Gnome recently got awared 1M Euro by the Sovereign Tech Fund (STF). The
> STF was created by the German government to fund public infrastructure:
> 
> "The Sovereign Tech Fund supports the development, improvement and
>  maintenance of open digital infrastructure. Our goal is to sustainably
>  strengthen the open source ecosystem. We focus on security, resilience,
>  technological diversity, and the people behind the code." (cf. [1])
> 
> Gnome has proposed various specific projects including integrating
> systemd-homed with Gnome. Systemd-homed provides various features and if
> you're interested in details then you might find it useful to read [2].
> It makes use of various new VFS and fs specific developments over the
> last years.
> 
> One feature is encrypting the home directory via LUKS. An approriate
> image or device must contain a GPT partition table. Currently there's
> only one partition which is a LUKS2 volume. Inside that LUKS2 volume is
> a Linux filesystem. Currently supported are btrfs (see [4] though),
> ext4, and xfs.
> 
> The following issue isn't specific to systemd-homed. Gnome wants to be
> able to support locking encrypted home directories. For example, when
> the laptop is suspended. To do this the luksSuspend command can be used.
> 
> The luksSuspend call is nothing else than a device mapper ioctl to
> suspend the block device and it's owning superblock/filesystem. Which in
> turn is nothing but a freeze initiated from the block layer:
> 
> dm_suspend()
> -> __dm_suspend()
>    -> lock_fs()
>       -> bdev_freeze()
> 
> So when we say luksSuspend we really mean block layer initiated freeze.
> The overall goal or expectation of userspace is that after a luksSuspend
> call all sensitive material has been evicted from relevant caches to
> harden against various attacks. And luksSuspend does wipe the encryption
> key and suspend the block device. However, the encryption key can still
> be available clear-text in the page cache. To illustrate this problem
> more simply:
> 
> truncate -s 500M /tmp/img
> echo password | cryptsetup luksFormat /tmp/img --force-password
> echo password | cryptsetup open /tmp/img test
> mkfs.xfs /dev/mapper/test
> mount /dev/mapper/test /mnt
> echo "secrets" > /mnt/data
> cryptsetup luksSuspend test
> cat /mnt/data
> 
> This will still happily print the contents of /mnt/data even though the
> block device and the owning filesystem are frozen because the data is
> still in the page cache.
> 
> To my knowledge, the only current way to get the contents of /mnt/data
> or the encryption key out of the page cache is via
> /proc/sys/vm/drop_caches which is a big hammer.
> 
> My initial reaction is to give userspace an API to drop the page cache
> of a specific filesystem which may have additional uses. I initially had
> started drafting an ioctl() and then got swayed towards a
> posix_fadvise() flag. I found out that this was already proposed a few
> years ago but got rejected as it was suspected this might just be
> someone toying around without a real world use-case. I think this here
> might qualify as a real-world use-case.
> 
> This may at least help securing users with a regular dm-crypt setup
> where dm-crypt is the top layer. Users that stack additional layers on
> top of dm-crypt may still leak plaintext of course if they introduce
> additional caching. But that's on them.
> 
> Of course other ideas welcome.

This isn't entirely unlike snapshot deletion, where we also need to
shoot down the pagecache.

Technically, the code I have now for snapshot deletion isn't quite what
I want; snapshot deletion probably wants something closer to revoke()
instead of waiting for files to be closed. But maybe the code I have is
close to what you need - maybe we could turn this into a common shared
API?

https://evilpiepirate.org/git/bcachefs.git/tree/fs/bcachefs/fs.c#n1569

The need for page zeroing is pretty orthogonal; if you want page zeroing
you want that enabled for all page cache folios at all times.

^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2024-02-17  4:04 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-01-16 10:50 [LSF/MM/BPF TOPIC] Dropping page cache of individual fs Christian Brauner
2024-01-16 11:45 ` Jan Kara
2024-01-17 12:53   ` Christian Brauner
2024-01-17 14:35     ` Jan Kara
2024-01-17 14:52       ` Matthew Wilcox
2024-01-17 20:51         ` Phillip Susi
2024-01-17 20:58           ` Matthew Wilcox
2024-01-18 14:26         ` Christian Brauner
2024-01-30  0:13         ` Adrian Vovk
2024-02-15 13:57           ` Jan Kara
2024-02-15 19:46             ` Adrian Vovk
2024-02-15 23:17               ` Dave Chinner
2024-02-16  1:14                 ` Adrian Vovk
2024-02-16 20:38                   ` init_on_alloc digression: " John Hubbard
2024-02-16 21:11                     ` Adrian Vovk
2024-02-16 21:19                       ` John Hubbard
2024-01-16 15:25 ` James Bottomley
2024-01-16 15:40   ` Matthew Wilcox
2024-01-16 15:54     ` James Bottomley
2024-01-16 20:56 ` Dave Chinner
2024-01-17  6:17   ` Theodore Ts'o
2024-01-30  1:14     ` Adrian Vovk
2024-01-17 13:19   ` Christian Brauner
2024-01-17 22:26     ` Dave Chinner
2024-01-18 14:09       ` Christian Brauner
2024-02-05 17:39     ` Russell Haley
2024-02-17  4:04 ` Kent Overstreet

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).