* [LSF/MM/BPF TOPIC] Dropping page cache of individual fs @ 2024-01-16 10:50 Christian Brauner 2024-01-16 11:45 ` Jan Kara ` (3 more replies) 0 siblings, 4 replies; 27+ messages in thread From: Christian Brauner @ 2024-01-16 10:50 UTC (permalink / raw) To: lsf-pc Cc: Christian Brauner, linux-fsdevel, linux-mm, linux-btrfs, linux-block, Matthew Wilcox, Jan Kara, Christoph Hellwig Hey, I'm not sure this even needs a full LSFMM discussion but since I currently don't have time to work on the patch I may as well submit it. Gnome recently got awared 1M Euro by the Sovereign Tech Fund (STF). The STF was created by the German government to fund public infrastructure: "The Sovereign Tech Fund supports the development, improvement and maintenance of open digital infrastructure. Our goal is to sustainably strengthen the open source ecosystem. We focus on security, resilience, technological diversity, and the people behind the code." (cf. [1]) Gnome has proposed various specific projects including integrating systemd-homed with Gnome. Systemd-homed provides various features and if you're interested in details then you might find it useful to read [2]. It makes use of various new VFS and fs specific developments over the last years. One feature is encrypting the home directory via LUKS. An approriate image or device must contain a GPT partition table. Currently there's only one partition which is a LUKS2 volume. Inside that LUKS2 volume is a Linux filesystem. Currently supported are btrfs (see [4] though), ext4, and xfs. The following issue isn't specific to systemd-homed. Gnome wants to be able to support locking encrypted home directories. For example, when the laptop is suspended. To do this the luksSuspend command can be used. The luksSuspend call is nothing else than a device mapper ioctl to suspend the block device and it's owning superblock/filesystem. Which in turn is nothing but a freeze initiated from the block layer: dm_suspend() -> __dm_suspend() -> lock_fs() -> bdev_freeze() So when we say luksSuspend we really mean block layer initiated freeze. The overall goal or expectation of userspace is that after a luksSuspend call all sensitive material has been evicted from relevant caches to harden against various attacks. And luksSuspend does wipe the encryption key and suspend the block device. However, the encryption key can still be available clear-text in the page cache. To illustrate this problem more simply: truncate -s 500M /tmp/img echo password | cryptsetup luksFormat /tmp/img --force-password echo password | cryptsetup open /tmp/img test mkfs.xfs /dev/mapper/test mount /dev/mapper/test /mnt echo "secrets" > /mnt/data cryptsetup luksSuspend test cat /mnt/data This will still happily print the contents of /mnt/data even though the block device and the owning filesystem are frozen because the data is still in the page cache. To my knowledge, the only current way to get the contents of /mnt/data or the encryption key out of the page cache is via /proc/sys/vm/drop_caches which is a big hammer. My initial reaction is to give userspace an API to drop the page cache of a specific filesystem which may have additional uses. I initially had started drafting an ioctl() and then got swayed towards a posix_fadvise() flag. I found out that this was already proposed a few years ago but got rejected as it was suspected this might just be someone toying around without a real world use-case. I think this here might qualify as a real-world use-case. This may at least help securing users with a regular dm-crypt setup where dm-crypt is the top layer. Users that stack additional layers on top of dm-crypt may still leak plaintext of course if they introduce additional caching. But that's on them. Of course other ideas welcome. [1]: https://www.sovereigntechfund.de/en [2]: https://systemd.io/HOME_DIRECTORY [3]: https://lore.kernel.org/linux-btrfs/20230908-merklich-bebauen-11914a630db4@brauner/ [4]: A bdev_freeze() call ideally does the following: (1) Freeze the block device @bdev (2) Find the owning superblock of the block device @bdev and freeze the filesystem as well. Especially (2) wasn't true for a long time. Filesystems would only be able to freeze the filesystems on the main block device. For example, an xfs filesystem using an external log device would not be able to be frozen if the block layer request came via the external log device. This is fixed since v6.8 for all filesystems using appropriate holder operations. Except for btrfs where block device initiated freezes don't work at all; not even for the main block device. I've pointed this out months ago in [3]. Which is why we currently can't use btrfs with LUKS2 encryption as as luksSuspend call will leave the filesystem unfrozen. [5]: https://gitlab.com/cryptsetup/cryptsetup/-/issues/855 https://gitlab.gnome.org/Teams/STF/homed/-/issues/23 ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Dropping page cache of individual fs 2024-01-16 10:50 [LSF/MM/BPF TOPIC] Dropping page cache of individual fs Christian Brauner @ 2024-01-16 11:45 ` Jan Kara 2024-01-17 12:53 ` Christian Brauner 2024-01-16 15:25 ` James Bottomley ` (2 subsequent siblings) 3 siblings, 1 reply; 27+ messages in thread From: Jan Kara @ 2024-01-16 11:45 UTC (permalink / raw) To: Christian Brauner Cc: lsf-pc, linux-fsdevel, linux-mm, linux-btrfs, linux-block, Matthew Wilcox, Jan Kara, Christoph Hellwig On Tue 16-01-24 11:50:32, Christian Brauner wrote: <snip the usecase details> > My initial reaction is to give userspace an API to drop the page cache > of a specific filesystem which may have additional uses. I initially had > started drafting an ioctl() and then got swayed towards a > posix_fadvise() flag. I found out that this was already proposed a few > years ago but got rejected as it was suspected this might just be > someone toying around without a real world use-case. I think this here > might qualify as a real-world use-case. > > This may at least help securing users with a regular dm-crypt setup > where dm-crypt is the top layer. Users that stack additional layers on > top of dm-crypt may still leak plaintext of course if they introduce > additional caching. But that's on them. Well, your usecase has one substantial difference from drop_caches. You actually *require* pages to be evicted from the page cache for security purposes. And giving any kind of guarantees is going to be tough. Think for example when someone grabs page cache folio reference through vmsplice(2), then you initiate your dmSuspend and want to evict page cache. What are you going to do? You cannot free the folio while the refcount is elevated, you could possibly detach it from the page cache so it isn't at least visible but that has side effects too - after you resume the folio would remain detached so it will not see changes happening to the file anymore. So IMHO the only thing you could do without problematic side-effects is report error. Which would be user unfriendly and could be actually surprisingly frequent due to trasient folio references taken by various code paths. Sure we could report error only if the page has pincount elevated, not only refcount, but it needs some serious thinking how this would interact. Also what is going to be the interaction with mlock(2)? Overall this doesn't seem like "just tweak drop_caches a bit" kind of work... Honza -- Jan Kara <jack@suse.com> SUSE Labs, CR ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Dropping page cache of individual fs 2024-01-16 11:45 ` Jan Kara @ 2024-01-17 12:53 ` Christian Brauner 2024-01-17 14:35 ` Jan Kara 0 siblings, 1 reply; 27+ messages in thread From: Christian Brauner @ 2024-01-17 12:53 UTC (permalink / raw) To: Jan Kara Cc: lsf-pc, linux-fsdevel, linux-mm, linux-btrfs, linux-block, Matthew Wilcox, Christoph Hellwig On Tue, Jan 16, 2024 at 12:45:19PM +0100, Jan Kara wrote: > On Tue 16-01-24 11:50:32, Christian Brauner wrote: > > <snip the usecase details> > > > My initial reaction is to give userspace an API to drop the page cache > > of a specific filesystem which may have additional uses. I initially had > > started drafting an ioctl() and then got swayed towards a > > posix_fadvise() flag. I found out that this was already proposed a few > > years ago but got rejected as it was suspected this might just be > > someone toying around without a real world use-case. I think this here > > might qualify as a real-world use-case. > > > > This may at least help securing users with a regular dm-crypt setup > > where dm-crypt is the top layer. Users that stack additional layers on > > top of dm-crypt may still leak plaintext of course if they introduce > > additional caching. But that's on them. > > Well, your usecase has one substantial difference from drop_caches. You > actually *require* pages to be evicted from the page cache for security > purposes. And giving any kind of guarantees is going to be tough. Think for > example when someone grabs page cache folio reference through vmsplice(2), > then you initiate your dmSuspend and want to evict page cache. What are you > going to do? You cannot free the folio while the refcount is elevated, you > could possibly detach it from the page cache so it isn't at least visible > but that has side effects too - after you resume the folio would remain > detached so it will not see changes happening to the file anymore. So IMHO > the only thing you could do without problematic side-effects is report > error. Which would be user unfriendly and could be actually surprisingly > frequent due to trasient folio references taken by various code paths. I wonder though, if you start suspending userspace and the filesystem how likely are you to encounter these transient errors? > > Sure we could report error only if the page has pincount elevated, not only > refcount, but it needs some serious thinking how this would interact. > > Also what is going to be the interaction with mlock(2)? > > Overall this doesn't seem like "just tweak drop_caches a bit" kind of > work... So when I talked to the Gnome people they were interested in an optimal or a best-effort solution. So returning an error might actually be useful. I'm specifically put this here because my knowledge of the page cache isn't sufficient to make a judgement what guarantees are and aren't feasible. So I'm grateful for any insight here. ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Dropping page cache of individual fs 2024-01-17 12:53 ` Christian Brauner @ 2024-01-17 14:35 ` Jan Kara 2024-01-17 14:52 ` Matthew Wilcox 0 siblings, 1 reply; 27+ messages in thread From: Jan Kara @ 2024-01-17 14:35 UTC (permalink / raw) To: Christian Brauner Cc: Jan Kara, lsf-pc, linux-fsdevel, linux-mm, linux-btrfs, linux-block, Matthew Wilcox, Christoph Hellwig On Wed 17-01-24 13:53:20, Christian Brauner wrote: > On Tue, Jan 16, 2024 at 12:45:19PM +0100, Jan Kara wrote: > > On Tue 16-01-24 11:50:32, Christian Brauner wrote: > > > > <snip the usecase details> > > > > > My initial reaction is to give userspace an API to drop the page cache > > > of a specific filesystem which may have additional uses. I initially had > > > started drafting an ioctl() and then got swayed towards a > > > posix_fadvise() flag. I found out that this was already proposed a few > > > years ago but got rejected as it was suspected this might just be > > > someone toying around without a real world use-case. I think this here > > > might qualify as a real-world use-case. > > > > > > This may at least help securing users with a regular dm-crypt setup > > > where dm-crypt is the top layer. Users that stack additional layers on > > > top of dm-crypt may still leak plaintext of course if they introduce > > > additional caching. But that's on them. > > > > Well, your usecase has one substantial difference from drop_caches. You > > actually *require* pages to be evicted from the page cache for security > > purposes. And giving any kind of guarantees is going to be tough. Think for > > example when someone grabs page cache folio reference through vmsplice(2), > > then you initiate your dmSuspend and want to evict page cache. What are you > > going to do? You cannot free the folio while the refcount is elevated, you > > could possibly detach it from the page cache so it isn't at least visible > > but that has side effects too - after you resume the folio would remain > > detached so it will not see changes happening to the file anymore. So IMHO > > the only thing you could do without problematic side-effects is report > > error. Which would be user unfriendly and could be actually surprisingly > > frequent due to trasient folio references taken by various code paths. > > I wonder though, if you start suspending userspace and the filesystem > how likely are you to encounter these transient errors? Yeah, my expectation is it should not be frequent in that case. But there could be surprises there - e.g. pages mapping running executable code are practically unevictable. Userspace should be mostly sleeping so there shouldn't be many but there would be some so in the worst case that could result in always returning error from the page cache eviction which would not be very useful. > > Sure we could report error only if the page has pincount elevated, not only > > refcount, but it needs some serious thinking how this would interact. > > > > Also what is going to be the interaction with mlock(2)? > > > > Overall this doesn't seem like "just tweak drop_caches a bit" kind of > > work... > > So when I talked to the Gnome people they were interested in an optimal > or a best-effort solution. So returning an error might actually be useful. OK. So could we then define the effect of your desired call as calling posix_fadvise(..., POSIX_FADV_DONTNEED) for every file? This is kind of best-effort eviction which is reasonably well understood by everybody. Honza -- Jan Kara <jack@suse.com> SUSE Labs, CR ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Dropping page cache of individual fs 2024-01-17 14:35 ` Jan Kara @ 2024-01-17 14:52 ` Matthew Wilcox 2024-01-17 20:51 ` Phillip Susi ` (2 more replies) 0 siblings, 3 replies; 27+ messages in thread From: Matthew Wilcox @ 2024-01-17 14:52 UTC (permalink / raw) To: Jan Kara Cc: Christian Brauner, lsf-pc, linux-fsdevel, linux-mm, linux-btrfs, linux-block, Christoph Hellwig On Wed, Jan 17, 2024 at 03:35:28PM +0100, Jan Kara wrote: > OK. So could we then define the effect of your desired call as calling > posix_fadvise(..., POSIX_FADV_DONTNEED) for every file? This is kind of > best-effort eviction which is reasonably well understood by everybody. I feel like we're in an XY trap [1]. What Christian actually wants is to not be able to access the contents of a file while the device it's on is suspended, and we've gone from there to "must drop the page cache". We have numerous ways to intercept file reads and make them either block or fail. The obvious one to me is security_file_permission() called from rw_verify_area(). Can we do everything we need with an LSM? [1] https://meta.stackexchange.com/questions/66377/what-is-the-xy-problem ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Dropping page cache of individual fs 2024-01-17 14:52 ` Matthew Wilcox @ 2024-01-17 20:51 ` Phillip Susi 2024-01-17 20:58 ` Matthew Wilcox 2024-01-18 14:26 ` Christian Brauner 2024-01-30 0:13 ` Adrian Vovk 2 siblings, 1 reply; 27+ messages in thread From: Phillip Susi @ 2024-01-17 20:51 UTC (permalink / raw) To: Matthew Wilcox, Jan Kara Cc: Christian Brauner, lsf-pc, linux-fsdevel, linux-mm, linux-btrfs, linux-block, Christoph Hellwig Matthew Wilcox <willy@infradead.org> writes: > We have numerous ways to intercept file reads and make them either > block or fail. The obvious one to me is security_file_permission() > called from rw_verify_area(). Can we do everything we need with an LSM? I like the idea. That runs when someone opens a file right? What about if they already had the file open or mapped before the volume was locked? If not, is that OK? Are we just trying to deny open requests of files while the volume is locked? Is that in addition to, or instead of throwing out the key and suspending IO at the block layer? If it is in addition, then that would mean that trying to open a file would fail cleanly, but accessing a page that is already mapped could hang the task. In an unkillable state. For a long time. Even the OOM killer can't kill a task blocked like that can it? Or did that get fixed at some point? ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Dropping page cache of individual fs 2024-01-17 20:51 ` Phillip Susi @ 2024-01-17 20:58 ` Matthew Wilcox 0 siblings, 0 replies; 27+ messages in thread From: Matthew Wilcox @ 2024-01-17 20:58 UTC (permalink / raw) To: Phillip Susi Cc: Jan Kara, Christian Brauner, lsf-pc, linux-fsdevel, linux-mm, linux-btrfs, linux-block, Christoph Hellwig On Wed, Jan 17, 2024 at 03:51:37PM -0500, Phillip Susi wrote: > Matthew Wilcox <willy@infradead.org> writes: > > > We have numerous ways to intercept file reads and make them either > > block or fail. The obvious one to me is security_file_permission() > > called from rw_verify_area(). Can we do everything we need with an LSM? > > I like the idea. That runs when someone opens a file right? What about Every read() and write() call goes through there. eg ksys_read -> vfs_read -> rw_verify_area -> security_file_permission It wouldn't cover mmap accesses. So if you had the file mmaped before suspend, you'd still be able to load from the mmap. There's no security_ hook for that right now, afaik. > Is that in addition to, or instead of throwing out the key and > suspending IO at the block layer? If it is in addition, then that would > mean that trying to open a file would fail cleanly, but accessing a page > that is already mapped could hang the task. In an unkillable state. > For a long time. Even the OOM killer can't kill a task blocked like > that can it? Or did that get fixed at some point? TASK_KILLABLE was added in 2008, but it's up to each individual call site whether to use killable or uninterruptible sleep. ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Dropping page cache of individual fs 2024-01-17 14:52 ` Matthew Wilcox 2024-01-17 20:51 ` Phillip Susi @ 2024-01-18 14:26 ` Christian Brauner 2024-01-30 0:13 ` Adrian Vovk 2 siblings, 0 replies; 27+ messages in thread From: Christian Brauner @ 2024-01-18 14:26 UTC (permalink / raw) To: Matthew Wilcox Cc: Jan Kara, lsf-pc, linux-fsdevel, linux-mm, linux-btrfs, linux-block, Christoph Hellwig On Wed, Jan 17, 2024 at 02:52:32PM +0000, Matthew Wilcox wrote: > On Wed, Jan 17, 2024 at 03:35:28PM +0100, Jan Kara wrote: > > OK. So could we then define the effect of your desired call as calling > > posix_fadvise(..., POSIX_FADV_DONTNEED) for every file? This is kind of > > best-effort eviction which is reasonably well understood by everybody. > > I feel like we're in an XY trap [1]. What Christian actually wants is > to not be able to access the contents of a file while the device it's > on is suspended, and we've gone from there to "must drop the page cache". > > We have numerous ways to intercept file reads and make them either > block or fail. The obvious one to me is security_file_permission() > called from rw_verify_area(). Can we do everything we need with an LSM? > > [1] https://meta.stackexchange.com/questions/66377/what-is-the-xy-problem Nice idea and we do stuff like that in other scenarios such as [1] where we care about preventing _writes_ from occuring while a specific service hasn't been fully set up. So that has been going through my mind as well. And the LSM approach might be complementary. For example, if feasible, it could be activated _before_ the freeze operation only allowing the block layer initiated freeze. And then we can drop the page cache. But in this case the LSM approach isn't easily workable or solves the problem for Gnome. It would force the usage of a bpf LSM most likely as well. And the LSM would have to be activated when the filesystem is frozen and then deactivated when it is unfrozen. I'm not even sure that's currently easily doable. But the Gnome use-case wants to be able to drop file contents before they suspend the system. So the thread-model is wider than just someone being able to read contents on an active systems. But it's best-effort of course. So failing and reporting an error would be totally fine and then policy could dictate whether to not even suspend. It actually might help userspace in general. The ability to drop the page cache of a specific filesystem is useful independent of the Gnome use-case especially in systems with thousands or ten-thousands of services that use separate filesystem images something that's not uncommon. [1]: https://github.com/systemd/systemd/blob/74e6a7d84a40de18bb3b18eeef6284f870f30a6e/src/nsresourced/bpf/userns_restrict/userns-restrict.bpf.c ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Dropping page cache of individual fs 2024-01-17 14:52 ` Matthew Wilcox 2024-01-17 20:51 ` Phillip Susi 2024-01-18 14:26 ` Christian Brauner @ 2024-01-30 0:13 ` Adrian Vovk 2024-02-15 13:57 ` Jan Kara 2 siblings, 1 reply; 27+ messages in thread From: Adrian Vovk @ 2024-01-30 0:13 UTC (permalink / raw) To: Matthew Wilcox, Jan Kara Cc: Christian Brauner, lsf-pc, linux-fsdevel, linux-mm, linux-btrfs, linux-block, Christoph Hellwig Hello! I'm the "GNOME people" who Christian is referring to On 1/17/24 09:52, Matthew Wilcox wrote: > I feel like we're in an XY trap [1]. What Christian actually wants is > to not be able to access the contents of a file while the device it's > on is suspended, and we've gone from there to "must drop the page cache". What we really want is for the plaintext contents of the files to be gone from memory while the dm-crypt device backing them is suspended. Ultimately my goal is to limit the chance that an attacker with access to a user's suspended laptop will be able to access the user's encrypted data. I need to achieve this without forcing the user to completely log out/power off/etc their system; it must be invisible to the user. The key word here is limit; if we can remove _most_ files from memory _most_ of the time Ithink luksSuspend would be a lot more useful against cold boot than it is today. I understand that perfectly wiping all the files out of memory without completely unmounting the filesystem isn't feasible, and that's probably OK for our use-case. As long as most files can be removed from memory most of the time, anyway... > We have numerous ways to intercept file reads and make them either > block or fail. The obvious one to me is security_file_permission() > called from rw_verify_area(). Can we do everything we need with an LSM? > > [1] https://meta.stackexchange.com/questions/66377/what-is-the-xy-problem As Christian mentioned: the LSM may be a good addition, but it would have to be in addition to wiping the data out of the page cache, not instead of. An LSM will not help against a cold boot attack Adrian ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Dropping page cache of individual fs 2024-01-30 0:13 ` Adrian Vovk @ 2024-02-15 13:57 ` Jan Kara 2024-02-15 19:46 ` Adrian Vovk 0 siblings, 1 reply; 27+ messages in thread From: Jan Kara @ 2024-02-15 13:57 UTC (permalink / raw) To: Adrian Vovk Cc: Matthew Wilcox, Jan Kara, Christian Brauner, lsf-pc, linux-fsdevel, linux-mm, linux-btrfs, linux-block, Christoph Hellwig On Mon 29-01-24 19:13:17, Adrian Vovk wrote: > Hello! I'm the "GNOME people" who Christian is referring to Got back to thinking about this after a while... > On 1/17/24 09:52, Matthew Wilcox wrote: > > I feel like we're in an XY trap [1]. What Christian actually wants is > > to not be able to access the contents of a file while the device it's > > on is suspended, and we've gone from there to "must drop the page cache". > > What we really want is for the plaintext contents of the files to be gone > from memory while the dm-crypt device backing them is suspended. > > Ultimately my goal is to limit the chance that an attacker with access to a > user's suspended laptop will be able to access the user's encrypted data. I > need to achieve this without forcing the user to completely log out/power > off/etc their system; it must be invisible to the user. The key word here is > limit; if we can remove _most_ files from memory _most_ of the time Ithink > luksSuspend would be a lot more useful against cold boot than it is today. Well, but if your attack vector are cold-boot attacks, then how does freeing pages from the page cache help you? I mean sure the page allocator will start tracking those pages with potentially sensitive content as free but unless you also zero all of them, this doesn't help anything against cold-boot attacks? The sensitive memory content is still there... So you would also have to enable something like zero-on-page-free and generally the cost of this is going to be pretty big? > I understand that perfectly wiping all the files out of memory without > completely unmounting the filesystem isn't feasible, and that's probably OK > for our use-case. As long as most files can be removed from memory most of > the time, anyway... OK, understood. I guess in that case something like BLKFLSBUF ioctl on steroids (to also evict filesystem caches, not only the block device) could be useful for you. Honza -- Jan Kara <jack@suse.com> SUSE Labs, CR ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Dropping page cache of individual fs 2024-02-15 13:57 ` Jan Kara @ 2024-02-15 19:46 ` Adrian Vovk 2024-02-15 23:17 ` Dave Chinner 0 siblings, 1 reply; 27+ messages in thread From: Adrian Vovk @ 2024-02-15 19:46 UTC (permalink / raw) To: Jan Kara Cc: Matthew Wilcox, Christian Brauner, lsf-pc, linux-fsdevel, linux-mm, linux-btrfs, linux-block, Christoph Hellwig On 2/15/24 08:57, Jan Kara wrote: > On Mon 29-01-24 19:13:17, Adrian Vovk wrote: >> Hello! I'm the "GNOME people" who Christian is referring to > Got back to thinking about this after a while... > >> On 1/17/24 09:52, Matthew Wilcox wrote: >>> I feel like we're in an XY trap [1]. What Christian actually wants is >>> to not be able to access the contents of a file while the device it's >>> on is suspended, and we've gone from there to "must drop the page cache". >> What we really want is for the plaintext contents of the files to be gone >> from memory while the dm-crypt device backing them is suspended. >> >> Ultimately my goal is to limit the chance that an attacker with access to a >> user's suspended laptop will be able to access the user's encrypted data. I >> need to achieve this without forcing the user to completely log out/power >> off/etc their system; it must be invisible to the user. The key word here is >> limit; if we can remove _most_ files from memory _most_ of the time Ithink >> luksSuspend would be a lot more useful against cold boot than it is today. > Well, but if your attack vector are cold-boot attacks, then how does > freeing pages from the page cache help you? I mean sure the page allocator > will start tracking those pages with potentially sensitive content as free > but unless you also zero all of them, this doesn't help anything against > cold-boot attacks? The sensitive memory content is still there... > > So you would also have to enable something like zero-on-page-free and > generally the cost of this is going to be pretty big? Yes you are right. Just marking pages as free isn't enough. I'm sure it's reasonable enough to zero out the pages that are getting free'd at our request. But the difficulty here is to try and clear pages that were freed previously for other reasons, unless we're zeroing out all pages on free. So I suppose that leaves me with a couple questions: - As far as I know, the kernel only naturally frees pages from the page cache when they're about to be given to some program for imminent use. But then in the case the page isn't only free'd, but also zero'd out before it's handed over to the program (because giving a program access to a page filled with potentially sensitive data is a bad idea!). Is this correct? - Are there other situations (aside from drop_caches) where the kernel frees pages from the page cache? Especially without having to zero them anyway? In other words, what situations would turning on some zero-pages-on-free setting actually hurt performance? - Does dismounting a filesystem completely zero out the removed fs's pages from the page cache? - I remember hearing somewhere of some Linux support for zeroing out all pages in memory if they're free'd from the page cache. However, I spent a while trying to find this (how to turn it on, benchmarks) and I couldn't find it. Do you know if such a thing exists, and if so how to turn it on? I'm curious of the actual performance impact of it. >> I understand that perfectly wiping all the files out of memory without >> completely unmounting the filesystem isn't feasible, and that's probably OK >> for our use-case. As long as most files can be removed from memory most of >> the time, anyway... > OK, understood. I guess in that case something like BLKFLSBUF ioctl on > steroids (to also evict filesystem caches, not only the block device) could > be useful for you. > > Honza Best, Adrian ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Dropping page cache of individual fs 2024-02-15 19:46 ` Adrian Vovk @ 2024-02-15 23:17 ` Dave Chinner 2024-02-16 1:14 ` Adrian Vovk 0 siblings, 1 reply; 27+ messages in thread From: Dave Chinner @ 2024-02-15 23:17 UTC (permalink / raw) To: Adrian Vovk Cc: Jan Kara, Matthew Wilcox, Christian Brauner, lsf-pc, linux-fsdevel, linux-mm, linux-btrfs, linux-block, Christoph Hellwig On Thu, Feb 15, 2024 at 02:46:52PM -0500, Adrian Vovk wrote: > On 2/15/24 08:57, Jan Kara wrote: > > On Mon 29-01-24 19:13:17, Adrian Vovk wrote: > > > Hello! I'm the "GNOME people" who Christian is referring to > > Got back to thinking about this after a while... > > > > > On 1/17/24 09:52, Matthew Wilcox wrote: > > > > I feel like we're in an XY trap [1]. What Christian actually wants is > > > > to not be able to access the contents of a file while the device it's > > > > on is suspended, and we've gone from there to "must drop the page cache". > > > What we really want is for the plaintext contents of the files to be gone > > > from memory while the dm-crypt device backing them is suspended. > > > > > > Ultimately my goal is to limit the chance that an attacker with access to a > > > user's suspended laptop will be able to access the user's encrypted data. I > > > need to achieve this without forcing the user to completely log out/power > > > off/etc their system; it must be invisible to the user. The key word here is > > > limit; if we can remove _most_ files from memory _most_ of the time Ithink > > > luksSuspend would be a lot more useful against cold boot than it is today. > > Well, but if your attack vector are cold-boot attacks, then how does > > freeing pages from the page cache help you? I mean sure the page allocator > > will start tracking those pages with potentially sensitive content as free > > but unless you also zero all of them, this doesn't help anything against > > cold-boot attacks? The sensitive memory content is still there... > > > > So you would also have to enable something like zero-on-page-free and > > generally the cost of this is going to be pretty big? > > Yes you are right. Just marking pages as free isn't enough. > > I'm sure it's reasonable enough to zero out the pages that are getting > free'd at our request. But the difficulty here is to try and clear pages > that were freed previously for other reasons, unless we're zeroing out all > pages on free. So I suppose that leaves me with a couple questions: > > - As far as I know, the kernel only naturally frees pages from the page > cache when they're about to be given to some program for imminent use. Memory pressure does cause cache reclaim. Not just page cache, but also slab caches and anything else various subsystems can clean up to free memory.. > But > then in the case the page isn't only free'd, but also zero'd out before it's > handed over to the program (because giving a program access to a page filled > with potentially sensitive data is a bad idea!). Is this correct? Memory exposed to userspace is zeroed before userspace can access it. Kernel memory is not zeroed unless the caller specifically asks for it to be zeroed. > - Are there other situations (aside from drop_caches) where the kernel frees > pages from the page cache? Especially without having to zero them anyway? In truncate(), fallocate(), direct IO, fadvise(), madvise(), etc. IOWs, there are lots of runtime vectors that cause page cache to be freed. > other words, what situations would turning on some zero-pages-on-free > setting actually hurt performance? Lots. page contents are typically cold when the page is freed so the zeroing is typically memory latency and bandwidth bound. And doing it on free means there isn't any sort of "cache priming" performance benefits that we get with zeroing at allocation because the page contents are not going to be immediately accessed by the kernel or userspace. > - Does dismounting a filesystem completely zero out the removed fs's pages > from the page cache? No. It just frees them. No explicit zeroing. > - I remember hearing somewhere of some Linux support for zeroing out all > pages in memory if they're free'd from the page cache. However, I spent a > while trying to find this (how to turn it on, benchmarks) and I couldn't > find it. Do you know if such a thing exists, and if so how to turn it on? > I'm curious of the actual performance impact of it. You can test it for yourself: the init_on_free kernel command line option controls whether the kernel zeroes on free. Typical distro configuration is: $ sudo dmesg |grep auto-init [ 0.018882] mem auto-init: stack:all(zero), heap alloc:on, heap free:off $ So this kernel zeroes all stack memory, page and heap memory on allocation, and does nothing on free... -Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Dropping page cache of individual fs 2024-02-15 23:17 ` Dave Chinner @ 2024-02-16 1:14 ` Adrian Vovk 2024-02-16 20:38 ` init_on_alloc digression: " John Hubbard 0 siblings, 1 reply; 27+ messages in thread From: Adrian Vovk @ 2024-02-16 1:14 UTC (permalink / raw) To: Dave Chinner Cc: Jan Kara, Matthew Wilcox, Christian Brauner, lsf-pc, linux-fsdevel, linux-mm, linux-btrfs, linux-block, Christoph Hellwig On 2/15/24 18:17, Dave Chinner wrote: > On Thu, Feb 15, 2024 at 02:46:52PM -0500, Adrian Vovk wrote: >> On 2/15/24 08:57, Jan Kara wrote: >>> On Mon 29-01-24 19:13:17, Adrian Vovk wrote: >>>> Hello! I'm the "GNOME people" who Christian is referring to >>> Got back to thinking about this after a while... >>> >>>> On 1/17/24 09:52, Matthew Wilcox wrote: >>>>> I feel like we're in an XY trap [1]. What Christian actually wants is >>>>> to not be able to access the contents of a file while the device it's >>>>> on is suspended, and we've gone from there to "must drop the page cache". >>>> What we really want is for the plaintext contents of the files to be gone >>>> from memory while the dm-crypt device backing them is suspended. >>>> >>>> Ultimately my goal is to limit the chance that an attacker with access to a >>>> user's suspended laptop will be able to access the user's encrypted data. I >>>> need to achieve this without forcing the user to completely log out/power >>>> off/etc their system; it must be invisible to the user. The key word here is >>>> limit; if we can remove _most_ files from memory _most_ of the time Ithink >>>> luksSuspend would be a lot more useful against cold boot than it is today. >>> Well, but if your attack vector are cold-boot attacks, then how does >>> freeing pages from the page cache help you? I mean sure the page allocator >>> will start tracking those pages with potentially sensitive content as free >>> but unless you also zero all of them, this doesn't help anything against >>> cold-boot attacks? The sensitive memory content is still there... >>> >>> So you would also have to enable something like zero-on-page-free and >>> generally the cost of this is going to be pretty big? >> Yes you are right. Just marking pages as free isn't enough. >> >> I'm sure it's reasonable enough to zero out the pages that are getting >> free'd at our request. But the difficulty here is to try and clear pages >> that were freed previously for other reasons, unless we're zeroing out all >> pages on free. So I suppose that leaves me with a couple questions: >> >> - As far as I know, the kernel only naturally frees pages from the page >> cache when they're about to be given to some program for imminent use. > Memory pressure does cause cache reclaim. Not just page cache, but > also slab caches and anything else various subsystems can clean up > to free memory.. > >> But >> then in the case the page isn't only free'd, but also zero'd out before it's >> handed over to the program (because giving a program access to a page filled >> with potentially sensitive data is a bad idea!). Is this correct? > Memory exposed to userspace is zeroed before userspace can access > it. Kernel memory is not zeroed unless the caller specifically asks > for it to be zeroed. > >> - Are there other situations (aside from drop_caches) where the kernel frees >> pages from the page cache? Especially without having to zero them anyway? In > truncate(), fallocate(), direct IO, fadvise(), madvise(), etc. IOWs, > there are lots of runtime vectors that cause page cache to be freed. > >> other words, what situations would turning on some zero-pages-on-free >> setting actually hurt performance? > Lots. page contents are typically cold when the page is freed so > the zeroing is typically memory latency and bandwidth bound. And > doing it on free means there isn't any sort of "cache priming" > performance benefits that we get with zeroing at allocation because > the page contents are not going to be immediately accessed by the > kernel or userspace. > >> - Does dismounting a filesystem completely zero out the removed fs's pages >> from the page cache? > No. It just frees them. No explicit zeroing. I see. So even dismounting a filesystem and removing the device completely doesn't fully protect from a cold-boot attack. Good to know. > >> - I remember hearing somewhere of some Linux support for zeroing out all >> pages in memory if they're free'd from the page cache. However, I spent a >> while trying to find this (how to turn it on, benchmarks) and I couldn't >> find it. Do you know if such a thing exists, and if so how to turn it on? >> I'm curious of the actual performance impact of it. > You can test it for yourself: the init_on_free kernel command line > option controls whether the kernel zeroes on free. > > Typical distro configuration is: > > $ sudo dmesg |grep auto-init > [ 0.018882] mem auto-init: stack:all(zero), heap alloc:on, heap free:off > $ > > So this kernel zeroes all stack memory, page and heap memory on > allocation, and does nothing on free... I see. Thank you for all the information. So ~5% performance penalty isn't trivial, especially to protect against something rare/unlikely like a cold-boot attack, but it would be quite nice if we could have some semblance of effort put into making sure the data is actually out of memory if we claim that we've done our best to harden the system against this scenario. Again, I'm all for best-effort solutions here; doing 90% is better than doing 0%... I've got an alternative idea. How feasible would a second API be that just goes through free regions of memory and zeroes them out? This would be something we call immediately after we tell the kernel to drop everything it can relating to a given filesystem. So the flow would be something like follows: 1, user puts systemd-homed into this "locked" mode, homed wipes the dm-crypt key out of memory and suspends the block device (this already exists) 2. homed asks the kernel to drop whatever caches it can relating to that filesystem (the topic of this email thread) 3. homed asks the kernel to zero out all unallocated memory to make sure that the data is really gone (the second call I'm proposing now). Sure this operation can take a while, but for our use-cases it's probably fine. We would do this only in response to a direct user action (and we can show a nice little progress spinner on screen), or right before suspend. A couple of extra seconds of work while entering suspend isn't going to be noticed by the user. If the hardware supports something faster/better to mitigate cold-boot attacks, like memory encryption / SEV, then we'd prefer to use that instead of course, but for unsupported hardware I think just zeroing out all the memory that has been marked free should do the trick just fine. By the way, something like cryptsetup might want to use this second API too to ensure data is purged from memory after it closes a LUKS volume, for instance. So for example if you have an encrypted USB stick you use on your computer, the data really gets wiped after you unplug it. > -Dave. ^ permalink raw reply [flat|nested] 27+ messages in thread
* init_on_alloc digression: [LSF/MM/BPF TOPIC] Dropping page cache of individual fs 2024-02-16 1:14 ` Adrian Vovk @ 2024-02-16 20:38 ` John Hubbard 2024-02-16 21:11 ` Adrian Vovk 0 siblings, 1 reply; 27+ messages in thread From: John Hubbard @ 2024-02-16 20:38 UTC (permalink / raw) To: Adrian Vovk, Dave Chinner Cc: Jan Kara, Matthew Wilcox, Christian Brauner, lsf-pc, linux-fsdevel, linux-mm, linux-btrfs, linux-block, Christoph Hellwig On 2/15/24 17:14, Adrian Vovk wrote: ... >> Typical distro configuration is: >> >> $ sudo dmesg |grep auto-init >> [ 0.018882] mem auto-init: stack:all(zero), heap alloc:on, heap >> free:off >> $ >> >> So this kernel zeroes all stack memory, page and heap memory on >> allocation, and does nothing on free... > > I see. Thank you for all the information. > > So ~5% performance penalty isn't trivial, especially to protect against And it's more like 600% or more, on some systems. For example, imagine if someone had a memory-coherent system that included both CPUs and GPUs, each with their own NUMA memory nodes. The GPU has fast DMA engines that can zero a lot of that memory very very quickly, order(s) of magnitude faster than the CPU can clear it. So, the GPU driver is going to clear that memory before handing it out to user space, and all is well so far. But init_on_alloc forces the CPU to clear the memory first, because of the belief here that this is somehow required in order to get defense in depth. (True, if you can convince yourself that some parts of the kernel are in a different trust boundary than others. I lack faith here and am not a believer in such make belief boundaries.) Anyway, this situation has wasted much time, and at this point, I wish I could delete the whole init_on_alloc feature. Just in case you wanted an alt perspective. :) thanks, -- John Hubbard NVIDIA ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: init_on_alloc digression: [LSF/MM/BPF TOPIC] Dropping page cache of individual fs 2024-02-16 20:38 ` init_on_alloc digression: " John Hubbard @ 2024-02-16 21:11 ` Adrian Vovk 2024-02-16 21:19 ` John Hubbard 0 siblings, 1 reply; 27+ messages in thread From: Adrian Vovk @ 2024-02-16 21:11 UTC (permalink / raw) To: John Hubbard, Dave Chinner Cc: Jan Kara, Matthew Wilcox, Christian Brauner, lsf-pc, linux-fsdevel, linux-mm, linux-btrfs, linux-block, Christoph Hellwig On 2/16/24 15:38, John Hubbard wrote: > On 2/15/24 17:14, Adrian Vovk wrote: > ... >>> Typical distro configuration is: >>> >>> $ sudo dmesg |grep auto-init >>> [ 0.018882] mem auto-init: stack:all(zero), heap alloc:on, heap >>> free:off >>> $ >>> >>> So this kernel zeroes all stack memory, page and heap memory on >>> allocation, and does nothing on free... >> >> I see. Thank you for all the information. >> >> So ~5% performance penalty isn't trivial, especially to protect against > > And it's more like 600% or more, on some systems. For example, imagine if > someone had a memory-coherent system that included both CPUs and GPUs, > each with their own NUMA memory nodes. The GPU has fast DMA engines that > can zero a lot of that memory very very quickly, order(s) of magnitude > faster than the CPU can clear it. > > So, the GPU driver is going to clear that memory before handing it > out to user space, and all is well so far. > > But init_on_alloc forces the CPU to clear the memory first, because of > the belief here that this is somehow required in order to get defense > in depth. (True, if you can convince yourself that some parts of the > kernel are in a different trust boundary than others. I lack faith > here and am not a believer in such make belief boundaries.) As far as I can tell init_on_alloc isn't about drawing a trust boundary between parts of the kernel, but about hardening the kernel against mistakes made by developers, i.e. if they forget to initialize some memory. If the memory isn't zero'd and the developer forgets to initialize it, then potentially memory under user control (from page cache or so) can control flow of execution in the kernel. Thus, zeroing out the memory provides a second layer of defense even in situations where the first layer (not using uninitialized memory) failed. Thus, defense in depth. Is this just an NVIDIA embedded thing (AFAIK your desktop/laptop cards don't share memory with the CPU), or would it affect something like Intel/AMD APUs as well? If the GPU is so much faster at zeroing out blocks of memory in these systems, maybe the kernel should use the GPU's DMA engine whenever it needs to zero out some blocks of memory (I'm joking, mostly; I can imagine it's not quite so simple) > Anyway, this situation has wasted much time, and at this point, I > wish I could delete the whole init_on_alloc feature. > > Just in case you wanted an alt perspective. :) This is all good to know, thanks. I'm not particularly interested in init_on_alloc since it doesn't help against cold-boot scenarios. Does init_on_free have similar performance issues on such systems? (i.e. are you often freeing memory and then immediately allocating the same memory in the GPU driver?) Either way, I'd much prefer to have both turned off and only zero out free'd memory periodically / on user request. Not on every allocation/free. > thanks, Best, Adrian ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: init_on_alloc digression: [LSF/MM/BPF TOPIC] Dropping page cache of individual fs 2024-02-16 21:11 ` Adrian Vovk @ 2024-02-16 21:19 ` John Hubbard 0 siblings, 0 replies; 27+ messages in thread From: John Hubbard @ 2024-02-16 21:19 UTC (permalink / raw) To: Adrian Vovk, Dave Chinner Cc: Jan Kara, Matthew Wilcox, Christian Brauner, lsf-pc, linux-fsdevel, linux-mm, linux-btrfs, linux-block, Christoph Hellwig On 2/16/24 13:11, Adrian Vovk wrote: ... >> But init_on_alloc forces the CPU to clear the memory first, because of >> the belief here that this is somehow required in order to get defense >> in depth. (True, if you can convince yourself that some parts of the >> kernel are in a different trust boundary than others. I lack faith >> here and am not a believer in such make belief boundaries.) > > As far as I can tell init_on_alloc isn't about drawing a trust boundary > between parts of the kernel, but about hardening the kernel against > mistakes made by developers, i.e. if they forget to initialize some So this is writing code in order to protect against other code, in the same kernel. So now we need some more code in case this new code forgets to do something, or has a bug. This will recurse into an infinite amount of code. :) > memory. If the memory isn't zero'd and the developer forgets to > initialize it, then potentially memory under user control (from page > cache or so) can control flow of execution in the kernel. Thus, zeroing > out the memory provides a second layer of defense even in situations > where the first layer (not using uninitialized memory) failed. Thus, > defense in depth. Why not initialize memory at the entry of every function that sees the page, then, and call it defense-really-in-depth? It's hard to see where the silliness ends. > > Is this just an NVIDIA embedded thing (AFAIK your desktop/laptop cards Nope. Any system that has slow CPU access to fast accelerator memory would suffer like this. And many are being built. > don't share memory with the CPU), or would it affect something like > Intel/AMD APUs as well? > > If the GPU is so much faster at zeroing out blocks of memory in these > systems, maybe the kernel should use the GPU's DMA engine whenever it > needs to zero out some blocks of memory (I'm joking, mostly; I can > imagine it's not quite so simple) Yes, it's conceivable to put in a callback hook from the init_on_alloc so that it could use a driver to fast-zero the memory. Except that will never be accepted by anyone who accepts your first argument: this is "protection" against those forgetful, silly driver writers. thanks, -- John Hubbard NVIDIA ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Dropping page cache of individual fs 2024-01-16 10:50 [LSF/MM/BPF TOPIC] Dropping page cache of individual fs Christian Brauner 2024-01-16 11:45 ` Jan Kara @ 2024-01-16 15:25 ` James Bottomley 2024-01-16 15:40 ` Matthew Wilcox 2024-01-16 20:56 ` Dave Chinner 2024-02-17 4:04 ` Kent Overstreet 3 siblings, 1 reply; 27+ messages in thread From: James Bottomley @ 2024-01-16 15:25 UTC (permalink / raw) To: Christian Brauner, lsf-pc Cc: linux-fsdevel, linux-mm, linux-btrfs, linux-block, Matthew Wilcox, Jan Kara, Christoph Hellwig On Tue, 2024-01-16 at 11:50 +0100, Christian Brauner wrote: > So when we say luksSuspend we really mean block layer initiated > freeze. The overall goal or expectation of userspace is that after a > luksSuspend call all sensitive material has been evicted from > relevant caches to harden against various attacks. And luksSuspend > does wipe the encryption key and suspend the block device. However, > the encryption key can still be available clear-text in the page > cache. To illustrate this problem more simply: > > truncate -s 500M /tmp/img > echo password | cryptsetup luksFormat /tmp/img --force-password > echo password | cryptsetup open /tmp/img test > mkfs.xfs /dev/mapper/test > mount /dev/mapper/test /mnt > echo "secrets" > /mnt/data > cryptsetup luksSuspend test > cat /mnt/data Not really anything to do with the drop caches problem, but luks can use the kernel keyring API for this. That should ensure the key itself can be shredded on suspend without replication anywhere in memory. Of course the real problem is likely that the key has or is derived from a password and that password is in the user space gnome-keyring, which will be much harder to purge ... although if the keyring were using secret memory it would be way easier ... So perhaps before we start bending the kernel out of shape in the name of security, we should also ensure that the various user space components are secured first. The most important thing to get right first is key management (lose the key and someone who can steal the encrypted data can access it). Then you can worry about data leaks due to the cache, which are somewhat harder to exploit easily (to exploit this you have to get into the cache in the first place, which is harder). > This will still happily print the contents of /mnt/data even though > the block device and the owning filesystem are frozen because the > data is still in the page cache. > > To my knowledge, the only current way to get the contents of > /mnt/data or the encryption key out of the page cache is via > /proc/sys/vm/drop_caches which is a big hammer. To be honest, why is this too big a hammer? Secret data could be sprayed all over the cache, so killing all of it (assuming we can as Jan points out) would be a security benefit. I'm sure people would be willing to pay the additional start up time of an entirely empty cache on resume in exchange for the nicely evaluateable security guarantee it gives. In other words, dropping caches by device is harder to analyse from security terms (because now you have to figure out where secret data is and which caches you need to drop) and it's not clear it really has much advantage in terms of faster resume for the complexity it would introduce. James ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Dropping page cache of individual fs 2024-01-16 15:25 ` James Bottomley @ 2024-01-16 15:40 ` Matthew Wilcox 2024-01-16 15:54 ` James Bottomley 0 siblings, 1 reply; 27+ messages in thread From: Matthew Wilcox @ 2024-01-16 15:40 UTC (permalink / raw) To: James Bottomley Cc: Christian Brauner, lsf-pc, linux-fsdevel, linux-mm, linux-btrfs, linux-block, Jan Kara, Christoph Hellwig On Tue, Jan 16, 2024 at 10:25:20AM -0500, James Bottomley wrote: > On Tue, 2024-01-16 at 11:50 +0100, Christian Brauner wrote: > > So when we say luksSuspend we really mean block layer initiated > > freeze. The overall goal or expectation of userspace is that after a > > luksSuspend call all sensitive material has been evicted from > > relevant caches to harden against various attacks. And luksSuspend > > does wipe the encryption key and suspend the block device. However, > > the encryption key can still be available clear-text in the page > > cache. To illustrate this problem more simply: > > > > truncate -s 500M /tmp/img > > echo password | cryptsetup luksFormat /tmp/img --force-password > > echo password | cryptsetup open /tmp/img test > > mkfs.xfs /dev/mapper/test > > mount /dev/mapper/test /mnt > > echo "secrets" > /mnt/data > > cryptsetup luksSuspend test > > cat /mnt/data > > Not really anything to do with the drop caches problem, but luks can > use the kernel keyring API for this. That should ensure the key itself > can be shredded on suspend without replication anywhere in memory. Of > course the real problem is likely that the key has or is derived from a > password and that password is in the user space gnome-keyring, which > will be much harder to purge ... although if the keyring were using > secret memory it would be way easier ... I think you've misunderstood the problem. Let's try it again. add-password-to-kernel-keyring create-encrypted-volume-using-password write-detailed-confession-to-encrypted-volume suspend-volume delete-password-from-kernel-keyring cat-volume reveals the detailed confession ie the page cache contains the decrypted data, even though what's on disc is encrypted. Nothing to do with key management. Yes, there are various things we can do that will prevent the page cache from being dropped, but I strongly suggest _not_ registering your detailed confession with an RDMA card. A 99% solution is better than a 0% solution. The tricky part, I think, is that the page cache is not indexed physically but virtually. We need each inode on the suspended volume to drop its cache. Dropping the cache of just the bdev is going to hide the direectory structure, inode tables, etc, but the real privacy gains are to be had from dropping file contents. ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Dropping page cache of individual fs 2024-01-16 15:40 ` Matthew Wilcox @ 2024-01-16 15:54 ` James Bottomley 0 siblings, 0 replies; 27+ messages in thread From: James Bottomley @ 2024-01-16 15:54 UTC (permalink / raw) To: Matthew Wilcox Cc: Christian Brauner, lsf-pc, linux-fsdevel, linux-mm, linux-btrfs, linux-block, Jan Kara, Christoph Hellwig On Tue, 2024-01-16 at 15:40 +0000, Matthew Wilcox wrote: > On Tue, Jan 16, 2024 at 10:25:20AM -0500, James Bottomley wrote: > > On Tue, 2024-01-16 at 11:50 +0100, Christian Brauner wrote: > > > So when we say luksSuspend we really mean block layer initiated > > > freeze. The overall goal or expectation of userspace is that > > > after a luksSuspend call all sensitive material has been evicted > > > from relevant caches to harden against various attacks. And > > > luksSuspend does wipe the encryption key and suspend the block > > > device. However, the encryption key can still be available clear- > > > text in the page cache. To illustrate this problem more simply: > > > > > > truncate -s 500M /tmp/img > > > echo password | cryptsetup luksFormat /tmp/img --force-password > > > echo password | cryptsetup open /tmp/img test > > > mkfs.xfs /dev/mapper/test > > > mount /dev/mapper/test /mnt > > > echo "secrets" > /mnt/data > > > cryptsetup luksSuspend test > > > cat /mnt/data > > > > Not really anything to do with the drop caches problem, but luks > > can use the kernel keyring API for this. That should ensure the > > key itself can be shredded on suspend without replication anywhere > > in memory. Of course the real problem is likely that the key has > > or is derived from a password and that password is in the user > > space gnome-keyring, which will be much harder to purge ... > > although if the keyring were using secret memory it would be way > > easier ... > > I think you've misunderstood the problem. Let's try it again. > > add-password-to-kernel-keyring > create-encrypted-volume-using-password > write-detailed-confession-to-encrypted-volume > suspend-volume > delete-password-from-kernel-keyring > cat-volume reveals the detailed confession > > ie the page cache contains the decrypted data, even though what's on > disc is encrypted. Nothing to do with key management. No I didn't; you cut the bit where I referred to that in the second half of my email you don't quote. But my point is that caching key material is by far the biggest security problem because if that happens and it can be recovered, every secret on the disk is toast. Caching clear pages from the disk is a problem, but it's way less severe than caching key material, so making sure the former is solved should be priority number one (because in security you start with the biggest exposure first). I then went on to say that for the second problem, I think making drop all caches actually do that has the best security properties rather than segmented cache dropping. James ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Dropping page cache of individual fs 2024-01-16 10:50 [LSF/MM/BPF TOPIC] Dropping page cache of individual fs Christian Brauner 2024-01-16 11:45 ` Jan Kara 2024-01-16 15:25 ` James Bottomley @ 2024-01-16 20:56 ` Dave Chinner 2024-01-17 6:17 ` Theodore Ts'o 2024-01-17 13:19 ` Christian Brauner 2024-02-17 4:04 ` Kent Overstreet 3 siblings, 2 replies; 27+ messages in thread From: Dave Chinner @ 2024-01-16 20:56 UTC (permalink / raw) To: Christian Brauner Cc: lsf-pc, linux-fsdevel, linux-mm, linux-btrfs, linux-block, Matthew Wilcox, Jan Kara, Christoph Hellwig On Tue, Jan 16, 2024 at 11:50:32AM +0100, Christian Brauner wrote: > Hey, > > I'm not sure this even needs a full LSFMM discussion but since I > currently don't have time to work on the patch I may as well submit it. > > Gnome recently got awared 1M Euro by the Sovereign Tech Fund (STF). The > STF was created by the German government to fund public infrastructure: > > "The Sovereign Tech Fund supports the development, improvement and > maintenance of open digital infrastructure. Our goal is to sustainably > strengthen the open source ecosystem. We focus on security, resilience, > technological diversity, and the people behind the code." (cf. [1]) > > Gnome has proposed various specific projects including integrating > systemd-homed with Gnome. Systemd-homed provides various features and if > you're interested in details then you might find it useful to read [2]. > It makes use of various new VFS and fs specific developments over the > last years. > > One feature is encrypting the home directory via LUKS. An approriate > image or device must contain a GPT partition table. Currently there's > only one partition which is a LUKS2 volume. Inside that LUKS2 volume is > a Linux filesystem. Currently supported are btrfs (see [4] though), > ext4, and xfs. > > The following issue isn't specific to systemd-homed. Gnome wants to be > able to support locking encrypted home directories. For example, when > the laptop is suspended. To do this the luksSuspend command can be used. > > The luksSuspend call is nothing else than a device mapper ioctl to > suspend the block device and it's owning superblock/filesystem. Which in > turn is nothing but a freeze initiated from the block layer: > > dm_suspend() > -> __dm_suspend() > -> lock_fs() > -> bdev_freeze() > > So when we say luksSuspend we really mean block layer initiated freeze. > The overall goal or expectation of userspace is that after a luksSuspend > call all sensitive material has been evicted from relevant caches to > harden against various attacks. And luksSuspend does wipe the encryption > key and suspend the block device. However, the encryption key can still > be available clear-text in the page cache. The wiping of secrets is completely orthogonal to the freezing of the device and filesystem - the freeze does not need to occur to allow the encryption keys and decrypted data to be purged. They should not be conflated; purging needs to be a completely separate operation that can be run regardless of device/fs freeze status. FWIW, focussing on purging the page cache omits the fact that having access to the directory structure is a problem - one can still retrieve other user information that is stored in metadata (e.g. xattrs) that isn't part of the page cache. Even the directory structure that is cached in dentries could reveal secrets someone wants to keep hidden (e.g code names for operations/products). So if we want luksSuspend to actually protect user information when it runs, then it effectively needs to bring the filesystem right back to it's "just mounted" state where the only thing in memory is the root directory dentry and inode and nothing else. And, of course, this is largely impossible to do because anything with an open file on the filesystem will prevent this robust cache purge from occurring.... Which brings us back to "best effort" only, and at this point we already have drop-caches.... Mind you, I do wonder if drop caches is fast enough for this sort of use case. It is single threaded, and if the filesystem/system has millions of cached inodes it can take minutes to run. Unmount has the same problem - purging large dentry/inode caches takes a *lot* of CPU time and these operations are single threaded. So it may not be practical in the luks context to purge caches e.g. suspending a laptop shouldn't take minutes. However laptops are getting to the hundreds of GB of RAM these days and so they can cache millions of inodes, so cache purge runtime is definitely a consideration here. -Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Dropping page cache of individual fs 2024-01-16 20:56 ` Dave Chinner @ 2024-01-17 6:17 ` Theodore Ts'o 2024-01-30 1:14 ` Adrian Vovk 2024-01-17 13:19 ` Christian Brauner 1 sibling, 1 reply; 27+ messages in thread From: Theodore Ts'o @ 2024-01-17 6:17 UTC (permalink / raw) To: Dave Chinner Cc: Christian Brauner, lsf-pc, linux-fsdevel, linux-mm, linux-btrfs, linux-block, Matthew Wilcox, Jan Kara, Christoph Hellwig On Wed, Jan 17, 2024 at 07:56:01AM +1100, Dave Chinner wrote: > > The wiping of secrets is completely orthogonal to the freezing of > the device and filesystem - the freeze does not need to occur to > allow the encryption keys and decrypted data to be purged. They > should not be conflated; purging needs to be a completely separate > operation that can be run regardless of device/fs freeze status. > > FWIW, focussing on purging the page cache omits the fact that > having access to the directory structure is a problem - one can > still retrieve other user information that is stored in metadata > (e.g. xattrs) that isn't part of the page cache. Even the directory > structure that is cached in dentries could reveal secrets someone > wants to keep hidden (e.g code names for operations/products). Yeah, I think we need to really revisit the implicit requirements which were made upfront about wanting to protect against the page cache being exposed. What is the threat model that you are trying to protect against? If the attacker has access to the memory of the suspended processor, then number of things you need to protect against becomes *vast*. For one thing, if you're going to blow away the LUKS encryption on suspend, then during the resume process, *before* you allow general user processes to start running again (when they might try to read from the file system whose encryption key is no longer available, and thus will be treated to EIO errors), you're going to have to request that user to provide the encryption key, either directly or indirectly. And if the attacker has access to the suspended memory, is it read-only access, or can the attacker modify the memory image to include a trojan that records the encryption once it is demanded of the user, and then mails it off to Moscow or Beijing or Fort Meade? To address the whole set of problems, it might be that the answer might lie in something like confidential compute, where the all of the memory encrypted. Now you don't need to worry about wiping the page cache, since it's all encrypted. Of course, you still need to solve the problem of how to restablish the confidential compute keys after it has been wiped as part of the suspend, but you needed to solve that with the LUKS key anyway. This also addresses Dave's concern of it might not being practical to drop all of the caches if their are millions of cached inodes and cached pages that all need to be dropped at suspend time. Anoter potential approach is a bit more targetted, which is to mark certain files as containing keying information, so the system can focus on making sure those pages are wiped at suspend time. It still has issues, such as how the desire to wipe them from the memory at suspend time interacts with mlock(), which is often done by programs to prevent them from getting written to swap. And of course, we still need to worry about what to do if the file is pinned because it's being accessed by RDMA or by sendfile(2) --- but perhaps a keyfile has no business of being accessed via RDMA or blasted out (unencrypted!) at high speed to a network connection via sendfile(2) --- and so perhaps those sorts of things should be disallowed if the file is marked as "this file contains secret keys --- treat it specially". - Ted ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Dropping page cache of individual fs 2024-01-17 6:17 ` Theodore Ts'o @ 2024-01-30 1:14 ` Adrian Vovk 0 siblings, 0 replies; 27+ messages in thread From: Adrian Vovk @ 2024-01-30 1:14 UTC (permalink / raw) To: Theodore Ts'o, Dave Chinner Cc: Christian Brauner, lsf-pc, linux-fsdevel, linux-mm, linux-btrfs, linux-block, Matthew Wilcox, Jan Kara, Christoph Hellwig On 1/17/24 01:17, Theodore Ts'o wrote: > What is the threat model that you are trying to protect against? If > the attacker has access to the memory of the suspended processor, then > number of things you need to protect against becomes *vast*. For one > thing, if you're going to blow away the LUKS encryption on suspend, > then during the resume process, *before* you allow general user > processes to start running again (when they might try to read from the > file system whose encryption key is no longer available, and thus will > be treated to EIO errors), you're going to have to request that user > to provide the encryption key, either directly or indirectly. The threat we have in mind are cold-boot attacks, same as the threat that dm-crypt protects against when it lets us wipe the LUKS volume key.We want to limit the amount of plain-text user data an attacker can acquire from a suspended system. As I mention elsewhere in this thread, the key word for me is "limit". I'm not expecting perfect security, but I'd like most plaintext file contents to be removed from memory on suspend so that an attacker cannot access most recently accessed files. Ideally it would be "all" not "most", of course, but I'll happily take what's feasible > And if the attacker has access to the suspended memory, is it > read-only access, or can the attacker modify the memory image to > include a trojan that records the encryption once it is demanded of > the user, and then mails it off to Moscow or Beijing or Fort Meade? Yes, it's read-only access. If the attacker has write access to the memory image while the system is suspended then it's complete game-over on all fronts. At that point they can completely replace the kernel if they so choose. This is not something I expect to be able to defend against outside of the solutions you mention, but those are not feasible on commodity consumer hardware. I'm looking to achieve the best we can with what we have. This is also not an attack I've heard of in the wild against consumer hardware; I know it's possible because I know people who've done it, but it takes many weeks (at least) of research and effort to prepare for a given chip - definitely not as easy as a cold-boot attack which can take seconds and works pretty universally. > To address the whole set of problems, it might be that the answer > might lie in something like confidential compute, where the all of the > memory encrypted. Now you don't need to worry about wiping the page > cache, since it's all encrypted. Of course, you still need to solve > the problem of how to restablish the confidential compute keys after > it has been wiped as part of the suspend, but you needed to solve that > with the LUKS key anyway. Without special hardware support you'll need to re-establish keys via unencrypted software, and unencrypted software can be replaced by an attacker if they're able to write to RAM. So it doesn't solve the problem you bring up. But anyway I feel this part of the discussion is starting to border on theoretical... Though I suppose encrypting all the memory belonging to just the one user with that user's LUKS volume key could be an alternative solution. That way wiping out the key has the effect of "wiping out" all the user's related memory, at least until we can re-authenticate and bring it all back. But I suspect this would not only be extremely difficult to implement in the kernel but would also have huge performance cost without special hardware > Anoter potential approach is a bit more targetted, which is to mark > certain files as containing keying information, so the system can > focus on making sure those pages are wiped at suspend time. It still > has issues, such as how the desire to wipe them from the memory at > suspend time interacts with mlock(), which is often done by programs > to prevent them from getting written to swap. And of course, we still > need to worry about what to do if the file is pinned because it's > being accessed by RDMA or by sendfile(2) --- but perhaps a keyfile has > no business of being accessed via RDMA or blasted out (unencrypted!) > at high speed to a network connection via sendfile(2) --- and so > perhaps those sorts of things should be disallowed if the file is > marked as "this file contains secret keys --- treat it specially". Secret keys are not what we're trying to protect here necessarily. Random user documents are often sensitive. People store tax documents, corporate secrets, or any number of other sensitive things on their computers. If an attacker can perform a cold boot attack on the device then depending on how recently these tax documents or corporate secrets were accessed they might just be in memory in plain text, which is not good. No amount of protecting the keys prevents this. That said, having an extra security layer for secret keys would be useful. There are definitely files that contain sensitive data, and it would be useful to tell the kernel which files those are so that it can treat them extra carefully in the ways you suggest. Maybe even avoid putting them in plain text into the page cache? But this would have to be an extra step, since it's not feasible to make the user mark all the files that they consider to be sensitive Adrian ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Dropping page cache of individual fs 2024-01-16 20:56 ` Dave Chinner 2024-01-17 6:17 ` Theodore Ts'o @ 2024-01-17 13:19 ` Christian Brauner 2024-01-17 22:26 ` Dave Chinner 2024-02-05 17:39 ` Russell Haley 1 sibling, 2 replies; 27+ messages in thread From: Christian Brauner @ 2024-01-17 13:19 UTC (permalink / raw) To: Dave Chinner Cc: lsf-pc, linux-fsdevel, linux-mm, linux-btrfs, linux-block, Matthew Wilcox, Jan Kara, Christoph Hellwig, adrianvovk On Wed, Jan 17, 2024 at 07:56:01AM +1100, Dave Chinner wrote: > On Tue, Jan 16, 2024 at 11:50:32AM +0100, Christian Brauner wrote: > > Hey, > > > > I'm not sure this even needs a full LSFMM discussion but since I > > currently don't have time to work on the patch I may as well submit it. > > > > Gnome recently got awared 1M Euro by the Sovereign Tech Fund (STF). The > > STF was created by the German government to fund public infrastructure: > > > > "The Sovereign Tech Fund supports the development, improvement and > > maintenance of open digital infrastructure. Our goal is to sustainably > > strengthen the open source ecosystem. We focus on security, resilience, > > technological diversity, and the people behind the code." (cf. [1]) > > > > Gnome has proposed various specific projects including integrating > > systemd-homed with Gnome. Systemd-homed provides various features and if > > you're interested in details then you might find it useful to read [2]. > > It makes use of various new VFS and fs specific developments over the > > last years. > > > > One feature is encrypting the home directory via LUKS. An approriate > > image or device must contain a GPT partition table. Currently there's > > only one partition which is a LUKS2 volume. Inside that LUKS2 volume is > > a Linux filesystem. Currently supported are btrfs (see [4] though), > > ext4, and xfs. > > > > The following issue isn't specific to systemd-homed. Gnome wants to be > > able to support locking encrypted home directories. For example, when > > the laptop is suspended. To do this the luksSuspend command can be used. > > > > The luksSuspend call is nothing else than a device mapper ioctl to > > suspend the block device and it's owning superblock/filesystem. Which in > > turn is nothing but a freeze initiated from the block layer: > > > > dm_suspend() > > -> __dm_suspend() > > -> lock_fs() > > -> bdev_freeze() > > > > So when we say luksSuspend we really mean block layer initiated freeze. > > The overall goal or expectation of userspace is that after a luksSuspend > > call all sensitive material has been evicted from relevant caches to > > harden against various attacks. And luksSuspend does wipe the encryption > > key and suspend the block device. However, the encryption key can still > > be available clear-text in the page cache. > > The wiping of secrets is completely orthogonal to the freezing of > the device and filesystem - the freeze does not need to occur to > allow the encryption keys and decrypted data to be purged. They > should not be conflated; purging needs to be a completely separate > operation that can be run regardless of device/fs freeze status. Yes, I'm aware. I didn't mean to imply that these things are in any way necessarily connected. Just that there are use-cases where they are. And the encrypted home directory case is one. One froze the block device and filesystem one would now also like to drop the page cache which has most of the interesting data. The fact that after a block layer initiated freeze - again mostly a device mapper problem - one may or may not be able to successfully read from the filesystem is annoying. Of course one can't write, that will hang one immediately. But if one still has some data in the page cache one can still dump the contents of that file. That's at least odd behavior from a users POV even if for us it's cleary why that's the case. And a freeze does do a sync_filesystem() and a sync_blockdev() to flush out any dirty data for that specific filesystem. So it would be fitting to give users an api that allows them to also drop the page cache contents. For some use-cases like the Gnome use-case one wants to do a freeze and drop everything that one can from the page cache for that specific filesystem. And drop_caches is a big hammer simply because there are workloads where that isn't feasible. Even on a modern boring laption system one may have lots of services. On a large scale system one may have thousands of services and they may all uses separate images (And the border between isolated services and containers is fuzzy at best.). And here invoking drop_caches penalizes every service. One may want to drop the contents of _some_ services but not all of them. Especially during suspend where one cares about dropping the page cache of the home directory that gets suspended - encrypted or unencrypted. Ignoring the security aspect itself. Just the fact that one froze the block device and the owning filesystem one may want to go and drop the page cache as well without impacting every other filesystem on the system. Which may be thousands. One doesn't want to penalize them all. Ignoring the specific use-case I know that David has been interested in a way to drop the page cache for afs. So this is not just for the home directory case. I mostly wanted to make it clear that there are users of an interface like this; even if it were just best effort. > > FWIW, focussing on purging the page cache omits the fact that > having access to the directory structure is a problem - one can > still retrieve other user information that is stored in metadata > (e.g. xattrs) that isn't part of the page cache. Even the directory > structure that is cached in dentries could reveal secrets someone > wants to keep hidden (e.g code names for operations/products). Yes, of course but that's fine. The most sensitive data and the biggest chunks of data will be the contents of files. We don't necessarily need to cater to the paranoid with this. > > So if we want luksSuspend to actually protect user information when > it runs, then it effectively needs to bring the filesystem right > back to it's "just mounted" state where the only thing in memory is > the root directory dentry and inode and nothing else. Yes, which we know isn't feasible. > > And, of course, this is largely impossible to do because anything > with an open file on the filesystem will prevent this robust cache > purge from occurring.... > > Which brings us back to "best effort" only, and at this point we > already have drop-caches.... > > Mind you, I do wonder if drop caches is fast enough for this sort of > use case. It is single threaded, and if the filesystem/system has > millions of cached inodes it can take minutes to run. Unmount has > the same problem - purging large dentry/inode caches takes a *lot* > of CPU time and these operations are single threaded. > > So it may not be practical in the luks context to purge caches e.g. > suspending a laptop shouldn't take minutes. However laptops are > getting to the hundreds of GB of RAM these days and so they can > cache millions of inodes, so cache purge runtime is definitely a > consideration here. I'm really trying to look for a practical api that doesn't require users to drop the caches for every mounted image on the system. FYI, I've tried to get some users to reply here so they could speak to the fact that they don't expect this to be an optimal solution but none of them know how to reply to lore mboxes so I can just relay information. ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Dropping page cache of individual fs 2024-01-17 13:19 ` Christian Brauner @ 2024-01-17 22:26 ` Dave Chinner 2024-01-18 14:09 ` Christian Brauner 2024-02-05 17:39 ` Russell Haley 1 sibling, 1 reply; 27+ messages in thread From: Dave Chinner @ 2024-01-17 22:26 UTC (permalink / raw) To: Christian Brauner Cc: lsf-pc, linux-fsdevel, linux-mm, linux-btrfs, linux-block, Matthew Wilcox, Jan Kara, Christoph Hellwig, adrianvovk On Wed, Jan 17, 2024 at 02:19:43PM +0100, Christian Brauner wrote: > On Wed, Jan 17, 2024 at 07:56:01AM +1100, Dave Chinner wrote: > > On Tue, Jan 16, 2024 at 11:50:32AM +0100, Christian Brauner wrote: > > > Hey, > > > > > > I'm not sure this even needs a full LSFMM discussion but since I > > > currently don't have time to work on the patch I may as well submit it. > > > > > > Gnome recently got awared 1M Euro by the Sovereign Tech Fund (STF). The > > > STF was created by the German government to fund public infrastructure: > > > > > > "The Sovereign Tech Fund supports the development, improvement and > > > maintenance of open digital infrastructure. Our goal is to sustainably > > > strengthen the open source ecosystem. We focus on security, resilience, > > > technological diversity, and the people behind the code." (cf. [1]) > > > > > > Gnome has proposed various specific projects including integrating > > > systemd-homed with Gnome. Systemd-homed provides various features and if > > > you're interested in details then you might find it useful to read [2]. > > > It makes use of various new VFS and fs specific developments over the > > > last years. > > > > > > One feature is encrypting the home directory via LUKS. An approriate > > > image or device must contain a GPT partition table. Currently there's > > > only one partition which is a LUKS2 volume. Inside that LUKS2 volume is > > > a Linux filesystem. Currently supported are btrfs (see [4] though), > > > ext4, and xfs. > > > > > > The following issue isn't specific to systemd-homed. Gnome wants to be > > > able to support locking encrypted home directories. For example, when > > > the laptop is suspended. To do this the luksSuspend command can be used. > > > > > > The luksSuspend call is nothing else than a device mapper ioctl to > > > suspend the block device and it's owning superblock/filesystem. Which in > > > turn is nothing but a freeze initiated from the block layer: > > > > > > dm_suspend() > > > -> __dm_suspend() > > > -> lock_fs() > > > -> bdev_freeze() > > > > > > So when we say luksSuspend we really mean block layer initiated freeze. > > > The overall goal or expectation of userspace is that after a luksSuspend > > > call all sensitive material has been evicted from relevant caches to > > > harden against various attacks. And luksSuspend does wipe the encryption > > > key and suspend the block device. However, the encryption key can still > > > be available clear-text in the page cache. > > > > The wiping of secrets is completely orthogonal to the freezing of > > the device and filesystem - the freeze does not need to occur to > > allow the encryption keys and decrypted data to be purged. They > > should not be conflated; purging needs to be a completely separate > > operation that can be run regardless of device/fs freeze status. > > Yes, I'm aware. I didn't mean to imply that these things are in any way > necessarily connected. Just that there are use-cases where they are. And > the encrypted home directory case is one. One froze the block device and > filesystem one would now also like to drop the page cache which has most > of the interesting data. > > The fact that after a block layer initiated freeze - again mostly a > device mapper problem - one may or may not be able to successfully read > from the filesystem is annoying. Of course one can't write, that will > hang one immediately. But if one still has some data in the page cache > one can still dump the contents of that file. That's at least odd > behavior from a users POV even if for us it's cleary why that's the > case. A frozen filesystem doesn't prevent read operations from occurring. > And a freeze does do a sync_filesystem() and a sync_blockdev() to flush > out any dirty data for that specific filesystem. Yes, it's required to do that - the whole point of freezing a filesystem is to bring the filesystem into a *consistent physical state on persistent storage* and to hold it in that state until it is thawed. > So it would be fitting > to give users an api that allows them to also drop the page cache > contents. Not as part of a freeze operation. Read operations have *always* been allowed from frozen filesystems; they are intended to be allowed because one of the use cases for freezing is to create a consistent filesystem state for backup of the filesystem. That requires everything in the filesystem can be read whilst it is frozen, and that means the page cache needs to remain operational. What the underlying device allows when it has been *suspended* is a different issue altogether. The key observation here is that storage device suspend != filesystem freeze and they can have very different semantics depending on the operation being performed on the block device while it is suspended. IOWs, a device suspend implementation might freeze the filesystem to bring the contents of the storage device whilst frozen into a consistent, uptodate state (e.g. for device level backups), but block device level suspend does not *require* that the filesystem is frozen whilst the device IO operations are suspended. > For some use-cases like the Gnome use-case one wants to do a freeze and > drop everything that one can from the page cache for that specific > filesystem. So they have to do an extra system call between FS_IOC_FREEZE and FS_IOC_THAW. What's the problem with that? What are you trying to optimise by colliding cache purging with FS_IOC_FREEZE? If the user/application/infrastructure already has to iterate all the mounted filesystems to freeze them, then it's trivial for them to add a cache purging step to that infrastructure for the storage configurations that might need it. I just don't see why this needs to be part of a block device freeze operation, especially as the "purge caches on this filesystem" operation has potential use cases outside of the luksSuspend context.... Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Dropping page cache of individual fs 2024-01-17 22:26 ` Dave Chinner @ 2024-01-18 14:09 ` Christian Brauner 0 siblings, 0 replies; 27+ messages in thread From: Christian Brauner @ 2024-01-18 14:09 UTC (permalink / raw) To: Dave Chinner Cc: lsf-pc, linux-fsdevel, linux-mm, linux-btrfs, linux-block, Matthew Wilcox, Jan Kara, Christoph Hellwig, adrianvovk > > The fact that after a block layer initiated freeze - again mostly a > > device mapper problem - one may or may not be able to successfully read > > from the filesystem is annoying. Of course one can't write, that will > > hang one immediately. But if one still has some data in the page cache > > one can still dump the contents of that file. That's at least odd > > behavior from a users POV even if for us it's cleary why that's the > > case. > > A frozen filesystem doesn't prevent read operations from occurring. Yes, that's what I was saying. I'm not disputing that. > > > And a freeze does do a sync_filesystem() and a sync_blockdev() to flush > > out any dirty data for that specific filesystem. > > Yes, it's required to do that - the whole point of freezing a > filesystem is to bring the filesystem into a *consistent physical > state on persistent storage* and to hold it in that state until it > is thawed. > > > So it would be fitting > > to give users an api that allows them to also drop the page cache > > contents. > > Not as part of a freeze operation. Yes, that's why I'd like to have a separate e.g., flag for fadvise. > > For some use-cases like the Gnome use-case one wants to do a freeze and > > drop everything that one can from the page cache for that specific > > filesystem. > > So they have to do an extra system call between FS_IOC_FREEZE and > FS_IOC_THAW. What's the problem with that? What are you trying to > optimise by colliding cache purging with FS_IOC_FREEZE? > > If the user/application/infrastructure already has to iterate all > the mounted filesystems to freeze them, then it's trivial for them > to add a cache purging step to that infrastructure for the storage > configurations that might need it. I just don't see why this needs > to be part of a block device freeze operation, especially as the > "purge caches on this filesystem" operation has potential use cases > outside of the luksSuspend context.... Ah, I'm sorry I think we're accidently talking past each other... I'm _not_ trying to tie block layer freezing and cache purging. I'm trying to expose something like: posix_fadvise(fs_fd, [...], POSIX_FADV_FS_DONTNEED/DROP); The Gnome people could then do: cryptsetup luksSuspend posix_fadvise(fs_fd, [...], POSIX_FADV_FS_DONTNEED/DROP); as two separate operations. Because the dropping the caches step is useful to other users as well; completely independent of the block layer freeze that I used to motivate this. ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Dropping page cache of individual fs 2024-01-17 13:19 ` Christian Brauner 2024-01-17 22:26 ` Dave Chinner @ 2024-02-05 17:39 ` Russell Haley 1 sibling, 0 replies; 27+ messages in thread From: Russell Haley @ 2024-02-05 17:39 UTC (permalink / raw) To: Christian Brauner, Dave Chinner Cc: lsf-pc, linux-fsdevel, linux-mm, linux-btrfs, linux-block, Matthew Wilcox, Jan Kara, Christoph Hellwig, adrianvovk On 1/17/24 07:19, Christian Brauner wrote: > And drop_caches is a big hammer simply because there are workloads where > that isn't feasible. Even on a modern boring laption system one may have > lots of services. On a large scale system one may have thousands of > services and they may all uses separate images (And the border between > isolated services and containers is fuzzy at best.). And here invoking > drop_caches penalizes every service. > > One may want to drop the contents of _some_ services but not all of > them. Especially during suspend where one cares about dropping the page > cache of the home directory that gets suspended - encrypted or > unencrypted. > > Ignoring the security aspect itself. Just the fact that one froze the > block device and the owning filesystem one may want to go and drop the > page cache as well without impacting every other filesystem on the > system. Which may be thousands. One doesn't want to penalize them all. I'm not following the problem with dropping all the caches, at least for the suspend use case rather than quick user switching. Suspend takes all the services on the machine offline for hundreds of milliseconds minimum. If they don't hit the ground running... so what? drop_caches=3 gets the metadata too, I think, which should protect the directory structure. >> >> FWIW, focussing on purging the page cache omits the fact that >> having access to the directory structure is a problem - one can >> still retrieve other user information that is stored in metadata >> (e.g. xattrs) that isn't part of the page cache. Even the directory >> structure that is cached in dentries could reveal secrets someone >> wants to keep hidden (e.g code names for operations/products). > > Yes, of course but that's fine. The most sensitive data and the biggest > chunks of data will be the contents of files. We don't necessarily need > to cater to the paranoid with this. > If actual security is not required, maybe look into whatever Android is doing? As far as I know it has similar use pattern and threat model (wifi passwords, session cookies, and credit card numbers matter; exposing high-entropy metadata that probably uniquely identifies files to anyone who has seen the same data elsewhere is fine). But then, perhaps what Android does is nothing, relying on locked bootloaders and device-specific kernels to make booting into a generic memory dumper sufficiently difficult. >> >> So if we want luksSuspend to actually protect user information when >> it runs, then it effectively needs to bring the filesystem right >> back to it's "just mounted" state where the only thing in memory is >> the root directory dentry and inode and nothing else. > > Yes, which we know isn't feasible. > >> >> And, of course, this is largely impossible to do because anything >> with an open file on the filesystem will prevent this robust cache >> purge from occurring.... >> >> Which brings us back to "best effort" only, and at this point we >> already have drop-caches.... >> >> Mind you, I do wonder if drop caches is fast enough for this sort of >> use case. It is single threaded, and if the filesystem/system has >> millions of cached inodes it can take minutes to run. Unmount has >> the same problem - purging large dentry/inode caches takes a *lot* >> of CPU time and these operations are single threaded. >> >> So it may not be practical in the luks context to purge caches e.g. >> suspending a laptop shouldn't take minutes. However laptops are >> getting to the hundreds of GB of RAM these days and so they can >> cache millions of inodes, so cache purge runtime is definitely a >> consideration here. > > I'm really trying to look for a practical api that doesn't require users > to drop the caches for every mounted image on the system. > > FYI, I've tried to get some users to reply here so they could speak to > the fact that they don't expect this to be an optimal solution but none > of them know how to reply to lore mboxes so I can just relay > information. > User replying here :-) One possible alternative would be to use suspend-to-encrypted-swap instead of suspend-to-RAM. It feels like it was left to rot as memory sizes kept growing and disk speeds didn't, but that trend has reversed. NVMe SSDs can write several GB/s if fed properly. And hasn't there been a recent push for authenticated hibernation images? That would also protect the application memory, which could be quite sensitive I think because of session cookies, oauth tokens, and the like. I assume that a sophisticated adversary with access to a memory image of my logged-in PC would be able to read my email and impersonate me for at least a week. ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Dropping page cache of individual fs 2024-01-16 10:50 [LSF/MM/BPF TOPIC] Dropping page cache of individual fs Christian Brauner ` (2 preceding siblings ...) 2024-01-16 20:56 ` Dave Chinner @ 2024-02-17 4:04 ` Kent Overstreet 3 siblings, 0 replies; 27+ messages in thread From: Kent Overstreet @ 2024-02-17 4:04 UTC (permalink / raw) To: Christian Brauner Cc: lsf-pc, linux-fsdevel, linux-mm, linux-btrfs, linux-block, Matthew Wilcox, Jan Kara, Christoph Hellwig On Tue, Jan 16, 2024 at 11:50:32AM +0100, Christian Brauner wrote: > Hey, > > I'm not sure this even needs a full LSFMM discussion but since I > currently don't have time to work on the patch I may as well submit it. > > Gnome recently got awared 1M Euro by the Sovereign Tech Fund (STF). The > STF was created by the German government to fund public infrastructure: > > "The Sovereign Tech Fund supports the development, improvement and > maintenance of open digital infrastructure. Our goal is to sustainably > strengthen the open source ecosystem. We focus on security, resilience, > technological diversity, and the people behind the code." (cf. [1]) > > Gnome has proposed various specific projects including integrating > systemd-homed with Gnome. Systemd-homed provides various features and if > you're interested in details then you might find it useful to read [2]. > It makes use of various new VFS and fs specific developments over the > last years. > > One feature is encrypting the home directory via LUKS. An approriate > image or device must contain a GPT partition table. Currently there's > only one partition which is a LUKS2 volume. Inside that LUKS2 volume is > a Linux filesystem. Currently supported are btrfs (see [4] though), > ext4, and xfs. > > The following issue isn't specific to systemd-homed. Gnome wants to be > able to support locking encrypted home directories. For example, when > the laptop is suspended. To do this the luksSuspend command can be used. > > The luksSuspend call is nothing else than a device mapper ioctl to > suspend the block device and it's owning superblock/filesystem. Which in > turn is nothing but a freeze initiated from the block layer: > > dm_suspend() > -> __dm_suspend() > -> lock_fs() > -> bdev_freeze() > > So when we say luksSuspend we really mean block layer initiated freeze. > The overall goal or expectation of userspace is that after a luksSuspend > call all sensitive material has been evicted from relevant caches to > harden against various attacks. And luksSuspend does wipe the encryption > key and suspend the block device. However, the encryption key can still > be available clear-text in the page cache. To illustrate this problem > more simply: > > truncate -s 500M /tmp/img > echo password | cryptsetup luksFormat /tmp/img --force-password > echo password | cryptsetup open /tmp/img test > mkfs.xfs /dev/mapper/test > mount /dev/mapper/test /mnt > echo "secrets" > /mnt/data > cryptsetup luksSuspend test > cat /mnt/data > > This will still happily print the contents of /mnt/data even though the > block device and the owning filesystem are frozen because the data is > still in the page cache. > > To my knowledge, the only current way to get the contents of /mnt/data > or the encryption key out of the page cache is via > /proc/sys/vm/drop_caches which is a big hammer. > > My initial reaction is to give userspace an API to drop the page cache > of a specific filesystem which may have additional uses. I initially had > started drafting an ioctl() and then got swayed towards a > posix_fadvise() flag. I found out that this was already proposed a few > years ago but got rejected as it was suspected this might just be > someone toying around without a real world use-case. I think this here > might qualify as a real-world use-case. > > This may at least help securing users with a regular dm-crypt setup > where dm-crypt is the top layer. Users that stack additional layers on > top of dm-crypt may still leak plaintext of course if they introduce > additional caching. But that's on them. > > Of course other ideas welcome. This isn't entirely unlike snapshot deletion, where we also need to shoot down the pagecache. Technically, the code I have now for snapshot deletion isn't quite what I want; snapshot deletion probably wants something closer to revoke() instead of waiting for files to be closed. But maybe the code I have is close to what you need - maybe we could turn this into a common shared API? https://evilpiepirate.org/git/bcachefs.git/tree/fs/bcachefs/fs.c#n1569 The need for page zeroing is pretty orthogonal; if you want page zeroing you want that enabled for all page cache folios at all times. ^ permalink raw reply [flat|nested] 27+ messages in thread
end of thread, other threads:[~2024-02-17 4:04 UTC | newest] Thread overview: 27+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2024-01-16 10:50 [LSF/MM/BPF TOPIC] Dropping page cache of individual fs Christian Brauner 2024-01-16 11:45 ` Jan Kara 2024-01-17 12:53 ` Christian Brauner 2024-01-17 14:35 ` Jan Kara 2024-01-17 14:52 ` Matthew Wilcox 2024-01-17 20:51 ` Phillip Susi 2024-01-17 20:58 ` Matthew Wilcox 2024-01-18 14:26 ` Christian Brauner 2024-01-30 0:13 ` Adrian Vovk 2024-02-15 13:57 ` Jan Kara 2024-02-15 19:46 ` Adrian Vovk 2024-02-15 23:17 ` Dave Chinner 2024-02-16 1:14 ` Adrian Vovk 2024-02-16 20:38 ` init_on_alloc digression: " John Hubbard 2024-02-16 21:11 ` Adrian Vovk 2024-02-16 21:19 ` John Hubbard 2024-01-16 15:25 ` James Bottomley 2024-01-16 15:40 ` Matthew Wilcox 2024-01-16 15:54 ` James Bottomley 2024-01-16 20:56 ` Dave Chinner 2024-01-17 6:17 ` Theodore Ts'o 2024-01-30 1:14 ` Adrian Vovk 2024-01-17 13:19 ` Christian Brauner 2024-01-17 22:26 ` Dave Chinner 2024-01-18 14:09 ` Christian Brauner 2024-02-05 17:39 ` Russell Haley 2024-02-17 4:04 ` Kent Overstreet
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).