public inbox for linux-block@vger.kernel.org
 help / color / mirror / Atom feed
* deadlock when swapping to encrypted swapfile
@ 2025-09-08 18:27 Robert Beckett
  2025-09-08 19:56 ` Mikulas Patocka
  2025-09-09 14:37 ` Mikulas Patocka
  0 siblings, 2 replies; 9+ messages in thread
From: Robert Beckett @ 2025-09-08 18:27 UTC (permalink / raw)
  To: dm-devel, linux-block, Mikulas Patocka; +Cc: kernel

[-- Attachment #1: Type: text/plain, Size: 1834 bytes --]

Hi,

While testing resiliency of encrypted swap using dmcrypt we encounter easily reproducible deadlocks.
The setup is a simple 1GB encrypted swap file [1] with a little mem chewer program [2] to consume all ram.

Usually the first run will oomkill the memchewer successfully.
However, after 1-3 runs typically, it will deadlock the machine.

Using softdog and the lockup detectors it looks like [3] it looks like the dmcrypt_write thread
is stuck for over 2 minutes while everything else is waiting on the swap bio limiter [4]

I wondered whether it might be hitting tag exhaustion in blk_mq_get_tag, but adding trace debug and
enabling the block trace events seems to suggest that generally progress is being made [5].

Also note lockdep doesn't complain.

Looks to me like a soft lockup possibly due to swap out hitting similar or same issue as [4] but
not self inflicted this time. However, once general memory exhaustion occurs, it seems to result
in the same issue.

I'm not intimately familiar with the dm and block-mq code, so I'd appreciate any help in
debugging it further or a fix.
I guess the main question is: why doesn't it oomkill? oomkill seems like a
sensible action in this scenario. Any advice on making oomkill more reliable here?
Would [4] need to be tweaked in any way for swap files vs partition?

Thanks

Bob


[1] Swap file setup
```
$ swapoff /home/swapfile
$ echo 'swap /home/swapfile /dev/urandom swap,cipher=aes-cbc-essiv:sha256,size=256' >> /etc/crypttab
$ systemctl daemon-reload
$ systemctl start systemd-cryptsetup@swap.service
$ swapon /dev/mapper/swap
```

[2] See attached memchewer.c
[3] See attached dmesg-pstore.202509081711-0.bz2
[4] https://lore.kernel.org/dm-devel/alpine.LRH.2.02.2102101518320.18125@file01.intranet.prod.int.rdu2.redhat.com/
[5] See attached dmesg-pstore.202509081649-0.bz2

[-- Attachment #2: memchewer.c --]
[-- Type: application/octet-stream, Size: 525 bytes --]

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>

const int mb = 512;
const int sleep_ms = 2000;

int main(void)
{
    size_t bufsize = (size_t) mb * 1024 * 1024;
    int count = 0;
    while (1) {
        char *buf = malloc(bufsize);
        if (buf == NULL) {
            puts("Not enough memory");
            return 1;
        }
        memset(buf, count % 256, bufsize);
        printf("Total memory allocated: %d MB\n", ++count * mb);
        usleep((useconds_t) 1000 * sleep_ms);
    }
}

[-- Attachment #3: dmesg-pstore.202509081711-0.bz2 --]
[-- Type: application/octet-stream, Size: 173643 bytes --]

[-- Attachment #4: dmesg-pstore.202509081649-0.bz2 --]
[-- Type: application/octet-stream, Size: 381885 bytes --]

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: deadlock when swapping to encrypted swapfile
  2025-09-08 18:27 deadlock when swapping to encrypted swapfile Robert Beckett
@ 2025-09-08 19:56 ` Mikulas Patocka
  2025-09-09 14:37 ` Mikulas Patocka
  1 sibling, 0 replies; 9+ messages in thread
From: Mikulas Patocka @ 2025-09-08 19:56 UTC (permalink / raw)
  To: Robert Beckett; +Cc: dm-devel, linux-block, kernel



On Mon, 8 Sep 2025, Robert Beckett wrote:

> Hi,
> 
> While testing resiliency of encrypted swap using dmcrypt we encounter easily reproducible deadlocks.
> The setup is a simple 1GB encrypted swap file [1] with a little mem chewer program [2] to consume all ram.
> 
> Usually the first run will oomkill the memchewer successfully.
> However, after 1-3 runs typically, it will deadlock the machine.
> 
> Using softdog and the lockup detectors it looks like [3] it looks like the dmcrypt_write thread
> is stuck for over 2 minutes while everything else is waiting on the swap bio limiter [4]
> 
> I wondered whether it might be hitting tag exhaustion in blk_mq_get_tag, but adding trace debug and
> enabling the block trace events seems to suggest that generally progress is being made [5].
> 
> Also note lockdep doesn't complain.
> 
> Looks to me like a soft lockup possibly due to swap out hitting similar or same issue as [4] but
> not self inflicted this time. However, once general memory exhaustion occurs, it seems to result
> in the same issue.
> 
> I'm not intimately familiar with the dm and block-mq code, so I'd appreciate any help in
> debugging it further or a fix.
> I guess the main question is: why doesn't it oomkill? oomkill seems like a
> sensible action in this scenario. Any advice on making oomkill more reliable here?
> Would [4] need to be tweaked in any way for swap files vs partition?
> 
> Thanks
> 
> Bob

Hi

What happens if you lower /sys/module/dm_mod/parameters/swap_bios? Does it 
help?

The general problem with swapping to encrypted device is that for each 
swapped-out page, the dm-crypt driver needs to allocate another page that 
holds the encrypted data. So, the harder it tries to swap, the more memory 
it consumes. The device mapper stack uses mempools, so that it should work 
in case of total memory exhaustion, but perhaps some kernel part doesn't 
use them and deadlocks. I could try to reproduce it.

Mikulas


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: deadlock when swapping to encrypted swapfile
  2025-09-08 18:27 deadlock when swapping to encrypted swapfile Robert Beckett
  2025-09-08 19:56 ` Mikulas Patocka
@ 2025-09-09 14:37 ` Mikulas Patocka
  2025-09-09 16:50   ` Robert Beckett
  1 sibling, 1 reply; 9+ messages in thread
From: Mikulas Patocka @ 2025-09-09 14:37 UTC (permalink / raw)
  To: Robert Beckett; +Cc: dm-devel, linux-block, kernel



On Mon, 8 Sep 2025, Robert Beckett wrote:

> Hi,
> 
> While testing resiliency of encrypted swap using dmcrypt we encounter easily reproducible deadlocks.
> The setup is a simple 1GB encrypted swap file [1] with a little mem chewer program [2] to consume all ram.
> 
> [1] Swap file setup
> ```
> $ swapoff /home/swapfile
> $ echo 'swap /home/swapfile /dev/urandom swap,cipher=aes-cbc-essiv:sha256,size=256' >> /etc/crypttab
> $ systemctl daemon-reload
> $ systemctl start systemd-cryptsetup@swap.service
> $ swapon /dev/mapper/swap
> ```

I have tried to swap on encrypted block device and it worked for me.

I've just realized that you are swapping to a file with the loopback 
driver on the top of it and with the dm-crypt device on the top of the 
loopback device.

This can't work in principle - the problem is that the filesystem needs to 
allocate memory when you write to it, so it deadlocks when the machine 
runs out of memory and needs to write back some pages. There is no easy 
fix - fixing this would require major rewrite of the VFS layer.

When you swap to a file directly, the kernel bypasses the filesystem, so 
it should work - but when you put encryption on the top of a file, there 
is no way how to bypass the filesystem.

So, I suggest to create a partition or a logical volume for swap and put 
dm-crypt on the top of it.

Mikulas


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: deadlock when swapping to encrypted swapfile
  2025-09-09 14:37 ` Mikulas Patocka
@ 2025-09-09 16:50   ` Robert Beckett
  2025-09-10 11:26     ` Mikulas Patocka
  0 siblings, 1 reply; 9+ messages in thread
From: Robert Beckett @ 2025-09-09 16:50 UTC (permalink / raw)
  To: Mikulas Patocka; +Cc: dm-devel, linux-block, kernel


 ---- On Tue, 09 Sep 2025 15:37:09 +0100  Mikulas Patocka <mpatocka@redhat.com> wrote --- 
 > 
 > 
 > On Mon, 8 Sep 2025, Robert Beckett wrote:
 > 
 > > Hi,
 > > 
 > > While testing resiliency of encrypted swap using dmcrypt we encounter easily reproducible deadlocks.
 > > The setup is a simple 1GB encrypted swap file [1] with a little mem chewer program [2] to consume all ram.
 > > 
 > > [1] Swap file setup
 > > ```
 > > $ swapoff /home/swapfile
 > > $ echo 'swap /home/swapfile /dev/urandom swap,cipher=aes-cbc-essiv:sha256,size=256' >> /etc/crypttab
 > > $ systemctl daemon-reload
 > > $ systemctl start systemd-cryptsetup@swap.service
 > > $ swapon /dev/mapper/swap
 > > ```
 > 
 > I have tried to swap on encrypted block device and it worked for me.
 > 
 > I've just realized that you are swapping to a file with the loopback 
 > driver on the top of it and with the dm-crypt device on the top of the 
 > loopback device.
 > 
 > This can't work in principle - the problem is that the filesystem needs to 
 > allocate memory when you write to it, so it deadlocks when the machine 
 > runs out of memory and needs to write back some pages. There is no easy 
 > fix - fixing this would require major rewrite of the VFS layer.
 > 
 > When you swap to a file directly, the kernel bypasses the filesystem, so 
 > it should work - but when you put encryption on the top of a file, there 
 > is no way how to bypass the filesystem.
 > 
 > So, I suggest to create a partition or a logical volume for swap and put 
 > dm-crypt on the top of it.
 > 
 > Mikulas
 > 
 > 

Yeah, unfortunately we are currently restricted to using a swapfile due to many units already shipped with that.
We have longer term plans to dynamically allocate the swapfiles as neded based on a new query for estimated size
required for hibernation etc. Moving to swap partition is just not viable currently.

I tried halving /sys/module/dm_mod/parameters/swap_bios but it didn't help, which based on your more recent
reply is not unexpected.

I have a work around for now, which is to run a userland earlyoom daemon. That seems to get in and oomkill in time.
I guess another option would be to have the swapfile in a luks encrypted partition, but that equally is not viable for
steamdeck currently.

However, I'm still interested in the longer term solution of fixing the kernel so that it can handle scenarios
like this no matter how ill advised they may be. Telling users not to do something seems like a bad solution :)

Do you have any ideas about the unreliable kernel oomkiller stepping in? I definitely fill ram and swap, seems like
it should be firing.

Thanks

Bob




^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: deadlock when swapping to encrypted swapfile
  2025-09-09 16:50   ` Robert Beckett
@ 2025-09-10 11:26     ` Mikulas Patocka
  2025-09-10 15:24       ` Robert Beckett
  0 siblings, 1 reply; 9+ messages in thread
From: Mikulas Patocka @ 2025-09-10 11:26 UTC (permalink / raw)
  To: Robert Beckett; +Cc: dm-devel, linux-block, kernel



On Tue, 9 Sep 2025, Robert Beckett wrote:

> 
>  ---- On Tue, 09 Sep 2025 15:37:09 +0100  Mikulas Patocka <mpatocka@redhat.com> wrote --- 
>  > 
>  > 
>  > On Mon, 8 Sep 2025, Robert Beckett wrote:
>  > 
>  > > Hi,
>  > > 
>  > > While testing resiliency of encrypted swap using dmcrypt we encounter easily reproducible deadlocks.
>  > > The setup is a simple 1GB encrypted swap file [1] with a little mem chewer program [2] to consume all ram.
>  > > 
>  > > [1] Swap file setup
>  > > ```
>  > > $ swapoff /home/swapfile
>  > > $ echo 'swap /home/swapfile /dev/urandom swap,cipher=aes-cbc-essiv:sha256,size=256' >> /etc/crypttab
>  > > $ systemctl daemon-reload
>  > > $ systemctl start systemd-cryptsetup@swap.service
>  > > $ swapon /dev/mapper/swap
>  > > ```
>  > 
>  > I have tried to swap on encrypted block device and it worked for me.
>  > 
>  > I've just realized that you are swapping to a file with the loopback 
>  > driver on the top of it and with the dm-crypt device on the top of the 
>  > loopback device.
>  > 
>  > This can't work in principle - the problem is that the filesystem needs to 
>  > allocate memory when you write to it, so it deadlocks when the machine 
>  > runs out of memory and needs to write back some pages. There is no easy 
>  > fix - fixing this would require major rewrite of the VFS layer.
>  > 
>  > When you swap to a file directly, the kernel bypasses the filesystem, so 
>  > it should work - but when you put encryption on the top of a file, there 
>  > is no way how to bypass the filesystem.
>  > 
>  > So, I suggest to create a partition or a logical volume for swap and put 
>  > dm-crypt on the top of it.
>  > 
>  > Mikulas
>  > 
>  > 
> 
> Yeah, unfortunately we are currently restricted to using a swapfile due to many units already shipped with that.
> We have longer term plans to dynamically allocate the swapfiles as neded based on a new query for estimated size
> required for hibernation etc. Moving to swap partition is just not viable currently.

You can try the dm-loop target that I created some times ago. It won't be 
in the official kernel because the Linux developers don't like the idea of 
creating fixed mapping for a file, but it may work better for you. Unlike 
the in-kernel loop driver, the dm-loop driver doesn't allocate memory when 
processing reads and write.

Create a swap file on the filesystem, load the dm-loop target on the top 
of that file and then create dm-crypt on the top of the dm-loop target. 
Then, run mkswap and swapon on the dm-crypt device.

> I tried halving /sys/module/dm_mod/parameters/swap_bios but it didn't help, which based on your more recent
> reply is not unexpected.
> 
> I have a work around for now, which is to run a userland earlyoom daemon. That seems to get in and oomkill in time.
> I guess another option would be to have the swapfile in a luks encrypted partition, but that equally is not viable for
> steamdeck currently.
> 
> However, I'm still interested in the longer term solution of fixing the kernel so that it can handle scenarios
> like this no matter how ill advised they may be. Telling users not to do something seems like a bad solution :)

You would have to rewrite the filesystems not to allocate memory when 
processing reads and writes. I think that this is not feasible.

> Do you have any ideas about the unreliable kernel oomkiller stepping in? I definitely fill ram and swap, seems like
> it should be firing.

I think that the main problem with the OOM killer is that it sometimes 
doesn't fire for big applications.

If you create a simple program that does malloc in a loop, the OOM killer 
will kill it reliably. However, if some big program (such as web browser, 
text editor, ...) allocates too much memory, the OOM killer may not kill 
it.

The problem is that when the kernel runs out of memory, it tries to flush 
filesystem caches first. However the program code is also executed from 
filesystem cache. So, it flushes the pages with the program code, which 
makes the big program run slower. The program that is running out of 
memory allocates more pages, so more code will be flushed and the program 
will run even slower - and so on, in a loop. So, the system slows down 
gradually to a halt without really going out of memory and killing the big 
program.

I think that using userspace OOM killer is appropriate to prevent this 
problem with the kernel OOM killer. 

Mikulas

> Thanks
> 
> Bob

---
 drivers/md/Kconfig   |    9 +
 drivers/md/Makefile  |    1 
 drivers/md/dm-loop.c |  404 +++++++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 414 insertions(+)

Index: linux-2.6/drivers/md/Kconfig
===================================================================
--- linux-2.6.orig/drivers/md/Kconfig	2025-09-10 13:06:15.000000000 +0200
+++ linux-2.6/drivers/md/Kconfig	2025-09-10 13:06:15.000000000 +0200
@@ -647,6 +647,15 @@ config DM_ZONED
 
 	  If unsure, say N.
 
+config DM_LOOP
+	tristate "Loop target"
+	depends on BLK_DEV_DM
+	help
+	  This device-mapper target allows you to treat a regular file as
+	  a block device.
+
+	  If unsure, say N.
+
 config DM_AUDIT
 	bool "DM audit events"
 	depends on BLK_DEV_DM
Index: linux-2.6/drivers/md/Makefile
===================================================================
--- linux-2.6.orig/drivers/md/Makefile	2025-09-10 13:06:15.000000000 +0200
+++ linux-2.6/drivers/md/Makefile	2025-09-10 13:06:15.000000000 +0200
@@ -79,6 +79,7 @@ obj-$(CONFIG_DM_CLONE)		+= dm-clone.o
 obj-$(CONFIG_DM_LOG_WRITES)	+= dm-log-writes.o
 obj-$(CONFIG_DM_INTEGRITY)	+= dm-integrity.o
 obj-$(CONFIG_DM_ZONED)		+= dm-zoned.o
+obj-$(CONFIG_DM_LOOP)		+= dm-loop.o
 obj-$(CONFIG_DM_WRITECACHE)	+= dm-writecache.o
 obj-$(CONFIG_SECURITY_LOADPIN_VERITY)	+= dm-verity-loadpin.o
 
Index: linux-2.6/drivers/md/dm-loop.c
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6/drivers/md/dm-loop.c	2025-09-10 13:06:15.000000000 +0200
@@ -0,0 +1,404 @@
+#include <linux/device-mapper.h>
+
+#include <linux/module.h>
+#include <linux/pagemap.h>
+
+#define DM_MSG_PREFIX "loop"
+
+struct loop_c {
+	struct file *filp;
+	char *path;
+	loff_t offset;
+	struct block_device *bdev;
+	struct inode *inode;
+	unsigned blkbits;
+	bool read_only;
+	sector_t mapped_sectors;
+
+	sector_t nr_extents;
+	struct dm_loop_extent *map;
+};
+
+struct dm_loop_extent {
+	sector_t start; 		/* start sector in mapped device */
+	sector_t to;			/* start sector on target device */
+	sector_t len;			/* length in sectors */
+};
+
+static sector_t blk2sect(struct loop_c *lc, blkcnt_t block)
+{
+	return block << (lc->blkbits - SECTOR_SHIFT);
+}
+
+static blkcnt_t sec2blk(struct loop_c *lc, sector_t sector)
+{
+	return sector >> (lc->blkbits - SECTOR_SHIFT);
+}
+
+static blkcnt_t sec2blk_roundup(struct loop_c *lc, sector_t sector)
+{
+	return (sector + (1 << (lc->blkbits - SECTOR_SHIFT)) - 1) >> (lc->blkbits - SECTOR_SHIFT);
+}
+
+static struct dm_loop_extent *extent_binary_lookup(struct loop_c *lc, sector_t sector)
+{
+	ssize_t first = 0;
+	ssize_t last = lc->nr_extents - 1;
+
+	while (first <= last) {
+		ssize_t middle = (first + last) >> 1;
+		struct dm_loop_extent *ex = &lc->map[middle];
+		if (sector < ex->start) {
+			last = middle - 1;
+			continue;
+		}
+		if (likely(sector >= ex->start + ex->len)) {
+			first = middle + 1;
+			continue;
+		}
+		return ex;
+	}
+
+	return NULL;
+}
+
+static int loop_map(struct dm_target *ti, struct bio *bio)
+{
+	struct loop_c *lc = ti->private;
+	sector_t sector, len;
+	struct dm_loop_extent *ex;
+
+	sector = dm_target_offset(ti, bio->bi_iter.bi_sector);
+	ex = extent_binary_lookup(lc, sector);
+	if (!ex)
+		return DM_MAPIO_KILL;
+
+	bio_set_dev(bio, lc->bdev);
+	bio->bi_iter.bi_sector = ex->to + (sector - ex->start);
+	len = ex->len - (sector - ex->start);
+	if (len < bio_sectors(bio))
+		dm_accept_partial_bio(bio, len);
+
+	if (unlikely(!ex->to)) {
+		if (unlikely(!lc->read_only))
+			return DM_MAPIO_KILL;
+		zero_fill_bio(bio);
+		bio_endio(bio);
+		return DM_MAPIO_SUBMITTED;
+	}
+
+	return DM_MAPIO_REMAPPED;
+}
+
+static void loop_status(struct dm_target *ti, status_type_t type,
+		unsigned status_flags, char *result, unsigned maxlen)
+{
+	struct loop_c *lc = ti->private;
+	size_t sz = 0;
+
+	switch (type) {
+		case STATUSTYPE_INFO:
+			result[0] = '\0';
+			break;
+		case STATUSTYPE_TABLE:
+			DMEMIT("%s %llu", lc->path, lc->offset);
+			break;
+		case STATUSTYPE_IMA:
+			DMEMIT_TARGET_NAME_VERSION(ti->type);
+			DMEMIT(",file_name=%s,offset=%llu;", lc->path, lc->offset);
+			break;
+	}
+}
+
+static int loop_iterate_devices(struct dm_target *ti,
+				iterate_devices_callout_fn fn, void *data)
+{
+	return 0;
+}
+
+static int extent_range(struct loop_c *lc,
+			sector_t logical_blk, sector_t last_blk,
+			sector_t *begin_blk, sector_t *nr_blks,
+			char **error)
+{
+	sector_t dist = 0, phys_blk, probe_blk = logical_blk;
+	int r;
+
+	/* Find beginning physical block of extent starting at logical_blk. */
+	*begin_blk = probe_blk;
+	*nr_blks = 0;
+	r = bmap(lc->inode, begin_blk);
+	if (r) {
+		*error = "bmap failed";
+		return r;
+	}
+	if (!*begin_blk) {
+		if (!lc->read_only) {
+			*error = "File is sparse";
+			return -ENXIO;
+		}
+	}
+
+	for (phys_blk = *begin_blk; phys_blk == *begin_blk + dist; dist += !!*begin_blk) {
+		cond_resched();
+
+		(*nr_blks)++;
+		if (++probe_blk > last_blk)
+			break;
+
+		phys_blk = probe_blk;
+		r = bmap(lc->inode, &phys_blk);
+		if (r) {
+			*error = "bmap failed";
+			return r;
+		}
+		if (unlikely(!phys_blk)) {
+			if (!lc->read_only) {
+				*error = "File is sparse";
+				return -ENXIO;
+			}
+		}
+	}
+
+	return 0;
+}
+
+static int loop_extents(struct loop_c *lc, sector_t *nr_extents,
+			struct dm_loop_extent *map, char **error)
+{
+	int r;
+	sector_t start = 0;
+	sector_t nr_blks, begin_blk;
+	sector_t after_last_blk = sec2blk_roundup(lc,
+			(lc->mapped_sectors + (lc->offset >> 9)));
+	sector_t logical_blk = sec2blk(lc, lc->offset >> 9);
+
+	*nr_extents = 0;
+
+	/* for each block in the mapped region */
+	while (logical_blk < after_last_blk) {
+		r = extent_range(lc, logical_blk, after_last_blk - 1,
+				 &begin_blk, &nr_blks, error);
+
+		if (unlikely(r))
+			return r;
+
+		if (map) {
+			if (*nr_extents >= lc->nr_extents) {
+				*error = "The file changed while mapping it";
+				return -EBUSY;
+			}
+			map[*nr_extents].start = start;
+			map[*nr_extents].to = blk2sect(lc, begin_blk);
+			map[*nr_extents].len = blk2sect(lc, nr_blks);
+		}
+
+		(*nr_extents)++;
+		start += blk2sect(lc, nr_blks);
+		logical_blk += nr_blks;
+	}
+
+	if (*nr_extents != lc->nr_extents) {
+		*error = "The file changed while mapping it";
+		return -EBUSY;
+	}
+
+	return 0;
+}
+
+static int setup_block_map(struct loop_c *lc, struct dm_target *ti)
+{
+	int r;
+	sector_t n_file_sectors, offset_sector, nr_extents_tmp;
+
+	if (!S_ISREG(lc->inode->i_mode) || !lc->inode->i_sb || !lc->inode->i_sb->s_bdev) {
+		ti->error = "The file is not a regular file";
+		return -ENXIO;
+	}
+
+	lc->bdev = lc->inode->i_sb->s_bdev;
+	lc->blkbits = lc->inode->i_blkbits;
+	n_file_sectors = i_size_read(lc->inode) >> lc->blkbits << (lc->blkbits - 9);
+
+	if (lc->offset & ((1 << lc->blkbits) - 1)) {
+		ti->error = "Unaligned offset";
+		return -EINVAL;
+	}
+	offset_sector = lc->offset >> 9;
+	if (offset_sector >= n_file_sectors) {
+		ti->error = "Offset is greater than file size";
+		return -EINVAL;
+	}
+	if (ti->len > (n_file_sectors - offset_sector)) {
+		ti->error = "Target maps area after file end";
+		return -EINVAL;
+	}
+	lc->mapped_sectors = ti->len >> (lc->blkbits - 9) << (lc->blkbits - 9);
+
+	r = loop_extents(lc, &lc->nr_extents, NULL, &ti->error);
+	if (r)
+		return r;
+
+	if (lc->nr_extents != (size_t)lc->nr_extents) {
+		ti->error = "Too many extents";
+		return -EOVERFLOW;
+	}
+
+	lc->map = kvcalloc(lc->nr_extents, sizeof(struct dm_loop_extent), GFP_KERNEL);
+	if (!lc->map) {
+		ti->error = "Failed to allocate extent map";
+		return -ENOMEM;
+	}
+
+	r = loop_extents(lc, &nr_extents_tmp, lc->map, &ti->error);
+	if (r)
+		return r;
+
+	return 0;
+}
+
+static int loop_lock_inode(struct inode *inode)
+{
+	int r;
+	inode_lock(inode);
+	if (IS_SWAPFILE(inode)) {
+		inode_unlock(inode);
+		return -EBUSY;
+	}
+	inode->i_flags |= S_SWAPFILE;
+	r = inode_drain_writes(inode);
+	if (r) {
+		inode->i_flags &= ~S_SWAPFILE;
+		inode_unlock(inode);
+		return r;
+	}
+	inode_unlock(inode);
+	return 0;
+}
+
+static void loop_unlock_inode(struct inode *inode)
+{
+	inode_lock(inode);
+	inode->i_flags &= ~S_SWAPFILE;
+	inode_unlock(inode);
+}
+
+static void loop_free(struct loop_c *lc)
+{
+	if (!lc)
+		return;
+	if (!IS_ERR_OR_NULL(lc->filp)) {
+		loop_unlock_inode(lc->inode);
+		filp_close(lc->filp, NULL);
+	}
+	kvfree(lc->map);
+	kfree(lc->path);
+	kfree(lc);
+}
+
+static int loop_ctr(struct dm_target *ti, unsigned argc, char **argv)
+{
+	struct loop_c *lc = NULL;
+	int r;
+	char dummy;
+
+	if (argc != 2) {
+		r = -EINVAL;
+		ti->error = "Invalid number of arguments";
+		goto err;
+	}
+
+	lc = kzalloc(sizeof(*lc), GFP_KERNEL);
+	if (!lc) {
+		r = -ENOMEM;
+		ti->error = "Cannot allocate loop context";
+		goto err;
+	}
+	ti->private = lc;
+
+	lc->path = kstrdup(argv[0], GFP_KERNEL);
+	if (!lc->path) {
+		r = -ENOMEM;
+		ti->error = "Cannot allocate loop path";
+		goto err;
+	}
+
+	if (sscanf(argv[1], "%lld%c", &lc->offset, &dummy) != 1) {
+		r = -EINVAL;
+		ti->error = "Invalid file offset";
+		goto err;
+	}
+
+	lc->read_only = !(dm_table_get_mode(ti->table) & FMODE_WRITE);
+
+	lc->filp = filp_open(lc->path, lc->read_only ? O_RDONLY : O_RDWR, 0);
+	if (IS_ERR(lc->filp)) {
+		r = PTR_ERR(lc->filp);
+		ti->error = "Could not open backing file";
+		goto err;
+	}
+
+	lc->inode = lc->filp->f_mapping->host;
+
+	r = loop_lock_inode(lc->inode);
+	if (r) {
+		ti->error = "Could not lock inode";
+		goto err;
+	}
+
+	r = setup_block_map(lc, ti);
+	if (r) {
+		goto err;
+	}
+
+	return 0;
+
+err:
+	loop_free(lc);
+	return r;
+}
+
+static void loop_dtr(struct dm_target *ti)
+{
+	struct loop_c *lc = ti->private;
+	loop_free(lc);
+}
+
+static struct target_type loop_target = {
+	.name = "loop",
+	.version = {1, 0, 0},
+	.module = THIS_MODULE,
+	.ctr = loop_ctr,
+	.dtr = loop_dtr,
+	.map = loop_map,
+	.status = loop_status,
+	.iterate_devices = loop_iterate_devices,
+};
+
+static int __init dm_loop_init(void)
+{
+	int r;
+
+	r = dm_register_target(&loop_target);
+	if (r < 0) {
+		DMERR("register failed %d", r);
+		goto err_target;
+	}
+
+	return 0;
+
+err_target:
+	return r;
+}
+
+static void __exit dm_loop_exit(void)
+{
+	dm_unregister_target(&loop_target);
+}
+
+module_init(dm_loop_init);
+module_exit(dm_loop_exit);
+
+MODULE_LICENSE("GPL");
+MODULE_AUTHOR("Mikulas Patocka <mpatocka@redhat.com>");
+MODULE_DESCRIPTION("device-mapper loop target");


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: deadlock when swapping to encrypted swapfile
  2025-09-10 11:26     ` Mikulas Patocka
@ 2025-09-10 15:24       ` Robert Beckett
  2025-09-10 17:45         ` Bryn M. Reeves
  2025-09-11 16:56         ` Mikulas Patocka
  0 siblings, 2 replies; 9+ messages in thread
From: Robert Beckett @ 2025-09-10 15:24 UTC (permalink / raw)
  To: Mikulas Patocka; +Cc: dm-devel, linux-block, kernel






 ---- On Wed, 10 Sep 2025 12:26:06 +0100  Mikulas Patocka <mpatocka@redhat.com> wrote --- 
 > 
 > 
 > On Tue, 9 Sep 2025, Robert Beckett wrote:
 > 
 > > 
 > >  ---- On Tue, 09 Sep 2025 15:37:09 +0100  Mikulas Patocka <mpatocka@redhat.com> wrote --- 
 > >  > 
 > >  > 
 > >  > On Mon, 8 Sep 2025, Robert Beckett wrote:
 > >  > 
 > >  > > Hi,
 > >  > > 
 > >  > > While testing resiliency of encrypted swap using dmcrypt we encounter easily reproducible deadlocks.
 > >  > > The setup is a simple 1GB encrypted swap file [1] with a little mem chewer program [2] to consume all ram.
 > >  > > 
 > >  > > [1] Swap file setup
 > >  > > ```
 > >  > > $ swapoff /home/swapfile
 > >  > > $ echo 'swap /home/swapfile /dev/urandom swap,cipher=aes-cbc-essiv:sha256,size=256' >> /etc/crypttab
 > >  > > $ systemctl daemon-reload
 > >  > > $ systemctl start systemd-cryptsetup@swap.service
 > >  > > $ swapon /dev/mapper/swap
 > >  > > ```
 > >  > 
 > >  > I have tried to swap on encrypted block device and it worked for me.
 > >  > 
 > >  > I've just realized that you are swapping to a file with the loopback 
 > >  > driver on the top of it and with the dm-crypt device on the top of the 
 > >  > loopback device.
 > >  > 
 > >  > This can't work in principle - the problem is that the filesystem needs to 
 > >  > allocate memory when you write to it, so it deadlocks when the machine 
 > >  > runs out of memory and needs to write back some pages. There is no easy 
 > >  > fix - fixing this would require major rewrite of the VFS layer.
 > >  > 
 > >  > When you swap to a file directly, the kernel bypasses the filesystem, so 
 > >  > it should work - but when you put encryption on the top of a file, there 
 > >  > is no way how to bypass the filesystem.
 > >  > 
 > >  > So, I suggest to create a partition or a logical volume for swap and put 
 > >  > dm-crypt on the top of it.
 > >  > 
 > >  > Mikulas
 > >  > 
 > >  > 
 > > 
 > > Yeah, unfortunately we are currently restricted to using a swapfile due to many units already shipped with that.
 > > We have longer term plans to dynamically allocate the swapfiles as neded based on a new query for estimated size
 > > required for hibernation etc. Moving to swap partition is just not viable currently.
 > 
 > You can try the dm-loop target that I created some times ago. It won't be 
 > in the official kernel because the Linux developers don't like the idea of 
 > creating fixed mapping for a file, but it may work better for you. Unlike 
 > the in-kernel loop driver, the dm-loop driver doesn't allocate memory when 
 > processing reads and write.

oh interesting. I hadn't seen that.
I was discussing a quick idea of potentially adding a new fallocate mode bit to request contiguous non-moveable
block assignment as it pre-allocates, which filesystems could then implement support for. Then use the known
file range with dm-crypt directly instead of going via the block device.
I guess this is roughly analaguous to that idea.

I see that dm-loop is very old at this point. Do you know the rationale for rejection?
was there any hope to get it included with more work?
If the main objection was regarding file spans that they can't gurantee persist, maybe a new fallocate based
contrace with the filesystems could aleviate the worries? 


 > 
 > Create a swap file on the filesystem, load the dm-loop target on the top 
 > of that file and then create dm-crypt on the top of the dm-loop target. 
 > Then, run mkswap and swapon on the dm-crypt device.
 > 
 > > I tried halving /sys/module/dm_mod/parameters/swap_bios but it didn't help, which based on your more recent
 > > reply is not unexpected.
 > > 
 > > I have a work around for now, which is to run a userland earlyoom daemon. That seems to get in and oomkill in time.
 > > I guess another option would be to have the swapfile in a luks encrypted partition, but that equally is not viable for
 > > steamdeck currently.
 > > 
 > > However, I'm still interested in the longer term solution of fixing the kernel so that it can handle scenarios
 > > like this no matter how ill advised they may be. Telling users not to do something seems like a bad solution :)
 > 
 > You would have to rewrite the filesystems not to allocate memory when 
 > processing reads and writes. I think that this is not feasible.
 > 
 > > Do you have any ideas about the unreliable kernel oomkiller stepping in? I definitely fill ram and swap, seems like
 > > it should be firing.
 > 
 > I think that the main problem with the OOM killer is that it sometimes 
 > doesn't fire for big applications.

perhaps oom_kill_allocating_task helps in that scenario?
in this lockup scenario I don't see oomkiller starting at all. It looks like it soft
locks and never feels the need to step in.
Perhaps because it sees some (tiny) amount of forward progress with
some swapout requests completing?

 > 
 > If you create a simple program that does malloc in a loop, the OOM killer 
 > will kill it reliably. However, if some big program (such as web browser, 
 > text editor, ...) allocates too much memory, the OOM killer may not kill 
 > it.
 > 

this doesn't appear to be the case. The memchewer program that llocates 512MB
chunks in a loop never triggers oomkiller


 > The problem is that when the kernel runs out of memory, it tries to flush 
 > filesystem caches first. However the program code is also executed from 
 > filesystem cache. So, it flushes the pages with the program code, which 
 > makes the big program run slower. The program that is running out of 
 > memory allocates more pages, so more code will be flushed and the program 
 > will run even slower - and so on, in a loop. So, the system slows down 
 > gradually to a halt without really going out of memory and killing the big 
 > program.
 > 
 > I think that using userspace OOM killer is appropriate to prevent this 
 > problem with the kernel OOM killer. 

Turns out I spoke too soon on the userland earloom daemon being a solution.
It worked for some patterns, but not others.
It mostly worked well when swap was either pre-filled with data greater than it's
threshold so as soon as ram is exhausted it stepped in, or when the allocations are
sufficiently spaced for it to fill greater than it's threshold without many more
outstanding swapouts before it gets to evaluate again.

For now the only really reliable way to work around is to disable memory
overcommit, but we really don't want to go looking down that route as it 
will have all sorts of other impacts.

 > 
 > Mikulas
 > 
 > > Thanks
 > > 
 > > Bob
 > 
 > ---
 >  drivers/md/Kconfig   |    9 +
 >  drivers/md/Makefile  |    1 
 >  drivers/md/dm-loop.c |  404 +++++++++++++++++++++++++++++++++++++++++++++++++++
 >  3 files changed, 414 insertions(+)
 > 
 > Index: linux-2.6/drivers/md/Kconfig
 > ===================================================================
 > --- linux-2.6.orig/drivers/md/Kconfig    2025-09-10 13:06:15.000000000 +0200
 > +++ linux-2.6/drivers/md/Kconfig    2025-09-10 13:06:15.000000000 +0200
 > @@ -647,6 +647,15 @@ config DM_ZONED
 >  
 >        If unsure, say N.
 >  
 > +config DM_LOOP
 > +    tristate "Loop target"
 > +    depends on BLK_DEV_DM
 > +    help
 > +      This device-mapper target allows you to treat a regular file as
 > +      a block device.
 > +
 > +      If unsure, say N.
 > +
 >  config DM_AUDIT
 >      bool "DM audit events"
 >      depends on BLK_DEV_DM
 > Index: linux-2.6/drivers/md/Makefile
 > ===================================================================
 > --- linux-2.6.orig/drivers/md/Makefile    2025-09-10 13:06:15.000000000 +0200
 > +++ linux-2.6/drivers/md/Makefile    2025-09-10 13:06:15.000000000 +0200
 > @@ -79,6 +79,7 @@ obj-$(CONFIG_DM_CLONE)        += dm-clone.o
 >  obj-$(CONFIG_DM_LOG_WRITES)    += dm-log-writes.o
 >  obj-$(CONFIG_DM_INTEGRITY)    += dm-integrity.o
 >  obj-$(CONFIG_DM_ZONED)        += dm-zoned.o
 > +obj-$(CONFIG_DM_LOOP)        += dm-loop.o
 >  obj-$(CONFIG_DM_WRITECACHE)    += dm-writecache.o
 >  obj-$(CONFIG_SECURITY_LOADPIN_VERITY)    += dm-verity-loadpin.o
 >  
 > Index: linux-2.6/drivers/md/dm-loop.c
 > ===================================================================
 > --- /dev/null    1970-01-01 00:00:00.000000000 +0000
 > +++ linux-2.6/drivers/md/dm-loop.c    2025-09-10 13:06:15.000000000 +0200
 > @@ -0,0 +1,404 @@
 > +#include <linux/device-mapper.h>
 > +
 > +#include <linux/module.h>
 > +#include <linux/pagemap.h>
 > +
 > +#define DM_MSG_PREFIX "loop"
 > +
 > +struct loop_c {
 > +    struct file *filp;
 > +    char *path;
 > +    loff_t offset;
 > +    struct block_device *bdev;
 > +    struct inode *inode;
 > +    unsigned blkbits;
 > +    bool read_only;
 > +    sector_t mapped_sectors;
 > +
 > +    sector_t nr_extents;
 > +    struct dm_loop_extent *map;
 > +};
 > +
 > +struct dm_loop_extent {
 > +    sector_t start;         /* start sector in mapped device */
 > +    sector_t to;            /* start sector on target device */
 > +    sector_t len;            /* length in sectors */
 > +};
 > +
 > +static sector_t blk2sect(struct loop_c *lc, blkcnt_t block)
 > +{
 > +    return block << (lc->blkbits - SECTOR_SHIFT);
 > +}
 > +
 > +static blkcnt_t sec2blk(struct loop_c *lc, sector_t sector)
 > +{
 > +    return sector >> (lc->blkbits - SECTOR_SHIFT);
 > +}
 > +
 > +static blkcnt_t sec2blk_roundup(struct loop_c *lc, sector_t sector)
 > +{
 > +    return (sector + (1 << (lc->blkbits - SECTOR_SHIFT)) - 1) >> (lc->blkbits - SECTOR_SHIFT);
 > +}
 > +
 > +static struct dm_loop_extent *extent_binary_lookup(struct loop_c *lc, sector_t sector)
 > +{
 > +    ssize_t first = 0;
 > +    ssize_t last = lc->nr_extents - 1;
 > +
 > +    while (first <= last) {
 > +        ssize_t middle = (first + last) >> 1;
 > +        struct dm_loop_extent *ex = &lc->map[middle];
 > +        if (sector < ex->start) {
 > +            last = middle - 1;
 > +            continue;
 > +        }
 > +        if (likely(sector >= ex->start + ex->len)) {
 > +            first = middle + 1;
 > +            continue;
 > +        }
 > +        return ex;
 > +    }
 > +
 > +    return NULL;
 > +}
 > +
 > +static int loop_map(struct dm_target *ti, struct bio *bio)
 > +{
 > +    struct loop_c *lc = ti->private;
 > +    sector_t sector, len;
 > +    struct dm_loop_extent *ex;
 > +
 > +    sector = dm_target_offset(ti, bio->bi_iter.bi_sector);
 > +    ex = extent_binary_lookup(lc, sector);
 > +    if (!ex)
 > +        return DM_MAPIO_KILL;
 > +
 > +    bio_set_dev(bio, lc->bdev);
 > +    bio->bi_iter.bi_sector = ex->to + (sector - ex->start);
 > +    len = ex->len - (sector - ex->start);
 > +    if (len < bio_sectors(bio))
 > +        dm_accept_partial_bio(bio, len);
 > +
 > +    if (unlikely(!ex->to)) {
 > +        if (unlikely(!lc->read_only))
 > +            return DM_MAPIO_KILL;
 > +        zero_fill_bio(bio);
 > +        bio_endio(bio);
 > +        return DM_MAPIO_SUBMITTED;
 > +    }
 > +
 > +    return DM_MAPIO_REMAPPED;
 > +}
 > +
 > +static void loop_status(struct dm_target *ti, status_type_t type,
 > +        unsigned status_flags, char *result, unsigned maxlen)
 > +{
 > +    struct loop_c *lc = ti->private;
 > +    size_t sz = 0;
 > +
 > +    switch (type) {
 > +        case STATUSTYPE_INFO:
 > +            result[0] = '\0';
 > +            break;
 > +        case STATUSTYPE_TABLE:
 > +            DMEMIT("%s %llu", lc->path, lc->offset);
 > +            break;
 > +        case STATUSTYPE_IMA:
 > +            DMEMIT_TARGET_NAME_VERSION(ti->type);
 > +            DMEMIT(",file_name=%s,offset=%llu;", lc->path, lc->offset);
 > +            break;
 > +    }
 > +}
 > +
 > +static int loop_iterate_devices(struct dm_target *ti,
 > +                iterate_devices_callout_fn fn, void *data)
 > +{
 > +    return 0;
 > +}
 > +
 > +static int extent_range(struct loop_c *lc,
 > +            sector_t logical_blk, sector_t last_blk,
 > +            sector_t *begin_blk, sector_t *nr_blks,
 > +            char **error)
 > +{
 > +    sector_t dist = 0, phys_blk, probe_blk = logical_blk;
 > +    int r;
 > +
 > +    /* Find beginning physical block of extent starting at logical_blk. */
 > +    *begin_blk = probe_blk;
 > +    *nr_blks = 0;
 > +    r = bmap(lc->inode, begin_blk);
 > +    if (r) {
 > +        *error = "bmap failed";
 > +        return r;
 > +    }
 > +    if (!*begin_blk) {
 > +        if (!lc->read_only) {
 > +            *error = "File is sparse";
 > +            return -ENXIO;
 > +        }
 > +    }
 > +
 > +    for (phys_blk = *begin_blk; phys_blk == *begin_blk + dist; dist += !!*begin_blk) {
 > +        cond_resched();
 > +
 > +        (*nr_blks)++;
 > +        if (++probe_blk > last_blk)
 > +            break;
 > +
 > +        phys_blk = probe_blk;
 > +        r = bmap(lc->inode, &phys_blk);
 > +        if (r) {
 > +            *error = "bmap failed";
 > +            return r;
 > +        }
 > +        if (unlikely(!phys_blk)) {
 > +            if (!lc->read_only) {
 > +                *error = "File is sparse";
 > +                return -ENXIO;
 > +            }
 > +        }
 > +    }
 > +
 > +    return 0;
 > +}
 > +
 > +static int loop_extents(struct loop_c *lc, sector_t *nr_extents,
 > +            struct dm_loop_extent *map, char **error)
 > +{
 > +    int r;
 > +    sector_t start = 0;
 > +    sector_t nr_blks, begin_blk;
 > +    sector_t after_last_blk = sec2blk_roundup(lc,
 > +            (lc->mapped_sectors + (lc->offset >> 9)));
 > +    sector_t logical_blk = sec2blk(lc, lc->offset >> 9);
 > +
 > +    *nr_extents = 0;
 > +
 > +    /* for each block in the mapped region */
 > +    while (logical_blk < after_last_blk) {
 > +        r = extent_range(lc, logical_blk, after_last_blk - 1,
 > +                 &begin_blk, &nr_blks, error);
 > +
 > +        if (unlikely(r))
 > +            return r;
 > +
 > +        if (map) {
 > +            if (*nr_extents >= lc->nr_extents) {
 > +                *error = "The file changed while mapping it";
 > +                return -EBUSY;
 > +            }
 > +            map[*nr_extents].start = start;
 > +            map[*nr_extents].to = blk2sect(lc, begin_blk);
 > +            map[*nr_extents].len = blk2sect(lc, nr_blks);
 > +        }
 > +
 > +        (*nr_extents)++;
 > +        start += blk2sect(lc, nr_blks);
 > +        logical_blk += nr_blks;
 > +    }
 > +
 > +    if (*nr_extents != lc->nr_extents) {
 > +        *error = "The file changed while mapping it";
 > +        return -EBUSY;
 > +    }
 > +
 > +    return 0;
 > +}
 > +
 > +static int setup_block_map(struct loop_c *lc, struct dm_target *ti)
 > +{
 > +    int r;
 > +    sector_t n_file_sectors, offset_sector, nr_extents_tmp;
 > +
 > +    if (!S_ISREG(lc->inode->i_mode) || !lc->inode->i_sb || !lc->inode->i_sb->s_bdev) {
 > +        ti->error = "The file is not a regular file";
 > +        return -ENXIO;
 > +    }
 > +
 > +    lc->bdev = lc->inode->i_sb->s_bdev;
 > +    lc->blkbits = lc->inode->i_blkbits;
 > +    n_file_sectors = i_size_read(lc->inode) >> lc->blkbits << (lc->blkbits - 9);
 > +
 > +    if (lc->offset & ((1 << lc->blkbits) - 1)) {
 > +        ti->error = "Unaligned offset";
 > +        return -EINVAL;
 > +    }
 > +    offset_sector = lc->offset >> 9;
 > +    if (offset_sector >= n_file_sectors) {
 > +        ti->error = "Offset is greater than file size";
 > +        return -EINVAL;
 > +    }
 > +    if (ti->len > (n_file_sectors - offset_sector)) {
 > +        ti->error = "Target maps area after file end";
 > +        return -EINVAL;
 > +    }
 > +    lc->mapped_sectors = ti->len >> (lc->blkbits - 9) << (lc->blkbits - 9);
 > +
 > +    r = loop_extents(lc, &lc->nr_extents, NULL, &ti->error);
 > +    if (r)
 > +        return r;
 > +
 > +    if (lc->nr_extents != (size_t)lc->nr_extents) {
 > +        ti->error = "Too many extents";
 > +        return -EOVERFLOW;
 > +    }
 > +
 > +    lc->map = kvcalloc(lc->nr_extents, sizeof(struct dm_loop_extent), GFP_KERNEL);
 > +    if (!lc->map) {
 > +        ti->error = "Failed to allocate extent map";
 > +        return -ENOMEM;
 > +    }
 > +
 > +    r = loop_extents(lc, &nr_extents_tmp, lc->map, &ti->error);
 > +    if (r)
 > +        return r;
 > +
 > +    return 0;
 > +}
 > +
 > +static int loop_lock_inode(struct inode *inode)
 > +{
 > +    int r;
 > +    inode_lock(inode);
 > +    if (IS_SWAPFILE(inode)) {
 > +        inode_unlock(inode);
 > +        return -EBUSY;
 > +    }
 > +    inode->i_flags |= S_SWAPFILE;
 > +    r = inode_drain_writes(inode);
 > +    if (r) {
 > +        inode->i_flags &= ~S_SWAPFILE;
 > +        inode_unlock(inode);
 > +        return r;
 > +    }
 > +    inode_unlock(inode);
 > +    return 0;
 > +}
 > +
 > +static void loop_unlock_inode(struct inode *inode)
 > +{
 > +    inode_lock(inode);
 > +    inode->i_flags &= ~S_SWAPFILE;
 > +    inode_unlock(inode);
 > +}
 > +
 > +static void loop_free(struct loop_c *lc)
 > +{
 > +    if (!lc)
 > +        return;
 > +    if (!IS_ERR_OR_NULL(lc->filp)) {
 > +        loop_unlock_inode(lc->inode);
 > +        filp_close(lc->filp, NULL);
 > +    }
 > +    kvfree(lc->map);
 > +    kfree(lc->path);
 > +    kfree(lc);
 > +}
 > +
 > +static int loop_ctr(struct dm_target *ti, unsigned argc, char **argv)
 > +{
 > +    struct loop_c *lc = NULL;
 > +    int r;
 > +    char dummy;
 > +
 > +    if (argc != 2) {
 > +        r = -EINVAL;
 > +        ti->error = "Invalid number of arguments";
 > +        goto err;
 > +    }
 > +
 > +    lc = kzalloc(sizeof(*lc), GFP_KERNEL);
 > +    if (!lc) {
 > +        r = -ENOMEM;
 > +        ti->error = "Cannot allocate loop context";
 > +        goto err;
 > +    }
 > +    ti->private = lc;
 > +
 > +    lc->path = kstrdup(argv[0], GFP_KERNEL);
 > +    if (!lc->path) {
 > +        r = -ENOMEM;
 > +        ti->error = "Cannot allocate loop path";
 > +        goto err;
 > +    }
 > +
 > +    if (sscanf(argv[1], "%lld%c", &lc->offset, &dummy) != 1) {
 > +        r = -EINVAL;
 > +        ti->error = "Invalid file offset";
 > +        goto err;
 > +    }
 > +
 > +    lc->read_only = !(dm_table_get_mode(ti->table) & FMODE_WRITE);
 > +
 > +    lc->filp = filp_open(lc->path, lc->read_only ? O_RDONLY : O_RDWR, 0);
 > +    if (IS_ERR(lc->filp)) {
 > +        r = PTR_ERR(lc->filp);
 > +        ti->error = "Could not open backing file";
 > +        goto err;
 > +    }
 > +
 > +    lc->inode = lc->filp->f_mapping->host;
 > +
 > +    r = loop_lock_inode(lc->inode);
 > +    if (r) {
 > +        ti->error = "Could not lock inode";
 > +        goto err;
 > +    }
 > +
 > +    r = setup_block_map(lc, ti);
 > +    if (r) {
 > +        goto err;
 > +    }
 > +
 > +    return 0;
 > +
 > +err:
 > +    loop_free(lc);
 > +    return r;
 > +}
 > +
 > +static void loop_dtr(struct dm_target *ti)
 > +{
 > +    struct loop_c *lc = ti->private;
 > +    loop_free(lc);
 > +}
 > +
 > +static struct target_type loop_target = {
 > +    .name = "loop",
 > +    .version = {1, 0, 0},
 > +    .module = THIS_MODULE,
 > +    .ctr = loop_ctr,
 > +    .dtr = loop_dtr,
 > +    .map = loop_map,
 > +    .status = loop_status,
 > +    .iterate_devices = loop_iterate_devices,
 > +};
 > +
 > +static int __init dm_loop_init(void)
 > +{
 > +    int r;
 > +
 > +    r = dm_register_target(&loop_target);
 > +    if (r < 0) {
 > +        DMERR("register failed %d", r);
 > +        goto err_target;
 > +    }
 > +
 > +    return 0;
 > +
 > +err_target:
 > +    return r;
 > +}
 > +
 > +static void __exit dm_loop_exit(void)
 > +{
 > +    dm_unregister_target(&loop_target);
 > +}
 > +
 > +module_init(dm_loop_init);
 > +module_exit(dm_loop_exit);
 > +
 > +MODULE_LICENSE("GPL");
 > +MODULE_AUTHOR("Mikulas Patocka <mpatocka@redhat.com>");
 > +MODULE_DESCRIPTION("device-mapper loop target");
 > 
 > 


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: deadlock when swapping to encrypted swapfile
  2025-09-10 15:24       ` Robert Beckett
@ 2025-09-10 17:45         ` Bryn M. Reeves
  2025-09-11 16:56         ` Mikulas Patocka
  1 sibling, 0 replies; 9+ messages in thread
From: Bryn M. Reeves @ 2025-09-10 17:45 UTC (permalink / raw)
  To: Robert Beckett; +Cc: Mikulas Patocka, dm-devel, linux-block, kernel

On Wed, Sep 10, 2025 at 04:24:46PM +0100, Robert Beckett wrote:
> I see that dm-loop is very old at this point. Do you know the rationale for rejection?
> was there any hope to get it included with more work?
> If the main objection was regarding file spans that they can't gurantee persist, maybe a new fallocate based
> contrace with the filesystems could aleviate the worries? 

Right: I first wrote it back in 2006. When it fimally made it onto a
mailing list in 2008 the concerns were basically threefold: "why is DM
reinventing everything?", the borrowing of the S_SWAPFILE flag to keep
the file mapping stable while dm-loop goes behind the filesystem's
back, and the greedy population of the extent table (lazily filling the
extent table reduces start up time and the amount of pinned memory, but
has the drawback that the target needs to allocate memory for unmapped
extents while it is running, reintroducing the possibility of deadlock
in low memory situations).

Most of the interesting discussions happened in this thread after Jens
posted an RFC patch taking a similar approach for /dev/loop:

  https://lkml.iu.edu/hypermail/linux/kernel/0801.1/0716.html

This used a prio tree instead of a simple table and binary search.

There have been various different approaches proposed down the years but
none have made it to mainline to date. I wrote one in 2011 that
refactored drivers/block/loop.c so that it could be reused by
device-mapper: that seemed like it might be more acceptable upstream but
we didn't pursue it at the time (it also removes the main benefit for
your case, since it uses the regular loop.c machinery for IO).

The version Mikulas posted is most closely related to a version I was
working on in 2008-9:

  https://www.sourceware.org/pub/dm/patches/2.6-unstable/editing/patches-2.6.31/dm-loop.patch

Which is the one discussed in the thread above - I think roughly the
same objections exist today.

(historical note - dmsetup still has code to generate dm-loop tables if
symlinked to the name 'dmlosetup' or 'losetup':

  # dmlosetup 
  dmlosetup: Please specify loop_device.
  Usage:
  
  dmlosetup [-d|-a] [-e encryption] [-o offset] [-f|loop_device] [file]
  
  Couldn't process command line)

Regards,
Bryn.


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: deadlock when swapping to encrypted swapfile
  2025-09-10 15:24       ` Robert Beckett
  2025-09-10 17:45         ` Bryn M. Reeves
@ 2025-09-11 16:56         ` Mikulas Patocka
  2025-09-11 17:12           ` Robert Beckett
  1 sibling, 1 reply; 9+ messages in thread
From: Mikulas Patocka @ 2025-09-11 16:56 UTC (permalink / raw)
  To: Robert Beckett; +Cc: dm-devel, linux-block, kernel



On Wed, 10 Sep 2025, Robert Beckett wrote:

>  > > Yeah, unfortunately we are currently restricted to using a swapfile due to many units already shipped with that.
>  > > We have longer term plans to dynamically allocate the swapfiles as neded based on a new query for estimated size
>  > > required for hibernation etc. Moving to swap partition is just not viable currently.
>  > 
>  > You can try the dm-loop target that I created some times ago. It won't be 
>  > in the official kernel because the Linux developers don't like the idea of 
>  > creating fixed mapping for a file, but it may work better for you. Unlike 
>  > the in-kernel loop driver, the dm-loop driver doesn't allocate memory when 
>  > processing reads and write.
> 
> oh interesting. I hadn't seen that.
> I was discussing a quick idea of potentially adding a new fallocate mode bit to request contiguous non-moveable
> block assignment as it pre-allocates, which filesystems could then implement support for.

You can ask the VFS maintainers about this, but I think they'll reject it.

> Then use the known
> file range with dm-crypt directly instead of going via the block device.
> I guess this is roughly analaguous to that idea.
> 
> I see that dm-loop is very old at this point. Do you know the rationale for rejection?

The reason was that the filesystem developers think that the filesystems 
should have freedom to move the allocated blocks around.

The dm-loop patch sets the flag S_SWAPFILE to prevent that from happening, 
but they don't want more code to use this flag.

> was there any hope to get it included with more work?

No - because they don't like the idea of creating a map of file blocks in 
advance.

One could rework the dm-loop patch to use standard filesystem methods read 
and write, but then it would allocate memory when processing requests and 
it would be unsuitable for swapping.

> If the main objection was regarding file spans that they can't gurantee persist, maybe a new fallocate based
> contrace with the filesystems could aleviate the worries? 
> 
> 
>  > 
>  > Create a swap file on the filesystem, load the dm-loop target on the top 
>  > of that file and then create dm-crypt on the top of the dm-loop target. 
>  > Then, run mkswap and swapon on the dm-crypt device.
>  > 
>  > > I tried halving /sys/module/dm_mod/parameters/swap_bios but it didn't help, which based on your more recent
>  > > reply is not unexpected.
>  > > 
>  > > I have a work around for now, which is to run a userland earlyoom daemon. That seems to get in and oomkill in time.
>  > > I guess another option would be to have the swapfile in a luks encrypted partition, but that equally is not viable for
>  > > steamdeck currently.
>  > > 
>  > > However, I'm still interested in the longer term solution of fixing the kernel so that it can handle scenarios
>  > > like this no matter how ill advised they may be. Telling users not to do something seems like a bad solution :)
>  > 
>  > You would have to rewrite the filesystems not to allocate memory when 
>  > processing reads and writes. I think that this is not feasible.
>  > 
>  > > Do you have any ideas about the unreliable kernel oomkiller stepping in? I definitely fill ram and swap, seems like
>  > > it should be firing.
>  > 
>  > I think that the main problem with the OOM killer is that it sometimes 
>  > doesn't fire for big applications.
> 
> perhaps oom_kill_allocating_task helps in that scenario?
> in this lockup scenario I don't see oomkiller starting at all. It looks like it soft
> locks and never feels the need to step in.
> Perhaps because it sees some (tiny) amount of forward progress with
> some swapout requests completing?

If you are swapping to an encrypted file, it may deadlock even before you 
exhaust the memory and swap.

>  > I think that using userspace OOM killer is appropriate to prevent this 
>  > problem with the kernel OOM killer. 
> 
> Turns out I spoke too soon on the userland earloom daemon being a solution.
> It worked for some patterns, but not others.
> It mostly worked well when swap was either pre-filled with data greater than it's
> threshold so as soon as ram is exhausted it stepped in, or when the allocations are
> sufficiently spaced for it to fill greater than it's threshold without many more
> outstanding swapouts before it gets to evaluate again.
> 
> For now the only really reliable way to work around is to disable memory
> overcommit, but we really don't want to go looking down that route as it 
> will have all sorts of other impacts.

You can allocate a file, use e4defrag to reduce the number of fragments, 
then use the FS_IOC_FIEMAP ioctl to find out the location of the file and 
then use the dm-linear target to map the file - and place encryption and 
swapping on the top of that.

If you have control over the whole device and make sure that no one moves 
the file, it should work.

Note that you can't use device mapper on a block device that has 
filesystem mounted, so you'll have to add one dm-linear device underneath 
the filesystem.

Mikulas


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: deadlock when swapping to encrypted swapfile
  2025-09-11 16:56         ` Mikulas Patocka
@ 2025-09-11 17:12           ` Robert Beckett
  0 siblings, 0 replies; 9+ messages in thread
From: Robert Beckett @ 2025-09-11 17:12 UTC (permalink / raw)
  To: Mikulas Patocka; +Cc: dm-devel, linux-block, kernel






 ---- On Thu, 11 Sep 2025 17:56:01 +0100  Mikulas Patocka <mpatocka@redhat.com> wrote --- 
 > 
 > 
 > On Wed, 10 Sep 2025, Robert Beckett wrote:
 > 
 > >  > > Yeah, unfortunately we are currently restricted to using a swapfile due to many units already shipped with that.
 > >  > > We have longer term plans to dynamically allocate the swapfiles as neded based on a new query for estimated size
 > >  > > required for hibernation etc. Moving to swap partition is just not viable currently.
 > >  > 
 > >  > You can try the dm-loop target that I created some times ago. It won't be 
 > >  > in the official kernel because the Linux developers don't like the idea of 
 > >  > creating fixed mapping for a file, but it may work better for you. Unlike 
 > >  > the in-kernel loop driver, the dm-loop driver doesn't allocate memory when 
 > >  > processing reads and write.
 > > 
 > > oh interesting. I hadn't seen that.
 > > I was discussing a quick idea of potentially adding a new fallocate mode bit to request contiguous non-moveable
 > > block assignment as it pre-allocates, which filesystems could then implement support for.
 > 
 > You can ask the VFS maintainers about this, but I think they'll reject it.
 > 
 > > Then use the known
 > > file range with dm-crypt directly instead of going via the block device.
 > > I guess this is roughly analaguous to that idea.
 > > 
 > > I see that dm-loop is very old at this point. Do you know the rationale for rejection?
 > 
 > The reason was that the filesystem developers think that the filesystems 
 > should have freedom to move the allocated blocks around.
 > 
 > The dm-loop patch sets the flag S_SWAPFILE to prevent that from happening, 
 > but they don't want more code to use this flag.
 > 
 > > was there any hope to get it included with more work?
 > 
 > No - because they don't like the idea of creating a map of file blocks in 
 > advance.
 > 
 > One could rework the dm-loop patch to use standard filesystem methods read 
 > and write, but then it would allocate memory when processing requests and 
 > it would be unsuitable for swapping.
 > 
 > > If the main objection was regarding file spans that they can't gurantee persist, maybe a new fallocate based
 > > contrace with the filesystems could aleviate the worries? 
 > > 
 > > 
 > >  > 
 > >  > Create a swap file on the filesystem, load the dm-loop target on the top 
 > >  > of that file and then create dm-crypt on the top of the dm-loop target. 
 > >  > Then, run mkswap and swapon on the dm-crypt device.
 > >  > 
 > >  > > I tried halving /sys/module/dm_mod/parameters/swap_bios but it didn't help, which based on your more recent
 > >  > > reply is not unexpected.
 > >  > > 
 > >  > > I have a work around for now, which is to run a userland earlyoom daemon. That seems to get in and oomkill in time.
 > >  > > I guess another option would be to have the swapfile in a luks encrypted partition, but that equally is not viable for
 > >  > > steamdeck currently.
 > >  > > 
 > >  > > However, I'm still interested in the longer term solution of fixing the kernel so that it can handle scenarios
 > >  > > like this no matter how ill advised they may be. Telling users not to do something seems like a bad solution :)
 > >  > 
 > >  > You would have to rewrite the filesystems not to allocate memory when 
 > >  > processing reads and writes. I think that this is not feasible.
 > >  > 
 > >  > > Do you have any ideas about the unreliable kernel oomkiller stepping in? I definitely fill ram and swap, seems like
 > >  > > it should be firing.
 > >  > 
 > >  > I think that the main problem with the OOM killer is that it sometimes 
 > >  > doesn't fire for big applications.
 > > 
 > > perhaps oom_kill_allocating_task helps in that scenario?
 > > in this lockup scenario I don't see oomkiller starting at all. It looks like it soft
 > > locks and never feels the need to step in.
 > > Perhaps because it sees some (tiny) amount of forward progress with
 > > some swapout requests completing?
 > 
 > If you are swapping to an encrypted file, it may deadlock even before you 
 > exhaust the memory and swap.
 > 
 > >  > I think that using userspace OOM killer is appropriate to prevent this 
 > >  > problem with the kernel OOM killer. 
 > > 
 > > Turns out I spoke too soon on the userland earloom daemon being a solution.
 > > It worked for some patterns, but not others.
 > > It mostly worked well when swap was either pre-filled with data greater than it's
 > > threshold so as soon as ram is exhausted it stepped in, or when the allocations are
 > > sufficiently spaced for it to fill greater than it's threshold without many more
 > > outstanding swapouts before it gets to evaluate again.
 > > 
 > > For now the only really reliable way to work around is to disable memory
 > > overcommit, but we really don't want to go looking down that route as it 
 > > will have all sorts of other impacts.
 > 
 > You can allocate a file, use e4defrag to reduce the number of fragments, 
 > then use the FS_IOC_FIEMAP ioctl to find out the location of the file and 
 > then use the dm-linear target to map the file - and place encryption and 
 > swapping on the top of that.
 > 
 > If you have control over the whole device and make sure that no one moves 
 > the file, it should work.
 > 
 > Note that you can't use device mapper on a block device that has 
 > filesystem mounted, so you'll have to add one dm-linear device underneath 
 > the filesystem.

Yeah, I already looked in to that and discounted it.
We want to support dynamic swapfile sizing, which means files, not statically reserved regions of disk.
Besides, if static regions were suitable for our design, we could just use partitions instead.

At this point, we accept that any sort of mapping over the top of dependable extents is just not
going to happen. And Based on the discussions from your previous attempt
 https://lore.kernel.org/dm-devel/7d6ae2c9-df8e-50d0-7ad6-b787cb3cfab4@redhat.com/
I can definitely agree with the fs folks why it makes sense to avoid that approach.

I think instead we'll go looking at ancyrpting swap data during swap out.
zswap exists as a permutation of data during swap in/out. I suspect we could do
something with encryption in a similar manner (to be investigated).

Thanks for the discussion, was interesting diving in to it.

 > 
 > Mikulas
 > 
 > 


^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2025-09-11 17:13 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-09-08 18:27 deadlock when swapping to encrypted swapfile Robert Beckett
2025-09-08 19:56 ` Mikulas Patocka
2025-09-09 14:37 ` Mikulas Patocka
2025-09-09 16:50   ` Robert Beckett
2025-09-10 11:26     ` Mikulas Patocka
2025-09-10 15:24       ` Robert Beckett
2025-09-10 17:45         ` Bryn M. Reeves
2025-09-11 16:56         ` Mikulas Patocka
2025-09-11 17:12           ` Robert Beckett

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox