[PATCH 0/2] btrfs: fix use-after-free in btrfs_encoded_read

public inbox for linux-btrfs@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH 0/2] btrfs: fix use-after-free in btrfs_encoded_read_endio
@ 2024-11-11  9:28 Johannes Thumshirn
  2024-11-11  9:28 ` [PATCH 1/2] " Johannes Thumshirn
  2024-11-11  9:28 ` [PATCH 2/2] btrfs: simplify waiting for encoded read endios Johannes Thumshirn
  0 siblings, 2 replies; 10+ messages in thread
From: Johannes Thumshirn @ 2024-11-11  9:28 UTC (permalink / raw)
  To: linux-btrfs
  Cc: Damien Le Moal, Johannes Thumshirn, Johannes Thumshirn,
	Mark Harmstone, Omar Sandoval

Shinichiro reported a occassional memory corruption in our CI system with
btrfs/248 that lead to panics. He also managed to reproduce this
corruption reliably on one host. See patch 1/2 for details on the
corruption and the fix, patch 2/2 is a cleanup Damien suggested on top of
the fix to make the code more obvious.

Mark, I've tried your test tool for the encoded read io_uring path, but didn't
succeed building it. Can you please have a look at these patches with an
io_uring hat on?

Johannes Thumshirn (2):
  btrfs: fix use-after-free in btrfs_encoded_read_endio
  btrfs: simplify waiting for encoded read endios

 fs/btrfs/inode.c | 16 +++++++---------
 1 file changed, 7 insertions(+), 9 deletions(-)

-- 
2.43.0

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH 1/2] btrfs: fix use-after-free in btrfs_encoded_read_endio
  2024-11-11  9:28 [PATCH 0/2] btrfs: fix use-after-free in btrfs_encoded_read_endio Johannes Thumshirn
@ 2024-11-11  9:28 ` Johannes Thumshirn
  2024-11-11 13:27   ` Filipe Manana
  2024-11-11 18:17   ` Mark Harmstone
  2024-11-11  9:28 ` [PATCH 2/2] btrfs: simplify waiting for encoded read endios Johannes Thumshirn
  1 sibling, 2 replies; 10+ messages in thread
From: Johannes Thumshirn @ 2024-11-11  9:28 UTC (permalink / raw)
  To: linux-btrfs
  Cc: Damien Le Moal, Johannes Thumshirn, Mark Harmstone, Omar Sandoval,
	Shinichiro Kawasaki, Damien Le Moal

From: Johannes Thumshirn <johannes.thumshirn@wdc.com>

Shinichiro reported the following use-after free that sometimes is
happening in our CI system when running fstests' btrfs/284 on a TCMU
runner device:

   ==================================================================
   BUG: KASAN: slab-use-after-free in lock_release+0x708/0x780
   Read of size 8 at addr ffff888106a83f18 by task kworker/u80:6/219

   CPU: 8 UID: 0 PID: 219 Comm: kworker/u80:6 Not tainted 6.12.0-rc6-kts+ #15
   Hardware name: Supermicro Super Server/X11SPi-TF, BIOS 3.3 02/21/2020
   Workqueue: btrfs-endio btrfs_end_bio_work [btrfs]
   Call Trace:
    <TASK>
    dump_stack_lvl+0x6e/0xa0
    ? lock_release+0x708/0x780
    print_report+0x174/0x505
    ? lock_release+0x708/0x780
    ? __virt_addr_valid+0x224/0x410
    ? lock_release+0x708/0x780
    kasan_report+0xda/0x1b0
    ? lock_release+0x708/0x780
    ? __wake_up+0x44/0x60
    lock_release+0x708/0x780
    ? __pfx_lock_release+0x10/0x10
    ? __pfx_do_raw_spin_lock+0x10/0x10
    ? lock_is_held_type+0x9a/0x110
    _raw_spin_unlock_irqrestore+0x1f/0x60
    __wake_up+0x44/0x60
    btrfs_encoded_read_endio+0x14b/0x190 [btrfs]
    btrfs_check_read_bio+0x8d9/0x1360 [btrfs]
    ? lock_release+0x1b0/0x780
    ? trace_lock_acquire+0x12f/0x1a0
    ? __pfx_btrfs_check_read_bio+0x10/0x10 [btrfs]
    ? process_one_work+0x7e3/0x1460
    ? lock_acquire+0x31/0xc0
    ? process_one_work+0x7e3/0x1460
    process_one_work+0x85c/0x1460
    ? __pfx_process_one_work+0x10/0x10
    ? assign_work+0x16c/0x240
    worker_thread+0x5e6/0xfc0
    ? __pfx_worker_thread+0x10/0x10
    kthread+0x2c3/0x3a0
    ? __pfx_kthread+0x10/0x10
    ret_from_fork+0x31/0x70
    ? __pfx_kthread+0x10/0x10
    ret_from_fork_asm+0x1a/0x30
    </TASK>

   Allocated by task 3661:
    kasan_save_stack+0x30/0x50
    kasan_save_track+0x14/0x30
    __kasan_kmalloc+0xaa/0xb0
    btrfs_encoded_read_regular_fill_pages+0x16c/0x6d0 [btrfs]
    send_extent_data+0xf0f/0x24a0 [btrfs]
    process_extent+0x48a/0x1830 [btrfs]
    changed_cb+0x178b/0x2ea0 [btrfs]
    btrfs_ioctl_send+0x3bf9/0x5c20 [btrfs]
    _btrfs_ioctl_send+0x117/0x330 [btrfs]
    btrfs_ioctl+0x184a/0x60a0 [btrfs]
    __x64_sys_ioctl+0x12e/0x1a0
    do_syscall_64+0x95/0x180
    entry_SYSCALL_64_after_hwframe+0x76/0x7e

   Freed by task 3661:
    kasan_save_stack+0x30/0x50
    kasan_save_track+0x14/0x30
    kasan_save_free_info+0x3b/0x70
    __kasan_slab_free+0x4f/0x70
    kfree+0x143/0x490
    btrfs_encoded_read_regular_fill_pages+0x531/0x6d0 [btrfs]
    send_extent_data+0xf0f/0x24a0 [btrfs]
    process_extent+0x48a/0x1830 [btrfs]
    changed_cb+0x178b/0x2ea0 [btrfs]
    btrfs_ioctl_send+0x3bf9/0x5c20 [btrfs]
    _btrfs_ioctl_send+0x117/0x330 [btrfs]
    btrfs_ioctl+0x184a/0x60a0 [btrfs]
    __x64_sys_ioctl+0x12e/0x1a0
    do_syscall_64+0x95/0x180
    entry_SYSCALL_64_after_hwframe+0x76/0x7e

   The buggy address belongs to the object at ffff888106a83f00
    which belongs to the cache kmalloc-rnd-07-96 of size 96
   The buggy address is located 24 bytes inside of
    freed 96-byte region [ffff888106a83f00, ffff888106a83f60)

   The buggy address belongs to the physical page:
   page: refcount:1 mapcount:0 mapping:0000000000000000 index:0xffff888106a83800 pfn:0x106a83
   flags: 0x17ffffc0000000(node=0|zone=2|lastcpupid=0x1fffff)
   page_type: f5(slab)
   raw: 0017ffffc0000000 ffff888100053680 ffffea0004917200 0000000000000004
   raw: ffff888106a83800 0000000080200019 00000001f5000000 0000000000000000
   page dumped because: kasan: bad access detected

   Memory state around the buggy address:
    ffff888106a83e00: fa fb fb fb fb fb fb fb fb fb fb fb fc fc fc fc
    ffff888106a83e80: fa fb fb fb fb fb fb fb fb fb fb fb fc fc fc fc
   >ffff888106a83f00: fa fb fb fb fb fb fb fb fb fb fb fb fc fc fc fc
                               ^
    ffff888106a83f80: fa fb fb fb fb fb fb fb fb fb fb fb fc fc fc fc
    ffff888106a84000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
   ==================================================================

Further analyzing the trace and the crash dump's vmcore file shows that
the wake_up() call in btrfs_encoded_read_endio() is calling wake_up() on
the wait_queue that is in the private data passed to the end_io handler.

Commit 4ff47df40447 ("btrfs: move priv off stack in
btrfs_encoded_read_regular_fill_pages()") moved 'struct
btrfs_encoded_read_private' off the stack.

Before that commit one can see a corruption of the private data when
analyzing the vmcore after a crash:

*(struct btrfs_encoded_read_private *)0xffff88815626eec8 = {
	.wait = (wait_queue_head_t){
		.lock = (spinlock_t){
			.rlock = (struct raw_spinlock){
				.raw_lock = (arch_spinlock_t){
					.val = (atomic_t){
						.counter = (int)-2005885696,
					},
					.locked = (u8)0,
					.pending = (u8)157,
					.locked_pending = (u16)40192,
					.tail = (u16)34928,
				},
				.magic = (unsigned int)536325682,
				.owner_cpu = (unsigned int)29,
				.owner = (void *)__SCT__tp_func_btrfs_transaction_commit+0x0 = 0x0,
				.dep_map = (struct lockdep_map){
					.key = (struct lock_class_key *)0xffff8881575a3b6c,
					.class_cache = (struct lock_class *[2]){ 0xffff8882a71985c0, 0xffffea00066f5d40 },
					.name = (const char *)0xffff88815626f100 = "",
					.wait_type_outer = (u8)37,
					.wait_type_inner = (u8)178,
					.lock_type = (u8)154,
				},
			},
			.__padding = (u8 [24]){ 0, 157, 112, 136, 50, 174, 247, 31, 29 },
			.dep_map = (struct lockdep_map){
				.key = (struct lock_class_key *)0xffff8881575a3b6c,
				.class_cache = (struct lock_class *[2]){ 0xffff8882a71985c0, 0xffffea00066f5d40 },
				.name = (const char *)0xffff88815626f100 = "",
				.wait_type_outer = (u8)37,
				.wait_type_inner = (u8)178,
				.lock_type = (u8)154,
			},
		},
		.head = (struct list_head){
			.next = (struct list_head *)0x112cca,
			.prev = (struct list_head *)0x47,
		},
	},
	.pending = (atomic_t){
		.counter = (int)-1491499288,
	},
	.status = (blk_status_t)130,
}

Move the call to bio_put() before the atomic_test operation and
also change atomic_dec_return() to atomic_dec_and_test() to fix the
corruption, as atomic_dec_return() is defined as two instructions on
x86_64, whereas atomic_dec_and_test() is defineda s a single atomic
operation.

Reported-by: Shinichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Suggested-by: Damien Le Moal <Damien.LeMoal@wdc.com>
Fixes: 1881fba89bd5 ("btrfs: add BTRFS_IOC_ENCODED_READ ioctl")
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
---
 fs/btrfs/inode.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 22b8e2764619..cb8b23a3e56b 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -9089,7 +9089,8 @@ static void btrfs_encoded_read_endio(struct btrfs_bio *bbio)
 		 */
 		WRITE_ONCE(priv->status, bbio->bio.bi_status);
 	}
-	if (atomic_dec_return(&priv->pending) == 0) {
+	bio_put(&bbio->bio);
+	if (atomic_dec_and_test(&priv->pending)) {
 		int err = blk_status_to_errno(READ_ONCE(priv->status));
 
 		if (priv->uring_ctx) {
@@ -9099,7 +9100,6 @@ static void btrfs_encoded_read_endio(struct btrfs_bio *bbio)
 			wake_up(&priv->wait);
 		}
 	}
-	bio_put(&bbio->bio);
 }
 
 int btrfs_encoded_read_regular_fill_pages(struct btrfs_inode *inode,
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH 2/2] btrfs: simplify waiting for encoded read endios
  2024-11-11  9:28 [PATCH 0/2] btrfs: fix use-after-free in btrfs_encoded_read_endio Johannes Thumshirn
  2024-11-11  9:28 ` [PATCH 1/2] " Johannes Thumshirn
@ 2024-11-11  9:28 ` Johannes Thumshirn
  2024-11-11  9:56   ` Damien Le Moal
  2024-11-11 18:36   ` Mark Harmstone
  1 sibling, 2 replies; 10+ messages in thread
From: Johannes Thumshirn @ 2024-11-11  9:28 UTC (permalink / raw)
  To: linux-btrfs
  Cc: Damien Le Moal, Johannes Thumshirn, Mark Harmstone, Omar Sandoval,
	Damien Le Moal

From: Johannes Thumshirn <johannes.thumshirn@wdc.com>

Simplifiy the I/O completion path for encoded reads by using a
completion instead of a wait_queue.

Furthermore skip taking an extra reference that is instantly
dropped anyways.

Suggested-by: Damien Le Moal <Damien.LeMoal@wdc.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
---
 fs/btrfs/inode.c | 12 +++++-------
 1 file changed, 5 insertions(+), 7 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index cb8b23a3e56b..916c9d7ca112 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -9068,7 +9068,7 @@ static ssize_t btrfs_encoded_read_inline(
 }
 
 struct btrfs_encoded_read_private {
-	wait_queue_head_t wait;
+	struct completion done;
 	void *uring_ctx;
 	atomic_t pending;
 	blk_status_t status;
@@ -9097,7 +9097,7 @@ static void btrfs_encoded_read_endio(struct btrfs_bio *bbio)
 			btrfs_uring_read_extent_endio(priv->uring_ctx, err);
 			kfree(priv);
 		} else {
-			wake_up(&priv->wait);
+			complete(&priv->done);
 		}
 	}
 }
@@ -9116,7 +9116,7 @@ int btrfs_encoded_read_regular_fill_pages(struct btrfs_inode *inode,
 	if (!priv)
 		return -ENOMEM;
 
-	init_waitqueue_head(&priv->wait);
+	init_completion(&priv->done);
 	atomic_set(&priv->pending, 1);
 	priv->status = 0;
 	priv->uring_ctx = uring_ctx;
@@ -9145,11 +9145,10 @@ int btrfs_encoded_read_regular_fill_pages(struct btrfs_inode *inode,
 		disk_io_size -= bytes;
 	} while (disk_io_size);
 
-	atomic_inc(&priv->pending);
 	btrfs_submit_bbio(bbio, 0);
 
 	if (uring_ctx) {
-		if (atomic_dec_return(&priv->pending) == 0) {
+		if (atomic_read(&priv->pending) == 0) {
 			ret = blk_status_to_errno(READ_ONCE(priv->status));
 			btrfs_uring_read_extent_endio(uring_ctx, ret);
 			kfree(priv);
@@ -9158,8 +9157,7 @@ int btrfs_encoded_read_regular_fill_pages(struct btrfs_inode *inode,
 
 		return -EIOCBQUEUED;
 	} else {
-		if (atomic_dec_return(&priv->pending) != 0)
-			io_wait_event(priv->wait, !atomic_read(&priv->pending));
+		wait_for_completion(&priv->done);
 		/* See btrfs_encoded_read_endio() for ordering. */
 		ret = blk_status_to_errno(READ_ONCE(priv->status));
 		kfree(priv);
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH 2/2] btrfs: simplify waiting for encoded read endios
  2024-11-11  9:28 ` [PATCH 2/2] btrfs: simplify waiting for encoded read endios Johannes Thumshirn
@ 2024-11-11  9:56   ` Damien Le Moal
  2024-11-11 10:13     ` Johannes Thumshirn
  2024-11-11 18:36   ` Mark Harmstone
  1 sibling, 1 reply; 10+ messages in thread
From: Damien Le Moal @ 2024-11-11  9:56 UTC (permalink / raw)
  To: Johannes Thumshirn, linux-btrfs
  Cc: Johannes Thumshirn, Mark Harmstone, Omar Sandoval, Damien Le Moal

On 11/11/24 18:28, Johannes Thumshirn wrote:
> From: Johannes Thumshirn <johannes.thumshirn@wdc.com>
> 
> Simplifiy the I/O completion path for encoded reads by using a
> completion instead of a wait_queue.
> 
> Furthermore skip taking an extra reference that is instantly
> dropped anyways.
> 
> Suggested-by: Damien Le Moal <Damien.LeMoal@wdc.com>
> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
> ---
>  fs/btrfs/inode.c | 12 +++++-------
>  1 file changed, 5 insertions(+), 7 deletions(-)
> 
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index cb8b23a3e56b..916c9d7ca112 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -9068,7 +9068,7 @@ static ssize_t btrfs_encoded_read_inline(
>  }
>  
>  struct btrfs_encoded_read_private {
> -	wait_queue_head_t wait;
> +	struct completion done;
>  	void *uring_ctx;
>  	atomic_t pending;
>  	blk_status_t status;
> @@ -9097,7 +9097,7 @@ static void btrfs_encoded_read_endio(struct btrfs_bio *bbio)
>  			btrfs_uring_read_extent_endio(priv->uring_ctx, err);
>  			kfree(priv);
>  		} else {
> -			wake_up(&priv->wait);
> +			complete(&priv->done);
>  		}
>  	}
>  }
> @@ -9116,7 +9116,7 @@ int btrfs_encoded_read_regular_fill_pages(struct btrfs_inode *inode,
>  	if (!priv)
>  		return -ENOMEM;
>  
> -	init_waitqueue_head(&priv->wait);
> +	init_completion(&priv->done);
>  	atomic_set(&priv->pending, 1);
>  	priv->status = 0;
>  	priv->uring_ctx = uring_ctx;
> @@ -9145,11 +9145,10 @@ int btrfs_encoded_read_regular_fill_pages(struct btrfs_inode *inode,
>  		disk_io_size -= bytes;
>  	} while (disk_io_size);
>  
> -	atomic_inc(&priv->pending);
>  	btrfs_submit_bbio(bbio, 0);
>  
>  	if (uring_ctx) {
> -		if (atomic_dec_return(&priv->pending) == 0) {
> +		if (atomic_read(&priv->pending) == 0) {

This does not look right... Testing this essentially means that you are doing
wait_for_completion(). So I would move this hunk after the call for
wait_for_completion(). That will also allow avoiding duplicating the cleanup
path getting ret and doing the kfree(priv).

>  			ret = blk_status_to_errno(READ_ONCE(priv->status));
>  			btrfs_uring_read_extent_endio(uring_ctx, ret);
>  			kfree(priv);
> @@ -9158,8 +9157,7 @@ int btrfs_encoded_read_regular_fill_pages(struct btrfs_inode *inode,
>  
>  		return -EIOCBQUEUED;
>  	} else {
> -		if (atomic_dec_return(&priv->pending) != 0)
> -			io_wait_event(priv->wait, !atomic_read(&priv->pending));
> +		wait_for_completion(&priv->done);
>  		/* See btrfs_encoded_read_endio() for ordering. */
>  		ret = blk_status_to_errno(READ_ONCE(priv->status));
>  		kfree(priv);


-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 2/2] btrfs: simplify waiting for encoded read endios
  2024-11-11  9:56   ` Damien Le Moal
@ 2024-11-11 10:13     ` Johannes Thumshirn
  0 siblings, 0 replies; 10+ messages in thread
From: Johannes Thumshirn @ 2024-11-11 10:13 UTC (permalink / raw)
  To: Damien Le Moal, Johannes Thumshirn, linux-btrfs@vger.kernel.org
  Cc: Mark Harmstone, Omar Sandoval, Damien Le Moal

On 11.11.24 10:56, Damien Le Moal wrote:
>> @@ -9145,11 +9145,10 @@ int btrfs_encoded_read_regular_fill_pages(struct btrfs_inode *inode,
>>   		disk_io_size -= bytes;
>>   	} while (disk_io_size);
>>   
>> -	atomic_inc(&priv->pending);
>>   	btrfs_submit_bbio(bbio, 0);
>>   
>>   	if (uring_ctx) {
>> -		if (atomic_dec_return(&priv->pending) == 0) {
>> +		if (atomic_read(&priv->pending) == 0) {
> 
> This does not look right... Testing this essentially means that you are doing
> wait_for_completion(). So I would move this hunk after the call for
> wait_for_completion(). That will also allow avoiding duplicating the cleanup
> path getting ret and doing the kfree(priv).
> 

Are you sure? Because we initialize as 1, then submit and test for 0. 
Until the bio completes we're returning -EIOCBQUEUED, once it completes 
it drops a reference and atomic_read() will return 0.

BUT I just saw, it's doing a double free:

In 'btrfs_encoded_read_regular_fill_pages':
if (uring_ctx) {
          if (atomic_read(&priv->pending) == 0) {
                  ret = blk_status_to_errno(READ_ONCE(priv->status));
                  btrfs_uring_read_extent_endio(uring_ctx, ret);
                  kfree(priv);
                  return ret;
          }

          return -EIOCBQUEUED;
}

and 'btrfs_encoded_read_endio':
bio_put(&bbio->bio);
if (atomic_dec_and_test(&priv->pending)) {
         int err = blk_status_to_errno(READ_ONCE(priv->status));

         if (priv->uring_ctx) {
                 btrfs_uring_read_extent_endio(priv->uring_ctx, err);
                 kfree(priv);

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 1/2] btrfs: fix use-after-free in btrfs_encoded_read_endio
  2024-11-11  9:28 ` [PATCH 1/2] " Johannes Thumshirn
@ 2024-11-11 13:27   ` Filipe Manana
  2024-11-11 13:44     ` Damien Le Moal
  2024-11-11 18:17   ` Mark Harmstone
  1 sibling, 1 reply; 10+ messages in thread
From: Filipe Manana @ 2024-11-11 13:27 UTC (permalink / raw)
  To: Johannes Thumshirn
  Cc: linux-btrfs, Damien Le Moal, Johannes Thumshirn, Mark Harmstone,
	Omar Sandoval, Shinichiro Kawasaki, Damien Le Moal

On Mon, Nov 11, 2024 at 9:28 AM Johannes Thumshirn <jth@kernel.org> wrote:
>
> From: Johannes Thumshirn <johannes.thumshirn@wdc.com>
>
> Shinichiro reported the following use-after free that sometimes is
> happening in our CI system when running fstests' btrfs/284 on a TCMU
> runner device:
>
>    ==================================================================
>    BUG: KASAN: slab-use-after-free in lock_release+0x708/0x780
>    Read of size 8 at addr ffff888106a83f18 by task kworker/u80:6/219
>
>    CPU: 8 UID: 0 PID: 219 Comm: kworker/u80:6 Not tainted 6.12.0-rc6-kts+ #15
>    Hardware name: Supermicro Super Server/X11SPi-TF, BIOS 3.3 02/21/2020
>    Workqueue: btrfs-endio btrfs_end_bio_work [btrfs]
>    Call Trace:
>     <TASK>
>     dump_stack_lvl+0x6e/0xa0
>     ? lock_release+0x708/0x780
>     print_report+0x174/0x505
>     ? lock_release+0x708/0x780
>     ? __virt_addr_valid+0x224/0x410
>     ? lock_release+0x708/0x780
>     kasan_report+0xda/0x1b0
>     ? lock_release+0x708/0x780
>     ? __wake_up+0x44/0x60
>     lock_release+0x708/0x780
>     ? __pfx_lock_release+0x10/0x10
>     ? __pfx_do_raw_spin_lock+0x10/0x10
>     ? lock_is_held_type+0x9a/0x110
>     _raw_spin_unlock_irqrestore+0x1f/0x60
>     __wake_up+0x44/0x60
>     btrfs_encoded_read_endio+0x14b/0x190 [btrfs]
>     btrfs_check_read_bio+0x8d9/0x1360 [btrfs]
>     ? lock_release+0x1b0/0x780
>     ? trace_lock_acquire+0x12f/0x1a0
>     ? __pfx_btrfs_check_read_bio+0x10/0x10 [btrfs]
>     ? process_one_work+0x7e3/0x1460
>     ? lock_acquire+0x31/0xc0
>     ? process_one_work+0x7e3/0x1460
>     process_one_work+0x85c/0x1460
>     ? __pfx_process_one_work+0x10/0x10
>     ? assign_work+0x16c/0x240
>     worker_thread+0x5e6/0xfc0
>     ? __pfx_worker_thread+0x10/0x10
>     kthread+0x2c3/0x3a0
>     ? __pfx_kthread+0x10/0x10
>     ret_from_fork+0x31/0x70
>     ? __pfx_kthread+0x10/0x10
>     ret_from_fork_asm+0x1a/0x30
>     </TASK>
>
>    Allocated by task 3661:
>     kasan_save_stack+0x30/0x50
>     kasan_save_track+0x14/0x30
>     __kasan_kmalloc+0xaa/0xb0
>     btrfs_encoded_read_regular_fill_pages+0x16c/0x6d0 [btrfs]
>     send_extent_data+0xf0f/0x24a0 [btrfs]
>     process_extent+0x48a/0x1830 [btrfs]
>     changed_cb+0x178b/0x2ea0 [btrfs]
>     btrfs_ioctl_send+0x3bf9/0x5c20 [btrfs]
>     _btrfs_ioctl_send+0x117/0x330 [btrfs]
>     btrfs_ioctl+0x184a/0x60a0 [btrfs]
>     __x64_sys_ioctl+0x12e/0x1a0
>     do_syscall_64+0x95/0x180
>     entry_SYSCALL_64_after_hwframe+0x76/0x7e
>
>    Freed by task 3661:
>     kasan_save_stack+0x30/0x50
>     kasan_save_track+0x14/0x30
>     kasan_save_free_info+0x3b/0x70
>     __kasan_slab_free+0x4f/0x70
>     kfree+0x143/0x490
>     btrfs_encoded_read_regular_fill_pages+0x531/0x6d0 [btrfs]
>     send_extent_data+0xf0f/0x24a0 [btrfs]
>     process_extent+0x48a/0x1830 [btrfs]
>     changed_cb+0x178b/0x2ea0 [btrfs]
>     btrfs_ioctl_send+0x3bf9/0x5c20 [btrfs]
>     _btrfs_ioctl_send+0x117/0x330 [btrfs]
>     btrfs_ioctl+0x184a/0x60a0 [btrfs]
>     __x64_sys_ioctl+0x12e/0x1a0
>     do_syscall_64+0x95/0x180
>     entry_SYSCALL_64_after_hwframe+0x76/0x7e
>
>    The buggy address belongs to the object at ffff888106a83f00
>     which belongs to the cache kmalloc-rnd-07-96 of size 96
>    The buggy address is located 24 bytes inside of
>     freed 96-byte region [ffff888106a83f00, ffff888106a83f60)
>
>    The buggy address belongs to the physical page:
>    page: refcount:1 mapcount:0 mapping:0000000000000000 index:0xffff888106a83800 pfn:0x106a83
>    flags: 0x17ffffc0000000(node=0|zone=2|lastcpupid=0x1fffff)
>    page_type: f5(slab)
>    raw: 0017ffffc0000000 ffff888100053680 ffffea0004917200 0000000000000004
>    raw: ffff888106a83800 0000000080200019 00000001f5000000 0000000000000000
>    page dumped because: kasan: bad access detected
>
>    Memory state around the buggy address:
>     ffff888106a83e00: fa fb fb fb fb fb fb fb fb fb fb fb fc fc fc fc
>     ffff888106a83e80: fa fb fb fb fb fb fb fb fb fb fb fb fc fc fc fc
>    >ffff888106a83f00: fa fb fb fb fb fb fb fb fb fb fb fb fc fc fc fc
>                                ^
>     ffff888106a83f80: fa fb fb fb fb fb fb fb fb fb fb fb fc fc fc fc
>     ffff888106a84000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>    ==================================================================
>
> Further analyzing the trace and the crash dump's vmcore file shows that
> the wake_up() call in btrfs_encoded_read_endio() is calling wake_up() on
> the wait_queue that is in the private data passed to the end_io handler.
>
> Commit 4ff47df40447 ("btrfs: move priv off stack in
> btrfs_encoded_read_regular_fill_pages()") moved 'struct
> btrfs_encoded_read_private' off the stack.
>
> Before that commit one can see a corruption of the private data when
> analyzing the vmcore after a crash:

Can you highlight what's the corruption?
Just dumping a large structure isn't immediately obvious, I suppose
you mean it's related to the large negative values of the atomic
counters?

>
> *(struct btrfs_encoded_read_private *)0xffff88815626eec8 = {
>         .wait = (wait_queue_head_t){
>                 .lock = (spinlock_t){
>                         .rlock = (struct raw_spinlock){
>                                 .raw_lock = (arch_spinlock_t){
>                                         .val = (atomic_t){
>                                                 .counter = (int)-2005885696,
>                                         },
>                                         .locked = (u8)0,
>                                         .pending = (u8)157,
>                                         .locked_pending = (u16)40192,
>                                         .tail = (u16)34928,
>                                 },
>                                 .magic = (unsigned int)536325682,
>                                 .owner_cpu = (unsigned int)29,
>                                 .owner = (void *)__SCT__tp_func_btrfs_transaction_commit+0x0 = 0x0,
>                                 .dep_map = (struct lockdep_map){
>                                         .key = (struct lock_class_key *)0xffff8881575a3b6c,
>                                         .class_cache = (struct lock_class *[2]){ 0xffff8882a71985c0, 0xffffea00066f5d40 },
>                                         .name = (const char *)0xffff88815626f100 = "",
>                                         .wait_type_outer = (u8)37,
>                                         .wait_type_inner = (u8)178,
>                                         .lock_type = (u8)154,
>                                 },
>                         },
>                         .__padding = (u8 [24]){ 0, 157, 112, 136, 50, 174, 247, 31, 29 },
>                         .dep_map = (struct lockdep_map){
>                                 .key = (struct lock_class_key *)0xffff8881575a3b6c,
>                                 .class_cache = (struct lock_class *[2]){ 0xffff8882a71985c0, 0xffffea00066f5d40 },
>                                 .name = (const char *)0xffff88815626f100 = "",
>                                 .wait_type_outer = (u8)37,
>                                 .wait_type_inner = (u8)178,
>                                 .lock_type = (u8)154,
>                         },
>                 },
>                 .head = (struct list_head){
>                         .next = (struct list_head *)0x112cca,
>                         .prev = (struct list_head *)0x47,
>                 },
>         },
>         .pending = (atomic_t){
>                 .counter = (int)-1491499288,
>         },
>         .status = (blk_status_t)130,
> }
>
> Move the call to bio_put() before the atomic_test operation and
> also change atomic_dec_return() to atomic_dec_and_test() to fix the
> corruption, as atomic_dec_return() is defined as two instructions on
> x86_64, whereas atomic_dec_and_test() is defineda s a single atomic
> operation.

And why does this fixes the problem?

Can we also get a description here about why the corruption happens?
Having this may be enough to understand why these changes fix the bug.

Thanks.


>
> Reported-by: Shinichiro Kawasaki <shinichiro.kawasaki@wdc.com>
> Suggested-by: Damien Le Moal <Damien.LeMoal@wdc.com>
> Fixes: 1881fba89bd5 ("btrfs: add BTRFS_IOC_ENCODED_READ ioctl")
> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
> ---
>  fs/btrfs/inode.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index 22b8e2764619..cb8b23a3e56b 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -9089,7 +9089,8 @@ static void btrfs_encoded_read_endio(struct btrfs_bio *bbio)
>                  */
>                 WRITE_ONCE(priv->status, bbio->bio.bi_status);
>         }
> -       if (atomic_dec_return(&priv->pending) == 0) {
> +       bio_put(&bbio->bio);
> +       if (atomic_dec_and_test(&priv->pending)) {
>                 int err = blk_status_to_errno(READ_ONCE(priv->status));
>
>                 if (priv->uring_ctx) {
> @@ -9099,7 +9100,6 @@ static void btrfs_encoded_read_endio(struct btrfs_bio *bbio)
>                         wake_up(&priv->wait);
>                 }
>         }
> -       bio_put(&bbio->bio);
>  }
>
>  int btrfs_encoded_read_regular_fill_pages(struct btrfs_inode *inode,
> --
> 2.43.0
>
>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 1/2] btrfs: fix use-after-free in btrfs_encoded_read_endio
  2024-11-11 13:27   ` Filipe Manana
@ 2024-11-11 13:44     ` Damien Le Moal
  2024-11-11 14:21       ` Filipe Manana
  0 siblings, 1 reply; 10+ messages in thread
From: Damien Le Moal @ 2024-11-11 13:44 UTC (permalink / raw)
  To: Filipe Manana, Johannes Thumshirn
  Cc: linux-btrfs, Johannes Thumshirn, Mark Harmstone, Omar Sandoval,
	Shinichiro Kawasaki, Damien Le Moal

On 11/11/24 22:27, Filipe Manana wrote:
> On Mon, Nov 11, 2024 at 9:28 AM Johannes Thumshirn <jth@kernel.org> wrote:
>>
>> From: Johannes Thumshirn <johannes.thumshirn@wdc.com>
>>
>> Shinichiro reported the following use-after free that sometimes is
>> happening in our CI system when running fstests' btrfs/284 on a TCMU
>> runner device:
>>
>>    ==================================================================
>>    BUG: KASAN: slab-use-after-free in lock_release+0x708/0x780
>>    Read of size 8 at addr ffff888106a83f18 by task kworker/u80:6/219
>>
>>    CPU: 8 UID: 0 PID: 219 Comm: kworker/u80:6 Not tainted 6.12.0-rc6-kts+ #15
>>    Hardware name: Supermicro Super Server/X11SPi-TF, BIOS 3.3 02/21/2020
>>    Workqueue: btrfs-endio btrfs_end_bio_work [btrfs]
>>    Call Trace:
>>     <TASK>
>>     dump_stack_lvl+0x6e/0xa0
>>     ? lock_release+0x708/0x780
>>     print_report+0x174/0x505
>>     ? lock_release+0x708/0x780
>>     ? __virt_addr_valid+0x224/0x410
>>     ? lock_release+0x708/0x780
>>     kasan_report+0xda/0x1b0
>>     ? lock_release+0x708/0x780
>>     ? __wake_up+0x44/0x60
>>     lock_release+0x708/0x780
>>     ? __pfx_lock_release+0x10/0x10
>>     ? __pfx_do_raw_spin_lock+0x10/0x10
>>     ? lock_is_held_type+0x9a/0x110
>>     _raw_spin_unlock_irqrestore+0x1f/0x60
>>     __wake_up+0x44/0x60
>>     btrfs_encoded_read_endio+0x14b/0x190 [btrfs]
>>     btrfs_check_read_bio+0x8d9/0x1360 [btrfs]
>>     ? lock_release+0x1b0/0x780
>>     ? trace_lock_acquire+0x12f/0x1a0
>>     ? __pfx_btrfs_check_read_bio+0x10/0x10 [btrfs]
>>     ? process_one_work+0x7e3/0x1460
>>     ? lock_acquire+0x31/0xc0
>>     ? process_one_work+0x7e3/0x1460
>>     process_one_work+0x85c/0x1460
>>     ? __pfx_process_one_work+0x10/0x10
>>     ? assign_work+0x16c/0x240
>>     worker_thread+0x5e6/0xfc0
>>     ? __pfx_worker_thread+0x10/0x10
>>     kthread+0x2c3/0x3a0
>>     ? __pfx_kthread+0x10/0x10
>>     ret_from_fork+0x31/0x70
>>     ? __pfx_kthread+0x10/0x10
>>     ret_from_fork_asm+0x1a/0x30
>>     </TASK>
>>
>>    Allocated by task 3661:
>>     kasan_save_stack+0x30/0x50
>>     kasan_save_track+0x14/0x30
>>     __kasan_kmalloc+0xaa/0xb0
>>     btrfs_encoded_read_regular_fill_pages+0x16c/0x6d0 [btrfs]
>>     send_extent_data+0xf0f/0x24a0 [btrfs]
>>     process_extent+0x48a/0x1830 [btrfs]
>>     changed_cb+0x178b/0x2ea0 [btrfs]
>>     btrfs_ioctl_send+0x3bf9/0x5c20 [btrfs]
>>     _btrfs_ioctl_send+0x117/0x330 [btrfs]
>>     btrfs_ioctl+0x184a/0x60a0 [btrfs]
>>     __x64_sys_ioctl+0x12e/0x1a0
>>     do_syscall_64+0x95/0x180
>>     entry_SYSCALL_64_after_hwframe+0x76/0x7e
>>
>>    Freed by task 3661:
>>     kasan_save_stack+0x30/0x50
>>     kasan_save_track+0x14/0x30
>>     kasan_save_free_info+0x3b/0x70
>>     __kasan_slab_free+0x4f/0x70
>>     kfree+0x143/0x490
>>     btrfs_encoded_read_regular_fill_pages+0x531/0x6d0 [btrfs]
>>     send_extent_data+0xf0f/0x24a0 [btrfs]
>>     process_extent+0x48a/0x1830 [btrfs]
>>     changed_cb+0x178b/0x2ea0 [btrfs]
>>     btrfs_ioctl_send+0x3bf9/0x5c20 [btrfs]
>>     _btrfs_ioctl_send+0x117/0x330 [btrfs]
>>     btrfs_ioctl+0x184a/0x60a0 [btrfs]
>>     __x64_sys_ioctl+0x12e/0x1a0
>>     do_syscall_64+0x95/0x180
>>     entry_SYSCALL_64_after_hwframe+0x76/0x7e
>>
>>    The buggy address belongs to the object at ffff888106a83f00
>>     which belongs to the cache kmalloc-rnd-07-96 of size 96
>>    The buggy address is located 24 bytes inside of
>>     freed 96-byte region [ffff888106a83f00, ffff888106a83f60)
>>
>>    The buggy address belongs to the physical page:
>>    page: refcount:1 mapcount:0 mapping:0000000000000000 index:0xffff888106a83800 pfn:0x106a83
>>    flags: 0x17ffffc0000000(node=0|zone=2|lastcpupid=0x1fffff)
>>    page_type: f5(slab)
>>    raw: 0017ffffc0000000 ffff888100053680 ffffea0004917200 0000000000000004
>>    raw: ffff888106a83800 0000000080200019 00000001f5000000 0000000000000000
>>    page dumped because: kasan: bad access detected
>>
>>    Memory state around the buggy address:
>>     ffff888106a83e00: fa fb fb fb fb fb fb fb fb fb fb fb fc fc fc fc
>>     ffff888106a83e80: fa fb fb fb fb fb fb fb fb fb fb fb fc fc fc fc
>>    >ffff888106a83f00: fa fb fb fb fb fb fb fb fb fb fb fb fc fc fc fc
>>                                ^
>>     ffff888106a83f80: fa fb fb fb fb fb fb fb fb fb fb fb fc fc fc fc
>>     ffff888106a84000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>>    ==================================================================
>>
>> Further analyzing the trace and the crash dump's vmcore file shows that
>> the wake_up() call in btrfs_encoded_read_endio() is calling wake_up() on
>> the wait_queue that is in the private data passed to the end_io handler.
>>
>> Commit 4ff47df40447 ("btrfs: move priv off stack in
>> btrfs_encoded_read_regular_fill_pages()") moved 'struct
>> btrfs_encoded_read_private' off the stack.
>>
>> Before that commit one can see a corruption of the private data when
>> analyzing the vmcore after a crash:
> 
> Can you highlight what's the corruption?
> Just dumping a large structure isn't immediately obvious, I suppose
> you mean it's related to the large negative values of the atomic
> counters?
> 
>>
>> *(struct btrfs_encoded_read_private *)0xffff88815626eec8 = {
>>         .wait = (wait_queue_head_t){
>>                 .lock = (spinlock_t){
>>                         .rlock = (struct raw_spinlock){
>>                                 .raw_lock = (arch_spinlock_t){
>>                                         .val = (atomic_t){
>>                                                 .counter = (int)-2005885696,
>>                                         },
>>                                         .locked = (u8)0,
>>                                         .pending = (u8)157,
>>                                         .locked_pending = (u16)40192,
>>                                         .tail = (u16)34928,
>>                                 },
>>                                 .magic = (unsigned int)536325682,
>>                                 .owner_cpu = (unsigned int)29,
>>                                 .owner = (void *)__SCT__tp_func_btrfs_transaction_commit+0x0 = 0x0,
>>                                 .dep_map = (struct lockdep_map){
>>                                         .key = (struct lock_class_key *)0xffff8881575a3b6c,
>>                                         .class_cache = (struct lock_class *[2]){ 0xffff8882a71985c0, 0xffffea00066f5d40 },
>>                                         .name = (const char *)0xffff88815626f100 = "",
>>                                         .wait_type_outer = (u8)37,
>>                                         .wait_type_inner = (u8)178,
>>                                         .lock_type = (u8)154,
>>                                 },
>>                         },
>>                         .__padding = (u8 [24]){ 0, 157, 112, 136, 50, 174, 247, 31, 29 },
>>                         .dep_map = (struct lockdep_map){
>>                                 .key = (struct lock_class_key *)0xffff8881575a3b6c,
>>                                 .class_cache = (struct lock_class *[2]){ 0xffff8882a71985c0, 0xffffea00066f5d40 },
>>                                 .name = (const char *)0xffff88815626f100 = "",
>>                                 .wait_type_outer = (u8)37,
>>                                 .wait_type_inner = (u8)178,
>>                                 .lock_type = (u8)154,
>>                         },
>>                 },
>>                 .head = (struct list_head){
>>                         .next = (struct list_head *)0x112cca,
>>                         .prev = (struct list_head *)0x47,
>>                 },
>>         },
>>         .pending = (atomic_t){
>>                 .counter = (int)-1491499288,
>>         },
>>         .status = (blk_status_t)130,
>> }
>>
>> Move the call to bio_put() before the atomic_test operation and
>> also change atomic_dec_return() to atomic_dec_and_test() to fix the
>> corruption, as atomic_dec_return() is defined as two instructions on
>> x86_64, whereas atomic_dec_and_test() is defineda s a single atomic
>> operation.
> 
> And why does this fixes the problem?
> 
> Can we also get a description here about why the corruption happens?
> Having this may be enough to understand why these changes fix the bug.

Waking up the issuer before doing the bio put can result in the issuer function
returning before the bio is cleaned up. Which is weird... But thinking more
about this, it is probably not a bug in itself since the bbio is allocated and
not on the stack of the issuer.

What is definitely a bug is that there is nothing atomic with:

if (atomic_dec_return(&priv->pending) == 0)
	wake_up(&priv->wait);

on the completion side and

if (atomic_dec_return(&priv.pending))
	io_wait_event(priv.wait, !atomic_read(&priv.pending));

on the issuer side (btrfs_encoded_read_regular_fill_pages()).

The atomic_dec_return() on the completion side can cause io_wait_event() to
return and thus have the issuer return *BEFORE* the completion side has a chance
to call wake_up(&priv->wait), because the "if" is not fully processed yet as the
comparison with 0 is still needed. And given that priv is on the stack of the
issuer, the wake_up() call can endup referencing invalid memory.

Johannes,

I think more and more that the proper fix is to squash your 2 patches together
to avoid all these dangerous games played with the pending atomic.

-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 1/2] btrfs: fix use-after-free in btrfs_encoded_read_endio
  2024-11-11 13:44     ` Damien Le Moal
@ 2024-11-11 14:21       ` Filipe Manana
  0 siblings, 0 replies; 10+ messages in thread
From: Filipe Manana @ 2024-11-11 14:21 UTC (permalink / raw)
  To: Damien Le Moal
  Cc: Johannes Thumshirn, linux-btrfs, Johannes Thumshirn,
	Mark Harmstone, Omar Sandoval, Shinichiro Kawasaki,
	Damien Le Moal

On Mon, Nov 11, 2024 at 1:44 PM Damien Le Moal <dlemoal@kernel.org> wrote:
>
> On 11/11/24 22:27, Filipe Manana wrote:
> > On Mon, Nov 11, 2024 at 9:28 AM Johannes Thumshirn <jth@kernel.org> wrote:
> >>
> >> From: Johannes Thumshirn <johannes.thumshirn@wdc.com>
> >>
> >> Shinichiro reported the following use-after free that sometimes is
> >> happening in our CI system when running fstests' btrfs/284 on a TCMU
> >> runner device:
> >>
> >>    ==================================================================
> >>    BUG: KASAN: slab-use-after-free in lock_release+0x708/0x780
> >>    Read of size 8 at addr ffff888106a83f18 by task kworker/u80:6/219
> >>
> >>    CPU: 8 UID: 0 PID: 219 Comm: kworker/u80:6 Not tainted 6.12.0-rc6-kts+ #15
> >>    Hardware name: Supermicro Super Server/X11SPi-TF, BIOS 3.3 02/21/2020
> >>    Workqueue: btrfs-endio btrfs_end_bio_work [btrfs]
> >>    Call Trace:
> >>     <TASK>
> >>     dump_stack_lvl+0x6e/0xa0
> >>     ? lock_release+0x708/0x780
> >>     print_report+0x174/0x505
> >>     ? lock_release+0x708/0x780
> >>     ? __virt_addr_valid+0x224/0x410
> >>     ? lock_release+0x708/0x780
> >>     kasan_report+0xda/0x1b0
> >>     ? lock_release+0x708/0x780
> >>     ? __wake_up+0x44/0x60
> >>     lock_release+0x708/0x780
> >>     ? __pfx_lock_release+0x10/0x10
> >>     ? __pfx_do_raw_spin_lock+0x10/0x10
> >>     ? lock_is_held_type+0x9a/0x110
> >>     _raw_spin_unlock_irqrestore+0x1f/0x60
> >>     __wake_up+0x44/0x60
> >>     btrfs_encoded_read_endio+0x14b/0x190 [btrfs]
> >>     btrfs_check_read_bio+0x8d9/0x1360 [btrfs]
> >>     ? lock_release+0x1b0/0x780
> >>     ? trace_lock_acquire+0x12f/0x1a0
> >>     ? __pfx_btrfs_check_read_bio+0x10/0x10 [btrfs]
> >>     ? process_one_work+0x7e3/0x1460
> >>     ? lock_acquire+0x31/0xc0
> >>     ? process_one_work+0x7e3/0x1460
> >>     process_one_work+0x85c/0x1460
> >>     ? __pfx_process_one_work+0x10/0x10
> >>     ? assign_work+0x16c/0x240
> >>     worker_thread+0x5e6/0xfc0
> >>     ? __pfx_worker_thread+0x10/0x10
> >>     kthread+0x2c3/0x3a0
> >>     ? __pfx_kthread+0x10/0x10
> >>     ret_from_fork+0x31/0x70
> >>     ? __pfx_kthread+0x10/0x10
> >>     ret_from_fork_asm+0x1a/0x30
> >>     </TASK>
> >>
> >>    Allocated by task 3661:
> >>     kasan_save_stack+0x30/0x50
> >>     kasan_save_track+0x14/0x30
> >>     __kasan_kmalloc+0xaa/0xb0
> >>     btrfs_encoded_read_regular_fill_pages+0x16c/0x6d0 [btrfs]
> >>     send_extent_data+0xf0f/0x24a0 [btrfs]
> >>     process_extent+0x48a/0x1830 [btrfs]
> >>     changed_cb+0x178b/0x2ea0 [btrfs]
> >>     btrfs_ioctl_send+0x3bf9/0x5c20 [btrfs]
> >>     _btrfs_ioctl_send+0x117/0x330 [btrfs]
> >>     btrfs_ioctl+0x184a/0x60a0 [btrfs]
> >>     __x64_sys_ioctl+0x12e/0x1a0
> >>     do_syscall_64+0x95/0x180
> >>     entry_SYSCALL_64_after_hwframe+0x76/0x7e
> >>
> >>    Freed by task 3661:
> >>     kasan_save_stack+0x30/0x50
> >>     kasan_save_track+0x14/0x30
> >>     kasan_save_free_info+0x3b/0x70
> >>     __kasan_slab_free+0x4f/0x70
> >>     kfree+0x143/0x490
> >>     btrfs_encoded_read_regular_fill_pages+0x531/0x6d0 [btrfs]
> >>     send_extent_data+0xf0f/0x24a0 [btrfs]
> >>     process_extent+0x48a/0x1830 [btrfs]
> >>     changed_cb+0x178b/0x2ea0 [btrfs]
> >>     btrfs_ioctl_send+0x3bf9/0x5c20 [btrfs]
> >>     _btrfs_ioctl_send+0x117/0x330 [btrfs]
> >>     btrfs_ioctl+0x184a/0x60a0 [btrfs]
> >>     __x64_sys_ioctl+0x12e/0x1a0
> >>     do_syscall_64+0x95/0x180
> >>     entry_SYSCALL_64_after_hwframe+0x76/0x7e
> >>
> >>    The buggy address belongs to the object at ffff888106a83f00
> >>     which belongs to the cache kmalloc-rnd-07-96 of size 96
> >>    The buggy address is located 24 bytes inside of
> >>     freed 96-byte region [ffff888106a83f00, ffff888106a83f60)
> >>
> >>    The buggy address belongs to the physical page:
> >>    page: refcount:1 mapcount:0 mapping:0000000000000000 index:0xffff888106a83800 pfn:0x106a83
> >>    flags: 0x17ffffc0000000(node=0|zone=2|lastcpupid=0x1fffff)
> >>    page_type: f5(slab)
> >>    raw: 0017ffffc0000000 ffff888100053680 ffffea0004917200 0000000000000004
> >>    raw: ffff888106a83800 0000000080200019 00000001f5000000 0000000000000000
> >>    page dumped because: kasan: bad access detected
> >>
> >>    Memory state around the buggy address:
> >>     ffff888106a83e00: fa fb fb fb fb fb fb fb fb fb fb fb fc fc fc fc
> >>     ffff888106a83e80: fa fb fb fb fb fb fb fb fb fb fb fb fc fc fc fc
> >>    >ffff888106a83f00: fa fb fb fb fb fb fb fb fb fb fb fb fc fc fc fc
> >>                                ^
> >>     ffff888106a83f80: fa fb fb fb fb fb fb fb fb fb fb fb fc fc fc fc
> >>     ffff888106a84000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> >>    ==================================================================
> >>
> >> Further analyzing the trace and the crash dump's vmcore file shows that
> >> the wake_up() call in btrfs_encoded_read_endio() is calling wake_up() on
> >> the wait_queue that is in the private data passed to the end_io handler.
> >>
> >> Commit 4ff47df40447 ("btrfs: move priv off stack in
> >> btrfs_encoded_read_regular_fill_pages()") moved 'struct
> >> btrfs_encoded_read_private' off the stack.
> >>
> >> Before that commit one can see a corruption of the private data when
> >> analyzing the vmcore after a crash:
> >
> > Can you highlight what's the corruption?
> > Just dumping a large structure isn't immediately obvious, I suppose
> > you mean it's related to the large negative values of the atomic
> > counters?
> >
> >>
> >> *(struct btrfs_encoded_read_private *)0xffff88815626eec8 = {
> >>         .wait = (wait_queue_head_t){
> >>                 .lock = (spinlock_t){
> >>                         .rlock = (struct raw_spinlock){
> >>                                 .raw_lock = (arch_spinlock_t){
> >>                                         .val = (atomic_t){
> >>                                                 .counter = (int)-2005885696,
> >>                                         },
> >>                                         .locked = (u8)0,
> >>                                         .pending = (u8)157,
> >>                                         .locked_pending = (u16)40192,
> >>                                         .tail = (u16)34928,
> >>                                 },
> >>                                 .magic = (unsigned int)536325682,
> >>                                 .owner_cpu = (unsigned int)29,
> >>                                 .owner = (void *)__SCT__tp_func_btrfs_transaction_commit+0x0 = 0x0,
> >>                                 .dep_map = (struct lockdep_map){
> >>                                         .key = (struct lock_class_key *)0xffff8881575a3b6c,
> >>                                         .class_cache = (struct lock_class *[2]){ 0xffff8882a71985c0, 0xffffea00066f5d40 },
> >>                                         .name = (const char *)0xffff88815626f100 = "",
> >>                                         .wait_type_outer = (u8)37,
> >>                                         .wait_type_inner = (u8)178,
> >>                                         .lock_type = (u8)154,
> >>                                 },
> >>                         },
> >>                         .__padding = (u8 [24]){ 0, 157, 112, 136, 50, 174, 247, 31, 29 },
> >>                         .dep_map = (struct lockdep_map){
> >>                                 .key = (struct lock_class_key *)0xffff8881575a3b6c,
> >>                                 .class_cache = (struct lock_class *[2]){ 0xffff8882a71985c0, 0xffffea00066f5d40 },
> >>                                 .name = (const char *)0xffff88815626f100 = "",
> >>                                 .wait_type_outer = (u8)37,
> >>                                 .wait_type_inner = (u8)178,
> >>                                 .lock_type = (u8)154,
> >>                         },
> >>                 },
> >>                 .head = (struct list_head){
> >>                         .next = (struct list_head *)0x112cca,
> >>                         .prev = (struct list_head *)0x47,
> >>                 },
> >>         },
> >>         .pending = (atomic_t){
> >>                 .counter = (int)-1491499288,
> >>         },
> >>         .status = (blk_status_t)130,
> >> }
> >>
> >> Move the call to bio_put() before the atomic_test operation and
> >> also change atomic_dec_return() to atomic_dec_and_test() to fix the
> >> corruption, as atomic_dec_return() is defined as two instructions on
> >> x86_64, whereas atomic_dec_and_test() is defineda s a single atomic
> >> operation.
> >
> > And why does this fixes the problem?
> >
> > Can we also get a description here about why the corruption happens?
> > Having this may be enough to understand why these changes fix the bug.
>
> Waking up the issuer before doing the bio put can result in the issuer function
> returning before the bio is cleaned up. Which is weird... But thinking more
> about this, it is probably not a bug in itself since the bbio is allocated and
> not on the stack of the issuer.
>
> What is definitely a bug is that there is nothing atomic with:
>
> if (atomic_dec_return(&priv->pending) == 0)
>         wake_up(&priv->wait);
>
> on the completion side and
>
> if (atomic_dec_return(&priv.pending))
>         io_wait_event(priv.wait, !atomic_read(&priv.pending));
>
> on the issuer side (btrfs_encoded_read_regular_fill_pages()).
>
> The atomic_dec_return() on the completion side can cause io_wait_event() to
> return and thus have the issuer return *BEFORE* the completion side has a chance
> to call wake_up(&priv->wait), because the "if" is not fully processed yet as the
> comparison with 0 is still needed. And given that priv is on the stack of the
> issuer, the wake_up() call can endup referencing invalid memory.

Sure, I get that.

What I'm asking for is to have the change log explain how/why the bug
happens and why the proposed solution fixes the bug.
And not just dump a structure's content and say what the change does
without any explanation at all.

Every patch fixing a bug should have those details.


>
> Johannes,
>
> I think more and more that the proper fix is to squash your 2 patches together
> to avoid all these dangerous games played with the pending atomic.
>
> --
> Damien Le Moal
> Western Digital Research

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 1/2] btrfs: fix use-after-free in btrfs_encoded_read_endio
  2024-11-11  9:28 ` [PATCH 1/2] " Johannes Thumshirn
  2024-11-11 13:27   ` Filipe Manana
@ 2024-11-11 18:17   ` Mark Harmstone
  1 sibling, 0 replies; 10+ messages in thread
From: Mark Harmstone @ 2024-11-11 18:17 UTC (permalink / raw)
  To: Johannes Thumshirn, linux-btrfs@vger.kernel.org
  Cc: Damien Le Moal, Johannes Thumshirn, Mark Harmstone, Omar Sandoval,
	Shinichiro Kawasaki, Damien Le Moal

On 11/11/24 09:28, Johannes Thumshirn wrote:
> > 
> From: Johannes Thumshirn <johannes.thumshirn@wdc.com>
> 
> Shinichiro reported the following use-after free that sometimes is
> happening in our CI system when running fstests' btrfs/284 on a TCMU
> runner device:
> 
>     ==================================================================
>     BUG: KASAN: slab-use-after-free in lock_release+0x708/0x780
>     Read of size 8 at addr ffff888106a83f18 by task kworker/u80:6/219
> 
>     CPU: 8 UID: 0 PID: 219 Comm: kworker/u80:6 Not tainted 6.12.0-rc6-kts+ #15
>     Hardware name: Supermicro Super Server/X11SPi-TF, BIOS 3.3 02/21/2020
>     Workqueue: btrfs-endio btrfs_end_bio_work [btrfs]
>     Call Trace:
>      <TASK>
>      dump_stack_lvl+0x6e/0xa0
>      ? lock_release+0x708/0x780
>      print_report+0x174/0x505
>      ? lock_release+0x708/0x780
>      ? __virt_addr_valid+0x224/0x410
>      ? lock_release+0x708/0x780
>      kasan_report+0xda/0x1b0
>      ? lock_release+0x708/0x780
>      ? __wake_up+0x44/0x60
>      lock_release+0x708/0x780
>      ? __pfx_lock_release+0x10/0x10
>      ? __pfx_do_raw_spin_lock+0x10/0x10
>      ? lock_is_held_type+0x9a/0x110
>      _raw_spin_unlock_irqrestore+0x1f/0x60
>      __wake_up+0x44/0x60
>      btrfs_encoded_read_endio+0x14b/0x190 [btrfs]
>      btrfs_check_read_bio+0x8d9/0x1360 [btrfs]
>      ? lock_release+0x1b0/0x780
>      ? trace_lock_acquire+0x12f/0x1a0
>      ? __pfx_btrfs_check_read_bio+0x10/0x10 [btrfs]
>      ? process_one_work+0x7e3/0x1460
>      ? lock_acquire+0x31/0xc0
>      ? process_one_work+0x7e3/0x1460
>      process_one_work+0x85c/0x1460
>      ? __pfx_process_one_work+0x10/0x10
>      ? assign_work+0x16c/0x240
>      worker_thread+0x5e6/0xfc0
>      ? __pfx_worker_thread+0x10/0x10
>      kthread+0x2c3/0x3a0
>      ? __pfx_kthread+0x10/0x10
>      ret_from_fork+0x31/0x70
>      ? __pfx_kthread+0x10/0x10
>      ret_from_fork_asm+0x1a/0x30
>      </TASK>
> 
>     Allocated by task 3661:
>      kasan_save_stack+0x30/0x50
>      kasan_save_track+0x14/0x30
>      __kasan_kmalloc+0xaa/0xb0
>      btrfs_encoded_read_regular_fill_pages+0x16c/0x6d0 [btrfs]
>      send_extent_data+0xf0f/0x24a0 [btrfs]
>      process_extent+0x48a/0x1830 [btrfs]
>      changed_cb+0x178b/0x2ea0 [btrfs]
>      btrfs_ioctl_send+0x3bf9/0x5c20 [btrfs]
>      _btrfs_ioctl_send+0x117/0x330 [btrfs]
>      btrfs_ioctl+0x184a/0x60a0 [btrfs]
>      __x64_sys_ioctl+0x12e/0x1a0
>      do_syscall_64+0x95/0x180
>      entry_SYSCALL_64_after_hwframe+0x76/0x7e
> 
>     Freed by task 3661:
>      kasan_save_stack+0x30/0x50
>      kasan_save_track+0x14/0x30
>      kasan_save_free_info+0x3b/0x70
>      __kasan_slab_free+0x4f/0x70
>      kfree+0x143/0x490
>      btrfs_encoded_read_regular_fill_pages+0x531/0x6d0 [btrfs]
>      send_extent_data+0xf0f/0x24a0 [btrfs]
>      process_extent+0x48a/0x1830 [btrfs]
>      changed_cb+0x178b/0x2ea0 [btrfs]
>      btrfs_ioctl_send+0x3bf9/0x5c20 [btrfs]
>      _btrfs_ioctl_send+0x117/0x330 [btrfs]
>      btrfs_ioctl+0x184a/0x60a0 [btrfs]
>      __x64_sys_ioctl+0x12e/0x1a0
>      do_syscall_64+0x95/0x180
>      entry_SYSCALL_64_after_hwframe+0x76/0x7e
> 
>     The buggy address belongs to the object at ffff888106a83f00
>      which belongs to the cache kmalloc-rnd-07-96 of size 96
>     The buggy address is located 24 bytes inside of
>      freed 96-byte region [ffff888106a83f00, ffff888106a83f60)
> 
>     The buggy address belongs to the physical page:
>     page: refcount:1 mapcount:0 mapping:0000000000000000 index:0xffff888106a83800 pfn:0x106a83
>     flags: 0x17ffffc0000000(node=0|zone=2|lastcpupid=0x1fffff)
>     page_type: f5(slab)
>     raw: 0017ffffc0000000 ffff888100053680 ffffea0004917200 0000000000000004
>     raw: ffff888106a83800 0000000080200019 00000001f5000000 0000000000000000
>     page dumped because: kasan: bad access detected
> 
>     Memory state around the buggy address:
>      ffff888106a83e00: fa fb fb fb fb fb fb fb fb fb fb fb fc fc fc fc
>      ffff888106a83e80: fa fb fb fb fb fb fb fb fb fb fb fb fc fc fc fc
>     >ffff888106a83f00: fa fb fb fb fb fb fb fb fb fb fb fb fc fc fc fc
>                                 ^
>      ffff888106a83f80: fa fb fb fb fb fb fb fb fb fb fb fb fc fc fc fc
>      ffff888106a84000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>     ==================================================================
> 
> Further analyzing the trace and the crash dump's vmcore file shows that
> the wake_up() call in btrfs_encoded_read_endio() is calling wake_up() on
> the wait_queue that is in the private data passed to the end_io handler.
> 
> Commit 4ff47df40447 ("btrfs: move priv off stack in
> btrfs_encoded_read_regular_fill_pages()") moved 'struct
> btrfs_encoded_read_private' off the stack.
> 
> Before that commit one can see a corruption of the private data when
> analyzing the vmcore after a crash:
> 
> *(struct btrfs_encoded_read_private *)0xffff88815626eec8 = {
> 	.wait = (wait_queue_head_t){
> 		.lock = (spinlock_t){
> 			.rlock = (struct raw_spinlock){
> 				.raw_lock = (arch_spinlock_t){
> 					.val = (atomic_t){
> 						.counter = (int)-2005885696,
> 					},
> 					.locked = (u8)0,
> 					.pending = (u8)157,
> 					.locked_pending = (u16)40192,
> 					.tail = (u16)34928,
> 				},
> 				.magic = (unsigned int)536325682,
> 				.owner_cpu = (unsigned int)29,
> 				.owner = (void *)__SCT__tp_func_btrfs_transaction_commit+0x0 = 0x0,
> 				.dep_map = (struct lockdep_map){
> 					.key = (struct lock_class_key *)0xffff8881575a3b6c,
> 					.class_cache = (struct lock_class *[2]){ 0xffff8882a71985c0, 0xffffea00066f5d40 },
> 					.name = (const char *)0xffff88815626f100 = "",
> 					.wait_type_outer = (u8)37,
> 					.wait_type_inner = (u8)178,
> 					.lock_type = (u8)154,
> 				},
> 			},
> 			.__padding = (u8 [24]){ 0, 157, 112, 136, 50, 174, 247, 31, 29 },
> 			.dep_map = (struct lockdep_map){
> 				.key = (struct lock_class_key *)0xffff8881575a3b6c,
> 				.class_cache = (struct lock_class *[2]){ 0xffff8882a71985c0, 0xffffea00066f5d40 },
> 				.name = (const char *)0xffff88815626f100 = "",
> 				.wait_type_outer = (u8)37,
> 				.wait_type_inner = (u8)178,
> 				.lock_type = (u8)154,
> 			},
> 		},
> 		.head = (struct list_head){
> 			.next = (struct list_head *)0x112cca,
> 			.prev = (struct list_head *)0x47,
> 		},
> 	},
> 	.pending = (atomic_t){
> 		.counter = (int)-1491499288,
> 	},
> 	.status = (blk_status_t)130,
> }
> 
> Move the call to bio_put() before the atomic_test operation and
> also change atomic_dec_return() to atomic_dec_and_test() to fix the
> corruption, as atomic_dec_return() is defined as two instructions on
> x86_64, whereas atomic_dec_and_test() is defineda s a single atomic
> operation.
> 
> Reported-by: Shinichiro Kawasaki <shinichiro.kawasaki@wdc.com>
> Suggested-by: Damien Le Moal <Damien.LeMoal@wdc.com>
> Fixes: 1881fba89bd5 ("btrfs: add BTRFS_IOC_ENCODED_READ ioctl")
> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
> ---
>   fs/btrfs/inode.c | 4 ++--
>   1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index 22b8e2764619..cb8b23a3e56b 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -9089,7 +9089,8 @@ static void btrfs_encoded_read_endio(struct btrfs_bio *bbio)
>   		 */
>   		WRITE_ONCE(priv->status, bbio->bio.bi_status);
>   	}
> -	if (atomic_dec_return(&priv->pending) == 0) {
> +	bio_put(&bbio->bio);
> +	if (atomic_dec_and_test(&priv->pending)) {
>   		int err = blk_status_to_errno(READ_ONCE(priv->status));
>   
>   		if (priv->uring_ctx) {
> @@ -9099,7 +9100,6 @@ static void btrfs_encoded_read_endio(struct btrfs_bio *bbio)
>   			wake_up(&priv->wait);
>   		}
>   	}
> -	bio_put(&bbio->bio);
>   }
>   
>   int btrfs_encoded_read_regular_fill_pages(struct btrfs_inode *inode,

Thanks Johannes, this patch looks good to me. Thank you for fixing.

It looks like atomic_dec_return is a bit of a footgun... does this imply 
that all instances of atomic_dec_return(&foo) == 0 should actually be 
atomic_dec_and_test(&foo)?

Mark

Reviewed-by: Mark Harmstone <maharmstone@fb.com>


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 2/2] btrfs: simplify waiting for encoded read endios
  2024-11-11  9:28 ` [PATCH 2/2] btrfs: simplify waiting for encoded read endios Johannes Thumshirn
  2024-11-11  9:56   ` Damien Le Moal
@ 2024-11-11 18:36   ` Mark Harmstone
  1 sibling, 0 replies; 10+ messages in thread
From: Mark Harmstone @ 2024-11-11 18:36 UTC (permalink / raw)
  To: Johannes Thumshirn, linux-btrfs@vger.kernel.org
  Cc: Damien Le Moal, Johannes Thumshirn, Mark Harmstone, Omar Sandoval,
	Damien Le Moal

On 11/11/24 09:28, Johannes Thumshirn wrote:
> > 
> From: Johannes Thumshirn <johannes.thumshirn@wdc.com>
> 
> Simplifiy the I/O completion path for encoded reads by using a
> completion instead of a wait_queue.
> 
> Furthermore skip taking an extra reference that is instantly
> dropped anyways.
> 
> Suggested-by: Damien Le Moal <Damien.LeMoal@wdc.com>
> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
> ---
>   fs/btrfs/inode.c | 12 +++++-------
>   1 file changed, 5 insertions(+), 7 deletions(-)
> 
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index cb8b23a3e56b..916c9d7ca112 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -9068,7 +9068,7 @@ static ssize_t btrfs_encoded_read_inline(
>   }
>   
>   struct btrfs_encoded_read_private {
> -	wait_queue_head_t wait;
> +	struct completion done;
>   	void *uring_ctx;
>   	atomic_t pending;
>   	blk_status_t status;
> @@ -9097,7 +9097,7 @@ static void btrfs_encoded_read_endio(struct btrfs_bio *bbio)
>   			btrfs_uring_read_extent_endio(priv->uring_ctx, err);
>   			kfree(priv);
>   		} else {
> -			wake_up(&priv->wait);
> +			complete(&priv->done);
>   		}
>   	}
>   }
> @@ -9116,7 +9116,7 @@ int btrfs_encoded_read_regular_fill_pages(struct btrfs_inode *inode,
>   	if (!priv)
>   		return -ENOMEM;
>   
> -	init_waitqueue_head(&priv->wait);
> +	init_completion(&priv->done);
>   	atomic_set(&priv->pending, 1);
>   	priv->status = 0;
>   	priv->uring_ctx = uring_ctx;
> @@ -9145,11 +9145,10 @@ int btrfs_encoded_read_regular_fill_pages(struct btrfs_inode *inode,
>   		disk_io_size -= bytes;
>   	} while (disk_io_size);
>   
> -	atomic_inc(&priv->pending);
>   	btrfs_submit_bbio(bbio, 0);

Is this safe for the io_uring path? The bio completion calls 
btrfs_encoded_read_endio, which frees priv if it decreases pending to 0.

>   
>   	if (uring_ctx) {
> -		if (atomic_dec_return(&priv->pending) == 0) {
> +		if (atomic_read(&priv->pending) == 0) {
>   			ret = blk_status_to_errno(READ_ONCE(priv->status));
>   			btrfs_uring_read_extent_endio(uring_ctx, ret);
>   			kfree(priv);
> @@ -9158,8 +9157,7 @@ int btrfs_encoded_read_regular_fill_pages(struct btrfs_inode *inode,
>   
>   		return -EIOCBQUEUED;
>   	} else {
> -		if (atomic_dec_return(&priv->pending) != 0)
> -			io_wait_event(priv->wait, !atomic_read(&priv->pending));
> +		wait_for_completion(&priv->done);

Probably a nitpick, but completion.txt implies that this ought to be 
wait_for_completion_io.

>   		/* See btrfs_encoded_read_endio() for ordering. */
>   		ret = blk_status_to_errno(READ_ONCE(priv->status));
>   		kfree(priv);

Mark

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2024-11-11 18:36 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-11-11  9:28 [PATCH 0/2] btrfs: fix use-after-free in btrfs_encoded_read_endio Johannes Thumshirn
2024-11-11  9:28 ` [PATCH 1/2] " Johannes Thumshirn
2024-11-11 13:27   ` Filipe Manana
2024-11-11 13:44     ` Damien Le Moal
2024-11-11 14:21       ` Filipe Manana
2024-11-11 18:17   ` Mark Harmstone
2024-11-11  9:28 ` [PATCH 2/2] btrfs: simplify waiting for encoded read endios Johannes Thumshirn
2024-11-11  9:56   ` Damien Le Moal
2024-11-11 10:13     ` Johannes Thumshirn
2024-11-11 18:36   ` Mark Harmstone

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox