[PATCH v2 0/2] btrfs: fix use-after-free in btrfs_encoded_read

public inbox for linux-btrfs@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH v2 0/2] btrfs: fix use-after-free in btrfs_encoded_read_endio
@ 2024-11-12 13:53 Johannes Thumshirn
  2024-11-12 13:53 ` [PATCH v2 1/2] " Johannes Thumshirn
  2024-11-12 13:53 ` [PATCH v2 2/2] btrfs: simplify waiting for encoded read endios Johannes Thumshirn
  0 siblings, 2 replies; 9+ messages in thread
From: Johannes Thumshirn @ 2024-11-12 13:53 UTC (permalink / raw)
  To: linux-btrfs
  Cc: Filipe Manana, Damien Le Moal, Johannes Thumshirn,
	Johannes Thumshirn, Mark Harmstone, Omar Sandoval

Shinichiro reported a occassional memory corruption in our CI system with
btrfs/248 that lead to panics. He also managed to reproduce this
corruption reliably on one host. See patch 1/2 for details on the
corruption and the fix, patch 2/2 is a cleanup Damien suggested on top of
the fix to make the code more obvious.

Changes to v1:
- Update commit message of patch 1/1
- Prevent double-free of 'priv' in case of io_uring in 2/2
- Use wait_for_completion_io() in 2/2
- Convert priv->pending from atomic_t to refcount_t calling it refs in 2/2

Link to v1:
https://lore.kernel.org/linux-btrfs/cover.1731316882.git.jth@kernel.org

Johannes Thumshirn (2):
  btrfs: fix use-after-free in btrfs_encoded_read_endio
  btrfs: simplify waiting for encoded read endios

 fs/btrfs/inode.c | 27 ++++++++++++++-------------
 1 file changed, 14 insertions(+), 13 deletions(-)

-- 
2.43.0


^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH v2 1/2] btrfs: fix use-after-free in btrfs_encoded_read_endio
  2024-11-12 13:53 [PATCH v2 0/2] btrfs: fix use-after-free in btrfs_encoded_read_endio Johannes Thumshirn
@ 2024-11-12 13:53 ` Johannes Thumshirn
  2024-11-12 14:45   ` Filipe Manana
  2024-11-12 20:50   ` Qu Wenruo
  2024-11-12 13:53 ` [PATCH v2 2/2] btrfs: simplify waiting for encoded read endios Johannes Thumshirn
  1 sibling, 2 replies; 9+ messages in thread
From: Johannes Thumshirn @ 2024-11-12 13:53 UTC (permalink / raw)
  To: linux-btrfs
  Cc: Filipe Manana, Damien Le Moal, Johannes Thumshirn, Mark Harmstone,
	Omar Sandoval, Shinichiro Kawasaki, Damien Le Moal

From: Johannes Thumshirn <johannes.thumshirn@wdc.com>

Shinichiro reported the following use-after free that sometimes is
happening in our CI system when running fstests' btrfs/284 on a TCMU
runner device:

   ==================================================================
   BUG: KASAN: slab-use-after-free in lock_release+0x708/0x780
   Read of size 8 at addr ffff888106a83f18 by task kworker/u80:6/219

   CPU: 8 UID: 0 PID: 219 Comm: kworker/u80:6 Not tainted 6.12.0-rc6-kts+ #15
   Hardware name: Supermicro Super Server/X11SPi-TF, BIOS 3.3 02/21/2020
   Workqueue: btrfs-endio btrfs_end_bio_work [btrfs]
   Call Trace:
    <TASK>
    dump_stack_lvl+0x6e/0xa0
    ? lock_release+0x708/0x780
    print_report+0x174/0x505
    ? lock_release+0x708/0x780
    ? __virt_addr_valid+0x224/0x410
    ? lock_release+0x708/0x780
    kasan_report+0xda/0x1b0
    ? lock_release+0x708/0x780
    ? __wake_up+0x44/0x60
    lock_release+0x708/0x780
    ? __pfx_lock_release+0x10/0x10
    ? __pfx_do_raw_spin_lock+0x10/0x10
    ? lock_is_held_type+0x9a/0x110
    _raw_spin_unlock_irqrestore+0x1f/0x60
    __wake_up+0x44/0x60
    btrfs_encoded_read_endio+0x14b/0x190 [btrfs]
    btrfs_check_read_bio+0x8d9/0x1360 [btrfs]
    ? lock_release+0x1b0/0x780
    ? trace_lock_acquire+0x12f/0x1a0
    ? __pfx_btrfs_check_read_bio+0x10/0x10 [btrfs]
    ? process_one_work+0x7e3/0x1460
    ? lock_acquire+0x31/0xc0
    ? process_one_work+0x7e3/0x1460
    process_one_work+0x85c/0x1460
    ? __pfx_process_one_work+0x10/0x10
    ? assign_work+0x16c/0x240
    worker_thread+0x5e6/0xfc0
    ? __pfx_worker_thread+0x10/0x10
    kthread+0x2c3/0x3a0
    ? __pfx_kthread+0x10/0x10
    ret_from_fork+0x31/0x70
    ? __pfx_kthread+0x10/0x10
    ret_from_fork_asm+0x1a/0x30
    </TASK>

   Allocated by task 3661:
    kasan_save_stack+0x30/0x50
    kasan_save_track+0x14/0x30
    __kasan_kmalloc+0xaa/0xb0
    btrfs_encoded_read_regular_fill_pages+0x16c/0x6d0 [btrfs]
    send_extent_data+0xf0f/0x24a0 [btrfs]
    process_extent+0x48a/0x1830 [btrfs]
    changed_cb+0x178b/0x2ea0 [btrfs]
    btrfs_ioctl_send+0x3bf9/0x5c20 [btrfs]
    _btrfs_ioctl_send+0x117/0x330 [btrfs]
    btrfs_ioctl+0x184a/0x60a0 [btrfs]
    __x64_sys_ioctl+0x12e/0x1a0
    do_syscall_64+0x95/0x180
    entry_SYSCALL_64_after_hwframe+0x76/0x7e

   Freed by task 3661:
    kasan_save_stack+0x30/0x50
    kasan_save_track+0x14/0x30
    kasan_save_free_info+0x3b/0x70
    __kasan_slab_free+0x4f/0x70
    kfree+0x143/0x490
    btrfs_encoded_read_regular_fill_pages+0x531/0x6d0 [btrfs]
    send_extent_data+0xf0f/0x24a0 [btrfs]
    process_extent+0x48a/0x1830 [btrfs]
    changed_cb+0x178b/0x2ea0 [btrfs]
    btrfs_ioctl_send+0x3bf9/0x5c20 [btrfs]
    _btrfs_ioctl_send+0x117/0x330 [btrfs]
    btrfs_ioctl+0x184a/0x60a0 [btrfs]
    __x64_sys_ioctl+0x12e/0x1a0
    do_syscall_64+0x95/0x180
    entry_SYSCALL_64_after_hwframe+0x76/0x7e

   The buggy address belongs to the object at ffff888106a83f00
    which belongs to the cache kmalloc-rnd-07-96 of size 96
   The buggy address is located 24 bytes inside of
    freed 96-byte region [ffff888106a83f00, ffff888106a83f60)

   The buggy address belongs to the physical page:
   page: refcount:1 mapcount:0 mapping:0000000000000000 index:0xffff888106a83800 pfn:0x106a83
   flags: 0x17ffffc0000000(node=0|zone=2|lastcpupid=0x1fffff)
   page_type: f5(slab)
   raw: 0017ffffc0000000 ffff888100053680 ffffea0004917200 0000000000000004
   raw: ffff888106a83800 0000000080200019 00000001f5000000 0000000000000000
   page dumped because: kasan: bad access detected

   Memory state around the buggy address:
    ffff888106a83e00: fa fb fb fb fb fb fb fb fb fb fb fb fc fc fc fc
    ffff888106a83e80: fa fb fb fb fb fb fb fb fb fb fb fb fc fc fc fc
   >ffff888106a83f00: fa fb fb fb fb fb fb fb fb fb fb fb fc fc fc fc
                               ^
    ffff888106a83f80: fa fb fb fb fb fb fb fb fb fb fb fb fc fc fc fc
    ffff888106a84000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
   ==================================================================

Further analyzing the trace and the crash dump's vmcore file shows that
the wake_up() call in btrfs_encoded_read_endio() is calling wake_up() on
the wait_queue that is in the private data passed to the end_io handler.

Commit 4ff47df40447 ("btrfs: move priv off stack in
btrfs_encoded_read_regular_fill_pages()") moved 'struct
btrfs_encoded_read_private' off the stack.

Before that commit one can see a corruption of the private data when
analyzing the vmcore after a crash:

*(struct btrfs_encoded_read_private *)0xffff88815626eec8 = {
	.wait = (wait_queue_head_t){
		.lock = (spinlock_t){
			.rlock = (struct raw_spinlock){
				.raw_lock = (arch_spinlock_t){
					.val = (atomic_t){
						.counter = (int)-2005885696,
					},
					.locked = (u8)0,
					.pending = (u8)157,
					.locked_pending = (u16)40192,
					.tail = (u16)34928,
				},
				.magic = (unsigned int)536325682,
				.owner_cpu = (unsigned int)29,
				.owner = (void *)__SCT__tp_func_btrfs_transaction_commit+0x0 = 0x0,
				.dep_map = (struct lockdep_map){
					.key = (struct lock_class_key *)0xffff8881575a3b6c,
					.class_cache = (struct lock_class *[2]){ 0xffff8882a71985c0, 0xffffea00066f5d40 },
					.name = (const char *)0xffff88815626f100 = "",
					.wait_type_outer = (u8)37,
					.wait_type_inner = (u8)178,
					.lock_type = (u8)154,
				},
			},
			.__padding = (u8 [24]){ 0, 157, 112, 136, 50, 174, 247, 31, 29 },
			.dep_map = (struct lockdep_map){
				.key = (struct lock_class_key *)0xffff8881575a3b6c,
				.class_cache = (struct lock_class *[2]){ 0xffff8882a71985c0, 0xffffea00066f5d40 },
				.name = (const char *)0xffff88815626f100 = "",
				.wait_type_outer = (u8)37,
				.wait_type_inner = (u8)178,
				.lock_type = (u8)154,
			},
		},
		.head = (struct list_head){
			.next = (struct list_head *)0x112cca,
			.prev = (struct list_head *)0x47,
		},
	},
	.pending = (atomic_t){
		.counter = (int)-1491499288,
	},
	.status = (blk_status_t)130,
}

Here we can see several indicators of in-memory data corruption, e.g. the
large negative atomic values of ->pending or
->wait->lock->rlock->raw_lock->val, as well as the bogus spinlock magic
0x1ff7ae32 (decimal 536325682 above) instead of 0xdead4ead or the bogus
pointer values for ->wait->head.

To fix this, move the call to bio_put() before the atomic_test operation
so the submitter side in btrfs_encoded_read_regular_fill_pages() is not
woken up before the bio is cleaned up.

Also change atomic_dec_return() to atomic_dec_and_test() to fix the
corruption, as atomic_dec_return() is defined as two instructions on
x86_64, whereas atomic_dec_and_test() is defined as a single atomic
operation. This can lead to a situation where counter value is already
decremented but the if statement in btrfs_encoded_read_endio() is not
completely processed, i.e. the 0 test has not completed. If another thread
continues executing btrfs_encoded_read_regular_fill_pages() the
atomic_dec_return() there can see an already updated ->pending counter and
continues by freeing the private data. Continuing in the endio handler the
test for 0 succeeds and the wait_queue is woken up, resulting in a
use-after-free.

Reported-by: Shinichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Suggested-by: Damien Le Moal <Damien.LeMoal@wdc.com>
Fixes: 1881fba89bd5 ("btrfs: add BTRFS_IOC_ENCODED_READ ioctl")
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
---
 fs/btrfs/inode.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 22b8e2764619..cb8b23a3e56b 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -9089,7 +9089,8 @@ static void btrfs_encoded_read_endio(struct btrfs_bio *bbio)
 		 */
 		WRITE_ONCE(priv->status, bbio->bio.bi_status);
 	}
-	if (atomic_dec_return(&priv->pending) == 0) {
+	bio_put(&bbio->bio);
+	if (atomic_dec_and_test(&priv->pending)) {
 		int err = blk_status_to_errno(READ_ONCE(priv->status));
 
 		if (priv->uring_ctx) {
@@ -9099,7 +9100,6 @@ static void btrfs_encoded_read_endio(struct btrfs_bio *bbio)
 			wake_up(&priv->wait);
 		}
 	}
-	bio_put(&bbio->bio);
 }
 
 int btrfs_encoded_read_regular_fill_pages(struct btrfs_inode *inode,
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH v2 1/2] btrfs: fix use-after-free in btrfs_encoded_read_endio
  2024-11-12 13:53 ` [PATCH v2 1/2] " Johannes Thumshirn
@ 2024-11-12 14:45   ` Filipe Manana
  2024-11-12 17:20     ` Johannes Thumshirn
  2024-11-12 20:50   ` Qu Wenruo
  1 sibling, 1 reply; 9+ messages in thread
From: Filipe Manana @ 2024-11-12 14:45 UTC (permalink / raw)
  To: Johannes Thumshirn
  Cc: linux-btrfs, Filipe Manana, Damien Le Moal, Johannes Thumshirn,
	Mark Harmstone, Omar Sandoval, Shinichiro Kawasaki,
	Damien Le Moal

On Tue, Nov 12, 2024 at 1:54 PM Johannes Thumshirn <jth@kernel.org> wrote:
>
> From: Johannes Thumshirn <johannes.thumshirn@wdc.com>
>
> Shinichiro reported the following use-after free that sometimes is
> happening in our CI system when running fstests' btrfs/284 on a TCMU
> runner device:
>
>    ==================================================================
>    BUG: KASAN: slab-use-after-free in lock_release+0x708/0x780
>    Read of size 8 at addr ffff888106a83f18 by task kworker/u80:6/219
>
>    CPU: 8 UID: 0 PID: 219 Comm: kworker/u80:6 Not tainted 6.12.0-rc6-kts+ #15
>    Hardware name: Supermicro Super Server/X11SPi-TF, BIOS 3.3 02/21/2020
>    Workqueue: btrfs-endio btrfs_end_bio_work [btrfs]
>    Call Trace:
>     <TASK>
>     dump_stack_lvl+0x6e/0xa0
>     ? lock_release+0x708/0x780
>     print_report+0x174/0x505
>     ? lock_release+0x708/0x780
>     ? __virt_addr_valid+0x224/0x410
>     ? lock_release+0x708/0x780
>     kasan_report+0xda/0x1b0
>     ? lock_release+0x708/0x780
>     ? __wake_up+0x44/0x60
>     lock_release+0x708/0x780
>     ? __pfx_lock_release+0x10/0x10
>     ? __pfx_do_raw_spin_lock+0x10/0x10
>     ? lock_is_held_type+0x9a/0x110
>     _raw_spin_unlock_irqrestore+0x1f/0x60
>     __wake_up+0x44/0x60
>     btrfs_encoded_read_endio+0x14b/0x190 [btrfs]
>     btrfs_check_read_bio+0x8d9/0x1360 [btrfs]
>     ? lock_release+0x1b0/0x780
>     ? trace_lock_acquire+0x12f/0x1a0
>     ? __pfx_btrfs_check_read_bio+0x10/0x10 [btrfs]
>     ? process_one_work+0x7e3/0x1460
>     ? lock_acquire+0x31/0xc0
>     ? process_one_work+0x7e3/0x1460
>     process_one_work+0x85c/0x1460
>     ? __pfx_process_one_work+0x10/0x10
>     ? assign_work+0x16c/0x240
>     worker_thread+0x5e6/0xfc0
>     ? __pfx_worker_thread+0x10/0x10
>     kthread+0x2c3/0x3a0
>     ? __pfx_kthread+0x10/0x10
>     ret_from_fork+0x31/0x70
>     ? __pfx_kthread+0x10/0x10
>     ret_from_fork_asm+0x1a/0x30
>     </TASK>
>
>    Allocated by task 3661:
>     kasan_save_stack+0x30/0x50
>     kasan_save_track+0x14/0x30
>     __kasan_kmalloc+0xaa/0xb0
>     btrfs_encoded_read_regular_fill_pages+0x16c/0x6d0 [btrfs]
>     send_extent_data+0xf0f/0x24a0 [btrfs]
>     process_extent+0x48a/0x1830 [btrfs]
>     changed_cb+0x178b/0x2ea0 [btrfs]
>     btrfs_ioctl_send+0x3bf9/0x5c20 [btrfs]
>     _btrfs_ioctl_send+0x117/0x330 [btrfs]
>     btrfs_ioctl+0x184a/0x60a0 [btrfs]
>     __x64_sys_ioctl+0x12e/0x1a0
>     do_syscall_64+0x95/0x180
>     entry_SYSCALL_64_after_hwframe+0x76/0x7e
>
>    Freed by task 3661:
>     kasan_save_stack+0x30/0x50
>     kasan_save_track+0x14/0x30
>     kasan_save_free_info+0x3b/0x70
>     __kasan_slab_free+0x4f/0x70
>     kfree+0x143/0x490
>     btrfs_encoded_read_regular_fill_pages+0x531/0x6d0 [btrfs]
>     send_extent_data+0xf0f/0x24a0 [btrfs]
>     process_extent+0x48a/0x1830 [btrfs]
>     changed_cb+0x178b/0x2ea0 [btrfs]
>     btrfs_ioctl_send+0x3bf9/0x5c20 [btrfs]
>     _btrfs_ioctl_send+0x117/0x330 [btrfs]
>     btrfs_ioctl+0x184a/0x60a0 [btrfs]
>     __x64_sys_ioctl+0x12e/0x1a0
>     do_syscall_64+0x95/0x180
>     entry_SYSCALL_64_after_hwframe+0x76/0x7e
>
>    The buggy address belongs to the object at ffff888106a83f00
>     which belongs to the cache kmalloc-rnd-07-96 of size 96
>    The buggy address is located 24 bytes inside of
>     freed 96-byte region [ffff888106a83f00, ffff888106a83f60)
>
>    The buggy address belongs to the physical page:
>    page: refcount:1 mapcount:0 mapping:0000000000000000 index:0xffff888106a83800 pfn:0x106a83
>    flags: 0x17ffffc0000000(node=0|zone=2|lastcpupid=0x1fffff)
>    page_type: f5(slab)
>    raw: 0017ffffc0000000 ffff888100053680 ffffea0004917200 0000000000000004
>    raw: ffff888106a83800 0000000080200019 00000001f5000000 0000000000000000
>    page dumped because: kasan: bad access detected
>
>    Memory state around the buggy address:
>     ffff888106a83e00: fa fb fb fb fb fb fb fb fb fb fb fb fc fc fc fc
>     ffff888106a83e80: fa fb fb fb fb fb fb fb fb fb fb fb fc fc fc fc
>    >ffff888106a83f00: fa fb fb fb fb fb fb fb fb fb fb fb fc fc fc fc
>                                ^
>     ffff888106a83f80: fa fb fb fb fb fb fb fb fb fb fb fb fc fc fc fc
>     ffff888106a84000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>    ==================================================================
>
> Further analyzing the trace and the crash dump's vmcore file shows that
> the wake_up() call in btrfs_encoded_read_endio() is calling wake_up() on
> the wait_queue that is in the private data passed to the end_io handler.
>
> Commit 4ff47df40447 ("btrfs: move priv off stack in
> btrfs_encoded_read_regular_fill_pages()") moved 'struct
> btrfs_encoded_read_private' off the stack.
>
> Before that commit one can see a corruption of the private data when
> analyzing the vmcore after a crash:
>
> *(struct btrfs_encoded_read_private *)0xffff88815626eec8 = {
>         .wait = (wait_queue_head_t){
>                 .lock = (spinlock_t){
>                         .rlock = (struct raw_spinlock){
>                                 .raw_lock = (arch_spinlock_t){
>                                         .val = (atomic_t){
>                                                 .counter = (int)-2005885696,
>                                         },
>                                         .locked = (u8)0,
>                                         .pending = (u8)157,
>                                         .locked_pending = (u16)40192,
>                                         .tail = (u16)34928,
>                                 },
>                                 .magic = (unsigned int)536325682,
>                                 .owner_cpu = (unsigned int)29,
>                                 .owner = (void *)__SCT__tp_func_btrfs_transaction_commit+0x0 = 0x0,
>                                 .dep_map = (struct lockdep_map){
>                                         .key = (struct lock_class_key *)0xffff8881575a3b6c,
>                                         .class_cache = (struct lock_class *[2]){ 0xffff8882a71985c0, 0xffffea00066f5d40 },
>                                         .name = (const char *)0xffff88815626f100 = "",
>                                         .wait_type_outer = (u8)37,
>                                         .wait_type_inner = (u8)178,
>                                         .lock_type = (u8)154,
>                                 },
>                         },
>                         .__padding = (u8 [24]){ 0, 157, 112, 136, 50, 174, 247, 31, 29 },
>                         .dep_map = (struct lockdep_map){
>                                 .key = (struct lock_class_key *)0xffff8881575a3b6c,
>                                 .class_cache = (struct lock_class *[2]){ 0xffff8882a71985c0, 0xffffea00066f5d40 },
>                                 .name = (const char *)0xffff88815626f100 = "",
>                                 .wait_type_outer = (u8)37,
>                                 .wait_type_inner = (u8)178,
>                                 .lock_type = (u8)154,
>                         },
>                 },
>                 .head = (struct list_head){
>                         .next = (struct list_head *)0x112cca,
>                         .prev = (struct list_head *)0x47,
>                 },
>         },
>         .pending = (atomic_t){
>                 .counter = (int)-1491499288,
>         },
>         .status = (blk_status_t)130,
> }
>
> Here we can see several indicators of in-memory data corruption, e.g. the
> large negative atomic values of ->pending or
> ->wait->lock->rlock->raw_lock->val, as well as the bogus spinlock magic
> 0x1ff7ae32 (decimal 536325682 above) instead of 0xdead4ead or the bogus
> pointer values for ->wait->head.
>
> To fix this, move the call to bio_put() before the atomic_test operation
> so the submitter side in btrfs_encoded_read_regular_fill_pages() is not
> woken up before the bio is cleaned up.

This is the part I don't see what's the relation to the use-after-free
problem on the private structure.
This seems like a cleanup that should be a separate patch with its own
changelog.

>
> Also change atomic_dec_return() to atomic_dec_and_test() to fix the
> corruption, as atomic_dec_return() is defined as two instructions on
> x86_64, whereas atomic_dec_and_test() is defined as a single atomic
> operation. This can lead to a situation where counter value is already
> decremented but the if statement in btrfs_encoded_read_endio() is not
> completely processed, i.e. the 0 test has not completed. If another thread
> continues executing btrfs_encoded_read_regular_fill_pages() the
> atomic_dec_return() there can see an already updated ->pending counter and
> continues by freeing the private data. Continuing in the endio handler the
> test for 0 succeeds and the wait_queue is woken up, resulting in a
> use-after-free.

This is the sort of explanation that should have been in v1.
Basically, that the non-atomicity of atomic_dec_return() can make the
waiter see the 0 value and free the private structure before the waker
does a wake_up() against the private's wait queue.

So with bio_put() change in a separate patch:

Reviewed-by: Filipe Manana <fdmanana@suse.com>

(or with it but with an explanation on how this relates to the
use-after-free, which I can't see)

Thanks.

>
> Reported-by: Shinichiro Kawasaki <shinichiro.kawasaki@wdc.com>
> Suggested-by: Damien Le Moal <Damien.LeMoal@wdc.com>
> Fixes: 1881fba89bd5 ("btrfs: add BTRFS_IOC_ENCODED_READ ioctl")
> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
> ---
>  fs/btrfs/inode.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index 22b8e2764619..cb8b23a3e56b 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -9089,7 +9089,8 @@ static void btrfs_encoded_read_endio(struct btrfs_bio *bbio)
>                  */
>                 WRITE_ONCE(priv->status, bbio->bio.bi_status);
>         }
> -       if (atomic_dec_return(&priv->pending) == 0) {
> +       bio_put(&bbio->bio);
> +       if (atomic_dec_and_test(&priv->pending)) {
>                 int err = blk_status_to_errno(READ_ONCE(priv->status));
>
>                 if (priv->uring_ctx) {
> @@ -9099,7 +9100,6 @@ static void btrfs_encoded_read_endio(struct btrfs_bio *bbio)
>                         wake_up(&priv->wait);
>                 }
>         }
> -       bio_put(&bbio->bio);
>  }
>
>  int btrfs_encoded_read_regular_fill_pages(struct btrfs_inode *inode,
> --
> 2.43.0
>
>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v2 1/2] btrfs: fix use-after-free in btrfs_encoded_read_endio
  2024-11-12 14:45   ` Filipe Manana
@ 2024-11-12 17:20     ` Johannes Thumshirn
  0 siblings, 0 replies; 9+ messages in thread
From: Johannes Thumshirn @ 2024-11-12 17:20 UTC (permalink / raw)
  To: Filipe Manana, Johannes Thumshirn
  Cc: linux-btrfs@vger.kernel.org, Filipe Manana, Damien Le Moal,
	Mark Harmstone, Omar Sandoval, Shinichiro Kawasaki,
	Damien Le Moal

On 12.11.24 15:46, Filipe Manana wrote:
> This is the part I don't see what's the relation to the use-after-free
> problem on the private structure.
> This seems like a cleanup that should be a separate patch with its own
> changelog.

I need to re-test first, because AFAIR both the bio_put() and the 
atomic_dec_and_test() where needed to fix the bug on the test system.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v2 1/2] btrfs: fix use-after-free in btrfs_encoded_read_endio
  2024-11-12 13:53 ` [PATCH v2 1/2] " Johannes Thumshirn
  2024-11-12 14:45   ` Filipe Manana
@ 2024-11-12 20:50   ` Qu Wenruo
  2024-11-13  8:01     ` Johannes Thumshirn
  1 sibling, 1 reply; 9+ messages in thread
From: Qu Wenruo @ 2024-11-12 20:50 UTC (permalink / raw)
  To: Johannes Thumshirn, linux-btrfs
  Cc: Filipe Manana, Damien Le Moal, Johannes Thumshirn, Mark Harmstone,
	Omar Sandoval, Shinichiro Kawasaki, Damien Le Moal



在 2024/11/13 00:23, Johannes Thumshirn 写道:
> From: Johannes Thumshirn <johannes.thumshirn@wdc.com>
>
> Shinichiro reported the following use-after free that sometimes is
> happening in our CI system when running fstests' btrfs/284 on a TCMU
> runner device:
>
>     ==================================================================
>     BUG: KASAN: slab-use-after-free in lock_release+0x708/0x780
>     Read of size 8 at addr ffff888106a83f18 by task kworker/u80:6/219
>
>     CPU: 8 UID: 0 PID: 219 Comm: kworker/u80:6 Not tainted 6.12.0-rc6-kts+ #15
>     Hardware name: Supermicro Super Server/X11SPi-TF, BIOS 3.3 02/21/2020
>     Workqueue: btrfs-endio btrfs_end_bio_work [btrfs]
>     Call Trace:
>      <TASK>
>      dump_stack_lvl+0x6e/0xa0
>      ? lock_release+0x708/0x780
>      print_report+0x174/0x505
>      ? lock_release+0x708/0x780
>      ? __virt_addr_valid+0x224/0x410
>      ? lock_release+0x708/0x780
>      kasan_report+0xda/0x1b0
>      ? lock_release+0x708/0x780
>      ? __wake_up+0x44/0x60
>      lock_release+0x708/0x780
>      ? __pfx_lock_release+0x10/0x10
>      ? __pfx_do_raw_spin_lock+0x10/0x10
>      ? lock_is_held_type+0x9a/0x110
>      _raw_spin_unlock_irqrestore+0x1f/0x60
>      __wake_up+0x44/0x60
>      btrfs_encoded_read_endio+0x14b/0x190 [btrfs]
>      btrfs_check_read_bio+0x8d9/0x1360 [btrfs]
>      ? lock_release+0x1b0/0x780
>      ? trace_lock_acquire+0x12f/0x1a0
>      ? __pfx_btrfs_check_read_bio+0x10/0x10 [btrfs]
>      ? process_one_work+0x7e3/0x1460
>      ? lock_acquire+0x31/0xc0
>      ? process_one_work+0x7e3/0x1460
>      process_one_work+0x85c/0x1460
>      ? __pfx_process_one_work+0x10/0x10
>      ? assign_work+0x16c/0x240
>      worker_thread+0x5e6/0xfc0
>      ? __pfx_worker_thread+0x10/0x10
>      kthread+0x2c3/0x3a0
>      ? __pfx_kthread+0x10/0x10
>      ret_from_fork+0x31/0x70
>      ? __pfx_kthread+0x10/0x10
>      ret_from_fork_asm+0x1a/0x30
>      </TASK>
>
>     Allocated by task 3661:
>      kasan_save_stack+0x30/0x50
>      kasan_save_track+0x14/0x30
>      __kasan_kmalloc+0xaa/0xb0
>      btrfs_encoded_read_regular_fill_pages+0x16c/0x6d0 [btrfs]
>      send_extent_data+0xf0f/0x24a0 [btrfs]
>      process_extent+0x48a/0x1830 [btrfs]
>      changed_cb+0x178b/0x2ea0 [btrfs]
>      btrfs_ioctl_send+0x3bf9/0x5c20 [btrfs]
>      _btrfs_ioctl_send+0x117/0x330 [btrfs]
>      btrfs_ioctl+0x184a/0x60a0 [btrfs]
>      __x64_sys_ioctl+0x12e/0x1a0
>      do_syscall_64+0x95/0x180
>      entry_SYSCALL_64_after_hwframe+0x76/0x7e
>
>     Freed by task 3661:
>      kasan_save_stack+0x30/0x50
>      kasan_save_track+0x14/0x30
>      kasan_save_free_info+0x3b/0x70
>      __kasan_slab_free+0x4f/0x70
>      kfree+0x143/0x490
>      btrfs_encoded_read_regular_fill_pages+0x531/0x6d0 [btrfs]
>      send_extent_data+0xf0f/0x24a0 [btrfs]
>      process_extent+0x48a/0x1830 [btrfs]
>      changed_cb+0x178b/0x2ea0 [btrfs]
>      btrfs_ioctl_send+0x3bf9/0x5c20 [btrfs]
>      _btrfs_ioctl_send+0x117/0x330 [btrfs]
>      btrfs_ioctl+0x184a/0x60a0 [btrfs]
>      __x64_sys_ioctl+0x12e/0x1a0
>      do_syscall_64+0x95/0x180
>      entry_SYSCALL_64_after_hwframe+0x76/0x7e
>
>     The buggy address belongs to the object at ffff888106a83f00
>      which belongs to the cache kmalloc-rnd-07-96 of size 96
>     The buggy address is located 24 bytes inside of
>      freed 96-byte region [ffff888106a83f00, ffff888106a83f60)
>
>     The buggy address belongs to the physical page:
>     page: refcount:1 mapcount:0 mapping:0000000000000000 index:0xffff888106a83800 pfn:0x106a83
>     flags: 0x17ffffc0000000(node=0|zone=2|lastcpupid=0x1fffff)
>     page_type: f5(slab)
>     raw: 0017ffffc0000000 ffff888100053680 ffffea0004917200 0000000000000004
>     raw: ffff888106a83800 0000000080200019 00000001f5000000 0000000000000000
>     page dumped because: kasan: bad access detected
>
>     Memory state around the buggy address:
>      ffff888106a83e00: fa fb fb fb fb fb fb fb fb fb fb fb fc fc fc fc
>      ffff888106a83e80: fa fb fb fb fb fb fb fb fb fb fb fb fc fc fc fc
>     >ffff888106a83f00: fa fb fb fb fb fb fb fb fb fb fb fb fc fc fc fc
>                                 ^
>      ffff888106a83f80: fa fb fb fb fb fb fb fb fb fb fb fb fc fc fc fc
>      ffff888106a84000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>     ==================================================================
>
> Further analyzing the trace and the crash dump's vmcore file shows that
> the wake_up() call in btrfs_encoded_read_endio() is calling wake_up() on
> the wait_queue that is in the private data passed to the end_io handler.
>
> Commit 4ff47df40447 ("btrfs: move priv off stack in
> btrfs_encoded_read_regular_fill_pages()") moved 'struct
> btrfs_encoded_read_private' off the stack.
>
> Before that commit one can see a corruption of the private data when
> analyzing the vmcore after a crash:
>
> *(struct btrfs_encoded_read_private *)0xffff88815626eec8 = {
> 	.wait = (wait_queue_head_t){
> 		.lock = (spinlock_t){
> 			.rlock = (struct raw_spinlock){
> 				.raw_lock = (arch_spinlock_t){
> 					.val = (atomic_t){
> 						.counter = (int)-2005885696,
> 					},
> 					.locked = (u8)0,
> 					.pending = (u8)157,
> 					.locked_pending = (u16)40192,
> 					.tail = (u16)34928,
> 				},
> 				.magic = (unsigned int)536325682,
> 				.owner_cpu = (unsigned int)29,
> 				.owner = (void *)__SCT__tp_func_btrfs_transaction_commit+0x0 = 0x0,
> 				.dep_map = (struct lockdep_map){
> 					.key = (struct lock_class_key *)0xffff8881575a3b6c,
> 					.class_cache = (struct lock_class *[2]){ 0xffff8882a71985c0, 0xffffea00066f5d40 },
> 					.name = (const char *)0xffff88815626f100 = "",
> 					.wait_type_outer = (u8)37,
> 					.wait_type_inner = (u8)178,
> 					.lock_type = (u8)154,
> 				},
> 			},
> 			.__padding = (u8 [24]){ 0, 157, 112, 136, 50, 174, 247, 31, 29 },
> 			.dep_map = (struct lockdep_map){
> 				.key = (struct lock_class_key *)0xffff8881575a3b6c,
> 				.class_cache = (struct lock_class *[2]){ 0xffff8882a71985c0, 0xffffea00066f5d40 },
> 				.name = (const char *)0xffff88815626f100 = "",
> 				.wait_type_outer = (u8)37,
> 				.wait_type_inner = (u8)178,
> 				.lock_type = (u8)154,
> 			},
> 		},
> 		.head = (struct list_head){
> 			.next = (struct list_head *)0x112cca,
> 			.prev = (struct list_head *)0x47,
> 		},
> 	},
> 	.pending = (atomic_t){
> 		.counter = (int)-1491499288,
> 	},
> 	.status = (blk_status_t)130,
> }
>
> Here we can see several indicators of in-memory data corruption, e.g. the
> large negative atomic values of ->pending or
> ->wait->lock->rlock->raw_lock->val, as well as the bogus spinlock magic
> 0x1ff7ae32 (decimal 536325682 above) instead of 0xdead4ead or the bogus
> pointer values for ->wait->head.
>
> To fix this, move the call to bio_put() before the atomic_test operation
> so the submitter side in btrfs_encoded_read_regular_fill_pages() is not
> woken up before the bio is cleaned up.
>
> Also change atomic_dec_return() to atomic_dec_and_test() to fix the
> corruption, as atomic_dec_return() is defined as two instructions on
> x86_64, whereas atomic_dec_and_test() is defined as a single atomic
> operation.

This means we should not utilize "atomic_dec_return() == 0" as a way to
do synchronization.

And unfortunately I'm also seeing other locations still utilizing the
same patter inside btrfs_encoded_read_regular_fill_pages()

Shouldn't we also fix that call site even just for the sake of consistency?

Thanks,
Qu

> This can lead to a situation where counter value is already
> decremented but the if statement in btrfs_encoded_read_endio() is not
> completely processed, i.e. the 0 test has not completed. If another thread
> continues executing btrfs_encoded_read_regular_fill_pages() the
> atomic_dec_return() there can see an already updated ->pending counter and
> continues by freeing the private data. Continuing in the endio handler the
> test for 0 succeeds and the wait_queue is woken up, resulting in a
> use-after-free.
>
> Reported-by: Shinichiro Kawasaki <shinichiro.kawasaki@wdc.com>
> Suggested-by: Damien Le Moal <Damien.LeMoal@wdc.com>
> Fixes: 1881fba89bd5 ("btrfs: add BTRFS_IOC_ENCODED_READ ioctl")
> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
> ---
>   fs/btrfs/inode.c | 4 ++--
>   1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index 22b8e2764619..cb8b23a3e56b 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -9089,7 +9089,8 @@ static void btrfs_encoded_read_endio(struct btrfs_bio *bbio)
>   		 */
>   		WRITE_ONCE(priv->status, bbio->bio.bi_status);
>   	}
> -	if (atomic_dec_return(&priv->pending) == 0) {
> +	bio_put(&bbio->bio);
> +	if (atomic_dec_and_test(&priv->pending)) {
>   		int err = blk_status_to_errno(READ_ONCE(priv->status));
>
>   		if (priv->uring_ctx) {
> @@ -9099,7 +9100,6 @@ static void btrfs_encoded_read_endio(struct btrfs_bio *bbio)
>   			wake_up(&priv->wait);
>   		}
>   	}
> -	bio_put(&bbio->bio);
>   }
>
>   int btrfs_encoded_read_regular_fill_pages(struct btrfs_inode *inode,


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v2 1/2] btrfs: fix use-after-free in btrfs_encoded_read_endio
  2024-11-12 20:50   ` Qu Wenruo
@ 2024-11-13  8:01     ` Johannes Thumshirn
  2024-11-25 15:55       ` David Sterba
  0 siblings, 1 reply; 9+ messages in thread
From: Johannes Thumshirn @ 2024-11-13  8:01 UTC (permalink / raw)
  To: Qu Wenruo, Johannes Thumshirn, linux-btrfs@vger.kernel.org
  Cc: Filipe Manana, Damien Le Moal, Mark Harmstone, Omar Sandoval,
	Shinichiro Kawasaki, Damien Le Moal

On 12.11.24 21:51, Qu Wenruo wrote:
>> To fix this, move the call to bio_put() before the atomic_test operation
>> so the submitter side in btrfs_encoded_read_regular_fill_pages() is not
>> woken up before the bio is cleaned up.
>>
>> Also change atomic_dec_return() to atomic_dec_and_test() to fix the
>> corruption, as atomic_dec_return() is defined as two instructions on
>> x86_64, whereas atomic_dec_and_test() is defined as a single atomic
>> operation.
> 
> This means we should not utilize "atomic_dec_return() == 0" as a way to
> do synchronization.

At least not for reference counting, hence recount_t doesn't even have 
an equivalent.

> And unfortunately I'm also seeing other locations still utilizing the
> same patter inside btrfs_encoded_read_regular_fill_pages()
> 
> Shouldn't we also fix that call site even just for the sake of consistency?

I have no idea, TBH. The other user of atomic_dec_return() in btrfs is 
in delayed-inode.c:finish_one_item():

       /* atomic_dec_return implies a barrier */
       if ((atomic_dec_return(&delayed_root->items) <
            BTRFS_DELAYED_BACKGROUND || seq % BTRFS_DELAYED_BATCH == 0))
               cond_wake_up_nomb(&delayed_root->wait);

And that one looks safe in my eyes.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v2 1/2] btrfs: fix use-after-free in btrfs_encoded_read_endio
  2024-11-13  8:01     ` Johannes Thumshirn
@ 2024-11-25 15:55       ` David Sterba
  0 siblings, 0 replies; 9+ messages in thread
From: David Sterba @ 2024-11-25 15:55 UTC (permalink / raw)
  To: Johannes Thumshirn
  Cc: Qu Wenruo, Johannes Thumshirn, linux-btrfs@vger.kernel.org,
	Filipe Manana, Damien Le Moal, Mark Harmstone, Omar Sandoval,
	Shinichiro Kawasaki, Damien Le Moal

On Wed, Nov 13, 2024 at 08:01:29AM +0000, Johannes Thumshirn wrote:
> On 12.11.24 21:51, Qu Wenruo wrote:
> >> To fix this, move the call to bio_put() before the atomic_test operation
> >> so the submitter side in btrfs_encoded_read_regular_fill_pages() is not
> >> woken up before the bio is cleaned up.
> >>
> >> Also change atomic_dec_return() to atomic_dec_and_test() to fix the
> >> corruption, as atomic_dec_return() is defined as two instructions on
> >> x86_64, whereas atomic_dec_and_test() is defined as a single atomic
> >> operation.
> > 
> > This means we should not utilize "atomic_dec_return() == 0" as a way to
> > do synchronization.
> 
> At least not for reference counting, hence recount_t doesn't even have 
> an equivalent.
> 
> > And unfortunately I'm also seeing other locations still utilizing the
> > same patter inside btrfs_encoded_read_regular_fill_pages()
> > 
> > Shouldn't we also fix that call site even just for the sake of consistency?
> 
> I have no idea, TBH. The other user of atomic_dec_return() in btrfs is 
> in delayed-inode.c:finish_one_item():
> 
>        /* atomic_dec_return implies a barrier */
>        if ((atomic_dec_return(&delayed_root->items) <
>             BTRFS_DELAYED_BACKGROUND || seq % BTRFS_DELAYED_BATCH == 0))
>                cond_wake_up_nomb(&delayed_root->wait);
> 
> And that one looks safe in my eyes.

A safe pattern where atomic_dec_return() is when once reaching 0 there
will be no increment. Which is for example when setting the pending
counter, possibyl adding e.g. pages one by one (atomic_inc) and then
starting that. Each processed page then decrements the counter, once all
are done the 0 will be there.

There are atomic_dec_return() that could be wrong, I haven't examined
all of them but it seems to be always safer to use
atomic_dec_and_test().

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH v2 2/2] btrfs: simplify waiting for encoded read endios
  2024-11-12 13:53 [PATCH v2 0/2] btrfs: fix use-after-free in btrfs_encoded_read_endio Johannes Thumshirn
  2024-11-12 13:53 ` [PATCH v2 1/2] " Johannes Thumshirn
@ 2024-11-12 13:53 ` Johannes Thumshirn
  2024-11-12 14:57   ` Filipe Manana
  1 sibling, 1 reply; 9+ messages in thread
From: Johannes Thumshirn @ 2024-11-12 13:53 UTC (permalink / raw)
  To: linux-btrfs
  Cc: Filipe Manana, Damien Le Moal, Johannes Thumshirn, Damien Le Moal

From: Johannes Thumshirn <johannes.thumshirn@wdc.com>

Simplify the I/O completion path for encoded reads by using a
completion instead of a wait_queue.

Furthermore skip taking an extra reference that is instantly
dropped anyways and convert btrfs_encoded_read_private::pending into a
refcount_t filed instead of atomic_t.

Freeing of the private data is now handled at a common place in
btrfs_encoded_read_regular_fill_pages() and if btrfs_encoded_read_endio()
is freeing the data in case it has an io_uring context associated, the
btrfs_bio's private filed is NULLed to avoid a double free of the private
data.

Suggested-by: Damien Le Moal <Damien.LeMoal@wdc.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
---
 fs/btrfs/inode.c | 31 ++++++++++++++++---------------
 1 file changed, 16 insertions(+), 15 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index cb8b23a3e56b..3093905364e5 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -9068,9 +9068,9 @@ static ssize_t btrfs_encoded_read_inline(
 }
 
 struct btrfs_encoded_read_private {
-	wait_queue_head_t wait;
+	struct completion done;
 	void *uring_ctx;
-	atomic_t pending;
+	refcount_t refs;
 	blk_status_t status;
 };
 
@@ -9090,14 +9090,15 @@ static void btrfs_encoded_read_endio(struct btrfs_bio *bbio)
 		WRITE_ONCE(priv->status, bbio->bio.bi_status);
 	}
 	bio_put(&bbio->bio);
-	if (atomic_dec_and_test(&priv->pending)) {
+	if (refcount_dec_and_test(&priv->refs)) {
 		int err = blk_status_to_errno(READ_ONCE(priv->status));
 
 		if (priv->uring_ctx) {
 			btrfs_uring_read_extent_endio(priv->uring_ctx, err);
+			bbio->private = NULL;
 			kfree(priv);
 		} else {
-			wake_up(&priv->wait);
+			complete(&priv->done);
 		}
 	}
 }
@@ -9116,8 +9117,8 @@ int btrfs_encoded_read_regular_fill_pages(struct btrfs_inode *inode,
 	if (!priv)
 		return -ENOMEM;
 
-	init_waitqueue_head(&priv->wait);
-	atomic_set(&priv->pending, 1);
+	init_completion(&priv->done);
+	refcount_set(&priv->refs, 1);
 	priv->status = 0;
 	priv->uring_ctx = uring_ctx;
 
@@ -9130,7 +9131,7 @@ int btrfs_encoded_read_regular_fill_pages(struct btrfs_inode *inode,
 		size_t bytes = min_t(u64, disk_io_size, PAGE_SIZE);
 
 		if (bio_add_page(&bbio->bio, pages[i], bytes, 0) < bytes) {
-			atomic_inc(&priv->pending);
+			refcount_inc(&priv->refs);
 			btrfs_submit_bbio(bbio, 0);
 
 			bbio = btrfs_bio_alloc(BIO_MAX_VECS, REQ_OP_READ, fs_info,
@@ -9145,26 +9146,26 @@ int btrfs_encoded_read_regular_fill_pages(struct btrfs_inode *inode,
 		disk_io_size -= bytes;
 	} while (disk_io_size);
 
-	atomic_inc(&priv->pending);
 	btrfs_submit_bbio(bbio, 0);
 
 	if (uring_ctx) {
-		if (atomic_dec_return(&priv->pending) == 0) {
+		if (bbio->private && refcount_read(&priv->refs) == 0) {
 			ret = blk_status_to_errno(READ_ONCE(priv->status));
 			btrfs_uring_read_extent_endio(uring_ctx, ret);
-			kfree(priv);
-			return ret;
+			goto out;
 		}
 
 		return -EIOCBQUEUED;
 	} else {
-		if (atomic_dec_return(&priv->pending) != 0)
-			io_wait_event(priv->wait, !atomic_read(&priv->pending));
+		wait_for_completion_io(&priv->done);
 		/* See btrfs_encoded_read_endio() for ordering. */
 		ret = blk_status_to_errno(READ_ONCE(priv->status));
-		kfree(priv);
-		return ret;
 	}
+
+out:
+	kfree(priv);
+	return ret;
+
 }
 
 ssize_t btrfs_encoded_read_regular(struct kiocb *iocb, struct iov_iter *iter,
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH v2 2/2] btrfs: simplify waiting for encoded read endios
  2024-11-12 13:53 ` [PATCH v2 2/2] btrfs: simplify waiting for encoded read endios Johannes Thumshirn
@ 2024-11-12 14:57   ` Filipe Manana
  0 siblings, 0 replies; 9+ messages in thread
From: Filipe Manana @ 2024-11-12 14:57 UTC (permalink / raw)
  To: Johannes Thumshirn
  Cc: linux-btrfs, Filipe Manana, Damien Le Moal, Johannes Thumshirn,
	Damien Le Moal

On Tue, Nov 12, 2024 at 1:54 PM Johannes Thumshirn <jth@kernel.org> wrote:
>
> From: Johannes Thumshirn <johannes.thumshirn@wdc.com>
>
> Simplify the I/O completion path for encoded reads by using a
> completion instead of a wait_queue.
>
> Furthermore skip taking an extra reference that is instantly
> dropped anyways and convert btrfs_encoded_read_private::pending into a
> refcount_t filed instead of atomic_t.
>
> Freeing of the private data is now handled at a common place in
> btrfs_encoded_read_regular_fill_pages() and if btrfs_encoded_read_endio()
> is freeing the data in case it has an io_uring context associated, the
> btrfs_bio's private filed is NULLed to avoid a double free of the private
> data.
>
> Suggested-by: Damien Le Moal <Damien.LeMoal@wdc.com>
> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
> ---
>  fs/btrfs/inode.c | 31 ++++++++++++++++---------------
>  1 file changed, 16 insertions(+), 15 deletions(-)
>
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index cb8b23a3e56b..3093905364e5 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -9068,9 +9068,9 @@ static ssize_t btrfs_encoded_read_inline(
>  }
>
>  struct btrfs_encoded_read_private {
> -       wait_queue_head_t wait;
> +       struct completion done;
>         void *uring_ctx;
> -       atomic_t pending;
> +       refcount_t refs;
>         blk_status_t status;
>  };
>
> @@ -9090,14 +9090,15 @@ static void btrfs_encoded_read_endio(struct btrfs_bio *bbio)
>                 WRITE_ONCE(priv->status, bbio->bio.bi_status);
>         }
>         bio_put(&bbio->bio);
> -       if (atomic_dec_and_test(&priv->pending)) {
> +       if (refcount_dec_and_test(&priv->refs)) {
>                 int err = blk_status_to_errno(READ_ONCE(priv->status));
>
>                 if (priv->uring_ctx) {
>                         btrfs_uring_read_extent_endio(priv->uring_ctx, err);
> +                       bbio->private = NULL;

Isn't this racy?

We decremented priv->refs to 0, the task at
btrfs_encoded_read_regular_fill_pages() sees it as 0 and sees
bbio->private still as non-NULL, does a kfree() on it and then we do
it here again.

I.e. we should set bbio->private to NULL before decrementing, and
possibly some barriers.

Thanks.


>                         kfree(priv);
>                 } else {
> -                       wake_up(&priv->wait);
> +                       complete(&priv->done);
>                 }
>         }
>  }
> @@ -9116,8 +9117,8 @@ int btrfs_encoded_read_regular_fill_pages(struct btrfs_inode *inode,
>         if (!priv)
>                 return -ENOMEM;
>
> -       init_waitqueue_head(&priv->wait);
> -       atomic_set(&priv->pending, 1);
> +       init_completion(&priv->done);
> +       refcount_set(&priv->refs, 1);
>         priv->status = 0;
>         priv->uring_ctx = uring_ctx;
>
> @@ -9130,7 +9131,7 @@ int btrfs_encoded_read_regular_fill_pages(struct btrfs_inode *inode,
>                 size_t bytes = min_t(u64, disk_io_size, PAGE_SIZE);
>
>                 if (bio_add_page(&bbio->bio, pages[i], bytes, 0) < bytes) {
> -                       atomic_inc(&priv->pending);
> +                       refcount_inc(&priv->refs);
>                         btrfs_submit_bbio(bbio, 0);
>
>                         bbio = btrfs_bio_alloc(BIO_MAX_VECS, REQ_OP_READ, fs_info,
> @@ -9145,26 +9146,26 @@ int btrfs_encoded_read_regular_fill_pages(struct btrfs_inode *inode,
>                 disk_io_size -= bytes;
>         } while (disk_io_size);
>
> -       atomic_inc(&priv->pending);
>         btrfs_submit_bbio(bbio, 0);
>
>         if (uring_ctx) {
> -               if (atomic_dec_return(&priv->pending) == 0) {
> +               if (bbio->private && refcount_read(&priv->refs) == 0) {
>                         ret = blk_status_to_errno(READ_ONCE(priv->status));
>                         btrfs_uring_read_extent_endio(uring_ctx, ret);
> -                       kfree(priv);
> -                       return ret;
> +                       goto out;
>                 }
>
>                 return -EIOCBQUEUED;
>         } else {
> -               if (atomic_dec_return(&priv->pending) != 0)
> -                       io_wait_event(priv->wait, !atomic_read(&priv->pending));
> +               wait_for_completion_io(&priv->done);
>                 /* See btrfs_encoded_read_endio() for ordering. */
>                 ret = blk_status_to_errno(READ_ONCE(priv->status));
> -               kfree(priv);
> -               return ret;
>         }
> +
> +out:
> +       kfree(priv);
> +       return ret;
> +
>  }
>
>  ssize_t btrfs_encoded_read_regular(struct kiocb *iocb, struct iov_iter *iter,
> --
> 2.43.0
>
>

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2024-11-25 15:55 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-11-12 13:53 [PATCH v2 0/2] btrfs: fix use-after-free in btrfs_encoded_read_endio Johannes Thumshirn
2024-11-12 13:53 ` [PATCH v2 1/2] " Johannes Thumshirn
2024-11-12 14:45   ` Filipe Manana
2024-11-12 17:20     ` Johannes Thumshirn
2024-11-12 20:50   ` Qu Wenruo
2024-11-13  8:01     ` Johannes Thumshirn
2024-11-25 15:55       ` David Sterba
2024-11-12 13:53 ` [PATCH v2 2/2] btrfs: simplify waiting for encoded read endios Johannes Thumshirn
2024-11-12 14:57   ` Filipe Manana

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox