* [PATCH v2 0/2] btrfs: fix use-after-free in btrfs_encoded_read_endio
@ 2024-11-12 13:53 Johannes Thumshirn
2024-11-12 13:53 ` [PATCH v2 1/2] " Johannes Thumshirn
2024-11-12 13:53 ` [PATCH v2 2/2] btrfs: simplify waiting for encoded read endios Johannes Thumshirn
0 siblings, 2 replies; 9+ messages in thread
From: Johannes Thumshirn @ 2024-11-12 13:53 UTC (permalink / raw)
To: linux-btrfs
Cc: Filipe Manana, Damien Le Moal, Johannes Thumshirn,
Johannes Thumshirn, Mark Harmstone, Omar Sandoval
Shinichiro reported a occassional memory corruption in our CI system with
btrfs/248 that lead to panics. He also managed to reproduce this
corruption reliably on one host. See patch 1/2 for details on the
corruption and the fix, patch 2/2 is a cleanup Damien suggested on top of
the fix to make the code more obvious.
Changes to v1:
- Update commit message of patch 1/1
- Prevent double-free of 'priv' in case of io_uring in 2/2
- Use wait_for_completion_io() in 2/2
- Convert priv->pending from atomic_t to refcount_t calling it refs in 2/2
Link to v1:
https://lore.kernel.org/linux-btrfs/cover.1731316882.git.jth@kernel.org
Johannes Thumshirn (2):
btrfs: fix use-after-free in btrfs_encoded_read_endio
btrfs: simplify waiting for encoded read endios
fs/btrfs/inode.c | 27 ++++++++++++++-------------
1 file changed, 14 insertions(+), 13 deletions(-)
--
2.43.0
^ permalink raw reply [flat|nested] 9+ messages in thread
* [PATCH v2 1/2] btrfs: fix use-after-free in btrfs_encoded_read_endio
2024-11-12 13:53 [PATCH v2 0/2] btrfs: fix use-after-free in btrfs_encoded_read_endio Johannes Thumshirn
@ 2024-11-12 13:53 ` Johannes Thumshirn
2024-11-12 14:45 ` Filipe Manana
2024-11-12 20:50 ` Qu Wenruo
2024-11-12 13:53 ` [PATCH v2 2/2] btrfs: simplify waiting for encoded read endios Johannes Thumshirn
1 sibling, 2 replies; 9+ messages in thread
From: Johannes Thumshirn @ 2024-11-12 13:53 UTC (permalink / raw)
To: linux-btrfs
Cc: Filipe Manana, Damien Le Moal, Johannes Thumshirn, Mark Harmstone,
Omar Sandoval, Shinichiro Kawasaki, Damien Le Moal
From: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Shinichiro reported the following use-after free that sometimes is
happening in our CI system when running fstests' btrfs/284 on a TCMU
runner device:
==================================================================
BUG: KASAN: slab-use-after-free in lock_release+0x708/0x780
Read of size 8 at addr ffff888106a83f18 by task kworker/u80:6/219
CPU: 8 UID: 0 PID: 219 Comm: kworker/u80:6 Not tainted 6.12.0-rc6-kts+ #15
Hardware name: Supermicro Super Server/X11SPi-TF, BIOS 3.3 02/21/2020
Workqueue: btrfs-endio btrfs_end_bio_work [btrfs]
Call Trace:
<TASK>
dump_stack_lvl+0x6e/0xa0
? lock_release+0x708/0x780
print_report+0x174/0x505
? lock_release+0x708/0x780
? __virt_addr_valid+0x224/0x410
? lock_release+0x708/0x780
kasan_report+0xda/0x1b0
? lock_release+0x708/0x780
? __wake_up+0x44/0x60
lock_release+0x708/0x780
? __pfx_lock_release+0x10/0x10
? __pfx_do_raw_spin_lock+0x10/0x10
? lock_is_held_type+0x9a/0x110
_raw_spin_unlock_irqrestore+0x1f/0x60
__wake_up+0x44/0x60
btrfs_encoded_read_endio+0x14b/0x190 [btrfs]
btrfs_check_read_bio+0x8d9/0x1360 [btrfs]
? lock_release+0x1b0/0x780
? trace_lock_acquire+0x12f/0x1a0
? __pfx_btrfs_check_read_bio+0x10/0x10 [btrfs]
? process_one_work+0x7e3/0x1460
? lock_acquire+0x31/0xc0
? process_one_work+0x7e3/0x1460
process_one_work+0x85c/0x1460
? __pfx_process_one_work+0x10/0x10
? assign_work+0x16c/0x240
worker_thread+0x5e6/0xfc0
? __pfx_worker_thread+0x10/0x10
kthread+0x2c3/0x3a0
? __pfx_kthread+0x10/0x10
ret_from_fork+0x31/0x70
? __pfx_kthread+0x10/0x10
ret_from_fork_asm+0x1a/0x30
</TASK>
Allocated by task 3661:
kasan_save_stack+0x30/0x50
kasan_save_track+0x14/0x30
__kasan_kmalloc+0xaa/0xb0
btrfs_encoded_read_regular_fill_pages+0x16c/0x6d0 [btrfs]
send_extent_data+0xf0f/0x24a0 [btrfs]
process_extent+0x48a/0x1830 [btrfs]
changed_cb+0x178b/0x2ea0 [btrfs]
btrfs_ioctl_send+0x3bf9/0x5c20 [btrfs]
_btrfs_ioctl_send+0x117/0x330 [btrfs]
btrfs_ioctl+0x184a/0x60a0 [btrfs]
__x64_sys_ioctl+0x12e/0x1a0
do_syscall_64+0x95/0x180
entry_SYSCALL_64_after_hwframe+0x76/0x7e
Freed by task 3661:
kasan_save_stack+0x30/0x50
kasan_save_track+0x14/0x30
kasan_save_free_info+0x3b/0x70
__kasan_slab_free+0x4f/0x70
kfree+0x143/0x490
btrfs_encoded_read_regular_fill_pages+0x531/0x6d0 [btrfs]
send_extent_data+0xf0f/0x24a0 [btrfs]
process_extent+0x48a/0x1830 [btrfs]
changed_cb+0x178b/0x2ea0 [btrfs]
btrfs_ioctl_send+0x3bf9/0x5c20 [btrfs]
_btrfs_ioctl_send+0x117/0x330 [btrfs]
btrfs_ioctl+0x184a/0x60a0 [btrfs]
__x64_sys_ioctl+0x12e/0x1a0
do_syscall_64+0x95/0x180
entry_SYSCALL_64_after_hwframe+0x76/0x7e
The buggy address belongs to the object at ffff888106a83f00
which belongs to the cache kmalloc-rnd-07-96 of size 96
The buggy address is located 24 bytes inside of
freed 96-byte region [ffff888106a83f00, ffff888106a83f60)
The buggy address belongs to the physical page:
page: refcount:1 mapcount:0 mapping:0000000000000000 index:0xffff888106a83800 pfn:0x106a83
flags: 0x17ffffc0000000(node=0|zone=2|lastcpupid=0x1fffff)
page_type: f5(slab)
raw: 0017ffffc0000000 ffff888100053680 ffffea0004917200 0000000000000004
raw: ffff888106a83800 0000000080200019 00000001f5000000 0000000000000000
page dumped because: kasan: bad access detected
Memory state around the buggy address:
ffff888106a83e00: fa fb fb fb fb fb fb fb fb fb fb fb fc fc fc fc
ffff888106a83e80: fa fb fb fb fb fb fb fb fb fb fb fb fc fc fc fc
>ffff888106a83f00: fa fb fb fb fb fb fb fb fb fb fb fb fc fc fc fc
^
ffff888106a83f80: fa fb fb fb fb fb fb fb fb fb fb fb fc fc fc fc
ffff888106a84000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
==================================================================
Further analyzing the trace and the crash dump's vmcore file shows that
the wake_up() call in btrfs_encoded_read_endio() is calling wake_up() on
the wait_queue that is in the private data passed to the end_io handler.
Commit 4ff47df40447 ("btrfs: move priv off stack in
btrfs_encoded_read_regular_fill_pages()") moved 'struct
btrfs_encoded_read_private' off the stack.
Before that commit one can see a corruption of the private data when
analyzing the vmcore after a crash:
*(struct btrfs_encoded_read_private *)0xffff88815626eec8 = {
.wait = (wait_queue_head_t){
.lock = (spinlock_t){
.rlock = (struct raw_spinlock){
.raw_lock = (arch_spinlock_t){
.val = (atomic_t){
.counter = (int)-2005885696,
},
.locked = (u8)0,
.pending = (u8)157,
.locked_pending = (u16)40192,
.tail = (u16)34928,
},
.magic = (unsigned int)536325682,
.owner_cpu = (unsigned int)29,
.owner = (void *)__SCT__tp_func_btrfs_transaction_commit+0x0 = 0x0,
.dep_map = (struct lockdep_map){
.key = (struct lock_class_key *)0xffff8881575a3b6c,
.class_cache = (struct lock_class *[2]){ 0xffff8882a71985c0, 0xffffea00066f5d40 },
.name = (const char *)0xffff88815626f100 = "",
.wait_type_outer = (u8)37,
.wait_type_inner = (u8)178,
.lock_type = (u8)154,
},
},
.__padding = (u8 [24]){ 0, 157, 112, 136, 50, 174, 247, 31, 29 },
.dep_map = (struct lockdep_map){
.key = (struct lock_class_key *)0xffff8881575a3b6c,
.class_cache = (struct lock_class *[2]){ 0xffff8882a71985c0, 0xffffea00066f5d40 },
.name = (const char *)0xffff88815626f100 = "",
.wait_type_outer = (u8)37,
.wait_type_inner = (u8)178,
.lock_type = (u8)154,
},
},
.head = (struct list_head){
.next = (struct list_head *)0x112cca,
.prev = (struct list_head *)0x47,
},
},
.pending = (atomic_t){
.counter = (int)-1491499288,
},
.status = (blk_status_t)130,
}
Here we can see several indicators of in-memory data corruption, e.g. the
large negative atomic values of ->pending or
->wait->lock->rlock->raw_lock->val, as well as the bogus spinlock magic
0x1ff7ae32 (decimal 536325682 above) instead of 0xdead4ead or the bogus
pointer values for ->wait->head.
To fix this, move the call to bio_put() before the atomic_test operation
so the submitter side in btrfs_encoded_read_regular_fill_pages() is not
woken up before the bio is cleaned up.
Also change atomic_dec_return() to atomic_dec_and_test() to fix the
corruption, as atomic_dec_return() is defined as two instructions on
x86_64, whereas atomic_dec_and_test() is defined as a single atomic
operation. This can lead to a situation where counter value is already
decremented but the if statement in btrfs_encoded_read_endio() is not
completely processed, i.e. the 0 test has not completed. If another thread
continues executing btrfs_encoded_read_regular_fill_pages() the
atomic_dec_return() there can see an already updated ->pending counter and
continues by freeing the private data. Continuing in the endio handler the
test for 0 succeeds and the wait_queue is woken up, resulting in a
use-after-free.
Reported-by: Shinichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Suggested-by: Damien Le Moal <Damien.LeMoal@wdc.com>
Fixes: 1881fba89bd5 ("btrfs: add BTRFS_IOC_ENCODED_READ ioctl")
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
---
fs/btrfs/inode.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 22b8e2764619..cb8b23a3e56b 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -9089,7 +9089,8 @@ static void btrfs_encoded_read_endio(struct btrfs_bio *bbio)
*/
WRITE_ONCE(priv->status, bbio->bio.bi_status);
}
- if (atomic_dec_return(&priv->pending) == 0) {
+ bio_put(&bbio->bio);
+ if (atomic_dec_and_test(&priv->pending)) {
int err = blk_status_to_errno(READ_ONCE(priv->status));
if (priv->uring_ctx) {
@@ -9099,7 +9100,6 @@ static void btrfs_encoded_read_endio(struct btrfs_bio *bbio)
wake_up(&priv->wait);
}
}
- bio_put(&bbio->bio);
}
int btrfs_encoded_read_regular_fill_pages(struct btrfs_inode *inode,
--
2.43.0
^ permalink raw reply related [flat|nested] 9+ messages in thread
* [PATCH v2 2/2] btrfs: simplify waiting for encoded read endios
2024-11-12 13:53 [PATCH v2 0/2] btrfs: fix use-after-free in btrfs_encoded_read_endio Johannes Thumshirn
2024-11-12 13:53 ` [PATCH v2 1/2] " Johannes Thumshirn
@ 2024-11-12 13:53 ` Johannes Thumshirn
2024-11-12 14:57 ` Filipe Manana
1 sibling, 1 reply; 9+ messages in thread
From: Johannes Thumshirn @ 2024-11-12 13:53 UTC (permalink / raw)
To: linux-btrfs
Cc: Filipe Manana, Damien Le Moal, Johannes Thumshirn, Damien Le Moal
From: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Simplify the I/O completion path for encoded reads by using a
completion instead of a wait_queue.
Furthermore skip taking an extra reference that is instantly
dropped anyways and convert btrfs_encoded_read_private::pending into a
refcount_t filed instead of atomic_t.
Freeing of the private data is now handled at a common place in
btrfs_encoded_read_regular_fill_pages() and if btrfs_encoded_read_endio()
is freeing the data in case it has an io_uring context associated, the
btrfs_bio's private filed is NULLed to avoid a double free of the private
data.
Suggested-by: Damien Le Moal <Damien.LeMoal@wdc.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
---
fs/btrfs/inode.c | 31 ++++++++++++++++---------------
1 file changed, 16 insertions(+), 15 deletions(-)
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index cb8b23a3e56b..3093905364e5 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -9068,9 +9068,9 @@ static ssize_t btrfs_encoded_read_inline(
}
struct btrfs_encoded_read_private {
- wait_queue_head_t wait;
+ struct completion done;
void *uring_ctx;
- atomic_t pending;
+ refcount_t refs;
blk_status_t status;
};
@@ -9090,14 +9090,15 @@ static void btrfs_encoded_read_endio(struct btrfs_bio *bbio)
WRITE_ONCE(priv->status, bbio->bio.bi_status);
}
bio_put(&bbio->bio);
- if (atomic_dec_and_test(&priv->pending)) {
+ if (refcount_dec_and_test(&priv->refs)) {
int err = blk_status_to_errno(READ_ONCE(priv->status));
if (priv->uring_ctx) {
btrfs_uring_read_extent_endio(priv->uring_ctx, err);
+ bbio->private = NULL;
kfree(priv);
} else {
- wake_up(&priv->wait);
+ complete(&priv->done);
}
}
}
@@ -9116,8 +9117,8 @@ int btrfs_encoded_read_regular_fill_pages(struct btrfs_inode *inode,
if (!priv)
return -ENOMEM;
- init_waitqueue_head(&priv->wait);
- atomic_set(&priv->pending, 1);
+ init_completion(&priv->done);
+ refcount_set(&priv->refs, 1);
priv->status = 0;
priv->uring_ctx = uring_ctx;
@@ -9130,7 +9131,7 @@ int btrfs_encoded_read_regular_fill_pages(struct btrfs_inode *inode,
size_t bytes = min_t(u64, disk_io_size, PAGE_SIZE);
if (bio_add_page(&bbio->bio, pages[i], bytes, 0) < bytes) {
- atomic_inc(&priv->pending);
+ refcount_inc(&priv->refs);
btrfs_submit_bbio(bbio, 0);
bbio = btrfs_bio_alloc(BIO_MAX_VECS, REQ_OP_READ, fs_info,
@@ -9145,26 +9146,26 @@ int btrfs_encoded_read_regular_fill_pages(struct btrfs_inode *inode,
disk_io_size -= bytes;
} while (disk_io_size);
- atomic_inc(&priv->pending);
btrfs_submit_bbio(bbio, 0);
if (uring_ctx) {
- if (atomic_dec_return(&priv->pending) == 0) {
+ if (bbio->private && refcount_read(&priv->refs) == 0) {
ret = blk_status_to_errno(READ_ONCE(priv->status));
btrfs_uring_read_extent_endio(uring_ctx, ret);
- kfree(priv);
- return ret;
+ goto out;
}
return -EIOCBQUEUED;
} else {
- if (atomic_dec_return(&priv->pending) != 0)
- io_wait_event(priv->wait, !atomic_read(&priv->pending));
+ wait_for_completion_io(&priv->done);
/* See btrfs_encoded_read_endio() for ordering. */
ret = blk_status_to_errno(READ_ONCE(priv->status));
- kfree(priv);
- return ret;
}
+
+out:
+ kfree(priv);
+ return ret;
+
}
ssize_t btrfs_encoded_read_regular(struct kiocb *iocb, struct iov_iter *iter,
--
2.43.0
^ permalink raw reply related [flat|nested] 9+ messages in thread
* Re: [PATCH v2 1/2] btrfs: fix use-after-free in btrfs_encoded_read_endio
2024-11-12 13:53 ` [PATCH v2 1/2] " Johannes Thumshirn
@ 2024-11-12 14:45 ` Filipe Manana
2024-11-12 17:20 ` Johannes Thumshirn
2024-11-12 20:50 ` Qu Wenruo
1 sibling, 1 reply; 9+ messages in thread
From: Filipe Manana @ 2024-11-12 14:45 UTC (permalink / raw)
To: Johannes Thumshirn
Cc: linux-btrfs, Filipe Manana, Damien Le Moal, Johannes Thumshirn,
Mark Harmstone, Omar Sandoval, Shinichiro Kawasaki,
Damien Le Moal
On Tue, Nov 12, 2024 at 1:54 PM Johannes Thumshirn <jth@kernel.org> wrote:
>
> From: Johannes Thumshirn <johannes.thumshirn@wdc.com>
>
> Shinichiro reported the following use-after free that sometimes is
> happening in our CI system when running fstests' btrfs/284 on a TCMU
> runner device:
>
> ==================================================================
> BUG: KASAN: slab-use-after-free in lock_release+0x708/0x780
> Read of size 8 at addr ffff888106a83f18 by task kworker/u80:6/219
>
> CPU: 8 UID: 0 PID: 219 Comm: kworker/u80:6 Not tainted 6.12.0-rc6-kts+ #15
> Hardware name: Supermicro Super Server/X11SPi-TF, BIOS 3.3 02/21/2020
> Workqueue: btrfs-endio btrfs_end_bio_work [btrfs]
> Call Trace:
> <TASK>
> dump_stack_lvl+0x6e/0xa0
> ? lock_release+0x708/0x780
> print_report+0x174/0x505
> ? lock_release+0x708/0x780
> ? __virt_addr_valid+0x224/0x410
> ? lock_release+0x708/0x780
> kasan_report+0xda/0x1b0
> ? lock_release+0x708/0x780
> ? __wake_up+0x44/0x60
> lock_release+0x708/0x780
> ? __pfx_lock_release+0x10/0x10
> ? __pfx_do_raw_spin_lock+0x10/0x10
> ? lock_is_held_type+0x9a/0x110
> _raw_spin_unlock_irqrestore+0x1f/0x60
> __wake_up+0x44/0x60
> btrfs_encoded_read_endio+0x14b/0x190 [btrfs]
> btrfs_check_read_bio+0x8d9/0x1360 [btrfs]
> ? lock_release+0x1b0/0x780
> ? trace_lock_acquire+0x12f/0x1a0
> ? __pfx_btrfs_check_read_bio+0x10/0x10 [btrfs]
> ? process_one_work+0x7e3/0x1460
> ? lock_acquire+0x31/0xc0
> ? process_one_work+0x7e3/0x1460
> process_one_work+0x85c/0x1460
> ? __pfx_process_one_work+0x10/0x10
> ? assign_work+0x16c/0x240
> worker_thread+0x5e6/0xfc0
> ? __pfx_worker_thread+0x10/0x10
> kthread+0x2c3/0x3a0
> ? __pfx_kthread+0x10/0x10
> ret_from_fork+0x31/0x70
> ? __pfx_kthread+0x10/0x10
> ret_from_fork_asm+0x1a/0x30
> </TASK>
>
> Allocated by task 3661:
> kasan_save_stack+0x30/0x50
> kasan_save_track+0x14/0x30
> __kasan_kmalloc+0xaa/0xb0
> btrfs_encoded_read_regular_fill_pages+0x16c/0x6d0 [btrfs]
> send_extent_data+0xf0f/0x24a0 [btrfs]
> process_extent+0x48a/0x1830 [btrfs]
> changed_cb+0x178b/0x2ea0 [btrfs]
> btrfs_ioctl_send+0x3bf9/0x5c20 [btrfs]
> _btrfs_ioctl_send+0x117/0x330 [btrfs]
> btrfs_ioctl+0x184a/0x60a0 [btrfs]
> __x64_sys_ioctl+0x12e/0x1a0
> do_syscall_64+0x95/0x180
> entry_SYSCALL_64_after_hwframe+0x76/0x7e
>
> Freed by task 3661:
> kasan_save_stack+0x30/0x50
> kasan_save_track+0x14/0x30
> kasan_save_free_info+0x3b/0x70
> __kasan_slab_free+0x4f/0x70
> kfree+0x143/0x490
> btrfs_encoded_read_regular_fill_pages+0x531/0x6d0 [btrfs]
> send_extent_data+0xf0f/0x24a0 [btrfs]
> process_extent+0x48a/0x1830 [btrfs]
> changed_cb+0x178b/0x2ea0 [btrfs]
> btrfs_ioctl_send+0x3bf9/0x5c20 [btrfs]
> _btrfs_ioctl_send+0x117/0x330 [btrfs]
> btrfs_ioctl+0x184a/0x60a0 [btrfs]
> __x64_sys_ioctl+0x12e/0x1a0
> do_syscall_64+0x95/0x180
> entry_SYSCALL_64_after_hwframe+0x76/0x7e
>
> The buggy address belongs to the object at ffff888106a83f00
> which belongs to the cache kmalloc-rnd-07-96 of size 96
> The buggy address is located 24 bytes inside of
> freed 96-byte region [ffff888106a83f00, ffff888106a83f60)
>
> The buggy address belongs to the physical page:
> page: refcount:1 mapcount:0 mapping:0000000000000000 index:0xffff888106a83800 pfn:0x106a83
> flags: 0x17ffffc0000000(node=0|zone=2|lastcpupid=0x1fffff)
> page_type: f5(slab)
> raw: 0017ffffc0000000 ffff888100053680 ffffea0004917200 0000000000000004
> raw: ffff888106a83800 0000000080200019 00000001f5000000 0000000000000000
> page dumped because: kasan: bad access detected
>
> Memory state around the buggy address:
> ffff888106a83e00: fa fb fb fb fb fb fb fb fb fb fb fb fc fc fc fc
> ffff888106a83e80: fa fb fb fb fb fb fb fb fb fb fb fb fc fc fc fc
> >ffff888106a83f00: fa fb fb fb fb fb fb fb fb fb fb fb fc fc fc fc
> ^
> ffff888106a83f80: fa fb fb fb fb fb fb fb fb fb fb fb fc fc fc fc
> ffff888106a84000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> ==================================================================
>
> Further analyzing the trace and the crash dump's vmcore file shows that
> the wake_up() call in btrfs_encoded_read_endio() is calling wake_up() on
> the wait_queue that is in the private data passed to the end_io handler.
>
> Commit 4ff47df40447 ("btrfs: move priv off stack in
> btrfs_encoded_read_regular_fill_pages()") moved 'struct
> btrfs_encoded_read_private' off the stack.
>
> Before that commit one can see a corruption of the private data when
> analyzing the vmcore after a crash:
>
> *(struct btrfs_encoded_read_private *)0xffff88815626eec8 = {
> .wait = (wait_queue_head_t){
> .lock = (spinlock_t){
> .rlock = (struct raw_spinlock){
> .raw_lock = (arch_spinlock_t){
> .val = (atomic_t){
> .counter = (int)-2005885696,
> },
> .locked = (u8)0,
> .pending = (u8)157,
> .locked_pending = (u16)40192,
> .tail = (u16)34928,
> },
> .magic = (unsigned int)536325682,
> .owner_cpu = (unsigned int)29,
> .owner = (void *)__SCT__tp_func_btrfs_transaction_commit+0x0 = 0x0,
> .dep_map = (struct lockdep_map){
> .key = (struct lock_class_key *)0xffff8881575a3b6c,
> .class_cache = (struct lock_class *[2]){ 0xffff8882a71985c0, 0xffffea00066f5d40 },
> .name = (const char *)0xffff88815626f100 = "",
> .wait_type_outer = (u8)37,
> .wait_type_inner = (u8)178,
> .lock_type = (u8)154,
> },
> },
> .__padding = (u8 [24]){ 0, 157, 112, 136, 50, 174, 247, 31, 29 },
> .dep_map = (struct lockdep_map){
> .key = (struct lock_class_key *)0xffff8881575a3b6c,
> .class_cache = (struct lock_class *[2]){ 0xffff8882a71985c0, 0xffffea00066f5d40 },
> .name = (const char *)0xffff88815626f100 = "",
> .wait_type_outer = (u8)37,
> .wait_type_inner = (u8)178,
> .lock_type = (u8)154,
> },
> },
> .head = (struct list_head){
> .next = (struct list_head *)0x112cca,
> .prev = (struct list_head *)0x47,
> },
> },
> .pending = (atomic_t){
> .counter = (int)-1491499288,
> },
> .status = (blk_status_t)130,
> }
>
> Here we can see several indicators of in-memory data corruption, e.g. the
> large negative atomic values of ->pending or
> ->wait->lock->rlock->raw_lock->val, as well as the bogus spinlock magic
> 0x1ff7ae32 (decimal 536325682 above) instead of 0xdead4ead or the bogus
> pointer values for ->wait->head.
>
> To fix this, move the call to bio_put() before the atomic_test operation
> so the submitter side in btrfs_encoded_read_regular_fill_pages() is not
> woken up before the bio is cleaned up.
This is the part I don't see what's the relation to the use-after-free
problem on the private structure.
This seems like a cleanup that should be a separate patch with its own
changelog.
>
> Also change atomic_dec_return() to atomic_dec_and_test() to fix the
> corruption, as atomic_dec_return() is defined as two instructions on
> x86_64, whereas atomic_dec_and_test() is defined as a single atomic
> operation. This can lead to a situation where counter value is already
> decremented but the if statement in btrfs_encoded_read_endio() is not
> completely processed, i.e. the 0 test has not completed. If another thread
> continues executing btrfs_encoded_read_regular_fill_pages() the
> atomic_dec_return() there can see an already updated ->pending counter and
> continues by freeing the private data. Continuing in the endio handler the
> test for 0 succeeds and the wait_queue is woken up, resulting in a
> use-after-free.
This is the sort of explanation that should have been in v1.
Basically, that the non-atomicity of atomic_dec_return() can make the
waiter see the 0 value and free the private structure before the waker
does a wake_up() against the private's wait queue.
So with bio_put() change in a separate patch:
Reviewed-by: Filipe Manana <fdmanana@suse.com>
(or with it but with an explanation on how this relates to the
use-after-free, which I can't see)
Thanks.
>
> Reported-by: Shinichiro Kawasaki <shinichiro.kawasaki@wdc.com>
> Suggested-by: Damien Le Moal <Damien.LeMoal@wdc.com>
> Fixes: 1881fba89bd5 ("btrfs: add BTRFS_IOC_ENCODED_READ ioctl")
> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
> ---
> fs/btrfs/inode.c | 4 ++--
> 1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index 22b8e2764619..cb8b23a3e56b 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -9089,7 +9089,8 @@ static void btrfs_encoded_read_endio(struct btrfs_bio *bbio)
> */
> WRITE_ONCE(priv->status, bbio->bio.bi_status);
> }
> - if (atomic_dec_return(&priv->pending) == 0) {
> + bio_put(&bbio->bio);
> + if (atomic_dec_and_test(&priv->pending)) {
> int err = blk_status_to_errno(READ_ONCE(priv->status));
>
> if (priv->uring_ctx) {
> @@ -9099,7 +9100,6 @@ static void btrfs_encoded_read_endio(struct btrfs_bio *bbio)
> wake_up(&priv->wait);
> }
> }
> - bio_put(&bbio->bio);
> }
>
> int btrfs_encoded_read_regular_fill_pages(struct btrfs_inode *inode,
> --
> 2.43.0
>
>
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH v2 2/2] btrfs: simplify waiting for encoded read endios
2024-11-12 13:53 ` [PATCH v2 2/2] btrfs: simplify waiting for encoded read endios Johannes Thumshirn
@ 2024-11-12 14:57 ` Filipe Manana
0 siblings, 0 replies; 9+ messages in thread
From: Filipe Manana @ 2024-11-12 14:57 UTC (permalink / raw)
To: Johannes Thumshirn
Cc: linux-btrfs, Filipe Manana, Damien Le Moal, Johannes Thumshirn,
Damien Le Moal
On Tue, Nov 12, 2024 at 1:54 PM Johannes Thumshirn <jth@kernel.org> wrote:
>
> From: Johannes Thumshirn <johannes.thumshirn@wdc.com>
>
> Simplify the I/O completion path for encoded reads by using a
> completion instead of a wait_queue.
>
> Furthermore skip taking an extra reference that is instantly
> dropped anyways and convert btrfs_encoded_read_private::pending into a
> refcount_t filed instead of atomic_t.
>
> Freeing of the private data is now handled at a common place in
> btrfs_encoded_read_regular_fill_pages() and if btrfs_encoded_read_endio()
> is freeing the data in case it has an io_uring context associated, the
> btrfs_bio's private filed is NULLed to avoid a double free of the private
> data.
>
> Suggested-by: Damien Le Moal <Damien.LeMoal@wdc.com>
> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
> ---
> fs/btrfs/inode.c | 31 ++++++++++++++++---------------
> 1 file changed, 16 insertions(+), 15 deletions(-)
>
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index cb8b23a3e56b..3093905364e5 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -9068,9 +9068,9 @@ static ssize_t btrfs_encoded_read_inline(
> }
>
> struct btrfs_encoded_read_private {
> - wait_queue_head_t wait;
> + struct completion done;
> void *uring_ctx;
> - atomic_t pending;
> + refcount_t refs;
> blk_status_t status;
> };
>
> @@ -9090,14 +9090,15 @@ static void btrfs_encoded_read_endio(struct btrfs_bio *bbio)
> WRITE_ONCE(priv->status, bbio->bio.bi_status);
> }
> bio_put(&bbio->bio);
> - if (atomic_dec_and_test(&priv->pending)) {
> + if (refcount_dec_and_test(&priv->refs)) {
> int err = blk_status_to_errno(READ_ONCE(priv->status));
>
> if (priv->uring_ctx) {
> btrfs_uring_read_extent_endio(priv->uring_ctx, err);
> + bbio->private = NULL;
Isn't this racy?
We decremented priv->refs to 0, the task at
btrfs_encoded_read_regular_fill_pages() sees it as 0 and sees
bbio->private still as non-NULL, does a kfree() on it and then we do
it here again.
I.e. we should set bbio->private to NULL before decrementing, and
possibly some barriers.
Thanks.
> kfree(priv);
> } else {
> - wake_up(&priv->wait);
> + complete(&priv->done);
> }
> }
> }
> @@ -9116,8 +9117,8 @@ int btrfs_encoded_read_regular_fill_pages(struct btrfs_inode *inode,
> if (!priv)
> return -ENOMEM;
>
> - init_waitqueue_head(&priv->wait);
> - atomic_set(&priv->pending, 1);
> + init_completion(&priv->done);
> + refcount_set(&priv->refs, 1);
> priv->status = 0;
> priv->uring_ctx = uring_ctx;
>
> @@ -9130,7 +9131,7 @@ int btrfs_encoded_read_regular_fill_pages(struct btrfs_inode *inode,
> size_t bytes = min_t(u64, disk_io_size, PAGE_SIZE);
>
> if (bio_add_page(&bbio->bio, pages[i], bytes, 0) < bytes) {
> - atomic_inc(&priv->pending);
> + refcount_inc(&priv->refs);
> btrfs_submit_bbio(bbio, 0);
>
> bbio = btrfs_bio_alloc(BIO_MAX_VECS, REQ_OP_READ, fs_info,
> @@ -9145,26 +9146,26 @@ int btrfs_encoded_read_regular_fill_pages(struct btrfs_inode *inode,
> disk_io_size -= bytes;
> } while (disk_io_size);
>
> - atomic_inc(&priv->pending);
> btrfs_submit_bbio(bbio, 0);
>
> if (uring_ctx) {
> - if (atomic_dec_return(&priv->pending) == 0) {
> + if (bbio->private && refcount_read(&priv->refs) == 0) {
> ret = blk_status_to_errno(READ_ONCE(priv->status));
> btrfs_uring_read_extent_endio(uring_ctx, ret);
> - kfree(priv);
> - return ret;
> + goto out;
> }
>
> return -EIOCBQUEUED;
> } else {
> - if (atomic_dec_return(&priv->pending) != 0)
> - io_wait_event(priv->wait, !atomic_read(&priv->pending));
> + wait_for_completion_io(&priv->done);
> /* See btrfs_encoded_read_endio() for ordering. */
> ret = blk_status_to_errno(READ_ONCE(priv->status));
> - kfree(priv);
> - return ret;
> }
> +
> +out:
> + kfree(priv);
> + return ret;
> +
> }
>
> ssize_t btrfs_encoded_read_regular(struct kiocb *iocb, struct iov_iter *iter,
> --
> 2.43.0
>
>
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH v2 1/2] btrfs: fix use-after-free in btrfs_encoded_read_endio
2024-11-12 14:45 ` Filipe Manana
@ 2024-11-12 17:20 ` Johannes Thumshirn
0 siblings, 0 replies; 9+ messages in thread
From: Johannes Thumshirn @ 2024-11-12 17:20 UTC (permalink / raw)
To: Filipe Manana, Johannes Thumshirn
Cc: linux-btrfs@vger.kernel.org, Filipe Manana, Damien Le Moal,
Mark Harmstone, Omar Sandoval, Shinichiro Kawasaki,
Damien Le Moal
On 12.11.24 15:46, Filipe Manana wrote:
> This is the part I don't see what's the relation to the use-after-free
> problem on the private structure.
> This seems like a cleanup that should be a separate patch with its own
> changelog.
I need to re-test first, because AFAIR both the bio_put() and the
atomic_dec_and_test() where needed to fix the bug on the test system.
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH v2 1/2] btrfs: fix use-after-free in btrfs_encoded_read_endio
2024-11-12 13:53 ` [PATCH v2 1/2] " Johannes Thumshirn
2024-11-12 14:45 ` Filipe Manana
@ 2024-11-12 20:50 ` Qu Wenruo
2024-11-13 8:01 ` Johannes Thumshirn
1 sibling, 1 reply; 9+ messages in thread
From: Qu Wenruo @ 2024-11-12 20:50 UTC (permalink / raw)
To: Johannes Thumshirn, linux-btrfs
Cc: Filipe Manana, Damien Le Moal, Johannes Thumshirn, Mark Harmstone,
Omar Sandoval, Shinichiro Kawasaki, Damien Le Moal
在 2024/11/13 00:23, Johannes Thumshirn 写道:
> From: Johannes Thumshirn <johannes.thumshirn@wdc.com>
>
> Shinichiro reported the following use-after free that sometimes is
> happening in our CI system when running fstests' btrfs/284 on a TCMU
> runner device:
>
> ==================================================================
> BUG: KASAN: slab-use-after-free in lock_release+0x708/0x780
> Read of size 8 at addr ffff888106a83f18 by task kworker/u80:6/219
>
> CPU: 8 UID: 0 PID: 219 Comm: kworker/u80:6 Not tainted 6.12.0-rc6-kts+ #15
> Hardware name: Supermicro Super Server/X11SPi-TF, BIOS 3.3 02/21/2020
> Workqueue: btrfs-endio btrfs_end_bio_work [btrfs]
> Call Trace:
> <TASK>
> dump_stack_lvl+0x6e/0xa0
> ? lock_release+0x708/0x780
> print_report+0x174/0x505
> ? lock_release+0x708/0x780
> ? __virt_addr_valid+0x224/0x410
> ? lock_release+0x708/0x780
> kasan_report+0xda/0x1b0
> ? lock_release+0x708/0x780
> ? __wake_up+0x44/0x60
> lock_release+0x708/0x780
> ? __pfx_lock_release+0x10/0x10
> ? __pfx_do_raw_spin_lock+0x10/0x10
> ? lock_is_held_type+0x9a/0x110
> _raw_spin_unlock_irqrestore+0x1f/0x60
> __wake_up+0x44/0x60
> btrfs_encoded_read_endio+0x14b/0x190 [btrfs]
> btrfs_check_read_bio+0x8d9/0x1360 [btrfs]
> ? lock_release+0x1b0/0x780
> ? trace_lock_acquire+0x12f/0x1a0
> ? __pfx_btrfs_check_read_bio+0x10/0x10 [btrfs]
> ? process_one_work+0x7e3/0x1460
> ? lock_acquire+0x31/0xc0
> ? process_one_work+0x7e3/0x1460
> process_one_work+0x85c/0x1460
> ? __pfx_process_one_work+0x10/0x10
> ? assign_work+0x16c/0x240
> worker_thread+0x5e6/0xfc0
> ? __pfx_worker_thread+0x10/0x10
> kthread+0x2c3/0x3a0
> ? __pfx_kthread+0x10/0x10
> ret_from_fork+0x31/0x70
> ? __pfx_kthread+0x10/0x10
> ret_from_fork_asm+0x1a/0x30
> </TASK>
>
> Allocated by task 3661:
> kasan_save_stack+0x30/0x50
> kasan_save_track+0x14/0x30
> __kasan_kmalloc+0xaa/0xb0
> btrfs_encoded_read_regular_fill_pages+0x16c/0x6d0 [btrfs]
> send_extent_data+0xf0f/0x24a0 [btrfs]
> process_extent+0x48a/0x1830 [btrfs]
> changed_cb+0x178b/0x2ea0 [btrfs]
> btrfs_ioctl_send+0x3bf9/0x5c20 [btrfs]
> _btrfs_ioctl_send+0x117/0x330 [btrfs]
> btrfs_ioctl+0x184a/0x60a0 [btrfs]
> __x64_sys_ioctl+0x12e/0x1a0
> do_syscall_64+0x95/0x180
> entry_SYSCALL_64_after_hwframe+0x76/0x7e
>
> Freed by task 3661:
> kasan_save_stack+0x30/0x50
> kasan_save_track+0x14/0x30
> kasan_save_free_info+0x3b/0x70
> __kasan_slab_free+0x4f/0x70
> kfree+0x143/0x490
> btrfs_encoded_read_regular_fill_pages+0x531/0x6d0 [btrfs]
> send_extent_data+0xf0f/0x24a0 [btrfs]
> process_extent+0x48a/0x1830 [btrfs]
> changed_cb+0x178b/0x2ea0 [btrfs]
> btrfs_ioctl_send+0x3bf9/0x5c20 [btrfs]
> _btrfs_ioctl_send+0x117/0x330 [btrfs]
> btrfs_ioctl+0x184a/0x60a0 [btrfs]
> __x64_sys_ioctl+0x12e/0x1a0
> do_syscall_64+0x95/0x180
> entry_SYSCALL_64_after_hwframe+0x76/0x7e
>
> The buggy address belongs to the object at ffff888106a83f00
> which belongs to the cache kmalloc-rnd-07-96 of size 96
> The buggy address is located 24 bytes inside of
> freed 96-byte region [ffff888106a83f00, ffff888106a83f60)
>
> The buggy address belongs to the physical page:
> page: refcount:1 mapcount:0 mapping:0000000000000000 index:0xffff888106a83800 pfn:0x106a83
> flags: 0x17ffffc0000000(node=0|zone=2|lastcpupid=0x1fffff)
> page_type: f5(slab)
> raw: 0017ffffc0000000 ffff888100053680 ffffea0004917200 0000000000000004
> raw: ffff888106a83800 0000000080200019 00000001f5000000 0000000000000000
> page dumped because: kasan: bad access detected
>
> Memory state around the buggy address:
> ffff888106a83e00: fa fb fb fb fb fb fb fb fb fb fb fb fc fc fc fc
> ffff888106a83e80: fa fb fb fb fb fb fb fb fb fb fb fb fc fc fc fc
> >ffff888106a83f00: fa fb fb fb fb fb fb fb fb fb fb fb fc fc fc fc
> ^
> ffff888106a83f80: fa fb fb fb fb fb fb fb fb fb fb fb fc fc fc fc
> ffff888106a84000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> ==================================================================
>
> Further analyzing the trace and the crash dump's vmcore file shows that
> the wake_up() call in btrfs_encoded_read_endio() is calling wake_up() on
> the wait_queue that is in the private data passed to the end_io handler.
>
> Commit 4ff47df40447 ("btrfs: move priv off stack in
> btrfs_encoded_read_regular_fill_pages()") moved 'struct
> btrfs_encoded_read_private' off the stack.
>
> Before that commit one can see a corruption of the private data when
> analyzing the vmcore after a crash:
>
> *(struct btrfs_encoded_read_private *)0xffff88815626eec8 = {
> .wait = (wait_queue_head_t){
> .lock = (spinlock_t){
> .rlock = (struct raw_spinlock){
> .raw_lock = (arch_spinlock_t){
> .val = (atomic_t){
> .counter = (int)-2005885696,
> },
> .locked = (u8)0,
> .pending = (u8)157,
> .locked_pending = (u16)40192,
> .tail = (u16)34928,
> },
> .magic = (unsigned int)536325682,
> .owner_cpu = (unsigned int)29,
> .owner = (void *)__SCT__tp_func_btrfs_transaction_commit+0x0 = 0x0,
> .dep_map = (struct lockdep_map){
> .key = (struct lock_class_key *)0xffff8881575a3b6c,
> .class_cache = (struct lock_class *[2]){ 0xffff8882a71985c0, 0xffffea00066f5d40 },
> .name = (const char *)0xffff88815626f100 = "",
> .wait_type_outer = (u8)37,
> .wait_type_inner = (u8)178,
> .lock_type = (u8)154,
> },
> },
> .__padding = (u8 [24]){ 0, 157, 112, 136, 50, 174, 247, 31, 29 },
> .dep_map = (struct lockdep_map){
> .key = (struct lock_class_key *)0xffff8881575a3b6c,
> .class_cache = (struct lock_class *[2]){ 0xffff8882a71985c0, 0xffffea00066f5d40 },
> .name = (const char *)0xffff88815626f100 = "",
> .wait_type_outer = (u8)37,
> .wait_type_inner = (u8)178,
> .lock_type = (u8)154,
> },
> },
> .head = (struct list_head){
> .next = (struct list_head *)0x112cca,
> .prev = (struct list_head *)0x47,
> },
> },
> .pending = (atomic_t){
> .counter = (int)-1491499288,
> },
> .status = (blk_status_t)130,
> }
>
> Here we can see several indicators of in-memory data corruption, e.g. the
> large negative atomic values of ->pending or
> ->wait->lock->rlock->raw_lock->val, as well as the bogus spinlock magic
> 0x1ff7ae32 (decimal 536325682 above) instead of 0xdead4ead or the bogus
> pointer values for ->wait->head.
>
> To fix this, move the call to bio_put() before the atomic_test operation
> so the submitter side in btrfs_encoded_read_regular_fill_pages() is not
> woken up before the bio is cleaned up.
>
> Also change atomic_dec_return() to atomic_dec_and_test() to fix the
> corruption, as atomic_dec_return() is defined as two instructions on
> x86_64, whereas atomic_dec_and_test() is defined as a single atomic
> operation.
This means we should not utilize "atomic_dec_return() == 0" as a way to
do synchronization.
And unfortunately I'm also seeing other locations still utilizing the
same patter inside btrfs_encoded_read_regular_fill_pages()
Shouldn't we also fix that call site even just for the sake of consistency?
Thanks,
Qu
> This can lead to a situation where counter value is already
> decremented but the if statement in btrfs_encoded_read_endio() is not
> completely processed, i.e. the 0 test has not completed. If another thread
> continues executing btrfs_encoded_read_regular_fill_pages() the
> atomic_dec_return() there can see an already updated ->pending counter and
> continues by freeing the private data. Continuing in the endio handler the
> test for 0 succeeds and the wait_queue is woken up, resulting in a
> use-after-free.
>
> Reported-by: Shinichiro Kawasaki <shinichiro.kawasaki@wdc.com>
> Suggested-by: Damien Le Moal <Damien.LeMoal@wdc.com>
> Fixes: 1881fba89bd5 ("btrfs: add BTRFS_IOC_ENCODED_READ ioctl")
> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
> ---
> fs/btrfs/inode.c | 4 ++--
> 1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index 22b8e2764619..cb8b23a3e56b 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -9089,7 +9089,8 @@ static void btrfs_encoded_read_endio(struct btrfs_bio *bbio)
> */
> WRITE_ONCE(priv->status, bbio->bio.bi_status);
> }
> - if (atomic_dec_return(&priv->pending) == 0) {
> + bio_put(&bbio->bio);
> + if (atomic_dec_and_test(&priv->pending)) {
> int err = blk_status_to_errno(READ_ONCE(priv->status));
>
> if (priv->uring_ctx) {
> @@ -9099,7 +9100,6 @@ static void btrfs_encoded_read_endio(struct btrfs_bio *bbio)
> wake_up(&priv->wait);
> }
> }
> - bio_put(&bbio->bio);
> }
>
> int btrfs_encoded_read_regular_fill_pages(struct btrfs_inode *inode,
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH v2 1/2] btrfs: fix use-after-free in btrfs_encoded_read_endio
2024-11-12 20:50 ` Qu Wenruo
@ 2024-11-13 8:01 ` Johannes Thumshirn
2024-11-25 15:55 ` David Sterba
0 siblings, 1 reply; 9+ messages in thread
From: Johannes Thumshirn @ 2024-11-13 8:01 UTC (permalink / raw)
To: Qu Wenruo, Johannes Thumshirn, linux-btrfs@vger.kernel.org
Cc: Filipe Manana, Damien Le Moal, Mark Harmstone, Omar Sandoval,
Shinichiro Kawasaki, Damien Le Moal
On 12.11.24 21:51, Qu Wenruo wrote:
>> To fix this, move the call to bio_put() before the atomic_test operation
>> so the submitter side in btrfs_encoded_read_regular_fill_pages() is not
>> woken up before the bio is cleaned up.
>>
>> Also change atomic_dec_return() to atomic_dec_and_test() to fix the
>> corruption, as atomic_dec_return() is defined as two instructions on
>> x86_64, whereas atomic_dec_and_test() is defined as a single atomic
>> operation.
>
> This means we should not utilize "atomic_dec_return() == 0" as a way to
> do synchronization.
At least not for reference counting, hence recount_t doesn't even have
an equivalent.
> And unfortunately I'm also seeing other locations still utilizing the
> same patter inside btrfs_encoded_read_regular_fill_pages()
>
> Shouldn't we also fix that call site even just for the sake of consistency?
I have no idea, TBH. The other user of atomic_dec_return() in btrfs is
in delayed-inode.c:finish_one_item():
/* atomic_dec_return implies a barrier */
if ((atomic_dec_return(&delayed_root->items) <
BTRFS_DELAYED_BACKGROUND || seq % BTRFS_DELAYED_BATCH == 0))
cond_wake_up_nomb(&delayed_root->wait);
And that one looks safe in my eyes.
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH v2 1/2] btrfs: fix use-after-free in btrfs_encoded_read_endio
2024-11-13 8:01 ` Johannes Thumshirn
@ 2024-11-25 15:55 ` David Sterba
0 siblings, 0 replies; 9+ messages in thread
From: David Sterba @ 2024-11-25 15:55 UTC (permalink / raw)
To: Johannes Thumshirn
Cc: Qu Wenruo, Johannes Thumshirn, linux-btrfs@vger.kernel.org,
Filipe Manana, Damien Le Moal, Mark Harmstone, Omar Sandoval,
Shinichiro Kawasaki, Damien Le Moal
On Wed, Nov 13, 2024 at 08:01:29AM +0000, Johannes Thumshirn wrote:
> On 12.11.24 21:51, Qu Wenruo wrote:
> >> To fix this, move the call to bio_put() before the atomic_test operation
> >> so the submitter side in btrfs_encoded_read_regular_fill_pages() is not
> >> woken up before the bio is cleaned up.
> >>
> >> Also change atomic_dec_return() to atomic_dec_and_test() to fix the
> >> corruption, as atomic_dec_return() is defined as two instructions on
> >> x86_64, whereas atomic_dec_and_test() is defined as a single atomic
> >> operation.
> >
> > This means we should not utilize "atomic_dec_return() == 0" as a way to
> > do synchronization.
>
> At least not for reference counting, hence recount_t doesn't even have
> an equivalent.
>
> > And unfortunately I'm also seeing other locations still utilizing the
> > same patter inside btrfs_encoded_read_regular_fill_pages()
> >
> > Shouldn't we also fix that call site even just for the sake of consistency?
>
> I have no idea, TBH. The other user of atomic_dec_return() in btrfs is
> in delayed-inode.c:finish_one_item():
>
> /* atomic_dec_return implies a barrier */
> if ((atomic_dec_return(&delayed_root->items) <
> BTRFS_DELAYED_BACKGROUND || seq % BTRFS_DELAYED_BATCH == 0))
> cond_wake_up_nomb(&delayed_root->wait);
>
> And that one looks safe in my eyes.
A safe pattern where atomic_dec_return() is when once reaching 0 there
will be no increment. Which is for example when setting the pending
counter, possibyl adding e.g. pages one by one (atomic_inc) and then
starting that. Each processed page then decrements the counter, once all
are done the 0 will be there.
There are atomic_dec_return() that could be wrong, I haven't examined
all of them but it seems to be always safer to use
atomic_dec_and_test().
^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2024-11-25 15:55 UTC | newest]
Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-11-12 13:53 [PATCH v2 0/2] btrfs: fix use-after-free in btrfs_encoded_read_endio Johannes Thumshirn
2024-11-12 13:53 ` [PATCH v2 1/2] " Johannes Thumshirn
2024-11-12 14:45 ` Filipe Manana
2024-11-12 17:20 ` Johannes Thumshirn
2024-11-12 20:50 ` Qu Wenruo
2024-11-13 8:01 ` Johannes Thumshirn
2024-11-25 15:55 ` David Sterba
2024-11-12 13:53 ` [PATCH v2 2/2] btrfs: simplify waiting for encoded read endios Johannes Thumshirn
2024-11-12 14:57 ` Filipe Manana
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox