* while (1) in btrfs_relocate_block_group didn't end @ 2019-09-14 21:36 Cebtenzzre 2019-09-16 21:20 ` Cebtenzzre 0 siblings, 1 reply; 5+ messages in thread From: Cebtenzzre @ 2019-09-14 21:36 UTC (permalink / raw) To: linux-btrfs [-- Attachment #1: Type: text/plain, Size: 1467 bytes --] Hi, I started a balance of one block group, and I saw this in dmesg: BTRFS info (device sdi1): balance: start -dvrange=2236714319872..2236714319873 BTRFS info (device sdi1): relocating block group 2236714319872 flags data|raid0 BTRFS info (device sdi1): found 1 extents BTRFS info (device sdi1): found 1 extents BTRFS info (device sdi1): found 1 extents BTRFS info (device sdi1): found 1 extents BTRFS info (device sdi1): found 1 extents It continued like that for a total of 754 lines until I rebooted. Before that, I captured some debug info. I ran this in my shell for a few seconds, where PID is the pid of the process that called the balance ioctl: integer i=0; while true; do sudo cat /proc/PID/stack >stack$i; sleep .01010101; i+=1; done Which effectively gave me stack samples at (close to) 99Hz. Maybe not ideal, but I was in a hurry and I didn't want my disks to sustain such heavy, repetitive I/O for too long. I've attached the stack samples as stacks.tar.gz. A few of them are empty. To me, it looks like the kernel never left the while (1) loop in btrfs_relocate_block_group. The kernel messages seem to confirm this. I am using Arch Linux with kernel version 5.2.14-arch2, and I specified "slub_debug=P,kmalloc-2k" in the kernel cmdline to detect and protect against a use-after-free that I found when I had KASAN enabled. Would that kernel parameter result in a silent retry if it hit the use-after- free? -- Cebtenzzre <cebtenzzre@gmail.com> [-- Attachment #2: stacks.tar.gz --] [-- Type: application/x-compressed-tar, Size: 9068 bytes --] ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: while (1) in btrfs_relocate_block_group didn't end 2019-09-14 21:36 while (1) in btrfs_relocate_block_group didn't end Cebtenzzre @ 2019-09-16 21:20 ` Cebtenzzre 2019-09-28 18:36 ` Cebtenzzre 0 siblings, 1 reply; 5+ messages in thread From: Cebtenzzre @ 2019-09-16 21:20 UTC (permalink / raw) To: linux-btrfs On Sat, 2019-09-14 at 17:36 -0400, Cebtenzzre wrote: > Hi, > > I started a balance of one block group, and I saw this in dmesg: > > BTRFS info (device sdi1): balance: start -dvrange=2236714319872..2236714319873 > BTRFS info (device sdi1): relocating block group 2236714319872 flags data|raid0 > BTRFS info (device sdi1): found 1 extents > BTRFS info (device sdi1): found 1 extents > BTRFS info (device sdi1): found 1 extents > BTRFS info (device sdi1): found 1 extents > BTRFS info (device sdi1): found 1 extents > > It continued like that for a total of 754 lines until I rebooted. Before > that, I captured some debug info. I ran this in my shell for a few > seconds, where PID is the pid of the process that called the balance > ioctl: > > integer i=0; while true; do sudo cat /proc/PID/stack >stack$i; sleep .01010101; i+=1; done > > Which effectively gave me stack samples at (close to) 99Hz. Maybe not > ideal, but I was in a hurry and I didn't want my disks to sustain such > heavy, repetitive I/O for too long. > > I've attached the stack samples as stacks.tar.gz. A few of them are > empty. To me, it looks like the kernel never left the while (1) loop in > btrfs_relocate_block_group. The kernel messages seem to confirm this. > > I am using Arch Linux with kernel version 5.2.14-arch2, and I specified > "slub_debug=P,kmalloc-2k" in the kernel cmdline to detect and protect > against a use-after-free that I found when I had KASAN enabled. Would > that kernel parameter result in a silent retry if it hit the use-after- > free? Please disregard the quoted message. This behavior does appear to be a result of using the slub_debug option instead of KASAN. It is not directly caused by BTRFS. -- Cebtenzzre <cebtenzzre@gmail.com> ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: while (1) in btrfs_relocate_block_group didn't end 2019-09-16 21:20 ` Cebtenzzre @ 2019-09-28 18:36 ` Cebtenzzre 2019-09-28 23:37 ` Qu Wenruo 0 siblings, 1 reply; 5+ messages in thread From: Cebtenzzre @ 2019-09-28 18:36 UTC (permalink / raw) To: linux-btrfs On Mon, 2019-09-16 at 17:20 -0400, Cebtenzzre wrote: > On Sat, 2019-09-14 at 17:36 -0400, Cebtenzzre wrote: > > Hi, > > > > I started a balance of one block group, and I saw this in dmesg: > > > > BTRFS info (device sdi1): balance: start -dvrange=2236714319872..2236714319873 > > BTRFS info (device sdi1): relocating block group 2236714319872 flags data|raid0 > > BTRFS info (device sdi1): found 1 extents > > BTRFS info (device sdi1): found 1 extents > > BTRFS info (device sdi1): found 1 extents > > BTRFS info (device sdi1): found 1 extents > > BTRFS info (device sdi1): found 1 extents > > > > [...] > > > > I am using Arch Linux with kernel version 5.2.14-arch2, and I specified > > "slub_debug=P,kmalloc-2k" in the kernel cmdline to detect and protect > > against a use-after-free that I found when I had KASAN enabled. Would > > that kernel parameter result in a silent retry if it hit the use-after- > > free? > > Please disregard the quoted message. This behavior does appear to be a > result of using the slub_debug option instead of KASAN. It is not > directly caused by BTRFS. Actually, I just reproduced this behavior without slub_debug in the cmdline, on Linux 5.3.0 with "[PATCH] btrfs: relocation: Fix KASAN report about use-after-free due to dead reloc tree cleanup race" ( https://patchwork.kernel.org/patch/11153729/) applied. So, this issue is still relevant and possible to trigger, though under different conditions (different volume, kernel version, and cmdline). -- Cebtenzzre <cebtenzzre@gmail.com> ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: while (1) in btrfs_relocate_block_group didn't end 2019-09-28 18:36 ` Cebtenzzre @ 2019-09-28 23:37 ` Qu Wenruo 2019-10-04 13:51 ` Cebtenzzre 0 siblings, 1 reply; 5+ messages in thread From: Qu Wenruo @ 2019-09-28 23:37 UTC (permalink / raw) To: Cebtenzzre, linux-btrfs [-- Attachment #1.1: Type: text/plain, Size: 1851 bytes --] On 2019/9/29 上午2:36, Cebtenzzre wrote: > On Mon, 2019-09-16 at 17:20 -0400, Cebtenzzre wrote: >> On Sat, 2019-09-14 at 17:36 -0400, Cebtenzzre wrote: >>> Hi, >>> >>> I started a balance of one block group, and I saw this in dmesg: >>> >>> BTRFS info (device sdi1): balance: start -dvrange=2236714319872..2236714319873 >>> BTRFS info (device sdi1): relocating block group 2236714319872 flags data|raid0 >>> BTRFS info (device sdi1): found 1 extents >>> BTRFS info (device sdi1): found 1 extents >>> BTRFS info (device sdi1): found 1 extents >>> BTRFS info (device sdi1): found 1 extents >>> BTRFS info (device sdi1): found 1 extents >>> >>> [...] >>> >>> I am using Arch Linux with kernel version 5.2.14-arch2, and I specified >>> "slub_debug=P,kmalloc-2k" in the kernel cmdline to detect and protect >>> against a use-after-free that I found when I had KASAN enabled. Would >>> that kernel parameter result in a silent retry if it hit the use-after- >>> free? >> >> Please disregard the quoted message. This behavior does appear to be a >> result of using the slub_debug option instead of KASAN. It is not >> directly caused by BTRFS. > > Actually, I just reproduced this behavior without slub_debug in the > cmdline, on Linux 5.3.0 with "[PATCH] btrfs: relocation: Fix KASAN > report about use-after-free due to dead reloc tree cleanup race" ( > https://patchwork.kernel.org/patch/11153729/) applied. > > So, this issue is still relevant and possible to trigger, though under > different conditions (different volume, kernel version, and cmdline). > That patch is not to solve the while loop problem, so we still need some extra info for this problem. Is the problem always reproducible on that fs or still with some randomness? And, can you still reproduce it with v5.1/v5.2? Thanks, Qu [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 488 bytes --] ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: while (1) in btrfs_relocate_block_group didn't end 2019-09-28 23:37 ` Qu Wenruo @ 2019-10-04 13:51 ` Cebtenzzre 0 siblings, 0 replies; 5+ messages in thread From: Cebtenzzre @ 2019-10-04 13:51 UTC (permalink / raw) To: Qu Wenruo, linux-btrfs On Sun, 2019-09-29 at 07:37 +0800, Qu Wenruo wrote: > > On 2019/9/29 上午2:36, Cebtenzzre wrote: > > On Mon, 2019-09-16 at 17:20 -0400, Cebtenzzre wrote: > > > On Sat, 2019-09-14 at 17:36 -0400, Cebtenzzre wrote: > > > > Hi, > > > > > > > > I started a balance of one block group, and I saw this in dmesg: > > > > > > > > BTRFS info (device sdi1): balance: start -dvrange=2236714319872..2236714319873 > > > > BTRFS info (device sdi1): relocating block group 2236714319872 flags data|raid0 > > > > BTRFS info (device sdi1): found 1 extents > > > > BTRFS info (device sdi1): found 1 extents > > > > BTRFS info (device sdi1): found 1 extents > > > > BTRFS info (device sdi1): found 1 extents > > > > BTRFS info (device sdi1): found 1 extents > > > > > > > > [...] > > > > > > > > I am using Arch Linux with kernel version 5.2.14-arch2, and I specified > > > > "slub_debug=P,kmalloc-2k" in the kernel cmdline to detect and protect > > > > against a use-after-free that I found when I had KASAN enabled. Would > > > > that kernel parameter result in a silent retry if it hit the use-after- > > > > free? > > > > > > Please disregard the quoted message. This behavior does appear to be a > > > result of using the slub_debug option instead of KASAN. It is not > > > directly caused by BTRFS. > > > > Actually, I just reproduced this behavior without slub_debug in the > > cmdline, on Linux 5.3.0 with "[PATCH] btrfs: relocation: Fix KASAN > > report about use-after-free due to dead reloc tree cleanup race" ( > > https://patchwork.kernel.org/patch/11153729/) applied. > > > > So, this issue is still relevant and possible to trigger, though under > > different conditions (different volume, kernel version, and cmdline). > > > > That patch is not to solve the while loop problem, so we still need some > extra info for this problem. > > Is the problem always reproducible on that fs or still with some randomness? > > And, can you still reproduce it with v5.1/v5.2? > > Thanks, > Qu > I mentioned that patch because it was the only patch I had applied to my kernel at the time. The "issue" I was referring to was the looping issue that I reported in the first email. I have only come across this behavior without slub_debug once or twice, so I don't have enough of a sample size to say whether it can happen on older kernels. It's caused by running a balance with *just* the right amount of free space, such that the correct behavior is probably ENOSPC. I might eventually dedicate a volume to reproducing this issue, and bisect the kernel. But I need all of my disks to be usable right now. -- Cebtenzzre <cebtenzzre@gmail.com> ^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2019-10-04 13:51 UTC | newest] Thread overview: 5+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2019-09-14 21:36 while (1) in btrfs_relocate_block_group didn't end Cebtenzzre 2019-09-16 21:20 ` Cebtenzzre 2019-09-28 18:36 ` Cebtenzzre 2019-09-28 23:37 ` Qu Wenruo 2019-10-04 13:51 ` Cebtenzzre
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox