while (1) in btrfs_relocate_block

public inbox for linux-btrfs@vger.kernel.org
 help / color / mirror / Atom feed

* while (1) in btrfs_relocate_block_group didn't end
@ 2019-09-14 21:36 Cebtenzzre
  2019-09-16 21:20 ` Cebtenzzre
  0 siblings, 1 reply; 5+ messages in thread
From: Cebtenzzre @ 2019-09-14 21:36 UTC (permalink / raw)
  To: linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 1467 bytes --]

Hi,

I started a balance of one block group, and I saw this in dmesg:

BTRFS info (device sdi1): balance: start -dvrange=2236714319872..2236714319873
BTRFS info (device sdi1): relocating block group 2236714319872 flags data|raid0
BTRFS info (device sdi1): found 1 extents
BTRFS info (device sdi1): found 1 extents
BTRFS info (device sdi1): found 1 extents
BTRFS info (device sdi1): found 1 extents
BTRFS info (device sdi1): found 1 extents

It continued like that for a total of 754 lines until I rebooted. Before
that, I captured some debug info. I ran this in my shell for a few
seconds, where PID is the pid of the process that called the balance
ioctl:

integer i=0; while true; do sudo cat /proc/PID/stack >stack$i; sleep .01010101; i+=1; done

Which effectively gave me stack samples at (close to) 99Hz. Maybe not
ideal, but I was in a hurry and I didn't want my disks to sustain such
heavy, repetitive I/O for too long.

I've attached the stack samples as stacks.tar.gz. A few of them are
empty. To me, it looks like the kernel never left the while (1) loop in
btrfs_relocate_block_group. The kernel messages seem to confirm this.

I am using Arch Linux with kernel version 5.2.14-arch2, and I specified
"slub_debug=P,kmalloc-2k" in the kernel cmdline to detect and protect
against a use-after-free that I found when I had KASAN enabled. Would
that kernel parameter result in a silent retry if it hit the use-after-
free?
-- 
Cebtenzzre <cebtenzzre@gmail.com>

[-- Attachment #2: stacks.tar.gz --]
[-- Type: application/x-compressed-tar, Size: 9068 bytes --]

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: while (1) in btrfs_relocate_block_group didn't end
  2019-09-14 21:36 while (1) in btrfs_relocate_block_group didn't end Cebtenzzre
@ 2019-09-16 21:20 ` Cebtenzzre
  2019-09-28 18:36   ` Cebtenzzre
  0 siblings, 1 reply; 5+ messages in thread
From: Cebtenzzre @ 2019-09-16 21:20 UTC (permalink / raw)
  To: linux-btrfs

On Sat, 2019-09-14 at 17:36 -0400, Cebtenzzre wrote:
> Hi,
> 
> I started a balance of one block group, and I saw this in dmesg:
> 
> BTRFS info (device sdi1): balance: start -dvrange=2236714319872..2236714319873
> BTRFS info (device sdi1): relocating block group 2236714319872 flags data|raid0
> BTRFS info (device sdi1): found 1 extents
> BTRFS info (device sdi1): found 1 extents
> BTRFS info (device sdi1): found 1 extents
> BTRFS info (device sdi1): found 1 extents
> BTRFS info (device sdi1): found 1 extents
> 
> It continued like that for a total of 754 lines until I rebooted. Before
> that, I captured some debug info. I ran this in my shell for a few
> seconds, where PID is the pid of the process that called the balance
> ioctl:
> 
> integer i=0; while true; do sudo cat /proc/PID/stack >stack$i; sleep .01010101; i+=1; done
> 
> Which effectively gave me stack samples at (close to) 99Hz. Maybe not
> ideal, but I was in a hurry and I didn't want my disks to sustain such
> heavy, repetitive I/O for too long.
> 
> I've attached the stack samples as stacks.tar.gz. A few of them are
> empty. To me, it looks like the kernel never left the while (1) loop in
> btrfs_relocate_block_group. The kernel messages seem to confirm this.
> 
> I am using Arch Linux with kernel version 5.2.14-arch2, and I specified
> "slub_debug=P,kmalloc-2k" in the kernel cmdline to detect and protect
> against a use-after-free that I found when I had KASAN enabled. Would
> that kernel parameter result in a silent retry if it hit the use-after-
> free?

Please disregard the quoted message. This behavior does appear to be a
result of using the slub_debug option instead of KASAN. It is not
directly caused by BTRFS.
-- 
Cebtenzzre <cebtenzzre@gmail.com>


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: while (1) in btrfs_relocate_block_group didn't end
  2019-09-16 21:20 ` Cebtenzzre
@ 2019-09-28 18:36   ` Cebtenzzre
  2019-09-28 23:37     ` Qu Wenruo
  0 siblings, 1 reply; 5+ messages in thread
From: Cebtenzzre @ 2019-09-28 18:36 UTC (permalink / raw)
  To: linux-btrfs

On Mon, 2019-09-16 at 17:20 -0400, Cebtenzzre wrote:
> On Sat, 2019-09-14 at 17:36 -0400, Cebtenzzre wrote:
> > Hi,
> > 
> > I started a balance of one block group, and I saw this in dmesg:
> > 
> > BTRFS info (device sdi1): balance: start -dvrange=2236714319872..2236714319873
> > BTRFS info (device sdi1): relocating block group 2236714319872 flags data|raid0
> > BTRFS info (device sdi1): found 1 extents
> > BTRFS info (device sdi1): found 1 extents
> > BTRFS info (device sdi1): found 1 extents
> > BTRFS info (device sdi1): found 1 extents
> > BTRFS info (device sdi1): found 1 extents
> > 
> > [...]
> > 
> > I am using Arch Linux with kernel version 5.2.14-arch2, and I specified
> > "slub_debug=P,kmalloc-2k" in the kernel cmdline to detect and protect
> > against a use-after-free that I found when I had KASAN enabled. Would
> > that kernel parameter result in a silent retry if it hit the use-after-
> > free?
> 
> Please disregard the quoted message. This behavior does appear to be a
> result of using the slub_debug option instead of KASAN. It is not
> directly caused by BTRFS.

Actually, I just reproduced this behavior without slub_debug in the
cmdline, on Linux 5.3.0 with "[PATCH] btrfs: relocation: Fix KASAN
report about use-after-free due to dead reloc tree cleanup race" (
https://patchwork.kernel.org/patch/11153729/) applied.

So, this issue is still relevant and possible to trigger, though under
different conditions (different volume, kernel version, and cmdline).
-- 
Cebtenzzre <cebtenzzre@gmail.com>


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: while (1) in btrfs_relocate_block_group didn't end
  2019-09-28 18:36   ` Cebtenzzre
@ 2019-09-28 23:37     ` Qu Wenruo
  2019-10-04 13:51       ` Cebtenzzre
  0 siblings, 1 reply; 5+ messages in thread
From: Qu Wenruo @ 2019-09-28 23:37 UTC (permalink / raw)
  To: Cebtenzzre, linux-btrfs


[-- Attachment #1.1: Type: text/plain, Size: 1851 bytes --]



On 2019/9/29 上午2:36, Cebtenzzre wrote:
> On Mon, 2019-09-16 at 17:20 -0400, Cebtenzzre wrote:
>> On Sat, 2019-09-14 at 17:36 -0400, Cebtenzzre wrote:
>>> Hi,
>>>
>>> I started a balance of one block group, and I saw this in dmesg:
>>>
>>> BTRFS info (device sdi1): balance: start -dvrange=2236714319872..2236714319873
>>> BTRFS info (device sdi1): relocating block group 2236714319872 flags data|raid0
>>> BTRFS info (device sdi1): found 1 extents
>>> BTRFS info (device sdi1): found 1 extents
>>> BTRFS info (device sdi1): found 1 extents
>>> BTRFS info (device sdi1): found 1 extents
>>> BTRFS info (device sdi1): found 1 extents
>>>
>>> [...]
>>>
>>> I am using Arch Linux with kernel version 5.2.14-arch2, and I specified
>>> "slub_debug=P,kmalloc-2k" in the kernel cmdline to detect and protect
>>> against a use-after-free that I found when I had KASAN enabled. Would
>>> that kernel parameter result in a silent retry if it hit the use-after-
>>> free?
>>
>> Please disregard the quoted message. This behavior does appear to be a
>> result of using the slub_debug option instead of KASAN. It is not
>> directly caused by BTRFS.
> 
> Actually, I just reproduced this behavior without slub_debug in the
> cmdline, on Linux 5.3.0 with "[PATCH] btrfs: relocation: Fix KASAN
> report about use-after-free due to dead reloc tree cleanup race" (
> https://patchwork.kernel.org/patch/11153729/) applied.
> 
> So, this issue is still relevant and possible to trigger, though under
> different conditions (different volume, kernel version, and cmdline).
> 

That patch is not to solve the while loop problem, so we still need some
extra info for this problem.

Is the problem always reproducible on that fs or still with some randomness?

And, can you still reproduce it with v5.1/v5.2?

Thanks,
Qu


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: while (1) in btrfs_relocate_block_group didn't end
  2019-09-28 23:37     ` Qu Wenruo
@ 2019-10-04 13:51       ` Cebtenzzre
  0 siblings, 0 replies; 5+ messages in thread
From: Cebtenzzre @ 2019-10-04 13:51 UTC (permalink / raw)
  To: Qu Wenruo, linux-btrfs

On Sun, 2019-09-29 at 07:37 +0800, Qu Wenruo wrote:
> 
> On 2019/9/29 上午2:36, Cebtenzzre wrote:
> > On Mon, 2019-09-16 at 17:20 -0400, Cebtenzzre wrote:
> > > On Sat, 2019-09-14 at 17:36 -0400, Cebtenzzre wrote:
> > > > Hi,
> > > > 
> > > > I started a balance of one block group, and I saw this in dmesg:
> > > > 
> > > > BTRFS info (device sdi1): balance: start -dvrange=2236714319872..2236714319873
> > > > BTRFS info (device sdi1): relocating block group 2236714319872 flags data|raid0
> > > > BTRFS info (device sdi1): found 1 extents
> > > > BTRFS info (device sdi1): found 1 extents
> > > > BTRFS info (device sdi1): found 1 extents
> > > > BTRFS info (device sdi1): found 1 extents
> > > > BTRFS info (device sdi1): found 1 extents
> > > > 
> > > > [...]
> > > > 
> > > > I am using Arch Linux with kernel version 5.2.14-arch2, and I specified
> > > > "slub_debug=P,kmalloc-2k" in the kernel cmdline to detect and protect
> > > > against a use-after-free that I found when I had KASAN enabled. Would
> > > > that kernel parameter result in a silent retry if it hit the use-after-
> > > > free?
> > > 
> > > Please disregard the quoted message. This behavior does appear to be a
> > > result of using the slub_debug option instead of KASAN. It is not
> > > directly caused by BTRFS.
> > 
> > Actually, I just reproduced this behavior without slub_debug in the
> > cmdline, on Linux 5.3.0 with "[PATCH] btrfs: relocation: Fix KASAN
> > report about use-after-free due to dead reloc tree cleanup race" (
> > https://patchwork.kernel.org/patch/11153729/) applied.
> > 
> > So, this issue is still relevant and possible to trigger, though under
> > different conditions (different volume, kernel version, and cmdline).
> > 
> 
> That patch is not to solve the while loop problem, so we still need some
> extra info for this problem.
> 
> Is the problem always reproducible on that fs or still with some randomness?
> 
> And, can you still reproduce it with v5.1/v5.2?
> 
> Thanks,
> Qu
> 

I mentioned that patch because it was the only patch I had applied to my
kernel at the time. The "issue" I was referring to was the looping issue
that I reported in the first email.

I have only come across this behavior without slub_debug once or twice,
so I don't have enough of a sample size to say whether it can happen on
older kernels. It's caused by running a balance with *just* the right
amount of free space, such that the correct behavior is probably ENOSPC.

I might eventually dedicate a volume to reproducing this issue, and
bisect the kernel. But I need all of my disks to be usable right now.
-- 
Cebtenzzre <cebtenzzre@gmail.com>


^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2019-10-04 13:51 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2019-09-14 21:36 while (1) in btrfs_relocate_block_group didn't end Cebtenzzre
2019-09-16 21:20 ` Cebtenzzre
2019-09-28 18:36   ` Cebtenzzre
2019-09-28 23:37     ` Qu Wenruo
2019-10-04 13:51       ` Cebtenzzre

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox