From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from mout01.posteo.de (mout01.posteo.de [185.67.36.65])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id 956FF524F
	for <linux-kernel@vger.kernel.org>; Mon,  6 Jan 2025 12:53:41 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=185.67.36.65
ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1736168024; cv=none; b=ikCwRpVf4nM+FkOvAowSA+alvkiUAgDZjH4uniTGdwTWThMnM3urR3BTxkhF2STNW3yZNJdQrnSrJ1cVBATNT26jr4J4JftJaZjMKAFTEjMYm2EbBGWFNMWOxNhLovbS88T6Bz35FHrxfO6+wyUkZyLhHTiJEZuzVlXRuzcIQpE=
ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1736168024; c=relaxed/simple;
	bh=NwPD9OTbzu7amAfdX3cPOauvw7fdeW5apjUajyr5o4k=;
	h=From:To:Cc:Subject:In-Reply-To:References:Date:Message-ID:
	 MIME-Version:Content-Type; b=a5oJ26MfoUldWWTOp08bLUn8F8jirXJqdy8ixw7LJvq7lkeavEV3YtI9uYQyQgplWikr+MUUWvyyUD539R2AZFXPqIAW+39S6BtDAwp3VHE0xtyOtsUnZEo7CzzkQebj9luTIb4W1hqnVK5KN0UvsyBQDKEOXP5UK0G3ohR/n48=
ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=posteo.net; spf=pass smtp.mailfrom=posteo.net; dkim=pass (2048-bit key) header.d=posteo.net header.i=@posteo.net header.b=r7z6+hlt; arc=none smtp.client-ip=185.67.36.65
Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=posteo.net
Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=posteo.net
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (2048-bit key) header.d=posteo.net header.i=@posteo.net header.b="r7z6+hlt"
Received: from submission (posteo.de [185.67.36.169]) 
	by mout01.posteo.de (Postfix) with ESMTPS id 822DE240027
	for <linux-kernel@vger.kernel.org>; Mon,  6 Jan 2025 13:53:39 +0100 (CET)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=posteo.net; s=2017;
	t=1736168019; bh=NwPD9OTbzu7amAfdX3cPOauvw7fdeW5apjUajyr5o4k=;
	h=From:To:Cc:Subject:Date:Message-ID:MIME-Version:Content-Type:
	 From;
	b=r7z6+hlt/T/E9O0wr+I1iPvdeBdT83r5+1v8XkkhIei2ukm1bCT9t3mX7Nyqcvms5
	 eylCh67J7TXlKhgE34SpUFUBlUP58Ya6R1SGXDsloRrmywdq6KA/eK1VsI8qFoxhOk
	 3hj4feXBs/KxZSG0GQQ9UOlaoM0JkMUXi60WrqAuybQKNuRbiKW/XfHq8uQPBe/VLs
	 qXHfZt/PL8mx8hy+nMJD9dLr51uWDNbgTjEh0ndTo/MnNQ/Z4egfeNMdSOMj5LWy8c
	 0Q1z4X7jd+VrFiz7PDCuQssUYyedtApktCv3g/fiUu5dAlKGr53m4xa3Bwkr4z2L0+
	 BMkMrebZEXL1A==
Received: from customer (localhost [127.0.0.1])
	by submission (posteo.de) with ESMTPSA id 4YRYyT5nydz9rxL;
	Mon,  6 Jan 2025 13:53:37 +0100 (CET)
From: Charalampos Mitrodimas <charmitro@posteo.net>
To: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Koichiro Den <koichiro.den@canonical.com>,  linux-mm@kvack.org,
  akpm@linux-foundation.org,  linux-kernel@vger.kernel.org,  Thomas
 Gleixner <tglx@linutronix.de>,  Peter Zijlstra <peterz@infradead.org>
Subject: Re: [PATCH v2] vmstat: disable vmstat_work on vmstat_cpu_down_prep()
In-Reply-To: <7ed97096-859e-46d0-8f27-16a2298a8914@lucifer.local> (Lorenzo
	Stoakes's message of "Mon, 6 Jan 2025 10:52:37 +0000")
References: <20241221033321.4154409-1-koichiro.den@canonical.com>
	<ff6461df-25d1-494f-ad34-763faf249309@lucifer.local>
	<2q7ge6cgzeowqffyn6w6ed4trhaaumv5ubdgud2tsoolen7wpw@4akuomhbacyh>
	<7ed97096-859e-46d0-8f27-16a2298a8914@lucifer.local>
Date: Mon, 06 Jan 2025 12:53:36 +0000
Message-ID: <m2h66ct86n.fsf@posteo.net>
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Content-Type: text/plain

Lorenzo Stoakes <lorenzo.stoakes@oracle.com> writes:

> +cc tglx, peterz for insight on CPU hot plug
>
> On Sat, Jan 04, 2025 at 01:00:17PM +0900, Koichiro Den wrote:
>> On Fri, Jan 03, 2025 at 11:33:19PM +0000, Lorenzo Stoakes wrote:
>> > On Sat, Dec 21, 2024 at 12:33:20PM +0900, Koichiro Den wrote:
>> > > Even after mm/vmstat:online teardown, shepherd may still queue work for
>> > > the dying cpu until the cpu is removed from online mask. While it's
>> > > quite rare, this means that after unbind_workers() unbinds a per-cpu
>> > > kworker, it potentially runs vmstat_update for the dying CPU on an
>> > > irrelevant cpu before entering atomic AP states.
>> > > When CONFIG_DEBUG_PREEMPT=y, it results in the following error with the
>> > > backtrace.
>> > >
>> > >   BUG: using smp_processor_id() in preemptible [00000000] code: \
>> > >                                                kworker/7:3/1702
>> > >   caller is refresh_cpu_vm_stats+0x235/0x5f0
>> > >   CPU: 0 UID: 0 PID: 1702 Comm: kworker/7:3 Tainted: G
>> > >   Tainted: [N]=TEST
>> > >   Workqueue: mm_percpu_wq vmstat_update
>> > >   Call Trace:
>> > >    <TASK>
>> > >    dump_stack_lvl+0x8d/0xb0
>> > >    check_preemption_disabled+0xce/0xe0
>> > >    refresh_cpu_vm_stats+0x235/0x5f0
>> > >    vmstat_update+0x17/0xa0
>> > >    process_one_work+0x869/0x1aa0
>> > >    worker_thread+0x5e5/0x1100
>> > >    kthread+0x29e/0x380
>> > >    ret_from_fork+0x2d/0x70
>> > >    ret_from_fork_asm+0x1a/0x30
>> > >    </TASK>
>> > >
>> > > So, for mm/vmstat:online, disable vmstat_work reliably on teardown and
>> > > symmetrically enable it on startup.
>> > >
>> > > Signed-off-by: Koichiro Den <koichiro.den@canonical.com>
>> >
>> > Hi,
>> >
>> > I observed a warning in my qemu and real hardware, which I bisected to this commit:
>> >
>> > [    0.087733] ------------[ cut here ]------------
>> > [    0.087733] workqueue: work disable count underflowed
>> > [    0.087733] WARNING: CPU: 1 PID: 21 at kernel/workqueue.c:4313 enable_work+0xb5/0xc0
>> >
>> > This is:
>> >
>> > static void work_offqd_enable(struct work_offq_data *offqd)
>> > {
>> > 	if (likely(offqd->disable > 0))
>> > 		offqd->disable--;
>> > 	else
>> > 		WARN_ONCE(true, "workqueue: work disable count underflowed\n"); <-- this line
>> > }
>> >
>> > So (based on this code) presumably an enable is only required if previously
>> > disabled, and this code is being called on startup unconditionally without
>> > the work having been disabled previously? I'm not hugely familiar with
>> > delayed workqueue implementation details.
>> >
>> > [    0.087733] Modules linked in:
>> > [    0.087733] CPU: 1 UID: 0 PID: 21 Comm: cpuhp/1 Not tainted 6.13.0-rc4+ #58
>> > [    0.087733] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Arch Linux 1.16.3-1-1 04/01/2014
>> > [    0.087733] RIP: 0010:enable_work+0xb5/0xc0
>> > [    0.087733] Code: 6f b8 01 00 74 0f 31 d2 be 01 00 00 00 eb b5 90 0f 0b 90 eb ca c6 05 60 6f b8 01 01 90 48 c7 c7 b0 a9 6e 82 e8 4c a4 fd ff 90 <0f> 0b 90 90 eb d6 0f 1f 44 00 00 90 90 90 90 90 90 90 90 90 90 90
>> > [    0.087733] RSP: 0018:ffffc900000cbe30 EFLAGS: 00010092
>> > [    0.087733] RAX: 0000000000000029 RBX: ffff888263ca9d60 RCX: 0000000000000000
>> > [    0.087733] RDX: 0000000000000001 RSI: ffffc900000cbce8 RDI: 0000000000000001
>> > [    0.087733] RBP: ffffc900000cbe30 R08: 00000000ffffdfff R09: ffffffff82b12f08
>> > [    0.087733] R10: 0000000000000003 R11: 0000000000000002 R12: 00000000000000c4
>> > [    0.087733] R13: ffffffff81278d90 R14: 0000000000000000 R15: ffff888263c9c648
>> > [    0.087733] FS:  0000000000000000(0000) GS:ffff888263c80000(0000) knlGS:0000000000000000
>> > [    0.087733] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> > [    0.087733] CR2: 0000000000000000 CR3: 0000000002a2e000 CR4: 0000000000750ef0
>> > [    0.087733] PKRU: 55555554
>> > [    0.087733] Call Trace:
>> > [    0.087733]  <TASK>
>> > [    0.087733]  ? enable_work+0xb5/0xc0
>> > [    0.087733]  ? __warn.cold+0x93/0xf2
>> > [    0.087733]  ? enable_work+0xb5/0xc0
>> > [    0.087733]  ? report_bug+0xff/0x140
>> > [    0.087733]  ? handle_bug+0x54/0x90
>> > [    0.087733]  ? exc_invalid_op+0x17/0x70
>> > [    0.087733]  ? asm_exc_invalid_op+0x1a/0x20
>> > [    0.087733]  ? __pfx_vmstat_cpu_online+0x10/0x10
>> > [    0.087733]  ? enable_work+0xb5/0xc0
>> > [    0.087733]  vmstat_cpu_online+0x5c/0x70
>> > [    0.087733]  cpuhp_invoke_callback+0x133/0x440
>> > [    0.087733]  cpuhp_thread_fun+0x95/0x150
>> > [    0.087733]  smpboot_thread_fn+0xd5/0x1d0
>> > [    0.087734]  ? __pfx_smpboot_thread_fn+0x10/0x10
>> > [    0.087735]  kthread+0xc8/0xf0
>> > [    0.087737]  ? __pfx_kthread+0x10/0x10
>> > [    0.087738]  ret_from_fork+0x2c/0x50
>> > [    0.087739]  ? __pfx_kthread+0x10/0x10
>> > [    0.087740]  ret_from_fork_asm+0x1a/0x30
>> > [    0.087742]  </TASK>
>> > [    0.087742] ---[ end trace 0000000000000000 ]---
>> >
>> >
>> > > ---
>> > > v1: https://lore.kernel.org/all/20241220134234.3809621-1-koichiro.den@canonical.com/
>> > > ---
>> > >  mm/vmstat.c | 3 ++-
>> > >  1 file changed, 2 insertions(+), 1 deletion(-)
>> > >
>> > > diff --git a/mm/vmstat.c b/mm/vmstat.c
>> > > index 4d016314a56c..0889b75cef14 100644
>> > > --- a/mm/vmstat.c
>> > > +++ b/mm/vmstat.c
>> > > @@ -2148,13 +2148,14 @@ static int vmstat_cpu_online(unsigned int cpu)
>> > >  	if (!node_state(cpu_to_node(cpu), N_CPU)) {
>> > >  		node_set_state(cpu_to_node(cpu), N_CPU);
>> > >  	}
>> > > +	enable_delayed_work(&per_cpu(vmstat_work, cpu));
>> >
>> > Probably needs to be 'if disabled' here, as this is invoked on normal
>> > startup when the work won't have been disabled?
>> >
>> > Had a brief look at code and couldn't see how that could be done
>> > however... and one would need to be careful about races... Tricky!
>> >
>> > >
>> > >  	return 0;
>> > >  }
>> > >
>> > >  static int vmstat_cpu_down_prep(unsigned int cpu)
>> > >  {
>> > > -	cancel_delayed_work_sync(&per_cpu(vmstat_work, cpu));
>> > > +	disable_delayed_work_sync(&per_cpu(vmstat_work, cpu));
>> > >  	return 0;
>> > >  }
>> > >
>> > > --
>> > > 2.43.0
>> > >
>> > >
>> >
>> > Let me know if you need any more details, .config etc.
>> >
>> > I noticed this warning on a real box too (in both cases running akpm's
>> > mm-unstable branch), FWIW.
>>
>> Thank you for the report. I was able to reproduce the warning and now
>> wonder how I missed it.. My oversight, apologies.
>>
>> In my current view, the simplest solution would be to make sure a local
>> vmstat_work is disabled until vmstat_cpu_online() runs for the cpu, even
>> during boot-up. The following patch suppresses the warning:
>>
>>   diff --git a/mm/vmstat.c b/mm/vmstat.c
>>   index 0889b75cef14..19ceed5d34bf 100644
>>   --- a/mm/vmstat.c
>>   +++ b/mm/vmstat.c
>>   @@ -2122,10 +2122,14 @@ static void __init start_shepherd_timer(void)
>>    {
>>           int cpu;
>>
>>   -       for_each_possible_cpu(cpu)
>>   +       for_each_possible_cpu(cpu) {
>>                   INIT_DEFERRABLE_WORK(per_cpu_ptr(&vmstat_work, cpu),
>>                           vmstat_update);
>>
>>   +               /* will be enabled on vmstat_cpu_online */
>>   +               disable_delayed_work_sync(&per_cpu(vmstat_work, cpu));
>>   +       }
>>   +
>>           schedule_delayed_work(&shepherd,
>>                   round_jiffies_relative(sysctl_stat_interval));
>>    }
>>
>> If you think of a better solution later, please let me know. Otherwise,
>> I'll submit a follow-up fix patch with the above diff.
>
> Thanks, this resolves the problem, but are we sure that _all_ CPUs will
> definitely call vmstat_cpu_online()?
>
> I did a bit of printk output and it seems like this _didn't_ online CPU 0,
> presumably the boot CPU which calls this function in the first instance?

FWIW with the proposed fix I can see that all CPUs are online,
  grep -H . /sys/devices/system/cpu/cpu*/online
  /sys/devices/system/cpu/cpu0/online:1
  /sys/devices/system/cpu/cpu1/online:1
  /sys/devices/system/cpu/cpu2/online:1
  /sys/devices/system/cpu/cpu3/online:1
  /sys/devices/system/cpu/cpu4/online:1
  /sys/devices/system/cpu/cpu5/online:1
  /sys/devices/system/cpu/cpu6/online:1
  /sys/devices/system/cpu/cpu7/online:1

>
> I also see that init_mm_internals() invokes cpuhp_setup_state_nocalls()
> explicitly which does _not_ call the callback, though even if it did this
> would be too early as it calls start_shepherd_timer() _after_ this anyway.
>
> So yeah, unless I'm missing something, I think this patch is broken.
>
> I have added Thomas and Peter to give some insight on the CPU hotplug side.
>
> It feels like the patch really needs an 'enable if not already enabled'
> call in vmstat_cpu_online().
>
>>
>> Thanks.
>>
>> -Koichiro