Bad psi_group_cpu.tasks[NR_MEMSTALL] counter

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* Bad psi_group_cpu.tasks[NR_MEMSTALL] counter
@ 2024-06-04  7:16 Max Kellermann
  2024-06-10  8:28 ` Max Kellermann
  2024-06-12  5:01 ` Suren Baghdasaryan
  0 siblings, 2 replies; 19+ messages in thread
From: Max Kellermann @ 2024-06-04  7:16 UTC (permalink / raw)
  To: Johannes Weiner, surenb, Peter Zijlstra, linux-kernel

Hi kernel people,
I have a problem that I have been trying to debug for a few days, but
I got lost in the depths of the scheduler code; I'm stuck and I need
your help.
We have several servers which show a constant memory.pressure value of
30 to 100 (or more), even when the server is idle. I tracked this down
to psi_group_cpu.tasks[NR_MEMSTALL]==1 even though no such process
exists, but I can't figure out why the kernel thinks there is still
one task stuck in memstall. I tried to track down all the code paths
that lead to psi_group_change(), but found nothing conclusive, and
failed to reproduce it on a test machine with kernel patches injecting
delays (trying to cause data race bugs that may have caused this
problem).

This happened on servers that were very busy and indeed were in
memstall often due to going over memory.high frequently. We have one
"main" cgroup with memory.high configured, and all the workload
processes live in sub-cgroups, of which we always have a few thousand.
When memory.events gets triggered, our process manager stops a bunch
of idle processes to free up memory, which then deletes the sub-cgroup
they belong to. In other words: sub-cgroups get created and deleted
very often, and they get deleted when there is indeed memory stall
happening. My theory was that there could be a data race bug that
forgets to decrement tasks[NR_MEMSTALL], maybe when a stalled child
cgroup gets deleted.
On our Grafana, I can easily track the beginning of this bug to a
point two weeks ago; in the system log, I can see that hundreds of
processes needed to be terminated due to memory pressure at that time.

The affected servers run kernel 6.8.7 with a few custom patches, but
none of these patches affect the scheduler or cgroups; they're about
unrelated things like denying access to Ceph snapshots and adjusting
debugfs permissions. (I submitted most of those patches to LKML long
ago but nobody cared.)
Newer kernels don't seem to have fixes for my problem; the relevant
parts of the code are unchanged.

One of the servers is still running with this problem, and I can
access it with gdb on /proc/kcore. I'll keep it that way for some more
time, so if you have any idea what to look for, let me know.

This is the psi_group of the "main" cgroup:

 $1 = {parent = 0xffff9de707287800, enabled = true, avgs_lock = {owner
= {counter = 0}, wait_lock = {raw_lock = {{val = {counter = 0},
{locked = 0 '\000', pending = 0 '\000'}, {locked_pending = 0, tail =
0}}}}, osq = {tail = {
        counter = 0}}, wait_list = {next = 0xffff9de70f772820, prev =
0xffff9de70f772820}}, pcpu = 0x3fb640033900, avg_total =
{6133960836647, 5923217690044, 615319825665255, 595479374843164,
19259777147170, 12847590051880},
  avg_last_update = 1606208471280060, avg_next_update =
1606210394507082, avgs_work = {work = {data = {counter = 321}, entry =
{next = 0xffff9de70f772880, prev = 0xffff9de70f772880}, func =
0xffffffff880dcc00}, timer = {entry = {
        next = 0x0 <fixed_percpu_data>, pprev = 0xffff9e05bef5bc48},
expires = 4455558105, function = 0xffffffff880a1ca0, flags =
522190853}, wq = 0xffff9de700051400, cpu = 64}, avg_triggers = {next =
0xffff9de70f7728d0,
    prev = 0xffff9de70f7728d0}, avg_nr_triggers = {0, 0, 0, 0, 0, 0},
total = {{6133960836647, 5923217690044, 615328415599847,
595487964777756, 19281251983650, 12869064888360}, {6092994926,
5559819737, 105947464151, 100672353730,
      8196529519, 7678536634}}, avg = {{0, 0, 0}, {0, 0, 0}, {203596,
203716, 198499}, {203596, 203716, 198288}, {0, 0, 60}, {0, 0, 0}},
rtpoll_task = 0x0 <fixed_percpu_data>, rtpoll_timer = {entry = {next =
0xdead000000000122,
      pprev = 0x0 <fixed_percpu_data>}, expires = 4405010639, function
= 0xffffffff880dac50, flags = 67108868}, rtpoll_wait = {lock = {{rlock
= {raw_lock = {{val = {counter = 0}, {locked = 0 '\000', pending = 0
'\000'}, {
                locked_pending = 0, tail = 0}}}}}}, head = {next =
0xffff9de70f772a20, prev = 0xffff9de70f772a20}}, rtpoll_wakeup =
{counter = 0}, rtpoll_scheduled = {counter = 0}, rtpoll_trigger_lock =
{owner = {counter = 0},
    wait_lock = {raw_lock = {{val = {counter = 0}, {locked = 0 '\000',
pending = 0 '\000'}, {locked_pending = 0, tail = 0}}}}, osq = {tail =
{counter = 0}}, wait_list = {next = 0xffff9de70f772a48, prev =
0xffff9de70f772a48}},
  rtpoll_triggers = {next = 0xffff9de70f772a58, prev =
0xffff9de70f772a58}, rtpoll_nr_triggers = {0, 0, 0, 0, 0, 0},
rtpoll_states = 0, rtpoll_min_period = 18446744073709551615,
rtpoll_total = {6092994926, 5559819737, 105947464151,
    100672353730, 8196529519, 7678536634}, rtpoll_next_update =
1100738436720135, rtpoll_until = 0}

This is a summary of all psi_group_pcpu for the 32 CPU cores (on the
way, I wrote a small gdb script to dump interesting details like these
but that went nowhere):

  state_mask 0 = 0x0 tasks {0, 0, 0, 0}
  state_mask 1 = 0x0 tasks {0, 0, 0, 0}
  state_mask 2 = 0x0 tasks {0, 0, 0, 0}
  state_mask 3 = 0x0 tasks {0, 0, 0, 0}
  state_mask 4 = 0x0 tasks {0, 0, 0, 0}
  state_mask 5 = 0x0 tasks {0, 0, 0, 0}
  state_mask 6 = 0x0 tasks {0, 0, 0, 0}
  state_mask 7 = 0x0 tasks {0, 0, 0, 0}
  state_mask 8 = 0x0 tasks {0, 0, 0, 0}
  state_mask 9 = 0x0 tasks {0, 0, 0, 0}
  state_mask 10 = 0x0 tasks {0, 0, 0, 0}
  state_mask 11 = 0x0 tasks {0, 0, 0, 0}
  state_mask 12 = 0x0 tasks {0, 0, 0, 0}
  state_mask 13 = 0x0 tasks {0, 0, 0, 0}
  state_mask 14 = 0x0 tasks {0, 0, 0, 0}
  state_mask 15 = 0x0 tasks {0, 0, 0, 0}
  state_mask 16 = 0x0 tasks {0, 0, 0, 0}
  state_mask 17 = 0x0 tasks {0, 0, 0, 0}
  state_mask 18 = 0x0 tasks {0, 0, 0, 0}
  state_mask 19 = 0x0 tasks {0, 0, 0, 0}
  state_mask 20 = 0x0 tasks {0, 0, 0, 0}
  state_mask 21 = 0x0 tasks {0, 0, 0, 0}
  state_mask 22 = 0x0 tasks {0, 0, 0, 0}
  state_mask 23 = 0x0 tasks {0, 0, 0, 0}
  state_mask 24 = 0x0 tasks {0, 0, 0, 0}
  state_mask 25 = 0x0 tasks {0, 0, 0, 0}
  state_mask 26 = 0x0 tasks {0, 0, 0, 0}
  state_mask 27 = 0x0 tasks {0, 0, 0, 0}
  state_mask 28 = 0x0 tasks {0, 0, 0, 0}
  state_mask 29 = 0x0 tasks {0, 0, 0, 0}
  state_mask 30 = 0x4c tasks {0, 1, 0, 0}
  state_mask 31 = 0x0 tasks {0, 0, 0, 0}

CPU core 30 is stuck with this bogus value. state_mask 0x4c =
PSI_MEM_SOME|PSI_MEM_FULL|PSI_NONIDLE.

The memory pressure at the time of this writing:

 # cat /sys/fs/cgroup/system.slice/system-cm4all.slice/bp-spawn.scope/memory.pressure
 some avg10=99.22 avg60=99.39 avg300=97.62 total=615423620626
 full avg10=99.22 avg60=99.39 avg300=97.54 total=595583169804
 # cat /sys/fs/cgroup/system.slice/system-cm4all.slice/bp-spawn.scope/_/memory.pressure
 some avg10=0.00 avg60=0.00 avg300=0.00 total=0
 full avg10=0.00 avg60=0.00 avg300=0.00 total=0
 # cat /sys/fs/cgroup/system.slice/system-cm4all.slice/bp-spawn.scope/cgroup.stat
 nr_descendants 1
 nr_dying_descendants 2224

There is currently no worker process; there is only one idle dummy
process in a single sub-cgroup called "_", only there to keep the
systemd scope populated. It should therefore be impossible to have
memory.pressure when the only leaf cgroup has pressure=0.

(nr_dying_descendants is decremented extremely slowly; I deactivated
the server shortly before collecting these numbers, to make sure it's
really idle and there are really no processes left to cause this
pressure. I don't think nr_dying_descendants is relevant for this
problem; even after two days of full idle, the counter and the
pressure didn't go back to zero.)

Please help :-)

Max

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Bad psi_group_cpu.tasks[NR_MEMSTALL] counter
  2024-06-04  7:16 Bad psi_group_cpu.tasks[NR_MEMSTALL] counter Max Kellermann
@ 2024-06-10  8:28 ` Max Kellermann
  2024-06-12  5:01 ` Suren Baghdasaryan
  1 sibling, 0 replies; 19+ messages in thread
From: Max Kellermann @ 2024-06-10  8:28 UTC (permalink / raw)
  To: Johannes Weiner, surenb, Peter Zijlstra, linux-kernel

On Tue, Jun 4, 2024 at 9:16 AM Max Kellermann <max.kellermann@ionos.com> wrote:
> One of the servers is still running with this problem, and I can
> access it with gdb on /proc/kcore. I'll keep it that way for some more
> time, so if you have any idea what to look for, let me know.

It's been a week since I reported this kernel bug, but nobody replied.
I will have to reboot this server soon, so this is the last chance to
extract information about the bug.

Max

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Bad psi_group_cpu.tasks[NR_MEMSTALL] counter
  2024-06-04  7:16 Bad psi_group_cpu.tasks[NR_MEMSTALL] counter Max Kellermann
  2024-06-10  8:28 ` Max Kellermann
@ 2024-06-12  5:01 ` Suren Baghdasaryan
  2024-06-12  6:49   ` Max Kellermann
  2024-08-05 12:34   ` Max Kellermann
  1 sibling, 2 replies; 19+ messages in thread
From: Suren Baghdasaryan @ 2024-06-12  5:01 UTC (permalink / raw)
  To: Max Kellermann; +Cc: Johannes Weiner, Peter Zijlstra, linux-kernel

On Tue, Jun 4, 2024 at 12:16 AM Max Kellermann <max.kellermann@ionos.com> wrote:
>
> Hi kernel people,
> I have a problem that I have been trying to debug for a few days, but
> I got lost in the depths of the scheduler code; I'm stuck and I need
> your help.
> We have several servers which show a constant memory.pressure value of
> 30 to 100 (or more), even when the server is idle. I tracked this down
> to psi_group_cpu.tasks[NR_MEMSTALL]==1 even though no such process
> exists, but I can't figure out why the kernel thinks there is still
> one task stuck in memstall. I tried to track down all the code paths
> that lead to psi_group_change(), but found nothing conclusive, and
> failed to reproduce it on a test machine with kernel patches injecting
> delays (trying to cause data race bugs that may have caused this
> problem).
>
> This happened on servers that were very busy and indeed were in
> memstall often due to going over memory.high frequently. We have one
> "main" cgroup with memory.high configured, and all the workload
> processes live in sub-cgroups, of which we always have a few thousand.
> When memory.events gets triggered, our process manager stops a bunch
> of idle processes to free up memory, which then deletes the sub-cgroup
> they belong to. In other words: sub-cgroups get created and deleted
> very often, and they get deleted when there is indeed memory stall
> happening. My theory was that there could be a data race bug that
> forgets to decrement tasks[NR_MEMSTALL], maybe when a stalled child
> cgroup gets deleted.

Hi Max,
I'm not an expert in the scheduler (I maintain mostly PSI triggers),
so my feedback might be utterly wrong.
I looked a bit into the relevant code and I think if your theory was
correct and psi_task_change() was called while task's cgroup is
destroyed then task_psi_group() would have returned an invalid pointer
and we would crash once that value is dereferenced.
Instead I think what might be happening is that the task is terminated
while it's in memstall. do_exit() calls do_task_dead() at the very
end, which sets current->__state to TASK_DEAD and calls the last
__schedule() for this task. __schedule() will call deactivate_task(rq,
prev, DEQUEUE_SLEEP) which will set prev->on_rq = NULL and call
dequeue_task(..., DEQUEUE_SLEEP) leading to psi_dequeue(..., true).
Note that because of that last parameter of psi_dequeue() is "true",
psi_task_change() will not be called at this time. Later on
__schedule() calls psi_sched_switch(). That leads to psi_task_switch()
but note that the last parameter will be "true" because prev->on_rq ==
NULL. So we end up calling psi_task_switch(, true). Now, note this
line: https://elixir.bootlin.com/linux/latest/source/kernel/sched/psi.c#L955.
It will clear TSK_MEMSTALL_RUNNING but not TSK_MEMSTALL. So, if the
task was in TSK_MEMSTALL state then that state won't be cleared, which
might be the problem you are facing.
I think you can check if this theory pans out by adding a WARN_ON() ar
the end of psi_task_switch():

void psi_task_switch(struct task_struct *prev, struct task_struct
*next, bool sleep)
{
...
        if ((prev->psi_flags ^ next->psi_flags) & ~TSK_ONCPU) {
                clear &= ~TSK_ONCPU;
                for (; group; group = group->parent)
                        psi_group_change(group, cpu, clear, set, now,
wake_clock);
        }
+        WARN_ON(prev->__state & TASK_DEAD && prev->psi_flags & TSK_MEMSTALL);
}

Again, I am by no means an expert in this area. Johannes or Peter
would be much better people to consult with.
Thanks,
Suren.

> On our Grafana, I can easily track the beginning of this bug to a
> point two weeks ago; in the system log, I can see that hundreds of
> processes needed to be terminated due to memory pressure at that time.
>
> The affected servers run kernel 6.8.7 with a few custom patches, but
> none of these patches affect the scheduler or cgroups; they're about
> unrelated things like denying access to Ceph snapshots and adjusting
> debugfs permissions. (I submitted most of those patches to LKML long
> ago but nobody cared.)
> Newer kernels don't seem to have fixes for my problem; the relevant
> parts of the code are unchanged.
>
> One of the servers is still running with this problem, and I can
> access it with gdb on /proc/kcore. I'll keep it that way for some more
> time, so if you have any idea what to look for, let me know.
>
> This is the psi_group of the "main" cgroup:
>
>  $1 = {parent = 0xffff9de707287800, enabled = true, avgs_lock = {owner
> = {counter = 0}, wait_lock = {raw_lock = {{val = {counter = 0},
> {locked = 0 '\000', pending = 0 '\000'}, {locked_pending = 0, tail =
> 0}}}}, osq = {tail = {
>         counter = 0}}, wait_list = {next = 0xffff9de70f772820, prev =
> 0xffff9de70f772820}}, pcpu = 0x3fb640033900, avg_total =
> {6133960836647, 5923217690044, 615319825665255, 595479374843164,
> 19259777147170, 12847590051880},
>   avg_last_update = 1606208471280060, avg_next_update =
> 1606210394507082, avgs_work = {work = {data = {counter = 321}, entry =
> {next = 0xffff9de70f772880, prev = 0xffff9de70f772880}, func =
> 0xffffffff880dcc00}, timer = {entry = {
>         next = 0x0 <fixed_percpu_data>, pprev = 0xffff9e05bef5bc48},
> expires = 4455558105, function = 0xffffffff880a1ca0, flags =
> 522190853}, wq = 0xffff9de700051400, cpu = 64}, avg_triggers = {next =
> 0xffff9de70f7728d0,
>     prev = 0xffff9de70f7728d0}, avg_nr_triggers = {0, 0, 0, 0, 0, 0},
> total = {{6133960836647, 5923217690044, 615328415599847,
> 595487964777756, 19281251983650, 12869064888360}, {6092994926,
> 5559819737, 105947464151, 100672353730,
>       8196529519, 7678536634}}, avg = {{0, 0, 0}, {0, 0, 0}, {203596,
> 203716, 198499}, {203596, 203716, 198288}, {0, 0, 60}, {0, 0, 0}},
> rtpoll_task = 0x0 <fixed_percpu_data>, rtpoll_timer = {entry = {next =
> 0xdead000000000122,
>       pprev = 0x0 <fixed_percpu_data>}, expires = 4405010639, function
> = 0xffffffff880dac50, flags = 67108868}, rtpoll_wait = {lock = {{rlock
> = {raw_lock = {{val = {counter = 0}, {locked = 0 '\000', pending = 0
> '\000'}, {
>                 locked_pending = 0, tail = 0}}}}}}, head = {next =
> 0xffff9de70f772a20, prev = 0xffff9de70f772a20}}, rtpoll_wakeup =
> {counter = 0}, rtpoll_scheduled = {counter = 0}, rtpoll_trigger_lock =
> {owner = {counter = 0},
>     wait_lock = {raw_lock = {{val = {counter = 0}, {locked = 0 '\000',
> pending = 0 '\000'}, {locked_pending = 0, tail = 0}}}}, osq = {tail =
> {counter = 0}}, wait_list = {next = 0xffff9de70f772a48, prev =
> 0xffff9de70f772a48}},
>   rtpoll_triggers = {next = 0xffff9de70f772a58, prev =
> 0xffff9de70f772a58}, rtpoll_nr_triggers = {0, 0, 0, 0, 0, 0},
> rtpoll_states = 0, rtpoll_min_period = 18446744073709551615,
> rtpoll_total = {6092994926, 5559819737, 105947464151,
>     100672353730, 8196529519, 7678536634}, rtpoll_next_update =
> 1100738436720135, rtpoll_until = 0}
>
> This is a summary of all psi_group_pcpu for the 32 CPU cores (on the
> way, I wrote a small gdb script to dump interesting details like these
> but that went nowhere):
>
>   state_mask 0 = 0x0 tasks {0, 0, 0, 0}
>   state_mask 1 = 0x0 tasks {0, 0, 0, 0}
>   state_mask 2 = 0x0 tasks {0, 0, 0, 0}
>   state_mask 3 = 0x0 tasks {0, 0, 0, 0}
>   state_mask 4 = 0x0 tasks {0, 0, 0, 0}
>   state_mask 5 = 0x0 tasks {0, 0, 0, 0}
>   state_mask 6 = 0x0 tasks {0, 0, 0, 0}
>   state_mask 7 = 0x0 tasks {0, 0, 0, 0}
>   state_mask 8 = 0x0 tasks {0, 0, 0, 0}
>   state_mask 9 = 0x0 tasks {0, 0, 0, 0}
>   state_mask 10 = 0x0 tasks {0, 0, 0, 0}
>   state_mask 11 = 0x0 tasks {0, 0, 0, 0}
>   state_mask 12 = 0x0 tasks {0, 0, 0, 0}
>   state_mask 13 = 0x0 tasks {0, 0, 0, 0}
>   state_mask 14 = 0x0 tasks {0, 0, 0, 0}
>   state_mask 15 = 0x0 tasks {0, 0, 0, 0}
>   state_mask 16 = 0x0 tasks {0, 0, 0, 0}
>   state_mask 17 = 0x0 tasks {0, 0, 0, 0}
>   state_mask 18 = 0x0 tasks {0, 0, 0, 0}
>   state_mask 19 = 0x0 tasks {0, 0, 0, 0}
>   state_mask 20 = 0x0 tasks {0, 0, 0, 0}
>   state_mask 21 = 0x0 tasks {0, 0, 0, 0}
>   state_mask 22 = 0x0 tasks {0, 0, 0, 0}
>   state_mask 23 = 0x0 tasks {0, 0, 0, 0}
>   state_mask 24 = 0x0 tasks {0, 0, 0, 0}
>   state_mask 25 = 0x0 tasks {0, 0, 0, 0}
>   state_mask 26 = 0x0 tasks {0, 0, 0, 0}
>   state_mask 27 = 0x0 tasks {0, 0, 0, 0}
>   state_mask 28 = 0x0 tasks {0, 0, 0, 0}
>   state_mask 29 = 0x0 tasks {0, 0, 0, 0}
>   state_mask 30 = 0x4c tasks {0, 1, 0, 0}
>   state_mask 31 = 0x0 tasks {0, 0, 0, 0}
>
> CPU core 30 is stuck with this bogus value. state_mask 0x4c =
> PSI_MEM_SOME|PSI_MEM_FULL|PSI_NONIDLE.
>
> The memory pressure at the time of this writing:
>
>  # cat /sys/fs/cgroup/system.slice/system-cm4all.slice/bp-spawn.scope/memory.pressure
>  some avg10=99.22 avg60=99.39 avg300=97.62 total=615423620626
>  full avg10=99.22 avg60=99.39 avg300=97.54 total=595583169804
>  # cat /sys/fs/cgroup/system.slice/system-cm4all.slice/bp-spawn.scope/_/memory.pressure
>  some avg10=0.00 avg60=0.00 avg300=0.00 total=0
>  full avg10=0.00 avg60=0.00 avg300=0.00 total=0
>  # cat /sys/fs/cgroup/system.slice/system-cm4all.slice/bp-spawn.scope/cgroup.stat
>  nr_descendants 1
>  nr_dying_descendants 2224
>
> There is currently no worker process; there is only one idle dummy
> process in a single sub-cgroup called "_", only there to keep the
> systemd scope populated. It should therefore be impossible to have
> memory.pressure when the only leaf cgroup has pressure=0.
>
> (nr_dying_descendants is decremented extremely slowly; I deactivated
> the server shortly before collecting these numbers, to make sure it's
> really idle and there are really no processes left to cause this
> pressure. I don't think nr_dying_descendants is relevant for this
> problem; even after two days of full idle, the counter and the
> pressure didn't go back to zero.)
>
> Please help :-)
>
> Max

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Bad psi_group_cpu.tasks[NR_MEMSTALL] counter
  2024-06-12  5:01 ` Suren Baghdasaryan
@ 2024-06-12  6:49   ` Max Kellermann
  2024-06-12  9:48     ` Peter Zijlstra
  2024-06-12 15:17     ` Suren Baghdasaryan
  2024-08-05 12:34   ` Max Kellermann
  1 sibling, 2 replies; 19+ messages in thread
From: Max Kellermann @ 2024-06-12  6:49 UTC (permalink / raw)
  To: Suren Baghdasaryan; +Cc: Johannes Weiner, Peter Zijlstra, linux-kernel

On Wed, Jun 12, 2024 at 7:01 AM Suren Baghdasaryan <surenb@google.com> wrote:
> Instead I think what might be happening is that the task is terminated
> while it's in memstall.

How is it possible to terminate a task that's in memstall?
This must be between psi_memstall_enter() and psi_memstall_leave(),
but I had already checked all the callers and found nothing
suspicious; no obvious way to escape the section without
psi_memstall_leave(). In my understanding, it's impossible to
terminate a task that's currently stuck in the kernel. First, it needs
to leave the kernel and go back to userspace, doesn't it?

> I think if your theory was
> correct and psi_task_change() was called while task's cgroup is
> destroyed then task_psi_group() would have returned an invalid pointer
> and we would crash once that value is dereferenced.

I was thinking of something slightly different; something about the
cgroup being deleted or a task being terminated and the bookkeeping of
the PSI flags getting wrong, maybe some data race. I found the whole
PSI code with per-task flags, per-cpu per-cgroup counters and flags
somewhat obscure (but somebody else's code is always obscure, of
course); I thought there was a lot of potential for mistakes with the
bookkeeping, but I found nothing specific.

Anyway, thanks for looking into this - I hope we can get a grip on
this issue, as it's preventing me from using PSI values for actual
process management; the servers that go into this state will always
appear overloaded and that would lead to killing all the workload
processes forever.

Max

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Bad psi_group_cpu.tasks[NR_MEMSTALL] counter
  2024-06-12  6:49   ` Max Kellermann
@ 2024-06-12  9:48     ` Peter Zijlstra
  2024-06-12 10:20       ` Max Kellermann
  2024-06-12 15:17     ` Suren Baghdasaryan
  1 sibling, 1 reply; 19+ messages in thread
From: Peter Zijlstra @ 2024-06-12  9:48 UTC (permalink / raw)
  To: Max Kellermann; +Cc: Suren Baghdasaryan, Johannes Weiner, linux-kernel

On Wed, Jun 12, 2024 at 08:49:02AM +0200, Max Kellermann wrote:
> On Wed, Jun 12, 2024 at 7:01 AM Suren Baghdasaryan <surenb@google.com> wrote:
> > Instead I think what might be happening is that the task is terminated
> > while it's in memstall.
> 
> How is it possible to terminate a task that's in memstall?
> This must be between psi_memstall_enter() and psi_memstall_leave(),
> but I had already checked all the callers and found nothing
> suspicious; no obvious way to escape the section without
> psi_memstall_leave(). In my understanding, it's impossible to
> terminate a task that's currently stuck in the kernel. First, it needs
> to leave the kernel and go back to userspace, doesn't it?

Yeah, this. I can't find anything obvious either. The trickiest one is
read-ahead though, I couldn't immediately track all the
readahead_expand() callers, any such caller must then end up calling
read_pages() in order to land on the psi_memstall_leave(). This is
typically through page_cache_ra_*().

The erofs one is also not entirely obvious, but irrelevant if you're not using
it... the below should make it a little more obvious, but what do I know.

(whitespace mangled)

--- a/fs/erofs/zdata.c
+++ b/fs/erofs/zdata.c
@@ -1698,9 +1698,9 @@ static void z_erofs_submit_queue(struct z_erofs_decompress_frontend *f,
                        submit_bio(bio);
                else
                        erofs_fscache_submit_bio(bio);
-               if (memstall)
-                       psi_memstall_leave(&pflags);
        }
+       if (memstall)
+               psi_memstall_leave(&pflags);
 
        /*
         * although background is preferred, no one is pending for submission.


> > I think if your theory was
> > correct and psi_task_change() was called while task's cgroup is
> > destroyed then task_psi_group() would have returned an invalid pointer
> > and we would crash once that value is dereferenced.
> 
> I was thinking of something slightly different; something about the
> cgroup being deleted or a task being terminated and the bookkeeping of
> the PSI flags getting wrong, maybe some data race. I found the whole
> PSI code with per-task flags, per-cpu per-cgroup counters and flags
> somewhat obscure (but somebody else's code is always obscure, of
> course);

It really is a bit tricky, performance and all that.

> I thought there was a lot of potential for mistakes with the
> bookkeeping, but I found nothing specific.
> 
> Anyway, thanks for looking into this - I hope we can get a grip on
> this issue, as it's preventing me from using PSI values for actual
> process management; the servers that go into this state will always
> appear overloaded and that would lead to killing all the workload
> processes forever.

Best case would be if you could somehow find a reproducer, but
I realize this might be tricky.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Bad psi_group_cpu.tasks[NR_MEMSTALL] counter
  2024-06-12  9:48     ` Peter Zijlstra
@ 2024-06-12 10:20       ` Max Kellermann
  2024-11-21 14:02         ` Gao Xiang
  0 siblings, 1 reply; 19+ messages in thread
From: Max Kellermann @ 2024-06-12 10:20 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Suren Baghdasaryan, Johannes Weiner, linux-kernel

On Wed, Jun 12, 2024 at 11:49 AM Peter Zijlstra <peterz@infradead.org> wrote:
> The erofs one is also not entirely obvious, but irrelevant if you're not using
> it... the below should make it a little more obvious, but what do I know.

We do use erofs a lot, and I read that very function the other day -
it is weird code with two loop levels plus continue and even goto; but
I thought it was okay. psi_memstall_enter() is only ever called if
bio!=NULL, and the function takes care to call psi_memstall_leave()
when NULLing bio. Therefore I think your patch is not necessary (but
adds a tiny bit of overhead). What do I miss?

> Best case would be if you could somehow find a reproducer, but
> I realize this might be tricky.

Oh, I wish. I tried for several days, adding artificial delays
everywhere, in order to make some race more likely; I created and
deleted millions of cgroups and killed just as many processes under
(artificial) memory pressure, but nothing.

Max

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Bad psi_group_cpu.tasks[NR_MEMSTALL] counter
  2024-06-12  6:49   ` Max Kellermann
  2024-06-12  9:48     ` Peter Zijlstra
@ 2024-06-12 15:17     ` Suren Baghdasaryan
  1 sibling, 0 replies; 19+ messages in thread
From: Suren Baghdasaryan @ 2024-06-12 15:17 UTC (permalink / raw)
  To: Max Kellermann; +Cc: Johannes Weiner, Peter Zijlstra, linux-kernel

On Tue, Jun 11, 2024 at 11:49 PM Max Kellermann
<max.kellermann@ionos.com> wrote:
>
> On Wed, Jun 12, 2024 at 7:01 AM Suren Baghdasaryan <surenb@google.com> wrote:
> > Instead I think what might be happening is that the task is terminated
> > while it's in memstall.
>
> How is it possible to terminate a task that's in memstall?
> This must be between psi_memstall_enter() and psi_memstall_leave(),
> but I had already checked all the callers and found nothing
> suspicious; no obvious way to escape the section without
> psi_memstall_leave(). In my understanding, it's impossible to
> terminate a task that's currently stuck in the kernel. First, it needs
> to leave the kernel and go back to userspace, doesn't it?

Doh! I made an assumption that this can happen while it should not,
unless psi_memstall_enter()/psi_memstall_leave() are not balanced. My
bad.

Since the issue is hard to reproduce, maybe you could add debugging
code to store _RET_IP_ inside the task_struct at the end of
psi_memstall_enter() and clear it inside psi_memstall_leave(). Then in
do_exit() you check if it's still set and generate a warning reporting
recorded _RET_IP_. This should hint us to which psi_memstall_enter()
was missing its psi_memstall_leave().

>
> > I think if your theory was
> > correct and psi_task_change() was called while task's cgroup is
> > destroyed then task_psi_group() would have returned an invalid pointer
> > and we would crash once that value is dereferenced.
>
> I was thinking of something slightly different; something about the
> cgroup being deleted or a task being terminated and the bookkeeping of
> the PSI flags getting wrong, maybe some data race. I found the whole
> PSI code with per-task flags, per-cpu per-cgroup counters and flags
> somewhat obscure (but somebody else's code is always obscure, of
> course); I thought there was a lot of potential for mistakes with the
> bookkeeping, but I found nothing specific.
>
> Anyway, thanks for looking into this - I hope we can get a grip on
> this issue, as it's preventing me from using PSI values for actual
> process management; the servers that go into this state will always
> appear overloaded and that would lead to killing all the workload
> processes forever.
>
> Max

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Bad psi_group_cpu.tasks[NR_MEMSTALL] counter
  2024-06-12  5:01 ` Suren Baghdasaryan
  2024-06-12  6:49   ` Max Kellermann
@ 2024-08-05 12:34   ` Max Kellermann
  2024-08-06 15:56     ` Suren Baghdasaryan
  1 sibling, 1 reply; 19+ messages in thread
From: Max Kellermann @ 2024-08-05 12:34 UTC (permalink / raw)
  To: Suren Baghdasaryan; +Cc: Johannes Weiner, Peter Zijlstra, linux-kernel

On Wed, Jun 12, 2024 at 7:01 AM Suren Baghdasaryan <surenb@google.com> wrote:
> I think you can check if this theory pans out by adding a WARN_ON() ar
> the end of psi_task_switch():
>
> void psi_task_switch(struct task_struct *prev, struct task_struct
> *next, bool sleep)
> {
> ...
>         if ((prev->psi_flags ^ next->psi_flags) & ~TSK_ONCPU) {
>                 clear &= ~TSK_ONCPU;
>                 for (; group; group = group->parent)
>                         psi_group_change(group, cpu, clear, set, now,
> wake_clock);
>         }
> +        WARN_ON(prev->__state & TASK_DEAD && prev->psi_flags & TSK_MEMSTALL);
> }

Our servers have been running with this experimental WARN_ON line you
suggested, and today I found one of them had produced more than 300
warnings since it was rebooted yesterday:

 ------------[ cut here ]------------
 WARNING: CPU: 38 PID: 448145 at kernel/sched/psi.c:992
psi_task_switch+0x114/0x218
 Modules linked in:
 CPU: 38 PID: 448145 Comm: php-cgi8.1 Not tainted 6.9.12-cm4all1-ampere+ #178
 Hardware name: Supermicro ARS-110M-NR/R12SPD-A, BIOS 1.1b 10/17/2023
 pstate: 404000c9 (nZcv daIF +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
 pc : psi_task_switch+0x114/0x218
 lr : psi_task_switch+0x98/0x218
 sp : ffff8000c5493c80
 x29: ffff8000c5493c80 x28: ffff0837ccd18640 x27: ffff07ff81ee3300
 x26: ffff0837ccd18000 x25: 0000000000000000 x24: 0000000000000001
 x23: 000000000000001c x22: 0000000000000026 x21: 00003010d610f448
 x20: 0000000000000000 x19: 0000000000000000 x18: 0000000000000000
 x17: 0000000000000000 x16: 0000000000000000 x15: 0000000000000000
 x14: 0000000000000004 x13: ffff08072ca62000 x12: ffffc22f32e1a000
 x11: 0000000000000001 x10: 0000000000000026 x9 : ffffc22f3129b150
 x8 : ffffc22f32e1aa88 x7 : 000000000000000c x6 : 0000d7ed3b360390
 x5 : ffff08faff6fb88c x4 : 0000000000000000 x3 : 0000000000e9de78
 x2 : 000000008ff70300 x1 : 000000008ff8d518 x0 : 0000000000000002
 Call trace:
  psi_task_switch+0x114/0x218
  __schedule+0x390/0xbc8
  do_task_dead+0x64/0xa0
  do_exit+0x5ac/0x9c0
  __arm64_sys_exit+0x1c/0x28
  invoke_syscall.constprop.0+0x54/0xf0
  do_el0_svc+0xa4/0xc8
  el0_svc+0x18/0x58
  el0t_64_sync_handler+0xf8/0x128
  el0t_64_sync+0x14c/0x150
 ---[ end trace 0000000000000000 ]---

And indeed, it has a constant (and bogus) memory pressure value:

 # cat /proc/pressure/memory
 some avg10=99.99 avg60=98.65 avg300=98.70 total=176280880996
 full avg10=98.16 avg60=96.70 avg300=96.82 total=173950123267

It's taken nearly two months. and none of the other servers had
produced this; this seems to be a bug that's really rare.

Max

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Bad psi_group_cpu.tasks[NR_MEMSTALL] counter
  2024-08-05 12:34   ` Max Kellermann
@ 2024-08-06 15:56     ` Suren Baghdasaryan
  2024-08-12  8:06       ` Max Kellermann
  2024-11-20 16:30       ` Max Kellermann
  0 siblings, 2 replies; 19+ messages in thread
From: Suren Baghdasaryan @ 2024-08-06 15:56 UTC (permalink / raw)
  To: Max Kellermann; +Cc: Johannes Weiner, Peter Zijlstra, linux-kernel

On Mon, Aug 5, 2024 at 5:35 AM Max Kellermann <max.kellermann@ionos.com> wrote:
>
> On Wed, Jun 12, 2024 at 7:01 AM Suren Baghdasaryan <surenb@google.com> wrote:
> > I think you can check if this theory pans out by adding a WARN_ON() ar
> > the end of psi_task_switch():
> >
> > void psi_task_switch(struct task_struct *prev, struct task_struct
> > *next, bool sleep)
> > {
> > ...
> >         if ((prev->psi_flags ^ next->psi_flags) & ~TSK_ONCPU) {
> >                 clear &= ~TSK_ONCPU;
> >                 for (; group; group = group->parent)
> >                         psi_group_change(group, cpu, clear, set, now,
> > wake_clock);
> >         }
> > +        WARN_ON(prev->__state & TASK_DEAD && prev->psi_flags & TSK_MEMSTALL);
> > }
>
> Our servers have been running with this experimental WARN_ON line you
> suggested, and today I found one of them had produced more than 300
> warnings since it was rebooted yesterday:
>
>  ------------[ cut here ]------------
>  WARNING: CPU: 38 PID: 448145 at kernel/sched/psi.c:992
> psi_task_switch+0x114/0x218
>  Modules linked in:
>  CPU: 38 PID: 448145 Comm: php-cgi8.1 Not tainted 6.9.12-cm4all1-ampere+ #178
>  Hardware name: Supermicro ARS-110M-NR/R12SPD-A, BIOS 1.1b 10/17/2023
>  pstate: 404000c9 (nZcv daIF +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
>  pc : psi_task_switch+0x114/0x218
>  lr : psi_task_switch+0x98/0x218
>  sp : ffff8000c5493c80
>  x29: ffff8000c5493c80 x28: ffff0837ccd18640 x27: ffff07ff81ee3300
>  x26: ffff0837ccd18000 x25: 0000000000000000 x24: 0000000000000001
>  x23: 000000000000001c x22: 0000000000000026 x21: 00003010d610f448
>  x20: 0000000000000000 x19: 0000000000000000 x18: 0000000000000000
>  x17: 0000000000000000 x16: 0000000000000000 x15: 0000000000000000
>  x14: 0000000000000004 x13: ffff08072ca62000 x12: ffffc22f32e1a000
>  x11: 0000000000000001 x10: 0000000000000026 x9 : ffffc22f3129b150
>  x8 : ffffc22f32e1aa88 x7 : 000000000000000c x6 : 0000d7ed3b360390
>  x5 : ffff08faff6fb88c x4 : 0000000000000000 x3 : 0000000000e9de78
>  x2 : 000000008ff70300 x1 : 000000008ff8d518 x0 : 0000000000000002
>  Call trace:
>   psi_task_switch+0x114/0x218
>   __schedule+0x390/0xbc8
>   do_task_dead+0x64/0xa0
>   do_exit+0x5ac/0x9c0
>   __arm64_sys_exit+0x1c/0x28
>   invoke_syscall.constprop.0+0x54/0xf0
>   do_el0_svc+0xa4/0xc8
>   el0_svc+0x18/0x58
>   el0t_64_sync_handler+0xf8/0x128
>   el0t_64_sync+0x14c/0x150
>  ---[ end trace 0000000000000000 ]---
>
> And indeed, it has a constant (and bogus) memory pressure value:
>
>  # cat /proc/pressure/memory
>  some avg10=99.99 avg60=98.65 avg300=98.70 total=176280880996
>  full avg10=98.16 avg60=96.70 avg300=96.82 total=173950123267

Hmm. The original scenario I was thinking about when I proposed this
WARN_ON() was deemed impossible, so I think the only other possibility
is that the task being killed somehow skipped psi_memstall_leave()
before its death... Did you have the instrumentation I suggested to
track imbalance between psi_memstall_enter()/psi_memstall_leave() and
to record the _RET_IP_? If so, did it trigger at all?

>
> It's taken nearly two months. and none of the other servers had
> produced this; this seems to be a bug that's really rare.
>
> Max

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Bad psi_group_cpu.tasks[NR_MEMSTALL] counter
  2024-08-06 15:56     ` Suren Baghdasaryan
@ 2024-08-12  8:06       ` Max Kellermann
  2024-08-12 21:33         ` Suren Baghdasaryan
  2024-11-20 16:30       ` Max Kellermann
  1 sibling, 1 reply; 19+ messages in thread
From: Max Kellermann @ 2024-08-12  8:06 UTC (permalink / raw)
  To: Suren Baghdasaryan; +Cc: Johannes Weiner, Peter Zijlstra, linux-kernel

On Tue, Aug 6, 2024 at 5:56 PM Suren Baghdasaryan <surenb@google.com> wrote:
> Hmm. The original scenario I was thinking about when I proposed this
> WARN_ON() was deemed impossible, so I think the only other possibility
> is that the task being killed somehow skipped psi_memstall_leave()
> before its death... Did you have the instrumentation I suggested to
> track imbalance between psi_memstall_enter()/psi_memstall_leave() and
> to record the _RET_IP_? If so, did it trigger at all?

No, unfortunately I did not have the instrumentation because I don't
know how this works (and didn't have the time to find out). If you
have a patch for me, I can merge it into our kernel fork so we have
the data next time it occurs.

Max

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Bad psi_group_cpu.tasks[NR_MEMSTALL] counter
  2024-08-12  8:06       ` Max Kellermann
@ 2024-08-12 21:33         ` Suren Baghdasaryan
  0 siblings, 0 replies; 19+ messages in thread
From: Suren Baghdasaryan @ 2024-08-12 21:33 UTC (permalink / raw)
  To: Max Kellermann; +Cc: Johannes Weiner, Peter Zijlstra, linux-kernel

On Mon, Aug 12, 2024 at 1:06 AM Max Kellermann <max.kellermann@ionos.com> wrote:
>
> On Tue, Aug 6, 2024 at 5:56 PM Suren Baghdasaryan <surenb@google.com> wrote:
> > Hmm. The original scenario I was thinking about when I proposed this
> > WARN_ON() was deemed impossible, so I think the only other possibility
> > is that the task being killed somehow skipped psi_memstall_leave()
> > before its death... Did you have the instrumentation I suggested to
> > track imbalance between psi_memstall_enter()/psi_memstall_leave() and
> > to record the _RET_IP_? If so, did it trigger at all?
>
> No, unfortunately I did not have the instrumentation because I don't
> know how this works (and didn't have the time to find out). If you
> have a patch for me, I can merge it into our kernel fork so we have
> the data next time it occurs.

Ok. Will you be able to enable CONFIG_DEBUG_INFO for this test to
symbolize reported _RET_IP_ once we capture it?

>
> Max

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Bad psi_group_cpu.tasks[NR_MEMSTALL] counter
  2024-08-06 15:56     ` Suren Baghdasaryan
  2024-08-12  8:06       ` Max Kellermann
@ 2024-11-20 16:30       ` Max Kellermann
  2024-11-20 16:56         ` Suren Baghdasaryan
  1 sibling, 1 reply; 19+ messages in thread
From: Max Kellermann @ 2024-11-20 16:30 UTC (permalink / raw)
  To: Suren Baghdasaryan; +Cc: Johannes Weiner, Peter Zijlstra, linux-kernel

On Tue, Aug 6, 2024 at 5:56 PM Suren Baghdasaryan <surenb@google.com> wrote:

> Hmm. The original scenario I was thinking about when I proposed this
> WARN_ON() was deemed impossible, so I think the only other possibility
> is that the task being killed somehow skipped psi_memstall_leave()
> before its death... Did you have the instrumentation I suggested to
> track imbalance between psi_memstall_enter()/psi_memstall_leave() and
> to record the _RET_IP_? If so, did it trigger at all?

Hi Suren,

it has taken a few months, but today, this bug was trapped by 6.11.9
with your patch:

 ------------[ cut here ]------------
 Stall from readahead_expand+0xcf/0x1d0 was never cleared
 WARNING: CPU: 2 PID: 22644 at kernel/exit.c:933 do_exit+0x8c7/0x970
 Modules linked in:
 CPU: 2 UID: 2148030876 PID: 22644 Comm: php-was Not tainted
6.11.9-cm4all2-vm+ #242
 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-2 04/01/2014
 RIP: 0010:do_exit+0x8c7/0x970
 Code: 75 bb 4c 89 ef 48 89 dd e8 b6 21 0d 00 48 c7 c7 80 60 60 a5 e8
ca dd ba 00 e9 5b fa ff ff 48 c7 c7 d8 4d fc a4 e8 e9 9e ff ff <0f> 0b
e9 1d fe ff ff e8 4d e6 24 00 e9 9b fd ff ff 48 8b 78 68 e8
 RSP: 0018:ffffc2bdd2fafc50 EFLAGS: 00010286
 RAX: 0000000000000000 RBX: ffff9f96762bab80 RCX: 0000000000000027
 RDX: ffff9fb1bee9b188 RSI: 0000000000000001 RDI: ffff9fb1bee9b180
 RBP: ffffc2bdd2faf6a8 R08: 00000000ffffbfff R09: 0000000000000001
 R10: 0000000000000003 R11: ffff9fb1beca0000 R12: ffffc2bdd2fafc60
 R13: ffff9faa09fed5d8 R14: 0000000000000001 R15: 0000000000000000
 FS:  0000000000000000(0000) GS:ffff9fb1bee80000(0000) knlGS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 CR2: 00007f0bfd7b0000 CR3: 000000027762e003 CR4: 00000000003706b0
 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
 Call Trace:
  <TASK>
  ? __warn+0x7c/0x110
  ? do_exit+0x8c7/0x970
  ? report_bug+0x14c/0x170
  ? handle_bug+0x53/0x90
  ? exc_invalid_op+0x13/0x60
  ? asm_exc_invalid_op+0x16/0x20
  ? do_exit+0x8c7/0x970
  do_group_exit+0x2c/0x80
  __x64_sys_exit_group+0x14/0x20
  x64_sys_call+0x14aa/0x16a0
  do_syscall_64+0x64/0x100
  ? __memcg_kmem_charge_page+0xdc/0x1d0
  ? do_fault+0x270/0x490
  ? __handle_mm_fault+0x321/0x1280
  ? __count_memcg_events+0x53/0xf0
  ? handle_mm_fault+0xa0/0x260
  ? do_user_addr_fault+0x33b/0x5b0
  ? clear_bhb_loop+0x45/0xa0
  ? clear_bhb_loop+0x45/0xa0
  ? clear_bhb_loop+0x45/0xa0
  ? clear_bhb_loop+0x45/0xa0
  ? clear_bhb_loop+0x45/0xa0
  entry_SYSCALL_64_after_hwframe+0x76/0x7e
 RIP: 0033:0x7f0c00c5365a
 Code: Unable to access opcode bytes at 0x7f0c00c53630.
 RSP: 002b:00007ffc678fba38 EFLAGS: 00000246 ORIG_RAX: 00000000000000e7
 RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f0c00c5365a
 RDX: 0000000000000005 RSI: 0000000000000000 RDI: 0000000000000000
 RBP: 00007ffc678fba88 R08: 00007ffc678fb938 R09: 0000000000000018
 R10: 0000000000000008 R11: 0000000000000246 R12: 00007f0c004db8a0
 R13: 00007ffc678fbab0 R14: 0000000000000000 R15: 0000000000000000
  </TASK>
 ---[ end trace 0000000000000000 ]---
 ------------[ cut here ]------------
 Stall from readahead_expand+0xcf/0x1d0 was never cleared
 WARNING: CPU: 2 PID: 22644 at kernel/sched/psi.c:989
psi_task_switch+0x126/0x230
 Modules linked in:
 CPU: 2 UID: 2148030876 PID: 22644 Comm: php-was Tainted: G        W
       6.11.9-cm4all2-vm+ #242
 Tainted: [W]=WARN
 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-2 04/01/2014
 RIP: 0010:psi_task_switch+0x126/0x230
 Code: f6 75 e6 41 f6 44 24 18 80 74 36 41 f6 84 24 a8 08 00 00 02 74
2b 49 8b b4 24 b0 08 00 00 48 c7 c7 d8 4d fc a4 e8 5a de f9 ff <0f> 0b
41 f6 44 24 18 80 74 0d 41 f6 84 24 a8 08 00 00 02 74 02 0f
 RSP: 0018:ffffc2bdd2fafb48 EFLAGS: 00010086
 RAX: 0000000000000000 RBX: 0000000000000002 RCX: 0000000000000027
 RDX: ffff9fb1bee9b188 RSI: 0000000000000001 RDI: ffff9fb1bee9b180
 RBP: 000000000000001c R08: 00000000ffffbfff R09: 0000000000000001
 R10: 0000000000000003 R11: ffff9fb1beca0000 R12: ffff9f96762bab80
 R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
 FS:  0000000000000000(0000) GS:ffff9fb1bee80000(0000) knlGS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 CR2: 00007f0bfd7b0000 CR3: 000000027762e003 CR4: 00000000003706b0
 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
 Call Trace:
  <TASK>
  ? __warn+0x7c/0x110
  ? psi_task_switch+0x126/0x230
  ? report_bug+0x14c/0x170
  ? handle_bug+0x53/0x90
  ? exc_invalid_op+0x13/0x60
  ? asm_exc_invalid_op+0x16/0x20
  ? psi_task_switch+0x126/0x230
  ? psi_task_switch+0x126/0x230
  __schedule+0x954/0x1150
  do_task_dead+0x3e/0x40
  do_exit+0x6f0/0x970
  do_group_exit+0x2c/0x80
  __x64_sys_exit_group+0x14/0x20
  x64_sys_call+0x14aa/0x16a0
  do_syscall_64+0x64/0x100
  ? __memcg_kmem_charge_page+0xdc/0x1d0
  ? do_fault+0x270/0x490
  ? __handle_mm_fault+0x321/0x1280
  ? __count_memcg_events+0x53/0xf0
  ? handle_mm_fault+0xa0/0x260
  ? do_user_addr_fault+0x33b/0x5b0
  ? clear_bhb_loop+0x45/0xa0
  ? clear_bhb_loop+0x45/0xa0
  ? clear_bhb_loop+0x45/0xa0
  ? clear_bhb_loop+0x45/0xa0
  ? clear_bhb_loop+0x45/0xa0
  entry_SYSCALL_64_after_hwframe+0x76/0x7e
 RIP: 0033:0x7f0c00c5365a
 Code: Unable to access opcode bytes at 0x7f0c00c53630.
 RSP: 002b:00007ffc678fba38 EFLAGS: 00000246 ORIG_RAX: 00000000000000e7
 RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f0c00c5365a
 RDX: 0000000000000005 RSI: 0000000000000000 RDI: 0000000000000000
 RBP: 00007ffc678fba88 R08: 00007ffc678fb938 R09: 0000000000000018
 R10: 0000000000000008 R11: 0000000000000246 R12: 00007f0c004db8a0
 R13: 00007ffc678fbab0 R14: 0000000000000000 R15: 0000000000000000
  </TASK>
 ---[ end trace 0000000000000000 ]---
 ------------[ cut here ]------------
 WARNING: CPU: 2 PID: 22644 at kernel/sched/psi.c:994
psi_task_switch+0x13b/0x230
 Modules linked in:
 CPU: 2 UID: 2148030876 PID: 22644 Comm: php-was Tainted: G        W
       6.11.9-cm4all2-vm+ #242
 Tainted: [W]=WARN
 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-2 04/01/2014
 RIP: 0010:psi_task_switch+0x13b/0x230
 Code: 2b 49 8b b4 24 b0 08 00 00 48 c7 c7 d8 4d fc a4 e8 5a de f9 ff
0f 0b 41 f6 44 24 18 80 74 0d 41 f6 84 24 a8 08 00 00 02 74 02 <0f> 0b
48 83 c4 10 5b 5d 41 5c 41 5d 41 5e 41 5f c3 cc cc cc cc 41
 RSP: 0018:ffffc2bdd2fafb48 EFLAGS: 00010002
 RAX: 0000000000000000 RBX: 0000000000000002 RCX: 0000000000000027
 RDX: ffff9fb1bee9b188 RSI: 0000000000000001 RDI: ffff9fb1bee9b180
 RBP: 000000000000001c R08: 00000000ffffbfff R09: 0000000000000001
 R10: 0000000000000003 R11: ffff9fb1beca0000 R12: ffff9f96762bab80
 R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
 FS:  0000000000000000(0000) GS:ffff9fb1bee80000(0000) knlGS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 CR2: 00007f0bfd7b0000 CR3: 000000027762e003 CR4: 00000000003706b0
 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
 Call Trace:
  <TASK>
  ? __warn+0x7c/0x110
  ? psi_task_switch+0x13b/0x230
  ? report_bug+0x14c/0x170
  ? handle_bug+0x53/0x90
  ? exc_invalid_op+0x13/0x60
  ? asm_exc_invalid_op+0x16/0x20
  ? psi_task_switch+0x13b/0x230
  ? psi_task_switch+0x126/0x230
  __schedule+0x954/0x1150
  do_task_dead+0x3e/0x40
  do_exit+0x6f0/0x970
  do_group_exit+0x2c/0x80
  __x64_sys_exit_group+0x14/0x20
  x64_sys_call+0x14aa/0x16a0
  do_syscall_64+0x64/0x100
  ? __memcg_kmem_charge_page+0xdc/0x1d0
  ? do_fault+0x270/0x490
  ? __handle_mm_fault+0x321/0x1280
  ? __count_memcg_events+0x53/0xf0
  ? handle_mm_fault+0xa0/0x260
  ? do_user_addr_fault+0x33b/0x5b0
  ? clear_bhb_loop+0x45/0xa0
  ? clear_bhb_loop+0x45/0xa0
  ? clear_bhb_loop+0x45/0xa0
  ? clear_bhb_loop+0x45/0xa0
  ? clear_bhb_loop+0x45/0xa0
  entry_SYSCALL_64_after_hwframe+0x76/0x7e
 RIP: 0033:0x7f0c00c5365a
 Code: Unable to access opcode bytes at 0x7f0c00c53630.
 RSP: 002b:00007ffc678fba38 EFLAGS: 00000246 ORIG_RAX: 00000000000000e7
 RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f0c00c5365a
 RDX: 0000000000000005 RSI: 0000000000000000 RDI: 0000000000000000
 RBP: 00007ffc678fba88 R08: 00007ffc678fb938 R09: 0000000000000018
 R10: 0000000000000008 R11: 0000000000000246 R12: 00007f0c004db8a0
 R13: 00007ffc678fbab0 R14: 0000000000000000 R15: 0000000000000000
  </TASK>
 ---[ end trace 0000000000000000 ]---

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Bad psi_group_cpu.tasks[NR_MEMSTALL] counter
  2024-11-20 16:30       ` Max Kellermann
@ 2024-11-20 16:56         ` Suren Baghdasaryan
  2024-11-21  4:51           ` Christoph Hellwig
  0 siblings, 1 reply; 19+ messages in thread
From: Suren Baghdasaryan @ 2024-11-20 16:56 UTC (permalink / raw)
  To: Max Kellermann
  Cc: Johannes Weiner, Peter Zijlstra, linux-kernel, Christoph Hellwig

On Wed, Nov 20, 2024 at 8:30 AM Max Kellermann <max.kellermann@ionos.com> wrote:
>
> On Tue, Aug 6, 2024 at 5:56 PM Suren Baghdasaryan <surenb@google.com> wrote:
>
> > Hmm. The original scenario I was thinking about when I proposed this
> > WARN_ON() was deemed impossible, so I think the only other possibility
> > is that the task being killed somehow skipped psi_memstall_leave()
> > before its death... Did you have the instrumentation I suggested to
> > track imbalance between psi_memstall_enter()/psi_memstall_leave() and
> > to record the _RET_IP_? If so, did it trigger at all?
>
> Hi Suren,
>
> it has taken a few months, but today, this bug was trapped by 6.11.9
> with your patch:
>
>  ------------[ cut here ]------------
>  Stall from readahead_expand+0xcf/0x1d0 was never cleared

This is interesting. readahead_expand() indeed has
psi_memstall_enter() without a matching psi_memstall_leave():
https://elixir.bootlin.com/linux/v6.11.9/source/mm/readahead.c#L727
and https://elixir.bootlin.com/linux/v6.11.9/source/mm/readahead.c#L754.
Looks like this was introduced in [1].  I'm not sure if that's a bug
or psi_memstall_leave() is supposed to be called from somewhere else.
CC'ing Christoph to clarify.

[1] 176042404ee6 "mm: add PSI accounting around ->read_folio and
->readahead calls"

>  WARNING: CPU: 2 PID: 22644 at kernel/exit.c:933 do_exit+0x8c7/0x970
>  Modules linked in:
>  CPU: 2 UID: 2148030876 PID: 22644 Comm: php-was Not tainted
> 6.11.9-cm4all2-vm+ #242
>  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-2 04/01/2014
>  RIP: 0010:do_exit+0x8c7/0x970
>  Code: 75 bb 4c 89 ef 48 89 dd e8 b6 21 0d 00 48 c7 c7 80 60 60 a5 e8
> ca dd ba 00 e9 5b fa ff ff 48 c7 c7 d8 4d fc a4 e8 e9 9e ff ff <0f> 0b
> e9 1d fe ff ff e8 4d e6 24 00 e9 9b fd ff ff 48 8b 78 68 e8
>  RSP: 0018:ffffc2bdd2fafc50 EFLAGS: 00010286
>  RAX: 0000000000000000 RBX: ffff9f96762bab80 RCX: 0000000000000027
>  RDX: ffff9fb1bee9b188 RSI: 0000000000000001 RDI: ffff9fb1bee9b180
>  RBP: ffffc2bdd2faf6a8 R08: 00000000ffffbfff R09: 0000000000000001
>  R10: 0000000000000003 R11: ffff9fb1beca0000 R12: ffffc2bdd2fafc60
>  R13: ffff9faa09fed5d8 R14: 0000000000000001 R15: 0000000000000000
>  FS:  0000000000000000(0000) GS:ffff9fb1bee80000(0000) knlGS:0000000000000000
>  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>  CR2: 00007f0bfd7b0000 CR3: 000000027762e003 CR4: 00000000003706b0
>  DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>  DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
>  Call Trace:
>   <TASK>
>   ? __warn+0x7c/0x110
>   ? do_exit+0x8c7/0x970
>   ? report_bug+0x14c/0x170
>   ? handle_bug+0x53/0x90
>   ? exc_invalid_op+0x13/0x60
>   ? asm_exc_invalid_op+0x16/0x20
>   ? do_exit+0x8c7/0x970
>   do_group_exit+0x2c/0x80
>   __x64_sys_exit_group+0x14/0x20
>   x64_sys_call+0x14aa/0x16a0
>   do_syscall_64+0x64/0x100
>   ? __memcg_kmem_charge_page+0xdc/0x1d0
>   ? do_fault+0x270/0x490
>   ? __handle_mm_fault+0x321/0x1280
>   ? __count_memcg_events+0x53/0xf0
>   ? handle_mm_fault+0xa0/0x260
>   ? do_user_addr_fault+0x33b/0x5b0
>   ? clear_bhb_loop+0x45/0xa0
>   ? clear_bhb_loop+0x45/0xa0
>   ? clear_bhb_loop+0x45/0xa0
>   ? clear_bhb_loop+0x45/0xa0
>   ? clear_bhb_loop+0x45/0xa0
>   entry_SYSCALL_64_after_hwframe+0x76/0x7e
>  RIP: 0033:0x7f0c00c5365a
>  Code: Unable to access opcode bytes at 0x7f0c00c53630.
>  RSP: 002b:00007ffc678fba38 EFLAGS: 00000246 ORIG_RAX: 00000000000000e7
>  RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f0c00c5365a
>  RDX: 0000000000000005 RSI: 0000000000000000 RDI: 0000000000000000
>  RBP: 00007ffc678fba88 R08: 00007ffc678fb938 R09: 0000000000000018
>  R10: 0000000000000008 R11: 0000000000000246 R12: 00007f0c004db8a0
>  R13: 00007ffc678fbab0 R14: 0000000000000000 R15: 0000000000000000
>   </TASK>
>  ---[ end trace 0000000000000000 ]---
>  ------------[ cut here ]------------
>  Stall from readahead_expand+0xcf/0x1d0 was never cleared
>  WARNING: CPU: 2 PID: 22644 at kernel/sched/psi.c:989
> psi_task_switch+0x126/0x230
>  Modules linked in:
>  CPU: 2 UID: 2148030876 PID: 22644 Comm: php-was Tainted: G        W
>        6.11.9-cm4all2-vm+ #242
>  Tainted: [W]=WARN
>  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-2 04/01/2014
>  RIP: 0010:psi_task_switch+0x126/0x230
>  Code: f6 75 e6 41 f6 44 24 18 80 74 36 41 f6 84 24 a8 08 00 00 02 74
> 2b 49 8b b4 24 b0 08 00 00 48 c7 c7 d8 4d fc a4 e8 5a de f9 ff <0f> 0b
> 41 f6 44 24 18 80 74 0d 41 f6 84 24 a8 08 00 00 02 74 02 0f
>  RSP: 0018:ffffc2bdd2fafb48 EFLAGS: 00010086
>  RAX: 0000000000000000 RBX: 0000000000000002 RCX: 0000000000000027
>  RDX: ffff9fb1bee9b188 RSI: 0000000000000001 RDI: ffff9fb1bee9b180
>  RBP: 000000000000001c R08: 00000000ffffbfff R09: 0000000000000001
>  R10: 0000000000000003 R11: ffff9fb1beca0000 R12: ffff9f96762bab80
>  R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
>  FS:  0000000000000000(0000) GS:ffff9fb1bee80000(0000) knlGS:0000000000000000
>  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>  CR2: 00007f0bfd7b0000 CR3: 000000027762e003 CR4: 00000000003706b0
>  DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>  DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
>  Call Trace:
>   <TASK>
>   ? __warn+0x7c/0x110
>   ? psi_task_switch+0x126/0x230
>   ? report_bug+0x14c/0x170
>   ? handle_bug+0x53/0x90
>   ? exc_invalid_op+0x13/0x60
>   ? asm_exc_invalid_op+0x16/0x20
>   ? psi_task_switch+0x126/0x230
>   ? psi_task_switch+0x126/0x230
>   __schedule+0x954/0x1150
>   do_task_dead+0x3e/0x40
>   do_exit+0x6f0/0x970
>   do_group_exit+0x2c/0x80
>   __x64_sys_exit_group+0x14/0x20
>   x64_sys_call+0x14aa/0x16a0
>   do_syscall_64+0x64/0x100
>   ? __memcg_kmem_charge_page+0xdc/0x1d0
>   ? do_fault+0x270/0x490
>   ? __handle_mm_fault+0x321/0x1280
>   ? __count_memcg_events+0x53/0xf0
>   ? handle_mm_fault+0xa0/0x260
>   ? do_user_addr_fault+0x33b/0x5b0
>   ? clear_bhb_loop+0x45/0xa0
>   ? clear_bhb_loop+0x45/0xa0
>   ? clear_bhb_loop+0x45/0xa0
>   ? clear_bhb_loop+0x45/0xa0
>   ? clear_bhb_loop+0x45/0xa0
>   entry_SYSCALL_64_after_hwframe+0x76/0x7e
>  RIP: 0033:0x7f0c00c5365a
>  Code: Unable to access opcode bytes at 0x7f0c00c53630.
>  RSP: 002b:00007ffc678fba38 EFLAGS: 00000246 ORIG_RAX: 00000000000000e7
>  RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f0c00c5365a
>  RDX: 0000000000000005 RSI: 0000000000000000 RDI: 0000000000000000
>  RBP: 00007ffc678fba88 R08: 00007ffc678fb938 R09: 0000000000000018
>  R10: 0000000000000008 R11: 0000000000000246 R12: 00007f0c004db8a0
>  R13: 00007ffc678fbab0 R14: 0000000000000000 R15: 0000000000000000
>   </TASK>
>  ---[ end trace 0000000000000000 ]---
>  ------------[ cut here ]------------
>  WARNING: CPU: 2 PID: 22644 at kernel/sched/psi.c:994
> psi_task_switch+0x13b/0x230
>  Modules linked in:
>  CPU: 2 UID: 2148030876 PID: 22644 Comm: php-was Tainted: G        W
>        6.11.9-cm4all2-vm+ #242
>  Tainted: [W]=WARN
>  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-2 04/01/2014
>  RIP: 0010:psi_task_switch+0x13b/0x230
>  Code: 2b 49 8b b4 24 b0 08 00 00 48 c7 c7 d8 4d fc a4 e8 5a de f9 ff
> 0f 0b 41 f6 44 24 18 80 74 0d 41 f6 84 24 a8 08 00 00 02 74 02 <0f> 0b
> 48 83 c4 10 5b 5d 41 5c 41 5d 41 5e 41 5f c3 cc cc cc cc 41
>  RSP: 0018:ffffc2bdd2fafb48 EFLAGS: 00010002
>  RAX: 0000000000000000 RBX: 0000000000000002 RCX: 0000000000000027
>  RDX: ffff9fb1bee9b188 RSI: 0000000000000001 RDI: ffff9fb1bee9b180
>  RBP: 000000000000001c R08: 00000000ffffbfff R09: 0000000000000001
>  R10: 0000000000000003 R11: ffff9fb1beca0000 R12: ffff9f96762bab80
>  R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
>  FS:  0000000000000000(0000) GS:ffff9fb1bee80000(0000) knlGS:0000000000000000
>  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>  CR2: 00007f0bfd7b0000 CR3: 000000027762e003 CR4: 00000000003706b0
>  DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>  DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
>  Call Trace:
>   <TASK>
>   ? __warn+0x7c/0x110
>   ? psi_task_switch+0x13b/0x230
>   ? report_bug+0x14c/0x170
>   ? handle_bug+0x53/0x90
>   ? exc_invalid_op+0x13/0x60
>   ? asm_exc_invalid_op+0x16/0x20
>   ? psi_task_switch+0x13b/0x230
>   ? psi_task_switch+0x126/0x230
>   __schedule+0x954/0x1150
>   do_task_dead+0x3e/0x40
>   do_exit+0x6f0/0x970
>   do_group_exit+0x2c/0x80
>   __x64_sys_exit_group+0x14/0x20
>   x64_sys_call+0x14aa/0x16a0
>   do_syscall_64+0x64/0x100
>   ? __memcg_kmem_charge_page+0xdc/0x1d0
>   ? do_fault+0x270/0x490
>   ? __handle_mm_fault+0x321/0x1280
>   ? __count_memcg_events+0x53/0xf0
>   ? handle_mm_fault+0xa0/0x260
>   ? do_user_addr_fault+0x33b/0x5b0
>   ? clear_bhb_loop+0x45/0xa0
>   ? clear_bhb_loop+0x45/0xa0
>   ? clear_bhb_loop+0x45/0xa0
>   ? clear_bhb_loop+0x45/0xa0
>   ? clear_bhb_loop+0x45/0xa0
>   entry_SYSCALL_64_after_hwframe+0x76/0x7e
>  RIP: 0033:0x7f0c00c5365a
>  Code: Unable to access opcode bytes at 0x7f0c00c53630.
>  RSP: 002b:00007ffc678fba38 EFLAGS: 00000246 ORIG_RAX: 00000000000000e7
>  RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f0c00c5365a
>  RDX: 0000000000000005 RSI: 0000000000000000 RDI: 0000000000000000
>  RBP: 00007ffc678fba88 R08: 00007ffc678fb938 R09: 0000000000000018
>  R10: 0000000000000008 R11: 0000000000000246 R12: 00007f0c004db8a0
>  R13: 00007ffc678fbab0 R14: 0000000000000000 R15: 0000000000000000
>   </TASK>
>  ---[ end trace 0000000000000000 ]---

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Bad psi_group_cpu.tasks[NR_MEMSTALL] counter
  2024-11-20 16:56         ` Suren Baghdasaryan
@ 2024-11-21  4:51           ` Christoph Hellwig
  2024-11-21  8:43             ` Max Kellermann
  0 siblings, 1 reply; 19+ messages in thread
From: Christoph Hellwig @ 2024-11-21  4:51 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: Max Kellermann, Johannes Weiner, Peter Zijlstra, linux-kernel,
	Christoph Hellwig

On Wed, Nov 20, 2024 at 08:56:21AM -0800, Suren Baghdasaryan wrote:
> This is interesting. readahead_expand() indeed has
> psi_memstall_enter() without a matching psi_memstall_leave():
> https://elixir.bootlin.com/linux/v6.11.9/source/mm/readahead.c#L727
> and https://elixir.bootlin.com/linux/v6.11.9/source/mm/readahead.c#L754.
> Looks like this was introduced in [1].  I'm not sure if that's a bug
> or psi_memstall_leave() is supposed to be called from somewhere else.
> CC'ing Christoph to clarify.

So the readahead_control structure tracks if psi_memstall_enter has
been called using the _workingset member, and the assumption was
that readahead_expand is only called from aops->readahead so that
read_pages() takes care of the psi_memstall_leave.  This still seems
reasonable given that readahead_expand operates on the ractl structure
only used by ->readahead, and tested fine back then.

Something seems to be going wrong here, though, but the trace below
doesn't really tell me anything about the workload or file system
used, and if this is even calling into readahead.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Bad psi_group_cpu.tasks[NR_MEMSTALL] counter
  2024-11-21  4:51           ` Christoph Hellwig
@ 2024-11-21  8:43             ` Max Kellermann
  2024-11-21 13:18               ` Gao Xiang
  0 siblings, 1 reply; 19+ messages in thread
From: Max Kellermann @ 2024-11-21  8:43 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Suren Baghdasaryan, Johannes Weiner, Peter Zijlstra, linux-kernel

On Thu, Nov 21, 2024 at 5:51 AM Christoph Hellwig <hch@lst.de> wrote:
> Something seems to be going wrong here, though, but the trace below
> doesn't really tell me anything about the workload or file system
> used, and if this is even calling into readahead.

In case you were asking :-) these are web servers (shared webhosting),
running PHP most of the time. The host itself runs on an ext4, but I
don't think the ext4 system partition has anything to do with this.
PHP runs in containers that are erofs, the PHP sources plus
memory-mapped opcache files are in btrfs (read-only snapshot) and the
runtime data is on NFS or Ceph (there have been stalls on both server
types).
My limited experience with Linux MM suggests that this happens during
the page fault of a memory mapped file. PHP processes usually mmap
only files from erofs and btrfs.
The servers are always somewhat under memory pressure; our container
manager keeps as many containers alive as possible and only shuts them
down when the server reaches the memory limit. At any given time,
there are thousands of containers.

Max

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Bad psi_group_cpu.tasks[NR_MEMSTALL] counter
  2024-11-21  8:43             ` Max Kellermann
@ 2024-11-21 13:18               ` Gao Xiang
  2024-11-28 10:00                 ` Max Kellermann
  0 siblings, 1 reply; 19+ messages in thread
From: Gao Xiang @ 2024-11-21 13:18 UTC (permalink / raw)
  To: Max Kellermann, Christoph Hellwig
  Cc: Suren Baghdasaryan, Johannes Weiner, Peter Zijlstra, linux-kernel

Hi Max!

On 2024/11/21 16:43, Max Kellermann wrote:
> On Thu, Nov 21, 2024 at 5:51 AM Christoph Hellwig <hch@lst.de> wrote:
>> Something seems to be going wrong here, though, but the trace below
>> doesn't really tell me anything about the workload or file system
>> used, and if this is even calling into readahead.
> 
> In case you were asking :-) these are web servers (shared webhosting),
> running PHP most of the time. The host itself runs on an ext4, but I
> don't think the ext4 system partition has anything to do with this.
> PHP runs in containers that are erofs, the PHP sources plus
> memory-mapped opcache files are in btrfs (read-only snapshot) and the
> runtime data is on NFS or Ceph (there have been stalls on both server
> types).
> My limited experience with Linux MM suggests that this happens during
> the page fault of a memory mapped file. PHP processes usually mmap
> only files from erofs and btrfs.
> The servers are always somewhat under memory pressure; our container
> manager keeps as many containers alive as possible and only shuts them
> down when the server reaches the memory limit. At any given time,
> there are thousands of containers.

Just saw this. I guess your _recent_ 6.11.9 bug is actually
related to EROFS since EROFS uses readahead_expand().  I think
your recent report was introduced by a recent backport fix
commit 9e2f9d34dd12 ("erofs: handle overlapped pclusters out of crafted images properly")
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=v6.11.9&id=9cfa199bcbbbba31cbf97b2786f44f4464f3f29a

bio can be NULL after this patch and causes
unbalanced psi_memstall_{enter,leave}().  It can be fixed as
(the diff below could be damaged due to my email client):

diff --git a/fs/erofs/zdata.c b/fs/erofs/zdata.c
index 01f147505487..19ef4ff2a134 100644
--- a/fs/erofs/zdata.c
+++ b/fs/erofs/zdata.c
@@ -1792,9 +1792,9 @@ static void z_erofs_submit_queue(struct z_erofs_decompress_frontend *f,
                         erofs_fscache_submit_bio(bio);
                 else
                         submit_bio(bio);
-               if (memstall)
-                       psi_memstall_leave(&pflags);
         }
+       if (memstall)
+               psi_memstall_leave(&pflags);

         /*
          * although background is preferred, no one is pending for submission.

But your original report is without the very recent
commit 9e2f9d34dd12, before this commit bio cannot
be NULL so I don't think they are the same issue.

I will submit a formal fix for the recent bug later,
thanks!

Thanks,
Gao Xiang


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: Bad psi_group_cpu.tasks[NR_MEMSTALL] counter
  2024-06-12 10:20       ` Max Kellermann
@ 2024-11-21 14:02         ` Gao Xiang
  0 siblings, 0 replies; 19+ messages in thread
From: Gao Xiang @ 2024-11-21 14:02 UTC (permalink / raw)
  To: Max Kellermann, Peter Zijlstra
  Cc: Suren Baghdasaryan, Johannes Weiner, linux-kernel



On 2024/6/12 18:20, Max Kellermann wrote:
> On Wed, Jun 12, 2024 at 11:49 AM Peter Zijlstra <peterz@infradead.org> wrote:
>> The erofs one is also not entirely obvious, but irrelevant if you're not using
>> it... the below should make it a little more obvious, but what do I know.
> 
> We do use erofs a lot, and I read that very function the other day -
> it is weird code with two loop levels plus continue and even goto; but
> I thought it was okay. psi_memstall_enter() is only ever called if

The outer loop handles the whole inflight chain (all compression
extents that need to be decompressed).

The inner loop handles each compression extent (called as a pcluster
-- physical cluster in EROFS) in the chain, and it could contain
several consecutive physical blocks so it needs a loop:

   If the physical block doesn't need a real I/O (due to cached data
for example), it will "continue;"
   Or it will check if i/o is consecutive, if not, submit the current bio;
   then get a new bio if needed and add the physical block to the bio;
   if the physical block cannot be added to the bio, retry:
    -- some similiar logic can also be found in:
       https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/mpage.c?h=v6.12#n298

Hopefully it helps.

Thanks,
Gao Xiang

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Bad psi_group_cpu.tasks[NR_MEMSTALL] counter
  2024-11-21 13:18               ` Gao Xiang
@ 2024-11-28 10:00                 ` Max Kellermann
  2024-11-28 10:46                   ` Gao Xiang
  0 siblings, 1 reply; 19+ messages in thread
From: Max Kellermann @ 2024-11-28 10:00 UTC (permalink / raw)
  To: Gao Xiang
  Cc: Christoph Hellwig, Suren Baghdasaryan, Johannes Weiner,
	Peter Zijlstra, linux-kernel

On Thu, Nov 21, 2024 at 2:18 PM Gao Xiang <hsiangkao@linux.alibaba.com> wrote:
> Just saw this. I guess your _recent_ 6.11.9 bug is actually
> related to EROFS since EROFS uses readahead_expand().  I think
> your recent report was introduced by a recent backport fix
> commit 9e2f9d34dd12 ("erofs: handle overlapped pclusters out of crafted images properly")
> https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=v6.11.9&id=9cfa199bcbbbba31cbf97b2786f44f4464f3f29a
>
> bio can be NULL after this patch and causes
> unbalanced psi_memstall_{enter,leave}().  It can be fixed as
> (the diff below could be damaged due to my email client):

With your patch, the PSI warning (from Suren's debugging patch) fired
again last night. Which means there may be other instances of this bug
left.

 ------------[ cut here ]------------
 Stall from readahead_expand+0xca/0x1d0 was never cleared
 WARNING: CPU: 133 PID: 91645 at kernel/sched/psi.c:989
psi_task_switch+0x126/0x230
 Modules linked in:
 CPU: 133 UID: 3221274747 PID: 91645 Comm: php-cgi8.1 Tainted: G
 W          6.11.10-cm4all2-es+ #267
 Tainted: [W]=WARN
 Hardware name: Dell Inc. PowerEdge R7615/0G9DHV, BIOS 1.6.10 12/08/2023
 RIP: 0010:psi_task_switch+0x126/0x230
 Code: f6 75 e6 41 f6 44 24 18 80 74 36 41 f6 84 24 d0 08 00 00 02 74
2b 49 8b b4 24 d8 08 00 00 48 c7 c7 20 c8 8d a8 e8 fa 1f f9 ff <0f> 0b
41 f6 44 24 18 80 74 0d 41 f6 84 24 d0 08 00 00 02 74 02 0f
 RSP: 0018:ffff96be9c28b9a8 EFLAGS: 00010086
 RAX: 0000000000000000 RBX: 0000000000000085 RCX: 0000000000000027
 RDX: ffff8997b995c8c8 RSI: 0000000000000001 RDI: ffff8997b995c8c0
 RBP: 000000000000001c R08: 00000000ffff7fff R09: 0000000000000058
 R10: 00000000ffff7fff R11: ffff899abd2a1000 R12: ffff891db3b85c00
 R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
 FS:  0000000000000000(0000) GS:ffff8997b9940000(0000) knlGS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 CR2: 00007f26d7aba480 CR3: 000000c07c61a006 CR4: 0000000000770ef0
 PKRU: 55555554
 Call Trace:
  <TASK>
  ? __warn+0x93/0x140
  ? psi_task_switch+0x126/0x230
  ? report_bug+0x174/0x1a0
  ? handle_bug+0x53/0x90
  ? exc_invalid_op+0x17/0x70
  ? asm_exc_invalid_op+0x16/0x20
  ? psi_task_switch+0x126/0x230
  ? psi_task_switch+0x126/0x230
  __schedule+0x980/0x10f0
  do_task_dead+0x3e/0x40
  do_exit+0x6ed/0x970
  do_group_exit+0x2c/0x80
  __x64_sys_exit_group+0x14/0x20
  x64_sys_call+0x15aa/0x17b0
  do_syscall_64+0x64/0x100
  ? srso_alias_return_thunk+0x5/0xfbef5
  ? get_page_from_freelist+0x60e/0x1140
  ? cgroup_rstat_updated+0x88/0x210
  ? srso_alias_return_thunk+0x5/0xfbef5
  ? __mod_memcg_lruvec_state+0x91/0x140
  ? srso_alias_return_thunk+0x5/0xfbef5
  ? __lruvec_stat_mod_folio+0x80/0xd0
  ? srso_alias_return_thunk+0x5/0xfbef5
  ? folio_add_file_rmap_ptes+0x37/0xb0
  ? srso_alias_return_thunk+0x5/0xfbef5
  ? set_pte_range+0xb7/0x280
  ? srso_alias_return_thunk+0x5/0xfbef5
  ? next_uptodate_folio+0x83/0x270
  ? srso_alias_return_thunk+0x5/0xfbef5
  ? filemap_map_pages+0x4a2/0x590
  ? srso_alias_return_thunk+0x5/0xfbef5
  ? do_fault+0x291/0x4d0
  ? srso_alias_return_thunk+0x5/0xfbef5
  ? srso_alias_return_thunk+0x5/0xfbef5
  ? __handle_mm_fault+0x31c/0x1060
  ? srso_alias_return_thunk+0x5/0xfbef5
  ? __count_memcg_events+0x53/0xf0
  ? srso_alias_return_thunk+0x5/0xfbef5
  ? handle_mm_fault+0xb6/0x280
  ? srso_alias_return_thunk+0x5/0xfbef5
  ? do_user_addr_fault+0x386/0x610
  ? srso_alias_return_thunk+0x5/0xfbef5
  ? exc_page_fault+0x6f/0x120
  entry_SYSCALL_64_after_hwframe+0x76/0x7e
 RIP: 0033:0x7f26dad48349
 Code: Unable to access opcode bytes at 0x7f26dad4831f.
 RSP: 002b:00007ffcd05a7848 EFLAGS: 00000246 ORIG_RAX: 00000000000000e7
 RAX: ffffffffffffffda RBX: 00007f26dae429e0 RCX: 00007f26dad48349
 RDX: 000000000000003c RSI: 00000000000000e7 RDI: 0000000000000000
 RBP: 0000000000000000 R08: fffffffffffffd48 R09: 000055c238e82190
 R10: 00007f26d8a781a8 R11: 0000000000000246 R12: 00007f26dae429e0
 R13: 00007f26dae482e0 R14: 000000000000001e R15: 00007f26dae482c8
  </TASK>
 ---[ end trace 0000000000000000 ]---

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Bad psi_group_cpu.tasks[NR_MEMSTALL] counter
  2024-11-28 10:00                 ` Max Kellermann
@ 2024-11-28 10:46                   ` Gao Xiang
  0 siblings, 0 replies; 19+ messages in thread
From: Gao Xiang @ 2024-11-28 10:46 UTC (permalink / raw)
  To: Max Kellermann
  Cc: Christoph Hellwig, Suren Baghdasaryan, Johannes Weiner,
	Peter Zijlstra, linux-kernel



On 2024/11/28 18:00, Max Kellermann wrote:
> On Thu, Nov 21, 2024 at 2:18 PM Gao Xiang <hsiangkao@linux.alibaba.com> wrote:
>> Just saw this. I guess your _recent_ 6.11.9 bug is actually
>> related to EROFS since EROFS uses readahead_expand().  I think
>> your recent report was introduced by a recent backport fix
>> commit 9e2f9d34dd12 ("erofs: handle overlapped pclusters out of crafted images properly")
>> https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=v6.11.9&id=9cfa199bcbbbba31cbf97b2786f44f4464f3f29a
>>
>> bio can be NULL after this patch and causes
>> unbalanced psi_memstall_{enter,leave}().  It can be fixed as
>> (the diff below could be damaged due to my email client):
> 
> With your patch, the PSI warning (from Suren's debugging patch) fired
> again last night. Which means there may be other instances of this bug
> left.

Ok, but honestly that is somewhat strange, on the
EROFS side readahead_expand() can only be used in
.readahead() context and with the original
readahead_control.

I don't have more clues without more understanding
of psi memstall.  From my limited POV, the callers
of psi_memstall_{enter,leave}() all seem good.

Thanks,
Gao Xiang

> 
>   ------------[ cut here ]------------
>   Stall from readahead_expand+0xca/0x1d0 was never cleared
>   WARNING: CPU: 133 PID: 91645 at kernel/sched/psi.c:989
> psi_task_switch+0x126/0x230
>   Modules linked in:
>   CPU: 133 UID: 3221274747 PID: 91645 Comm: php-cgi8.1 Tainted: G
>   W          6.11.10-cm4all2-es+ #267
>   Tainted: [W]=WARN
>   Hardware name: Dell Inc. PowerEdge R7615/0G9DHV, BIOS 1.6.10 12/08/2023
>   RIP: 0010:psi_task_switch+0x126/0x230
>   Code: f6 75 e6 41 f6 44 24 18 80 74 36 41 f6 84 24 d0 08 00 00 02 74
> 2b 49 8b b4 24 d8 08 00 00 48 c7 c7 20 c8 8d a8 e8 fa 1f f9 ff <0f> 0b
> 41 f6 44 24 18 80 74 0d 41 f6 84 24 d0 08 00 00 02 74 02 0f
>   RSP: 0018:ffff96be9c28b9a8 EFLAGS: 00010086
>   RAX: 0000000000000000 RBX: 0000000000000085 RCX: 0000000000000027
>   RDX: ffff8997b995c8c8 RSI: 0000000000000001 RDI: ffff8997b995c8c0
>   RBP: 000000000000001c R08: 00000000ffff7fff R09: 0000000000000058
>   R10: 00000000ffff7fff R11: ffff899abd2a1000 R12: ffff891db3b85c00
>   R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
>   FS:  0000000000000000(0000) GS:ffff8997b9940000(0000) knlGS:0000000000000000
>   CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>   CR2: 00007f26d7aba480 CR3: 000000c07c61a006 CR4: 0000000000770ef0
>   PKRU: 55555554
>   Call Trace:
>    <TASK>
>    ? __warn+0x93/0x140
>    ? psi_task_switch+0x126/0x230
>    ? report_bug+0x174/0x1a0
>    ? handle_bug+0x53/0x90
>    ? exc_invalid_op+0x17/0x70
>    ? asm_exc_invalid_op+0x16/0x20
>    ? psi_task_switch+0x126/0x230
>    ? psi_task_switch+0x126/0x230
>    __schedule+0x980/0x10f0
>    do_task_dead+0x3e/0x40
>    do_exit+0x6ed/0x970
>    do_group_exit+0x2c/0x80
>    __x64_sys_exit_group+0x14/0x20
>    x64_sys_call+0x15aa/0x17b0
>    do_syscall_64+0x64/0x100
>    ? srso_alias_return_thunk+0x5/0xfbef5
>    ? get_page_from_freelist+0x60e/0x1140
>    ? cgroup_rstat_updated+0x88/0x210
>    ? srso_alias_return_thunk+0x5/0xfbef5
>    ? __mod_memcg_lruvec_state+0x91/0x140
>    ? srso_alias_return_thunk+0x5/0xfbef5
>    ? __lruvec_stat_mod_folio+0x80/0xd0
>    ? srso_alias_return_thunk+0x5/0xfbef5
>    ? folio_add_file_rmap_ptes+0x37/0xb0
>    ? srso_alias_return_thunk+0x5/0xfbef5
>    ? set_pte_range+0xb7/0x280
>    ? srso_alias_return_thunk+0x5/0xfbef5
>    ? next_uptodate_folio+0x83/0x270
>    ? srso_alias_return_thunk+0x5/0xfbef5
>    ? filemap_map_pages+0x4a2/0x590
>    ? srso_alias_return_thunk+0x5/0xfbef5
>    ? do_fault+0x291/0x4d0
>    ? srso_alias_return_thunk+0x5/0xfbef5
>    ? srso_alias_return_thunk+0x5/0xfbef5
>    ? __handle_mm_fault+0x31c/0x1060
>    ? srso_alias_return_thunk+0x5/0xfbef5
>    ? __count_memcg_events+0x53/0xf0
>    ? srso_alias_return_thunk+0x5/0xfbef5
>    ? handle_mm_fault+0xb6/0x280
>    ? srso_alias_return_thunk+0x5/0xfbef5
>    ? do_user_addr_fault+0x386/0x610
>    ? srso_alias_return_thunk+0x5/0xfbef5
>    ? exc_page_fault+0x6f/0x120
>    entry_SYSCALL_64_after_hwframe+0x76/0x7e
>   RIP: 0033:0x7f26dad48349
>   Code: Unable to access opcode bytes at 0x7f26dad4831f.
>   RSP: 002b:00007ffcd05a7848 EFLAGS: 00000246 ORIG_RAX: 00000000000000e7
>   RAX: ffffffffffffffda RBX: 00007f26dae429e0 RCX: 00007f26dad48349
>   RDX: 000000000000003c RSI: 00000000000000e7 RDI: 0000000000000000
>   RBP: 0000000000000000 R08: fffffffffffffd48 R09: 000055c238e82190
>   R10: 00007f26d8a781a8 R11: 0000000000000246 R12: 00007f26dae429e0
>   R13: 00007f26dae482e0 R14: 000000000000001e R15: 00007f26dae482c8
>    </TASK>
>   ---[ end trace 0000000000000000 ]---


^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2024-11-28 10:46 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-06-04  7:16 Bad psi_group_cpu.tasks[NR_MEMSTALL] counter Max Kellermann
2024-06-10  8:28 ` Max Kellermann
2024-06-12  5:01 ` Suren Baghdasaryan
2024-06-12  6:49   ` Max Kellermann
2024-06-12  9:48     ` Peter Zijlstra
2024-06-12 10:20       ` Max Kellermann
2024-11-21 14:02         ` Gao Xiang
2024-06-12 15:17     ` Suren Baghdasaryan
2024-08-05 12:34   ` Max Kellermann
2024-08-06 15:56     ` Suren Baghdasaryan
2024-08-12  8:06       ` Max Kellermann
2024-08-12 21:33         ` Suren Baghdasaryan
2024-11-20 16:30       ` Max Kellermann
2024-11-20 16:56         ` Suren Baghdasaryan
2024-11-21  4:51           ` Christoph Hellwig
2024-11-21  8:43             ` Max Kellermann
2024-11-21 13:18               ` Gao Xiang
2024-11-28 10:00                 ` Max Kellermann
2024-11-28 10:46                   ` Gao Xiang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox