All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Eric W. Biederman" <ebiederm@xmission.com>
To: Zhang Qiao <zhangqiao22@huawei.com>
Cc: lkml <linux-kernel@vger.kernel.org>, <keescook@chromium.org>,
	<tglx@linutronix.de>, Peter Zijlstra <peterz@infradead.org>,
	<elver@google.com>, <legion@kernel.org>, <oleg@redhat.com>,
	<brauner@kernel.org>
Subject: Re: Question about kill a process group
Date: Wed, 13 Apr 2022 10:47:12 -0500	[thread overview]
Message-ID: <87ilrd2dfj.fsf@email.froward.int.ebiederm.org> (raw)
In-Reply-To: <e70b350e-abfc-7f8d-4590-d18801a9a722@huawei.com> (Zhang Qiao's message of "Wed, 13 Apr 2022 09:56:29 +0800")

Zhang Qiao <zhangqiao22@huawei.com> writes:

> Gentle ping. Any comments on this problem?

Is fork12 a new test?

Is there a real world use case that connects to this?

How many children are being created in this test?  Several million?

I would like to blame this on the old issue that tasklist_lock being
a global lock.  Given the number of child processes (as many as can be
created) I don't think we are hurt much by using a global lock.  The
problem for solubility is that we have a lock.

Fundamentally there must be a lock taken to maintain the parent's
list of children.

I only see SIGQUIT being called once in the parent process so that
should not be an issue.

There is a minor issue in fork12 that it calls exit(0) instead of
_exit(0) in the children.  Not the problem you are dealing with
but it does look like it can be a distraction.

I suspect the issue really is the thundering hurd of a million+
processes synchronizing on a single lock.

I don't think this is a hard lockup, just a global slow down.
I expect everything will eventually exit.


To do something about this is going to take a deep and fundamental
redesign of how we maintain process lists to handle a parent
with millions of children well.

Is there any real world reason to care about this case?  Without
real world motivation I am inclined to just note that this is
something that is handled poorly, and leave it as is.

Eric

>
> 在 2022/4/2 10:22, Zhang Qiao 写道:
>> ping...
>> 
>> Any suggestions for this problem?
>> 
>> thank!
>> Qiao
>> 
>> 
>> 在 2022/3/29 16:07, Zhang Qiao 写道:
>>> hello everyone,
>>>
>>> I got a hradlockup panic when run the ltp syscall testcases.
>>>
>>> 348439.713178] NMI watchdog: Watchdog detected hard LOCKUP on cpu 32
>>> [348439.713236] irq event stamp: 0
>>> [348439.713237] hardirqs last  enabled at (0): [<0000000000000000>] 0x0
>>> [348439.713238] hardirqs last disabled at (0): [<ffffffff87cd1ea5>] copy_process+0x7f5/0x2160
>>> [348439.713239] softirqs last  enabled at (0): [<ffffffff87cd1ea5>] copy_process+0x7f5/0x2160
>>> [348439.713240] softirqs last disabled at (0): [<0000000000000000>] 0x0
>>> [348439.713241] CPU: 32 PID: 1151212 Comm: fork12 Kdump: loaded Tainted: G S                5.10.0+ #1
>>> [348439.713242] Hardware name: Huawei RH2288H V3/BC11HGSA0, BIOS 3.35 10/20/2016
>>> [348439.713243] RIP: 0010:queued_write_lock_slowpath+0x4d/0x80
>>> [348439.713245] RSP: 0018:ffffa3a6bed4fe60 EFLAGS: 00000006
>>> [348439.713246] RAX: 0000000000000500 RBX: ffffffff892060c0 RCX: 00000000000000ff
>>> [348439.713247] RDX: 0000000000000500 RSI: 0000000000000100 RDI: ffffffff892060c0
>>> [348439.713248] RBP: ffffffff892060c4 R08: 0000000000000001 R09: 0000000000000000
>>> [348439.713249] R10: ffffa3a6bed4fde8 R11: 0000000000000000 R12: ffff96dfd3b68001
>>> [348439.713250] R13: ffff96dfd3b68000 R14: ffff96dfd3b68c38 R15: ffff96e2cf1f51c0
>>> [348439.713251] FS:  0000000000000000(0000) GS:ffff96edbc200000(0000) knlGS:0000000000000000
>>> [348439.713252] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>> [348439.713253] CR2: 0000000000416ea0 CR3: 0000002d91812004 CR4: 00000000001706e0
>>> [348439.713254] Call Trace:
>>> [348439.713255]  do_raw_write_lock+0xa9/0xb0
>>> [348439.713256]  _raw_write_lock_irq+0x5a/0x70
>>> [348439.713256]  do_exit+0x429/0xd00
>>> [348439.713257]  do_group_exit+0x39/0xb0
>>> [348439.713258]  __x64_sys_exit_group+0x14/0x20
>>> [348439.713259]  do_syscall_64+0x33/0x40
>>> [348439.713260]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
>>> [348439.713260] RIP: 0033:0x7f59295a7066
>>> [348439.713261] Code: Unable to access opcode bytes at RIP 0x7f59295a703c.
>>> [348439.713262] RSP: 002b:00007fff0afeac38 EFLAGS: 00000246 ORIG_RAX: 00000000000000e7
>>> [348439.713264] RAX: ffffffffffffffda RBX: 00007f5929694530 RCX: 00007f59295a7066
>>> [348439.713265] RDX: 0000000000000002 RSI: 000000000000003c RDI: 0000000000000002
>>> [348439.713266] RBP: 0000000000000002 R08: 00000000000000e7 R09: ffffffffffffff80
>>> [348439.713267] R10: 0000000000000002 R11: 0000000000000246 R12: 00007f5929694530
>>> [348439.713268] R13: 0000000000000001 R14: 00007f5929697f68 R15: 0000000000000000
>>> [348439.713269] Kernel panic - not syncing: Hard LOCKUP
>>> [348439.713270] CPU: 32 PID: 1151212 Comm: fork12 Kdump: loaded Tainted: G S                5.10.0+ #1
>>> [348439.713272] Hardware name: Huawei RH2288H V3/BC11HGSA0, BIOS 3.35 10/20/2016
>>> [348439.713272] Call Trace:
>>> [348439.713273]  <NMI>
>>> [348439.713274]  dump_stack+0x77/0x97
>>> [348439.713275]  panic+0x10c/0x2fb
>>> [348439.713275]  nmi_panic+0x35/0x40
>>> [348439.713276]  watchdog_hardlockup_check+0xeb/0x110
>>> [348439.713277]  __perf_event_overflow+0x52/0xf0
>>> [348439.713278]  handle_pmi_common+0x21a/0x320
>>> [348439.713286]  intel_pmu_handle_irq+0xc9/0x1b0
>>> [348439.713287]  perf_event_nmi_handler+0x24/0x40
>>> [348439.713288]  nmi_handle+0xc3/0x2a0
>>> [348439.713289]  default_do_nmi+0x49/0xf0
>>> [348439.713289]  exc_nmi+0x146/0x160
>>> [348439.713290]  end_repeat_nmi+0x16/0x55
>>> [348439.713291] RIP: 0010:queued_write_lock_slowpath+0x4d/0x80
>>> [348439.713293] RSP: 0018:ffffa3a6bed4fe60 EFLAGS: 00000006
>>> [348439.713295] RAX: 0000000000000500 RBX: ffffffff892060c0 RCX: 00000000000000ff
>>> [348439.713296] RDX: 0000000000000500 RSI: 0000000000000100 RDI: ffffffff892060c0
>>> [348439.713296] RBP: ffffffff892060c4 R08: 0000000000000001 R09: 0000000000000000
>>> [348439.713297] R10: ffffa3a6bed4fde8 R11: 0000000000000000 R12: ffff96dfd3b68001
>>> [348439.713298] R13: ffff96dfd3b68000 R14: ffff96dfd3b68c38 R15: ffff96e2cf1f51c0
>>> [348439.713300]  </NMI>
>>> [348439.713301]  do_raw_write_lock+0xa9/0xb0
>>> [348439.713302]  _raw_write_lock_irq+0x5a/0x70
>>> [348439.713303]  do_exit+0x429/0xd00
>>> [348439.713303]  do_group_exit+0x39/0xb0
>>> [348439.713304]  __x64_sys_exit_group+0x14/0x20
>>> [348439.713305]  do_syscall_64+0x33/0x40
>>> [348439.713306]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
>>> [348439.713307] RIP: 0033:0x7f59295a7066
>>> [348439.713308] Code: Unable to access opcode bytes at RIP 0x7f59295a703c.
>>>
>>>
>>> when analyzing vmcore, i notice lots of fork12 processes are waiting for tasklist read lock or write
>>> lock (see the attachment file all_cpu_stacks.log),and every fork12 process(belongs to the same
>>> process group) call kill(0, SIGQUIT) in their signal handler()[1], it will traverse all the processes in the
>>> same process group and send signal to them one by one, which is a very time-costly work and hold tasklist
>>> read lock long time. At the same time, other processes will exit after receive signal, they try to get
>>> the tasklist write lock at exit_notify().
>>>
>>> [1] fork12 testcase: https://github.com/linux-test-project/ltp/blob/master/testcases/kernel/syscalls/fork/fork12.c
>>>
>>> some processes call kill(0, SIGQUIT), wait for tasklist read lock:
>>>
>>>  #5 [ffff972a9b16fd78] native_queued_spin_lock_slowpath at ffffffff9931ed47
>>>  #6 [ffff972a9b16fd78] queued_read_lock_slowpath at ffffffff99320a58
>>>  #7 [ffff972a9b16fd90] do_wait at ffffffff992bc17d
>>>  #8 [ffff972a9b16fdd0] kernel_wait4 at ffffffff992bd88d
>>>  #9 [ffff972a9b16fe58] __do_sys_wait4 at ffffffff992bd9e5
>>> #10 [ffff972a9b16ff30] do_syscall_64 at ffffffff9920432d
>>> #11 [ffff972a9b16ff50] entry_SYSCALL_64_after_hwframe at ffffffff99c000ad
>>>
>>> As the same time, some processes are exiting, wait for tasklist write lock:
>>>
>>>  #5 [ffff972aa49a7e60] native_queued_spin_lock_slowpath at ffffffff9931ecb0
>>>  #6 [ffff972aa49a7e60] queued_write_lock_slowpath at ffffffff993209e4
>>>  #7 [ffff972aa49a7e78] do_raw_write_lock at ffffffff99320834
>>>  #8 [ffff972aa49a7e88] do_exit at ffffffff992bcd78
>>>  #9 [ffff972aa49a7f00] do_group_exit at ffffffff992bd719
>>> #10 [ffff972aa49a7f28] __x64_sys_exit_group at ffffffff992bd7a4
>>> #11 [ffff972aa49a7f30] do_syscall_64 at ffffffff9920432d
>>> #12 [ffff972aa49a7f50] entry_SYSCALL_64_after_hwframe at ffffffff99c000ad
>>>
>>> In this scenario,there are lots of process are waiting for tasklist read lock or the tasklist
>>> write lock, so they will queue. if the wait queue is long enough, it might cause a hardlockup issue when a
>>> process wait for taking the write lock at exit_notify().
>>>
>>> I tried to solve this problem by avoiding traversing the process group multiple times when kill(0, xxxx)
>>> is called multiple times form the same process group, but it doesn't look like a good solution.
>>>
>>> Is there any good idea for fixing this problem ?
>>>
>>> Thanks!
>>>
>>> Qiao
>>> .
>>>
>> .
>> 

  reply	other threads:[~2022-04-13 15:47 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-03-29  8:07 Question about kill a process group Zhang Qiao
2022-04-02  2:22 ` Zhang Qiao
2022-04-13  1:56   ` Zhang Qiao
2022-04-13 15:47     ` Eric W. Biederman [this message]
2022-04-14 11:40       ` Zhang Qiao
2022-04-21 16:12         ` Eric W. Biederman
2022-04-28  2:05           ` Zhang Qiao
2022-04-28 12:33           ` Thomas Gleixner
2022-05-11 18:33             ` Eric W. Biederman
2022-05-11 22:53               ` Thomas Gleixner
2022-05-12 18:23                 ` Eric W. Biederman
2022-09-26  7:32                   ` Zhang Qiao

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87ilrd2dfj.fsf@email.froward.int.ebiederm.org \
    --to=ebiederm@xmission.com \
    --cc=brauner@kernel.org \
    --cc=elver@google.com \
    --cc=keescook@chromium.org \
    --cc=legion@kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=oleg@redhat.com \
    --cc=peterz@infradead.org \
    --cc=tglx@linutronix.de \
    --cc=zhangqiao22@huawei.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.