public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: Sasha Levin <sashal@kernel.org>
To: linux-kernel@vger.kernel.org, stable@vger.kernel.org
Cc: Yonghong Song <yhs@fb.com>, Alexei Starovoitov <ast@kernel.org>,
	"Paul E . McKenney" <paulmck@kernel.org>,
	Sasha Levin <sashal@kernel.org>,
	netdev@vger.kernel.org, bpf@vger.kernel.org
Subject: [PATCH AUTOSEL 5.8 55/63] bpf: Fix a rcu_sched stall issue with bpf task/task_file iterator
Date: Mon, 24 Aug 2020 12:34:55 -0400	[thread overview]
Message-ID: <20200824163504.605538-55-sashal@kernel.org> (raw)
In-Reply-To: <20200824163504.605538-1-sashal@kernel.org>

From: Yonghong Song <yhs@fb.com>

[ Upstream commit e679654a704e5bd676ea6446fa7b764cbabf168a ]

In our production system, we observed rcu stalls when
'bpftool prog` is running.
  rcu: INFO: rcu_sched self-detected stall on CPU
  rcu: \x097-....: (20999 ticks this GP) idle=302/1/0x4000000000000000 softirq=1508852/1508852 fqs=4913
  \x09(t=21031 jiffies g=2534773 q=179750)
  NMI backtrace for cpu 7
  CPU: 7 PID: 184195 Comm: bpftool Kdump: loaded Tainted: G        W         5.8.0-00004-g68bfc7f8c1b4 #6
  Hardware name: Quanta Twin Lakes MP/Twin Lakes Passive MP, BIOS F09_3A17 05/03/2019
  Call Trace:
  <IRQ>
  dump_stack+0x57/0x70
  nmi_cpu_backtrace.cold+0x14/0x53
  ? lapic_can_unplug_cpu.cold+0x39/0x39
  nmi_trigger_cpumask_backtrace+0xb7/0xc7
  rcu_dump_cpu_stacks+0xa2/0xd0
  rcu_sched_clock_irq.cold+0x1ff/0x3d9
  ? tick_nohz_handler+0x100/0x100
  update_process_times+0x5b/0x90
  tick_sched_timer+0x5e/0xf0
  __hrtimer_run_queues+0x12a/0x2a0
  hrtimer_interrupt+0x10e/0x280
  __sysvec_apic_timer_interrupt+0x51/0xe0
  asm_call_on_stack+0xf/0x20
  </IRQ>
  sysvec_apic_timer_interrupt+0x6f/0x80
  asm_sysvec_apic_timer_interrupt+0x12/0x20
  RIP: 0010:task_file_seq_get_next+0x71/0x220
  Code: 00 00 8b 53 1c 49 8b 7d 00 89 d6 48 8b 47 20 44 8b 18 41 39 d3 76 75 48 8b 4f 20 8b 01 39 d0 76 61 41 89 d1 49 39 c1 48 19 c0 <48> 8b 49 08 21 d0 48 8d 04 c1 4c 8b 08 4d 85 c9 74 46 49 8b 41 38
  RSP: 0018:ffffc90006223e10 EFLAGS: 00000297
  RAX: ffffffffffffffff RBX: ffff888f0d172388 RCX: ffff888c8c07c1c0
  RDX: 00000000000f017b RSI: 00000000000f017b RDI: ffff888c254702c0
  RBP: ffffc90006223e68 R08: ffff888be2a1c140 R09: 00000000000f017b
  R10: 0000000000000002 R11: 0000000000100000 R12: ffff888f23c24118
  R13: ffffc90006223e60 R14: ffffffff828509a0 R15: 00000000ffffffff
  task_file_seq_next+0x52/0xa0
  bpf_seq_read+0xb9/0x320
  vfs_read+0x9d/0x180
  ksys_read+0x5f/0xe0
  do_syscall_64+0x38/0x60
  entry_SYSCALL_64_after_hwframe+0x44/0xa9
  RIP: 0033:0x7f8815f4f76e
  Code: c0 e9 f6 fe ff ff 55 48 8d 3d 76 70 0a 00 48 89 e5 e8 36 06 02 00 66 0f 1f 44 00 00 64 8b 04 25 18 00 00 00 85 c0 75 14 0f 05 <48> 3d 00 f0 ff ff 77 52 c3 66 0f 1f 84 00 00 00 00 00 55 48 89 e5
  RSP: 002b:00007fff8f9df578 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
  RAX: ffffffffffffffda RBX: 000000000170b9c0 RCX: 00007f8815f4f76e
  RDX: 0000000000001000 RSI: 00007fff8f9df5b0 RDI: 0000000000000007
  RBP: 00007fff8f9e05f0 R08: 0000000000000049 R09: 0000000000000010
  R10: 00007f881601fa40 R11: 0000000000000246 R12: 00007fff8f9e05a8
  R13: 00007fff8f9e05a8 R14: 0000000001917f90 R15: 000000000000e22e

Note that `bpftool prog` actually calls a task_file bpf iterator
program to establish an association between prog/map/link/btf anon
files and processes.

In the case where the above rcu stall occured, we had a process
having 1587 tasks and each task having roughly 81305 files.
This implied 129 million bpf prog invocations. Unfortunwtely none of
these files are prog/map/link/btf files so bpf iterator/prog needs
to traverse all these files and not able to return to user space
since there are no seq_file buffer overflow.

This patch fixed the issue in bpf_seq_read() to limit the number
of visited objects. If the maximum number of visited objects is
reached, no more objects will be visited in the current syscall.
If there is nothing written in the seq_file buffer, -EAGAIN will
return to the user so user can try again.

The maximum number of visited objects is set at 1 million.
In our Intel Xeon D-2191 2.3GHZ 18-core server, bpf_seq_read()
visiting 1 million files takes around 0.18 seconds.

We did not use cond_resched() since for some iterators, e.g.,
netlink iterator, where rcu read_lock critical section spans between
consecutive seq_ops->next(), which makes impossible to do cond_resched()
in the key while loop of function bpf_seq_read().

Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Cc: Paul E. McKenney <paulmck@kernel.org>
Link: https://lore.kernel.org/bpf/20200818222309.2181348-1-yhs@fb.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
 kernel/bpf/bpf_iter.c | 15 ++++++++++++++-
 1 file changed, 14 insertions(+), 1 deletion(-)

diff --git a/kernel/bpf/bpf_iter.c b/kernel/bpf/bpf_iter.c
index dd612b80b9fea..3c18090cd73dc 100644
--- a/kernel/bpf/bpf_iter.c
+++ b/kernel/bpf/bpf_iter.c
@@ -64,6 +64,9 @@ static void bpf_iter_done_stop(struct seq_file *seq)
 	iter_priv->done_stop = true;
 }
 
+/* maximum visited objects before bailing out */
+#define MAX_ITER_OBJECTS	1000000
+
 /* bpf_seq_read, a customized and simpler version for bpf iterator.
  * no_llseek is assumed for this file.
  * The following are differences from seq_read():
@@ -76,7 +79,7 @@ static ssize_t bpf_seq_read(struct file *file, char __user *buf, size_t size,
 {
 	struct seq_file *seq = file->private_data;
 	size_t n, offs, copied = 0;
-	int err = 0;
+	int err = 0, num_objs = 0;
 	void *p;
 
 	mutex_lock(&seq->lock);
@@ -132,6 +135,7 @@ static ssize_t bpf_seq_read(struct file *file, char __user *buf, size_t size,
 	while (1) {
 		loff_t pos = seq->index;
 
+		num_objs++;
 		offs = seq->count;
 		p = seq->op->next(seq, p, &seq->index);
 		if (pos == seq->index) {
@@ -150,6 +154,15 @@ static ssize_t bpf_seq_read(struct file *file, char __user *buf, size_t size,
 		if (seq->count >= size)
 			break;
 
+		if (num_objs >= MAX_ITER_OBJECTS) {
+			if (offs == 0) {
+				err = -EAGAIN;
+				seq->op->stop(seq, p);
+				goto done;
+			}
+			break;
+		}
+
 		err = seq->op->show(seq, p);
 		if (err > 0) {
 			bpf_iter_dec_seq_num(seq);
-- 
2.25.1


  parent reply	other threads:[~2020-08-24 17:19 UTC|newest]

Thread overview: 66+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-08-24 16:34 [PATCH AUTOSEL 5.8 01/63] spi: stm32: clear only asserted irq flags on interrupt Sasha Levin
2020-08-24 16:34 ` [PATCH AUTOSEL 5.8 02/63] jbd2: make sure jh have b_transaction set in refile/unfile_buffer Sasha Levin
2020-08-24 16:34 ` [PATCH AUTOSEL 5.8 03/63] ext4: don't BUG on inconsistent journal feature Sasha Levin
2020-08-24 16:34 ` [PATCH AUTOSEL 5.8 04/63] ext4: handle read only external journal device Sasha Levin
2020-08-24 16:34 ` [PATCH AUTOSEL 5.8 05/63] ext4: skip non-loaded groups at cr=0/1 when scanning for good groups Sasha Levin
2020-08-24 16:34 ` [PATCH AUTOSEL 5.8 06/63] drm/virtio: fix memory leak in virtio_gpu_cleanup_object() Sasha Levin
2020-08-24 16:34 ` [PATCH AUTOSEL 5.8 07/63] ext4: abort the filesystem if failed to async write metadata buffer Sasha Levin
2020-08-24 16:34 ` [PATCH AUTOSEL 5.8 08/63] jbd2: abort journal if free a async write error " Sasha Levin
2020-08-24 16:34 ` [PATCH AUTOSEL 5.8 09/63] ext4: handle option set by mount flags correctly Sasha Levin
2020-08-24 16:34 ` [PATCH AUTOSEL 5.8 10/63] ext4: handle error of ext4_setup_system_zone() on remount Sasha Levin
2020-08-24 16:34 ` [PATCH AUTOSEL 5.8 11/63] ext4: correctly restore system zone info when remount fails Sasha Levin
2020-08-24 16:34 ` [PATCH AUTOSEL 5.8 12/63] fs: prevent BUG_ON in submit_bh_wbc() Sasha Levin
2020-08-24 16:34 ` [PATCH AUTOSEL 5.8 13/63] spi: stm32h7: fix race condition at end of transfer Sasha Levin
2020-08-24 16:34 ` [PATCH AUTOSEL 5.8 14/63] spi: stm32: fix fifo threshold level in case of short transfer Sasha Levin
2020-08-24 16:34 ` [PATCH AUTOSEL 5.8 15/63] spi: stm32: fix stm32_spi_prepare_mbr in case of odd clk_rate Sasha Levin
2020-08-24 16:34 ` [PATCH AUTOSEL 5.8 16/63] spi: stm32: always perform registers configuration prior to transfer Sasha Levin
2020-08-24 16:34 ` [PATCH AUTOSEL 5.8 17/63] drm/amd/powerplay: correct Vega20 cached smu feature state Sasha Levin
2020-08-24 16:34 ` [PATCH AUTOSEL 5.8 18/63] drm/amd/powerplay: correct UVD/VCE PG state on custom pptable uploading Sasha Levin
2020-08-24 16:34 ` [PATCH AUTOSEL 5.8 19/63] drm/amd/display: Fix LFC multiplier changing erratically Sasha Levin
2020-08-24 16:34 ` [PATCH AUTOSEL 5.8 20/63] drm/amd/display: Switch to immediate mode for updating infopackets Sasha Levin
2020-08-24 16:34 ` [PATCH AUTOSEL 5.8 21/63] selftests/bpf: Fix segmentation fault in test_progs Sasha Levin
2020-08-24 16:34 ` [PATCH AUTOSEL 5.8 22/63] libbpf: Handle GCC built-in types for Arm NEON Sasha Levin
2020-08-24 16:34 ` [PATCH AUTOSEL 5.8 23/63] netfilter: avoid ipv6 -> nf_defrag_ipv6 module dependency Sasha Levin
2020-08-24 16:34 ` [PATCH AUTOSEL 5.8 24/63] libbpf: Prevent overriding errno when logging errors Sasha Levin
2020-08-24 16:34 ` [PATCH AUTOSEL 5.8 25/63] tools/bpftool: Fix compilation warnings in 32-bit mode Sasha Levin
2020-08-24 16:34 ` [PATCH AUTOSEL 5.8 26/63] selftest/bpf: " Sasha Levin
2020-08-24 16:34 ` [PATCH AUTOSEL 5.8 27/63] selftests/bpf: Fix btf_dump test cases on 32-bit arches Sasha Levin
2020-08-24 16:34 ` [PATCH AUTOSEL 5.8 28/63] selftests/bpf: Correct various core_reloc 64-bit assumptions Sasha Levin
2020-08-24 16:34 ` [PATCH AUTOSEL 5.8 29/63] can: j1939: transport: j1939_xtp_rx_dat_one(): compare own packets to detect corruptions Sasha Levin
2020-08-24 16:34 ` [PATCH AUTOSEL 5.8 30/63] dma-pool: fix coherent pool allocations for IOMMU mappings Sasha Levin
2020-08-24 16:34 ` [PATCH AUTOSEL 5.8 31/63] dma-pool: Only allocate from CMA when in same memory zone Sasha Levin
2020-08-24 16:34 ` [PATCH AUTOSEL 5.8 32/63] drivers/net/wan/hdlc_x25: Added needed_headroom and a skb->len check Sasha Levin
2020-08-24 16:34 ` [PATCH AUTOSEL 5.8 33/63] ALSA: hda/realtek: Add model alc298-samsung-headphone Sasha Levin
2020-08-24 16:34 ` [PATCH AUTOSEL 5.8 34/63] s390/cio: add cond_resched() in the slow_eval_known_fn() loop Sasha Levin
2020-08-24 16:34 ` [PATCH AUTOSEL 5.8 35/63] ASoC: wm8994: Avoid attempts to read unreadable registers Sasha Levin
2020-08-24 16:34 ` [PATCH AUTOSEL 5.8 36/63] ALSA: usb-audio: ignore broken processing/extension unit Sasha Levin
2020-08-24 16:34 ` [PATCH AUTOSEL 5.8 37/63] selftests: disable rp_filter for icmp_redirect.sh Sasha Levin
2020-08-24 16:34 ` [PATCH AUTOSEL 5.8 38/63] scsi: fcoe: Fix I/O path allocation Sasha Levin
2020-08-24 16:34 ` [PATCH AUTOSEL 5.8 39/63] scsi: ufs: Fix possible infinite loop in ufshcd_hold Sasha Levin
2020-08-24 16:34 ` [PATCH AUTOSEL 5.8 40/63] scsi: ufs: Improve interrupt handling for shared interrupts Sasha Levin
2020-08-24 16:34 ` [PATCH AUTOSEL 5.8 41/63] scsi: ufs: Clean up completed request without interrupt notification Sasha Levin
2020-08-24 16:34 ` [PATCH AUTOSEL 5.8 42/63] scsi: scsi_debug: Fix scp is NULL errors Sasha Levin
2020-08-24 16:34 ` [PATCH AUTOSEL 5.8 43/63] scsi: qla2xxx: Flush all sessions on zone disable Sasha Levin
2020-08-24 16:34 ` [PATCH AUTOSEL 5.8 44/63] scsi: qla2xxx: Flush I/O " Sasha Levin
2020-08-24 16:34 ` [PATCH AUTOSEL 5.8 45/63] scsi: qla2xxx: Indicate correct supported speeds for Mezz card Sasha Levin
2020-08-24 16:34 ` [PATCH AUTOSEL 5.8 46/63] scsi: qla2xxx: Fix login timeout Sasha Levin
2020-08-24 16:34 ` [PATCH AUTOSEL 5.8 47/63] scsi: qla2xxx: Check if FW supports MQ before enabling Sasha Levin
2020-08-24 16:34 ` [PATCH AUTOSEL 5.8 48/63] scsi: qla2xxx: Fix null pointer access during disconnect from subsystem Sasha Levin
2020-08-24 16:34 ` [PATCH AUTOSEL 5.8 49/63] Revert "scsi: qla2xxx: Fix crash on qla2x00_mailbox_command" Sasha Levin
2020-08-24 16:34 ` [PATCH AUTOSEL 5.8 50/63] macvlan: validate setting of multiple remote source MAC addresses Sasha Levin
2020-08-24 16:34 ` [PATCH AUTOSEL 5.8 51/63] net: gianfar: Add of_node_put() before goto statement Sasha Levin
2020-08-24 16:34 ` [PATCH AUTOSEL 5.8 52/63] drm/amdgpu: disable gfxoff for navy_flounder Sasha Levin
2020-08-24 18:23   ` Alex Deucher
2020-08-30 22:41     ` Sasha Levin
2020-08-24 16:34 ` [PATCH AUTOSEL 5.8 53/63] drm/amdgpu: fix NULL pointer access issue when unloading driver Sasha Levin
2020-08-24 16:34 ` [PATCH AUTOSEL 5.8 54/63] drm/amdkfd: fix the wrong sdma instance query for renoir Sasha Levin
2020-08-24 16:34 ` Sasha Levin [this message]
2020-08-24 16:34 ` [PATCH AUTOSEL 5.8 56/63] bpf: Avoid visit same object multiple times Sasha Levin
2020-08-24 16:34 ` [PATCH AUTOSEL 5.8 57/63] ext4: limit the length of per-inode prealloc list Sasha Levin
2020-08-24 16:34 ` [PATCH AUTOSEL 5.8 58/63] Revert "drm/amdgpu: disable gfxoff for navy_flounder" Sasha Levin
2020-08-24 18:24   ` Alex Deucher
2020-08-24 16:34 ` [PATCH AUTOSEL 5.8 59/63] powerpc/perf: Fix soft lockups due to missed interrupt accounting Sasha Levin
2020-08-24 16:35 ` [PATCH AUTOSEL 5.8 60/63] libbpf: Fix map index used in error message Sasha Levin
2020-08-24 16:35 ` [PATCH AUTOSEL 5.8 61/63] bpf: selftests: global_funcs: Check err_str before strstr Sasha Levin
2020-08-24 16:35 ` [PATCH AUTOSEL 5.8 62/63] arm64: Move handling of erratum 1418040 into C code Sasha Levin
2020-08-24 16:35 ` [PATCH AUTOSEL 5.8 63/63] arm64: Allow booting of late CPUs affected by erratum 1418040 Sasha Levin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20200824163504.605538-55-sashal@kernel.org \
    --to=sashal@kernel.org \
    --cc=ast@kernel.org \
    --cc=bpf@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=netdev@vger.kernel.org \
    --cc=paulmck@kernel.org \
    --cc=stable@vger.kernel.org \
    --cc=yhs@fb.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox