* [PATCH bpf-next v2 00/26] Resilient Queued Spin Lock
@ 2025-02-06 10:54 Kumar Kartikeya Dwivedi
2025-02-06 10:54 ` [PATCH bpf-next v2 01/26] locking: Move MCS struct definition to public header Kumar Kartikeya Dwivedi
` (28 more replies)
0 siblings, 29 replies; 67+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2025-02-06 10:54 UTC (permalink / raw)
To: bpf, linux-kernel
Cc: Linus Torvalds, Peter Zijlstra, Will Deacon, Waiman Long,
Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
Martin KaFai Lau, Eduard Zingerman, Paul E. McKenney, Tejun Heo,
Barret Rhoden, Josh Don, Dohyun Kim, linux-arm-kernel,
kernel-team
Changelog:
----------
v1 -> v2
v1: https://lore.kernel.org/bpf/20250107140004.2732830-1-memxor@gmail.com
* Address nits from Waiman and Peter
* Fix arm64 WFE bug pointed out by Peter.
* Fix incorrect memory ordering in release_held_lock_entry, and
document subtleties. Explain why release is sufficient in unlock
but not in release_held_lock_entry.
* Remove dependence on CONFIG_QUEUED_SPINLOCKS and introduce a
test-and-set fallback when queued spinlock support is missing on an
architecture.
* Enforce FIFO ordering for BPF program spin unlocks.
* Address comments from Eduard on verifier plumbing.
* Add comments as suggested by Waiman.
* Refactor paravirt TAS lock to use the implemented TAS fallback.
* Use rqspinlock_t as the type throughout so that it can be replaced
with a non-qspinlock type in case of fallback.
* Testing and benchmarking on arm64, added numbers to the cover letter.
* Fix kernel test robot errors.
* Fix a BPF selftest bug leading to spurious failures on arm64.
Introduction
------------
This patch set introduces Resilient Queued Spin Lock (or rqspinlock with
res_spin_lock() and res_spin_unlock() APIs).
This is a qspinlock variant which recovers the kernel from a stalled
state when the lock acquisition path cannot make forward progress. This
can occur when a lock acquisition attempt enters a deadlock situation
(e.g. AA, or ABBA), or more generally, when the owner of the lock (which
we’re trying to acquire) isn’t making forward progress.
The cover letter provides an overview of the motivation, design, and
alternative approaches. We then provide evaluation numbers showcasing
that while rqspinlock incurs overhead, the performance of rqspinlock
approaches that of the normal qspinlock used by the kernel.
The evaluations for rqspinlock were performed by replacing the default
qspinlock implementation with it and booting the kernel to run the
experiments. Support for locktorture is also included with numbers in
this series.
The cover letter's design section provides an overview of the
algorithmic approach. A technical document describing the implementation
in more detail is available here:
https://github.com/kkdwivedi/rqspinlock/blob/main/rqspinlock.pdf
We have a WIP TLA+ proof for liveness and mutual exclusion of rqspinlock
built on top of the qspinlock TLA+ proof from Catalin Marinas [3]. We
will share more details and the links in the near future.
Motivation
----------
In regular kernel code, usage of locks is assumed to be correct, so as
to avoid deadlocks and stalls by construction, however, the same is not
true for BPF programs. Users write normal C code and the in-kernel eBPF
runtime ensures the safety of the kernel by rejecting unsafe programs.
Users can upload programs that use locks in an improper fashion, and may
cause deadlocks when these programs run inside the kernel. The verifier
is responsible for rejecting such programs from being loaded into the
kernel.
Until now, the eBPF verifier ensured deadlock safety by only permitting
one lock acquisition at a time, and by preventing any functions to be
called from within the critical section. Additionally, only a few
restricted program types are allowed to call spin locks. As the usage of
eBPF grows (e.g. with sched_ext) beyond its conventional application in
networking, tracing, and security, the limitations on locking are
becoming a bottleneck for users.
The rqspinlock implementation allows us to permit more flexible locking
patterns in BPF programs, without limiting them to the subset that can
be proven safe statically (which is fairly small, and requires complex
static analysis), while ensuring that the kernel will recover in case we
encounter a locking violation at runtime. We make a tradeoff here by
accepting programs that may potentially have deadlocks, and recover the
kernel quickly at runtime to ensure availability.
Additionally, eBPF programs attached to different parts of the kernel
can introduce new control flow into the kernel, which increases the
likelihood of deadlocks in code not written to handle reentrancy. There
have been multiple syzbot reports surfacing deadlocks in internal kernel
code due to the diverse ways in which eBPF programs can be attached to
different parts of the kernel. By switching the BPF subsystem’s lock
usage to rqspinlock, all of these issues can be mitigated at runtime.
This spin lock implementation allows BPF maps to become safer and remove
mechanisms that have fallen short in assuring safety when nesting
programs in arbitrary ways in the same context or across different
contexts. The red diffs due to patches 16-18 demonstrate this
simplification.
> kernel/bpf/hashtab.c | 102 ++++++++++++++++++++++++++++++++--------------------------...
> kernel/bpf/lpm_trie.c | 25 ++++++++++++++-----------
> kernel/bpf/percpu_freelist.c | 113 +++++++++++++++++++++++++---------------------------------...
> kernel/bpf/percpu_freelist.h | 4 ++--
> 4 files changed, 73 insertions(+), 171 deletions(-)
Design
------
Deadlocks mostly manifest as stalls in the waiting loops of the
qspinlock slow path. Thus, using stalls as a signal for deadlocks avoids
introducing cost to the normal fast path, and ensures bounded
termination of the waiting loop. Our recovery algorithm is focused on
terminating the waiting loops of the qspinlock algorithm when it gets
stuck, and implementing bespoke recovery procedures for each class of
waiter to restore the lock to a usable state. Deadlock detection is the
main mechanism used to provide faster recovery, with the timeout
mechanism acting as a final line of defense.
Deadlock Detection
~~~~~~~~~~~~~~~~~~
We handle two cases of deadlocks: AA deadlocks (attempts to acquire the
same lock again), and ABBA deadlocks (attempts to acquire two locks in
the opposite order from two distinct threads). Variants of ABBA
deadlocks may be encountered with more than two locks being held in the
incorrect order. These are not diagnosed explicitly, as they reduce to
ABBA deadlocks.
Deadlock detection is triggered immediately when beginning the waiting
loop of a lock slow path.
While timeouts ensure that any waiting loops in the locking slow path
terminate and return to the caller, it can be excessively long in some
situations. While the default timeout is short (0.5s), a stall for this
duration inside the kernel can set off alerts for latency-critical
services with strict SLOs. Ideally, the kernel should recover from an
undesired state of the lock as soon as possible.
A multi-step strategy is used to recover the kernel from waiting loops
in the locking algorithm which may fail to terminate in a bounded amount
of time.
* Each CPU maintains a table of held locks. Entries are inserted and
removed upon entry into lock, and exit from unlock, respectively.
* Deadlock detection for AA locks is thus simple: we have an AA
deadlock if we find a held lock entry for the lock we’re attempting
to acquire on the same CPU.
* During deadlock detection for ABBA, we search through the tables of
all other CPUs to find situations where we are holding a lock the
remote CPU is attempting to acquire, and they are holding a lock we
are attempting to acquire. Upon encountering such a condition, we
report an ABBA deadlock.
* We divide the duration between entry time point into the waiting loop
and the timeout time point into intervals of 1 ms, and perform
deadlock detection until timeout happens. Upon entry into the slow
path, and then completion of each 1 ms interval, we perform detection
of both AA and ABBA deadlocks. In the event that deadlock detection
yields a positive result, the recovery happens sooner than the
timeout. Otherwise, it happens as a last resort upon completion of
the timeout.
Timeouts
~~~~~~~~
Timeouts act as final line of defense against stalls for waiting loops.
The ‘ktime_get_mono_fast_ns’ function is used to poll for the current
time, and it is compared to the timestamp indicating the end time in the
waiter loop. Each waiting loop is instrumented to check an extra
condition using a macro. Internally, the macro implementation amortizes
the checking of the timeout to avoid sampling the clock in every
iteration. Precisely, the timeout checks are invoked every 64k
iterations.
Recovery
~~~~~~~~
There is extensive literature in academia on designing locks that
support timeouts [0][1], as timeouts can be used as a proxy for
detecting the presence of deadlocks and recovering from them, without
maintaining explicit metadata to construct a waits-for relationship
between two threads at runtime.
In case of rqspinlock, the key simplification in our algorithm comes
from the fact that upon a timeout, waiters always leave the queue in
FIFO order. As such, the timeout is only enforced by the head of the
wait queue, while other waiters rely on the head to signal them when a
timeout has occurred and when they need to exit. We don’t have to
implement complex algorithms and do not need extra synchronization for
waiters in the middle of the queue timing out before their predecessor
or successor, unlike previous approaches [0][1].
There are three forms of waiters in the original queued spin lock
algorithm. The first is the waiter which acquires the pending bit and
spins on the lock word without forming a wait queue. The second is the
head waiter that is the first waiter heading the wait queue. The third
form is of all the non-head waiters queued behind the head, waiting to
be signalled through their MCS node to overtake the responsibility of
the head.
In rqspinlock's recovery algorithm, we are concerned with the second and
third kind. First, we augment the waiting loop of the head of the wait
queue with a timeout. When this timeout happens, all waiters part of the
wait queue will abort their lock acquisition attempts. This happens in
three steps.
* First, the head breaks out of its loop waiting for pending and locked
bits to turn to 0, and non-head waiters break out of their MCS node
spin (more on that later).
* Next, every waiter (head or non-head) attempts to check whether they
are also the tail waiter, in such a case they attempt to zero out the
tail word and allow a new queue to be built up for this lock. If they
succeed, they have no one to signal next in the queue to stop
spinning.
* Otherwise, they signal the MCS node of the next waiter to break out
of its spin and try resetting the tail word back to 0. This goes on
until the tail waiter is found. In case of races, the new tail will
be responsible for performing the same task, as the old tail will
then fail to reset the tail word and wait for its next pointer to be
updated before it signals the new tail to do the same.
Timeout Bound
~~~~~~~~~~~~~
The timeout is applied by two types of waiters: the pending bit waiter
and the wait queue head waiter. As such, for the pending waiter, only
the lock owner is ahead of it, and for the wait queue head waiter, only
the lock owner and the pending waiter take precedence in executing their
critical sections.
Therefore, the timeout value must span at most 2 critical section
lengths, and thus, it is unaffected by the amount of contention or the
number of CPUs on the host. Non-head waiters simply wait for the wait
queue head to signal them on a timeout.
In Meta's production, we have noticed uncore PMU reads and SMIs
consuming tens of msecs. While these events are rare, a 0.5 second
timeout should absorb such tail events and not raise false alarms for
timeouts. We will continue monitoring this in production and adjust the
timeout if necessary in the future.
More details of the recovery algorithm is described in patch 9 and a
detailed description is available at [2].
Alternatives
------------
Lockdep: We do not rely on the lockdep facility for reporting violations
for primarily two reasons:
* Overhead: The lockdep infrastructure can add significant overhead to
the lock acquisition path, and is not recommended for use in
production due to this reason. While the report is more useful and
exhaustive, the overhead can be prohibitive, especially as BPF
programs run in hot paths of the kernel. Moreover, it also increases
the size of the lock word to store extra metadata, which is not
feasible for BPF spin locks that are 4-bytes in size today (similar to
qspinlock).
* Debug Tool: Lockdep is intended to be used as a debugging facility,
providing extra context to the user about the locking violations
occurring during runtime. It is always turned off on all production
kernels, therefore isn’t available most of the time.
We require a mechanism for detecting common variants of deadlocks that
is always available in production kernels and never turned off. At the
same time, it must not introduce overhead in terms of time (for the slow
path) and memory (for the lock word size).
Evaluation
----------
We run benchmarks that stress locking scalability and perform comparison
against the baseline (qspinlock). For the rqspinlock case, we replace
the default qspinlock with it in the kernel, such that all spin locks in
the kernel use the rqspinlock slow path. As such, benchmarks that stress
kernel spin locks end up exercising rqspinlock.
Evaluation setup
~~~~~~~~~~~~~~~~
We set the CPU governor to performance for all experiments.
Note: Numbers for arm64 have been obtained without the no-WFE fallback
in this series, to perform a fair comparison with the WFE using
qspinlock baseline.
x86_64:
Intel Xeon Platinum 8468 (Sapphire Rapids)
96 cores (48 x 2 sockets)
2 threads per core, 0-95, siblings from 96-191
2 NUMA nodes (every 48 cores), 2 LLCs (every 48 cores), 1 LLC per NUMA node
Hyperthreading enabled
arm64:
Ampere Max Neoverse-N1 256-Core Processor
256 cores (128 cores x 2 sockets)
1 thread per core
2 NUMA nodes (every 128 cores), 1 L2 per core (256 instances), no shared L3
No hyperthreading available
The locktorture experiment is run for 30 seconds.
Average of 25 runs is used for will-it-scale, after an initial warm up.
More information on the locks contended in the will-it-scale experiments
is available in the evaluation section of the CNA paper, in table 1 [4].
Legend:
QL - qspinlock (avg. throughput)
RQL - rqspinlock (avg. throughput)
Results
~~~~~~~
locktorture - x86_64
Threads QL RQL Speedup
-----------------------------------------------
1 46910437 45057327 0.96
2 29871063 25085034 0.84
4 13876024 19242776 1.39
8 14638499 13346847 0.91
16 14380506 14104716 0.98
24 17278144 15293077 0.89
32 19494283 17826675 0.91
40 27760955 21002910 0.76
48 28638897 26432549 0.92
56 29336194 26512029 0.9
64 30040731 27421403 0.91
72 29523599 27010618 0.91
80 28846738 27885141 0.97
88 29277418 25963753 0.89
96 28472339 27423865 0.96
104 28093317 26634895 0.95
112 29914000 27872339 0.93
120 29199580 26682695 0.91
128 27755880 27314662 0.98
136 30349095 27092211 0.89
144 29193933 27805445 0.95
152 28956663 26071497 0.9
160 28950009 28183864 0.97
168 29383520 28135091 0.96
176 28475883 27549601 0.97
184 31958138 28602434 0.89
192 31342633 33394385 1.07
will-it-scale open1_threads - x86_64
Threads QL QL stddev stddev% RQL RQL stddev stddev% Speedup
-----------------------------------------------------------------------------------------------
1 1396323.92 7373.12 0.53 1366616.8 4152.08 0.3 0.98
2 1844403.8 3165.26 0.17 1700301.96 2396.58 0.14 0.92
4 2370590.6 24545.54 1.04 1655872.32 47938.71 2.9 0.7
8 2185227.04 9537.9 0.44 1691205.16 9783.25 0.58 0.77
16 2110672.36 10972.99 0.52 1781696.24 15021.43 0.84 0.84
24 1655042.72 18037.23 1.09 2165125.4 5422.54 0.25 1.31
32 1738928.24 7166.64 0.41 1829468.24 9081.59 0.5 1.05
40 1854430.52 6148.24 0.33 1731062.28 3311.95 0.19 0.93
48 1766529.96 5063.86 0.29 1749375.28 2311.27 0.13 0.99
56 1303016.28 6168.4 0.47 1452656 7695.29 0.53 1.11
64 1169557.96 4353.67 0.37 1287370.56 8477.2 0.66 1.1
72 1036023.4 7116.53 0.69 1135513.92 9542.55 0.84 1.1
80 1097913.64 11356 1.03 1176864.8 6771.41 0.58 1.07
88 1123907.36 12843.13 1.14 1072416.48 7412.25 0.69 0.95
96 1166981.52 9402.71 0.81 1129678.76 9499.14 0.84 0.97
104 1108954.04 8171.46 0.74 1032044.44 7840.17 0.76 0.93
112 1000777.76 8445.7 0.84 1078498.8 6551.47 0.61 1.08
120 1029448.4 6992.29 0.68 1093743 8378.94 0.77 1.06
128 1106670.36 10102.15 0.91 1241438.68 23212.66 1.87 1.12
136 1183776.88 6394.79 0.54 1116799.64 18111.38 1.62 0.94
144 1201122 25917.69 2.16 1301779.96 15792.6 1.21 1.08
152 1099737.08 13567.82 1.23 1053647.2 12704.29 1.21 0.96
160 1031186.32 9048.07 0.88 1069961.4 8293.18 0.78 1.04
168 1068817 16486.06 1.54 1096495.36 14021.93 1.28 1.03
176 966633.96 9623.27 1 1081129.84 9474.81 0.88 1.12
184 1004419.04 12111.11 1.21 1037771.24 12001.66 1.16 1.03
192 1088858.08 16522.93 1.52 1027943.12 14238.57 1.39 0.94
will-it-scale open2_threads - x86_64
Threads QL QL stddev stddev% RQL RQL stddev stddev% Speedup
-----------------------------------------------------------------------------------------------
1 1337797.76 4649.19 0.35 1332609.4 3813.14 0.29 1
2 1598300.2 1059.93 0.07 1771891.36 5667.12 0.32 1.11
4 1736573.76 13025.33 0.75 1396901.2 2682.46 0.19 0.8
8 1794367.84 4879.6 0.27 1917478.56 3751.98 0.2 1.07
16 1990998.44 8332.78 0.42 1864165.56 9648.59 0.52 0.94
24 1868148.56 4248.23 0.23 1710136.68 2760.58 0.16 0.92
32 1955180 6719 0.34 1936149.88 1980.87 0.1 0.99
40 1769646.4 4686.54 0.26 1729653.68 4551.22 0.26 0.98
48 1724861.16 4056.66 0.24 1764900 971.11 0.06 1.02
56 1318568 7758.86 0.59 1385660.84 7039.8 0.51 1.05
64 1143290.28 5351.43 0.47 1316686.6 5597.69 0.43 1.15
72 1196762.68 10655.67 0.89 1230173.24 9858.2 0.8 1.03
80 1126308.24 6901.55 0.61 1085391.16 7444.34 0.69 0.96
88 1035672.96 5452.95 0.53 1035541.52 8095.33 0.78 1
96 1030203.36 6735.71 0.65 1020113.48 8683.13 0.85 0.99
104 1039432.88 6583.59 0.63 1083902.48 5775.72 0.53 1.04
112 1113609.04 4380.62 0.39 1072010.36 8983.14 0.84 0.96
120 1109420.96 7183.5 0.65 1079424.12 10929.97 1.01 0.97
128 1095400.04 4274.6 0.39 1095475.2 12042.02 1.1 1
136 1071605.4 11103.73 1.04 1114757.2 10516.55 0.94 1.04
144 1104147.2 9714.75 0.88 1044954.16 7544.2 0.72 0.95
152 1164280.24 13386.15 1.15 1101213.92 11568.49 1.05 0.95
160 1084892.04 7941.25 0.73 1152273.76 9593.38 0.83 1.06
168 983654.76 11772.85 1.2 1111772.28 9806.83 0.88 1.13
176 1087544.24 11262.35 1.04 1077507.76 9442.02 0.88 0.99
184 1101682.4 24701.68 2.24 1095223.2 16707.29 1.53 0.99
192 983712.08 13453.59 1.37 1051244.2 15662.05 1.49 1.07
will-it-scale lock1_threads - x86_64
Threads QL QL stddev stddev% RQL RQL stddev stddev% Speedup
-----------------------------------------------------------------------------------------------
1 4307484.96 3959.31 0.09 4252908.56 10375.78 0.24 0.99
2 7701844.32 4169.88 0.05 7219233.52 6437.11 0.09 0.94
4 14781878.72 22854.85 0.15 15260565.12 37305.71 0.24 1.03
8 12949698.64 99270.42 0.77 9954660.4 142805.68 1.43 0.77
16 12947690.64 72977.27 0.56 10865245.12 49520.31 0.46 0.84
24 11142990.64 33200.39 0.3 11444391.68 37884.46 0.33 1.03
32 9652335.84 22369.48 0.23 9344086.72 21639.22 0.23 0.97
40 9185931.12 5508.96 0.06 8881506.32 5072.33 0.06 0.97
48 9084385.36 10871.05 0.12 8863579.12 4583.37 0.05 0.98
56 6595540.96 33100.59 0.5 6640389.76 46619.96 0.7 1.01
64 5946726.24 47160.5 0.79 6572155.84 91973.73 1.4 1.11
72 6744894.72 43166.65 0.64 5991363.36 80637.56 1.35 0.89
80 6234502.16 118983.16 1.91 5157894.32 73592.72 1.43 0.83
88 5053879.6 199713.75 3.95 4479758.08 36202.27 0.81 0.89
96 5184302.64 99199.89 1.91 5249210.16 122348.69 2.33 1.01
104 4612391.92 40803.05 0.88 4850209.6 26813.28 0.55 1.05
112 4809209.68 24070.68 0.5 4869477.84 27489.04 0.56 1.01
120 5130746.4 34265.5 0.67 4620047.12 44229.54 0.96 0.9
128 5376465.28 95028.05 1.77 4781179.6 43700.93 0.91 0.89
136 5453742.4 86718.87 1.59 5412457.12 40339.68 0.75 0.99
144 5805040.72 84669.31 1.46 5595382.48 68701.65 1.23 0.96
152 5842897.36 31120.33 0.53 5787587.12 43521.68 0.75 0.99
160 5837665.12 14179.44 0.24 5118808.72 45193.23 0.88 0.88
168 5660332.72 27467.09 0.49 5104959.04 40891.75 0.8 0.9
176 5180312.24 28656.39 0.55 4718407.6 58734.13 1.24 0.91
184 4706824.16 50469.31 1.07 4692962.64 92266.85 1.97 1
192 5126054.56 51082.02 1 4680866.8 58743.51 1.25 0.91
will-it-scale lock2_threads - x86_64
Threads QL QL stddev stddev% RQL RQL stddev stddev% Speedup
-----------------------------------------------------------------------------------------------
1 4316091.2 4933.28 0.11 4293104 30369.71 0.71 0.99
2 3500046.4 19852.62 0.57 4507627.76 23667.66 0.53 1.29
4 3639098.96 26370.65 0.72 3673166.32 30822.71 0.84 1.01
8 3714548.56 49953.44 1.34 4055818.56 71630.41 1.77 1.09
16 4188724.64 105414.49 2.52 4316077.12 68956.15 1.6 1.03
24 3737908.32 47391.46 1.27 3762254.56 55345.7 1.47 1.01
32 3820952.8 45207.66 1.18 3710368.96 52651.92 1.42 0.97
40 3791280.8 28630.55 0.76 3661933.52 37671.27 1.03 0.97
48 3765721.84 59553.83 1.58 3604738.64 50861.36 1.41 0.96
56 3175505.76 64336.17 2.03 2771022.48 66586.99 2.4 0.87
64 2620294.48 71651.34 2.73 2650171.68 44810.83 1.69 1.01
72 2861893.6 86542.61 3.02 2537437.2 84571.75 3.33 0.89
80 2976297.2 83566.43 2.81 2645132.8 85992.34 3.25 0.89
88 2547724.8 102014.36 4 2336852.16 80570.25 3.45 0.92
96 2945310.32 82673.25 2.81 2513316.96 45741.81 1.82 0.85
104 3028818.64 90643.36 2.99 2581787.52 52967.48 2.05 0.85
112 2546264.16 102605.82 4.03 2118812.64 62043.19 2.93 0.83
120 2917334.64 112220.01 3.85 2720418.64 64035.96 2.35 0.93
128 2906621.84 69428.1 2.39 2795310.32 56736.87 2.03 0.96
136 2841833.76 105541.11 3.71 3063404.48 62288.94 2.03 1.08
144 3032822.32 134796.56 4.44 3169985.6 149707.83 4.72 1.05
152 2557694.96 62218.15 2.43 2469887.6 68343.78 2.77 0.97
160 2810214.72 61468.79 2.19 2323768.48 54226.71 2.33 0.83
168 2651146.48 76573.27 2.89 2385936.64 52433.98 2.2 0.9
176 2720616.32 89026.19 3.27 2941400.08 59296.64 2.02 1.08
184 2696086 88541.24 3.28 2598225.2 76365.7 2.94 0.96
192 2908194.48 87023.91 2.99 2377677.68 53299.82 2.24 0.82
locktorture - arm64
Threads QL RQL Speedup
-----------------------------------------------
1 43320464 44718174 1.03
2 21056971 29255448 1.39
4 16040120 11563981 0.72
8 12786398 12838909 1
16 13646408 13436730 0.98
24 13597928 13669457 1.01
32 16456220 14600324 0.89
40 16667726 13883101 0.83
48 14347691 14608641 1.02
56 15624580 15180758 0.97
64 18105114 16009137 0.88
72 16606438 14772256 0.89
80 16550202 14124056 0.85
88 16716082 15930618 0.95
96 16489242 16817657 1.02
104 17915808 17165324 0.96
112 17217482 21343282 1.24
120 20449845 20576123 1.01
128 18700902 20286275 1.08
136 17913378 21142921 1.18
144 18225673 18971921 1.04
152 18374206 19229854 1.05
160 23136514 20129504 0.87
168 21096269 17167777 0.81
176 21376794 21594914 1.01
184 23542989 20638298 0.88
192 22793754 20655980 0.91
200 20933027 19628316 0.94
208 23105684 25572720 1.11
216 24158081 23173848 0.96
224 23388984 22485353 0.96
232 21916401 23899343 1.09
240 22292129 22831784 1.02
248 25812762 22636787 0.88
256 24294738 26127113 1.08
will-it-scale open1_threads - arm64
Threads QL QL stddev stddev% RQL RQL stddev stddev% Speedup
-----------------------------------------------------------------------------------------------
1 844452.32 801 0.09 804936.92 900.25 0.11 0.95
2 1309419.08 9495.78 0.73 1265080.24 3171.13 0.25 0.97
4 2113074.24 5363.19 0.25 2041158.28 7883.65 0.39 0.97
8 1916650.96 15749.86 0.82 2039850.04 7562.87 0.37 1.06
16 1835540.72 12940.45 0.7 1937398.56 11461.15 0.59 1.06
24 1876760.48 12581.67 0.67 1966659.16 10012.69 0.51 1.05
32 1834525.6 5571.08 0.3 1929180.4 6221.96 0.32 1.05
40 1851592.76 7848.18 0.42 1937504.44 5991.55 0.31 1.05
48 1845067 4118.68 0.22 1773331.56 6068.23 0.34 0.96
56 1742709.36 6874.03 0.39 1716184.92 6713.16 0.39 0.98
64 1685339.72 6688.91 0.4 1676046.16 5844.06 0.35 0.99
72 1694838.84 2433.41 0.14 1821189.6 2906.89 0.16 1.07
80 1738778.68 2916.74 0.17 1729212.6 3714.41 0.21 0.99
88 1753131.76 2734.34 0.16 1713294.32 4652.82 0.27 0.98
96 1694112.52 4449.69 0.26 1714438.36 5621.66 0.33 1.01
104 1780279.76 2420.52 0.14 1767679.12 3067.66 0.17 0.99
112 1700284.72 9796.23 0.58 1796674.6 4066.06 0.23 1.06
120 1760466.72 3978.65 0.23 1704706.08 4080.04 0.24 0.97
128 1634067.96 5187.94 0.32 1764115.48 3545.02 0.2 1.08
136 1170303.84 7602.29 0.65 1227188.04 8090.84 0.66 1.05
144 953186.16 7859.02 0.82 964822.08 10536.61 1.09 1.01
152 818893.96 7238.86 0.88 853412.44 5932.25 0.7 1.04
160 707460.48 3868.26 0.55 746985.68 10363.03 1.39 1.06
168 658380.56 4938.77 0.75 672101.12 5442.95 0.81 1.02
176 614692.04 3137.74 0.51 615143.36 6197.19 1.01 1
184 574808.88 4741.61 0.82 592395.08 8840.92 1.49 1.03
192 548142.92 6116.31 1.12 571299.68 8388.56 1.47 1.04
200 511621.96 2182.33 0.43 532144.88 5467.04 1.03 1.04
208 506583.32 6834.39 1.35 521427.08 10318.65 1.98 1.03
216 480438.04 3608.96 0.75 510697.76 8086.47 1.58 1.06
224 470644.96 3451.35 0.73 467433.92 5008.59 1.07 0.99
232 466973.72 6599.97 1.41 444345.92 2144.96 0.48 0.95
240 442927.68 2351.56 0.53 440503.56 4289.01 0.97 0.99
248 432991.16 5829.92 1.35 445462.6 5944.03 1.33 1.03
256 409455.44 1430.5 0.35 422219.4 4007.04 0.95 1.03
will-it-scale open2_threads - arm64
Threads QL QL stddev stddev% RQL RQL stddev stddev% Speedup
-----------------------------------------------------------------------------------------------
1 818645.4 1097.02 0.13 774110.24 1562.45 0.2 0.95
2 1281013.04 2188.78 0.17 1238346.24 2149.97 0.17 0.97
4 2058514.16 13105.36 0.64 1985375 3204.48 0.16 0.96
8 1920414.8 16154.63 0.84 1911667.92 8882.98 0.46 1
16 1943729.68 8714.38 0.45 1978946.72 7465.65 0.38 1.02
24 1915846.88 7749.9 0.4 1914442.72 9841.71 0.51 1
32 1964695.92 8854.83 0.45 1914650.28 9357.82 0.49 0.97
40 1845071.12 5103.26 0.28 1891685.44 4278.34 0.23 1.03
48 1838897.6 5123.61 0.28 1843498.2 5391.94 0.29 1
56 1823768.32 3214.14 0.18 1736477.48 5675.49 0.33 0.95
64 1627162.36 3528.1 0.22 1685727.16 6102.63 0.36 1.04
72 1725320.16 4709.83 0.27 1710174.4 6707.54 0.39 0.99
80 1692288.44 9110.89 0.54 1773676.24 4327.94 0.24 1.05
88 1725496.64 4249.71 0.25 1695173.84 5097.14 0.3 0.98
96 1766093.08 2280.09 0.13 1732782.64 3606.1 0.21 0.98
104 1647753 2926.83 0.18 1710876.4 4416.04 0.26 1.04
112 1763785.52 3838.26 0.22 1803813.76 1859.2 0.1 1.02
120 1684095.16 2385.31 0.14 1766903.08 3258.34 0.18 1.05
128 1733528.56 2800.62 0.16 1677446.32 3201.14 0.19 0.97
136 1179187.84 6804.86 0.58 1241839.52 10698.51 0.86 1.05
144 969456.36 6421.85 0.66 1018441.96 8732.19 0.86 1.05
152 839295.64 10422.66 1.24 817531.92 6778.37 0.83 0.97
160 743010.72 6957.98 0.94 749291.16 9388.47 1.25 1.01
168 666049.88 13159.73 1.98 689408.08 10192.66 1.48 1.04
176 609185.56 5685.18 0.93 653744.24 10847.35 1.66 1.07
184 602232.08 12089.72 2.01 597718.6 13856.45 2.32 0.99
192 563919.32 9870.46 1.75 560080.4 8388.47 1.5 0.99
200 522396.28 4155.61 0.8 539168.64 10456.64 1.94 1.03
208 520328.28 9353.14 1.8 510011.4 6061.19 1.19 0.98
216 479797.72 5824.58 1.21 486955.32 4547.05 0.93 1.01
224 467943.8 4484.86 0.96 473252.76 5608.58 1.19 1.01
232 456914.24 3129.5 0.68 457463.2 7474.83 1.63 1
240 450535 5149.78 1.14 437653.56 4604.92 1.05 0.97
248 435475.2 2350.87 0.54 435589.24 6176.01 1.42 1
256 416737.88 2592.76 0.62 424178.28 3932.2 0.93 1.02
will-it-scale lock1_threads - arm64
Threads QL QL stddev stddev% RQL RQL stddev stddev% Speedup
-----------------------------------------------------------------------------------------------
1 2512077.52 3026.1 0.12 2085365.92 1612.44 0.08 0.83
2 4840180.4 3646.31 0.08 4326922.24 3802.17 0.09 0.89
4 9358779.44 6673.07 0.07 8467588.56 5577.05 0.07 0.9
8 9374436.88 18826.26 0.2 8635110.16 4217.66 0.05 0.92
16 9527184.08 14111.94 0.15 8561174.16 3258.6 0.04 0.9
24 8873099.76 17242.32 0.19 9286778.72 4124.51 0.04 1.05
32 8457640.4 10790.92 0.13 8700401.52 5110 0.06 1.03
40 8478771.76 13250.8 0.16 8746198.16 7606.42 0.09 1.03
48 8329097.76 7958.92 0.1 8774265.36 6082.08 0.07 1.05
56 8330143.04 11586.93 0.14 8472426.48 7402.13 0.09 1.02
64 8334684.08 10478.03 0.13 7979193.52 8436.63 0.11 0.96
72 7941815.52 16031.38 0.2 8016885.52 12640.56 0.16 1.01
80 8042221.68 10219.93 0.13 8072222.88 12479.54 0.15 1
88 8190336.8 10751.38 0.13 8432977.6 11865.67 0.14 1.03
96 8235010.08 7267.8 0.09 8022101.28 11910.63 0.15 0.97
104 8154434.08 7770.8 0.1 7987812 7647.42 0.1 0.98
112 7738464.56 11067.72 0.14 7968483.92 20632.93 0.26 1.03
120 8228919.36 10395.79 0.13 8304329.28 11913.76 0.14 1.01
128 7798646.64 8877.8 0.11 8197938.4 7527.81 0.09 1.05
136 5567293.68 66259.82 1.19 5642017.12 126584.59 2.24 1.01
144 4425655.52 55729.96 1.26 4519874.64 82996.01 1.84 1.02
152 3871300.8 77793.78 2.01 3850025.04 80167.3 2.08 0.99
160 3558041.68 55108.3 1.55 3495924.96 83626.42 2.39 0.98
168 3302042.72 45011.89 1.36 3298002.8 59393.64 1.8 1
176 3066165.2 34896.54 1.14 3063027.44 58219.26 1.9 1
184 2817899.6 43585.27 1.55 2859393.84 45258.03 1.58 1.01
192 2690403.76 42236.77 1.57 2630652.24 35953.13 1.37 0.98
200 2563141.44 28145.43 1.1 2539964.32 38556.52 1.52 0.99
208 2502968.8 27687.81 1.11 2477757.28 28240.81 1.14 0.99
216 2474917.76 24128.71 0.97 2483161.44 32198.37 1.3 1
224 2386874.72 32954.66 1.38 2398068.48 37667.29 1.57 1
232 2379248.24 27413.4 1.15 2327601.68 24565.28 1.06 0.98
240 2302146.64 19914.19 0.87 2236074.64 20968.17 0.94 0.97
248 2241798.32 21542.52 0.96 2173312.24 26498.36 1.22 0.97
256 2198765.12 20832.66 0.95 2136159.52 25027.96 1.17 0.97
will-it-scale lock2_threads - arm64
Threads QL QL stddev stddev% RQL RQL stddev stddev% Speedup
-----------------------------------------------------------------------------------------------
1 2499414.32 1932.27 0.08 2075704.8 24589.71 1.18 0.83
2 3887820 34198.36 0.88 4057432.64 11896.04 0.29 1.04
4 3445307.6 7958.3 0.23 3869960.4 3788.5 0.1 1.12
8 4310597.2 14405.9 0.33 3931319.76 5845.33 0.15 0.91
16 3995159.84 22621.85 0.57 3953339.68 15668.9 0.4 0.99
24 4048456.88 22956.51 0.57 3887812.64 30584.77 0.79 0.96
32 3974808.64 20465.87 0.51 3718778.08 27407.24 0.74 0.94
40 3941154.88 15136.68 0.38 3551464.24 33378.67 0.94 0.9
48 3725436.32 17090.67 0.46 3714356.08 19035.26 0.51 1
56 3558449.44 10123.46 0.28 3449656.08 36476.87 1.06 0.97
64 3514616.08 16470.99 0.47 3493197.04 25639.82 0.73 0.99
72 3461700.88 16780.97 0.48 3376565.04 16930.19 0.5 0.98
80 3797008.64 17599.05 0.46 3505856.16 34320.34 0.98 0.92
88 3737459.44 10774.93 0.29 3631757.68 24231.29 0.67 0.97
96 3612816.16 21865.86 0.61 3545354.56 16391.15 0.46 0.98
104 3765167.36 17763.8 0.47 3466467.12 22235.45 0.64 0.92
112 3713386 15455.21 0.42 3402210 18349.66 0.54 0.92
120 3699986.08 15153.08 0.41 3580303.92 19823.01 0.55 0.97
128 3648694.56 11891.62 0.33 3426445.28 22993.32 0.67 0.94
136 800046.88 6039.73 0.75 784412.16 9062.03 1.16 0.98
144 769483.36 5231.74 0.68 714132.8 8953.57 1.25 0.93
152 821081.52 4249.12 0.52 743694.64 8155.18 1.1 0.91
160 789040.16 9187.4 1.16 834865.44 6159.29 0.74 1.06
168 867742.4 8967.66 1.03 734905.36 15582.75 2.12 0.85
176 838650.32 7949.72 0.95 846939.68 8959.8 1.06 1.01
184 854984.48 19475.51 2.28 794549.92 11924.54 1.5 0.93
192 846262.32 13795.86 1.63 899915.12 8639.82 0.96 1.06
200 942602.16 12665.42 1.34 900385.76 8592.23 0.95 0.96
208 954183.68 12853.22 1.35 1166186.96 13045.03 1.12 1.22
216 929319.76 10157.79 1.09 926773.76 10577.01 1.14 1
224 967896.56 9819.6 1.01 951144.32 12343.83 1.3 0.98
232 990621.12 7771.97 0.78 916361.2 17878.44 1.95 0.93
240 995285.04 20104.22 2.02 972119.6 12856.42 1.32 0.98
248 1029436 20404.97 1.98 965301.28 11102.95 1.15 0.94
256 1038724.8 19201.03 1.85 1029942.08 12563.07 1.22 0.99
Written By
----------
Alexei Starovoitov <ast@kernel.org>
Kumar Kartikeya Dwivedi <memxor@gmail.com>
[0]: https://www.cs.rochester.edu/research/synchronization/pseudocode/timeout.html
[1]: https://dl.acm.org/doi/10.1145/571825.571830
[2]: https://github.com/kkdwivedi/rqspinlock/blob/main/rqspinlock.pdf
[3]: https://git.kernel.org/pub/scm/linux/kernel/git/cmarinas/kernel-tla.git/plain/qspinlock.tla
[4]: https://arxiv.org/pdf/1810.05600
Kumar Kartikeya Dwivedi (26):
locking: Move MCS struct definition to public header
locking: Move common qspinlock helpers to a private header
locking: Allow obtaining result of arch_mcs_spin_lock_contended
locking: Copy out qspinlock.c to rqspinlock.c
rqspinlock: Add rqspinlock.h header
rqspinlock: Drop PV and virtualization support
rqspinlock: Add support for timeouts
rqspinlock: Protect pending bit owners from stalls
rqspinlock: Protect waiters in queue from stalls
rqspinlock: Protect waiters in trylock fallback from stalls
rqspinlock: Add deadlock detection and recovery
rqspinlock: Add a test-and-set fallback
rqspinlock: Add basic support for CONFIG_PARAVIRT
rqspinlock: Add helper to print a splat on timeout or deadlock
rqspinlock: Add macros for rqspinlock usage
rqspinlock: Add locktorture support
rqspinlock: Hardcode cond_acquire loops to asm-generic implementation
rqspinlock: Add entry to Makefile, MAINTAINERS
bpf: Convert hashtab.c to rqspinlock
bpf: Convert percpu_freelist.c to rqspinlock
bpf: Convert lpm_trie.c to rqspinlock
bpf: Introduce rqspinlock kfuncs
bpf: Handle allocation failure in acquire_lock_state
bpf: Implement verifier support for rqspinlock
bpf: Maintain FIFO property for rqspinlock unlock
selftests/bpf: Add tests for rqspinlock
MAINTAINERS | 3 +
arch/x86/include/asm/rqspinlock.h | 29 +
include/asm-generic/Kbuild | 1 +
include/asm-generic/mcs_spinlock.h | 6 +
include/asm-generic/rqspinlock.h | 215 +++++
include/linux/bpf.h | 10 +
include/linux/bpf_verifier.h | 20 +-
kernel/bpf/btf.c | 26 +-
kernel/bpf/hashtab.c | 102 +--
kernel/bpf/lpm_trie.c | 25 +-
kernel/bpf/percpu_freelist.c | 113 +--
kernel/bpf/percpu_freelist.h | 4 +-
kernel/bpf/syscall.c | 6 +-
kernel/bpf/verifier.c | 250 +++++-
kernel/locking/Makefile | 1 +
kernel/locking/lock_events_list.h | 5 +
kernel/locking/locktorture.c | 51 ++
kernel/locking/mcs_spinlock.h | 10 +-
kernel/locking/qspinlock.c | 193 +----
kernel/locking/qspinlock.h | 200 +++++
kernel/locking/rqspinlock.c | 766 ++++++++++++++++++
kernel/locking/rqspinlock.h | 48 ++
.../selftests/bpf/prog_tests/res_spin_lock.c | 99 +++
tools/testing/selftests/bpf/progs/irq.c | 53 ++
.../selftests/bpf/progs/res_spin_lock.c | 143 ++++
.../selftests/bpf/progs/res_spin_lock_fail.c | 244 ++++++
26 files changed, 2207 insertions(+), 416 deletions(-)
create mode 100644 arch/x86/include/asm/rqspinlock.h
create mode 100644 include/asm-generic/rqspinlock.h
create mode 100644 kernel/locking/qspinlock.h
create mode 100644 kernel/locking/rqspinlock.c
create mode 100644 kernel/locking/rqspinlock.h
create mode 100644 tools/testing/selftests/bpf/prog_tests/res_spin_lock.c
create mode 100644 tools/testing/selftests/bpf/progs/res_spin_lock.c
create mode 100644 tools/testing/selftests/bpf/progs/res_spin_lock_fail.c
base-commit: 0abff462d802a352c87b7f5e71b442b09bf9cfff
--
2.43.5
^ permalink raw reply [flat|nested] 67+ messages in thread
* [PATCH bpf-next v2 01/26] locking: Move MCS struct definition to public header
2025-02-06 10:54 [PATCH bpf-next v2 00/26] Resilient Queued Spin Lock Kumar Kartikeya Dwivedi
@ 2025-02-06 10:54 ` Kumar Kartikeya Dwivedi
2025-02-06 10:54 ` [PATCH bpf-next v2 02/26] locking: Move common qspinlock helpers to a private header Kumar Kartikeya Dwivedi
` (27 subsequent siblings)
28 siblings, 0 replies; 67+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2025-02-06 10:54 UTC (permalink / raw)
To: bpf, linux-kernel
Cc: Barret Rhoden, Linus Torvalds, Peter Zijlstra, Will Deacon,
Waiman Long, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
Martin KaFai Lau, Eduard Zingerman, Paul E. McKenney, Tejun Heo,
Josh Don, Dohyun Kim, linux-arm-kernel, kernel-team
Move the definition of the struct mcs_spinlock from the private
mcs_spinlock.h header in kernel/locking to the mcs_spinlock.h
asm-generic header, since we will need to reference it from the
qspinlock.h header in subsequent commits.
Reviewed-by: Barret Rhoden <brho@google.com>
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
---
include/asm-generic/mcs_spinlock.h | 6 ++++++
kernel/locking/mcs_spinlock.h | 6 ------
2 files changed, 6 insertions(+), 6 deletions(-)
diff --git a/include/asm-generic/mcs_spinlock.h b/include/asm-generic/mcs_spinlock.h
index 10cd4ffc6ba2..39c94012b88a 100644
--- a/include/asm-generic/mcs_spinlock.h
+++ b/include/asm-generic/mcs_spinlock.h
@@ -1,6 +1,12 @@
#ifndef __ASM_MCS_SPINLOCK_H
#define __ASM_MCS_SPINLOCK_H
+struct mcs_spinlock {
+ struct mcs_spinlock *next;
+ int locked; /* 1 if lock acquired */
+ int count; /* nesting count, see qspinlock.c */
+};
+
/*
* Architectures can define their own:
*
diff --git a/kernel/locking/mcs_spinlock.h b/kernel/locking/mcs_spinlock.h
index 85251d8771d9..16160ca8907f 100644
--- a/kernel/locking/mcs_spinlock.h
+++ b/kernel/locking/mcs_spinlock.h
@@ -15,12 +15,6 @@
#include <asm/mcs_spinlock.h>
-struct mcs_spinlock {
- struct mcs_spinlock *next;
- int locked; /* 1 if lock acquired */
- int count; /* nesting count, see qspinlock.c */
-};
-
#ifndef arch_mcs_spin_lock_contended
/*
* Using smp_cond_load_acquire() provides the acquire semantics
--
2.43.5
^ permalink raw reply related [flat|nested] 67+ messages in thread
* [PATCH bpf-next v2 02/26] locking: Move common qspinlock helpers to a private header
2025-02-06 10:54 [PATCH bpf-next v2 00/26] Resilient Queued Spin Lock Kumar Kartikeya Dwivedi
2025-02-06 10:54 ` [PATCH bpf-next v2 01/26] locking: Move MCS struct definition to public header Kumar Kartikeya Dwivedi
@ 2025-02-06 10:54 ` Kumar Kartikeya Dwivedi
2025-02-07 23:21 ` kernel test robot
2025-02-06 10:54 ` [PATCH bpf-next v2 03/26] locking: Allow obtaining result of arch_mcs_spin_lock_contended Kumar Kartikeya Dwivedi
` (26 subsequent siblings)
28 siblings, 1 reply; 67+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2025-02-06 10:54 UTC (permalink / raw)
To: bpf, linux-kernel
Cc: Barret Rhoden, Linus Torvalds, Peter Zijlstra, Will Deacon,
Waiman Long, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
Martin KaFai Lau, Eduard Zingerman, Paul E. McKenney, Tejun Heo,
Josh Don, Dohyun Kim, linux-arm-kernel, kernel-team
Move qspinlock helper functions that encode, decode tail word, set and
clear the pending and locked bits, and other miscellaneous definitions
and macros to a private header. To this end, create a qspinlock.h header
file in kernel/locking. Subsequent commits will introduce a modified
qspinlock slow path function, thus moving shared code to a private
header will help minimize unnecessary code duplication.
Reviewed-by: Barret Rhoden <brho@google.com>
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
---
kernel/locking/qspinlock.c | 193 +----------------------------------
kernel/locking/qspinlock.h | 200 +++++++++++++++++++++++++++++++++++++
2 files changed, 205 insertions(+), 188 deletions(-)
create mode 100644 kernel/locking/qspinlock.h
diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
index 7d96bed718e4..af8d122bb649 100644
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -25,8 +25,9 @@
#include <trace/events/lock.h>
/*
- * Include queued spinlock statistics code
+ * Include queued spinlock definitions and statistics code
*/
+#include "qspinlock.h"
#include "qspinlock_stat.h"
/*
@@ -67,36 +68,6 @@
*/
#include "mcs_spinlock.h"
-#define MAX_NODES 4
-
-/*
- * On 64-bit architectures, the mcs_spinlock structure will be 16 bytes in
- * size and four of them will fit nicely in one 64-byte cacheline. For
- * pvqspinlock, however, we need more space for extra data. To accommodate
- * that, we insert two more long words to pad it up to 32 bytes. IOW, only
- * two of them can fit in a cacheline in this case. That is OK as it is rare
- * to have more than 2 levels of slowpath nesting in actual use. We don't
- * want to penalize pvqspinlocks to optimize for a rare case in native
- * qspinlocks.
- */
-struct qnode {
- struct mcs_spinlock mcs;
-#ifdef CONFIG_PARAVIRT_SPINLOCKS
- long reserved[2];
-#endif
-};
-
-/*
- * The pending bit spinning loop count.
- * This heuristic is used to limit the number of lockword accesses
- * made by atomic_cond_read_relaxed when waiting for the lock to
- * transition out of the "== _Q_PENDING_VAL" state. We don't spin
- * indefinitely because there's no guarantee that we'll make forward
- * progress.
- */
-#ifndef _Q_PENDING_LOOPS
-#define _Q_PENDING_LOOPS 1
-#endif
/*
* Per-CPU queue node structures; we can never have more than 4 nested
@@ -106,161 +77,7 @@ struct qnode {
*
* PV doubles the storage and uses the second cacheline for PV state.
*/
-static DEFINE_PER_CPU_ALIGNED(struct qnode, qnodes[MAX_NODES]);
-
-/*
- * We must be able to distinguish between no-tail and the tail at 0:0,
- * therefore increment the cpu number by one.
- */
-
-static inline __pure u32 encode_tail(int cpu, int idx)
-{
- u32 tail;
-
- tail = (cpu + 1) << _Q_TAIL_CPU_OFFSET;
- tail |= idx << _Q_TAIL_IDX_OFFSET; /* assume < 4 */
-
- return tail;
-}
-
-static inline __pure struct mcs_spinlock *decode_tail(u32 tail)
-{
- int cpu = (tail >> _Q_TAIL_CPU_OFFSET) - 1;
- int idx = (tail & _Q_TAIL_IDX_MASK) >> _Q_TAIL_IDX_OFFSET;
-
- return per_cpu_ptr(&qnodes[idx].mcs, cpu);
-}
-
-static inline __pure
-struct mcs_spinlock *grab_mcs_node(struct mcs_spinlock *base, int idx)
-{
- return &((struct qnode *)base + idx)->mcs;
-}
-
-#define _Q_LOCKED_PENDING_MASK (_Q_LOCKED_MASK | _Q_PENDING_MASK)
-
-#if _Q_PENDING_BITS == 8
-/**
- * clear_pending - clear the pending bit.
- * @lock: Pointer to queued spinlock structure
- *
- * *,1,* -> *,0,*
- */
-static __always_inline void clear_pending(struct qspinlock *lock)
-{
- WRITE_ONCE(lock->pending, 0);
-}
-
-/**
- * clear_pending_set_locked - take ownership and clear the pending bit.
- * @lock: Pointer to queued spinlock structure
- *
- * *,1,0 -> *,0,1
- *
- * Lock stealing is not allowed if this function is used.
- */
-static __always_inline void clear_pending_set_locked(struct qspinlock *lock)
-{
- WRITE_ONCE(lock->locked_pending, _Q_LOCKED_VAL);
-}
-
-/*
- * xchg_tail - Put in the new queue tail code word & retrieve previous one
- * @lock : Pointer to queued spinlock structure
- * @tail : The new queue tail code word
- * Return: The previous queue tail code word
- *
- * xchg(lock, tail), which heads an address dependency
- *
- * p,*,* -> n,*,* ; prev = xchg(lock, node)
- */
-static __always_inline u32 xchg_tail(struct qspinlock *lock, u32 tail)
-{
- /*
- * We can use relaxed semantics since the caller ensures that the
- * MCS node is properly initialized before updating the tail.
- */
- return (u32)xchg_relaxed(&lock->tail,
- tail >> _Q_TAIL_OFFSET) << _Q_TAIL_OFFSET;
-}
-
-#else /* _Q_PENDING_BITS == 8 */
-
-/**
- * clear_pending - clear the pending bit.
- * @lock: Pointer to queued spinlock structure
- *
- * *,1,* -> *,0,*
- */
-static __always_inline void clear_pending(struct qspinlock *lock)
-{
- atomic_andnot(_Q_PENDING_VAL, &lock->val);
-}
-
-/**
- * clear_pending_set_locked - take ownership and clear the pending bit.
- * @lock: Pointer to queued spinlock structure
- *
- * *,1,0 -> *,0,1
- */
-static __always_inline void clear_pending_set_locked(struct qspinlock *lock)
-{
- atomic_add(-_Q_PENDING_VAL + _Q_LOCKED_VAL, &lock->val);
-}
-
-/**
- * xchg_tail - Put in the new queue tail code word & retrieve previous one
- * @lock : Pointer to queued spinlock structure
- * @tail : The new queue tail code word
- * Return: The previous queue tail code word
- *
- * xchg(lock, tail)
- *
- * p,*,* -> n,*,* ; prev = xchg(lock, node)
- */
-static __always_inline u32 xchg_tail(struct qspinlock *lock, u32 tail)
-{
- u32 old, new;
-
- old = atomic_read(&lock->val);
- do {
- new = (old & _Q_LOCKED_PENDING_MASK) | tail;
- /*
- * We can use relaxed semantics since the caller ensures that
- * the MCS node is properly initialized before updating the
- * tail.
- */
- } while (!atomic_try_cmpxchg_relaxed(&lock->val, &old, new));
-
- return old;
-}
-#endif /* _Q_PENDING_BITS == 8 */
-
-/**
- * queued_fetch_set_pending_acquire - fetch the whole lock value and set pending
- * @lock : Pointer to queued spinlock structure
- * Return: The previous lock value
- *
- * *,*,* -> *,1,*
- */
-#ifndef queued_fetch_set_pending_acquire
-static __always_inline u32 queued_fetch_set_pending_acquire(struct qspinlock *lock)
-{
- return atomic_fetch_or_acquire(_Q_PENDING_VAL, &lock->val);
-}
-#endif
-
-/**
- * set_locked - Set the lock bit and own the lock
- * @lock: Pointer to queued spinlock structure
- *
- * *,*,0 -> *,0,1
- */
-static __always_inline void set_locked(struct qspinlock *lock)
-{
- WRITE_ONCE(lock->locked, _Q_LOCKED_VAL);
-}
-
+static DEFINE_PER_CPU_ALIGNED(struct qnode, qnodes[_Q_MAX_NODES]);
/*
* Generate the native code for queued_spin_unlock_slowpath(); provide NOPs for
@@ -410,7 +227,7 @@ void __lockfunc queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
* any MCS node. This is not the most elegant solution, but is
* simple enough.
*/
- if (unlikely(idx >= MAX_NODES)) {
+ if (unlikely(idx >= _Q_MAX_NODES)) {
lockevent_inc(lock_no_node);
while (!queued_spin_trylock(lock))
cpu_relax();
@@ -465,7 +282,7 @@ void __lockfunc queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
* head of the waitqueue.
*/
if (old & _Q_TAIL_MASK) {
- prev = decode_tail(old);
+ prev = decode_tail(old, qnodes);
/* Link @node into the waitqueue. */
WRITE_ONCE(prev->next, node);
diff --git a/kernel/locking/qspinlock.h b/kernel/locking/qspinlock.h
new file mode 100644
index 000000000000..d4ceb9490365
--- /dev/null
+++ b/kernel/locking/qspinlock.h
@@ -0,0 +1,200 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+/*
+ * Queued spinlock defines
+ *
+ * This file contains macro definitions and functions shared between different
+ * qspinlock slow path implementations.
+ */
+#ifndef __LINUX_QSPINLOCK_H
+#define __LINUX_QSPINLOCK_H
+
+#include <asm-generic/percpu.h>
+#include <linux/percpu-defs.h>
+#include <asm-generic/qspinlock.h>
+#include <asm-generic/mcs_spinlock.h>
+
+#define _Q_MAX_NODES 4
+
+/*
+ * The pending bit spinning loop count.
+ * This heuristic is used to limit the number of lockword accesses
+ * made by atomic_cond_read_relaxed when waiting for the lock to
+ * transition out of the "== _Q_PENDING_VAL" state. We don't spin
+ * indefinitely because there's no guarantee that we'll make forward
+ * progress.
+ */
+#ifndef _Q_PENDING_LOOPS
+#define _Q_PENDING_LOOPS 1
+#endif
+
+/*
+ * On 64-bit architectures, the mcs_spinlock structure will be 16 bytes in
+ * size and four of them will fit nicely in one 64-byte cacheline. For
+ * pvqspinlock, however, we need more space for extra data. To accommodate
+ * that, we insert two more long words to pad it up to 32 bytes. IOW, only
+ * two of them can fit in a cacheline in this case. That is OK as it is rare
+ * to have more than 2 levels of slowpath nesting in actual use. We don't
+ * want to penalize pvqspinlocks to optimize for a rare case in native
+ * qspinlocks.
+ */
+struct qnode {
+ struct mcs_spinlock mcs;
+#ifdef CONFIG_PARAVIRT_SPINLOCKS
+ long reserved[2];
+#endif
+};
+
+/*
+ * We must be able to distinguish between no-tail and the tail at 0:0,
+ * therefore increment the cpu number by one.
+ */
+
+static inline __pure u32 encode_tail(int cpu, int idx)
+{
+ u32 tail;
+
+ tail = (cpu + 1) << _Q_TAIL_CPU_OFFSET;
+ tail |= idx << _Q_TAIL_IDX_OFFSET; /* assume < 4 */
+
+ return tail;
+}
+
+static inline __pure struct mcs_spinlock *decode_tail(u32 tail, struct qnode *qnodes)
+{
+ int cpu = (tail >> _Q_TAIL_CPU_OFFSET) - 1;
+ int idx = (tail & _Q_TAIL_IDX_MASK) >> _Q_TAIL_IDX_OFFSET;
+
+ return per_cpu_ptr(&qnodes[idx].mcs, cpu);
+}
+
+static inline __pure
+struct mcs_spinlock *grab_mcs_node(struct mcs_spinlock *base, int idx)
+{
+ return &((struct qnode *)base + idx)->mcs;
+}
+
+#define _Q_LOCKED_PENDING_MASK (_Q_LOCKED_MASK | _Q_PENDING_MASK)
+
+#if _Q_PENDING_BITS == 8
+/**
+ * clear_pending - clear the pending bit.
+ * @lock: Pointer to queued spinlock structure
+ *
+ * *,1,* -> *,0,*
+ */
+static __always_inline void clear_pending(struct qspinlock *lock)
+{
+ WRITE_ONCE(lock->pending, 0);
+}
+
+/**
+ * clear_pending_set_locked - take ownership and clear the pending bit.
+ * @lock: Pointer to queued spinlock structure
+ *
+ * *,1,0 -> *,0,1
+ *
+ * Lock stealing is not allowed if this function is used.
+ */
+static __always_inline void clear_pending_set_locked(struct qspinlock *lock)
+{
+ WRITE_ONCE(lock->locked_pending, _Q_LOCKED_VAL);
+}
+
+/*
+ * xchg_tail - Put in the new queue tail code word & retrieve previous one
+ * @lock : Pointer to queued spinlock structure
+ * @tail : The new queue tail code word
+ * Return: The previous queue tail code word
+ *
+ * xchg(lock, tail), which heads an address dependency
+ *
+ * p,*,* -> n,*,* ; prev = xchg(lock, node)
+ */
+static __always_inline u32 xchg_tail(struct qspinlock *lock, u32 tail)
+{
+ /*
+ * We can use relaxed semantics since the caller ensures that the
+ * MCS node is properly initialized before updating the tail.
+ */
+ return (u32)xchg_relaxed(&lock->tail,
+ tail >> _Q_TAIL_OFFSET) << _Q_TAIL_OFFSET;
+}
+
+#else /* _Q_PENDING_BITS == 8 */
+
+/**
+ * clear_pending - clear the pending bit.
+ * @lock: Pointer to queued spinlock structure
+ *
+ * *,1,* -> *,0,*
+ */
+static __always_inline void clear_pending(struct qspinlock *lock)
+{
+ atomic_andnot(_Q_PENDING_VAL, &lock->val);
+}
+
+/**
+ * clear_pending_set_locked - take ownership and clear the pending bit.
+ * @lock: Pointer to queued spinlock structure
+ *
+ * *,1,0 -> *,0,1
+ */
+static __always_inline void clear_pending_set_locked(struct qspinlock *lock)
+{
+ atomic_add(-_Q_PENDING_VAL + _Q_LOCKED_VAL, &lock->val);
+}
+
+/**
+ * xchg_tail - Put in the new queue tail code word & retrieve previous one
+ * @lock : Pointer to queued spinlock structure
+ * @tail : The new queue tail code word
+ * Return: The previous queue tail code word
+ *
+ * xchg(lock, tail)
+ *
+ * p,*,* -> n,*,* ; prev = xchg(lock, node)
+ */
+static __always_inline u32 xchg_tail(struct qspinlock *lock, u32 tail)
+{
+ u32 old, new;
+
+ old = atomic_read(&lock->val);
+ do {
+ new = (old & _Q_LOCKED_PENDING_MASK) | tail;
+ /*
+ * We can use relaxed semantics since the caller ensures that
+ * the MCS node is properly initialized before updating the
+ * tail.
+ */
+ } while (!atomic_try_cmpxchg_relaxed(&lock->val, &old, new));
+
+ return old;
+}
+#endif /* _Q_PENDING_BITS == 8 */
+
+/**
+ * queued_fetch_set_pending_acquire - fetch the whole lock value and set pending
+ * @lock : Pointer to queued spinlock structure
+ * Return: The previous lock value
+ *
+ * *,*,* -> *,1,*
+ */
+#ifndef queued_fetch_set_pending_acquire
+static __always_inline u32 queued_fetch_set_pending_acquire(struct qspinlock *lock)
+{
+ return atomic_fetch_or_acquire(_Q_PENDING_VAL, &lock->val);
+}
+#endif
+
+/**
+ * set_locked - Set the lock bit and own the lock
+ * @lock: Pointer to queued spinlock structure
+ *
+ * *,*,0 -> *,0,1
+ */
+static __always_inline void set_locked(struct qspinlock *lock)
+{
+ WRITE_ONCE(lock->locked, _Q_LOCKED_VAL);
+}
+
+#endif /* __LINUX_QSPINLOCK_H */
--
2.43.5
^ permalink raw reply related [flat|nested] 67+ messages in thread
* [PATCH bpf-next v2 03/26] locking: Allow obtaining result of arch_mcs_spin_lock_contended
2025-02-06 10:54 [PATCH bpf-next v2 00/26] Resilient Queued Spin Lock Kumar Kartikeya Dwivedi
2025-02-06 10:54 ` [PATCH bpf-next v2 01/26] locking: Move MCS struct definition to public header Kumar Kartikeya Dwivedi
2025-02-06 10:54 ` [PATCH bpf-next v2 02/26] locking: Move common qspinlock helpers to a private header Kumar Kartikeya Dwivedi
@ 2025-02-06 10:54 ` Kumar Kartikeya Dwivedi
2025-02-06 10:54 ` [PATCH bpf-next v2 04/26] locking: Copy out qspinlock.c to rqspinlock.c Kumar Kartikeya Dwivedi
` (25 subsequent siblings)
28 siblings, 0 replies; 67+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2025-02-06 10:54 UTC (permalink / raw)
To: bpf, linux-kernel
Cc: Barret Rhoden, Linus Torvalds, Peter Zijlstra, Will Deacon,
Waiman Long, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
Martin KaFai Lau, Eduard Zingerman, Paul E. McKenney, Tejun Heo,
Josh Don, Dohyun Kim, linux-arm-kernel, kernel-team
To support upcoming changes that require inspecting the return value
once the conditional waiting loop in arch_mcs_spin_lock_contended
terminates, modify the macro to preserve the result of
smp_cond_load_acquire. This enables checking the return value as needed,
which will help disambiguate the MCS node’s locked state in future
patches.
Reviewed-by: Barret Rhoden <brho@google.com>
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
---
kernel/locking/mcs_spinlock.h | 4 +---
1 file changed, 1 insertion(+), 3 deletions(-)
diff --git a/kernel/locking/mcs_spinlock.h b/kernel/locking/mcs_spinlock.h
index 16160ca8907f..5c92ba199b90 100644
--- a/kernel/locking/mcs_spinlock.h
+++ b/kernel/locking/mcs_spinlock.h
@@ -24,9 +24,7 @@
* spinning, and smp_cond_load_acquire() provides that behavior.
*/
#define arch_mcs_spin_lock_contended(l) \
-do { \
- smp_cond_load_acquire(l, VAL); \
-} while (0)
+ smp_cond_load_acquire(l, VAL)
#endif
#ifndef arch_mcs_spin_unlock_contended
--
2.43.5
^ permalink raw reply related [flat|nested] 67+ messages in thread
* [PATCH bpf-next v2 04/26] locking: Copy out qspinlock.c to rqspinlock.c
2025-02-06 10:54 [PATCH bpf-next v2 00/26] Resilient Queued Spin Lock Kumar Kartikeya Dwivedi
` (2 preceding siblings ...)
2025-02-06 10:54 ` [PATCH bpf-next v2 03/26] locking: Allow obtaining result of arch_mcs_spin_lock_contended Kumar Kartikeya Dwivedi
@ 2025-02-06 10:54 ` Kumar Kartikeya Dwivedi
2025-02-06 10:54 ` [PATCH bpf-next v2 05/26] rqspinlock: Add rqspinlock.h header Kumar Kartikeya Dwivedi
` (24 subsequent siblings)
28 siblings, 0 replies; 67+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2025-02-06 10:54 UTC (permalink / raw)
To: bpf, linux-kernel
Cc: Barret Rhoden, Linus Torvalds, Peter Zijlstra, Will Deacon,
Waiman Long, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
Martin KaFai Lau, Eduard Zingerman, Paul E. McKenney, Tejun Heo,
Josh Don, Dohyun Kim, linux-arm-kernel, kernel-team
In preparation for introducing a new lock implementation, Resilient
Queued Spin Lock, or rqspinlock, we first begin our modifications by
using the existing qspinlock.c code as the base. Simply copy the code to
a new file and rename functions and variables from 'queued' to
'resilient_queued'.
This helps each subsequent commit in clearly showing how and where the
code is being changed. The only change after a literal copy in this
commit is renaming the functions where necessary.
Reviewed-by: Barret Rhoden <brho@google.com>
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
---
kernel/locking/rqspinlock.c | 410 ++++++++++++++++++++++++++++++++++++
1 file changed, 410 insertions(+)
create mode 100644 kernel/locking/rqspinlock.c
diff --git a/kernel/locking/rqspinlock.c b/kernel/locking/rqspinlock.c
new file mode 100644
index 000000000000..caaa7c9bbc79
--- /dev/null
+++ b/kernel/locking/rqspinlock.c
@@ -0,0 +1,410 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Resilient Queued Spin Lock
+ *
+ * (C) Copyright 2013-2015 Hewlett-Packard Development Company, L.P.
+ * (C) Copyright 2013-2014,2018 Red Hat, Inc.
+ * (C) Copyright 2015 Intel Corp.
+ * (C) Copyright 2015 Hewlett-Packard Enterprise Development LP
+ *
+ * Authors: Waiman Long <longman@redhat.com>
+ * Peter Zijlstra <peterz@infradead.org>
+ */
+
+#ifndef _GEN_PV_LOCK_SLOWPATH
+
+#include <linux/smp.h>
+#include <linux/bug.h>
+#include <linux/cpumask.h>
+#include <linux/percpu.h>
+#include <linux/hardirq.h>
+#include <linux/mutex.h>
+#include <linux/prefetch.h>
+#include <asm/byteorder.h>
+#include <asm/qspinlock.h>
+#include <trace/events/lock.h>
+
+/*
+ * Include queued spinlock definitions and statistics code
+ */
+#include "qspinlock.h"
+#include "qspinlock_stat.h"
+
+/*
+ * The basic principle of a queue-based spinlock can best be understood
+ * by studying a classic queue-based spinlock implementation called the
+ * MCS lock. A copy of the original MCS lock paper ("Algorithms for Scalable
+ * Synchronization on Shared-Memory Multiprocessors by Mellor-Crummey and
+ * Scott") is available at
+ *
+ * https://bugzilla.kernel.org/show_bug.cgi?id=206115
+ *
+ * This queued spinlock implementation is based on the MCS lock, however to
+ * make it fit the 4 bytes we assume spinlock_t to be, and preserve its
+ * existing API, we must modify it somehow.
+ *
+ * In particular; where the traditional MCS lock consists of a tail pointer
+ * (8 bytes) and needs the next pointer (another 8 bytes) of its own node to
+ * unlock the next pending (next->locked), we compress both these: {tail,
+ * next->locked} into a single u32 value.
+ *
+ * Since a spinlock disables recursion of its own context and there is a limit
+ * to the contexts that can nest; namely: task, softirq, hardirq, nmi. As there
+ * are at most 4 nesting levels, it can be encoded by a 2-bit number. Now
+ * we can encode the tail by combining the 2-bit nesting level with the cpu
+ * number. With one byte for the lock value and 3 bytes for the tail, only a
+ * 32-bit word is now needed. Even though we only need 1 bit for the lock,
+ * we extend it to a full byte to achieve better performance for architectures
+ * that support atomic byte write.
+ *
+ * We also change the first spinner to spin on the lock bit instead of its
+ * node; whereby avoiding the need to carry a node from lock to unlock, and
+ * preserving existing lock API. This also makes the unlock code simpler and
+ * faster.
+ *
+ * N.B. The current implementation only supports architectures that allow
+ * atomic operations on smaller 8-bit and 16-bit data types.
+ *
+ */
+
+#include "mcs_spinlock.h"
+
+/*
+ * Per-CPU queue node structures; we can never have more than 4 nested
+ * contexts: task, softirq, hardirq, nmi.
+ *
+ * Exactly fits one 64-byte cacheline on a 64-bit architecture.
+ *
+ * PV doubles the storage and uses the second cacheline for PV state.
+ */
+static DEFINE_PER_CPU_ALIGNED(struct qnode, qnodes[_Q_MAX_NODES]);
+
+/*
+ * Generate the native code for resilient_queued_spin_unlock_slowpath(); provide NOPs
+ * for all the PV callbacks.
+ */
+
+static __always_inline void __pv_init_node(struct mcs_spinlock *node) { }
+static __always_inline void __pv_wait_node(struct mcs_spinlock *node,
+ struct mcs_spinlock *prev) { }
+static __always_inline void __pv_kick_node(struct qspinlock *lock,
+ struct mcs_spinlock *node) { }
+static __always_inline u32 __pv_wait_head_or_lock(struct qspinlock *lock,
+ struct mcs_spinlock *node)
+ { return 0; }
+
+#define pv_enabled() false
+
+#define pv_init_node __pv_init_node
+#define pv_wait_node __pv_wait_node
+#define pv_kick_node __pv_kick_node
+#define pv_wait_head_or_lock __pv_wait_head_or_lock
+
+#ifdef CONFIG_PARAVIRT_SPINLOCKS
+#define resilient_queued_spin_lock_slowpath native_resilient_queued_spin_lock_slowpath
+#endif
+
+#endif /* _GEN_PV_LOCK_SLOWPATH */
+
+/**
+ * resilient_queued_spin_lock_slowpath - acquire the queued spinlock
+ * @lock: Pointer to queued spinlock structure
+ * @val: Current value of the queued spinlock 32-bit word
+ *
+ * (queue tail, pending bit, lock value)
+ *
+ * fast : slow : unlock
+ * : :
+ * uncontended (0,0,0) -:--> (0,0,1) ------------------------------:--> (*,*,0)
+ * : | ^--------.------. / :
+ * : v \ \ | :
+ * pending : (0,1,1) +--> (0,1,0) \ | :
+ * : | ^--' | | :
+ * : v | | :
+ * uncontended : (n,x,y) +--> (n,0,0) --' | :
+ * queue : | ^--' | :
+ * : v | :
+ * contended : (*,x,y) +--> (*,0,0) ---> (*,0,1) -' :
+ * queue : ^--' :
+ */
+void __lockfunc resilient_queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
+{
+ struct mcs_spinlock *prev, *next, *node;
+ u32 old, tail;
+ int idx;
+
+ BUILD_BUG_ON(CONFIG_NR_CPUS >= (1U << _Q_TAIL_CPU_BITS));
+
+ if (pv_enabled())
+ goto pv_queue;
+
+ if (virt_spin_lock(lock))
+ return;
+
+ /*
+ * Wait for in-progress pending->locked hand-overs with a bounded
+ * number of spins so that we guarantee forward progress.
+ *
+ * 0,1,0 -> 0,0,1
+ */
+ if (val == _Q_PENDING_VAL) {
+ int cnt = _Q_PENDING_LOOPS;
+ val = atomic_cond_read_relaxed(&lock->val,
+ (VAL != _Q_PENDING_VAL) || !cnt--);
+ }
+
+ /*
+ * If we observe any contention; queue.
+ */
+ if (val & ~_Q_LOCKED_MASK)
+ goto queue;
+
+ /*
+ * trylock || pending
+ *
+ * 0,0,* -> 0,1,* -> 0,0,1 pending, trylock
+ */
+ val = queued_fetch_set_pending_acquire(lock);
+
+ /*
+ * If we observe contention, there is a concurrent locker.
+ *
+ * Undo and queue; our setting of PENDING might have made the
+ * n,0,0 -> 0,0,0 transition fail and it will now be waiting
+ * on @next to become !NULL.
+ */
+ if (unlikely(val & ~_Q_LOCKED_MASK)) {
+
+ /* Undo PENDING if we set it. */
+ if (!(val & _Q_PENDING_MASK))
+ clear_pending(lock);
+
+ goto queue;
+ }
+
+ /*
+ * We're pending, wait for the owner to go away.
+ *
+ * 0,1,1 -> *,1,0
+ *
+ * this wait loop must be a load-acquire such that we match the
+ * store-release that clears the locked bit and create lock
+ * sequentiality; this is because not all
+ * clear_pending_set_locked() implementations imply full
+ * barriers.
+ */
+ if (val & _Q_LOCKED_MASK)
+ smp_cond_load_acquire(&lock->locked, !VAL);
+
+ /*
+ * take ownership and clear the pending bit.
+ *
+ * 0,1,0 -> 0,0,1
+ */
+ clear_pending_set_locked(lock);
+ lockevent_inc(lock_pending);
+ return;
+
+ /*
+ * End of pending bit optimistic spinning and beginning of MCS
+ * queuing.
+ */
+queue:
+ lockevent_inc(lock_slowpath);
+pv_queue:
+ node = this_cpu_ptr(&qnodes[0].mcs);
+ idx = node->count++;
+ tail = encode_tail(smp_processor_id(), idx);
+
+ trace_contention_begin(lock, LCB_F_SPIN);
+
+ /*
+ * 4 nodes are allocated based on the assumption that there will
+ * not be nested NMIs taking spinlocks. That may not be true in
+ * some architectures even though the chance of needing more than
+ * 4 nodes will still be extremely unlikely. When that happens,
+ * we fall back to spinning on the lock directly without using
+ * any MCS node. This is not the most elegant solution, but is
+ * simple enough.
+ */
+ if (unlikely(idx >= _Q_MAX_NODES)) {
+ lockevent_inc(lock_no_node);
+ while (!queued_spin_trylock(lock))
+ cpu_relax();
+ goto release;
+ }
+
+ node = grab_mcs_node(node, idx);
+
+ /*
+ * Keep counts of non-zero index values:
+ */
+ lockevent_cond_inc(lock_use_node2 + idx - 1, idx);
+
+ /*
+ * Ensure that we increment the head node->count before initialising
+ * the actual node. If the compiler is kind enough to reorder these
+ * stores, then an IRQ could overwrite our assignments.
+ */
+ barrier();
+
+ node->locked = 0;
+ node->next = NULL;
+ pv_init_node(node);
+
+ /*
+ * We touched a (possibly) cold cacheline in the per-cpu queue node;
+ * attempt the trylock once more in the hope someone let go while we
+ * weren't watching.
+ */
+ if (queued_spin_trylock(lock))
+ goto release;
+
+ /*
+ * Ensure that the initialisation of @node is complete before we
+ * publish the updated tail via xchg_tail() and potentially link
+ * @node into the waitqueue via WRITE_ONCE(prev->next, node) below.
+ */
+ smp_wmb();
+
+ /*
+ * Publish the updated tail.
+ * We have already touched the queueing cacheline; don't bother with
+ * pending stuff.
+ *
+ * p,*,* -> n,*,*
+ */
+ old = xchg_tail(lock, tail);
+ next = NULL;
+
+ /*
+ * if there was a previous node; link it and wait until reaching the
+ * head of the waitqueue.
+ */
+ if (old & _Q_TAIL_MASK) {
+ prev = decode_tail(old, qnodes);
+
+ /* Link @node into the waitqueue. */
+ WRITE_ONCE(prev->next, node);
+
+ pv_wait_node(node, prev);
+ arch_mcs_spin_lock_contended(&node->locked);
+
+ /*
+ * While waiting for the MCS lock, the next pointer may have
+ * been set by another lock waiter. We optimistically load
+ * the next pointer & prefetch the cacheline for writing
+ * to reduce latency in the upcoming MCS unlock operation.
+ */
+ next = READ_ONCE(node->next);
+ if (next)
+ prefetchw(next);
+ }
+
+ /*
+ * we're at the head of the waitqueue, wait for the owner & pending to
+ * go away.
+ *
+ * *,x,y -> *,0,0
+ *
+ * this wait loop must use a load-acquire such that we match the
+ * store-release that clears the locked bit and create lock
+ * sequentiality; this is because the set_locked() function below
+ * does not imply a full barrier.
+ *
+ * The PV pv_wait_head_or_lock function, if active, will acquire
+ * the lock and return a non-zero value. So we have to skip the
+ * atomic_cond_read_acquire() call. As the next PV queue head hasn't
+ * been designated yet, there is no way for the locked value to become
+ * _Q_SLOW_VAL. So both the set_locked() and the
+ * atomic_cmpxchg_relaxed() calls will be safe.
+ *
+ * If PV isn't active, 0 will be returned instead.
+ *
+ */
+ if ((val = pv_wait_head_or_lock(lock, node)))
+ goto locked;
+
+ val = atomic_cond_read_acquire(&lock->val, !(VAL & _Q_LOCKED_PENDING_MASK));
+
+locked:
+ /*
+ * claim the lock:
+ *
+ * n,0,0 -> 0,0,1 : lock, uncontended
+ * *,*,0 -> *,*,1 : lock, contended
+ *
+ * If the queue head is the only one in the queue (lock value == tail)
+ * and nobody is pending, clear the tail code and grab the lock.
+ * Otherwise, we only need to grab the lock.
+ */
+
+ /*
+ * In the PV case we might already have _Q_LOCKED_VAL set, because
+ * of lock stealing; therefore we must also allow:
+ *
+ * n,0,1 -> 0,0,1
+ *
+ * Note: at this point: (val & _Q_PENDING_MASK) == 0, because of the
+ * above wait condition, therefore any concurrent setting of
+ * PENDING will make the uncontended transition fail.
+ */
+ if ((val & _Q_TAIL_MASK) == tail) {
+ if (atomic_try_cmpxchg_relaxed(&lock->val, &val, _Q_LOCKED_VAL))
+ goto release; /* No contention */
+ }
+
+ /*
+ * Either somebody is queued behind us or _Q_PENDING_VAL got set
+ * which will then detect the remaining tail and queue behind us
+ * ensuring we'll see a @next.
+ */
+ set_locked(lock);
+
+ /*
+ * contended path; wait for next if not observed yet, release.
+ */
+ if (!next)
+ next = smp_cond_load_relaxed(&node->next, (VAL));
+
+ arch_mcs_spin_unlock_contended(&next->locked);
+ pv_kick_node(lock, next);
+
+release:
+ trace_contention_end(lock, 0);
+
+ /*
+ * release the node
+ */
+ __this_cpu_dec(qnodes[0].mcs.count);
+}
+EXPORT_SYMBOL(resilient_queued_spin_lock_slowpath);
+
+/*
+ * Generate the paravirt code for resilient_queued_spin_unlock_slowpath().
+ */
+#if !defined(_GEN_PV_LOCK_SLOWPATH) && defined(CONFIG_PARAVIRT_SPINLOCKS)
+#define _GEN_PV_LOCK_SLOWPATH
+
+#undef pv_enabled
+#define pv_enabled() true
+
+#undef pv_init_node
+#undef pv_wait_node
+#undef pv_kick_node
+#undef pv_wait_head_or_lock
+
+#undef resilient_queued_spin_lock_slowpath
+#define resilient_queued_spin_lock_slowpath __pv_resilient_queued_spin_lock_slowpath
+
+#include "qspinlock_paravirt.h"
+#include "rqspinlock.c"
+
+bool nopvspin;
+static __init int parse_nopvspin(char *arg)
+{
+ nopvspin = true;
+ return 0;
+}
+early_param("nopvspin", parse_nopvspin);
+#endif
--
2.43.5
^ permalink raw reply related [flat|nested] 67+ messages in thread
* [PATCH bpf-next v2 05/26] rqspinlock: Add rqspinlock.h header
2025-02-06 10:54 [PATCH bpf-next v2 00/26] Resilient Queued Spin Lock Kumar Kartikeya Dwivedi
` (3 preceding siblings ...)
2025-02-06 10:54 ` [PATCH bpf-next v2 04/26] locking: Copy out qspinlock.c to rqspinlock.c Kumar Kartikeya Dwivedi
@ 2025-02-06 10:54 ` Kumar Kartikeya Dwivedi
2025-02-06 10:54 ` [PATCH bpf-next v2 06/26] rqspinlock: Drop PV and virtualization support Kumar Kartikeya Dwivedi
` (23 subsequent siblings)
28 siblings, 0 replies; 67+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2025-02-06 10:54 UTC (permalink / raw)
To: bpf, linux-kernel
Cc: Barret Rhoden, Linus Torvalds, Peter Zijlstra, Will Deacon,
Waiman Long, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
Martin KaFai Lau, Eduard Zingerman, Paul E. McKenney, Tejun Heo,
Josh Don, Dohyun Kim, linux-arm-kernel, kernel-team
This header contains the public declarations usable in the rest of the
kernel for rqspinlock.
Let's also type alias qspinlock to rqspinlock_t to ensure consistent use
of the new lock type. We want to remove dependence on the qspinlock type
in later patches as we need to provide a test-and-set fallback, hence
begin abstracting away from now onwards.
Reviewed-by: Barret Rhoden <brho@google.com>
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
---
include/asm-generic/rqspinlock.h | 19 +++++++++++++++++++
kernel/locking/rqspinlock.c | 3 ++-
2 files changed, 21 insertions(+), 1 deletion(-)
create mode 100644 include/asm-generic/rqspinlock.h
diff --git a/include/asm-generic/rqspinlock.h b/include/asm-generic/rqspinlock.h
new file mode 100644
index 000000000000..54860b519571
--- /dev/null
+++ b/include/asm-generic/rqspinlock.h
@@ -0,0 +1,19 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+/*
+ * Resilient Queued Spin Lock
+ *
+ * (C) Copyright 2024 Meta Platforms, Inc. and affiliates.
+ *
+ * Authors: Kumar Kartikeya Dwivedi <memxor@gmail.com>
+ */
+#ifndef __ASM_GENERIC_RQSPINLOCK_H
+#define __ASM_GENERIC_RQSPINLOCK_H
+
+#include <linux/types.h>
+
+struct qspinlock;
+typedef struct qspinlock rqspinlock_t;
+
+extern void resilient_queued_spin_lock_slowpath(rqspinlock_t *lock, u32 val);
+
+#endif /* __ASM_GENERIC_RQSPINLOCK_H */
diff --git a/kernel/locking/rqspinlock.c b/kernel/locking/rqspinlock.c
index caaa7c9bbc79..18eb9ef3e908 100644
--- a/kernel/locking/rqspinlock.c
+++ b/kernel/locking/rqspinlock.c
@@ -23,6 +23,7 @@
#include <asm/byteorder.h>
#include <asm/qspinlock.h>
#include <trace/events/lock.h>
+#include <asm/rqspinlock.h>
/*
* Include queued spinlock definitions and statistics code
@@ -127,7 +128,7 @@ static __always_inline u32 __pv_wait_head_or_lock(struct qspinlock *lock,
* contended : (*,x,y) +--> (*,0,0) ---> (*,0,1) -' :
* queue : ^--' :
*/
-void __lockfunc resilient_queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
+void __lockfunc resilient_queued_spin_lock_slowpath(rqspinlock_t *lock, u32 val)
{
struct mcs_spinlock *prev, *next, *node;
u32 old, tail;
--
2.43.5
^ permalink raw reply related [flat|nested] 67+ messages in thread
* [PATCH bpf-next v2 06/26] rqspinlock: Drop PV and virtualization support
2025-02-06 10:54 [PATCH bpf-next v2 00/26] Resilient Queued Spin Lock Kumar Kartikeya Dwivedi
` (4 preceding siblings ...)
2025-02-06 10:54 ` [PATCH bpf-next v2 05/26] rqspinlock: Add rqspinlock.h header Kumar Kartikeya Dwivedi
@ 2025-02-06 10:54 ` Kumar Kartikeya Dwivedi
2025-02-06 10:54 ` [PATCH bpf-next v2 07/26] rqspinlock: Add support for timeouts Kumar Kartikeya Dwivedi
` (22 subsequent siblings)
28 siblings, 0 replies; 67+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2025-02-06 10:54 UTC (permalink / raw)
To: bpf, linux-kernel
Cc: Barret Rhoden, Linus Torvalds, Peter Zijlstra, Will Deacon,
Waiman Long, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
Martin KaFai Lau, Eduard Zingerman, Paul E. McKenney, Tejun Heo,
Josh Don, Dohyun Kim, linux-arm-kernel, kernel-team
Changes to rqspinlock in subsequent commits will be algorithmic
modifications, which won't remain in agreement with the implementations
of paravirt spin lock and virt_spin_lock support. These future changes
include measures for terminating waiting loops in slow path after a
certain point. While using a fair lock like qspinlock directly inside
virtual machines leads to suboptimal performance under certain
conditions, we cannot use the existing virtualization support before we
make it resilient as well. Therefore, drop it for now.
Reviewed-by: Barret Rhoden <brho@google.com>
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
---
kernel/locking/rqspinlock.c | 89 -------------------------------------
1 file changed, 89 deletions(-)
diff --git a/kernel/locking/rqspinlock.c b/kernel/locking/rqspinlock.c
index 18eb9ef3e908..52db60cd9691 100644
--- a/kernel/locking/rqspinlock.c
+++ b/kernel/locking/rqspinlock.c
@@ -11,8 +11,6 @@
* Peter Zijlstra <peterz@infradead.org>
*/
-#ifndef _GEN_PV_LOCK_SLOWPATH
-
#include <linux/smp.h>
#include <linux/bug.h>
#include <linux/cpumask.h>
@@ -75,38 +73,9 @@
* contexts: task, softirq, hardirq, nmi.
*
* Exactly fits one 64-byte cacheline on a 64-bit architecture.
- *
- * PV doubles the storage and uses the second cacheline for PV state.
*/
static DEFINE_PER_CPU_ALIGNED(struct qnode, qnodes[_Q_MAX_NODES]);
-/*
- * Generate the native code for resilient_queued_spin_unlock_slowpath(); provide NOPs
- * for all the PV callbacks.
- */
-
-static __always_inline void __pv_init_node(struct mcs_spinlock *node) { }
-static __always_inline void __pv_wait_node(struct mcs_spinlock *node,
- struct mcs_spinlock *prev) { }
-static __always_inline void __pv_kick_node(struct qspinlock *lock,
- struct mcs_spinlock *node) { }
-static __always_inline u32 __pv_wait_head_or_lock(struct qspinlock *lock,
- struct mcs_spinlock *node)
- { return 0; }
-
-#define pv_enabled() false
-
-#define pv_init_node __pv_init_node
-#define pv_wait_node __pv_wait_node
-#define pv_kick_node __pv_kick_node
-#define pv_wait_head_or_lock __pv_wait_head_or_lock
-
-#ifdef CONFIG_PARAVIRT_SPINLOCKS
-#define resilient_queued_spin_lock_slowpath native_resilient_queued_spin_lock_slowpath
-#endif
-
-#endif /* _GEN_PV_LOCK_SLOWPATH */
-
/**
* resilient_queued_spin_lock_slowpath - acquire the queued spinlock
* @lock: Pointer to queued spinlock structure
@@ -136,12 +105,6 @@ void __lockfunc resilient_queued_spin_lock_slowpath(rqspinlock_t *lock, u32 val)
BUILD_BUG_ON(CONFIG_NR_CPUS >= (1U << _Q_TAIL_CPU_BITS));
- if (pv_enabled())
- goto pv_queue;
-
- if (virt_spin_lock(lock))
- return;
-
/*
* Wait for in-progress pending->locked hand-overs with a bounded
* number of spins so that we guarantee forward progress.
@@ -212,7 +175,6 @@ void __lockfunc resilient_queued_spin_lock_slowpath(rqspinlock_t *lock, u32 val)
*/
queue:
lockevent_inc(lock_slowpath);
-pv_queue:
node = this_cpu_ptr(&qnodes[0].mcs);
idx = node->count++;
tail = encode_tail(smp_processor_id(), idx);
@@ -251,7 +213,6 @@ void __lockfunc resilient_queued_spin_lock_slowpath(rqspinlock_t *lock, u32 val)
node->locked = 0;
node->next = NULL;
- pv_init_node(node);
/*
* We touched a (possibly) cold cacheline in the per-cpu queue node;
@@ -288,7 +249,6 @@ void __lockfunc resilient_queued_spin_lock_slowpath(rqspinlock_t *lock, u32 val)
/* Link @node into the waitqueue. */
WRITE_ONCE(prev->next, node);
- pv_wait_node(node, prev);
arch_mcs_spin_lock_contended(&node->locked);
/*
@@ -312,23 +272,9 @@ void __lockfunc resilient_queued_spin_lock_slowpath(rqspinlock_t *lock, u32 val)
* store-release that clears the locked bit and create lock
* sequentiality; this is because the set_locked() function below
* does not imply a full barrier.
- *
- * The PV pv_wait_head_or_lock function, if active, will acquire
- * the lock and return a non-zero value. So we have to skip the
- * atomic_cond_read_acquire() call. As the next PV queue head hasn't
- * been designated yet, there is no way for the locked value to become
- * _Q_SLOW_VAL. So both the set_locked() and the
- * atomic_cmpxchg_relaxed() calls will be safe.
- *
- * If PV isn't active, 0 will be returned instead.
- *
*/
- if ((val = pv_wait_head_or_lock(lock, node)))
- goto locked;
-
val = atomic_cond_read_acquire(&lock->val, !(VAL & _Q_LOCKED_PENDING_MASK));
-locked:
/*
* claim the lock:
*
@@ -341,11 +287,6 @@ void __lockfunc resilient_queued_spin_lock_slowpath(rqspinlock_t *lock, u32 val)
*/
/*
- * In the PV case we might already have _Q_LOCKED_VAL set, because
- * of lock stealing; therefore we must also allow:
- *
- * n,0,1 -> 0,0,1
- *
* Note: at this point: (val & _Q_PENDING_MASK) == 0, because of the
* above wait condition, therefore any concurrent setting of
* PENDING will make the uncontended transition fail.
@@ -369,7 +310,6 @@ void __lockfunc resilient_queued_spin_lock_slowpath(rqspinlock_t *lock, u32 val)
next = smp_cond_load_relaxed(&node->next, (VAL));
arch_mcs_spin_unlock_contended(&next->locked);
- pv_kick_node(lock, next);
release:
trace_contention_end(lock, 0);
@@ -380,32 +320,3 @@ void __lockfunc resilient_queued_spin_lock_slowpath(rqspinlock_t *lock, u32 val)
__this_cpu_dec(qnodes[0].mcs.count);
}
EXPORT_SYMBOL(resilient_queued_spin_lock_slowpath);
-
-/*
- * Generate the paravirt code for resilient_queued_spin_unlock_slowpath().
- */
-#if !defined(_GEN_PV_LOCK_SLOWPATH) && defined(CONFIG_PARAVIRT_SPINLOCKS)
-#define _GEN_PV_LOCK_SLOWPATH
-
-#undef pv_enabled
-#define pv_enabled() true
-
-#undef pv_init_node
-#undef pv_wait_node
-#undef pv_kick_node
-#undef pv_wait_head_or_lock
-
-#undef resilient_queued_spin_lock_slowpath
-#define resilient_queued_spin_lock_slowpath __pv_resilient_queued_spin_lock_slowpath
-
-#include "qspinlock_paravirt.h"
-#include "rqspinlock.c"
-
-bool nopvspin;
-static __init int parse_nopvspin(char *arg)
-{
- nopvspin = true;
- return 0;
-}
-early_param("nopvspin", parse_nopvspin);
-#endif
--
2.43.5
^ permalink raw reply related [flat|nested] 67+ messages in thread
* [PATCH bpf-next v2 07/26] rqspinlock: Add support for timeouts
2025-02-06 10:54 [PATCH bpf-next v2 00/26] Resilient Queued Spin Lock Kumar Kartikeya Dwivedi
` (5 preceding siblings ...)
2025-02-06 10:54 ` [PATCH bpf-next v2 06/26] rqspinlock: Drop PV and virtualization support Kumar Kartikeya Dwivedi
@ 2025-02-06 10:54 ` Kumar Kartikeya Dwivedi
2025-02-10 9:56 ` Peter Zijlstra
2025-02-06 10:54 ` [PATCH bpf-next v2 08/26] rqspinlock: Protect pending bit owners from stalls Kumar Kartikeya Dwivedi
` (21 subsequent siblings)
28 siblings, 1 reply; 67+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2025-02-06 10:54 UTC (permalink / raw)
To: bpf, linux-kernel
Cc: Barret Rhoden, Linus Torvalds, Peter Zijlstra, Will Deacon,
Waiman Long, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
Martin KaFai Lau, Eduard Zingerman, Paul E. McKenney, Tejun Heo,
Josh Don, Dohyun Kim, linux-arm-kernel, kernel-team
Introduce policy macro RES_CHECK_TIMEOUT which can be used to detect
when the timeout has expired for the slow path to return an error. It
depends on being passed two variables initialized to 0: ts, ret. The
'ts' parameter is of type rqspinlock_timeout.
This macro resolves to the (ret) expression so that it can be used in
statements like smp_cond_load_acquire to break the waiting loop
condition.
The 'spin' member is used to amortize the cost of checking time by
dispatching to the implementation every 64k iterations. The
'timeout_end' member is used to keep track of the timestamp that denotes
the end of the waiting period. The 'ret' parameter denotes the status of
the timeout, and can be checked in the slow path to detect timeouts
after waiting loops.
The 'duration' member is used to store the timeout duration for each
waiting loop, that is passed down from the caller of the slow path
function. Use the RES_INIT_TIMEOUT macro to initialize it. The default
timeout value defined in the header (RES_DEF_TIMEOUT) is 0.5 seconds.
This macro will be used as a condition for waiting loops in the slow
path. Since each waiting loop applies a fresh timeout using the same
rqspinlock_timeout, we add a new RES_RESET_TIMEOUT as well to ensure the
values can be easily reinitialized to the default state.
Reviewed-by: Barret Rhoden <brho@google.com>
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
---
include/asm-generic/rqspinlock.h | 8 +++++-
kernel/locking/rqspinlock.c | 46 +++++++++++++++++++++++++++++++-
2 files changed, 52 insertions(+), 2 deletions(-)
diff --git a/include/asm-generic/rqspinlock.h b/include/asm-generic/rqspinlock.h
index 54860b519571..c89733cbe643 100644
--- a/include/asm-generic/rqspinlock.h
+++ b/include/asm-generic/rqspinlock.h
@@ -10,10 +10,16 @@
#define __ASM_GENERIC_RQSPINLOCK_H
#include <linux/types.h>
+#include <vdso/time64.h>
struct qspinlock;
typedef struct qspinlock rqspinlock_t;
-extern void resilient_queued_spin_lock_slowpath(rqspinlock_t *lock, u32 val);
+/*
+ * Default timeout for waiting loops is 0.5 seconds
+ */
+#define RES_DEF_TIMEOUT (NSEC_PER_SEC / 2)
+
+extern void resilient_queued_spin_lock_slowpath(rqspinlock_t *lock, u32 val, u64 timeout);
#endif /* __ASM_GENERIC_RQSPINLOCK_H */
diff --git a/kernel/locking/rqspinlock.c b/kernel/locking/rqspinlock.c
index 52db60cd9691..200454e9c636 100644
--- a/kernel/locking/rqspinlock.c
+++ b/kernel/locking/rqspinlock.c
@@ -6,9 +6,11 @@
* (C) Copyright 2013-2014,2018 Red Hat, Inc.
* (C) Copyright 2015 Intel Corp.
* (C) Copyright 2015 Hewlett-Packard Enterprise Development LP
+ * (C) Copyright 2024 Meta Platforms, Inc. and affiliates.
*
* Authors: Waiman Long <longman@redhat.com>
* Peter Zijlstra <peterz@infradead.org>
+ * Kumar Kartikeya Dwivedi <memxor@gmail.com>
*/
#include <linux/smp.h>
@@ -22,6 +24,7 @@
#include <asm/qspinlock.h>
#include <trace/events/lock.h>
#include <asm/rqspinlock.h>
+#include <linux/timekeeping.h>
/*
* Include queued spinlock definitions and statistics code
@@ -68,6 +71,44 @@
#include "mcs_spinlock.h"
+struct rqspinlock_timeout {
+ u64 timeout_end;
+ u64 duration;
+ u16 spin;
+};
+
+static noinline int check_timeout(struct rqspinlock_timeout *ts)
+{
+ u64 time = ktime_get_mono_fast_ns();
+
+ if (!ts->timeout_end) {
+ ts->timeout_end = time + ts->duration;
+ return 0;
+ }
+
+ if (time > ts->timeout_end)
+ return -ETIMEDOUT;
+
+ return 0;
+}
+
+#define RES_CHECK_TIMEOUT(ts, ret) \
+ ({ \
+ if (!(ts).spin++) \
+ (ret) = check_timeout(&(ts)); \
+ (ret); \
+ })
+
+/*
+ * Initialize the 'duration' member with the chosen timeout.
+ */
+#define RES_INIT_TIMEOUT(ts, _timeout) ({ (ts).spin = 1; (ts).duration = _timeout; })
+
+/*
+ * We only need to reset 'timeout_end', 'spin' will just wrap around as necessary.
+ */
+#define RES_RESET_TIMEOUT(ts) ({ (ts).timeout_end = 0; })
+
/*
* Per-CPU queue node structures; we can never have more than 4 nested
* contexts: task, softirq, hardirq, nmi.
@@ -97,14 +138,17 @@ static DEFINE_PER_CPU_ALIGNED(struct qnode, qnodes[_Q_MAX_NODES]);
* contended : (*,x,y) +--> (*,0,0) ---> (*,0,1) -' :
* queue : ^--' :
*/
-void __lockfunc resilient_queued_spin_lock_slowpath(rqspinlock_t *lock, u32 val)
+void __lockfunc resilient_queued_spin_lock_slowpath(rqspinlock_t *lock, u32 val, u64 timeout)
{
struct mcs_spinlock *prev, *next, *node;
+ struct rqspinlock_timeout ts;
u32 old, tail;
int idx;
BUILD_BUG_ON(CONFIG_NR_CPUS >= (1U << _Q_TAIL_CPU_BITS));
+ RES_INIT_TIMEOUT(ts, timeout);
+
/*
* Wait for in-progress pending->locked hand-overs with a bounded
* number of spins so that we guarantee forward progress.
--
2.43.5
^ permalink raw reply related [flat|nested] 67+ messages in thread
* [PATCH bpf-next v2 08/26] rqspinlock: Protect pending bit owners from stalls
2025-02-06 10:54 [PATCH bpf-next v2 00/26] Resilient Queued Spin Lock Kumar Kartikeya Dwivedi
` (6 preceding siblings ...)
2025-02-06 10:54 ` [PATCH bpf-next v2 07/26] rqspinlock: Add support for timeouts Kumar Kartikeya Dwivedi
@ 2025-02-06 10:54 ` Kumar Kartikeya Dwivedi
2025-02-06 10:54 ` [PATCH bpf-next v2 09/26] rqspinlock: Protect waiters in queue " Kumar Kartikeya Dwivedi
` (20 subsequent siblings)
28 siblings, 0 replies; 67+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2025-02-06 10:54 UTC (permalink / raw)
To: bpf, linux-kernel
Cc: Barret Rhoden, Linus Torvalds, Peter Zijlstra, Will Deacon,
Waiman Long, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
Martin KaFai Lau, Eduard Zingerman, Paul E. McKenney, Tejun Heo,
Josh Don, Dohyun Kim, linux-arm-kernel, kernel-team
The pending bit is used to avoid queueing in case the lock is
uncontended, and has demonstrated benefits for the 2 contender scenario,
esp. on x86. In case the pending bit is acquired and we wait for the
locked bit to disappear, we may get stuck due to the lock owner not
making progress. Hence, this waiting loop must be protected with a
timeout check.
To perform a graceful recovery once we decide to abort our lock
acquisition attempt in this case, we must unset the pending bit since we
own it. All waiters undoing their changes and exiting gracefully allows
the lock word to be restored to the unlocked state once all participants
(owner, waiters) have been recovered, and the lock remains usable.
Hence, set the pending bit back to zero before returning to the caller.
Introduce a lockevent (rqspinlock_lock_timeout) to capture timeout
event statistics.
Reviewed-by: Barret Rhoden <brho@google.com>
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
---
include/asm-generic/rqspinlock.h | 2 +-
kernel/locking/lock_events_list.h | 5 +++++
kernel/locking/rqspinlock.c | 28 +++++++++++++++++++++++-----
3 files changed, 29 insertions(+), 6 deletions(-)
diff --git a/include/asm-generic/rqspinlock.h b/include/asm-generic/rqspinlock.h
index c89733cbe643..0981162c8ac7 100644
--- a/include/asm-generic/rqspinlock.h
+++ b/include/asm-generic/rqspinlock.h
@@ -20,6 +20,6 @@ typedef struct qspinlock rqspinlock_t;
*/
#define RES_DEF_TIMEOUT (NSEC_PER_SEC / 2)
-extern void resilient_queued_spin_lock_slowpath(rqspinlock_t *lock, u32 val, u64 timeout);
+extern int resilient_queued_spin_lock_slowpath(rqspinlock_t *lock, u32 val, u64 timeout);
#endif /* __ASM_GENERIC_RQSPINLOCK_H */
diff --git a/kernel/locking/lock_events_list.h b/kernel/locking/lock_events_list.h
index 97fb6f3f840a..c5286249994d 100644
--- a/kernel/locking/lock_events_list.h
+++ b/kernel/locking/lock_events_list.h
@@ -49,6 +49,11 @@ LOCK_EVENT(lock_use_node4) /* # of locking ops that use 4th percpu node */
LOCK_EVENT(lock_no_node) /* # of locking ops w/o using percpu node */
#endif /* CONFIG_QUEUED_SPINLOCKS */
+/*
+ * Locking events for Resilient Queued Spin Lock
+ */
+LOCK_EVENT(rqspinlock_lock_timeout) /* # of locking ops that timeout */
+
/*
* Locking events for rwsem
*/
diff --git a/kernel/locking/rqspinlock.c b/kernel/locking/rqspinlock.c
index 200454e9c636..8e512feb37ce 100644
--- a/kernel/locking/rqspinlock.c
+++ b/kernel/locking/rqspinlock.c
@@ -138,12 +138,12 @@ static DEFINE_PER_CPU_ALIGNED(struct qnode, qnodes[_Q_MAX_NODES]);
* contended : (*,x,y) +--> (*,0,0) ---> (*,0,1) -' :
* queue : ^--' :
*/
-void __lockfunc resilient_queued_spin_lock_slowpath(rqspinlock_t *lock, u32 val, u64 timeout)
+int __lockfunc resilient_queued_spin_lock_slowpath(rqspinlock_t *lock, u32 val, u64 timeout)
{
struct mcs_spinlock *prev, *next, *node;
struct rqspinlock_timeout ts;
+ int idx, ret = 0;
u32 old, tail;
- int idx;
BUILD_BUG_ON(CONFIG_NR_CPUS >= (1U << _Q_TAIL_CPU_BITS));
@@ -201,8 +201,25 @@ void __lockfunc resilient_queued_spin_lock_slowpath(rqspinlock_t *lock, u32 val,
* clear_pending_set_locked() implementations imply full
* barriers.
*/
- if (val & _Q_LOCKED_MASK)
- smp_cond_load_acquire(&lock->locked, !VAL);
+ if (val & _Q_LOCKED_MASK) {
+ RES_RESET_TIMEOUT(ts);
+ smp_cond_load_acquire(&lock->locked, !VAL || RES_CHECK_TIMEOUT(ts, ret));
+ }
+
+ if (ret) {
+ /*
+ * We waited for the locked bit to go back to 0, as the pending
+ * waiter, but timed out. We need to clear the pending bit since
+ * we own it. Once a stuck owner has been recovered, the lock
+ * must be restored to a valid state, hence removing the pending
+ * bit is necessary.
+ *
+ * *,1,* -> *,0,*
+ */
+ clear_pending(lock);
+ lockevent_inc(rqspinlock_lock_timeout);
+ return ret;
+ }
/*
* take ownership and clear the pending bit.
@@ -211,7 +228,7 @@ void __lockfunc resilient_queued_spin_lock_slowpath(rqspinlock_t *lock, u32 val,
*/
clear_pending_set_locked(lock);
lockevent_inc(lock_pending);
- return;
+ return 0;
/*
* End of pending bit optimistic spinning and beginning of MCS
@@ -362,5 +379,6 @@ void __lockfunc resilient_queued_spin_lock_slowpath(rqspinlock_t *lock, u32 val,
* release the node
*/
__this_cpu_dec(qnodes[0].mcs.count);
+ return 0;
}
EXPORT_SYMBOL(resilient_queued_spin_lock_slowpath);
--
2.43.5
^ permalink raw reply related [flat|nested] 67+ messages in thread
* [PATCH bpf-next v2 09/26] rqspinlock: Protect waiters in queue from stalls
2025-02-06 10:54 [PATCH bpf-next v2 00/26] Resilient Queued Spin Lock Kumar Kartikeya Dwivedi
` (7 preceding siblings ...)
2025-02-06 10:54 ` [PATCH bpf-next v2 08/26] rqspinlock: Protect pending bit owners from stalls Kumar Kartikeya Dwivedi
@ 2025-02-06 10:54 ` Kumar Kartikeya Dwivedi
2025-02-10 10:17 ` Peter Zijlstra
2025-02-06 10:54 ` [PATCH bpf-next v2 10/26] rqspinlock: Protect waiters in trylock fallback " Kumar Kartikeya Dwivedi
` (19 subsequent siblings)
28 siblings, 1 reply; 67+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2025-02-06 10:54 UTC (permalink / raw)
To: bpf, linux-kernel
Cc: Barret Rhoden, Linus Torvalds, Peter Zijlstra, Will Deacon,
Waiman Long, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
Martin KaFai Lau, Eduard Zingerman, Paul E. McKenney, Tejun Heo,
Josh Don, Dohyun Kim, linux-arm-kernel, kernel-team
Implement the wait queue cleanup algorithm for rqspinlock. There are
three forms of waiters in the original queued spin lock algorithm. The
first is the waiter which acquires the pending bit and spins on the lock
word without forming a wait queue. The second is the head waiter that is
the first waiter heading the wait queue. The third form is of all the
non-head waiters queued behind the head, waiting to be signalled through
their MCS node to overtake the responsibility of the head.
In this commit, we are concerned with the second and third kind. First,
we augment the waiting loop of the head of the wait queue with a
timeout. When this timeout happens, all waiters part of the wait queue
will abort their lock acquisition attempts. This happens in three steps.
First, the head breaks out of its loop waiting for pending and locked
bits to turn to 0, and non-head waiters break out of their MCS node spin
(more on that later). Next, every waiter (head or non-head) attempts to
check whether they are also the tail waiter, in such a case they attempt
to zero out the tail word and allow a new queue to be built up for this
lock. If they succeed, they have no one to signal next in the queue to
stop spinning. Otherwise, they signal the MCS node of the next waiter to
break out of its spin and try resetting the tail word back to 0. This
goes on until the tail waiter is found. In case of races, the new tail
will be responsible for performing the same task, as the old tail will
then fail to reset the tail word and wait for its next pointer to be
updated before it signals the new tail to do the same.
Lastly, all of these waiters release the rqnode and return to the
caller. This patch underscores the point that rqspinlock's timeout does
not apply to each waiter individually, and cannot be relied upon as an
upper bound. It is possible for the rqspinlock waiters to return early
from a failed lock acquisition attempt as soon as stalls are detected.
The head waiter cannot directly WRITE_ONCE the tail to zero, as it may
race with a concurrent xchg and a non-head waiter linking its MCS node
to the head's MCS node through 'prev->next' assignment.
Reviewed-by: Barret Rhoden <brho@google.com>
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
---
kernel/locking/rqspinlock.c | 42 +++++++++++++++++++++++++++++---
kernel/locking/rqspinlock.h | 48 +++++++++++++++++++++++++++++++++++++
2 files changed, 87 insertions(+), 3 deletions(-)
create mode 100644 kernel/locking/rqspinlock.h
diff --git a/kernel/locking/rqspinlock.c b/kernel/locking/rqspinlock.c
index 8e512feb37ce..fdc20157d0c9 100644
--- a/kernel/locking/rqspinlock.c
+++ b/kernel/locking/rqspinlock.c
@@ -77,6 +77,8 @@ struct rqspinlock_timeout {
u16 spin;
};
+#define RES_TIMEOUT_VAL 2
+
static noinline int check_timeout(struct rqspinlock_timeout *ts)
{
u64 time = ktime_get_mono_fast_ns();
@@ -305,12 +307,18 @@ int __lockfunc resilient_queued_spin_lock_slowpath(rqspinlock_t *lock, u32 val,
* head of the waitqueue.
*/
if (old & _Q_TAIL_MASK) {
+ int val;
+
prev = decode_tail(old, qnodes);
/* Link @node into the waitqueue. */
WRITE_ONCE(prev->next, node);
- arch_mcs_spin_lock_contended(&node->locked);
+ val = arch_mcs_spin_lock_contended(&node->locked);
+ if (val == RES_TIMEOUT_VAL) {
+ ret = -EDEADLK;
+ goto waitq_timeout;
+ }
/*
* While waiting for the MCS lock, the next pointer may have
@@ -334,7 +342,35 @@ int __lockfunc resilient_queued_spin_lock_slowpath(rqspinlock_t *lock, u32 val,
* sequentiality; this is because the set_locked() function below
* does not imply a full barrier.
*/
- val = atomic_cond_read_acquire(&lock->val, !(VAL & _Q_LOCKED_PENDING_MASK));
+ RES_RESET_TIMEOUT(ts);
+ val = atomic_cond_read_acquire(&lock->val, !(VAL & _Q_LOCKED_PENDING_MASK) ||
+ RES_CHECK_TIMEOUT(ts, ret));
+
+waitq_timeout:
+ if (ret) {
+ /*
+ * If the tail is still pointing to us, then we are the final waiter,
+ * and are responsible for resetting the tail back to 0. Otherwise, if
+ * the cmpxchg operation fails, we signal the next waiter to take exit
+ * and try the same. For a waiter with tail node 'n':
+ *
+ * n,*,* -> 0,*,*
+ *
+ * When performing cmpxchg for the whole word (NR_CPUS > 16k), it is
+ * possible locked/pending bits keep changing and we see failures even
+ * when we remain the head of wait queue. However, eventually,
+ * pending bit owner will unset the pending bit, and new waiters
+ * will queue behind us. This will leave the lock owner in
+ * charge, and it will eventually either set locked bit to 0, or
+ * leave it as 1, allowing us to make progress.
+ */
+ if (!try_cmpxchg_tail(lock, tail, 0)) {
+ next = smp_cond_load_relaxed(&node->next, VAL);
+ WRITE_ONCE(next->locked, RES_TIMEOUT_VAL);
+ }
+ lockevent_inc(rqspinlock_lock_timeout);
+ goto release;
+ }
/*
* claim the lock:
@@ -379,6 +415,6 @@ int __lockfunc resilient_queued_spin_lock_slowpath(rqspinlock_t *lock, u32 val,
* release the node
*/
__this_cpu_dec(qnodes[0].mcs.count);
- return 0;
+ return ret;
}
EXPORT_SYMBOL(resilient_queued_spin_lock_slowpath);
diff --git a/kernel/locking/rqspinlock.h b/kernel/locking/rqspinlock.h
new file mode 100644
index 000000000000..3cec3a0f2d7e
--- /dev/null
+++ b/kernel/locking/rqspinlock.h
@@ -0,0 +1,48 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+/*
+ * Resilient Queued Spin Lock defines
+ *
+ * (C) Copyright 2024 Meta Platforms, Inc. and affiliates.
+ *
+ * Authors: Kumar Kartikeya Dwivedi <memxor@gmail.com>
+ */
+#ifndef __LINUX_RQSPINLOCK_H
+#define __LINUX_RQSPINLOCK_H
+
+#include "qspinlock.h"
+
+/*
+ * try_cmpxchg_tail - Return result of cmpxchg of tail word with a new value
+ * @lock: Pointer to queued spinlock structure
+ * @tail: The tail to compare against
+ * @new_tail: The new queue tail code word
+ * Return: Bool to indicate whether the cmpxchg operation succeeded
+ *
+ * This is used by the head of the wait queue to clean up the queue.
+ * Provides relaxed ordering, since observers only rely on initialized
+ * state of the node which was made visible through the xchg_tail operation,
+ * i.e. through the smp_wmb preceding xchg_tail.
+ *
+ * We avoid using 16-bit cmpxchg, which is not available on all architectures.
+ */
+static __always_inline bool try_cmpxchg_tail(struct qspinlock *lock, u32 tail, u32 new_tail)
+{
+ u32 old, new;
+
+ old = atomic_read(&lock->val);
+ do {
+ /*
+ * Is the tail part we compare to already stale? Fail.
+ */
+ if ((old & _Q_TAIL_MASK) != tail)
+ return false;
+ /*
+ * Encode latest locked/pending state for new tail.
+ */
+ new = (old & _Q_LOCKED_PENDING_MASK) | new_tail;
+ } while (!atomic_try_cmpxchg_relaxed(&lock->val, &old, new));
+
+ return true;
+}
+
+#endif /* __LINUX_RQSPINLOCK_H */
--
2.43.5
^ permalink raw reply related [flat|nested] 67+ messages in thread
* [PATCH bpf-next v2 10/26] rqspinlock: Protect waiters in trylock fallback from stalls
2025-02-06 10:54 [PATCH bpf-next v2 00/26] Resilient Queued Spin Lock Kumar Kartikeya Dwivedi
` (8 preceding siblings ...)
2025-02-06 10:54 ` [PATCH bpf-next v2 09/26] rqspinlock: Protect waiters in queue " Kumar Kartikeya Dwivedi
@ 2025-02-06 10:54 ` Kumar Kartikeya Dwivedi
2025-02-06 10:54 ` [PATCH bpf-next v2 11/26] rqspinlock: Add deadlock detection and recovery Kumar Kartikeya Dwivedi
` (18 subsequent siblings)
28 siblings, 0 replies; 67+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2025-02-06 10:54 UTC (permalink / raw)
To: bpf, linux-kernel
Cc: Barret Rhoden, Linus Torvalds, Peter Zijlstra, Will Deacon,
Waiman Long, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
Martin KaFai Lau, Eduard Zingerman, Paul E. McKenney, Tejun Heo,
Josh Don, Dohyun Kim, linux-arm-kernel, kernel-team
When we run out of maximum rqnodes, the original queued spin lock slow
path falls back to a try lock. In such a case, we are again susceptible
to stalls in case the lock owner fails to make progress. We use the
timeout as a fallback to break out of this loop and return to the
caller. This is a fallback for an extreme edge case, when on the same
CPU we run out of all 4 qnodes. When could this happen? We are in slow
path in task context, we get interrupted by an IRQ, which while in the
slow path gets interrupted by an NMI, whcih in the slow path gets
another nested NMI, which enters the slow path. All of the interruptions
happen after node->count++.
Reviewed-by: Barret Rhoden <brho@google.com>
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
---
kernel/locking/rqspinlock.c | 8 +++++++-
1 file changed, 7 insertions(+), 1 deletion(-)
diff --git a/kernel/locking/rqspinlock.c b/kernel/locking/rqspinlock.c
index fdc20157d0c9..df7adec59cec 100644
--- a/kernel/locking/rqspinlock.c
+++ b/kernel/locking/rqspinlock.c
@@ -255,8 +255,14 @@ int __lockfunc resilient_queued_spin_lock_slowpath(rqspinlock_t *lock, u32 val,
*/
if (unlikely(idx >= _Q_MAX_NODES)) {
lockevent_inc(lock_no_node);
- while (!queued_spin_trylock(lock))
+ RES_RESET_TIMEOUT(ts);
+ while (!queued_spin_trylock(lock)) {
+ if (RES_CHECK_TIMEOUT(ts, ret)) {
+ lockevent_inc(rqspinlock_lock_timeout);
+ break;
+ }
cpu_relax();
+ }
goto release;
}
--
2.43.5
^ permalink raw reply related [flat|nested] 67+ messages in thread
* [PATCH bpf-next v2 11/26] rqspinlock: Add deadlock detection and recovery
2025-02-06 10:54 [PATCH bpf-next v2 00/26] Resilient Queued Spin Lock Kumar Kartikeya Dwivedi
` (9 preceding siblings ...)
2025-02-06 10:54 ` [PATCH bpf-next v2 10/26] rqspinlock: Protect waiters in trylock fallback " Kumar Kartikeya Dwivedi
@ 2025-02-06 10:54 ` Kumar Kartikeya Dwivedi
2025-02-08 1:53 ` Alexei Starovoitov
` (2 more replies)
2025-02-06 10:54 ` [PATCH bpf-next v2 12/26] rqspinlock: Add a test-and-set fallback Kumar Kartikeya Dwivedi
` (17 subsequent siblings)
28 siblings, 3 replies; 67+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2025-02-06 10:54 UTC (permalink / raw)
To: bpf, linux-kernel
Cc: Linus Torvalds, Peter Zijlstra, Will Deacon, Waiman Long,
Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
Martin KaFai Lau, Eduard Zingerman, Paul E. McKenney, Tejun Heo,
Barret Rhoden, Josh Don, Dohyun Kim, linux-arm-kernel,
kernel-team
While the timeout logic provides guarantees for the waiter's forward
progress, the time until a stalling waiter unblocks can still be long.
The default timeout of 1/2 sec can be excessively long for some use
cases. Additionally, custom timeouts may exacerbate recovery time.
Introduce logic to detect common cases of deadlocks and perform quicker
recovery. This is done by dividing the time from entry into the locking
slow path until the timeout into intervals of 1 ms. Then, after each
interval elapses, deadlock detection is performed, while also polling
the lock word to ensure we can quickly break out of the detection logic
and proceed with lock acquisition.
A 'held_locks' table is maintained per-CPU where the entry at the bottom
denotes a lock being waited for or already taken. Entries coming before
it denote locks that are already held. The current CPU's table can thus
be looked at to detect AA deadlocks. The tables from other CPUs can be
looked at to discover ABBA situations. Finally, when a matching entry
for the lock being taken on the current CPU is found on some other CPU,
a deadlock situation is detected. This function can take a long time,
therefore the lock word is constantly polled in each loop iteration to
ensure we can preempt detection and proceed with lock acquisition, using
the is_lock_released check.
We set 'spin' member of rqspinlock_timeout struct to 0 to trigger
deadlock checks immediately to perform faster recovery.
Note: Extending lock word size by 4 bytes to record owner CPU can allow
faster detection for ABBA. It is typically the owner which participates
in a ABBA situation. However, to keep compatibility with existing lock
words in the kernel (struct qspinlock), and given deadlocks are a rare
event triggered by bugs, we choose to favor compatibility over faster
detection.
The release_held_lock_entry function requires an smp_wmb, while the
release store on unlock will provide the necessary ordering for us. Add
comments to document the subtleties of why this is correct. It is
possible for stores to be reordered still, but in the context of the
deadlock detection algorithm, a release barrier is sufficient and
needn't be stronger for unlock's case.
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
---
include/asm-generic/rqspinlock.h | 83 +++++++++++++-
kernel/locking/rqspinlock.c | 183 ++++++++++++++++++++++++++++---
2 files changed, 252 insertions(+), 14 deletions(-)
diff --git a/include/asm-generic/rqspinlock.h b/include/asm-generic/rqspinlock.h
index 0981162c8ac7..c1dbd25287a1 100644
--- a/include/asm-generic/rqspinlock.h
+++ b/include/asm-generic/rqspinlock.h
@@ -11,15 +11,96 @@
#include <linux/types.h>
#include <vdso/time64.h>
+#include <linux/percpu.h>
struct qspinlock;
typedef struct qspinlock rqspinlock_t;
+extern int resilient_queued_spin_lock_slowpath(rqspinlock_t *lock, u32 val, u64 timeout);
+
/*
* Default timeout for waiting loops is 0.5 seconds
*/
#define RES_DEF_TIMEOUT (NSEC_PER_SEC / 2)
-extern int resilient_queued_spin_lock_slowpath(rqspinlock_t *lock, u32 val, u64 timeout);
+#define RES_NR_HELD 32
+
+struct rqspinlock_held {
+ int cnt;
+ void *locks[RES_NR_HELD];
+};
+
+DECLARE_PER_CPU_ALIGNED(struct rqspinlock_held, rqspinlock_held_locks);
+
+static __always_inline void grab_held_lock_entry(void *lock)
+{
+ int cnt = this_cpu_inc_return(rqspinlock_held_locks.cnt);
+
+ if (unlikely(cnt > RES_NR_HELD)) {
+ /* Still keep the inc so we decrement later. */
+ return;
+ }
+
+ /*
+ * Implied compiler barrier in per-CPU operations; otherwise we can have
+ * the compiler reorder inc with write to table, allowing interrupts to
+ * overwrite and erase our write to the table (as on interrupt exit it
+ * will be reset to NULL).
+ */
+ this_cpu_write(rqspinlock_held_locks.locks[cnt - 1], lock);
+}
+
+/*
+ * It is possible to run into misdetection scenarios of AA deadlocks on the same
+ * CPU, and missed ABBA deadlocks on remote CPUs when this function pops entries
+ * out of order (due to lock A, lock B, unlock A, unlock B) pattern. The correct
+ * logic to preserve right entries in the table would be to walk the array of
+ * held locks and swap and clear out-of-order entries, but that's too
+ * complicated and we don't have a compelling use case for out of order unlocking.
+ *
+ * Therefore, we simply don't support such cases and keep the logic simple here.
+ */
+static __always_inline void release_held_lock_entry(void)
+{
+ struct rqspinlock_held *rqh = this_cpu_ptr(&rqspinlock_held_locks);
+
+ if (unlikely(rqh->cnt > RES_NR_HELD))
+ goto dec;
+ WRITE_ONCE(rqh->locks[rqh->cnt - 1], NULL);
+dec:
+ this_cpu_dec(rqspinlock_held_locks.cnt);
+ /*
+ * This helper is invoked when we unwind upon failing to acquire the
+ * lock. Unlike the unlock path which constitutes a release store after
+ * we clear the entry, we need to emit a write barrier here. Otherwise,
+ * we may have a situation as follows:
+ *
+ * <error> for lock B
+ * release_held_lock_entry
+ *
+ * try_cmpxchg_acquire for lock A
+ * grab_held_lock_entry
+ *
+ * Since these are attempts for different locks, no sequentiality is
+ * guaranteed and reordering may occur such that dec, inc are done
+ * before entry is overwritten. This permits a remote lock holder of
+ * lock B to now observe it as being attempted on this CPU, and may lead
+ * to misdetection.
+ *
+ * In case of unlock, we will always do a release on the lock word after
+ * releasing the entry, ensuring that other CPUs cannot hold the lock
+ * (and make conclusions about deadlocks) until the entry has been
+ * cleared on the local CPU, preventing any anomalies. Reordering is
+ * still possible there, but a remote CPU cannot observe a lock in our
+ * table which it is already holding, since visibility entails our
+ * release store for the said lock has not retired.
+ *
+ * We don't have a problem if the dec and WRITE_ONCE above get reordered
+ * with each other, we either notice an empty NULL entry on top (if dec
+ * succeeds WRITE_ONCE), or a potentially stale entry which cannot be
+ * observed (if dec precedes WRITE_ONCE).
+ */
+ smp_wmb();
+}
#endif /* __ASM_GENERIC_RQSPINLOCK_H */
diff --git a/kernel/locking/rqspinlock.c b/kernel/locking/rqspinlock.c
index df7adec59cec..42e8a56534b6 100644
--- a/kernel/locking/rqspinlock.c
+++ b/kernel/locking/rqspinlock.c
@@ -30,6 +30,7 @@
* Include queued spinlock definitions and statistics code
*/
#include "qspinlock.h"
+#include "rqspinlock.h"
#include "qspinlock_stat.h"
/*
@@ -74,16 +75,146 @@
struct rqspinlock_timeout {
u64 timeout_end;
u64 duration;
+ u64 cur;
u16 spin;
};
#define RES_TIMEOUT_VAL 2
-static noinline int check_timeout(struct rqspinlock_timeout *ts)
+DEFINE_PER_CPU_ALIGNED(struct rqspinlock_held, rqspinlock_held_locks);
+
+static bool is_lock_released(rqspinlock_t *lock, u32 mask, struct rqspinlock_timeout *ts)
+{
+ if (!(atomic_read_acquire(&lock->val) & (mask)))
+ return true;
+ return false;
+}
+
+static noinline int check_deadlock_AA(rqspinlock_t *lock, u32 mask,
+ struct rqspinlock_timeout *ts)
+{
+ struct rqspinlock_held *rqh = this_cpu_ptr(&rqspinlock_held_locks);
+ int cnt = min(RES_NR_HELD, rqh->cnt);
+
+ /*
+ * Return an error if we hold the lock we are attempting to acquire.
+ * We'll iterate over max 32 locks; no need to do is_lock_released.
+ */
+ for (int i = 0; i < cnt - 1; i++) {
+ if (rqh->locks[i] == lock)
+ return -EDEADLK;
+ }
+ return 0;
+}
+
+/*
+ * This focuses on the most common case of ABBA deadlocks (or ABBA involving
+ * more locks, which reduce to ABBA). This is not exhaustive, and we rely on
+ * timeouts as the final line of defense.
+ */
+static noinline int check_deadlock_ABBA(rqspinlock_t *lock, u32 mask,
+ struct rqspinlock_timeout *ts)
+{
+ struct rqspinlock_held *rqh = this_cpu_ptr(&rqspinlock_held_locks);
+ int rqh_cnt = min(RES_NR_HELD, rqh->cnt);
+ void *remote_lock;
+ int cpu;
+
+ /*
+ * Find the CPU holding the lock that we want to acquire. If there is a
+ * deadlock scenario, we will read a stable set on the remote CPU and
+ * find the target. This would be a constant time operation instead of
+ * O(NR_CPUS) if we could determine the owning CPU from a lock value, but
+ * that requires increasing the size of the lock word.
+ */
+ for_each_possible_cpu(cpu) {
+ struct rqspinlock_held *rqh_cpu = per_cpu_ptr(&rqspinlock_held_locks, cpu);
+ int real_cnt = READ_ONCE(rqh_cpu->cnt);
+ int cnt = min(RES_NR_HELD, real_cnt);
+
+ /*
+ * Let's ensure to break out of this loop if the lock is available for
+ * us to potentially acquire.
+ */
+ if (is_lock_released(lock, mask, ts))
+ return 0;
+
+ /*
+ * Skip ourselves, and CPUs whose count is less than 2, as they need at
+ * least one held lock and one acquisition attempt (reflected as top
+ * most entry) to participate in an ABBA deadlock.
+ *
+ * If cnt is more than RES_NR_HELD, it means the current lock being
+ * acquired won't appear in the table, and other locks in the table are
+ * already held, so we can't determine ABBA.
+ */
+ if (cpu == smp_processor_id() || real_cnt < 2 || real_cnt > RES_NR_HELD)
+ continue;
+
+ /*
+ * Obtain the entry at the top, this corresponds to the lock the
+ * remote CPU is attempting to acquire in a deadlock situation,
+ * and would be one of the locks we hold on the current CPU.
+ */
+ remote_lock = READ_ONCE(rqh_cpu->locks[cnt - 1]);
+ /*
+ * If it is NULL, we've raced and cannot determine a deadlock
+ * conclusively, skip this CPU.
+ */
+ if (!remote_lock)
+ continue;
+ /*
+ * Find if the lock we're attempting to acquire is held by this CPU.
+ * Don't consider the topmost entry, as that must be the latest lock
+ * being held or acquired. For a deadlock, the target CPU must also
+ * attempt to acquire a lock we hold, so for this search only 'cnt - 1'
+ * entries are important.
+ */
+ for (int i = 0; i < cnt - 1; i++) {
+ if (READ_ONCE(rqh_cpu->locks[i]) != lock)
+ continue;
+ /*
+ * We found our lock as held on the remote CPU. Is the
+ * acquisition attempt on the remote CPU for a lock held
+ * by us? If so, we have a deadlock situation, and need
+ * to recover.
+ */
+ for (int i = 0; i < rqh_cnt - 1; i++) {
+ if (rqh->locks[i] == remote_lock)
+ return -EDEADLK;
+ }
+ /*
+ * Inconclusive; retry again later.
+ */
+ return 0;
+ }
+ }
+ return 0;
+}
+
+static noinline int check_deadlock(rqspinlock_t *lock, u32 mask,
+ struct rqspinlock_timeout *ts)
+{
+ int ret;
+
+ ret = check_deadlock_AA(lock, mask, ts);
+ if (ret)
+ return ret;
+ ret = check_deadlock_ABBA(lock, mask, ts);
+ if (ret)
+ return ret;
+
+ return 0;
+}
+
+static noinline int check_timeout(rqspinlock_t *lock, u32 mask,
+ struct rqspinlock_timeout *ts)
{
u64 time = ktime_get_mono_fast_ns();
+ u64 prev = ts->cur;
if (!ts->timeout_end) {
+ ts->cur = time;
ts->timeout_end = time + ts->duration;
return 0;
}
@@ -91,20 +222,30 @@ static noinline int check_timeout(struct rqspinlock_timeout *ts)
if (time > ts->timeout_end)
return -ETIMEDOUT;
+ /*
+ * A millisecond interval passed from last time? Trigger deadlock
+ * checks.
+ */
+ if (prev + NSEC_PER_MSEC < time) {
+ ts->cur = time;
+ return check_deadlock(lock, mask, ts);
+ }
+
return 0;
}
-#define RES_CHECK_TIMEOUT(ts, ret) \
- ({ \
- if (!(ts).spin++) \
- (ret) = check_timeout(&(ts)); \
- (ret); \
+#define RES_CHECK_TIMEOUT(ts, ret, mask) \
+ ({ \
+ if (!(ts).spin++) \
+ (ret) = check_timeout((lock), (mask), &(ts)); \
+ (ret); \
})
/*
* Initialize the 'duration' member with the chosen timeout.
+ * Set spin member to 0 to trigger AA/ABBA checks immediately.
*/
-#define RES_INIT_TIMEOUT(ts, _timeout) ({ (ts).spin = 1; (ts).duration = _timeout; })
+#define RES_INIT_TIMEOUT(ts, _timeout) ({ (ts).spin = 0; (ts).duration = _timeout; })
/*
* We only need to reset 'timeout_end', 'spin' will just wrap around as necessary.
@@ -192,6 +333,11 @@ int __lockfunc resilient_queued_spin_lock_slowpath(rqspinlock_t *lock, u32 val,
goto queue;
}
+ /*
+ * Grab an entry in the held locks array, to enable deadlock detection.
+ */
+ grab_held_lock_entry(lock);
+
/*
* We're pending, wait for the owner to go away.
*
@@ -205,7 +351,7 @@ int __lockfunc resilient_queued_spin_lock_slowpath(rqspinlock_t *lock, u32 val,
*/
if (val & _Q_LOCKED_MASK) {
RES_RESET_TIMEOUT(ts);
- smp_cond_load_acquire(&lock->locked, !VAL || RES_CHECK_TIMEOUT(ts, ret));
+ smp_cond_load_acquire(&lock->locked, !VAL || RES_CHECK_TIMEOUT(ts, ret, _Q_LOCKED_MASK));
}
if (ret) {
@@ -220,7 +366,7 @@ int __lockfunc resilient_queued_spin_lock_slowpath(rqspinlock_t *lock, u32 val,
*/
clear_pending(lock);
lockevent_inc(rqspinlock_lock_timeout);
- return ret;
+ goto err_release_entry;
}
/*
@@ -238,6 +384,11 @@ int __lockfunc resilient_queued_spin_lock_slowpath(rqspinlock_t *lock, u32 val,
*/
queue:
lockevent_inc(lock_slowpath);
+ /*
+ * Grab deadlock detection entry for the queue path.
+ */
+ grab_held_lock_entry(lock);
+
node = this_cpu_ptr(&qnodes[0].mcs);
idx = node->count++;
tail = encode_tail(smp_processor_id(), idx);
@@ -257,9 +408,9 @@ int __lockfunc resilient_queued_spin_lock_slowpath(rqspinlock_t *lock, u32 val,
lockevent_inc(lock_no_node);
RES_RESET_TIMEOUT(ts);
while (!queued_spin_trylock(lock)) {
- if (RES_CHECK_TIMEOUT(ts, ret)) {
+ if (RES_CHECK_TIMEOUT(ts, ret, ~0u)) {
lockevent_inc(rqspinlock_lock_timeout);
- break;
+ goto err_release_node;
}
cpu_relax();
}
@@ -350,7 +501,7 @@ int __lockfunc resilient_queued_spin_lock_slowpath(rqspinlock_t *lock, u32 val,
*/
RES_RESET_TIMEOUT(ts);
val = atomic_cond_read_acquire(&lock->val, !(VAL & _Q_LOCKED_PENDING_MASK) ||
- RES_CHECK_TIMEOUT(ts, ret));
+ RES_CHECK_TIMEOUT(ts, ret, _Q_LOCKED_PENDING_MASK));
waitq_timeout:
if (ret) {
@@ -375,7 +526,7 @@ int __lockfunc resilient_queued_spin_lock_slowpath(rqspinlock_t *lock, u32 val,
WRITE_ONCE(next->locked, RES_TIMEOUT_VAL);
}
lockevent_inc(rqspinlock_lock_timeout);
- goto release;
+ goto err_release_node;
}
/*
@@ -422,5 +573,11 @@ int __lockfunc resilient_queued_spin_lock_slowpath(rqspinlock_t *lock, u32 val,
*/
__this_cpu_dec(qnodes[0].mcs.count);
return ret;
+err_release_node:
+ trace_contention_end(lock, ret);
+ __this_cpu_dec(qnodes[0].mcs.count);
+err_release_entry:
+ release_held_lock_entry();
+ return ret;
}
EXPORT_SYMBOL(resilient_queued_spin_lock_slowpath);
--
2.43.5
^ permalink raw reply related [flat|nested] 67+ messages in thread
* [PATCH bpf-next v2 12/26] rqspinlock: Add a test-and-set fallback
2025-02-06 10:54 [PATCH bpf-next v2 00/26] Resilient Queued Spin Lock Kumar Kartikeya Dwivedi
` (10 preceding siblings ...)
2025-02-06 10:54 ` [PATCH bpf-next v2 11/26] rqspinlock: Add deadlock detection and recovery Kumar Kartikeya Dwivedi
@ 2025-02-06 10:54 ` Kumar Kartikeya Dwivedi
2025-02-06 10:54 ` [PATCH bpf-next v2 13/26] rqspinlock: Add basic support for CONFIG_PARAVIRT Kumar Kartikeya Dwivedi
` (16 subsequent siblings)
28 siblings, 0 replies; 67+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2025-02-06 10:54 UTC (permalink / raw)
To: bpf, linux-kernel
Cc: Linus Torvalds, Peter Zijlstra, Will Deacon, Waiman Long,
Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
Martin KaFai Lau, Eduard Zingerman, Paul E. McKenney, Tejun Heo,
Barret Rhoden, Josh Don, Dohyun Kim, linux-arm-kernel,
kernel-team
Include a test-and-set fallback when queued spinlock support is not
available. Introduce a rqspinlock type to act as a fallback when
qspinlock support is absent.
Include ifdef guards to ensure the slow path in this file is only
compiled when CONFIG_QUEUED_SPINLOCKS=y. Subsequent patches will add
further logic to ensure fallback to the test-and-set implementation
when queued spinlock support is unavailable on an architecture.
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
---
include/asm-generic/rqspinlock.h | 17 +++++++++++++++
kernel/locking/rqspinlock.c | 37 ++++++++++++++++++++++++++++++++
2 files changed, 54 insertions(+)
diff --git a/include/asm-generic/rqspinlock.h b/include/asm-generic/rqspinlock.h
index c1dbd25287a1..92e53b2aafb9 100644
--- a/include/asm-generic/rqspinlock.h
+++ b/include/asm-generic/rqspinlock.h
@@ -12,11 +12,28 @@
#include <linux/types.h>
#include <vdso/time64.h>
#include <linux/percpu.h>
+#ifdef CONFIG_QUEUED_SPINLOCKS
+#include <asm/qspinlock.h>
+#endif
+
+struct rqspinlock {
+ union {
+ atomic_t val;
+ u32 locked;
+ };
+};
struct qspinlock;
+#ifdef CONFIG_QUEUED_SPINLOCKS
typedef struct qspinlock rqspinlock_t;
+#else
+typedef struct rqspinlock rqspinlock_t;
+#endif
+extern int resilient_tas_spin_lock(rqspinlock_t *lock, u64 timeout);
+#ifdef CONFIG_QUEUED_SPINLOCKS
extern int resilient_queued_spin_lock_slowpath(rqspinlock_t *lock, u32 val, u64 timeout);
+#endif
/*
* Default timeout for waiting loops is 0.5 seconds
diff --git a/kernel/locking/rqspinlock.c b/kernel/locking/rqspinlock.c
index 42e8a56534b6..ea034e80f855 100644
--- a/kernel/locking/rqspinlock.c
+++ b/kernel/locking/rqspinlock.c
@@ -21,7 +21,9 @@
#include <linux/mutex.h>
#include <linux/prefetch.h>
#include <asm/byteorder.h>
+#ifdef CONFIG_QUEUED_SPINLOCKS
#include <asm/qspinlock.h>
+#endif
#include <trace/events/lock.h>
#include <asm/rqspinlock.h>
#include <linux/timekeeping.h>
@@ -29,8 +31,10 @@
/*
* Include queued spinlock definitions and statistics code
*/
+#ifdef CONFIG_QUEUED_SPINLOCKS
#include "qspinlock.h"
#include "rqspinlock.h"
+#endif
#include "qspinlock_stat.h"
/*
@@ -252,6 +256,37 @@ static noinline int check_timeout(rqspinlock_t *lock, u32 mask,
*/
#define RES_RESET_TIMEOUT(ts) ({ (ts).timeout_end = 0; })
+/*
+ * Provide a test-and-set fallback for cases when queued spin lock support is
+ * absent from the architecture.
+ */
+int __lockfunc resilient_tas_spin_lock(rqspinlock_t *lock, u64 timeout)
+{
+ struct rqspinlock_timeout ts;
+ int val, ret = 0;
+
+ RES_INIT_TIMEOUT(ts, timeout);
+ grab_held_lock_entry(lock);
+retry:
+ val = atomic_read(&lock->val);
+
+ if (val || !atomic_try_cmpxchg(&lock->val, &val, 1)) {
+ if (RES_CHECK_TIMEOUT(ts, ret, ~0u)) {
+ lockevent_inc(rqspinlock_lock_timeout);
+ goto out;
+ }
+ cpu_relax();
+ goto retry;
+ }
+
+ return 0;
+out:
+ release_held_lock_entry();
+ return ret;
+}
+
+#ifdef CONFIG_QUEUED_SPINLOCKS
+
/*
* Per-CPU queue node structures; we can never have more than 4 nested
* contexts: task, softirq, hardirq, nmi.
@@ -581,3 +616,5 @@ int __lockfunc resilient_queued_spin_lock_slowpath(rqspinlock_t *lock, u32 val,
return ret;
}
EXPORT_SYMBOL(resilient_queued_spin_lock_slowpath);
+
+#endif /* CONFIG_QUEUED_SPINLOCKS */
--
2.43.5
^ permalink raw reply related [flat|nested] 67+ messages in thread
* [PATCH bpf-next v2 13/26] rqspinlock: Add basic support for CONFIG_PARAVIRT
2025-02-06 10:54 [PATCH bpf-next v2 00/26] Resilient Queued Spin Lock Kumar Kartikeya Dwivedi
` (11 preceding siblings ...)
2025-02-06 10:54 ` [PATCH bpf-next v2 12/26] rqspinlock: Add a test-and-set fallback Kumar Kartikeya Dwivedi
@ 2025-02-06 10:54 ` Kumar Kartikeya Dwivedi
2025-02-06 10:54 ` [PATCH bpf-next v2 14/26] rqspinlock: Add helper to print a splat on timeout or deadlock Kumar Kartikeya Dwivedi
` (15 subsequent siblings)
28 siblings, 0 replies; 67+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2025-02-06 10:54 UTC (permalink / raw)
To: bpf, linux-kernel
Cc: Linus Torvalds, Peter Zijlstra, Will Deacon, Waiman Long,
Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
Martin KaFai Lau, Eduard Zingerman, Paul E. McKenney, Tejun Heo,
Barret Rhoden, Josh Don, Dohyun Kim, linux-arm-kernel,
kernel-team
We ripped out PV and virtualization related bits from rqspinlock in an
earlier commit, however, a fair lock performs poorly within a virtual
machine when the lock holder is preempted. As such, retain the
virt_spin_lock fallback to test and set lock, but with timeout and
deadlock detection. We can do this by simply depending on the
resilient_tas_spin_lock implementation from the previous patch.
We don't integrate support for CONFIG_PARAVIRT_SPINLOCKS yet, as that
requires more involved algorithmic changes and introduces more
complexity. It can be done when the need arises in the future.
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
---
arch/x86/include/asm/rqspinlock.h | 29 +++++++++++++++++++++++++++++
include/asm-generic/rqspinlock.h | 14 ++++++++++++++
kernel/locking/rqspinlock.c | 3 +++
3 files changed, 46 insertions(+)
create mode 100644 arch/x86/include/asm/rqspinlock.h
diff --git a/arch/x86/include/asm/rqspinlock.h b/arch/x86/include/asm/rqspinlock.h
new file mode 100644
index 000000000000..cbd65212c177
--- /dev/null
+++ b/arch/x86/include/asm/rqspinlock.h
@@ -0,0 +1,29 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _ASM_X86_RQSPINLOCK_H
+#define _ASM_X86_RQSPINLOCK_H
+
+#include <asm/paravirt.h>
+
+#ifdef CONFIG_PARAVIRT
+DECLARE_STATIC_KEY_FALSE(virt_spin_lock_key);
+
+#define resilient_virt_spin_lock_enabled resilient_virt_spin_lock_enabled
+static __always_inline bool resilient_virt_spin_lock_enabled(void)
+{
+ return static_branch_likely(&virt_spin_lock_key);
+}
+
+struct qspinlock;
+extern int resilient_tas_spin_lock(struct qspinlock *lock, u64 timeout);
+
+#define resilient_virt_spin_lock resilient_virt_spin_lock
+static inline int resilient_virt_spin_lock(struct qspinlock *lock, u64 timeout)
+{
+ return resilient_tas_spin_lock(lock, timeout);
+}
+
+#endif /* CONFIG_PARAVIRT */
+
+#include <asm-generic/rqspinlock.h>
+
+#endif /* _ASM_X86_RQSPINLOCK_H */
diff --git a/include/asm-generic/rqspinlock.h b/include/asm-generic/rqspinlock.h
index 92e53b2aafb9..bbe049dcf70d 100644
--- a/include/asm-generic/rqspinlock.h
+++ b/include/asm-generic/rqspinlock.h
@@ -35,6 +35,20 @@ extern int resilient_tas_spin_lock(rqspinlock_t *lock, u64 timeout);
extern int resilient_queued_spin_lock_slowpath(rqspinlock_t *lock, u32 val, u64 timeout);
#endif
+#ifndef resilient_virt_spin_lock_enabled
+static __always_inline bool resilient_virt_spin_lock_enabled(void)
+{
+ return false;
+}
+#endif
+
+#ifndef resilient_virt_spin_lock
+static __always_inline int resilient_virt_spin_lock(struct qspinlock *lock, u64 timeout)
+{
+ return 0;
+}
+#endif
+
/*
* Default timeout for waiting loops is 0.5 seconds
*/
diff --git a/kernel/locking/rqspinlock.c b/kernel/locking/rqspinlock.c
index ea034e80f855..13d1759c9353 100644
--- a/kernel/locking/rqspinlock.c
+++ b/kernel/locking/rqspinlock.c
@@ -325,6 +325,9 @@ int __lockfunc resilient_queued_spin_lock_slowpath(rqspinlock_t *lock, u32 val,
BUILD_BUG_ON(CONFIG_NR_CPUS >= (1U << _Q_TAIL_CPU_BITS));
+ if (resilient_virt_spin_lock_enabled())
+ return resilient_virt_spin_lock(lock, timeout);
+
RES_INIT_TIMEOUT(ts, timeout);
/*
--
2.43.5
^ permalink raw reply related [flat|nested] 67+ messages in thread
* [PATCH bpf-next v2 14/26] rqspinlock: Add helper to print a splat on timeout or deadlock
2025-02-06 10:54 [PATCH bpf-next v2 00/26] Resilient Queued Spin Lock Kumar Kartikeya Dwivedi
` (12 preceding siblings ...)
2025-02-06 10:54 ` [PATCH bpf-next v2 13/26] rqspinlock: Add basic support for CONFIG_PARAVIRT Kumar Kartikeya Dwivedi
@ 2025-02-06 10:54 ` Kumar Kartikeya Dwivedi
2025-02-06 10:54 ` [PATCH bpf-next v2 15/26] rqspinlock: Add macros for rqspinlock usage Kumar Kartikeya Dwivedi
` (14 subsequent siblings)
28 siblings, 0 replies; 67+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2025-02-06 10:54 UTC (permalink / raw)
To: bpf, linux-kernel
Cc: Linus Torvalds, Peter Zijlstra, Will Deacon, Waiman Long,
Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
Martin KaFai Lau, Eduard Zingerman, Paul E. McKenney, Tejun Heo,
Barret Rhoden, Josh Don, Dohyun Kim, linux-arm-kernel,
kernel-team
Whenever a timeout and a deadlock occurs, we would want to print a
message to the dmesg console, including the CPU where the event
occurred, the list of locks in the held locks table, and the stack trace
of the caller, which allows determining where exactly in the slow path
the waiter timed out or detected a deadlock.
Splats are limited to atmost one per-CPU during machine uptime, and a
lock is acquired to ensure that no interleaving occurs when a concurrent
set of CPUs conflict and enter a deadlock situation and start printing
data.
Later patches will use this to inspect return value of rqspinlock API
and then report a violation if necessary.
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
---
kernel/locking/rqspinlock.c | 29 +++++++++++++++++++++++++++++
1 file changed, 29 insertions(+)
diff --git a/kernel/locking/rqspinlock.c b/kernel/locking/rqspinlock.c
index 13d1759c9353..93f928bc4e9c 100644
--- a/kernel/locking/rqspinlock.c
+++ b/kernel/locking/rqspinlock.c
@@ -196,6 +196,35 @@ static noinline int check_deadlock_ABBA(rqspinlock_t *lock, u32 mask,
return 0;
}
+static DEFINE_PER_CPU(int, report_nest_cnt);
+static DEFINE_PER_CPU(bool, report_flag);
+static arch_spinlock_t report_lock;
+
+static void rqspinlock_report_violation(const char *s, void *lock)
+{
+ struct rqspinlock_held *rqh = this_cpu_ptr(&rqspinlock_held_locks);
+
+ if (this_cpu_inc_return(report_nest_cnt) != 1) {
+ this_cpu_dec(report_nest_cnt);
+ return;
+ }
+ if (this_cpu_read(report_flag))
+ goto end;
+ this_cpu_write(report_flag, true);
+ arch_spin_lock(&report_lock);
+
+ pr_err("CPU %d: %s", smp_processor_id(), s);
+ pr_info("Held locks: %d\n", rqh->cnt + 1);
+ pr_info("Held lock[%2d] = 0x%px\n", 0, lock);
+ for (int i = 0; i < min(RES_NR_HELD, rqh->cnt); i++)
+ pr_info("Held lock[%2d] = 0x%px\n", i + 1, rqh->locks[i]);
+ dump_stack();
+
+ arch_spin_unlock(&report_lock);
+end:
+ this_cpu_dec(report_nest_cnt);
+}
+
static noinline int check_deadlock(rqspinlock_t *lock, u32 mask,
struct rqspinlock_timeout *ts)
{
--
2.43.5
^ permalink raw reply related [flat|nested] 67+ messages in thread
* [PATCH bpf-next v2 15/26] rqspinlock: Add macros for rqspinlock usage
2025-02-06 10:54 [PATCH bpf-next v2 00/26] Resilient Queued Spin Lock Kumar Kartikeya Dwivedi
` (13 preceding siblings ...)
2025-02-06 10:54 ` [PATCH bpf-next v2 14/26] rqspinlock: Add helper to print a splat on timeout or deadlock Kumar Kartikeya Dwivedi
@ 2025-02-06 10:54 ` Kumar Kartikeya Dwivedi
2025-02-06 10:54 ` [PATCH bpf-next v2 16/26] rqspinlock: Add locktorture support Kumar Kartikeya Dwivedi
` (13 subsequent siblings)
28 siblings, 0 replies; 67+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2025-02-06 10:54 UTC (permalink / raw)
To: bpf, linux-kernel
Cc: Linus Torvalds, Peter Zijlstra, Will Deacon, Waiman Long,
Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
Martin KaFai Lau, Eduard Zingerman, Paul E. McKenney, Tejun Heo,
Barret Rhoden, Josh Don, Dohyun Kim, linux-arm-kernel,
kernel-team
Introduce helper macros that wrap around the rqspinlock slow path and
provide an interface analogous to the raw_spin_lock API. Note that
in case of error conditions, preemption and IRQ disabling is
automatically unrolled before returning the error back to the caller.
Ensure that in absence of CONFIG_QUEUED_SPINLOCKS support, we fallback
to the test-and-set implementation.
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
---
include/asm-generic/rqspinlock.h | 71 ++++++++++++++++++++++++++++++++
1 file changed, 71 insertions(+)
diff --git a/include/asm-generic/rqspinlock.h b/include/asm-generic/rqspinlock.h
index bbe049dcf70d..46119fc768b8 100644
--- a/include/asm-generic/rqspinlock.h
+++ b/include/asm-generic/rqspinlock.h
@@ -134,4 +134,75 @@ static __always_inline void release_held_lock_entry(void)
smp_wmb();
}
+#ifdef CONFIG_QUEUED_SPINLOCKS
+
+/**
+ * res_spin_lock - acquire a queued spinlock
+ * @lock: Pointer to queued spinlock structure
+ */
+static __always_inline int res_spin_lock(rqspinlock_t *lock)
+{
+ int val = 0;
+
+ if (likely(atomic_try_cmpxchg_acquire(&lock->val, &val, _Q_LOCKED_VAL))) {
+ grab_held_lock_entry(lock);
+ return 0;
+ }
+ return resilient_queued_spin_lock_slowpath(lock, val, RES_DEF_TIMEOUT);
+}
+
+#else
+
+#define res_spin_lock(lock) resilient_tas_spin_lock(lock, RES_DEF_TIMEOUT)
+
+#endif /* CONFIG_QUEUED_SPINLOCKS */
+
+static __always_inline void res_spin_unlock(rqspinlock_t *lock)
+{
+ struct rqspinlock_held *rqh = this_cpu_ptr(&rqspinlock_held_locks);
+
+ if (unlikely(rqh->cnt > RES_NR_HELD))
+ goto unlock;
+ WRITE_ONCE(rqh->locks[rqh->cnt - 1], NULL);
+unlock:
+ this_cpu_dec(rqspinlock_held_locks.cnt);
+ /*
+ * Release barrier, ensures correct ordering. See release_held_lock_entry
+ * for details. Perform release store instead of queued_spin_unlock,
+ * since we use this function for test-and-set fallback as well. When we
+ * have CONFIG_QUEUED_SPINLOCKS=n, we clear the full 4-byte lockword.
+ */
+ smp_store_release(&lock->locked, 0);
+}
+
+#ifdef CONFIG_QUEUED_SPINLOCKS
+#define raw_res_spin_lock_init(lock) ({ *(lock) = (rqspinlock_t)__ARCH_SPIN_LOCK_UNLOCKED; })
+#else
+#define raw_res_spin_lock_init(lock) ({ *(lock) = (rqspinlock_t){0}; })
+#endif
+
+#define raw_res_spin_lock(lock) \
+ ({ \
+ int __ret; \
+ preempt_disable(); \
+ __ret = res_spin_lock(lock); \
+ if (__ret) \
+ preempt_enable(); \
+ __ret; \
+ })
+
+#define raw_res_spin_unlock(lock) ({ res_spin_unlock(lock); preempt_enable(); })
+
+#define raw_res_spin_lock_irqsave(lock, flags) \
+ ({ \
+ int __ret; \
+ local_irq_save(flags); \
+ __ret = raw_res_spin_lock(lock); \
+ if (__ret) \
+ local_irq_restore(flags); \
+ __ret; \
+ })
+
+#define raw_res_spin_unlock_irqrestore(lock, flags) ({ raw_res_spin_unlock(lock); local_irq_restore(flags); })
+
#endif /* __ASM_GENERIC_RQSPINLOCK_H */
--
2.43.5
^ permalink raw reply related [flat|nested] 67+ messages in thread
* [PATCH bpf-next v2 16/26] rqspinlock: Add locktorture support
2025-02-06 10:54 [PATCH bpf-next v2 00/26] Resilient Queued Spin Lock Kumar Kartikeya Dwivedi
` (14 preceding siblings ...)
2025-02-06 10:54 ` [PATCH bpf-next v2 15/26] rqspinlock: Add macros for rqspinlock usage Kumar Kartikeya Dwivedi
@ 2025-02-06 10:54 ` Kumar Kartikeya Dwivedi
2025-02-06 10:54 ` [PATCH bpf-next v2 17/26] rqspinlock: Hardcode cond_acquire loops to asm-generic implementation Kumar Kartikeya Dwivedi
` (12 subsequent siblings)
28 siblings, 0 replies; 67+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2025-02-06 10:54 UTC (permalink / raw)
To: bpf, linux-kernel
Cc: Linus Torvalds, Peter Zijlstra, Will Deacon, Waiman Long,
Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
Martin KaFai Lau, Eduard Zingerman, Paul E. McKenney, Tejun Heo,
Barret Rhoden, Josh Don, Dohyun Kim, linux-arm-kernel,
kernel-team
Introduce locktorture support for rqspinlock using the newly added
macros as the first in-kernel user and consumer.
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
---
kernel/locking/locktorture.c | 51 ++++++++++++++++++++++++++++++++++++
kernel/locking/rqspinlock.c | 1 +
2 files changed, 52 insertions(+)
diff --git a/kernel/locking/locktorture.c b/kernel/locking/locktorture.c
index cc33470f4de9..a055ff38d1f5 100644
--- a/kernel/locking/locktorture.c
+++ b/kernel/locking/locktorture.c
@@ -362,6 +362,56 @@ static struct lock_torture_ops raw_spin_lock_irq_ops = {
.name = "raw_spin_lock_irq"
};
+#include <asm/rqspinlock.h>
+static rqspinlock_t rqspinlock;
+
+static int torture_raw_res_spin_write_lock(int tid __maybe_unused)
+{
+ raw_res_spin_lock(&rqspinlock);
+ return 0;
+}
+
+static void torture_raw_res_spin_write_unlock(int tid __maybe_unused)
+{
+ raw_res_spin_unlock(&rqspinlock);
+}
+
+static struct lock_torture_ops raw_res_spin_lock_ops = {
+ .writelock = torture_raw_res_spin_write_lock,
+ .write_delay = torture_spin_lock_write_delay,
+ .task_boost = torture_rt_boost,
+ .writeunlock = torture_raw_res_spin_write_unlock,
+ .readlock = NULL,
+ .read_delay = NULL,
+ .readunlock = NULL,
+ .name = "raw_res_spin_lock"
+};
+
+static int torture_raw_res_spin_write_lock_irq(int tid __maybe_unused)
+{
+ unsigned long flags;
+
+ raw_res_spin_lock_irqsave(&rqspinlock, flags);
+ cxt.cur_ops->flags = flags;
+ return 0;
+}
+
+static void torture_raw_res_spin_write_unlock_irq(int tid __maybe_unused)
+{
+ raw_res_spin_unlock_irqrestore(&rqspinlock, cxt.cur_ops->flags);
+}
+
+static struct lock_torture_ops raw_res_spin_lock_irq_ops = {
+ .writelock = torture_raw_res_spin_write_lock_irq,
+ .write_delay = torture_spin_lock_write_delay,
+ .task_boost = torture_rt_boost,
+ .writeunlock = torture_raw_res_spin_write_unlock_irq,
+ .readlock = NULL,
+ .read_delay = NULL,
+ .readunlock = NULL,
+ .name = "raw_res_spin_lock_irq"
+};
+
static DEFINE_RWLOCK(torture_rwlock);
static int torture_rwlock_write_lock(int tid __maybe_unused)
@@ -1168,6 +1218,7 @@ static int __init lock_torture_init(void)
&lock_busted_ops,
&spin_lock_ops, &spin_lock_irq_ops,
&raw_spin_lock_ops, &raw_spin_lock_irq_ops,
+ &raw_res_spin_lock_ops, &raw_res_spin_lock_irq_ops,
&rw_lock_ops, &rw_lock_irq_ops,
&mutex_lock_ops,
&ww_mutex_lock_ops,
diff --git a/kernel/locking/rqspinlock.c b/kernel/locking/rqspinlock.c
index 93f928bc4e9c..49b4f3c75a3e 100644
--- a/kernel/locking/rqspinlock.c
+++ b/kernel/locking/rqspinlock.c
@@ -86,6 +86,7 @@ struct rqspinlock_timeout {
#define RES_TIMEOUT_VAL 2
DEFINE_PER_CPU_ALIGNED(struct rqspinlock_held, rqspinlock_held_locks);
+EXPORT_SYMBOL_GPL(rqspinlock_held_locks);
static bool is_lock_released(rqspinlock_t *lock, u32 mask, struct rqspinlock_timeout *ts)
{
--
2.43.5
^ permalink raw reply related [flat|nested] 67+ messages in thread
* [PATCH bpf-next v2 17/26] rqspinlock: Hardcode cond_acquire loops to asm-generic implementation
2025-02-06 10:54 [PATCH bpf-next v2 00/26] Resilient Queued Spin Lock Kumar Kartikeya Dwivedi
` (15 preceding siblings ...)
2025-02-06 10:54 ` [PATCH bpf-next v2 16/26] rqspinlock: Add locktorture support Kumar Kartikeya Dwivedi
@ 2025-02-06 10:54 ` Kumar Kartikeya Dwivedi
2025-02-08 1:58 ` Alexei Starovoitov
2025-02-10 9:53 ` Peter Zijlstra
2025-02-06 10:54 ` [PATCH bpf-next v2 18/26] rqspinlock: Add entry to Makefile, MAINTAINERS Kumar Kartikeya Dwivedi
` (11 subsequent siblings)
28 siblings, 2 replies; 67+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2025-02-06 10:54 UTC (permalink / raw)
To: bpf, linux-kernel
Cc: Ankur Arora, Linus Torvalds, Peter Zijlstra, Will Deacon,
Waiman Long, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
Martin KaFai Lau, Eduard Zingerman, Paul E. McKenney, Tejun Heo,
Barret Rhoden, Josh Don, Dohyun Kim, linux-arm-kernel,
kernel-team
Currently, for rqspinlock usage, the implementation of
smp_cond_load_acquire (and thus, atomic_cond_read_acquire) are
susceptible to stalls on arm64, because they do not guarantee that the
conditional expression will be repeatedly invoked if the address being
loaded from is not written to by other CPUs. When support for
event-streams is absent (which unblocks stuck WFE-based loops every
~100us), we may end up being stuck forever.
This causes a problem for us, as we need to repeatedly invoke the
RES_CHECK_TIMEOUT in the spin loop to break out when the timeout
expires.
Hardcode the implementation to the asm-generic version in rqspinlock.c
until support for smp_cond_load_acquire_timewait [0] lands upstream.
[0]: https://lore.kernel.org/lkml/20250203214911.898276-1-ankur.a.arora@oracle.com
Cc: Ankur Arora <ankur.a.arora@oracle.com>
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
---
kernel/locking/rqspinlock.c | 41 ++++++++++++++++++++++++++++++++++---
1 file changed, 38 insertions(+), 3 deletions(-)
diff --git a/kernel/locking/rqspinlock.c b/kernel/locking/rqspinlock.c
index 49b4f3c75a3e..b4cceeecf29c 100644
--- a/kernel/locking/rqspinlock.c
+++ b/kernel/locking/rqspinlock.c
@@ -325,6 +325,41 @@ int __lockfunc resilient_tas_spin_lock(rqspinlock_t *lock, u64 timeout)
*/
static DEFINE_PER_CPU_ALIGNED(struct qnode, qnodes[_Q_MAX_NODES]);
+/*
+ * Hardcode smp_cond_load_acquire and atomic_cond_read_acquire implementations
+ * to the asm-generic implementation. In rqspinlock code, our conditional
+ * expression involves checking the value _and_ additionally a timeout. However,
+ * on arm64, the WFE-based implementation may never spin again if no stores
+ * occur to the locked byte in the lock word. As such, we may be stuck forever
+ * if event-stream based unblocking is not available on the platform for WFE
+ * spin loops (arch_timer_evtstrm_available).
+ *
+ * Once support for smp_cond_load_acquire_timewait [0] lands, we can drop this
+ * workaround.
+ *
+ * [0]: https://lore.kernel.org/lkml/20250203214911.898276-1-ankur.a.arora@oracle.com
+ */
+#define res_smp_cond_load_relaxed(ptr, cond_expr) ({ \
+ typeof(ptr) __PTR = (ptr); \
+ __unqual_scalar_typeof(*ptr) VAL; \
+ for (;;) { \
+ VAL = READ_ONCE(*__PTR); \
+ if (cond_expr) \
+ break; \
+ cpu_relax(); \
+ } \
+ (typeof(*ptr))VAL; \
+})
+
+#define res_smp_cond_load_acquire(ptr, cond_expr) ({ \
+ __unqual_scalar_typeof(*ptr) _val; \
+ _val = res_smp_cond_load_relaxed(ptr, cond_expr); \
+ smp_acquire__after_ctrl_dep(); \
+ (typeof(*ptr))_val; \
+})
+
+#define res_atomic_cond_read_acquire(v, c) res_smp_cond_load_acquire(&(v)->counter, (c))
+
/**
* resilient_queued_spin_lock_slowpath - acquire the queued spinlock
* @lock: Pointer to queued spinlock structure
@@ -419,7 +454,7 @@ int __lockfunc resilient_queued_spin_lock_slowpath(rqspinlock_t *lock, u32 val,
*/
if (val & _Q_LOCKED_MASK) {
RES_RESET_TIMEOUT(ts);
- smp_cond_load_acquire(&lock->locked, !VAL || RES_CHECK_TIMEOUT(ts, ret, _Q_LOCKED_MASK));
+ res_smp_cond_load_acquire(&lock->locked, !VAL || RES_CHECK_TIMEOUT(ts, ret, _Q_LOCKED_MASK));
}
if (ret) {
@@ -568,8 +603,8 @@ int __lockfunc resilient_queued_spin_lock_slowpath(rqspinlock_t *lock, u32 val,
* does not imply a full barrier.
*/
RES_RESET_TIMEOUT(ts);
- val = atomic_cond_read_acquire(&lock->val, !(VAL & _Q_LOCKED_PENDING_MASK) ||
- RES_CHECK_TIMEOUT(ts, ret, _Q_LOCKED_PENDING_MASK));
+ val = res_atomic_cond_read_acquire(&lock->val, !(VAL & _Q_LOCKED_PENDING_MASK) ||
+ RES_CHECK_TIMEOUT(ts, ret, _Q_LOCKED_PENDING_MASK));
waitq_timeout:
if (ret) {
--
2.43.5
^ permalink raw reply related [flat|nested] 67+ messages in thread
* [PATCH bpf-next v2 18/26] rqspinlock: Add entry to Makefile, MAINTAINERS
2025-02-06 10:54 [PATCH bpf-next v2 00/26] Resilient Queued Spin Lock Kumar Kartikeya Dwivedi
` (16 preceding siblings ...)
2025-02-06 10:54 ` [PATCH bpf-next v2 17/26] rqspinlock: Hardcode cond_acquire loops to asm-generic implementation Kumar Kartikeya Dwivedi
@ 2025-02-06 10:54 ` Kumar Kartikeya Dwivedi
2025-02-07 14:14 ` kernel test robot
` (2 more replies)
2025-02-06 10:54 ` [PATCH bpf-next v2 19/26] bpf: Convert hashtab.c to rqspinlock Kumar Kartikeya Dwivedi
` (10 subsequent siblings)
28 siblings, 3 replies; 67+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2025-02-06 10:54 UTC (permalink / raw)
To: bpf, linux-kernel
Cc: Linus Torvalds, Peter Zijlstra, Will Deacon, Waiman Long,
Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
Martin KaFai Lau, Eduard Zingerman, Paul E. McKenney, Tejun Heo,
Barret Rhoden, Josh Don, Dohyun Kim, linux-arm-kernel,
kernel-team
Ensure that rqspinlock is built when qspinlock support and BPF subsystem
is enabled. Also, add the file under the BPF MAINTAINERS entry so that
all patches changing code in the file end up Cc'ing bpf@vger and the
maintainers/reviewers.
Ensure that the rqspinlock code is only built when the BPF subsystem is
compiled in. Depending on queued spinlock support, we may or may not end
up building the queued spinlock slowpath, and instead fallback to the
test-and-set implementation.
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
---
MAINTAINERS | 3 +++
include/asm-generic/Kbuild | 1 +
kernel/locking/Makefile | 1 +
3 files changed, 5 insertions(+)
diff --git a/MAINTAINERS b/MAINTAINERS
index 896a307fa065..4d81f3303c79 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -4305,6 +4305,9 @@ F: include/uapi/linux/filter.h
F: kernel/bpf/
F: kernel/trace/bpf_trace.c
F: lib/buildid.c
+F: arch/*/include/asm/rqspinlock.h
+F: include/asm-generic/rqspinlock.h
+F: kernel/locking/rqspinlock.c
F: lib/test_bpf.c
F: net/bpf/
F: net/core/filter.c
diff --git a/include/asm-generic/Kbuild b/include/asm-generic/Kbuild
index 1b43c3a77012..8675b7b4ad23 100644
--- a/include/asm-generic/Kbuild
+++ b/include/asm-generic/Kbuild
@@ -45,6 +45,7 @@ mandatory-y += pci.h
mandatory-y += percpu.h
mandatory-y += pgalloc.h
mandatory-y += preempt.h
+mandatory-y += rqspinlock.h
mandatory-y += runtime-const.h
mandatory-y += rwonce.h
mandatory-y += sections.h
diff --git a/kernel/locking/Makefile b/kernel/locking/Makefile
index 0db4093d17b8..5645e9029bc0 100644
--- a/kernel/locking/Makefile
+++ b/kernel/locking/Makefile
@@ -24,6 +24,7 @@ obj-$(CONFIG_SMP) += spinlock.o
obj-$(CONFIG_LOCK_SPIN_ON_OWNER) += osq_lock.o
obj-$(CONFIG_PROVE_LOCKING) += spinlock.o
obj-$(CONFIG_QUEUED_SPINLOCKS) += qspinlock.o
+obj-$(CONFIG_BPF_SYSCALL) += rqspinlock.o
obj-$(CONFIG_RT_MUTEXES) += rtmutex_api.o
obj-$(CONFIG_PREEMPT_RT) += spinlock_rt.o ww_rt_mutex.o
obj-$(CONFIG_DEBUG_SPINLOCK) += spinlock.o
--
2.43.5
^ permalink raw reply related [flat|nested] 67+ messages in thread
* [PATCH bpf-next v2 19/26] bpf: Convert hashtab.c to rqspinlock
2025-02-06 10:54 [PATCH bpf-next v2 00/26] Resilient Queued Spin Lock Kumar Kartikeya Dwivedi
` (17 preceding siblings ...)
2025-02-06 10:54 ` [PATCH bpf-next v2 18/26] rqspinlock: Add entry to Makefile, MAINTAINERS Kumar Kartikeya Dwivedi
@ 2025-02-06 10:54 ` Kumar Kartikeya Dwivedi
2025-02-08 2:01 ` Alexei Starovoitov
2025-02-06 10:54 ` [PATCH bpf-next v2 20/26] bpf: Convert percpu_freelist.c " Kumar Kartikeya Dwivedi
` (9 subsequent siblings)
28 siblings, 1 reply; 67+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2025-02-06 10:54 UTC (permalink / raw)
To: bpf, linux-kernel
Cc: Linus Torvalds, Peter Zijlstra, Will Deacon, Waiman Long,
Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
Martin KaFai Lau, Eduard Zingerman, Paul E. McKenney, Tejun Heo,
Barret Rhoden, Josh Don, Dohyun Kim, linux-arm-kernel,
kernel-team
Convert hashtab.c from raw_spinlock to rqspinlock, and drop the hashed
per-cpu counter crud from the code base which is no longer necessary.
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
---
kernel/bpf/hashtab.c | 102 ++++++++++++++-----------------------------
1 file changed, 32 insertions(+), 70 deletions(-)
diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
index 4a9eeb7aef85..9b394e147967 100644
--- a/kernel/bpf/hashtab.c
+++ b/kernel/bpf/hashtab.c
@@ -16,6 +16,7 @@
#include "bpf_lru_list.h"
#include "map_in_map.h"
#include <linux/bpf_mem_alloc.h>
+#include <asm/rqspinlock.h>
#define HTAB_CREATE_FLAG_MASK \
(BPF_F_NO_PREALLOC | BPF_F_NO_COMMON_LRU | BPF_F_NUMA_NODE | \
@@ -78,7 +79,7 @@
*/
struct bucket {
struct hlist_nulls_head head;
- raw_spinlock_t raw_lock;
+ rqspinlock_t raw_lock;
};
#define HASHTAB_MAP_LOCK_COUNT 8
@@ -104,8 +105,6 @@ struct bpf_htab {
u32 n_buckets; /* number of hash buckets */
u32 elem_size; /* size of each element in bytes */
u32 hashrnd;
- struct lock_class_key lockdep_key;
- int __percpu *map_locked[HASHTAB_MAP_LOCK_COUNT];
};
/* each htab element is struct htab_elem + key + value */
@@ -140,45 +139,26 @@ static void htab_init_buckets(struct bpf_htab *htab)
for (i = 0; i < htab->n_buckets; i++) {
INIT_HLIST_NULLS_HEAD(&htab->buckets[i].head, i);
- raw_spin_lock_init(&htab->buckets[i].raw_lock);
- lockdep_set_class(&htab->buckets[i].raw_lock,
- &htab->lockdep_key);
+ raw_res_spin_lock_init(&htab->buckets[i].raw_lock);
cond_resched();
}
}
-static inline int htab_lock_bucket(const struct bpf_htab *htab,
- struct bucket *b, u32 hash,
- unsigned long *pflags)
+static inline int htab_lock_bucket(struct bucket *b, unsigned long *pflags)
{
unsigned long flags;
+ int ret;
- hash = hash & min_t(u32, HASHTAB_MAP_LOCK_MASK, htab->n_buckets - 1);
-
- preempt_disable();
- local_irq_save(flags);
- if (unlikely(__this_cpu_inc_return(*(htab->map_locked[hash])) != 1)) {
- __this_cpu_dec(*(htab->map_locked[hash]));
- local_irq_restore(flags);
- preempt_enable();
- return -EBUSY;
- }
-
- raw_spin_lock(&b->raw_lock);
+ ret = raw_res_spin_lock_irqsave(&b->raw_lock, flags);
+ if (ret)
+ return ret;
*pflags = flags;
-
return 0;
}
-static inline void htab_unlock_bucket(const struct bpf_htab *htab,
- struct bucket *b, u32 hash,
- unsigned long flags)
+static inline void htab_unlock_bucket(struct bucket *b, unsigned long flags)
{
- hash = hash & min_t(u32, HASHTAB_MAP_LOCK_MASK, htab->n_buckets - 1);
- raw_spin_unlock(&b->raw_lock);
- __this_cpu_dec(*(htab->map_locked[hash]));
- local_irq_restore(flags);
- preempt_enable();
+ raw_res_spin_unlock_irqrestore(&b->raw_lock, flags);
}
static bool htab_lru_map_delete_node(void *arg, struct bpf_lru_node *node);
@@ -483,14 +463,12 @@ static struct bpf_map *htab_map_alloc(union bpf_attr *attr)
bool percpu_lru = (attr->map_flags & BPF_F_NO_COMMON_LRU);
bool prealloc = !(attr->map_flags & BPF_F_NO_PREALLOC);
struct bpf_htab *htab;
- int err, i;
+ int err;
htab = bpf_map_area_alloc(sizeof(*htab), NUMA_NO_NODE);
if (!htab)
return ERR_PTR(-ENOMEM);
- lockdep_register_key(&htab->lockdep_key);
-
bpf_map_init_from_attr(&htab->map, attr);
if (percpu_lru) {
@@ -536,15 +514,6 @@ static struct bpf_map *htab_map_alloc(union bpf_attr *attr)
if (!htab->buckets)
goto free_elem_count;
- for (i = 0; i < HASHTAB_MAP_LOCK_COUNT; i++) {
- htab->map_locked[i] = bpf_map_alloc_percpu(&htab->map,
- sizeof(int),
- sizeof(int),
- GFP_USER);
- if (!htab->map_locked[i])
- goto free_map_locked;
- }
-
if (htab->map.map_flags & BPF_F_ZERO_SEED)
htab->hashrnd = 0;
else
@@ -607,15 +576,12 @@ static struct bpf_map *htab_map_alloc(union bpf_attr *attr)
free_map_locked:
if (htab->use_percpu_counter)
percpu_counter_destroy(&htab->pcount);
- for (i = 0; i < HASHTAB_MAP_LOCK_COUNT; i++)
- free_percpu(htab->map_locked[i]);
bpf_map_area_free(htab->buckets);
bpf_mem_alloc_destroy(&htab->pcpu_ma);
bpf_mem_alloc_destroy(&htab->ma);
free_elem_count:
bpf_map_free_elem_count(&htab->map);
free_htab:
- lockdep_unregister_key(&htab->lockdep_key);
bpf_map_area_free(htab);
return ERR_PTR(err);
}
@@ -817,7 +783,7 @@ static bool htab_lru_map_delete_node(void *arg, struct bpf_lru_node *node)
b = __select_bucket(htab, tgt_l->hash);
head = &b->head;
- ret = htab_lock_bucket(htab, b, tgt_l->hash, &flags);
+ ret = htab_lock_bucket(b, &flags);
if (ret)
return false;
@@ -828,7 +794,7 @@ static bool htab_lru_map_delete_node(void *arg, struct bpf_lru_node *node)
break;
}
- htab_unlock_bucket(htab, b, tgt_l->hash, flags);
+ htab_unlock_bucket(b, flags);
if (l == tgt_l)
check_and_free_fields(htab, l);
@@ -1147,7 +1113,7 @@ static long htab_map_update_elem(struct bpf_map *map, void *key, void *value,
*/
}
- ret = htab_lock_bucket(htab, b, hash, &flags);
+ ret = htab_lock_bucket(b, &flags);
if (ret)
return ret;
@@ -1198,7 +1164,7 @@ static long htab_map_update_elem(struct bpf_map *map, void *key, void *value,
check_and_free_fields(htab, l_old);
}
}
- htab_unlock_bucket(htab, b, hash, flags);
+ htab_unlock_bucket(b, flags);
if (l_old) {
if (old_map_ptr)
map->ops->map_fd_put_ptr(map, old_map_ptr, true);
@@ -1207,7 +1173,7 @@ static long htab_map_update_elem(struct bpf_map *map, void *key, void *value,
}
return 0;
err:
- htab_unlock_bucket(htab, b, hash, flags);
+ htab_unlock_bucket(b, flags);
return ret;
}
@@ -1254,7 +1220,7 @@ static long htab_lru_map_update_elem(struct bpf_map *map, void *key, void *value
copy_map_value(&htab->map,
l_new->key + round_up(map->key_size, 8), value);
- ret = htab_lock_bucket(htab, b, hash, &flags);
+ ret = htab_lock_bucket(b, &flags);
if (ret)
goto err_lock_bucket;
@@ -1275,7 +1241,7 @@ static long htab_lru_map_update_elem(struct bpf_map *map, void *key, void *value
ret = 0;
err:
- htab_unlock_bucket(htab, b, hash, flags);
+ htab_unlock_bucket(b, flags);
err_lock_bucket:
if (ret)
@@ -1312,7 +1278,7 @@ static long __htab_percpu_map_update_elem(struct bpf_map *map, void *key,
b = __select_bucket(htab, hash);
head = &b->head;
- ret = htab_lock_bucket(htab, b, hash, &flags);
+ ret = htab_lock_bucket(b, &flags);
if (ret)
return ret;
@@ -1337,7 +1303,7 @@ static long __htab_percpu_map_update_elem(struct bpf_map *map, void *key,
}
ret = 0;
err:
- htab_unlock_bucket(htab, b, hash, flags);
+ htab_unlock_bucket(b, flags);
return ret;
}
@@ -1378,7 +1344,7 @@ static long __htab_lru_percpu_map_update_elem(struct bpf_map *map, void *key,
return -ENOMEM;
}
- ret = htab_lock_bucket(htab, b, hash, &flags);
+ ret = htab_lock_bucket(b, &flags);
if (ret)
goto err_lock_bucket;
@@ -1402,7 +1368,7 @@ static long __htab_lru_percpu_map_update_elem(struct bpf_map *map, void *key,
}
ret = 0;
err:
- htab_unlock_bucket(htab, b, hash, flags);
+ htab_unlock_bucket(b, flags);
err_lock_bucket:
if (l_new) {
bpf_map_dec_elem_count(&htab->map);
@@ -1444,7 +1410,7 @@ static long htab_map_delete_elem(struct bpf_map *map, void *key)
b = __select_bucket(htab, hash);
head = &b->head;
- ret = htab_lock_bucket(htab, b, hash, &flags);
+ ret = htab_lock_bucket(b, &flags);
if (ret)
return ret;
@@ -1454,7 +1420,7 @@ static long htab_map_delete_elem(struct bpf_map *map, void *key)
else
ret = -ENOENT;
- htab_unlock_bucket(htab, b, hash, flags);
+ htab_unlock_bucket(b, flags);
if (l)
free_htab_elem(htab, l);
@@ -1480,7 +1446,7 @@ static long htab_lru_map_delete_elem(struct bpf_map *map, void *key)
b = __select_bucket(htab, hash);
head = &b->head;
- ret = htab_lock_bucket(htab, b, hash, &flags);
+ ret = htab_lock_bucket(b, &flags);
if (ret)
return ret;
@@ -1491,7 +1457,7 @@ static long htab_lru_map_delete_elem(struct bpf_map *map, void *key)
else
ret = -ENOENT;
- htab_unlock_bucket(htab, b, hash, flags);
+ htab_unlock_bucket(b, flags);
if (l)
htab_lru_push_free(htab, l);
return ret;
@@ -1558,7 +1524,6 @@ static void htab_map_free_timers_and_wq(struct bpf_map *map)
static void htab_map_free(struct bpf_map *map)
{
struct bpf_htab *htab = container_of(map, struct bpf_htab, map);
- int i;
/* bpf_free_used_maps() or close(map_fd) will trigger this map_free callback.
* bpf_free_used_maps() is called after bpf prog is no longer executing.
@@ -1583,9 +1548,6 @@ static void htab_map_free(struct bpf_map *map)
bpf_mem_alloc_destroy(&htab->ma);
if (htab->use_percpu_counter)
percpu_counter_destroy(&htab->pcount);
- for (i = 0; i < HASHTAB_MAP_LOCK_COUNT; i++)
- free_percpu(htab->map_locked[i]);
- lockdep_unregister_key(&htab->lockdep_key);
bpf_map_area_free(htab);
}
@@ -1628,7 +1590,7 @@ static int __htab_map_lookup_and_delete_elem(struct bpf_map *map, void *key,
b = __select_bucket(htab, hash);
head = &b->head;
- ret = htab_lock_bucket(htab, b, hash, &bflags);
+ ret = htab_lock_bucket(b, &bflags);
if (ret)
return ret;
@@ -1665,7 +1627,7 @@ static int __htab_map_lookup_and_delete_elem(struct bpf_map *map, void *key,
hlist_nulls_del_rcu(&l->hash_node);
out_unlock:
- htab_unlock_bucket(htab, b, hash, bflags);
+ htab_unlock_bucket(b, bflags);
if (l) {
if (is_lru_map)
@@ -1787,7 +1749,7 @@ __htab_map_lookup_and_delete_batch(struct bpf_map *map,
head = &b->head;
/* do not grab the lock unless need it (bucket_cnt > 0). */
if (locked) {
- ret = htab_lock_bucket(htab, b, batch, &flags);
+ ret = htab_lock_bucket(b, &flags);
if (ret) {
rcu_read_unlock();
bpf_enable_instrumentation();
@@ -1810,7 +1772,7 @@ __htab_map_lookup_and_delete_batch(struct bpf_map *map,
/* Note that since bucket_cnt > 0 here, it is implicit
* that the locked was grabbed, so release it.
*/
- htab_unlock_bucket(htab, b, batch, flags);
+ htab_unlock_bucket(b, flags);
rcu_read_unlock();
bpf_enable_instrumentation();
goto after_loop;
@@ -1821,7 +1783,7 @@ __htab_map_lookup_and_delete_batch(struct bpf_map *map,
/* Note that since bucket_cnt > 0 here, it is implicit
* that the locked was grabbed, so release it.
*/
- htab_unlock_bucket(htab, b, batch, flags);
+ htab_unlock_bucket(b, flags);
rcu_read_unlock();
bpf_enable_instrumentation();
kvfree(keys);
@@ -1884,7 +1846,7 @@ __htab_map_lookup_and_delete_batch(struct bpf_map *map,
dst_val += value_size;
}
- htab_unlock_bucket(htab, b, batch, flags);
+ htab_unlock_bucket(b, flags);
locked = false;
while (node_to_free) {
--
2.43.5
^ permalink raw reply related [flat|nested] 67+ messages in thread
* [PATCH bpf-next v2 20/26] bpf: Convert percpu_freelist.c to rqspinlock
2025-02-06 10:54 [PATCH bpf-next v2 00/26] Resilient Queued Spin Lock Kumar Kartikeya Dwivedi
` (18 preceding siblings ...)
2025-02-06 10:54 ` [PATCH bpf-next v2 19/26] bpf: Convert hashtab.c to rqspinlock Kumar Kartikeya Dwivedi
@ 2025-02-06 10:54 ` Kumar Kartikeya Dwivedi
2025-02-06 10:54 ` [PATCH bpf-next v2 21/26] bpf: Convert lpm_trie.c " Kumar Kartikeya Dwivedi
` (8 subsequent siblings)
28 siblings, 0 replies; 67+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2025-02-06 10:54 UTC (permalink / raw)
To: bpf, linux-kernel
Cc: Linus Torvalds, Peter Zijlstra, Will Deacon, Waiman Long,
Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
Martin KaFai Lau, Eduard Zingerman, Paul E. McKenney, Tejun Heo,
Barret Rhoden, Josh Don, Dohyun Kim, linux-arm-kernel,
kernel-team
Convert the percpu_freelist.c code to use rqspinlock, and remove the
extralist fallback and trylock-based acquisitions to avoid deadlocks.
Key thing to note is the retained while (true) loop to search through
other CPUs when failing to push a node due to locking errors. This
retains the behavior of the old code, where it would keep trying until
it would be able to successfully push the node back into the freelist of
a CPU.
Technically, we should start iteration for this loop from
raw_smp_processor_id() + 1, but to avoid hitting the edge of nr_cpus,
we skip execution in the loop body instead.
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
---
kernel/bpf/percpu_freelist.c | 113 ++++++++---------------------------
kernel/bpf/percpu_freelist.h | 4 +-
2 files changed, 27 insertions(+), 90 deletions(-)
diff --git a/kernel/bpf/percpu_freelist.c b/kernel/bpf/percpu_freelist.c
index 034cf87b54e9..632762b57299 100644
--- a/kernel/bpf/percpu_freelist.c
+++ b/kernel/bpf/percpu_freelist.c
@@ -14,11 +14,9 @@ int pcpu_freelist_init(struct pcpu_freelist *s)
for_each_possible_cpu(cpu) {
struct pcpu_freelist_head *head = per_cpu_ptr(s->freelist, cpu);
- raw_spin_lock_init(&head->lock);
+ raw_res_spin_lock_init(&head->lock);
head->first = NULL;
}
- raw_spin_lock_init(&s->extralist.lock);
- s->extralist.first = NULL;
return 0;
}
@@ -34,58 +32,39 @@ static inline void pcpu_freelist_push_node(struct pcpu_freelist_head *head,
WRITE_ONCE(head->first, node);
}
-static inline void ___pcpu_freelist_push(struct pcpu_freelist_head *head,
+static inline bool ___pcpu_freelist_push(struct pcpu_freelist_head *head,
struct pcpu_freelist_node *node)
{
- raw_spin_lock(&head->lock);
- pcpu_freelist_push_node(head, node);
- raw_spin_unlock(&head->lock);
-}
-
-static inline bool pcpu_freelist_try_push_extra(struct pcpu_freelist *s,
- struct pcpu_freelist_node *node)
-{
- if (!raw_spin_trylock(&s->extralist.lock))
+ if (raw_res_spin_lock(&head->lock))
return false;
-
- pcpu_freelist_push_node(&s->extralist, node);
- raw_spin_unlock(&s->extralist.lock);
+ pcpu_freelist_push_node(head, node);
+ raw_res_spin_unlock(&head->lock);
return true;
}
-static inline void ___pcpu_freelist_push_nmi(struct pcpu_freelist *s,
- struct pcpu_freelist_node *node)
+void __pcpu_freelist_push(struct pcpu_freelist *s,
+ struct pcpu_freelist_node *node)
{
- int cpu, orig_cpu;
+ struct pcpu_freelist_head *head;
+ int cpu;
- orig_cpu = raw_smp_processor_id();
- while (1) {
- for_each_cpu_wrap(cpu, cpu_possible_mask, orig_cpu) {
- struct pcpu_freelist_head *head;
+ if (___pcpu_freelist_push(this_cpu_ptr(s->freelist), node))
+ return;
+ while (true) {
+ for_each_cpu_wrap(cpu, cpu_possible_mask, raw_smp_processor_id()) {
+ if (cpu == raw_smp_processor_id())
+ continue;
head = per_cpu_ptr(s->freelist, cpu);
- if (raw_spin_trylock(&head->lock)) {
- pcpu_freelist_push_node(head, node);
- raw_spin_unlock(&head->lock);
- return;
- }
- }
-
- /* cannot lock any per cpu lock, try extralist */
- if (pcpu_freelist_try_push_extra(s, node))
+ if (raw_res_spin_lock(&head->lock))
+ continue;
+ pcpu_freelist_push_node(head, node);
+ raw_res_spin_unlock(&head->lock);
return;
+ }
}
}
-void __pcpu_freelist_push(struct pcpu_freelist *s,
- struct pcpu_freelist_node *node)
-{
- if (in_nmi())
- ___pcpu_freelist_push_nmi(s, node);
- else
- ___pcpu_freelist_push(this_cpu_ptr(s->freelist), node);
-}
-
void pcpu_freelist_push(struct pcpu_freelist *s,
struct pcpu_freelist_node *node)
{
@@ -120,71 +99,29 @@ void pcpu_freelist_populate(struct pcpu_freelist *s, void *buf, u32 elem_size,
static struct pcpu_freelist_node *___pcpu_freelist_pop(struct pcpu_freelist *s)
{
+ struct pcpu_freelist_node *node = NULL;
struct pcpu_freelist_head *head;
- struct pcpu_freelist_node *node;
int cpu;
for_each_cpu_wrap(cpu, cpu_possible_mask, raw_smp_processor_id()) {
head = per_cpu_ptr(s->freelist, cpu);
if (!READ_ONCE(head->first))
continue;
- raw_spin_lock(&head->lock);
+ if (raw_res_spin_lock(&head->lock))
+ continue;
node = head->first;
if (node) {
WRITE_ONCE(head->first, node->next);
- raw_spin_unlock(&head->lock);
+ raw_res_spin_unlock(&head->lock);
return node;
}
- raw_spin_unlock(&head->lock);
+ raw_res_spin_unlock(&head->lock);
}
-
- /* per cpu lists are all empty, try extralist */
- if (!READ_ONCE(s->extralist.first))
- return NULL;
- raw_spin_lock(&s->extralist.lock);
- node = s->extralist.first;
- if (node)
- WRITE_ONCE(s->extralist.first, node->next);
- raw_spin_unlock(&s->extralist.lock);
- return node;
-}
-
-static struct pcpu_freelist_node *
-___pcpu_freelist_pop_nmi(struct pcpu_freelist *s)
-{
- struct pcpu_freelist_head *head;
- struct pcpu_freelist_node *node;
- int cpu;
-
- for_each_cpu_wrap(cpu, cpu_possible_mask, raw_smp_processor_id()) {
- head = per_cpu_ptr(s->freelist, cpu);
- if (!READ_ONCE(head->first))
- continue;
- if (raw_spin_trylock(&head->lock)) {
- node = head->first;
- if (node) {
- WRITE_ONCE(head->first, node->next);
- raw_spin_unlock(&head->lock);
- return node;
- }
- raw_spin_unlock(&head->lock);
- }
- }
-
- /* cannot pop from per cpu lists, try extralist */
- if (!READ_ONCE(s->extralist.first) || !raw_spin_trylock(&s->extralist.lock))
- return NULL;
- node = s->extralist.first;
- if (node)
- WRITE_ONCE(s->extralist.first, node->next);
- raw_spin_unlock(&s->extralist.lock);
return node;
}
struct pcpu_freelist_node *__pcpu_freelist_pop(struct pcpu_freelist *s)
{
- if (in_nmi())
- return ___pcpu_freelist_pop_nmi(s);
return ___pcpu_freelist_pop(s);
}
diff --git a/kernel/bpf/percpu_freelist.h b/kernel/bpf/percpu_freelist.h
index 3c76553cfe57..914798b74967 100644
--- a/kernel/bpf/percpu_freelist.h
+++ b/kernel/bpf/percpu_freelist.h
@@ -5,15 +5,15 @@
#define __PERCPU_FREELIST_H__
#include <linux/spinlock.h>
#include <linux/percpu.h>
+#include <asm/rqspinlock.h>
struct pcpu_freelist_head {
struct pcpu_freelist_node *first;
- raw_spinlock_t lock;
+ rqspinlock_t lock;
};
struct pcpu_freelist {
struct pcpu_freelist_head __percpu *freelist;
- struct pcpu_freelist_head extralist;
};
struct pcpu_freelist_node {
--
2.43.5
^ permalink raw reply related [flat|nested] 67+ messages in thread
* [PATCH bpf-next v2 21/26] bpf: Convert lpm_trie.c to rqspinlock
2025-02-06 10:54 [PATCH bpf-next v2 00/26] Resilient Queued Spin Lock Kumar Kartikeya Dwivedi
` (19 preceding siblings ...)
2025-02-06 10:54 ` [PATCH bpf-next v2 20/26] bpf: Convert percpu_freelist.c " Kumar Kartikeya Dwivedi
@ 2025-02-06 10:54 ` Kumar Kartikeya Dwivedi
2025-02-06 10:54 ` [PATCH bpf-next v2 22/26] bpf: Introduce rqspinlock kfuncs Kumar Kartikeya Dwivedi
` (7 subsequent siblings)
28 siblings, 0 replies; 67+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2025-02-06 10:54 UTC (permalink / raw)
To: bpf, linux-kernel
Cc: Linus Torvalds, Peter Zijlstra, Will Deacon, Waiman Long,
Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
Martin KaFai Lau, Eduard Zingerman, Paul E. McKenney, Tejun Heo,
Barret Rhoden, Josh Don, Dohyun Kim, linux-arm-kernel,
kernel-team
Convert all LPM trie usage of raw_spinlock to rqspinlock.
Note that rcu_dereference_protected in trie_delete_elem is switched over
to plain rcu_dereference, the RCU read lock should be held from BPF
program side or eBPF syscall path, and the trie->lock is just acquired
before the dereference. It is not clear the reason the protected variant
was used from the commit history, but the above reasoning makes sense so
switch over.
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
---
kernel/bpf/lpm_trie.c | 25 ++++++++++++++-----------
1 file changed, 14 insertions(+), 11 deletions(-)
diff --git a/kernel/bpf/lpm_trie.c b/kernel/bpf/lpm_trie.c
index e8a772e64324..be66d7e520e0 100644
--- a/kernel/bpf/lpm_trie.c
+++ b/kernel/bpf/lpm_trie.c
@@ -15,6 +15,7 @@
#include <net/ipv6.h>
#include <uapi/linux/btf.h>
#include <linux/btf_ids.h>
+#include <asm/rqspinlock.h>
#include <linux/bpf_mem_alloc.h>
/* Intermediate node */
@@ -36,7 +37,7 @@ struct lpm_trie {
size_t n_entries;
size_t max_prefixlen;
size_t data_size;
- raw_spinlock_t lock;
+ rqspinlock_t lock;
};
/* This trie implements a longest prefix match algorithm that can be used to
@@ -342,7 +343,9 @@ static long trie_update_elem(struct bpf_map *map,
if (!new_node)
return -ENOMEM;
- raw_spin_lock_irqsave(&trie->lock, irq_flags);
+ ret = raw_res_spin_lock_irqsave(&trie->lock, irq_flags);
+ if (ret)
+ goto out_free;
new_node->prefixlen = key->prefixlen;
RCU_INIT_POINTER(new_node->child[0], NULL);
@@ -356,8 +359,7 @@ static long trie_update_elem(struct bpf_map *map,
*/
slot = &trie->root;
- while ((node = rcu_dereference_protected(*slot,
- lockdep_is_held(&trie->lock)))) {
+ while ((node = rcu_dereference(*slot))) {
matchlen = longest_prefix_match(trie, node, key);
if (node->prefixlen != matchlen ||
@@ -442,8 +444,8 @@ static long trie_update_elem(struct bpf_map *map,
rcu_assign_pointer(*slot, im_node);
out:
- raw_spin_unlock_irqrestore(&trie->lock, irq_flags);
-
+ raw_res_spin_unlock_irqrestore(&trie->lock, irq_flags);
+out_free:
if (ret)
bpf_mem_cache_free(&trie->ma, new_node);
bpf_mem_cache_free_rcu(&trie->ma, free_node);
@@ -467,7 +469,9 @@ static long trie_delete_elem(struct bpf_map *map, void *_key)
if (key->prefixlen > trie->max_prefixlen)
return -EINVAL;
- raw_spin_lock_irqsave(&trie->lock, irq_flags);
+ ret = raw_res_spin_lock_irqsave(&trie->lock, irq_flags);
+ if (ret)
+ return ret;
/* Walk the tree looking for an exact key/length match and keeping
* track of the path we traverse. We will need to know the node
@@ -478,8 +482,7 @@ static long trie_delete_elem(struct bpf_map *map, void *_key)
trim = &trie->root;
trim2 = trim;
parent = NULL;
- while ((node = rcu_dereference_protected(
- *trim, lockdep_is_held(&trie->lock)))) {
+ while ((node = rcu_dereference(*trim))) {
matchlen = longest_prefix_match(trie, node, key);
if (node->prefixlen != matchlen ||
@@ -543,7 +546,7 @@ static long trie_delete_elem(struct bpf_map *map, void *_key)
free_node = node;
out:
- raw_spin_unlock_irqrestore(&trie->lock, irq_flags);
+ raw_res_spin_unlock_irqrestore(&trie->lock, irq_flags);
bpf_mem_cache_free_rcu(&trie->ma, free_parent);
bpf_mem_cache_free_rcu(&trie->ma, free_node);
@@ -592,7 +595,7 @@ static struct bpf_map *trie_alloc(union bpf_attr *attr)
offsetof(struct bpf_lpm_trie_key_u8, data);
trie->max_prefixlen = trie->data_size * 8;
- raw_spin_lock_init(&trie->lock);
+ raw_res_spin_lock_init(&trie->lock);
/* Allocate intermediate and leaf nodes from the same allocator */
leaf_size = sizeof(struct lpm_trie_node) + trie->data_size +
--
2.43.5
^ permalink raw reply related [flat|nested] 67+ messages in thread
* [PATCH bpf-next v2 22/26] bpf: Introduce rqspinlock kfuncs
2025-02-06 10:54 [PATCH bpf-next v2 00/26] Resilient Queued Spin Lock Kumar Kartikeya Dwivedi
` (20 preceding siblings ...)
2025-02-06 10:54 ` [PATCH bpf-next v2 21/26] bpf: Convert lpm_trie.c " Kumar Kartikeya Dwivedi
@ 2025-02-06 10:54 ` Kumar Kartikeya Dwivedi
2025-02-07 13:43 ` kernel test robot
2025-02-06 10:54 ` [PATCH bpf-next v2 23/26] bpf: Handle allocation failure in acquire_lock_state Kumar Kartikeya Dwivedi
` (6 subsequent siblings)
28 siblings, 1 reply; 67+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2025-02-06 10:54 UTC (permalink / raw)
To: bpf, linux-kernel
Cc: Linus Torvalds, Peter Zijlstra, Will Deacon, Waiman Long,
Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
Martin KaFai Lau, Eduard Zingerman, Paul E. McKenney, Tejun Heo,
Barret Rhoden, Josh Don, Dohyun Kim, linux-arm-kernel,
kernel-team
Introduce four new kfuncs, bpf_res_spin_lock, and bpf_res_spin_unlock,
and their irqsave/irqrestore variants, which wrap the rqspinlock APIs.
bpf_res_spin_lock returns a conditional result, depending on whether the
lock was acquired (NULL is returned when lock acquisition succeeds,
non-NULL upon failure). The memory pointed to by the returned pointer
upon failure can be dereferenced after the NULL check to obtain the
error code.
Instead of using the old bpf_spin_lock type, introduce a new type with
the same layout, and the same alignment, but a different name to avoid
type confusion.
Preemption is disabled upon successful lock acquisition, however IRQs
are not. Special kfuncs can be introduced later to allow disabling IRQs
when taking a spin lock. Resilient locks are safe against AA deadlocks,
hence not disabling IRQs currently does not allow violation of kernel
safety.
__irq_flag annotation is used to accept IRQ flags for the IRQ-variants,
with the same semantics as existing bpf_local_irq_{save, restore}.
These kfuncs will require additional verifier-side support in subsequent
commits, to allow programs to hold multiple locks at the same time.
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
---
include/asm-generic/rqspinlock.h | 7 +++
include/linux/bpf.h | 1 +
kernel/locking/rqspinlock.c | 78 ++++++++++++++++++++++++++++++++
3 files changed, 86 insertions(+)
diff --git a/include/asm-generic/rqspinlock.h b/include/asm-generic/rqspinlock.h
index 46119fc768b8..8249c2da09ad 100644
--- a/include/asm-generic/rqspinlock.h
+++ b/include/asm-generic/rqspinlock.h
@@ -23,6 +23,13 @@ struct rqspinlock {
};
};
+/* Even though this is same as struct rqspinlock, we need to emit a distinct
+ * type in BTF for BPF programs.
+ */
+struct bpf_res_spin_lock {
+ u32 val;
+};
+
struct qspinlock;
#ifdef CONFIG_QUEUED_SPINLOCKS
typedef struct qspinlock rqspinlock_t;
diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index f3f50e29d639..35af09ee6a2c 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -30,6 +30,7 @@
#include <linux/static_call.h>
#include <linux/memcontrol.h>
#include <linux/cfi.h>
+#include <asm/rqspinlock.h>
struct bpf_verifier_env;
struct bpf_verifier_log;
diff --git a/kernel/locking/rqspinlock.c b/kernel/locking/rqspinlock.c
index b4cceeecf29c..d05333203671 100644
--- a/kernel/locking/rqspinlock.c
+++ b/kernel/locking/rqspinlock.c
@@ -15,6 +15,8 @@
#include <linux/smp.h>
#include <linux/bug.h>
+#include <linux/bpf.h>
+#include <linux/err.h>
#include <linux/cpumask.h>
#include <linux/percpu.h>
#include <linux/hardirq.h>
@@ -686,3 +688,79 @@ int __lockfunc resilient_queued_spin_lock_slowpath(rqspinlock_t *lock, u32 val,
EXPORT_SYMBOL(resilient_queued_spin_lock_slowpath);
#endif /* CONFIG_QUEUED_SPINLOCKS */
+
+__bpf_kfunc_start_defs();
+
+#define REPORT_STR(ret) ({ ret == -ETIMEDOUT ? "Timeout detected" : "AA or ABBA deadlock detected"; })
+
+__bpf_kfunc int bpf_res_spin_lock(struct bpf_res_spin_lock *lock)
+{
+ int ret;
+
+ BUILD_BUG_ON(sizeof(rqspinlock_t) != sizeof(struct bpf_res_spin_lock));
+ BUILD_BUG_ON(__alignof__(rqspinlock_t) != __alignof__(struct bpf_res_spin_lock));
+
+ preempt_disable();
+ ret = res_spin_lock((rqspinlock_t *)lock);
+ if (unlikely(ret)) {
+ preempt_enable();
+ rqspinlock_report_violation(REPORT_STR(ret), lock);
+ return ret;
+ }
+ return 0;
+}
+
+__bpf_kfunc void bpf_res_spin_unlock(struct bpf_res_spin_lock *lock)
+{
+ res_spin_unlock((rqspinlock_t *)lock);
+ preempt_enable();
+}
+
+__bpf_kfunc int bpf_res_spin_lock_irqsave(struct bpf_res_spin_lock *lock, unsigned long *flags__irq_flag)
+{
+ u64 *ptr = (u64 *)flags__irq_flag;
+ unsigned long flags;
+ int ret;
+
+ preempt_disable();
+ local_irq_save(flags);
+ ret = res_spin_lock((rqspinlock_t *)lock);
+ if (unlikely(ret)) {
+ local_irq_restore(flags);
+ preempt_enable();
+ rqspinlock_report_violation(REPORT_STR(ret), lock);
+ return ret;
+ }
+ *ptr = flags;
+ return 0;
+}
+
+__bpf_kfunc void bpf_res_spin_unlock_irqrestore(struct bpf_res_spin_lock *lock, unsigned long *flags__irq_flag)
+{
+ u64 *ptr = (u64 *)flags__irq_flag;
+ unsigned long flags = *ptr;
+
+ res_spin_unlock((rqspinlock_t *)lock);
+ local_irq_restore(flags);
+ preempt_enable();
+}
+
+__bpf_kfunc_end_defs();
+
+BTF_KFUNCS_START(rqspinlock_kfunc_ids)
+BTF_ID_FLAGS(func, bpf_res_spin_lock, KF_RET_NULL)
+BTF_ID_FLAGS(func, bpf_res_spin_unlock)
+BTF_ID_FLAGS(func, bpf_res_spin_lock_irqsave, KF_RET_NULL)
+BTF_ID_FLAGS(func, bpf_res_spin_unlock_irqrestore)
+BTF_KFUNCS_END(rqspinlock_kfunc_ids)
+
+static const struct btf_kfunc_id_set rqspinlock_kfunc_set = {
+ .owner = THIS_MODULE,
+ .set = &rqspinlock_kfunc_ids,
+};
+
+static __init int rqspinlock_register_kfuncs(void)
+{
+ return register_btf_kfunc_id_set(BPF_PROG_TYPE_UNSPEC, &rqspinlock_kfunc_set);
+}
+late_initcall(rqspinlock_register_kfuncs);
--
2.43.5
^ permalink raw reply related [flat|nested] 67+ messages in thread
* [PATCH bpf-next v2 23/26] bpf: Handle allocation failure in acquire_lock_state
2025-02-06 10:54 [PATCH bpf-next v2 00/26] Resilient Queued Spin Lock Kumar Kartikeya Dwivedi
` (21 preceding siblings ...)
2025-02-06 10:54 ` [PATCH bpf-next v2 22/26] bpf: Introduce rqspinlock kfuncs Kumar Kartikeya Dwivedi
@ 2025-02-06 10:54 ` Kumar Kartikeya Dwivedi
2025-02-08 2:04 ` Alexei Starovoitov
2025-02-06 10:54 ` [PATCH bpf-next v2 24/26] bpf: Implement verifier support for rqspinlock Kumar Kartikeya Dwivedi
` (5 subsequent siblings)
28 siblings, 1 reply; 67+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2025-02-06 10:54 UTC (permalink / raw)
To: bpf, linux-kernel
Cc: Linus Torvalds, Peter Zijlstra, Will Deacon, Waiman Long,
Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
Martin KaFai Lau, Eduard Zingerman, Paul E. McKenney, Tejun Heo,
Barret Rhoden, Josh Don, Dohyun Kim, linux-arm-kernel,
kernel-team
The acquire_lock_state function needs to handle possible NULL values
returned by acquire_reference_state, and return -ENOMEM.
Fixes: 769b0f1c8214 ("bpf: Refactor {acquire,release}_reference_state")
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
---
kernel/bpf/verifier.c | 2 ++
1 file changed, 2 insertions(+)
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 9971c03adfd5..d6999d085c7d 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -1501,6 +1501,8 @@ static int acquire_lock_state(struct bpf_verifier_env *env, int insn_idx, enum r
struct bpf_reference_state *s;
s = acquire_reference_state(env, insn_idx);
+ if (!s)
+ return -ENOMEM;
s->type = type;
s->id = id;
s->ptr = ptr;
--
2.43.5
^ permalink raw reply related [flat|nested] 67+ messages in thread
* [PATCH bpf-next v2 24/26] bpf: Implement verifier support for rqspinlock
2025-02-06 10:54 [PATCH bpf-next v2 00/26] Resilient Queued Spin Lock Kumar Kartikeya Dwivedi
` (22 preceding siblings ...)
2025-02-06 10:54 ` [PATCH bpf-next v2 23/26] bpf: Handle allocation failure in acquire_lock_state Kumar Kartikeya Dwivedi
@ 2025-02-06 10:54 ` Kumar Kartikeya Dwivedi
2025-02-12 0:08 ` Eduard Zingerman
2025-02-06 10:54 ` [PATCH bpf-next v2 25/26] bpf: Maintain FIFO property for rqspinlock unlock Kumar Kartikeya Dwivedi
` (4 subsequent siblings)
28 siblings, 1 reply; 67+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2025-02-06 10:54 UTC (permalink / raw)
To: bpf, linux-kernel
Cc: Linus Torvalds, Peter Zijlstra, Will Deacon, Waiman Long,
Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
Martin KaFai Lau, Eduard Zingerman, Paul E. McKenney, Tejun Heo,
Barret Rhoden, Josh Don, Dohyun Kim, linux-arm-kernel,
kernel-team
Introduce verifier-side support for rqspinlock kfuncs. The first step is
allowing bpf_res_spin_lock type to be defined in map values and
allocated objects, so BTF-side is updated with a new BPF_RES_SPIN_LOCK
field to recognize and validate.
Any object cannot have both bpf_spin_lock and bpf_res_spin_lock, only
one of them (and at most one of them per-object, like before) must be
present. The bpf_res_spin_lock can also be used to protect objects that
require lock protection for their kfuncs, like BPF rbtree and linked
list.
The verifier plumbing to simulate success and failure cases when calling
the kfuncs is done by pushing a new verifier state to the verifier state
stack which will verify the failure case upon calling the kfunc. The
path where success is indicated creates all lock reference state and IRQ
state (if necessary for irqsave variants). In the case of failure, the
state clears the registers r0-r5, sets the return value, and skips kfunc
processing, proceeding to the next instruction.
When marking the return value for success case, the value is marked as
0, and for the failure case as [-MAX_ERRNO, -1]. Then, in the program,
whenever user checks the return value as 'if (ret)' or 'if (ret < 0)'
the verifier never traverses such branches for success cases, and would
be aware that the lock is not held in such cases.
We push the kfunc state in check_kfunc_call whenever rqspinlock kfuncs
are invoked. We introduce a kfunc_class state to avoid mixing lock
irqrestore kfuncs with IRQ state created by bpf_local_irq_save.
With all this infrastructure, these kfuncs become usable in programs
while satisfying all safety properties required by the kernel.
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
---
include/linux/bpf.h | 9 ++
include/linux/bpf_verifier.h | 17 ++-
kernel/bpf/btf.c | 26 ++++-
kernel/bpf/syscall.c | 6 +-
kernel/bpf/verifier.c | 219 ++++++++++++++++++++++++++++-------
5 files changed, 232 insertions(+), 45 deletions(-)
diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 35af09ee6a2c..91dddf7396f9 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -205,6 +205,7 @@ enum btf_field_type {
BPF_REFCOUNT = (1 << 9),
BPF_WORKQUEUE = (1 << 10),
BPF_UPTR = (1 << 11),
+ BPF_RES_SPIN_LOCK = (1 << 12),
};
typedef void (*btf_dtor_kfunc_t)(void *);
@@ -240,6 +241,7 @@ struct btf_record {
u32 cnt;
u32 field_mask;
int spin_lock_off;
+ int res_spin_lock_off;
int timer_off;
int wq_off;
int refcount_off;
@@ -315,6 +317,8 @@ static inline const char *btf_field_type_name(enum btf_field_type type)
switch (type) {
case BPF_SPIN_LOCK:
return "bpf_spin_lock";
+ case BPF_RES_SPIN_LOCK:
+ return "bpf_res_spin_lock";
case BPF_TIMER:
return "bpf_timer";
case BPF_WORKQUEUE:
@@ -347,6 +351,8 @@ static inline u32 btf_field_type_size(enum btf_field_type type)
switch (type) {
case BPF_SPIN_LOCK:
return sizeof(struct bpf_spin_lock);
+ case BPF_RES_SPIN_LOCK:
+ return sizeof(struct bpf_res_spin_lock);
case BPF_TIMER:
return sizeof(struct bpf_timer);
case BPF_WORKQUEUE:
@@ -377,6 +383,8 @@ static inline u32 btf_field_type_align(enum btf_field_type type)
switch (type) {
case BPF_SPIN_LOCK:
return __alignof__(struct bpf_spin_lock);
+ case BPF_RES_SPIN_LOCK:
+ return __alignof__(struct bpf_res_spin_lock);
case BPF_TIMER:
return __alignof__(struct bpf_timer);
case BPF_WORKQUEUE:
@@ -420,6 +428,7 @@ static inline void bpf_obj_init_field(const struct btf_field *field, void *addr)
case BPF_RB_ROOT:
/* RB_ROOT_CACHED 0-inits, no need to do anything after memset */
case BPF_SPIN_LOCK:
+ case BPF_RES_SPIN_LOCK:
case BPF_TIMER:
case BPF_WORKQUEUE:
case BPF_KPTR_UNREF:
diff --git a/include/linux/bpf_verifier.h b/include/linux/bpf_verifier.h
index 32c23f2a3086..ed444e44f524 100644
--- a/include/linux/bpf_verifier.h
+++ b/include/linux/bpf_verifier.h
@@ -115,6 +115,15 @@ struct bpf_reg_state {
int depth:30;
} iter;
+ /* For irq stack slots */
+ struct {
+ enum {
+ IRQ_KFUNC_IGNORE,
+ IRQ_NATIVE_KFUNC,
+ IRQ_LOCK_KFUNC,
+ } kfunc_class;
+ } irq;
+
/* Max size from any of the above. */
struct {
unsigned long raw1;
@@ -255,9 +264,11 @@ struct bpf_reference_state {
* default to pointer reference on zero initialization of a state.
*/
enum ref_state_type {
- REF_TYPE_PTR = 1,
- REF_TYPE_IRQ = 2,
- REF_TYPE_LOCK = 3,
+ REF_TYPE_PTR = (1 << 1),
+ REF_TYPE_IRQ = (1 << 2),
+ REF_TYPE_LOCK = (1 << 3),
+ REF_TYPE_RES_LOCK = (1 << 4),
+ REF_TYPE_RES_LOCK_IRQ = (1 << 5),
} type;
/* Track each reference created with a unique id, even if the same
* instruction creates the reference multiple times (eg, via CALL).
diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c
index 9433b6467bbe..aba6183253ea 100644
--- a/kernel/bpf/btf.c
+++ b/kernel/bpf/btf.c
@@ -3480,6 +3480,15 @@ static int btf_get_field_type(const struct btf *btf, const struct btf_type *var_
goto end;
}
}
+ if (field_mask & BPF_RES_SPIN_LOCK) {
+ if (!strcmp(name, "bpf_res_spin_lock")) {
+ if (*seen_mask & BPF_RES_SPIN_LOCK)
+ return -E2BIG;
+ *seen_mask |= BPF_RES_SPIN_LOCK;
+ type = BPF_RES_SPIN_LOCK;
+ goto end;
+ }
+ }
if (field_mask & BPF_TIMER) {
if (!strcmp(name, "bpf_timer")) {
if (*seen_mask & BPF_TIMER)
@@ -3658,6 +3667,7 @@ static int btf_find_field_one(const struct btf *btf,
switch (field_type) {
case BPF_SPIN_LOCK:
+ case BPF_RES_SPIN_LOCK:
case BPF_TIMER:
case BPF_WORKQUEUE:
case BPF_LIST_NODE:
@@ -3951,6 +3961,7 @@ struct btf_record *btf_parse_fields(const struct btf *btf, const struct btf_type
return ERR_PTR(-ENOMEM);
rec->spin_lock_off = -EINVAL;
+ rec->res_spin_lock_off = -EINVAL;
rec->timer_off = -EINVAL;
rec->wq_off = -EINVAL;
rec->refcount_off = -EINVAL;
@@ -3978,6 +3989,11 @@ struct btf_record *btf_parse_fields(const struct btf *btf, const struct btf_type
/* Cache offset for faster lookup at runtime */
rec->spin_lock_off = rec->fields[i].offset;
break;
+ case BPF_RES_SPIN_LOCK:
+ WARN_ON_ONCE(rec->spin_lock_off >= 0);
+ /* Cache offset for faster lookup at runtime */
+ rec->res_spin_lock_off = rec->fields[i].offset;
+ break;
case BPF_TIMER:
WARN_ON_ONCE(rec->timer_off >= 0);
/* Cache offset for faster lookup at runtime */
@@ -4021,9 +4037,15 @@ struct btf_record *btf_parse_fields(const struct btf *btf, const struct btf_type
rec->cnt++;
}
+ if (rec->spin_lock_off >= 0 && rec->res_spin_lock_off >= 0) {
+ ret = -EINVAL;
+ goto end;
+ }
+
/* bpf_{list_head, rb_node} require bpf_spin_lock */
if ((btf_record_has_field(rec, BPF_LIST_HEAD) ||
- btf_record_has_field(rec, BPF_RB_ROOT)) && rec->spin_lock_off < 0) {
+ btf_record_has_field(rec, BPF_RB_ROOT)) &&
+ (rec->spin_lock_off < 0 && rec->res_spin_lock_off < 0)) {
ret = -EINVAL;
goto end;
}
@@ -5636,7 +5658,7 @@ btf_parse_struct_metas(struct bpf_verifier_log *log, struct btf *btf)
type = &tab->types[tab->cnt];
type->btf_id = i;
- record = btf_parse_fields(btf, t, BPF_SPIN_LOCK | BPF_LIST_HEAD | BPF_LIST_NODE |
+ record = btf_parse_fields(btf, t, BPF_SPIN_LOCK | BPF_RES_SPIN_LOCK | BPF_LIST_HEAD | BPF_LIST_NODE |
BPF_RB_ROOT | BPF_RB_NODE | BPF_REFCOUNT |
BPF_KPTR, t->size);
/* The record cannot be unset, treat it as an error if so */
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index c420edbfb7c8..054707215d28 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -648,6 +648,7 @@ void btf_record_free(struct btf_record *rec)
case BPF_RB_ROOT:
case BPF_RB_NODE:
case BPF_SPIN_LOCK:
+ case BPF_RES_SPIN_LOCK:
case BPF_TIMER:
case BPF_REFCOUNT:
case BPF_WORKQUEUE:
@@ -700,6 +701,7 @@ struct btf_record *btf_record_dup(const struct btf_record *rec)
case BPF_RB_ROOT:
case BPF_RB_NODE:
case BPF_SPIN_LOCK:
+ case BPF_RES_SPIN_LOCK:
case BPF_TIMER:
case BPF_REFCOUNT:
case BPF_WORKQUEUE:
@@ -777,6 +779,7 @@ void bpf_obj_free_fields(const struct btf_record *rec, void *obj)
switch (fields[i].type) {
case BPF_SPIN_LOCK:
+ case BPF_RES_SPIN_LOCK:
break;
case BPF_TIMER:
bpf_timer_cancel_and_free(field_ptr);
@@ -1203,7 +1206,7 @@ static int map_check_btf(struct bpf_map *map, struct bpf_token *token,
return -EINVAL;
map->record = btf_parse_fields(btf, value_type,
- BPF_SPIN_LOCK | BPF_TIMER | BPF_KPTR | BPF_LIST_HEAD |
+ BPF_SPIN_LOCK | BPF_RES_SPIN_LOCK | BPF_TIMER | BPF_KPTR | BPF_LIST_HEAD |
BPF_RB_ROOT | BPF_REFCOUNT | BPF_WORKQUEUE | BPF_UPTR,
map->value_size);
if (!IS_ERR_OR_NULL(map->record)) {
@@ -1222,6 +1225,7 @@ static int map_check_btf(struct bpf_map *map, struct bpf_token *token,
case 0:
continue;
case BPF_SPIN_LOCK:
+ case BPF_RES_SPIN_LOCK:
if (map->map_type != BPF_MAP_TYPE_HASH &&
map->map_type != BPF_MAP_TYPE_ARRAY &&
map->map_type != BPF_MAP_TYPE_CGROUP_STORAGE &&
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index d6999d085c7d..294761dd0072 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -456,7 +456,7 @@ static bool subprog_is_exc_cb(struct bpf_verifier_env *env, int subprog)
static bool reg_may_point_to_spin_lock(const struct bpf_reg_state *reg)
{
- return btf_record_has_field(reg_btf_record(reg), BPF_SPIN_LOCK);
+ return btf_record_has_field(reg_btf_record(reg), BPF_SPIN_LOCK | BPF_RES_SPIN_LOCK);
}
static bool type_is_rdonly_mem(u32 type)
@@ -1148,7 +1148,8 @@ static int release_irq_state(struct bpf_verifier_state *state, int id);
static int mark_stack_slot_irq_flag(struct bpf_verifier_env *env,
struct bpf_kfunc_call_arg_meta *meta,
- struct bpf_reg_state *reg, int insn_idx)
+ struct bpf_reg_state *reg, int insn_idx,
+ int kfunc_class)
{
struct bpf_func_state *state = func(env, reg);
struct bpf_stack_state *slot;
@@ -1170,6 +1171,7 @@ static int mark_stack_slot_irq_flag(struct bpf_verifier_env *env,
st->type = PTR_TO_STACK; /* we don't have dedicated reg type */
st->live |= REG_LIVE_WRITTEN;
st->ref_obj_id = id;
+ st->irq.kfunc_class = kfunc_class;
for (i = 0; i < BPF_REG_SIZE; i++)
slot->slot_type[i] = STACK_IRQ_FLAG;
@@ -1178,7 +1180,8 @@ static int mark_stack_slot_irq_flag(struct bpf_verifier_env *env,
return 0;
}
-static int unmark_stack_slot_irq_flag(struct bpf_verifier_env *env, struct bpf_reg_state *reg)
+static int unmark_stack_slot_irq_flag(struct bpf_verifier_env *env, struct bpf_reg_state *reg,
+ int kfunc_class)
{
struct bpf_func_state *state = func(env, reg);
struct bpf_stack_state *slot;
@@ -1192,6 +1195,15 @@ static int unmark_stack_slot_irq_flag(struct bpf_verifier_env *env, struct bpf_r
slot = &state->stack[spi];
st = &slot->spilled_ptr;
+ if (kfunc_class != IRQ_KFUNC_IGNORE && st->irq.kfunc_class != kfunc_class) {
+ const char *flag_kfunc = st->irq.kfunc_class == IRQ_NATIVE_KFUNC ? "native" : "lock";
+ const char *used_kfunc = kfunc_class == IRQ_NATIVE_KFUNC ? "native" : "lock";
+
+ verbose(env, "irq flag acquired by %s kfuncs cannot be restored with %s kfuncs\n",
+ flag_kfunc, used_kfunc);
+ return -EINVAL;
+ }
+
err = release_irq_state(env->cur_state, st->ref_obj_id);
WARN_ON_ONCE(err && err != -EACCES);
if (err) {
@@ -1591,7 +1603,7 @@ static struct bpf_reference_state *find_lock_state(struct bpf_verifier_state *st
for (i = 0; i < state->acquired_refs; i++) {
struct bpf_reference_state *s = &state->refs[i];
- if (s->type != type)
+ if (!(s->type & type))
continue;
if (s->id == id && s->ptr == ptr)
@@ -7985,6 +7997,12 @@ static int check_kfunc_mem_size_reg(struct bpf_verifier_env *env, struct bpf_reg
return err;
}
+enum {
+ PROCESS_SPIN_LOCK = (1 << 0),
+ PROCESS_RES_LOCK = (1 << 1),
+ PROCESS_LOCK_IRQ = (1 << 2),
+};
+
/* Implementation details:
* bpf_map_lookup returns PTR_TO_MAP_VALUE_OR_NULL.
* bpf_obj_new returns PTR_TO_BTF_ID | MEM_ALLOC | PTR_MAYBE_NULL.
@@ -8007,30 +8025,33 @@ static int check_kfunc_mem_size_reg(struct bpf_verifier_env *env, struct bpf_reg
* env->cur_state->active_locks remembers which map value element or allocated
* object got locked and clears it after bpf_spin_unlock.
*/
-static int process_spin_lock(struct bpf_verifier_env *env, int regno,
- bool is_lock)
+static int process_spin_lock(struct bpf_verifier_env *env, int regno, int flags)
{
+ bool is_lock = flags & PROCESS_SPIN_LOCK, is_res_lock = flags & PROCESS_RES_LOCK;
+ const char *lock_str = is_res_lock ? "bpf_res_spin" : "bpf_spin";
struct bpf_reg_state *regs = cur_regs(env), *reg = ®s[regno];
struct bpf_verifier_state *cur = env->cur_state;
bool is_const = tnum_is_const(reg->var_off);
+ bool is_irq = flags & PROCESS_LOCK_IRQ;
u64 val = reg->var_off.value;
struct bpf_map *map = NULL;
struct btf *btf = NULL;
struct btf_record *rec;
+ u32 spin_lock_off;
int err;
if (!is_const) {
verbose(env,
- "R%d doesn't have constant offset. bpf_spin_lock has to be at the constant offset\n",
- regno);
+ "R%d doesn't have constant offset. %s_lock has to be at the constant offset\n",
+ regno, lock_str);
return -EINVAL;
}
if (reg->type == PTR_TO_MAP_VALUE) {
map = reg->map_ptr;
if (!map->btf) {
verbose(env,
- "map '%s' has to have BTF in order to use bpf_spin_lock\n",
- map->name);
+ "map '%s' has to have BTF in order to use %s_lock\n",
+ map->name, lock_str);
return -EINVAL;
}
} else {
@@ -8038,36 +8059,53 @@ static int process_spin_lock(struct bpf_verifier_env *env, int regno,
}
rec = reg_btf_record(reg);
- if (!btf_record_has_field(rec, BPF_SPIN_LOCK)) {
- verbose(env, "%s '%s' has no valid bpf_spin_lock\n", map ? "map" : "local",
- map ? map->name : "kptr");
+ if (!btf_record_has_field(rec, is_res_lock ? BPF_RES_SPIN_LOCK : BPF_SPIN_LOCK)) {
+ verbose(env, "%s '%s' has no valid %s_lock\n", map ? "map" : "local",
+ map ? map->name : "kptr", lock_str);
return -EINVAL;
}
- if (rec->spin_lock_off != val + reg->off) {
- verbose(env, "off %lld doesn't point to 'struct bpf_spin_lock' that is at %d\n",
- val + reg->off, rec->spin_lock_off);
+ spin_lock_off = is_res_lock ? rec->res_spin_lock_off : rec->spin_lock_off;
+ if (spin_lock_off != val + reg->off) {
+ verbose(env, "off %lld doesn't point to 'struct %s_lock' that is at %d\n",
+ val + reg->off, lock_str, spin_lock_off);
return -EINVAL;
}
if (is_lock) {
void *ptr;
+ int type;
if (map)
ptr = map;
else
ptr = btf;
- if (cur->active_locks) {
- verbose(env,
- "Locking two bpf_spin_locks are not allowed\n");
- return -EINVAL;
+ if (!is_res_lock && cur->active_locks) {
+ if (find_lock_state(env->cur_state, REF_TYPE_LOCK, 0, NULL)) {
+ verbose(env,
+ "Locking two bpf_spin_locks are not allowed\n");
+ return -EINVAL;
+ }
+ } else if (is_res_lock) {
+ if (find_lock_state(env->cur_state, REF_TYPE_RES_LOCK, reg->id, ptr)) {
+ verbose(env, "Acquiring the same lock again, AA deadlock detected\n");
+ return -EINVAL;
+ }
}
- err = acquire_lock_state(env, env->insn_idx, REF_TYPE_LOCK, reg->id, ptr);
+
+ if (is_res_lock && is_irq)
+ type = REF_TYPE_RES_LOCK_IRQ;
+ else if (is_res_lock)
+ type = REF_TYPE_RES_LOCK;
+ else
+ type = REF_TYPE_LOCK;
+ err = acquire_lock_state(env, env->insn_idx, type, reg->id, ptr);
if (err < 0) {
verbose(env, "Failed to acquire lock state\n");
return err;
}
} else {
void *ptr;
+ int type;
if (map)
ptr = map;
@@ -8075,12 +8113,18 @@ static int process_spin_lock(struct bpf_verifier_env *env, int regno,
ptr = btf;
if (!cur->active_locks) {
- verbose(env, "bpf_spin_unlock without taking a lock\n");
+ verbose(env, "%s_unlock without taking a lock\n", lock_str);
return -EINVAL;
}
- if (release_lock_state(env->cur_state, REF_TYPE_LOCK, reg->id, ptr)) {
- verbose(env, "bpf_spin_unlock of different lock\n");
+ if (is_res_lock && is_irq)
+ type = REF_TYPE_RES_LOCK_IRQ;
+ else if (is_res_lock)
+ type = REF_TYPE_RES_LOCK;
+ else
+ type = REF_TYPE_LOCK;
+ if (release_lock_state(cur, type, reg->id, ptr)) {
+ verbose(env, "%s_unlock of different lock\n", lock_str);
return -EINVAL;
}
@@ -9391,11 +9435,11 @@ static int check_func_arg(struct bpf_verifier_env *env, u32 arg,
return -EACCES;
}
if (meta->func_id == BPF_FUNC_spin_lock) {
- err = process_spin_lock(env, regno, true);
+ err = process_spin_lock(env, regno, PROCESS_SPIN_LOCK);
if (err)
return err;
} else if (meta->func_id == BPF_FUNC_spin_unlock) {
- err = process_spin_lock(env, regno, false);
+ err = process_spin_lock(env, regno, 0);
if (err)
return err;
} else {
@@ -11274,7 +11318,7 @@ static int check_helper_call(struct bpf_verifier_env *env, struct bpf_insn *insn
regs[BPF_REG_0].map_uid = meta.map_uid;
regs[BPF_REG_0].type = PTR_TO_MAP_VALUE | ret_flag;
if (!type_may_be_null(ret_flag) &&
- btf_record_has_field(meta.map_ptr->record, BPF_SPIN_LOCK)) {
+ btf_record_has_field(meta.map_ptr->record, BPF_SPIN_LOCK | BPF_RES_SPIN_LOCK)) {
regs[BPF_REG_0].id = ++env->id_gen;
}
break;
@@ -11446,10 +11490,10 @@ static int check_helper_call(struct bpf_verifier_env *env, struct bpf_insn *insn
/* mark_btf_func_reg_size() is used when the reg size is determined by
* the BTF func_proto's return value size and argument.
*/
-static void mark_btf_func_reg_size(struct bpf_verifier_env *env, u32 regno,
- size_t reg_size)
+static void __mark_btf_func_reg_size(struct bpf_verifier_env *env, struct bpf_reg_state *regs,
+ u32 regno, size_t reg_size)
{
- struct bpf_reg_state *reg = &cur_regs(env)[regno];
+ struct bpf_reg_state *reg = ®s[regno];
if (regno == BPF_REG_0) {
/* Function return value */
@@ -11467,6 +11511,12 @@ static void mark_btf_func_reg_size(struct bpf_verifier_env *env, u32 regno,
}
}
+static void mark_btf_func_reg_size(struct bpf_verifier_env *env, u32 regno,
+ size_t reg_size)
+{
+ return __mark_btf_func_reg_size(env, cur_regs(env), regno, reg_size);
+}
+
static bool is_kfunc_acquire(struct bpf_kfunc_call_arg_meta *meta)
{
return meta->kfunc_flags & KF_ACQUIRE;
@@ -11604,6 +11654,7 @@ enum {
KF_ARG_RB_ROOT_ID,
KF_ARG_RB_NODE_ID,
KF_ARG_WORKQUEUE_ID,
+ KF_ARG_RES_SPIN_LOCK_ID,
};
BTF_ID_LIST(kf_arg_btf_ids)
@@ -11613,6 +11664,7 @@ BTF_ID(struct, bpf_list_node)
BTF_ID(struct, bpf_rb_root)
BTF_ID(struct, bpf_rb_node)
BTF_ID(struct, bpf_wq)
+BTF_ID(struct, bpf_res_spin_lock)
static bool __is_kfunc_ptr_arg_type(const struct btf *btf,
const struct btf_param *arg, int type)
@@ -11661,6 +11713,11 @@ static bool is_kfunc_arg_wq(const struct btf *btf, const struct btf_param *arg)
return __is_kfunc_ptr_arg_type(btf, arg, KF_ARG_WORKQUEUE_ID);
}
+static bool is_kfunc_arg_res_spin_lock(const struct btf *btf, const struct btf_param *arg)
+{
+ return __is_kfunc_ptr_arg_type(btf, arg, KF_ARG_RES_SPIN_LOCK_ID);
+}
+
static bool is_kfunc_arg_callback(struct bpf_verifier_env *env, const struct btf *btf,
const struct btf_param *arg)
{
@@ -11732,6 +11789,7 @@ enum kfunc_ptr_arg_type {
KF_ARG_PTR_TO_MAP,
KF_ARG_PTR_TO_WORKQUEUE,
KF_ARG_PTR_TO_IRQ_FLAG,
+ KF_ARG_PTR_TO_RES_SPIN_LOCK,
};
enum special_kfunc_type {
@@ -11768,6 +11826,10 @@ enum special_kfunc_type {
KF_bpf_iter_num_new,
KF_bpf_iter_num_next,
KF_bpf_iter_num_destroy,
+ KF_bpf_res_spin_lock,
+ KF_bpf_res_spin_unlock,
+ KF_bpf_res_spin_lock_irqsave,
+ KF_bpf_res_spin_unlock_irqrestore,
};
BTF_SET_START(special_kfunc_set)
@@ -11846,6 +11908,10 @@ BTF_ID(func, bpf_local_irq_restore)
BTF_ID(func, bpf_iter_num_new)
BTF_ID(func, bpf_iter_num_next)
BTF_ID(func, bpf_iter_num_destroy)
+BTF_ID(func, bpf_res_spin_lock)
+BTF_ID(func, bpf_res_spin_unlock)
+BTF_ID(func, bpf_res_spin_lock_irqsave)
+BTF_ID(func, bpf_res_spin_unlock_irqrestore)
static bool is_kfunc_ret_null(struct bpf_kfunc_call_arg_meta *meta)
{
@@ -11939,6 +12005,9 @@ get_kfunc_ptr_arg_type(struct bpf_verifier_env *env,
if (is_kfunc_arg_irq_flag(meta->btf, &args[argno]))
return KF_ARG_PTR_TO_IRQ_FLAG;
+ if (is_kfunc_arg_res_spin_lock(meta->btf, &args[argno]))
+ return KF_ARG_PTR_TO_RES_SPIN_LOCK;
+
if ((base_type(reg->type) == PTR_TO_BTF_ID || reg2btf_ids[base_type(reg->type)])) {
if (!btf_type_is_struct(ref_t)) {
verbose(env, "kernel function %s args#%d pointer type %s %s is not supported\n",
@@ -12046,13 +12115,19 @@ static int process_irq_flag(struct bpf_verifier_env *env, int regno,
struct bpf_kfunc_call_arg_meta *meta)
{
struct bpf_reg_state *regs = cur_regs(env), *reg = ®s[regno];
+ int err, kfunc_class = IRQ_NATIVE_KFUNC;
bool irq_save;
- int err;
- if (meta->func_id == special_kfunc_list[KF_bpf_local_irq_save]) {
+ if (meta->func_id == special_kfunc_list[KF_bpf_local_irq_save] ||
+ meta->func_id == special_kfunc_list[KF_bpf_res_spin_lock_irqsave]) {
irq_save = true;
- } else if (meta->func_id == special_kfunc_list[KF_bpf_local_irq_restore]) {
+ if (meta->func_id == special_kfunc_list[KF_bpf_res_spin_lock_irqsave])
+ kfunc_class = IRQ_LOCK_KFUNC;
+ } else if (meta->func_id == special_kfunc_list[KF_bpf_local_irq_restore] ||
+ meta->func_id == special_kfunc_list[KF_bpf_res_spin_unlock_irqrestore]) {
irq_save = false;
+ if (meta->func_id == special_kfunc_list[KF_bpf_res_spin_unlock_irqrestore])
+ kfunc_class = IRQ_LOCK_KFUNC;
} else {
verbose(env, "verifier internal error: unknown irq flags kfunc\n");
return -EFAULT;
@@ -12068,7 +12143,7 @@ static int process_irq_flag(struct bpf_verifier_env *env, int regno,
if (err)
return err;
- err = mark_stack_slot_irq_flag(env, meta, reg, env->insn_idx);
+ err = mark_stack_slot_irq_flag(env, meta, reg, env->insn_idx, kfunc_class);
if (err)
return err;
} else {
@@ -12082,7 +12157,7 @@ static int process_irq_flag(struct bpf_verifier_env *env, int regno,
if (err)
return err;
- err = unmark_stack_slot_irq_flag(env, reg);
+ err = unmark_stack_slot_irq_flag(env, reg, kfunc_class);
if (err)
return err;
}
@@ -12209,7 +12284,8 @@ static int check_reg_allocation_locked(struct bpf_verifier_env *env, struct bpf_
if (!env->cur_state->active_locks)
return -EINVAL;
- s = find_lock_state(env->cur_state, REF_TYPE_LOCK, id, ptr);
+ s = find_lock_state(env->cur_state, REF_TYPE_LOCK | REF_TYPE_RES_LOCK | REF_TYPE_RES_LOCK_IRQ,
+ id, ptr);
if (!s) {
verbose(env, "held lock and object are not in the same allocation\n");
return -EINVAL;
@@ -12245,9 +12321,18 @@ static bool is_bpf_graph_api_kfunc(u32 btf_id)
btf_id == special_kfunc_list[KF_bpf_refcount_acquire_impl];
}
+static bool is_bpf_res_spin_lock_kfunc(u32 btf_id)
+{
+ return btf_id == special_kfunc_list[KF_bpf_res_spin_lock] ||
+ btf_id == special_kfunc_list[KF_bpf_res_spin_unlock] ||
+ btf_id == special_kfunc_list[KF_bpf_res_spin_lock_irqsave] ||
+ btf_id == special_kfunc_list[KF_bpf_res_spin_unlock_irqrestore];
+}
+
static bool kfunc_spin_allowed(u32 btf_id)
{
- return is_bpf_graph_api_kfunc(btf_id) || is_bpf_iter_num_api_kfunc(btf_id);
+ return is_bpf_graph_api_kfunc(btf_id) || is_bpf_iter_num_api_kfunc(btf_id) ||
+ is_bpf_res_spin_lock_kfunc(btf_id);
}
static bool is_sync_callback_calling_kfunc(u32 btf_id)
@@ -12679,6 +12764,7 @@ static int check_kfunc_args(struct bpf_verifier_env *env, struct bpf_kfunc_call_
case KF_ARG_PTR_TO_CONST_STR:
case KF_ARG_PTR_TO_WORKQUEUE:
case KF_ARG_PTR_TO_IRQ_FLAG:
+ case KF_ARG_PTR_TO_RES_SPIN_LOCK:
break;
default:
WARN_ON_ONCE(1);
@@ -12977,6 +13063,28 @@ static int check_kfunc_args(struct bpf_verifier_env *env, struct bpf_kfunc_call_
if (ret < 0)
return ret;
break;
+ case KF_ARG_PTR_TO_RES_SPIN_LOCK:
+ {
+ int flags = PROCESS_RES_LOCK;
+
+ if (reg->type != PTR_TO_MAP_VALUE && reg->type != (PTR_TO_BTF_ID | MEM_ALLOC)) {
+ verbose(env, "arg#%d doesn't point to map value or allocated object\n", i);
+ return -EINVAL;
+ }
+
+ if (!is_bpf_res_spin_lock_kfunc(meta->func_id))
+ return -EFAULT;
+ if (meta->func_id == special_kfunc_list[KF_bpf_res_spin_lock] ||
+ meta->func_id == special_kfunc_list[KF_bpf_res_spin_lock_irqsave])
+ flags |= PROCESS_SPIN_LOCK;
+ if (meta->func_id == special_kfunc_list[KF_bpf_res_spin_lock_irqsave] ||
+ meta->func_id == special_kfunc_list[KF_bpf_res_spin_unlock_irqrestore])
+ flags |= PROCESS_LOCK_IRQ;
+ ret = process_spin_lock(env, regno, flags);
+ if (ret < 0)
+ return ret;
+ break;
+ }
}
}
@@ -13062,6 +13170,33 @@ static int check_kfunc_call(struct bpf_verifier_env *env, struct bpf_insn *insn,
insn_aux->is_iter_next = is_iter_next_kfunc(&meta);
+ if (!insn->off &&
+ (insn->imm == special_kfunc_list[KF_bpf_res_spin_lock] ||
+ insn->imm == special_kfunc_list[KF_bpf_res_spin_lock_irqsave])) {
+ struct bpf_verifier_state *branch;
+ struct bpf_reg_state *regs;
+
+ branch = push_stack(env, env->insn_idx + 1, env->insn_idx, false);
+ if (!branch) {
+ verbose(env, "failed to push state for failed lock acquisition\n");
+ return -ENOMEM;
+ }
+
+ regs = branch->frame[branch->curframe]->regs;
+
+ /* Clear r0-r5 registers in forked state */
+ for (i = 0; i < CALLER_SAVED_REGS; i++)
+ mark_reg_not_init(env, regs, caller_saved[i]);
+
+ mark_reg_unknown(env, regs, BPF_REG_0);
+ err = __mark_reg_s32_range(env, regs, BPF_REG_0, -MAX_ERRNO, -1);
+ if (err) {
+ verbose(env, "failed to mark s32 range for retval in forked state for lock\n");
+ return err;
+ }
+ __mark_btf_func_reg_size(env, regs, BPF_REG_0, sizeof(u32));
+ }
+
if (is_kfunc_destructive(&meta) && !capable(CAP_SYS_BOOT)) {
verbose(env, "destructive kfunc calls require CAP_SYS_BOOT capability\n");
return -EACCES;
@@ -13232,6 +13367,9 @@ static int check_kfunc_call(struct bpf_verifier_env *env, struct bpf_insn *insn,
if (btf_type_is_scalar(t)) {
mark_reg_unknown(env, regs, BPF_REG_0);
+ if (meta.btf == btf_vmlinux && (meta.func_id == special_kfunc_list[KF_bpf_res_spin_lock] ||
+ meta.func_id == special_kfunc_list[KF_bpf_res_spin_lock_irqsave]))
+ __mark_reg_const_zero(env, ®s[BPF_REG_0]);
mark_btf_func_reg_size(env, BPF_REG_0, t->size);
} else if (btf_type_is_ptr(t)) {
ptr_type = btf_type_skip_modifiers(desc_btf, t->type, &ptr_type_id);
@@ -18114,7 +18252,8 @@ static bool stacksafe(struct bpf_verifier_env *env, struct bpf_func_state *old,
case STACK_IRQ_FLAG:
old_reg = &old->stack[spi].spilled_ptr;
cur_reg = &cur->stack[spi].spilled_ptr;
- if (!check_ids(old_reg->ref_obj_id, cur_reg->ref_obj_id, idmap))
+ if (!check_ids(old_reg->ref_obj_id, cur_reg->ref_obj_id, idmap) ||
+ old_reg->irq.kfunc_class != cur_reg->irq.kfunc_class)
return false;
break;
case STACK_MISC:
@@ -18158,6 +18297,8 @@ static bool refsafe(struct bpf_verifier_state *old, struct bpf_verifier_state *c
case REF_TYPE_IRQ:
break;
case REF_TYPE_LOCK:
+ case REF_TYPE_RES_LOCK:
+ case REF_TYPE_RES_LOCK_IRQ:
if (old->refs[i].ptr != cur->refs[i].ptr)
return false;
break;
@@ -19491,7 +19632,7 @@ static int check_map_prog_compatibility(struct bpf_verifier_env *env,
}
}
- if (btf_record_has_field(map->record, BPF_SPIN_LOCK)) {
+ if (btf_record_has_field(map->record, BPF_SPIN_LOCK | BPF_RES_SPIN_LOCK)) {
if (prog_type == BPF_PROG_TYPE_SOCKET_FILTER) {
verbose(env, "socket filter progs cannot use bpf_spin_lock yet\n");
return -EINVAL;
--
2.43.5
^ permalink raw reply related [flat|nested] 67+ messages in thread
* [PATCH bpf-next v2 25/26] bpf: Maintain FIFO property for rqspinlock unlock
2025-02-06 10:54 [PATCH bpf-next v2 00/26] Resilient Queued Spin Lock Kumar Kartikeya Dwivedi
` (23 preceding siblings ...)
2025-02-06 10:54 ` [PATCH bpf-next v2 24/26] bpf: Implement verifier support for rqspinlock Kumar Kartikeya Dwivedi
@ 2025-02-06 10:54 ` Kumar Kartikeya Dwivedi
2025-02-06 10:54 ` [PATCH bpf-next v2 26/26] selftests/bpf: Add tests for rqspinlock Kumar Kartikeya Dwivedi
` (3 subsequent siblings)
28 siblings, 0 replies; 67+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2025-02-06 10:54 UTC (permalink / raw)
To: bpf, linux-kernel
Cc: Linus Torvalds, Peter Zijlstra, Will Deacon, Waiman Long,
Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
Martin KaFai Lau, Eduard Zingerman, Paul E. McKenney, Tejun Heo,
Barret Rhoden, Josh Don, Dohyun Kim, linux-arm-kernel,
kernel-team
Since out-of-order unlocks are unsupported for rqspinlock, and irqsave
variants enforce strict FIFO ordering anyway, make the same change for
normal non-irqsave variants, such that FIFO ordering is enforced.
Two new verifier state fields (active_lock_id, active_lock_ptr) are used
to denote the top of the stack, and prev_id and prev_ptr are ascertained
whenever popping the topmost entry through an unlock.
Take special care to make these fields part of the state comparison in
refsafe.
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
---
include/linux/bpf_verifier.h | 3 +++
kernel/bpf/verifier.c | 33 ++++++++++++++++++++++++++++-----
2 files changed, 31 insertions(+), 5 deletions(-)
diff --git a/include/linux/bpf_verifier.h b/include/linux/bpf_verifier.h
index ed444e44f524..92cd2289b743 100644
--- a/include/linux/bpf_verifier.h
+++ b/include/linux/bpf_verifier.h
@@ -269,6 +269,7 @@ struct bpf_reference_state {
REF_TYPE_LOCK = (1 << 3),
REF_TYPE_RES_LOCK = (1 << 4),
REF_TYPE_RES_LOCK_IRQ = (1 << 5),
+ REF_TYPE_LOCK_MASK = REF_TYPE_LOCK | REF_TYPE_RES_LOCK | REF_TYPE_RES_LOCK_IRQ,
} type;
/* Track each reference created with a unique id, even if the same
* instruction creates the reference multiple times (eg, via CALL).
@@ -435,6 +436,8 @@ struct bpf_verifier_state {
u32 active_locks;
u32 active_preempt_locks;
u32 active_irq_id;
+ u32 active_lock_id;
+ void *active_lock_ptr;
bool active_rcu_lock;
bool speculative;
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 294761dd0072..9cac6ea4f844 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -1421,6 +1421,8 @@ static int copy_reference_state(struct bpf_verifier_state *dst, const struct bpf
dst->active_preempt_locks = src->active_preempt_locks;
dst->active_rcu_lock = src->active_rcu_lock;
dst->active_irq_id = src->active_irq_id;
+ dst->active_lock_id = src->active_lock_id;
+ dst->active_lock_ptr = src->active_lock_ptr;
return 0;
}
@@ -1520,6 +1522,8 @@ static int acquire_lock_state(struct bpf_verifier_env *env, int insn_idx, enum r
s->ptr = ptr;
state->active_locks++;
+ state->active_lock_id = id;
+ state->active_lock_ptr = ptr;
return 0;
}
@@ -1559,16 +1563,24 @@ static void release_reference_state(struct bpf_verifier_state *state, int idx)
static int release_lock_state(struct bpf_verifier_state *state, int type, int id, void *ptr)
{
+ void *prev_ptr = NULL;
+ u32 prev_id = 0;
int i;
for (i = 0; i < state->acquired_refs; i++) {
- if (state->refs[i].type != type)
- continue;
- if (state->refs[i].id == id && state->refs[i].ptr == ptr) {
+ if (state->refs[i].type == type && state->refs[i].id == id &&
+ state->refs[i].ptr == ptr) {
release_reference_state(state, i);
state->active_locks--;
+ /* Reassign active lock (id, ptr). */
+ state->active_lock_id = prev_id;
+ state->active_lock_ptr = prev_ptr;
return 0;
}
+ if (state->refs[i].type & REF_TYPE_LOCK_MASK) {
+ prev_id = state->refs[i].id;
+ prev_ptr = state->refs[i].ptr;
+ }
}
return -EINVAL;
}
@@ -8123,6 +8135,14 @@ static int process_spin_lock(struct bpf_verifier_env *env, int regno, int flags)
type = REF_TYPE_RES_LOCK;
else
type = REF_TYPE_LOCK;
+ if (!find_lock_state(cur, type, reg->id, ptr)) {
+ verbose(env, "%s_unlock of different lock\n", lock_str);
+ return -EINVAL;
+ }
+ if (reg->id != cur->active_lock_id || ptr != cur->active_lock_ptr) {
+ verbose(env, "%s_unlock cannot be out of order\n", lock_str);
+ return -EINVAL;
+ }
if (release_lock_state(cur, type, reg->id, ptr)) {
verbose(env, "%s_unlock of different lock\n", lock_str);
return -EINVAL;
@@ -12284,8 +12304,7 @@ static int check_reg_allocation_locked(struct bpf_verifier_env *env, struct bpf_
if (!env->cur_state->active_locks)
return -EINVAL;
- s = find_lock_state(env->cur_state, REF_TYPE_LOCK | REF_TYPE_RES_LOCK | REF_TYPE_RES_LOCK_IRQ,
- id, ptr);
+ s = find_lock_state(env->cur_state, REF_TYPE_LOCK_MASK, id, ptr);
if (!s) {
verbose(env, "held lock and object are not in the same allocation\n");
return -EINVAL;
@@ -18288,6 +18307,10 @@ static bool refsafe(struct bpf_verifier_state *old, struct bpf_verifier_state *c
if (!check_ids(old->active_irq_id, cur->active_irq_id, idmap))
return false;
+ if (!check_ids(old->active_lock_id, cur->active_lock_id, idmap) ||
+ old->active_lock_ptr != cur->active_lock_ptr)
+ return false;
+
for (i = 0; i < old->acquired_refs; i++) {
if (!check_ids(old->refs[i].id, cur->refs[i].id, idmap) ||
old->refs[i].type != cur->refs[i].type)
--
2.43.5
^ permalink raw reply related [flat|nested] 67+ messages in thread
* [PATCH bpf-next v2 26/26] selftests/bpf: Add tests for rqspinlock
2025-02-06 10:54 [PATCH bpf-next v2 00/26] Resilient Queued Spin Lock Kumar Kartikeya Dwivedi
` (24 preceding siblings ...)
2025-02-06 10:54 ` [PATCH bpf-next v2 25/26] bpf: Maintain FIFO property for rqspinlock unlock Kumar Kartikeya Dwivedi
@ 2025-02-06 10:54 ` Kumar Kartikeya Dwivedi
2025-02-12 0:14 ` Eduard Zingerman
2025-02-10 9:31 ` [PATCH bpf-next v2 00/26] Resilient Queued Spin Lock Peter Zijlstra
` (2 subsequent siblings)
28 siblings, 1 reply; 67+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2025-02-06 10:54 UTC (permalink / raw)
To: bpf, linux-kernel
Cc: Linus Torvalds, Peter Zijlstra, Will Deacon, Waiman Long,
Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
Martin KaFai Lau, Eduard Zingerman, Paul E. McKenney, Tejun Heo,
Barret Rhoden, Josh Don, Dohyun Kim, linux-arm-kernel,
kernel-team
Introduce selftests that trigger AA, ABBA deadlocks, and test the edge
case where the held locks table runs out of entries, since we then
fallback to the timeout as the final line of defense. Also exercise
verifier's AA detection where applicable.
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
---
.../selftests/bpf/prog_tests/res_spin_lock.c | 99 +++++++
tools/testing/selftests/bpf/progs/irq.c | 53 ++++
.../selftests/bpf/progs/res_spin_lock.c | 143 ++++++++++
.../selftests/bpf/progs/res_spin_lock_fail.c | 244 ++++++++++++++++++
4 files changed, 539 insertions(+)
create mode 100644 tools/testing/selftests/bpf/prog_tests/res_spin_lock.c
create mode 100644 tools/testing/selftests/bpf/progs/res_spin_lock.c
create mode 100644 tools/testing/selftests/bpf/progs/res_spin_lock_fail.c
diff --git a/tools/testing/selftests/bpf/prog_tests/res_spin_lock.c b/tools/testing/selftests/bpf/prog_tests/res_spin_lock.c
new file mode 100644
index 000000000000..5a46b3e4a842
--- /dev/null
+++ b/tools/testing/selftests/bpf/prog_tests/res_spin_lock.c
@@ -0,0 +1,99 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2024 Meta Platforms, Inc. and affiliates. */
+#include <test_progs.h>
+#include <network_helpers.h>
+
+#include "res_spin_lock.skel.h"
+#include "res_spin_lock_fail.skel.h"
+
+static void test_res_spin_lock_failure(void)
+{
+ RUN_TESTS(res_spin_lock_fail);
+}
+
+static volatile int skip;
+
+static void *spin_lock_thread(void *arg)
+{
+ int err, prog_fd = *(u32 *) arg;
+ LIBBPF_OPTS(bpf_test_run_opts, topts,
+ .data_in = &pkt_v4,
+ .data_size_in = sizeof(pkt_v4),
+ .repeat = 10000,
+ );
+
+ while (!READ_ONCE(skip)) {
+ err = bpf_prog_test_run_opts(prog_fd, &topts);
+ ASSERT_OK(err, "test_run");
+ ASSERT_OK(topts.retval, "test_run retval");
+ }
+ pthread_exit(arg);
+}
+
+static void test_res_spin_lock_success(void)
+{
+ LIBBPF_OPTS(bpf_test_run_opts, topts,
+ .data_in = &pkt_v4,
+ .data_size_in = sizeof(pkt_v4),
+ .repeat = 1,
+ );
+ struct res_spin_lock *skel;
+ pthread_t thread_id[16];
+ int prog_fd, i, err;
+ void *ret;
+
+ skel = res_spin_lock__open_and_load();
+ if (!ASSERT_OK_PTR(skel, "res_spin_lock__open_and_load"))
+ return;
+ /* AA deadlock */
+ prog_fd = bpf_program__fd(skel->progs.res_spin_lock_test);
+ err = bpf_prog_test_run_opts(prog_fd, &topts);
+ ASSERT_OK(err, "error");
+ ASSERT_OK(topts.retval, "retval");
+
+ prog_fd = bpf_program__fd(skel->progs.res_spin_lock_test_held_lock_max);
+ err = bpf_prog_test_run_opts(prog_fd, &topts);
+ ASSERT_OK(err, "error");
+ ASSERT_OK(topts.retval, "retval");
+
+ /* Multi-threaded ABBA deadlock. */
+
+ prog_fd = bpf_program__fd(skel->progs.res_spin_lock_test_AB);
+ for (i = 0; i < 16; i++) {
+ int err;
+
+ err = pthread_create(&thread_id[i], NULL, &spin_lock_thread, &prog_fd);
+ if (!ASSERT_OK(err, "pthread_create"))
+ goto end;
+ }
+
+ topts.repeat = 1000;
+ int fd = bpf_program__fd(skel->progs.res_spin_lock_test_BA);
+ while (!topts.retval && !err && !READ_ONCE(skel->bss->err)) {
+ err = bpf_prog_test_run_opts(fd, &topts);
+ }
+
+ WRITE_ONCE(skip, true);
+
+ for (i = 0; i < 16; i++) {
+ if (!ASSERT_OK(pthread_join(thread_id[i], &ret), "pthread_join"))
+ goto end;
+ if (!ASSERT_EQ(ret, &prog_fd, "ret == prog_fd"))
+ goto end;
+ }
+
+ ASSERT_EQ(READ_ONCE(skel->bss->err), -EDEADLK, "timeout err");
+ ASSERT_OK(err, "err");
+ ASSERT_EQ(topts.retval, -EDEADLK, "timeout");
+end:
+ res_spin_lock__destroy(skel);
+ return;
+}
+
+void test_res_spin_lock(void)
+{
+ if (test__start_subtest("res_spin_lock_success"))
+ test_res_spin_lock_success();
+ if (test__start_subtest("res_spin_lock_failure"))
+ test_res_spin_lock_failure();
+}
diff --git a/tools/testing/selftests/bpf/progs/irq.c b/tools/testing/selftests/bpf/progs/irq.c
index b0b53d980964..3d4fee83a5be 100644
--- a/tools/testing/selftests/bpf/progs/irq.c
+++ b/tools/testing/selftests/bpf/progs/irq.c
@@ -11,6 +11,9 @@ extern void bpf_local_irq_save(unsigned long *) __weak __ksym;
extern void bpf_local_irq_restore(unsigned long *) __weak __ksym;
extern int bpf_copy_from_user_str(void *dst, u32 dst__sz, const void *unsafe_ptr__ign, u64 flags) __weak __ksym;
+struct bpf_res_spin_lock lockA __hidden SEC(".data.A");
+struct bpf_res_spin_lock lockB __hidden SEC(".data.B");
+
SEC("?tc")
__failure __msg("arg#0 doesn't point to an irq flag on stack")
int irq_save_bad_arg(struct __sk_buff *ctx)
@@ -441,4 +444,54 @@ int irq_ooo_refs_array(struct __sk_buff *ctx)
return 0;
}
+SEC("?tc")
+__failure __msg("cannot restore irq state out of order")
+int irq_ooo_lock_cond_inv(struct __sk_buff *ctx)
+{
+ unsigned long flags1, flags2;
+
+ if (bpf_res_spin_lock_irqsave(&lockA, &flags1))
+ return 0;
+ if (bpf_res_spin_lock_irqsave(&lockB, &flags2)) {
+ bpf_res_spin_unlock_irqrestore(&lockA, &flags1);
+ return 0;
+ }
+
+ bpf_res_spin_unlock_irqrestore(&lockB, &flags1);
+ bpf_res_spin_unlock_irqrestore(&lockA, &flags2);
+ return 0;
+}
+
+SEC("?tc")
+__failure __msg("function calls are not allowed")
+int irq_wrong_kfunc_class_1(struct __sk_buff *ctx)
+{
+ unsigned long flags1;
+
+ if (bpf_res_spin_lock_irqsave(&lockA, &flags1))
+ return 0;
+ /* For now, bpf_local_irq_restore is not allowed in critical section,
+ * but this test ensures error will be caught with kfunc_class when it's
+ * opened up. Tested by temporarily permitting this kfunc in critical
+ * section.
+ */
+ bpf_local_irq_restore(&flags1);
+ bpf_res_spin_unlock_irqrestore(&lockA, &flags1);
+ return 0;
+}
+
+SEC("?tc")
+__failure __msg("function calls are not allowed")
+int irq_wrong_kfunc_class_2(struct __sk_buff *ctx)
+{
+ unsigned long flags1, flags2;
+
+ bpf_local_irq_save(&flags1);
+ if (bpf_res_spin_lock_irqsave(&lockA, &flags2))
+ return 0;
+ bpf_local_irq_restore(&flags2);
+ bpf_res_spin_unlock_irqrestore(&lockA, &flags1);
+ return 0;
+}
+
char _license[] SEC("license") = "GPL";
diff --git a/tools/testing/selftests/bpf/progs/res_spin_lock.c b/tools/testing/selftests/bpf/progs/res_spin_lock.c
new file mode 100644
index 000000000000..f68aa2ccccc2
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/res_spin_lock.c
@@ -0,0 +1,143 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2024 Meta Platforms, Inc. and affiliates. */
+#include <vmlinux.h>
+#include <bpf/bpf_tracing.h>
+#include <bpf/bpf_helpers.h>
+#include "bpf_misc.h"
+
+#define EDEADLK 35
+#define ETIMEDOUT 110
+
+struct arr_elem {
+ struct bpf_res_spin_lock lock;
+};
+
+struct {
+ __uint(type, BPF_MAP_TYPE_ARRAY);
+ __uint(max_entries, 64);
+ __type(key, int);
+ __type(value, struct arr_elem);
+} arrmap SEC(".maps");
+
+struct bpf_res_spin_lock lockA __hidden SEC(".data.A");
+struct bpf_res_spin_lock lockB __hidden SEC(".data.B");
+
+SEC("tc")
+int res_spin_lock_test(struct __sk_buff *ctx)
+{
+ struct arr_elem *elem1, *elem2;
+ int r;
+
+ elem1 = bpf_map_lookup_elem(&arrmap, &(int){0});
+ if (!elem1)
+ return -1;
+ elem2 = bpf_map_lookup_elem(&arrmap, &(int){0});
+ if (!elem2)
+ return -1;
+
+ r = bpf_res_spin_lock(&elem1->lock);
+ if (r)
+ return r;
+ if (!bpf_res_spin_lock(&elem2->lock)) {
+ bpf_res_spin_unlock(&elem2->lock);
+ bpf_res_spin_unlock(&elem1->lock);
+ return -1;
+ }
+ bpf_res_spin_unlock(&elem1->lock);
+ return 0;
+}
+
+SEC("tc")
+int res_spin_lock_test_AB(struct __sk_buff *ctx)
+{
+ int r;
+
+ r = bpf_res_spin_lock(&lockA);
+ if (r)
+ return !r;
+ /* Only unlock if we took the lock. */
+ if (!bpf_res_spin_lock(&lockB))
+ bpf_res_spin_unlock(&lockB);
+ bpf_res_spin_unlock(&lockA);
+ return 0;
+}
+
+int err;
+
+SEC("tc")
+int res_spin_lock_test_BA(struct __sk_buff *ctx)
+{
+ int r;
+
+ r = bpf_res_spin_lock(&lockB);
+ if (r)
+ return !r;
+ if (!bpf_res_spin_lock(&lockA))
+ bpf_res_spin_unlock(&lockA);
+ else
+ err = -EDEADLK;
+ bpf_res_spin_unlock(&lockB);
+ return err ?: 0;
+}
+
+SEC("tc")
+int res_spin_lock_test_held_lock_max(struct __sk_buff *ctx)
+{
+ struct bpf_res_spin_lock *locks[48] = {};
+ struct arr_elem *e;
+ u64 time_beg, time;
+ int ret = 0, i;
+
+ _Static_assert(ARRAY_SIZE(((struct rqspinlock_held){}).locks) == 32,
+ "RES_NR_HELD assumed to be 32");
+
+ for (i = 0; i < 34; i++) {
+ int key = i;
+
+ /* We cannot pass in i as it will get spilled/filled by the compiler and
+ * loses bounds in verifier state.
+ */
+ e = bpf_map_lookup_elem(&arrmap, &key);
+ if (!e)
+ return 1;
+ locks[i] = &e->lock;
+ }
+
+ for (; i < 48; i++) {
+ int key = i - 2;
+
+ /* We cannot pass in i as it will get spilled/filled by the compiler and
+ * loses bounds in verifier state.
+ */
+ e = bpf_map_lookup_elem(&arrmap, &key);
+ if (!e)
+ return 1;
+ locks[i] = &e->lock;
+ }
+
+ time_beg = bpf_ktime_get_ns();
+ for (i = 0; i < 34; i++) {
+ if (bpf_res_spin_lock(locks[i]))
+ goto end;
+ }
+
+ /* Trigger AA, after exhausting entries in the held lock table. This
+ * time, only the timeout can save us, as AA detection won't succeed.
+ */
+ if (!bpf_res_spin_lock(locks[34])) {
+ bpf_res_spin_unlock(locks[34]);
+ ret = 1;
+ goto end;
+ }
+
+end:
+ for (i = i - 1; i >= 0; i--)
+ bpf_res_spin_unlock(locks[i]);
+ time = bpf_ktime_get_ns() - time_beg;
+ /* Time spent should be easily above our limit (1/2 s), since AA
+ * detection won't be expedited due to lack of held lock entry.
+ */
+ return ret ?: (time > 1000000000 / 2 ? 0 : 1);
+}
+
+char _license[] SEC("license") = "GPL";
diff --git a/tools/testing/selftests/bpf/progs/res_spin_lock_fail.c b/tools/testing/selftests/bpf/progs/res_spin_lock_fail.c
new file mode 100644
index 000000000000..3222e9283c78
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/res_spin_lock_fail.c
@@ -0,0 +1,244 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2024 Meta Platforms, Inc. and affiliates. */
+#include <vmlinux.h>
+#include <bpf/bpf_tracing.h>
+#include <bpf/bpf_helpers.h>
+#include <bpf/bpf_core_read.h>
+#include "bpf_misc.h"
+#include "bpf_experimental.h"
+
+struct arr_elem {
+ struct bpf_res_spin_lock lock;
+};
+
+struct {
+ __uint(type, BPF_MAP_TYPE_ARRAY);
+ __uint(max_entries, 1);
+ __type(key, int);
+ __type(value, struct arr_elem);
+} arrmap SEC(".maps");
+
+long value;
+
+struct bpf_spin_lock lock __hidden SEC(".data.A");
+struct bpf_res_spin_lock res_lock __hidden SEC(".data.B");
+
+SEC("?tc")
+__failure __msg("point to map value or allocated object")
+int res_spin_lock_arg(struct __sk_buff *ctx)
+{
+ struct arr_elem *elem;
+
+ elem = bpf_map_lookup_elem(&arrmap, &(int){0});
+ if (!elem)
+ return 0;
+ bpf_res_spin_lock((struct bpf_res_spin_lock *)bpf_core_cast(&elem->lock, struct __sk_buff));
+ bpf_res_spin_lock(&elem->lock);
+ return 0;
+}
+
+SEC("?tc")
+__failure __msg("AA deadlock detected")
+int res_spin_lock_AA(struct __sk_buff *ctx)
+{
+ struct arr_elem *elem;
+
+ elem = bpf_map_lookup_elem(&arrmap, &(int){0});
+ if (!elem)
+ return 0;
+ bpf_res_spin_lock(&elem->lock);
+ bpf_res_spin_lock(&elem->lock);
+ return 0;
+}
+
+SEC("?tc")
+__failure __msg("AA deadlock detected")
+int res_spin_lock_cond_AA(struct __sk_buff *ctx)
+{
+ struct arr_elem *elem;
+
+ elem = bpf_map_lookup_elem(&arrmap, &(int){0});
+ if (!elem)
+ return 0;
+ if (bpf_res_spin_lock(&elem->lock))
+ return 0;
+ bpf_res_spin_lock(&elem->lock);
+ return 0;
+}
+
+SEC("?tc")
+__failure __msg("unlock of different lock")
+int res_spin_lock_mismatch_1(struct __sk_buff *ctx)
+{
+ struct arr_elem *elem;
+
+ elem = bpf_map_lookup_elem(&arrmap, &(int){0});
+ if (!elem)
+ return 0;
+ if (bpf_res_spin_lock(&elem->lock))
+ return 0;
+ bpf_res_spin_unlock(&res_lock);
+ return 0;
+}
+
+SEC("?tc")
+__failure __msg("unlock of different lock")
+int res_spin_lock_mismatch_2(struct __sk_buff *ctx)
+{
+ struct arr_elem *elem;
+
+ elem = bpf_map_lookup_elem(&arrmap, &(int){0});
+ if (!elem)
+ return 0;
+ if (bpf_res_spin_lock(&res_lock))
+ return 0;
+ bpf_res_spin_unlock(&elem->lock);
+ return 0;
+}
+
+SEC("?tc")
+__failure __msg("unlock of different lock")
+int res_spin_lock_irq_mismatch_1(struct __sk_buff *ctx)
+{
+ struct arr_elem *elem;
+ unsigned long f1;
+
+ elem = bpf_map_lookup_elem(&arrmap, &(int){0});
+ if (!elem)
+ return 0;
+ bpf_local_irq_save(&f1);
+ if (bpf_res_spin_lock(&res_lock))
+ return 0;
+ bpf_res_spin_unlock_irqrestore(&res_lock, &f1);
+ return 0;
+}
+
+SEC("?tc")
+__failure __msg("unlock of different lock")
+int res_spin_lock_irq_mismatch_2(struct __sk_buff *ctx)
+{
+ struct arr_elem *elem;
+ unsigned long f1;
+
+ elem = bpf_map_lookup_elem(&arrmap, &(int){0});
+ if (!elem)
+ return 0;
+ if (bpf_res_spin_lock_irqsave(&res_lock, &f1))
+ return 0;
+ bpf_res_spin_unlock(&res_lock);
+ return 0;
+}
+
+SEC("?tc")
+__success
+int res_spin_lock_ooo(struct __sk_buff *ctx)
+{
+ struct arr_elem *elem;
+
+ elem = bpf_map_lookup_elem(&arrmap, &(int){0});
+ if (!elem)
+ return 0;
+ if (bpf_res_spin_lock(&res_lock))
+ return 0;
+ if (bpf_res_spin_lock(&elem->lock)) {
+ bpf_res_spin_unlock(&res_lock);
+ return 0;
+ }
+ bpf_res_spin_unlock(&elem->lock);
+ bpf_res_spin_unlock(&res_lock);
+ return 0;
+}
+
+SEC("?tc")
+__success
+int res_spin_lock_ooo_irq(struct __sk_buff *ctx)
+{
+ struct arr_elem *elem;
+ unsigned long f1, f2;
+
+ elem = bpf_map_lookup_elem(&arrmap, &(int){0});
+ if (!elem)
+ return 0;
+ if (bpf_res_spin_lock_irqsave(&res_lock, &f1))
+ return 0;
+ if (bpf_res_spin_lock_irqsave(&elem->lock, &f2)) {
+ bpf_res_spin_unlock_irqrestore(&res_lock, &f1);
+ /* We won't have a unreleased IRQ flag error here. */
+ return 0;
+ }
+ bpf_res_spin_unlock_irqrestore(&elem->lock, &f2);
+ bpf_res_spin_unlock_irqrestore(&res_lock, &f1);
+ return 0;
+}
+
+struct bpf_res_spin_lock lock1 __hidden SEC(".data.OO1");
+struct bpf_res_spin_lock lock2 __hidden SEC(".data.OO2");
+
+SEC("?tc")
+__failure __msg("bpf_res_spin_unlock cannot be out of order")
+int res_spin_lock_ooo_unlock(struct __sk_buff *ctx)
+{
+ if (bpf_res_spin_lock(&lock1))
+ return 0;
+ if (bpf_res_spin_lock(&lock2)) {
+ bpf_res_spin_unlock(&lock1);
+ return 0;
+ }
+ bpf_res_spin_unlock(&lock1);
+ bpf_res_spin_unlock(&lock2);
+ return 0;
+}
+
+SEC("?tc")
+__failure __msg("off 1 doesn't point to 'struct bpf_res_spin_lock' that is at 0")
+int res_spin_lock_bad_off(struct __sk_buff *ctx)
+{
+ struct arr_elem *elem;
+
+ elem = bpf_map_lookup_elem(&arrmap, &(int){0});
+ if (!elem)
+ return 0;
+ bpf_res_spin_lock((void *)&elem->lock + 1);
+ return 0;
+}
+
+SEC("?tc")
+__failure __msg("R1 doesn't have constant offset. bpf_res_spin_lock has to be at the constant offset")
+int res_spin_lock_var_off(struct __sk_buff *ctx)
+{
+ struct arr_elem *elem;
+ u64 val = value;
+
+ elem = bpf_map_lookup_elem(&arrmap, &(int){0});
+ if (!elem) {
+ // FIXME: Only inline assembly use in assert macro doesn't emit
+ // BTF definition.
+ bpf_throw(0);
+ return 0;
+ }
+ bpf_assert_range(val, 0, 40);
+ bpf_res_spin_lock((void *)&value + val);
+ return 0;
+}
+
+SEC("?tc")
+__failure __msg("map 'res_spin.bss' has no valid bpf_res_spin_lock")
+int res_spin_lock_no_lock_map(struct __sk_buff *ctx)
+{
+ bpf_res_spin_lock((void *)&value + 1);
+ return 0;
+}
+
+SEC("?tc")
+__failure __msg("local 'kptr' has no valid bpf_res_spin_lock")
+int res_spin_lock_no_lock_kptr(struct __sk_buff *ctx)
+{
+ struct { int i; } *p = bpf_obj_new(typeof(*p));
+
+ if (!p)
+ return 0;
+ bpf_res_spin_lock((void *)p);
+ return 0;
+}
+
+char _license[] SEC("license") = "GPL";
--
2.43.5
^ permalink raw reply related [flat|nested] 67+ messages in thread
* Re: [PATCH bpf-next v2 22/26] bpf: Introduce rqspinlock kfuncs
2025-02-06 10:54 ` [PATCH bpf-next v2 22/26] bpf: Introduce rqspinlock kfuncs Kumar Kartikeya Dwivedi
@ 2025-02-07 13:43 ` kernel test robot
0 siblings, 0 replies; 67+ messages in thread
From: kernel test robot @ 2025-02-07 13:43 UTC (permalink / raw)
To: Kumar Kartikeya Dwivedi, bpf, linux-kernel
Cc: llvm, oe-kbuild-all, Peter Zijlstra, Will Deacon, Waiman Long,
Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
Martin KaFai Lau, Eduard Zingerman, Paul E. McKenney, Tejun Heo,
Barret Rhoden, Josh Don, Dohyun Kim, linux-arm-kernel,
kernel-team
Hi Kumar,
kernel test robot noticed the following build errors:
[auto build test ERROR on 0abff462d802a352c87b7f5e71b442b09bf9cfff]
url: https://github.com/intel-lab-lkp/linux/commits/Kumar-Kartikeya-Dwivedi/locking-Move-MCS-struct-definition-to-public-header/20250206-190258
base: 0abff462d802a352c87b7f5e71b442b09bf9cfff
patch link: https://lore.kernel.org/r/20250206105435.2159977-23-memxor%40gmail.com
patch subject: [PATCH bpf-next v2 22/26] bpf: Introduce rqspinlock kfuncs
config: x86_64-buildonly-randconfig-004-20250207 (https://download.01.org/0day-ci/archive/20250207/202502072155.DbOeX8Le-lkp@intel.com/config)
compiler: clang version 19.1.3 (https://github.com/llvm/llvm-project ab51eccf88f5321e7c60591c5546b254b6afab99)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20250207/202502072155.DbOeX8Le-lkp@intel.com/reproduce)
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202502072155.DbOeX8Le-lkp@intel.com/
All errors (new ones prefixed by >>):
In file included from fs/timerfd.c:26:
In file included from include/linux/syscalls.h:94:
In file included from include/trace/syscall.h:7:
In file included from include/linux/trace_events.h:10:
In file included from include/linux/perf_event.h:62:
In file included from include/linux/security.h:35:
In file included from include/linux/bpf.h:33:
In file included from arch/x86/include/asm/rqspinlock.h:27:
>> include/asm-generic/rqspinlock.h:40:12: error: conflicting types for 'resilient_tas_spin_lock'
40 | extern int resilient_tas_spin_lock(rqspinlock_t *lock, u64 timeout);
| ^
arch/x86/include/asm/rqspinlock.h:17:12: note: previous declaration is here
17 | extern int resilient_tas_spin_lock(struct qspinlock *lock, u64 timeout);
| ^
1 error generated.
--
In file included from fs/splice.c:27:
include/linux/mm_inline.h:47:41: warning: arithmetic between different enumeration types ('enum node_stat_item' and 'enum lru_list') [-Wenum-enum-conversion]
47 | __mod_lruvec_state(lruvec, NR_LRU_BASE + lru, nr_pages);
| ~~~~~~~~~~~ ^ ~~~
include/linux/mm_inline.h:49:22: warning: arithmetic between different enumeration types ('enum zone_stat_item' and 'enum lru_list') [-Wenum-enum-conversion]
49 | NR_ZONE_LRU_BASE + lru, nr_pages);
| ~~~~~~~~~~~~~~~~ ^ ~~~
In file included from fs/splice.c:31:
In file included from include/linux/syscalls.h:94:
In file included from include/trace/syscall.h:7:
In file included from include/linux/trace_events.h:10:
In file included from include/linux/perf_event.h:62:
In file included from include/linux/security.h:35:
In file included from include/linux/bpf.h:33:
In file included from arch/x86/include/asm/rqspinlock.h:27:
>> include/asm-generic/rqspinlock.h:40:12: error: conflicting types for 'resilient_tas_spin_lock'
40 | extern int resilient_tas_spin_lock(rqspinlock_t *lock, u64 timeout);
| ^
arch/x86/include/asm/rqspinlock.h:17:12: note: previous declaration is here
17 | extern int resilient_tas_spin_lock(struct qspinlock *lock, u64 timeout);
| ^
2 warnings and 1 error generated.
--
In file included from fs/aio.c:20:
In file included from include/linux/syscalls.h:94:
In file included from include/trace/syscall.h:7:
In file included from include/linux/trace_events.h:10:
In file included from include/linux/perf_event.h:62:
In file included from include/linux/security.h:35:
In file included from include/linux/bpf.h:33:
In file included from arch/x86/include/asm/rqspinlock.h:27:
>> include/asm-generic/rqspinlock.h:40:12: error: conflicting types for 'resilient_tas_spin_lock'
40 | extern int resilient_tas_spin_lock(rqspinlock_t *lock, u64 timeout);
| ^
arch/x86/include/asm/rqspinlock.h:17:12: note: previous declaration is here
17 | extern int resilient_tas_spin_lock(struct qspinlock *lock, u64 timeout);
| ^
In file included from fs/aio.c:29:
include/linux/mman.h:159:9: warning: division by zero is undefined [-Wdivision-by-zero]
159 | _calc_vm_trans(flags, MAP_SYNC, VM_SYNC ) |
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
include/linux/mman.h:137:21: note: expanded from macro '_calc_vm_trans'
137 | : ((x) & (bit1)) / ((bit1) / (bit2))))
| ^ ~~~~~~~~~~~~~~~~~
include/linux/mman.h:160:9: warning: division by zero is undefined [-Wdivision-by-zero]
160 | _calc_vm_trans(flags, MAP_STACK, VM_NOHUGEPAGE) |
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
include/linux/mman.h:137:21: note: expanded from macro '_calc_vm_trans'
137 | : ((x) & (bit1)) / ((bit1) / (bit2))))
| ^ ~~~~~~~~~~~~~~~~~
2 warnings and 1 error generated.
vim +/resilient_tas_spin_lock +40 include/asm-generic/rqspinlock.h
c34e46edef2a89 Kumar Kartikeya Dwivedi 2025-02-06 39
7a9d3b27f7bf9c Kumar Kartikeya Dwivedi 2025-02-06 @40 extern int resilient_tas_spin_lock(rqspinlock_t *lock, u64 timeout);
7a9d3b27f7bf9c Kumar Kartikeya Dwivedi 2025-02-06 41 #ifdef CONFIG_QUEUED_SPINLOCKS
6516ce00a1482f Kumar Kartikeya Dwivedi 2025-02-06 42 extern int resilient_queued_spin_lock_slowpath(rqspinlock_t *lock, u32 val, u64 timeout);
7a9d3b27f7bf9c Kumar Kartikeya Dwivedi 2025-02-06 43 #endif
6516ce00a1482f Kumar Kartikeya Dwivedi 2025-02-06 44
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [PATCH bpf-next v2 18/26] rqspinlock: Add entry to Makefile, MAINTAINERS
2025-02-06 10:54 ` [PATCH bpf-next v2 18/26] rqspinlock: Add entry to Makefile, MAINTAINERS Kumar Kartikeya Dwivedi
@ 2025-02-07 14:14 ` kernel test robot
2025-02-07 14:45 ` kernel test robot
2025-02-08 0:43 ` kernel test robot
2 siblings, 0 replies; 67+ messages in thread
From: kernel test robot @ 2025-02-07 14:14 UTC (permalink / raw)
To: Kumar Kartikeya Dwivedi, bpf, linux-kernel
Cc: oe-kbuild-all, Peter Zijlstra, Will Deacon, Waiman Long,
Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
Martin KaFai Lau, Eduard Zingerman, Paul E. McKenney, Tejun Heo,
Barret Rhoden, Josh Don, Dohyun Kim, linux-arm-kernel,
kernel-team
Hi Kumar,
kernel test robot noticed the following build errors:
[auto build test ERROR on 0abff462d802a352c87b7f5e71b442b09bf9cfff]
url: https://github.com/intel-lab-lkp/linux/commits/Kumar-Kartikeya-Dwivedi/locking-Move-MCS-struct-definition-to-public-header/20250206-190258
base: 0abff462d802a352c87b7f5e71b442b09bf9cfff
patch link: https://lore.kernel.org/r/20250206105435.2159977-19-memxor%40gmail.com
patch subject: [PATCH bpf-next v2 18/26] rqspinlock: Add entry to Makefile, MAINTAINERS
config: i386-randconfig-014-20250207 (https://download.01.org/0day-ci/archive/20250207/202502072249.IXcsG9Tu-lkp@intel.com/config)
compiler: gcc-12 (Debian 12.2.0-14) 12.2.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20250207/202502072249.IXcsG9Tu-lkp@intel.com/reproduce)
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202502072249.IXcsG9Tu-lkp@intel.com/
All errors (new ones prefixed by >>):
In file included from arch/x86/include/asm/rqspinlock.h:27,
from kernel/locking/rqspinlock.c:28:
include/asm-generic/rqspinlock.h:33:12: error: conflicting types for 'resilient_tas_spin_lock'; have 'int(rqspinlock_t *, u64)' {aka 'int(struct rqspinlock *, long long unsigned int)'}
33 | extern int resilient_tas_spin_lock(rqspinlock_t *lock, u64 timeout);
| ^~~~~~~~~~~~~~~~~~~~~~~
arch/x86/include/asm/rqspinlock.h:17:12: note: previous declaration of 'resilient_tas_spin_lock' with type 'int(struct qspinlock *, u64)' {aka 'int(struct qspinlock *, long long unsigned int)'}
17 | extern int resilient_tas_spin_lock(struct qspinlock *lock, u64 timeout);
| ^~~~~~~~~~~~~~~~~~~~~~~
>> kernel/locking/rqspinlock.c:293:16: error: conflicting types for 'resilient_tas_spin_lock'; have 'int(rqspinlock_t *, u64)' {aka 'int(struct rqspinlock *, long long unsigned int)'}
293 | int __lockfunc resilient_tas_spin_lock(rqspinlock_t *lock, u64 timeout)
| ^~~~~~~~~~~~~~~~~~~~~~~
arch/x86/include/asm/rqspinlock.h:17:12: note: previous declaration of 'resilient_tas_spin_lock' with type 'int(struct qspinlock *, u64)' {aka 'int(struct qspinlock *, long long unsigned int)'}
17 | extern int resilient_tas_spin_lock(struct qspinlock *lock, u64 timeout);
| ^~~~~~~~~~~~~~~~~~~~~~~
kernel/locking/rqspinlock.c:204:13: warning: 'rqspinlock_report_violation' defined but not used [-Wunused-function]
204 | static void rqspinlock_report_violation(const char *s, void *lock)
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~
vim +293 kernel/locking/rqspinlock.c
65ba402b78bc5d Kumar Kartikeya Dwivedi 2025-02-06 288
7a9d3b27f7bf9c Kumar Kartikeya Dwivedi 2025-02-06 289 /*
7a9d3b27f7bf9c Kumar Kartikeya Dwivedi 2025-02-06 290 * Provide a test-and-set fallback for cases when queued spin lock support is
7a9d3b27f7bf9c Kumar Kartikeya Dwivedi 2025-02-06 291 * absent from the architecture.
7a9d3b27f7bf9c Kumar Kartikeya Dwivedi 2025-02-06 292 */
7a9d3b27f7bf9c Kumar Kartikeya Dwivedi 2025-02-06 @293 int __lockfunc resilient_tas_spin_lock(rqspinlock_t *lock, u64 timeout)
7a9d3b27f7bf9c Kumar Kartikeya Dwivedi 2025-02-06 294 {
7a9d3b27f7bf9c Kumar Kartikeya Dwivedi 2025-02-06 295 struct rqspinlock_timeout ts;
7a9d3b27f7bf9c Kumar Kartikeya Dwivedi 2025-02-06 296 int val, ret = 0;
7a9d3b27f7bf9c Kumar Kartikeya Dwivedi 2025-02-06 297
7a9d3b27f7bf9c Kumar Kartikeya Dwivedi 2025-02-06 298 RES_INIT_TIMEOUT(ts, timeout);
7a9d3b27f7bf9c Kumar Kartikeya Dwivedi 2025-02-06 299 grab_held_lock_entry(lock);
7a9d3b27f7bf9c Kumar Kartikeya Dwivedi 2025-02-06 300 retry:
7a9d3b27f7bf9c Kumar Kartikeya Dwivedi 2025-02-06 301 val = atomic_read(&lock->val);
7a9d3b27f7bf9c Kumar Kartikeya Dwivedi 2025-02-06 302
7a9d3b27f7bf9c Kumar Kartikeya Dwivedi 2025-02-06 303 if (val || !atomic_try_cmpxchg(&lock->val, &val, 1)) {
7a9d3b27f7bf9c Kumar Kartikeya Dwivedi 2025-02-06 304 if (RES_CHECK_TIMEOUT(ts, ret, ~0u)) {
7a9d3b27f7bf9c Kumar Kartikeya Dwivedi 2025-02-06 305 lockevent_inc(rqspinlock_lock_timeout);
7a9d3b27f7bf9c Kumar Kartikeya Dwivedi 2025-02-06 306 goto out;
7a9d3b27f7bf9c Kumar Kartikeya Dwivedi 2025-02-06 307 }
7a9d3b27f7bf9c Kumar Kartikeya Dwivedi 2025-02-06 308 cpu_relax();
7a9d3b27f7bf9c Kumar Kartikeya Dwivedi 2025-02-06 309 goto retry;
7a9d3b27f7bf9c Kumar Kartikeya Dwivedi 2025-02-06 310 }
7a9d3b27f7bf9c Kumar Kartikeya Dwivedi 2025-02-06 311
7a9d3b27f7bf9c Kumar Kartikeya Dwivedi 2025-02-06 312 return 0;
7a9d3b27f7bf9c Kumar Kartikeya Dwivedi 2025-02-06 313 out:
7a9d3b27f7bf9c Kumar Kartikeya Dwivedi 2025-02-06 314 release_held_lock_entry();
7a9d3b27f7bf9c Kumar Kartikeya Dwivedi 2025-02-06 315 return ret;
7a9d3b27f7bf9c Kumar Kartikeya Dwivedi 2025-02-06 316 }
7a9d3b27f7bf9c Kumar Kartikeya Dwivedi 2025-02-06 317
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [PATCH bpf-next v2 18/26] rqspinlock: Add entry to Makefile, MAINTAINERS
2025-02-06 10:54 ` [PATCH bpf-next v2 18/26] rqspinlock: Add entry to Makefile, MAINTAINERS Kumar Kartikeya Dwivedi
2025-02-07 14:14 ` kernel test robot
@ 2025-02-07 14:45 ` kernel test robot
2025-02-08 0:43 ` kernel test robot
2 siblings, 0 replies; 67+ messages in thread
From: kernel test robot @ 2025-02-07 14:45 UTC (permalink / raw)
To: Kumar Kartikeya Dwivedi, bpf, linux-kernel
Cc: oe-kbuild-all, Peter Zijlstra, Will Deacon, Waiman Long,
Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
Martin KaFai Lau, Eduard Zingerman, Paul E. McKenney, Tejun Heo,
Barret Rhoden, Josh Don, Dohyun Kim, linux-arm-kernel,
kernel-team
Hi Kumar,
kernel test robot noticed the following build errors:
[auto build test ERROR on 0abff462d802a352c87b7f5e71b442b09bf9cfff]
url: https://github.com/intel-lab-lkp/linux/commits/Kumar-Kartikeya-Dwivedi/locking-Move-MCS-struct-definition-to-public-header/20250206-190258
base: 0abff462d802a352c87b7f5e71b442b09bf9cfff
patch link: https://lore.kernel.org/r/20250206105435.2159977-19-memxor%40gmail.com
patch subject: [PATCH bpf-next v2 18/26] rqspinlock: Add entry to Makefile, MAINTAINERS
config: arm-randconfig-001-20250207 (https://download.01.org/0day-ci/archive/20250207/202502072210.Fzbbpkun-lkp@intel.com/config)
compiler: arm-linux-gnueabi-gcc (GCC) 14.2.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20250207/202502072210.Fzbbpkun-lkp@intel.com/reproduce)
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202502072210.Fzbbpkun-lkp@intel.com/
All error/warnings (new ones prefixed by >>):
In file included from kernel/locking/rqspinlock.c:77:
>> kernel/locking/mcs_spinlock.h:57:27: warning: 'struct mcs_spinlock' declared inside parameter list will not be visible outside of this definition or declaration
57 | void mcs_spin_lock(struct mcs_spinlock **lock, struct mcs_spinlock *node)
| ^~~~~~~~~~~~
kernel/locking/mcs_spinlock.h: In function 'mcs_spin_lock':
>> kernel/locking/mcs_spinlock.h:62:13: error: invalid use of undefined type 'struct mcs_spinlock'
62 | node->locked = 0;
| ^~
kernel/locking/mcs_spinlock.h:63:13: error: invalid use of undefined type 'struct mcs_spinlock'
63 | node->next = NULL;
| ^~
In file included from <command-line>:
kernel/locking/mcs_spinlock.h:83:24: error: invalid use of undefined type 'struct mcs_spinlock'
83 | WRITE_ONCE(prev->next, node);
| ^~
include/linux/compiler_types.h:522:23: note: in definition of macro '__compiletime_assert'
522 | if (!(condition)) \
| ^~~~~~~~~
include/linux/compiler_types.h:542:9: note: in expansion of macro '_compiletime_assert'
542 | _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
| ^~~~~~~~~~~~~~~~~~~
include/asm-generic/rwonce.h:36:9: note: in expansion of macro 'compiletime_assert'
36 | compiletime_assert(__native_word(t) || sizeof(t) == sizeof(long long), \
| ^~~~~~~~~~~~~~~~~~
include/asm-generic/rwonce.h:36:28: note: in expansion of macro '__native_word'
36 | compiletime_assert(__native_word(t) || sizeof(t) == sizeof(long long), \
| ^~~~~~~~~~~~~
include/asm-generic/rwonce.h:60:9: note: in expansion of macro 'compiletime_assert_rwonce_type'
60 | compiletime_assert_rwonce_type(x); \
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
kernel/locking/mcs_spinlock.h:83:9: note: in expansion of macro 'WRITE_ONCE'
83 | WRITE_ONCE(prev->next, node);
| ^~~~~~~~~~
kernel/locking/mcs_spinlock.h:83:24: error: invalid use of undefined type 'struct mcs_spinlock'
83 | WRITE_ONCE(prev->next, node);
| ^~
include/linux/compiler_types.h:522:23: note: in definition of macro '__compiletime_assert'
522 | if (!(condition)) \
| ^~~~~~~~~
include/linux/compiler_types.h:542:9: note: in expansion of macro '_compiletime_assert'
542 | _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
| ^~~~~~~~~~~~~~~~~~~
include/asm-generic/rwonce.h:36:9: note: in expansion of macro 'compiletime_assert'
36 | compiletime_assert(__native_word(t) || sizeof(t) == sizeof(long long), \
| ^~~~~~~~~~~~~~~~~~
include/asm-generic/rwonce.h:36:28: note: in expansion of macro '__native_word'
36 | compiletime_assert(__native_word(t) || sizeof(t) == sizeof(long long), \
| ^~~~~~~~~~~~~
include/asm-generic/rwonce.h:60:9: note: in expansion of macro 'compiletime_assert_rwonce_type'
60 | compiletime_assert_rwonce_type(x); \
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
kernel/locking/mcs_spinlock.h:83:9: note: in expansion of macro 'WRITE_ONCE'
83 | WRITE_ONCE(prev->next, node);
| ^~~~~~~~~~
kernel/locking/mcs_spinlock.h:83:24: error: invalid use of undefined type 'struct mcs_spinlock'
83 | WRITE_ONCE(prev->next, node);
| ^~
include/linux/compiler_types.h:522:23: note: in definition of macro '__compiletime_assert'
522 | if (!(condition)) \
| ^~~~~~~~~
include/linux/compiler_types.h:542:9: note: in expansion of macro '_compiletime_assert'
542 | _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
| ^~~~~~~~~~~~~~~~~~~
include/asm-generic/rwonce.h:36:9: note: in expansion of macro 'compiletime_assert'
36 | compiletime_assert(__native_word(t) || sizeof(t) == sizeof(long long), \
| ^~~~~~~~~~~~~~~~~~
include/asm-generic/rwonce.h:36:28: note: in expansion of macro '__native_word'
36 | compiletime_assert(__native_word(t) || sizeof(t) == sizeof(long long), \
| ^~~~~~~~~~~~~
include/asm-generic/rwonce.h:60:9: note: in expansion of macro 'compiletime_assert_rwonce_type'
60 | compiletime_assert_rwonce_type(x); \
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
kernel/locking/mcs_spinlock.h:83:9: note: in expansion of macro 'WRITE_ONCE'
83 | WRITE_ONCE(prev->next, node);
| ^~~~~~~~~~
kernel/locking/mcs_spinlock.h:83:24: error: invalid use of undefined type 'struct mcs_spinlock'
83 | WRITE_ONCE(prev->next, node);
| ^~
include/linux/compiler_types.h:522:23: note: in definition of macro '__compiletime_assert'
522 | if (!(condition)) \
| ^~~~~~~~~
include/linux/compiler_types.h:542:9: note: in expansion of macro '_compiletime_assert'
542 | _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
| ^~~~~~~~~~~~~~~~~~~
include/asm-generic/rwonce.h:36:9: note: in expansion of macro 'compiletime_assert'
36 | compiletime_assert(__native_word(t) || sizeof(t) == sizeof(long long), \
| ^~~~~~~~~~~~~~~~~~
include/asm-generic/rwonce.h:36:28: note: in expansion of macro '__native_word'
36 | compiletime_assert(__native_word(t) || sizeof(t) == sizeof(long long), \
| ^~~~~~~~~~~~~
include/asm-generic/rwonce.h:60:9: note: in expansion of macro 'compiletime_assert_rwonce_type'
60 | compiletime_assert_rwonce_type(x); \
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
kernel/locking/mcs_spinlock.h:83:9: note: in expansion of macro 'WRITE_ONCE'
83 | WRITE_ONCE(prev->next, node);
| ^~~~~~~~~~
kernel/locking/mcs_spinlock.h:83:24: error: invalid use of undefined type 'struct mcs_spinlock'
83 | WRITE_ONCE(prev->next, node);
| ^~
include/linux/compiler_types.h:522:23: note: in definition of macro '__compiletime_assert'
522 | if (!(condition)) \
| ^~~~~~~~~
include/linux/compiler_types.h:542:9: note: in expansion of macro '_compiletime_assert'
542 | _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
| ^~~~~~~~~~~~~~~~~~~
include/asm-generic/rwonce.h:36:9: note: in expansion of macro 'compiletime_assert'
vim +62 kernel/locking/mcs_spinlock.h
e207552e64ea05 include/linux/mcs_spinlock.h Will Deacon 2014-01-21 39
e72246748ff006 include/linux/mcs_spinlock.h Tim Chen 2014-01-21 40 /*
e72246748ff006 include/linux/mcs_spinlock.h Tim Chen 2014-01-21 41 * Note: the smp_load_acquire/smp_store_release pair is not
e72246748ff006 include/linux/mcs_spinlock.h Tim Chen 2014-01-21 42 * sufficient to form a full memory barrier across
e72246748ff006 include/linux/mcs_spinlock.h Tim Chen 2014-01-21 43 * cpus for many architectures (except x86) for mcs_unlock and mcs_lock.
e72246748ff006 include/linux/mcs_spinlock.h Tim Chen 2014-01-21 44 * For applications that need a full barrier across multiple cpus
e72246748ff006 include/linux/mcs_spinlock.h Tim Chen 2014-01-21 45 * with mcs_unlock and mcs_lock pair, smp_mb__after_unlock_lock() should be
e72246748ff006 include/linux/mcs_spinlock.h Tim Chen 2014-01-21 46 * used after mcs_lock.
e72246748ff006 include/linux/mcs_spinlock.h Tim Chen 2014-01-21 47 */
5faeb8adb956a5 include/linux/mcs_spinlock.h Jason Low 2014-01-21 48
5faeb8adb956a5 include/linux/mcs_spinlock.h Jason Low 2014-01-21 49 /*
5faeb8adb956a5 include/linux/mcs_spinlock.h Jason Low 2014-01-21 50 * In order to acquire the lock, the caller should declare a local node and
5faeb8adb956a5 include/linux/mcs_spinlock.h Jason Low 2014-01-21 51 * pass a reference of the node to this function in addition to the lock.
5faeb8adb956a5 include/linux/mcs_spinlock.h Jason Low 2014-01-21 52 * If the lock has already been acquired, then this will proceed to spin
5faeb8adb956a5 include/linux/mcs_spinlock.h Jason Low 2014-01-21 53 * on this node->locked until the previous lock holder sets the node->locked
5faeb8adb956a5 include/linux/mcs_spinlock.h Jason Low 2014-01-21 54 * in mcs_spin_unlock().
5faeb8adb956a5 include/linux/mcs_spinlock.h Jason Low 2014-01-21 55 */
e72246748ff006 include/linux/mcs_spinlock.h Tim Chen 2014-01-21 56 static inline
e72246748ff006 include/linux/mcs_spinlock.h Tim Chen 2014-01-21 @57 void mcs_spin_lock(struct mcs_spinlock **lock, struct mcs_spinlock *node)
e72246748ff006 include/linux/mcs_spinlock.h Tim Chen 2014-01-21 58 {
e72246748ff006 include/linux/mcs_spinlock.h Tim Chen 2014-01-21 59 struct mcs_spinlock *prev;
e72246748ff006 include/linux/mcs_spinlock.h Tim Chen 2014-01-21 60
e72246748ff006 include/linux/mcs_spinlock.h Tim Chen 2014-01-21 61 /* Init node */
e72246748ff006 include/linux/mcs_spinlock.h Tim Chen 2014-01-21 @62 node->locked = 0;
e72246748ff006 include/linux/mcs_spinlock.h Tim Chen 2014-01-21 63 node->next = NULL;
e72246748ff006 include/linux/mcs_spinlock.h Tim Chen 2014-01-21 64
920c720aa5aa39 kernel/locking/mcs_spinlock.h Peter Zijlstra 2016-02-01 65 /*
920c720aa5aa39 kernel/locking/mcs_spinlock.h Peter Zijlstra 2016-02-01 66 * We rely on the full barrier with global transitivity implied by the
920c720aa5aa39 kernel/locking/mcs_spinlock.h Peter Zijlstra 2016-02-01 67 * below xchg() to order the initialization stores above against any
920c720aa5aa39 kernel/locking/mcs_spinlock.h Peter Zijlstra 2016-02-01 68 * observation of @node. And to provide the ACQUIRE ordering associated
920c720aa5aa39 kernel/locking/mcs_spinlock.h Peter Zijlstra 2016-02-01 69 * with a LOCK primitive.
920c720aa5aa39 kernel/locking/mcs_spinlock.h Peter Zijlstra 2016-02-01 70 */
920c720aa5aa39 kernel/locking/mcs_spinlock.h Peter Zijlstra 2016-02-01 71 prev = xchg(lock, node);
e72246748ff006 include/linux/mcs_spinlock.h Tim Chen 2014-01-21 72 if (likely(prev == NULL)) {
5faeb8adb956a5 include/linux/mcs_spinlock.h Jason Low 2014-01-21 73 /*
5faeb8adb956a5 include/linux/mcs_spinlock.h Jason Low 2014-01-21 74 * Lock acquired, don't need to set node->locked to 1. Threads
5faeb8adb956a5 include/linux/mcs_spinlock.h Jason Low 2014-01-21 75 * only spin on its own node->locked value for lock acquisition.
5faeb8adb956a5 include/linux/mcs_spinlock.h Jason Low 2014-01-21 76 * However, since this thread can immediately acquire the lock
5faeb8adb956a5 include/linux/mcs_spinlock.h Jason Low 2014-01-21 77 * and does not proceed to spin on its own node->locked, this
5faeb8adb956a5 include/linux/mcs_spinlock.h Jason Low 2014-01-21 78 * value won't be used. If a debug mode is needed to
5faeb8adb956a5 include/linux/mcs_spinlock.h Jason Low 2014-01-21 79 * audit lock status, then set node->locked value here.
5faeb8adb956a5 include/linux/mcs_spinlock.h Jason Low 2014-01-21 80 */
e72246748ff006 include/linux/mcs_spinlock.h Tim Chen 2014-01-21 81 return;
e72246748ff006 include/linux/mcs_spinlock.h Tim Chen 2014-01-21 82 }
4d3199e4ca8e66 kernel/locking/mcs_spinlock.h Davidlohr Bueso 2015-02-22 83 WRITE_ONCE(prev->next, node);
e207552e64ea05 include/linux/mcs_spinlock.h Will Deacon 2014-01-21 84
e207552e64ea05 include/linux/mcs_spinlock.h Will Deacon 2014-01-21 85 /* Wait until the lock holder passes the lock down. */
e207552e64ea05 include/linux/mcs_spinlock.h Will Deacon 2014-01-21 86 arch_mcs_spin_lock_contended(&node->locked);
e72246748ff006 include/linux/mcs_spinlock.h Tim Chen 2014-01-21 87 }
e72246748ff006 include/linux/mcs_spinlock.h Tim Chen 2014-01-21 88
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [PATCH bpf-next v2 02/26] locking: Move common qspinlock helpers to a private header
2025-02-06 10:54 ` [PATCH bpf-next v2 02/26] locking: Move common qspinlock helpers to a private header Kumar Kartikeya Dwivedi
@ 2025-02-07 23:21 ` kernel test robot
0 siblings, 0 replies; 67+ messages in thread
From: kernel test robot @ 2025-02-07 23:21 UTC (permalink / raw)
To: Kumar Kartikeya Dwivedi, bpf, linux-kernel
Cc: oe-kbuild-all, Barret Rhoden, Peter Zijlstra, Will Deacon,
Waiman Long, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
Martin KaFai Lau, Eduard Zingerman, Paul E. McKenney, Tejun Heo,
Josh Don, Dohyun Kim, linux-arm-kernel, kernel-team
Hi Kumar,
kernel test robot noticed the following build warnings:
[auto build test WARNING on 0abff462d802a352c87b7f5e71b442b09bf9cfff]
url: https://github.com/intel-lab-lkp/linux/commits/Kumar-Kartikeya-Dwivedi/locking-Move-MCS-struct-definition-to-public-header/20250206-190258
base: 0abff462d802a352c87b7f5e71b442b09bf9cfff
patch link: https://lore.kernel.org/r/20250206105435.2159977-3-memxor%40gmail.com
patch subject: [PATCH bpf-next v2 02/26] locking: Move common qspinlock helpers to a private header
config: x86_64-randconfig-121-20250207 (https://download.01.org/0day-ci/archive/20250208/202502080738.raao5j60-lkp@intel.com/config)
compiler: gcc-12 (Debian 12.2.0-14) 12.2.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20250208/202502080738.raao5j60-lkp@intel.com/reproduce)
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202502080738.raao5j60-lkp@intel.com/
sparse warnings: (new ones prefixed by >>)
>> kernel/locking/qspinlock.c:285:41: sparse: sparse: incorrect type in argument 2 (different address spaces) @@ expected struct qnode *qnodes @@ got struct qnode [noderef] __percpu * @@
kernel/locking/qspinlock.c:285:41: sparse: expected struct qnode *qnodes
kernel/locking/qspinlock.c:285:41: sparse: got struct qnode [noderef] __percpu *
kernel/locking/qspinlock.c: note: in included file:
>> kernel/locking/qspinlock.h:67:16: sparse: sparse: incorrect type in initializer (different address spaces) @@ expected void const [noderef] __percpu *__vpp_verify @@ got struct mcs_spinlock * @@
kernel/locking/qspinlock.h:67:16: sparse: expected void const [noderef] __percpu *__vpp_verify
kernel/locking/qspinlock.h:67:16: sparse: got struct mcs_spinlock *
vim +285 kernel/locking/qspinlock.c
108
109 /**
110 * queued_spin_lock_slowpath - acquire the queued spinlock
111 * @lock: Pointer to queued spinlock structure
112 * @val: Current value of the queued spinlock 32-bit word
113 *
114 * (queue tail, pending bit, lock value)
115 *
116 * fast : slow : unlock
117 * : :
118 * uncontended (0,0,0) -:--> (0,0,1) ------------------------------:--> (*,*,0)
119 * : | ^--------.------. / :
120 * : v \ \ | :
121 * pending : (0,1,1) +--> (0,1,0) \ | :
122 * : | ^--' | | :
123 * : v | | :
124 * uncontended : (n,x,y) +--> (n,0,0) --' | :
125 * queue : | ^--' | :
126 * : v | :
127 * contended : (*,x,y) +--> (*,0,0) ---> (*,0,1) -' :
128 * queue : ^--' :
129 */
130 void __lockfunc queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
131 {
132 struct mcs_spinlock *prev, *next, *node;
133 u32 old, tail;
134 int idx;
135
136 BUILD_BUG_ON(CONFIG_NR_CPUS >= (1U << _Q_TAIL_CPU_BITS));
137
138 if (pv_enabled())
139 goto pv_queue;
140
141 if (virt_spin_lock(lock))
142 return;
143
144 /*
145 * Wait for in-progress pending->locked hand-overs with a bounded
146 * number of spins so that we guarantee forward progress.
147 *
148 * 0,1,0 -> 0,0,1
149 */
150 if (val == _Q_PENDING_VAL) {
151 int cnt = _Q_PENDING_LOOPS;
152 val = atomic_cond_read_relaxed(&lock->val,
153 (VAL != _Q_PENDING_VAL) || !cnt--);
154 }
155
156 /*
157 * If we observe any contention; queue.
158 */
159 if (val & ~_Q_LOCKED_MASK)
160 goto queue;
161
162 /*
163 * trylock || pending
164 *
165 * 0,0,* -> 0,1,* -> 0,0,1 pending, trylock
166 */
167 val = queued_fetch_set_pending_acquire(lock);
168
169 /*
170 * If we observe contention, there is a concurrent locker.
171 *
172 * Undo and queue; our setting of PENDING might have made the
173 * n,0,0 -> 0,0,0 transition fail and it will now be waiting
174 * on @next to become !NULL.
175 */
176 if (unlikely(val & ~_Q_LOCKED_MASK)) {
177
178 /* Undo PENDING if we set it. */
179 if (!(val & _Q_PENDING_MASK))
180 clear_pending(lock);
181
182 goto queue;
183 }
184
185 /*
186 * We're pending, wait for the owner to go away.
187 *
188 * 0,1,1 -> *,1,0
189 *
190 * this wait loop must be a load-acquire such that we match the
191 * store-release that clears the locked bit and create lock
192 * sequentiality; this is because not all
193 * clear_pending_set_locked() implementations imply full
194 * barriers.
195 */
196 if (val & _Q_LOCKED_MASK)
197 smp_cond_load_acquire(&lock->locked, !VAL);
198
199 /*
200 * take ownership and clear the pending bit.
201 *
202 * 0,1,0 -> 0,0,1
203 */
204 clear_pending_set_locked(lock);
205 lockevent_inc(lock_pending);
206 return;
207
208 /*
209 * End of pending bit optimistic spinning and beginning of MCS
210 * queuing.
211 */
212 queue:
213 lockevent_inc(lock_slowpath);
214 pv_queue:
215 node = this_cpu_ptr(&qnodes[0].mcs);
216 idx = node->count++;
217 tail = encode_tail(smp_processor_id(), idx);
218
219 trace_contention_begin(lock, LCB_F_SPIN);
220
221 /*
222 * 4 nodes are allocated based on the assumption that there will
223 * not be nested NMIs taking spinlocks. That may not be true in
224 * some architectures even though the chance of needing more than
225 * 4 nodes will still be extremely unlikely. When that happens,
226 * we fall back to spinning on the lock directly without using
227 * any MCS node. This is not the most elegant solution, but is
228 * simple enough.
229 */
230 if (unlikely(idx >= _Q_MAX_NODES)) {
231 lockevent_inc(lock_no_node);
232 while (!queued_spin_trylock(lock))
233 cpu_relax();
234 goto release;
235 }
236
237 node = grab_mcs_node(node, idx);
238
239 /*
240 * Keep counts of non-zero index values:
241 */
242 lockevent_cond_inc(lock_use_node2 + idx - 1, idx);
243
244 /*
245 * Ensure that we increment the head node->count before initialising
246 * the actual node. If the compiler is kind enough to reorder these
247 * stores, then an IRQ could overwrite our assignments.
248 */
249 barrier();
250
251 node->locked = 0;
252 node->next = NULL;
253 pv_init_node(node);
254
255 /*
256 * We touched a (possibly) cold cacheline in the per-cpu queue node;
257 * attempt the trylock once more in the hope someone let go while we
258 * weren't watching.
259 */
260 if (queued_spin_trylock(lock))
261 goto release;
262
263 /*
264 * Ensure that the initialisation of @node is complete before we
265 * publish the updated tail via xchg_tail() and potentially link
266 * @node into the waitqueue via WRITE_ONCE(prev->next, node) below.
267 */
268 smp_wmb();
269
270 /*
271 * Publish the updated tail.
272 * We have already touched the queueing cacheline; don't bother with
273 * pending stuff.
274 *
275 * p,*,* -> n,*,*
276 */
277 old = xchg_tail(lock, tail);
278 next = NULL;
279
280 /*
281 * if there was a previous node; link it and wait until reaching the
282 * head of the waitqueue.
283 */
284 if (old & _Q_TAIL_MASK) {
> 285 prev = decode_tail(old, qnodes);
286
287 /* Link @node into the waitqueue. */
288 WRITE_ONCE(prev->next, node);
289
290 pv_wait_node(node, prev);
291 arch_mcs_spin_lock_contended(&node->locked);
292
293 /*
294 * While waiting for the MCS lock, the next pointer may have
295 * been set by another lock waiter. We optimistically load
296 * the next pointer & prefetch the cacheline for writing
297 * to reduce latency in the upcoming MCS unlock operation.
298 */
299 next = READ_ONCE(node->next);
300 if (next)
301 prefetchw(next);
302 }
303
304 /*
305 * we're at the head of the waitqueue, wait for the owner & pending to
306 * go away.
307 *
308 * *,x,y -> *,0,0
309 *
310 * this wait loop must use a load-acquire such that we match the
311 * store-release that clears the locked bit and create lock
312 * sequentiality; this is because the set_locked() function below
313 * does not imply a full barrier.
314 *
315 * The PV pv_wait_head_or_lock function, if active, will acquire
316 * the lock and return a non-zero value. So we have to skip the
317 * atomic_cond_read_acquire() call. As the next PV queue head hasn't
318 * been designated yet, there is no way for the locked value to become
319 * _Q_SLOW_VAL. So both the set_locked() and the
320 * atomic_cmpxchg_relaxed() calls will be safe.
321 *
322 * If PV isn't active, 0 will be returned instead.
323 *
324 */
325 if ((val = pv_wait_head_or_lock(lock, node)))
326 goto locked;
327
328 val = atomic_cond_read_acquire(&lock->val, !(VAL & _Q_LOCKED_PENDING_MASK));
329
330 locked:
331 /*
332 * claim the lock:
333 *
334 * n,0,0 -> 0,0,1 : lock, uncontended
335 * *,*,0 -> *,*,1 : lock, contended
336 *
337 * If the queue head is the only one in the queue (lock value == tail)
338 * and nobody is pending, clear the tail code and grab the lock.
339 * Otherwise, we only need to grab the lock.
340 */
341
342 /*
343 * In the PV case we might already have _Q_LOCKED_VAL set, because
344 * of lock stealing; therefore we must also allow:
345 *
346 * n,0,1 -> 0,0,1
347 *
348 * Note: at this point: (val & _Q_PENDING_MASK) == 0, because of the
349 * above wait condition, therefore any concurrent setting of
350 * PENDING will make the uncontended transition fail.
351 */
352 if ((val & _Q_TAIL_MASK) == tail) {
353 if (atomic_try_cmpxchg_relaxed(&lock->val, &val, _Q_LOCKED_VAL))
354 goto release; /* No contention */
355 }
356
357 /*
358 * Either somebody is queued behind us or _Q_PENDING_VAL got set
359 * which will then detect the remaining tail and queue behind us
360 * ensuring we'll see a @next.
361 */
362 set_locked(lock);
363
364 /*
365 * contended path; wait for next if not observed yet, release.
366 */
367 if (!next)
368 next = smp_cond_load_relaxed(&node->next, (VAL));
369
370 arch_mcs_spin_unlock_contended(&next->locked);
371 pv_kick_node(lock, next);
372
373 release:
374 trace_contention_end(lock, 0);
375
376 /*
377 * release the node
378 */
379 __this_cpu_dec(qnodes[0].mcs.count);
380 }
381 EXPORT_SYMBOL(queued_spin_lock_slowpath);
382
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [PATCH bpf-next v2 18/26] rqspinlock: Add entry to Makefile, MAINTAINERS
2025-02-06 10:54 ` [PATCH bpf-next v2 18/26] rqspinlock: Add entry to Makefile, MAINTAINERS Kumar Kartikeya Dwivedi
2025-02-07 14:14 ` kernel test robot
2025-02-07 14:45 ` kernel test robot
@ 2025-02-08 0:43 ` kernel test robot
2 siblings, 0 replies; 67+ messages in thread
From: kernel test robot @ 2025-02-08 0:43 UTC (permalink / raw)
To: Kumar Kartikeya Dwivedi, bpf, linux-kernel
Cc: oe-kbuild-all, Peter Zijlstra, Will Deacon, Waiman Long,
Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
Martin KaFai Lau, Eduard Zingerman, Paul E. McKenney, Tejun Heo,
Barret Rhoden, Josh Don, Dohyun Kim, linux-arm-kernel,
kernel-team
Hi Kumar,
kernel test robot noticed the following build warnings:
[auto build test WARNING on 0abff462d802a352c87b7f5e71b442b09bf9cfff]
url: https://github.com/intel-lab-lkp/linux/commits/Kumar-Kartikeya-Dwivedi/locking-Move-MCS-struct-definition-to-public-header/20250206-190258
base: 0abff462d802a352c87b7f5e71b442b09bf9cfff
patch link: https://lore.kernel.org/r/20250206105435.2159977-19-memxor%40gmail.com
patch subject: [PATCH bpf-next v2 18/26] rqspinlock: Add entry to Makefile, MAINTAINERS
config: x86_64-randconfig-121-20250207 (https://download.01.org/0day-ci/archive/20250208/202502080835.XRxxo7P5-lkp@intel.com/config)
compiler: gcc-12 (Debian 12.2.0-14) 12.2.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20250208/202502080835.XRxxo7P5-lkp@intel.com/reproduce)
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202502080835.XRxxo7P5-lkp@intel.com/
sparse warnings: (new ones prefixed by >>)
>> kernel/locking/rqspinlock.c:101:39: sparse: sparse: incorrect type in initializer (different address spaces) @@ expected void const [noderef] __percpu *__vpp_verify @@ got struct rqspinlock_held * @@
kernel/locking/rqspinlock.c:101:39: sparse: expected void const [noderef] __percpu *__vpp_verify
kernel/locking/rqspinlock.c:101:39: sparse: got struct rqspinlock_held *
kernel/locking/rqspinlock.c:123:39: sparse: sparse: incorrect type in initializer (different address spaces) @@ expected void const [noderef] __percpu *__vpp_verify @@ got struct rqspinlock_held * @@
kernel/locking/rqspinlock.c:123:39: sparse: expected void const [noderef] __percpu *__vpp_verify
kernel/locking/rqspinlock.c:123:39: sparse: got struct rqspinlock_held *
kernel/locking/rqspinlock.c:136:51: sparse: sparse: incorrect type in initializer (different address spaces) @@ expected void const [noderef] __percpu *__vpp_verify @@ got struct rqspinlock_held * @@
kernel/locking/rqspinlock.c:136:51: sparse: expected void const [noderef] __percpu *__vpp_verify
kernel/locking/rqspinlock.c:136:51: sparse: got struct rqspinlock_held *
kernel/locking/rqspinlock.c:206:39: sparse: sparse: incorrect type in initializer (different address spaces) @@ expected void const [noderef] __percpu *__vpp_verify @@ got struct rqspinlock_held * @@
kernel/locking/rqspinlock.c:206:39: sparse: expected void const [noderef] __percpu *__vpp_verify
kernel/locking/rqspinlock.c:206:39: sparse: got struct rqspinlock_held *
>> kernel/locking/rqspinlock.c:572:41: sparse: sparse: incorrect type in argument 2 (different address spaces) @@ expected struct qnode *qnodes @@ got struct qnode [noderef] __percpu * @@
kernel/locking/rqspinlock.c:572:41: sparse: expected struct qnode *qnodes
kernel/locking/rqspinlock.c:572:41: sparse: got struct qnode [noderef] __percpu *
kernel/locking/rqspinlock.c: note: in included file:
kernel/locking/qspinlock.h:67:16: sparse: sparse: incorrect type in initializer (different address spaces) @@ expected void const [noderef] __percpu *__vpp_verify @@ got struct mcs_spinlock * @@
kernel/locking/qspinlock.h:67:16: sparse: expected void const [noderef] __percpu *__vpp_verify
kernel/locking/qspinlock.h:67:16: sparse: got struct mcs_spinlock *
vim +101 kernel/locking/rqspinlock.c
6516ce00a1482f Kumar Kartikeya Dwivedi 2025-02-06 97
6516ce00a1482f Kumar Kartikeya Dwivedi 2025-02-06 98 static noinline int check_deadlock_AA(rqspinlock_t *lock, u32 mask,
6516ce00a1482f Kumar Kartikeya Dwivedi 2025-02-06 99 struct rqspinlock_timeout *ts)
6516ce00a1482f Kumar Kartikeya Dwivedi 2025-02-06 100 {
6516ce00a1482f Kumar Kartikeya Dwivedi 2025-02-06 @101 struct rqspinlock_held *rqh = this_cpu_ptr(&rqspinlock_held_locks);
6516ce00a1482f Kumar Kartikeya Dwivedi 2025-02-06 102 int cnt = min(RES_NR_HELD, rqh->cnt);
6516ce00a1482f Kumar Kartikeya Dwivedi 2025-02-06 103
6516ce00a1482f Kumar Kartikeya Dwivedi 2025-02-06 104 /*
6516ce00a1482f Kumar Kartikeya Dwivedi 2025-02-06 105 * Return an error if we hold the lock we are attempting to acquire.
6516ce00a1482f Kumar Kartikeya Dwivedi 2025-02-06 106 * We'll iterate over max 32 locks; no need to do is_lock_released.
6516ce00a1482f Kumar Kartikeya Dwivedi 2025-02-06 107 */
6516ce00a1482f Kumar Kartikeya Dwivedi 2025-02-06 108 for (int i = 0; i < cnt - 1; i++) {
6516ce00a1482f Kumar Kartikeya Dwivedi 2025-02-06 109 if (rqh->locks[i] == lock)
6516ce00a1482f Kumar Kartikeya Dwivedi 2025-02-06 110 return -EDEADLK;
6516ce00a1482f Kumar Kartikeya Dwivedi 2025-02-06 111 }
6516ce00a1482f Kumar Kartikeya Dwivedi 2025-02-06 112 return 0;
6516ce00a1482f Kumar Kartikeya Dwivedi 2025-02-06 113 }
6516ce00a1482f Kumar Kartikeya Dwivedi 2025-02-06 114
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [PATCH bpf-next v2 11/26] rqspinlock: Add deadlock detection and recovery
2025-02-06 10:54 ` [PATCH bpf-next v2 11/26] rqspinlock: Add deadlock detection and recovery Kumar Kartikeya Dwivedi
@ 2025-02-08 1:53 ` Alexei Starovoitov
2025-02-08 3:03 ` Kumar Kartikeya Dwivedi
2025-02-10 10:21 ` Peter Zijlstra
2025-02-10 10:36 ` Peter Zijlstra
2 siblings, 1 reply; 67+ messages in thread
From: Alexei Starovoitov @ 2025-02-08 1:53 UTC (permalink / raw)
To: Kumar Kartikeya Dwivedi
Cc: bpf, LKML, Linus Torvalds, Peter Zijlstra, Will Deacon,
Waiman Long, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
Martin KaFai Lau, Eduard Zingerman, Paul E. McKenney, Tejun Heo,
Barret Rhoden, Josh Don, Dohyun Kim, linux-arm-kernel,
Kernel Team
On Thu, Feb 6, 2025 at 2:54 AM Kumar Kartikeya Dwivedi <memxor@gmail.com> wrote:
>
> +/*
> + * It is possible to run into misdetection scenarios of AA deadlocks on the same
> + * CPU, and missed ABBA deadlocks on remote CPUs when this function pops entries
> + * out of order (due to lock A, lock B, unlock A, unlock B) pattern. The correct
> + * logic to preserve right entries in the table would be to walk the array of
> + * held locks and swap and clear out-of-order entries, but that's too
> + * complicated and we don't have a compelling use case for out of order unlocking.
> + *
> + * Therefore, we simply don't support such cases and keep the logic simple here.
> + */
The comment looks obsolete from the old version of this patch.
Patch 25 is now enforces the fifo order in the verifier
and code review will do the same for use of res_spin_lock()
in bpf internals. So pls drop the comment or reword.
> +static __always_inline void release_held_lock_entry(void)
> +{
> + struct rqspinlock_held *rqh = this_cpu_ptr(&rqspinlock_held_locks);
> +
> + if (unlikely(rqh->cnt > RES_NR_HELD))
> + goto dec;
> + WRITE_ONCE(rqh->locks[rqh->cnt - 1], NULL);
> +dec:
> + this_cpu_dec(rqspinlock_held_locks.cnt);
..
> + * We don't have a problem if the dec and WRITE_ONCE above get reordered
> + * with each other, we either notice an empty NULL entry on top (if dec
> + * succeeds WRITE_ONCE), or a potentially stale entry which cannot be
> + * observed (if dec precedes WRITE_ONCE).
> + */
> + smp_wmb();
since smp_wmb() is needed to address ordering weakness vs try_cmpxchg_acquire()
would it make sense to move it before this_cpu_dec() to address
the 2nd part of the harmless race as well?
^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [PATCH bpf-next v2 17/26] rqspinlock: Hardcode cond_acquire loops to asm-generic implementation
2025-02-06 10:54 ` [PATCH bpf-next v2 17/26] rqspinlock: Hardcode cond_acquire loops to asm-generic implementation Kumar Kartikeya Dwivedi
@ 2025-02-08 1:58 ` Alexei Starovoitov
2025-02-08 3:04 ` Kumar Kartikeya Dwivedi
2025-02-10 9:53 ` Peter Zijlstra
1 sibling, 1 reply; 67+ messages in thread
From: Alexei Starovoitov @ 2025-02-08 1:58 UTC (permalink / raw)
To: Kumar Kartikeya Dwivedi
Cc: bpf, LKML, Ankur Arora, Linus Torvalds, Peter Zijlstra,
Will Deacon, Waiman Long, Alexei Starovoitov, Andrii Nakryiko,
Daniel Borkmann, Martin KaFai Lau, Eduard Zingerman,
Paul E. McKenney, Tejun Heo, Barret Rhoden, Josh Don, Dohyun Kim,
linux-arm-kernel, Kernel Team
On Thu, Feb 6, 2025 at 2:55 AM Kumar Kartikeya Dwivedi <memxor@gmail.com> wrote:
>
> Currently, for rqspinlock usage, the implementation of
> smp_cond_load_acquire (and thus, atomic_cond_read_acquire) are
> susceptible to stalls on arm64, because they do not guarantee that the
> conditional expression will be repeatedly invoked if the address being
> loaded from is not written to by other CPUs. When support for
> event-streams is absent (which unblocks stuck WFE-based loops every
> ~100us), we may end up being stuck forever.
>
> This causes a problem for us, as we need to repeatedly invoke the
> RES_CHECK_TIMEOUT in the spin loop to break out when the timeout
> expires.
>
> Hardcode the implementation to the asm-generic version in rqspinlock.c
> until support for smp_cond_load_acquire_timewait [0] lands upstream.
>
> [0]: https://lore.kernel.org/lkml/20250203214911.898276-1-ankur.a.arora@oracle.com
>
> Cc: Ankur Arora <ankur.a.arora@oracle.com>
> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
> ---
> kernel/locking/rqspinlock.c | 41 ++++++++++++++++++++++++++++++++++---
> 1 file changed, 38 insertions(+), 3 deletions(-)
>
> diff --git a/kernel/locking/rqspinlock.c b/kernel/locking/rqspinlock.c
> index 49b4f3c75a3e..b4cceeecf29c 100644
> --- a/kernel/locking/rqspinlock.c
> +++ b/kernel/locking/rqspinlock.c
> @@ -325,6 +325,41 @@ int __lockfunc resilient_tas_spin_lock(rqspinlock_t *lock, u64 timeout)
> */
> static DEFINE_PER_CPU_ALIGNED(struct qnode, qnodes[_Q_MAX_NODES]);
>
> +/*
> + * Hardcode smp_cond_load_acquire and atomic_cond_read_acquire implementations
> + * to the asm-generic implementation. In rqspinlock code, our conditional
> + * expression involves checking the value _and_ additionally a timeout. However,
> + * on arm64, the WFE-based implementation may never spin again if no stores
> + * occur to the locked byte in the lock word. As such, we may be stuck forever
> + * if event-stream based unblocking is not available on the platform for WFE
> + * spin loops (arch_timer_evtstrm_available).
> + *
> + * Once support for smp_cond_load_acquire_timewait [0] lands, we can drop this
> + * workaround.
> + *
> + * [0]: https://lore.kernel.org/lkml/20250203214911.898276-1-ankur.a.arora@oracle.com
> + */
It's fine as a workaround for now to avoid being blocked
on Ankur's set (which will go via different tree too),
but in v3 pls add an extra patch that demonstrates the final result
with WFE stuff working as designed without amortizing
in RES_CHECK_TIMEOUT() macro.
Guessing RES_CHECK_TIMEOUT will have some ifdef to handle that case?
^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [PATCH bpf-next v2 19/26] bpf: Convert hashtab.c to rqspinlock
2025-02-06 10:54 ` [PATCH bpf-next v2 19/26] bpf: Convert hashtab.c to rqspinlock Kumar Kartikeya Dwivedi
@ 2025-02-08 2:01 ` Alexei Starovoitov
2025-02-08 3:06 ` Kumar Kartikeya Dwivedi
0 siblings, 1 reply; 67+ messages in thread
From: Alexei Starovoitov @ 2025-02-08 2:01 UTC (permalink / raw)
To: Kumar Kartikeya Dwivedi
Cc: bpf, LKML, Linus Torvalds, Peter Zijlstra, Will Deacon,
Waiman Long, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
Martin KaFai Lau, Eduard Zingerman, Paul E. McKenney, Tejun Heo,
Barret Rhoden, Josh Don, Dohyun Kim, linux-arm-kernel,
Kernel Team
On Thu, Feb 6, 2025 at 2:55 AM Kumar Kartikeya Dwivedi <memxor@gmail.com> wrote:
>
> Convert hashtab.c from raw_spinlock to rqspinlock, and drop the hashed
> per-cpu counter crud from the code base which is no longer necessary.
>
> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
> ---
> kernel/bpf/hashtab.c | 102 ++++++++++++++-----------------------------
> 1 file changed, 32 insertions(+), 70 deletions(-)
>
> diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
> index 4a9eeb7aef85..9b394e147967 100644
> --- a/kernel/bpf/hashtab.c
> +++ b/kernel/bpf/hashtab.c
> @@ -16,6 +16,7 @@
> #include "bpf_lru_list.h"
> #include "map_in_map.h"
> #include <linux/bpf_mem_alloc.h>
> +#include <asm/rqspinlock.h>
>
> #define HTAB_CREATE_FLAG_MASK \
> (BPF_F_NO_PREALLOC | BPF_F_NO_COMMON_LRU | BPF_F_NUMA_NODE | \
> @@ -78,7 +79,7 @@
> */
> struct bucket {
> struct hlist_nulls_head head;
> - raw_spinlock_t raw_lock;
> + rqspinlock_t raw_lock;
Pls add known syzbot reports as 'Closes:' to commit log.
^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [PATCH bpf-next v2 23/26] bpf: Handle allocation failure in acquire_lock_state
2025-02-06 10:54 ` [PATCH bpf-next v2 23/26] bpf: Handle allocation failure in acquire_lock_state Kumar Kartikeya Dwivedi
@ 2025-02-08 2:04 ` Alexei Starovoitov
0 siblings, 0 replies; 67+ messages in thread
From: Alexei Starovoitov @ 2025-02-08 2:04 UTC (permalink / raw)
To: Kumar Kartikeya Dwivedi
Cc: bpf, LKML, Linus Torvalds, Peter Zijlstra, Will Deacon,
Waiman Long, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
Martin KaFai Lau, Eduard Zingerman, Paul E. McKenney, Tejun Heo,
Barret Rhoden, Josh Don, Dohyun Kim, linux-arm-kernel,
Kernel Team
On Thu, Feb 6, 2025 at 2:55 AM Kumar Kartikeya Dwivedi <memxor@gmail.com> wrote:
>
> The acquire_lock_state function needs to handle possible NULL values
> returned by acquire_reference_state, and return -ENOMEM.
>
> Fixes: 769b0f1c8214 ("bpf: Refactor {acquire,release}_reference_state")
> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
> ---
> kernel/bpf/verifier.c | 2 ++
> 1 file changed, 2 insertions(+)
>
> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> index 9971c03adfd5..d6999d085c7d 100644
> --- a/kernel/bpf/verifier.c
> +++ b/kernel/bpf/verifier.c
> @@ -1501,6 +1501,8 @@ static int acquire_lock_state(struct bpf_verifier_env *env, int insn_idx, enum r
> struct bpf_reference_state *s;
>
> s = acquire_reference_state(env, insn_idx);
> + if (!s)
> + return -ENOMEM;
I'll grab this fix into bpf tree.
Next time just send it separately, so the fix is not lost
in the patch bomb.
^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [PATCH bpf-next v2 11/26] rqspinlock: Add deadlock detection and recovery
2025-02-08 1:53 ` Alexei Starovoitov
@ 2025-02-08 3:03 ` Kumar Kartikeya Dwivedi
0 siblings, 0 replies; 67+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2025-02-08 3:03 UTC (permalink / raw)
To: Alexei Starovoitov
Cc: bpf, LKML, Linus Torvalds, Peter Zijlstra, Will Deacon,
Waiman Long, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
Martin KaFai Lau, Eduard Zingerman, Paul E. McKenney, Tejun Heo,
Barret Rhoden, Josh Don, Dohyun Kim, linux-arm-kernel,
Kernel Team
On Sat, 8 Feb 2025 at 02:54, Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Thu, Feb 6, 2025 at 2:54 AM Kumar Kartikeya Dwivedi <memxor@gmail.com> wrote:
> >
> > +/*
> > + * It is possible to run into misdetection scenarios of AA deadlocks on the same
> > + * CPU, and missed ABBA deadlocks on remote CPUs when this function pops entries
> > + * out of order (due to lock A, lock B, unlock A, unlock B) pattern. The correct
> > + * logic to preserve right entries in the table would be to walk the array of
> > + * held locks and swap and clear out-of-order entries, but that's too
> > + * complicated and we don't have a compelling use case for out of order unlocking.
> > + *
> > + * Therefore, we simply don't support such cases and keep the logic simple here.
> > + */
>
> The comment looks obsolete from the old version of this patch.
> Patch 25 is now enforces the fifo order in the verifier
> and code review will do the same for use of res_spin_lock()
> in bpf internals. So pls drop the comment or reword.
>
Ok.
> > +static __always_inline void release_held_lock_entry(void)
> > +{
> > + struct rqspinlock_held *rqh = this_cpu_ptr(&rqspinlock_held_locks);
> > +
> > + if (unlikely(rqh->cnt > RES_NR_HELD))
> > + goto dec;
> > + WRITE_ONCE(rqh->locks[rqh->cnt - 1], NULL);
> > +dec:
> > + this_cpu_dec(rqspinlock_held_locks.cnt);
>
> ..
> > + * We don't have a problem if the dec and WRITE_ONCE above get reordered
> > + * with each other, we either notice an empty NULL entry on top (if dec
> > + * succeeds WRITE_ONCE), or a potentially stale entry which cannot be
> > + * observed (if dec precedes WRITE_ONCE).
> > + */
> > + smp_wmb();
>
> since smp_wmb() is needed to address ordering weakness vs try_cmpxchg_acquire()
> would it make sense to move it before this_cpu_dec() to address
> the 2nd part of the harmless race as well?
So you mean, even if the dec gets ordered with inc, the other side is
bound to notice a NULL entry and not a stale entry, and we'll ensure
the NULL write is always visible before the dec.
Sounds like it should also work, I will think over it for a bit and
probably make this change (and perhaps likewise on the unlock side).
^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [PATCH bpf-next v2 17/26] rqspinlock: Hardcode cond_acquire loops to asm-generic implementation
2025-02-08 1:58 ` Alexei Starovoitov
@ 2025-02-08 3:04 ` Kumar Kartikeya Dwivedi
0 siblings, 0 replies; 67+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2025-02-08 3:04 UTC (permalink / raw)
To: Alexei Starovoitov
Cc: bpf, LKML, Ankur Arora, Linus Torvalds, Peter Zijlstra,
Will Deacon, Waiman Long, Alexei Starovoitov, Andrii Nakryiko,
Daniel Borkmann, Martin KaFai Lau, Eduard Zingerman,
Paul E. McKenney, Tejun Heo, Barret Rhoden, Josh Don, Dohyun Kim,
linux-arm-kernel, Kernel Team
On Sat, 8 Feb 2025 at 02:58, Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Thu, Feb 6, 2025 at 2:55 AM Kumar Kartikeya Dwivedi <memxor@gmail.com> wrote:
> >
> > Currently, for rqspinlock usage, the implementation of
> > smp_cond_load_acquire (and thus, atomic_cond_read_acquire) are
> > susceptible to stalls on arm64, because they do not guarantee that the
> > conditional expression will be repeatedly invoked if the address being
> > loaded from is not written to by other CPUs. When support for
> > event-streams is absent (which unblocks stuck WFE-based loops every
> > ~100us), we may end up being stuck forever.
> >
> > This causes a problem for us, as we need to repeatedly invoke the
> > RES_CHECK_TIMEOUT in the spin loop to break out when the timeout
> > expires.
> >
> > Hardcode the implementation to the asm-generic version in rqspinlock.c
> > until support for smp_cond_load_acquire_timewait [0] lands upstream.
> >
> > [0]: https://lore.kernel.org/lkml/20250203214911.898276-1-ankur.a.arora@oracle.com
> >
> > Cc: Ankur Arora <ankur.a.arora@oracle.com>
> > Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
> > ---
> > kernel/locking/rqspinlock.c | 41 ++++++++++++++++++++++++++++++++++---
> > 1 file changed, 38 insertions(+), 3 deletions(-)
> >
> > diff --git a/kernel/locking/rqspinlock.c b/kernel/locking/rqspinlock.c
> > index 49b4f3c75a3e..b4cceeecf29c 100644
> > --- a/kernel/locking/rqspinlock.c
> > +++ b/kernel/locking/rqspinlock.c
> > @@ -325,6 +325,41 @@ int __lockfunc resilient_tas_spin_lock(rqspinlock_t *lock, u64 timeout)
> > */
> > static DEFINE_PER_CPU_ALIGNED(struct qnode, qnodes[_Q_MAX_NODES]);
> >
> > +/*
> > + * Hardcode smp_cond_load_acquire and atomic_cond_read_acquire implementations
> > + * to the asm-generic implementation. In rqspinlock code, our conditional
> > + * expression involves checking the value _and_ additionally a timeout. However,
> > + * on arm64, the WFE-based implementation may never spin again if no stores
> > + * occur to the locked byte in the lock word. As such, we may be stuck forever
> > + * if event-stream based unblocking is not available on the platform for WFE
> > + * spin loops (arch_timer_evtstrm_available).
> > + *
> > + * Once support for smp_cond_load_acquire_timewait [0] lands, we can drop this
> > + * workaround.
> > + *
> > + * [0]: https://lore.kernel.org/lkml/20250203214911.898276-1-ankur.a.arora@oracle.com
> > + */
>
> It's fine as a workaround for now to avoid being blocked
> on Ankur's set (which will go via different tree too),
> but in v3 pls add an extra patch that demonstrates the final result
> with WFE stuff working as designed without amortizing
> in RES_CHECK_TIMEOUT() macro.
> Guessing RES_CHECK_TIMEOUT will have some ifdef to handle that case?
Yes, or we can pass in the check_timeout expression directly. I'll
make the change in v3.
^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [PATCH bpf-next v2 19/26] bpf: Convert hashtab.c to rqspinlock
2025-02-08 2:01 ` Alexei Starovoitov
@ 2025-02-08 3:06 ` Kumar Kartikeya Dwivedi
0 siblings, 0 replies; 67+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2025-02-08 3:06 UTC (permalink / raw)
To: Alexei Starovoitov
Cc: bpf, LKML, Linus Torvalds, Peter Zijlstra, Will Deacon,
Waiman Long, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
Martin KaFai Lau, Eduard Zingerman, Paul E. McKenney, Tejun Heo,
Barret Rhoden, Josh Don, Dohyun Kim, linux-arm-kernel,
Kernel Team
On Sat, 8 Feb 2025 at 03:01, Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Thu, Feb 6, 2025 at 2:55 AM Kumar Kartikeya Dwivedi <memxor@gmail.com> wrote:
> >
> > Convert hashtab.c from raw_spinlock to rqspinlock, and drop the hashed
> > per-cpu counter crud from the code base which is no longer necessary.
> >
> > Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
> > ---
> > kernel/bpf/hashtab.c | 102 ++++++++++++++-----------------------------
> > 1 file changed, 32 insertions(+), 70 deletions(-)
> >
> > diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
> > index 4a9eeb7aef85..9b394e147967 100644
> > --- a/kernel/bpf/hashtab.c
> > +++ b/kernel/bpf/hashtab.c
> > @@ -16,6 +16,7 @@
> > #include "bpf_lru_list.h"
> > #include "map_in_map.h"
> > #include <linux/bpf_mem_alloc.h>
> > +#include <asm/rqspinlock.h>
> >
> > #define HTAB_CREATE_FLAG_MASK \
> > (BPF_F_NO_PREALLOC | BPF_F_NO_COMMON_LRU | BPF_F_NUMA_NODE | \
> > @@ -78,7 +79,7 @@
> > */
> > struct bucket {
> > struct hlist_nulls_head head;
> > - raw_spinlock_t raw_lock;
> > + rqspinlock_t raw_lock;
>
> Pls add known syzbot reports as 'Closes:' to commit log.
Ack, I've found [0] and [1], I will dig for more, see which ones this
applies to and add them to the commit log.
[0]: https://lore.kernel.org/bpf/CAPPBnEZpjGnsuA26Mf9kYibSaGLm=oF6=12L21X1GEQdqjLnzQ@mail.gmail.com/
[1]: https://lore.kernel.org/bpf/CAADnVQJVJADKw0KC6GzhSOjA8DJFammARKwVh+TeNAD7U3h91A@mail.gmail.com/
^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [PATCH bpf-next v2 00/26] Resilient Queued Spin Lock
2025-02-06 10:54 [PATCH bpf-next v2 00/26] Resilient Queued Spin Lock Kumar Kartikeya Dwivedi
` (25 preceding siblings ...)
2025-02-06 10:54 ` [PATCH bpf-next v2 26/26] selftests/bpf: Add tests for rqspinlock Kumar Kartikeya Dwivedi
@ 2025-02-10 9:31 ` Peter Zijlstra
2025-02-10 9:38 ` Peter Zijlstra
2025-02-10 9:49 ` Peter Zijlstra
28 siblings, 0 replies; 67+ messages in thread
From: Peter Zijlstra @ 2025-02-10 9:31 UTC (permalink / raw)
To: Kumar Kartikeya Dwivedi
Cc: bpf, linux-kernel, Linus Torvalds, Will Deacon, Waiman Long,
Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
Martin KaFai Lau, Eduard Zingerman, Paul E. McKenney, Tejun Heo,
Barret Rhoden, Josh Don, Dohyun Kim, linux-arm-kernel,
kernel-team
On Thu, Feb 06, 2025 at 02:54:08AM -0800, Kumar Kartikeya Dwivedi wrote:
> Additionally, eBPF programs attached to different parts of the kernel
> can introduce new control flow into the kernel, which increases the
> likelihood of deadlocks in code not written to handle reentrancy. There
> have been multiple syzbot reports surfacing deadlocks in internal kernel
> code due to the diverse ways in which eBPF programs can be attached to
> different parts of the kernel. By switching the BPF subsystem’s lock
> usage to rqspinlock, all of these issues can be mitigated at runtime.
Only if the called stuff is using this new lock. IIRC we've had a number
of cases where eBPF was used to tie together 'normal' kernel functions
in a way that wasn't sound. You can't help there.
As an example, eBPF calling strncpy_from_user(), which ends up in fault
injection and badness happens -- this has been since fixed, but still.
^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [PATCH bpf-next v2 00/26] Resilient Queued Spin Lock
2025-02-06 10:54 [PATCH bpf-next v2 00/26] Resilient Queued Spin Lock Kumar Kartikeya Dwivedi
` (26 preceding siblings ...)
2025-02-10 9:31 ` [PATCH bpf-next v2 00/26] Resilient Queued Spin Lock Peter Zijlstra
@ 2025-02-10 9:38 ` Peter Zijlstra
2025-02-10 10:49 ` Peter Zijlstra
2025-02-10 9:49 ` Peter Zijlstra
28 siblings, 1 reply; 67+ messages in thread
From: Peter Zijlstra @ 2025-02-10 9:38 UTC (permalink / raw)
To: Kumar Kartikeya Dwivedi
Cc: bpf, linux-kernel, Linus Torvalds, Will Deacon, Waiman Long,
Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
Martin KaFai Lau, Eduard Zingerman, Paul E. McKenney, Tejun Heo,
Barret Rhoden, Josh Don, Dohyun Kim, linux-arm-kernel,
kernel-team
On Thu, Feb 06, 2025 at 02:54:08AM -0800, Kumar Kartikeya Dwivedi wrote:
> Deadlock Detection
> ~~~~~~~~~~~~~~~~~~
> We handle two cases of deadlocks: AA deadlocks (attempts to acquire the
> same lock again), and ABBA deadlocks (attempts to acquire two locks in
> the opposite order from two distinct threads). Variants of ABBA
> deadlocks may be encountered with more than two locks being held in the
> incorrect order. These are not diagnosed explicitly, as they reduce to
> ABBA deadlocks.
>
> Deadlock detection is triggered immediately when beginning the waiting
> loop of a lock slow path.
>
> While timeouts ensure that any waiting loops in the locking slow path
> terminate and return to the caller, it can be excessively long in some
> situations. While the default timeout is short (0.5s), a stall for this
> duration inside the kernel can set off alerts for latency-critical
> services with strict SLOs. Ideally, the kernel should recover from an
> undesired state of the lock as soon as possible.
>
> A multi-step strategy is used to recover the kernel from waiting loops
> in the locking algorithm which may fail to terminate in a bounded amount
> of time.
>
> * Each CPU maintains a table of held locks. Entries are inserted and
> removed upon entry into lock, and exit from unlock, respectively.
> * Deadlock detection for AA locks is thus simple: we have an AA
> deadlock if we find a held lock entry for the lock we’re attempting
> to acquire on the same CPU.
> * During deadlock detection for ABBA, we search through the tables of
> all other CPUs to find situations where we are holding a lock the
> remote CPU is attempting to acquire, and they are holding a lock we
> are attempting to acquire. Upon encountering such a condition, we
> report an ABBA deadlock.
> * We divide the duration between entry time point into the waiting loop
> and the timeout time point into intervals of 1 ms, and perform
> deadlock detection until timeout happens. Upon entry into the slow
> path, and then completion of each 1 ms interval, we perform detection
> of both AA and ABBA deadlocks. In the event that deadlock detection
> yields a positive result, the recovery happens sooner than the
> timeout. Otherwise, it happens as a last resort upon completion of
> the timeout.
>
> Timeouts
> ~~~~~~~~
> Timeouts act as final line of defense against stalls for waiting loops.
> The ‘ktime_get_mono_fast_ns’ function is used to poll for the current
> time, and it is compared to the timestamp indicating the end time in the
> waiter loop. Each waiting loop is instrumented to check an extra
> condition using a macro. Internally, the macro implementation amortizes
> the checking of the timeout to avoid sampling the clock in every
> iteration. Precisely, the timeout checks are invoked every 64k
> iterations.
>
> Recovery
> ~~~~~~~~
I'm probably bad at reading, but I failed to find anything that
explained how you recover from a deadlock.
Do you force unload the BPF program?
^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [PATCH bpf-next v2 00/26] Resilient Queued Spin Lock
2025-02-06 10:54 [PATCH bpf-next v2 00/26] Resilient Queued Spin Lock Kumar Kartikeya Dwivedi
` (27 preceding siblings ...)
2025-02-10 9:38 ` Peter Zijlstra
@ 2025-02-10 9:49 ` Peter Zijlstra
2025-02-10 19:16 ` Ankur Arora
28 siblings, 1 reply; 67+ messages in thread
From: Peter Zijlstra @ 2025-02-10 9:49 UTC (permalink / raw)
To: Kumar Kartikeya Dwivedi
Cc: bpf, linux-kernel, Linus Torvalds, Will Deacon, Waiman Long,
Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
Martin KaFai Lau, Eduard Zingerman, Paul E. McKenney, Tejun Heo,
Barret Rhoden, Josh Don, Dohyun Kim, linux-arm-kernel,
kernel-team
On Thu, Feb 06, 2025 at 02:54:08AM -0800, Kumar Kartikeya Dwivedi wrote:
> Changelog:
> ----------
> v1 -> v2
> v1: https://lore.kernel.org/bpf/20250107140004.2732830-1-memxor@gmail.com
>
> * Address nits from Waiman and Peter
> * Fix arm64 WFE bug pointed out by Peter.
What's the state of that smp_cond_relaxed_timeout() patch-set? That
still seems like what you're needing, right?
^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [PATCH bpf-next v2 17/26] rqspinlock: Hardcode cond_acquire loops to asm-generic implementation
2025-02-06 10:54 ` [PATCH bpf-next v2 17/26] rqspinlock: Hardcode cond_acquire loops to asm-generic implementation Kumar Kartikeya Dwivedi
2025-02-08 1:58 ` Alexei Starovoitov
@ 2025-02-10 9:53 ` Peter Zijlstra
2025-02-10 10:03 ` Peter Zijlstra
1 sibling, 1 reply; 67+ messages in thread
From: Peter Zijlstra @ 2025-02-10 9:53 UTC (permalink / raw)
To: Kumar Kartikeya Dwivedi
Cc: bpf, linux-kernel, Ankur Arora, Linus Torvalds, Will Deacon,
Waiman Long, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
Martin KaFai Lau, Eduard Zingerman, Paul E. McKenney, Tejun Heo,
Barret Rhoden, Josh Don, Dohyun Kim, linux-arm-kernel,
kernel-team
On Thu, Feb 06, 2025 at 02:54:25AM -0800, Kumar Kartikeya Dwivedi wrote:
> Currently, for rqspinlock usage, the implementation of
> smp_cond_load_acquire (and thus, atomic_cond_read_acquire) are
> susceptible to stalls on arm64, because they do not guarantee that the
> conditional expression will be repeatedly invoked if the address being
> loaded from is not written to by other CPUs. When support for
> event-streams is absent (which unblocks stuck WFE-based loops every
> ~100us), we may end up being stuck forever.
>
> This causes a problem for us, as we need to repeatedly invoke the
> RES_CHECK_TIMEOUT in the spin loop to break out when the timeout
> expires.
>
> Hardcode the implementation to the asm-generic version in rqspinlock.c
> until support for smp_cond_load_acquire_timewait [0] lands upstream.
>
*sigh*.. this patch should go *before* patch 8. As is that's still
horribly broken and I was WTF-ing because your 0/n changelog said you
fixed it.
^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [PATCH bpf-next v2 07/26] rqspinlock: Add support for timeouts
2025-02-06 10:54 ` [PATCH bpf-next v2 07/26] rqspinlock: Add support for timeouts Kumar Kartikeya Dwivedi
@ 2025-02-10 9:56 ` Peter Zijlstra
2025-02-11 4:55 ` Alexei Starovoitov
0 siblings, 1 reply; 67+ messages in thread
From: Peter Zijlstra @ 2025-02-10 9:56 UTC (permalink / raw)
To: Kumar Kartikeya Dwivedi
Cc: bpf, linux-kernel, Barret Rhoden, Linus Torvalds, Will Deacon,
Waiman Long, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
Martin KaFai Lau, Eduard Zingerman, Paul E. McKenney, Tejun Heo,
Josh Don, Dohyun Kim, linux-arm-kernel, kernel-team
On Thu, Feb 06, 2025 at 02:54:15AM -0800, Kumar Kartikeya Dwivedi wrote:
> @@ -68,6 +71,44 @@
>
> #include "mcs_spinlock.h"
>
> +struct rqspinlock_timeout {
> + u64 timeout_end;
> + u64 duration;
> + u16 spin;
> +};
> +
> +static noinline int check_timeout(struct rqspinlock_timeout *ts)
> +{
> + u64 time = ktime_get_mono_fast_ns();
This is only sane if you have a TSC clocksource. If you ever manage to
hit the HPET fallback, you're *really* sad.
> +
> + if (!ts->timeout_end) {
> + ts->timeout_end = time + ts->duration;
> + return 0;
> + }
> +
> + if (time > ts->timeout_end)
> + return -ETIMEDOUT;
> +
> + return 0;
> +}
> +
> +#define RES_CHECK_TIMEOUT(ts, ret) \
> + ({ \
> + if (!(ts).spin++) \
> + (ret) = check_timeout(&(ts)); \
> + (ret); \
> + })
^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [PATCH bpf-next v2 17/26] rqspinlock: Hardcode cond_acquire loops to asm-generic implementation
2025-02-10 9:53 ` Peter Zijlstra
@ 2025-02-10 10:03 ` Peter Zijlstra
2025-02-13 6:15 ` Kumar Kartikeya Dwivedi
0 siblings, 1 reply; 67+ messages in thread
From: Peter Zijlstra @ 2025-02-10 10:03 UTC (permalink / raw)
To: Kumar Kartikeya Dwivedi
Cc: bpf, linux-kernel, Ankur Arora, Linus Torvalds, Will Deacon,
Waiman Long, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
Martin KaFai Lau, Eduard Zingerman, Paul E. McKenney, Tejun Heo,
Barret Rhoden, Josh Don, Dohyun Kim, linux-arm-kernel,
kernel-team
On Mon, Feb 10, 2025 at 10:53:25AM +0100, Peter Zijlstra wrote:
> On Thu, Feb 06, 2025 at 02:54:25AM -0800, Kumar Kartikeya Dwivedi wrote:
> > Currently, for rqspinlock usage, the implementation of
> > smp_cond_load_acquire (and thus, atomic_cond_read_acquire) are
> > susceptible to stalls on arm64, because they do not guarantee that the
> > conditional expression will be repeatedly invoked if the address being
> > loaded from is not written to by other CPUs. When support for
> > event-streams is absent (which unblocks stuck WFE-based loops every
> > ~100us), we may end up being stuck forever.
> >
> > This causes a problem for us, as we need to repeatedly invoke the
> > RES_CHECK_TIMEOUT in the spin loop to break out when the timeout
> > expires.
> >
> > Hardcode the implementation to the asm-generic version in rqspinlock.c
> > until support for smp_cond_load_acquire_timewait [0] lands upstream.
> >
>
> *sigh*.. this patch should go *before* patch 8. As is that's still
> horribly broken and I was WTF-ing because your 0/n changelog said you
> fixed it.
And since you're doing local copies of things, why not take a lobal copy
of the smp_cond_load_acquire_timewait() thing?
^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [PATCH bpf-next v2 09/26] rqspinlock: Protect waiters in queue from stalls
2025-02-06 10:54 ` [PATCH bpf-next v2 09/26] rqspinlock: Protect waiters in queue " Kumar Kartikeya Dwivedi
@ 2025-02-10 10:17 ` Peter Zijlstra
2025-02-13 6:20 ` Kumar Kartikeya Dwivedi
0 siblings, 1 reply; 67+ messages in thread
From: Peter Zijlstra @ 2025-02-10 10:17 UTC (permalink / raw)
To: Kumar Kartikeya Dwivedi
Cc: bpf, linux-kernel, Barret Rhoden, Linus Torvalds, Will Deacon,
Waiman Long, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
Martin KaFai Lau, Eduard Zingerman, Paul E. McKenney, Tejun Heo,
Josh Don, Dohyun Kim, linux-arm-kernel, kernel-team
On Thu, Feb 06, 2025 at 02:54:17AM -0800, Kumar Kartikeya Dwivedi wrote:
> Implement the wait queue cleanup algorithm for rqspinlock. There are
> three forms of waiters in the original queued spin lock algorithm. The
> first is the waiter which acquires the pending bit and spins on the lock
> word without forming a wait queue. The second is the head waiter that is
> the first waiter heading the wait queue. The third form is of all the
> non-head waiters queued behind the head, waiting to be signalled through
> their MCS node to overtake the responsibility of the head.
>
> In this commit, we are concerned with the second and third kind. First,
> we augment the waiting loop of the head of the wait queue with a
> timeout. When this timeout happens, all waiters part of the wait queue
> will abort their lock acquisition attempts.
Why? Why terminate the whole wait-queue?
I *think* I understand, but it would be good to spell out. Also, in the
comment.
^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [PATCH bpf-next v2 11/26] rqspinlock: Add deadlock detection and recovery
2025-02-06 10:54 ` [PATCH bpf-next v2 11/26] rqspinlock: Add deadlock detection and recovery Kumar Kartikeya Dwivedi
2025-02-08 1:53 ` Alexei Starovoitov
@ 2025-02-10 10:21 ` Peter Zijlstra
2025-02-13 6:11 ` Kumar Kartikeya Dwivedi
2025-02-10 10:36 ` Peter Zijlstra
2 siblings, 1 reply; 67+ messages in thread
From: Peter Zijlstra @ 2025-02-10 10:21 UTC (permalink / raw)
To: Kumar Kartikeya Dwivedi
Cc: bpf, linux-kernel, Linus Torvalds, Will Deacon, Waiman Long,
Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
Martin KaFai Lau, Eduard Zingerman, Paul E. McKenney, Tejun Heo,
Barret Rhoden, Josh Don, Dohyun Kim, linux-arm-kernel,
kernel-team
On Thu, Feb 06, 2025 at 02:54:19AM -0800, Kumar Kartikeya Dwivedi wrote:
> +#define RES_NR_HELD 32
> +
> +struct rqspinlock_held {
> + int cnt;
> + void *locks[RES_NR_HELD];
> +};
That cnt field makes the whole thing overflow a cacheline boundary.
Making it 31 makes it fit again.
^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [PATCH bpf-next v2 11/26] rqspinlock: Add deadlock detection and recovery
2025-02-06 10:54 ` [PATCH bpf-next v2 11/26] rqspinlock: Add deadlock detection and recovery Kumar Kartikeya Dwivedi
2025-02-08 1:53 ` Alexei Starovoitov
2025-02-10 10:21 ` Peter Zijlstra
@ 2025-02-10 10:36 ` Peter Zijlstra
2 siblings, 0 replies; 67+ messages in thread
From: Peter Zijlstra @ 2025-02-10 10:36 UTC (permalink / raw)
To: Kumar Kartikeya Dwivedi
Cc: bpf, linux-kernel, Linus Torvalds, Will Deacon, Waiman Long,
Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
Martin KaFai Lau, Eduard Zingerman, Paul E. McKenney, Tejun Heo,
Barret Rhoden, Josh Don, Dohyun Kim, linux-arm-kernel,
kernel-team
On Thu, Feb 06, 2025 at 02:54:19AM -0800, Kumar Kartikeya Dwivedi wrote:
> + /*
> + * Find the CPU holding the lock that we want to acquire. If there is a
> + * deadlock scenario, we will read a stable set on the remote CPU and
> + * find the target. This would be a constant time operation instead of
> + * O(NR_CPUS) if we could determine the owning CPU from a lock value, but
> + * that requires increasing the size of the lock word.
> + */
Is increasing the size of rqspinlock_t really a problem? For the kernel
as a whole there's very little code that really relies on spinlock_t
being u32 (lockref is an example that does care).
And it seems to me this thing might benefit somewhat significantly from
adding this little extra bit.
^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [PATCH bpf-next v2 00/26] Resilient Queued Spin Lock
2025-02-10 9:38 ` Peter Zijlstra
@ 2025-02-10 10:49 ` Peter Zijlstra
2025-02-11 4:37 ` Alexei Starovoitov
0 siblings, 1 reply; 67+ messages in thread
From: Peter Zijlstra @ 2025-02-10 10:49 UTC (permalink / raw)
To: Kumar Kartikeya Dwivedi
Cc: bpf, linux-kernel, Linus Torvalds, Will Deacon, Waiman Long,
Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
Martin KaFai Lau, Eduard Zingerman, Paul E. McKenney, Tejun Heo,
Barret Rhoden, Josh Don, Dohyun Kim, linux-arm-kernel,
kernel-team
On Mon, Feb 10, 2025 at 10:38:41AM +0100, Peter Zijlstra wrote:
> On Thu, Feb 06, 2025 at 02:54:08AM -0800, Kumar Kartikeya Dwivedi wrote:
>
>
> > Deadlock Detection
> > ~~~~~~~~~~~~~~~~~~
> > We handle two cases of deadlocks: AA deadlocks (attempts to acquire the
> > same lock again), and ABBA deadlocks (attempts to acquire two locks in
> > the opposite order from two distinct threads). Variants of ABBA
> > deadlocks may be encountered with more than two locks being held in the
> > incorrect order. These are not diagnosed explicitly, as they reduce to
> > ABBA deadlocks.
> >
> > Deadlock detection is triggered immediately when beginning the waiting
> > loop of a lock slow path.
> >
> > While timeouts ensure that any waiting loops in the locking slow path
> > terminate and return to the caller, it can be excessively long in some
> > situations. While the default timeout is short (0.5s), a stall for this
> > duration inside the kernel can set off alerts for latency-critical
> > services with strict SLOs. Ideally, the kernel should recover from an
> > undesired state of the lock as soon as possible.
> >
> > A multi-step strategy is used to recover the kernel from waiting loops
> > in the locking algorithm which may fail to terminate in a bounded amount
> > of time.
> >
> > * Each CPU maintains a table of held locks. Entries are inserted and
> > removed upon entry into lock, and exit from unlock, respectively.
> > * Deadlock detection for AA locks is thus simple: we have an AA
> > deadlock if we find a held lock entry for the lock we’re attempting
> > to acquire on the same CPU.
> > * During deadlock detection for ABBA, we search through the tables of
> > all other CPUs to find situations where we are holding a lock the
> > remote CPU is attempting to acquire, and they are holding a lock we
> > are attempting to acquire. Upon encountering such a condition, we
> > report an ABBA deadlock.
> > * We divide the duration between entry time point into the waiting loop
> > and the timeout time point into intervals of 1 ms, and perform
> > deadlock detection until timeout happens. Upon entry into the slow
> > path, and then completion of each 1 ms interval, we perform detection
> > of both AA and ABBA deadlocks. In the event that deadlock detection
> > yields a positive result, the recovery happens sooner than the
> > timeout. Otherwise, it happens as a last resort upon completion of
> > the timeout.
> >
> > Timeouts
> > ~~~~~~~~
> > Timeouts act as final line of defense against stalls for waiting loops.
> > The ‘ktime_get_mono_fast_ns’ function is used to poll for the current
> > time, and it is compared to the timestamp indicating the end time in the
> > waiter loop. Each waiting loop is instrumented to check an extra
> > condition using a macro. Internally, the macro implementation amortizes
> > the checking of the timeout to avoid sampling the clock in every
> > iteration. Precisely, the timeout checks are invoked every 64k
> > iterations.
> >
> > Recovery
> > ~~~~~~~~
>
> I'm probably bad at reading, but I failed to find anything that
> explained how you recover from a deadlock.
>
> Do you force unload the BPF program?
Even the simple AB-BA case,
CPU0 CPU1
lock-A lock-B
lock-B lock-A <-
just having a random lock op return -ETIMO doesn't actually solve
anything. Suppose CPU1's lock-A will time out; it will have to unwind
and release lock-B before CPU0 can make progress.
Worse, if CPU1 isn't quick enough to unwind and release B, then CPU0's
lock-B will also time out.
At which point they'll both try again and you're stuck in the same
place, no?
Given you *have* to unwind to make progress; why not move the entire
thing to a wound-wait style lock? Then you also get rid of the whole
timeout mess.
^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [PATCH bpf-next v2 00/26] Resilient Queued Spin Lock
2025-02-10 9:49 ` Peter Zijlstra
@ 2025-02-10 19:16 ` Ankur Arora
0 siblings, 0 replies; 67+ messages in thread
From: Ankur Arora @ 2025-02-10 19:16 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Kumar Kartikeya Dwivedi, bpf, linux-kernel, Linus Torvalds,
Will Deacon, Waiman Long, Alexei Starovoitov, Andrii Nakryiko,
Daniel Borkmann, Martin KaFai Lau, Eduard Zingerman,
Paul E. McKenney, Tejun Heo, Barret Rhoden, Josh Don, Dohyun Kim,
linux-arm-kernel, kernel-team
Peter Zijlstra <peterz@infradead.org> writes:
> On Thu, Feb 06, 2025 at 02:54:08AM -0800, Kumar Kartikeya Dwivedi wrote:
>> Changelog:
>> ----------
>> v1 -> v2
>> v1: https://lore.kernel.org/bpf/20250107140004.2732830-1-memxor@gmail.com
>>
>> * Address nits from Waiman and Peter
>> * Fix arm64 WFE bug pointed out by Peter.
>
> What's the state of that smp_cond_relaxed_timeout() patch-set?
Just waiting for review comments: https://lore.kernel.org/lkml/20250203214911.898276-1-ankur.a.arora@oracle.com/
--
ankur
^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [PATCH bpf-next v2 00/26] Resilient Queued Spin Lock
2025-02-10 10:49 ` Peter Zijlstra
@ 2025-02-11 4:37 ` Alexei Starovoitov
2025-02-11 10:43 ` Peter Zijlstra
0 siblings, 1 reply; 67+ messages in thread
From: Alexei Starovoitov @ 2025-02-11 4:37 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Kumar Kartikeya Dwivedi, bpf, LKML, Linus Torvalds, Will Deacon,
Waiman Long, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
Martin KaFai Lau, Eduard Zingerman, Paul E. McKenney, Tejun Heo,
Barret Rhoden, Josh Don, Dohyun Kim, linux-arm-kernel,
Kernel Team
On Mon, Feb 10, 2025 at 2:49 AM Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > Do you force unload the BPF program?
Not yet. As you can imagine, cancelling bpf program is much
harder than sending sigkill to the user space process.
The prog needs to safely free all the resources it holds.
This work was ongoing for a couple years now with numerous discussions.
Many steps in-between are being considered as well.
Including detaching misbehaving prog, but there is always a counter
argument.
> Even the simple AB-BA case,
>
> CPU0 CPU1
> lock-A lock-B
> lock-B lock-A <-
>
> just having a random lock op return -ETIMO doesn't actually solve
> anything. Suppose CPU1's lock-A will time out; it will have to unwind
> and release lock-B before CPU0 can make progress.
>
> Worse, if CPU1 isn't quick enough to unwind and release B, then CPU0's
> lock-B will also time out.
>
> At which point they'll both try again and you're stuck in the same
> place, no?
Not really. You're missing that deadlock is not a normal case.
As soon as we have cancellation logic working we will be "sigkilling"
prog where deadlock was detected.
In this patch the verifier guarantees that the prog must check
the return value from bpf_res_spin_lock().
The prog cannot keep re-trying.
The only thing it can do is to exit.
Failing to grab res_spin_lock() is not a normal condition.
The prog has to implement a fallback path for it,
but it has the look and feel of normal spin_lock and algorithms
are written assuming that the lock will be taken.
If res_spin_lock errors, it's a bug in the prog or the prog
was invoked from an unexpected context.
Same thing for patches 19,20,21 where we're addressing years
of accumulated tech debt in the bpf core parts, like bpf hashmap.
Once res_spin_lock() fails in kernel/bpf/hashtab.c
the bpf_map_update_elem() will return EBUSY
(just like it does now when it detects re-entrance on bucket lock).
This is no retry.
If res_spin_lock fails in bpf hashmap it's 99% case of syzbot
doing "clever" attaching of bpf progs to bpf internals and
trying hard to break things.
> Given you *have* to unwind to make progress; why not move the entire
> thing to a wound-wait style lock? Then you also get rid of the whole
> timeout mess.
We looked at things like ww_mutex_lock, but they don't fit.
wound-wait is for databases where deadlock is normal and expected.
The transaction has to be aborted and retried.
res_spin_lock is different. It's kinda safe spin_lock that doesn't
brick the kernel.
To be a drop in replacement it has to perform at the same speed
as spin_lock. Hence the massive benchmarking effort that
you see in the cover letter. That's also the reason to keep it 4 bytes.
We don't want to increase it to 8 or whatever unless it's absolutely
necessary.
In the other email you say:
> And it seems to me this thing might benefit somewhat significantly from
> adding this little extra bit.
referring to optimization that 8 byte res_spin_lock can potentially
do O(1) ABBA deadlock detection instead of O(NR_CPUS).
That was a conscious trade-off. Deadlocks are not normal.
If it takes a bit longer to detect it's fine.
The res_spin_lock is optimized to proceed as normal qspinlock.
^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [PATCH bpf-next v2 07/26] rqspinlock: Add support for timeouts
2025-02-10 9:56 ` Peter Zijlstra
@ 2025-02-11 4:55 ` Alexei Starovoitov
2025-02-11 10:11 ` Peter Zijlstra
0 siblings, 1 reply; 67+ messages in thread
From: Alexei Starovoitov @ 2025-02-11 4:55 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Kumar Kartikeya Dwivedi, bpf, LKML, Barret Rhoden, Linus Torvalds,
Will Deacon, Waiman Long, Alexei Starovoitov, Andrii Nakryiko,
Daniel Borkmann, Martin KaFai Lau, Eduard Zingerman,
Paul E. McKenney, Tejun Heo, Josh Don, Dohyun Kim,
linux-arm-kernel, Kernel Team
On Mon, Feb 10, 2025 at 1:56 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Thu, Feb 06, 2025 at 02:54:15AM -0800, Kumar Kartikeya Dwivedi wrote:
> > @@ -68,6 +71,44 @@
> >
> > #include "mcs_spinlock.h"
> >
> > +struct rqspinlock_timeout {
> > + u64 timeout_end;
> > + u64 duration;
> > + u16 spin;
> > +};
> > +
> > +static noinline int check_timeout(struct rqspinlock_timeout *ts)
> > +{
> > + u64 time = ktime_get_mono_fast_ns();
>
> This is only sane if you have a TSC clocksource. If you ever manage to
> hit the HPET fallback, you're *really* sad.
ktime_get_mono_fast_ns() is the best NMI safe time source we're aware of.
perf, rcu, even hardlockup detector are using it.
The clock source can drop to hpet on buggy hw and everything is indeed
sad in that case, but not like we have a choice.
Note that the timeout detection is the last resort.
The logic goes through AA and ABBA detection first.
So timeout means that the locking dependency is quite complex.
Periodically checking "are we spinning too long" via
ktime_get_mono_fast_ns() is what lets us abort the lock.
Maybe I'm missing the concern.
Should we use
__arch_get_hw_counter(VDSO_CLOCKMODE_TSC, NULL) instead ?
^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [PATCH bpf-next v2 07/26] rqspinlock: Add support for timeouts
2025-02-11 4:55 ` Alexei Starovoitov
@ 2025-02-11 10:11 ` Peter Zijlstra
2025-02-11 18:00 ` Alexei Starovoitov
0 siblings, 1 reply; 67+ messages in thread
From: Peter Zijlstra @ 2025-02-11 10:11 UTC (permalink / raw)
To: Alexei Starovoitov
Cc: Kumar Kartikeya Dwivedi, bpf, LKML, Barret Rhoden, Linus Torvalds,
Will Deacon, Waiman Long, Alexei Starovoitov, Andrii Nakryiko,
Daniel Borkmann, Martin KaFai Lau, Eduard Zingerman,
Paul E. McKenney, Tejun Heo, Josh Don, Dohyun Kim,
linux-arm-kernel, Kernel Team
On Mon, Feb 10, 2025 at 08:55:56PM -0800, Alexei Starovoitov wrote:
> On Mon, Feb 10, 2025 at 1:56 AM Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > On Thu, Feb 06, 2025 at 02:54:15AM -0800, Kumar Kartikeya Dwivedi wrote:
> > > @@ -68,6 +71,44 @@
> > >
> > > #include "mcs_spinlock.h"
> > >
> > > +struct rqspinlock_timeout {
> > > + u64 timeout_end;
> > > + u64 duration;
> > > + u16 spin;
> > > +};
> > > +
> > > +static noinline int check_timeout(struct rqspinlock_timeout *ts)
> > > +{
> > > + u64 time = ktime_get_mono_fast_ns();
> >
> > This is only sane if you have a TSC clocksource. If you ever manage to
> > hit the HPET fallback, you're *really* sad.
>
> ktime_get_mono_fast_ns() is the best NMI safe time source we're aware of.
> perf, rcu, even hardlockup detector are using it.
perf is primarily using local_clock(), as is the scheduler.
^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [PATCH bpf-next v2 00/26] Resilient Queued Spin Lock
2025-02-11 4:37 ` Alexei Starovoitov
@ 2025-02-11 10:43 ` Peter Zijlstra
2025-02-11 18:33 ` Alexei Starovoitov
0 siblings, 1 reply; 67+ messages in thread
From: Peter Zijlstra @ 2025-02-11 10:43 UTC (permalink / raw)
To: Alexei Starovoitov
Cc: Kumar Kartikeya Dwivedi, bpf, LKML, Linus Torvalds, Will Deacon,
Waiman Long, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
Martin KaFai Lau, Eduard Zingerman, Paul E. McKenney, Tejun Heo,
Barret Rhoden, Josh Don, Dohyun Kim, linux-arm-kernel,
Kernel Team
On Mon, Feb 10, 2025 at 08:37:06PM -0800, Alexei Starovoitov wrote:
> On Mon, Feb 10, 2025 at 2:49 AM Peter Zijlstra <peterz@infradead.org> wrote:
> > >
> > > Do you force unload the BPF program?
>
> Not yet. As you can imagine, cancelling bpf program is much
> harder than sending sigkill to the user space process.
So you are killing the user program? Because it wasn't at all clear what
if anything is done when this failure case is tripped.
> The prog needs to safely free all the resources it holds.
> This work was ongoing for a couple years now with numerous discussions.
Well, for you maybe, I'm new here. This is only the second submission,
and really only the first one I got to mostly read.
> > Even the simple AB-BA case,
> >
> > CPU0 CPU1
> > lock-A lock-B
> > lock-B lock-A <-
> >
> > just having a random lock op return -ETIMO doesn't actually solve
> > anything. Suppose CPU1's lock-A will time out; it will have to unwind
> > and release lock-B before CPU0 can make progress.
> >
> > Worse, if CPU1 isn't quick enough to unwind and release B, then CPU0's
> > lock-B will also time out.
> >
> > At which point they'll both try again and you're stuck in the same
> > place, no?
>
> Not really. You're missing that deadlock is not a normal case.
Well, if this is unpriv user programs, you should most definitely
consider them the normal case. Must assume user space is malicious.
> As soon as we have cancellation logic working we will be "sigkilling"
> prog where deadlock was detected.
Ah, so that's the plan, but not yet included here? This means that every
BPF program invocation must be 'cancellable'? What if kernel thread is
hitting tracepoint or somesuch?
So much details not clear to me and not explained either :/
> In this patch the verifier guarantees that the prog must check
> the return value from bpf_res_spin_lock().
Yeah, but so what? It can check and still not do the right thing. Only
checking the return value is consumed somehow doesn't really help much.
> The prog cannot keep re-trying.
> The only thing it can do is to exit.
Right, but it might have already modified things, how are you going to
recover from that?
> Failing to grab res_spin_lock() is not a normal condition.
If you're going to be exposing this to unpriv, I really do think you
should assume it to be the normal case.
> The prog has to implement a fallback path for it,
But verifier must verify it is sane fallback, how can it do that?
> > Given you *have* to unwind to make progress; why not move the entire
> > thing to a wound-wait style lock? Then you also get rid of the whole
> > timeout mess.
>
> We looked at things like ww_mutex_lock, but they don't fit.
> wound-wait is for databases where deadlock is normal and expected.
> The transaction has to be aborted and retried.
Right, which to me sounds exactly like what you want for unpriv.
Have the program structured such that it must acquire all locks before
it does a modification / store -- and have the verifier enforce this.
Then any lock failure can be handled by the bpf core, not the program
itself. Core can unlock all previously acquired locks, and core can
either re-attempt the program or 'skip' it after N failures.
It does mean the bpf core needs to track the acquired locks -- which you
already do, except it becomes mandatory, prog cannot acquire more than
~32 locks.
> res_spin_lock is different. It's kinda safe spin_lock that doesn't
> brick the kernel.
Well, 1/2 second is pretty much bricked imo.
> That was a conscious trade-off. Deadlocks are not normal.
I really do think you should assume they are normal, unpriv and all
that.
^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [PATCH bpf-next v2 07/26] rqspinlock: Add support for timeouts
2025-02-11 10:11 ` Peter Zijlstra
@ 2025-02-11 18:00 ` Alexei Starovoitov
0 siblings, 0 replies; 67+ messages in thread
From: Alexei Starovoitov @ 2025-02-11 18:00 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Kumar Kartikeya Dwivedi, bpf, LKML, Barret Rhoden, Linus Torvalds,
Will Deacon, Waiman Long, Alexei Starovoitov, Andrii Nakryiko,
Daniel Borkmann, Martin KaFai Lau, Eduard Zingerman,
Paul E. McKenney, Tejun Heo, Josh Don, Dohyun Kim,
linux-arm-kernel, Kernel Team
On Tue, Feb 11, 2025 at 2:11 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Mon, Feb 10, 2025 at 08:55:56PM -0800, Alexei Starovoitov wrote:
> > On Mon, Feb 10, 2025 at 1:56 AM Peter Zijlstra <peterz@infradead.org> wrote:
> > >
> > > On Thu, Feb 06, 2025 at 02:54:15AM -0800, Kumar Kartikeya Dwivedi wrote:
> > > > @@ -68,6 +71,44 @@
> > > >
> > > > #include "mcs_spinlock.h"
> > > >
> > > > +struct rqspinlock_timeout {
> > > > + u64 timeout_end;
> > > > + u64 duration;
> > > > + u16 spin;
> > > > +};
> > > > +
> > > > +static noinline int check_timeout(struct rqspinlock_timeout *ts)
> > > > +{
> > > > + u64 time = ktime_get_mono_fast_ns();
> > >
> > > This is only sane if you have a TSC clocksource. If you ever manage to
> > > hit the HPET fallback, you're *really* sad.
> >
> > ktime_get_mono_fast_ns() is the best NMI safe time source we're aware of.
> > perf, rcu, even hardlockup detector are using it.
>
> perf is primarily using local_clock(), as is the scheduler.
We considered it, but I think it won't tick when irqs are disabled,
since the generic part is jiffies based ?
^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [PATCH bpf-next v2 00/26] Resilient Queued Spin Lock
2025-02-11 10:43 ` Peter Zijlstra
@ 2025-02-11 18:33 ` Alexei Starovoitov
2025-02-13 9:59 ` Peter Zijlstra
0 siblings, 1 reply; 67+ messages in thread
From: Alexei Starovoitov @ 2025-02-11 18:33 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Kumar Kartikeya Dwivedi, bpf, LKML, Linus Torvalds, Will Deacon,
Waiman Long, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
Martin KaFai Lau, Eduard Zingerman, Paul E. McKenney, Tejun Heo,
Barret Rhoden, Josh Don, Dohyun Kim, linux-arm-kernel,
Kernel Team
On Tue, Feb 11, 2025 at 2:44 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Mon, Feb 10, 2025 at 08:37:06PM -0800, Alexei Starovoitov wrote:
> > On Mon, Feb 10, 2025 at 2:49 AM Peter Zijlstra <peterz@infradead.org> wrote:
> > > >
> > > > Do you force unload the BPF program?
> >
> > Not yet. As you can imagine, cancelling bpf program is much
> > harder than sending sigkill to the user space process.
>
> So you are killing the user program? Because it wasn't at all clear what
> if anything is done when this failure case is tripped.
No. We're not killing the user process. bpf progs often run
when there is no owner process. They're just attached
somewhere and doing things.
Like XDP firewall will work just fine without any user space.
> > The prog needs to safely free all the resources it holds.
> > This work was ongoing for a couple years now with numerous discussions.
>
> Well, for you maybe, I'm new here. This is only the second submission,
> and really only the first one I got to mostly read.
>
> > > Even the simple AB-BA case,
> > >
> > > CPU0 CPU1
> > > lock-A lock-B
> > > lock-B lock-A <-
> > >
> > > just having a random lock op return -ETIMO doesn't actually solve
> > > anything. Suppose CPU1's lock-A will time out; it will have to unwind
> > > and release lock-B before CPU0 can make progress.
> > >
> > > Worse, if CPU1 isn't quick enough to unwind and release B, then CPU0's
> > > lock-B will also time out.
> > >
> > > At which point they'll both try again and you're stuck in the same
> > > place, no?
> >
> > Not really. You're missing that deadlock is not a normal case.
>
> Well, if this is unpriv user programs, you should most definitely
> consider them the normal case. Must assume user space is malicious.
Ohh. No unpriv here.
Since spectre was discovered unpriv bpf died.
BPF_UNPRIV_DEFAULT_OFF=y was the default for distros and
all hyperscalers for quite some time.
> > As soon as we have cancellation logic working we will be "sigkilling"
> > prog where deadlock was detected.
>
> Ah, so that's the plan, but not yet included here? This means that every
> BPF program invocation must be 'cancellable'? What if kernel thread is
> hitting tracepoint or somesuch?
>
> So much details not clear to me and not explained either :/
Yes. The plan is to "kill" bpf prog when it misbehaves.
But this is orthogonal to this res_spin_lock set which is
a building block.
> Right, but it might have already modified things, how are you going to
> recover from that?
Tracking resources acquisition and release by the bpf prog
is a normal verifier job.
When bpf prog does bpf_rcu_read_lock() the verifier makes sure
that all execution paths from there on have bpf_rcu_read_unlock()
before program reaches the exit.
Same thing with locks.
If bpf_res_spin_lock() succeeds the verifier will make sure
there is matching bpf_res_spin_unlock().
If some resource was acquired before bpf_res_spin_lock() and
it returned -EDEADLK the verifier will not allow early return
without releasing all acquired resources.
> > Failing to grab res_spin_lock() is not a normal condition.
>
> If you're going to be exposing this to unpriv, I really do think you
> should assume it to be the normal case.
No unpriv for foreseeable future.
> > The prog has to implement a fallback path for it,
>
> But verifier must verify it is sane fallback, how can it do that?
>
> > > Given you *have* to unwind to make progress; why not move the entire
> > > thing to a wound-wait style lock? Then you also get rid of the whole
> > > timeout mess.
> >
> > We looked at things like ww_mutex_lock, but they don't fit.
> > wound-wait is for databases where deadlock is normal and expected.
> > The transaction has to be aborted and retried.
>
> Right, which to me sounds exactly like what you want for unpriv.
>
> Have the program structured such that it must acquire all locks before
> it does a modification / store -- and have the verifier enforce this.
> Then any lock failure can be handled by the bpf core, not the program
> itself. Core can unlock all previously acquired locks, and core can
> either re-attempt the program or 'skip' it after N failures.
We definitely don't want to bpf core to keep track of acquired resources.
That just doesn't scale.
There could be rcu_read_locks, all kinds of refcounted objects,
locks taken, and so on.
The verifier makes sure that the program does the release no matter
what the execution path.
That's how it scales.
On my devserver I have 152 bpf programs running.
All of them keep acquiring and releasing resources (locks, sockets,
memory) million times a second.
The verifier checks that each prog is doing its job individually.
> It does mean the bpf core needs to track the acquired locks -- which you
> already do,
We don't. The bpf infra does static checks only.
The core doesn't track objects at run-time.
The only exceptions are map elements.
bpf prog might store an acquired object in a map.
Only in that case bpf infra will free that object when it frees
the whole map.
But that doesn't apply to short lived things like RCU CS and
locks. Those cannot last long. They must complete within single
execution of the prog.
> > That was a conscious trade-off. Deadlocks are not normal.
>
> I really do think you should assume they are normal, unpriv and all
> that.
No unpriv and no, we don't want deadlocks to be considered normal
by bpf users. They need to hear "fix your broken prog" message loud
and clear. Patch 14 splat is a step in that direction.
Currently it's only for in-kernel res_spin_lock() usage
(like in bpf hashtab). Eventually we will deliver the message to users
without polluting dmesg. Still debating the actual mechanism.
^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [PATCH bpf-next v2 24/26] bpf: Implement verifier support for rqspinlock
2025-02-06 10:54 ` [PATCH bpf-next v2 24/26] bpf: Implement verifier support for rqspinlock Kumar Kartikeya Dwivedi
@ 2025-02-12 0:08 ` Eduard Zingerman
2025-02-13 6:41 ` Kumar Kartikeya Dwivedi
0 siblings, 1 reply; 67+ messages in thread
From: Eduard Zingerman @ 2025-02-12 0:08 UTC (permalink / raw)
To: Kumar Kartikeya Dwivedi, bpf, linux-kernel
Cc: Linus Torvalds, Peter Zijlstra, Will Deacon, Waiman Long,
Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
Martin KaFai Lau, Paul E. McKenney, Tejun Heo, Barret Rhoden,
Josh Don, Dohyun Kim, linux-arm-kernel, kernel-team
On Thu, 2025-02-06 at 02:54 -0800, Kumar Kartikeya Dwivedi wrote:
> Introduce verifier-side support for rqspinlock kfuncs. The first step is
> allowing bpf_res_spin_lock type to be defined in map values and
> allocated objects, so BTF-side is updated with a new BPF_RES_SPIN_LOCK
> field to recognize and validate.
>
> Any object cannot have both bpf_spin_lock and bpf_res_spin_lock, only
> one of them (and at most one of them per-object, like before) must be
> present. The bpf_res_spin_lock can also be used to protect objects that
> require lock protection for their kfuncs, like BPF rbtree and linked
> list.
>
> The verifier plumbing to simulate success and failure cases when calling
> the kfuncs is done by pushing a new verifier state to the verifier state
> stack which will verify the failure case upon calling the kfunc. The
> path where success is indicated creates all lock reference state and IRQ
> state (if necessary for irqsave variants). In the case of failure, the
> state clears the registers r0-r5, sets the return value, and skips kfunc
> processing, proceeding to the next instruction.
>
> When marking the return value for success case, the value is marked as
> 0, and for the failure case as [-MAX_ERRNO, -1]. Then, in the program,
> whenever user checks the return value as 'if (ret)' or 'if (ret < 0)'
> the verifier never traverses such branches for success cases, and would
> be aware that the lock is not held in such cases.
>
> We push the kfunc state in check_kfunc_call whenever rqspinlock kfuncs
> are invoked. We introduce a kfunc_class state to avoid mixing lock
> irqrestore kfuncs with IRQ state created by bpf_local_irq_save.
>
> With all this infrastructure, these kfuncs become usable in programs
> while satisfying all safety properties required by the kernel.
>
> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
> ---
Apart from a few nits, I think this patch looks good.
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
[...]
> diff --git a/include/linux/bpf_verifier.h b/include/linux/bpf_verifier.h
> index 32c23f2a3086..ed444e44f524 100644
> --- a/include/linux/bpf_verifier.h
> +++ b/include/linux/bpf_verifier.h
> @@ -115,6 +115,15 @@ struct bpf_reg_state {
> int depth:30;
> } iter;
>
> + /* For irq stack slots */
> + struct {
> + enum {
> + IRQ_KFUNC_IGNORE,
Is this state ever used?
mark_stack_slot_irq_flag() is always called with either NATIVE or LOCK.
> + IRQ_NATIVE_KFUNC,
> + IRQ_LOCK_KFUNC,
> + } kfunc_class;
> + } irq;
> +
> /* Max size from any of the above. */
> struct {
> unsigned long raw1;
[...]
> @@ -8038,36 +8059,53 @@ static int process_spin_lock(struct bpf_verifier_env *env, int regno,
> }
>
> rec = reg_btf_record(reg);
> - if (!btf_record_has_field(rec, BPF_SPIN_LOCK)) {
> - verbose(env, "%s '%s' has no valid bpf_spin_lock\n", map ? "map" : "local",
> - map ? map->name : "kptr");
> + if (!btf_record_has_field(rec, is_res_lock ? BPF_RES_SPIN_LOCK : BPF_SPIN_LOCK)) {
> + verbose(env, "%s '%s' has no valid %s_lock\n", map ? "map" : "local",
> + map ? map->name : "kptr", lock_str);
> return -EINVAL;
> }
> - if (rec->spin_lock_off != val + reg->off) {
> - verbose(env, "off %lld doesn't point to 'struct bpf_spin_lock' that is at %d\n",
> - val + reg->off, rec->spin_lock_off);
> + spin_lock_off = is_res_lock ? rec->res_spin_lock_off : rec->spin_lock_off;
> + if (spin_lock_off != val + reg->off) {
> + verbose(env, "off %lld doesn't point to 'struct %s_lock' that is at %d\n",
> + val + reg->off, lock_str, spin_lock_off);
> return -EINVAL;
> }
> if (is_lock) {
> void *ptr;
> + int type;
>
> if (map)
> ptr = map;
> else
> ptr = btf;
>
> - if (cur->active_locks) {
> - verbose(env,
> - "Locking two bpf_spin_locks are not allowed\n");
> - return -EINVAL;
> + if (!is_res_lock && cur->active_locks) {
Nit: having '&& cur->active_locks' in this branch but not the one for
'is_res_lock' is a bit confusing. As far as I understand this is
just an optimization, and active_locks check could be done (or dropped)
in both cases.
> + if (find_lock_state(env->cur_state, REF_TYPE_LOCK, 0, NULL)) {
> + verbose(env,
> + "Locking two bpf_spin_locks are not allowed\n");
> + return -EINVAL;
> + }
> + } else if (is_res_lock) {
> + if (find_lock_state(env->cur_state, REF_TYPE_RES_LOCK, reg->id, ptr)) {
> + verbose(env, "Acquiring the same lock again, AA deadlock detected\n");
> + return -EINVAL;
> + }
> }
Nit: there is no branch for find_lock_state(... REF_TYPE_RES_LOCK_IRQ ...),
this is not a problem, as other checks catch the imbalance in
number of unlocks or unlock of the same lock, but verifier won't
report the above "AA deadlock" message for bpf_res_spin_lock_irqsave().
The above two checks make it legal to take resilient lock while
holding regular lock and vice versa. This is probably ok, can't figure
out an example when this causes trouble.
> - err = acquire_lock_state(env, env->insn_idx, REF_TYPE_LOCK, reg->id, ptr);
> +
> + if (is_res_lock && is_irq)
> + type = REF_TYPE_RES_LOCK_IRQ;
> + else if (is_res_lock)
> + type = REF_TYPE_RES_LOCK;
> + else
> + type = REF_TYPE_LOCK;
> + err = acquire_lock_state(env, env->insn_idx, type, reg->id, ptr);
> if (err < 0) {
> verbose(env, "Failed to acquire lock state\n");
> return err;
> }
> } else {
> void *ptr;
> + int type;
>
> if (map)
> ptr = map;
[...]
^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [PATCH bpf-next v2 26/26] selftests/bpf: Add tests for rqspinlock
2025-02-06 10:54 ` [PATCH bpf-next v2 26/26] selftests/bpf: Add tests for rqspinlock Kumar Kartikeya Dwivedi
@ 2025-02-12 0:14 ` Eduard Zingerman
2025-02-13 6:25 ` Kumar Kartikeya Dwivedi
0 siblings, 1 reply; 67+ messages in thread
From: Eduard Zingerman @ 2025-02-12 0:14 UTC (permalink / raw)
To: Kumar Kartikeya Dwivedi, bpf, linux-kernel
Cc: Linus Torvalds, Peter Zijlstra, Will Deacon, Waiman Long,
Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
Martin KaFai Lau, Paul E. McKenney, Tejun Heo, Barret Rhoden,
Josh Don, Dohyun Kim, linux-arm-kernel, kernel-team
On Thu, 2025-02-06 at 02:54 -0800, Kumar Kartikeya Dwivedi wrote:
[...]
> +void test_res_spin_lock(void)
> +{
> + if (test__start_subtest("res_spin_lock_success"))
> + test_res_spin_lock_success();
> + if (test__start_subtest("res_spin_lock_failure"))
> + test_res_spin_lock_failure();
> +}
Such organization makes it impossible to select sub-tests from
res_spin_lock_failure using ./test_progs -t.
I suggest doing something like below:
@@ -6,7 +6,7 @@
#include "res_spin_lock.skel.h"
#include "res_spin_lock_fail.skel.h"
-static void test_res_spin_lock_failure(void)
+void test_res_spin_lock_failure(void)
{
RUN_TESTS(res_spin_lock_fail);
}
@@ -30,7 +30,7 @@ static void *spin_lock_thread(void *arg)
pthread_exit(arg);
}
-static void test_res_spin_lock_success(void)
+void test_res_spin_lock_success(void)
{
LIBBPF_OPTS(bpf_test_run_opts, topts,
.data_in = &pkt_v4,
@@ -89,11 +89,3 @@ static void test_res_spin_lock_success(void)
res_spin_lock__destroy(skel);
return;
}
-
-void test_res_spin_lock(void)
-{
- if (test__start_subtest("res_spin_lock_success"))
- test_res_spin_lock_success();
- if (test__start_subtest("res_spin_lock_failure"))
- test_res_spin_lock_failure();
-}
[...]
^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [PATCH bpf-next v2 11/26] rqspinlock: Add deadlock detection and recovery
2025-02-10 10:21 ` Peter Zijlstra
@ 2025-02-13 6:11 ` Kumar Kartikeya Dwivedi
0 siblings, 0 replies; 67+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2025-02-13 6:11 UTC (permalink / raw)
To: Peter Zijlstra
Cc: bpf, linux-kernel, Linus Torvalds, Will Deacon, Waiman Long,
Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
Martin KaFai Lau, Eduard Zingerman, Paul E. McKenney, Tejun Heo,
Barret Rhoden, Josh Don, Dohyun Kim, linux-arm-kernel,
kernel-team
On Mon, 10 Feb 2025 at 11:21, Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Thu, Feb 06, 2025 at 02:54:19AM -0800, Kumar Kartikeya Dwivedi wrote:
> > +#define RES_NR_HELD 32
> > +
> > +struct rqspinlock_held {
> > + int cnt;
> > + void *locks[RES_NR_HELD];
> > +};
>
> That cnt field makes the whole thing overflow a cacheline boundary.
> Making it 31 makes it fit again.
Makes sense, I can make RES_NR_HELD 31, it doesn't matter too much.
That's one less cacheline to pull into the local CPU during remote CPU reads.
^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [PATCH bpf-next v2 17/26] rqspinlock: Hardcode cond_acquire loops to asm-generic implementation
2025-02-10 10:03 ` Peter Zijlstra
@ 2025-02-13 6:15 ` Kumar Kartikeya Dwivedi
0 siblings, 0 replies; 67+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2025-02-13 6:15 UTC (permalink / raw)
To: Peter Zijlstra
Cc: bpf, linux-kernel, Ankur Arora, Linus Torvalds, Will Deacon,
Waiman Long, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
Martin KaFai Lau, Eduard Zingerman, Paul E. McKenney, Tejun Heo,
Barret Rhoden, Josh Don, Dohyun Kim, linux-arm-kernel,
kernel-team
On Mon, 10 Feb 2025 at 11:03, Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Mon, Feb 10, 2025 at 10:53:25AM +0100, Peter Zijlstra wrote:
> > On Thu, Feb 06, 2025 at 02:54:25AM -0800, Kumar Kartikeya Dwivedi wrote:
> > > Currently, for rqspinlock usage, the implementation of
> > > smp_cond_load_acquire (and thus, atomic_cond_read_acquire) are
> > > susceptible to stalls on arm64, because they do not guarantee that the
> > > conditional expression will be repeatedly invoked if the address being
> > > loaded from is not written to by other CPUs. When support for
> > > event-streams is absent (which unblocks stuck WFE-based loops every
> > > ~100us), we may end up being stuck forever.
> > >
> > > This causes a problem for us, as we need to repeatedly invoke the
> > > RES_CHECK_TIMEOUT in the spin loop to break out when the timeout
> > > expires.
> > >
> > > Hardcode the implementation to the asm-generic version in rqspinlock.c
> > > until support for smp_cond_load_acquire_timewait [0] lands upstream.
> > >
> >
> > *sigh*.. this patch should go *before* patch 8. As is that's still
> > horribly broken and I was WTF-ing because your 0/n changelog said you
> > fixed it.
>
Sorry about that, I will move it before the patch using this.
> And since you're doing local copies of things, why not take a lobal copy
> of the smp_cond_load_acquire_timewait() thing?
Ack, I'll address this in v3.
^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [PATCH bpf-next v2 09/26] rqspinlock: Protect waiters in queue from stalls
2025-02-10 10:17 ` Peter Zijlstra
@ 2025-02-13 6:20 ` Kumar Kartikeya Dwivedi
0 siblings, 0 replies; 67+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2025-02-13 6:20 UTC (permalink / raw)
To: Peter Zijlstra
Cc: bpf, linux-kernel, Barret Rhoden, Linus Torvalds, Will Deacon,
Waiman Long, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
Martin KaFai Lau, Eduard Zingerman, Paul E. McKenney, Tejun Heo,
Josh Don, Dohyun Kim, linux-arm-kernel, kernel-team
On Mon, 10 Feb 2025 at 11:17, Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Thu, Feb 06, 2025 at 02:54:17AM -0800, Kumar Kartikeya Dwivedi wrote:
> > Implement the wait queue cleanup algorithm for rqspinlock. There are
> > three forms of waiters in the original queued spin lock algorithm. The
> > first is the waiter which acquires the pending bit and spins on the lock
> > word without forming a wait queue. The second is the head waiter that is
> > the first waiter heading the wait queue. The third form is of all the
> > non-head waiters queued behind the head, waiting to be signalled through
> > their MCS node to overtake the responsibility of the head.
> >
> > In this commit, we are concerned with the second and third kind. First,
> > we augment the waiting loop of the head of the wait queue with a
> > timeout. When this timeout happens, all waiters part of the wait queue
> > will abort their lock acquisition attempts.
>
> Why? Why terminate the whole wait-queue?
>
> I *think* I understand, but it would be good to spell out. Also, in the
> comment.
Ack. The main reason is that we eschew per-waiter timeouts with one
applied at the head of the wait queue.
This allows everyone to break out faster once we've seen the owner /
pending waiter not responding for the timeout duration from the head.
Secondly, it avoids complicated synchronization, because when not
leaving in FIFO order, prev's next pointer needs to be fixed up etc.
Let me know if this explanation differs from your understanding.
^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [PATCH bpf-next v2 26/26] selftests/bpf: Add tests for rqspinlock
2025-02-12 0:14 ` Eduard Zingerman
@ 2025-02-13 6:25 ` Kumar Kartikeya Dwivedi
0 siblings, 0 replies; 67+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2025-02-13 6:25 UTC (permalink / raw)
To: Eduard Zingerman
Cc: bpf, linux-kernel, Linus Torvalds, Peter Zijlstra, Will Deacon,
Waiman Long, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
Martin KaFai Lau, Paul E. McKenney, Tejun Heo, Barret Rhoden,
Josh Don, Dohyun Kim, linux-arm-kernel, kernel-team
On Wed, 12 Feb 2025 at 01:14, Eduard Zingerman <eddyz87@gmail.com> wrote:
>
> On Thu, 2025-02-06 at 02:54 -0800, Kumar Kartikeya Dwivedi wrote:
>
> [...]
>
> > +void test_res_spin_lock(void)
> > +{
> > + if (test__start_subtest("res_spin_lock_success"))
> > + test_res_spin_lock_success();
> > + if (test__start_subtest("res_spin_lock_failure"))
> > + test_res_spin_lock_failure();
> > +}
>
> Such organization makes it impossible to select sub-tests from
> res_spin_lock_failure using ./test_progs -t.
> I suggest doing something like below:
>
> @@ -6,7 +6,7 @@
> #include "res_spin_lock.skel.h"
> #include "res_spin_lock_fail.skel.h"
>
> -static void test_res_spin_lock_failure(void)
> +void test_res_spin_lock_failure(void)
> {
> RUN_TESTS(res_spin_lock_fail);
> }
> @@ -30,7 +30,7 @@ static void *spin_lock_thread(void *arg)
> pthread_exit(arg);
> }
>
> -static void test_res_spin_lock_success(void)
> +void test_res_spin_lock_success(void)
> {
> LIBBPF_OPTS(bpf_test_run_opts, topts,
> .data_in = &pkt_v4,
> @@ -89,11 +89,3 @@ static void test_res_spin_lock_success(void)
> res_spin_lock__destroy(skel);
> return;
> }
> -
> -void test_res_spin_lock(void)
> -{
> - if (test__start_subtest("res_spin_lock_success"))
> - test_res_spin_lock_success();
> - if (test__start_subtest("res_spin_lock_failure"))
> - test_res_spin_lock_failure();
> -}
>
Ack, will fix.
> [...]
>
^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [PATCH bpf-next v2 24/26] bpf: Implement verifier support for rqspinlock
2025-02-12 0:08 ` Eduard Zingerman
@ 2025-02-13 6:41 ` Kumar Kartikeya Dwivedi
0 siblings, 0 replies; 67+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2025-02-13 6:41 UTC (permalink / raw)
To: Eduard Zingerman
Cc: bpf, linux-kernel, Linus Torvalds, Peter Zijlstra, Will Deacon,
Waiman Long, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
Martin KaFai Lau, Paul E. McKenney, Tejun Heo, Barret Rhoden,
Josh Don, Dohyun Kim, linux-arm-kernel, kernel-team
On Wed, 12 Feb 2025 at 01:08, Eduard Zingerman <eddyz87@gmail.com> wrote:
>
> On Thu, 2025-02-06 at 02:54 -0800, Kumar Kartikeya Dwivedi wrote:
> > Introduce verifier-side support for rqspinlock kfuncs. The first step is
> > allowing bpf_res_spin_lock type to be defined in map values and
> > allocated objects, so BTF-side is updated with a new BPF_RES_SPIN_LOCK
> > field to recognize and validate.
> >
> > Any object cannot have both bpf_spin_lock and bpf_res_spin_lock, only
> > one of them (and at most one of them per-object, like before) must be
> > present. The bpf_res_spin_lock can also be used to protect objects that
> > require lock protection for their kfuncs, like BPF rbtree and linked
> > list.
> >
> > The verifier plumbing to simulate success and failure cases when calling
> > the kfuncs is done by pushing a new verifier state to the verifier state
> > stack which will verify the failure case upon calling the kfunc. The
> > path where success is indicated creates all lock reference state and IRQ
> > state (if necessary for irqsave variants). In the case of failure, the
> > state clears the registers r0-r5, sets the return value, and skips kfunc
> > processing, proceeding to the next instruction.
> >
> > When marking the return value for success case, the value is marked as
> > 0, and for the failure case as [-MAX_ERRNO, -1]. Then, in the program,
> > whenever user checks the return value as 'if (ret)' or 'if (ret < 0)'
> > the verifier never traverses such branches for success cases, and would
> > be aware that the lock is not held in such cases.
> >
> > We push the kfunc state in check_kfunc_call whenever rqspinlock kfuncs
> > are invoked. We introduce a kfunc_class state to avoid mixing lock
> > irqrestore kfuncs with IRQ state created by bpf_local_irq_save.
> >
> > With all this infrastructure, these kfuncs become usable in programs
> > while satisfying all safety properties required by the kernel.
> >
> > Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
> > ---
>
> Apart from a few nits, I think this patch looks good.
>
> Acked-by: Eduard Zingerman <eddyz87@gmail.com>
>
Thanks!
> [...]
>
> > diff --git a/include/linux/bpf_verifier.h b/include/linux/bpf_verifier.h
> > index 32c23f2a3086..ed444e44f524 100644
> > --- a/include/linux/bpf_verifier.h
> > +++ b/include/linux/bpf_verifier.h
> > @@ -115,6 +115,15 @@ struct bpf_reg_state {
> > int depth:30;
> > } iter;
> >
> > + /* For irq stack slots */
> > + struct {
> > + enum {
> > + IRQ_KFUNC_IGNORE,
>
> Is this state ever used?
> mark_stack_slot_irq_flag() is always called with either NATIVE or LOCK.
Hm, no, it was just the default / invalid value, I guess it can be dropped.
>
> > + IRQ_NATIVE_KFUNC,
> > + IRQ_LOCK_KFUNC,
> > + } kfunc_class;
> > + } irq;
> > +
> > /* Max size from any of the above. */
> > struct {
> > unsigned long raw1;
>
> [...]
>
> > @@ -8038,36 +8059,53 @@ static int process_spin_lock(struct bpf_verifier_env *env, int regno,
> > }
> >
> > rec = reg_btf_record(reg);
> > - if (!btf_record_has_field(rec, BPF_SPIN_LOCK)) {
> > - verbose(env, "%s '%s' has no valid bpf_spin_lock\n", map ? "map" : "local",
> > - map ? map->name : "kptr");
> > + if (!btf_record_has_field(rec, is_res_lock ? BPF_RES_SPIN_LOCK : BPF_SPIN_LOCK)) {
> > + verbose(env, "%s '%s' has no valid %s_lock\n", map ? "map" : "local",
> > + map ? map->name : "kptr", lock_str);
> > return -EINVAL;
> > }
> > - if (rec->spin_lock_off != val + reg->off) {
> > - verbose(env, "off %lld doesn't point to 'struct bpf_spin_lock' that is at %d\n",
> > - val + reg->off, rec->spin_lock_off);
> > + spin_lock_off = is_res_lock ? rec->res_spin_lock_off : rec->spin_lock_off;
> > + if (spin_lock_off != val + reg->off) {
> > + verbose(env, "off %lld doesn't point to 'struct %s_lock' that is at %d\n",
> > + val + reg->off, lock_str, spin_lock_off);
> > return -EINVAL;
> > }
> > if (is_lock) {
> > void *ptr;
> > + int type;
> >
> > if (map)
> > ptr = map;
> > else
> > ptr = btf;
> >
> > - if (cur->active_locks) {
> > - verbose(env,
> > - "Locking two bpf_spin_locks are not allowed\n");
> > - return -EINVAL;
> > + if (!is_res_lock && cur->active_locks) {
>
> Nit: having '&& cur->active_locks' in this branch but not the one for
> 'is_res_lock' is a bit confusing. As far as I understand this is
> just an optimization, and active_locks check could be done (or dropped)
> in both cases.
Yeah, I can make it consistent by adding the check to both.
>
> > + if (find_lock_state(env->cur_state, REF_TYPE_LOCK, 0, NULL)) {
> > + verbose(env,
> > + "Locking two bpf_spin_locks are not allowed\n");
> > + return -EINVAL;
> > + }
> > + } else if (is_res_lock) {
> > + if (find_lock_state(env->cur_state, REF_TYPE_RES_LOCK, reg->id, ptr)) {
> > + verbose(env, "Acquiring the same lock again, AA deadlock detected\n");
> > + return -EINVAL;
> > + }
> > }
>
> Nit: there is no branch for find_lock_state(... REF_TYPE_RES_LOCK_IRQ ...),
> this is not a problem, as other checks catch the imbalance in
> number of unlocks or unlock of the same lock, but verifier won't
> report the above "AA deadlock" message for bpf_res_spin_lock_irqsave().
>
Good point, will fix.
> The above two checks make it legal to take resilient lock while
> holding regular lock and vice versa. This is probably ok, can't figure
> out an example when this causes trouble.
Yeah, that shouldn't cause a problem.
>
> > - err = acquire_lock_state(env, env->insn_idx, REF_TYPE_LOCK, reg->id, ptr);
> > +
> > + if (is_res_lock && is_irq)
> > + type = REF_TYPE_RES_LOCK_IRQ;
> > + else if (is_res_lock)
> > + type = REF_TYPE_RES_LOCK;
> > + else
> > + type = REF_TYPE_LOCK;
> > + err = acquire_lock_state(env, env->insn_idx, type, reg->id, ptr);
> > if (err < 0) {
> > verbose(env, "Failed to acquire lock state\n");
> > return err;
> > }
> > } else {
> > void *ptr;
> > + int type;
> >
> > if (map)
> > ptr = map;
>
> [...]
>
^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [PATCH bpf-next v2 00/26] Resilient Queued Spin Lock
2025-02-11 18:33 ` Alexei Starovoitov
@ 2025-02-13 9:59 ` Peter Zijlstra
2025-02-14 2:37 ` Alexei Starovoitov
0 siblings, 1 reply; 67+ messages in thread
From: Peter Zijlstra @ 2025-02-13 9:59 UTC (permalink / raw)
To: Alexei Starovoitov
Cc: Kumar Kartikeya Dwivedi, bpf, LKML, Linus Torvalds, Will Deacon,
Waiman Long, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
Martin KaFai Lau, Eduard Zingerman, Paul E. McKenney, Tejun Heo,
Barret Rhoden, Josh Don, Dohyun Kim, linux-arm-kernel,
Kernel Team
On Tue, Feb 11, 2025 at 10:33:00AM -0800, Alexei Starovoitov wrote:
> Ohh. No unpriv here.
> Since spectre was discovered unpriv bpf died.
> BPF_UNPRIV_DEFAULT_OFF=y was the default for distros and
> all hyperscalers for quite some time.
Ah, okay. Time to remove the option then?
> > So much details not clear to me and not explained either :/
>
> Yes. The plan is to "kill" bpf prog when it misbehaves.
> But this is orthogonal to this res_spin_lock set which is
> a building block.
>
> > Right, but it might have already modified things, how are you going to
> > recover from that?
>
> Tracking resources acquisition and release by the bpf prog
> is a normal verifier job.
> When bpf prog does bpf_rcu_read_lock() the verifier makes sure
> that all execution paths from there on have bpf_rcu_read_unlock()
> before program reaches the exit.
> Same thing with locks.
Ah, okay, this wasn't stated anywhere. This is rather crucial
information.
> If bpf_res_spin_lock() succeeds the verifier will make sure
> there is matching bpf_res_spin_unlock().
> If some resource was acquired before bpf_res_spin_lock() and
> it returned -EDEADLK the verifier will not allow early return
> without releasing all acquired resources.
Good.
> > Have the program structured such that it must acquire all locks before
> > it does a modification / store -- and have the verifier enforce this.
> > Then any lock failure can be handled by the bpf core, not the program
> > itself. Core can unlock all previously acquired locks, and core can
> > either re-attempt the program or 'skip' it after N failures.
>
> We definitely don't want to bpf core to keep track of acquired resources.
> That just doesn't scale.
> There could be rcu_read_locks, all kinds of refcounted objects,
> locks taken, and so on.
> The verifier makes sure that the program does the release no matter
> what the execution path.
> That's how it scales.
> On my devserver I have 152 bpf programs running.
> All of them keep acquiring and releasing resources (locks, sockets,
> memory) million times a second.
> The verifier checks that each prog is doing its job individually.
Well, this patch set tracks the held lock stack -- which is required in
order to do the deadlock thing after all.
> > It does mean the bpf core needs to track the acquired locks -- which you
> > already do,
>
> We don't.
This patch set does exactly that. Is required for deadlock analysis.
> The bpf infra does static checks only.
> The core doesn't track objects at run-time.
> The only exceptions are map elements.
> bpf prog might store an acquired object in a map.
> Only in that case bpf infra will free that object when it frees
> the whole map.
> But that doesn't apply to short lived things like RCU CS and
> locks. Those cannot last long. They must complete within single
> execution of the prog.
Right. Held lock stack is like that.
> > > That was a conscious trade-off. Deadlocks are not normal.
> >
> > I really do think you should assume they are normal, unpriv and all
> > that.
>
> No unpriv and no, we don't want deadlocks to be considered normal
> by bpf users. They need to hear "fix your broken prog" message loud
> and clear. Patch 14 splat is a step in that direction.
> Currently it's only for in-kernel res_spin_lock() usage
> (like in bpf hashtab). Eventually we will deliver the message to users
> without polluting dmesg. Still debating the actual mechanism.
OK; how is the user supposed to handle locking two hash buckets? Does
the BPF prog create some global lock to serialize the multi bucket case?
Anyway, I wonder. Since the verifier tracks all this, it can determine
lock order for the prog. Can't it do what lockdep does and maintain lock
order graph of all loaded BPF programs?
This is load-time overhead, rather than runtime.
^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [PATCH bpf-next v2 00/26] Resilient Queued Spin Lock
2025-02-13 9:59 ` Peter Zijlstra
@ 2025-02-14 2:37 ` Alexei Starovoitov
2025-03-04 10:46 ` Peter Zijlstra
0 siblings, 1 reply; 67+ messages in thread
From: Alexei Starovoitov @ 2025-02-14 2:37 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Kumar Kartikeya Dwivedi, bpf, LKML, Linus Torvalds, Will Deacon,
Waiman Long, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
Martin KaFai Lau, Eduard Zingerman, Paul E. McKenney, Tejun Heo,
Barret Rhoden, Josh Don, Dohyun Kim, linux-arm-kernel,
Kernel Team
On Thu, Feb 13, 2025 at 1:59 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Tue, Feb 11, 2025 at 10:33:00AM -0800, Alexei Starovoitov wrote:
>
> > Ohh. No unpriv here.
> > Since spectre was discovered unpriv bpf died.
> > BPF_UNPRIV_DEFAULT_OFF=y was the default for distros and
> > all hyperscalers for quite some time.
>
> Ah, okay. Time to remove the option then?
Good point. Indeed.
Will accept the patch if anyone has cycles to prep it, test it.
> > > So much details not clear to me and not explained either :/
> >
> > Yes. The plan is to "kill" bpf prog when it misbehaves.
> > But this is orthogonal to this res_spin_lock set which is
> > a building block.
> >
> > > Right, but it might have already modified things, how are you going to
> > > recover from that?
> >
> > Tracking resources acquisition and release by the bpf prog
> > is a normal verifier job.
> > When bpf prog does bpf_rcu_read_lock() the verifier makes sure
> > that all execution paths from there on have bpf_rcu_read_unlock()
> > before program reaches the exit.
> > Same thing with locks.
>
> Ah, okay, this wasn't stated anywhere. This is rather crucial
> information.
This is kinda verifier 101. I don't think it needs to be in the log.
> > We definitely don't want to bpf core to keep track of acquired resources.
> > That just doesn't scale.
> > There could be rcu_read_locks, all kinds of refcounted objects,
> > locks taken, and so on.
> > The verifier makes sure that the program does the release no matter
> > what the execution path.
> > That's how it scales.
> > On my devserver I have 152 bpf programs running.
> > All of them keep acquiring and releasing resources (locks, sockets,
> > memory) million times a second.
> > The verifier checks that each prog is doing its job individually.
>
> Well, this patch set tracks the held lock stack -- which is required in
> order to do the deadlock thing after all.
Right, but the held lock set is per-cpu global and not exhaustive.
It cannot detect 3-lock circles _by design_.
We rely on timeout for extreme cases.
> > The bpf infra does static checks only.
> > The core doesn't track objects at run-time.
> > The only exceptions are map elements.
> > bpf prog might store an acquired object in a map.
> > Only in that case bpf infra will free that object when it frees
> > the whole map.
> > But that doesn't apply to short lived things like RCU CS and
> > locks. Those cannot last long. They must complete within single
> > execution of the prog.
>
> Right. Held lock stack is like that.
They're not equivalent and not used for correctness.
See patch 26 and res_spin_lock_test_held_lock_max() selftest
that was added specifically to overwhelm:
+struct rqspinlock_held {
+ int cnt;
+ void *locks[RES_NR_HELD];
+};
It's an impossible case in reality, but the res_spin_lock
code should be prepared for extreme cases like that.
Just like existing qspinlock has 4 percpu qnodes and
test-and-set fallback in case "if (unlikely(idx >= MAX_NODES))"
line qspinlock.c:413.
Can it happen in practice ? Probably never.
But the code has to be ready to handle it.
> > > > That was a conscious trade-off. Deadlocks are not normal.
> > >
> > > I really do think you should assume they are normal, unpriv and all
> > > that.
> >
> > No unpriv and no, we don't want deadlocks to be considered normal
> > by bpf users. They need to hear "fix your broken prog" message loud
> > and clear. Patch 14 splat is a step in that direction.
> > Currently it's only for in-kernel res_spin_lock() usage
> > (like in bpf hashtab). Eventually we will deliver the message to users
> > without polluting dmesg. Still debating the actual mechanism.
>
> OK; how is the user supposed to handle locking two hash buckets? Does
> the BPF prog create some global lock to serialize the multi bucket case?
Not following.
Are you talking about patch 19 where we convert per-bucket
raw_spinlock_t in bpf hashmap to rqspinlock_t ?
Only one bucket lock is held at a time by map update code,
but due to reentrance and crazy kprobes in the wrong places
two bucket locks of a single map can be held on the same cpu.
bpf_prog_A -> bpf_map_update -> res_spin_lock(bucket_A)
-> kprobe or tracepoint
-> bpf_prob_B -> bpf_map_update -> res_spin_lock(bucket_B)
and that's why we currently have:
if (__this_cpu_inc_return(*(htab->map_locked[hash])) ...
return -EBUSY;
.. workaround to prevent the most obvious AA deadlock,
but it's not enough.
People were able to hit ABBA.
Note, raw_spin_lock today (and res_spin_lock after patch 19) is
used by proper kernel code in kernel/bpf/hashtab.c.
bpf prog just calls bpf_map_update() which is a normal
helper call from the verifier point of view.
It doesn't know whether there are locks inside or not.
bpf_ktime_get_ns() helper is similar.
The verifier knows that it's safe from NMI,
but what kinds of locks inside it doesn't care.
> Anyway, I wonder. Since the verifier tracks all this, it can determine
> lock order for the prog. Can't it do what lockdep does and maintain lock
> order graph of all loaded BPF programs?
>
> This is load-time overhead, rather than runtime.
I wish it was possible. Locks are dynamic. They protect
dynamically allocated objects, so the order cannot be statically
verified. We pushed the limit of static analysis a lot.
Maybe too much.
For example,
the verifier can statically validate the following code:
struct node_data *n, *m, *o;
struct bpf_rb_node *res, *res2;
// here we allocate an object of type known to the verifier
n = bpf_obj_new(typeof(*n));
if (!n)
return 1;
n->key = 41;
n->data = 42;
// here the verifier knows that glock spin_lock
// protect rbtree groot
bpf_spin_lock(&glock);
// here it checks that the lock is held and type of
// objects in rbtree matches the type of 'n'
bpf_rbtree_add(&groot, &n->node, less);
bpf_spin_unlock(&glock);
and all kinds of other more complex stuff,
but it is not enough to cover necessary algorithms.
Here is an example from real code that shows
why we cannot verify two held locks:
struct bpf_vqueue {
struct bpf_spin_lock lock;
int credit;
unsigned long long lasttime;
unsigned int rate;
};
struct {
__uint(type, BPF_MAP_TYPE_HASH);
__uint(max_entries, ...);
__type(key, int);
__type(value, struct bpf_vqueue);
} vqueue SEC(".maps");
q = bpf_map_lookup_elem(&vqueue, &key);
if (!q)
goto err;
curtime = bpf_ktime_get_ns();
bpf_spin_lock(&q->lock);
q->lasttime = curtime;
q->credit -= ...;
credit = q->credit;
bpf_spin_unlock(&q->lock);
the above is safe, but if there are two lookups:
q1 = bpf_map_lookup_elem(&vqueue, &key1);
q2 = bpf_map_lookup_elem(&vqueue, &key2);
both will point to two different locks,
and since the key is dynamic there is no way to know
the order of q1->lock vs q2->lock.
So we allow only one lock at a time with
bare minimal operations while holding the lock,
but it's not enough to do any meaningful work.
The top feature request is to allow calls
while holding locks (currently they're disallowed,
like above bpf_ktime_get_ns() cannot be done
while holding the lock)
and allow grabbing more than one lock.
That's what res_spin_lock() is achieving.
Having said all that I think the discussion is diverging into
all-thing-bpf instead of focusing on res_spin_lock.
Just to make it clear... there is a patch 18:
F: kernel/bpf/
F: kernel/trace/bpf_trace.c
F: lib/buildid.c
+F: arch/*/include/asm/rqspinlock.h
+F: include/asm-generic/rqspinlock.h
+F: kernel/locking/rqspinlock.c
F: lib/test_bpf.c
F: net/bpf/
that adds maintainer entries to BPF scope.
We're not asking locking experts to maintain this new res_spin_lock.
It's not a generic kernel infra.
It will only be used by bpf infra and by bpf progs.
We will maintain it and we will fix whatever bugs
we introduce.
We can place it in kernel/bpf/rqspinlock.c
to make things more obvious,
but kernel/locking/ feels a bit cleaner.
We're not asking to review patches 14 and higher.
They are presented for completeness.
(patch 17 was out-of-order. It will be moved sooner. Sorry about that)
But welcome feedback for patches 1-13.
Like the way you spotted broken smp_cond_load_acquire()
on arm64 due to WFE.
That was a great catch. We really appreciate it.
^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [PATCH bpf-next v2 00/26] Resilient Queued Spin Lock
2025-02-14 2:37 ` Alexei Starovoitov
@ 2025-03-04 10:46 ` Peter Zijlstra
2025-03-05 3:26 ` Alexei Starovoitov
0 siblings, 1 reply; 67+ messages in thread
From: Peter Zijlstra @ 2025-03-04 10:46 UTC (permalink / raw)
To: Alexei Starovoitov
Cc: Kumar Kartikeya Dwivedi, bpf, LKML, Linus Torvalds, Will Deacon,
Waiman Long, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
Martin KaFai Lau, Eduard Zingerman, Paul E. McKenney, Tejun Heo,
Barret Rhoden, Josh Don, Dohyun Kim, linux-arm-kernel,
Kernel Team
new posting reminded me we had this thread...
On Thu, Feb 13, 2025 at 06:37:05PM -0800, Alexei Starovoitov wrote:
> > > When bpf prog does bpf_rcu_read_lock() the verifier makes sure
> > > that all execution paths from there on have bpf_rcu_read_unlock()
> > > before program reaches the exit.
> > > Same thing with locks.
> >
> > Ah, okay, this wasn't stated anywhere. This is rather crucial
> > information.
>
> This is kinda verifier 101. I don't think it needs to be in the log.
Right, but I didn't take that class. I'm BPF n00b. Meanwhile you're
asking me to review this :-/
> > OK; how is the user supposed to handle locking two hash buckets? Does
> > the BPF prog create some global lock to serialize the multi bucket case?
>
> Not following.
> Are you talking about patch 19 where we convert per-bucket
> raw_spinlock_t in bpf hashmap to rqspinlock_t ?
I'm not sure -- see the BPF n00b thing, I don't know how this is
supposed to be used.
Like really; I have absolutely 0 clues.
Anyway; the situation I was thinking of was something along the lines
of: you need data from 2 buckets, so you need to lock 2 buckets, but
since hash-table, there is no sane order, so you need a 3rd lock to
impose order.
But also, see below, you've illustrated this exact case with q1,q2.
> Only one bucket lock is held at a time by map update code,
> but due to reentrance and crazy kprobes in the wrong places
> two bucket locks of a single map can be held on the same cpu.
>
> bpf_prog_A -> bpf_map_update -> res_spin_lock(bucket_A)
> -> kprobe or tracepoint
> -> bpf_prob_B -> bpf_map_update -> res_spin_lock(bucket_B)
>
> and that's why we currently have:
> if (__this_cpu_inc_return(*(htab->map_locked[hash])) ...
> return -EBUSY;
>
> .. workaround to prevent the most obvious AA deadlock,
> but it's not enough.
> People were able to hit ABBA.
Right, you can create arbitrary lock chain with this; chain length is
limited by nesting-depth*nr-cpus or somesuch.
> > Anyway, I wonder. Since the verifier tracks all this, it can determine
> > lock order for the prog. Can't it do what lockdep does and maintain lock
> > order graph of all loaded BPF programs?
> >
> > This is load-time overhead, rather than runtime.
>
> I wish it was possible. Locks are dynamic. They protect
> dynamically allocated objects, so the order cannot be statically
> verified. We pushed the limit of static analysis a lot.
> Maybe too much.
> For example,
> the verifier can statically validate the following code:
> struct node_data *n, *m, *o;
> struct bpf_rb_node *res, *res2;
>
> // here we allocate an object of type known to the verifier
> n = bpf_obj_new(typeof(*n));
> if (!n)
> return 1;
> n->key = 41;
> n->data = 42;
>
> // here the verifier knows that glock spin_lock
> // protect rbtree groot
> bpf_spin_lock(&glock);
>
> // here it checks that the lock is held and type of
> // objects in rbtree matches the type of 'n'
> bpf_rbtree_add(&groot, &n->node, less);
> bpf_spin_unlock(&glock);
>
> and all kinds of other more complex stuff,
> but it is not enough to cover necessary algorithms.
>
> Here is an example from real code that shows
> why we cannot verify two held locks:
>
> struct bpf_vqueue {
> struct bpf_spin_lock lock;
> int credit;
> unsigned long long lasttime;
> unsigned int rate;
> };
>
> struct {
> __uint(type, BPF_MAP_TYPE_HASH);
> __uint(max_entries, ...);
> __type(key, int);
> __type(value, struct bpf_vqueue);
> } vqueue SEC(".maps");
>
> q = bpf_map_lookup_elem(&vqueue, &key);
> if (!q)
> goto err;
> curtime = bpf_ktime_get_ns();
> bpf_spin_lock(&q->lock);
> q->lasttime = curtime;
> q->credit -= ...;
> credit = q->credit;
> bpf_spin_unlock(&q->lock);
>
> the above is safe, but if there are two lookups:
>
> q1 = bpf_map_lookup_elem(&vqueue, &key1);
> q2 = bpf_map_lookup_elem(&vqueue, &key2);
>
> both will point to two different locks,
> and since the key is dynamic there is no way to know
> the order of q1->lock vs q2->lock.
I still feel like I'm missing things, but while they are two dynamic
locks, they are both locks of vqueue object. What lockdep does is
classify locks by initialization site (by default). Same can be done
here, classify per dynamic object.
So verifier can know the above is invalid. Both locks are same class, so
treat as A-A order (trivial case is where q1 and q2 are in fact the same
object since the keys hash the same).
Now, going back to 3rd lock, if instead you write it like:
bpf_spin_lock(&glock);
q1 = bpf_map_lookup_elem(&vqueue, &key1);
q2 = bpf_map_lookup_elem(&vqueue, &key2);
...
bpf_spin_unlock(&glock);
then (assuming q1 != q2) things are fine, since glock will serialize
everybody taking two vqueue locks.
And the above program snippet seems to imply maps are global state, so
you can keep lock graph of maps, such that:
bpf_map_lookup_elem(&map-A, &key-A);
bpf_map_lookup_elem(&map-B, &key-B);
vs
bpf_map_lookup_elem(&map-B, &key-B);
bpf_map_lookup_elem(&map-A, &key-A);
trips AB-BA
> So we allow only one lock at a time with
> bare minimal operations while holding the lock,
> but it's not enough to do any meaningful work.
Yes, I can see that being a problem.
> The top feature request is to allow calls
> while holding locks (currently they're disallowed,
> like above bpf_ktime_get_ns() cannot be done
> while holding the lock)
So bpf_ktime_get_ns() is a trivial example; it it always safe to call,
you can simply whitelist it.
> and allow grabbing more than one lock.
> That's what res_spin_lock() is achieving.
I am not at all sure how res_spin_lock is helping with the q1,q2 thing.
That will trivially result in lock cycles.
And you said any program that would trigger deadlock is invalid.
Therefore the q1,q2 example from above is still invalid and
res_spin_lock has not helped.
> Having said all that I think the discussion is diverging into
> all-thing-bpf instead of focusing on res_spin_lock.
I disagree, all of this is needed to understand res_spin_lock.
From the above, I'm not yet convinced you cannot extend the verifier
with something lockdep like.
> Just to make it clear... there is a patch 18:
>
> F: kernel/bpf/
> F: kernel/trace/bpf_trace.c
> F: lib/buildid.c
> +F: arch/*/include/asm/rqspinlock.h
> +F: include/asm-generic/rqspinlock.h
> +F: kernel/locking/rqspinlock.c
> F: lib/test_bpf.c
> F: net/bpf/
>
> that adds maintainer entries to BPF scope.
>
> We're not asking locking experts to maintain this new res_spin_lock.
> It's not a generic kernel infra.
> It will only be used by bpf infra and by bpf progs.
> We will maintain it and we will fix whatever bugs
> we introduce.
While that is appreciated, the whole kernel is subject to the worst case
behaviour of this thing. As such, I feel I need to care.
^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [PATCH bpf-next v2 00/26] Resilient Queued Spin Lock
2025-03-04 10:46 ` Peter Zijlstra
@ 2025-03-05 3:26 ` Alexei Starovoitov
0 siblings, 0 replies; 67+ messages in thread
From: Alexei Starovoitov @ 2025-03-05 3:26 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Kumar Kartikeya Dwivedi, bpf, LKML, Linus Torvalds, Will Deacon,
Waiman Long, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
Martin KaFai Lau, Eduard Zingerman, Paul E. McKenney, Tejun Heo,
Barret Rhoden, Josh Don, Dohyun Kim, linux-arm-kernel,
Kernel Team
On Tue, Mar 4, 2025 at 2:46 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
>
> Anyway; the situation I was thinking of was something along the lines
> of: you need data from 2 buckets, so you need to lock 2 buckets, but
> since hash-table, there is no sane order, so you need a 3rd lock to
> impose order.
Not quite. This is a typical request to allow locking two buckets
and solution to that is run-time lock address check.
> > q1 = bpf_map_lookup_elem(&vqueue, &key1);
> > q2 = bpf_map_lookup_elem(&vqueue, &key2);
> >
> > both will point to two different locks,
> > and since the key is dynamic there is no way to know
> > the order of q1->lock vs q2->lock.
>
> I still feel like I'm missing things, but while they are two dynamic
> locks, they are both locks of vqueue object. What lockdep does is
> classify locks by initialization site (by default). Same can be done
> here, classify per dynamic object.
>
> So verifier can know the above is invalid. Both locks are same class, so
> treat as A-A order (trivial case is where q1 and q2 are in fact the same
> object since the keys hash the same).
Sounds like you're saying that the verifier should reject
the case when two locks of the same class like q1->lock and q2->lock
need to be taken ?
But that is one of the use cases where people requested to allow
multiple locks.
The typical solution to this is to order locks by addresses at runtime.
And nf_conntrack_double_lock() in net/netfilter/nf_conntrack_core.c
does exactly that.
if (lock1 < lock2) {
spin_lock(lock1);spin_lock(lock2);
} else {
spin_lock(lock2);spin_lock(lock1);
}
> Now, going back to 3rd lock, if instead you write it like:
>
> bpf_spin_lock(&glock);
> q1 = bpf_map_lookup_elem(&vqueue, &key1);
> q2 = bpf_map_lookup_elem(&vqueue, &key2);
> ...
> bpf_spin_unlock(&glock);
>
> then (assuming q1 != q2) things are fine, since glock will serialize
> everybody taking two vqueue locks.
>
> And the above program snippet seems to imply maps are global state, so
Not quite. Some maps are global, but there are dynamic maps too.
That's what map-in-map is for.
> you can keep lock graph of maps, such that:
>
> bpf_map_lookup_elem(&map-A, &key-A);
> bpf_map_lookup_elem(&map-B, &key-B);
>
> vs
>
> bpf_map_lookup_elem(&map-B, &key-B);
> bpf_map_lookup_elem(&map-A, &key-A);
>
> trips AB-BA
If everything was static and _keys_ known statically too, then yes,
such analysis by the verifier would be possible.
But both maps and keys are dynamic.
Note, to make sure that the above example doesn't confuse people,
bpf_map_lookup_elem() lookup itself is completely lockless.
So nothing wrong with the above sequence as written.
Only when:
q1 = bpf_map_lookup_elem(&map-A, &key-A);
q2 = bpf_map_lookup_elem(&map-B, &key-B);
if (bpf_res_spin_lock(&q1->lock))
if (bpf_res_spin_lock(&q2->lock))
the deadlocks become a possibility.
Both maps and keys are only known at run-time.
So locking logic has to do run-time checks too.
> I am not at all sure how res_spin_lock is helping with the q1,q2 thing.
> That will trivially result in lock cycles.
Right and AA or ABBA will be instantly detected at run-time.
> And you said any program that would trigger deadlock is invalid.
> Therefore the q1,q2 example from above is still invalid and
> res_spin_lock has not helped.
res_spin_lock will do its job and will prevent a deadlock.
As we explained earlier such a program will be marked as broken
and will be detached/stopped by the bpf infra.
Also we're talking root privileges.
None of this is allowed in unpriv.
> > Just to make it clear... there is a patch 18:
> >
> > F: kernel/bpf/
> > F: kernel/trace/bpf_trace.c
> > F: lib/buildid.c
> > +F: arch/*/include/asm/rqspinlock.h
> > +F: include/asm-generic/rqspinlock.h
> > +F: kernel/locking/rqspinlock.c
> > F: lib/test_bpf.c
> > F: net/bpf/
> >
> > that adds maintainer entries to BPF scope.
> >
> > We're not asking locking experts to maintain this new res_spin_lock.
> > It's not a generic kernel infra.
> > It will only be used by bpf infra and by bpf progs.
> > We will maintain it and we will fix whatever bugs
> > we introduce.
>
> While that is appreciated, the whole kernel is subject to the worst case
> behaviour of this thing. As such, I feel I need to care.
Not sure why you're trying to relitigate the years worth of
discussions around locks in the bpf community.
Static analysis of 2+ locks by the verifier is impossible.
Full lock graph cycle detection lockdep-style is too slow in run-time.
Hence res_spin_lock with AA, ABBA, and timeout as a last resort
is our solution to real reported bugs.
This res_spin_lock patchset fixes the following syzbot reports:
https://lore.kernel.org/bpf/675302fd.050a0220.2477f.0004.GAE@google.com
https://lore.kernel.org/bpf/000000000000b3e63e061eed3f6b@google.com
https://lore.kernel.org/bpf/CAPPBnEa1_pZ6W24+WwtcNFvTUHTHO7KUmzEbOcMqxp+m2o15qQ@mail.gmail.com
https://lore.kernel.org/bpf/CAPPBnEYm+9zduStsZaDnq93q1jPLqO-PiKX9jy0MuL8LCXmCrQ@mail.gmail.com
https://lore.kernel.org/lkml/000000000000adb08b061413919e@google.com
It fixes the real issues.
Some of them have hacky workarounds, some are not fixed yet.
More syzbot reports will be fixed in follow ups when we
adopt res_spin_lock in other parts of bpf infra.
Note, all of the above syzbot reports are _not_ using direct
locks inside the bpf programs. All of them hit proper kernel
spin_locks inside bpf infra (like inside map implementations and such).
The verifier cannot do anything. syzbot generated programs
are trivial. They do one bpf_map_update_elem() call or similar.
It's a combination of attaching to tricky tracepoints
like trace_contention_begin or deep inside bpf infra.
We already have these workarounds:
CFLAGS_REMOVE_percpu_freelist.o = $(CC_FLAGS_FTRACE)
CFLAGS_REMOVE_bpf_lru_list.o = $(CC_FLAGS_FTRACE)
CFLAGS_REMOVE_queue_stack_maps.o = $(CC_FLAGS_FTRACE)
CFLAGS_REMOVE_lpm_trie.o = $(CC_FLAGS_FTRACE)
CFLAGS_REMOVE_ringbuf.o = $(CC_FLAGS_FTRACE)
to prevent recursion anywhere in these files,
but it's helping only so much.
So please take a look at patches 1-18 and help us make
sure we implemented AA, ABBA, timeout logic without obvious bugs.
I think we did, but extra review would be great.
Thanks!
^ permalink raw reply [flat|nested] 67+ messages in thread
end of thread, other threads:[~2025-03-05 4:30 UTC | newest]
Thread overview: 67+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-02-06 10:54 [PATCH bpf-next v2 00/26] Resilient Queued Spin Lock Kumar Kartikeya Dwivedi
2025-02-06 10:54 ` [PATCH bpf-next v2 01/26] locking: Move MCS struct definition to public header Kumar Kartikeya Dwivedi
2025-02-06 10:54 ` [PATCH bpf-next v2 02/26] locking: Move common qspinlock helpers to a private header Kumar Kartikeya Dwivedi
2025-02-07 23:21 ` kernel test robot
2025-02-06 10:54 ` [PATCH bpf-next v2 03/26] locking: Allow obtaining result of arch_mcs_spin_lock_contended Kumar Kartikeya Dwivedi
2025-02-06 10:54 ` [PATCH bpf-next v2 04/26] locking: Copy out qspinlock.c to rqspinlock.c Kumar Kartikeya Dwivedi
2025-02-06 10:54 ` [PATCH bpf-next v2 05/26] rqspinlock: Add rqspinlock.h header Kumar Kartikeya Dwivedi
2025-02-06 10:54 ` [PATCH bpf-next v2 06/26] rqspinlock: Drop PV and virtualization support Kumar Kartikeya Dwivedi
2025-02-06 10:54 ` [PATCH bpf-next v2 07/26] rqspinlock: Add support for timeouts Kumar Kartikeya Dwivedi
2025-02-10 9:56 ` Peter Zijlstra
2025-02-11 4:55 ` Alexei Starovoitov
2025-02-11 10:11 ` Peter Zijlstra
2025-02-11 18:00 ` Alexei Starovoitov
2025-02-06 10:54 ` [PATCH bpf-next v2 08/26] rqspinlock: Protect pending bit owners from stalls Kumar Kartikeya Dwivedi
2025-02-06 10:54 ` [PATCH bpf-next v2 09/26] rqspinlock: Protect waiters in queue " Kumar Kartikeya Dwivedi
2025-02-10 10:17 ` Peter Zijlstra
2025-02-13 6:20 ` Kumar Kartikeya Dwivedi
2025-02-06 10:54 ` [PATCH bpf-next v2 10/26] rqspinlock: Protect waiters in trylock fallback " Kumar Kartikeya Dwivedi
2025-02-06 10:54 ` [PATCH bpf-next v2 11/26] rqspinlock: Add deadlock detection and recovery Kumar Kartikeya Dwivedi
2025-02-08 1:53 ` Alexei Starovoitov
2025-02-08 3:03 ` Kumar Kartikeya Dwivedi
2025-02-10 10:21 ` Peter Zijlstra
2025-02-13 6:11 ` Kumar Kartikeya Dwivedi
2025-02-10 10:36 ` Peter Zijlstra
2025-02-06 10:54 ` [PATCH bpf-next v2 12/26] rqspinlock: Add a test-and-set fallback Kumar Kartikeya Dwivedi
2025-02-06 10:54 ` [PATCH bpf-next v2 13/26] rqspinlock: Add basic support for CONFIG_PARAVIRT Kumar Kartikeya Dwivedi
2025-02-06 10:54 ` [PATCH bpf-next v2 14/26] rqspinlock: Add helper to print a splat on timeout or deadlock Kumar Kartikeya Dwivedi
2025-02-06 10:54 ` [PATCH bpf-next v2 15/26] rqspinlock: Add macros for rqspinlock usage Kumar Kartikeya Dwivedi
2025-02-06 10:54 ` [PATCH bpf-next v2 16/26] rqspinlock: Add locktorture support Kumar Kartikeya Dwivedi
2025-02-06 10:54 ` [PATCH bpf-next v2 17/26] rqspinlock: Hardcode cond_acquire loops to asm-generic implementation Kumar Kartikeya Dwivedi
2025-02-08 1:58 ` Alexei Starovoitov
2025-02-08 3:04 ` Kumar Kartikeya Dwivedi
2025-02-10 9:53 ` Peter Zijlstra
2025-02-10 10:03 ` Peter Zijlstra
2025-02-13 6:15 ` Kumar Kartikeya Dwivedi
2025-02-06 10:54 ` [PATCH bpf-next v2 18/26] rqspinlock: Add entry to Makefile, MAINTAINERS Kumar Kartikeya Dwivedi
2025-02-07 14:14 ` kernel test robot
2025-02-07 14:45 ` kernel test robot
2025-02-08 0:43 ` kernel test robot
2025-02-06 10:54 ` [PATCH bpf-next v2 19/26] bpf: Convert hashtab.c to rqspinlock Kumar Kartikeya Dwivedi
2025-02-08 2:01 ` Alexei Starovoitov
2025-02-08 3:06 ` Kumar Kartikeya Dwivedi
2025-02-06 10:54 ` [PATCH bpf-next v2 20/26] bpf: Convert percpu_freelist.c " Kumar Kartikeya Dwivedi
2025-02-06 10:54 ` [PATCH bpf-next v2 21/26] bpf: Convert lpm_trie.c " Kumar Kartikeya Dwivedi
2025-02-06 10:54 ` [PATCH bpf-next v2 22/26] bpf: Introduce rqspinlock kfuncs Kumar Kartikeya Dwivedi
2025-02-07 13:43 ` kernel test robot
2025-02-06 10:54 ` [PATCH bpf-next v2 23/26] bpf: Handle allocation failure in acquire_lock_state Kumar Kartikeya Dwivedi
2025-02-08 2:04 ` Alexei Starovoitov
2025-02-06 10:54 ` [PATCH bpf-next v2 24/26] bpf: Implement verifier support for rqspinlock Kumar Kartikeya Dwivedi
2025-02-12 0:08 ` Eduard Zingerman
2025-02-13 6:41 ` Kumar Kartikeya Dwivedi
2025-02-06 10:54 ` [PATCH bpf-next v2 25/26] bpf: Maintain FIFO property for rqspinlock unlock Kumar Kartikeya Dwivedi
2025-02-06 10:54 ` [PATCH bpf-next v2 26/26] selftests/bpf: Add tests for rqspinlock Kumar Kartikeya Dwivedi
2025-02-12 0:14 ` Eduard Zingerman
2025-02-13 6:25 ` Kumar Kartikeya Dwivedi
2025-02-10 9:31 ` [PATCH bpf-next v2 00/26] Resilient Queued Spin Lock Peter Zijlstra
2025-02-10 9:38 ` Peter Zijlstra
2025-02-10 10:49 ` Peter Zijlstra
2025-02-11 4:37 ` Alexei Starovoitov
2025-02-11 10:43 ` Peter Zijlstra
2025-02-11 18:33 ` Alexei Starovoitov
2025-02-13 9:59 ` Peter Zijlstra
2025-02-14 2:37 ` Alexei Starovoitov
2025-03-04 10:46 ` Peter Zijlstra
2025-03-05 3:26 ` Alexei Starovoitov
2025-02-10 9:49 ` Peter Zijlstra
2025-02-10 19:16 ` Ankur Arora
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).