[PATCH 0/4] locking/osq_lock: Update osq

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH 0/4] locking/osq_lock: Update osq_lock to dynamic
@ 2024-09-14  8:53 yongli-oc
  2024-09-14  8:53 ` [PATCH 1/4] locking/osq_lock: The Kconfig for dynamic numa-aware osq lock yongli-oc
                   ` (4 more replies)
  0 siblings, 5 replies; 19+ messages in thread
From: yongli-oc @ 2024-09-14  8:53 UTC (permalink / raw)
  To: peterz, mingo, will, longman, boqun.feng
  Cc: linux-kernel, yongli, louisqi, cobechen, jiangbowang

    The dynamic numa-aware osq_lock supports numa architecture based on
    the kernel in-box osq_lock.
    After enable by echo 1 > /proc/zx_numa_lock/dynamic_enable, the patch
    keeps checking how many processes are waiting for a osq lock. If more
    than a threshold, the patch stops these processes and switches to the
    numa-aware osq lock, then restarts. By a cleanup work queue,
    numa-aware osq lock turns back to osq_lock when all nodes unlocked,
    all the numa-aware lock memory returns to the pre-allocated Linux
    kernel memory cache.
    
    The struct optimistic_spin_queue of dynamic numa-aware lock is also
    4 bytes, the same as the in-box osq lock. If enable dynamic switch,
    it will be accessed as three members by union. The tail is tail16,
    2 bytes, supports 65534 cpu cores. The other two members are for lock
    switch state and numa memory index, each 1 byte.
    
    The serial and numa are added to the struct optimistic_spin_node to
    know how many processes is waiting an osq lock. Each process applies
    an osq lock, the serial will add 1. The numa is a condition for
    turning program from osq_lock to numa-aware lock.
    
    We have done some performance evaluation for the dynamic numa-aware
    osq lock by perf, locktorture, unixbench, fxmark etc.
    fxmark: Filesystem Multicore Scalability Benchmark
    https://github.com/sslab-gatech/fxmark
    
    The following results are tested by Zhaoxin KH40000 32 cores processor
    or 32+32 cores, two sockets processor, and AMD EPYC 7551 32-core
    processor, two sockets. Since I do not know well about AMD CPU, the
    code to support AMD CPU is a sample only.
    
    The number under Average represents an average of test five times.
     The CV is the Coefficient of Variation.
    The kernel source code is 6.6.28 stable, compiled in the default
    configuration.
    The 6.6.28-osq-dynamic is the kernel 6.6.28 with the patch and enable
    dynamic switch.
    The OS is Ubuntu 22.04.02 LTS, gcc version 9.5.0.
    
    perf bench     Zhaoxin KH40000 32 cores
    kernel     6.6.28          6.6.28-osq-dynamic
    epoll  Average  CV       Average  CV    Improve
    ADD     25620  0.78%     64609  2.55%  152.18%
    WAIT     7134  1.77%     11098  0.52%   55.56%
    
    locktortue                Zhaoxin KH40000 32 cores
    kernel                 6.6.28            6.6.28-osq-dynamic
    lock torture        Average   CV     Average      CV    Improve
    mutex_lock Writes   7433503  1.59%   17979058   1.90%   141.87%
    
    unixbench             Zhaoxin KH40000 32+32 cores, run on ssd
    64 copys                          6.6.28           6.6.28-osq-dynamic
    System Benchmarks Partial   Average    CV     Average   CV     Improve
         Execl Throughput       1460.18   1.18%   1865.22  0.25%    27.74%
    File Copy 1024 bufsize 200   549.94   0.62%   1221.32  6.71%   122.08%
    File Copy 256 bufsize 500    339.62   2.20%    896.58  6.57%   164.00%
    File Copy 4096 bufsize 800  1173.68   1.88%   2089.7   5.20%    78.05%
         Pipe Throughput       52122.26   0.18%  53842.72  0.15%     3.30%
    Pipe-based Context Switchi 18340.38   0.92%  19874.66  0.80%     8.37%
         Process Creation       2325.12   0.18%   2178.16  0.21%    -6.32%
    Shell Scripts (1 concurren  7414.32   0.29%   8458.5   0.10%    14.08%
    Shell Scripts (16 concurre                                    
    Shell Scripts (8 concurren  7156.48   0.10%   8132.42  0.14%    13.64%
       System Call Overhead     1476.9    0.14%   1574.32  0.09%     6.60%
    System Benchmarks Index Sc  2982.64   0.33%   4008.66  0.94%    34.40%
    
    fxmark                Zhaoxin KH40000 32 cores, run on ssd (ssd, ext4)
    parallel cores              32                           24
              6.6.28  vs   6.6.28-osq-dynamic  6.6.28 vs 6.6.28-osq-dynamic
    item  Improve      Average,CV:Average,CV         Improv      CV:CV
    DWAL  -0.17% (  455895, 0.14%:  455115, 0.37%)   -0.07% ( 0.42%: 0.44%)
    DWOL   1.10% (32166648, 2.64%:32521877, 2.06%)   -0.68% ( 2.54%: 3.04%)
    DWOM  51.63% (  496955, 4.34%:  753509, 8.32%)   45.93% ( 3.14%: 2.57%)
    DWSL   1.67% (   20229, 2.34%:   20566, 3.18%)   -1.74% ( 1.96%: 2.66%)
    MWRL  71.00% (  348097, 0.92%:  595241, 1.26%)   65.95% ( 0.65%: 2.27%)
    MWRM  63.06% (    6750, 3.33%:   11007, 4.31%)   60.18% ( 5.67%: 4.81%)
    MWCL  16.99% (  149628, 1.66%:  175054, 0.82%)   16.96% ( 2.57%: 0.51%)
    MWCM  80.97% (    9448, 4.66%:   17098, 0.96%)   73.79% ( 5.37%: 1.79%)
    MWUM  37.73% (   16858, 3.13%:   23220, 3.42%)   31.16% ( 3.59%: 1.62%)
    MWUL  12.83% (   45275, 3.90%:   51083, 3.25%)   19.94% ( 4.19%: 1.98%)
    DWTL  41.44% (   85255, 5.01%:  120583, 9.83%)   45.07% ( 6.42%: 6.11%)
    MRPL  -2.63% (11448731, 1.91%:11147179, 4.18%)   -0.56% ( 1.65%: 3.33%)
    MRPM   0.29% ( 5423233, 1.77%: 5438929, 2.59%)  -10.54% (15.85%:16.42%)
    MRPH  -0.49% (  688629, 2.84%:  685266, 2.88%)  -18.99% (15.00%:24.98%)
    MRDM   8.42% ( 3662627, 0.76%: 3971133, 0.45%)    4.53% ( 1.77%: 1.72%)
    MRDL   6.25% (  530518, 2.75%:  563671, 5.33%)   12.43% (25.88%:26.91%)
    DRBH   7.16% (  388144, 7.88%:  415933,17.87%)  -20.61% (29.12%:21.91%)
    DRBM  -4.34% (  381710, 5.51%:  365159, 3.15%)  -16.93% (27.15%:29.85%)
    DRBL  -0.17% (46227341, 2.50%:46147935, 2.89%)   -4.03% ( 4.01%: 5.30%)
    
    fxmark      Zhaoxin KH40000 32 cores, run on ssd (ssd, ext4)
    parallel cores      2                        1         
    	6.6.28 vs 6.6.28-osq-dynamic  6.6.28 vs 6.6.28-osq-dynamic
    item     Improve    CV:CV          Improve    CV: CV
    DWAL     1.78%  (0.31%:0.20%)      6.36%  (2.52%: 0.67%)
    DWOL     2.46%  (2.26%:2.53%)      1.83%  (2.69%: 3.07%)
    DWOM     2.70%  (2.58%:3.12%)      2.22%  (2.67%: 3.79%)
    DWSL     3.28%  (2.90%:3.38%)      4.41%  (1.32%: 1.36%)
    MWRL    -0.76%  (1.46%:1.94%)     -0.82%  (2.04%: 2.32%)
    MWRM     1.94%  (4.38%:0.89%)     -2.05%  (4.07%: 5.16%)
    MWCL    -0.07%  (1.36%:3.84%)     -2.17%  (1.58%: 3.04%)
    MWCM     1.85%  (2.95%:4.68%)      0.28%  (0.45%: 2.48%)
    MWUM    -2.85%  (1.48%:2.01%)     -3.06%  (1.47%: 1.97%)
    MWUL    -1.46%  (0.58%:2.27%)     -2.98%  (0.71%: 2.11%)
    DWTL     0.40%  (3.89%:4.35%)     -2.68%  (4.04%: 3.15%)
    MRPL     3.11%  (1.38%:0.35%)     -4.81%  (0.32%:16.52%)
    MRPM     2.99%  (0.29%:1.19%)      3.50%  (0.56%: 0.78%)
    MRPH     3.01%  (1.10%:1.42%)      5.06%  (1.18%: 1.73%)
    MRDM    -1.67%  (4.59%:5.58%)     -3.30%  (0.23%: 8.01%)
    MRDL     1.94%  (1.56%:4.39%)     -0.55%  (0.88%: 9.57%)
    DRBH     7.24%  (7.07%:7.10%)      3.36%  (3.30%: 2.95%)
    DRBM     4.40%  (5.11%:0.74%)     -2.55%  (0.46%: 3.28%)
    DRBL     5.50%  (5.58%:0.30%)     -1.00%  (0.71%: 5.21%)
    
    (some tests has more than 10% loss, CV is also more than 10%,
    the result is not stable)
    
    perf bench    AMD EPYC 7551 32-core, 2 sockets
    kernel     6.6.28          6.6.28-osq-dynamic
    epoll  Average  CV       Average  CV    Improve
    ADD     15258  2.30%      62160  2.40%  307.38%
    WAIT     3861  4.20%       6990 16.77%   81.03%
    
    locktortue            AMD EPYC 7551 32-core, 2 sockets
    kernel                 6.6.28            6.6.28-osq-dynamic
    lock torture        Average   CV     Average      CV    Improve
    mutex_lock Writes  10435064  3.14%   22627890   4.92%   116.84%
    
    unixbench     AMD EPYC 7551 32-core, 2 sockets. run on ramdisk
    64 copys                          6.6.28        6.6.28-osq-dynamic
    System Benchmarks Partial   Average   CV      Average   CV   Improve
         Execl Throughput       2677.18  0.90%    3451.76  0.22%  28.93%
    File Copy 1024 bufsize 200   815.2   0.59%    1999.54  0.36% 145.28%
    File Copy 256 bufsize 500    504.6   0.69%    1359.6   0.49% 169.44%
    File Copy 4096 bufsize 800  1842.76  1.24%    3236.48  1.40%  75.63%
         Pipe Throughput       57748.74  0.01%   57539.6   0.03%  -0.36%
    Pipe-based Context Switchi 20882.18  0.57%   20525.38  0.57%  -1.71%
         Process Creation       4523.98  0.20%    4784.98  0.10%   5.77%
    Shell Scripts (1 concurren 13136.54  0.06%   15883.6   0.35%  20.91%
    Shell Scripts (16 concurre                                   
    Shell Scripts (8 concurren 12883.82  0.14%   15640.32  0.20%  21.40%
       System Call Overhead     3533.74  0.04%    3544.16  0.02%   0.29%
    System Benchmarks Index Sc  4809.38  0.23%    6575.44  0.14%  36.72%
    
    fxmark  AMD EPYC 7551 32-core, 2 sockets. run on ramdisk (mem,tmpfs)
    parallel cores         64                                32
                6.6.28 vs 6.6.28-osq-dynamic   6.6.28 vs 6.6.28-osq-dynamic
    item  Improve   Average,   CV :  Average,   CV    Improve  CV  :   CV
    DWAL  -0.22% ( 24091112, 0.31%: 24038426, 0.52%)  -0.26% (0.10%: 0.12%)
    DWOL   2.21% ( 86569869, 0.36%: 88479947, 0.27%)   1.99% (0.41%: 0.29%)
    DWOM 210.41% (   425986, 0.77%:  1322320, 0.28%) 128.86% (0.59%: 0.46%)
    DWSL   1.27% ( 70260252, 0.39%: 71149334, 0.37%)   1.19% (0.31%: 0.22%)
    MWRL   0.85% (   489865, 0.22%:   494045, 0.25%)   2.29% (0.12%: 0.33%)
    MWRM  96.28% (   149042, 0.45%:   292540, 3.55%)  60.10% (2.49%: 0.38%)
    MWCL  -5.44% (   772582, 2.92%:   730585, 0.80%)   0.32% (2.41%: 2.56%)
    MWCM  53.89% (   153857, 1.92%:   236774, 0.46%)  23.84% (0.72%: 0.50%)
    MWUM  88.20% (   214551, 3.90%:   403790, 0.41%)  62.81% (0.80%: 1.12%)
    MWUL  -8.26% (   970810, 1.63%:   890615, 1.63%)  -6.73% (3.01%: 1.61%)
    DWTL   5.90% (  5522297, 0.49%:  5847951, 0.18%)   5.03% (0.44%: 0.08%)
    MRPL  -1.10% ( 39707577, 0.07%: 39268812, 0.03%)  -1.30% (0.18%: 0.07%)
    MRPM  -0.63% ( 16446350, 0.47%: 16341936, 0.40%)   0.45% (0.15%: 0.45%)
    MRPH  -0.03% (  3805484, 0.50%:  3804248, 0.12%)   3.02% (1.54%: 0.36%)
    MRDM  49.41% ( 20178742, 1.89%: 30148449, 1.01%)  17.58% (1.19%: 0.85%)
    MRDL  -1.95% (227253170, 0.48%:222825409, 1.34%)  -1.80% (0.32%: 0.54%)
    DRBH   6.01% (  1045587, 1.91%:  1108467, 0.64%)   0.12% (0.13%: 0.30%)
    DRBM   0.65% (117702744, 0.31%:118473408, 0.87%)   1.12% (0.25%: 1.18%)
    DRBL   0.93% (121770444, 0.42%:122905957, 0.25%)   1.59% (0.31%: 0.40%)
    
    fxmark  AMD EPYC 7551 32-core, 2 sockets. run on ramdisk (mem,tmpfs)
    parallel cores      2                        1         
    	6.6.28 vs 6.6.28-osq-dynamic  6.6.28 vs 6.6.28-osq-dynamic
    item     Improve    CV :  CV           Improve    CV :  CV
    DWAL    -0.74%   (0.33%:  0.19%)      -1.02%   (0.19%:  0.34%)
    DWOL     1.50%   (0.36%:  0.44%)       1.89%   (0.30%:  0.36%)
    DWOM    -2.00%   (0.73%:  0.38%)       2.43%   (0.35%:  0.29%)
    DWSL     1.03%   (0.34%:  0.54%)       1.18%   (0.46%:  0.61%)
    MWRL     0.93%   (0.39%:  0.18%)       2.25%   (1.28%:  1.78%)
    MWRM    -0.30%   (0.60%:  0.47%)       0.17%   (0.58%:  0.47%)
    MWCL    -1.28%   (0.41%:  0.66%)      -0.38%   (0.19%:  0.44%)
    MWCM    -1.23%   (0.36%:  0.23%)      -1.42%   (0.41%:  0.54%)
    MWUM    -2.28%   (0.57%:  0.75%)      -1.11%   (0.82%:  0.21%)
    MWUL    -1.87%   (0.64%:  0.50%)      -1.75%   (0.58%:  0.65%)
    DWTL     0.36%   (0.09%:  0.12%)       0.19%   (0.09%:  0.09%)
    MRPL    -1.45%   (0.37%:  0.31%)      -1.35%   (0.12%:  0.54%)
    MRPM    -0.58%   (0.30%:  0.11%)      -1.04%   (0.18%:  0.31%)
    MRPH     0.79%   (3.92%:  0.48%)      -0.53%   (0.68%:  0.33%)
    MRDM    -0.55%   (0.93%:  0.44%)      -0.13%   (0.43%:  0.67%)
    MRDL    -0.11%   (0.56%:  0.19%)       0.68%   (0.71%:  0.49%)
    DRBH     0.09%   (1.31%:  0.87%)       2.75%   (0.68%:  0.45%)
    DRBM     1.09%   (0.19%:  1.05%)       1.60%   (0.15%:  0.72%)
    DRBL     3.26%   (1.00%:  0.56%)       2.34%   (0.36%:  0.23%)

    From the test result, when heavy contention, the performance of
    dynamic numa-aware lock is better than the performance of in-box
    osq_lock. If not too many processes apply a lock, the performance
    is nearly the same as the in-box osq_lock.

yongli-oc (4):
  locking/osq_lock: The Kconfig for dynamic numa-aware osq lock.
  locking/osq_lock: Turn to dynamic switch function from
    osq_lock/unlock.
  locking/osq_lock: From x_osq_lock/unlock to numa-aware lock/unlock.
  locking/osq_lock: The numa-aware lock memory prepare, assign and
    cleanup.

 include/linux/osq_lock.h     |  33 ++-
 kernel/Kconfig.numalocks     |  17 ++
 kernel/locking/Makefile      |   1 +
 kernel/locking/numa.h        |  98 +++++++
 kernel/locking/numa_osq.h    |  32 +++
 kernel/locking/osq_lock.c    |  62 +++-
 kernel/locking/x_osq_lock.c  | 332 ++++++++++++++++++++++
 kernel/locking/zx_numa.c     | 537 +++++++++++++++++++++++++++++++++++
 kernel/locking/zx_numa_osq.c | 433 ++++++++++++++++++++++++++++
 lib/Kconfig.debug            |   2 +
 10 files changed, 1543 insertions(+), 4 deletions(-)
 create mode 100644 kernel/Kconfig.numalocks
 create mode 100644 kernel/locking/numa.h
 create mode 100644 kernel/locking/numa_osq.h
 create mode 100644 kernel/locking/x_osq_lock.c
 create mode 100644 kernel/locking/zx_numa.c
 create mode 100644 kernel/locking/zx_numa_osq.c

-- 
2.34.1


^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH 1/4] locking/osq_lock: The Kconfig for dynamic numa-aware osq lock.
  2024-09-14  8:53 [PATCH 0/4] locking/osq_lock: Update osq_lock to dynamic yongli-oc
@ 2024-09-14  8:53 ` yongli-oc
  2024-09-14  8:53 ` [PATCH 2/4] locking/osq_lock: Turn to dynamic switch function from osq_lock/unlock yongli-oc
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 19+ messages in thread
From: yongli-oc @ 2024-09-14  8:53 UTC (permalink / raw)
  To: peterz, mingo, will, longman, boqun.feng
  Cc: linux-kernel, yongli, louisqi, cobechen, jiangbowang

The make menu to choose if compile the dynamic numa-aware osq lock
or not, default is N.

Signed-off-by: yongli-oc <yongli-oc@zhaoxin.com>
---
 kernel/Kconfig.numalocks | 17 +++++++++++++++++
 lib/Kconfig.debug        |  2 ++
 2 files changed, 19 insertions(+)
 create mode 100644 kernel/Kconfig.numalocks

diff --git a/kernel/Kconfig.numalocks b/kernel/Kconfig.numalocks
new file mode 100644
index 000000000000..feb1751b637c
--- /dev/null
+++ b/kernel/Kconfig.numalocks
@@ -0,0 +1,17 @@
+# SPDX-License-Identifier: GPL-2.0-only
+
+menu "NUMA Lock Supporting (OSQ)"
+
+config LOCK_SPIN_ON_OWNER_NUMA
+	bool "Enable dynamic numa-aware osq lock"
+	depends on LOCK_SPIN_ON_OWNER && X86_64
+	default n
+	help
+	  According the cpu numa architecture, the numa-aware lock
+	  always unlocks the process waiting on the same numa node
+	  first. It is different from the kernel inbox osq_lock.
+	  The dynamic numa-aware osq lock switchies between osq_lock and
+	  numa-aware lock automatically, according contention level.
+	  Enable: echo 1 > /proc/zx_numa_lock/dynamic_enable.
+
+endmenu #  NUMA Lock Supporting
diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index 27539c2626bf..eb2314d47420 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -1337,6 +1337,8 @@ config DEBUG_PREEMPT
 	  depending on workload as it triggers debugging routines for each
 	  this_cpu operation. It should only be used for debugging purposes.
 
+source "kernel/Kconfig.numalocks"
+
 menu "Lock Debugging (spinlocks, mutexes, etc...)"
 
 config LOCK_DEBUGGING_SUPPORT
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 2/4] locking/osq_lock: Turn to dynamic switch function from osq_lock/unlock.
  2024-09-14  8:53 [PATCH 0/4] locking/osq_lock: Update osq_lock to dynamic yongli-oc
  2024-09-14  8:53 ` [PATCH 1/4] locking/osq_lock: The Kconfig for dynamic numa-aware osq lock yongli-oc
@ 2024-09-14  8:53 ` yongli-oc
  2024-09-14 16:06   ` Waiman Long
  2024-09-14  8:53 ` [PATCH 3/4] locking/osq_lock: From x_osq_lock/unlock to numa-aware lock/unlock yongli-oc
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 19+ messages in thread
From: yongli-oc @ 2024-09-14  8:53 UTC (permalink / raw)
  To: peterz, mingo, will, longman, boqun.feng
  Cc: linux-kernel, yongli, louisqi, cobechen, jiangbowang

To support numa-aware osq lock, the struct optimistic_spin_queue
is accessed as three members, numa_enable, index, tail16, by union.
The size of the struct is the same as before.
If dynamic numa-ware lock enable, turns to the crossing, x_osq_lock to
check contention level and starts dynamic switch.

Signed-off-by: yongli-oc <yongli-oc@zhaoxin.com>
---
 include/linux/osq_lock.h  | 33 ++++++++++++++++++++-
 kernel/locking/osq_lock.c | 62 +++++++++++++++++++++++++++++++++++++--
 2 files changed, 91 insertions(+), 4 deletions(-)

diff --git a/include/linux/osq_lock.h b/include/linux/osq_lock.h
index ea8fb31379e3..37a7bc4ab530 100644
--- a/include/linux/osq_lock.h
+++ b/include/linux/osq_lock.h
@@ -12,14 +12,42 @@ struct optimistic_spin_queue {
 	 * Stores an encoded value of the CPU # of the tail node in the queue.
 	 * If the queue is empty, then it's set to OSQ_UNLOCKED_VAL.
 	 */
+#ifdef CONFIG_LOCK_SPIN_ON_OWNER_NUMA
+	union {
+		atomic_t tail;
+		u32 val;
+#ifdef __LITTLE_ENDIAN
+		struct {
+			u16 tail16;
+			u8 index;
+			u8 numa_enable;
+		};
+#else
+		struct {
+			u8 numa_enable;
+			u8 index;
+			u16 tail16;
+		};
+#endif
+	};
+#else
 	atomic_t tail;
+#endif
 };
 
 #define OSQ_UNLOCKED_VAL (0)
 
 /* Init macro and function. */
+#ifdef CONFIG_LOCK_SPIN_ON_OWNER_NUMA
+
+#define OSQ_LOCK_UNLOCKED { .tail = ATOMIC_INIT(OSQ_UNLOCKED_VAL) }
+
+#else
+
 #define OSQ_LOCK_UNLOCKED { ATOMIC_INIT(OSQ_UNLOCKED_VAL) }
 
+#endif
+
 static inline void osq_lock_init(struct optimistic_spin_queue *lock)
 {
 	atomic_set(&lock->tail, OSQ_UNLOCKED_VAL);
@@ -28,9 +56,12 @@ static inline void osq_lock_init(struct optimistic_spin_queue *lock)
 extern bool osq_lock(struct optimistic_spin_queue *lock);
 extern void osq_unlock(struct optimistic_spin_queue *lock);
 
+#ifdef CONFIG_LOCK_SPIN_ON_OWNER_NUMA
+extern bool osq_is_locked(struct optimistic_spin_queue *lock);
+#else
 static inline bool osq_is_locked(struct optimistic_spin_queue *lock)
 {
 	return atomic_read(&lock->tail) != OSQ_UNLOCKED_VAL;
 }
-
+#endif
 #endif
diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c
index 75a6f6133866..a7b516939e00 100644
--- a/kernel/locking/osq_lock.c
+++ b/kernel/locking/osq_lock.c
@@ -2,7 +2,10 @@
 #include <linux/percpu.h>
 #include <linux/sched.h>
 #include <linux/osq_lock.h>
-
+#ifdef CONFIG_LOCK_SPIN_ON_OWNER_NUMA
+#include "numa.h"
+#include "numa_osq.h"
+#endif
 /*
  * An MCS like lock especially tailored for optimistic spinning for sleeping
  * lock implementations (mutex, rwsem, etc).
@@ -12,12 +15,34 @@
  * spinning.
  */
 
+#ifdef CONFIG_LOCK_SPIN_ON_OWNER_NUMA
+DEFINE_PER_CPU_SHARED_ALIGNED(struct optimistic_spin_node, osq_node);
+/*
+ * We use the value 0 to represent "no CPU", thus the encoded value
+ * will be the CPU number incremented by 1.
+ */
+inline int encode_cpu(int cpu_nr)
+{
+	return cpu_nr + 1;
+}
+
+inline int node_cpu(struct optimistic_spin_node *node)
+{
+	return node->cpu - 1;
+}
+
+inline struct optimistic_spin_node *decode_cpu(int encoded_cpu_val)
+{
+	int cpu_nr = encoded_cpu_val - 1;
+
+	return per_cpu_ptr(&osq_node, cpu_nr);
+}
+#else
 struct optimistic_spin_node {
 	struct optimistic_spin_node *next, *prev;
 	int locked; /* 1 if lock acquired */
 	int cpu; /* encoded CPU # + 1 value */
 };
-
 static DEFINE_PER_CPU_SHARED_ALIGNED(struct optimistic_spin_node, osq_node);
 
 /*
@@ -40,6 +65,7 @@ static inline struct optimistic_spin_node *decode_cpu(int encoded_cpu_val)
 
 	return per_cpu_ptr(&osq_node, cpu_nr);
 }
+#endif
 
 /*
  * Get a stable @node->next pointer, either for unlock() or unqueue() purposes.
@@ -97,6 +123,14 @@ bool osq_lock(struct optimistic_spin_queue *lock)
 	int curr = encode_cpu(smp_processor_id());
 	int old;
 
+#ifdef CONFIG_LOCK_SPIN_ON_OWNER_NUMA
+	if (unlikely(enable_zx_numa_osq_lock > 1)) {
+		node->numa = 1;
+		return x_osq_lock(lock);
+	}
+	node->numa = 0;
+#endif
+
 	node->locked = 0;
 	node->next = NULL;
 	node->cpu = curr;
@@ -108,6 +142,11 @@ bool osq_lock(struct optimistic_spin_queue *lock)
 	 * the lock tail.
 	 */
 	old = atomic_xchg(&lock->tail, curr);
+#ifdef CONFIG_LOCK_SPIN_ON_OWNER_NUMA
+	if (enable_zx_numa_osq_lock > 0)
+	//enable means all cpu cores are less tan 65534.
+		old = old & 0xffff;
+#endif
 	if (old == OSQ_UNLOCKED_VAL)
 		return true;
 
@@ -212,6 +251,14 @@ void osq_unlock(struct optimistic_spin_queue *lock)
 	struct optimistic_spin_node *node, *next;
 	int curr = encode_cpu(smp_processor_id());
 
+	node = this_cpu_ptr(&osq_node);
+
+#ifdef CONFIG_LOCK_SPIN_ON_OWNER_NUMA
+	if (unlikely(enable_zx_numa_osq_lock > 1 &&
+		node->numa == 1))
+		return x_osq_unlock(lock);
+#endif
+
 	/*
 	 * Fast path for the uncontended case.
 	 */
@@ -222,7 +269,6 @@ void osq_unlock(struct optimistic_spin_queue *lock)
 	/*
 	 * Second most likely case.
 	 */
-	node = this_cpu_ptr(&osq_node);
 	next = xchg(&node->next, NULL);
 	if (next) {
 		WRITE_ONCE(next->locked, 1);
@@ -233,3 +279,13 @@ void osq_unlock(struct optimistic_spin_queue *lock)
 	if (next)
 		WRITE_ONCE(next->locked, 1);
 }
+#ifdef CONFIG_LOCK_SPIN_ON_OWNER_NUMA
+bool osq_is_locked(struct optimistic_spin_queue *lock)
+{
+	if (unlikely(enable_zx_numa_osq_lock > 1))
+		return x_osq_is_locked(lock);
+	return atomic_read(&lock->tail) != OSQ_UNLOCKED_VAL;
+}
+#endif
+
+
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 3/4] locking/osq_lock: From x_osq_lock/unlock to numa-aware lock/unlock.
  2024-09-14  8:53 [PATCH 0/4] locking/osq_lock: Update osq_lock to dynamic yongli-oc
  2024-09-14  8:53 ` [PATCH 1/4] locking/osq_lock: The Kconfig for dynamic numa-aware osq lock yongli-oc
  2024-09-14  8:53 ` [PATCH 2/4] locking/osq_lock: Turn to dynamic switch function from osq_lock/unlock yongli-oc
@ 2024-09-14  8:53 ` yongli-oc
  2024-09-14 17:11   ` Waiman Long
  2024-09-15 18:44   ` Uros Bizjak
  2024-09-14  8:53 ` [PATCH 4/4] locking/osq_lock: The numa-aware lock memory prepare, assign and cleanup yongli-oc
  2024-09-30  7:23 ` [PATCH v2 0/3] locking/osq_lock: Update osq_lock to dynamic yongli-oc
  4 siblings, 2 replies; 19+ messages in thread
From: yongli-oc @ 2024-09-14  8:53 UTC (permalink / raw)
  To: peterz, mingo, will, longman, boqun.feng
  Cc: linux-kernel, yongli, louisqi, cobechen, jiangbowang

According to the contention level, switches from x_osq_lock to
numa-aware osq_lock.
The numa-aware lock is a two level osq_lock.
The Makefile for dynamic numa-aware osq lock.

Signed-off-by: yongli-oc <yongli-oc@zhaoxin.com>
---
 kernel/locking/Makefile      |   1 +
 kernel/locking/numa.h        |  98 ++++++++
 kernel/locking/numa_osq.h    |  32 +++
 kernel/locking/x_osq_lock.c  | 332 +++++++++++++++++++++++++++
 kernel/locking/zx_numa_osq.c | 433 +++++++++++++++++++++++++++++++++++
 5 files changed, 896 insertions(+)
 create mode 100644 kernel/locking/numa.h
 create mode 100644 kernel/locking/numa_osq.h
 create mode 100644 kernel/locking/x_osq_lock.c
 create mode 100644 kernel/locking/zx_numa_osq.c

diff --git a/kernel/locking/Makefile b/kernel/locking/Makefile
index 0db4093d17b8..00c1d5bb6eb9 100644
--- a/kernel/locking/Makefile
+++ b/kernel/locking/Makefile
@@ -22,6 +22,7 @@ obj-$(CONFIG_LOCKDEP) += lockdep_proc.o
 endif
 obj-$(CONFIG_SMP) += spinlock.o
 obj-$(CONFIG_LOCK_SPIN_ON_OWNER) += osq_lock.o
+obj-$(CONFIG_LOCK_SPIN_ON_OWNER_NUMA) += x_osq_lock.o zx_numa_osq.o zx_numa.o
 obj-$(CONFIG_PROVE_LOCKING) += spinlock.o
 obj-$(CONFIG_QUEUED_SPINLOCKS) += qspinlock.o
 obj-$(CONFIG_RT_MUTEXES) += rtmutex_api.o
diff --git a/kernel/locking/numa.h b/kernel/locking/numa.h
new file mode 100644
index 000000000000..01e5aef3257a
--- /dev/null
+++ b/kernel/locking/numa.h
@@ -0,0 +1,98 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef __LINUX_NUMA_LOCK_H
+#define __LINUX_NUMA_LOCK_H
+#include "mcs_spinlock.h"
+
+struct optimistic_spin_node {
+	struct optimistic_spin_node *next, *prev;
+	int numa;
+	int locked; /* 1 if lock acquired */
+	int cpu; /* encoded CPU # + 1 value */
+	u32 serial;
+};
+
+
+struct _numa_buf {
+	void *numa_ptr;
+	struct list_head list;
+	u32 lockaddr;
+	u32 highaddr;
+	u8 idle;
+	u8 type;
+	u16 index;
+};
+
+struct cache_padding {
+	char x[0];
+} ____cacheline_internodealigned_in_smp;
+#define CACHE_PADDING(name)	struct cache_padding name
+
+struct _numa_lock {
+	atomic_t tail ____cacheline_aligned_in_smp;
+	atomic_t addr;
+	u8 shift;
+	u8 stopping;
+	u16 numa_nodes;
+	u32 accessed;
+	uint64_t totalaccessed;
+	u32 nodeswitched;
+	atomic_t initlock;
+	atomic_t pending;
+	union {
+		struct mcs_spinlock mcs_node;
+		struct optimistic_spin_node osq_node;
+	};
+	CACHE_PADDING(pad);
+};
+
+struct numa_cpu_info {
+	__u8	x86_model;
+	/* CPU family */
+	__u8	x86;
+	/* CPU vendor */
+	__u8	x86_vendor;
+	__u8	x86_reserved;
+	u32	feature1;
+};
+
+#define NUMAEXPAND 1
+
+#define COHORT_START 1
+#define ACQUIRE_NUMALOCK (UINT_MAX-1)
+#define NODE_WAIT UINT_MAX
+#define LOCK_NUMALOCK 1
+#define UNLOCK_NUMALOCK 0
+
+#define NUMALOCKDYNAMIC 0xff
+#define TURNTONUMAREADY 0xa5a5
+#define NUMATURNBACKREADY 0x5a5a
+
+#define NUMA_LOCKED_VAL 0xf5efef
+#define NUMA_UNLOCKED_VAL 0
+
+#define NUMASTEERMASK 0xf0000000
+#define HIGH32BITMASK 0xffffffff00000000
+#define LOW32MASK 0xffffffff
+
+extern int NUMASHIFT;
+extern int NUMACLUSTERS;
+extern int zx_numa_lock_total;
+extern struct _numa_buf *zx_numa_entry;
+extern atomic_t numa_count;
+extern int enable_zx_numa_osq_lock;
+extern u32 zx_numa_lock;
+extern int dynamic_enable;
+extern struct kmem_cache *zx_numa_lock_cachep;
+
+static inline u32 ptrmask(void *s)
+{
+	return (uint64_t)s & LOW32MASK;
+}
+inline void *get_numa_lock(int index);
+
+int zx_check_numa_dynamic_locked(u32 lockaddr, struct _numa_lock *_numa_lock,
+		int t);
+int zx_numa_lock_ptr_get(void *p);
+void numa_lock_init_data(struct _numa_lock *s, int clusters, u32 lockval,
+		u32 lockaddr);
+#endif
diff --git a/kernel/locking/numa_osq.h b/kernel/locking/numa_osq.h
new file mode 100644
index 000000000000..4c99ad60c4e0
--- /dev/null
+++ b/kernel/locking/numa_osq.h
@@ -0,0 +1,32 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef __LINUX_NUMA_OSQ_H
+#define __LINUX_NUMA_OSQ_H
+
+#include <linux/osq_lock.h>
+#include "mcs_spinlock.h"
+
+#define OSQLOCKINITED 0
+#define OSQTONUMADETECT 0x10
+#define OSQLOCKSTOPPING 0xfc
+#define OSQ_LOCKED_VAL 0xffff
+
+extern u16 osq_keep_times;
+extern u16 osq_lock_depth;
+extern int osq_node_max;
+
+inline int encode_cpu(int cpu_nr);
+inline int node_cpu(struct optimistic_spin_node *node);
+inline struct optimistic_spin_node *decode_cpu(int encoded_cpu_val);
+
+void zx_osq_lock_stopping(struct optimistic_spin_queue *lock);
+void zx_osq_numa_start(struct optimistic_spin_queue *lock);
+void zx_osq_turn_numa_waiting(struct optimistic_spin_queue *lock);
+
+bool x_osq_lock(struct optimistic_spin_queue *lock);
+void x_osq_unlock(struct optimistic_spin_queue *lock);
+bool x_osq_is_locked(struct optimistic_spin_queue *lock);
+inline void zx_numa_osq_unlock(struct optimistic_spin_queue *qslock,
+		struct _numa_lock *n);
+inline bool zx_numa_osq_lock(struct optimistic_spin_queue *qslock,
+		struct _numa_lock *n);
+#endif
diff --git a/kernel/locking/x_osq_lock.c b/kernel/locking/x_osq_lock.c
new file mode 100644
index 000000000000..6da8905ae7d6
--- /dev/null
+++ b/kernel/locking/x_osq_lock.c
@@ -0,0 +1,332 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * crossing from osq_lock to numa-aware lock
+ */
+#include <linux/percpu.h>
+#include <linux/sched.h>
+#include <linux/osq_lock.h>
+#include "numa.h"
+#include "numa_osq.h"
+
+u16 osq_lock_depth = 8;
+u16 osq_keep_times = 32;
+
+/*
+ * An MCS like lock especially tailored for optimistic spinning for sleeping
+ * lock implementations (mutex, rwsem, etc).
+ *
+ * Using a single mcs node per CPU is safe because sleeping locks should not be
+ * called from interrupt context and we have preemption disabled while
+ * spinning.
+ */
+DECLARE_PER_CPU_SHARED_ALIGNED(struct optimistic_spin_node, osq_node);
+
+/*
+ * Get a stable @node->next pointer, either for unlock() or unqueue() purposes.
+ * Can return NULL in case we were the last queued and we updated @lock instead.
+ *
+ * If osq_lock() is being cancelled there must be a previous node
+ * and 'old_cpu' is its CPU #.
+ * For osq_unlock() there is never a previous node and old_cpu is
+ * set to OSQ_UNLOCKED_VAL.
+ */
+static inline struct optimistic_spin_node *
+osq_wait_next_stop(struct optimistic_spin_queue *lock,
+	      struct optimistic_spin_node *node,
+	      int old_cpu)
+{
+	u16 curr = encode_cpu(smp_processor_id());
+	u16 old = old_cpu;
+
+	if (lock->numa_enable == OSQLOCKSTOPPING && old == OSQ_UNLOCKED_VAL)
+		old = OSQ_LOCKED_VAL;
+
+	for (;;) {
+		if (READ_ONCE(lock->tail16) == curr &&
+		    cmpxchg(&lock->tail16, curr, old) == curr) {
+
+			/*
+			 * We were the last queued, we moved @lock back. @prev
+			 * will now observe @lock and will complete its
+			 * unlock()/unqueue().
+			 */
+			return NULL;
+		}
+
+		/*
+		 * We must xchg() the @node->next value, because if we were to
+		 * leave it in, a concurrent unlock()/unqueue() from
+		 * @node->next might complete Step-A and think its @prev is
+		 * still valid.
+		 *
+		 * If the concurrent unlock()/unqueue() wins the race, we'll
+		 * wait for either @lock to point to us, through its Step-B, or
+		 * wait for a new @node->next from its Step-C.
+		 */
+		if (node->next) {
+			struct optimistic_spin_node *next;
+
+			next = xchg(&node->next, NULL);
+			if (next)
+				return next;
+		}
+
+		cpu_relax();
+	}
+}
+
+bool x_osq_lock(struct optimistic_spin_queue *lock)
+{
+	struct optimistic_spin_node *node = this_cpu_ptr(&osq_node);
+	struct optimistic_spin_node *prev, *next;
+	int cpu = smp_processor_id();
+	u16 curr = encode_cpu(cpu);
+	struct optimistic_spin_queue tail;
+	u16 old;
+
+	tail.val = READ_ONCE(lock->val);
+	if (unlikely(tail.numa_enable == OSQLOCKSTOPPING)) {
+		zx_osq_turn_numa_waiting(lock);
+		return x_osq_lock(lock);
+	}
+
+	if (unlikely(tail.numa_enable == NUMALOCKDYNAMIC)) {
+		struct _numa_lock *_numa_lock = NULL;
+		struct _numa_lock *node_lock = NULL;
+
+		_numa_lock = get_numa_lock(tail.index);
+		node_lock = (struct _numa_lock *) _numa_lock +
+					(cpu >> NUMASHIFT);
+
+		prefetch(node_lock);
+		return zx_numa_osq_lock(lock, _numa_lock);
+	}
+
+	node->locked = 0;
+	node->next = NULL;
+	node->cpu = curr;
+
+	/*
+	 * We need both ACQUIRE (pairs with corresponding RELEASE in
+	 * unlock() uncontended, or fastpath) and RELEASE (to publish
+	 * the node fields we just initialised) semantics when updating
+	 * the lock tail.
+	 */
+
+	if (likely(tail.numa_enable >= OSQTONUMADETECT)) {
+		struct optimistic_spin_queue ss;
+
+		while (1) {
+			ss.val = atomic_read(&lock->tail);
+			if (ss.tail16 == OSQ_LOCKED_VAL) {
+				zx_osq_turn_numa_waiting(lock);
+				return x_osq_lock(lock);
+			}
+			if (cmpxchg(&lock->tail16, ss.tail16, curr)
+					== ss.tail16) {
+				old = ss.tail16;
+				break;
+			}
+			cpu_relax();
+		}
+	} else
+		old = xchg(&lock->tail16, curr);
+
+	if (old == OSQ_UNLOCKED_VAL) {
+		node->serial = 1;
+		return true;
+	}
+
+	prev = decode_cpu(old);
+	node->prev = prev;
+
+	node->serial = prev->serial + 1;
+	/*
+	 * osq_lock()			unqueue
+	 *
+	 * node->prev = prev		osq_wait_next()
+	 * WMB				MB
+	 * prev->next = node		next->prev = prev // unqueue-C
+	 *
+	 * Here 'node->prev' and 'next->prev' are the same variable and we need
+	 * to ensure these stores happen in-order to avoid corrupting the list.
+	 */
+	smp_wmb();
+
+	WRITE_ONCE(prev->next, node);
+
+	/*
+	 * Normally @prev is untouchable after the above store; because at that
+	 * moment unlock can proceed and wipe the node element from stack.
+	 *
+	 * However, since our nodes are static per-cpu storage, we're
+	 * guaranteed their existence -- this allows us to apply
+	 * cmpxchg in an attempt to undo our queueing.
+	 */
+
+	/*
+	 * Wait to acquire the lock or cancellation. Note that need_resched()
+	 * will come with an IPI, which will wake smp_cond_load_relaxed() if it
+	 * is implemented with a monitor-wait. vcpu_is_preempted() relies on
+	 * polling, be careful.
+	 */
+	if (smp_cond_load_relaxed(&node->locked, VAL || need_resched() ||
+				  vcpu_is_preempted(node_cpu(node->prev))))
+		return true;
+
+	/* unqueue */
+	/*
+	 * Step - A  -- stabilize @prev
+	 *
+	 * Undo our @prev->next assignment; this will make @prev's
+	 * unlock()/unqueue() wait for a next pointer since @lock points to us
+	 * (or later).
+	 */
+
+	for (;;) {
+		/*
+		 * cpu_relax() below implies a compiler barrier which would
+		 * prevent this comparison being optimized away.
+		 */
+		if (data_race(prev->next) == node &&
+		    cmpxchg(&prev->next, node, NULL) == node)
+			break;
+
+		/*
+		 * We can only fail the cmpxchg() racing against an unlock(),
+		 * in which case we should observe @node->locked becoming
+		 * true.
+		 */
+		if (smp_load_acquire(&node->locked))
+			return true;
+
+		cpu_relax();
+
+		/*
+		 * Or we race against a concurrent unqueue()'s step-B, in which
+		 * case its step-C will write us a new @node->prev pointer.
+		 */
+		prev = READ_ONCE(node->prev);
+	}
+
+	/*
+	 * Step - B -- stabilize @next
+	 *
+	 * Similar to unlock(), wait for @node->next or move @lock from @node
+	 * back to @prev.
+	 */
+
+	next = osq_wait_next_stop(lock, node, prev->cpu);
+	if (!next)
+		return false;
+
+	/*
+	 * Step - C -- unlink
+	 *
+	 * @prev is stable because its still waiting for a new @prev->next
+	 * pointer, @next is stable because our @node->next pointer is NULL and
+	 * it will wait in Step-A.
+	 */
+
+	WRITE_ONCE(next->prev, prev);
+	WRITE_ONCE(prev->next, next);
+
+	return false;
+}
+
+
+
+void x_osq_unlock(struct optimistic_spin_queue *lock)
+{
+	struct optimistic_spin_node *node, *next;
+	int threadshold = osq_lock_depth;
+	int cpu = smp_processor_id();
+	u16 curr = encode_cpu(cpu);
+	int depth = 0;
+	u32 count = 0;
+
+	if (unlikely(lock->numa_enable == NUMALOCKDYNAMIC)) {
+		struct _numa_lock *_numa_lock = get_numa_lock(lock->index);
+
+		prefetch((struct _numa_lock *) _numa_lock + (cpu >> NUMASHIFT));
+		return zx_numa_osq_unlock(lock, _numa_lock);
+	}
+	/*
+	 * Fast path for the uncontended case.
+	 */
+	if (unlikely(lock->numa_enable == OSQTONUMADETECT)) {
+		struct optimistic_spin_node *node_last = NULL;
+		u16 tail = 0;
+
+		tail = cmpxchg(&lock->tail16, curr, OSQ_UNLOCKED_VAL);
+		if (tail == curr)
+			return;
+
+		node = this_cpu_ptr(&osq_node);
+		node_last = decode_cpu(tail);
+		depth = node_last->serial - node->serial;
+		count = READ_ONCE(node->locked);
+		if (count > osq_keep_times && (dynamic_enable & 0x1))
+			zx_osq_lock_stopping(lock);
+	} else if (unlikely(lock->numa_enable == OSQLOCKSTOPPING)) {
+		if (cmpxchg(&lock->tail16, curr, OSQ_LOCKED_VAL)
+					== curr) {
+			zx_osq_numa_start(lock);
+			return;
+		}
+	} else {
+		struct optimistic_spin_queue t;
+
+		t.val = 0;
+		if (dynamic_enable & 0x1) {
+			if (atomic_read(&numa_count) < zx_numa_lock_total)
+				t.numa_enable = OSQTONUMADETECT;
+		}
+		if (t.numa_enable == OSQTONUMADETECT) {
+			if (atomic_cmpxchg_release(&lock->tail, curr,
+				(t.val | OSQ_UNLOCKED_VAL)) == curr)
+				return;
+		} else if (cmpxchg(&lock->tail16, curr,
+				OSQ_UNLOCKED_VAL) == curr)
+			return;
+	}
+
+	/*
+	 * Second most likely case.
+	 */
+	node = this_cpu_ptr(&osq_node);
+	next = xchg(&node->next, NULL);
+	if (next) {
+		if (depth > threadshold)
+			WRITE_ONCE(next->locked, count + 1);
+		else
+			WRITE_ONCE(next->locked, 1);
+		return;
+	}
+
+	next = osq_wait_next_stop(lock, node, OSQ_UNLOCKED_VAL);
+	if (next) {
+		if (depth > threadshold)
+			WRITE_ONCE(next->locked, count + 1);
+		else
+			WRITE_ONCE(next->locked, 1);
+	}
+}
+
+bool x_osq_is_locked(struct optimistic_spin_queue *lock)
+{
+	struct optimistic_spin_queue val;
+
+	val.val = atomic_read(&lock->tail);
+	if (val.tail16 == OSQ_UNLOCKED_VAL)
+		return false;
+
+	if (val.tail16 == OSQ_LOCKED_VAL) {
+		if (val.numa_enable != NUMALOCKDYNAMIC)
+			return true;
+		return zx_check_numa_dynamic_locked(ptrmask(lock),
+					get_numa_lock(val.index), 0);
+	}
+
+	return true;
+}
diff --git a/kernel/locking/zx_numa_osq.c b/kernel/locking/zx_numa_osq.c
new file mode 100644
index 000000000000..f4c3dba208d3
--- /dev/null
+++ b/kernel/locking/zx_numa_osq.c
@@ -0,0 +1,433 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Dynamic numa-aware osq lock
+ * Author: LiYong <yongli-oc@zhaoxin.com>
+ *
+ */
+#include <linux/cpumask.h>
+#include <asm/byteorder.h>
+#include <linux/percpu.h>
+#include <linux/sched.h>
+#include <linux/slab.h>
+#include <linux/osq_lock.h>
+#include "numa.h"
+#include "numa_osq.h"
+
+int osq_node_max = 256;
+
+/*
+ * The pending bit spinning loop count.
+ * This heuristic is used to limit the number of lockword accesses
+ * made by atomic_cond_read_relaxed when waiting for the lock to
+ * transition out of the "== _Q_PENDING_VAL" state. We don't spin
+ * indefinitely because there's no guarantee that we'll make forward
+ * progress.
+ */
+
+static DEFINE_PER_CPU_SHARED_ALIGNED(struct optimistic_spin_node, osq_cpu_node);
+
+/*
+ * We use the value 0 to represent "no CPU", thus the encoded value
+ * will be the CPU number incremented by 1.
+ */
+static inline int decode(int cpu_nr)
+{
+	return cpu_nr - 1;
+}
+
+static inline struct optimistic_spin_node *decode_curr(int encoded_cpu_val)
+{
+	int cpu_nr = decode(encoded_cpu_val);
+
+	return per_cpu_ptr(&osq_cpu_node, cpu_nr);
+}
+
+static int atomic64_cmpxchg_notequal(void *qslock, atomic_t *tail, int curr)
+{
+	u64 ss = 0;
+	u32 addr = ptrmask(qslock);
+	u64 addrcurr = (((u64)addr) << 32) | curr;
+
+	while (1) {
+		ss = atomic64_read((atomic64_t *) tail);
+		if ((ss >> 32) != addr)
+			return NUMA_LOCKED_VAL;
+		if ((ss & LOW32MASK) == NUMA_LOCKED_VAL)
+			return NUMA_LOCKED_VAL;
+		if (atomic64_cmpxchg((atomic64_t *) tail, ss, addrcurr) == ss)
+			return ss & LOW32MASK;
+		cpu_relax();
+	}
+}
+
+void zx_osq_lock_stopping(struct optimistic_spin_queue *lock)
+{
+	int s = 0;
+
+	s = zx_numa_lock_ptr_get(lock);
+	if (s < zx_numa_lock_total) {
+		numa_lock_init_data(zx_numa_entry[s].numa_ptr,
+			NUMACLUSTERS, NUMA_UNLOCKED_VAL,
+			ptrmask(lock));
+
+		WRITE_ONCE(lock->index, s);
+		zx_numa_entry[s].type = 1;
+		smp_mb();/*should set these before enable*/
+		prefetchw(&lock->numa_enable);
+		WRITE_ONCE(lock->numa_enable, OSQLOCKSTOPPING);
+	} else {
+		prefetchw(&lock->numa_enable);
+		WRITE_ONCE(lock->numa_enable, OSQLOCKINITED);
+	}
+}
+
+void zx_osq_numa_start(struct optimistic_spin_queue *lock)
+{
+	struct _numa_lock *_numa_lock = get_numa_lock(lock->index);
+
+	prefetchw(&lock->numa_enable);
+	WRITE_ONCE(lock->numa_enable, NUMALOCKDYNAMIC);
+	smp_mb(); /*should keep lock->numa_enable modified first*/
+	atomic_set(&_numa_lock->initlock, TURNTONUMAREADY);
+}
+
+
+void zx_osq_turn_numa_waiting(struct optimistic_spin_queue *lock)
+{
+	struct _numa_lock *_numa_lock = get_numa_lock(lock->index);
+
+	atomic_inc(&_numa_lock->pending);
+	while (1) {
+		int s = atomic_read(&_numa_lock->initlock);
+
+		if (s == TURNTONUMAREADY)
+			break;
+		cpu_relax();
+		cpu_relax();
+		cpu_relax();
+		cpu_relax();
+
+	}
+	atomic_dec(&_numa_lock->pending);
+}
+
+
+
+
+static struct optimistic_spin_node *
+zx_numa_osq_wait_next(struct _numa_lock *lock,
+		struct optimistic_spin_node *node,
+		struct optimistic_spin_node *prev, int cpu)
+{
+	struct optimistic_spin_node *next = NULL;
+	int curr = encode_cpu(cpu);
+	int old;
+
+	old = prev ? prev->cpu : OSQ_UNLOCKED_VAL;
+	for (;;) {
+		if (atomic_read(&lock->tail) == curr &&
+		    atomic_cmpxchg_acquire(&lock->tail, curr, old) == curr) {
+
+			break;
+		}
+		if (node->next) {
+			next = xchg(&node->next, NULL);
+			if (next)
+				break;
+		}
+		cpu_relax();
+	}
+	return next;
+}
+static void zx_numa_turn_osq_waiting(struct optimistic_spin_queue *lock,
+			struct _numa_lock *_numa_lock)
+{
+	struct _numa_lock *numa_lock = _numa_lock + _numa_lock->numa_nodes;
+	int lockaddr = ptrmask(lock);
+	u64 s = 0;
+	struct optimistic_spin_queue tail;
+
+	tail.numa_enable = NUMALOCKDYNAMIC;
+	tail.index = lock->index;
+	tail.tail16 = OSQ_LOCKED_VAL;
+	while (1) {
+		cpu_relax(); cpu_relax(); cpu_relax(); cpu_relax();
+		s = atomic64_read((atomic64_t *) &numa_lock->tail);
+		if ((s >> 32) != lockaddr)
+			break;
+		if ((s & LOW32MASK) == NUMA_LOCKED_VAL)
+			break;
+	}
+	prefetchw(&lock->tail);
+	if (atomic_cmpxchg(&lock->tail, tail.val, OSQ_UNLOCKED_VAL)
+			== tail.val) {
+		;
+	}
+
+}
+
+static int _zx_node_osq_lock_internal(struct optimistic_spin_queue *qslock,
+	struct optimistic_spin_node *node, struct optimistic_spin_node *prev,
+	struct _numa_lock *node_lock, int cpu, int *cur_status)
+{
+	struct optimistic_spin_node *next = NULL;
+
+	for (;;) {
+		struct optimistic_spin_node *node_prev = NULL;
+
+		if (prev->next == node &&
+		    cmpxchg(&prev->next, node, NULL) == node) {
+			break;
+		}
+		/*load locked first each time*/
+		*cur_status = smp_load_acquire(&node->locked);
+
+		if (*cur_status != NODE_WAIT)
+			return 0; //goto NODE_UNLOCK;
+
+		cpu_relax();
+		node_prev = READ_ONCE(node->prev);
+		if (node_prev != prev)
+			prev = node_prev;
+	}
+
+	next = zx_numa_osq_wait_next(node_lock, node, prev, cpu);
+	if (!next)
+		return -1;
+
+	WRITE_ONCE(next->prev, prev);
+	WRITE_ONCE(prev->next, next);
+
+	return -1;
+}
+
+static int _zx_node_osq_lock(struct optimistic_spin_queue *qslock,
+					struct _numa_lock *_numa_lock)
+{
+	struct optimistic_spin_node *node = this_cpu_ptr(&osq_cpu_node);
+	struct optimistic_spin_node *prev = NULL;
+	int cpu = smp_processor_id();
+	int curr = encode_cpu(cpu);
+	int numa = cpu >> _numa_lock->shift;
+	struct _numa_lock *node_lock = _numa_lock + numa;
+	int cur_status = 0;
+	int old = 0;
+
+	node->locked = NODE_WAIT;
+	node->next = NULL;
+	node->cpu = curr;
+
+	old = atomic64_cmpxchg_notequal(qslock, &node_lock->tail, curr);
+
+	if (old == NUMA_LOCKED_VAL) {
+		bool s = true;
+
+		zx_numa_turn_osq_waiting(qslock, _numa_lock);
+		s = osq_lock(qslock);
+		if (s == true)
+			return 1;
+		else
+			return -1;
+	}
+
+	if (old == 0) {
+		node->locked = COHORT_START;
+		return ACQUIRE_NUMALOCK;
+	}
+
+	prev = decode_curr(old);
+	node->prev = prev;
+
+	smp_mb(); /* make sure node set before set pre->next */
+
+	WRITE_ONCE(prev->next, node);
+
+	while ((cur_status = READ_ONCE(node->locked)) == NODE_WAIT) {
+		if (need_resched() || vcpu_is_preempted(node_cpu(node->prev))) {
+			int ddd = _zx_node_osq_lock_internal(qslock, node, prev,
+						node_lock, cpu, &cur_status);
+
+			if (cur_status != NODE_WAIT)
+				goto NODE_UNLOCK;
+			if (ddd == -1)
+				return -1;
+		}
+		cpu_relax();
+	}
+NODE_UNLOCK:
+	if (cur_status == ACQUIRE_NUMALOCK)
+		node->locked = COHORT_START;
+	return cur_status;
+}
+static int _zx_numa_osq_lock(struct optimistic_spin_queue *qslock, int cpu,
+				struct _numa_lock *_numa_lock)
+{
+	int numacpu = cpu >> _numa_lock->shift;
+	int numacurr = encode_cpu(numacpu);
+
+	struct optimistic_spin_node *node = &(_numa_lock + numacpu)->osq_node;
+	struct _numa_lock *numa_lock = _numa_lock + _numa_lock->numa_nodes;
+	struct optimistic_spin_node *prevnode = NULL;
+	int prev = 0;
+
+	node->next = NULL;
+	node->locked = LOCK_NUMALOCK;
+	node->cpu = numacurr;
+
+	prev = atomic_xchg(&numa_lock->tail, numacurr);
+	if (prev == 0) {
+		node->locked = UNLOCK_NUMALOCK;
+		return 0;
+	}
+
+	prevnode = &(_numa_lock + prev - 1)->osq_node;
+	node->prev = prevnode;
+	smp_mb(); /*node->prev should be set before next*/
+	WRITE_ONCE(prevnode->next, node);
+
+	while (READ_ONCE(node->locked) == LOCK_NUMALOCK) {
+		cpu_relax();
+		cpu_relax();
+		cpu_relax();
+		cpu_relax();
+	}
+	return 0;
+}
+inline bool zx_numa_osq_lock(struct optimistic_spin_queue *qslock,
+		struct _numa_lock *_numa_lock)
+{
+	struct _numa_lock *node_lock = NULL;
+	int cpu = smp_processor_id();
+	int numa = cpu >> _numa_lock->shift;
+	int status = 0;
+
+	node_lock = _numa_lock + numa;
+
+	if (node_lock->stopping) {
+		zx_numa_turn_osq_waiting(qslock, _numa_lock);
+		return osq_lock(qslock);
+	}
+
+	status = _zx_node_osq_lock(qslock, _numa_lock);
+	if (status == ACQUIRE_NUMALOCK)
+		status = _zx_numa_osq_lock(qslock, smp_processor_id(),
+				_numa_lock);
+
+	if (status == -1)
+		return false;
+	return true;
+}
+
+static int atomic64_checktail_osq(struct optimistic_spin_queue *qslock,
+	struct _numa_lock *node_lock, int ctail)
+{
+	u64 addr = ((u64)ptrmask(qslock)) << 32;
+	u64 addrtail = addr | ctail;
+	u64 ss = 0;
+	bool mark;
+
+	ss = atomic64_read((atomic64_t *) &node_lock->tail);
+	if (node_lock->stopping == 0)
+		mark = (ss == addrtail &&
+			atomic64_cmpxchg_acquire(
+				(atomic64_t *) &node_lock->tail,
+				addrtail, addr|NUMA_UNLOCKED_VAL) == addrtail);
+	else
+		mark = (ss == addrtail &&
+			atomic64_cmpxchg_acquire(
+				(atomic64_t *) &node_lock->tail,
+				addrtail, NUMA_LOCKED_VAL) == addrtail);
+	return mark;
+}
+
+static void node_lock_release(struct optimistic_spin_queue *qslock,
+		struct _numa_lock *node_lock, struct optimistic_spin_node *node,
+		int val, int cpu, int numa_end)
+{
+	struct optimistic_spin_node *next = NULL;
+	int curr = encode_cpu(cpu);
+
+	while (1) {
+		if (atomic64_checktail_osq(qslock, node_lock, curr)) {
+			if (qslock->numa_enable == NUMALOCKDYNAMIC) {
+				int index = qslock->index;
+
+				if (numa_end == OSQ_UNLOCKED_VAL &&
+					zx_numa_entry[index].idle == 0) {
+					cmpxchg(&zx_numa_entry[index].idle,
+							0, 1);
+				}
+			}
+			return;
+		}
+		if (node->next) {
+			next = xchg(&node->next, NULL);
+			if (next) {
+				WRITE_ONCE(next->locked, val);
+				return;
+			}
+		}
+		cpu_relax();
+	}
+}
+
+static int numa_lock_release(struct optimistic_spin_queue *qslock,
+		struct _numa_lock *numa_lock,
+		struct optimistic_spin_node *node, int cpu)
+{
+	struct optimistic_spin_node *next = NULL;
+	int curr = cpu >> numa_lock->shift;
+	int numacurr = encode_cpu(curr);
+
+	while (1) {
+		if (atomic_read(&numa_lock->tail) == numacurr &&
+		    atomic_cmpxchg_acquire(&numa_lock->tail, numacurr,
+					   OSQ_UNLOCKED_VAL) == numacurr) {
+			return OSQ_UNLOCKED_VAL;
+		}
+
+		if (node->next) {
+			next = xchg(&node->next, NULL);
+			if (next) {
+				WRITE_ONCE(next->locked, UNLOCK_NUMALOCK);
+				return 1;
+			}
+		}
+		cpu_relax();
+	}
+}
+
+inline void zx_numa_osq_unlock(struct optimistic_spin_queue *qslock,
+		 struct _numa_lock *_numa_lock)
+{
+	u32 cpu =  smp_processor_id();
+	struct optimistic_spin_node *node = this_cpu_ptr(&osq_cpu_node);
+	int numa = cpu >> _numa_lock->shift;
+	struct _numa_lock *numa_lock = _numa_lock + _numa_lock->numa_nodes;
+	struct _numa_lock *node_lock = _numa_lock + numa;
+	struct optimistic_spin_node *numa_node =
+						&(_numa_lock + numa)->osq_node;
+	struct optimistic_spin_node *next = NULL;
+	int cur_count = 0;
+	int numa_end = 0;
+
+	cur_count = READ_ONCE(node->locked);
+
+	if (cur_count >= osq_node_max - 1) {
+		numa_end = numa_lock_release(qslock,
+				numa_lock, numa_node, cpu);
+		node_lock_release(qslock, node_lock, node,
+				ACQUIRE_NUMALOCK, cpu, numa_end);
+		return;
+	}
+
+	next = xchg(&node->next, NULL);
+	if (next) {
+		WRITE_ONCE(next->locked, cur_count + 1);
+		return;
+	}
+
+	numa_end = numa_lock_release(qslock, numa_lock, numa_node, cpu);
+	node_lock_release(qslock, node_lock, node, ACQUIRE_NUMALOCK,
+			cpu, numa_end);
+}
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 4/4] locking/osq_lock: The numa-aware lock memory prepare, assign and cleanup.
  2024-09-14  8:53 [PATCH 0/4] locking/osq_lock: Update osq_lock to dynamic yongli-oc
                   ` (2 preceding siblings ...)
  2024-09-14  8:53 ` [PATCH 3/4] locking/osq_lock: From x_osq_lock/unlock to numa-aware lock/unlock yongli-oc
@ 2024-09-14  8:53 ` yongli-oc
  2024-09-14 17:21   ` Waiman Long
  2024-09-15 16:43   ` kernel test robot
  2024-09-30  7:23 ` [PATCH v2 0/3] locking/osq_lock: Update osq_lock to dynamic yongli-oc
  4 siblings, 2 replies; 19+ messages in thread
From: yongli-oc @ 2024-09-14  8:53 UTC (permalink / raw)
  To: peterz, mingo, will, longman, boqun.feng
  Cc: linux-kernel, yongli, louisqi, cobechen, jiangbowang

The numa-aware lock kernel memory cache preparation, and a
workqueue to turn numa-aware lock back to osq lock.
The /proc interface. Enable dynamic switch by
echo 1 > /proc/zx_numa_lock/dynamic_enable

Signed-off-by: yongli-oc <yongli-oc@zhaoxin.com>
---
 kernel/locking/zx_numa.c | 537 +++++++++++++++++++++++++++++++++++++++
 1 file changed, 537 insertions(+)
 create mode 100644 kernel/locking/zx_numa.c

diff --git a/kernel/locking/zx_numa.c b/kernel/locking/zx_numa.c
new file mode 100644
index 000000000000..89df6670a024
--- /dev/null
+++ b/kernel/locking/zx_numa.c
@@ -0,0 +1,537 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Dynamic numa-aware osq lock
+ * Crossing from numa-aware lock to osq_lock
+ * Numa lock memory initialize and /proc interface
+ * Author: LiYong <yongli-oc@zhaoxin.com>
+ *
+ */
+#include <linux/cpumask.h>
+#include <asm/byteorder.h>
+#include <asm/kvm_para.h>
+#include <linux/percpu.h>
+#include <linux/sched.h>
+#include <linux/slab.h>
+#include <linux/osq_lock.h>
+#include <linux/module.h>
+#include <linux/proc_fs.h>
+#include <linux/seq_file.h>
+#include <linux/uaccess.h>
+#include <linux/reboot.h>
+
+#include "numa.h"
+#include "numa_osq.h"
+
+int enable_zx_numa_osq_lock;
+struct delayed_work zx_numa_start_work;
+struct delayed_work zx_numa_cleanup_work;
+
+atomic_t numa_count;
+struct _numa_buf *zx_numa_entry;
+int zx_numa_lock_total = 256;
+LIST_HEAD(_zx_numa_head);
+LIST_HEAD(_zx_numa_lock_head);
+
+struct kmem_cache *zx_numa_entry_cachep;
+struct kmem_cache *zx_numa_lock_cachep;
+int NUMASHIFT;
+int NUMACLUSTERS;
+static atomic_t lockindex;
+int dynamic_enable;
+
+static const struct numa_cpu_info numa_cpu_list[] = {
+	/*feature1=1, a numa node includes two clusters*/
+	//{1, 23, X86_VENDOR_AMD, 0, 1},
+	{0x5b, 7, X86_VENDOR_CENTAUR, 0, 1},
+	{0x5b, 7, X86_VENDOR_ZHAOXIN, 0, 1}
+};
+
+inline void *get_numa_lock(int index)
+{
+	if (index >= 0 && index < zx_numa_lock_total)
+		return zx_numa_entry[index].numa_ptr;
+	else
+		return NULL;
+}
+
+static int zx_get_numa_shift(int all_cpus, int clusters)
+{
+	int cpus = (int) all_cpus/clusters;
+	int count = 0;
+
+	while (cpus) {
+		cpus >>= 1;
+		count++;
+	}
+	return count-1;
+}
+
+void numa_lock_init_data(struct _numa_lock *s, int clusters,
+			u32 lockval, u32 lockaddr)
+{
+	int j = 0;
+
+	for (j = 0; j < clusters + NUMAEXPAND; j++) {
+		atomic_set(&(s + j)->tail, lockval);
+		atomic_set(&(s + j)->addr, lockaddr);
+		(s + j)->shift = NUMASHIFT;
+		(s + j)->stopping = 0;
+		(s + j)->numa_nodes = clusters;
+		(s + j)->accessed = 0;
+		(s + j)->totalaccessed = 0;
+		(s + j)->nodeswitched = 0;
+		atomic_set(&(s + j)->initlock, 0);
+		atomic_set(&(s + j)->pending, 0);
+	}
+}
+
+int zx_numa_lock_ptr_get(void *p)
+{
+	int i = 0;
+	int index = 0;
+
+	if (atomic_read(&numa_count) >= zx_numa_lock_total)
+		return zx_numa_lock_total;
+
+	index = atomic_inc_return(&lockindex);
+
+	for (i = 0; i < zx_numa_lock_total; i++) {
+		if (index >= zx_numa_lock_total)
+			index = 0;
+		if (cmpxchg(&zx_numa_entry[index].lockaddr,
+					0, ptrmask(p)) == 0) {
+			while (1) {
+				struct _numa_lock *node_lock =
+					zx_numa_entry[index].numa_ptr;
+				struct _numa_lock *numa_lock = node_lock +
+						node_lock->numa_nodes;
+
+				if (atomic_read(&numa_lock->tail) ==
+								NUMA_LOCKED_VAL)
+					break;
+				cpu_relax();
+
+			}
+			atomic_inc(&numa_count);
+			zx_numa_entry[index].highaddr = ((u64)p) >> 32;
+			atomic_set(&lockindex, index);
+			return index;
+		}
+		index++;
+		if (atomic_read(&numa_count) >= zx_numa_lock_total)
+			break;
+	}
+	return zx_numa_lock_total;
+}
+
+int zx_check_numa_dynamic_locked(u32 lockaddr,
+		struct _numa_lock *_numa_lock, int t)
+{
+	struct _numa_lock *node_lock = NULL;
+	u64 s = -1;
+	int i = 0;
+
+	if (atomic_read(&_numa_lock->pending) != 0)
+		return 1;
+
+	for (i = 0; i < _numa_lock->numa_nodes + 1; i++) {
+		node_lock = _numa_lock + i;
+		cpu_relax(); cpu_relax(); cpu_relax(); cpu_relax();
+		s = atomic64_read((atomic64_t *) &node_lock->tail);
+		if ((s >> 32) != lockaddr)
+			continue;
+		if ((s & LOW32MASK) == NUMA_LOCKED_VAL
+				|| (s & LOW32MASK) == NUMA_UNLOCKED_VAL)
+			continue;
+		break;
+	}
+
+	if (i == _numa_lock->numa_nodes + 1)
+		return 0;
+	return i+1;
+}
+
+static int zx_numa_lock64_try_to_freeze(u32 lockaddr, struct _numa_lock *_numa_lock,
+			int index)
+{
+	struct _numa_lock *node_lock = NULL;
+	u64 addr = ((u64)lockaddr) << 32;
+	u64 s = 0;
+	u64 ff = 0;
+	int i = 0;
+
+	for (i = 0; i < _numa_lock->numa_nodes+1; i++) {
+		node_lock = _numa_lock + i;
+		cpu_relax();
+
+		s = atomic64_read((atomic64_t *)&node_lock->tail);
+		if ((s & HIGH32BITMASK) != addr)
+			continue;
+
+		if ((s & LOW32MASK) == NUMA_LOCKED_VAL)
+			continue;
+
+		if ((s & LOW32MASK) == NUMA_UNLOCKED_VAL) {
+			ff = atomic64_cmpxchg((atomic64_t *)&node_lock->tail,
+				(addr|NUMA_UNLOCKED_VAL), NUMA_LOCKED_VAL);
+			if (ff == (addr|NUMA_UNLOCKED_VAL))
+				continue;
+		}
+		break;
+	}
+
+	if (i == _numa_lock->numa_nodes + 1) {
+		zx_numa_entry[index].idle = 0;
+		zx_numa_entry[index].type = 0;
+		zx_numa_entry[index].highaddr = 0;
+		xchg(&zx_numa_entry[index].lockaddr, 0);
+	}
+
+	return i;
+}
+
+static void zx_numa_lock_stopping(struct _numa_lock *_numa_lock)
+{
+	struct _numa_lock *node_lock = NULL;
+	int i = 0;
+
+	for (i = 0; i < _numa_lock->numa_nodes+1; i++) {
+		node_lock = _numa_lock + i;
+		WRITE_ONCE(node_lock->stopping, 1);
+	}
+}
+
+static void zx_numa_cleanup(struct work_struct *work)
+{
+	int i = 0;
+	int checktimes = 2;
+
+	//reboot or power off state
+	if (READ_ONCE(enable_zx_numa_osq_lock) == 0xf)
+		return;
+
+	if (atomic_read(&numa_count) == 0) {
+		if (READ_ONCE(dynamic_enable) != 0)
+			schedule_delayed_work(&zx_numa_cleanup_work, 60*HZ);
+		return;
+	}
+
+	for (i = 0; i < zx_numa_lock_total; i++) {
+		int s = 0;
+		u32 lockaddr = READ_ONCE(zx_numa_entry[i].lockaddr);
+		u32 type = zx_numa_entry[i].type;
+		struct _numa_lock *buf =  zx_numa_entry[i].numa_ptr;
+		int nodes = 0;
+
+		if (lockaddr == 0 || type == 3 || zx_numa_entry[i].idle == 0)
+			continue;
+		nodes = buf->numa_nodes;
+		if (zx_numa_entry[i].idle < checktimes) {
+
+			s = zx_check_numa_dynamic_locked(lockaddr, buf, 1);
+			if (s != 0) {
+				zx_numa_entry[i].idle = 1;
+				continue;
+			}
+			zx_numa_entry[i].idle++;
+		}
+
+		if (zx_numa_entry[i].idle == checktimes) {
+			zx_numa_lock_stopping(buf);
+			zx_numa_entry[i].idle++;
+
+		}
+
+		if (zx_numa_entry[i].idle == checktimes+1) {
+			while (1) {
+				if (zx_numa_lock64_try_to_freeze(lockaddr, buf,
+						i) == nodes + 1) {
+					//all node has been locked
+					u32 left = 0;
+
+					left = atomic_dec_return(&numa_count);
+					break;
+				}
+				cpu_relax(); cpu_relax();
+				cpu_relax(); cpu_relax();
+			}
+		}
+	}
+	schedule_delayed_work(&zx_numa_cleanup_work, 60*HZ);
+}
+
+static int create_numa_buffer_list(int clusters, int len)
+{
+	int i = 0;
+
+	for (i = 0; i < zx_numa_lock_total; i++) {
+		struct _numa_lock *s = (struct _numa_lock *)kmem_cache_alloc(
+				zx_numa_lock_cachep, GFP_KERNEL);
+		if (!s) {
+			while (i > 0) {
+				kmem_cache_free(zx_numa_lock_cachep,
+						zx_numa_entry[i-1].numa_ptr);
+				i--;
+			}
+			return 0;
+		}
+		memset((char *)s, 0,
+			len * L1_CACHE_BYTES * (clusters + NUMAEXPAND));
+		numa_lock_init_data(s, clusters, NUMA_LOCKED_VAL, 0);
+		zx_numa_entry[i].numa_ptr = s;
+		zx_numa_entry[i].lockaddr = 0;
+		zx_numa_entry[i].highaddr = 0;
+		zx_numa_entry[i].idle = 0;
+		zx_numa_entry[i].type = 0;
+	}
+
+	for (i = 0; i < zx_numa_lock_total; i++) {
+		zx_numa_entry[i].index = i;
+		list_add_tail(&(zx_numa_entry[i].list), &_zx_numa_lock_head);
+	}
+	return 1;
+}
+
+static int zx_numa_lock_init(int numa)
+{
+	int align = max_t(int, L1_CACHE_BYTES, ARCH_MIN_TASKALIGN);
+	int d = 0;
+	int status = 0;
+
+	atomic_set(&lockindex, 0);
+	atomic_set(&numa_count, 0);
+
+	if (sizeof(struct _numa_lock) & 0x3f)
+		d = (int)((sizeof(struct _numa_lock) + L1_CACHE_BYTES) /
+			  L1_CACHE_BYTES);
+	else
+		d = (int)(sizeof(struct _numa_lock) / L1_CACHE_BYTES);
+
+	zx_numa_entry_cachep = kmem_cache_create(
+		"zx_numa_entry",
+		sizeof(struct _numa_buf) * zx_numa_lock_total, align,
+		SLAB_PANIC | SLAB_ACCOUNT, NULL);
+
+	zx_numa_lock_cachep = kmem_cache_create(
+		"zx_numa_lock",
+		d * L1_CACHE_BYTES * (numa + NUMAEXPAND), align,
+		SLAB_PANIC | SLAB_ACCOUNT, NULL);
+
+
+	if (zx_numa_entry_cachep && zx_numa_lock_cachep) {
+		zx_numa_entry = (struct _numa_buf *)kmem_cache_alloc(
+				zx_numa_entry_cachep, GFP_KERNEL);
+		if (zx_numa_entry) {
+			memset((char *)zx_numa_entry, 0,
+				sizeof(struct _numa_buf) * zx_numa_lock_total);
+			create_numa_buffer_list(numa, d);
+			status = 1;
+		}
+	}
+
+	pr_info("enable dynamic numa-aware osq_lock, clusters %d\n",
+		numa);
+	return status;
+}
+
+
+#define numa_lock_proc_dir "zx_numa_lock"
+#define zx_numa_enable_dir "dynamic_enable"
+#define numa_entry_total 8
+struct proc_dir_entry *numa_lock_proc;
+struct proc_dir_entry *numa_lock_enable;
+struct proc_dir_entry *numa_proc_entry[numa_entry_total];
+
+static ssize_t numa_lock_proc_read(struct file *file,
+		char __user *usrbuf, size_t len, loff_t *off)
+{
+	int id = (long) pde_data(file_inode(file));
+	char kbuffer[128];
+	ssize_t retval = 0;
+	size_t n = 0;
+
+	memset(kbuffer, 0, sizeof(kbuffer));
+	if (id == 0)
+		n = sprintf(kbuffer, "%d\n", READ_ONCE(dynamic_enable));
+	else if (id == 1)
+		n = sprintf(kbuffer, "%d\n", READ_ONCE(osq_lock_depth));
+	else if (id == 2)
+		n = sprintf(kbuffer, "%d\n", READ_ONCE(osq_keep_times));
+	else if (id == 3)
+		n = sprintf(kbuffer, "%d\n", READ_ONCE(osq_node_max));
+	else if (id == 4)
+		n = sprintf(kbuffer, "%d\n", atomic_read(&numa_count));
+	retval = simple_read_from_buffer(usrbuf, len, off, kbuffer, n);
+
+	return retval;
+}
+
+static ssize_t numa_lock_proc_write(struct file *file,
+		const char __user *buffer, size_t count, loff_t *f_pos)
+{
+	int id = (long) pde_data(file_inode(file));
+	char kbuffer[128];
+	unsigned long new = 0;
+	int err = 0;
+
+	memset(kbuffer, 0, sizeof(kbuffer));
+	if (copy_from_user(kbuffer, buffer, count))
+		return count;
+	kbuffer[count] = '\0';
+	err = kstrtoul(kbuffer, 10, &new);
+
+	if (id == 0) {
+		int last = READ_ONCE(dynamic_enable);
+
+		if (new < 0 || new >= 2 || last == new)
+			return count;
+
+		if (last == 0) {
+			prefetchw(&enable_zx_numa_osq_lock);
+			//enable to the 2-bytes-tail osq-lock
+			prefetchw(&enable_zx_numa_osq_lock);
+			WRITE_ONCE(enable_zx_numa_osq_lock, 2);
+			schedule_delayed_work(&zx_numa_cleanup_work, 60*HZ);
+		}
+		prefetchw(&dynamic_enable);
+		WRITE_ONCE(dynamic_enable, new);
+		return count;
+	}
+
+	if (READ_ONCE(dynamic_enable) != 0) {
+		pr_info("dynamic %d: change setting should disable dynamic\n",
+			dynamic_enable);
+		return count;
+	}
+	if (id == 1 && new > 4 && new <= 32)
+		WRITE_ONCE(osq_lock_depth, new);
+	else if (id == 2 && new >= 16 && new <= 2048)
+		WRITE_ONCE(osq_keep_times, new);
+	else if (id == 3 && new > 4 && new <= 2048)
+		WRITE_ONCE(osq_node_max, new);
+	return count;
+}
+static int numa_lock_proc_show(struct seq_file *m, void *v)
+{
+	return 0;
+}
+
+static int numa_lock_proc_open(struct inode *inode, struct file *file)
+{
+	return single_open(file, numa_lock_proc_show, NULL);
+}
+static const struct proc_ops numa_lock_proc_fops = {
+	.proc_open = numa_lock_proc_open,
+	.proc_read = numa_lock_proc_read,
+	.proc_write = numa_lock_proc_write
+};
+
+static int numalock_proc_init(void)
+{
+	int index = 0;
+	int i = 0;
+
+	numa_lock_proc = proc_mkdir(numa_lock_proc_dir, NULL);
+	if (numa_lock_proc == NULL) {
+		pr_info("%s proc create %s failed\n", __func__,
+				numa_lock_proc_dir);
+		return -EINVAL;
+	}
+
+	numa_lock_enable = proc_create_data(zx_numa_enable_dir, 0666,
+		numa_lock_proc, &numa_lock_proc_fops, (void *)(long)index++);
+	if (!numa_lock_enable) {
+		pr_info("%s proc_create_data %s failed!\n", __func__,
+				zx_numa_enable_dir);
+		return -ENOMEM;
+	}
+
+	for (i = 0; i < numa_entry_total; i++)
+		numa_proc_entry[i] = NULL;
+
+	numa_proc_entry[0] =  proc_create_data("osq_lock_depth", 0664,
+		numa_lock_proc, &numa_lock_proc_fops, (void *)(long)index++);
+	numa_proc_entry[1] =  proc_create_data("osq_keep_times", 0664,
+		numa_lock_proc, &numa_lock_proc_fops, (void *)(long)index++);
+	numa_proc_entry[2] =  proc_create_data("osq_node_max", 0664,
+		numa_lock_proc, &numa_lock_proc_fops, (void *)(long)index++);
+	numa_proc_entry[3] =  proc_create_data("numa_osq_lock", 0444,
+		numa_lock_proc, &numa_lock_proc_fops, (void *)(long)index++);
+	return 0;
+}
+
+static void numalock_proc_exit(void)
+{
+	int i = 0;
+
+	for (i = 0; i < numa_entry_total; i++) {
+		if (numa_proc_entry[i])
+			proc_remove(numa_proc_entry[i]);
+	}
+	if (numa_lock_enable)
+		proc_remove(numa_lock_enable);
+	if (numa_lock_proc)
+		remove_proc_entry(numa_lock_proc_dir, NULL);
+
+}
+
+static int numalock_shutdown_notify(struct notifier_block *unused1,
+		unsigned long unused2, void *unused3)
+{
+	if (READ_ONCE(enable_zx_numa_osq_lock) == 2) {
+		WRITE_ONCE(dynamic_enable, 0);
+		WRITE_ONCE(enable_zx_numa_osq_lock, 0xf);
+	}
+	return NOTIFY_DONE;
+}
+static struct notifier_block numalock_shutdown_nb = {
+	.notifier_call = numalock_shutdown_notify,
+};
+static int __init zx_numa_base_init(void)
+{
+	int cpu = num_possible_cpus();
+	int i = 0;
+
+	WRITE_ONCE(enable_zx_numa_osq_lock, 0);
+	if (kvm_para_available())
+		return 0;
+	if (cpu >= 65534 || cpu < 16 || (cpu & 0x7) != 0)
+		return 0;
+
+	for (i = 0; i < ARRAY_SIZE(numa_cpu_list); i++) {
+		if (boot_cpu_data.x86_vendor == numa_cpu_list[i].x86_vendor &&
+			boot_cpu_data.x86 == numa_cpu_list[i].x86 &&
+			boot_cpu_data.x86_model == numa_cpu_list[i].x86_model) {
+
+			if (numa_cpu_list[i].feature1 == 1)
+				NUMACLUSTERS = nr_node_ids + nr_node_ids;
+			NUMASHIFT = zx_get_numa_shift(num_possible_cpus(),
+					NUMACLUSTERS);
+
+			if (zx_numa_lock_init(NUMACLUSTERS) == 0)
+				return -ENOMEM;
+			register_reboot_notifier(&numalock_shutdown_nb);
+			numalock_proc_init();
+			INIT_DELAYED_WORK(&zx_numa_cleanup_work,
+				zx_numa_cleanup);
+			prefetchw(&enable_zx_numa_osq_lock);
+			WRITE_ONCE(enable_zx_numa_osq_lock, 1);
+			return 0;
+		}
+	}
+	return 0;
+}
+
+static void __exit zx_numa_lock_exit(void)
+{
+	numalock_proc_exit();
+	prefetchw(&dynamic_enable);
+	WRITE_ONCE(dynamic_enable, 0);
+}
+
+late_initcall(zx_numa_base_init);
+module_exit(zx_numa_lock_exit);
+MODULE_AUTHOR("LiYong <yongli-oc@zhaoxin.com>");
+MODULE_DESCRIPTION("zx dynamic numa-aware osq lock");
+MODULE_LICENSE("GPL");
+
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/4] locking/osq_lock: Turn to dynamic switch function from osq_lock/unlock.
  2024-09-14  8:53 ` [PATCH 2/4] locking/osq_lock: Turn to dynamic switch function from osq_lock/unlock yongli-oc
@ 2024-09-14 16:06   ` Waiman Long
  2024-09-19  9:32     ` yongli-os
  0 siblings, 1 reply; 19+ messages in thread
From: Waiman Long @ 2024-09-14 16:06 UTC (permalink / raw)
  To: yongli-oc, peterz, mingo, will, boqun.feng
  Cc: linux-kernel, yongli, louisqi, cobechen, jiangbowang


On 9/14/24 04:53, yongli-oc wrote:
> To support numa-aware osq lock, the struct optimistic_spin_queue
> is accessed as three members, numa_enable, index, tail16, by union.
> The size of the struct is the same as before.
> If dynamic numa-ware lock enable, turns to the crossing, x_osq_lock to
> check contention level and starts dynamic switch.
>
> Signed-off-by: yongli-oc <yongli-oc@zhaoxin.com>
> ---
>   include/linux/osq_lock.h  | 33 ++++++++++++++++++++-
>   kernel/locking/osq_lock.c | 62 +++++++++++++++++++++++++++++++++++++--
>   2 files changed, 91 insertions(+), 4 deletions(-)
>
> diff --git a/include/linux/osq_lock.h b/include/linux/osq_lock.h
> index ea8fb31379e3..37a7bc4ab530 100644
> --- a/include/linux/osq_lock.h
> +++ b/include/linux/osq_lock.h
> @@ -12,14 +12,42 @@ struct optimistic_spin_queue {
>   	 * Stores an encoded value of the CPU # of the tail node in the queue.
>   	 * If the queue is empty, then it's set to OSQ_UNLOCKED_VAL.
>   	 */
> +#ifdef CONFIG_LOCK_SPIN_ON_OWNER_NUMA
> +	union {
> +		atomic_t tail;
> +		u32 val;
> +#ifdef __LITTLE_ENDIAN
> +		struct {
> +			u16 tail16;
> +			u8 index;
> +			u8 numa_enable;
> +		};
> +#else
> +		struct {
> +			u8 numa_enable;
> +			u8 index;
> +			u16 tail16;
> +		};
> +#endif
> +	};
> +#else
>   	atomic_t tail;
> +#endif
>   };
>   
>   #define OSQ_UNLOCKED_VAL (0)
>   
>   /* Init macro and function. */
> +#ifdef CONFIG_LOCK_SPIN_ON_OWNER_NUMA
> +
> +#define OSQ_LOCK_UNLOCKED { .tail = ATOMIC_INIT(OSQ_UNLOCKED_VAL) }
> +
> +#else
> +
>   #define OSQ_LOCK_UNLOCKED { ATOMIC_INIT(OSQ_UNLOCKED_VAL) }
>   
> +#endif
> +
>   static inline void osq_lock_init(struct optimistic_spin_queue *lock)
>   {
>   	atomic_set(&lock->tail, OSQ_UNLOCKED_VAL);
> @@ -28,9 +56,12 @@ static inline void osq_lock_init(struct optimistic_spin_queue *lock)
>   extern bool osq_lock(struct optimistic_spin_queue *lock);
>   extern void osq_unlock(struct optimistic_spin_queue *lock);
>   
> +#ifdef CONFIG_LOCK_SPIN_ON_OWNER_NUMA
> +extern bool osq_is_locked(struct optimistic_spin_queue *lock);
> +#else
>   static inline bool osq_is_locked(struct optimistic_spin_queue *lock)
>   {
>   	return atomic_read(&lock->tail) != OSQ_UNLOCKED_VAL;
>   }
> -
> +#endif
>   #endif
> diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c
> index 75a6f6133866..a7b516939e00 100644
> --- a/kernel/locking/osq_lock.c
> +++ b/kernel/locking/osq_lock.c
> @@ -2,7 +2,10 @@
>   #include <linux/percpu.h>
>   #include <linux/sched.h>
>   #include <linux/osq_lock.h>
> -
> +#ifdef CONFIG_LOCK_SPIN_ON_OWNER_NUMA
> +#include "numa.h"
> +#include "numa_osq.h"
> +#endif

These header files are defined in patch 3. You need to rethink about 
patch ordering in order not to break bisection.


>   /*
>    * An MCS like lock especially tailored for optimistic spinning for sleeping
>    * lock implementations (mutex, rwsem, etc).
> @@ -12,12 +15,34 @@
>    * spinning.
>    */
>   
> +#ifdef CONFIG_LOCK_SPIN_ON_OWNER_NUMA
> +DEFINE_PER_CPU_SHARED_ALIGNED(struct optimistic_spin_node, osq_node);
> +/*
> + * We use the value 0 to represent "no CPU", thus the encoded value
> + * will be the CPU number incremented by 1.
> + */
> +inline int encode_cpu(int cpu_nr)
> +{
> +	return cpu_nr + 1;
> +}
> +
> +inline int node_cpu(struct optimistic_spin_node *node)
> +{
> +	return node->cpu - 1;
> +}
> +
> +inline struct optimistic_spin_node *decode_cpu(int encoded_cpu_val)
> +{
> +	int cpu_nr = encoded_cpu_val - 1;
> +
> +	return per_cpu_ptr(&osq_node, cpu_nr);
> +}
> +#else
>   struct optimistic_spin_node {
>   	struct optimistic_spin_node *next, *prev;
>   	int locked; /* 1 if lock acquired */
>   	int cpu; /* encoded CPU # + 1 value */
>   };
> -
>   static DEFINE_PER_CPU_SHARED_ALIGNED(struct optimistic_spin_node, osq_node);
>   
>   /*
> @@ -40,6 +65,7 @@ static inline struct optimistic_spin_node *decode_cpu(int encoded_cpu_val)
>   
>   	return per_cpu_ptr(&osq_node, cpu_nr);
>   }
> +#endif
>   
>   /*
>    * Get a stable @node->next pointer, either for unlock() or unqueue() purposes.
> @@ -97,6 +123,14 @@ bool osq_lock(struct optimistic_spin_queue *lock)
>   	int curr = encode_cpu(smp_processor_id());
>   	int old;
>   
> +#ifdef CONFIG_LOCK_SPIN_ON_OWNER_NUMA
> +	if (unlikely(enable_zx_numa_osq_lock > 1)) {
> +		node->numa = 1;
> +		return x_osq_lock(lock);
> +	}
> +	node->numa = 0;
> +#endif
> +
>   	node->locked = 0;
>   	node->next = NULL;
>   	node->cpu = curr;
> @@ -108,6 +142,11 @@ bool osq_lock(struct optimistic_spin_queue *lock)
>   	 * the lock tail.
>   	 */
>   	old = atomic_xchg(&lock->tail, curr);
> +#ifdef CONFIG_LOCK_SPIN_ON_OWNER_NUMA
> +	if (enable_zx_numa_osq_lock > 0)
> +	//enable means all cpu cores are less tan 65534.
> +		old = old & 0xffff;
> +#endif
>   	if (old == OSQ_UNLOCKED_VAL)
>   		return true;
>   
> @@ -212,6 +251,14 @@ void osq_unlock(struct optimistic_spin_queue *lock)
>   	struct optimistic_spin_node *node, *next;
>   	int curr = encode_cpu(smp_processor_id());
>   
> +	node = this_cpu_ptr(&osq_node);
> +
> +#ifdef CONFIG_LOCK_SPIN_ON_OWNER_NUMA
> +	if (unlikely(enable_zx_numa_osq_lock > 1 &&
> +		node->numa == 1))
> +		return x_osq_unlock(lock);
> +#endif
> +
>   	/*
>   	 * Fast path for the uncontended case.
>   	 */
> @@ -222,7 +269,6 @@ void osq_unlock(struct optimistic_spin_queue *lock)
>   	/*
>   	 * Second most likely case.
>   	 */
> -	node = this_cpu_ptr(&osq_node);
>   	next = xchg(&node->next, NULL);
>   	if (next) {
>   		WRITE_ONCE(next->locked, 1);
> @@ -233,3 +279,13 @@ void osq_unlock(struct optimistic_spin_queue *lock)
>   	if (next)
>   		WRITE_ONCE(next->locked, 1);
>   }
> +#ifdef CONFIG_LOCK_SPIN_ON_OWNER_NUMA
> +bool osq_is_locked(struct optimistic_spin_queue *lock)
> +{
> +	if (unlikely(enable_zx_numa_osq_lock > 1))
> +		return x_osq_is_locked(lock);
> +	return atomic_read(&lock->tail) != OSQ_UNLOCKED_VAL;
> +}
> +#endif
> +
> +


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 3/4] locking/osq_lock: From x_osq_lock/unlock to numa-aware lock/unlock.
  2024-09-14  8:53 ` [PATCH 3/4] locking/osq_lock: From x_osq_lock/unlock to numa-aware lock/unlock yongli-oc
@ 2024-09-14 17:11   ` Waiman Long
  2024-09-19  9:35     ` yongli-os
  2024-09-15 18:44   ` Uros Bizjak
  1 sibling, 1 reply; 19+ messages in thread
From: Waiman Long @ 2024-09-14 17:11 UTC (permalink / raw)
  To: yongli-oc, peterz, mingo, will, boqun.feng
  Cc: linux-kernel, yongli, louisqi, cobechen, jiangbowang

On 9/14/24 04:53, yongli-oc wrote:
> According to the contention level, switches from x_osq_lock to
> numa-aware osq_lock.
> The numa-aware lock is a two level osq_lock.
> The Makefile for dynamic numa-aware osq lock.
>
> Signed-off-by: yongli-oc <yongli-oc@zhaoxin.com>
> ---
>   kernel/locking/Makefile      |   1 +
>   kernel/locking/numa.h        |  98 ++++++++
>   kernel/locking/numa_osq.h    |  32 +++
>   kernel/locking/x_osq_lock.c  | 332 +++++++++++++++++++++++++++
>   kernel/locking/zx_numa_osq.c | 433 +++++++++++++++++++++++++++++++++++
>   5 files changed, 896 insertions(+)
>   create mode 100644 kernel/locking/numa.h
>   create mode 100644 kernel/locking/numa_osq.h
>   create mode 100644 kernel/locking/x_osq_lock.c
>   create mode 100644 kernel/locking/zx_numa_osq.c
>
> diff --git a/kernel/locking/Makefile b/kernel/locking/Makefile
> index 0db4093d17b8..00c1d5bb6eb9 100644
> --- a/kernel/locking/Makefile
> +++ b/kernel/locking/Makefile
> @@ -22,6 +22,7 @@ obj-$(CONFIG_LOCKDEP) += lockdep_proc.o
>   endif
>   obj-$(CONFIG_SMP) += spinlock.o
>   obj-$(CONFIG_LOCK_SPIN_ON_OWNER) += osq_lock.o
> +obj-$(CONFIG_LOCK_SPIN_ON_OWNER_NUMA) += x_osq_lock.o zx_numa_osq.o zx_numa.o
>   obj-$(CONFIG_PROVE_LOCKING) += spinlock.o
>   obj-$(CONFIG_QUEUED_SPINLOCKS) += qspinlock.o
>   obj-$(CONFIG_RT_MUTEXES) += rtmutex_api.o
> diff --git a/kernel/locking/numa.h b/kernel/locking/numa.h
> new file mode 100644
> index 000000000000..01e5aef3257a
> --- /dev/null
> +++ b/kernel/locking/numa.h
> @@ -0,0 +1,98 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef __LINUX_NUMA_LOCK_H
> +#define __LINUX_NUMA_LOCK_H
> +#include "mcs_spinlock.h"
> +
> +struct optimistic_spin_node {
> +	struct optimistic_spin_node *next, *prev;
> +	int numa;
> +	int locked; /* 1 if lock acquired */
> +	int cpu; /* encoded CPU # + 1 value */
> +	u32 serial;
> +};
> +
> +
> +struct _numa_buf {
> +	void *numa_ptr;
> +	struct list_head list;
> +	u32 lockaddr;
> +	u32 highaddr;
> +	u8 idle;
> +	u8 type;
> +	u16 index;
> +};
> +
> +struct cache_padding {
> +	char x[0];
> +} ____cacheline_internodealigned_in_smp;
> +#define CACHE_PADDING(name)	struct cache_padding name
We have struct cacheline_padding and CACHELINE_PADDING() in 
include/linux/cache. What is the point of reinventing the same thing here?
> +
> +struct _numa_lock {
> +	atomic_t tail ____cacheline_aligned_in_smp;
> +	atomic_t addr;
> +	u8 shift;
> +	u8 stopping;
> +	u16 numa_nodes;
> +	u32 accessed;
> +	uint64_t totalaccessed;
> +	u32 nodeswitched;
> +	atomic_t initlock;
> +	atomic_t pending;
> +	union {
> +		struct mcs_spinlock mcs_node;
> +		struct optimistic_spin_node osq_node;
> +	};
> +	CACHE_PADDING(pad);
> +};
> +
> +struct numa_cpu_info {
> +	__u8	x86_model;
> +	/* CPU family */
> +	__u8	x86;
> +	/* CPU vendor */
> +	__u8	x86_vendor;
> +	__u8	x86_reserved;
> +	u32	feature1;
> +};
> +
> +#define NUMAEXPAND 1
> +
> +#define COHORT_START 1
> +#define ACQUIRE_NUMALOCK (UINT_MAX-1)
> +#define NODE_WAIT UINT_MAX
> +#define LOCK_NUMALOCK 1
> +#define UNLOCK_NUMALOCK 0
> +
> +#define NUMALOCKDYNAMIC 0xff
> +#define TURNTONUMAREADY 0xa5a5
> +#define NUMATURNBACKREADY 0x5a5a
> +
> +#define NUMA_LOCKED_VAL 0xf5efef
What are these special bit values for?
> +#define NUMA_UNLOCKED_VAL 0
> +
> +#define NUMASTEERMASK 0xf0000000
> +#define HIGH32BITMASK 0xffffffff00000000
> +#define LOW32MASK 0xffffffff
> +
> +extern int NUMASHIFT;
> +extern int NUMACLUSTERS;
In the kernel, uppercase names are used for macros. Variables should use 
all lowercase names to avoid confusion.
> +extern int zx_numa_lock_total;
> +extern struct _numa_buf *zx_numa_entry;
> +extern atomic_t numa_count;
> +extern int enable_zx_numa_osq_lock;
> +extern u32 zx_numa_lock;
> +extern int dynamic_enable;
> +extern struct kmem_cache *zx_numa_lock_cachep;
> +
> +static inline u32 ptrmask(void *s)
> +{
> +	return (uint64_t)s & LOW32MASK;
> +}
> +inline void *get_numa_lock(int index);
> +
> +int zx_check_numa_dynamic_locked(u32 lockaddr, struct _numa_lock *_numa_lock,
> +		int t);
> +int zx_numa_lock_ptr_get(void *p);
> +void numa_lock_init_data(struct _numa_lock *s, int clusters, u32 lockval,
> +		u32 lockaddr);
> +#endif
> diff --git a/kernel/locking/numa_osq.h b/kernel/locking/numa_osq.h
> new file mode 100644
> index 000000000000..4c99ad60c4e0
> --- /dev/null
> +++ b/kernel/locking/numa_osq.h
> @@ -0,0 +1,32 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef __LINUX_NUMA_OSQ_H
> +#define __LINUX_NUMA_OSQ_H
> +
> +#include <linux/osq_lock.h>
> +#include "mcs_spinlock.h"
> +
> +#define OSQLOCKINITED 0
> +#define OSQTONUMADETECT 0x10
> +#define OSQLOCKSTOPPING 0xfc
> +#define OSQ_LOCKED_VAL 0xffff
> +
> +extern u16 osq_keep_times;
> +extern u16 osq_lock_depth;
> +extern int osq_node_max;
> +
> +inline int encode_cpu(int cpu_nr);
> +inline int node_cpu(struct optimistic_spin_node *node);
> +inline struct optimistic_spin_node *decode_cpu(int encoded_cpu_val);
> +
> +void zx_osq_lock_stopping(struct optimistic_spin_queue *lock);
> +void zx_osq_numa_start(struct optimistic_spin_queue *lock);
> +void zx_osq_turn_numa_waiting(struct optimistic_spin_queue *lock);
> +
> +bool x_osq_lock(struct optimistic_spin_queue *lock);
> +void x_osq_unlock(struct optimistic_spin_queue *lock);
> +bool x_osq_is_locked(struct optimistic_spin_queue *lock);
> +inline void zx_numa_osq_unlock(struct optimistic_spin_queue *qslock,
> +		struct _numa_lock *n);
> +inline bool zx_numa_osq_lock(struct optimistic_spin_queue *qslock,
> +		struct _numa_lock *n);
> +#endif
> diff --git a/kernel/locking/x_osq_lock.c b/kernel/locking/x_osq_lock.c
> new file mode 100644
> index 000000000000..6da8905ae7d6
> --- /dev/null
> +++ b/kernel/locking/x_osq_lock.c
> @@ -0,0 +1,332 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * crossing from osq_lock to numa-aware lock
> + */
> +#include <linux/percpu.h>
> +#include <linux/sched.h>
> +#include <linux/osq_lock.h>
> +#include "numa.h"
> +#include "numa_osq.h"
> +
> +u16 osq_lock_depth = 8;
> +u16 osq_keep_times = 32;
> +
> +/*
> + * An MCS like lock especially tailored for optimistic spinning for sleeping
> + * lock implementations (mutex, rwsem, etc).
> + *
> + * Using a single mcs node per CPU is safe because sleeping locks should not be
> + * called from interrupt context and we have preemption disabled while
> + * spinning.
> + */
> +DECLARE_PER_CPU_SHARED_ALIGNED(struct optimistic_spin_node, osq_node);
> +
> +/*
> + * Get a stable @node->next pointer, either for unlock() or unqueue() purposes.
> + * Can return NULL in case we were the last queued and we updated @lock instead.
> + *
> + * If osq_lock() is being cancelled there must be a previous node
> + * and 'old_cpu' is its CPU #.
> + * For osq_unlock() there is never a previous node and old_cpu is
> + * set to OSQ_UNLOCKED_VAL.
> + */
> +static inline struct optimistic_spin_node *
> +osq_wait_next_stop(struct optimistic_spin_queue *lock,
> +	      struct optimistic_spin_node *node,
> +	      int old_cpu)
> +{
> +	u16 curr = encode_cpu(smp_processor_id());
> +	u16 old = old_cpu;
> +
> +	if (lock->numa_enable == OSQLOCKSTOPPING && old == OSQ_UNLOCKED_VAL)
> +		old = OSQ_LOCKED_VAL;
> +
> +	for (;;) {
> +		if (READ_ONCE(lock->tail16) == curr &&
> +		    cmpxchg(&lock->tail16, curr, old) == curr) {
> +
> +			/*
> +			 * We were the last queued, we moved @lock back. @prev
> +			 * will now observe @lock and will complete its
> +			 * unlock()/unqueue().
> +			 */
> +			return NULL;
> +		}
> +
> +		/*
> +		 * We must xchg() the @node->next value, because if we were to
> +		 * leave it in, a concurrent unlock()/unqueue() from
> +		 * @node->next might complete Step-A and think its @prev is
> +		 * still valid.
> +		 *
> +		 * If the concurrent unlock()/unqueue() wins the race, we'll
> +		 * wait for either @lock to point to us, through its Step-B, or
> +		 * wait for a new @node->next from its Step-C.
> +		 */
> +		if (node->next) {
> +			struct optimistic_spin_node *next;
> +
> +			next = xchg(&node->next, NULL);
> +			if (next)
> +				return next;
> +		}
> +
> +		cpu_relax();
> +	}
> +}
> +
> +bool x_osq_lock(struct optimistic_spin_queue *lock)
> +{
> +	struct optimistic_spin_node *node = this_cpu_ptr(&osq_node);
> +	struct optimistic_spin_node *prev, *next;
> +	int cpu = smp_processor_id();
> +	u16 curr = encode_cpu(cpu);
> +	struct optimistic_spin_queue tail;
> +	u16 old;
> +
> +	tail.val = READ_ONCE(lock->val);
> +	if (unlikely(tail.numa_enable == OSQLOCKSTOPPING)) {
> +		zx_osq_turn_numa_waiting(lock);
> +		return x_osq_lock(lock);
> +	}
> +
> +	if (unlikely(tail.numa_enable == NUMALOCKDYNAMIC)) {
> +		struct _numa_lock *_numa_lock = NULL;
> +		struct _numa_lock *node_lock = NULL;
> +
> +		_numa_lock = get_numa_lock(tail.index);
> +		node_lock = (struct _numa_lock *) _numa_lock +
> +					(cpu >> NUMASHIFT);
> +
> +		prefetch(node_lock);
> +		return zx_numa_osq_lock(lock, _numa_lock);
> +	}
> +
> +	node->locked = 0;
> +	node->next = NULL;
> +	node->cpu = curr;
> +
> +	/*
> +	 * We need both ACQUIRE (pairs with corresponding RELEASE in
> +	 * unlock() uncontended, or fastpath) and RELEASE (to publish
> +	 * the node fields we just initialised) semantics when updating
> +	 * the lock tail.
> +	 */
> +
> +	if (likely(tail.numa_enable >= OSQTONUMADETECT)) {
> +		struct optimistic_spin_queue ss;
> +
> +		while (1) {
> +			ss.val = atomic_read(&lock->tail);
> +			if (ss.tail16 == OSQ_LOCKED_VAL) {
> +				zx_osq_turn_numa_waiting(lock);
> +				return x_osq_lock(lock);
> +			}
> +			if (cmpxchg(&lock->tail16, ss.tail16, curr)
> +					== ss.tail16) {
> +				old = ss.tail16;
> +				break;
> +			}
> +			cpu_relax();
> +		}
> +	} else
> +		old = xchg(&lock->tail16, curr);
> +
> +	if (old == OSQ_UNLOCKED_VAL) {
> +		node->serial = 1;
> +		return true;
> +	}
> +
> +	prev = decode_cpu(old);
> +	node->prev = prev;
> +
> +	node->serial = prev->serial + 1;
> +	/*
> +	 * osq_lock()			unqueue
> +	 *
> +	 * node->prev = prev		osq_wait_next()
> +	 * WMB				MB
> +	 * prev->next = node		next->prev = prev // unqueue-C
> +	 *
> +	 * Here 'node->prev' and 'next->prev' are the same variable and we need
> +	 * to ensure these stores happen in-order to avoid corrupting the list.
> +	 */
> +	smp_wmb();
> +
> +	WRITE_ONCE(prev->next, node);
> +
> +	/*
> +	 * Normally @prev is untouchable after the above store; because at that
> +	 * moment unlock can proceed and wipe the node element from stack.
> +	 *
> +	 * However, since our nodes are static per-cpu storage, we're
> +	 * guaranteed their existence -- this allows us to apply
> +	 * cmpxchg in an attempt to undo our queueing.
> +	 */
> +
> +	/*
> +	 * Wait to acquire the lock or cancellation. Note that need_resched()
> +	 * will come with an IPI, which will wake smp_cond_load_relaxed() if it
> +	 * is implemented with a monitor-wait. vcpu_is_preempted() relies on
> +	 * polling, be careful.
> +	 */
> +	if (smp_cond_load_relaxed(&node->locked, VAL || need_resched() ||
> +				  vcpu_is_preempted(node_cpu(node->prev))))
> +		return true;
> +
> +	/* unqueue */
> +	/*
> +	 * Step - A  -- stabilize @prev
> +	 *
> +	 * Undo our @prev->next assignment; this will make @prev's
> +	 * unlock()/unqueue() wait for a next pointer since @lock points to us
> +	 * (or later).
> +	 */
> +
> +	for (;;) {
> +		/*
> +		 * cpu_relax() below implies a compiler barrier which would
> +		 * prevent this comparison being optimized away.
> +		 */
> +		if (data_race(prev->next) == node &&
> +		    cmpxchg(&prev->next, node, NULL) == node)
> +			break;
> +
> +		/*
> +		 * We can only fail the cmpxchg() racing against an unlock(),
> +		 * in which case we should observe @node->locked becoming
> +		 * true.
> +		 */
> +		if (smp_load_acquire(&node->locked))
> +			return true;
> +
> +		cpu_relax();
> +
> +		/*
> +		 * Or we race against a concurrent unqueue()'s step-B, in which
> +		 * case its step-C will write us a new @node->prev pointer.
> +		 */
> +		prev = READ_ONCE(node->prev);
> +	}
> +
> +	/*
> +	 * Step - B -- stabilize @next
> +	 *
> +	 * Similar to unlock(), wait for @node->next or move @lock from @node
> +	 * back to @prev.
> +	 */
> +
> +	next = osq_wait_next_stop(lock, node, prev->cpu);
> +	if (!next)
> +		return false;
> +
> +	/*
> +	 * Step - C -- unlink
> +	 *
> +	 * @prev is stable because its still waiting for a new @prev->next
> +	 * pointer, @next is stable because our @node->next pointer is NULL and
> +	 * it will wait in Step-A.
> +	 */
> +
> +	WRITE_ONCE(next->prev, prev);
> +	WRITE_ONCE(prev->next, next);
> +
> +	return false;
> +}
> +
> +
> +
> +void x_osq_unlock(struct optimistic_spin_queue *lock)
> +{
> +	struct optimistic_spin_node *node, *next;
> +	int threadshold = osq_lock_depth;
> +	int cpu = smp_processor_id();
> +	u16 curr = encode_cpu(cpu);
> +	int depth = 0;
> +	u32 count = 0;
> +
> +	if (unlikely(lock->numa_enable == NUMALOCKDYNAMIC)) {
> +		struct _numa_lock *_numa_lock = get_numa_lock(lock->index);
> +
> +		prefetch((struct _numa_lock *) _numa_lock + (cpu >> NUMASHIFT));
> +		return zx_numa_osq_unlock(lock, _numa_lock);
> +	}
> +	/*
> +	 * Fast path for the uncontended case.
> +	 */
> +	if (unlikely(lock->numa_enable == OSQTONUMADETECT)) {
> +		struct optimistic_spin_node *node_last = NULL;
> +		u16 tail = 0;
> +
> +		tail = cmpxchg(&lock->tail16, curr, OSQ_UNLOCKED_VAL);
> +		if (tail == curr)
> +			return;
> +
> +		node = this_cpu_ptr(&osq_node);
> +		node_last = decode_cpu(tail);
> +		depth = node_last->serial - node->serial;
> +		count = READ_ONCE(node->locked);
> +		if (count > osq_keep_times && (dynamic_enable & 0x1))
> +			zx_osq_lock_stopping(lock);
> +	} else if (unlikely(lock->numa_enable == OSQLOCKSTOPPING)) {
> +		if (cmpxchg(&lock->tail16, curr, OSQ_LOCKED_VAL)
> +					== curr) {
> +			zx_osq_numa_start(lock);
> +			return;
> +		}
> +	} else {
> +		struct optimistic_spin_queue t;
> +
> +		t.val = 0;
> +		if (dynamic_enable & 0x1) {
> +			if (atomic_read(&numa_count) < zx_numa_lock_total)
> +				t.numa_enable = OSQTONUMADETECT;
> +		}
> +		if (t.numa_enable == OSQTONUMADETECT) {
> +			if (atomic_cmpxchg_release(&lock->tail, curr,
> +				(t.val | OSQ_UNLOCKED_VAL)) == curr)
> +				return;
> +		} else if (cmpxchg(&lock->tail16, curr,
> +				OSQ_UNLOCKED_VAL) == curr)
> +			return;
> +	}
> +
> +	/*
> +	 * Second most likely case.
> +	 */
> +	node = this_cpu_ptr(&osq_node);
> +	next = xchg(&node->next, NULL);
> +	if (next) {
> +		if (depth > threadshold)
> +			WRITE_ONCE(next->locked, count + 1);
> +		else
> +			WRITE_ONCE(next->locked, 1);
> +		return;
> +	}
> +
> +	next = osq_wait_next_stop(lock, node, OSQ_UNLOCKED_VAL);
> +	if (next) {
> +		if (depth > threadshold)
> +			WRITE_ONCE(next->locked, count + 1);
> +		else
> +			WRITE_ONCE(next->locked, 1);
> +	}
> +}
> +
> +bool x_osq_is_locked(struct optimistic_spin_queue *lock)
> +{
> +	struct optimistic_spin_queue val;
> +
> +	val.val = atomic_read(&lock->tail);
> +	if (val.tail16 == OSQ_UNLOCKED_VAL)
> +		return false;
> +
> +	if (val.tail16 == OSQ_LOCKED_VAL) {
> +		if (val.numa_enable != NUMALOCKDYNAMIC)
> +			return true;
> +		return zx_check_numa_dynamic_locked(ptrmask(lock),
> +					get_numa_lock(val.index), 0);
> +	}
> +
> +	return true;
> +}
These functions are mostly based on the  osq_* functions with some extra 
code stuffed in. Common code should be shared instead of duplicated.
> diff --git a/kernel/locking/zx_numa_osq.c b/kernel/locking/zx_numa_osq.c
> new file mode 100644
> index 000000000000..f4c3dba208d3
> --- /dev/null
> +++ b/kernel/locking/zx_numa_osq.c
> @@ -0,0 +1,433 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * Dynamic numa-aware osq lock
> + * Author: LiYong <yongli-oc@zhaoxin.com>
> + *
> + */
> +#include <linux/cpumask.h>
> +#include <asm/byteorder.h>
> +#include <linux/percpu.h>
> +#include <linux/sched.h>
> +#include <linux/slab.h>
> +#include <linux/osq_lock.h>
> +#include "numa.h"
> +#include "numa_osq.h"
> +
> +int osq_node_max = 256;
> +
> +/*
> + * The pending bit spinning loop count.
> + * This heuristic is used to limit the number of lockword accesses
> + * made by atomic_cond_read_relaxed when waiting for the lock to
> + * transition out of the "== _Q_PENDING_VAL" state. We don't spin
> + * indefinitely because there's no guarantee that we'll make forward
> + * progress.
> + */
> +
> +static DEFINE_PER_CPU_SHARED_ALIGNED(struct optimistic_spin_node, osq_cpu_node);
> +
> +/*
> + * We use the value 0 to represent "no CPU", thus the encoded value
> + * will be the CPU number incremented by 1.
> + */
> +static inline int decode(int cpu_nr)
> +{
> +	return cpu_nr - 1;
> +}
> +
> +static inline struct optimistic_spin_node *decode_curr(int encoded_cpu_val)
> +{
> +	int cpu_nr = decode(encoded_cpu_val);
> +
> +	return per_cpu_ptr(&osq_cpu_node, cpu_nr);
> +}
> +
> +static int atomic64_cmpxchg_notequal(void *qslock, atomic_t *tail, int curr)
> +{
> +	u64 ss = 0;
> +	u32 addr = ptrmask(qslock);
> +	u64 addrcurr = (((u64)addr) << 32) | curr;
> +
> +	while (1) {
> +		ss = atomic64_read((atomic64_t *) tail);
> +		if ((ss >> 32) != addr)
> +			return NUMA_LOCKED_VAL;
> +		if ((ss & LOW32MASK) == NUMA_LOCKED_VAL)
> +			return NUMA_LOCKED_VAL;
> +		if (atomic64_cmpxchg((atomic64_t *) tail, ss, addrcurr) == ss)
> +			return ss & LOW32MASK;
> +		cpu_relax();
> +	}
> +}
> +
> +void zx_osq_lock_stopping(struct optimistic_spin_queue *lock)
> +{
> +	int s = 0;
> +
> +	s = zx_numa_lock_ptr_get(lock);
> +	if (s < zx_numa_lock_total) {
> +		numa_lock_init_data(zx_numa_entry[s].numa_ptr,
> +			NUMACLUSTERS, NUMA_UNLOCKED_VAL,
> +			ptrmask(lock));
> +
> +		WRITE_ONCE(lock->index, s);
> +		zx_numa_entry[s].type = 1;
> +		smp_mb();/*should set these before enable*/
> +		prefetchw(&lock->numa_enable);
> +		WRITE_ONCE(lock->numa_enable, OSQLOCKSTOPPING);
> +	} else {
> +		prefetchw(&lock->numa_enable);
> +		WRITE_ONCE(lock->numa_enable, OSQLOCKINITED);
> +	}
> +}
> +
> +void zx_osq_numa_start(struct optimistic_spin_queue *lock)
> +{
> +	struct _numa_lock *_numa_lock = get_numa_lock(lock->index);
> +
> +	prefetchw(&lock->numa_enable);
> +	WRITE_ONCE(lock->numa_enable, NUMALOCKDYNAMIC);
> +	smp_mb(); /*should keep lock->numa_enable modified first*/
> +	atomic_set(&_numa_lock->initlock, TURNTONUMAREADY);
> +}
> +
> +
> +void zx_osq_turn_numa_waiting(struct optimistic_spin_queue *lock)
> +{
> +	struct _numa_lock *_numa_lock = get_numa_lock(lock->index);
> +
> +	atomic_inc(&_numa_lock->pending);
> +	while (1) {
> +		int s = atomic_read(&_numa_lock->initlock);
> +
> +		if (s == TURNTONUMAREADY)
> +			break;
> +		cpu_relax();
> +		cpu_relax();
> +		cpu_relax();
> +		cpu_relax();
We don't usually call cpu_relax() multiple times like that as the actual 
delay is CPU dependent and it is hard to get it right for all. Can you 
explain why it is called exactly 4 times here?
> +
> +	}
> +	atomic_dec(&_numa_lock->pending);
> +}
> +
> +
> +
> +
> +static struct optimistic_spin_node *
> +zx_numa_osq_wait_next(struct _numa_lock *lock,
> +		struct optimistic_spin_node *node,
> +		struct optimistic_spin_node *prev, int cpu)
> +{
> +	struct optimistic_spin_node *next = NULL;
> +	int curr = encode_cpu(cpu);
> +	int old;
> +
> +	old = prev ? prev->cpu : OSQ_UNLOCKED_VAL;
> +	for (;;) {
> +		if (atomic_read(&lock->tail) == curr &&
> +		    atomic_cmpxchg_acquire(&lock->tail, curr, old) == curr) {
> +
> +			break;
> +		}
> +		if (node->next) {
> +			next = xchg(&node->next, NULL);
> +			if (next)
> +				break;
> +		}
> +		cpu_relax();
> +	}
> +	return next;
> +}
> +static void zx_numa_turn_osq_waiting(struct optimistic_spin_queue *lock,
> +			struct _numa_lock *_numa_lock)
> +{
> +	struct _numa_lock *numa_lock = _numa_lock + _numa_lock->numa_nodes;
> +	int lockaddr = ptrmask(lock);
> +	u64 s = 0;
> +	struct optimistic_spin_queue tail;
> +
> +	tail.numa_enable = NUMALOCKDYNAMIC;
> +	tail.index = lock->index;
> +	tail.tail16 = OSQ_LOCKED_VAL;
> +	while (1) {
> +		cpu_relax(); cpu_relax(); cpu_relax(); cpu_relax();
> +		s = atomic64_read((atomic64_t *) &numa_lock->tail);
> +		if ((s >> 32) != lockaddr)
> +			break;
> +		if ((s & LOW32MASK) == NUMA_LOCKED_VAL)
> +			break;
> +	}
> +	prefetchw(&lock->tail);
> +	if (atomic_cmpxchg(&lock->tail, tail.val, OSQ_UNLOCKED_VAL)
> +			== tail.val) {
> +		;
> +	}
> +
> +}
> +
> +static int _zx_node_osq_lock_internal(struct optimistic_spin_queue *qslock,
> +	struct optimistic_spin_node *node, struct optimistic_spin_node *prev,
> +	struct _numa_lock *node_lock, int cpu, int *cur_status)
> +{
> +	struct optimistic_spin_node *next = NULL;
> +
> +	for (;;) {
> +		struct optimistic_spin_node *node_prev = NULL;
> +
> +		if (prev->next == node &&
> +		    cmpxchg(&prev->next, node, NULL) == node) {
> +			break;
> +		}
> +		/*load locked first each time*/
> +		*cur_status = smp_load_acquire(&node->locked);
> +
> +		if (*cur_status != NODE_WAIT)
> +			return 0; //goto NODE_UNLOCK;
> +
> +		cpu_relax();
> +		node_prev = READ_ONCE(node->prev);
> +		if (node_prev != prev)
> +			prev = node_prev;
> +	}
> +
> +	next = zx_numa_osq_wait_next(node_lock, node, prev, cpu);
> +	if (!next)
> +		return -1;
> +
> +	WRITE_ONCE(next->prev, prev);
> +	WRITE_ONCE(prev->next, next);
> +
> +	return -1;
> +}
> +
> +static int _zx_node_osq_lock(struct optimistic_spin_queue *qslock,
> +					struct _numa_lock *_numa_lock)
> +{
> +	struct optimistic_spin_node *node = this_cpu_ptr(&osq_cpu_node);
> +	struct optimistic_spin_node *prev = NULL;
> +	int cpu = smp_processor_id();
> +	int curr = encode_cpu(cpu);
> +	int numa = cpu >> _numa_lock->shift;
> +	struct _numa_lock *node_lock = _numa_lock + numa;
> +	int cur_status = 0;
> +	int old = 0;
> +
> +	node->locked = NODE_WAIT;
> +	node->next = NULL;
> +	node->cpu = curr;
> +
> +	old = atomic64_cmpxchg_notequal(qslock, &node_lock->tail, curr);
> +
> +	if (old == NUMA_LOCKED_VAL) {
> +		bool s = true;
> +
> +		zx_numa_turn_osq_waiting(qslock, _numa_lock);
> +		s = osq_lock(qslock);
> +		if (s == true)
> +			return 1;
> +		else
> +			return -1;
> +	}
> +
> +	if (old == 0) {
> +		node->locked = COHORT_START;
> +		return ACQUIRE_NUMALOCK;
> +	}
> +
> +	prev = decode_curr(old);
> +	node->prev = prev;
> +
> +	smp_mb(); /* make sure node set before set pre->next */
> +
> +	WRITE_ONCE(prev->next, node);
> +
> +	while ((cur_status = READ_ONCE(node->locked)) == NODE_WAIT) {
> +		if (need_resched() || vcpu_is_preempted(node_cpu(node->prev))) {
> +			int ddd = _zx_node_osq_lock_internal(qslock, node, prev,
> +						node_lock, cpu, &cur_status);
> +
> +			if (cur_status != NODE_WAIT)
> +				goto NODE_UNLOCK;
> +			if (ddd == -1)
> +				return -1;
> +		}
> +		cpu_relax();
> +	}
> +NODE_UNLOCK:
> +	if (cur_status == ACQUIRE_NUMALOCK)
> +		node->locked = COHORT_START;
> +	return cur_status;
> +}
> +static int _zx_numa_osq_lock(struct optimistic_spin_queue *qslock, int cpu,
> +				struct _numa_lock *_numa_lock)
> +{
> +	int numacpu = cpu >> _numa_lock->shift;
> +	int numacurr = encode_cpu(numacpu);
> +
> +	struct optimistic_spin_node *node = &(_numa_lock + numacpu)->osq_node;
> +	struct _numa_lock *numa_lock = _numa_lock + _numa_lock->numa_nodes;
> +	struct optimistic_spin_node *prevnode = NULL;
> +	int prev = 0;
> +
> +	node->next = NULL;
> +	node->locked = LOCK_NUMALOCK;
> +	node->cpu = numacurr;
> +
> +	prev = atomic_xchg(&numa_lock->tail, numacurr);
> +	if (prev == 0) {
> +		node->locked = UNLOCK_NUMALOCK;
> +		return 0;
> +	}
> +
> +	prevnode = &(_numa_lock + prev - 1)->osq_node;
> +	node->prev = prevnode;
> +	smp_mb(); /*node->prev should be set before next*/
> +	WRITE_ONCE(prevnode->next, node);
> +
> +	while (READ_ONCE(node->locked) == LOCK_NUMALOCK) {
> +		cpu_relax();
> +		cpu_relax();
> +		cpu_relax();
> +		cpu_relax();
> +	}
> +	return 0;
> +}
> +inline bool zx_numa_osq_lock(struct optimistic_spin_queue *qslock,
> +		struct _numa_lock *_numa_lock)
> +{
> +	struct _numa_lock *node_lock = NULL;
> +	int cpu = smp_processor_id();
> +	int numa = cpu >> _numa_lock->shift;
> +	int status = 0;
> +
> +	node_lock = _numa_lock + numa;
> +
> +	if (node_lock->stopping) {
> +		zx_numa_turn_osq_waiting(qslock, _numa_lock);
> +		return osq_lock(qslock);
> +	}
> +
> +	status = _zx_node_osq_lock(qslock, _numa_lock);
> +	if (status == ACQUIRE_NUMALOCK)
> +		status = _zx_numa_osq_lock(qslock, smp_processor_id(),
> +				_numa_lock);
> +
> +	if (status == -1)
> +		return false;
> +	return true;
> +}
> +
> +static int atomic64_checktail_osq(struct optimistic_spin_queue *qslock,
> +	struct _numa_lock *node_lock, int ctail)
> +{
> +	u64 addr = ((u64)ptrmask(qslock)) << 32;
> +	u64 addrtail = addr | ctail;
> +	u64 ss = 0;
> +	bool mark;
> +
> +	ss = atomic64_read((atomic64_t *) &node_lock->tail);
> +	if (node_lock->stopping == 0)
> +		mark = (ss == addrtail &&
> +			atomic64_cmpxchg_acquire(
> +				(atomic64_t *) &node_lock->tail,
> +				addrtail, addr|NUMA_UNLOCKED_VAL) == addrtail);
> +	else
> +		mark = (ss == addrtail &&
> +			atomic64_cmpxchg_acquire(
> +				(atomic64_t *) &node_lock->tail,
> +				addrtail, NUMA_LOCKED_VAL) == addrtail);
> +	return mark;
> +}
> +
> +static void node_lock_release(struct optimistic_spin_queue *qslock,
> +		struct _numa_lock *node_lock, struct optimistic_spin_node *node,
> +		int val, int cpu, int numa_end)
> +{
> +	struct optimistic_spin_node *next = NULL;
> +	int curr = encode_cpu(cpu);
> +
> +	while (1) {
> +		if (atomic64_checktail_osq(qslock, node_lock, curr)) {
> +			if (qslock->numa_enable == NUMALOCKDYNAMIC) {
> +				int index = qslock->index;
> +
> +				if (numa_end == OSQ_UNLOCKED_VAL &&
> +					zx_numa_entry[index].idle == 0) {
> +					cmpxchg(&zx_numa_entry[index].idle,
> +							0, 1);
> +				}
> +			}
> +			return;
> +		}
> +		if (node->next) {
> +			next = xchg(&node->next, NULL);
> +			if (next) {
> +				WRITE_ONCE(next->locked, val);
> +				return;
> +			}
> +		}
> +		cpu_relax();
> +	}
> +}
> +
> +static int numa_lock_release(struct optimistic_spin_queue *qslock,
> +		struct _numa_lock *numa_lock,
> +		struct optimistic_spin_node *node, int cpu)
> +{
> +	struct optimistic_spin_node *next = NULL;
> +	int curr = cpu >> numa_lock->shift;
> +	int numacurr = encode_cpu(curr);
> +
> +	while (1) {
> +		if (atomic_read(&numa_lock->tail) == numacurr &&
> +		    atomic_cmpxchg_acquire(&numa_lock->tail, numacurr,
> +					   OSQ_UNLOCKED_VAL) == numacurr) {
> +			return OSQ_UNLOCKED_VAL;
> +		}
> +
> +		if (node->next) {
> +			next = xchg(&node->next, NULL);
> +			if (next) {
> +				WRITE_ONCE(next->locked, UNLOCK_NUMALOCK);
> +				return 1;
> +			}
> +		}
> +		cpu_relax();
> +	}
> +}
> +
> +inline void zx_numa_osq_unlock(struct optimistic_spin_queue *qslock,
> +		 struct _numa_lock *_numa_lock)
> +{
> +	u32 cpu =  smp_processor_id();
> +	struct optimistic_spin_node *node = this_cpu_ptr(&osq_cpu_node);
> +	int numa = cpu >> _numa_lock->shift;
> +	struct _numa_lock *numa_lock = _numa_lock + _numa_lock->numa_nodes;
> +	struct _numa_lock *node_lock = _numa_lock + numa;
> +	struct optimistic_spin_node *numa_node =
> +						&(_numa_lock + numa)->osq_node;
> +	struct optimistic_spin_node *next = NULL;
> +	int cur_count = 0;
> +	int numa_end = 0;
> +
> +	cur_count = READ_ONCE(node->locked);
> +
> +	if (cur_count >= osq_node_max - 1) {
> +		numa_end = numa_lock_release(qslock,
> +				numa_lock, numa_node, cpu);
> +		node_lock_release(qslock, node_lock, node,
> +				ACQUIRE_NUMALOCK, cpu, numa_end);
> +		return;
> +	}
> +
> +	next = xchg(&node->next, NULL);
> +	if (next) {
> +		WRITE_ONCE(next->locked, cur_count + 1);
> +		return;
> +	}
> +
> +	numa_end = numa_lock_release(qslock, numa_lock, numa_node, cpu);
> +	node_lock_release(qslock, node_lock, node, ACQUIRE_NUMALOCK,
> +			cpu, numa_end);
> +}

Overall speaking, there are not enough comments to explain exactly what 
are you trying to achieve with these new functions. It makes them hard 
to review.

Cheers,
Longman


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 4/4] locking/osq_lock: The numa-aware lock memory prepare, assign and cleanup.
  2024-09-14  8:53 ` [PATCH 4/4] locking/osq_lock: The numa-aware lock memory prepare, assign and cleanup yongli-oc
@ 2024-09-14 17:21   ` Waiman Long
  2024-09-19  9:41     ` yongli-os
  2024-09-15 16:43   ` kernel test robot
  1 sibling, 1 reply; 19+ messages in thread
From: Waiman Long @ 2024-09-14 17:21 UTC (permalink / raw)
  To: yongli-oc, peterz, mingo, will, boqun.feng
  Cc: linux-kernel, yongli, louisqi, cobechen, jiangbowang


On 9/14/24 04:53, yongli-oc wrote:
> The numa-aware lock kernel memory cache preparation, and a
> workqueue to turn numa-aware lock back to osq lock.
> The /proc interface. Enable dynamic switch by
> echo 1 > /proc/zx_numa_lock/dynamic_enable
>
> Signed-off-by: yongli-oc <yongli-oc@zhaoxin.com>
> ---
>   kernel/locking/zx_numa.c | 537 +++++++++++++++++++++++++++++++++++++++
>   1 file changed, 537 insertions(+)
>   create mode 100644 kernel/locking/zx_numa.c
>
> diff --git a/kernel/locking/zx_numa.c b/kernel/locking/zx_numa.c
> new file mode 100644
> index 000000000000..89df6670a024
> --- /dev/null
> +++ b/kernel/locking/zx_numa.c
> @@ -0,0 +1,537 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * Dynamic numa-aware osq lock
> + * Crossing from numa-aware lock to osq_lock
> + * Numa lock memory initialize and /proc interface
> + * Author: LiYong <yongli-oc@zhaoxin.com>
> + *
> + */
> +#include <linux/cpumask.h>
> +#include <asm/byteorder.h>
> +#include <asm/kvm_para.h>
> +#include <linux/percpu.h>
> +#include <linux/sched.h>
> +#include <linux/slab.h>
> +#include <linux/osq_lock.h>
> +#include <linux/module.h>
> +#include <linux/proc_fs.h>
> +#include <linux/seq_file.h>
> +#include <linux/uaccess.h>
> +#include <linux/reboot.h>
> +
> +#include "numa.h"
> +#include "numa_osq.h"
> +
> +int enable_zx_numa_osq_lock;
> +struct delayed_work zx_numa_start_work;
> +struct delayed_work zx_numa_cleanup_work;
> +
> +atomic_t numa_count;
> +struct _numa_buf *zx_numa_entry;
> +int zx_numa_lock_total = 256;
> +LIST_HEAD(_zx_numa_head);
> +LIST_HEAD(_zx_numa_lock_head);
> +
> +struct kmem_cache *zx_numa_entry_cachep;
> +struct kmem_cache *zx_numa_lock_cachep;
> +int NUMASHIFT;
> +int NUMACLUSTERS;
> +static atomic_t lockindex;
> +int dynamic_enable;
> +
> +static const struct numa_cpu_info numa_cpu_list[] = {
> +	/*feature1=1, a numa node includes two clusters*/
> +	//{1, 23, X86_VENDOR_AMD, 0, 1},
> +	{0x5b, 7, X86_VENDOR_CENTAUR, 0, 1},
> +	{0x5b, 7, X86_VENDOR_ZHAOXIN, 0, 1}
> +};

Why are this zx_*() code specifically for ZhaoXin and Centaur family of 
CPUs? Are there some special hardware features that are specific to 
these CPUs?

BTW, your patch series lacks performance data to justify the addition of 
quite a lot of complexity to the core locking code. We are unlikely to 
take this without sufficient justification.

Another question that I have is that the base osq_lock() can coexist 
with your xz_osq_lock(). A cpu can dynamically switch from using 
osq_lock() to xz_osq_lock() and vice versa. What happens if some CPUs 
use osq_lock() while others use xz_osq_lock()? Will that cause a 
problem? Have you fully test this scenario to make sure that nothing 
breaks?

Cheers,
Longman


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 4/4] locking/osq_lock: The numa-aware lock memory prepare, assign and cleanup.
  2024-09-14  8:53 ` [PATCH 4/4] locking/osq_lock: The numa-aware lock memory prepare, assign and cleanup yongli-oc
  2024-09-14 17:21   ` Waiman Long
@ 2024-09-15 16:43   ` kernel test robot
  1 sibling, 0 replies; 19+ messages in thread
From: kernel test robot @ 2024-09-15 16:43 UTC (permalink / raw)
  To: yongli-oc, mingo, will, longman, boqun.feng
  Cc: llvm, oe-kbuild-all, linux-kernel, yongli, louisqi, cobechen,
	jiangbowang

Hi yongli-oc,

kernel test robot noticed the following build warnings:

[auto build test WARNING on tip/locking/core]
[also build test WARNING on akpm-mm/mm-nonmm-unstable linus/master v6.11-rc7 next-20240913]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/yongli-oc/locking-osq_lock-The-Kconfig-for-dynamic-numa-aware-osq-lock/20240914-172336
base:   tip/locking/core
patch link:    https://lore.kernel.org/r/20240914085327.32912-5-yongli-oc%40zhaoxin.com
patch subject: [PATCH 4/4] locking/osq_lock: The numa-aware lock memory prepare, assign and cleanup.
config: x86_64-allyesconfig (https://download.01.org/0day-ci/archive/20240916/202409160059.VIbC9G04-lkp@intel.com/config)
compiler: clang version 18.1.8 (https://github.com/llvm/llvm-project 3b5b5c1ec4a3095ab096dd780e84d7ab81f3d7ff)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20240916/202409160059.VIbC9G04-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202409160059.VIbC9G04-lkp@intel.com/

All warnings (new ones prefixed by >>):

>> kernel/locking/zx_numa.c:250:10: warning: variable 'left' set but not used [-Wunused-but-set-variable]
     250 |                                         u32 left = 0;
         |                                             ^
>> kernel/locking/zx_numa.c:375:6: warning: variable 'err' set but not used [-Wunused-but-set-variable]
     375 |         int err = 0;
         |             ^
   2 warnings generated.


vim +/left +250 kernel/locking/zx_numa.c

   203	
   204	static void zx_numa_cleanup(struct work_struct *work)
   205	{
   206		int i = 0;
   207		int checktimes = 2;
   208	
   209		//reboot or power off state
   210		if (READ_ONCE(enable_zx_numa_osq_lock) == 0xf)
   211			return;
   212	
   213		if (atomic_read(&numa_count) == 0) {
   214			if (READ_ONCE(dynamic_enable) != 0)
   215				schedule_delayed_work(&zx_numa_cleanup_work, 60*HZ);
   216			return;
   217		}
   218	
   219		for (i = 0; i < zx_numa_lock_total; i++) {
   220			int s = 0;
   221			u32 lockaddr = READ_ONCE(zx_numa_entry[i].lockaddr);
   222			u32 type = zx_numa_entry[i].type;
   223			struct _numa_lock *buf =  zx_numa_entry[i].numa_ptr;
   224			int nodes = 0;
   225	
   226			if (lockaddr == 0 || type == 3 || zx_numa_entry[i].idle == 0)
   227				continue;
   228			nodes = buf->numa_nodes;
   229			if (zx_numa_entry[i].idle < checktimes) {
   230	
   231				s = zx_check_numa_dynamic_locked(lockaddr, buf, 1);
   232				if (s != 0) {
   233					zx_numa_entry[i].idle = 1;
   234					continue;
   235				}
   236				zx_numa_entry[i].idle++;
   237			}
   238	
   239			if (zx_numa_entry[i].idle == checktimes) {
   240				zx_numa_lock_stopping(buf);
   241				zx_numa_entry[i].idle++;
   242	
   243			}
   244	
   245			if (zx_numa_entry[i].idle == checktimes+1) {
   246				while (1) {
   247					if (zx_numa_lock64_try_to_freeze(lockaddr, buf,
   248							i) == nodes + 1) {
   249						//all node has been locked
 > 250						u32 left = 0;
   251	
   252						left = atomic_dec_return(&numa_count);
   253						break;
   254					}
   255					cpu_relax(); cpu_relax();
   256					cpu_relax(); cpu_relax();
   257				}
   258			}
   259		}
   260		schedule_delayed_work(&zx_numa_cleanup_work, 60*HZ);
   261	}
   262	
   263	static int create_numa_buffer_list(int clusters, int len)
   264	{
   265		int i = 0;
   266	
   267		for (i = 0; i < zx_numa_lock_total; i++) {
   268			struct _numa_lock *s = (struct _numa_lock *)kmem_cache_alloc(
   269					zx_numa_lock_cachep, GFP_KERNEL);
   270			if (!s) {
   271				while (i > 0) {
   272					kmem_cache_free(zx_numa_lock_cachep,
   273							zx_numa_entry[i-1].numa_ptr);
   274					i--;
   275				}
   276				return 0;
   277			}
   278			memset((char *)s, 0,
   279				len * L1_CACHE_BYTES * (clusters + NUMAEXPAND));
   280			numa_lock_init_data(s, clusters, NUMA_LOCKED_VAL, 0);
   281			zx_numa_entry[i].numa_ptr = s;
   282			zx_numa_entry[i].lockaddr = 0;
   283			zx_numa_entry[i].highaddr = 0;
   284			zx_numa_entry[i].idle = 0;
   285			zx_numa_entry[i].type = 0;
   286		}
   287	
   288		for (i = 0; i < zx_numa_lock_total; i++) {
   289			zx_numa_entry[i].index = i;
   290			list_add_tail(&(zx_numa_entry[i].list), &_zx_numa_lock_head);
   291		}
   292		return 1;
   293	}
   294	
   295	static int zx_numa_lock_init(int numa)
   296	{
   297		int align = max_t(int, L1_CACHE_BYTES, ARCH_MIN_TASKALIGN);
   298		int d = 0;
   299		int status = 0;
   300	
   301		atomic_set(&lockindex, 0);
   302		atomic_set(&numa_count, 0);
   303	
   304		if (sizeof(struct _numa_lock) & 0x3f)
   305			d = (int)((sizeof(struct _numa_lock) + L1_CACHE_BYTES) /
   306				  L1_CACHE_BYTES);
   307		else
   308			d = (int)(sizeof(struct _numa_lock) / L1_CACHE_BYTES);
   309	
   310		zx_numa_entry_cachep = kmem_cache_create(
   311			"zx_numa_entry",
   312			sizeof(struct _numa_buf) * zx_numa_lock_total, align,
   313			SLAB_PANIC | SLAB_ACCOUNT, NULL);
   314	
   315		zx_numa_lock_cachep = kmem_cache_create(
   316			"zx_numa_lock",
   317			d * L1_CACHE_BYTES * (numa + NUMAEXPAND), align,
   318			SLAB_PANIC | SLAB_ACCOUNT, NULL);
   319	
   320	
   321		if (zx_numa_entry_cachep && zx_numa_lock_cachep) {
   322			zx_numa_entry = (struct _numa_buf *)kmem_cache_alloc(
   323					zx_numa_entry_cachep, GFP_KERNEL);
   324			if (zx_numa_entry) {
   325				memset((char *)zx_numa_entry, 0,
   326					sizeof(struct _numa_buf) * zx_numa_lock_total);
   327				create_numa_buffer_list(numa, d);
   328				status = 1;
   329			}
   330		}
   331	
   332		pr_info("enable dynamic numa-aware osq_lock, clusters %d\n",
   333			numa);
   334		return status;
   335	}
   336	
   337	
   338	#define numa_lock_proc_dir "zx_numa_lock"
   339	#define zx_numa_enable_dir "dynamic_enable"
   340	#define numa_entry_total 8
   341	struct proc_dir_entry *numa_lock_proc;
   342	struct proc_dir_entry *numa_lock_enable;
   343	struct proc_dir_entry *numa_proc_entry[numa_entry_total];
   344	
   345	static ssize_t numa_lock_proc_read(struct file *file,
   346			char __user *usrbuf, size_t len, loff_t *off)
   347	{
   348		int id = (long) pde_data(file_inode(file));
   349		char kbuffer[128];
   350		ssize_t retval = 0;
   351		size_t n = 0;
   352	
   353		memset(kbuffer, 0, sizeof(kbuffer));
   354		if (id == 0)
   355			n = sprintf(kbuffer, "%d\n", READ_ONCE(dynamic_enable));
   356		else if (id == 1)
   357			n = sprintf(kbuffer, "%d\n", READ_ONCE(osq_lock_depth));
   358		else if (id == 2)
   359			n = sprintf(kbuffer, "%d\n", READ_ONCE(osq_keep_times));
   360		else if (id == 3)
   361			n = sprintf(kbuffer, "%d\n", READ_ONCE(osq_node_max));
   362		else if (id == 4)
   363			n = sprintf(kbuffer, "%d\n", atomic_read(&numa_count));
   364		retval = simple_read_from_buffer(usrbuf, len, off, kbuffer, n);
   365	
   366		return retval;
   367	}
   368	
   369	static ssize_t numa_lock_proc_write(struct file *file,
   370			const char __user *buffer, size_t count, loff_t *f_pos)
   371	{
   372		int id = (long) pde_data(file_inode(file));
   373		char kbuffer[128];
   374		unsigned long new = 0;
 > 375		int err = 0;
   376	
   377		memset(kbuffer, 0, sizeof(kbuffer));
   378		if (copy_from_user(kbuffer, buffer, count))
   379			return count;
   380		kbuffer[count] = '\0';
   381		err = kstrtoul(kbuffer, 10, &new);
   382	
   383		if (id == 0) {
   384			int last = READ_ONCE(dynamic_enable);
   385	
   386			if (new < 0 || new >= 2 || last == new)
   387				return count;
   388	
   389			if (last == 0) {
   390				prefetchw(&enable_zx_numa_osq_lock);
   391				//enable to the 2-bytes-tail osq-lock
   392				prefetchw(&enable_zx_numa_osq_lock);
   393				WRITE_ONCE(enable_zx_numa_osq_lock, 2);
   394				schedule_delayed_work(&zx_numa_cleanup_work, 60*HZ);
   395			}
   396			prefetchw(&dynamic_enable);
   397			WRITE_ONCE(dynamic_enable, new);
   398			return count;
   399		}
   400	
   401		if (READ_ONCE(dynamic_enable) != 0) {
   402			pr_info("dynamic %d: change setting should disable dynamic\n",
   403				dynamic_enable);
   404			return count;
   405		}
   406		if (id == 1 && new > 4 && new <= 32)
   407			WRITE_ONCE(osq_lock_depth, new);
   408		else if (id == 2 && new >= 16 && new <= 2048)
   409			WRITE_ONCE(osq_keep_times, new);
   410		else if (id == 3 && new > 4 && new <= 2048)
   411			WRITE_ONCE(osq_node_max, new);
   412		return count;
   413	}
   414	static int numa_lock_proc_show(struct seq_file *m, void *v)
   415	{
   416		return 0;
   417	}
   418	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 3/4] locking/osq_lock: From x_osq_lock/unlock to numa-aware lock/unlock.
  2024-09-14  8:53 ` [PATCH 3/4] locking/osq_lock: From x_osq_lock/unlock to numa-aware lock/unlock yongli-oc
  2024-09-14 17:11   ` Waiman Long
@ 2024-09-15 18:44   ` Uros Bizjak
  2024-09-15 21:06     ` Waiman Long
  1 sibling, 1 reply; 19+ messages in thread
From: Uros Bizjak @ 2024-09-15 18:44 UTC (permalink / raw)
  To: yongli-oc, peterz, mingo, will, longman, boqun.feng
  Cc: linux-kernel, yongli, louisqi, cobechen, jiangbowang



On 14. 09. 24 10:53, yongli-oc wrote:
> According to the contention level, switches from x_osq_lock to
> numa-aware osq_lock.
> The numa-aware lock is a two level osq_lock.
> The Makefile for dynamic numa-aware osq lock.
> 
> Signed-off-by: yongli-oc <yongli-oc@zhaoxin.com>
> ---
>   kernel/locking/Makefile      |   1 +
>   kernel/locking/numa.h        |  98 ++++++++
>   kernel/locking/numa_osq.h    |  32 +++
>   kernel/locking/x_osq_lock.c  | 332 +++++++++++++++++++++++++++
>   kernel/locking/zx_numa_osq.c | 433 +++++++++++++++++++++++++++++++++++
>   5 files changed, 896 insertions(+)
>   create mode 100644 kernel/locking/numa.h
>   create mode 100644 kernel/locking/numa_osq.h
>   create mode 100644 kernel/locking/x_osq_lock.c
>   create mode 100644 kernel/locking/zx_numa_osq.c

...

> +	if (lock->numa_enable == OSQLOCKSTOPPING && old == OSQ_UNLOCKED_VAL)
> +		old = OSQ_LOCKED_VAL;
> +
> +	for (;;) {
> +		if (READ_ONCE(lock->tail16) == curr &&
> +		    cmpxchg(&lock->tail16, curr, old) == curr) {

I would like to ask if there is any benefit to read the location two 
times? cmpxchg() reads the location and skips the update when curr is 
different from the value at the location by itself. Using try_cmpxchg() 
can produce even more optimized code, since on x86 arch CMPXCHG also 
sets ZF flag when operand 2 is equal to the value at location (and 
update happens), and this flag can be used in a conditional jump.

The above condition could read:

if (try_cmpxchg(&lock->tail16, &curr, old)) {

with an optional temporary variable instead of curr, because 
try_cmpxchg() updates its second argument:

+	for (;;) {
+		u16 tmp = curr;
+		if (try_cmpxchg(&lock->tail16, &tmp, old)) {

Please note that the above form could produce slightly more optimized 
code also on non-x86 targets.

Uros.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 3/4] locking/osq_lock: From x_osq_lock/unlock to numa-aware lock/unlock.
  2024-09-15 18:44   ` Uros Bizjak
@ 2024-09-15 21:06     ` Waiman Long
  0 siblings, 0 replies; 19+ messages in thread
From: Waiman Long @ 2024-09-15 21:06 UTC (permalink / raw)
  To: Uros Bizjak, yongli-oc, peterz, mingo, will, boqun.feng
  Cc: linux-kernel, yongli, louisqi, cobechen, jiangbowang

On 9/15/24 14:44, Uros Bizjak wrote:
>
>
> On 14. 09. 24 10:53, yongli-oc wrote:
>> According to the contention level, switches from x_osq_lock to
>> numa-aware osq_lock.
>> The numa-aware lock is a two level osq_lock.
>> The Makefile for dynamic numa-aware osq lock.
>>
>> Signed-off-by: yongli-oc <yongli-oc@zhaoxin.com>
>> ---
>>   kernel/locking/Makefile      |   1 +
>>   kernel/locking/numa.h        |  98 ++++++++
>>   kernel/locking/numa_osq.h    |  32 +++
>>   kernel/locking/x_osq_lock.c  | 332 +++++++++++++++++++++++++++
>>   kernel/locking/zx_numa_osq.c | 433 +++++++++++++++++++++++++++++++++++
>>   5 files changed, 896 insertions(+)
>>   create mode 100644 kernel/locking/numa.h
>>   create mode 100644 kernel/locking/numa_osq.h
>>   create mode 100644 kernel/locking/x_osq_lock.c
>>   create mode 100644 kernel/locking/zx_numa_osq.c
>
> ...
>
>> +    if (lock->numa_enable == OSQLOCKSTOPPING && old == 
>> OSQ_UNLOCKED_VAL)
>> +        old = OSQ_LOCKED_VAL;
>> +
>> +    for (;;) {
>> +        if (READ_ONCE(lock->tail16) == curr &&
>> +            cmpxchg(&lock->tail16, curr, old) == curr) {
>
> I would like to ask if there is any benefit to read the location two 
> times? cmpxchg() reads the location and skips the update when curr is 
> different from the value at the location by itself. Using 
> try_cmpxchg() can produce even more optimized code, since on x86 arch 
> CMPXCHG also sets ZF flag when operand 2 is equal to the value at 
> location (and update happens), and this flag can be used in a 
> conditional jump.

The major reason is for doing a read first before cmpxchg() is to avoid 
the overhead of an atomic operation in case the current task isn't the 
tail. We usually optimize for the case with a lot of incoming lockers 
where the chance of a match isn't particularly high.

Cheers,
Longman


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/4] locking/osq_lock: Turn to dynamic switch function from osq_lock/unlock.
  2024-09-14 16:06   ` Waiman Long
@ 2024-09-19  9:32     ` yongli-os
  0 siblings, 0 replies; 19+ messages in thread
From: yongli-os @ 2024-09-19  9:32 UTC (permalink / raw)
  To: Waiman Long, peterz, mingo, will, boqun.feng
  Cc: linux-kernel, yongli, louisqi, cobechen, jiangbowang


On 2024/9/15 00:06, Waiman Long wrote:
>
>
> [这封邮件来自外部发件人 谨防风险]
>
> On 9/14/24 04:53, yongli-oc wrote:
>> To support numa-aware osq lock, the struct optimistic_spin_queue
>> is accessed as three members, numa_enable, index, tail16, by union.
>> The size of the struct is the same as before.
>> If dynamic numa-ware lock enable, turns to the crossing, x_osq_lock to
>> check contention level and starts dynamic switch.
>>
>> Signed-off-by: yongli-oc <yongli-oc@zhaoxin.com>
>> ---
>>   include/linux/osq_lock.h  | 33 ++++++++++++++++++++-
>>   kernel/locking/osq_lock.c | 62 +++++++++++++++++++++++++++++++++++++--
>>   2 files changed, 91 insertions(+), 4 deletions(-)
>>
>> diff --git a/include/linux/osq_lock.h b/include/linux/osq_lock.h
>> index ea8fb31379e3..37a7bc4ab530 100644
>> --- a/include/linux/osq_lock.h
>> +++ b/include/linux/osq_lock.h
>> @@ -12,14 +12,42 @@ struct optimistic_spin_queue {
>>        * Stores an encoded value of the CPU # of the tail node in the 
>> queue.
>>        * If the queue is empty, then it's set to OSQ_UNLOCKED_VAL.
>>        */
>> +#ifdef CONFIG_LOCK_SPIN_ON_OWNER_NUMA
>> +     union {
>> +             atomic_t tail;
>> +             u32 val;
>> +#ifdef __LITTLE_ENDIAN
>> +             struct {
>> +                     u16 tail16;
>> +                     u8 index;
>> +                     u8 numa_enable;
>> +             };
>> +#else
>> +             struct {
>> +                     u8 numa_enable;
>> +                     u8 index;
>> +                     u16 tail16;
>> +             };
>> +#endif
>> +     };
>> +#else
>>       atomic_t tail;
>> +#endif
>>   };
>>
>>   #define OSQ_UNLOCKED_VAL (0)
>>
>>   /* Init macro and function. */
>> +#ifdef CONFIG_LOCK_SPIN_ON_OWNER_NUMA
>> +
>> +#define OSQ_LOCK_UNLOCKED { .tail = ATOMIC_INIT(OSQ_UNLOCKED_VAL) }
>> +
>> +#else
>> +
>>   #define OSQ_LOCK_UNLOCKED { ATOMIC_INIT(OSQ_UNLOCKED_VAL) }
>>
>> +#endif
>> +
>>   static inline void osq_lock_init(struct optimistic_spin_queue *lock)
>>   {
>>       atomic_set(&lock->tail, OSQ_UNLOCKED_VAL);
>> @@ -28,9 +56,12 @@ static inline void osq_lock_init(struct 
>> optimistic_spin_queue *lock)
>>   extern bool osq_lock(struct optimistic_spin_queue *lock);
>>   extern void osq_unlock(struct optimistic_spin_queue *lock);
>>
>> +#ifdef CONFIG_LOCK_SPIN_ON_OWNER_NUMA
>> +extern bool osq_is_locked(struct optimistic_spin_queue *lock);
>> +#else
>>   static inline bool osq_is_locked(struct optimistic_spin_queue *lock)
>>   {
>>       return atomic_read(&lock->tail) != OSQ_UNLOCKED_VAL;
>>   }
>> -
>> +#endif
>>   #endif
>> diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c
>> index 75a6f6133866..a7b516939e00 100644
>> --- a/kernel/locking/osq_lock.c
>> +++ b/kernel/locking/osq_lock.c
>> @@ -2,7 +2,10 @@
>>   #include <linux/percpu.h>
>>   #include <linux/sched.h>
>>   #include <linux/osq_lock.h>
>> -
>> +#ifdef CONFIG_LOCK_SPIN_ON_OWNER_NUMA
>> +#include "numa.h"
>> +#include "numa_osq.h"
>> +#endif
>
> These header files are defined in patch 3. You need to rethink about
> patch ordering in order not to break bisection.
I will move it to the patch 2.
>
>>   /*
>>    * An MCS like lock especially tailored for optimistic spinning for 
>> sleeping
>>    * lock implementations (mutex, rwsem, etc).
>> @@ -12,12 +15,34 @@
>>    * spinning.
>>    */
>>
>> +#ifdef CONFIG_LOCK_SPIN_ON_OWNER_NUMA
>> +DEFINE_PER_CPU_SHARED_ALIGNED(struct optimistic_spin_node, osq_node);
>> +/*
>> + * We use the value 0 to represent "no CPU", thus the encoded value
>> + * will be the CPU number incremented by 1.
>> + */
>> +inline int encode_cpu(int cpu_nr)
>> +{
>> +     return cpu_nr + 1;
>> +}
>> +
>> +inline int node_cpu(struct optimistic_spin_node *node)
>> +{
>> +     return node->cpu - 1;
>> +}
>> +
>> +inline struct optimistic_spin_node *decode_cpu(int encoded_cpu_val)
>> +{
>> +     int cpu_nr = encoded_cpu_val - 1;
>> +
>> +     return per_cpu_ptr(&osq_node, cpu_nr);
>> +}
>> +#else
>>   struct optimistic_spin_node {
>>       struct optimistic_spin_node *next, *prev;
>>       int locked; /* 1 if lock acquired */
>>       int cpu; /* encoded CPU # + 1 value */
>>   };
>> -
>>   static DEFINE_PER_CPU_SHARED_ALIGNED(struct optimistic_spin_node, 
>> osq_node);
>>
>>   /*
>> @@ -40,6 +65,7 @@ static inline struct optimistic_spin_node 
>> *decode_cpu(int encoded_cpu_val)
>>
>>       return per_cpu_ptr(&osq_node, cpu_nr);
>>   }
>> +#endif
>>
>>   /*
>>    * Get a stable @node->next pointer, either for unlock() or 
>> unqueue() purposes.
>> @@ -97,6 +123,14 @@ bool osq_lock(struct optimistic_spin_queue *lock)
>>       int curr = encode_cpu(smp_processor_id());
>>       int old;
>>
>> +#ifdef CONFIG_LOCK_SPIN_ON_OWNER_NUMA
>> +     if (unlikely(enable_zx_numa_osq_lock > 1)) {
>> +             node->numa = 1;
>> +             return x_osq_lock(lock);
>> +     }
>> +     node->numa = 0;
>> +#endif
>> +
>>       node->locked = 0;
>>       node->next = NULL;
>>       node->cpu = curr;
>> @@ -108,6 +142,11 @@ bool osq_lock(struct optimistic_spin_queue *lock)
>>        * the lock tail.
>>        */
>>       old = atomic_xchg(&lock->tail, curr);
>> +#ifdef CONFIG_LOCK_SPIN_ON_OWNER_NUMA
>> +     if (enable_zx_numa_osq_lock > 0)
>> +     //enable means all cpu cores are less tan 65534.
>> +             old = old & 0xffff;
>> +#endif
>>       if (old == OSQ_UNLOCKED_VAL)
>>               return true;
>>
>> @@ -212,6 +251,14 @@ void osq_unlock(struct optimistic_spin_queue *lock)
>>       struct optimistic_spin_node *node, *next;
>>       int curr = encode_cpu(smp_processor_id());
>>
>> +     node = this_cpu_ptr(&osq_node);
>> +
>> +#ifdef CONFIG_LOCK_SPIN_ON_OWNER_NUMA
>> +     if (unlikely(enable_zx_numa_osq_lock > 1 &&
>> +             node->numa == 1))
>> +             return x_osq_unlock(lock);
>> +#endif
>> +
>>       /*
>>        * Fast path for the uncontended case.
>>        */
>> @@ -222,7 +269,6 @@ void osq_unlock(struct optimistic_spin_queue *lock)
>>       /*
>>        * Second most likely case.
>>        */
>> -     node = this_cpu_ptr(&osq_node);
>>       next = xchg(&node->next, NULL);
>>       if (next) {
>>               WRITE_ONCE(next->locked, 1);
>> @@ -233,3 +279,13 @@ void osq_unlock(struct optimistic_spin_queue *lock)
>>       if (next)
>>               WRITE_ONCE(next->locked, 1);
>>   }
>> +#ifdef CONFIG_LOCK_SPIN_ON_OWNER_NUMA
>> +bool osq_is_locked(struct optimistic_spin_queue *lock)
>> +{
>> +     if (unlikely(enable_zx_numa_osq_lock > 1))
>> +             return x_osq_is_locked(lock);
>> +     return atomic_read(&lock->tail) != OSQ_UNLOCKED_VAL;
>> +}
>> +#endif
>> +
>> +
>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 3/4] locking/osq_lock: From x_osq_lock/unlock to numa-aware lock/unlock.
  2024-09-14 17:11   ` Waiman Long
@ 2024-09-19  9:35     ` yongli-os
  0 siblings, 0 replies; 19+ messages in thread
From: yongli-os @ 2024-09-19  9:35 UTC (permalink / raw)
  To: Waiman Long, peterz, mingo, will, boqun.feng
  Cc: linux-kernel, yongli, louisqi, cobechen, jiangbowang


On 2024/9/15 01:11, Waiman Long wrote:
>
>
> [这封邮件来自外部发件人 谨防风险]
>
> On 9/14/24 04:53, yongli-oc wrote:
>> According to the contention level, switches from x_osq_lock to
>> numa-aware osq_lock.
>> The numa-aware lock is a two level osq_lock.
>> The Makefile for dynamic numa-aware osq lock.
>>
>> Signed-off-by: yongli-oc <yongli-oc@zhaoxin.com>
>> ---
>>   kernel/locking/Makefile      |   1 +
>>   kernel/locking/numa.h        |  98 ++++++++
>>   kernel/locking/numa_osq.h    |  32 +++
>>   kernel/locking/x_osq_lock.c  | 332 +++++++++++++++++++++++++++
>>   kernel/locking/zx_numa_osq.c | 433 +++++++++++++++++++++++++++++++++++
>>   5 files changed, 896 insertions(+)
>>   create mode 100644 kernel/locking/numa.h
>>   create mode 100644 kernel/locking/numa_osq.h
>>   create mode 100644 kernel/locking/x_osq_lock.c
>>   create mode 100644 kernel/locking/zx_numa_osq.c
>>
>> diff --git a/kernel/locking/Makefile b/kernel/locking/Makefile
>> index 0db4093d17b8..00c1d5bb6eb9 100644
>> --- a/kernel/locking/Makefile
>> +++ b/kernel/locking/Makefile
>> @@ -22,6 +22,7 @@ obj-$(CONFIG_LOCKDEP) += lockdep_proc.o
>>   endif
>>   obj-$(CONFIG_SMP) += spinlock.o
>>   obj-$(CONFIG_LOCK_SPIN_ON_OWNER) += osq_lock.o
>> +obj-$(CONFIG_LOCK_SPIN_ON_OWNER_NUMA) += x_osq_lock.o zx_numa_osq.o 
>> zx_numa.o
>>   obj-$(CONFIG_PROVE_LOCKING) += spinlock.o
>>   obj-$(CONFIG_QUEUED_SPINLOCKS) += qspinlock.o
>>   obj-$(CONFIG_RT_MUTEXES) += rtmutex_api.o
>> diff --git a/kernel/locking/numa.h b/kernel/locking/numa.h
>> new file mode 100644
>> index 000000000000..01e5aef3257a
>> --- /dev/null
>> +++ b/kernel/locking/numa.h
>> @@ -0,0 +1,98 @@
>> +/* SPDX-License-Identifier: GPL-2.0 */
>> +#ifndef __LINUX_NUMA_LOCK_H
>> +#define __LINUX_NUMA_LOCK_H
>> +#include "mcs_spinlock.h"
>> +
>> +struct optimistic_spin_node {
>> +     struct optimistic_spin_node *next, *prev;
>> +     int numa;
>> +     int locked; /* 1 if lock acquired */
>> +     int cpu; /* encoded CPU # + 1 value */
>> +     u32 serial;
>> +};
>> +
>> +
>> +struct _numa_buf {
>> +     void *numa_ptr;
>> +     struct list_head list;
>> +     u32 lockaddr;
>> +     u32 highaddr;
>> +     u8 idle;
>> +     u8 type;
>> +     u16 index;
>> +};
>> +
>> +struct cache_padding {
>> +     char x[0];
>> +} ____cacheline_internodealigned_in_smp;
>> +#define CACHE_PADDING(name)  struct cache_padding name
> We have struct cacheline_padding and CACHELINE_PADDING() in
> include/linux/cache. What is the point of reinventing the same thing 
> here?
I will include the linux/cache.h header, and delete the definition above.
>> +
>> +struct _numa_lock {
>> +     atomic_t tail ____cacheline_aligned_in_smp;
>> +     atomic_t addr;
>> +     u8 shift;
>> +     u8 stopping;
>> +     u16 numa_nodes;
>> +     u32 accessed;
>> +     uint64_t totalaccessed;
>> +     u32 nodeswitched;
>> +     atomic_t initlock;
>> +     atomic_t pending;
>> +     union {
>> +             struct mcs_spinlock mcs_node;
>> +             struct optimistic_spin_node osq_node;
>> +     };
>> +     CACHE_PADDING(pad);
>> +};
>> +
>> +struct numa_cpu_info {
>> +     __u8    x86_model;
>> +     /* CPU family */
>> +     __u8    x86;
>> +     /* CPU vendor */
>> +     __u8    x86_vendor;
>> +     __u8    x86_reserved;
>> +     u32     feature1;
>> +};
>> +
>> +#define NUMAEXPAND 1
>> +
>> +#define COHORT_START 1
>> +#define ACQUIRE_NUMALOCK (UINT_MAX-1)
>> +#define NODE_WAIT UINT_MAX
>> +#define LOCK_NUMALOCK 1
>> +#define UNLOCK_NUMALOCK 0
>> +
>> +#define NUMALOCKDYNAMIC 0xff
>> +#define TURNTONUMAREADY 0xa5a5
>> +#define NUMATURNBACKREADY 0x5a5a
>> +
>> +#define NUMA_LOCKED_VAL 0xf5efef
> What are these special bit values for?

It is value for NUMA lock state, the value has no special mean,

any value is OK.  I will change it to some ordinary value.


>> +#define NUMA_UNLOCKED_VAL 0
>> +
>> +#define NUMASTEERMASK 0xf0000000
>> +#define HIGH32BITMASK 0xffffffff00000000
>> +#define LOW32MASK 0xffffffff
>> +
>> +extern int NUMASHIFT;
>> +extern int NUMACLUSTERS;
> In the kernel, uppercase names are used for macros. Variables should use
> all lowercase names to avoid confusion.
Yes, I will change to lowercase.
>> +extern int zx_numa_lock_total;
>> +extern struct _numa_buf *zx_numa_entry;
>> +extern atomic_t numa_count;
>> +extern int enable_zx_numa_osq_lock;
>> +extern u32 zx_numa_lock;
>> +extern int dynamic_enable;
>> +extern struct kmem_cache *zx_numa_lock_cachep;
>> +
>> +static inline u32 ptrmask(void *s)
>> +{
>> +     return (uint64_t)s & LOW32MASK;
>> +}
>> +inline void *get_numa_lock(int index);
>> +
>> +int zx_check_numa_dynamic_locked(u32 lockaddr, struct _numa_lock 
>> *_numa_lock,
>> +             int t);
>> +int zx_numa_lock_ptr_get(void *p);
>> +void numa_lock_init_data(struct _numa_lock *s, int clusters, u32 
>> lockval,
>> +             u32 lockaddr);
>> +#endif
>> diff --git a/kernel/locking/numa_osq.h b/kernel/locking/numa_osq.h
>> new file mode 100644
>> index 000000000000..4c99ad60c4e0
>> --- /dev/null
>> +++ b/kernel/locking/numa_osq.h
>> @@ -0,0 +1,32 @@
>> +/* SPDX-License-Identifier: GPL-2.0 */
>> +#ifndef __LINUX_NUMA_OSQ_H
>> +#define __LINUX_NUMA_OSQ_H
>> +
>> +#include <linux/osq_lock.h>
>> +#include "mcs_spinlock.h"
>> +
>> +#define OSQLOCKINITED 0
>> +#define OSQTONUMADETECT 0x10
>> +#define OSQLOCKSTOPPING 0xfc
>> +#define OSQ_LOCKED_VAL 0xffff
>> +
>> +extern u16 osq_keep_times;
>> +extern u16 osq_lock_depth;
>> +extern int osq_node_max;
>> +
>> +inline int encode_cpu(int cpu_nr);
>> +inline int node_cpu(struct optimistic_spin_node *node);
>> +inline struct optimistic_spin_node *decode_cpu(int encoded_cpu_val);
>> +
>> +void zx_osq_lock_stopping(struct optimistic_spin_queue *lock);
>> +void zx_osq_numa_start(struct optimistic_spin_queue *lock);
>> +void zx_osq_turn_numa_waiting(struct optimistic_spin_queue *lock);
>> +
>> +bool x_osq_lock(struct optimistic_spin_queue *lock);
>> +void x_osq_unlock(struct optimistic_spin_queue *lock);
>> +bool x_osq_is_locked(struct optimistic_spin_queue *lock);
>> +inline void zx_numa_osq_unlock(struct optimistic_spin_queue *qslock,
>> +             struct _numa_lock *n);
>> +inline bool zx_numa_osq_lock(struct optimistic_spin_queue *qslock,
>> +             struct _numa_lock *n);
>> +#endif
>> diff --git a/kernel/locking/x_osq_lock.c b/kernel/locking/x_osq_lock.c
>> new file mode 100644
>> index 000000000000..6da8905ae7d6
>> --- /dev/null
>> +++ b/kernel/locking/x_osq_lock.c
>> @@ -0,0 +1,332 @@
>> +// SPDX-License-Identifier: GPL-2.0
>> +/*
>> + * crossing from osq_lock to numa-aware lock
>> + */
>> +#include <linux/percpu.h>
>> +#include <linux/sched.h>
>> +#include <linux/osq_lock.h>
>> +#include "numa.h"
>> +#include "numa_osq.h"
>> +
>> +u16 osq_lock_depth = 8;
>> +u16 osq_keep_times = 32;
>> +
>> +/*
>> + * An MCS like lock especially tailored for optimistic spinning for 
>> sleeping
>> + * lock implementations (mutex, rwsem, etc).
>> + *
>> + * Using a single mcs node per CPU is safe because sleeping locks 
>> should not be
>> + * called from interrupt context and we have preemption disabled while
>> + * spinning.
>> + */
>> +DECLARE_PER_CPU_SHARED_ALIGNED(struct optimistic_spin_node, osq_node);
>> +
>> +/*
>> + * Get a stable @node->next pointer, either for unlock() or 
>> unqueue() purposes.
>> + * Can return NULL in case we were the last queued and we updated 
>> @lock instead.
>> + *
>> + * If osq_lock() is being cancelled there must be a previous node
>> + * and 'old_cpu' is its CPU #.
>> + * For osq_unlock() there is never a previous node and old_cpu is
>> + * set to OSQ_UNLOCKED_VAL.
>> + */
>> +static inline struct optimistic_spin_node *
>> +osq_wait_next_stop(struct optimistic_spin_queue *lock,
>> +           struct optimistic_spin_node *node,
>> +           int old_cpu)
>> +{
>> +     u16 curr = encode_cpu(smp_processor_id());
>> +     u16 old = old_cpu;
>> +
>> +     if (lock->numa_enable == OSQLOCKSTOPPING && old == 
>> OSQ_UNLOCKED_VAL)
>> +             old = OSQ_LOCKED_VAL;
>> +
>> +     for (;;) {
>> +             if (READ_ONCE(lock->tail16) == curr &&
>> +                 cmpxchg(&lock->tail16, curr, old) == curr) {
>> +
>> +                     /*
>> +                      * We were the last queued, we moved @lock 
>> back. @prev
>> +                      * will now observe @lock and will complete its
>> +                      * unlock()/unqueue().
>> +                      */
>> +                     return NULL;
>> +             }
>> +
>> +             /*
>> +              * We must xchg() the @node->next value, because if we 
>> were to
>> +              * leave it in, a concurrent unlock()/unqueue() from
>> +              * @node->next might complete Step-A and think its 
>> @prev is
>> +              * still valid.
>> +              *
>> +              * If the concurrent unlock()/unqueue() wins the race, 
>> we'll
>> +              * wait for either @lock to point to us, through its 
>> Step-B, or
>> +              * wait for a new @node->next from its Step-C.
>> +              */
>> +             if (node->next) {
>> +                     struct optimistic_spin_node *next;
>> +
>> +                     next = xchg(&node->next, NULL);
>> +                     if (next)
>> +                             return next;
>> +             }
>> +
>> +             cpu_relax();
>> +     }
>> +}
>> +
>> +bool x_osq_lock(struct optimistic_spin_queue *lock)
>> +{
>> +     struct optimistic_spin_node *node = this_cpu_ptr(&osq_node);
>> +     struct optimistic_spin_node *prev, *next;
>> +     int cpu = smp_processor_id();
>> +     u16 curr = encode_cpu(cpu);
>> +     struct optimistic_spin_queue tail;
>> +     u16 old;
>> +
>> +     tail.val = READ_ONCE(lock->val);
>> +     if (unlikely(tail.numa_enable == OSQLOCKSTOPPING)) {
>> +             zx_osq_turn_numa_waiting(lock);
>> +             return x_osq_lock(lock);
>> +     }
>> +
>> +     if (unlikely(tail.numa_enable == NUMALOCKDYNAMIC)) {
>> +             struct _numa_lock *_numa_lock = NULL;
>> +             struct _numa_lock *node_lock = NULL;
>> +
>> +             _numa_lock = get_numa_lock(tail.index);
>> +             node_lock = (struct _numa_lock *) _numa_lock +
>> +                                     (cpu >> NUMASHIFT);
>> +
>> +             prefetch(node_lock);
>> +             return zx_numa_osq_lock(lock, _numa_lock);
>> +     }
>> +
>> +     node->locked = 0;
>> +     node->next = NULL;
>> +     node->cpu = curr;
>> +
>> +     /*
>> +      * We need both ACQUIRE (pairs with corresponding RELEASE in
>> +      * unlock() uncontended, or fastpath) and RELEASE (to publish
>> +      * the node fields we just initialised) semantics when updating
>> +      * the lock tail.
>> +      */
>> +
>> +     if (likely(tail.numa_enable >= OSQTONUMADETECT)) {
>> +             struct optimistic_spin_queue ss;
>> +
>> +             while (1) {
>> +                     ss.val = atomic_read(&lock->tail);
>> +                     if (ss.tail16 == OSQ_LOCKED_VAL) {
>> +                             zx_osq_turn_numa_waiting(lock);
>> +                             return x_osq_lock(lock);
>> +                     }
>> +                     if (cmpxchg(&lock->tail16, ss.tail16, curr)
>> +                                     == ss.tail16) {
>> +                             old = ss.tail16;
>> +                             break;
>> +                     }
>> +                     cpu_relax();
>> +             }
>> +     } else
>> +             old = xchg(&lock->tail16, curr);
>> +
>> +     if (old == OSQ_UNLOCKED_VAL) {
>> +             node->serial = 1;
>> +             return true;
>> +     }
>> +
>> +     prev = decode_cpu(old);
>> +     node->prev = prev;
>> +
>> +     node->serial = prev->serial + 1;
>> +     /*
>> +      * osq_lock()                   unqueue
>> +      *
>> +      * node->prev = prev            osq_wait_next()
>> +      * WMB                          MB
>> +      * prev->next = node            next->prev = prev // unqueue-C
>> +      *
>> +      * Here 'node->prev' and 'next->prev' are the same variable and 
>> we need
>> +      * to ensure these stores happen in-order to avoid corrupting 
>> the list.
>> +      */
>> +     smp_wmb();
>> +
>> +     WRITE_ONCE(prev->next, node);
>> +
>> +     /*
>> +      * Normally @prev is untouchable after the above store; because 
>> at that
>> +      * moment unlock can proceed and wipe the node element from stack.
>> +      *
>> +      * However, since our nodes are static per-cpu storage, we're
>> +      * guaranteed their existence -- this allows us to apply
>> +      * cmpxchg in an attempt to undo our queueing.
>> +      */
>> +
>> +     /*
>> +      * Wait to acquire the lock or cancellation. Note that 
>> need_resched()
>> +      * will come with an IPI, which will wake 
>> smp_cond_load_relaxed() if it
>> +      * is implemented with a monitor-wait. vcpu_is_preempted() 
>> relies on
>> +      * polling, be careful.
>> +      */
>> +     if (smp_cond_load_relaxed(&node->locked, VAL || need_resched() ||
>> + vcpu_is_preempted(node_cpu(node->prev))))
>> +             return true;
>> +
>> +     /* unqueue */
>> +     /*
>> +      * Step - A  -- stabilize @prev
>> +      *
>> +      * Undo our @prev->next assignment; this will make @prev's
>> +      * unlock()/unqueue() wait for a next pointer since @lock 
>> points to us
>> +      * (or later).
>> +      */
>> +
>> +     for (;;) {
>> +             /*
>> +              * cpu_relax() below implies a compiler barrier which 
>> would
>> +              * prevent this comparison being optimized away.
>> +              */
>> +             if (data_race(prev->next) == node &&
>> +                 cmpxchg(&prev->next, node, NULL) == node)
>> +                     break;
>> +
>> +             /*
>> +              * We can only fail the cmpxchg() racing against an 
>> unlock(),
>> +              * in which case we should observe @node->locked becoming
>> +              * true.
>> +              */
>> +             if (smp_load_acquire(&node->locked))
>> +                     return true;
>> +
>> +             cpu_relax();
>> +
>> +             /*
>> +              * Or we race against a concurrent unqueue()'s step-B, 
>> in which
>> +              * case its step-C will write us a new @node->prev 
>> pointer.
>> +              */
>> +             prev = READ_ONCE(node->prev);
>> +     }
>> +
>> +     /*
>> +      * Step - B -- stabilize @next
>> +      *
>> +      * Similar to unlock(), wait for @node->next or move @lock from 
>> @node
>> +      * back to @prev.
>> +      */
>> +
>> +     next = osq_wait_next_stop(lock, node, prev->cpu);
>> +     if (!next)
>> +             return false;
>> +
>> +     /*
>> +      * Step - C -- unlink
>> +      *
>> +      * @prev is stable because its still waiting for a new @prev->next
>> +      * pointer, @next is stable because our @node->next pointer is 
>> NULL and
>> +      * it will wait in Step-A.
>> +      */
>> +
>> +     WRITE_ONCE(next->prev, prev);
>> +     WRITE_ONCE(prev->next, next);
>> +
>> +     return false;
>> +}
>> +
>> +
>> +
>> +void x_osq_unlock(struct optimistic_spin_queue *lock)
>> +{
>> +     struct optimistic_spin_node *node, *next;
>> +     int threadshold = osq_lock_depth;
>> +     int cpu = smp_processor_id();
>> +     u16 curr = encode_cpu(cpu);
>> +     int depth = 0;
>> +     u32 count = 0;
>> +
>> +     if (unlikely(lock->numa_enable == NUMALOCKDYNAMIC)) {
>> +             struct _numa_lock *_numa_lock = 
>> get_numa_lock(lock->index);
>> +
>> +             prefetch((struct _numa_lock *) _numa_lock + (cpu >> 
>> NUMASHIFT));
>> +             return zx_numa_osq_unlock(lock, _numa_lock);
>> +     }
>> +     /*
>> +      * Fast path for the uncontended case.
>> +      */
>> +     if (unlikely(lock->numa_enable == OSQTONUMADETECT)) {
>> +             struct optimistic_spin_node *node_last = NULL;
>> +             u16 tail = 0;
>> +
>> +             tail = cmpxchg(&lock->tail16, curr, OSQ_UNLOCKED_VAL);
>> +             if (tail == curr)
>> +                     return;
>> +
>> +             node = this_cpu_ptr(&osq_node);
>> +             node_last = decode_cpu(tail);
>> +             depth = node_last->serial - node->serial;
>> +             count = READ_ONCE(node->locked);
>> +             if (count > osq_keep_times && (dynamic_enable & 0x1))
>> +                     zx_osq_lock_stopping(lock);
>> +     } else if (unlikely(lock->numa_enable == OSQLOCKSTOPPING)) {
>> +             if (cmpxchg(&lock->tail16, curr, OSQ_LOCKED_VAL)
>> +                                     == curr) {
>> +                     zx_osq_numa_start(lock);
>> +                     return;
>> +             }
>> +     } else {
>> +             struct optimistic_spin_queue t;
>> +
>> +             t.val = 0;
>> +             if (dynamic_enable & 0x1) {
>> +                     if (atomic_read(&numa_count) < zx_numa_lock_total)
>> +                             t.numa_enable = OSQTONUMADETECT;
>> +             }
>> +             if (t.numa_enable == OSQTONUMADETECT) {
>> +                     if (atomic_cmpxchg_release(&lock->tail, curr,
>> +                             (t.val | OSQ_UNLOCKED_VAL)) == curr)
>> +                             return;
>> +             } else if (cmpxchg(&lock->tail16, curr,
>> +                             OSQ_UNLOCKED_VAL) == curr)
>> +                     return;
>> +     }
>> +
>> +     /*
>> +      * Second most likely case.
>> +      */
>> +     node = this_cpu_ptr(&osq_node);
>> +     next = xchg(&node->next, NULL);
>> +     if (next) {
>> +             if (depth > threadshold)
>> +                     WRITE_ONCE(next->locked, count + 1);
>> +             else
>> +                     WRITE_ONCE(next->locked, 1);
>> +             return;
>> +     }
>> +
>> +     next = osq_wait_next_stop(lock, node, OSQ_UNLOCKED_VAL);
>> +     if (next) {
>> +             if (depth > threadshold)
>> +                     WRITE_ONCE(next->locked, count + 1);
>> +             else
>> +                     WRITE_ONCE(next->locked, 1);
>> +     }
>> +}
>> +
>> +bool x_osq_is_locked(struct optimistic_spin_queue *lock)
>> +{
>> +     struct optimistic_spin_queue val;
>> +
>> +     val.val = atomic_read(&lock->tail);
>> +     if (val.tail16 == OSQ_UNLOCKED_VAL)
>> +             return false;
>> +
>> +     if (val.tail16 == OSQ_LOCKED_VAL) {
>> +             if (val.numa_enable != NUMALOCKDYNAMIC)
>> +                     return true;
>> +             return zx_check_numa_dynamic_locked(ptrmask(lock),
>> +                                     get_numa_lock(val.index), 0);
>> +     }
>> +
>> +     return true;
>> +}
> These functions are mostly based on the  osq_* functions with some extra
> code stuffed in. Common code should be shared instead of duplicated.

It uses 16 bit tail, if merge to the osq_lock, it will make osq_lock 
changes too much.

>> diff --git a/kernel/locking/zx_numa_osq.c b/kernel/locking/zx_numa_osq.c
>> new file mode 100644
>> index 000000000000..f4c3dba208d3
>> --- /dev/null
>> +++ b/kernel/locking/zx_numa_osq.c
>> @@ -0,0 +1,433 @@
>> +// SPDX-License-Identifier: GPL-2.0
>> +/*
>> + * Dynamic numa-aware osq lock
>> + * Author: LiYong <yongli-oc@zhaoxin.com>
>> + *
>> + */
>> +#include <linux/cpumask.h>
>> +#include <asm/byteorder.h>
>> +#include <linux/percpu.h>
>> +#include <linux/sched.h>
>> +#include <linux/slab.h>
>> +#include <linux/osq_lock.h>
>> +#include "numa.h"
>> +#include "numa_osq.h"
>> +
>> +int osq_node_max = 256;
>> +
>> +/*
>> + * The pending bit spinning loop count.
>> + * This heuristic is used to limit the number of lockword accesses
>> + * made by atomic_cond_read_relaxed when waiting for the lock to
>> + * transition out of the "== _Q_PENDING_VAL" state. We don't spin
>> + * indefinitely because there's no guarantee that we'll make forward
>> + * progress.
>> + */
>> +
>> +static DEFINE_PER_CPU_SHARED_ALIGNED(struct optimistic_spin_node, 
>> osq_cpu_node);
>> +
>> +/*
>> + * We use the value 0 to represent "no CPU", thus the encoded value
>> + * will be the CPU number incremented by 1.
>> + */
>> +static inline int decode(int cpu_nr)
>> +{
>> +     return cpu_nr - 1;
>> +}
>> +
>> +static inline struct optimistic_spin_node *decode_curr(int 
>> encoded_cpu_val)
>> +{
>> +     int cpu_nr = decode(encoded_cpu_val);
>> +
>> +     return per_cpu_ptr(&osq_cpu_node, cpu_nr);
>> +}
>> +
>> +static int atomic64_cmpxchg_notequal(void *qslock, atomic_t *tail, 
>> int curr)
>> +{
>> +     u64 ss = 0;
>> +     u32 addr = ptrmask(qslock);
>> +     u64 addrcurr = (((u64)addr) << 32) | curr;
>> +
>> +     while (1) {
>> +             ss = atomic64_read((atomic64_t *) tail);
>> +             if ((ss >> 32) != addr)
>> +                     return NUMA_LOCKED_VAL;
>> +             if ((ss & LOW32MASK) == NUMA_LOCKED_VAL)
>> +                     return NUMA_LOCKED_VAL;
>> +             if (atomic64_cmpxchg((atomic64_t *) tail, ss, addrcurr) 
>> == ss)
>> +                     return ss & LOW32MASK;
>> +             cpu_relax();
>> +     }
>> +}
>> +
>> +void zx_osq_lock_stopping(struct optimistic_spin_queue *lock)
>> +{
>> +     int s = 0;
>> +
>> +     s = zx_numa_lock_ptr_get(lock);
>> +     if (s < zx_numa_lock_total) {
>> +             numa_lock_init_data(zx_numa_entry[s].numa_ptr,
>> +                     NUMACLUSTERS, NUMA_UNLOCKED_VAL,
>> +                     ptrmask(lock));
>> +
>> +             WRITE_ONCE(lock->index, s);
>> +             zx_numa_entry[s].type = 1;
>> +             smp_mb();/*should set these before enable*/
>> +             prefetchw(&lock->numa_enable);
>> +             WRITE_ONCE(lock->numa_enable, OSQLOCKSTOPPING);
>> +     } else {
>> +             prefetchw(&lock->numa_enable);
>> +             WRITE_ONCE(lock->numa_enable, OSQLOCKINITED);
>> +     }
>> +}
>> +
>> +void zx_osq_numa_start(struct optimistic_spin_queue *lock)
>> +{
>> +     struct _numa_lock *_numa_lock = get_numa_lock(lock->index);
>> +
>> +     prefetchw(&lock->numa_enable);
>> +     WRITE_ONCE(lock->numa_enable, NUMALOCKDYNAMIC);
>> +     smp_mb(); /*should keep lock->numa_enable modified first*/
>> +     atomic_set(&_numa_lock->initlock, TURNTONUMAREADY);
>> +}
>> +
>> +
>> +void zx_osq_turn_numa_waiting(struct optimistic_spin_queue *lock)
>> +{
>> +     struct _numa_lock *_numa_lock = get_numa_lock(lock->index);
>> +
>> +     atomic_inc(&_numa_lock->pending);
>> +     while (1) {
>> +             int s = atomic_read(&_numa_lock->initlock);
>> +
>> +             if (s == TURNTONUMAREADY)
>> +                     break;
>> +             cpu_relax();
>> +             cpu_relax();
>> +             cpu_relax();
>> +             cpu_relax();
> We don't usually call cpu_relax() multiple times like that as the actual
> delay is CPU dependent and it is hard to get it right for all. Can you
> explain why it is called exactly 4 times here?

The cpu_relax() is REP NOP. Form the intel instruction set,  the REP NOP 
is PAUSE.

The PAUSE instruction can save some power for the processor. So I often 
write more

  cpu_relax() when busy waiting.

I will change to one cpu_relax(), according to Linux kernel habits.

>> +
>> +     }
>> +     atomic_dec(&_numa_lock->pending);
>> +}
>> +
>> +
>> +
>> +
>> +static struct optimistic_spin_node *
>> +zx_numa_osq_wait_next(struct _numa_lock *lock,
>> +             struct optimistic_spin_node *node,
>> +             struct optimistic_spin_node *prev, int cpu)
>> +{
>> +     struct optimistic_spin_node *next = NULL;
>> +     int curr = encode_cpu(cpu);
>> +     int old;
>> +
>> +     old = prev ? prev->cpu : OSQ_UNLOCKED_VAL;
>> +     for (;;) {
>> +             if (atomic_read(&lock->tail) == curr &&
>> +                 atomic_cmpxchg_acquire(&lock->tail, curr, old) == 
>> curr) {
>> +
>> +                     break;
>> +             }
>> +             if (node->next) {
>> +                     next = xchg(&node->next, NULL);
>> +                     if (next)
>> +                             break;
>> +             }
>> +             cpu_relax();
>> +     }
>> +     return next;
>> +}
>> +static void zx_numa_turn_osq_waiting(struct optimistic_spin_queue 
>> *lock,
>> +                     struct _numa_lock *_numa_lock)
>> +{
>> +     struct _numa_lock *numa_lock = _numa_lock + 
>> _numa_lock->numa_nodes;
>> +     int lockaddr = ptrmask(lock);
>> +     u64 s = 0;
>> +     struct optimistic_spin_queue tail;
>> +
>> +     tail.numa_enable = NUMALOCKDYNAMIC;
>> +     tail.index = lock->index;
>> +     tail.tail16 = OSQ_LOCKED_VAL;
>> +     while (1) {
>> +             cpu_relax(); cpu_relax(); cpu_relax(); cpu_relax();
>> +             s = atomic64_read((atomic64_t *) &numa_lock->tail);
>> +             if ((s >> 32) != lockaddr)
>> +                     break;
>> +             if ((s & LOW32MASK) == NUMA_LOCKED_VAL)
>> +                     break;
>> +     }
>> +     prefetchw(&lock->tail);
>> +     if (atomic_cmpxchg(&lock->tail, tail.val, OSQ_UNLOCKED_VAL)
>> +                     == tail.val) {
>> +             ;
>> +     }
>> +
>> +}
>> +
>> +static int _zx_node_osq_lock_internal(struct optimistic_spin_queue 
>> *qslock,
>> +     struct optimistic_spin_node *node, struct optimistic_spin_node 
>> *prev,
>> +     struct _numa_lock *node_lock, int cpu, int *cur_status)
>> +{
>> +     struct optimistic_spin_node *next = NULL;
>> +
>> +     for (;;) {
>> +             struct optimistic_spin_node *node_prev = NULL;
>> +
>> +             if (prev->next == node &&
>> +                 cmpxchg(&prev->next, node, NULL) == node) {
>> +                     break;
>> +             }
>> +             /*load locked first each time*/
>> +             *cur_status = smp_load_acquire(&node->locked);
>> +
>> +             if (*cur_status != NODE_WAIT)
>> +                     return 0; //goto NODE_UNLOCK;
>> +
>> +             cpu_relax();
>> +             node_prev = READ_ONCE(node->prev);
>> +             if (node_prev != prev)
>> +                     prev = node_prev;
>> +     }
>> +
>> +     next = zx_numa_osq_wait_next(node_lock, node, prev, cpu);
>> +     if (!next)
>> +             return -1;
>> +
>> +     WRITE_ONCE(next->prev, prev);
>> +     WRITE_ONCE(prev->next, next);
>> +
>> +     return -1;
>> +}
>> +
>> +static int _zx_node_osq_lock(struct optimistic_spin_queue *qslock,
>> +                                     struct _numa_lock *_numa_lock)
>> +{
>> +     struct optimistic_spin_node *node = this_cpu_ptr(&osq_cpu_node);
>> +     struct optimistic_spin_node *prev = NULL;
>> +     int cpu = smp_processor_id();
>> +     int curr = encode_cpu(cpu);
>> +     int numa = cpu >> _numa_lock->shift;
>> +     struct _numa_lock *node_lock = _numa_lock + numa;
>> +     int cur_status = 0;
>> +     int old = 0;
>> +
>> +     node->locked = NODE_WAIT;
>> +     node->next = NULL;
>> +     node->cpu = curr;
>> +
>> +     old = atomic64_cmpxchg_notequal(qslock, &node_lock->tail, curr);
>> +
>> +     if (old == NUMA_LOCKED_VAL) {
>> +             bool s = true;
>> +
>> +             zx_numa_turn_osq_waiting(qslock, _numa_lock);
>> +             s = osq_lock(qslock);
>> +             if (s == true)
>> +                     return 1;
>> +             else
>> +                     return -1;
>> +     }
>> +
>> +     if (old == 0) {
>> +             node->locked = COHORT_START;
>> +             return ACQUIRE_NUMALOCK;
>> +     }
>> +
>> +     prev = decode_curr(old);
>> +     node->prev = prev;
>> +
>> +     smp_mb(); /* make sure node set before set pre->next */
>> +
>> +     WRITE_ONCE(prev->next, node);
>> +
>> +     while ((cur_status = READ_ONCE(node->locked)) == NODE_WAIT) {
>> +             if (need_resched() || 
>> vcpu_is_preempted(node_cpu(node->prev))) {
>> +                     int ddd = _zx_node_osq_lock_internal(qslock, 
>> node, prev,
>> +                                             node_lock, cpu, 
>> &cur_status);
>> +
>> +                     if (cur_status != NODE_WAIT)
>> +                             goto NODE_UNLOCK;
>> +                     if (ddd == -1)
>> +                             return -1;
>> +             }
>> +             cpu_relax();
>> +     }
>> +NODE_UNLOCK:
>> +     if (cur_status == ACQUIRE_NUMALOCK)
>> +             node->locked = COHORT_START;
>> +     return cur_status;
>> +}
>> +static int _zx_numa_osq_lock(struct optimistic_spin_queue *qslock, 
>> int cpu,
>> +                             struct _numa_lock *_numa_lock)
>> +{
>> +     int numacpu = cpu >> _numa_lock->shift;
>> +     int numacurr = encode_cpu(numacpu);
>> +
>> +     struct optimistic_spin_node *node = &(_numa_lock + 
>> numacpu)->osq_node;
>> +     struct _numa_lock *numa_lock = _numa_lock + 
>> _numa_lock->numa_nodes;
>> +     struct optimistic_spin_node *prevnode = NULL;
>> +     int prev = 0;
>> +
>> +     node->next = NULL;
>> +     node->locked = LOCK_NUMALOCK;
>> +     node->cpu = numacurr;
>> +
>> +     prev = atomic_xchg(&numa_lock->tail, numacurr);
>> +     if (prev == 0) {
>> +             node->locked = UNLOCK_NUMALOCK;
>> +             return 0;
>> +     }
>> +
>> +     prevnode = &(_numa_lock + prev - 1)->osq_node;
>> +     node->prev = prevnode;
>> +     smp_mb(); /*node->prev should be set before next*/
>> +     WRITE_ONCE(prevnode->next, node);
>> +
>> +     while (READ_ONCE(node->locked) == LOCK_NUMALOCK) {
>> +             cpu_relax();
>> +             cpu_relax();
>> +             cpu_relax();
>> +             cpu_relax();
>> +     }
>> +     return 0;
>> +}
>> +inline bool zx_numa_osq_lock(struct optimistic_spin_queue *qslock,
>> +             struct _numa_lock *_numa_lock)
>> +{
>> +     struct _numa_lock *node_lock = NULL;
>> +     int cpu = smp_processor_id();
>> +     int numa = cpu >> _numa_lock->shift;
>> +     int status = 0;
>> +
>> +     node_lock = _numa_lock + numa;
>> +
>> +     if (node_lock->stopping) {
>> +             zx_numa_turn_osq_waiting(qslock, _numa_lock);
>> +             return osq_lock(qslock);
>> +     }
>> +
>> +     status = _zx_node_osq_lock(qslock, _numa_lock);
>> +     if (status == ACQUIRE_NUMALOCK)
>> +             status = _zx_numa_osq_lock(qslock, smp_processor_id(),
>> +                             _numa_lock);
>> +
>> +     if (status == -1)
>> +             return false;
>> +     return true;
>> +}
>> +
>> +static int atomic64_checktail_osq(struct optimistic_spin_queue *qslock,
>> +     struct _numa_lock *node_lock, int ctail)
>> +{
>> +     u64 addr = ((u64)ptrmask(qslock)) << 32;
>> +     u64 addrtail = addr | ctail;
>> +     u64 ss = 0;
>> +     bool mark;
>> +
>> +     ss = atomic64_read((atomic64_t *) &node_lock->tail);
>> +     if (node_lock->stopping == 0)
>> +             mark = (ss == addrtail &&
>> +                     atomic64_cmpxchg_acquire(
>> +                             (atomic64_t *) &node_lock->tail,
>> +                             addrtail, addr|NUMA_UNLOCKED_VAL) == 
>> addrtail);
>> +     else
>> +             mark = (ss == addrtail &&
>> +                     atomic64_cmpxchg_acquire(
>> +                             (atomic64_t *) &node_lock->tail,
>> +                             addrtail, NUMA_LOCKED_VAL) == addrtail);
>> +     return mark;
>> +}
>> +
>> +static void node_lock_release(struct optimistic_spin_queue *qslock,
>> +             struct _numa_lock *node_lock, struct 
>> optimistic_spin_node *node,
>> +             int val, int cpu, int numa_end)
>> +{
>> +     struct optimistic_spin_node *next = NULL;
>> +     int curr = encode_cpu(cpu);
>> +
>> +     while (1) {
>> +             if (atomic64_checktail_osq(qslock, node_lock, curr)) {
>> +                     if (qslock->numa_enable == NUMALOCKDYNAMIC) {
>> +                             int index = qslock->index;
>> +
>> +                             if (numa_end == OSQ_UNLOCKED_VAL &&
>> +                                     zx_numa_entry[index].idle == 0) {
>> + cmpxchg(&zx_numa_entry[index].idle,
>> +                                                     0, 1);
>> +                             }
>> +                     }
>> +                     return;
>> +             }
>> +             if (node->next) {
>> +                     next = xchg(&node->next, NULL);
>> +                     if (next) {
>> +                             WRITE_ONCE(next->locked, val);
>> +                             return;
>> +                     }
>> +             }
>> +             cpu_relax();
>> +     }
>> +}
>> +
>> +static int numa_lock_release(struct optimistic_spin_queue *qslock,
>> +             struct _numa_lock *numa_lock,
>> +             struct optimistic_spin_node *node, int cpu)
>> +{
>> +     struct optimistic_spin_node *next = NULL;
>> +     int curr = cpu >> numa_lock->shift;
>> +     int numacurr = encode_cpu(curr);
>> +
>> +     while (1) {
>> +             if (atomic_read(&numa_lock->tail) == numacurr &&
>> + atomic_cmpxchg_acquire(&numa_lock->tail, numacurr,
>> +                                        OSQ_UNLOCKED_VAL) == 
>> numacurr) {
>> +                     return OSQ_UNLOCKED_VAL;
>> +             }
>> +
>> +             if (node->next) {
>> +                     next = xchg(&node->next, NULL);
>> +                     if (next) {
>> +                             WRITE_ONCE(next->locked, UNLOCK_NUMALOCK);
>> +                             return 1;
>> +                     }
>> +             }
>> +             cpu_relax();
>> +     }
>> +}
>> +
>> +inline void zx_numa_osq_unlock(struct optimistic_spin_queue *qslock,
>> +              struct _numa_lock *_numa_lock)
>> +{
>> +     u32 cpu =  smp_processor_id();
>> +     struct optimistic_spin_node *node = this_cpu_ptr(&osq_cpu_node);
>> +     int numa = cpu >> _numa_lock->shift;
>> +     struct _numa_lock *numa_lock = _numa_lock + 
>> _numa_lock->numa_nodes;
>> +     struct _numa_lock *node_lock = _numa_lock + numa;
>> +     struct optimistic_spin_node *numa_node =
>> +                                             &(_numa_lock + 
>> numa)->osq_node;
>> +     struct optimistic_spin_node *next = NULL;
>> +     int cur_count = 0;
>> +     int numa_end = 0;
>> +
>> +     cur_count = READ_ONCE(node->locked);
>> +
>> +     if (cur_count >= osq_node_max - 1) {
>> +             numa_end = numa_lock_release(qslock,
>> +                             numa_lock, numa_node, cpu);
>> +             node_lock_release(qslock, node_lock, node,
>> +                             ACQUIRE_NUMALOCK, cpu, numa_end);
>> +             return;
>> +     }
>> +
>> +     next = xchg(&node->next, NULL);
>> +     if (next) {
>> +             WRITE_ONCE(next->locked, cur_count + 1);
>> +             return;
>> +     }
>> +
>> +     numa_end = numa_lock_release(qslock, numa_lock, numa_node, cpu);
>> +     node_lock_release(qslock, node_lock, node, ACQUIRE_NUMALOCK,
>> +                     cpu, numa_end);
>> +}
>
> Overall speaking, there are not enough comments to explain exactly what
> are you trying to achieve with these new functions. It makes them hard
> to review.
>
Apologies for so less comments. I will write more in next commit.
> Cheers,
> Longman
>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 4/4] locking/osq_lock: The numa-aware lock memory prepare, assign and cleanup.
  2024-09-14 17:21   ` Waiman Long
@ 2024-09-19  9:41     ` yongli-os
  2024-09-19 11:28       ` Waiman Long
  0 siblings, 1 reply; 19+ messages in thread
From: yongli-os @ 2024-09-19  9:41 UTC (permalink / raw)
  To: Waiman Long, peterz, mingo, will, boqun.feng
  Cc: linux-kernel, yongli, louisqi, cobechen, jiangbowang


On 2024/9/15 01:21, Waiman Long wrote:
>
>
> [这封邮件来自外部发件人 谨防风险]
>
> On 9/14/24 04:53, yongli-oc wrote:
>> The numa-aware lock kernel memory cache preparation, and a
>> workqueue to turn numa-aware lock back to osq lock.
>> The /proc interface. Enable dynamic switch by
>> echo 1 > /proc/zx_numa_lock/dynamic_enable
>>
>> Signed-off-by: yongli-oc <yongli-oc@zhaoxin.com>
>> ---
>>   kernel/locking/zx_numa.c | 537 +++++++++++++++++++++++++++++++++++++++
>>   1 file changed, 537 insertions(+)
>>   create mode 100644 kernel/locking/zx_numa.c
>>
>> diff --git a/kernel/locking/zx_numa.c b/kernel/locking/zx_numa.c
>> new file mode 100644
>> index 000000000000..89df6670a024
>> --- /dev/null
>> +++ b/kernel/locking/zx_numa.c
>> @@ -0,0 +1,537 @@
>> +// SPDX-License-Identifier: GPL-2.0
>> +/*
>> + * Dynamic numa-aware osq lock
>> + * Crossing from numa-aware lock to osq_lock
>> + * Numa lock memory initialize and /proc interface
>> + * Author: LiYong <yongli-oc@zhaoxin.com>
>> + *
>> + */
>> +#include <linux/cpumask.h>
>> +#include <asm/byteorder.h>
>> +#include <asm/kvm_para.h>
>> +#include <linux/percpu.h>
>> +#include <linux/sched.h>
>> +#include <linux/slab.h>
>> +#include <linux/osq_lock.h>
>> +#include <linux/module.h>
>> +#include <linux/proc_fs.h>
>> +#include <linux/seq_file.h>
>> +#include <linux/uaccess.h>
>> +#include <linux/reboot.h>
>> +
>> +#include "numa.h"
>> +#include "numa_osq.h"
>> +
>> +int enable_zx_numa_osq_lock;
>> +struct delayed_work zx_numa_start_work;
>> +struct delayed_work zx_numa_cleanup_work;
>> +
>> +atomic_t numa_count;
>> +struct _numa_buf *zx_numa_entry;
>> +int zx_numa_lock_total = 256;
>> +LIST_HEAD(_zx_numa_head);
>> +LIST_HEAD(_zx_numa_lock_head);
>> +
>> +struct kmem_cache *zx_numa_entry_cachep;
>> +struct kmem_cache *zx_numa_lock_cachep;
>> +int NUMASHIFT;
>> +int NUMACLUSTERS;
>> +static atomic_t lockindex;
>> +int dynamic_enable;
>> +
>> +static const struct numa_cpu_info numa_cpu_list[] = {
>> +     /*feature1=1, a numa node includes two clusters*/
>> +     //{1, 23, X86_VENDOR_AMD, 0, 1},
>> +     {0x5b, 7, X86_VENDOR_CENTAUR, 0, 1},
>> +     {0x5b, 7, X86_VENDOR_ZHAOXIN, 0, 1}
>> +};
>
> Why are this zx_*() code specifically for ZhaoXin and Centaur family of
> CPUs? Are there some special hardware features that are specific to
> these CPUs?

> Zhaoxin cpu is a x86 architecture processor. The processor has no any

special hardware features about the dynamic numa-aware lock patch.

But since different processor always has different  NUMA architecture

features,  I listed Zhaoxin CPU only.

When I tested the patch, I found the AMD EPYC 7551 is something like

  the Zhaoxin CPU. Both one node has  two clusters,  unlock processes

  in one cluster is much faster than unlock them in NUMA node.

I am not sure if it is fit for AMD CPU or not. so I comment the code for

the AMD CPU.

BTW, your patch series lacks performance data to justify the addition of

> quite a lot of complexity to the core locking code. We are unlikely to
> take this without sufficient justification.
>
In the cover letter,  these is performance test result for AMD EPYC 7551 and

Zhaoxin KH40000. I listed the perf epoll, locktorture mutex, unixbench 
and fxmark.

What test do you think is important for the Lock performance?

I will do more test in next submission.


> Another question that I have is that the base osq_lock() can coexist
> with your xz_osq_lock(). A cpu can dynamically switch from using
> osq_lock() to xz_osq_lock() and vice versa. What happens if some CPUs
> use osq_lock() while others use xz_osq_lock()? Will that cause a
> problem? Have you fully test this scenario to make sure that nothing
> breaks?
> Cheers,
> Longman 

The x_osq_lock uses a 16 bits tail,  the program is the nearly the same as

osq_lock before turning to numa-aware lock. By my opinion, from Intel

instruction set,  the atomic_xchg 32bits and cmpxchg 16 bits, both have

LOCK prefix,  the cacheline for tail are all accessed exclusively.


After dynamic switch enable,  some processes will enter the

x_osq_lock/x_osq_unlock,  if these processes meet queue tail, it will

atomic set the numa_enable to OSQTONUMADETECT. If some processes

are still in osq_lock, the numa_enable will be cleaned by atomic_xchg and

old &= 0xffff;  it will be set again when x_osq_unlock meets queue tail

next time.

After the numa_enable is set to OSQTONUMADETECT, the x_osq_unlock

will start to record contention depth(the serial in queue tail 's

optimistic_spin_node minus it in current unlocked CPU's node). If the depth

is more than osq_lock_depth, it will start increase the locked variable

in struct optimistic_spin_node.  After the locked variable is more than

osq_keep_times, it starts to turn to numa-aware lock.

If some processes in osq_lock/osq_unlock, the locked variable is

always set to 1.

So when set numa_enable to OSQLOCKSTOPPING, start switching to numa-aware

lock, so many lock()/unlock() are finished, all the processes should 
read the

enable_zx_numa_osq_lock as 2, to execute the x_osq_lock().

Consider unnecessarily to enable/disable dynamic switch frequently,

I did not add stopping protection here.


I prefer to use x_osq_lock to replace the osq_lock when

CONFIG_LOCK_SPIN_ON_OWNER_NUMA=y.

As I know,  in x86_64,  with __LOCK prefix,   the performance of 32 bits 
operand

is nearly the same as its of 16 bits operand.  From the test result in 
cover letter,

one or two processes, the performance difference is very little. I do 
not know if it

  is the same for other platform?

Best regards.

Li Yong


>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 4/4] locking/osq_lock: The numa-aware lock memory prepare, assign and cleanup.
  2024-09-19  9:41     ` yongli-os
@ 2024-09-19 11:28       ` Waiman Long
  0 siblings, 0 replies; 19+ messages in thread
From: Waiman Long @ 2024-09-19 11:28 UTC (permalink / raw)
  To: yongli-os, peterz, mingo, will, boqun.feng
  Cc: linux-kernel, yongli, louisqi, cobechen, jiangbowang


On 9/19/24 05:41, yongli-os wrote:
> BTW, your patch series lacks performance data to justify the addition of
>
>> quite a lot of complexity to the core locking code. We are unlikely to
>> take this without sufficient justification.
>>
> In the cover letter,  these is performance test result for AMD EPYC 
> 7551 and
>
> Zhaoxin KH40000. I listed the perf epoll, locktorture mutex, unixbench 
> and fxmark.
>
> What test do you think is important for the Lock performance?
>
> I will do more test in next submission.

Ah, I was not sent to/cc on the cover-letter. I only got your patches 
1-4. Yes, you did sent out a cover letter with some performance numbers 
after checking the LKML list. I will take a closer look at these 
performance numbers later as I am attending the LPC conference this week.

Cheers,
Longman


^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH v2 0/3] locking/osq_lock: Update osq_lock to dynamic
  2024-09-14  8:53 [PATCH 0/4] locking/osq_lock: Update osq_lock to dynamic yongli-oc
                   ` (3 preceding siblings ...)
  2024-09-14  8:53 ` [PATCH 4/4] locking/osq_lock: The numa-aware lock memory prepare, assign and cleanup yongli-oc
@ 2024-09-30  7:23 ` yongli-oc
  2024-09-30  7:23   ` [PATCH v2 1/3] locking/osq_lock: The Kconfig for dynamic numa-aware osq lock yongli-oc
                     ` (2 more replies)
  4 siblings, 3 replies; 19+ messages in thread
From: yongli-oc @ 2024-09-30  7:23 UTC (permalink / raw)
  To: peterz, mingo, will, longman, boqun.feng
  Cc: linux-kernel, yongli, louisqi, cobechen, jiangbowang

    The series patch changes osq lock to 2 bytes if the
    CONFIG_LOCK_SPIN_ON_OWNER=y, and change some coding problems,
    add more comments.
    Since the 2-byte and the 4-byte osq lock both access the same
    cacheline with LOCK# asserted, the speed should be essentially
    the same.
    To compare the performance for the two kinds of osq lock,
    I use locktorture and set cpu affinity for each mutex_lock
    write kthreads. The result is the an average of 9 times test.

    locktortue-SET CPU AFFINITY   AMD EPYC 7551 32-core, 2 sockets 
    Writers 6.6.28  6.6.28-osq-dynamic disable 6.6.28-osq-dynamic enable
    stress  4-byte          2-byte tail               2-byte tail    
         Average CV      Average   CV  Improve    Average   CV   Improve
     1  21047265 3.48%  21331993 12.16%  1.35%   21359519  6.76%   1.48%
     2  39186677 5.66%  40348197  6.44%  2.96%   39387961  4.18%   0.51%
     4  43467264 3.63%  44133849  4.95%  1.53%   38961218  7.01% -10.37%
     8  43780445 6.67%  48887433  3.31% 11.66%   41725007  5.29%  -4.69%
    16  41407176 4.19%  51042178  3.45% 23.27%   71381112  2.75%  72.39%
    32  46000746 6.63%  50060246 14.19%  8.82%   79361487  2.29%  72.52%
    48  44235011 5.20%  44988160  7.22%  1.70%   79779501  4.88%  80.35%
    64  59054128 4.02%  62233006  2.00%  5.38%  112695286  7.42%  90.83%

    The 2-byte osq lock in 1, 2, 4 threads, the performance is nearly
    the same as the 4-byte lock. 8, 16, 32 threads, the performance
    is better than the 4-byte lock, more threads, the
    performance tends to be the same. If turn on dynamic switching,
    2-byte locks in 4, 8 threads, the performance has a little
    degradation, may be related to the lock contention cumulative,
    but it is never satisfied with the switching conditions.
    If 16 threads or more, the performance improvements approaching
    80 percent.
    
    v1:
    The dynamic numa-aware osq_lock supports numa architecture based on
    the kernel in-box osq_lock.
    After enable by echo 1 > /proc/zx_numa_lock/dynamic_enable, the patch
    keeps checking how many processes are waiting for a osq lock. If more
    than a threshold, the patch stops these processes and switches to the
    numa-aware osq lock, then restarts. By a cleanup work queue,
    numa-aware osq lock turns back to osq_lock when all nodes unlocked,
    all the numa-aware lock memory returns to the pre-allocated Linux
    kernel memory cache.
    
    The struct optimistic_spin_queue of dynamic numa-aware lock is also
    4 bytes, the same as the in-box osq lock. If enable dynamic switch,
    it will be accessed as three members by union. The tail is tail16,
    2 bytes, supports 65534 cpu cores. The other two members are for lock
    switch state and numa memory index, each 1 byte.
    
    The serial are added to the struct optimistic_spin_node to
    know how many processes is waiting an osq lock. Each process applies
    an osq lock, the serial will add 1.
    
    We have done some performance evaluation for the dynamic numa-aware
    osq lock by perf, locktorture, unixbench, fxmark etc.
    fxmark: Filesystem Multicore Scalability Benchmark
    https://github.com/sslab-gatech/fxmark
    
    The following results are tested by Zhaoxin KH40000 32 cores processor
    or 32+32 cores, two sockets processor, and AMD EPYC 7551 32-core
    processor, two sockets. Since I do not know well about AMD CPU, the
    code to support AMD CPU is a sample only.
    
    The number under Average represents an average of test five times.
    The CV is the Coefficient of Variation.
    The kernel source code is 6.6.28 stable, compiled in the default
    configuration.
    The 6.6.28-osq-dynamic is the kernel 6.6.28 with the patch and enable
    dynamic switch.
    The OS is Ubuntu 22.04.02 LTS, gcc version 9.5.0.
    
    perf bench     Zhaoxin KH40000 32 cores
    kernel     6.6.28          6.6.28-osq-dynamic
    epoll  Average  CV       Average  CV    Improve
    ADD     25620  0.78%     64609  2.55%  152.18%
    WAIT     7134  1.77%     11098  0.52%   55.56%
    
    locktortue                Zhaoxin KH40000 32 cores
    kernel                 6.6.28            6.6.28-osq-dynamic
    lock torture        Average   CV     Average      CV    Improve
    mutex_lock Writes   7433503  1.59%   17979058   1.90%   141.87%
    
    unixbench             Zhaoxin KH40000 32+32 cores, run on ssd
    64 copys                          6.6.28           6.6.28-osq-dynamic
    System Benchmarks Partial   Average    CV     Average   CV     Improve
         Execl Throughput       1460.18   1.18%   1865.22  0.25%    27.74%
    File Copy 1024 bufsize 200   549.94   0.62%   1221.32  6.71%   122.08%
    File Copy 256 bufsize 500    339.62   2.20%    896.58  6.57%   164.00%
    File Copy 4096 bufsize 800  1173.68   1.88%   2089.7   5.20%    78.05%
         Pipe Throughput       52122.26   0.18%  53842.72  0.15%     3.30%
    Pipe-based Context Switchi 18340.38   0.92%  19874.66  0.80%     8.37%
         Process Creation       2325.12   0.18%   2178.16  0.21%    -6.32%
    Shell Scripts (1 concurren  7414.32   0.29%   8458.5   0.10%    14.08%
    Shell Scripts (16 concurre                                    
    Shell Scripts (8 concurren  7156.48   0.10%   8132.42  0.14%    13.64%
       System Call Overhead     1476.9    0.14%   1574.32  0.09%     6.60%
    System Benchmarks Index Sc  2982.64   0.33%   4008.66  0.94%    34.40%
    
    fxmark                Zhaoxin KH40000 32 cores, run on ssd (ssd, ext4)
    parallel cores              32                           24
              6.6.28  vs   6.6.28-osq-dynamic  6.6.28 vs 6.6.28-osq-dynamic
    item  Improve      Average,CV:Average,CV         Improv      CV:CV
    DWAL  -0.17% (  455895, 0.14%:  455115, 0.37%)   -0.07% ( 0.42%: 0.44%)
    DWOL   1.10% (32166648, 2.64%:32521877, 2.06%)   -0.68% ( 2.54%: 3.04%)
    DWOM  51.63% (  496955, 4.34%:  753509, 8.32%)   45.93% ( 3.14%: 2.57%)
    DWSL   1.67% (   20229, 2.34%:   20566, 3.18%)   -1.74% ( 1.96%: 2.66%)
    MWRL  71.00% (  348097, 0.92%:  595241, 1.26%)   65.95% ( 0.65%: 2.27%)
    MWRM  63.06% (    6750, 3.33%:   11007, 4.31%)   60.18% ( 5.67%: 4.81%)
    MWCL  16.99% (  149628, 1.66%:  175054, 0.82%)   16.96% ( 2.57%: 0.51%)
    MWCM  80.97% (    9448, 4.66%:   17098, 0.96%)   73.79% ( 5.37%: 1.79%)
    MWUM  37.73% (   16858, 3.13%:   23220, 3.42%)   31.16% ( 3.59%: 1.62%)
    MWUL  12.83% (   45275, 3.90%:   51083, 3.25%)   19.94% ( 4.19%: 1.98%)
    DWTL  41.44% (   85255, 5.01%:  120583, 9.83%)   45.07% ( 6.42%: 6.11%)
    MRPL  -2.63% (11448731, 1.91%:11147179, 4.18%)   -0.56% ( 1.65%: 3.33%)
    MRPM   0.29% ( 5423233, 1.77%: 5438929, 2.59%)  -10.54% (15.85%:16.42%)
    MRPH  -0.49% (  688629, 2.84%:  685266, 2.88%)  -18.99% (15.00%:24.98%)
    MRDM   8.42% ( 3662627, 0.76%: 3971133, 0.45%)    4.53% ( 1.77%: 1.72%)
    MRDL   6.25% (  530518, 2.75%:  563671, 5.33%)   12.43% (25.88%:26.91%)
    DRBH   7.16% (  388144, 7.88%:  415933,17.87%)  -20.61% (29.12%:21.91%)
    DRBM  -4.34% (  381710, 5.51%:  365159, 3.15%)  -16.93% (27.15%:29.85%)
    DRBL  -0.17% (46227341, 2.50%:46147935, 2.89%)   -4.03% ( 4.01%: 5.30%)
    
    fxmark      Zhaoxin KH40000 32 cores, run on ssd (ssd, ext4)
    parallel cores      2                        1         
    	6.6.28 vs 6.6.28-osq-dynamic  6.6.28 vs 6.6.28-osq-dynamic
    item     Improve    CV:CV          Improve    CV: CV
    DWAL     1.78%  (0.31%:0.20%)      6.36%  (2.52%: 0.67%)
    DWOL     2.46%  (2.26%:2.53%)      1.83%  (2.69%: 3.07%)
    DWOM     2.70%  (2.58%:3.12%)      2.22%  (2.67%: 3.79%)
    DWSL     3.28%  (2.90%:3.38%)      4.41%  (1.32%: 1.36%)
    MWRL    -0.76%  (1.46%:1.94%)     -0.82%  (2.04%: 2.32%)
    MWRM     1.94%  (4.38%:0.89%)     -2.05%  (4.07%: 5.16%)
    MWCL    -0.07%  (1.36%:3.84%)     -2.17%  (1.58%: 3.04%)
    MWCM     1.85%  (2.95%:4.68%)      0.28%  (0.45%: 2.48%)
    MWUM    -2.85%  (1.48%:2.01%)     -3.06%  (1.47%: 1.97%)
    MWUL    -1.46%  (0.58%:2.27%)     -2.98%  (0.71%: 2.11%)
    DWTL     0.40%  (3.89%:4.35%)     -2.68%  (4.04%: 3.15%)
    MRPL     3.11%  (1.38%:0.35%)     -4.81%  (0.32%:16.52%)
    MRPM     2.99%  (0.29%:1.19%)      3.50%  (0.56%: 0.78%)
    MRPH     3.01%  (1.10%:1.42%)      5.06%  (1.18%: 1.73%)
    MRDM    -1.67%  (4.59%:5.58%)     -3.30%  (0.23%: 8.01%)
    MRDL     1.94%  (1.56%:4.39%)     -0.55%  (0.88%: 9.57%)
    DRBH     7.24%  (7.07%:7.10%)      3.36%  (3.30%: 2.95%)
    DRBM     4.40%  (5.11%:0.74%)     -2.55%  (0.46%: 3.28%)
    DRBL     5.50%  (5.58%:0.30%)     -1.00%  (0.71%: 5.21%)
    
    (some tests has more than 10% loss, CV is also more than 10%,
    the result is not stable)
    
    perf bench    AMD EPYC 7551 32-core, 2 sockets
    kernel     6.6.28          6.6.28-osq-dynamic
    epoll  Average  CV       Average  CV    Improve
    ADD     15258  2.30%      62160  2.40%  307.38%
    WAIT     3861  4.20%       6990 16.77%   81.03%
    
    locktortue            AMD EPYC 7551 32-core, 2 sockets
    kernel                 6.6.28            6.6.28-osq-dynamic
    lock torture        Average   CV     Average      CV    Improve
    mutex_lock Writes  10435064  3.14%   22627890   4.92%   116.84%
    
    unixbench     AMD EPYC 7551 32-core, 2 sockets. run on ramdisk
    64 copys                          6.6.28        6.6.28-osq-dynamic
    System Benchmarks Partial   Average   CV      Average   CV   Improve
         Execl Throughput       2677.18  0.90%    3451.76  0.22%  28.93%
    File Copy 1024 bufsize 200   815.2   0.59%    1999.54  0.36% 145.28%
    File Copy 256 bufsize 500    504.6   0.69%    1359.6   0.49% 169.44%
    File Copy 4096 bufsize 800  1842.76  1.24%    3236.48  1.40%  75.63%
         Pipe Throughput       57748.74  0.01%   57539.6   0.03%  -0.36%
    Pipe-based Context Switchi 20882.18  0.57%   20525.38  0.57%  -1.71%
         Process Creation       4523.98  0.20%    4784.98  0.10%   5.77%
    Shell Scripts (1 concurren 13136.54  0.06%   15883.6   0.35%  20.91%
    Shell Scripts (16 concurre                                   
    Shell Scripts (8 concurren 12883.82  0.14%   15640.32  0.20%  21.40%
       System Call Overhead     3533.74  0.04%    3544.16  0.02%   0.29%
    System Benchmarks Index Sc  4809.38  0.23%    6575.44  0.14%  36.72%
    
    fxmark  AMD EPYC 7551 32-core, 2 sockets. run on ramdisk (mem,tmpfs)
    parallel cores         64                                32
                6.6.28 vs 6.6.28-osq-dynamic   6.6.28 vs 6.6.28-osq-dynamic
    item  Improve   Average,   CV :  Average,   CV    Improve  CV  :   CV
    DWAL  -0.22% ( 24091112, 0.31%: 24038426, 0.52%)  -0.26% (0.10%: 0.12%)
    DWOL   2.21% ( 86569869, 0.36%: 88479947, 0.27%)   1.99% (0.41%: 0.29%)
    DWOM 210.41% (   425986, 0.77%:  1322320, 0.28%) 128.86% (0.59%: 0.46%)
    DWSL   1.27% ( 70260252, 0.39%: 71149334, 0.37%)   1.19% (0.31%: 0.22%)
    MWRL   0.85% (   489865, 0.22%:   494045, 0.25%)   2.29% (0.12%: 0.33%)
    MWRM  96.28% (   149042, 0.45%:   292540, 3.55%)  60.10% (2.49%: 0.38%)
    MWCL  -5.44% (   772582, 2.92%:   730585, 0.80%)   0.32% (2.41%: 2.56%)
    MWCM  53.89% (   153857, 1.92%:   236774, 0.46%)  23.84% (0.72%: 0.50%)
    MWUM  88.20% (   214551, 3.90%:   403790, 0.41%)  62.81% (0.80%: 1.12%)
    MWUL  -8.26% (   970810, 1.63%:   890615, 1.63%)  -6.73% (3.01%: 1.61%)
    DWTL   5.90% (  5522297, 0.49%:  5847951, 0.18%)   5.03% (0.44%: 0.08%)
    MRPL  -1.10% ( 39707577, 0.07%: 39268812, 0.03%)  -1.30% (0.18%: 0.07%)
    MRPM  -0.63% ( 16446350, 0.47%: 16341936, 0.40%)   0.45% (0.15%: 0.45%)
    MRPH  -0.03% (  3805484, 0.50%:  3804248, 0.12%)   3.02% (1.54%: 0.36%)
    MRDM  49.41% ( 20178742, 1.89%: 30148449, 1.01%)  17.58% (1.19%: 0.85%)
    MRDL  -1.95% (227253170, 0.48%:222825409, 1.34%)  -1.80% (0.32%: 0.54%)
    DRBH   6.01% (  1045587, 1.91%:  1108467, 0.64%)   0.12% (0.13%: 0.30%)
    DRBM   0.65% (117702744, 0.31%:118473408, 0.87%)   1.12% (0.25%: 1.18%)
    DRBL   0.93% (121770444, 0.42%:122905957, 0.25%)   1.59% (0.31%: 0.40%)
    
    fxmark  AMD EPYC 7551 32-core, 2 sockets. run on ramdisk (mem,tmpfs)
    parallel cores      2                        1         
    	6.6.28 vs 6.6.28-osq-dynamic  6.6.28 vs 6.6.28-osq-dynamic
    item     Improve    CV :  CV           Improve    CV :  CV
    DWAL    -0.74%   (0.33%:  0.19%)      -1.02%   (0.19%:  0.34%)
    DWOL     1.50%   (0.36%:  0.44%)       1.89%   (0.30%:  0.36%)
    DWOM    -2.00%   (0.73%:  0.38%)       2.43%   (0.35%:  0.29%)
    DWSL     1.03%   (0.34%:  0.54%)       1.18%   (0.46%:  0.61%)
    MWRL     0.93%   (0.39%:  0.18%)       2.25%   (1.28%:  1.78%)
    MWRM    -0.30%   (0.60%:  0.47%)       0.17%   (0.58%:  0.47%)
    MWCL    -1.28%   (0.41%:  0.66%)      -0.38%   (0.19%:  0.44%)
    MWCM    -1.23%   (0.36%:  0.23%)      -1.42%   (0.41%:  0.54%)
    MWUM    -2.28%   (0.57%:  0.75%)      -1.11%   (0.82%:  0.21%)
    MWUL    -1.87%   (0.64%:  0.50%)      -1.75%   (0.58%:  0.65%)
    DWTL     0.36%   (0.09%:  0.12%)       0.19%   (0.09%:  0.09%)
    MRPL    -1.45%   (0.37%:  0.31%)      -1.35%   (0.12%:  0.54%)
    MRPM    -0.58%   (0.30%:  0.11%)      -1.04%   (0.18%:  0.31%)
    MRPH     0.79%   (3.92%:  0.48%)      -0.53%   (0.68%:  0.33%)
    MRDM    -0.55%   (0.93%:  0.44%)      -0.13%   (0.43%:  0.67%)
    MRDL    -0.11%   (0.56%:  0.19%)       0.68%   (0.71%:  0.49%)
    DRBH     0.09%   (1.31%:  0.87%)       2.75%   (0.68%:  0.45%)
    DRBM     1.09%   (0.19%:  1.05%)       1.60%   (0.15%:  0.72%)
    DRBL     3.26%   (1.00%:  0.56%)       2.34%   (0.36%:  0.23%)

    From the test result, when heavy contention, the performance of
    dynamic numa-aware lock is better than the performance of in-box
    osq_lock. If not too many processes apply a lock, the performance
    is nearly the same as the in-box osq_lock.

    ---
    Changes since v1 (based on Longman reviews)
    #1 Changes the bisection from v1 patchs.
    #2 Modify some code, such as the definition, special value in macro,
       cpu_relax().
    #3 Add some comments.  


yongli-oc (3):
  locking/osq_lock: The Kconfig for dynamic numa-aware osq lock.
  locking/osq_lock: Define osq by union to support dynamic numa-aware
    lock.
  locking/osq_lock: Turn from 2-byte osq_lock/unlock to numa
    lock/unlock.

 include/linux/osq_lock.h     |  33 ++-
 kernel/Kconfig.numalocks     |  17 ++
 kernel/locking/Makefile      |   3 +
 kernel/locking/numa.h        |  90 ++++++
 kernel/locking/numa_osq.h    |  29 ++
 kernel/locking/x_osq_lock.c  | 371 ++++++++++++++++++++++++
 kernel/locking/zx_numa.c     | 540 +++++++++++++++++++++++++++++++++++
 kernel/locking/zx_numa_osq.c | 497 ++++++++++++++++++++++++++++++++
 lib/Kconfig.debug            |   1 +
 9 files changed, 1580 insertions(+), 1 deletion(-)
 create mode 100644 kernel/Kconfig.numalocks
 create mode 100644 kernel/locking/numa.h
 create mode 100644 kernel/locking/numa_osq.h
 create mode 100644 kernel/locking/x_osq_lock.c
 create mode 100644 kernel/locking/zx_numa.c
 create mode 100644 kernel/locking/zx_numa_osq.c

-- 
2.34.1


^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH v2 1/3] locking/osq_lock: The Kconfig for dynamic numa-aware osq lock.
  2024-09-30  7:23 ` [PATCH v2 0/3] locking/osq_lock: Update osq_lock to dynamic yongli-oc
@ 2024-09-30  7:23   ` yongli-oc
  2024-09-30  7:23   ` [PATCH v2 2/3] locking/osq_lock: Define osq by union to support dynamic numa-aware lock yongli-oc
  2024-09-30  7:23   ` [PATCH v2 3/3] locking/osq_lock: Turn from 2-byte osq_lock/unlock to numa lock/unlock yongli-oc
  2 siblings, 0 replies; 19+ messages in thread
From: yongli-oc @ 2024-09-30  7:23 UTC (permalink / raw)
  To: peterz, mingo, will, longman, boqun.feng
  Cc: linux-kernel, yongli, louisqi, cobechen, jiangbowang

The make menu to choose if compile the dynamic numa-aware osq lock
or not, default is N.

Signed-off-by: yongli-oc <yongli-oc@zhaoxin.com>
---
 kernel/Kconfig.numalocks | 17 +++++++++++++++++
 lib/Kconfig.debug        |  1 +
 2 files changed, 18 insertions(+)
 create mode 100644 kernel/Kconfig.numalocks

diff --git a/kernel/Kconfig.numalocks b/kernel/Kconfig.numalocks
new file mode 100644
index 000000000000..feb1751b637c
--- /dev/null
+++ b/kernel/Kconfig.numalocks
@@ -0,0 +1,17 @@
+# SPDX-License-Identifier: GPL-2.0-only
+
+menu "NUMA Lock Supporting (OSQ)"
+
+config LOCK_SPIN_ON_OWNER_NUMA
+	bool "Enable dynamic numa-aware osq lock"
+	depends on LOCK_SPIN_ON_OWNER && X86_64
+	default n
+	help
+	  According the cpu numa architecture, the numa-aware lock
+	  always unlocks the process waiting on the same numa node
+	  first. It is different from the kernel inbox osq_lock.
+	  The dynamic numa-aware osq lock switchies between osq_lock and
+	  numa-aware lock automatically, according contention level.
+	  Enable: echo 1 > /proc/zx_numa_lock/dynamic_enable.
+
+endmenu #  NUMA Lock Supporting
diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index 27539c2626bf..cf9344eb61a4 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -1337,6 +1337,7 @@ config DEBUG_PREEMPT
 	  depending on workload as it triggers debugging routines for each
 	  this_cpu operation. It should only be used for debugging purposes.
 
+source "kernel/Kconfig.numalocks"
 menu "Lock Debugging (spinlocks, mutexes, etc...)"
 
 config LOCK_DEBUGGING_SUPPORT
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH v2 2/3] locking/osq_lock: Define osq by union to support dynamic numa-aware lock.
  2024-09-30  7:23 ` [PATCH v2 0/3] locking/osq_lock: Update osq_lock to dynamic yongli-oc
  2024-09-30  7:23   ` [PATCH v2 1/3] locking/osq_lock: The Kconfig for dynamic numa-aware osq lock yongli-oc
@ 2024-09-30  7:23   ` yongli-oc
  2024-09-30  7:23   ` [PATCH v2 3/3] locking/osq_lock: Turn from 2-byte osq_lock/unlock to numa lock/unlock yongli-oc
  2 siblings, 0 replies; 19+ messages in thread
From: yongli-oc @ 2024-09-30  7:23 UTC (permalink / raw)
  To: peterz, mingo, will, longman, boqun.feng
  Cc: linux-kernel, yongli, louisqi, cobechen, jiangbowang

    To support numa-aware osq lock, the struct optimistic_spin_queue
    is accessed as three members, numa_enable, index, tail16, by union.
    The size of the struct is the same as before.

Signed-off-by: yongli-oc <yongli-oc@zhaoxin.com>
---
 include/linux/osq_lock.h | 33 ++++++++++++++++++++++++++++++++-
 1 file changed, 32 insertions(+), 1 deletion(-)

diff --git a/include/linux/osq_lock.h b/include/linux/osq_lock.h
index ea8fb31379e3..37a7bc4ab530 100644
--- a/include/linux/osq_lock.h
+++ b/include/linux/osq_lock.h
@@ -12,14 +12,42 @@ struct optimistic_spin_queue {
 	 * Stores an encoded value of the CPU # of the tail node in the queue.
 	 * If the queue is empty, then it's set to OSQ_UNLOCKED_VAL.
 	 */
+#ifdef CONFIG_LOCK_SPIN_ON_OWNER_NUMA
+	union {
+		atomic_t tail;
+		u32 val;
+#ifdef __LITTLE_ENDIAN
+		struct {
+			u16 tail16;
+			u8 index;
+			u8 numa_enable;
+		};
+#else
+		struct {
+			u8 numa_enable;
+			u8 index;
+			u16 tail16;
+		};
+#endif
+	};
+#else
 	atomic_t tail;
+#endif
 };
 
 #define OSQ_UNLOCKED_VAL (0)
 
 /* Init macro and function. */
+#ifdef CONFIG_LOCK_SPIN_ON_OWNER_NUMA
+
+#define OSQ_LOCK_UNLOCKED { .tail = ATOMIC_INIT(OSQ_UNLOCKED_VAL) }
+
+#else
+
 #define OSQ_LOCK_UNLOCKED { ATOMIC_INIT(OSQ_UNLOCKED_VAL) }
 
+#endif
+
 static inline void osq_lock_init(struct optimistic_spin_queue *lock)
 {
 	atomic_set(&lock->tail, OSQ_UNLOCKED_VAL);
@@ -28,9 +56,12 @@ static inline void osq_lock_init(struct optimistic_spin_queue *lock)
 extern bool osq_lock(struct optimistic_spin_queue *lock);
 extern void osq_unlock(struct optimistic_spin_queue *lock);
 
+#ifdef CONFIG_LOCK_SPIN_ON_OWNER_NUMA
+extern bool osq_is_locked(struct optimistic_spin_queue *lock);
+#else
 static inline bool osq_is_locked(struct optimistic_spin_queue *lock)
 {
 	return atomic_read(&lock->tail) != OSQ_UNLOCKED_VAL;
 }
-
+#endif
 #endif
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH v2 3/3] locking/osq_lock: Turn from 2-byte osq_lock/unlock to numa lock/unlock.
  2024-09-30  7:23 ` [PATCH v2 0/3] locking/osq_lock: Update osq_lock to dynamic yongli-oc
  2024-09-30  7:23   ` [PATCH v2 1/3] locking/osq_lock: The Kconfig for dynamic numa-aware osq lock yongli-oc
  2024-09-30  7:23   ` [PATCH v2 2/3] locking/osq_lock: Define osq by union to support dynamic numa-aware lock yongli-oc
@ 2024-09-30  7:23   ` yongli-oc
  2 siblings, 0 replies; 19+ messages in thread
From: yongli-oc @ 2024-09-30  7:23 UTC (permalink / raw)
  To: peterz, mingo, will, longman, boqun.feng
  Cc: linux-kernel, yongli, louisqi, cobechen, jiangbowang

    According to the contention level, switches from osq_lock to
    numa-aware osq_lock.
    The numa.h is definition for numa-aware lock.
    The numa_osq.h is definition for osq to numa switching.
    The x_osq_lock.c is the crossing from two bytes osq lock to
    numa-aware lock.
    The zx_numa_osq.c lock is a two level osq_lock.
    zx_numa.c:
    The numa-aware lock kernel memory cache preparation, and a
    workqueue to turn numa-aware lock back to osq lock.
    The /proc interface. Enable dynamic switch by
    echo 1 > /proc/zx_numa_lock/dynamic_enable

    The new Makefile for dynamic numa-aware osq lock.

Signed-off-by: yongli-oc <yongli-oc@zhaoxin.com>
---
 kernel/locking/Makefile      |   3 +
 kernel/locking/numa.h        |  90 ++++++
 kernel/locking/numa_osq.h    |  29 ++
 kernel/locking/x_osq_lock.c  | 371 ++++++++++++++++++++++++
 kernel/locking/zx_numa.c     | 540 +++++++++++++++++++++++++++++++++++
 kernel/locking/zx_numa_osq.c | 497 ++++++++++++++++++++++++++++++++
 6 files changed, 1530 insertions(+)
 create mode 100644 kernel/locking/numa.h
 create mode 100644 kernel/locking/numa_osq.h
 create mode 100644 kernel/locking/x_osq_lock.c
 create mode 100644 kernel/locking/zx_numa.c
 create mode 100644 kernel/locking/zx_numa_osq.c

diff --git a/kernel/locking/Makefile b/kernel/locking/Makefile
index 0db4093d17b8..297bfad88fda 100644
--- a/kernel/locking/Makefile
+++ b/kernel/locking/Makefile
@@ -21,7 +21,10 @@ ifeq ($(CONFIG_PROC_FS),y)
 obj-$(CONFIG_LOCKDEP) += lockdep_proc.o
 endif
 obj-$(CONFIG_SMP) += spinlock.o
+obj-$(CONFIG_LOCK_SPIN_ON_OWNER_NUMA) += x_osq_lock.o zx_numa_osq.o zx_numa.o
+else
 obj-$(CONFIG_LOCK_SPIN_ON_OWNER) += osq_lock.o
+endif
 obj-$(CONFIG_PROVE_LOCKING) += spinlock.o
 obj-$(CONFIG_QUEUED_SPINLOCKS) += qspinlock.o
 obj-$(CONFIG_RT_MUTEXES) += rtmutex_api.o
diff --git a/kernel/locking/numa.h b/kernel/locking/numa.h
new file mode 100644
index 000000000000..790c27ed18e5
--- /dev/null
+++ b/kernel/locking/numa.h
@@ -0,0 +1,90 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef __LINUX_NUMA_LOCK_H
+#define __LINUX_NUMA_LOCK_H
+#include <linux/cache.h>
+#include "mcs_spinlock.h"
+
+struct optimistic_spin_node {
+	struct optimistic_spin_node *next, *prev;
+	int locked; /* 1 if lock acquired */
+	int cpu; /* encoded CPU # + 1 value */
+	u32 serial;
+};
+
+
+struct _numa_buf {
+	void *numa_ptr;
+	struct list_head list;
+	u32 lockaddr;
+	u32 highaddr;
+	u8 idle;
+	u8 type;
+	u16 index;
+};
+
+struct _numa_lock {
+	atomic_t tail ____cacheline_aligned_in_smp;
+	atomic_t addr;
+	u8 shift;
+	u8 stopping;
+	u16 numa_nodes;
+	u32 accessed;
+	uint64_t totalaccessed;
+	u32 nodeswitched;
+	atomic_t initlock;
+	atomic_t pending;
+	union {
+		struct mcs_spinlock mcs_node;
+		struct optimistic_spin_node osq_node;
+	};
+	CACHELINE_PADDING(pad);
+};
+
+struct numa_cpu_info {
+	__u8	x86_model;
+	/* CPU family */
+	__u8	x86;
+	/* CPU vendor */
+	__u8	x86_vendor;
+	__u8	x86_reserved;
+	u32	feature1;
+};
+
+#define NUMAEXPAND 1
+
+#define COHORT_START 1
+#define ACQUIRE_NUMALOCK (UINT_MAX-1)
+#define NODE_WAIT UINT_MAX
+#define LOCK_NUMALOCK 1
+#define UNLOCK_NUMALOCK 0
+
+#define NUMALOCKDYNAMIC 0xff
+#define TURNTONUMAREADY 0xa
+
+#define NUMA_LOCKED_VAL 0xffffff
+#define NUMA_UNLOCKED_VAL 0
+
+#define HIGH32BITMASK 0xffffffff00000000
+#define LOW32MASK 0xffffffff
+
+extern int numa_shift;
+extern int numa_clusters;
+extern int zx_numa_lock_total;
+extern struct _numa_buf *zx_numa_entry;
+extern atomic_t numa_count;
+extern int enable_zx_numa_osq_lock;
+extern u32 zx_numa_lock;
+extern int dynamic_enable;
+extern struct kmem_cache *zx_numa_lock_cachep;
+
+static inline u32 ptrmask(void *s)
+{
+	return (uint64_t)s & LOW32MASK;
+}
+inline void *get_numa_lock(int index);
+
+int zx_check_numa_dynamic_locked(u32 lockaddr, struct _numa_lock *_numa_lock);
+int zx_numa_lock_ptr_get(void *p);
+void numa_lock_init_data(struct _numa_lock *s, int clusters, u32 lockval,
+		u32 lockaddr);
+#endif
diff --git a/kernel/locking/numa_osq.h b/kernel/locking/numa_osq.h
new file mode 100644
index 000000000000..5c4675abc4fc
--- /dev/null
+++ b/kernel/locking/numa_osq.h
@@ -0,0 +1,29 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef __LINUX_NUMA_OSQ_H
+#define __LINUX_NUMA_OSQ_H
+
+#include <linux/osq_lock.h>
+#include "mcs_spinlock.h"
+
+#define OSQLOCKINITED 0
+#define OSQTONUMADETECT 0x10
+#define OSQLOCKSTOPPING 0xfc
+#define OSQ_LOCKED_VAL 0xffff
+
+extern u16 osq_keep_times;
+extern u16 osq_lock_depth;
+extern int osq_node_max;
+
+inline int encode_cpu(int cpu_nr);
+inline int node_cpu(struct optimistic_spin_node *node);
+inline struct optimistic_spin_node *decode_cpu(int encoded_cpu_val);
+
+void zx_osq_lock_stopping(struct optimistic_spin_queue *lock);
+void zx_osq_numa_start(struct optimistic_spin_queue *lock);
+void zx_osq_turn_numa_waiting(struct optimistic_spin_queue *lock);
+
+inline void zx_numa_osq_unlock(struct optimistic_spin_queue *qslock,
+		struct _numa_lock *n);
+inline bool zx_numa_osq_lock(struct optimistic_spin_queue *qslock,
+		struct _numa_lock *n);
+#endif
diff --git a/kernel/locking/x_osq_lock.c b/kernel/locking/x_osq_lock.c
new file mode 100644
index 000000000000..82cde1f6355b
--- /dev/null
+++ b/kernel/locking/x_osq_lock.c
@@ -0,0 +1,371 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * crossing from osq_lock to numa-aware lock
+ */
+#include <linux/percpu.h>
+#include <linux/sched.h>
+#include <linux/osq_lock.h>
+#include "numa.h"
+#include "numa_osq.h"
+
+u16 osq_lock_depth = 8;
+u16 osq_keep_times = 32;
+
+/*
+ * An MCS like lock especially tailored for optimistic spinning for sleeping
+ * lock implementations (mutex, rwsem, etc).
+ *
+ * Using a single mcs node per CPU is safe because sleeping locks should not be
+ * called from interrupt context and we have preemption disabled while
+ * spinning.
+ */
+static DEFINE_PER_CPU_SHARED_ALIGNED(struct optimistic_spin_node, osq_node);
+/*
+ * We use the value 0 to represent "no CPU", thus the encoded value
+ * will be the CPU number incremented by 1.
+ */
+inline int encode_cpu(int cpu_nr)
+{
+	return cpu_nr + 1;
+}
+
+inline int node_cpu(struct optimistic_spin_node *node)
+{
+	return node->cpu - 1;
+}
+
+inline struct optimistic_spin_node *decode_cpu(int encoded_cpu_val)
+{
+	int cpu_nr = encoded_cpu_val - 1;
+
+	return per_cpu_ptr(&osq_node, cpu_nr);
+}
+
+/*
+ * Get a stable @node->next pointer, either for unlock() or unqueue() purposes.
+ * Can return NULL in case we were the last queued and we updated @lock instead.
+ *
+ * If osq_lock() is being cancelled there must be a previous node
+ * and 'old_cpu' is its CPU #.
+ * For osq_unlock() there is never a previous node and old_cpu is
+ * set to OSQ_UNLOCKED_VAL.
+ * If the osq is in OSQLOCKSTOPPING state,
+ * the tail will be set to OSQ_LOCKED_VAL to stop the queue.
+ */
+static inline struct optimistic_spin_node *
+osq_wait_next_stop(struct optimistic_spin_queue *lock,
+	      struct optimistic_spin_node *node,
+	      int old_cpu)
+{
+	u16 curr = encode_cpu(smp_processor_id());
+	u16 old = old_cpu;
+
+	if (lock->numa_enable == OSQLOCKSTOPPING && old == OSQ_UNLOCKED_VAL)
+		old = OSQ_LOCKED_VAL;
+
+	for (;;) {
+		if (READ_ONCE(lock->tail16) == curr &&
+		    cmpxchg(&lock->tail16, curr, old) == curr) {
+
+			/*
+			 * We were the last queued, we moved @lock back. @prev
+			 * will now observe @lock and will complete its
+			 * unlock()/unqueue().
+			 */
+			return NULL;
+		}
+
+		/*
+		 * We must xchg() the @node->next value, because if we were to
+		 * leave it in, a concurrent unlock()/unqueue() from
+		 * @node->next might complete Step-A and think its @prev is
+		 * still valid.
+		 *
+		 * If the concurrent unlock()/unqueue() wins the race, we'll
+		 * wait for either @lock to point to us, through its Step-B, or
+		 * wait for a new @node->next from its Step-C.
+		 */
+		if (node->next) {
+			struct optimistic_spin_node *next;
+
+			next = xchg(&node->next, NULL);
+			if (next)
+				return next;
+		}
+
+		cpu_relax();
+	}
+}
+
+bool osq_lock(struct optimistic_spin_queue *lock)
+{
+	struct optimistic_spin_node *node = this_cpu_ptr(&osq_node);
+	struct optimistic_spin_node *prev, *next;
+	int cpu = smp_processor_id();
+	u16 curr = encode_cpu(cpu);
+	struct optimistic_spin_queue tail;
+	u16 old;
+
+	tail.val = READ_ONCE(lock->val);
+	if (unlikely(tail.numa_enable == OSQLOCKSTOPPING)) {
+		zx_osq_turn_numa_waiting(lock);
+		return osq_lock(lock);
+	}
+
+	if (unlikely(tail.numa_enable == NUMALOCKDYNAMIC)) {
+		struct _numa_lock *_numa_lock = NULL;
+		struct _numa_lock *node_lock = NULL;
+
+		_numa_lock = get_numa_lock(tail.index);
+		node_lock = (struct _numa_lock *) _numa_lock +
+					(cpu >> numa_shift);
+
+		prefetch(node_lock);
+		return zx_numa_osq_lock(lock, _numa_lock);
+	}
+
+	node->locked = 0;
+	node->next = NULL;
+	node->cpu = curr;
+	node->serial = 0;
+
+	/*
+	 * if ss.tail16 is OSQ_LOCKED_VAL, keeps the LOCKED state
+	 * and waiting numa-aware lock ready.
+	 */
+
+	if (likely(tail.numa_enable >= OSQTONUMADETECT)) {
+		struct optimistic_spin_queue ss;
+
+		while (1) {
+			ss.val = atomic_read(&lock->tail);
+			if (ss.tail16 == OSQ_LOCKED_VAL) {
+				zx_osq_turn_numa_waiting(lock);
+				return osq_lock(lock);
+			}
+			if (cmpxchg(&lock->tail16, ss.tail16, curr)
+					== ss.tail16) {
+				old = ss.tail16;
+				break;
+			}
+			cpu_relax();
+		}
+	} else
+		old = xchg(&lock->tail16, curr);
+
+	if (old == OSQ_UNLOCKED_VAL) {
+		node->serial = 1;
+		return true;
+	}
+
+	prev = decode_cpu(old);
+	node->prev = prev;
+
+	// Record osq serial number for the lock.
+	node->serial = prev->serial + 1;
+	/*
+	 * osq_lock()			unqueue
+	 *
+	 * node->prev = prev		osq_wait_next()
+	 * WMB				MB
+	 * prev->next = node		next->prev = prev // unqueue-C
+	 *
+	 * Here 'node->prev' and 'next->prev' are the same variable and we need
+	 * to ensure these stores happen in-order to avoid corrupting the list.
+	 */
+	smp_wmb();
+
+	WRITE_ONCE(prev->next, node);
+
+	/*
+	 * Normally @prev is untouchable after the above store; because at that
+	 * moment unlock can proceed and wipe the node element from stack.
+	 *
+	 * However, since our nodes are static per-cpu storage, we're
+	 * guaranteed their existence -- this allows us to apply
+	 * cmpxchg in an attempt to undo our queueing.
+	 */
+
+	/*
+	 * Wait to acquire the lock or cancellation. Note that need_resched()
+	 * will come with an IPI, which will wake smp_cond_load_relaxed() if it
+	 * is implemented with a monitor-wait. vcpu_is_preempted() relies on
+	 * polling, be careful.
+	 */
+	if (smp_cond_load_relaxed(&node->locked, VAL || need_resched() ||
+				  vcpu_is_preempted(node_cpu(node->prev))))
+		return true;
+
+	/* unqueue */
+	/*
+	 * Step - A  -- stabilize @prev
+	 *
+	 * Undo our @prev->next assignment; this will make @prev's
+	 * unlock()/unqueue() wait for a next pointer since @lock points to us
+	 * (or later).
+	 */
+
+	for (;;) {
+		/*
+		 * cpu_relax() below implies a compiler barrier which would
+		 * prevent this comparison being optimized away.
+		 */
+		if (data_race(prev->next) == node &&
+		    cmpxchg(&prev->next, node, NULL) == node)
+			break;
+
+		/*
+		 * We can only fail the cmpxchg() racing against an unlock(),
+		 * in which case we should observe @node->locked becoming
+		 * true.
+		 */
+		if (smp_load_acquire(&node->locked))
+			return true;
+
+		cpu_relax();
+
+		/*
+		 * Or we race against a concurrent unqueue()'s step-B, in which
+		 * case its step-C will write us a new @node->prev pointer.
+		 */
+		prev = READ_ONCE(node->prev);
+	}
+
+	/*
+	 * Step - B -- stabilize @next
+	 *
+	 * Similar to unlock(), wait for @node->next or move @lock from @node
+	 * back to @prev.
+	 */
+
+	next = osq_wait_next_stop(lock, node, prev->cpu);
+	if (!next)
+		return false;
+
+	/*
+	 * Step - C -- unlink
+	 *
+	 * @prev is stable because its still waiting for a new @prev->next
+	 * pointer, @next is stable because our @node->next pointer is NULL and
+	 * it will wait in Step-A.
+	 */
+
+	WRITE_ONCE(next->prev, prev);
+	WRITE_ONCE(prev->next, next);
+
+	return false;
+}
+
+/*
+ * In osq_unlock(), changes the osq state for switching, then unlock next.
+ * OSQTONUMADETECT: Starts to detect lock contention when first osq tail reached
+ *	after dynamic enable.
+ * OSQLOCKSTOPPING: If lock contention keeps more than osq_lock_depth
+ *	osq_keep_times, set the osq to STOPPING state and get
+ *	numa-aware lock memory entry.
+ * NUMALOCKDYNAMIC: numa-aware lock initialized and ready.
+ */
+
+void osq_unlock(struct optimistic_spin_queue *lock)
+{
+	struct optimistic_spin_node *node, *next;
+	int threadshold = osq_lock_depth;
+	int cpu = smp_processor_id();
+	u16 curr = encode_cpu(cpu);
+	int depth = 0;
+	u32 count = 0;
+
+	if (unlikely(lock->numa_enable == NUMALOCKDYNAMIC)) {
+		struct _numa_lock *_numa_lock = get_numa_lock(lock->index);
+
+		prefetch((struct _numa_lock *) _numa_lock + (cpu >> numa_shift));
+		return zx_numa_osq_unlock(lock, _numa_lock);
+	}
+	/*
+	 * Fast path for the uncontended case.
+	 */
+	if (unlikely(lock->numa_enable == OSQTONUMADETECT)) {
+		struct optimistic_spin_node *node_last = NULL;
+		u16 tail = 0;
+
+		tail = cmpxchg(&lock->tail16, curr, OSQ_UNLOCKED_VAL);
+		if (tail == curr)
+			return;
+
+		node = this_cpu_ptr(&osq_node);
+		node_last = decode_cpu(tail);
+		//Get the contention level
+		depth = node_last->serial - node->serial;
+		count = READ_ONCE(node->locked);
+		if (count > osq_keep_times && (dynamic_enable & 0x1))
+			zx_osq_lock_stopping(lock);
+	} else if (unlikely(lock->numa_enable == OSQLOCKSTOPPING)) {
+		if (cmpxchg(&lock->tail16, curr, OSQ_LOCKED_VAL)
+					== curr) {
+			//All osq stopped, start to run as numa-aware lock.
+			zx_osq_numa_start(lock);
+			return;
+		}
+	} else {
+		struct optimistic_spin_queue t;
+
+		/*
+		 * After dynamic enable, when osq reaches tail, set the osq
+		 * to DETECT mode to detect the contention.
+		 */
+		t.val = 0;
+		if (dynamic_enable & 0x1) {
+			if (atomic_read(&numa_count) < zx_numa_lock_total)
+				t.numa_enable = OSQTONUMADETECT;
+		}
+		if (t.numa_enable == OSQTONUMADETECT) {
+			if (atomic_cmpxchg_release(&lock->tail, curr,
+				(t.val | OSQ_UNLOCKED_VAL)) == curr)
+				return;
+		} else if (cmpxchg(&lock->tail16, curr,
+				OSQ_UNLOCKED_VAL) == curr)
+			return;
+	}
+
+	/*
+	 * Second most likely case.
+	 */
+	node = this_cpu_ptr(&osq_node);
+	next = xchg(&node->next, NULL);
+	if (next) {
+		if (depth > threadshold)
+			WRITE_ONCE(next->locked, count + 1);
+		else
+			WRITE_ONCE(next->locked, 1);
+		return;
+	}
+
+	next = osq_wait_next_stop(lock, node, OSQ_UNLOCKED_VAL);
+	if (next) {
+		if (depth > threadshold)
+			WRITE_ONCE(next->locked, count + 1);
+		else
+			WRITE_ONCE(next->locked, 1);
+	}
+}
+
+bool osq_is_locked(struct optimistic_spin_queue *lock)
+{
+	struct optimistic_spin_queue val;
+
+	val.val = atomic_read(&lock->tail);
+	if (val.tail16 == OSQ_UNLOCKED_VAL)
+		return false;
+
+	if (val.tail16 == OSQ_LOCKED_VAL) {
+
+		//state changing
+		if (val.numa_enable != NUMALOCKDYNAMIC)
+			return true;
+
+		return zx_check_numa_dynamic_locked(ptrmask(lock),
+					get_numa_lock(val.index));
+	}
+
+	return true;
+}
diff --git a/kernel/locking/zx_numa.c b/kernel/locking/zx_numa.c
new file mode 100644
index 000000000000..31d247261474
--- /dev/null
+++ b/kernel/locking/zx_numa.c
@@ -0,0 +1,540 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Dynamic numa-aware osq lock
+ * Crossing from numa-aware lock to osq_lock
+ * Numa lock memory initialize and /proc interface
+ * Author: LiYong <yongli-oc@zhaoxin.com>
+ *
+ */
+#include <linux/cpumask.h>
+#include <asm/byteorder.h>
+#include <asm/kvm_para.h>
+#include <linux/percpu.h>
+#include <linux/sched.h>
+#include <linux/slab.h>
+#include <linux/osq_lock.h>
+#include <linux/module.h>
+#include <linux/proc_fs.h>
+#include <linux/seq_file.h>
+#include <linux/uaccess.h>
+#include <linux/reboot.h>
+
+#include "numa.h"
+#include "numa_osq.h"
+
+int enable_zx_numa_osq_lock;
+struct delayed_work zx_numa_start_work;
+struct delayed_work zx_numa_cleanup_work;
+
+atomic_t numa_count;
+struct _numa_buf *zx_numa_entry;
+int zx_numa_lock_total = 256;
+LIST_HEAD(_zx_numa_head);
+LIST_HEAD(_zx_numa_lock_head);
+
+struct kmem_cache *zx_numa_entry_cachep;
+struct kmem_cache *zx_numa_lock_cachep;
+int numa_shift;
+int numa_clusters;
+static atomic_t lockindex;
+int dynamic_enable;
+
+static const struct numa_cpu_info numa_cpu_list[] = {
+	/*feature1=1, a numa node includes two clusters*/
+	//{1, 23, X86_VENDOR_AMD, 0, 1},
+	{0x5b, 7, X86_VENDOR_CENTAUR, 0, 1},
+	{0x5b, 7, X86_VENDOR_ZHAOXIN, 0, 1}
+};
+
+inline void *get_numa_lock(int index)
+{
+	if (index >= 0 && index < zx_numa_lock_total)
+		return zx_numa_entry[index].numa_ptr;
+	else
+		return NULL;
+}
+
+static int zx_get_numa_shift(int all_cpus, int clusters)
+{
+	int cpus = (int) all_cpus/clusters;
+	int count = 0;
+
+	while (cpus) {
+		cpus >>= 1;
+		count++;
+	}
+	return count-1;
+}
+
+void numa_lock_init_data(struct _numa_lock *s, int clusters,
+			u32 lockval, u32 lockaddr)
+{
+	int j = 0;
+
+	for (j = 0; j < clusters + NUMAEXPAND; j++) {
+		atomic_set(&(s + j)->tail, lockval);
+		atomic_set(&(s + j)->addr, lockaddr);
+		(s + j)->shift = numa_shift;
+		(s + j)->stopping = 0;
+		(s + j)->numa_nodes = clusters;
+		(s + j)->accessed = 0;
+		(s + j)->totalaccessed = 0;
+		(s + j)->nodeswitched = 0;
+		atomic_set(&(s + j)->initlock, 0);
+		atomic_set(&(s + j)->pending, 0);
+	}
+}
+/*
+ * The lockaddr of zx_numa_enry is key value to know which index is occupied.
+ */
+int zx_numa_lock_ptr_get(void *p)
+{
+	int i = 0;
+	int index = 0;
+
+	if (atomic_read(&numa_count) >= zx_numa_lock_total)
+		return zx_numa_lock_total;
+
+	index = atomic_inc_return(&lockindex);
+
+	for (i = 0; i < zx_numa_lock_total; i++) {
+		if (index >= zx_numa_lock_total)
+			index = 0;
+		if (cmpxchg(&zx_numa_entry[index].lockaddr,
+					0, ptrmask(p)) == 0) {
+			while (1) {
+				struct _numa_lock *node_lock =
+					zx_numa_entry[index].numa_ptr;
+				struct _numa_lock *numa_lock = node_lock +
+						node_lock->numa_nodes;
+
+				if (atomic_read(&numa_lock->tail) ==
+								NUMA_LOCKED_VAL)
+					break;
+				cpu_relax();
+
+			}
+			atomic_inc(&numa_count);
+			zx_numa_entry[index].highaddr = ((u64)p) >> 32;
+			atomic_set(&lockindex, index);
+			return index;
+		}
+		index++;
+		if (atomic_read(&numa_count) >= zx_numa_lock_total)
+			break;
+	}
+	return zx_numa_lock_total;
+}
+
+int zx_check_numa_dynamic_locked(u32 lockaddr,
+		struct _numa_lock *_numa_lock)
+{
+	struct _numa_lock *node_lock = NULL;
+	u64 s = -1;
+	int i = 0;
+
+	//in switching if the pending is not 0
+	if (atomic_read(&_numa_lock->pending) != 0)
+		return 1;
+
+	for (i = 0; i < _numa_lock->numa_nodes + 1; i++) {
+		node_lock = _numa_lock + i;
+		cpu_relax();
+		s = atomic64_read((atomic64_t *) &node_lock->tail);
+		if ((s >> 32) != lockaddr)
+			continue;
+		if ((s & LOW32MASK) == NUMA_LOCKED_VAL
+				|| (s & LOW32MASK) == NUMA_UNLOCKED_VAL)
+			continue;
+		break;
+	}
+
+	if (i == _numa_lock->numa_nodes + 1)
+		return 0;
+	return i+1;
+}
+
+static int zx_numa_lock64_try_to_freeze(u32 lockaddr, struct _numa_lock *_numa_lock,
+			int index)
+{
+	struct _numa_lock *node_lock = NULL;
+	u64 addr = ((u64)lockaddr) << 32;
+	u64 s = 0;
+	u64 ff = 0;
+	int i = 0;
+
+	//check and set the tail to LOCKED from first node to last node,
+	//if, all node tail are LOCKED, try to set the NUMA node to LOCKED
+	for (i = 0; i < _numa_lock->numa_nodes+1; i++) {
+		node_lock = _numa_lock + i;
+		cpu_relax();
+
+		s = atomic64_read((atomic64_t *)&node_lock->tail);
+		if ((s & HIGH32BITMASK) != addr)
+			continue;
+
+		if ((s & LOW32MASK) == NUMA_LOCKED_VAL)
+			continue;
+
+		if ((s & LOW32MASK) == NUMA_UNLOCKED_VAL) {
+			ff = atomic64_cmpxchg((atomic64_t *)&node_lock->tail,
+				(addr|NUMA_UNLOCKED_VAL), NUMA_LOCKED_VAL);
+			if (ff == (addr|NUMA_UNLOCKED_VAL))
+				continue;
+		}
+		break;
+	}
+	//All node's tail and numa node's tail are LOCKED, set numa lock memory
+	//referenced by index to un-occupied.
+	if (i == _numa_lock->numa_nodes + 1) {
+		zx_numa_entry[index].idle = 0;
+		zx_numa_entry[index].type = 0;
+		zx_numa_entry[index].highaddr = 0;
+		xchg(&zx_numa_entry[index].lockaddr, 0);
+	}
+
+	return i;
+}
+
+static void zx_numa_lock_stopping(struct _numa_lock *_numa_lock)
+{
+	struct _numa_lock *node_lock = NULL;
+	int i = 0;
+
+	for (i = 0; i < _numa_lock->numa_nodes+1; i++) {
+		node_lock = _numa_lock + i;
+		WRITE_ONCE(node_lock->stopping, 1);
+	}
+}
+
+static void zx_numa_cleanup(struct work_struct *work)
+{
+	int i = 0;
+	int checktimes = 2;
+
+	//reboot or power off state
+	if (READ_ONCE(enable_zx_numa_osq_lock) == 0xf)
+		return;
+
+	//If dynamic enable and numa-aware lock count is 0, reschedule the
+	//workqueue.
+	if (atomic_read(&numa_count) == 0) {
+		if (READ_ONCE(dynamic_enable) != 0)
+			schedule_delayed_work(&zx_numa_cleanup_work, 60*HZ);
+		return;
+	}
+
+	for (i = 0; i < zx_numa_lock_total; i++) {
+		int s = 0;
+		u32 lockaddr = READ_ONCE(zx_numa_entry[i].lockaddr);
+		u32 type = zx_numa_entry[i].type;
+		struct _numa_lock *buf =  zx_numa_entry[i].numa_ptr;
+		int nodes = 0;
+
+		if (lockaddr == 0 || type == 3 || zx_numa_entry[i].idle == 0)
+			continue;
+		nodes = buf->numa_nodes;
+		if (zx_numa_entry[i].idle < checktimes) {
+			//check if all node is idle
+			s = zx_check_numa_dynamic_locked(lockaddr, buf);
+			if (s != 0) {
+				zx_numa_entry[i].idle = 1;
+				continue;
+			}
+			zx_numa_entry[i].idle++;
+		}
+
+		if (zx_numa_entry[i].idle == checktimes) {
+			//set each node to stopping mode
+			zx_numa_lock_stopping(buf);
+			zx_numa_entry[i].idle++;
+
+		}
+
+		if (zx_numa_entry[i].idle == checktimes+1) {
+			while (1) {
+				//try to freezed all nodes
+				if (zx_numa_lock64_try_to_freeze(lockaddr, buf,
+						i) == nodes + 1) {
+					//all node has been locked
+					atomic_dec(&numa_count);
+					break;
+				}
+				cpu_relax();
+			}
+		}
+	}
+	schedule_delayed_work(&zx_numa_cleanup_work, 60*HZ);
+}
+
+static int create_numa_buffer_list(int clusters, int len)
+{
+	int i = 0;
+
+	for (i = 0; i < zx_numa_lock_total; i++) {
+		struct _numa_lock *s = (struct _numa_lock *)kmem_cache_alloc(
+				zx_numa_lock_cachep, GFP_KERNEL);
+		if (!s) {
+			while (i > 0) {
+				kmem_cache_free(zx_numa_lock_cachep,
+						zx_numa_entry[i-1].numa_ptr);
+				i--;
+			}
+			return 0;
+		}
+		memset((char *)s, 0,
+			len * L1_CACHE_BYTES * (clusters + NUMAEXPAND));
+		numa_lock_init_data(s, clusters, NUMA_LOCKED_VAL, 0);
+		zx_numa_entry[i].numa_ptr = s;
+		zx_numa_entry[i].lockaddr = 0;
+		zx_numa_entry[i].highaddr = 0;
+		zx_numa_entry[i].idle = 0;
+		zx_numa_entry[i].type = 0;
+	}
+
+	for (i = 0; i < zx_numa_lock_total; i++) {
+		zx_numa_entry[i].index = i;
+		list_add_tail(&(zx_numa_entry[i].list), &_zx_numa_lock_head);
+	}
+	return 1;
+}
+
+static int zx_numa_lock_init(int numa)
+{
+	int align = max_t(int, L1_CACHE_BYTES, ARCH_MIN_TASKALIGN);
+	int d = 0;
+	int status = 0;
+
+	atomic_set(&lockindex, 0);
+	atomic_set(&numa_count, 0);
+
+	if (sizeof(struct _numa_lock) & 0x3f)
+		d = (int)((sizeof(struct _numa_lock) + L1_CACHE_BYTES) /
+			  L1_CACHE_BYTES);
+	else
+		d = (int)(sizeof(struct _numa_lock) / L1_CACHE_BYTES);
+
+	zx_numa_entry_cachep = kmem_cache_create(
+		"zx_numa_entry",
+		sizeof(struct _numa_buf) * zx_numa_lock_total, align,
+		SLAB_PANIC | SLAB_ACCOUNT, NULL);
+
+	zx_numa_lock_cachep = kmem_cache_create(
+		"zx_numa_lock",
+		d * L1_CACHE_BYTES * (numa + NUMAEXPAND), align,
+		SLAB_PANIC | SLAB_ACCOUNT, NULL);
+
+
+	if (zx_numa_entry_cachep && zx_numa_lock_cachep) {
+		zx_numa_entry = (struct _numa_buf *)kmem_cache_alloc(
+				zx_numa_entry_cachep, GFP_KERNEL);
+		if (zx_numa_entry) {
+			memset((char *)zx_numa_entry, 0,
+				sizeof(struct _numa_buf) * zx_numa_lock_total);
+			create_numa_buffer_list(numa, d);
+			status = 1;
+		}
+	}
+
+	pr_info("enable dynamic numa-aware osq_lock, clusters %d\n",
+		numa);
+	return status;
+}
+
+
+#define numa_lock_proc_dir "zx_numa_lock"
+#define numa_entry_total 8
+struct proc_dir_entry *numa_lock_proc;
+struct proc_dir_entry *numa_lock_enable;
+struct proc_dir_entry *numa_proc_entry[numa_entry_total];
+
+static ssize_t numa_lock_proc_read(struct file *file,
+		char __user *usrbuf, size_t len, loff_t *off)
+{
+	int id = (long) pde_data(file_inode(file));
+	char kbuffer[128];
+	ssize_t retval = 0;
+	size_t n = 0;
+
+	memset(kbuffer, 0, sizeof(kbuffer));
+	if (id == 0)
+		n = sprintf(kbuffer, "%d\n", READ_ONCE(dynamic_enable));
+	else if (id == 1)
+		n = sprintf(kbuffer, "%d\n", READ_ONCE(osq_lock_depth));
+	else if (id == 2)
+		n = sprintf(kbuffer, "%d\n", READ_ONCE(osq_keep_times));
+	else if (id == 3)
+		n = sprintf(kbuffer, "%d\n", READ_ONCE(osq_node_max));
+	else if (id == 4)
+		n = sprintf(kbuffer, "%d\n", atomic_read(&numa_count));
+	retval = simple_read_from_buffer(usrbuf, len, off, kbuffer, n);
+
+	return retval;
+}
+
+static ssize_t numa_lock_proc_write(struct file *file,
+		const char __user *buffer, size_t count, loff_t *f_pos)
+{
+	int id = (long) pde_data(file_inode(file));
+	char kbuffer[128];
+	unsigned long new = 0;
+	int err = 0;
+
+	memset(kbuffer, 0, sizeof(kbuffer));
+	if (copy_from_user(kbuffer, buffer, count))
+		return count;
+	kbuffer[count] = '\0';
+	err = kstrtoul(kbuffer, 10, &new);
+	if (err != 0)
+		return count;
+
+	if (id == 0) {
+		int last = READ_ONCE(dynamic_enable);
+
+		if (new < 0 || new >= 2 || last == new)
+			return count;
+
+		if (last == 0)
+			schedule_delayed_work(&zx_numa_cleanup_work, 60*HZ);
+		prefetchw(&dynamic_enable);
+		WRITE_ONCE(dynamic_enable, new);
+		return count;
+	}
+
+	if (READ_ONCE(dynamic_enable) != 0) {
+		pr_info("dynamic %d: change setting should disable dynamic\n",
+			dynamic_enable);
+		return count;
+	}
+	if (id == 1 && new > 4 && new <= 32)
+		WRITE_ONCE(osq_lock_depth, new);
+	else if (id == 2 && new >= 16 && new <= 2048)
+		WRITE_ONCE(osq_keep_times, new);
+	else if (id == 3 && new > 4 && new <= 2048)
+		WRITE_ONCE(osq_node_max, new);
+	return count;
+}
+static int numa_lock_proc_show(struct seq_file *m, void *v)
+{
+	return 0;
+}
+
+static int numa_lock_proc_open(struct inode *inode, struct file *file)
+{
+	return single_open(file, numa_lock_proc_show, NULL);
+}
+static const struct proc_ops numa_lock_proc_fops = {
+	.proc_open = numa_lock_proc_open,
+	.proc_read = numa_lock_proc_read,
+	.proc_write = numa_lock_proc_write
+};
+
+static int numalock_proc_init(void)
+{
+	int index = 0;
+	int i = 0;
+
+	numa_lock_proc = proc_mkdir(numa_lock_proc_dir, NULL);
+	if (numa_lock_proc == NULL) {
+		pr_info("%s proc create %s failed\n", __func__,
+				numa_lock_proc_dir);
+		return -EINVAL;
+	}
+
+	numa_lock_enable = proc_create_data("dynamic_enable", 0666,
+		numa_lock_proc, &numa_lock_proc_fops, (void *)(long)index++);
+	if (!numa_lock_enable) {
+		pr_info("%s proc_create_data %s failed!\n", __func__,
+				"dynamic_enable");
+		return -ENOMEM;
+	}
+
+	for (i = 0; i < numa_entry_total; i++)
+		numa_proc_entry[i] = NULL;
+
+	numa_proc_entry[0] =  proc_create_data("osq_lock_depth", 0664,
+		numa_lock_proc, &numa_lock_proc_fops, (void *)(long)index++);
+	numa_proc_entry[1] =  proc_create_data("osq_keep_times", 0664,
+		numa_lock_proc, &numa_lock_proc_fops, (void *)(long)index++);
+	numa_proc_entry[2] =  proc_create_data("osq_node_max", 0664,
+		numa_lock_proc, &numa_lock_proc_fops, (void *)(long)index++);
+	numa_proc_entry[3] =  proc_create_data("numa_osq_lock", 0444,
+		numa_lock_proc, &numa_lock_proc_fops, (void *)(long)index++);
+	return 0;
+}
+
+static void numalock_proc_exit(void)
+{
+	int i = 0;
+
+	for (i = 0; i < numa_entry_total; i++) {
+		if (numa_proc_entry[i])
+			proc_remove(numa_proc_entry[i]);
+	}
+	if (numa_lock_enable)
+		proc_remove(numa_lock_enable);
+	if (numa_lock_proc)
+		remove_proc_entry(numa_lock_proc_dir, NULL);
+
+}
+
+static int numalock_shutdown_notify(struct notifier_block *unused1,
+		unsigned long unused2, void *unused3)
+{
+	if (READ_ONCE(enable_zx_numa_osq_lock) == 1) {
+		WRITE_ONCE(dynamic_enable, 0);
+		WRITE_ONCE(enable_zx_numa_osq_lock, 0xf);
+	}
+	return NOTIFY_DONE;
+}
+static struct notifier_block numalock_shutdown_nb = {
+	.notifier_call = numalock_shutdown_notify,
+};
+static int __init zx_numa_base_init(void)
+{
+	int cpu = num_possible_cpus();
+	int i = 0;
+
+	WRITE_ONCE(enable_zx_numa_osq_lock, 0);
+	if (kvm_para_available())
+		return 0;
+	if (cpu >= 65534 || cpu < 16 || (cpu & 0x7) != 0)
+		return 0;
+
+	for (i = 0; i < ARRAY_SIZE(numa_cpu_list); i++) {
+		if (boot_cpu_data.x86_vendor == numa_cpu_list[i].x86_vendor &&
+			boot_cpu_data.x86 == numa_cpu_list[i].x86 &&
+			boot_cpu_data.x86_model == numa_cpu_list[i].x86_model) {
+
+			if (numa_cpu_list[i].feature1 == 1)
+				numa_clusters = nr_node_ids + nr_node_ids;
+			numa_shift = zx_get_numa_shift(num_possible_cpus(),
+					numa_clusters);
+
+			if (zx_numa_lock_init(numa_clusters) == 0)
+				return -ENOMEM;
+			register_reboot_notifier(&numalock_shutdown_nb);
+			numalock_proc_init();
+			INIT_DELAYED_WORK(&zx_numa_cleanup_work,
+				zx_numa_cleanup);
+			prefetchw(&enable_zx_numa_osq_lock);
+			WRITE_ONCE(enable_zx_numa_osq_lock, 1);
+			return 0;
+		}
+	}
+	return 0;
+}
+
+static void __exit zx_numa_lock_exit(void)
+{
+	numalock_proc_exit();
+	prefetchw(&dynamic_enable);
+	WRITE_ONCE(dynamic_enable, 0);
+}
+
+late_initcall(zx_numa_base_init);
+module_exit(zx_numa_lock_exit);
+MODULE_AUTHOR("LiYong <yongli-oc@zhaoxin.com>");
+MODULE_DESCRIPTION("zx dynamic numa-aware osq lock");
+MODULE_LICENSE("GPL");
+
diff --git a/kernel/locking/zx_numa_osq.c b/kernel/locking/zx_numa_osq.c
new file mode 100644
index 000000000000..9fc329f33c36
--- /dev/null
+++ b/kernel/locking/zx_numa_osq.c
@@ -0,0 +1,497 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Dynamic numa-aware osq lock
+ * Author: LiYong <yongli-oc@zhaoxin.com>
+ *
+ */
+#include <linux/cpumask.h>
+#include <asm/byteorder.h>
+#include <linux/percpu.h>
+#include <linux/sched.h>
+#include <linux/slab.h>
+#include <linux/osq_lock.h>
+#include "numa.h"
+#include "numa_osq.h"
+
+int osq_node_max = 256;
+
+
+static DEFINE_PER_CPU_SHARED_ALIGNED(struct optimistic_spin_node, osq_cpu_node);
+
+/*
+ * We use the value 0 to represent "no CPU", thus the encoded value
+ * will be the CPU number incremented by 1.
+ */
+static inline int decode(int cpu_nr)
+{
+	return cpu_nr - 1;
+}
+
+static inline struct optimistic_spin_node *decode_curr(int encoded_cpu_val)
+{
+	int cpu_nr = decode(encoded_cpu_val);
+
+	return per_cpu_ptr(&osq_cpu_node, cpu_nr);
+}
+/*
+ * Exchange the tail to get the last one in the queue.
+ * If the high 32 bits of tail is not equal to the osq lock addr, or
+ * the low 32 bits is NUMA_LOCKED_VAL, the queue is in LOCKED state.
+ */
+static int atomic64_cmpxchg_notequal(void *qslock, atomic_t *tail, int curr)
+{
+	u64 ss = 0;
+	u32 addr = ptrmask(qslock);
+	u64 addrcurr = (((u64)addr) << 32) | curr;
+
+	while (1) {
+		ss = atomic64_read((atomic64_t *) tail);
+		if ((ss >> 32) != addr)
+			return NUMA_LOCKED_VAL;
+		if ((ss & LOW32MASK) == NUMA_LOCKED_VAL)
+			return NUMA_LOCKED_VAL;
+		if (atomic64_cmpxchg((atomic64_t *) tail, ss, addrcurr) == ss)
+			return ss & LOW32MASK;
+		cpu_relax();
+	}
+}
+/*
+ * Get numa-aware lock memory and set the osq to STOPPING state.
+ * Look for the zx_numa_entry array, get a free one and set each node's addr
+ * to the address of the osq lock, then set the struct optimistic_spin_queue's
+ * index to s.
+ *
+ * From the index, lock()/unlock() gets the position of struct _numa_lock,
+ * atomic check the struct _numa_lock's addr with the  struct optimistic_spin_queue
+ * lock's address, equal means the link is valid. if the addr is not equal,
+ * the link is broken by the numa_cleanup workqueue since idle.
+ */
+void zx_osq_lock_stopping(struct optimistic_spin_queue *lock)
+{
+	int s = 0;
+
+	s = zx_numa_lock_ptr_get(lock);
+	if (s < zx_numa_lock_total) {
+		numa_lock_init_data(zx_numa_entry[s].numa_ptr,
+			numa_clusters, NUMA_UNLOCKED_VAL,
+			ptrmask(lock));
+
+		WRITE_ONCE(lock->index, s);
+		zx_numa_entry[s].type = 1;
+		smp_mb();/*should set these before enable*/
+		prefetchw(&lock->numa_enable);
+		WRITE_ONCE(lock->numa_enable, OSQLOCKSTOPPING);
+	} else {
+		prefetchw(&lock->numa_enable);
+		WRITE_ONCE(lock->numa_enable, OSQLOCKINITED);
+	}
+}
+/*
+ * Set numa_enable to NUMALOCKDYNAMIC to start by numa-aware lock,
+ * then set _numa_lock->initlock to TURNTONUMAREADY, breaks the
+ * loop of zx_osq_turn_numa_waiting.
+ */
+void zx_osq_numa_start(struct optimistic_spin_queue *lock)
+{
+	struct _numa_lock *_numa_lock = get_numa_lock(lock->index);
+
+	prefetchw(&lock->numa_enable);
+	WRITE_ONCE(lock->numa_enable, NUMALOCKDYNAMIC);
+	smp_mb(); /*should keep lock->numa_enable modified first*/
+	atomic_set(&_numa_lock->initlock, TURNTONUMAREADY);
+}
+
+/*
+ * Waiting numa-aware lock ready
+ */
+void zx_osq_turn_numa_waiting(struct optimistic_spin_queue *lock)
+{
+	struct _numa_lock *_numa_lock = get_numa_lock(lock->index);
+
+	atomic_inc(&_numa_lock->pending);
+	while (1) {
+		int s = atomic_read(&_numa_lock->initlock);
+
+		if (s == TURNTONUMAREADY)
+			break;
+		cpu_relax();
+
+	}
+	atomic_dec(&_numa_lock->pending);
+}
+
+static struct optimistic_spin_node *
+zx_numa_osq_wait_next(struct _numa_lock *lock,
+		struct optimistic_spin_node *node,
+		struct optimistic_spin_node *prev, int cpu)
+{
+	struct optimistic_spin_node *next = NULL;
+	int curr = encode_cpu(cpu);
+	int old;
+
+	old = prev ? prev->cpu : OSQ_UNLOCKED_VAL;
+	for (;;) {
+		if (atomic_read(&lock->tail) == curr &&
+		    atomic_cmpxchg_acquire(&lock->tail, curr, old) == curr) {
+
+			break;
+		}
+		if (node->next) {
+			next = xchg(&node->next, NULL);
+			if (next)
+				break;
+		}
+		cpu_relax();
+	}
+	return next;
+}
+/*
+ * If the numa_lock tail is in LOCKED state or the high 32 bits addr
+ * is not equal to the osq lock, the numa-aware lock is stopped, and
+ * break the loop to osq_lock()
+ * if not, exchange the tail to enqueue.
+ */
+static void zx_numa_turn_osq_waiting(struct optimistic_spin_queue *lock,
+			struct _numa_lock *_numa_lock)
+{
+	struct _numa_lock *numa_lock = _numa_lock + _numa_lock->numa_nodes;
+	int lockaddr = ptrmask(lock);
+	u64 s = 0;
+	struct optimistic_spin_queue tail;
+
+	tail.numa_enable = NUMALOCKDYNAMIC;
+	tail.index = lock->index;
+	tail.tail16 = OSQ_LOCKED_VAL;
+	while (1) {
+		cpu_relax();
+		s = atomic64_read((atomic64_t *) &numa_lock->tail);
+		if ((s >> 32) != lockaddr)
+			break;
+		if ((s & LOW32MASK) == NUMA_LOCKED_VAL)
+			break;
+	}
+	prefetchw(&lock->tail);
+	if (atomic_cmpxchg(&lock->tail, tail.val, OSQ_UNLOCKED_VAL)
+			== tail.val) {
+		;
+	}
+
+}
+
+static int _zx_node_osq_lock_internal(struct optimistic_spin_queue *qslock,
+	struct optimistic_spin_node *node, struct optimistic_spin_node *prev,
+	struct _numa_lock *node_lock, int cpu, int *cur_status)
+{
+	struct optimistic_spin_node *next = NULL;
+
+	for (;;) {
+		struct optimistic_spin_node *node_prev = NULL;
+
+		/*
+		 * cpu_relax() below implies a compiler barrier which would
+		 * prevent this comparison being optimized away.
+		 */
+		if (data_race(prev->next) == node &&
+		    cmpxchg(&prev->next, node, NULL) == node) {
+			break;
+		}
+		/*load locked first each time*/
+		*cur_status = smp_load_acquire(&node->locked);
+
+		if (*cur_status != NODE_WAIT)
+			return 0; //goto NODE_UNLOCK;
+
+		cpu_relax();
+		/*
+		 * Or we race against a concurrent unqueue()'s step-B, in which
+		 * case its step-C will write us a new @node->prev pointer.
+		 */
+		node_prev = READ_ONCE(node->prev);
+		if (node_prev != prev)
+			prev = node_prev;
+	}
+
+	/*
+	 * Step - B -- stabilize @next
+	 *
+	 * Similar to unlock(), wait for @node->next or move @lock from @node
+	 * back to @prev.
+	 */
+	next = zx_numa_osq_wait_next(node_lock, node, prev, cpu);
+	if (!next)
+		return -1;
+
+	WRITE_ONCE(next->prev, prev);
+	WRITE_ONCE(prev->next, next);
+
+	return -1;
+}
+
+static int _zx_node_osq_lock(struct optimistic_spin_queue *qslock,
+					struct _numa_lock *_numa_lock)
+{
+	struct optimistic_spin_node *node = this_cpu_ptr(&osq_cpu_node);
+	struct optimistic_spin_node *prev = NULL;
+	int cpu = smp_processor_id();
+	int curr = encode_cpu(cpu);
+	int numa = cpu >> _numa_lock->shift;
+	struct _numa_lock *node_lock = _numa_lock + numa;
+	int cur_status = 0;
+	int old = 0;
+
+	node->locked = NODE_WAIT;
+	node->next = NULL;
+	node->cpu = curr;
+
+	old = atomic64_cmpxchg_notequal(qslock, &node_lock->tail, curr);
+
+	if (old == NUMA_LOCKED_VAL) {
+		bool s = true;
+
+		zx_numa_turn_osq_waiting(qslock, _numa_lock);
+		s = osq_lock(qslock);
+		if (s == true)
+			return 1;
+		else
+			return -1;
+	}
+
+	if (old == 0) {
+		node->locked = COHORT_START;
+		return ACQUIRE_NUMALOCK;
+	}
+
+	prev = decode_curr(old);
+	node->prev = prev;
+
+	smp_mb(); /* make sure node set before set pre->next */
+
+	WRITE_ONCE(prev->next, node);
+	/*
+	 * Normally @prev is untouchable after the above store; because at that
+	 * moment unlock can proceed and wipe the node element from stack.
+	 *
+	 * However, since our nodes are static per-cpu storage, we're
+	 * guaranteed their existence -- this allows us to apply
+	 * cmpxchg in an attempt to undo our queueing.
+	 */
+
+	/*
+	 * Wait to acquire the lock or cancellation. Note that need_resched()
+	 * will come with an IPI, which will wake smp_cond_load_relaxed() if it
+	 * is implemented with a monitor-wait. vcpu_is_preempted() relies on
+	 * polling, be careful.
+	 */
+	while ((cur_status = READ_ONCE(node->locked)) == NODE_WAIT) {
+		if (need_resched() || vcpu_is_preempted(node_cpu(node->prev))) {
+			int ddd = _zx_node_osq_lock_internal(qslock, node, prev,
+						node_lock, cpu, &cur_status);
+
+			if (cur_status != NODE_WAIT)
+				goto NODE_UNLOCK;
+			if (ddd == -1)
+				return -1;
+		}
+		cpu_relax();
+	}
+NODE_UNLOCK:
+	if (cur_status == ACQUIRE_NUMALOCK)
+		node->locked = COHORT_START;
+	return cur_status;
+}
+static int _zx_numa_osq_lock(struct optimistic_spin_queue *qslock, int cpu,
+				struct _numa_lock *_numa_lock)
+{
+	int numacpu = cpu >> _numa_lock->shift;
+	int numacurr = encode_cpu(numacpu);
+
+	struct optimistic_spin_node *node = &(_numa_lock + numacpu)->osq_node;
+	struct _numa_lock *numa_lock = _numa_lock + _numa_lock->numa_nodes;
+	struct optimistic_spin_node *prevnode = NULL;
+	int prev = 0;
+
+	node->next = NULL;
+	node->locked = LOCK_NUMALOCK;
+	node->cpu = numacurr;
+
+	prev = atomic_xchg(&numa_lock->tail, numacurr);
+	if (prev == 0) {
+		node->locked = UNLOCK_NUMALOCK;
+		return 0;
+	}
+
+	prevnode = &(_numa_lock + prev - 1)->osq_node;
+	node->prev = prevnode;
+	smp_mb(); /*node->prev should be set before next*/
+	WRITE_ONCE(prevnode->next, node);
+
+	while (READ_ONCE(node->locked) == LOCK_NUMALOCK)
+		cpu_relax();
+
+	return 0;
+}
+/*
+ * Two level osq_lock
+ * Check current node's tail then check the numa tail, the queue goes to
+ * run or wait according the two tail's state.
+ */
+
+inline bool zx_numa_osq_lock(struct optimistic_spin_queue *qslock,
+		struct _numa_lock *_numa_lock)
+{
+	struct _numa_lock *node_lock = NULL;
+	int cpu = smp_processor_id();
+	int numa = cpu >> _numa_lock->shift;
+	int status = 0;
+
+	node_lock = _numa_lock + numa;
+
+	if (node_lock->stopping) {
+		zx_numa_turn_osq_waiting(qslock, _numa_lock);
+		return osq_lock(qslock);
+	}
+
+	status = _zx_node_osq_lock(qslock, _numa_lock);
+	if (status == ACQUIRE_NUMALOCK)
+		status = _zx_numa_osq_lock(qslock, smp_processor_id(),
+				_numa_lock);
+
+	if (status == -1)
+		return false;
+	return true;
+}
+/*
+ * Check the end of the queue on current node.
+ * Keep the addr of the node_lock when set the queue to UNLOCKED.
+ * If the node is stopping, the node_lock will be set LOCKED.
+ */
+static int atomic64_checktail_osq(struct optimistic_spin_queue *qslock,
+	struct _numa_lock *node_lock, int ctail)
+{
+	u64 addr = ((u64)ptrmask(qslock)) << 32;
+	u64 addrtail = addr | ctail;
+	u64 ss = 0;
+	bool mark;
+
+	ss = atomic64_read((atomic64_t *) &node_lock->tail);
+	if (node_lock->stopping == 0)
+		mark = (ss == addrtail &&
+			atomic64_cmpxchg_acquire(
+				(atomic64_t *) &node_lock->tail,
+				addrtail, addr|NUMA_UNLOCKED_VAL) == addrtail);
+	else
+		mark = (ss == addrtail &&
+			atomic64_cmpxchg_acquire(
+				(atomic64_t *) &node_lock->tail,
+				addrtail, NUMA_LOCKED_VAL) == addrtail);
+	return mark;
+}
+
+/*
+ * Set current node's tail to ACQUIRE_NUMALOCK to check the numa tail busy or
+ * UNLOCKED
+ */
+static void node_lock_release(struct optimistic_spin_queue *qslock,
+		struct _numa_lock *node_lock, struct optimistic_spin_node *node,
+		int val, int cpu, int numa_end)
+{
+	struct optimistic_spin_node *next = NULL;
+	int curr = encode_cpu(cpu);
+
+	while (1) {
+		if (atomic64_checktail_osq(qslock, node_lock, curr)) {
+			if (qslock->numa_enable == NUMALOCKDYNAMIC) {
+				int index = qslock->index;
+				//starts numa_lock idle checking in cleanup workqueue.
+				if (numa_end == OSQ_UNLOCKED_VAL &&
+					zx_numa_entry[index].idle == 0) {
+					cmpxchg(&zx_numa_entry[index].idle,
+							0, 1);
+				}
+			}
+			return;
+		}
+		if (node->next) {
+			next = xchg(&node->next, NULL);
+			if (next) {
+				WRITE_ONCE(next->locked, val);
+				return;
+			}
+		}
+		cpu_relax();
+	}
+}
+/*
+ * Unlocks the queue waiting on the next NUMA node
+ */
+static int numa_lock_release(struct optimistic_spin_queue *qslock,
+		struct _numa_lock *numa_lock,
+		struct optimistic_spin_node *node, int cpu)
+{
+	struct optimistic_spin_node *next = NULL;
+	int curr = cpu >> numa_lock->shift;
+	int numacurr = encode_cpu(curr);
+
+	while (1) {
+		if (atomic_read(&numa_lock->tail) == numacurr &&
+		    atomic_cmpxchg_acquire(&numa_lock->tail, numacurr,
+					   OSQ_UNLOCKED_VAL) == numacurr) {
+			return OSQ_UNLOCKED_VAL;
+		}
+
+		if (node->next) {
+			next = xchg(&node->next, NULL);
+			if (next) {
+				WRITE_ONCE(next->locked, UNLOCK_NUMALOCK);
+				return 1;
+			}
+		}
+		cpu_relax();
+	}
+}
+/*
+ * Two level osq_unlock.
+ */
+inline void zx_numa_osq_unlock(struct optimistic_spin_queue *qslock,
+		 struct _numa_lock *_numa_lock)
+{
+	u32 cpu =  smp_processor_id();
+	struct optimistic_spin_node *node = this_cpu_ptr(&osq_cpu_node);
+	int numa = cpu >> _numa_lock->shift;
+	struct _numa_lock *numa_lock = _numa_lock + _numa_lock->numa_nodes;
+	struct _numa_lock *node_lock = _numa_lock + numa;
+	struct optimistic_spin_node *numa_node =
+			&(_numa_lock + numa)->osq_node;
+	struct optimistic_spin_node *next = NULL;
+	int cur_count = 0;
+	int numa_end = 0;
+
+	cur_count = READ_ONCE(node->locked);
+
+	/*
+	 * Turns to the queue waiting on the next node if unlocked times more
+	 * than osq_node_max on current node.
+	 */
+	if (cur_count >= osq_node_max - 1) {
+		//Unlocks the queue on next node.
+		numa_end = numa_lock_release(qslock,
+				numa_lock, numa_node, cpu);
+		node_lock_release(qslock, node_lock, node,
+				ACQUIRE_NUMALOCK, cpu, numa_end);
+		return;
+	}
+	/*
+	 * Unlocks the next on current node.
+	 */
+	next = xchg(&node->next, NULL);
+	if (next) {
+		WRITE_ONCE(next->locked, cur_count + 1);
+		return;
+	}
+	/*
+	 * The queue on current node reaches end.
+	 */
+	numa_end = numa_lock_release(qslock, numa_lock, numa_node, cpu);
+	node_lock_release(qslock, node_lock, node, ACQUIRE_NUMALOCK,
+			cpu, numa_end);
+}
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2024-09-30  7:24 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-09-14  8:53 [PATCH 0/4] locking/osq_lock: Update osq_lock to dynamic yongli-oc
2024-09-14  8:53 ` [PATCH 1/4] locking/osq_lock: The Kconfig for dynamic numa-aware osq lock yongli-oc
2024-09-14  8:53 ` [PATCH 2/4] locking/osq_lock: Turn to dynamic switch function from osq_lock/unlock yongli-oc
2024-09-14 16:06   ` Waiman Long
2024-09-19  9:32     ` yongli-os
2024-09-14  8:53 ` [PATCH 3/4] locking/osq_lock: From x_osq_lock/unlock to numa-aware lock/unlock yongli-oc
2024-09-14 17:11   ` Waiman Long
2024-09-19  9:35     ` yongli-os
2024-09-15 18:44   ` Uros Bizjak
2024-09-15 21:06     ` Waiman Long
2024-09-14  8:53 ` [PATCH 4/4] locking/osq_lock: The numa-aware lock memory prepare, assign and cleanup yongli-oc
2024-09-14 17:21   ` Waiman Long
2024-09-19  9:41     ` yongli-os
2024-09-19 11:28       ` Waiman Long
2024-09-15 16:43   ` kernel test robot
2024-09-30  7:23 ` [PATCH v2 0/3] locking/osq_lock: Update osq_lock to dynamic yongli-oc
2024-09-30  7:23   ` [PATCH v2 1/3] locking/osq_lock: The Kconfig for dynamic numa-aware osq lock yongli-oc
2024-09-30  7:23   ` [PATCH v2 2/3] locking/osq_lock: Define osq by union to support dynamic numa-aware lock yongli-oc
2024-09-30  7:23   ` [PATCH v2 3/3] locking/osq_lock: Turn from 2-byte osq_lock/unlock to numa lock/unlock yongli-oc

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox