From: Sasha Levin <sashal@kernel.org>
To: linux-kernel@vger.kernel.org, stable@vger.kernel.org
Cc: Muchun Song <songmuchun@bytedance.com>,
Peter Zijlstra <peterz@infradead.org>,
Waiman Long <longman@redhat.com>, Sasha Levin <sashal@kernel.org>,
mingo@redhat.com, will@kernel.org
Subject: [PATCH AUTOSEL 5.15 34/68] locking/rwsem: Optimize down_read_trylock() under highly contended case
Date: Tue, 30 Nov 2021 09:46:30 -0500 [thread overview]
Message-ID: <20211130144707.944580-34-sashal@kernel.org> (raw)
In-Reply-To: <20211130144707.944580-1-sashal@kernel.org>
From: Muchun Song <songmuchun@bytedance.com>
[ Upstream commit 14c24048841151548a3f4d9e218510c844c1b737 ]
We found that a process with 10 thousnads threads has been encountered
a regression problem from Linux-v4.14 to Linux-v5.4. It is a kind of
workload which will concurrently allocate lots of memory in different
threads sometimes. In this case, we will see the down_read_trylock()
with a high hotspot. Therefore, we suppose that rwsem has a regression
at least since Linux-v5.4. In order to easily debug this problem, we
write a simply benchmark to create the similar situation lile the
following.
```c++
#include <sys/mman.h>
#include <sys/time.h>
#include <sys/resource.h>
#include <sched.h>
#include <cstdio>
#include <cassert>
#include <thread>
#include <vector>
#include <chrono>
volatile int mutex;
void trigger(int cpu, char* ptr, std::size_t sz)
{
cpu_set_t set;
CPU_ZERO(&set);
CPU_SET(cpu, &set);
assert(pthread_setaffinity_np(pthread_self(), sizeof(set), &set) == 0);
while (mutex);
for (std::size_t i = 0; i < sz; i += 4096) {
*ptr = '\0';
ptr += 4096;
}
}
int main(int argc, char* argv[])
{
std::size_t sz = 100;
if (argc > 1)
sz = atoi(argv[1]);
auto nproc = std::thread::hardware_concurrency();
std::vector<std::thread> thr;
sz <<= 30;
auto* ptr = mmap(nullptr, sz, PROT_READ | PROT_WRITE, MAP_ANON |
MAP_PRIVATE, -1, 0);
assert(ptr != MAP_FAILED);
char* cptr = static_cast<char*>(ptr);
auto run = sz / nproc;
run = (run >> 12) << 12;
mutex = 1;
for (auto i = 0U; i < nproc; ++i) {
thr.emplace_back(std::thread([i, cptr, run]() { trigger(i, cptr, run); }));
cptr += run;
}
rusage usage_start;
getrusage(RUSAGE_SELF, &usage_start);
auto start = std::chrono::system_clock::now();
mutex = 0;
for (auto& t : thr)
t.join();
rusage usage_end;
getrusage(RUSAGE_SELF, &usage_end);
auto end = std::chrono::system_clock::now();
timeval utime;
timeval stime;
timersub(&usage_end.ru_utime, &usage_start.ru_utime, &utime);
timersub(&usage_end.ru_stime, &usage_start.ru_stime, &stime);
printf("usr: %ld.%06ld\n", utime.tv_sec, utime.tv_usec);
printf("sys: %ld.%06ld\n", stime.tv_sec, stime.tv_usec);
printf("real: %lu\n",
std::chrono::duration_cast<std::chrono::milliseconds>(end -
start).count());
return 0;
}
```
The functionality of above program is simply which creates `nproc`
threads and each of them are trying to touch memory (trigger page
fault) on different CPU. Then we will see the similar profile by
`perf top`.
25.55% [kernel] [k] down_read_trylock
14.78% [kernel] [k] handle_mm_fault
13.45% [kernel] [k] up_read
8.61% [kernel] [k] clear_page_erms
3.89% [kernel] [k] __do_page_fault
The highest hot instruction, which accounts for about 92%, in
down_read_trylock() is cmpxchg like the following.
91.89 │ lock cmpxchg %rdx,(%rdi)
Sice the problem is found by migrating from Linux-v4.14 to Linux-v5.4,
so we easily found that the commit ddb20d1d3aed ("locking/rwsem: Optimize
down_read_trylock()") caused the regression. The reason is that the
commit assumes the rwsem is not contended at all. But it is not always
true for mmap lock which could be contended with thousands threads.
So most threads almost need to run at least 2 times of "cmpxchg" to
acquire the lock. The overhead of atomic operation is higher than
non-atomic instructions, which caused the regression.
By using the above benchmark, the real executing time on a x86-64 system
before and after the patch were:
Before Patch After Patch
# of Threads real real reduced by
------------ ------ ------ ----------
1 65,373 65,206 ~0.0%
4 15,467 15,378 ~0.5%
40 6,214 5,528 ~11.0%
For the uncontended case, the new down_read_trylock() is the same as
before. For the contended cases, the new down_read_trylock() is faster
than before. The more contended, the more fast.
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Waiman Long <longman@redhat.com>
Link: https://lore.kernel.org/r/20211118094455.9068-1-songmuchun@bytedance.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
kernel/locking/rwsem.c | 11 ++++-------
1 file changed, 4 insertions(+), 7 deletions(-)
diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c
index 29eea50a3e678..e1b5cb8212675 100644
--- a/kernel/locking/rwsem.c
+++ b/kernel/locking/rwsem.c
@@ -1246,17 +1246,14 @@ static inline int __down_read_trylock(struct rw_semaphore *sem)
DEBUG_RWSEMS_WARN_ON(sem->magic != sem, sem);
- /*
- * Optimize for the case when the rwsem is not locked at all.
- */
- tmp = RWSEM_UNLOCKED_VALUE;
- do {
+ tmp = atomic_long_read(&sem->count);
+ while (!(tmp & RWSEM_READ_FAILED_MASK)) {
if (atomic_long_try_cmpxchg_acquire(&sem->count, &tmp,
- tmp + RWSEM_READER_BIAS)) {
+ tmp + RWSEM_READER_BIAS)) {
rwsem_set_reader_owned(sem);
return 1;
}
- } while (!(tmp & RWSEM_READ_FAILED_MASK));
+ }
return 0;
}
--
2.33.0
next prev parent reply other threads:[~2021-11-30 14:52 UTC|newest]
Thread overview: 71+ messages / expand[flat|nested] mbox.gz Atom feed top
2021-11-30 14:45 [PATCH AUTOSEL 5.15 01/68] ASoC: mediatek: mt8173-rt5650: Rename Speaker control to Ext Spk Sasha Levin
2021-11-30 14:45 ` [PATCH AUTOSEL 5.15 02/68] ASoC: Intel: sof_sdw: Add support for SKU 0AF3 product Sasha Levin
2021-11-30 14:45 ` [PATCH AUTOSEL 5.15 03/68] ASoC: Intel: soc-acpi: add SKU 0AF3 SoundWire configuration Sasha Levin
2021-11-30 14:46 ` [PATCH AUTOSEL 5.15 04/68] ASoC: Intel: sof_sdw: Add support for SKU 0B00 and 0B01 products Sasha Levin
2021-11-30 14:46 ` [PATCH AUTOSEL 5.15 05/68] ASoC: Intel: sof_sdw: Add support for SKU 0B11 product Sasha Levin
2021-11-30 14:46 ` [PATCH AUTOSEL 5.15 06/68] ASoC: Intel: sof_sdw: Add support for SKU 0B13 product Sasha Levin
2021-11-30 14:46 ` [PATCH AUTOSEL 5.15 07/68] ASoC: Intel: soc-acpi: add SKU 0B13 SoundWire configuration Sasha Levin
2021-11-30 14:46 ` [PATCH AUTOSEL 5.15 08/68] ASoC: Intel: sof_sdw: Add support for SKU 0B29 product Sasha Levin
2021-11-30 14:46 ` [PATCH AUTOSEL 5.15 09/68] ASoC: Intel: soc-acpi: add SKU 0B29 SoundWire configuration Sasha Levin
2021-11-30 14:46 ` [PATCH AUTOSEL 5.15 10/68] ASoC: Intel: sof_sdw: Add support for SKU 0B12 product Sasha Levin
2021-11-30 14:46 ` [PATCH AUTOSEL 5.15 11/68] ASoC: rt5682: Avoid the unexpected IRQ event during going to suspend Sasha Levin
2021-11-30 14:46 ` [PATCH AUTOSEL 5.15 12/68] ASoC: rt5682: Re-detect the combo jack after resuming Sasha Levin
2021-11-30 14:46 ` [PATCH AUTOSEL 5.15 13/68] ASoC: mediatek: mt8173: Fix debugfs registration for components Sasha Levin
2021-11-30 14:46 ` [PATCH AUTOSEL 5.15 14/68] ASoC: qdsp6: q6adm: improve error reporting Sasha Levin
2021-11-30 14:46 ` [PATCH AUTOSEL 5.15 15/68] ASoC: qdsp6: q6routing: validate port id before setting up route Sasha Levin
2021-11-30 14:46 ` [PATCH AUTOSEL 5.15 16/68] xen/privcmd: make option visible in Kconfig Sasha Levin
2021-11-30 14:46 ` [PATCH AUTOSEL 5.15 17/68] NFSv4.1: handle NFS4ERR_NOSPC by CREATE_SESSION Sasha Levin
2021-11-30 14:46 ` [PATCH AUTOSEL 5.15 18/68] scsi: ufs: ufshpb: Fix warning in ufshpb_set_hpb_read_to_upiu() Sasha Levin
2021-11-30 14:46 ` [PATCH AUTOSEL 5.15 19/68] scsi: scsi_debug: Fix type in min_t to avoid stack OOB Sasha Levin
2021-11-30 14:46 ` [PATCH AUTOSEL 5.15 20/68] scsi: ufs: ufs-mediatek: Add put_device() after of_find_device_by_node() Sasha Levin
2021-11-30 14:46 ` [PATCH AUTOSEL 5.15 21/68] atlantic: fix double-free in aq_ring_tx_clean Sasha Levin
2021-11-30 14:46 ` [PATCH AUTOSEL 5.15 22/68] stmmac_pci: Fix underflow size in stmmac_rx Sasha Levin
2021-11-30 14:46 ` [PATCH AUTOSEL 5.15 23/68] HID: ft260: fix i2c probing for hwmon devices Sasha Levin
2021-11-30 14:46 ` [PATCH AUTOSEL 5.15 24/68] HID: Ignore battery for Elan touchscreen on HP Envy X360 15-eu0xxx Sasha Levin
2021-11-30 14:46 ` [PATCH AUTOSEL 5.15 25/68] HID: multitouch: Fix Iiyama ProLite T1931SAW (0eef:0001 again!) Sasha Levin
2021-11-30 14:46 ` [PATCH AUTOSEL 5.15 26/68] parisc: Increase FRAME_WARN to 2048 bytes on parisc Sasha Levin
2021-11-30 14:46 ` [PATCH AUTOSEL 5.15 27/68] parisc: Provide an extru_safe() macro to extract unsigned bits Sasha Levin
2021-11-30 14:46 ` [PATCH AUTOSEL 5.15 28/68] parisc: Fix extraction of hash lock bits in syscall.S Sasha Levin
2021-11-30 14:46 ` [PATCH AUTOSEL 5.15 29/68] parisc: Convert PTE lookup to use extru_safe() macro Sasha Levin
2021-11-30 14:46 ` [PATCH AUTOSEL 5.15 30/68] selftests/tc-testing: match any qdisc type Sasha Levin
2021-11-30 14:46 ` [PATCH AUTOSEL 5.15 31/68] selftests/tc-testings: Be compatible with newer tc output Sasha Levin
2021-11-30 14:46 ` [PATCH AUTOSEL 5.15 32/68] block: avoid to touch unloaded module instance when opening bdev Sasha Levin
2021-11-30 14:46 ` [PATCH AUTOSEL 5.15 33/68] scsi: scsi_debug: Sanity check block descriptor length in resp_mode_select() Sasha Levin
2021-11-30 14:46 ` Sasha Levin [this message]
2021-11-30 14:46 ` [PATCH AUTOSEL 5.15 35/68] i2c: i801: Fix interrupt storm from SMB_ALERT signal Sasha Levin
2021-12-03 8:30 ` Jean Delvare
2021-11-30 14:46 ` [PATCH AUTOSEL 5.15 36/68] mmc: spi: Add device-tree SPI IDs Sasha Levin
2021-11-30 14:46 ` [PATCH AUTOSEL 5.15 37/68] net: chelsio: cxgb4vf: Fix an error code in cxgb4vf_pci_probe() Sasha Levin
2021-11-30 14:46 ` [PATCH AUTOSEL 5.15 38/68] cifs: populate server_hostname for extra channels Sasha Levin
2021-11-30 14:46 ` [PATCH AUTOSEL 5.15 39/68] smb2: clarify rc initialization in smb2_reconnect Sasha Levin
2021-11-30 14:46 ` [PATCH AUTOSEL 5.15 40/68] nvmet-tcp: fix a race condition between release_queue and io_work Sasha Levin
2021-11-30 14:46 ` [PATCH AUTOSEL 5.15 41/68] nvmet-tcp: add an helper to free the cmd buffers Sasha Levin
2021-11-30 14:46 ` [PATCH AUTOSEL 5.15 42/68] nvmet-tcp: fix memory leak when performing a controller reset Sasha Levin
2021-11-30 14:46 ` [PATCH AUTOSEL 5.15 43/68] nvme-tcp: validate R2T PDU in nvme_tcp_handle_r2t() Sasha Levin
2021-11-30 14:46 ` [PATCH AUTOSEL 5.15 44/68] nvme-tcp: fix memory leak when freeing a queue Sasha Levin
2021-11-30 14:46 ` [PATCH AUTOSEL 5.15 45/68] nvme-pci: add NO APST quirk for Kioxia device Sasha Levin
2021-11-30 14:46 ` [PATCH AUTOSEL 5.15 46/68] nvme-fabrics: ignore invalid fast_io_fail_tmo values Sasha Levin
2021-11-30 14:46 ` [PATCH AUTOSEL 5.15 47/68] nvme: fix write zeroes pi Sasha Levin
2021-11-30 14:46 ` [PATCH AUTOSEL 5.15 48/68] xen: add "not_essential" flag to struct xenbus_driver Sasha Levin
2021-11-30 14:46 ` [PATCH AUTOSEL 5.15 49/68] xen: flag xen_drm_front to be not essential for system boot Sasha Levin
2021-11-30 14:46 ` [PATCH AUTOSEL 5.15 50/68] xen: flag hvc_xen " Sasha Levin
2021-11-30 14:46 ` [PATCH AUTOSEL 5.15 51/68] xen: flag pvcalls-front " Sasha Levin
2021-11-30 14:46 ` [PATCH AUTOSEL 5.15 52/68] xen: flag xen_snd_front " Sasha Levin
2021-11-30 14:46 ` [PATCH AUTOSEL 5.15 53/68] x86/boot: Mark prepare_command_line() __init Sasha Levin
2021-11-30 14:46 ` [PATCH AUTOSEL 5.15 54/68] PM: hibernate: Fix snapshot partial write lengths Sasha Levin
2021-11-30 14:46 ` [PATCH AUTOSEL 5.15 55/68] drm/amdgpu: Fix MMIO HDP flush on SRIOV Sasha Levin
2021-11-30 14:46 ` [PATCH AUTOSEL 5.15 56/68] drm/amdgpu: Fix double free of dmabuf Sasha Levin
2021-11-30 14:46 ` [PATCH AUTOSEL 5.15 57/68] drm/amd/display: Fixed DSC would not PG after removing DSC stream Sasha Levin
2021-11-30 14:46 ` [PATCH AUTOSEL 5.15 58/68] drm/amdkfd: handle VMA remove race Sasha Levin
2021-11-30 14:46 ` [PATCH AUTOSEL 5.15 59/68] drm/amdgpu: fix byteorder error in amdgpu discovery Sasha Levin
2021-11-30 14:46 ` [PATCH AUTOSEL 5.15 60/68] drm/amd/display: update bios scratch when setting backlight Sasha Levin
2021-11-30 14:46 ` [PATCH AUTOSEL 5.15 61/68] vhost-vdpa: clean irqs before reseting vdpa device Sasha Levin
2021-11-30 14:46 ` [PATCH AUTOSEL 5.15 62/68] MIPS: boot/compressed/: add __ashldi3 to target for ZSTD compression Sasha Levin
2021-11-30 14:46 ` [PATCH AUTOSEL 5.15 63/68] nfc: virtual_ncidev: change default device permissions Sasha Levin
2021-11-30 14:47 ` [PATCH AUTOSEL 5.15 64/68] net: qed: fix the array may be out of bound Sasha Levin
2021-11-30 14:47 ` [PATCH AUTOSEL 5.15 65/68] net: mscc: ocelot: create a function that replaces an existing VCAP filter Sasha Levin
2021-12-04 14:46 ` Vladimir Oltean
2021-11-30 14:47 ` [PATCH AUTOSEL 5.15 66/68] net: ptp: add a definition for the UDP port for IEEE 1588 general messages Sasha Levin
2021-11-30 14:47 ` [PATCH AUTOSEL 5.15 67/68] io_uring: Fix undefined-behaviour in io_issue_sqe Sasha Levin
2021-11-30 14:47 ` [PATCH AUTOSEL 5.15 68/68] fs: ntfs: Limit NTFS_RW to page sizes smaller than 64k Sasha Levin
2021-11-30 15:16 ` [PATCH AUTOSEL 5.15 01/68] ASoC: mediatek: mt8173-rt5650: Rename Speaker control to Ext Spk Mark Brown
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20211130144707.944580-34-sashal@kernel.org \
--to=sashal@kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=longman@redhat.com \
--cc=mingo@redhat.com \
--cc=peterz@infradead.org \
--cc=songmuchun@bytedance.com \
--cc=stable@vger.kernel.org \
--cc=will@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox