public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: Sasha Levin <sashal@kernel.org>
To: linux-kernel@vger.kernel.org, stable@vger.kernel.org
Cc: Muchun Song <songmuchun@bytedance.com>,
	Peter Zijlstra <peterz@infradead.org>,
	Waiman Long <longman@redhat.com>, Sasha Levin <sashal@kernel.org>,
	mingo@redhat.com, will@kernel.org
Subject: [PATCH AUTOSEL 5.4 11/25] locking/rwsem: Optimize down_read_trylock() under highly contended case
Date: Tue, 30 Nov 2021 09:51:41 -0500	[thread overview]
Message-ID: <20211130145156.946083-11-sashal@kernel.org> (raw)
In-Reply-To: <20211130145156.946083-1-sashal@kernel.org>

From: Muchun Song <songmuchun@bytedance.com>

[ Upstream commit 14c24048841151548a3f4d9e218510c844c1b737 ]

We found that a process with 10 thousnads threads has been encountered
a regression problem from Linux-v4.14 to Linux-v5.4. It is a kind of
workload which will concurrently allocate lots of memory in different
threads sometimes. In this case, we will see the down_read_trylock()
with a high hotspot. Therefore, we suppose that rwsem has a regression
at least since Linux-v5.4. In order to easily debug this problem, we
write a simply benchmark to create the similar situation lile the
following.

  ```c++
  #include <sys/mman.h>
  #include <sys/time.h>
  #include <sys/resource.h>
  #include <sched.h>

  #include <cstdio>
  #include <cassert>
  #include <thread>
  #include <vector>
  #include <chrono>

  volatile int mutex;

  void trigger(int cpu, char* ptr, std::size_t sz)
  {
  	cpu_set_t set;
  	CPU_ZERO(&set);
  	CPU_SET(cpu, &set);
  	assert(pthread_setaffinity_np(pthread_self(), sizeof(set), &set) == 0);

  	while (mutex);

  	for (std::size_t i = 0; i < sz; i += 4096) {
  		*ptr = '\0';
  		ptr += 4096;
  	}
  }

  int main(int argc, char* argv[])
  {
  	std::size_t sz = 100;

  	if (argc > 1)
  		sz = atoi(argv[1]);

  	auto nproc = std::thread::hardware_concurrency();
  	std::vector<std::thread> thr;
  	sz <<= 30;
  	auto* ptr = mmap(nullptr, sz, PROT_READ | PROT_WRITE, MAP_ANON |
			 MAP_PRIVATE, -1, 0);
  	assert(ptr != MAP_FAILED);
  	char* cptr = static_cast<char*>(ptr);
  	auto run = sz / nproc;
  	run = (run >> 12) << 12;

  	mutex = 1;

  	for (auto i = 0U; i < nproc; ++i) {
  		thr.emplace_back(std::thread([i, cptr, run]() { trigger(i, cptr, run); }));
  		cptr += run;
  	}

  	rusage usage_start;
  	getrusage(RUSAGE_SELF, &usage_start);
  	auto start = std::chrono::system_clock::now();

  	mutex = 0;

  	for (auto& t : thr)
  		t.join();

  	rusage usage_end;
  	getrusage(RUSAGE_SELF, &usage_end);
  	auto end = std::chrono::system_clock::now();
  	timeval utime;
  	timeval stime;
  	timersub(&usage_end.ru_utime, &usage_start.ru_utime, &utime);
  	timersub(&usage_end.ru_stime, &usage_start.ru_stime, &stime);
  	printf("usr: %ld.%06ld\n", utime.tv_sec, utime.tv_usec);
  	printf("sys: %ld.%06ld\n", stime.tv_sec, stime.tv_usec);
  	printf("real: %lu\n",
  	       std::chrono::duration_cast<std::chrono::milliseconds>(end -
  	       start).count());

  	return 0;
  }
  ```

The functionality of above program is simply which creates `nproc`
threads and each of them are trying to touch memory (trigger page
fault) on different CPU. Then we will see the similar profile by
`perf top`.

  25.55%  [kernel]                  [k] down_read_trylock
  14.78%  [kernel]                  [k] handle_mm_fault
  13.45%  [kernel]                  [k] up_read
   8.61%  [kernel]                  [k] clear_page_erms
   3.89%  [kernel]                  [k] __do_page_fault

The highest hot instruction, which accounts for about 92%, in
down_read_trylock() is cmpxchg like the following.

  91.89 │      lock   cmpxchg %rdx,(%rdi)

Sice the problem is found by migrating from Linux-v4.14 to Linux-v5.4,
so we easily found that the commit ddb20d1d3aed ("locking/rwsem: Optimize
down_read_trylock()") caused the regression. The reason is that the
commit assumes the rwsem is not contended at all. But it is not always
true for mmap lock which could be contended with thousands threads.
So most threads almost need to run at least 2 times of "cmpxchg" to
acquire the lock. The overhead of atomic operation is higher than
non-atomic instructions, which caused the regression.

By using the above benchmark, the real executing time on a x86-64 system
before and after the patch were:

                  Before Patch  After Patch
   # of Threads      real          real     reduced by
   ------------     ------        ------    ----------
         1          65,373        65,206       ~0.0%
         4          15,467        15,378       ~0.5%
        40           6,214         5,528      ~11.0%

For the uncontended case, the new down_read_trylock() is the same as
before. For the contended cases, the new down_read_trylock() is faster
than before. The more contended, the more fast.

Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Waiman Long <longman@redhat.com>
Link: https://lore.kernel.org/r/20211118094455.9068-1-songmuchun@bytedance.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
 kernel/locking/rwsem.c | 11 ++++-------
 1 file changed, 4 insertions(+), 7 deletions(-)

diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c
index 5d54ff3179b80..db7b294b51dfe 100644
--- a/kernel/locking/rwsem.c
+++ b/kernel/locking/rwsem.c
@@ -1378,17 +1378,14 @@ static inline int __down_read_trylock(struct rw_semaphore *sem)
 
 	DEBUG_RWSEMS_WARN_ON(sem->magic != sem, sem);
 
-	/*
-	 * Optimize for the case when the rwsem is not locked at all.
-	 */
-	tmp = RWSEM_UNLOCKED_VALUE;
-	do {
+	tmp = atomic_long_read(&sem->count);
+	while (!(tmp & RWSEM_READ_FAILED_MASK)) {
 		if (atomic_long_try_cmpxchg_acquire(&sem->count, &tmp,
-					tmp + RWSEM_READER_BIAS)) {
+						    tmp + RWSEM_READER_BIAS)) {
 			rwsem_set_reader_owned(sem);
 			return 1;
 		}
-	} while (!(tmp & RWSEM_READ_FAILED_MASK));
+	}
 	return 0;
 }
 
-- 
2.33.0


  parent reply	other threads:[~2021-11-30 14:56 UTC|newest]

Thread overview: 25+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-11-30 14:51 [PATCH AUTOSEL 5.4 01/25] ASoC: mediatek: mt8173-rt5650: Rename Speaker control to Ext Spk Sasha Levin
2021-11-30 14:51 ` [PATCH AUTOSEL 5.4 02/25] ASoC: qdsp6: q6adm: improve error reporting Sasha Levin
2021-11-30 14:51 ` [PATCH AUTOSEL 5.4 03/25] ASoC: qdsp6: q6routing: validate port id before setting up route Sasha Levin
2021-11-30 14:51 ` [PATCH AUTOSEL 5.4 04/25] xen/privcmd: make option visible in Kconfig Sasha Levin
2021-11-30 14:51 ` [PATCH AUTOSEL 5.4 05/25] NFSv4.1: handle NFS4ERR_NOSPC by CREATE_SESSION Sasha Levin
2021-11-30 14:51 ` [PATCH AUTOSEL 5.4 06/25] HID: multitouch: Fix Iiyama ProLite T1931SAW (0eef:0001 again!) Sasha Levin
2021-11-30 14:51 ` [PATCH AUTOSEL 5.4 07/25] parisc: Provide an extru_safe() macro to extract unsigned bits Sasha Levin
2021-11-30 14:51 ` [PATCH AUTOSEL 5.4 08/25] parisc: Convert PTE lookup to use extru_safe() macro Sasha Levin
2021-11-30 14:51 ` [PATCH AUTOSEL 5.4 09/25] selftests/tc-testings: Be compatible with newer tc output Sasha Levin
2021-11-30 14:51 ` [PATCH AUTOSEL 5.4 10/25] scsi: scsi_debug: Sanity check block descriptor length in resp_mode_select() Sasha Levin
2021-11-30 14:51 ` Sasha Levin [this message]
2021-11-30 14:51 ` [PATCH AUTOSEL 5.4 12/25] i2c: i801: Fix interrupt storm from SMB_ALERT signal Sasha Levin
2021-11-30 14:51 ` [PATCH AUTOSEL 5.4 13/25] mmc: spi: Add device-tree SPI IDs Sasha Levin
2021-11-30 14:51 ` [PATCH AUTOSEL 5.4 14/25] net: chelsio: cxgb4vf: Fix an error code in cxgb4vf_pci_probe() Sasha Levin
2021-11-30 14:51 ` [PATCH AUTOSEL 5.4 15/25] smb2: clarify rc initialization in smb2_reconnect Sasha Levin
2021-11-30 14:51 ` [PATCH AUTOSEL 5.4 16/25] nvmet-tcp: fix a race condition between release_queue and io_work Sasha Levin
2021-11-30 14:51 ` [PATCH AUTOSEL 5.4 17/25] nvmet-tcp: add an helper to free the cmd buffers Sasha Levin
2021-11-30 14:51 ` [PATCH AUTOSEL 5.4 18/25] nvmet-tcp: fix memory leak when performing a controller reset Sasha Levin
2021-11-30 14:51 ` [PATCH AUTOSEL 5.4 19/25] nvme-tcp: fix memory leak when freeing a queue Sasha Levin
2021-11-30 14:51 ` [PATCH AUTOSEL 5.4 20/25] nvme-pci: add NO APST quirk for Kioxia device Sasha Levin
2021-11-30 14:51 ` [PATCH AUTOSEL 5.4 21/25] nvme: fix write zeroes pi Sasha Levin
2021-11-30 14:51 ` [PATCH AUTOSEL 5.4 22/25] PM: hibernate: Fix snapshot partial write lengths Sasha Levin
2021-11-30 14:51 ` [PATCH AUTOSEL 5.4 23/25] net: qed: fix the array may be out of bound Sasha Levin
2021-11-30 14:51 ` [PATCH AUTOSEL 5.4 24/25] net: ptp: add a definition for the UDP port for IEEE 1588 general messages Sasha Levin
2021-11-30 14:51 ` [PATCH AUTOSEL 5.4 25/25] fs: ntfs: Limit NTFS_RW to page sizes smaller than 64k Sasha Levin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20211130145156.946083-11-sashal@kernel.org \
    --to=sashal@kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=longman@redhat.com \
    --cc=mingo@redhat.com \
    --cc=peterz@infradead.org \
    --cc=songmuchun@bytedance.com \
    --cc=stable@vger.kernel.org \
    --cc=will@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox