From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id E71603375C3 for ; Tue, 13 Jan 2026 12:43:13 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1768308194; cv=none; b=Zc+vNDO+QmfHuQd9GEDWMOpcKO/F93YbDuHTSYX+OKG84n3X/jMcRNm9vqLPLma/xffqJjNSVI9Ixxrs+GkEgdZTqsIOkf9HVXp9HQB3Pxu8zIEaNsn+01XlRuMD4t5HaQPk8Ylvj0Mw+rwX38mD3dsKSXE3S4VhudfBOpPDyCc= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1768308194; c=relaxed/simple; bh=NfeZf0TnPY7410h7rGkeymXivD7y1mkFjmZR3zo/LXs=; h=From:To:Cc:Subject:Date:Message-ID:MIME-Version; b=oUqUKoTyR+fdDLLING62QZyr/cT/VE5WjFl+h4n0m942Ul3/kzpZ5Mxg8jfZ/poH9oijrlKCeJlr9oA+p9bbmb6f1HSQoM8nkUzIZhcwyyII3e24mzXzWww5tp0xbYHTOQp65HiN/mvZQgjReUwlj4VYN/T9yDk++5a+Lgdoj90= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=fZZknozk; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="fZZknozk" Received: by smtp.kernel.org (Postfix) with ESMTPSA id 72978C116C6; Tue, 13 Jan 2026 12:43:10 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1768308193; bh=NfeZf0TnPY7410h7rGkeymXivD7y1mkFjmZR3zo/LXs=; h=From:To:Cc:Subject:Date:From; b=fZZknozkQpye/GZKCJT1zM1B8IH4H/Lh3/epDuMFbfjsbko9+MSs0aYurgQUebTOw OCb0zI6WmG1jEpRXjhGwd2abH3nYTweAdh8XGNuMyuDgUBNrGRqxkvp+UhuiWvkR5w 8T8mktdsfqFPtsnpgGth/AAYVbf3Hq73xJYEpsXhBnYXS6YSscmuv9lJyI2KSYCFMp Z+8rgpvFVcbD+oNgve7re9+wCiyMb3FmHQQLiJSJH4WkV6gcowXFCyUbmYWmp1c9CW szzCvv5Pe5PbdbC40Z+gW8H0YyWk+30wdb6BoymmI1UihRcbunImTog/+kEL7qe2AM 2Xev2uywHqpzg== From: Jisheng Zhang To: Paul Walmsley , Palmer Dabbelt , Albert Ou , Alexandre Ghiti Cc: linux-riscv@lists.infradead.org, linux-kernel@vger.kernel.org Subject: [PATCH 0/3] riscv: word-at-a-time: improve find_zero() Date: Tue, 13 Jan 2026 20:24:54 +0800 Message-ID: <20260113122457.27507-1-jszhang@kernel.org> X-Mailer: git-send-email 2.51.0 Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Currently, there are two problems with riscv find_zero(): 1. When !RISCV_ISA_ZBB, the generic fls64() bring non-optimal code. But in word-at-a-time case, we don't have to go with fls64() code path, instead, we can fallback to the generic word-at-a-time implementaion. What's more, the fls64() brings non-necessary zero bits couting for RV32. In fact, fls() is enough. 2. Similar as 1, the generic fls64() also brings non-optimal code when RISCV_ISA_ZBB=y but HW doesn't support Zbb. So this series tries to improve find_zero() by falling back to generic word-at-a-time implementaion where necessary. We dramatically reduce the instructions of find_zero() from 33 to 8! Also testing with the micro-benchamrk in patch1 shows that the performance is improved by about 1150%! After that, we improve find_zero() for Zbb further by applying similar optimization as Linus did in commit f915a3e5b018 ("arm64: word-at-a-time: improve byte count calculations for LE"), so that we share the similar improvements: "The difference between the old and the new implementation is that "count_zero()" ends up scheduling better because it is being done on a value that is available earlier (before the final mask). But more importantly, it can be implemented without the insane semantics of the standard bit finding helpers that have the off-by-one issue and have to special-case the zero mask situation." On RV64 w/ Zbb, the new "find_zero()" ends up just "ctz" plus the shift right that then ends up being subsumed by the "add to final length". Reduce the total instructions from 7 to 3! But I have no HW platform which supports Zbb, so I can't get the performance improvement numbers by the last patch, only built and tested the patch on QEMU. Jisheng Zhang (3): riscv: word-at-a-time: improve find_zero() for !RISCV_ISA_ZBB riscv: word-at-a-time: improve find_zero() without Zbb riscv: word-at-a-time: improve find_zero() for Zbb arch/riscv/include/asm/word-at-a-time.h | 47 +++++++++++++++++++++++-- 1 file changed, 44 insertions(+), 3 deletions(-) -- 2.51.0