* [PATCH] riscv: Optimize memset @ 2023-05-05 8:43 张飞 2023-05-05 11:54 ` Andrew Jones 0 siblings, 1 reply; 15+ messages in thread From: 张飞 @ 2023-05-05 8:43 UTC (permalink / raw) To: paul.walmsley, palmer, aou; +Cc: linux-riscv, linux-kernel [-- Attachment #1: Type: text/plain, Size: 1 bytes --] [-- Attachment #2: 0001-riscv-Optimize-memset.patch --] [-- Type: application/octet-stream, Size: 1405 bytes --] From bca189fbfacf1b01975af45e49096a5718887a42 Mon Sep 17 00:00:00 2001 From: zhangfei <zhangfei@nj.iscas.ac.cn> Date: Fri, 5 May 2023 14:58:35 +0800 Subject: [PATCH] riscv: Optimize memset This patch has been optimized for memset data sizes less than 16 bytes. Compared to byte by byte storage, significant performance improvement has been achieved. Signed-off-by: Fei Zhang <zhangfei@nj.iscas.ac.cn> --- arch/riscv/lib/memset.S | 33 ++++++++++++++++++++++++++++++--- 1 file changed, 30 insertions(+), 3 deletions(-) diff --git a/arch/riscv/lib/memset.S b/arch/riscv/lib/memset.S index 34c5360c6705..0967bdf86bd6 100644 --- a/arch/riscv/lib/memset.S +++ b/arch/riscv/lib/memset.S @@ -105,9 +105,36 @@ WEAK(memset) beqz a2, 6f add a3, t0, a2 5: - sb a1, 0(t0) - addi t0, t0, 1 - bltu t0, a3, 5b + sb a1, 0(t0) + sb a1, -1(a3) + li a4, 2 + bgeu a4, a2, 6f + + sb a1, 1(t0) + sb a1, 2(t0) + sb a1, -2(a3) + sb a1, -3(a3) + li a4, 6 + bgeu a4, a2, 6f + + sb a1, 3(t0) + sb a1, -4(a3) + li a4, 8 + bgeu a4, a2, 6f + + sb a1, 4(t0) + sb a1, -5(a3) + li a4, 10 + bgeu a4, a2, 6f + + sb a1, 5(t0) + sb a1, 6(t0) + sb a1, -6(a3) + sb a1, -7(a3) + li a4, 14 + bgeu a4, a2, 6f + + sb a1, 7(t0) 6: ret END(__memset) -- 2.33.0 [-- Attachment #3: Type: text/plain, Size: 161 bytes --] _______________________________________________ linux-riscv mailing list linux-riscv@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-riscv ^ permalink raw reply related [flat|nested] 15+ messages in thread
* Re: [PATCH] riscv: Optimize memset 2023-05-05 8:43 [PATCH] riscv: Optimize memset 张飞 @ 2023-05-05 11:54 ` Andrew Jones 2023-05-09 2:22 ` [PATCH 1/2] RISC-V: lib: Improve memset assembler formatting zhangfei 0 siblings, 1 reply; 15+ messages in thread From: Andrew Jones @ 2023-05-05 11:54 UTC (permalink / raw) To: 张飞; +Cc: paul.walmsley, palmer, aou, linux-riscv, linux-kernel Please don't post patches as attachments. > From: zhangfei <zhangfei@nj.iscas.ac.cn> > Date: Fri, 5 May 2023 14:58:35 +0800 > Subject: [PATCH] riscv: Optimize memset > > This patch has been optimized for memset data sizes less than 16 bytes. > Compared to byte by byte storage, significant performance improvement has been achieved. > > Signed-off-by: Fei Zhang <zhangfei@nj.iscas.ac.cn> > --- > arch/riscv/lib/memset.S | 33 ++++++++++++++++++++++++++++++--- > 1 file changed, 30 insertions(+), 3 deletions(-) > > diff --git a/arch/riscv/lib/memset.S b/arch/riscv/lib/memset.S > index 34c5360c6705..0967bdf86bd6 100644 > --- a/arch/riscv/lib/memset.S > +++ b/arch/riscv/lib/memset.S > @@ -105,9 +105,36 @@ WEAK(memset) > beqz a2, 6f > add a3, t0, a2 > 5: > - sb a1, 0(t0) > - addi t0, t0, 1 > - bltu t0, a3, 5b > + sb a1, 0(t0) > + sb a1, -1(a3) > + li a4, 2 > + bgeu a4, a2, 6f > + > + sb a1, 1(t0) > + sb a1, 2(t0) > + sb a1, -2(a3) > + sb a1, -3(a3) > + li a4, 6 > + bgeu a4, a2, 6f > + > + sb a1, 3(t0) > + sb a1, -4(a3) > + li a4, 8 > + bgeu a4, a2, 6f Why is this check here? > + > + sb a1, 4(t0) > + sb a1, -5(a3) > + li a4, 10 > + bgeu a4, a2, 6f And this one? After the check of a2 against 6 above we know that offsets 6(t0) and -7(a3) are safe. Are we trying to avoid too may redundant stores with these additional checks? > + > + sb a1, 5(t0) > + sb a1, 6(t0) > + sb a1, -6(a3) > + sb a1, -7(a3) > + li a4, 14 > + bgeu a4, a2, 6f > + > + sb a1, 7(t0) > 6: > ret > END(__memset) > -- > 2.33.0 The indent of the new code doesn't match the old. I'd prefer we cleanup the old first, though. Please repost [1] as a first patch of a two-patch patch series, where yours is the second and matches the new formatting that [1] uses. [1] https://lore.kernel.org/all/20221027130247.31634-8-ajones@ventanamicro.com/ Thanks, Drew On Fri, May 05, 2023 at 04:43:44PM +0800, 张飞 wrote: > > _______________________________________________ > linux-riscv mailing list > linux-riscv@lists.infradead.org > http://lists.infradead.org/mailman/listinfo/linux-riscv _______________________________________________ linux-riscv mailing list linux-riscv@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-riscv ^ permalink raw reply [flat|nested] 15+ messages in thread
* [PATCH 1/2] RISC-V: lib: Improve memset assembler formatting 2023-05-05 11:54 ` Andrew Jones @ 2023-05-09 2:22 ` zhangfei 2023-05-09 2:22 ` [PATCH 2/2] riscv: Optimize memset zhangfei ` (2 more replies) 0 siblings, 3 replies; 15+ messages in thread From: zhangfei @ 2023-05-09 2:22 UTC (permalink / raw) To: ajones; +Cc: aou, linux-kernel, linux-riscv, palmer, paul.walmsley, zhangfei From: Andrew Jones <ajones@ventanamicro.com> Aligning the first operand of each instructions with a tab is a typical style which improves readability. Apply it to memset.S. While there, we also make a small grammar change to a comment. No functional change intended. Signed-off-by: Andrew Jones <ajones@ventanamicro.com> --- arch/riscv/lib/memset.S | 143 ++++++++++++++++++++-------------------- 1 file changed, 72 insertions(+), 71 deletions(-) diff --git a/arch/riscv/lib/memset.S b/arch/riscv/lib/memset.S index 34c5360c6705..e613c5c27998 100644 --- a/arch/riscv/lib/memset.S +++ b/arch/riscv/lib/memset.S @@ -3,111 +3,112 @@ * Copyright (C) 2013 Regents of the University of California */ - #include <linux/linkage.h> #include <asm/asm.h> /* void *memset(void *, int, size_t) */ ENTRY(__memset) WEAK(memset) - move t0, a0 /* Preserve return value */ + move t0, a0 /* Preserve return value */ /* Defer to byte-oriented fill for small sizes */ - sltiu a3, a2, 16 - bnez a3, 4f + sltiu a3, a2, 16 + bnez a3, 4f /* * Round to nearest XLEN-aligned address - * greater than or equal to start address + * greater than or equal to the start address. */ - addi a3, t0, SZREG-1 - andi a3, a3, ~(SZREG-1) - beq a3, t0, 2f /* Skip if already aligned */ + addi a3, t0, SZREG-1 + andi a3, a3, ~(SZREG-1) + beq a3, t0, 2f /* Skip if already aligned */ + /* Handle initial misalignment */ - sub a4, a3, t0 + sub a4, a3, t0 1: - sb a1, 0(t0) - addi t0, t0, 1 - bltu t0, a3, 1b - sub a2, a2, a4 /* Update count */ + sb a1, 0(t0) + addi t0, t0, 1 + bltu t0, a3, 1b + sub a2, a2, a4 /* Update count */ 2: /* Duff's device with 32 XLEN stores per iteration */ /* Broadcast value into all bytes */ - andi a1, a1, 0xff - slli a3, a1, 8 - or a1, a3, a1 - slli a3, a1, 16 - or a1, a3, a1 + andi a1, a1, 0xff + slli a3, a1, 8 + or a1, a3, a1 + slli a3, a1, 16 + or a1, a3, a1 #ifdef CONFIG_64BIT - slli a3, a1, 32 - or a1, a3, a1 + slli a3, a1, 32 + or a1, a3, a1 #endif /* Calculate end address */ - andi a4, a2, ~(SZREG-1) - add a3, t0, a4 + andi a4, a2, ~(SZREG-1) + add a3, t0, a4 - andi a4, a4, 31*SZREG /* Calculate remainder */ - beqz a4, 3f /* Shortcut if no remainder */ - neg a4, a4 - addi a4, a4, 32*SZREG /* Calculate initial offset */ + andi a4, a4, 31*SZREG /* Calculate remainder */ + beqz a4, 3f /* Shortcut if no remainder */ + neg a4, a4 + addi a4, a4, 32*SZREG /* Calculate initial offset */ /* Adjust start address with offset */ - sub t0, t0, a4 + sub t0, t0, a4 /* Jump into loop body */ /* Assumes 32-bit instruction lengths */ - la a5, 3f + la a5, 3f #ifdef CONFIG_64BIT - srli a4, a4, 1 + srli a4, a4, 1 #endif - add a5, a5, a4 - jr a5 + add a5, a5, a4 + jr a5 3: - REG_S a1, 0(t0) - REG_S a1, SZREG(t0) - REG_S a1, 2*SZREG(t0) - REG_S a1, 3*SZREG(t0) - REG_S a1, 4*SZREG(t0) - REG_S a1, 5*SZREG(t0) - REG_S a1, 6*SZREG(t0) - REG_S a1, 7*SZREG(t0) - REG_S a1, 8*SZREG(t0) - REG_S a1, 9*SZREG(t0) - REG_S a1, 10*SZREG(t0) - REG_S a1, 11*SZREG(t0) - REG_S a1, 12*SZREG(t0) - REG_S a1, 13*SZREG(t0) - REG_S a1, 14*SZREG(t0) - REG_S a1, 15*SZREG(t0) - REG_S a1, 16*SZREG(t0) - REG_S a1, 17*SZREG(t0) - REG_S a1, 18*SZREG(t0) - REG_S a1, 19*SZREG(t0) - REG_S a1, 20*SZREG(t0) - REG_S a1, 21*SZREG(t0) - REG_S a1, 22*SZREG(t0) - REG_S a1, 23*SZREG(t0) - REG_S a1, 24*SZREG(t0) - REG_S a1, 25*SZREG(t0) - REG_S a1, 26*SZREG(t0) - REG_S a1, 27*SZREG(t0) - REG_S a1, 28*SZREG(t0) - REG_S a1, 29*SZREG(t0) - REG_S a1, 30*SZREG(t0) - REG_S a1, 31*SZREG(t0) - addi t0, t0, 32*SZREG - bltu t0, a3, 3b - andi a2, a2, SZREG-1 /* Update count */ + REG_S a1, 0(t0) + REG_S a1, SZREG(t0) + REG_S a1, 2*SZREG(t0) + REG_S a1, 3*SZREG(t0) + REG_S a1, 4*SZREG(t0) + REG_S a1, 5*SZREG(t0) + REG_S a1, 6*SZREG(t0) + REG_S a1, 7*SZREG(t0) + REG_S a1, 8*SZREG(t0) + REG_S a1, 9*SZREG(t0) + REG_S a1, 10*SZREG(t0) + REG_S a1, 11*SZREG(t0) + REG_S a1, 12*SZREG(t0) + REG_S a1, 13*SZREG(t0) + REG_S a1, 14*SZREG(t0) + REG_S a1, 15*SZREG(t0) + REG_S a1, 16*SZREG(t0) + REG_S a1, 17*SZREG(t0) + REG_S a1, 18*SZREG(t0) + REG_S a1, 19*SZREG(t0) + REG_S a1, 20*SZREG(t0) + REG_S a1, 21*SZREG(t0) + REG_S a1, 22*SZREG(t0) + REG_S a1, 23*SZREG(t0) + REG_S a1, 24*SZREG(t0) + REG_S a1, 25*SZREG(t0) + REG_S a1, 26*SZREG(t0) + REG_S a1, 27*SZREG(t0) + REG_S a1, 28*SZREG(t0) + REG_S a1, 29*SZREG(t0) + REG_S a1, 30*SZREG(t0) + REG_S a1, 31*SZREG(t0) + + addi t0, t0, 32*SZREG + bltu t0, a3, 3b + andi a2, a2, SZREG-1 /* Update count */ 4: /* Handle trailing misalignment */ - beqz a2, 6f - add a3, t0, a2 + beqz a2, 6f + add a3, t0, a2 5: - sb a1, 0(t0) - addi t0, t0, 1 - bltu t0, a3, 5b + sb a1, 0(t0) + addi t0, t0, 1 + bltu t0, a3, 5b 6: ret END(__memset) -- 2.33.0 _______________________________________________ linux-riscv mailing list linux-riscv@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-riscv ^ permalink raw reply related [flat|nested] 15+ messages in thread
* [PATCH 2/2] riscv: Optimize memset 2023-05-09 2:22 ` [PATCH 1/2] RISC-V: lib: Improve memset assembler formatting zhangfei @ 2023-05-09 2:22 ` zhangfei 2023-05-09 2:22 ` Re: [PATCH] " zhangfei 2023-05-09 9:48 ` [PATCH 1/2] RISC-V: lib: Improve memset assembler formatting Andrew Jones 2 siblings, 0 replies; 15+ messages in thread From: zhangfei @ 2023-05-09 2:22 UTC (permalink / raw) To: ajones; +Cc: aou, linux-kernel, linux-riscv, palmer, paul.walmsley, zhangfei From: zhangfei <zhangfei@nj.iscas.ac.cn> This patch has been optimized for memset data sizes less than 16 bytes. Compared to byte by byte storage, significant performance improvement has been achieved. Signed-off-by: Fei Zhang <zhangfei@nj.iscas.ac.cn> --- arch/riscv/lib/memset.S | 33 ++++++++++++++++++++++++++++++--- 1 file changed, 30 insertions(+), 3 deletions(-) diff --git a/arch/riscv/lib/memset.S b/arch/riscv/lib/memset.S index e613c5c27998..6113a2696e79 100644 --- a/arch/riscv/lib/memset.S +++ b/arch/riscv/lib/memset.S @@ -106,9 +106,36 @@ WEAK(memset) beqz a2, 6f add a3, t0, a2 5: - sb a1, 0(t0) - addi t0, t0, 1 - bltu t0, a3, 5b + sb a1, 0(t0) + sb a1, -1(a3) + li a4, 2 + bgeu a4, a2, 6f + + sb a1, 1(t0) + sb a1, 2(t0) + sb a1, -2(a3) + sb a1, -3(a3) + li a4, 6 + bgeu a4, a2, 6f + + sb a1, 3(t0) + sb a1, -4(a3) + li a4, 8 + bgeu a4, a2, 6f + + sb a1, 4(t0) + sb a1, -5(a3) + li a4, 10 + bgeu a4, a2, 6f + + sb a1, 5(t0) + sb a1, 6(t0) + sb a1, -6(a3) + sb a1, -7(a3) + li a4, 14 + bgeu a4, a2, 6f + + sb a1, 7(t0) 6: ret END(__memset) -- 2.33.0 _______________________________________________ linux-riscv mailing list linux-riscv@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-riscv ^ permalink raw reply related [flat|nested] 15+ messages in thread
* Re: Re: [PATCH] riscv: Optimize memset 2023-05-09 2:22 ` [PATCH 1/2] RISC-V: lib: Improve memset assembler formatting zhangfei 2023-05-09 2:22 ` [PATCH 2/2] riscv: Optimize memset zhangfei @ 2023-05-09 2:22 ` zhangfei 2023-05-09 9:16 ` Andrew Jones 2023-05-09 9:48 ` [PATCH 1/2] RISC-V: lib: Improve memset assembler formatting Andrew Jones 2 siblings, 1 reply; 15+ messages in thread From: zhangfei @ 2023-05-09 2:22 UTC (permalink / raw) To: ajones; +Cc: aou, linux-kernel, linux-riscv, palmer, paul.walmsley, zhangfei From: zhangfei <zhangfei@nj.iscas.ac.cn> > > 5: > > - sb a1, 0(t0) > > - addi t0, t0, 1 > > - bltu t0, a3, 5b > > + sb a1, 0(t0) > > + sb a1, -1(a3) > > + li a4, 2 > > + bgeu a4, a2, 6f > > + > > + sb a1, 1(t0) > > + sb a1, 2(t0) > > + sb a1, -2(a3) > > + sb a1, -3(a3) > > + li a4, 6 > > + bgeu a4, a2, 6f > > + > > + sb a1, 3(t0) > > + sb a1, -4(a3) > > + li a4, 8 > > + bgeu a4, a2, 6f > > Why is this check here? Hi, I filled head and tail with minimal branching. Each conditional ensures that all the subsequently used offsets are well-defined and in the dest region. Although this approach may result in redundant storage, compared to byte by byte storage, it allows storage instructions to be executed in parallel and reduces the number of jumps. I used the code linked below for performance testing and commented on the memset that calls the arm architecture in the code to ensure it runs properly on the risc-v platform. [1] https://github.com/ARM-software/optimized-routines/blob/master/string/bench/memset.c#L53 The testing platform selected RISC-V SiFive U74.The test data is as follows: Before optimization --------------------- Random memset (bytes/ns): memset_call 32K:0.45 64K:0.35 128K:0.30 256K:0.28 512K:0.27 1024K:0.25 avg 0.30 Medium memset (bytes/ns): memset_call 8B:0.18 16B:0.48 32B:0.91 64B:1.63 128B:2.71 256B:4.40 512B:5.67 Large memset (bytes/ns): memset_call 1K:6.62 2K:7.02 4K:7.46 8K:7.70 16K:7.82 32K:7.63 64K:1.40 After optimization --------------------- Random memset bytes/ns): memset_call 32K:0.46 64K:0.35 128K:0.30 256K:0.28 512K:0.27 1024K:0.25 avg 0.31 Medium memset (bytes/ns ) memset_call 8B:0.27 16B:0.48 32B:0.91 64B:1.64 128B:2.71 256B:4.40 512B:5.67 Large memset (bytes/ns): memset_call 1K:6.62 2K:7.02 4K:7.47 8K:7.71 16K:7.83 32K:7.63 64K:1.40 From the results, it can be seen that memset has significantly improved its performance with a data volume of around 8B, from 0.18 bytes/ns to 0.27 bytes/ns. Thanks, Fei Zhang _______________________________________________ linux-riscv mailing list linux-riscv@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-riscv ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH] riscv: Optimize memset 2023-05-09 2:22 ` Re: [PATCH] " zhangfei @ 2023-05-09 9:16 ` Andrew Jones 2023-05-10 3:52 ` [PATCH 0/2] riscv: Optimize memset for data sizes less than 16 bytes zhangfei 0 siblings, 1 reply; 15+ messages in thread From: Andrew Jones @ 2023-05-09 9:16 UTC (permalink / raw) To: zhangfei; +Cc: aou, linux-kernel, linux-riscv, palmer, paul.walmsley, zhangfei On Tue, May 09, 2023 at 10:22:07AM +0800, zhangfei wrote: > From: zhangfei <zhangfei@nj.iscas.ac.cn> > > > > 5: > > > - sb a1, 0(t0) > > > - addi t0, t0, 1 > > > - bltu t0, a3, 5b > > > + sb a1, 0(t0) > > > + sb a1, -1(a3) > > > + li a4, 2 > > > + bgeu a4, a2, 6f > > > + > > > + sb a1, 1(t0) > > > + sb a1, 2(t0) > > > + sb a1, -2(a3) > > > + sb a1, -3(a3) > > > + li a4, 6 > > > + bgeu a4, a2, 6f > > > + > > > + sb a1, 3(t0) > > > + sb a1, -4(a3) > > > + li a4, 8 > > > + bgeu a4, a2, 6f > > > > Why is this check here? > > Hi, > > I filled head and tail with minimal branching. Each conditional ensures that > all the subsequently used offsets are well-defined and in the dest region. I know. You trimmed my comment, so I'll quote myself, here """ After the check of a2 against 6 above we know that offsets 6(t0) and -7(a3) are safe. Are we trying to avoid too may redundant stores with these additional checks? """ So, again. Why the additional check against 8 above and, the one you trimmed, checking 10? > > Although this approach may result in redundant storage, compared to byte by > byte storage, it allows storage instructions to be executed in parallel and > reduces the number of jumps. I understood that when I read the code, but text like this should go in the commit message to avoid people having to think their way through stuff. > > I used the code linked below for performance testing and commented on the memset > that calls the arm architecture in the code to ensure it runs properly on the > risc-v platform. > > [1] https://github.com/ARM-software/optimized-routines/blob/master/string/bench/memset.c#L53 > > The testing platform selected RISC-V SiFive U74.The test data is as follows: > > Before optimization > --------------------- > Random memset (bytes/ns): > memset_call 32K:0.45 64K:0.35 128K:0.30 256K:0.28 512K:0.27 1024K:0.25 avg 0.30 > > Medium memset (bytes/ns): > memset_call 8B:0.18 16B:0.48 32B:0.91 64B:1.63 128B:2.71 256B:4.40 512B:5.67 > Large memset (bytes/ns): > memset_call 1K:6.62 2K:7.02 4K:7.46 8K:7.70 16K:7.82 32K:7.63 64K:1.40 > > After optimization > --------------------- > Random memset bytes/ns): > memset_call 32K:0.46 64K:0.35 128K:0.30 256K:0.28 512K:0.27 1024K:0.25 avg 0.31 > Medium memset (bytes/ns ) > memset_call 8B:0.27 16B:0.48 32B:0.91 64B:1.64 128B:2.71 256B:4.40 512B:5.67 > Large memset (bytes/ns): > memset_call 1K:6.62 2K:7.02 4K:7.47 8K:7.71 16K:7.83 32K:7.63 64K:1.40 > > From the results, it can be seen that memset has significantly improved its performance with > a data volume of around 8B, from 0.18 bytes/ns to 0.27 bytes/ns. And these benchmark results belong in the cover letter, which this series is missing. Thanks, drew _______________________________________________ linux-riscv mailing list linux-riscv@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-riscv ^ permalink raw reply [flat|nested] 15+ messages in thread
* [PATCH 0/2] riscv: Optimize memset for data sizes less than 16 bytes 2023-05-09 9:16 ` Andrew Jones @ 2023-05-10 3:52 ` zhangfei 2023-05-10 3:52 ` [PATCH 1/2] RISC-V: lib: Improve memset assembler formatting zhangfei ` (2 more replies) 0 siblings, 3 replies; 15+ messages in thread From: zhangfei @ 2023-05-10 3:52 UTC (permalink / raw) To: ajones Cc: aou, linux-kernel, linux-riscv, palmer, paul.walmsley, zhang_fei_0403, zhangfei From: zhangfei <zhangfei@nj.iscas.ac.cn> At present, the implementation of the memset function uses byte by byte storage when processing tail data or when the initial data size is less than 16 bytes. This approach is not efficient. Therefore, I filled head and tail with minimal branching. Each conditional ensures that all the subsequently used offsets are well-defined and in the dest region. Although this approach may result in redundant storage, compared to byte by byte storage, it allows storage instructions to be executed in parallel, reduces the number of jumps, and ultimately achieves performance improvement. I used the code linked below for performance testing and commented on the memset that calls the arm architecture in the code to ensure it runs properly on the risc-v platform. [1] https://github.com/ARM-software/optimized-routines/blob/master/string/bench/memset.c#L53 The testing platform selected RISC-V SiFive U74.The test data is as follows: Before optimization --------------------- Random memset (bytes/ns): memset_call 32K:0.45 64K:0.35 128K:0.30 256K:0.28 512K:0.27 1024K:0.25 avg 0.30 Medium memset (bytes/ns): memset_call 8B:0.18 16B:0.48 32B:0.91 64B:1.63 128B:2.71 256B:4.40 512B:5.67 Large memset (bytes/ns): memset_call 1K:6.62 2K:7.02 4K:7.46 8K:7.70 16K:7.82 32K:7.63 64K:1.40 After optimization --------------------- Random memset bytes/ns): memset_call 32K:0.46 64K:0.35 128K:0.30 256K:0.28 512K:0.27 1024K:0.25 avg 0.31 Medium memset (bytes/ns ) memset_call 8B:0.27 16B:0.48 32B:0.91 64B:1.64 128B:2.71 256B:4.40 512B:5.67 Large memset (bytes/ns): memset_call 1K:6.62 2K:7.02 4K:7.47 8K:7.71 16K:7.83 32K:7.63 64K:1.40 From the results, it can be seen that memset has significantly improved its performance with a data volume of around 8B, from 0.18 bytes/ns to 0.27 bytes/ns. Thanks, Fei Zhang Andrew Jones (1): RISC-V: lib: Improve memset assembler formatting arch/riscv/lib/memset.S | 143 ++++++++++++++++++++-------------------- 1 file changed, 72 insertions(+), 71 deletions(-) zhangfei (1): riscv: Optimize memset arch/riscv/lib/memset.S | 33 ++++++++++++++++++++++++++++++--- 1 file changed, 30 insertions(+), 3 deletions(-) _______________________________________________ linux-riscv mailing list linux-riscv@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-riscv ^ permalink raw reply [flat|nested] 15+ messages in thread
* [PATCH 1/2] RISC-V: lib: Improve memset assembler formatting 2023-05-10 3:52 ` [PATCH 0/2] riscv: Optimize memset for data sizes less than 16 bytes zhangfei @ 2023-05-10 3:52 ` zhangfei 2023-05-10 7:13 ` Conor Dooley 2023-05-10 3:52 ` [PATCH 2/2] riscv: Optimize memset zhangfei 2023-05-10 3:52 ` [PATCH] " zhangfei 2 siblings, 1 reply; 15+ messages in thread From: zhangfei @ 2023-05-10 3:52 UTC (permalink / raw) To: ajones Cc: aou, linux-kernel, linux-riscv, palmer, paul.walmsley, zhang_fei_0403, zhangfei, Conor Dooley From: Andrew Jones <ajones@ventanamicro.com> Aligning the first operand of each instructions with a tab is a typical style which improves readability. Apply it to memset.S. While there, we also make a small grammar change to a comment. No functional change intended. Signed-off-by: Andrew Jones <ajones@ventanamicro.com> Reviewed-by: Conor Dooley <conor.dooley@microchip.com> --- arch/riscv/lib/memset.S | 143 ++++++++++++++++++++-------------------- 1 file changed, 72 insertions(+), 71 deletions(-) diff --git a/arch/riscv/lib/memset.S b/arch/riscv/lib/memset.S index 34c5360c6705..e613c5c27998 100644 --- a/arch/riscv/lib/memset.S +++ b/arch/riscv/lib/memset.S @@ -3,111 +3,112 @@ * Copyright (C) 2013 Regents of the University of California */ - #include <linux/linkage.h> #include <asm/asm.h> /* void *memset(void *, int, size_t) */ ENTRY(__memset) WEAK(memset) - move t0, a0 /* Preserve return value */ + move t0, a0 /* Preserve return value */ /* Defer to byte-oriented fill for small sizes */ - sltiu a3, a2, 16 - bnez a3, 4f + sltiu a3, a2, 16 + bnez a3, 4f /* * Round to nearest XLEN-aligned address - * greater than or equal to start address + * greater than or equal to the start address. */ - addi a3, t0, SZREG-1 - andi a3, a3, ~(SZREG-1) - beq a3, t0, 2f /* Skip if already aligned */ + addi a3, t0, SZREG-1 + andi a3, a3, ~(SZREG-1) + beq a3, t0, 2f /* Skip if already aligned */ + /* Handle initial misalignment */ - sub a4, a3, t0 + sub a4, a3, t0 1: - sb a1, 0(t0) - addi t0, t0, 1 - bltu t0, a3, 1b - sub a2, a2, a4 /* Update count */ + sb a1, 0(t0) + addi t0, t0, 1 + bltu t0, a3, 1b + sub a2, a2, a4 /* Update count */ 2: /* Duff's device with 32 XLEN stores per iteration */ /* Broadcast value into all bytes */ - andi a1, a1, 0xff - slli a3, a1, 8 - or a1, a3, a1 - slli a3, a1, 16 - or a1, a3, a1 + andi a1, a1, 0xff + slli a3, a1, 8 + or a1, a3, a1 + slli a3, a1, 16 + or a1, a3, a1 #ifdef CONFIG_64BIT - slli a3, a1, 32 - or a1, a3, a1 + slli a3, a1, 32 + or a1, a3, a1 #endif /* Calculate end address */ - andi a4, a2, ~(SZREG-1) - add a3, t0, a4 + andi a4, a2, ~(SZREG-1) + add a3, t0, a4 - andi a4, a4, 31*SZREG /* Calculate remainder */ - beqz a4, 3f /* Shortcut if no remainder */ - neg a4, a4 - addi a4, a4, 32*SZREG /* Calculate initial offset */ + andi a4, a4, 31*SZREG /* Calculate remainder */ + beqz a4, 3f /* Shortcut if no remainder */ + neg a4, a4 + addi a4, a4, 32*SZREG /* Calculate initial offset */ /* Adjust start address with offset */ - sub t0, t0, a4 + sub t0, t0, a4 /* Jump into loop body */ /* Assumes 32-bit instruction lengths */ - la a5, 3f + la a5, 3f #ifdef CONFIG_64BIT - srli a4, a4, 1 + srli a4, a4, 1 #endif - add a5, a5, a4 - jr a5 + add a5, a5, a4 + jr a5 3: - REG_S a1, 0(t0) - REG_S a1, SZREG(t0) - REG_S a1, 2*SZREG(t0) - REG_S a1, 3*SZREG(t0) - REG_S a1, 4*SZREG(t0) - REG_S a1, 5*SZREG(t0) - REG_S a1, 6*SZREG(t0) - REG_S a1, 7*SZREG(t0) - REG_S a1, 8*SZREG(t0) - REG_S a1, 9*SZREG(t0) - REG_S a1, 10*SZREG(t0) - REG_S a1, 11*SZREG(t0) - REG_S a1, 12*SZREG(t0) - REG_S a1, 13*SZREG(t0) - REG_S a1, 14*SZREG(t0) - REG_S a1, 15*SZREG(t0) - REG_S a1, 16*SZREG(t0) - REG_S a1, 17*SZREG(t0) - REG_S a1, 18*SZREG(t0) - REG_S a1, 19*SZREG(t0) - REG_S a1, 20*SZREG(t0) - REG_S a1, 21*SZREG(t0) - REG_S a1, 22*SZREG(t0) - REG_S a1, 23*SZREG(t0) - REG_S a1, 24*SZREG(t0) - REG_S a1, 25*SZREG(t0) - REG_S a1, 26*SZREG(t0) - REG_S a1, 27*SZREG(t0) - REG_S a1, 28*SZREG(t0) - REG_S a1, 29*SZREG(t0) - REG_S a1, 30*SZREG(t0) - REG_S a1, 31*SZREG(t0) - addi t0, t0, 32*SZREG - bltu t0, a3, 3b - andi a2, a2, SZREG-1 /* Update count */ + REG_S a1, 0(t0) + REG_S a1, SZREG(t0) + REG_S a1, 2*SZREG(t0) + REG_S a1, 3*SZREG(t0) + REG_S a1, 4*SZREG(t0) + REG_S a1, 5*SZREG(t0) + REG_S a1, 6*SZREG(t0) + REG_S a1, 7*SZREG(t0) + REG_S a1, 8*SZREG(t0) + REG_S a1, 9*SZREG(t0) + REG_S a1, 10*SZREG(t0) + REG_S a1, 11*SZREG(t0) + REG_S a1, 12*SZREG(t0) + REG_S a1, 13*SZREG(t0) + REG_S a1, 14*SZREG(t0) + REG_S a1, 15*SZREG(t0) + REG_S a1, 16*SZREG(t0) + REG_S a1, 17*SZREG(t0) + REG_S a1, 18*SZREG(t0) + REG_S a1, 19*SZREG(t0) + REG_S a1, 20*SZREG(t0) + REG_S a1, 21*SZREG(t0) + REG_S a1, 22*SZREG(t0) + REG_S a1, 23*SZREG(t0) + REG_S a1, 24*SZREG(t0) + REG_S a1, 25*SZREG(t0) + REG_S a1, 26*SZREG(t0) + REG_S a1, 27*SZREG(t0) + REG_S a1, 28*SZREG(t0) + REG_S a1, 29*SZREG(t0) + REG_S a1, 30*SZREG(t0) + REG_S a1, 31*SZREG(t0) + + addi t0, t0, 32*SZREG + bltu t0, a3, 3b + andi a2, a2, SZREG-1 /* Update count */ 4: /* Handle trailing misalignment */ - beqz a2, 6f - add a3, t0, a2 + beqz a2, 6f + add a3, t0, a2 5: - sb a1, 0(t0) - addi t0, t0, 1 - bltu t0, a3, 5b + sb a1, 0(t0) + addi t0, t0, 1 + bltu t0, a3, 5b 6: ret END(__memset) -- 2.33.0 _______________________________________________ linux-riscv mailing list linux-riscv@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-riscv ^ permalink raw reply related [flat|nested] 15+ messages in thread
* Re: [PATCH 1/2] RISC-V: lib: Improve memset assembler formatting 2023-05-10 3:52 ` [PATCH 1/2] RISC-V: lib: Improve memset assembler formatting zhangfei @ 2023-05-10 7:13 ` Conor Dooley 2023-05-11 1:55 ` zhangfei 0 siblings, 1 reply; 15+ messages in thread From: Conor Dooley @ 2023-05-10 7:13 UTC (permalink / raw) To: zhangfei Cc: ajones, aou, linux-kernel, linux-riscv, palmer, paul.walmsley, zhangfei [-- Attachment #1.1: Type: text/plain, Size: 1795 bytes --] Hey Zhangfei, On Wed, May 10, 2023 at 11:52:41AM +0800, zhangfei wrote: > From: Andrew Jones <ajones@ventanamicro.com> > > Aligning the first operand of each instructions with a tab is a > typical style which improves readability. Apply it to memset.S. > While there, we also make a small grammar change to a comment. > > No functional change intended. > > Signed-off-by: Andrew Jones <ajones@ventanamicro.com> > Reviewed-by: Conor Dooley <conor.dooley@microchip.com> Three things for you here.. Firstly, since you sent Andrew's work, you need to add your own Signed-off-by to the patch. Secondly & thirdly, please version your patchsets & do not later versions as an reply to the v1. In my mail client I see: May 10 zhangfei ( 58) ┌─>[PATCH 2/2] riscv: Optimize memset May 10 Andrew Jones ( 36) │ ┌─> May 10 zhangfei ( 32) ├─>Re: [PATCH] riscv: Optimize memset May 10 zhangfei ( 205) ├─>[PATCH 1/2] RISC-V: lib: Improve memset assembler formatting May 10 zhangfei ( 56) ┌─>[PATCH 0/2] riscv: Optimize memset for data sizes less than 16 bytes May 09 Andrew Jones ( 87) ┌─>Re: [PATCH] riscv: Optimize memset May 09 zhangfei ( 67) ┌─>Re: Re: [PATCH] riscv: Optimize memset May 09 Andrew Jones ( 211) ├─>Re: [PATCH 1/2] RISC-V: lib: Improve memset assembler formatting May 09 zhangfei ( 57) ├─>[PATCH 2/2] riscv: Optimize memset May 09 zhangfei ( 204) ┌─>[PATCH 1/2] RISC-V: lib: Improve memset assembler formatting May 05 Andrew Jones ( 91) ┌─> May 05 张飞 ( 38) [PATCH] riscv: Optimize memset How am I supposed to know what is what there? Thanks, Conor. [-- Attachment #1.2: signature.asc --] [-- Type: application/pgp-signature, Size: 228 bytes --] [-- Attachment #2: Type: text/plain, Size: 161 bytes --] _______________________________________________ linux-riscv mailing list linux-riscv@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-riscv ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH 1/2] RISC-V: lib: Improve memset assembler formatting 2023-05-10 7:13 ` Conor Dooley @ 2023-05-11 1:55 ` zhangfei 0 siblings, 0 replies; 15+ messages in thread From: zhangfei @ 2023-05-11 1:55 UTC (permalink / raw) To: conor.dooley Cc: ajones, aou, linux-kernel, linux-riscv, palmer, paul.walmsley, zhang_fei_0403, zhangfei From: zhangfei <zhangfei@nj.iscas.ac.cn> Hi, Due to the chaotic response in my previous email, I am sorry for this. I have reorganized patch v2 and sent it to you. Please reply under the latest patch. Thanks, Fei Zhang _______________________________________________ linux-riscv mailing list linux-riscv@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-riscv ^ permalink raw reply [flat|nested] 15+ messages in thread
* [PATCH 2/2] riscv: Optimize memset 2023-05-10 3:52 ` [PATCH 0/2] riscv: Optimize memset for data sizes less than 16 bytes zhangfei 2023-05-10 3:52 ` [PATCH 1/2] RISC-V: lib: Improve memset assembler formatting zhangfei @ 2023-05-10 3:52 ` zhangfei 2023-05-10 3:52 ` [PATCH] " zhangfei 2 siblings, 0 replies; 15+ messages in thread From: zhangfei @ 2023-05-10 3:52 UTC (permalink / raw) To: ajones Cc: aou, linux-kernel, linux-riscv, palmer, paul.walmsley, zhang_fei_0403, zhangfei From: zhangfei <zhangfei@nj.iscas.ac.cn> This patch has been optimized for memset data sizes less than 16 bytes. Compared to byte by byte storage, significant performance improvement has been achieved. It allows storage instructions to be executed in parallel and reduces the number of jumps. Signed-off-by: Fei Zhang <zhangfei@nj.iscas.ac.cn> --- arch/riscv/lib/memset.S | 33 ++++++++++++++++++++++++++++++--- 1 file changed, 30 insertions(+), 3 deletions(-) diff --git a/arch/riscv/lib/memset.S b/arch/riscv/lib/memset.S index e613c5c27998..6113a2696e79 100644 --- a/arch/riscv/lib/memset.S +++ b/arch/riscv/lib/memset.S @@ -106,9 +106,36 @@ WEAK(memset) beqz a2, 6f add a3, t0, a2 5: - sb a1, 0(t0) - addi t0, t0, 1 - bltu t0, a3, 5b + sb a1, 0(t0) + sb a1, -1(a3) + li a4, 2 + bgeu a4, a2, 6f + + sb a1, 1(t0) + sb a1, 2(t0) + sb a1, -2(a3) + sb a1, -3(a3) + li a4, 6 + bgeu a4, a2, 6f + + sb a1, 3(t0) + sb a1, -4(a3) + li a4, 8 + bgeu a4, a2, 6f + + sb a1, 4(t0) + sb a1, -5(a3) + li a4, 10 + bgeu a4, a2, 6f + + sb a1, 5(t0) + sb a1, 6(t0) + sb a1, -6(a3) + sb a1, -7(a3) + li a4, 14 + bgeu a4, a2, 6f + + sb a1, 7(t0) 6: ret END(__memset) -- 2.33.0 _______________________________________________ linux-riscv mailing list linux-riscv@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-riscv ^ permalink raw reply related [flat|nested] 15+ messages in thread
* Re: [PATCH] riscv: Optimize memset 2023-05-10 3:52 ` [PATCH 0/2] riscv: Optimize memset for data sizes less than 16 bytes zhangfei 2023-05-10 3:52 ` [PATCH 1/2] RISC-V: lib: Improve memset assembler formatting zhangfei 2023-05-10 3:52 ` [PATCH 2/2] riscv: Optimize memset zhangfei @ 2023-05-10 3:52 ` zhangfei 2023-05-10 6:58 ` Andrew Jones 2 siblings, 1 reply; 15+ messages in thread From: zhangfei @ 2023-05-10 3:52 UTC (permalink / raw) To: ajones Cc: aou, linux-kernel, linux-riscv, palmer, paul.walmsley, zhang_fei_0403, zhangfei From: zhangfei <zhangfei@nj.iscas.ac.cn> On Tue, May 09, 2023 11:16:33AM +0200, Andrew Jones wrote: > On Tue, May 09, 2023 at 10:22:07AM +0800, zhangfei wrote: > > > > Hi, > > > > I filled head and tail with minimal branching. Each conditional ensures that > > all the subsequently used offsets are well-defined and in the dest region. > > I know. You trimmed my comment, so I'll quote myself, here > > """ > After the check of a2 against 6 above we know that offsets 6(t0) > and -7(a3) are safe. Are we trying to avoid too may redundant > stores with these additional checks? > """ > > So, again. Why the additional check against 8 above and, the one you > trimmed, checking 10? Hi, These additional checks are to avoid too many redundant stores. Adding a check for more than 8 bytes is because after the loop segment '3' comes out, the remaining bytes are less than 8 bytes, which also avoids redundant stores. Thanks, Fei Zhang _______________________________________________ linux-riscv mailing list linux-riscv@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-riscv ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH] riscv: Optimize memset 2023-05-10 3:52 ` [PATCH] " zhangfei @ 2023-05-10 6:58 ` Andrew Jones 2023-05-11 1:42 ` zhangfei 0 siblings, 1 reply; 15+ messages in thread From: Andrew Jones @ 2023-05-10 6:58 UTC (permalink / raw) To: zhangfei; +Cc: aou, linux-kernel, linux-riscv, palmer, paul.walmsley, zhangfei On Wed, May 10, 2023 at 11:52:43AM +0800, zhangfei wrote: > From: zhangfei <zhangfei@nj.iscas.ac.cn> > > On Tue, May 09, 2023 11:16:33AM +0200, Andrew Jones wrote: > > On Tue, May 09, 2023 at 10:22:07AM +0800, zhangfei wrote: > > > > > > Hi, > > > > > > I filled head and tail with minimal branching. Each conditional ensures that > > > all the subsequently used offsets are well-defined and in the dest region. > > > > I know. You trimmed my comment, so I'll quote myself, here > > > > """ > > After the check of a2 against 6 above we know that offsets 6(t0) > > and -7(a3) are safe. Are we trying to avoid too may redundant > > stores with these additional checks? > > """ > > > > So, again. Why the additional check against 8 above and, the one you > > trimmed, checking 10? > > Hi, > > These additional checks are to avoid too many redundant stores. > > Adding a check for more than 8 bytes is because after the loop > segment '3' comes out, the remaining bytes are less than 8 bytes, > which also avoids redundant stores. So the benchmarks showed these additional checks were necessary to avoid making memset worse? Please add comments to the code explaining the purpose of the checks. Thanks, drew _______________________________________________ linux-riscv mailing list linux-riscv@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-riscv ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH] riscv: Optimize memset 2023-05-10 6:58 ` Andrew Jones @ 2023-05-11 1:42 ` zhangfei 0 siblings, 0 replies; 15+ messages in thread From: zhangfei @ 2023-05-11 1:42 UTC (permalink / raw) To: ajones Cc: aou, linux-kernel, linux-riscv, palmer, paul.walmsley, zhang_fei_0403, zhangfei From: zhangfei <zhangfei@nj.iscas.ac.cn> On Wed, May 10, 2023 at 14:58:22PM +0200, Andrew Jones wrote: > On Wed, May 10, 2023 at 11:52:43AM +0800, zhangfei wrote: > > From: zhangfei <zhangfei@nj.iscas.ac.cn> > > > > On Tue, May 09, 2023 11:16:33AM +0200, Andrew Jones wrote: > > > On Tue, May 09, 2023 at 10:22:07AM +0800, zhangfei wrote: > > > > > > > > Hi, > > > > > > > > I filled head and tail with minimal branching. Each conditional ensures that > > > > all the subsequently used offsets are well-defined and in the dest region. > > > > > > I know. You trimmed my comment, so I'll quote myself, here > > > > > > """ > > > After the check of a2 against 6 above we know that offsets 6(t0) > > > and -7(a3) are safe. Are we trying to avoid too may redundant > > > stores with these additional checks? > > > """ > > > > > > So, again. Why the additional check against 8 above and, the one you > > > trimmed, checking 10? > > > > Hi, > > > > These additional checks are to avoid too many redundant stores. > > > > Adding a check for more than 8 bytes is because after the loop > > segment '3' comes out, the remaining bytes are less than 8 bytes, > > which also avoids redundant stores. > > So the benchmarks showed these additional checks were necessary to avoid > making memset worse? Please add comments to the code explaining the > purpose of the checks. Hi, As you mentioned, the lack of these additional tests can make memset worse. When I removed the checks for 8 and 10 above, the benchmarks showed that the memset changed to 0.21 bytes/ns at 8B. Although this is better than storing byte by byte, additional detections will bring a better improvement to 0.27 bytes/ns. Due to the chaotic response in my previous email, I am sorry for this. I have reorganized patch v2 and sent it to you. Please reply under the latest patch. Thanks, Fei Zhang _______________________________________________ linux-riscv mailing list linux-riscv@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-riscv ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH 1/2] RISC-V: lib: Improve memset assembler formatting 2023-05-09 2:22 ` [PATCH 1/2] RISC-V: lib: Improve memset assembler formatting zhangfei 2023-05-09 2:22 ` [PATCH 2/2] riscv: Optimize memset zhangfei 2023-05-09 2:22 ` Re: [PATCH] " zhangfei @ 2023-05-09 9:48 ` Andrew Jones 2 siblings, 0 replies; 15+ messages in thread From: Andrew Jones @ 2023-05-09 9:48 UTC (permalink / raw) To: zhangfei Cc: aou, linux-kernel, linux-riscv, palmer, paul.walmsley, zhangfei, conor.dooley On Tue, May 09, 2023 at 10:22:05AM +0800, zhangfei wrote: > From: Andrew Jones <ajones@ventanamicro.com> > > Aligning the first operand of each instructions with a tab is a > typical style which improves readability. Apply it to memset.S. > While there, we also make a small grammar change to a comment. > > No functional change intended. > > Signed-off-by: Andrew Jones <ajones@ventanamicro.com> Please pick up Conor's r-b on this reposting. Thanks, drew > --- > arch/riscv/lib/memset.S | 143 ++++++++++++++++++++-------------------- > 1 file changed, 72 insertions(+), 71 deletions(-) > > diff --git a/arch/riscv/lib/memset.S b/arch/riscv/lib/memset.S > index 34c5360c6705..e613c5c27998 100644 > --- a/arch/riscv/lib/memset.S > +++ b/arch/riscv/lib/memset.S > @@ -3,111 +3,112 @@ > * Copyright (C) 2013 Regents of the University of California > */ > > - > #include <linux/linkage.h> > #include <asm/asm.h> > > /* void *memset(void *, int, size_t) */ > ENTRY(__memset) > WEAK(memset) > - move t0, a0 /* Preserve return value */ > + move t0, a0 /* Preserve return value */ > > /* Defer to byte-oriented fill for small sizes */ > - sltiu a3, a2, 16 > - bnez a3, 4f > + sltiu a3, a2, 16 > + bnez a3, 4f > > /* > * Round to nearest XLEN-aligned address > - * greater than or equal to start address > + * greater than or equal to the start address. > */ > - addi a3, t0, SZREG-1 > - andi a3, a3, ~(SZREG-1) > - beq a3, t0, 2f /* Skip if already aligned */ > + addi a3, t0, SZREG-1 > + andi a3, a3, ~(SZREG-1) > + beq a3, t0, 2f /* Skip if already aligned */ > + > /* Handle initial misalignment */ > - sub a4, a3, t0 > + sub a4, a3, t0 > 1: > - sb a1, 0(t0) > - addi t0, t0, 1 > - bltu t0, a3, 1b > - sub a2, a2, a4 /* Update count */ > + sb a1, 0(t0) > + addi t0, t0, 1 > + bltu t0, a3, 1b > + sub a2, a2, a4 /* Update count */ > > 2: /* Duff's device with 32 XLEN stores per iteration */ > /* Broadcast value into all bytes */ > - andi a1, a1, 0xff > - slli a3, a1, 8 > - or a1, a3, a1 > - slli a3, a1, 16 > - or a1, a3, a1 > + andi a1, a1, 0xff > + slli a3, a1, 8 > + or a1, a3, a1 > + slli a3, a1, 16 > + or a1, a3, a1 > #ifdef CONFIG_64BIT > - slli a3, a1, 32 > - or a1, a3, a1 > + slli a3, a1, 32 > + or a1, a3, a1 > #endif > > /* Calculate end address */ > - andi a4, a2, ~(SZREG-1) > - add a3, t0, a4 > + andi a4, a2, ~(SZREG-1) > + add a3, t0, a4 > > - andi a4, a4, 31*SZREG /* Calculate remainder */ > - beqz a4, 3f /* Shortcut if no remainder */ > - neg a4, a4 > - addi a4, a4, 32*SZREG /* Calculate initial offset */ > + andi a4, a4, 31*SZREG /* Calculate remainder */ > + beqz a4, 3f /* Shortcut if no remainder */ > + neg a4, a4 > + addi a4, a4, 32*SZREG /* Calculate initial offset */ > > /* Adjust start address with offset */ > - sub t0, t0, a4 > + sub t0, t0, a4 > > /* Jump into loop body */ > /* Assumes 32-bit instruction lengths */ > - la a5, 3f > + la a5, 3f > #ifdef CONFIG_64BIT > - srli a4, a4, 1 > + srli a4, a4, 1 > #endif > - add a5, a5, a4 > - jr a5 > + add a5, a5, a4 > + jr a5 > 3: > - REG_S a1, 0(t0) > - REG_S a1, SZREG(t0) > - REG_S a1, 2*SZREG(t0) > - REG_S a1, 3*SZREG(t0) > - REG_S a1, 4*SZREG(t0) > - REG_S a1, 5*SZREG(t0) > - REG_S a1, 6*SZREG(t0) > - REG_S a1, 7*SZREG(t0) > - REG_S a1, 8*SZREG(t0) > - REG_S a1, 9*SZREG(t0) > - REG_S a1, 10*SZREG(t0) > - REG_S a1, 11*SZREG(t0) > - REG_S a1, 12*SZREG(t0) > - REG_S a1, 13*SZREG(t0) > - REG_S a1, 14*SZREG(t0) > - REG_S a1, 15*SZREG(t0) > - REG_S a1, 16*SZREG(t0) > - REG_S a1, 17*SZREG(t0) > - REG_S a1, 18*SZREG(t0) > - REG_S a1, 19*SZREG(t0) > - REG_S a1, 20*SZREG(t0) > - REG_S a1, 21*SZREG(t0) > - REG_S a1, 22*SZREG(t0) > - REG_S a1, 23*SZREG(t0) > - REG_S a1, 24*SZREG(t0) > - REG_S a1, 25*SZREG(t0) > - REG_S a1, 26*SZREG(t0) > - REG_S a1, 27*SZREG(t0) > - REG_S a1, 28*SZREG(t0) > - REG_S a1, 29*SZREG(t0) > - REG_S a1, 30*SZREG(t0) > - REG_S a1, 31*SZREG(t0) > - addi t0, t0, 32*SZREG > - bltu t0, a3, 3b > - andi a2, a2, SZREG-1 /* Update count */ > + REG_S a1, 0(t0) > + REG_S a1, SZREG(t0) > + REG_S a1, 2*SZREG(t0) > + REG_S a1, 3*SZREG(t0) > + REG_S a1, 4*SZREG(t0) > + REG_S a1, 5*SZREG(t0) > + REG_S a1, 6*SZREG(t0) > + REG_S a1, 7*SZREG(t0) > + REG_S a1, 8*SZREG(t0) > + REG_S a1, 9*SZREG(t0) > + REG_S a1, 10*SZREG(t0) > + REG_S a1, 11*SZREG(t0) > + REG_S a1, 12*SZREG(t0) > + REG_S a1, 13*SZREG(t0) > + REG_S a1, 14*SZREG(t0) > + REG_S a1, 15*SZREG(t0) > + REG_S a1, 16*SZREG(t0) > + REG_S a1, 17*SZREG(t0) > + REG_S a1, 18*SZREG(t0) > + REG_S a1, 19*SZREG(t0) > + REG_S a1, 20*SZREG(t0) > + REG_S a1, 21*SZREG(t0) > + REG_S a1, 22*SZREG(t0) > + REG_S a1, 23*SZREG(t0) > + REG_S a1, 24*SZREG(t0) > + REG_S a1, 25*SZREG(t0) > + REG_S a1, 26*SZREG(t0) > + REG_S a1, 27*SZREG(t0) > + REG_S a1, 28*SZREG(t0) > + REG_S a1, 29*SZREG(t0) > + REG_S a1, 30*SZREG(t0) > + REG_S a1, 31*SZREG(t0) > + > + addi t0, t0, 32*SZREG > + bltu t0, a3, 3b > + andi a2, a2, SZREG-1 /* Update count */ > > 4: > /* Handle trailing misalignment */ > - beqz a2, 6f > - add a3, t0, a2 > + beqz a2, 6f > + add a3, t0, a2 > 5: > - sb a1, 0(t0) > - addi t0, t0, 1 > - bltu t0, a3, 5b > + sb a1, 0(t0) > + addi t0, t0, 1 > + bltu t0, a3, 5b > 6: > ret > END(__memset) > -- > 2.33.0 > _______________________________________________ linux-riscv mailing list linux-riscv@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-riscv ^ permalink raw reply [flat|nested] 15+ messages in thread
end of thread, other threads:[~2023-05-11 1:55 UTC | newest] Thread overview: 15+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2023-05-05 8:43 [PATCH] riscv: Optimize memset 张飞 2023-05-05 11:54 ` Andrew Jones 2023-05-09 2:22 ` [PATCH 1/2] RISC-V: lib: Improve memset assembler formatting zhangfei 2023-05-09 2:22 ` [PATCH 2/2] riscv: Optimize memset zhangfei 2023-05-09 2:22 ` Re: [PATCH] " zhangfei 2023-05-09 9:16 ` Andrew Jones 2023-05-10 3:52 ` [PATCH 0/2] riscv: Optimize memset for data sizes less than 16 bytes zhangfei 2023-05-10 3:52 ` [PATCH 1/2] RISC-V: lib: Improve memset assembler formatting zhangfei 2023-05-10 7:13 ` Conor Dooley 2023-05-11 1:55 ` zhangfei 2023-05-10 3:52 ` [PATCH 2/2] riscv: Optimize memset zhangfei 2023-05-10 3:52 ` [PATCH] " zhangfei 2023-05-10 6:58 ` Andrew Jones 2023-05-11 1:42 ` zhangfei 2023-05-09 9:48 ` [PATCH 1/2] RISC-V: lib: Improve memset assembler formatting Andrew Jones
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox