public inbox for linux-arm-kernel@lists.infradead.org
 help / color / mirror / Atom feed
* Slowdown copying data between kernel versions 4.19 and 5.15
@ 2023-06-23 21:30 Havens, Austin
  2023-06-28 21:38 ` Havens, Austin
  0 siblings, 1 reply; 9+ messages in thread
From: Havens, Austin @ 2023-06-23 21:30 UTC (permalink / raw)
  To: catalin.marinas@arm.com, will@kernel.org, michal.simek@amd.com
  Cc: Suresh, Siddarth, Lui, Vincent,
	linux-arm-kernel@lists.infradead.org

Hi all,
In the process of updating our kernel from 4.19 to 5.15 we noticed a slowdown when copying data.  We are using  Zynqmp 9EG SoCs and basically following the Xilinx/AMD release branches (though a bit behind).  I did some sample based profiling with perf, and it showed that a lot of the time was in __arch_copy_from_user, and since the amount of data getting copied is the same, it seems like it is spending more time in each __arch_copy_from_user call. 

 I made  a test program to replicate the issue and here is what I see (i used the same binary on both versions to rule out differences from the compiler). 

root@smudge:/tmp# uname -a
Linux smudge 4.19.0-xilinx-v2019.1 #1 SMP PREEMPT Thu May 18 04:01:27 UTC 2023 aarch64 aarch64 aarch64 GNU/Linux
root@smudge:/tmp# perf stat -einstructions -ecycles -e ld_dep_stall -e read_alloc -e dTLB-load-misses /mnt/usrroot/test_copy

 Performance counter stats for '/mnt/usrroot/test_copy':

          13202623      instructions              #    0.25  insn per cycle         
          52947780      cycles                                                      
          37588761      ld_dep_stall                                                
             16301      read_alloc                                                  
              1660      dTLB-load-misses                                            

       0.044990363 seconds time elapsed

       0.004092000 seconds user
       0.040920000 seconds sys

root@ahraptor:/tmp# uname -a
Linux ahraptor 5.15.36-xilinx-v2022.1 #1 SMP PREEMPT Mon Apr 10 22:46:16 UTC 2023 aarch64 aarch64 aarch64 GNU/Linux
root@ahraptor:/tmp# perf stat -einstructions -ecycles -e ld_dep_stall -e read_alloc -e dTLB-load-misses /mnt/usrroot/test_copy

 Performance counter stats for '/mnt/usrroot/test_copy':

          11625888      instructions              #    0.14  insn per cycle         
          83135040      cycles                                                      
          69833562      ld_dep_stall                                                
             27948      read_alloc                                                  
              3367      dTLB-load-misses                                            

       0.070537894 seconds time elapsed

       0.004165000 seconds user
       0.066643000 seconds sys


After some investigation I am guessing the issue is either in the iovector iteration changes (around https://elixir.bootlin.com/linux/v5.15/source/lib/iov_iter.c#L922 ) or the lower level changes in arch/arm64/lib/copy_from_user.S, but I am pretty out of my depth so it is just speculation. 

Here is the C++ code for the test program (I compiled it with G++ -O3), note that in our products we have the FPGA writing to a dedicated memory carveout which is where I have the /dev/mem mmap, you would have to change that to somewhere else to run. 

#include <iostream>
#include <memory>
#include <fstream>
#include <vector>

#include <fcntl.h>
#include <cstdlib>
#include <cstring>
#include <sys/mman.h>
#include <unistd.h>

using namespace std;

struct CaptureChunk
{
uint32_t partitionOffset, captureSizeInBytes;
};

static constexpr uint32_t MAX_WRITE_BYTES = (4096 * 256); // 1MiB
constexpr size_t copySize = 4096*1000;
constexpr size_t databufferSize =  copySize;

void updateTotalBytes(uint32_t bytesWritten)
{
 
}

void writeChunkToFile(const char* data, const CaptureChunk& chunk, const std::string& filePath)
{
    std::ofstream file;
    // The default buffer size seems to be very large and has to be written on close.
    // This would cause aborting the save to take several minutes (RAP-6926).
    // I don't thing we want it completely unbuffered either so we have to choose
    // a size. I think the page size is probably a pretty good bet for a good buffer
    // size. We could get it with sysconf(_SC_PAGESIZE); but I am just going to
    // use 4096 directly since that is almost always what it is, and that way we
    // won't have to change it on Windows which does not have sysconf.
    long sz = 4096;
    std::vector<char> buffer;
    buffer.resize(sz);
    file.rdbuf()->pubsetbuf(buffer.data(), sz);
    file.open(filePath.c_str(), std::ofstream::out | std::ofstream::binary);
    uint32_t readOffset = chunk.partitionOffset;
    uint32_t bytesRemainingToBeWritten = chunk.captureSizeInBytes;
        
    while(bytesRemainingToBeWritten > 0)
    {
        uint32_t bytesToWrite = std::min(MAX_WRITE_BYTES, bytesRemainingToBeWritten);
        if (readOffset + bytesToWrite > databufferSize)
        {
            bytesToWrite = databufferSize - readOffset;
        }
        file.write(data + readOffset, bytesToWrite);
        if (file.fail())
        {
            cout<< " failed to write " << filePath;
            break;
        }
        updateTotalBytes(bytesToWrite);
        bytesRemainingToBeWritten -= bytesToWrite;
        readOffset += bytesToWrite;
        if (readOffset == databufferSize)
        {
            readOffset = 0; //wrap around
        }
    }
    file.close();

}


void* getBuffer(int32_t& device)
{
    size_t size = databufferSize;
    device = ::open("/dev/mem", O_RDWR | O_SYNC);

    if (device < 0)
    {
        cout << "could not open dev/mem ";
    }
    // The databuffer driver should take care of getting the physical address
    void* buffer = mmap(nullptr, size, PROT_READ | PROT_WRITE, MAP_SHARED, device,  0x800000000);

    if (buffer== (void*)MAP_FAILED)
    {
        ::close(device);
        device = -1;
        buffer = nullptr;   
        cout << "could not mmap dev/mem ";     
    }
    return buffer;
}

int main()
{
    std::string fileName= "test_file.bin";
    CaptureChunk testChunk {.partitionOffset=0, .captureSizeInBytes=copySize};
    int32_t device;
    char* buffer= (char*)getBuffer(device);
    #ifdef copy_buffer
    char* copyBuffer = (char*)std::malloc(copySize);
    std::memcpy(copyBuffer, buffer, copySize);
    writeChunkToFile(copyBuffer, testChunk, fileName );
    #else
    writeChunkToFile(buffer, testChunk, fileName );
    #endif
    
    ::close(device);
    
    return 0;
}

Any help will be greatly appreciated.
-Austin



_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Slowdown copying data between kernel versions 4.19 and 5.15
  2023-06-23 21:30 Slowdown copying data between kernel versions 4.19 and 5.15 Havens, Austin
@ 2023-06-28 21:38 ` Havens, Austin
  2023-06-29 13:09   ` Robin Murphy
                     ` (2 more replies)
  0 siblings, 3 replies; 9+ messages in thread
From: Havens, Austin @ 2023-06-28 21:38 UTC (permalink / raw)
  To: catalin.marinas@arm.com, will@kernel.org, michal.simek@amd.com
  Cc: Suresh, Siddarth, Lui, Vincent,
	linux-arm-kernel@lists.infradead.org


<Friday, June 23, 2023 14:30 PM PDT, Austin Havens (Anritsu)> 
>Hi all,
>In the process of updating our kernel from 4.19 to 5.15 we noticed a slowdown when copying data.  We are using  Zynqmp 9EG SoCs and basically following the Xilinx/AMD release branches (though a bit behind).  I did some sample based profiling with perf, and it showed that a lot of the time was in __arch_copy_from_user, and since the amount of data getting copied is the same, it seems like it is spending more time in each __arch_copy_from_user call. 
>
 >I made  a test program to replicate the issue and here is what I see (i used the same binary on both versions to rule out differences from the compiler). 
>
>root@smudge:/tmp# uname -a
>Linux smudge 4.19.0-xilinx-v2019.1 #1 SMP PREEMPT Thu May 18 04:01:27 UTC 2023 aarch64 aarch64 aarch64 GNU/Linux
>root@smudge:/tmp# perf stat -einstructions -ecycles -e ld_dep_stall -e read_alloc -e dTLB-load-misses /mnt/usrroot/test_copy
>
> Performance counter stats for '/mnt/usrroot/test_copy':
>
>          13202623      instructions              #    0.25  insn per cycle         
>          52947780      cycles                                                      
>          37588761      ld_dep_stall                                                
>             16301      read_alloc                                                  
>              1660      dTLB-load-misses                                            
>
>       0.044990363 seconds time elapsed
>
>       0.004092000 seconds user
>       0.040920000 seconds sys
>
>root@ahraptor:/tmp# uname -a
>Linux ahraptor 5.15.36-xilinx-v2022.1 #1 SMP PREEMPT Mon Apr 10 22:46:16 UTC 2023 aarch64 aarch64 aarch64 GNU/Linux
>root@ahraptor:/tmp# perf stat -einstructions -ecycles -e ld_dep_stall -e read_alloc -e dTLB-load-misses /mnt/usrroot/test_copy
>
> Performance counter stats for '/mnt/usrroot/test_copy':
>
>          11625888      instructions              #    0.14  insn per cycle         
>          83135040      cycles                                                      
>          69833562      ld_dep_stall                                                
>             27948      read_alloc                                                  
>              3367      dTLB-load-misses                                            
>
>       0.070537894 seconds time elapsed
>
>       0.004165000 seconds user
>       0.066643000 seconds sys
>

>After some investigation I am guessing the issue is either in the iovector iteration changes (around https://elixir.bootlin.com/linux/v5.15/source/lib/iov_iter.c#L922 ) or the lower level changes in arch/arm64/lib/copy_from_user.S, but I am pretty out of my depth so it is just speculation. 
>

After comparing the dissassembly of __arch_copy_from_user on both kernels and going through commit logs, I figured out the slowdown was mostly due to to the changes from commit c703d80130b1c9d6783f4cbb9516fd5fe4a750d, specifially the changes to uao_ldp. 

diff --git a/arch/arm64/include/asm/asm-uaccess.h b/arch/arm64/include/asm/asm-uaccess.h
index 2c26ca5b7bb0..2b5454fa0f24 100644
--- a/arch/arm64/include/asm/asm-uaccess.h
+++ b/arch/arm64/include/asm/asm-uaccess.h
@@ -59,62 +59,32 @@ alternative_else_nop_endif
 #endif
 
 /*
- * Generate the assembly for UAO alternatives with exception table entries.
+ * Generate the assembly for LDTR/STTR with exception table entries.
  * This is complicated as there is no post-increment or pair versions of the
  * unprivileged instructions, and USER() only works for single instructions.
  */
-#ifdef CONFIG_ARM64_UAO
        .macro uao_ldp l, reg1, reg2, addr, post_inc
-               alternative_if_not ARM64_HAS_UAO
-8888:                  ldp     \reg1, \reg2, [\addr], \post_inc;
-8889:                  nop;
-                       nop;
-               alternative_else
-                       ldtr    \reg1, [\addr];
-                       ldtr    \reg2, [\addr, #8];
-                       add     \addr, \addr, \post_inc;
-               alternative_endif
+8888:          ldtr    \reg1, [\addr];
+8889:          ldtr    \reg2, [\addr, #8];
+               add     \addr, \addr, \post_inc;
 
                _asm_extable    8888b,\l;
                _asm_extable    8889b,\l;
        .endm

I could not directly revert the changes to test since more names changed in other commits than I cared to figure out, but I hacked out that change, and saw that the performance of the test program was basically back to normal. 

diff --git a/arch/arm64/include/asm/asm-uaccess.h b/arch/arm64/include/asm/asm-uaccess.h
index ccedf548dac9..2ddf7eba46fd 100644
--- a/arch/arm64/include/asm/asm-uaccess.h
+++ b/arch/arm64/include/asm/asm-uaccess.h
@@ -64,9 +64,9 @@ alternative_else_nop_endif
  * unprivileged instructions, and USER() only works for single instructions.
  */
        .macro user_ldp l, reg1, reg2, addr, post_inc
-8888:          ldtr    \reg1, [\addr];
-8889:          ldtr    \reg2, [\addr, #8];
-               add     \addr, \addr, \post_inc;
+8888:          ldp     \reg1, \reg2, [\addr], \post_inc;
+8889:          nop;
+               nop;


Profiling with the hacked __arch_copy_from_user 
root@ahraptor:/tmp# perf stat -einstructions -ecycles -e ld_dep_stall -e read_alloc -e dTLB-load-misses /mnt/usrroot/test_copy

 Performance counter stats for '/mnt/usrroot/test_copy':

          11822342      instructions              #    0.23  insn per cycle         
          50689594      cycles                                                      
          37627922      ld_dep_stall                                                
             17933      read_alloc                                                  
              3421      dTLB-load-misses                                            

       0.043440253 seconds time elapsed

       0.004382000 seconds user
       0.039442000 seconds sys

Unfortunately the hack crashes in other cases so it is not a viable solution for us. Also, on our actual workload there is still a small difference in performance remaining that I have not tracked down yet (I am guessing it has to do with the dTLB-load-misses remaining higher). 

Note, I think that the slow down is only noticeable in cases like ours where the data being copied from is not in cache (for us, because the FPGA writes it).

If anyone knows if this slowdown is expected, or if there are any workarounds, that would be helpful.
-Austin 
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: Slowdown copying data between kernel versions 4.19 and 5.15
  2023-06-28 21:38 ` Havens, Austin
@ 2023-06-29 13:09   ` Robin Murphy
  2023-06-29 13:36   ` Catalin Marinas
  2023-06-29 14:24   ` Mark Rutland
  2 siblings, 0 replies; 9+ messages in thread
From: Robin Murphy @ 2023-06-29 13:09 UTC (permalink / raw)
  To: Havens, Austin, catalin.marinas@arm.com, will@kernel.org,
	michal.simek@amd.com
  Cc: Suresh, Siddarth, Lui, Vincent,
	linux-arm-kernel@lists.infradead.org

On 2023-06-28 22:38, Havens, Austin wrote:
> 
> <Friday, June 23, 2023 14:30 PM PDT, Austin Havens (Anritsu)>
>> Hi all,
>> In the process of updating our kernel from 4.19 to 5.15 we noticed a slowdown when copying data.  We are using  Zynqmp 9EG SoCs and basically following the Xilinx/AMD release branches (though a bit behind).  I did some sample based profiling with perf, and it showed that a lot of the time was in __arch_copy_from_user, and since the amount of data getting copied is the same, it seems like it is spending more time in each __arch_copy_from_user call.
>>
>   >I made  a test program to replicate the issue and here is what I see (i used the same binary on both versions to rule out differences from the compiler).
>>
>> root@smudge:/tmp# uname -a
>> Linux smudge 4.19.0-xilinx-v2019.1 #1 SMP PREEMPT Thu May 18 04:01:27 UTC 2023 aarch64 aarch64 aarch64 GNU/Linux
>> root@smudge:/tmp# perf stat -einstructions -ecycles -e ld_dep_stall -e read_alloc -e dTLB-load-misses /mnt/usrroot/test_copy
>>
>>   Performance counter stats for '/mnt/usrroot/test_copy':
>>
>>            13202623      instructions              #    0.25  insn per cycle
>>            52947780      cycles
>>            37588761      ld_dep_stall
>>               16301      read_alloc
>>                1660      dTLB-load-misses
>>
>>         0.044990363 seconds time elapsed
>>
>>        0.004092000 seconds user
>>         0.040920000 seconds sys
>>
>> root@ahraptor:/tmp# uname -a
>> Linux ahraptor 5.15.36-xilinx-v2022.1 #1 SMP PREEMPT Mon Apr 10 22:46:16 UTC 2023 aarch64 aarch64 aarch64 GNU/Linux
>> root@ahraptor:/tmp# perf stat -einstructions -ecycles -e ld_dep_stall -e read_alloc -e dTLB-load-misses /mnt/usrroot/test_copy
>>
>>   Performance counter stats for '/mnt/usrroot/test_copy':
>>
>>           11625888      instructions              #    0.14  insn per cycle
>>            83135040      cycles
>>            69833562      ld_dep_stall
>>               27948      read_alloc
>>                3367      dTLB-load-misses
>>
>>         0.070537894 seconds time elapsed
>>
>>         0.004165000 seconds user
>>         0.066643000 seconds sys
>>
> 
>> After some investigation I am guessing the issue is either in the iovector iteration changes (around https://elixir.bootlin.com/linux/v5.15/source/lib/iov_iter.c#L922 ) or the lower level changes in arch/arm64/lib/copy_from_user.S, but I am pretty out of my depth so it is just speculation.
>>
> 
> After comparing the dissassembly of __arch_copy_from_user on both kernels and going through commit logs, I figured out the slowdown was mostly due to to the changes from commit c703d80130b1c9d6783f4cbb9516fd5fe4a750d, specifially the changes to uao_ldp.
> 
> diff --git a/arch/arm64/include/asm/asm-uaccess.h b/arch/arm64/include/asm/asm-uaccess.h
> index 2c26ca5b7bb0..2b5454fa0f24 100644
> --- a/arch/arm64/include/asm/asm-uaccess.h
> +++ b/arch/arm64/include/asm/asm-uaccess.h
> @@ -59,62 +59,32 @@ alternative_else_nop_endif
>   #endif
>   
>   /*
> - * Generate the assembly for UAO alternatives with exception table entries.
> + * Generate the assembly for LDTR/STTR with exception table entries.
>    * This is complicated as there is no post-increment or pair versions of the
>    * unprivileged instructions, and USER() only works for single instructions.
>    */
> -#ifdef CONFIG_ARM64_UAO
>          .macro uao_ldp l, reg1, reg2, addr, post_inc
> -               alternative_if_not ARM64_HAS_UAO
> -8888:                  ldp     \reg1, \reg2, [\addr], \post_inc;
> -8889:                  nop;
> -                       nop;
> -               alternative_else
> -                       ldtr    \reg1, [\addr];
> -                       ldtr    \reg2, [\addr, #8];
> -                       add     \addr, \addr, \post_inc;
> -               alternative_endif
> +8888:          ldtr    \reg1, [\addr];
> +8889:          ldtr    \reg2, [\addr, #8];
> +               add     \addr, \addr, \post_inc;
>   
>                  _asm_extable    8888b,\l;
>                  _asm_extable    8889b,\l;
>          .endm
> 
> I could not directly revert the changes to test since more names changed in other commits than I cared to figure out, but I hacked out that change, and saw that the performance of the test program was basically back to normal.
> 
> diff --git a/arch/arm64/include/asm/asm-uaccess.h b/arch/arm64/include/asm/asm-uaccess.h
> index ccedf548dac9..2ddf7eba46fd 100644
> --- a/arch/arm64/include/asm/asm-uaccess.h
> +++ b/arch/arm64/include/asm/asm-uaccess.h
> @@ -64,9 +64,9 @@ alternative_else_nop_endif
>    * unprivileged instructions, and USER() only works for single instructions.
>    */
>          .macro user_ldp l, reg1, reg2, addr, post_inc
> -8888:          ldtr    \reg1, [\addr];
> -8889:          ldtr    \reg2, [\addr, #8];
> -               add     \addr, \addr, \post_inc;
> +8888:          ldp     \reg1, \reg2, [\addr], \post_inc;
> +8889:          nop;
> +               nop;
> 
> 
> Profiling with the hacked __arch_copy_from_user
> root@ahraptor:/tmp# perf stat -einstructions -ecycles -e ld_dep_stall -e read_alloc -e dTLB-load-misses /mnt/usrroot/test_copy
> 
>   Performance counter stats for '/mnt/usrroot/test_copy':
> 
>            11822342      instructions              #    0.23  insn per cycle
>            50689594      cycles
>            37627922      ld_dep_stall
>               17933      read_alloc
>                3421      dTLB-load-misses
> 
>         0.043440253 seconds time elapsed
> 
>         0.004382000 seconds user
>         0.039442000 seconds sys
> 
> Unfortunately the hack crashes in other cases so it is not a viable solution for us. Also, on our actual workload there is still a small difference in performance remaining that I have not tracked down yet (I am guessing it has to do with the dTLB-load-misses remaining higher).
> 
> Note, I think that the slow down is only noticeable in cases like ours where the data being copied from is not in cache (for us, because the FPGA writes it).
> 
> If anyone knows if this slowdown is expected, or if there are any workarounds, that would be helpful.

Some slowdown is to be expected after the removal of set_fs() in 5.11 
means we always use the unprivileged load/store instructions for 
userspace access. There are no register-pair forms of LDTR/STTR, so the 
uaccess routines become inherently less efficient than a regular 
memcpy() could achieve (the doubling seen in those perf event counts 
represents the fact that the CPU is literally issuing twice as many load 
ops), but the tradeoff is that they are now significantly more robust 
and harder to exploit.

There is nominally a bit of performance still left on the table for 
smaller copies below 1-2KB, and particularly below 128 bytes or so, due 
to the current compromises made for fault handling. I had a go at 
improving this a while back with [1], but that series got parked since 
my copy_to_user() has an insidious bug which we couldn't easily figure 
out, and frankly all the fixup shenanigans is somewhat of a deranged 
nightmare that even I can no longer make sense of. Mark then tried a 
more rigorous approach of attempting to nail down the API semantics with 
tests first[2], but that ended up just leaving more open questions that 
nobody's found the time to think about further.

Thanks,
Robin.

[1] https://lore.kernel.org/r/cover.1664363162.git.robin.murphy@arm.com/
[2] https://lore.kernel.org/r/20230321122514.1743889-1-mark.rutland@arm.com/

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Slowdown copying data between kernel versions 4.19 and 5.15
  2023-06-28 21:38 ` Havens, Austin
  2023-06-29 13:09   ` Robin Murphy
@ 2023-06-29 13:36   ` Catalin Marinas
  2023-06-29 14:24   ` Mark Rutland
  2 siblings, 0 replies; 9+ messages in thread
From: Catalin Marinas @ 2023-06-29 13:36 UTC (permalink / raw)
  To: Havens, Austin
  Cc: will@kernel.org, michal.simek@amd.com, Suresh, Siddarth,
	Lui, Vincent, Mark Rutland, linux-arm-kernel@lists.infradead.org

Hi Austin,

On Wed, Jun 28, 2023 at 09:38:14PM +0000, Havens, Austin wrote:
> <Friday, June 23, 2023 14:30 PM PDT, Austin Havens (Anritsu)> 
> >In the process of updating our kernel from 4.19 to 5.15 we noticed a
> >slowdown when copying data.  We are using  Zynqmp 9EG SoCs and
> >basically following the Xilinx/AMD release branches (though a bit
> >behind).  I did some sample based profiling with perf, and it showed
> >that a lot of the time was in __arch_copy_from_user, and since the
> >amount of data getting copied is the same, it seems like it is
> >spending more time in each __arch_copy_from_user call. 

Thanks for digging into this. Which CPUs does this SoC have? Cortex-A53?

> >I made  a test program to replicate the issue and here is what I see
> >(i used the same binary on both versions to rule out differences from
> >the compiler). 
> >
> >root@smudge:/tmp# uname -a
> >Linux smudge 4.19.0-xilinx-v2019.1 #1 SMP PREEMPT Thu May 18 04:01:27 UTC 2023 aarch64 aarch64 aarch64 GNU/Linux
> >root@smudge:/tmp# perf stat -einstructions -ecycles -e ld_dep_stall -e read_alloc -e dTLB-load-misses /mnt/usrroot/test_copy
> >
> > Performance counter stats for '/mnt/usrroot/test_copy':
> >
> >          13202623      instructions              #    0.25  insn per cycle         
> >          52947780      cycles                                                      
> >          37588761      ld_dep_stall                                                
> >             16301      read_alloc                                                  
> >              1660      dTLB-load-misses                                            
> >
> >       0.044990363 seconds time elapsed
> >
> >       0.004092000 seconds user
> >       0.040920000 seconds sys
> >
> >root@ahraptor:/tmp# uname -a
> >Linux ahraptor 5.15.36-xilinx-v2022.1 #1 SMP PREEMPT Mon Apr 10 22:46:16 UTC 2023 aarch64 aarch64 aarch64 GNU/Linux
> >root@ahraptor:/tmp# perf stat -einstructions -ecycles -e ld_dep_stall -e read_alloc -e dTLB-load-misses /mnt/usrroot/test_copy
> >
> > Performance counter stats for '/mnt/usrroot/test_copy':
> >
> >          11625888      instructions              #    0.14  insn per cycle         
> >          83135040      cycles                                                      
> >          69833562      ld_dep_stall                                                
> >             27948      read_alloc                                                  
> >              3367      dTLB-load-misses                                            
> >
> >       0.070537894 seconds time elapsed
> >
> >       0.004165000 seconds user
> >       0.066643000 seconds sys

It is indeed a significant slowdown but does it show in real world
scenarios or mostly in microbenchmarks?

> After comparing the dissassembly of __arch_copy_from_user on both
> kernels and going through commit logs, I figured out the slowdown was
> mostly due to to the changes from commit
> c703d80130b1c9d6783f4cbb9516fd5fe4a750d, specifially the changes to
> uao_ldp. 

Your commit is missing an 'f' in front, it should be fc703d80130b
("arm64: uaccess: split user/kernel routines"). This is indeed replacing
one LDP with two LDTR instructions. The reason for this is that we
wanted to only use the 'T' variants of the access instructions so that
they are executed with EL0 privileges (better for security).

> I could not directly revert the changes to test since more names
> changed in other commits than I cared to figure out, but I hacked out
> that change, and saw that the performance of the test program was
> basically back to normal. 
> 
> diff --git a/arch/arm64/include/asm/asm-uaccess.h b/arch/arm64/include/asm/asm-uaccess.h
> index ccedf548dac9..2ddf7eba46fd 100644
> --- a/arch/arm64/include/asm/asm-uaccess.h
> +++ b/arch/arm64/include/asm/asm-uaccess.h
> @@ -64,9 +64,9 @@ alternative_else_nop_endif
>   * unprivileged instructions, and USER() only works for single instructions.
>   */
>         .macro user_ldp l, reg1, reg2, addr, post_inc
> -8888:          ldtr    \reg1, [\addr];
> -8889:          ldtr    \reg2, [\addr, #8];
> -               add     \addr, \addr, \post_inc;
> +8888:          ldp     \reg1, \reg2, [\addr], \post_inc;
> +8889:          nop;
> +               nop;

This won't work in all cases. For example on newer CPUs it fails if PAN
is enabled since LDP won't be able to access user space. While we could
disable PAN for these routines (and MTE tag checking), this also
overrides the execute-only user permissions since now an LDP/STP is
allowed to access them.

So, I'm not adding them back for a specific microarchitecture.

> Profiling with the hacked __arch_copy_from_user
> root@ahraptor:/tmp# perf stat -einstructions -ecycles -e ld_dep_stall -e read_alloc -e dTLB-load-misses /mnt/usrroot/test_copy
> 
>  Performance counter stats for '/mnt/usrroot/test_copy':
> 
>           11822342      instructions              #    0.23  insn per cycle
>           50689594      cycles
>           37627922      ld_dep_stall
>              17933      read_alloc
>               3421      dTLB-load-misses
> 
>        0.043440253 seconds time elapsed
> 
>        0.004382000 seconds user
>        0.039442000 seconds sys
> 
> Unfortunately the hack crashes in other cases so it is not a viable
> solution for us. Also, on our actual workload there is still a small
> difference in performance remaining that I have not tracked down yet
> (I am guessing it has to do with the dTLB-load-misses remaining
> higher). 
> 
> Note, I think that the slow down is only noticeable in cases like ours
> where the data being copied from is not in cache (for us, because the
> FPGA writes it).

I can see Robin already replied mentioning the usercopy refresh series,
maybe these do improve performance.

Since you mentioned the data is not cached, you could experiment with
some prefetching in copy loop, maybe it boosts the performance a bit.

-- 
Catalin

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Slowdown copying data between kernel versions 4.19 and 5.15
  2023-06-28 21:38 ` Havens, Austin
  2023-06-29 13:09   ` Robin Murphy
  2023-06-29 13:36   ` Catalin Marinas
@ 2023-06-29 14:24   ` Mark Rutland
  2023-06-29 18:33     ` Robin Murphy
  2023-06-29 19:33     ` Havens, Austin
  2 siblings, 2 replies; 9+ messages in thread
From: Mark Rutland @ 2023-06-29 14:24 UTC (permalink / raw)
  To: Havens, Austin
  Cc: catalin.marinas@arm.com, will@kernel.org, michal.simek@amd.com,
	Suresh, Siddarth, Lui, Vincent,
	linux-arm-kernel@lists.infradead.org

On Wed, Jun 28, 2023 at 09:38:14PM +0000, Havens, Austin wrote:
> >After some investigation I am guessing the issue is either in the iovector
> >iteration changes (around
> >https://elixir.bootlin.com/linux/v5.15/source/lib/iov_iter.c#L922 ) or the
> >lower level changes in arch/arm64/lib/copy_from_user.S, but I am pretty out
> >of my depth so it is just speculation. 
> 
> After comparing the dissassembly of __arch_copy_from_user on both kernels and
> going through commit logs, I figured out the slowdown was mostly due to to
> the changes from commit c703d80130b1c9d6783f4cbb9516fd5fe4a750d, specifially
> the changes to uao_ldp. 

For the benefit of others, that's commit:

  fc703d80130b1c9d ("arm64: uaccess: split user/kernel routine")

> 
> diff --git a/arch/arm64/include/asm/asm-uaccess.h b/arch/arm64/include/asm/asm-uaccess.h
> index 2c26ca5b7bb0..2b5454fa0f24 100644
> --- a/arch/arm64/include/asm/asm-uaccess.h
> +++ b/arch/arm64/include/asm/asm-uaccess.h
> @@ -59,62 +59,32 @@ alternative_else_nop_endif
>  #endif
>  
>  /*
> - * Generate the assembly for UAO alternatives with exception table entries.
> + * Generate the assembly for LDTR/STTR with exception table entries.
>   * This is complicated as there is no post-increment or pair versions of the
>   * unprivileged instructions, and USER() only works for single instructions.
>   */
> -#ifdef CONFIG_ARM64_UAO
>         .macro uao_ldp l, reg1, reg2, addr, post_inc
> -               alternative_if_not ARM64_HAS_UAO
> -8888:                  ldp     \reg1, \reg2, [\addr], \post_inc;
> -8889:                  nop;
> -                       nop;
> -               alternative_else
> -                       ldtr    \reg1, [\addr];
> -                       ldtr    \reg2, [\addr, #8];
> -                       add     \addr, \addr, \post_inc;
> -               alternative_endif
> +8888:          ldtr    \reg1, [\addr];
> +8889:          ldtr    \reg2, [\addr, #8];
> +               add     \addr, \addr, \post_inc;
>  
>                 _asm_extable    8888b,\l;
>                 _asm_extable    8889b,\l;
>         .endm
> 
> I could not directly revert the changes to test since more names changed in
> other commits than I cared to figure out, but I hacked out that change, and
> saw that the performance of the test program was basically back to normal. 
> 
> diff --git a/arch/arm64/include/asm/asm-uaccess.h b/arch/arm64/include/asm/asm-uaccess.h
> index ccedf548dac9..2ddf7eba46fd 100644
> --- a/arch/arm64/include/asm/asm-uaccess.h
> +++ b/arch/arm64/include/asm/asm-uaccess.h
> @@ -64,9 +64,9 @@ alternative_else_nop_endif
>   * unprivileged instructions, and USER() only works for single instructions.
>   */
>         .macro user_ldp l, reg1, reg2, addr, post_inc
> -8888:          ldtr    \reg1, [\addr];
> -8889:          ldtr    \reg2, [\addr, #8];
> -               add     \addr, \addr, \post_inc;
> +8888:          ldp     \reg1, \reg2, [\addr], \post_inc;
> +8889:          nop;
> +               nop;

As Catalin noted, we can't make that change generally as it'd be broken for any
system with PAN, and in general we *really* want to use LDTR/STTR for user
accesses to catch any misuse with kernel pointers.

> Profiling with the hacked __arch_copy_from_user 
> root@ahraptor:/tmp# perf stat -einstructions -ecycles -e ld_dep_stall -e read_alloc -e dTLB-load-misses /mnt/usrroot/test_copy
> 
>  Performance counter stats for '/mnt/usrroot/test_copy':
> 
>           11822342      instructions              #    0.23  insn per cycle         
>           50689594      cycles                                                      
>           37627922      ld_dep_stall                                                
>              17933      read_alloc                                                  
>               3421      dTLB-load-misses                                            
> 
>        0.043440253 seconds time elapsed
> 
>        0.004382000 seconds user
>        0.039442000 seconds sys
> 
> Unfortunately the hack crashes in other cases so it is not a viable solution
> for us. Also, on our actual workload there is still a small difference in
> performance remaining that I have not tracked down yet (I am guessing it has
> to do with the dTLB-load-misses remaining higher). 
> 
> Note, I think that the slow down is only noticeable in cases like ours where
> the data being copied from is not in cache (for us, because the FPGA writes
> it).

When you say "is not in cache", what exactly do you mean? If this were just the
latency of filling a cache I wouldn't expect the size of the first access to
make a difference, so I'm assuming the source buffer is not mapped with
cacheable memory attributes, which we generally assume.

Which memory attribues are the source and destination buffers mapped with? Is
that Normal-WB, Normal-NC, or Device? How exactly has that memory been mapped?

I'm assuming this is with some out-of-tree driver; if that's in a public tree
could you please provide a pointer to it?

Thanks,
Mark.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Slowdown copying data between kernel versions 4.19 and 5.15
  2023-06-29 14:24   ` Mark Rutland
@ 2023-06-29 18:33     ` Robin Murphy
  2023-06-29 19:33     ` Havens, Austin
  1 sibling, 0 replies; 9+ messages in thread
From: Robin Murphy @ 2023-06-29 18:33 UTC (permalink / raw)
  To: Mark Rutland, Havens, Austin
  Cc: catalin.marinas@arm.com, will@kernel.org, michal.simek@amd.com,
	Suresh, Siddarth, Lui, Vincent,
	linux-arm-kernel@lists.infradead.org

On 2023-06-29 15:24, Mark Rutland wrote:
> On Wed, Jun 28, 2023 at 09:38:14PM +0000, Havens, Austin wrote:
>>> After some investigation I am guessing the issue is either in the iovector
>>> iteration changes (around
>>> https://elixir.bootlin.com/linux/v5.15/source/lib/iov_iter.c#L922 ) or the
>>> lower level changes in arch/arm64/lib/copy_from_user.S, but I am pretty out
>>> of my depth so it is just speculation.
>>
>> After comparing the dissassembly of __arch_copy_from_user on both kernels and
>> going through commit logs, I figured out the slowdown was mostly due to to
>> the changes from commit c703d80130b1c9d6783f4cbb9516fd5fe4a750d, specifially
>> the changes to uao_ldp.
> 
> For the benefit of others, that's commit:
> 
>    fc703d80130b1c9d ("arm64: uaccess: split user/kernel routine")
> 
>>
>> diff --git a/arch/arm64/include/asm/asm-uaccess.h b/arch/arm64/include/asm/asm-uaccess.h
>> index 2c26ca5b7bb0..2b5454fa0f24 100644
>> --- a/arch/arm64/include/asm/asm-uaccess.h
>> +++ b/arch/arm64/include/asm/asm-uaccess.h
>> @@ -59,62 +59,32 @@ alternative_else_nop_endif
>>   #endif
>>   
>>   /*
>> - * Generate the assembly for UAO alternatives with exception table entries.
>> + * Generate the assembly for LDTR/STTR with exception table entries.
>>    * This is complicated as there is no post-increment or pair versions of the
>>    * unprivileged instructions, and USER() only works for single instructions.
>>    */
>> -#ifdef CONFIG_ARM64_UAO
>>          .macro uao_ldp l, reg1, reg2, addr, post_inc
>> -               alternative_if_not ARM64_HAS_UAO
>> -8888:                  ldp     \reg1, \reg2, [\addr], \post_inc;
>> -8889:                  nop;
>> -                       nop;
>> -               alternative_else
>> -                       ldtr    \reg1, [\addr];
>> -                       ldtr    \reg2, [\addr, #8];
>> -                       add     \addr, \addr, \post_inc;
>> -               alternative_endif
>> +8888:          ldtr    \reg1, [\addr];
>> +8889:          ldtr    \reg2, [\addr, #8];
>> +               add     \addr, \addr, \post_inc;
>>   
>>                  _asm_extable    8888b,\l;
>>                  _asm_extable    8889b,\l;
>>          .endm
>>
>> I could not directly revert the changes to test since more names changed in
>> other commits than I cared to figure out, but I hacked out that change, and
>> saw that the performance of the test program was basically back to normal.
>>
>> diff --git a/arch/arm64/include/asm/asm-uaccess.h b/arch/arm64/include/asm/asm-uaccess.h
>> index ccedf548dac9..2ddf7eba46fd 100644
>> --- a/arch/arm64/include/asm/asm-uaccess.h
>> +++ b/arch/arm64/include/asm/asm-uaccess.h
>> @@ -64,9 +64,9 @@ alternative_else_nop_endif
>>    * unprivileged instructions, and USER() only works for single instructions.
>>    */
>>          .macro user_ldp l, reg1, reg2, addr, post_inc
>> -8888:          ldtr    \reg1, [\addr];
>> -8889:          ldtr    \reg2, [\addr, #8];
>> -               add     \addr, \addr, \post_inc;
>> +8888:          ldp     \reg1, \reg2, [\addr], \post_inc;
>> +8889:          nop;
>> +               nop;
> 
> As Catalin noted, we can't make that change generally as it'd be broken for any
> system with PAN, and in general we *really* want to use LDTR/STTR for user
> accesses to catch any misuse with kernel pointers.
> 
>> Profiling with the hacked __arch_copy_from_user
>> root@ahraptor:/tmp# perf stat -einstructions -ecycles -e ld_dep_stall -e read_alloc -e dTLB-load-misses /mnt/usrroot/test_copy
>>
>>   Performance counter stats for '/mnt/usrroot/test_copy':
>>
>>            11822342      instructions              #    0.23  insn per cycle
>>            50689594      cycles
>>            37627922      ld_dep_stall
>>               17933      read_alloc
>>                3421      dTLB-load-misses
>>
>>         0.043440253 seconds time elapsed
>>
>>         0.004382000 seconds user
>>         0.039442000 seconds sys
>>
>> Unfortunately the hack crashes in other cases so it is not a viable solution
>> for us. Also, on our actual workload there is still a small difference in
>> performance remaining that I have not tracked down yet (I am guessing it has
>> to do with the dTLB-load-misses remaining higher).
>>
>> Note, I think that the slow down is only noticeable in cases like ours where
>> the data being copied from is not in cache (for us, because the FPGA writes
>> it).
> 
> When you say "is not in cache", what exactly do you mean? If this were just the
> latency of filling a cache I wouldn't expect the size of the first access to
> make a difference, so I'm assuming the source buffer is not mapped with
> cacheable memory attributes, which we generally assume.
> 
> Which memory attribues are the source and destination buffers mapped with? Is
> that Normal-WB, Normal-NC, or Device? How exactly has that memory been mapped?
> 
> I'm assuming this is with some out-of-tree driver; if that's in a public tree
> could you please provide a pointer to it?

Oh, re-reading the original mail, it looks like the copy is reading from 
a userspace mapping of /dev/mem, so it'll be Device - the ~50% 
performance drop did seem like more than I remember for Cortex-A53 in 
similar benchmarking (usercopy vs. optimal memcpy), but that was all 
cacheable, so doubling the number of actual memory system transactions 
does seem like it could plausibly account for the extra difference.

Cheers,
Robin.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Slowdown copying data between kernel versions 4.19 and 5.15
  2023-06-29 14:24   ` Mark Rutland
  2023-06-29 18:33     ` Robin Murphy
@ 2023-06-29 19:33     ` Havens, Austin
  2023-06-30 11:15       ` Mark Rutland
  1 sibling, 1 reply; 9+ messages in thread
From: Havens, Austin @ 2023-06-29 19:33 UTC (permalink / raw)
  To: Mark Rutland
  Cc: catalin.marinas@arm.com, will@kernel.org, michal.simek@amd.com,
	Suresh, Siddarth, Lui, Vincent,
	linux-arm-kernel@lists.infradead.org

Hi Mark,
Thanks for the reply.

 On Thursday, June 29, 2023 7:74 AM Mark Rutland wrote:
> On Wed, Jun 28, 2023 at 09:38:14PM +0000, Havens, Austin wrote:
>> >After some investigation I am guessing the issue is either in the iovector
>> >iteration changes (around
>> >https://elixir.bootlin.com/linux/v5.15/source/lib/iov_iter.c#L922 ) or the
>> >lower level changes in arch/arm64/lib/copy_from_user.S, but I am pretty out
>> >of my depth so it is just speculation. 
>> 
>> After comparing the dissassembly of __arch_copy_from_user on both kernels and
>> going through commit logs, I figured out the slowdown was mostly due to to
>> the changes from commit c703d80130b1c9d6783f4cbb9516fd5fe4a750d, specifially
>> the changes to uao_ldp. 
>
> For the benefit of others, that's commit:
>
>   fc703d80130b1c9d ("arm64: uaccess: split user/kernel routine")
Sorry for the copy paste error.


>> 
>> diff --git a/arch/arm64/include/asm/asm-uaccess.h b/arch/arm64/include/asm/asm-uaccess.h
>> index 2c26ca5b7bb0..2b5454fa0f24 100644
>> --- a/arch/arm64/include/asm/asm-uaccess.h
>> +++ b/arch/arm64/include/asm/asm-uaccess.h
>> @@ -59,62 +59,32 @@ alternative_else_nop_endif
>>  #endif
>>  
>>  /*
>> - * Generate the assembly for UAO alternatives with exception table entries.
>> + * Generate the assembly for LDTR/STTR with exception table entries.
>>   * This is complicated as there is no post-increment or pair versions of the
>>   * unprivileged instructions, and USER() only works for single instructions.
>>   */
>> -#ifdef CONFIG_ARM64_UAO
>>         .macro uao_ldp l, reg1, reg2, addr, post_inc
>> -               alternative_if_not ARM64_HAS_UAO
>> -8888:                  ldp     \reg1, \reg2, [\addr], \post_inc;
>> -8889:                  nop;
>> -                       nop;
>> -               alternative_else
>> -                       ldtr    \reg1, [\addr];
>> -                       ldtr    \reg2, [\addr, #8];
>> -                       add     \addr, \addr, \post_inc;
>> -               alternative_endif
>> +8888:          ldtr    \reg1, [\addr];
>> +8889:          ldtr    \reg2, [\addr, #8];
>> +               add     \addr, \addr, \post_inc;
>>  
>>                 _asm_extable    8888b,\l;
>>                 _asm_extable    8889b,\l;
>>         .endm
>> 
>> I could not directly revert the changes to test since more names changed in
>> other commits than I cared to figure out, but I hacked out that change, and
>> saw that the performance of the test program was basically back to normal. 
>> 
>> diff --git a/arch/arm64/include/asm/asm-uaccess.h b/arch/arm64/include/asm/asm-uaccess.h
>> index ccedf548dac9..2ddf7eba46fd 100644
>> --- a/arch/arm64/include/asm/asm-uaccess.h
>> +++ b/arch/arm64/include/asm/asm-uaccess.h
>> @@ -64,9 +64,9 @@ alternative_else_nop_endif
>>   * unprivileged instructions, and USER() only works for single instructions.
>>   */
>>         .macro user_ldp l, reg1, reg2, addr, post_inc
>> -8888:          ldtr    \reg1, [\addr];
>> -8889:          ldtr    \reg2, [\addr, #8];
>> -               add     \addr, \addr, \post_inc;
>> +8888:          ldp     \reg1, \reg2, [\addr], \post_inc;
>> +8889:          nop;
>> +               nop;
>
> As Catalin noted, we can't make that change generally as it'd be broken for any
> system with PAN, and in general we *really* want to use LDTR/STTR for user
> accesses to catch any misuse with kernel pointers.

I was afraid that would be the case. It puts us in a bit of a tough spot, but
at least we know we should be looking for workarounds instead of a fix. 

>> Profiling with the hacked __arch_copy_from_user 
>> root@ahraptor:/tmp# perf stat -einstructions -ecycles -e ld_dep_stall -e read_alloc -e dTLB-load-misses /mnt/usrroot/test_copy
>> 
>>  Performance counter stats for '/mnt/usrroot/test_copy':
>> 
>>           11822342      instructions              #    0.23  insn per cycle         
>>           50689594      cycles                                                      
>>           37627922      ld_dep_stall                                                
>>              17933      read_alloc                                                  
>>               3421      dTLB-load-misses                                            
>> 
>>        0.043440253 seconds time elapsed
>> 
>>        0.004382000 seconds user
>>        0.039442000 seconds sys
>> 
>> Unfortunately the hack crashes in other cases so it is not a viable solution
>> for us. Also, on our actual workload there is still a small difference in
>> performance remaining that I have not tracked down yet (I am guessing it has
>> to do with the dTLB-load-misses remaining higher). 
>> 
>> Note, I think that the slow down is only noticeable in cases like ours where
>> the data being copied from is not in cache (for us, because the FPGA writes
>> it).
>
> When you say "is not in cache", what exactly do you mean? If this were just the
> latency of filling a cache I wouldn't expect the size of the first access to
> make a difference, so I'm assuming the source buffer is not mapped with
> cacheable memory attributes, which we generally assume.
>
> Which memory attribues are the source and destination buffers mapped with? Is
> that Normal-WB, Normal-NC, or Device? How exactly has that memory been mapped?
>
> I'm assuming this is with some out-of-tree driver; if that's in a public tree
> could you please provide a pointer to it?
>
> Thanks,
> Mark.

I am actually not 100% clear on how the memory gets mapped. Currently we call 
ioremap in our driver, so I think that should map it as iomem. When I removed 
that or used /dev/mem, nothing changed, and looking at things now I think that 
is because the origional mapping is from drivers/of/of_reserved_mem.c

IIRC I mostly followed this wiki when setting things up
https://xilinx-wiki.atlassian.net/wiki/spaces/A/pages/18841683/Linux+Reserved+Memory

I think the relevant parts are from the dts (note we do it 2x, because we have some
 usages that also need to be accessed by other CPUs on the SoC which have adress 
 space restrictions)
 
reserved-memory {
#address-cells = <2>;
#size-cells = <2>;
ranges;

iq_capture: fpga_mem@1 {
    compatible = "shared-dma-pool";
    no-map;            
    reg = <0x0 0x70000000 0x0 0x10000000>;
};
big_iq_capture: fpga_mem@2 {
    compatible = "shared-dma-pool";
    no-map;
    reg = <0x8 0x0 0x0 0x80000000>;
};
};


anritsu-databuffer@0 {
 compatible = "anritsu,databuffer";
 memory-region = <&iq_capture>;
 device-name = "databuffer-device";
};
anritsu-databuffer@1 {
compatible = "anritsu,databuffer";
memory-region = <&big_iq_capture>;
device-name = "capturebuffer-device";
};

The databuffer driver is something we made and generally build out of tree,
but I put it in tree on our github if you want to look at it. I have not actually
tried to build it in-tree yet, so I could have made some mistakes with the Makefile
or something. Here is a link to where the ioremap is. 

https://github.com/Anritsu/linux-xlnx/blob/intree_databuffer_driver/drivers/char/databuffer_driver.c#L242

Despite doing my best to read the documentation, I was never really sure if I got the 
memory mapping right for our use case. 


If you are interested in context, the use case is in spectrum analyzers.
https://www.anritsu.com/en-us/test-measurement/products/ms2090a
The feature is IQ capture, which if you are not familiar with Spectrum Analyzers, 
is basically trying to take the data from an a high speed ADC and store it as fast 
as possible. Since the FPGA is writing the data is clocked to the ADC, the rates 
we can stream out without losing any data depend on how fast we can copy the 
data from memory to either the network or a file, which is why this performance
is important to us. I think we should probably be using scatter/gather for this,
but I could not convince the FPGA engineers to implement it (and it sounded hard
so I did not try very hard to convince them).  

Thanks for the help,
Austin

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Slowdown copying data between kernel versions 4.19 and 5.15
  2023-06-29 19:33     ` Havens, Austin
@ 2023-06-30 11:15       ` Mark Rutland
  2023-07-06 19:12         ` Havens, Austin
  0 siblings, 1 reply; 9+ messages in thread
From: Mark Rutland @ 2023-06-30 11:15 UTC (permalink / raw)
  To: Havens, Austin
  Cc: catalin.marinas@arm.com, will@kernel.org, michal.simek@amd.com,
	Suresh, Siddarth, Lui, Vincent,
	linux-arm-kernel@lists.infradead.org, Robin Murphy

On Thu, Jun 29, 2023 at 07:33:39PM +0000, Havens, Austin wrote:
> Hi Mark,
> Thanks for the reply.

No problem; thanks for the info here!

>  On Thursday, June 29, 2023 7:74 AM Mark Rutland wrote:
> > On Wed, Jun 28, 2023 at 09:38:14PM +0000, Havens, Austin wrote:

> >> Profiling with the hacked __arch_copy_from_user 
> >> root@ahraptor:/tmp# perf stat -einstructions -ecycles -e ld_dep_stall -e read_alloc -e dTLB-load-misses /mnt/usrroot/test_copy
> >> 
> >>  Performance counter stats for '/mnt/usrroot/test_copy':
> >> 
> >>           11822342      instructions              #    0.23  insn per cycle         
> >>           50689594      cycles                                                      
> >>           37627922      ld_dep_stall                                                
> >>              17933      read_alloc                                                  
> >>               3421      dTLB-load-misses                                            
> >> 
> >>        0.043440253 seconds time elapsed
> >> 
> >>        0.004382000 seconds user
> >>        0.039442000 seconds sys
> >> 
> >> Unfortunately the hack crashes in other cases so it is not a viable solution
> >> for us. Also, on our actual workload there is still a small difference in
> >> performance remaining that I have not tracked down yet (I am guessing it has
> >> to do with the dTLB-load-misses remaining higher). 
> >> 
> >> Note, I think that the slow down is only noticeable in cases like ours where
> >> the data being copied from is not in cache (for us, because the FPGA writes
> >> it).
> >
> > When you say "is not in cache", what exactly do you mean? If this were just the
> > latency of filling a cache I wouldn't expect the size of the first access to
> > make a difference, so I'm assuming the source buffer is not mapped with
> > cacheable memory attributes, which we generally assume.
> >
> > Which memory attribues are the source and destination buffers mapped with? Is
> > that Normal-WB, Normal-NC, or Device? How exactly has that memory been mapped?
> >
> > I'm assuming this is with some out-of-tree driver; if that's in a public tree
> > could you please provide a pointer to it?
> >
> > Thanks,
> > Mark.
> 
> I am actually not 100% clear on how the memory gets mapped. Currently we call 
> ioremap in our driver, so I think that should map it as iomem. When I removed 
> that or used /dev/mem, nothing changed, and looking at things now I think that 
> is because the origional mapping is from drivers/of/of_reserved_mem.c

A plain ioremap() will give you Device memory attributes, which
copy_{to,from}_user() aren't suppposed to be used with, and also forbids the
CPU from doing a bunch of things (e.g. gathering and prefetching) which makes
this slow.

If it's possible to use Normal Non-Cacheable instead, (e.g. by using
ioremap_wc()), that will likely be faster since it permits gathering and
prefetching, etc. Even that's a bit weird, and I'd generally expect to have a
kernel driver to manage non-coherent DMA like this (rather than userspace
having the buffer nad pointing the kernel at it).

Robin might have thoughts on how to handle the non-coherent DMA.

Thanks,
Mark.

> IIRC I mostly followed this wiki when setting things up
> https://xilinx-wiki.atlassian.net/wiki/spaces/A/pages/18841683/Linux+Reserved+Memory
> 
> I think the relevant parts are from the dts (note we do it 2x, because we have some
>  usages that also need to be accessed by other CPUs on the SoC which have adress 
>  space restrictions)
>  
> reserved-memory {
> #address-cells = <2>;
> #size-cells = <2>;
> ranges;
> 
> iq_capture: fpga_mem@1 {
>     compatible = "shared-dma-pool";
>     no-map;            
>     reg = <0x0 0x70000000 0x0 0x10000000>;
> };
> big_iq_capture: fpga_mem@2 {
>     compatible = "shared-dma-pool";
>     no-map;
>     reg = <0x8 0x0 0x0 0x80000000>;
> };
> };
> 
> 
> anritsu-databuffer@0 {
>  compatible = "anritsu,databuffer";
>  memory-region = <&iq_capture>;
>  device-name = "databuffer-device";
> };
> anritsu-databuffer@1 {
> compatible = "anritsu,databuffer";
> memory-region = <&big_iq_capture>;
> device-name = "capturebuffer-device";
> };
> 
> The databuffer driver is something we made and generally build out of tree,
> but I put it in tree on our github if you want to look at it. I have not actually
> tried to build it in-tree yet, so I could have made some mistakes with the Makefile
> or something. Here is a link to where the ioremap is. 
> 
> https://github.com/Anritsu/linux-xlnx/blob/intree_databuffer_driver/drivers/char/databuffer_driver.c#L242
> 
> Despite doing my best to read the documentation, I was never really sure if I got the 
> memory mapping right for our use case. 
> 
> 
> If you are interested in context, the use case is in spectrum analyzers.
> https://www.anritsu.com/en-us/test-measurement/products/ms2090a
> The feature is IQ capture, which if you are not familiar with Spectrum Analyzers, 
> is basically trying to take the data from an a high speed ADC and store it as fast 
> as possible. Since the FPGA is writing the data is clocked to the ADC, the rates 
> we can stream out without losing any data depend on how fast we can copy the 
> data from memory to either the network or a file, which is why this performance
> is important to us. I think we should probably be using scatter/gather for this,
> but I could not convince the FPGA engineers to implement it (and it sounded hard
> so I did not try very hard to convince them).  
> 
> Thanks for the help,
> Austin

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Slowdown copying data between kernel versions 4.19 and 5.15
  2023-06-30 11:15       ` Mark Rutland
@ 2023-07-06 19:12         ` Havens, Austin
  0 siblings, 0 replies; 9+ messages in thread
From: Havens, Austin @ 2023-07-06 19:12 UTC (permalink / raw)
  To: Mark Rutland
  Cc: catalin.marinas@arm.com, will@kernel.org, michal.simek@amd.com,
	Suresh, Siddarth, Lui, Vincent,
	linux-arm-kernel@lists.infradead.org, Robin Murphy

On Friday Jun 30, 2023 at 4:15 AM PDT Mark Rutland wrote:
>On Thu, Jun 29, 2023 at 07:33:39PM +0000, Havens, Austin wrote:
>> Hi Mark,
>> Thanks for the reply.
>
>No problem; thanks for the info here!
>
>>  On Thursday, June 29, 2023 7:74 AM Mark Rutland wrote:
>> > On Wed, Jun 28, 2023 at 09:38:14PM +0000, Havens, Austin wrote:
>
>> >> Profiling with the hacked __arch_copy_from_user 
>> >> root@ahraptor:/tmp# perf stat -einstructions -ecycles -e ld_dep_stall -e read_alloc -e dTLB-load-misses /mnt/usrroot/test_copy
>> >> 
>> >>  Performance counter stats for '/mnt/usrroot/test_copy':
>> >> 
>> >>           11822342      instructions              #    0.23  insn per cycle         
>> >>           50689594      cycles                                                      
>> >>           37627922      ld_dep_stall                                                
>> >>              17933      read_alloc                                                  
>> >>               3421      dTLB-load-misses                                            
>> >> 
>> >>        0.043440253 seconds time elapsed
>> >> 
>> >>        0.004382000 seconds user
>> >>        0.039442000 seconds sys
>> >> 
>> >> Unfortunately the hack crashes in other cases so it is not a viable solution
>> >> for us. Also, on our actual workload there is still a small difference in
>> >> performance remaining that I have not tracked down yet (I am guessing it has
>> >> to do with the dTLB-load-misses remaining higher). 
>> >> 
>> >> Note, I think that the slow down is only noticeable in cases like ours where
>> >> the data being copied from is not in cache (for us, because the FPGA writes
>> >> it).
>> >
>> > When you say "is not in cache", what exactly do you mean? If this were just the
>> > latency of filling a cache I wouldn't expect the size of the first access to
>> > make a difference, so I'm assuming the source buffer is not mapped with
>> > cacheable memory attributes, which we generally assume.
>> >
>> > Which memory attribues are the source and destination buffers mapped with? Is
>> > that Normal-WB, Normal-NC, or Device? How exactly has that memory been mapped?
>> >
>> > I'm assuming this is with some out-of-tree driver; if that's in a public tree
>> > could you please provide a pointer to it?
>> >
>> > Thanks,
>> > Mark.
>> 
>> I am actually not 100% clear on how the memory gets mapped. Currently we call 
>> ioremap in our driver, so I think that should map it as iomem. When I removed 
>> that or used /dev/mem, nothing changed, and looking at things now I think that 
>> is because the origional mapping is from drivers/of/of_reserved_mem.c
>
>A plain ioremap() will give you Device memory attributes, which
>copy_{to,from}_user() aren't suppposed to be used with, and also forbids the
>CPU from doing a bunch of things (e.g. gathering and prefetching) which makes
>this slow.
>
>If it's possible to use Normal Non-Cacheable instead, (e.g. by using
>ioremap_wc()), that will likely be faster since it permits gathering and
>prefetching, etc.

I tried using Normal Non-Cacheable memory instead of iomem with
memremap(r.start,lp->size, MEMREMAP_WC);
and
vma->vm_page_prot =pgprot_writecombine(vma->vm_page_prot);
but it did not help.

I talked to our FPGA engineers and they said it would be possible to make it coherent, 
I think this would take advantage of the CCI400 on the SoC. Before haveing them implement
it I tried using the dma interfaces to see if there was a performance improvement, and 
there was not. 

For  the allocation I used 

dma_mask = dma_get_required_mask(dev);
dma_set_coherent_mask(dev, dma_mask);


lp->databuffer_vaddr = dmam_alloc_coherent(dev, lp->size, &lp->databuffer_paddr, GFP_KERNEL);

and for the mmap I used.
dma_mmap_coherent(lp->dev, vma,lp->databuffer_vaddr, lp->databuffer_paddr, size);

The performance in both cases was virtually identical to the iomem case. 

> Even that's a bit weird, and I'd generally expect to have a
>kernel driver to manage non-coherent DMA like this (rather than userspace
>having the buffer nad pointing the kernel at it).

I am not sure I follow exactly what you mean by this, but I am assuming it means not using mmap.
We have a few different use cases for the "IQ Capture", which generally fall into "block capture" 
and "streaming". For the block captures we can capture up to  2GiB at a very high rate (12.8 Gb/s).
Since the rates are so high, SW can't keep up with data rate so we have to have a huge buffer. And
since the buffer is so big we can't keep copies of it, so we jump through a bunch of hoops for "zero copy".

For the "streaming" use case we want to continuously write the data somewhere, either the network, or 
a file (E.G. on an external USB SSD). In this case, if the sample rate is higher than we can copy out 
we have to start dropping data. We specify the maximum sample rate we can handle without dropping data
which is why the copy rate is critical to us. For this use case I can see how the non-coherent dma interfaces
would be useful for this, but I did not want to have 2 different ways of using the memory, because I was not 
confident I could get it right. I was also unclear on how the full data path to files or network would look. 
Also, all the controls of the FPGA and RF hardware are done from a separate processor on the SoC (ARM r5) 
running bare metal code. Having a static mmap greatly simplifies our design. 

I also experimented with sendfile and copy_file_range system calls to avoid the user space copies altogether,
 but could not get anything working since they seem to only work with regular files and not device drivers. 
 I am not sure if this is a dead end or worth looking into further. 

>
>Robin might have thoughts on how to handle the non-coherent DMA.
>
>Thanks,
>Mark.
>
>> IIRC I mostly followed this wiki when setting things up
>> https://xilinx-wiki.atlassian.net/wiki/spaces/A/pages/18841683/Linux+Reserved+Memory
>> 
>> I think the relevant parts are from the dts (note we do it 2x, because we have some
>>  usages that also need to be accessed by other CPUs on the SoC which have adress 
>>  space restrictions)
>>  
>> reserved-memory {
>> #address-cells = <2>;
>> #size-cells = <2>;
>> ranges;
>> 
>> iq_capture: fpga_mem@1 {
>>     compatible = "shared-dma-pool";
>>     no-map;            
>>     reg = <0x0 0x70000000 0x0 0x10000000>;
>> };
>> big_iq_capture: fpga_mem@2 {
>>     compatible = "shared-dma-pool";
>>     no-map;
>>     reg = <0x8 0x0 0x0 0x80000000>;
>> };
>> };
>> 
>> 
>> anritsu-databuffer@0 {
>>  compatible = "anritsu,databuffer";
>>  memory-region = <&iq_capture>;
>>  device-name = "databuffer-device";
>> };
>> anritsu-databuffer@1 {
>> compatible = "anritsu,databuffer";
>> memory-region = <&big_iq_capture>;
>> device-name = "capturebuffer-device";
>> };
>> 
>> The databuffer driver is something we made and generally build out of tree,
>> but I put it in tree on our github if you want to look at it. I have not actually
>> tried to build it in-tree yet, so I could have made some mistakes with the Makefile
>> or something. Here is a link to where the ioremap is. 
>> 
>> https://cas5-0-urlprotect.trendmicro.com:443/wis/clicktime/v1/query?url=https%3a%2f%2fgithub.com%2fAnritsu%2flinux%2dxlnx%2fblob%2fintree%5fdatabuffer%5fdriver%2fdrivers%2fchar%2fdatabuffer%5fdriver.c%23L242&umid=64811442-1737-11ee-8159-002248ea2eea&auth=1f42b9f4699cb4f8c50b50d41cd8c1d4cdca7970-201463e497370af4283fcf0ebc8377b03f82d2c9
>> 
>> Despite doing my best to read the documentation, I was never really sure if I got the 
>> memory mapping right for our use case. 
>> 
>> 
>> If you are interested in context, the use case is in spectrum analyzers.
>> https://www.anritsu.com/en-us/test-measurement/products/ms2090a
>> The feature is IQ capture, which if you are not familiar with Spectrum Analyzers, 
>> is basically trying to take the data from an a high speed ADC and store it as fast 
>> as possible. Since the FPGA is writing the data is clocked to the ADC, the rates 
>> we can stream out without losing any data depend on how fast we can copy the 
>> data from memory to either the network or a file, which is why this performance
>> is important to us. I think we should probably be using scatter/gather for this,
>> but I could not convince the FPGA engineers to implement it (and it sounded hard
>> so I did not try very hard to convince them).  
>> 
>> Thanks for the help,
>> Austin

Thanks again for the help,
Austin
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2023-07-06 19:12 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-06-23 21:30 Slowdown copying data between kernel versions 4.19 and 5.15 Havens, Austin
2023-06-28 21:38 ` Havens, Austin
2023-06-29 13:09   ` Robin Murphy
2023-06-29 13:36   ` Catalin Marinas
2023-06-29 14:24   ` Mark Rutland
2023-06-29 18:33     ` Robin Murphy
2023-06-29 19:33     ` Havens, Austin
2023-06-30 11:15       ` Mark Rutland
2023-07-06 19:12         ` Havens, Austin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox