Slowdown copying data between kernel versions 4.19 and 5.15

* Slowdown copying data between kernel versions 4.19 and 5.15
@ 2023-06-23 21:30 Havens, Austin
  2023-06-28 21:38 ` Havens, Austin
  0 siblings, 1 reply; 9+ messages in thread
From: Havens, Austin @ 2023-06-23 21:30 UTC (permalink / raw)
  To: catalin.marinas@arm.com, will@kernel.org, michal.simek@amd.com
  Cc: Suresh, Siddarth, Lui, Vincent,
	linux-arm-kernel@lists.infradead.org

Hi all,
In the process of updating our kernel from 4.19 to 5.15 we noticed a slowdown when copying data.  We are using  Zynqmp 9EG SoCs and basically following the Xilinx/AMD release branches (though a bit behind).  I did some sample based profiling with perf, and it showed that a lot of the time was in __arch_copy_from_user, and since the amount of data getting copied is the same, it seems like it is spending more time in each __arch_copy_from_user call. 

 I made  a test program to replicate the issue and here is what I see (i used the same binary on both versions to rule out differences from the compiler). 

root@smudge:/tmp# uname -a
Linux smudge 4.19.0-xilinx-v2019.1 #1 SMP PREEMPT Thu May 18 04:01:27 UTC 2023 aarch64 aarch64 aarch64 GNU/Linux
root@smudge:/tmp# perf stat -einstructions -ecycles -e ld_dep_stall -e read_alloc -e dTLB-load-misses /mnt/usrroot/test_copy

 Performance counter stats for '/mnt/usrroot/test_copy':

          13202623      instructions              #    0.25  insn per cycle         
          52947780      cycles                                                      
          37588761      ld_dep_stall                                                
             16301      read_alloc                                                  
              1660      dTLB-load-misses                                            

       0.044990363 seconds time elapsed

       0.004092000 seconds user
       0.040920000 seconds sys

root@ahraptor:/tmp# uname -a
Linux ahraptor 5.15.36-xilinx-v2022.1 #1 SMP PREEMPT Mon Apr 10 22:46:16 UTC 2023 aarch64 aarch64 aarch64 GNU/Linux
root@ahraptor:/tmp# perf stat -einstructions -ecycles -e ld_dep_stall -e read_alloc -e dTLB-load-misses /mnt/usrroot/test_copy

 Performance counter stats for '/mnt/usrroot/test_copy':

          11625888      instructions              #    0.14  insn per cycle         
          83135040      cycles                                                      
          69833562      ld_dep_stall                                                
             27948      read_alloc                                                  
              3367      dTLB-load-misses                                            

       0.070537894 seconds time elapsed

       0.004165000 seconds user
       0.066643000 seconds sys

After some investigation I am guessing the issue is either in the iovector iteration changes (around https://elixir.bootlin.com/linux/v5.15/source/lib/iov_iter.c#L922 ) or the lower level changes in arch/arm64/lib/copy_from_user.S, but I am pretty out of my depth so it is just speculation. 

Here is the C++ code for the test program (I compiled it with G++ -O3), note that in our products we have the FPGA writing to a dedicated memory carveout which is where I have the /dev/mem mmap, you would have to change that to somewhere else to run. 

#include <iostream>
#include <memory>
#include <fstream>
#include <vector>

#include <fcntl.h>
#include <cstdlib>
#include <cstring>
#include <sys/mman.h>
#include <unistd.h>

using namespace std;

struct CaptureChunk
{
uint32_t partitionOffset, captureSizeInBytes;
};

static constexpr uint32_t MAX_WRITE_BYTES = (4096 * 256); // 1MiB
constexpr size_t copySize = 4096*1000;
constexpr size_t databufferSize =  copySize;

void updateTotalBytes(uint32_t bytesWritten)
{

}

void writeChunkToFile(const char* data, const CaptureChunk& chunk, const std::string& filePath)
{
    std::ofstream file;
    // The default buffer size seems to be very large and has to be written on close.
    // This would cause aborting the save to take several minutes (RAP-6926).
    // I don't thing we want it completely unbuffered either so we have to choose
    // a size. I think the page size is probably a pretty good bet for a good buffer
    // size. We could get it with sysconf(_SC_PAGESIZE); but I am just going to
    // use 4096 directly since that is almost always what it is, and that way we
    // won't have to change it on Windows which does not have sysconf.
    long sz = 4096;
    std::vector<char> buffer;
    buffer.resize(sz);
    file.rdbuf()->pubsetbuf(buffer.data(), sz);
    file.open(filePath.c_str(), std::ofstream::out | std::ofstream::binary);
    uint32_t readOffset = chunk.partitionOffset;
    uint32_t bytesRemainingToBeWritten = chunk.captureSizeInBytes;

    while(bytesRemainingToBeWritten > 0)
    {
        uint32_t bytesToWrite = std::min(MAX_WRITE_BYTES, bytesRemainingToBeWritten);
        if (readOffset + bytesToWrite > databufferSize)
        {
            bytesToWrite = databufferSize - readOffset;
        }
        file.write(data + readOffset, bytesToWrite);
        if (file.fail())
        {
            cout<< " failed to write " << filePath;
            break;
        }
        updateTotalBytes(bytesToWrite);
        bytesRemainingToBeWritten -= bytesToWrite;
        readOffset += bytesToWrite;
        if (readOffset == databufferSize)
        {
            readOffset = 0; //wrap around
        }
    }
    file.close();

}

void* getBuffer(int32_t& device)
{
    size_t size = databufferSize;
    device = ::open("/dev/mem", O_RDWR | O_SYNC);

    if (device < 0)
    {
        cout << "could not open dev/mem ";
    }
    // The databuffer driver should take care of getting the physical address
    void* buffer = mmap(nullptr, size, PROT_READ | PROT_WRITE, MAP_SHARED, device,  0x800000000);

    if (buffer== (void*)MAP_FAILED)
    {
        ::close(device);
        device = -1;
        buffer = nullptr;   
        cout << "could not mmap dev/mem ";     
    }
    return buffer;
}

int main()
{
    std::string fileName= "test_file.bin";
    CaptureChunk testChunk {.partitionOffset=0, .captureSizeInBytes=copySize};
    int32_t device;
    char* buffer= (char*)getBuffer(device);
    #ifdef copy_buffer
    char* copyBuffer = (char*)std::malloc(copySize);
    std::memcpy(copyBuffer, buffer, copySize);
    writeChunkToFile(copyBuffer, testChunk, fileName );
    #else
    writeChunkToFile(buffer, testChunk, fileName );
    #endif

    ::close(device);

    return 0;
}

Any help will be greatly appreciated.
-Austin

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 9+ messages in thread