* Question: Modifying kernel to handle all I/O requests without page cache @ 2019-09-25 22:51 Jianshen Liu 2019-09-26 12:39 ` Carlos Maiolino 0 siblings, 1 reply; 5+ messages in thread From: Jianshen Liu @ 2019-09-25 22:51 UTC (permalink / raw) To: linux-fsdevel, linux-xfs Hi, I am working on a project trying to evaluate the performance of a workload running on a storage device. I don't want the benchmark result depends on a specific platform (e.g., a platform with X GiB of physical memory). Because it prevents people from reproducing the same result on a different platform configuration. Think about you are benchmarking a read-heavy workload, with data caching enabled you may end up with just testing the performance of the system memory. Currently, I'm thinking how to eliminate the cache effects created by the page cache. Direct I/O is a good option for testing with a single application but is not good for testing with unknown applications/workloads. Therefore, it is not feasible to ask people to modify the application source code before running the benchmark. Making changes within the kernel may only be the option because it is transparent to all user-space applications. The problem is I don't know how to modify the kernel so that it does not use the page cache for any IOs to a specific storage device. I have tried to append a fadvise64() call with POSIX_FADV_DONTNEED to the end of each read/write system calls. The performance of this approach is far from using Direct I/O. It is also unable to eliminate the caching effects under concurrent I/Os. I'm looking for any advice here to point me an efficient way to remove the cache effects from the page cache. Thanks, Jianshen ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: Question: Modifying kernel to handle all I/O requests without page cache 2019-09-25 22:51 Question: Modifying kernel to handle all I/O requests without page cache Jianshen Liu @ 2019-09-26 12:39 ` Carlos Maiolino 2019-09-27 1:42 ` Jianshen Liu 0 siblings, 1 reply; 5+ messages in thread From: Carlos Maiolino @ 2019-09-26 12:39 UTC (permalink / raw) To: Jianshen Liu; +Cc: linux-xfs Hi. I am removing linux-fsdevel from the CC, because it's not really the quorum for it. On Wed, Sep 25, 2019 at 03:51:27PM -0700, Jianshen Liu wrote: > Hi, > > I am working on a project trying to evaluate the performance of a > workload running on a storage device. I don't want the benchmark > result depends on a specific platform (e.g., a platform with X GiB of > physical memory). Well, this does not sound realistic to me. Memory is just only one of the variables in how IO throughput will perform. You bypass memory, then, what about IO Controller, Disks, Storage cache, etc etc etc? All of these are 'platform specific'. > Because it prevents people from reproducing the same > result on a different platform configuration. Every platform will behave different, memory is only one factor. > Think about you are > benchmarking a read-heavy workload, with data caching enabled you may > end up with just testing the performance of the system memory. > Not really true. It all depends on what kind of workload you are talking about. And what you are trying to measure. A read-heavy workload, may well use a lot of page cache, but it all depends on the IO patterns, and exactly what statistics you care about. Are you trying to measure how well a storage solution will perform on a random workload? On a sequential workload? Are you trying to measure how well an application will perform? If that's the case, removing the page cache from the equation really matters? I.e. will it give you realistic results? If you are benchmarking systems with 'random' kinds of workloads, there are several tools around there which can help, and you can configure to use DIO. > Currently, I'm thinking how to eliminate the cache effects created by > the page cache. Direct I/O is a good option for testing with a single > application but is not good for testing with unknown > applications/workloads. You can use many tools for that purpose, which can 'emulate' different workloads, without needing to modify a specific application. But if you are trying to create benchmarks for a specific application, if your benchmarks uses DIO or not, will depend on if the application uses DIO or not. > Therefore, it is not feasible to ask people to > modify the application source code before running the benchmark. Well, IMHO, your approach is wrong. First, if you are benchmarking how an application will perform, you need to use the same IO patterns the application is using, i.e. you won't need to modify it. If it does not use direct IO, benchmarking a system using direct IO will bring you something very wrong data. And the opposite is true, if the application uses direct IO, you don't want to benchmark a system by using the page cache, because one of the things you really want to measure is how well the application's cache is performing. Also, direct IO is also not a good option to use when you 'don't know how to issue I/O requests'. All I/O requests submitted using direct IO must be aligned. So, if the application does not issue aligned requests, the IO requests will fail. I remember some filesystems to had an option to 'open all files with O_DIRECT by default', and many problems being created because IO requests to such files were not all sector aligned. > > Making changes within the kernel may only be the option because it is > transparent to all user-space applications. I will hit the same point again :) and my question is: Why? :) Will you be using a custom kernel? With this modification? If not, you will not be gathering trustable data anyway. > The problem is I don't > know how to modify the kernel so that it does not use the page cache > for any IOs to a specific storage device. I have tried to append a > fadvise64() call with POSIX_FADV_DONTNEED to the end of each > read/write system calls. The performance of this approach is far from > using Direct I/O. It is also unable to eliminate the caching effects > under concurrent I/Os. I'm looking for any advice here to point me an > efficient way to remove the cache effects from the page cache. > > Thanks, > Jianshen Benchmarking systems is an 'art', and I am certainly not an expert on it, but at first, it looks like you are trying to create a 'generic benchmark' to some generic random system. And I will tell you, this is not going to work well. We have tons of cases and stories about people running benchmark X on system Z, and it performing 'well', but when running their real workload, everything starts to perform poorly, exactly because they did not use the correct benchmark at first. You have several layers in a storage stack, which starts from how the application handles its own IO requests. And each layer which will behave differently on each type of workload. Apologies to be repeating myself: If you are trying to measure only a storage solution, there are several tools around which can create different kinds of workload. If you are trying to measure an application performance on solution X, well, it is pointless to measure direct IO if the application does not use it or vice-versa, so, modifying an application, again, is not what you will want to do for benchmarking, for sure. Hope to have helped (and not created more questions :) Cheers -- Carlos ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: Question: Modifying kernel to handle all I/O requests without page cache 2019-09-26 12:39 ` Carlos Maiolino @ 2019-09-27 1:42 ` Jianshen Liu 2019-09-27 10:39 ` Carlos Maiolino 2019-09-27 22:17 ` Dave Chinner 0 siblings, 2 replies; 5+ messages in thread From: Jianshen Liu @ 2019-09-27 1:42 UTC (permalink / raw) To: Carlos Maiolino; +Cc: linux-xfs Hi Carlos, Thanks for your reply. On Thu, Sep 26, 2019 at 5:39 AM Carlos Maiolino <cmaiolino@redhat.com> wrote: > > On Wed, Sep 25, 2019 at 03:51:27PM -0700, Jianshen Liu wrote: > > Hi, > > > > I am working on a project trying to evaluate the performance of a > > workload running on a storage device. I don't want the benchmark > > result depends on a specific platform (e.g., a platform with X GiB of > > physical memory). > > Well, this does not sound realistic to me. Memory is just only one of the > variables in how IO throughput will perform. You bypass memory, then, what about > IO Controller, Disks, Storage cache, etc etc etc? All of these are 'platform > specific'. I apologize for any confusion because of my oversimplified project description. My final goal is to compare the efficiency of different platforms utilizing a specific storage device to run a given workload. Since the platforms can be heterogeneous (e.g., x86 vs arm), the comparison should be based on a reference unit that is relevant to the capability of the storage device but is irrelevant to a specific platform. With this reference unit, you can understand how much performance a platform can give over the capability of the specific storage device. Once you have this knowledge, you can consider whether you add/remove some CPUs, memory, the same model of storage devices, etc can improve the platform efficiency (e.g., cost/reference unit) with respect to the capability of the storage device under this workload. Moreover, you can answer questions like can you get the full unit of performance when you add one more device onto the platform. My question here is how to evaluate the platform-independent reference unit for the combination of a given workload and a specific storage device. Specifically, the reference unit should be a performance value of the workload under the capability of the storage device. In other words, this value should not be either enhanced or throttled by the testing platform. Yes, memory is one of the variables affecting the I/O performance, the CPU horse, network bandwidth, type of host interface, version of the software would be the other. But these are the variables I can easily control. For example, I can check whether the CPU and/or the network are the performance bottlenecks. The I/O controller, storage media, and the disk cache are encapsulated in the storage device, so these are not platform-specific variables as long as I keep using the same model of the storage device. The use of page cache, however, may enhance the performance value making the value become platform-dependent. > > Because it prevents people from reproducing the same > > result on a different platform configuration. > > Every platform will behave different, memory is only one factor. > > > Think about you are > > benchmarking a read-heavy workload, with data caching enabled you may > > end up with just testing the performance of the system memory. > > > > Not really true. It all depends on what kind of workload you are talking about. > And what you are trying to measure. > > A read-heavy workload, may well use a lot of page cache, but it all depends on > the IO patterns, and exactly what statistics you care about. Are you trying to > measure how well a storage solution will perform on a random workload? On a > sequential workload? > Are you trying to measure how well an application will perform? If that's the > case, removing the page cache from the equation really matters? I.e. will it > give you realistic results? > > If you are benchmarking systems with 'random' kinds of workloads, there are > several tools around there which can help, and you can configure to use DIO. > > > Currently, I'm thinking how to eliminate the cache effects created by > > the page cache. Direct I/O is a good option for testing with a single > > application but is not good for testing with unknown > > applications/workloads. > > You can use many tools for that purpose, which can 'emulate' different > workloads, without needing to modify a specific application. I don't want to emulate a workload. An emulated workload will most of the time be different from the source real-world workload. For example, replaying block I/O recording results generated by fio or blktrace will probably get different performance numbers from running the original workload. > But if you are trying to create benchmarks for a specific application, if your > benchmarks uses DIO or not, will depend on if the application uses DIO or not. This is my main question. I want running an application without involving page caching effects even when the application does not support DIO. > > Therefore, it is not feasible to ask people to > > modify the application source code before running the benchmark. > > Well, IMHO, your approach is wrong. First, if you are benchmarking how an application > will perform, you need to use the same IO patterns the application is using, > i.e. you won't need to modify it. If it does not use direct IO, benchmarking a system > using direct IO will bring you something very wrong data. And the opposite is true, > if the application uses direct IO, you don't want to benchmark a system by using > the page cache, because one of the things you really want to measure is how well the > application's cache is performing. > > Also, direct IO is also not a good option to use when you 'don't know how to > issue I/O requests'. > > All I/O requests submitted using direct IO must be aligned. So, if the > application does not issue aligned requests, the IO requests will fail. Yes, this is one of the difficulties in my problem. The application may not issue offset, length, buffer addressed aligned I/O. Thus, I cannot blindly convert application I/O to DIO within the kernel. > I remember some filesystems to had an option to 'open all files with O_DIRECT by > default', and many problems being created because IO requests to such files were > not all sector aligned. > > > > > Making changes within the kernel may only be the option because it is > > transparent to all user-space applications. > > I will hit the same point again :) and my question is: Why? :) Will you be using > a custom kernel? With this modification? If not, you will not be gathering > trustable data anyway. I created a loadable module to patch a vanilla kernel using the kernel livepatching mechanism. > > The problem is I don't > > know how to modify the kernel so that it does not use the page cache > > for any IOs to a specific storage device. I have tried to append a > > fadvise64() call with POSIX_FADV_DONTNEED to the end of each > > read/write system calls. The performance of this approach is far from > > using Direct I/O. It is also unable to eliminate the caching effects > > under concurrent I/Os. I'm looking for any advice here to point me an > > efficient way to remove the cache effects from the page cache. > > > > Thanks, > > Jianshen > > > Benchmarking systems is an 'art', and I am certainly not an expert on it, but at > first, it looks like you are trying to create a 'generic benchmark' to some > generic random system. And I will tell you, this is not going to work well. We > have tons of cases and stories about people running benchmark X on system Z, and > it performing 'well', but when running their real workload, everything starts to > perform poorly, exactly because they did not use the correct benchmark at first. I'm not trying to create a generic benchmark. I just want to create a benchmark methodology focusing on evaluating the efficiency of a platform for running a given workload on a specific storage device. > You have several layers in a storage stack, which starts from how the > application handles its own IO requests. And each layer which will behave > differently on each type of workload. My assumption is that we should run the same workload when comparing different platforms. > Apologies to be repeating myself: > > If you are trying to measure only a storage solution, there are several tools > around which can create different kinds of workload. I would like to know whether there is a tool that can create an identical workload as the source. But this still does not help to measure the reference unit that I mentioned. > If you are trying to measure an application performance on solution X, well, > it is pointless to measure direct IO if the application does not use it or > vice-versa, so, modifying an application, again, is not what you will want to do > for benchmarking, for sure. The point is that I'm not trying to measure the performance of an application on solution X. I'm trying to generate a platform-independent reference unit for the combination of a storage device and the application. I have researched different knobs provided by the kernel including drop_caches, cgroup, and vm subsystem, but none of them can help me to measure what I want. I would like to know whether there is a variable in the filesystem that defines the size of the page cache pool. Also, would it be possible to convert some of the application IOs to DIO when they are properly aligned? Are there any places in the kernel I can easily change to bypass the page cache? Thanks, Jianshen ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: Question: Modifying kernel to handle all I/O requests without page cache 2019-09-27 1:42 ` Jianshen Liu @ 2019-09-27 10:39 ` Carlos Maiolino 2019-09-27 22:17 ` Dave Chinner 1 sibling, 0 replies; 5+ messages in thread From: Carlos Maiolino @ 2019-09-27 10:39 UTC (permalink / raw) To: Jianshen Liu; +Cc: linux-xfs Hi. I'm gonna move this question to the top, for a short answer: > > But if you are trying to create benchmarks for a specific application, if your > > benchmarks uses DIO or not, will depend on if the application uses DIO or not. > > This is my main question. I want running an application without > involving page caching effects even when the application does not > support DIO. You simply can't. Aligned IOs is a primitive of block devices (if I can use these words). If you don't submit aligned IOs, you can't access block devices directly. You can't modify the kernel to do that either, because that's exactly one of the goals of the buffer cache, other than improving performance of course. If you submit an unaligned IO, kernel will first read in the whole sectors from the block device, modify them accordingly to your unaligned IO and write the whole sectors back. For reads, the process is the same, kernel will read at least the whole sector, never just a part of it. Now, let me try a longer reply :P > On Thu, Sep 26, 2019 at 06:42:43PM -0700, Jianshen Liu wrote: > Hi Carlos, > > Thanks for your reply. > > On Thu, Sep 26, 2019 at 5:39 AM Carlos Maiolino <cmaiolino@redhat.com> wrote: > > > > On Wed, Sep 25, 2019 at 03:51:27PM -0700, Jianshen Liu wrote: > > > Hi, > > > > > > I am working on a project trying to evaluate the performance of a > > > workload running on a storage device. I don't want the benchmark > > > result depends on a specific platform (e.g., a platform with X GiB of > > > physical memory). > > > > Well, this does not sound realistic to me. Memory is just only one of the > > variables in how IO throughput will perform. You bypass memory, then, what about > > IO Controller, Disks, Storage cache, etc etc etc? All of these are 'platform > > specific'. > > I apologize for any confusion because of my oversimplified project > description. My final goal is to compare the efficiency of different > platforms utilizing a specific storage device to run a given workload. > Since the platforms can be heterogeneous (e.g., x86 vs arm), the > comparison should be based on a reference unit that is relevant to the > capability of the storage device but is irrelevant to a specific > platform. The storage vendors usually already provide you with the hardware limitations which you can use as the reference units you are looking for. Like maximum IOPS and Throughput such storage solution can support. These are all platform and application independent reference units you can use. > With this reference unit, you can understand how much > performance a platform can give over the capability of the specific > storage device. Again, you can use the numbers provided by the vendor. For example, XFS is designed to be a high-throughput filesystem, and the goal is to be as close as possible to the hardware limits, but of course, it all depends on everything else. > Once you have this knowledge, you can consider whether > you add/remove some CPUs, memory, the same model of storage devices, > etc can improve the platform efficiency (e.g., cost/reference unit) > with respect to the capability of the storage device under this > workload. Storage hardware limitations, vendor provided numbers also applies here. And you can't simply discard application's behavior here. Everything you mentioned here will be directly affected by the application you're using, so, modifying the application will give you nothing useful to work on. > Moreover, you can answer questions like can you get the full > unit of performance when you add one more device onto the platform. For "Full unit of performance", you can again, use vendor-provided numbers :) > My question here is how to evaluate the platform-independent reference > unit for the combination of a given workload and a specific storage > device. Use the application you are trying to evaluate, in different platforms, and measure it. > Specifically, the reference unit should be a performance value > of the workload under the capability of the storage device. In other > words, this value should not be either enhanced or throttled by the > testing platform. Yes, memory is one of the variables affecting the > I/O performance, the CPU horse, network bandwidth, type of host > interface, version of the software would be the other. But these are > the variables I can easily control. For example, I can check whether > the CPU and/or the network are the performance bottlenecks. The I/O > controller, storage media, and the disk cache are encapsulated in the > storage device, so these are not platform-specific variables as long > as I keep using the same model of the storage device. The use of page > cache, however, may enhance the performance value making the value > become platform-dependent. Again, everything you measure, will have no meaning if you don't use realistic data. You can't simply bypass the buffer cache if the application does not support it, and so, it is pointless to measure how an application will 'perform' in such scenario. > I don't want to emulate a workload. An emulated workload will most of > the time be different from the source real-world workload. For > example, replaying block I/O recording results generated by fio or > blktrace will probably get different performance numbers from running > the original workload. And I think this is the crux for your issue. You don't want an emulated workload, because it may not reproduce the real-world workload. Why then are you trying to find a way to bypass the page/buffer cache, on an application that will not support direct IO and won't be able to use it like that? You don't want to collect data using emulated workloads, but at the same time you want to use something that is simply totally out of the reality? Does not make any sense to me. fio can get different performance numbers? Sure, I agree, no performance measurement tool can beat the real workload of a specific application, but, what you are trying to do doesn't either, so, what's the difference? > > Benchmarking systems is an 'art', and I am certainly not an expert on it, but at > > first, it looks like you are trying to create a 'generic benchmark' to some > > generic random system. And I will tell you, this is not going to work well. We > > have tons of cases and stories about people running benchmark X on system Z, and > > it performing 'well', but when running their real workload, everything starts to > > perform poorly, exactly because they did not use the correct benchmark at first. > > I'm not trying to create a generic benchmark. I just want to create a > benchmark methodology focusing on evaluating the efficiency of a > platform for running a given workload on a specific storage device. Ok, so, you want to evaluate how platform X will behave with your application + storage. Why then you want to modify that original platform behavior? In this case, let's say, by bypassing Linux page/buffer cache. By platform you mean hardware? Well, then, use the same software stack. > > > You have several layers in a storage stack, which starts from how the > > application handles its own IO requests. And each layer which will behave > > differently on each type of workload. > > My assumption is that we should run the same workload when comparing > different platforms. Yes, and if you don't want to use emulated workloads, you should don't try to hack your software stack to behave in weird ways. If you want to compare platforms, ensure to use the same software stack. Including the same configuration. That's all. > > If you are trying to measure an application performance on solution X, well, > > it is pointless to measure direct IO if the application does not use it or > > vice-versa, so, modifying an application, again, is not what you will want to do > > for benchmarking, for sure. > > The point is that I'm not trying to measure the performance of an > application on solution X. I'm trying to generate a > platform-independent reference unit for the combination of a storage > device and the application. You simply can't. Get any enterprise application out there, you will see the application vendors usually certify certain combinations of hardware + software stack. There is a reason for that. There are many variables in the way, not only the page/buffer cache. You can't simply bypass the page/buffer cache, and think you'll get some realistic base reference unit you can work with. Specially if you are not sure how the application behaves. If you want to have base reference unit numbers for a storage solution, use the vendor's reference numbers. They are platform agnostic. Everything else above that will be totally interdependent. > I have researched different knobs provided by the kernel including > drop_caches, cgroup, and vm subsystem, but none of them can help me to > measure what I want. Because I honestly think what you are trying to measure is unrealistic :) > I would like to know whether there is a variable > in the filesystem that defines the size of the page cache pool. There is no such silver bullet :) > Also, > would it be possible to convert some of the application IOs to DIO > when they are properly aligned? Not that I know about, but well, I'm not really an expert in the DIO code, maybe there's a way to fall back to buffered io, although, I don't think so. > Are there any places in the kernel I > can easily change to bypass the page cache? No. -- Carlos ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: Question: Modifying kernel to handle all I/O requests without page cache 2019-09-27 1:42 ` Jianshen Liu 2019-09-27 10:39 ` Carlos Maiolino @ 2019-09-27 22:17 ` Dave Chinner 1 sibling, 0 replies; 5+ messages in thread From: Dave Chinner @ 2019-09-27 22:17 UTC (permalink / raw) To: Jianshen Liu; +Cc: Carlos Maiolino, linux-xfs On Thu, Sep 26, 2019 at 06:42:43PM -0700, Jianshen Liu wrote: > > But if you are trying to create benchmarks for a specific application, if your > > benchmarks uses DIO or not, will depend on if the application uses DIO or not. > > This is my main question. I want running an application without > involving page caching effects even when the application does not > support DIO. LD_PRELOAD wrapper for the open() syscall. Check that the target is a file, then add O_DIRECT to the open flags. Won't help you for mmap() access that will always use the page cache, though, so things like executables will always use the page cache regardless of what tricks you try to play. So, as Carlos has said, what you want to do is largely impossible to acheive. > > All I/O requests submitted using direct IO must be aligned. So, if the > > application does not issue aligned requests, the IO requests will fail. > > Yes, this is one of the difficulties in my problem. The application > may not issue offset, length, buffer addressed aligned I/O. Thus, I > cannot blindly convert application I/O to DIO within the kernel. LD_PRELOAD wrapper to bounce buffer unaligned read/write() requests. > > I will hit the same point again :) and my question is: Why? :) Will you be using > > a custom kernel? With this modification? If not, you will not be gathering > > trustable data anyway. > > I created a loadable module to patch a vanilla kernel using the kernel > livepatching mechanism. That's just asking for trouble. I wouldn't trust a kernel that has been modified in that way as far as I could throw it. > > If you are trying to measure an application performance on solution X, well, > > it is pointless to measure direct IO if the application does not use it or > > vice-versa, so, modifying an application, again, is not what you will want to do > > for benchmarking, for sure. > > The point is that I'm not trying to measure the performance of an > application on solution X. I'm trying to generate a > platform-independent reference unit for the combination of a storage > device and the application. Sounds like an exercise that has no practical use to me - the model will have to be so generic and full of compromises that it won't be relevant to real world situations.... Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2019-09-27 22:17 UTC | newest] Thread overview: 5+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2019-09-25 22:51 Question: Modifying kernel to handle all I/O requests without page cache Jianshen Liu 2019-09-26 12:39 ` Carlos Maiolino 2019-09-27 1:42 ` Jianshen Liu 2019-09-27 10:39 ` Carlos Maiolino 2019-09-27 22:17 ` Dave Chinner
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox