* [Qemu-devel] backend for blk or fs with guaranteed blocking/synchronous I/O @ 2018-09-06 10:24 Artem Pisarenko 2018-09-06 15:08 ` Michael S. Tsirkin 2018-09-10 15:06 ` Artem Pisarenko 0 siblings, 2 replies; 4+ messages in thread From: Artem Pisarenko @ 2018-09-06 10:24 UTC (permalink / raw) To: qemu-devel@nongnu.org, qemu-block Cc: Stefan Hajnoczi, Kevin Wolf, Michael S. Tsirkin, Greg Kurz Hi all, I'm developing paravirtualized target linux system which runs multiple linux containers (LXC) inside itself. (For those, who unfamiliar with LXC, simply put, it's an isolated group of userspace processes with their own rootfs.) Each container should be provided access to its rootfs located at host and execution of container should be deterministic. Particularly, it means that container I/O operations must be synchronized within some predefined quantum of guest _virtual_ time, i.e. its I/O activity shouldn't be delayed by host performance or activities on host and other containers. In other words, guest should see it's like either infinite throughput and zero latency, or some predefined throughput/latency characteristics guaranteed per each rootfs. While other sources of non-determinism are seem to be eliminated (using TCG, -icount, etc.), asynchronous I/O still introduces it. What is scope of "(asynchronous) I/O" term within qemu? Is it something related to block devices layer only, or generic term, covering whole datapath between vCPU and backend? If it relates to block devices only, does usage of VirtFS guarantee deterministic access, or it still involves some asynchrony relative to guest virtual clock? Is it possible to force asynchronous I/O within qemu to be blocking by some external means (host OS configuration, hooks, etc.) ? I know, it may greatly slow down guest performance, but it's still better than nothing. Maybe some trivial patch can be made to qemu code at virtio, block backend or platform syscalls level? Maybe I/O automatically (and guaranteed) fallbacks to synchronous mode in some particular configurations, such as using block device with image located on tmpfs in RAM (either directly or via overlay fs) ? If so, it's great! Or maybe some other solutions exists?... Main problem is to organize access from guest linux to some file system at host (directory, mount point, image file... doesn't matter) in deterministic manner. Secondary problem is to optimize performance as much as possible by: - avoiding unnecessary overheads (e.g. using virtio infrastructure, preference virtfs over blk device, etc.); - allowing some asynchrony within defined quantum of time (e.g. 10ms), i.e. i/o order and speed are free to float within each quantum borders, while result seen by guest at end of quantum is always same. Actually, what I'm trying to achieve have direct contradiction with most people trying to avoid, because synchronous I/O degradates performance in vast majority of usage scenarios. Does anyone have any thoughts on this? Best regards, Artem Pisarenko -- С уважением, Артем Писаренко ^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [Qemu-devel] backend for blk or fs with guaranteed blocking/synchronous I/O 2018-09-06 10:24 [Qemu-devel] backend for blk or fs with guaranteed blocking/synchronous I/O Artem Pisarenko @ 2018-09-06 15:08 ` Michael S. Tsirkin 2018-09-07 8:15 ` Artem Pisarenko 2018-09-10 15:06 ` Artem Pisarenko 1 sibling, 1 reply; 4+ messages in thread From: Michael S. Tsirkin @ 2018-09-06 15:08 UTC (permalink / raw) To: Artem Pisarenko Cc: qemu-devel@nongnu.org, qemu-block, Stefan Hajnoczi, Kevin Wolf, Greg Kurz On Thu, Sep 06, 2018 at 04:24:12PM +0600, Artem Pisarenko wrote: > Hi all, > > I'm developing paravirtualized target linux system which runs multiple linux > containers (LXC) inside itself. (For those, who unfamiliar with LXC, simply > put, it's an isolated group of userspace processes with their own rootfs.) Each > container should be provided access to its rootfs located at host and execution > of container should be deterministic. Particularly, it means that container I/O > operations must be synchronized within some predefined quantum of guest > _virtual_ time, i.e. its I/O activity shouldn't be delayed by host performance > or activities on host and other containers. In other words, guest should see > it's like either infinite throughput and zero latency, or some predefined > throughput/latency characteristics guaranteed per each rootfs. > > While other sources of non-determinism are seem to be eliminated (using TCG, > -icount, etc.), asynchronous I/O still introduces it. ... Just that you should realize that the issues are not limited to QEMU: to get real time behaviour out of a Linux host you need a real-time kernel and real-time capable hardware/firmware. I'm not an expert on this at all, but see e.g. these old presentations: https://lwn.net/Articles/656807/ -- MST ^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [Qemu-devel] backend for blk or fs with guaranteed blocking/synchronous I/O 2018-09-06 15:08 ` Michael S. Tsirkin @ 2018-09-07 8:15 ` Artem Pisarenko 0 siblings, 0 replies; 4+ messages in thread From: Artem Pisarenko @ 2018-09-07 8:15 UTC (permalink / raw) To: Michael S. Tsirkin Cc: qemu-devel@nongnu.org, qemu-block, Stefan Hajnoczi, Kevin Wolf, Greg Kurz No. I don't need realtime behavior. Realtime implies determinism, but determinism doesn't implies realtime. Of course, I realize that there are other sources of non-determinism exist, but these are separate stories. Here I just trying to eliminate one of them - asynchronous emulation of I/O inside qemu. Realtime isn't solution here. Firstly, implementing realtime still leaves dependency on host machine (its performance, hardware configuration, etc.) and number of containers running. Yes, it will be deterministic, but results are tied to given host and containers count. Secondly, it's just an overkill for problem being solved. The problem area is bounded by guest and QEMU implementation. Using realtime requires to fight complexities on host also (host kernel must be realtime, system configuration must be tuned, all possible latencies must be carefully traced, etc.). I perfectly understand how complex to design realtime system in generic, and implementing it using linux makes things even more complex. Thirdly, it works only for KVM (and possibly other virtualization hypervisors). It's not my case, since my guest running with TCG and -icount,sleep=off. It seems you got me wrong. I'll try to explain problem in other way. Guest virtual clock must run independent of realtime (host) clock. They might be synchronized only in order to wait for some QEMU/host operation to be completed, i.e. guest time is being frozen by host performance bottlenecks, but it's transparent for guest. This is how works (or, at least, should work) "-icount,sleep=off" in time domain of CPU emulation. But I/O operations are seems to not respect this "policy". When QEMU processes I/O request from guest, it allows virtual time to run freely until backend completes operation and result passed back to guest. And this is what makes guest to "feel" speed/latency of I/O. It's the core of the problem. To explain problem even better I've written a simple script (test_run_multiple_containers.sh), which emulates execution of multiple containers: #!/bin/bash N=$1 for i in $(seq 1 $N); do dd if=/dev/zero of=/tmp/testfile_$i bs=1K count=100000 2>&1 | sed -n 's/^.*, \(.*\)$/\1/p' & done wait rm -f /tmp/testfile* Where N is a number of containers running in parallel, and /tmp/testfile_$i is a file located in $i container's rootfs (dedicated mount point, blk device or something else). Running ./test_run_multiple_containers.sh 1 on real machine should output value, which corresponds to maximum write speed. Lets define it as "max_io_throughput". Running this script on real machine with different N values should give ouptuts with roughly identical values like "max_io_throughput / N". What I need is that running this script on guest should always give identical and constant values, not depending on N value, current host load or something else external to guest. (No magic. While running emulation will cause at most "max_io_throughput" load on host (in terms of real time), QEMU will throttle guest virtual clock to be N times slower relative to realtime clock.) Also I forgot to mention that container's rootfs aren't required to be persistent and stay on host during execution of containers. They may be transferred to guest RAM before execution. They're just source images of rootfs. чт, 6 сент. 2018 г. в 21:08, Michael S. Tsirkin <mst@redhat.com>: > On Thu, Sep 06, 2018 at 04:24:12PM +0600, Artem Pisarenko wrote: > > Hi all, > > > > I'm developing paravirtualized target linux system which runs multiple > linux > > containers (LXC) inside itself. (For those, who unfamiliar with LXC, > simply > > put, it's an isolated group of userspace processes with their own > rootfs.) Each > > container should be provided access to its rootfs located at host and > execution > > of container should be deterministic. Particularly, it means that > container I/O > > operations must be synchronized within some predefined quantum of guest > > _virtual_ time, i.e. its I/O activity shouldn't be delayed by host > performance > > or activities on host and other containers. In other words, guest should > see > > it's like either infinite throughput and zero latency, or some predefined > > throughput/latency characteristics guaranteed per each rootfs. > > > > While other sources of non-determinism are seem to be eliminated (using > TCG, > > -icount, etc.), asynchronous I/O still introduces it. > > ... > > Just that you should realize that the issues are not limited to QEMU: to > get real time behaviour out of a Linux host you need a real-time kernel > and real-time capable hardware/firmware. I'm not an expert on this at > all, but see e.g. these old presentations: > https://lwn.net/Articles/656807/ > > -- > MST > -- С уважением, Артем Писаренко ^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [Qemu-devel] backend for blk or fs with guaranteed blocking/synchronous I/O 2018-09-06 10:24 [Qemu-devel] backend for blk or fs with guaranteed blocking/synchronous I/O Artem Pisarenko 2018-09-06 15:08 ` Michael S. Tsirkin @ 2018-09-10 15:06 ` Artem Pisarenko 1 sibling, 0 replies; 4+ messages in thread From: Artem Pisarenko @ 2018-09-10 15:06 UTC (permalink / raw) To: qemu-devel@nongnu.org, qemu-block Cc: Stefan Hajnoczi, Kevin Wolf, Michael S. Tsirkin, Greg Kurz, Paolo Bonzini It looks like things are even worse. Guest demonstrates strange timings even without access to anything external to machine. I've added Paolo Bonzini to CC, because issue looks related to cpu/tcg/memory stuff. I've written simple test script running parallel 'dd' utility processes operating on files located in RAM. QEMU machine with multiple vCPUs. Moreover, it have separate NUMA nodes for each vCPU. Script in brief: it accepts argument with desired processes count, for each process it mounts tmpfs, binded to node memory, and runs 'dd', binded both to node cpu and memory, which copies files located on that tmpfs. It's expected that overall execution time of N parallel processes (or speed of copying) should always be the same, not depending on N value (of course, provided that N <= nodes_count and 'dd' is single-threaded). Because it's just as simple as simple loops of instructions just loading and storing values in memory local to each CPU. No common resources should be involved - neither software (such as some target OS lock/mutex), nor hardware (such as memory bus). It should be almost ideal parallelization. But it's not only degradates when increasing N, but even does it proportionally !!! Same test running oh host machine (just multicore, no NUMA) shows expected results: it has degradation (because of common memory bus), but with non-linear dependency on N. Script ("test.sh"): #!/bin/bash N=$1 # Preparation... if command -v numactl >/dev/null; then USE_NUMA_BIND=1 else USE_NUMA_BIND=0 fi for i in $(seq 0 $((N - 1))); do mkdir -p /mnt/testmnt_$i if [[ "$USE_NUMA_BIND" == 1 ]] ; then TMPFS_EXTRA_OPT=",mpol=bind:$i"; fi mount -t tmpfs -o size=25M,noatime,nodiratime,norelatime$TMPFS_EXTRA_OPT tmpfs /mnt/testmnt_$i dd if=/dev/zero of=/mnt/testmnt_$i/testfile_r bs=10M count=1 >/dev/null 2>&1 done # Running... for i in $(seq 0 $((N - 1))); do if [[ "$USE_NUMA_BIND" == 1 ]] ; then PREFIX_RUN="numactl --cpunodebind=$i --membind=$i"; fi $PREFIX_RUN dd if=/mnt/testmnt_$i/testfile_r of=/mnt/testmnt_$i/testfile_w bs=100 count=100000 2>&1 | sed -n 's/^.*, \(.*\)$/\1/p' & done # Cleanup... wait for i in $(seq 0 $((N - 1))); do umount /mnt/testmnt_$i; done rm -rf /mnt/testmnt_* Corresponding QEMU command line fragment: "-machine accel=tcg -m 2048 -icount 1,sleep=off -rtc clock=vm -smp 10 -cpu qemu64 -numa node -numa node -numa node -numa node -numa node -numa node -numa node -numa node -numa node -numa node" (Removing -icount or numa nodes don't change results.) Example runs on my Intel Core i7-7700 host (adequate results): artem@host:~$ sudo ./test.sh 1 117 MB/s artem@host:~$ sudo ./test.sh 10 91,1 MB/s 89,3 MB/s 90,4 MB/s 85,0 MB/s 68,7 MB/s 63,1 MB/s 62,0 MB/s 55,9 MB/s 54,1 MB/s 56,0 MB/s Example runs on my tiny linux x86_64 guest (strange results): root@guest:~# ./test.sh 1 17.5 MB/s root@guest:~# ./test.sh 10 3.2 MB/s 2.7 MB/s 2.6 MB/s 2.0 MB/s 2.0 MB/s 1.9 MB/s 1.8 MB/s 1.8 MB/s 1.8 MB/s 1.8 MB/s Please, explain these results. Or maybe I wrong and it's normal ? чт, 6 сент. 2018 г. в 16:24, Artem Pisarenko <artem.k.pisarenko@gmail.com>: > Hi all, > > I'm developing paravirtualized target linux system which runs multiple > linux containers (LXC) inside itself. (For those, who unfamiliar with LXC, > simply put, it's an isolated group of userspace processes with their own > rootfs.) Each container should be provided access to its rootfs located at > host and execution of container should be deterministic. Particularly, it > means that container I/O operations must be synchronized within some > predefined quantum of guest _virtual_ time, i.e. its I/O activity shouldn't > be delayed by host performance or activities on host and other containers. > In other words, guest should see it's like either infinite throughput and > zero latency, or some predefined throughput/latency characteristics > guaranteed per each rootfs. > > While other sources of non-determinism are seem to be eliminated (using > TCG, -icount, etc.), asynchronous I/O still introduces it. > > What is scope of "(asynchronous) I/O" term within qemu? Is it something > related to block devices layer only, or generic term, covering whole > datapath between vCPU and backend? > If it relates to block devices only, does usage of VirtFS guarantee > deterministic access, or it still involves some asynchrony relative to > guest virtual clock? > Is it possible to force asynchronous I/O within qemu to be blocking by > some external means (host OS configuration, hooks, etc.) ? I know, it may > greatly slow down guest performance, but it's still better than nothing. > Maybe some trivial patch can be made to qemu code at virtio, block backend > or platform syscalls level? > Maybe I/O automatically (and guaranteed) fallbacks to synchronous mode in > some particular configurations, such as using block device with image > located on tmpfs in RAM (either directly or via overlay fs) ? If so, it's > great! > Or maybe some other solutions exists?... > > Main problem is to organize access from guest linux to some file system at > host (directory, mount point, image file... doesn't matter) in > deterministic manner. > Secondary problem is to optimize performance as much as possible by: > - avoiding unnecessary overheads (e.g. using virtio infrastructure, > preference virtfs over blk device, etc.); > - allowing some asynchrony within defined quantum of time (e.g. 10ms), > i.e. i/o order and speed are free to float within each quantum borders, > while result seen by guest at end of quantum is always same. > > Actually, what I'm trying to achieve have direct contradiction with most > people trying to avoid, because synchronous I/O degradates performance in > vast majority of usage scenarios. > > Does anyone have any thoughts on this? > > Best regards, > Artem Pisarenko > -- > > С уважением, > Артем Писаренко > -- С уважением, Артем Писаренко ^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2018-09-10 15:07 UTC | newest] Thread overview: 4+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2018-09-06 10:24 [Qemu-devel] backend for blk or fs with guaranteed blocking/synchronous I/O Artem Pisarenko 2018-09-06 15:08 ` Michael S. Tsirkin 2018-09-07 8:15 ` Artem Pisarenko 2018-09-10 15:06 ` Artem Pisarenko
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).