[Qemu-devel] backend for blk or fs with guaranteed blocking/synchronous I/O

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

* [Qemu-devel] backend for blk or fs with guaranteed blocking/synchronous I/O
@ 2018-09-06 10:24 Artem Pisarenko
  2018-09-06 15:08 ` Michael S. Tsirkin
  2018-09-10 15:06 ` Artem Pisarenko
  0 siblings, 2 replies; 4+ messages in thread
From: Artem Pisarenko @ 2018-09-06 10:24 UTC (permalink / raw)
  To: qemu-devel@nongnu.org, qemu-block
  Cc: Stefan Hajnoczi, Kevin Wolf, Michael S. Tsirkin, Greg Kurz

Hi all,

I'm developing paravirtualized target linux system which runs multiple
linux containers (LXC) inside itself. (For those, who unfamiliar with LXC,
simply put, it's an isolated group of userspace processes with their own
rootfs.) Each container should be provided access to its rootfs located at
host and execution of container should be deterministic. Particularly, it
means that container I/O operations must be synchronized within some
predefined quantum of guest _virtual_ time, i.e. its I/O activity shouldn't
be delayed by host performance or activities on host and other containers.
In other words, guest should see it's like either infinite throughput and
zero latency, or some predefined throughput/latency characteristics
guaranteed per each rootfs.

While other sources of non-determinism are seem to be eliminated (using
TCG, -icount, etc.), asynchronous I/O still introduces it.

What is scope of "(asynchronous) I/O" term within qemu? Is it something
related to block devices layer only, or generic term, covering whole
datapath between vCPU and backend?
If it relates to block devices only, does usage of VirtFS guarantee
deterministic access, or it still involves some asynchrony relative to
guest virtual clock?
Is it possible to force asynchronous I/O within qemu to be blocking by some
external means (host OS configuration, hooks, etc.) ? I know, it may
greatly slow down guest performance, but it's still better than nothing.
Maybe some trivial patch can be made to qemu code at virtio, block backend
or platform syscalls level?
Maybe I/O automatically (and guaranteed) fallbacks to synchronous mode in
some particular configurations, such as using block device with image
located on tmpfs in RAM (either directly or via overlay fs) ? If so, it's
great!
Or maybe some other solutions exists?...

Main problem is to organize access from guest linux to some file system at
host (directory, mount point, image file... doesn't matter) in
deterministic manner.
Secondary problem is to optimize performance as much as possible by:
- avoiding unnecessary overheads (e.g. using virtio infrastructure,
preference virtfs over blk device, etc.);
- allowing some asynchrony within defined quantum of time (e.g. 10ms), i.e.
i/o order and speed are free to float within each quantum borders, while
result seen by guest at end of quantum is always same.

Actually, what I'm trying to achieve have direct contradiction with most
people trying to avoid, because synchronous I/O degradates performance in
vast majority of usage scenarios.

Does anyone have any thoughts on this?

Best regards,
  Artem Pisarenko
-- 

С уважением,
  Артем Писаренко

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [Qemu-devel] backend for blk or fs with guaranteed blocking/synchronous I/O
  2018-09-06 10:24 [Qemu-devel] backend for blk or fs with guaranteed blocking/synchronous I/O Artem Pisarenko
@ 2018-09-06 15:08 ` Michael S. Tsirkin
  2018-09-07  8:15   ` Artem Pisarenko
  2018-09-10 15:06 ` Artem Pisarenko
  1 sibling, 1 reply; 4+ messages in thread
From: Michael S. Tsirkin @ 2018-09-06 15:08 UTC (permalink / raw)
  To: Artem Pisarenko
  Cc: qemu-devel@nongnu.org, qemu-block, Stefan Hajnoczi, Kevin Wolf,
	Greg Kurz

On Thu, Sep 06, 2018 at 04:24:12PM +0600, Artem Pisarenko wrote:
> Hi all,
> 
> I'm developing paravirtualized target linux system which runs multiple linux
> containers (LXC) inside itself. (For those, who unfamiliar with LXC, simply
> put, it's an isolated group of userspace processes with their own rootfs.) Each
> container should be provided access to its rootfs located at host and execution
> of container should be deterministic. Particularly, it means that container I/O
> operations must be synchronized within some predefined quantum of guest
> _virtual_ time, i.e. its I/O activity shouldn't be delayed by host performance
> or activities on host and other containers. In other words, guest should see
> it's like either infinite throughput and zero latency, or some predefined
> throughput/latency characteristics guaranteed per each rootfs.
> 
> While other sources of non-determinism are seem to be eliminated (using TCG,
> -icount, etc.), asynchronous I/O still introduces it.

...

Just that you should realize that the issues are not limited to QEMU: to
get real time behaviour out of a Linux host you need a real-time kernel
and real-time capable hardware/firmware. I'm not an expert on this at
all, but see e.g. these old presentations:
https://lwn.net/Articles/656807/

-- 
MST

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [Qemu-devel] backend for blk or fs with guaranteed blocking/synchronous I/O
  2018-09-06 15:08 ` Michael S. Tsirkin
@ 2018-09-07  8:15   ` Artem Pisarenko
  0 siblings, 0 replies; 4+ messages in thread
From: Artem Pisarenko @ 2018-09-07  8:15 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: qemu-devel@nongnu.org, qemu-block, Stefan Hajnoczi, Kevin Wolf,
	Greg Kurz

No. I don't need realtime behavior. Realtime implies determinism, but
determinism doesn't implies realtime. Of course, I realize that there are
other sources of non-determinism exist, but these are separate stories.
Here I just trying to eliminate one of them - asynchronous emulation of I/O
inside qemu. Realtime isn't solution here.

Firstly, implementing realtime still leaves dependency on host machine (its
performance, hardware configuration, etc.) and number of containers
running. Yes, it will be deterministic, but results are tied to given host
and containers count.
Secondly, it's just an overkill for problem being solved. The problem area
is bounded by guest and QEMU implementation. Using realtime requires to
fight complexities on host also (host kernel must be realtime, system
configuration must be tuned, all possible latencies must be carefully
traced, etc.). I perfectly understand how complex to design realtime system
in generic, and implementing it using linux makes things even more complex.
Thirdly, it works only for KVM (and possibly other virtualization
hypervisors). It's not my case, since my guest running with TCG and
-icount,sleep=off.

It seems you got me wrong. I'll try to explain problem in other way.

Guest virtual clock must run independent of realtime (host) clock. They
might be synchronized only in order to wait for some QEMU/host operation to
be completed, i.e. guest time is being frozen by host performance
bottlenecks, but it's transparent for guest. This is how works (or, at
least, should work) "-icount,sleep=off" in time domain of CPU emulation.
But I/O operations are seems to not respect this "policy". When QEMU
processes I/O request from guest, it allows virtual time to run freely
until backend completes operation and result passed back to guest. And this
is what makes guest to "feel" speed/latency of I/O. It's the core of the
problem.

To explain problem even better I've written a simple script
(test_run_multiple_containers.sh), which emulates execution of multiple
containers:
    #!/bin/bash
    N=$1
    for i in $(seq 1 $N);
    do
      dd if=/dev/zero of=/tmp/testfile_$i bs=1K count=100000 2>&1 | sed -n
's/^.*, \(.*\)$/\1/p' &
    done
    wait
    rm -f /tmp/testfile*
Where N is a number of containers running in parallel, and /tmp/testfile_$i
is a file located in $i container's rootfs (dedicated mount point, blk
device or something else).
Running
    ./test_run_multiple_containers.sh 1
on real machine should output value, which corresponds to maximum write
speed. Lets define it as "max_io_throughput".
Running this script on real machine with different N values should give
ouptuts with roughly identical values like "max_io_throughput / N".
What I need is that running this script on guest should always give
identical and constant values, not depending on N value, current host load
or something else external to guest. (No magic. While running emulation
will cause at most "max_io_throughput" load on host (in terms of real
time), QEMU will throttle guest virtual clock to be N times slower relative
to realtime clock.)

Also I forgot to mention that container's rootfs aren't required to be
persistent and stay on host during execution of containers. They may be
transferred to guest RAM before execution. They're just source images of
rootfs.

чт, 6 сент. 2018 г. в 21:08, Michael S. Tsirkin <mst@redhat.com>:

> On Thu, Sep 06, 2018 at 04:24:12PM +0600, Artem Pisarenko wrote:
> > Hi all,
> >
> > I'm developing paravirtualized target linux system which runs multiple
> linux
> > containers (LXC) inside itself. (For those, who unfamiliar with LXC,
> simply
> > put, it's an isolated group of userspace processes with their own
> rootfs.) Each
> > container should be provided access to its rootfs located at host and
> execution
> > of container should be deterministic. Particularly, it means that
> container I/O
> > operations must be synchronized within some predefined quantum of guest
> > _virtual_ time, i.e. its I/O activity shouldn't be delayed by host
> performance
> > or activities on host and other containers. In other words, guest should
> see
> > it's like either infinite throughput and zero latency, or some predefined
> > throughput/latency characteristics guaranteed per each rootfs.
> >
> > While other sources of non-determinism are seem to be eliminated (using
> TCG,
> > -icount, etc.), asynchronous I/O still introduces it.
>
> ...
>
> Just that you should realize that the issues are not limited to QEMU: to
> get real time behaviour out of a Linux host you need a real-time kernel
> and real-time capable hardware/firmware. I'm not an expert on this at
> all, but see e.g. these old presentations:
> https://lwn.net/Articles/656807/
>
> --
> MST
>
-- 

С уважением,
  Артем Писаренко

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [Qemu-devel] backend for blk or fs with guaranteed blocking/synchronous I/O
  2018-09-06 10:24 [Qemu-devel] backend for blk or fs with guaranteed blocking/synchronous I/O Artem Pisarenko
  2018-09-06 15:08 ` Michael S. Tsirkin
@ 2018-09-10 15:06 ` Artem Pisarenko
  1 sibling, 0 replies; 4+ messages in thread
From: Artem Pisarenko @ 2018-09-10 15:06 UTC (permalink / raw)
  To: qemu-devel@nongnu.org, qemu-block
  Cc: Stefan Hajnoczi, Kevin Wolf, Michael S. Tsirkin, Greg Kurz,
	Paolo Bonzini

It looks like things are even worse. Guest demonstrates strange timings
even without access to anything external to machine. I've added Paolo
Bonzini to CC, because issue looks related to cpu/tcg/memory stuff.

I've written simple test script running parallel 'dd' utility processes
operating on files located in RAM. QEMU machine with multiple vCPUs.
Moreover, it have separate NUMA nodes for each vCPU.
Script in brief: it accepts argument with desired processes count, for each
process it mounts tmpfs, binded to node memory, and runs 'dd', binded both
to node cpu and memory, which copies files located on that tmpfs.
It's expected that overall execution time of N parallel processes (or speed
of copying) should always be the same, not depending on N value (of course,
provided that N <= nodes_count and 'dd' is single-threaded). Because it's
just as simple as simple loops of instructions just loading and storing
values in memory local to each CPU. No common resources should be involved
- neither software (such as some target OS lock/mutex), nor hardware (such
as memory bus). It should be almost ideal parallelization.
But it's not only degradates when increasing N, but even does it
proportionally !!! Same test running oh host machine (just multicore, no
NUMA) shows expected results: it has degradation (because of common memory
bus), but with non-linear dependency on N.

Script ("test.sh"):
    #!/bin/bash
    N=$1
    # Preparation...
    if command -v numactl >/dev/null; then
      USE_NUMA_BIND=1
    else
      USE_NUMA_BIND=0
    fi
    for i in $(seq 0 $((N - 1)));
    do
      mkdir -p /mnt/testmnt_$i
      if [[ "$USE_NUMA_BIND" == 1 ]] ; then
TMPFS_EXTRA_OPT=",mpol=bind:$i"; fi
      mount -t tmpfs -o
size=25M,noatime,nodiratime,norelatime$TMPFS_EXTRA_OPT tmpfs /mnt/testmnt_$i
      dd if=/dev/zero of=/mnt/testmnt_$i/testfile_r bs=10M count=1
>/dev/null 2>&1
    done
    # Running...
    for i in $(seq 0 $((N - 1)));
    do
      if [[ "$USE_NUMA_BIND" == 1 ]] ; then PREFIX_RUN="numactl
--cpunodebind=$i --membind=$i"; fi
      $PREFIX_RUN dd if=/mnt/testmnt_$i/testfile_r
of=/mnt/testmnt_$i/testfile_w bs=100 count=100000 2>&1 | sed -n 's/^.*,
\(.*\)$/\1/p' &
    done
    # Cleanup...
    wait
    for i in $(seq 0 $((N - 1))); do umount /mnt/testmnt_$i; done
    rm -rf /mnt/testmnt_*

Corresponding QEMU command line fragment:
    "-machine accel=tcg -m 2048 -icount 1,sleep=off -rtc clock=vm -smp 10
-cpu qemu64 -numa node -numa node -numa node -numa node -numa node -numa
node -numa node -numa node -numa node -numa node"
(Removing -icount or numa nodes don't change results.)

Example runs on my Intel Core i7-7700 host (adequate results):
  artem@host:~$ sudo ./test.sh 1
  117 MB/s
  artem@host:~$ sudo ./test.sh 10
  91,1 MB/s
  89,3 MB/s
  90,4 MB/s
  85,0 MB/s
  68,7 MB/s
  63,1 MB/s
  62,0 MB/s
  55,9 MB/s
  54,1 MB/s
  56,0 MB/s

Example runs on my tiny linux x86_64 guest (strange results):
  root@guest:~# ./test.sh 1
  17.5 MB/s
  root@guest:~# ./test.sh 10
  3.2 MB/s
  2.7 MB/s
  2.6 MB/s
  2.0 MB/s
  2.0 MB/s
  1.9 MB/s
  1.8 MB/s
  1.8 MB/s
  1.8 MB/s
  1.8 MB/s

Please, explain these results. Or maybe I wrong and it's normal ?


чт, 6 сент. 2018 г. в 16:24, Artem Pisarenko <artem.k.pisarenko@gmail.com>:

> Hi all,
>
> I'm developing paravirtualized target linux system which runs multiple
> linux containers (LXC) inside itself. (For those, who unfamiliar with LXC,
> simply put, it's an isolated group of userspace processes with their own
> rootfs.) Each container should be provided access to its rootfs located at
> host and execution of container should be deterministic. Particularly, it
> means that container I/O operations must be synchronized within some
> predefined quantum of guest _virtual_ time, i.e. its I/O activity shouldn't
> be delayed by host performance or activities on host and other containers.
> In other words, guest should see it's like either infinite throughput and
> zero latency, or some predefined throughput/latency characteristics
> guaranteed per each rootfs.
>
> While other sources of non-determinism are seem to be eliminated (using
> TCG, -icount, etc.), asynchronous I/O still introduces it.
>
> What is scope of "(asynchronous) I/O" term within qemu? Is it something
> related to block devices layer only, or generic term, covering whole
> datapath between vCPU and backend?
> If it relates to block devices only, does usage of VirtFS guarantee
> deterministic access, or it still involves some asynchrony relative to
> guest virtual clock?
> Is it possible to force asynchronous I/O within qemu to be blocking by
> some external means (host OS configuration, hooks, etc.) ? I know, it may
> greatly slow down guest performance, but it's still better than nothing.
> Maybe some trivial patch can be made to qemu code at virtio, block backend
> or platform syscalls level?
> Maybe I/O automatically (and guaranteed) fallbacks to synchronous mode in
> some particular configurations, such as using block device with image
> located on tmpfs in RAM (either directly or via overlay fs) ? If so, it's
> great!
> Or maybe some other solutions exists?...
>
> Main problem is to organize access from guest linux to some file system at
> host (directory, mount point, image file... doesn't matter) in
> deterministic manner.
> Secondary problem is to optimize performance as much as possible by:
> - avoiding unnecessary overheads (e.g. using virtio infrastructure,
> preference virtfs over blk device, etc.);
> - allowing some asynchrony within defined quantum of time (e.g. 10ms),
> i.e. i/o order and speed are free to float within each quantum borders,
> while result seen by guest at end of quantum is always same.
>
> Actually, what I'm trying to achieve have direct contradiction with most
> people trying to avoid, because synchronous I/O degradates performance in
> vast majority of usage scenarios.
>
> Does anyone have any thoughts on this?
>
> Best regards,
>   Artem Pisarenko
> --
>
> С уважением,
>   Артем Писаренко
>
-- 

С уважением,
  Артем Писаренко

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2018-09-10 15:07 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2018-09-06 10:24 [Qemu-devel] backend for blk or fs with guaranteed blocking/synchronous I/O Artem Pisarenko
2018-09-06 15:08 ` Michael S. Tsirkin
2018-09-07  8:15   ` Artem Pisarenko
2018-09-10 15:06 ` Artem Pisarenko

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).