[Qemu-devel] Evaluating Disk IO and Snapshots

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

* [Qemu-devel] Evaluating Disk IO and Snapshots
@ 2006-01-19 14:04 Juergen Pfennig
  2006-01-19 19:43 ` André Braga
  0 siblings, 1 reply; 3+ messages in thread
From: Juergen Pfennig @ 2006-01-19 14:04 UTC (permalink / raw)
  To: qemu-devel

Hi,
THE FOLLOWING IS IMPORTANT (COMMENTS ARE WELCOME):

I will modifiy monitor.c to implement a "save" command. That command
will do:

    stop
    commit
    (rename the old vmstate file)
    savevm (from where it got loaded)

The commands commit and savevm will be modified to give progress 
report info, the same for qemu-img. A future improvement (which
is transparent to the user) could be to the avoid the commit at 
all and to use the new -snapshot file driver (see below) to 
remember the vm state. Also qemu-img would have to be enabled
to commit the -snapshot file and to extract the savevm file for
backward compatibility.

The following is just for information ...

I am still evaluating qemu, currently the disk IO. I am aware of some
proposed patches concerning the disk controler to enable windows to
use dma and async IO. I am not going to interfere with these.

What I found is that qcow has poor performance. I wrote my own driver
(which is intended only for -snapshot) and see signifcant improvements.
A 300 MByte file copy (win2003 xcopy /e between two real drives) takes
90 instead of 135 seconds. I will send the patch to the list after it
has matured for a while. The thing is linux only for mmap() is used.

Background: Windows massively uses the swap file - all dirty memory
gets written to swap after 30s, whereas linux is very lazy using 
it's swap file. For windows on qemu (best with -snapshot) swap io
performance really matters.

There is also a new implementation to generate temporary file names. 
The TMPDIR environment variable is taken into account. Reason (1):
programs must not assume that "/tmp" can/should be used. Some distros
(debian) propose the use of pam_tempdir (or however it is called).
Reason (2): matured versions of my driver will use 2 tmp files per
disk to avoid sparseness. As qemu for the moment has no config file
(which is good for the moment) environment variables might be used to
tweak the config.

Another 1..2% io speed improvement (here measured in CPU cycles) might be
possible by reorganizing the way how data is copied between the port and
the disk. The required changes are moderate. I will try a timing simulation
outside qemu and will report the result.

Yours Jürgen

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [Qemu-devel] Evaluating Disk IO and Snapshots
  2006-01-19 14:04 [Qemu-devel] Evaluating Disk IO and Snapshots Juergen Pfennig
@ 2006-01-19 19:43 ` André Braga
  0 siblings, 0 replies; 3+ messages in thread
From: André Braga @ 2006-01-19 19:43 UTC (permalink / raw)
  To: qemu-devel

On 1/19/06, Juergen Pfennig <info@j-pfennig.de> wrote:
> What I found is that qcow has poor performance. I wrote my own driver
> (which is intended only for -snapshot) and see signifcant improvements.
> A 300 MByte file copy (win2003 xcopy /e between two real drives) takes
> 90 instead of 135 seconds. I will send the patch to the list after it
> has matured for a while. The thing is linux only for mmap() is used.

Hi,

While you are at it, have you considered using the LZO libraries
instead of zlib for compression/decompression speed? Sure, it won't
compress as much as zlib, but speed improvements should be noticeable.

I was thinking about doing this myself, but no doubt you now
understand the relevant source code on a level I'll still take a few
weeks to.

The source code for LZO is GPL'd, though. If I understand it
correctly, Mr. Oberhumer wouldn't mind doing an exception for QEMU and
licensing it as LGPL in this particular case; anyway, he must be
contacted to clarify this issue.

Link: http://www.oberhumer.com/opensource/lzo/

> The thing is linux only for mmap() is used.

On the link below you'll find some practical source code showing the
differences between **nix mmap() and Windows' equivalents:

http://www-128.ibm.com/developerworks/eserver/library/es-MigratingWin32toLinux.html

Please consider using those so everyone can benefit ;)

Thank you!

--
"I decry the current tendency to seek patents on algorithms. There are
better ways to earn a living than to prevent other people from making
use of one's contributions to computer science."
Donald Knuth

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [Qemu-devel] Evaluating Disk IO and Snapshots
       [not found] <0MKpdM-1EzhTl05o2-0002QW@mx.kundenserver.de>
@ 2006-01-20 22:53 ` Juergen Pfennig
  0 siblings, 0 replies; 3+ messages in thread
From: Juergen Pfennig @ 2006-01-20 22:53 UTC (permalink / raw)
  To: qemu-devel

[-- Attachment #1: Type: text/plain, Size: 4006 bytes --]

Hi Andre
you suggested ...

  While you are at it, have you considered using the LZO libraries
  instead of zlib for compression/decompression speed? Sure, it won't
  compress as much as zlib, but speed improvements should be noticeable.

... sorry. This is a misunderstanding. 

(1) I will not modify qcow and friends. Beware!
(2) The thing works only for the -snapshot file.
(3) The snapshot file uses no compression.
(4) Non-Linux/BSD host would fall-back to qcow.
(5) Yes, a windows implementation would be possible.

Here more details:

The storage for temp data will not rely on sparse files. It will use
two memory mapped temp files, one for index info and one for real
data. I have implemented a simple version of it and am testing it
currently. Speed improvements (IO time) are significant (about 20%).

The zero-memory copy thing ...

There will be a new function for use by ne2000.c, ide.c and friends:

    ptr = bdrv_memory( ...disk..., sector, [read|write|cancel|commit])

In many situations the function can return a pointer into a
mem-mapped region (the windows swap file would be a good example).
This helps to avoid copying data aroud in user-space or between
user-space and kernel. The cancel/commit can be implemented via 
aliasing. The code also helps to combine disk sectors back to pages
without extra cost (windows usually write 4k blocks or larger).

THE PROBLEM: avoiding read before write. I will have a look at the
kernel sources.

Whereas I expect only a 1% winn by the zero-copy stuff, my tests for
another little thing promise a 4% improvment (measured in CPU 
cycles). Or 12.5 ns per IO byte. This is how it works:

OLD CODE (vl.c):
  void *ioport_opaque[MAX_IOPORTS];
  IOPortWriteFunc *ioport_write_table[3][MAX_IOPORTS];
  IOPortWriteFunc *ioport_read_table[3][MAX_IOPORTS];

  void cpu_outl(CPUState *env, int addr, int val)
  {   ioport_write_table[2][addr](ioport_opaque[addr], addr, val);
  }

OLD CODE (ide.c and even worse in ne200.c):
  void writeFunction(void *opaque, unsigned int addr, unsigned int data)
  { IDEState *s = ((IDEState *)opaque)->curr;
     char *p;
     p = s->data_ptr;
     *(unsigned int *)p = data;
     p += 4;
     s->data_ptr = p;
     if (p >= s->data_end) s->end_function();
  }

As you can see repeated port IO produces a lot of overhead. 115 ns per
32-bit word (P4 2.4 GHz CPU).

New Code (vl.c):
  typedef struct PIOInfo {
    /* ... more fields ... */
    IOPortWriteFunc* write;
    void*            opaque;
    char*            data_ptr;
    char*            data_end;
  } PIOInfo;

  PIOInfo*    pio_info_table[MAX_IOPORTS];

  void cpu_outx(CPUState *env, int addr, int val)
  {
    PIOInfo *i = pio_info_table[addr];
    if(i->data_ptr >= i->data_end) // simple call
       i->write(i->opaque, addr, val);
    else {                         // copy call
        *(int *)(i->data_ptr) = val;
        i->data_ptr += 4;
    }
 }

The new code moves the data coying (from ide.c and ne2000.c) into
vl.c. This saves 60 ns per 32-bit word. Some memory is saved,
cache-locality is increased. Async IO implementation gets easier.

THE PROBLEMS:

(1) For a simple call there is a 7ns penalty compared to the
    current solution.
(2) Until now the ide.c and ne2000.c drivers are very closely
    modelled to the hardware. The c code looks a bit like a 
    circuit diagram (1:1 relation). My proposal adds some
    abstraction. The ide.c driver would give up the "drive
    cache" memory and the ne2000.c driver would 1st fetch
    the (raw) data and then process it.

Disappointed?

Yes, it's a bit ugly. For modest speed enhancements a lot of code
is needed. But on the other hand: many small things taken together
can become a big progress (Paul's code generator, dma, async IO...).

I have attached my timing test. Copile it with -03 (-O4 makes no
sense unless you split the code into different files).

Yours Jürgen



[-- Attachment #2: test.c --]
[-- Type: text/x-csrc, Size: 5160 bytes --]

#include <stdio.h>
#include <sys/time.h>
#include <time.h>

#define MAX_IOPORTS 4096
typedef void (IOPortWriteFunc)(void *opaque, unsigned int address, unsigned int data);

typedef struct IDEState
{
    void*           dummy;
    void*           curr;
    char*           data_ptr;
    char*           data_end;
} IDEState;

typedef struct PIOInfo {
    void*            dummy;
    IOPortWriteFunc* write;
    void*            opaque;
    char*            data_ptr;
    char*            data_end;
} PIOInfo;

typedef struct CPUState
{
    void*            dummy;
    PIOInfo*         info;
} CPUState;

void *ioport_opaque[MAX_IOPORTS];
IOPortWriteFunc *ioport_write_table[3][MAX_IOPORTS];
PIOInfo*    pio_info_table[MAX_IOPORTS];

unsigned int fake = 0;
int testIdx = 23;
int testCnt = 10;

void writeFake(void *opaque, unsigned int addr, unsigned int data)
{
    fake ^= data;
}

void writeLoop(void *opaque, unsigned int addr, unsigned int data)
{
    IDEState *s = ((IDEState *)opaque)->curr;
    char *p;

    p = s->data_ptr;
    *(unsigned int *)p = data;
    p += 4;
    s->data_ptr = p;
    if (p >= s->data_end)
        printf("oops");
}

void cpu_outl(CPUState *env, int addr, int val)
{
    // the if overhead is 7 ns (2.4 GHz P4) ...
    //if(ioport_opaque[addr] == 0)
    ioport_write_table[2][addr](ioport_opaque[addr], addr, val);
}

void cpu_outx(CPUState *env, int addr, int val)
{
    PIOInfo *i = pio_info_table[addr];
    if(i->data_ptr >= i->data_end)
       i->write(i->opaque, addr, val);
    else {
        *(int *)(i->data_ptr) = val;
        i->data_ptr += 4;
    }
}

int main(int argc, char** argv)
{
    struct timeval     tss, tse;
    CPUState    env;
    IDEState    ide;
    PIOInfo     pio;
    int         irun;
    char        buff[64];

    // TEST 1

    ioport_write_table[2][testIdx] = writeFake;
    ioport_opaque[testIdx] = 0;
    printf("start 1\n");
    gettimeofday(&tss, NULL);

    for(irun=0; irun < 1000*1000*testCnt; irun++) {
        cpu_outl(&env, testIdx, irun);
        cpu_outl(&env, testIdx, irun);
        cpu_outl(&env, testIdx, irun);
        cpu_outl(&env, testIdx, irun);
        cpu_outl(&env, testIdx, irun);

        cpu_outl(&env, testIdx, irun);
        cpu_outl(&env, testIdx, irun);
        cpu_outl(&env, testIdx, irun);
        cpu_outl(&env, testIdx, irun);
        cpu_outl(&env, testIdx, irun);
    }

    gettimeofday(&tse, NULL);
    tse.tv_sec -= tss.tv_sec;
    tse.tv_usec -= tss.tv_usec;
    printf("done (%.6g ns/call)\n", ((double)(tse.tv_usec/1000 + tse.tv_sec*1000))/testCnt);

    // TEST 2

    ioport_write_table[2][testIdx] = writeLoop;
    ioport_opaque[testIdx] = &ide;
    ide.curr = &ide;
    printf("start 2\n");
    gettimeofday(&tss, NULL);

    for(irun=0; irun < 1000*1000*testCnt; irun++) {
        ide.data_ptr = buff;
        ide.data_end = buff + sizeof(buff);

        cpu_outl(&env, testIdx, irun);
        cpu_outl(&env, testIdx, irun);
        cpu_outl(&env, testIdx, irun);
        cpu_outl(&env, testIdx, irun);
        cpu_outl(&env, testIdx, irun);

        cpu_outl(&env, testIdx, irun);
        cpu_outl(&env, testIdx, irun);
        cpu_outl(&env, testIdx, irun);
        cpu_outl(&env, testIdx, irun);
        cpu_outl(&env, testIdx, irun);
    }

    gettimeofday(&tse, NULL);
    tse.tv_sec -= tss.tv_sec;
    tse.tv_usec -= tss.tv_usec;
    printf("done (%.6g ns/call)\n", ((double)(tse.tv_usec/1000 + tse.tv_sec*1000))/testCnt);

    // TEST 3

    pio_info_table[testIdx] = &pio;
    pio.write  = writeFake;
    pio.opaque = &ide;
    printf("start 3\n");
    gettimeofday(&tss, NULL);

    for(irun=0; irun < 1000*1000*testCnt; irun++) {
        pio.data_ptr = buff + sizeof(buff);
        pio.data_end = 0;

        cpu_outx(&env, testIdx, irun);
        cpu_outx(&env, testIdx, irun);
        cpu_outx(&env, testIdx, irun);
        cpu_outx(&env, testIdx, irun);
        cpu_outx(&env, testIdx, irun);

        cpu_outx(&env, testIdx, irun);
        cpu_outx(&env, testIdx, irun);
        cpu_outx(&env, testIdx, irun);
        cpu_outx(&env, testIdx, irun);
        cpu_outx(&env, testIdx, irun);
    }

    gettimeofday(&tse, NULL);
    tse.tv_sec -= tss.tv_sec;
    tse.tv_usec -= tss.tv_usec;
    printf("done (%.6g ns/call)\n", ((double)(tse.tv_usec/1000 + tse.tv_sec*1000))/testCnt);

    // TEST 4

    pio.write  = writeFake;
    pio.opaque = &ide;
    printf("start 4\n");
    gettimeofday(&tss, NULL);

    for(irun=0; irun < 1000*1000*testCnt; irun++) {
        pio.data_ptr = buff;
        pio.data_end = buff + sizeof(buff);

        cpu_outx(&env, testIdx, irun);
        cpu_outx(&env, testIdx, irun);
        cpu_outx(&env, testIdx, irun);
        cpu_outx(&env, testIdx, irun);
        cpu_outx(&env, testIdx, irun);

        cpu_outx(&env, testIdx, irun);
        cpu_outx(&env, testIdx, irun);
        cpu_outx(&env, testIdx, irun);
        cpu_outx(&env, testIdx, irun);
        cpu_outx(&env, testIdx, irun);
    }

    gettimeofday(&tse, NULL);
    tse.tv_sec -= tss.tv_sec;
    tse.tv_usec -= tss.tv_usec;
    printf("done (%.6g ns/call)\n", ((double)(tse.tv_usec/1000 + tse.tv_sec*1000))/testCnt);

    return 0;
}

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2006-01-20 22:56 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-01-19 14:04 [Qemu-devel] Evaluating Disk IO and Snapshots Juergen Pfennig
2006-01-19 19:43 ` André Braga
     [not found] <0MKpdM-1EzhTl05o2-0002QW@mx.kundenserver.de>
2006-01-20 22:53 ` Juergen Pfennig

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).