Linux virtualization list
 help / color / mirror / Atom feed
* Re: getrandom waits for a long time when /dev/random is insufficiently read from
From: Alex Xu via Virtualization @ 2016-07-31  1:53 UTC (permalink / raw)
  To: Theodore Ts'o; +Cc: Linux Crypto Mailing List, virtualization
In-Reply-To: <20160730220922.GA12853@thunk.org>

On Sat, 30 Jul 2016 18:09:22 -0400
Theodore Ts'o <tytso@mit.edu> wrote as excerpted:
> On Fri, Jul 29, 2016 at 01:31:14PM -0400, Alex Xu wrote:
> > When qemu is started with -object rng-random,filename=/dev/urandom,
> > and immediately (i.e. with no initrd and as the first thing in
> > init):
> > 
> > 1. the guest runs dd if=/dev/random, there is no blocking and tons
> > of data goes to the screen. the data appears to be random.
> > 
> > 2. the guest runs getrandom with any requested amount (tested 1 byte
> > and 16 bytes) and no flags, it blocks for 90-110 seconds while the
> > "non-blocking pool is initialized". the returned data appears to be
> > random.
> > 
> > 3. the guest runs getrandom with GRND_RANDOM with any requested
> > amount, it returns the desired amount or possibly less, but in my
> > experience at least 10 bytes. the returned data appears to be
> > random.
> > 
> > I believe that the difference between cases 1 and 2 is a bug, since
> > based on my previous statement, in this scenario, getrandom should
> > never block.  
> 
> This is correct; and it has been fixed in the patches in v4.8-rc1.
> The patch which fixes this has been marked for backporting to stable
> kernels:
> 
> commit 3371f3da08cff4b75c1f2dce742d460539d6566d
> Author: Theodore Ts'o <tytso@mit.edu>
> Date:   Sun Jun 12 18:11:51 2016 -0400
> 
>     random: initialize the non-blocking pool via
> add_hwgenerator_randomness() 
>     If we have a hardware RNG and are using the in-kernel rngd, we
> should use this to initialize the non-blocking pool so that
> getrandom(2) doesn't block unnecessarily.
>     
>     Cc: stable@kernel.org
>     Signed-off-by: Theodore Ts'o <tytso@mit.edu>
> 
> Basically, the urandom pool (now CSRPNG) wasn't getting initialized
> from the hardware random number generator.  Most people didn't notice
> because very few people actually *use* hardware random number
> generators (although it's much more common in VM's, which is how
> you're using it), and use of getrandom(2) is still relatively rare,
> given that glibc hasn't yet seen fit to support it yet.
> 
> Cheers,
> 
> 					- Ted

Dammit, the one time I track down an actual kernel bug someone's already
fixed it. I'd even bothered to check 4.6 so I figured nobody'd gotten
around to it yet.

Thanks for the excellent explanations though. :)

^ permalink raw reply

* Re: getrandom waits for a long time when /dev/random is insufficiently read from
From: Theodore Ts'o @ 2016-07-30 22:09 UTC (permalink / raw)
  To: Alex Xu; +Cc: Stephan Mueller, Linux Crypto Mailing List, virtualization
In-Reply-To: <20160729133114.37ff14ef.alex_y_xu@yahoo.ca>

On Fri, Jul 29, 2016 at 01:31:14PM -0400, Alex Xu wrote:
> 
> My understanding was that all three methods of obtaining entropy from
> userspace all receive data from the CSPRNG in the kernel, and that the
> only difference is that /dev/random and getrandom may block depending
> on the kernel's estimate of the currently available entropy.

This is incorrect.

/dev/random is a legacy interface which dates back to a time when
people didn't have as much trust in the cryptographic primitives ---
when there was concerns that the NSA might have put a back-door into
SHA-1, for example.  (As it turns out; we were wrong.  NSA put the
back door into Dual EC DRBG.)  So it uses a strategy of an extremely
conservative entropy estimator, and will allow N bytes to be
/dev/random pool as the entropy estimator believes that it has
gathered at least N bytes of entropy from environmental noise.

/dev/urandom uses a different output pool from /dev/random (the random
and urandom pools both draw from an common input pool).  Originally
the /dev/urandom pool drew from the input pool as needed, but it
wouldn't block if there was insufficient entropy.  Over time, it now
has limits about how quickly it can draw from the input pool, and it
behaves more and more like a CSPRNG.  In fact, in the most recent set
of patches which Linus has accepted for v4.8-rc1, the urandom pool has
been replaced by an actual CSPRNG using the ChaCha-20 stream cipher.

The getrandom(2) system call uses the same output pool (4.7 and
earlier) or CSPRG (starting with v4.8-rc1) as /dev/urandom.  The big
difference is that it blocks until we know for sure that the output
pool or CSRPNG has been seeded with 128 bits of entropy.  We don't do
this with /dev/urandom for backwards compatibility reasons.  (For
example, if we did make /dev/urandom block until it was seeded, it
would break systemd, because systemd and progams run by systemd draws
from /dev/urandom before it has been initialized, and if /dev/urandom
were to block, the boot would hang, and with the system quiscient, we
wouldn't get much environmental noise, and the system would hang
hours.)

> When qemu is started with -object rng-random,filename=/dev/urandom, and
> immediately (i.e. with no initrd and as the first thing in init):
> 
> 1. the guest runs dd if=/dev/random, there is no blocking and tons of
> data goes to the screen. the data appears to be random.
> 
> 2. the guest runs getrandom with any requested amount (tested 1 byte
> and 16 bytes) and no flags, it blocks for 90-110 seconds while the
> "non-blocking pool is initialized". the returned data appears to be
> random.
> 
> 3. the guest runs getrandom with GRND_RANDOM with any requested amount,
> it returns the desired amount or possibly less, but in my experience at
> least 10 bytes. the returned data appears to be random.
> 
> I believe that the difference between cases 1 and 2 is a bug, since
> based on my previous statement, in this scenario, getrandom should
> never block.

This is correct; and it has been fixed in the patches in v4.8-rc1.
The patch which fixes this has been marked for backporting to stable
kernels:

commit 3371f3da08cff4b75c1f2dce742d460539d6566d
Author: Theodore Ts'o <tytso@mit.edu>
Date:   Sun Jun 12 18:11:51 2016 -0400

    random: initialize the non-blocking pool via add_hwgenerator_randomness()
    
    If we have a hardware RNG and are using the in-kernel rngd, we should
    use this to initialize the non-blocking pool so that getrandom(2)
    doesn't block unnecessarily.
    
    Cc: stable@kernel.org
    Signed-off-by: Theodore Ts'o <tytso@mit.edu>

Basically, the urandom pool (now CSRPNG) wasn't getting initialized
from the hardware random number generator.  Most people didn't notice
because very few people actually *use* hardware random number
generators (although it's much more common in VM's, which is how
you're using it), and use of getrandom(2) is still relatively rare,
given that glibc hasn't yet seen fit to support it yet.

Cheers,

					- Ted

^ permalink raw reply

* Re: [Qemu-devel] [PATCH 6/7] qemu: Implement virtio-pstore device
From: Namhyung Kim @ 2016-07-30  8:57 UTC (permalink / raw)
  To: Daniel P. Berrange
  Cc: Tony Luck, Kees Cook, kvm, Radim Kr??m????, LKML,
	Michael S. Tsirkin, qemu-devel, Steven Rostedt, virtualization,
	Minchan Kim, Anton Vorontsov, Anthony Liguori, Colin Cross,
	Paolo Bonzini, Ingo Molnar
In-Reply-To: <20160728132239.GM22677@redhat.com>

On Thu, Jul 28, 2016 at 02:22:39PM +0100, Daniel P. Berrange wrote:
> On Thu, Jul 28, 2016 at 12:08:30AM +0900, Namhyung Kim wrote:
> > Add virtio pstore device to allow kernel log files saved on the host.
> > It will save the log files on the directory given by pstore device
> > option.
> > 
> >   $ qemu-system-x86_64 -device virtio-pstore,directory=dir-xx ...
> > 
> >   (guest) # echo c > /proc/sysrq-trigger
> > 
> >   $ ls dir-xx
> >   dmesg-1.enc.z  dmesg-2.enc.z
> > 
> > The log files are usually compressed using zlib.  Users can see the log
> > messages directly on the host or on the guest (using pstore filesystem).
> > 
> > The 'directory' property is required for virtio-pstore device to work.
> > It also adds 'bufsize' and 'console' (boolean) properties.
> > 
> > Cc: Paolo Bonzini <pbonzini@redhat.com>
> > Cc: Radim Kr??m???? <rkrcmar@redhat.com>
> > Cc: "Michael S. Tsirkin" <mst@redhat.com>
> > Cc: Anthony Liguori <aliguori@amazon.com>
> > Cc: Anton Vorontsov <anton@enomsg.org>
> > Cc: Colin Cross <ccross@android.com>
> > Cc: Kees Cook <keescook@chromium.org>
> > Cc: Tony Luck <tony.luck@intel.com>
> > Cc: Steven Rostedt <rostedt@goodmis.org>
> > Cc: Ingo Molnar <mingo@kernel.org>
> > Cc: Minchan Kim <minchan@kernel.org>
> > Cc: kvm@vger.kernel.org
> > Cc: qemu-devel@nongnu.org
> > Cc: virtualization@lists.linux-foundation.org
> > Signed-off-by: Namhyung Kim <namhyung@kernel.org>
> > ---

[SNIP]
> > +static void virtio_pstore_to_filename(VirtIOPstore *s, char *buf, size_t sz,
> > +                                      struct virtio_pstore_req *req)
> > +{
> > +    const char *basename;
> > +    unsigned long long id = 0;
> > +    unsigned int flags = le32_to_cpu(req->flags);
> > +
> > +    switch (le16_to_cpu(req->type)) {
> > +    case VIRTIO_PSTORE_TYPE_DMESG:
> > +        basename = "dmesg";
> > +        id = s->id++;
> > +        break;
> > +    case VIRTIO_PSTORE_TYPE_CONSOLE:
> > +        basename = "console";
> > +        if (s->console_id) {
> > +            id = s->console_id;
> > +        } else {
> > +            id = s->console_id = s->id++;
> > +        }
> > +        break;
> > +    default:
> > +        basename = "unknown";
> > +        break;
> > +    }
> > +
> > +    snprintf(buf, sz, "%s/%s-%llu%s", s->directory, basename, id,
> > +             flags & VIRTIO_PSTORE_FL_COMPRESSED ? ".enc.z" : "");
> 
> Please use g_strdup_printf() instead of splattering into a pre-allocated
> buffer than may or may not be large enough.

OK.

> 
> > +}
> > +
> > +static void virtio_pstore_from_filename(VirtIOPstore *s, char *name,
> > +                                        char *buf, size_t sz,
> > +                                        struct virtio_pstore_fileinfo *info)
> > +{
> > +    snprintf(buf, sz, "%s/%s", s->directory, name);
> > +
> > +    if (g_str_has_prefix(name, "dmesg-")) {
> > +        info->type = VIRTIO_PSTORE_TYPE_DMESG;
> > +        name += strlen("dmesg-");
> > +    } else if (g_str_has_prefix(name, "console-")) {
> > +        info->type = VIRTIO_PSTORE_TYPE_CONSOLE;
> > +        name += strlen("console-");
> > +    } else if (g_str_has_prefix(name, "unknown-")) {
> > +        info->type = VIRTIO_PSTORE_TYPE_UNKNOWN;
> > +        name += strlen("unknown-");
> > +    }
> > +
> > +    qemu_strtoull(name, NULL, 0, &info->id);
> > +
> > +    info->flags = 0;
> > +    if (g_str_has_suffix(name, ".enc.z")) {
> > +        info->flags |= VIRTIO_PSTORE_FL_COMPRESSED;
> > +    }
> > +}
> > +
> > +static ssize_t virtio_pstore_do_open(VirtIOPstore *s)
> > +{
> > +    s->dirp = opendir(s->directory);
> > +    if (s->dirp == NULL) {
> > +        return -1;
> > +    }
> > +
> > +    return 0;
> > +}
> > +
> > +static ssize_t virtio_pstore_do_read(VirtIOPstore *s, struct iovec *in_sg,
> > +                                     unsigned int in_num,
> > +                                     struct virtio_pstore_res *res)
> > +{
> > +    char path[PATH_MAX];
> 
> Don't declare PATH_MAX sized variables

Will change to use g_strdup_printf() as you said.

> 
> > +    int fd;
> > +    ssize_t len;
> > +    struct stat stbuf;
> > +    struct dirent *dent;
> > +    int sg_num = in_num;
> > +    struct iovec sg[sg_num];
> 
> 'sg_num' is initialized from 'in_num' which comes from the
> guest, and I'm not seeing anything which is bounds-checking
> the 'in_num' value. So you've possibly got a security flaw here
> I think, if the guest can cause QEMU to allocate arbitrary stack
> memory & thus overflow by setting arbitrarily large in_num.

Will add a bound check then.

> 
> > +    struct virtio_pstore_fileinfo info;
> > +    size_t offset = sizeof(*res) + sizeof(info);
> > +
> > +    if (s->dirp == NULL) {
> > +        return -1;
> > +    }
> > +
> > +    dent = readdir(s->dirp);
> > +    while (dent) {
> > +        if (dent->d_name[0] != '.') {
> > +            break;
> > +        }
> > +        dent = readdir(s->dirp);
> > +    }
> > +
> > +    if (dent == NULL) {
> > +        return 0;
> > +    }
> 
> So this seems to just be picking the first filename reported by
> readdir that isn't starting with '.'. Surely this can't the right
> logic when your corresponding do_write method can pick several
> different filenames, its potluck which do_read will give back.

Do you mean that it'd be better to check a list of known filenames and
fail if not?

> 
> > +
> > +    /* skip res and fileinfo */
> > +    sg_num = iov_copy(sg, sg_num, in_sg, in_num, offset,
> > +                      iov_size(in_sg, in_num) - offset);
> > +
> > +    virtio_pstore_from_filename(s, dent->d_name, path, sizeof(path), &info);
> > +    fd = open(path, O_RDONLY);
> > +    if (fd < 0) {
> > +        error_report("cannot open %s", path);
> > +        return -1;
> > +    }
> > +
> > +    if (fstat(fd, &stbuf) < 0) {
> > +        len = -1;
> > +        goto out;
> > +    }
> > +
> > +    len = readv(fd, sg, sg_num);
> > +    if (len < 0) {
> > +        if (errno == EAGAIN) {
> > +            len = 0;
> > +        }
> > +        goto out;
> > +    }
> > +
> > +    info.id        = cpu_to_le64(info.id);
> > +    info.type      = cpu_to_le16(info.type);
> > +    info.flags     = cpu_to_le32(info.flags);
> > +    info.len       = cpu_to_le32(len);
> > +    info.time_sec  = cpu_to_le64(stbuf.st_ctim.tv_sec);
> > +    info.time_nsec = cpu_to_le32(stbuf.st_ctim.tv_nsec);
> > +
> > +    iov_from_buf(in_sg, in_num, sizeof(*res), &info, sizeof(info));
> > +    len += sizeof(info);
> > +
> > + out:
> > +    close(fd);
> > +    return len;
> > +}
> > +
> > +static ssize_t virtio_pstore_do_write(VirtIOPstore *s, struct iovec *out_sg,
> > +                                      unsigned int out_num,
> > +                                      struct virtio_pstore_req *req)
> > +{
> > +    char path[PATH_MAX];
> > +    int fd;
> > +    ssize_t len;
> > +    unsigned short type;
> > +    int flags = O_WRONLY | O_CREAT;
> > +
> > +    /* we already consume the req */
> > +    iov_discard_front(&out_sg, &out_num, sizeof(*req));
> > +
> > +    virtio_pstore_to_filename(s, path, sizeof(path), req);
> > +
> > +    type = le16_to_cpu(req->type);
> > +
> > +    if (type == VIRTIO_PSTORE_TYPE_DMESG) {
> > +        flags |= O_TRUNC;
> > +    } else if (type == VIRTIO_PSTORE_TYPE_CONSOLE) {
> > +        flags |= O_APPEND;
> 
> Using O_APPEND will cause the file to grow without bound on the
> host, which is highly undesirable, aka a security flaw.

Right.  The plan is to add size checking and to split if it exceeds
some limit.  And we can keep some number of recent files only.

Thanks,
Namhyung


> 
> > +    }
> > +
> > +    fd = open(path, flags, 0644);
> > +    if (fd < 0) {
> > +        error_report("cannot open %s", path);
> > +        return -1;
> > +    }
> > +    len = writev(fd, out_sg, out_num);
> > +    close(fd);
> > +
> > +    return len;
> > +}
> 
> 
> Regards,
> Daniel
> -- 
> |: http://berrange.com      -o-    http://www.flickr.com/photos/dberrange/ :|
> |: http://libvirt.org              -o-             http://virt-manager.org :|
> |: http://autobuild.org       -o-         http://search.cpan.org/~danberr/ :|
> |: http://entangle-photo.org       -o-       http://live.gnome.org/gtk-vnc :|

^ permalink raw reply

* Re: [Qemu-devel] [PATCH 6/7] qemu: Implement virtio-pstore device
From: Namhyung Kim @ 2016-07-30  8:38 UTC (permalink / raw)
  To: Daniel P. Berrange
  Cc: Tony Luck, Kees Cook, kvm, Radim Kr??m????, Anton Vorontsov, LKML,
	Steven Rostedt, qemu-devel, Minchan Kim, Michael S. Tsirkin,
	Anthony Liguori, Colin Cross, Paolo Bonzini, virtualization,
	Ingo Molnar
In-Reply-To: <20160728130841.GL22677@redhat.com>

Hello,

On Thu, Jul 28, 2016 at 02:08:41PM +0100, Daniel P. Berrange wrote:
> On Thu, Jul 28, 2016 at 01:56:07PM +0100, Stefan Hajnoczi wrote:
> > On Thu, Jul 28, 2016 at 02:39:53PM +0900, Namhyung Kim wrote:
> > > On Thu, Jul 28, 2016 at 03:02:54AM +0300, Michael S. Tsirkin wrote:
> > > > On Thu, Jul 28, 2016 at 12:08:30AM +0900, Namhyung Kim wrote:
> > > > > +static ssize_t virtio_pstore_do_write(VirtIOPstore *s, struct iovec *out_sg,
> > > > > +                                      unsigned int out_num,
> > > > > +                                      struct virtio_pstore_req *req)
> > > > > +{
> > > > > +    char path[PATH_MAX];
> > > > > +    int fd;
> > > > > +    ssize_t len;
> > > > > +    unsigned short type;
> > > > > +    int flags = O_WRONLY | O_CREAT;
> > > > > +
> > > > > +    /* we already consume the req */
> > > > > +    iov_discard_front(&out_sg, &out_num, sizeof(*req));
> > > > > +
> > > > > +    virtio_pstore_to_filename(s, path, sizeof(path), req);
> > > > > +
> > > > > +    type = le16_to_cpu(req->type);
> > > > > +
> > > > > +    if (type == VIRTIO_PSTORE_TYPE_DMESG) {
> > > > > +        flags |= O_TRUNC;
> > > > > +    } else if (type == VIRTIO_PSTORE_TYPE_CONSOLE) {
> > > > > +        flags |= O_APPEND;
> > > > > +    }
> > > > > +
> > > > > +    fd = open(path, flags, 0644);
> > > > > +    if (fd < 0) {
> > > > > +        error_report("cannot open %s", path);
> > > > > +        return -1;
> > > > > +    }
> > > > > +    len = writev(fd, out_sg, out_num);
> > > > > +    close(fd);
> > > > > +
> > > > > +    return len;
> > > > 
> > > > All this is blocking VM until host io completes.
> > > 
> > > Hmm.. I don't know about the internals of qemu.  So does it make guest
> > > stop?  If so, that's what I want to do for _DMESG. :)  As it's called
> > > only on kernel oops I think it's admittable.  But for _CONSOLE, it
> > > needs to do asynchronously.  Maybe I can add a thread to do the work.
> > 
> > Please look at include/io/channel.h.  QEMU is event-driven and tends to
> > use asynchronous I/O instead of spawning threads.  The include/io/ APIs
> > allow you to do asynchronous I/O in the event loop.
> 
> That is true, except for I/O to/from plain files - the QIOChannelFile
> impl doesn't do anything special (yet) to make that work correctly in
> non-blocking mode. Of course that could be fixed...

Yep, I don't know how I can use the QIOChannelFile for async IO.
AFAICS it's just a wrapper for normal readv/writev.  Who is
responsible to do these IO when I use the IO channel API?  Also does
it guarantee that all IOs are processed in order?

Thanks,
Namhyung

^ permalink raw reply

* Re: [virtio-dev] Re: [PATCH v2 repost 4/7] virtio-balloon: speed up inflate/deflate process
From: Dave Hansen @ 2016-07-29 19:48 UTC (permalink / raw)
  To: Michael S. Tsirkin, Li, Liang Z
  Cc: virtio-dev@lists.oasis-open.org, kvm@vger.kernel.org, Amit Shah,
	qemu-devel@nongnu.org, linux-kernel@vger.kernel.org,
	linux-mm@kvack.org, Vlastimil Babka, Paolo Bonzini, Andrew Morton,
	virtualization@lists.linux-foundation.org, Mel Gorman,
	dgilbert@redhat.com
In-Reply-To: <20160729003759-mutt-send-email-mst@kernel.org>

On 07/28/2016 02:51 PM, Michael S. Tsirkin wrote:
>> > If 1MB is too big, how about 512K, or 256K?  32K seems too small.
>> > 
> It's only small because it makes you rescan the free list.
> So maybe you should do something else.
> I looked at it a bit. Instead of scanning the free list, how about
> scanning actual page structures? If page is unused, pass it to host.
> Solves the problem of rescanning multiple times, does it not?

FWIW, I think the new data structure needs some work.

Before, we had a potentially very long list of 4k areas.  Now, we've
just got a very large bitmap.  The bitmap might not even be very dense
if we are ballooning relatively few things.

Can I suggest an alternate scheme?  I think you actually need a hybrid
scheme that has bitmaps but also allows more flexibility in the pfn
ranges.  The payload could be a number of records each containing 3 things:

	pfn, page order, length of bitmap (maybe in powers of 2)

Each record is followed by the bitmap.  Or, if the bitmap length is 0,
immediately followed by another record.  A bitmap length of 0 implies a
bitmap with the least significant bit set.  Page order specifies how
many pages each bit represents.

This scheme could easily encode the new data structure you are proposing
by just setting pfn=0, order=0, and a very long bitmap length.  But, it
could handle sparse bitmaps much better *and* represent large pages much
more efficiently.

There's plenty of space to fit a whole record in 64 bits.

^ permalink raw reply

* Re: getrandom waits for a long time when /dev/random is insufficiently read from
From: Alex Xu via Virtualization @ 2016-07-29 17:31 UTC (permalink / raw)
  To: Stephan Mueller, virtualization; +Cc: Linux Crypto Mailing List
In-Reply-To: <2790164.RXkTBNoHIv@tauon.atsec.com>

On Fri, 29 Jul 2016 19:03:51 +0200
Stephan Mueller <smueller@chronox.de> wrote as excerpted:
> Am Freitag, 29. Juli 2016, 10:14:07 CEST schrieb Alex Xu:
> > I don't follow. Assuming you are correct and this is the issue, then
> > reading 128 bits (16 bytes) from /dev/random should "exhaust the
> > supply" and then both reads from /dev/random and calling getrandom
> > should block.  
> 
> You assume that getrandom works like /dev/random. This is not the
> case. It is a full deterministic RNG like /dev/urandom (which is
> seeded during its operation as entropy is available).

My understanding was that all three methods of obtaining entropy from
userspace all receive data from the CSPRNG in the kernel, and that the
only difference is that /dev/random and getrandom may block depending
on the kernel's estimate of the currently available entropy.

> getrandom *only* differs from /dev/*u*random in that it waits
> initially such that the system collected 128 bits of entropy.

I agree, this is the documented behavior of getrandom.

> But you point to a real issue: when /dev/random is pulled before
> getrandom (and yet insufficient entropy is present), then the
> getrandom call will be woken up when the input_pool received 128
> bits. But those 128 bits are fed from the input_pool to the
> blocking_pool based on the caller at the /dev/ random device. This
> implies that the reader for getrandom will NOT be able to obtain data
> from the input_pool and the nonblocking_pool because the transfer
> operation will not succeed. This implies that the nonblocking_pool
> remains unseeded and yet getrandom returns data to the caller.

I don't understand what this means. For my use case, hwrng is fed from
the host's urandom, so none of /dev/random, /dev/hwrng, /dev/urandom,
or getrandom with any flags in the guest should ever block except
possibly for very large amounts requested (megabytes at least).

> > That, however, is not the behavior I observed, which is that reading
> > any amount from /dev/random will never block (since it is fed
> > from /dev/urandom on the host side) whereas calling getrandom will
> > always block unless /dev/random is read from first.  
> 
> That is a different issue that I did not read from your initial
> explanation.
> 
> I need to look into it a bit deeper.

I have been trying to explain the same problem the entire time. Let me
be clear what the problem is as I see it:

When qemu is started with -object rng-random,filename=/dev/urandom, and
immediately (i.e. with no initrd and as the first thing in init):

1. the guest runs dd if=/dev/random, there is no blocking and tons of
data goes to the screen. the data appears to be random.

2. the guest runs getrandom with any requested amount (tested 1 byte
and 16 bytes) and no flags, it blocks for 90-110 seconds while the
"non-blocking pool is initialized". the returned data appears to be
random.

3. the guest runs getrandom with GRND_RANDOM with any requested amount,
it returns the desired amount or possibly less, but in my experience at
least 10 bytes. the returned data appears to be random.

I believe that the difference between cases 1 and 2 is a bug, since
based on my previous statement, in this scenario, getrandom should
never block.

^ permalink raw reply

* Re: getrandom waits for a long time when /dev/random is insufficiently read from
From: Stephan Mueller @ 2016-07-29 17:03 UTC (permalink / raw)
  To: Alex Xu; +Cc: Linux Crypto Mailing List, virtualization
In-Reply-To: <20160729101407.03123327.alex_y_xu@yahoo.ca>

Am Freitag, 29. Juli 2016, 10:14:07 CEST schrieb Alex Xu:

Hi Alex,

> On Fri, 29 Jul 2016 15:12:30 +0200
> 
> Stephan Mueller <smueller@chronox.de> wrote as excerpted:
> > Am Freitag, 29. Juli 2016, 09:03:45 CEST schrieb Alex Xu:
> > > In my opinion, assuming I am not doing something terribly wrong,
> > > this constitutes a bug in the kernel's handling of getrandom calls
> > > at boot, possibly only when the primary source of entropy is
> > > virtio.
> > 
> > Nope, I do not think that this is true:
> > 
> > - /dev/random returns one byte for one byte of entropy received, but
> > it has a lower limit of 64 bits
> > 
> > - getrandom behaves like /dev/urandom (i.e. nonblocking) except
> > during boot where it waits until the RNG has collected 128 bits
> > before operating like a DRNG that is seeded once in a while when
> > entropy comes in.
> > 
> > 
> > Ciao
> > Stephan
> 
> I don't follow. Assuming you are correct and this is the issue, then
> reading 128 bits (16 bytes) from /dev/random should "exhaust the
> supply" and then both reads from /dev/random and calling getrandom
> should block.

You assume that getrandom works like /dev/random. This is not the case. It is 
a full deterministic RNG like /dev/urandom (which is seeded during its 
operation as entropy is available).

getrandom *only* differs from /dev/*u*random in that it waits initially such 
that the system collected 128 bits of entropy.

But you point to a real issue: when /dev/random is pulled before getrandom 
(and yet insufficient entropy is present), then the getrandom call will be 
woken up when the input_pool received 128 bits. But those 128 bits are fed 
from the input_pool to the blocking_pool based on the caller at the /dev/
random device. This implies that the reader for getrandom will NOT be able to 
obtain data from the input_pool and the nonblocking_pool because the transfer 
operation will not succeed. This implies that the nonblocking_pool remains 
unseeded and yet getrandom returns data to the caller.
> 
> That, however, is not the behavior I observed, which is that reading
> any amount from /dev/random will never block (since it is fed
> from /dev/urandom on the host side) whereas calling getrandom will
> always block unless /dev/random is read from first.

That is a different issue that I did not read from your initial explanation.

I need to look into it a bit deeper.
> 
> Moreover, as long as virtio-rng is available (and fed
> from /dev/urandom), /proc/sys/kernel/random/entropy_avail is always 961
> immediately after booting, which is more than enough to satisfy a
> one-byte read. After reading 1 byte, the estimate decreases to 896 or
> 897, but after reading 29 more bytes it increases to 1106.
> 
> Again, these observations are consistent with the conjecture that the
> issue arises since virtio-rng is a "pull" source of entropy whereas
> most other methods (e.g. interrupt timing) are "push" sources. I
> suspect that a similar issue occurs if RDRAND is the only source of
> entropy.
> 
> I also tried running rngd in the guest which resolved the issue but
> seems entirely stupid to me, even moreso since
> http://rhelblog.redhat.com/2015/03/09/red-hat-enterprise-linux-virtual-machi
> nes-access-to-random-numbers-made-easy/ says that "The use of rngd is now
> not required and the guest kernel itself fetches entropy from the host when
> the available entropy falls below a specific threshold.".

right -- the kernel has now an in-kernel link that makes rngd superflowous in 
this case.



Ciao
Stephan

^ permalink raw reply

* Re: getrandom waits for a long time when /dev/random is insufficiently read from
From: Alex Xu via Virtualization @ 2016-07-29 14:14 UTC (permalink / raw)
  To: Stephan Mueller; +Cc: Linux Crypto Mailing List, virtualization
In-Reply-To: <2622345.NpnZjxROFX@tauon.atsec.com>

On Fri, 29 Jul 2016 15:12:30 +0200
Stephan Mueller <smueller@chronox.de> wrote as excerpted:
> Am Freitag, 29. Juli 2016, 09:03:45 CEST schrieb Alex Xu:
> > In my opinion, assuming I am not doing something terribly wrong,
> > this constitutes a bug in the kernel's handling of getrandom calls
> > at boot, possibly only when the primary source of entropy is
> > virtio.  
> 
> Nope, I do not think that this is true:
> 
> - /dev/random returns one byte for one byte of entropy received, but
> it has a lower limit of 64 bits
> 
> - getrandom behaves like /dev/urandom (i.e. nonblocking) except
> during boot where it waits until the RNG has collected 128 bits
> before operating like a DRNG that is seeded once in a while when
> entropy comes in.
> 
> 
> Ciao
> Stephan

I don't follow. Assuming you are correct and this is the issue, then
reading 128 bits (16 bytes) from /dev/random should "exhaust the
supply" and then both reads from /dev/random and calling getrandom
should block.

That, however, is not the behavior I observed, which is that reading
any amount from /dev/random will never block (since it is fed
from /dev/urandom on the host side) whereas calling getrandom will
always block unless /dev/random is read from first.

Moreover, as long as virtio-rng is available (and fed
from /dev/urandom), /proc/sys/kernel/random/entropy_avail is always 961
immediately after booting, which is more than enough to satisfy a
one-byte read. After reading 1 byte, the estimate decreases to 896 or
897, but after reading 29 more bytes it increases to 1106.

Again, these observations are consistent with the conjecture that the
issue arises since virtio-rng is a "pull" source of entropy whereas
most other methods (e.g. interrupt timing) are "push" sources. I
suspect that a similar issue occurs if RDRAND is the only source of
entropy.

I also tried running rngd in the guest which resolved the issue but
seems entirely stupid to me, even moreso since
http://rhelblog.redhat.com/2015/03/09/red-hat-enterprise-linux-virtual-machines-access-to-random-numbers-made-easy/
says that "The use of rngd is now not required and the guest kernel
itself fetches entropy from the host when the available entropy falls
below a specific threshold.".

^ permalink raw reply

* Re: getrandom waits for a long time when /dev/random is insufficiently read from
From: Stephan Mueller @ 2016-07-29 13:12 UTC (permalink / raw)
  To: Alex Xu; +Cc: Linux Crypto Mailing List, Nikos Mavrogiannopoulos,
	virtualization
In-Reply-To: <20160729090345.798c3e6f.alex_y_xu@yahoo.ca>

Am Freitag, 29. Juli 2016, 09:03:45 CEST schrieb Alex Xu:

Hi Alex,

> On Fri, 29 Jul 2016 12:24:27 +0200
> 
> Nikos Mavrogiannopoulos <nmav@gnutls.org> wrote:
> > On Fri, Jul 29, 2016 at 7:40 AM, Stephan Mueller
> > 
> > <smueller@chronox.de> wrote:
> > > And finally, you have a coding error that is very very common but
> > > fatal when reading from /dev/random: you do not account for short
> > > reads which implies that your loop continues even in the case of
> > > short reads.
> > > 
> > > Fix your code with something like the following:
> > > int read_random(char *buf, size_t buflen)
> > > {
> > > 
> > >         int fd = 0;
> > >         ssize_t ret = 0;
> > >         size_t len = 0;
> > >         
> > >         fd = open("/dev/random", O_RDONLY|O_CLOEXEC);
> > >         if(0 > fd)
> > >         
> > >                 return fd;
> > >         
> > >         do {
> > >         
> > >                 ret = read(fd, (buf + len), (buflen - len));
> > >                 if (0 < ret)
> > >                 
> > >                         len += ret;
> > >         
> > >         } while ((0 < ret || EINTR == errno || ERESTART == errno)
> > >         
> > >                  && buflen > len);
> > 
> > Unless there is a documentation error, the same is required when using
> > getrandom(). It can also return short as well as to be interrupted.
> > 
> > regards,
> > Nikos
> 
> I am aware that (according to the documentation) both random(4) and
> getrandom(2) may not return the full size of the read. However, that is
> (as far as I know) not relevant to the point that I am making.
> 
> What I am saying is that based on my understanding of random(4) and
> getrandom(2), at boot, given the same buffer size, reading
> from /dev/random should have the same behavior as calling getrandom
> passing no flags.

/dev/random can return after at least 64 bits received in the input_pool 
whereas getrandom waits for 128 bits.
> 
> The buffer size can also be set to 1 with similar results, but the
> iteration number for success must be then increased to a large number.
> IME 30 worked consistently while 29 hung; your results may vary.
> 
> The interesting thing is though, if GRND_RANDOM is passed to getrandom,
> then it does not hang and returns 1 byte immediately (whether or not
> GRND_NONBLOCK is set).

Sure, because there is one byte in the input_pool at the time user space 
boots. Note again, /dev/random waits until having 64 bits.

> 
> 1, 2..29: reads all return 1 byte, getrandom pauses for 90-110 secs then
> returns 1 byte
> 30+: reads all return 1 byte, getrandom immediately returns 1 byte
> -r 0: getrandom immediately returns 1 byte
> -r 1, -r 2, -r 128, -r 256: reads all return 1 byte, getrandom
> immediately returns 1 byte
> 
I would say that this is expected.

> Moving the open and close calls outside of the loop produces the same
> results. Writing 4096 bytes to /dev/urandom also has no effect.

Sure, it does not update the input_pool. Only the IOCTL can update the 
input_pool from user space.
> 
> In my opinion, assuming I am not doing something terribly wrong, this
> constitutes a bug in the kernel's handling of getrandom calls at boot,
> possibly only when the primary source of entropy is virtio.

Nope, I do not think that this is true:

- /dev/random returns one byte for one byte of entropy received, but it has a 
lower limit of 64 bits

- getrandom behaves like /dev/urandom (i.e. nonblocking) except during boot 
where it waits until the RNG has collected 128 bits before operating like a 
DRNG that is seeded once in a while when entropy comes in.


Ciao
Stephan

^ permalink raw reply

* Re: getrandom waits for a long time when /dev/random is insufficiently read from
From: Alex Xu via Virtualization @ 2016-07-29 13:03 UTC (permalink / raw)
  To: Nikos Mavrogiannopoulos
  Cc: Stephan Mueller, Linux Crypto Mailing List, virtualization
In-Reply-To: <CAJU7zaL8G28chcwEEYAquApm2ncPaBjKky4UPaWVy=6B+-rsCA@mail.gmail.com>

On Fri, 29 Jul 2016 12:24:27 +0200
Nikos Mavrogiannopoulos <nmav@gnutls.org> wrote:

> On Fri, Jul 29, 2016 at 7:40 AM, Stephan Mueller
> <smueller@chronox.de> wrote:
> > And finally, you have a coding error that is very very common but
> > fatal when reading from /dev/random: you do not account for short
> > reads which implies that your loop continues even in the case of
> > short reads.
> >
> > Fix your code with something like the following:
> > int read_random(char *buf, size_t buflen)
> > {
> >         int fd = 0;
> >         ssize_t ret = 0;
> >         size_t len = 0;
> >
> >         fd = open("/dev/random", O_RDONLY|O_CLOEXEC);
> >         if(0 > fd)
> >                 return fd;
> >         do {
> >                 ret = read(fd, (buf + len), (buflen - len));
> >                 if (0 < ret)
> >                         len += ret;
> >         } while ((0 < ret || EINTR == errno || ERESTART == errno)
> >                  && buflen > len);  
> 
> Unless there is a documentation error, the same is required when using
> getrandom(). It can also return short as well as to be interrupted.
> 
> regards,
> Nikos

I am aware that (according to the documentation) both random(4) and
getrandom(2) may not return the full size of the read. However, that is
(as far as I know) not relevant to the point that I am making.

What I am saying is that based on my understanding of random(4) and
getrandom(2), at boot, given the same buffer size, reading
from /dev/random should have the same behavior as calling getrandom
passing no flags.

The buffer size can also be set to 1 with similar results, but the
iteration number for success must be then increased to a large number.
IME 30 worked consistently while 29 hung; your results may vary.

The interesting thing is though, if GRND_RANDOM is passed to getrandom,
then it does not hang and returns 1 byte immediately (whether or not
GRND_NONBLOCK is set).

The following revised program demonstrates this:

#include <fcntl.h>
#include <linux/random.h>
#include <stdlib.h>
#include <string.h>
#include <syscall.h>
#include <unistd.h>

int main(int argc, char *argv[]) {
    char buf[1];
    int gr_flags;
    const char *iters;

    if (!strcmp(argv[1], "-r")) {
        gr_flags = GRND_RANDOM;
        iters = argv[2];
    } else {
        gr_flags = 0;
        iters = argv[1];
    }

    for (int i = 0; i < atoi(iters); i++) {
        int fd;
        if ((fd = open("/dev/random", O_RDONLY)) == -1)
            return 2;

        if (read(fd, buf, 1) != 1)
            return 3;

        if (close(fd) == -1)
            return 4;
    }

    if (syscall(SYS_getrandom, buf, 1, gr_flags) != 1)
        return 5;

    return 0;
}

Again, making the buffer size 1 resolves the complaint regarding short
reads.

With the same command line as my original email, running this in QEMU
results in:

1, 2..29: reads all return 1 byte, getrandom pauses for 90-110 secs then
returns 1 byte
30+: reads all return 1 byte, getrandom immediately returns 1 byte
-r 0: getrandom immediately returns 1 byte
-r 1, -r 2, -r 128, -r 256: reads all return 1 byte, getrandom
immediately returns 1 byte

Moving the open and close calls outside of the loop produces the same
results. Writing 4096 bytes to /dev/urandom also has no effect.

In my opinion, assuming I am not doing something terribly wrong, this
constitutes a bug in the kernel's handling of getrandom calls at boot,
possibly only when the primary source of entropy is virtio.

^ permalink raw reply

* Re: getrandom waits for a long time when /dev/random is insufficiently read from
From: Nikos Mavrogiannopoulos @ 2016-07-29 10:24 UTC (permalink / raw)
  To: Stephan Mueller; +Cc: Linux Crypto Mailing List, Alex Xu, virtualization
In-Reply-To: <2481163.nONN48TG9I@tauon.atsec.com>

On Fri, Jul 29, 2016 at 7:40 AM, Stephan Mueller <smueller@chronox.de> wrote:
> And finally, you have a coding error that is very very common but fatal when
> reading from /dev/random: you do not account for short reads which implies
> that your loop continues even in the case of short reads.
>
> Fix your code with something like the following:
> int read_random(char *buf, size_t buflen)
> {
>         int fd = 0;
>         ssize_t ret = 0;
>         size_t len = 0;
>
>         fd = open("/dev/random", O_RDONLY|O_CLOEXEC);
>         if(0 > fd)
>                 return fd;
>         do {
>                 ret = read(fd, (buf + len), (buflen - len));
>                 if (0 < ret)
>                         len += ret;
>         } while ((0 < ret || EINTR == errno || ERESTART == errno)
>                  && buflen > len);

Unless there is a documentation error, the same is required when using
getrandom(). It can also return short as well as to be interrupted.

regards,
Nikos

^ permalink raw reply

* Re: [PATCH v3] virtio_blk: Fix a slient kernel panic
From: Stefan Hajnoczi @ 2016-07-29  8:26 UTC (permalink / raw)
  To: Minfei Huang
  Cc: Minfei Huang, fanc.fnst, linux-kernel, Linux Virtualization,
	Michael S. Tsirkin, Minfei Huang
In-Reply-To: <1468902762-43268-1-git-send-email-mnfhuang@gmail.com>

On Tue, Jul 19, 2016 at 5:32 AM, Minfei Huang <mnfhuang@gmail.com> wrote:
> From: Minfei Huang <mnghuan@gmail.com>
>
> We do a lot of memory allocation in function init_vq, and don't handle
> the allocation failure properly. Then this function will return 0,
> although initialization fails due to lacking memory. At that moment,
> kernel will panic in guest machine, if virtio is used to drive disk.
>
> To fix this bug, we should take care of allocation failure, and return
> correct value to let caller know what happen.
>
> Tested-by: Chao Fan <fanc.fnst@cn.fujitsu.com>
> Signed-off-by: Minfei Huang <minfei.hmf@alibaba-inc.com>
> Signed-off-by: Minfei Huang <mnghuan@gmail.com>
> ---
> v2:
> - Remove useless initialisation to NULL
> v1:
> - Refactor the patch to make code more readable
> ---
>  drivers/block/virtio_blk.c | 26 ++++++++------------------
>  1 file changed, 8 insertions(+), 18 deletions(-)

Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>

^ permalink raw reply

* Re: getrandom waits for a long time when /dev/random is insufficiently read from
From: Stephan Mueller @ 2016-07-29  5:40 UTC (permalink / raw)
  To: Alex Xu; +Cc: linux-crypto, virtualization
In-Reply-To: <20160728180732.12d38880@alex-desktop>

Am Donnerstag, 28. Juli 2016, 18:07:32 CEST schrieb Alex Xu:

Hi Alex,

> Linux 4.6, also tried 4.7, qemu 2.6, using this C program:

I am not sure what problem you are referring to, but that is an expected 
behavior.

You get partial reads when reading from /dev/random with a minimum of 64 
bits. On the other hand getrandom(2) is woken up after the input_pool 
received 128 bits of entropy.

In you strace you see that after reading 16 bytes from /dev/random, the 
getrandom unblocks and starts delivering.

Note, in virtualized environments the current Linux /dev/random 
implementation collects massively less entropy compared to a bare-metal 
system. Hence the long wait time of your 90 to 100 secs until getrandom 
unblocks.

Besides, even without reading from /dev/random, your getrandom will wait that 
long.

And finally, you have a coding error that is very very common but fatal when 
reading from /dev/random: you do not account for short reads which implies 
that your loop continues even in the case of short reads.

Fix your code with something like the following:

int read_random(char *buf, size_t buflen)
{
        int fd = 0;
        ssize_t ret = 0;
        size_t len = 0;

        fd = open("/dev/random", O_RDONLY|O_CLOEXEC);
        if(0 > fd)
                return fd;
        do {
                ret = read(fd, (buf + len), (buflen - len));
                if (0 < ret)
                        len += ret;
        } while ((0 < ret || EINTR == errno || ERESTART == errno)
                 && buflen > len);

...

Ciao
Stephan

^ permalink raw reply

* [PATCH v5] virtio: new feature to detect IOMMU device quirk
From: Michael S. Tsirkin @ 2016-07-29  3:08 UTC (permalink / raw)
  To: linux-kernel; +Cc: Christoph Hellwig, virtualization

The interaction between virtio and IOMMUs is messy.

On most systems with virtio, physical addresses match bus addresses,
and it doesn't particularly matter which one we use to program
the device.

On some systems, including Xen and any system with a physical device
that speaks virtio behind a physical IOMMU, we must program the IOMMU
for virtio DMA to work at all.

On other systems, including SPARC and PPC64, virtio-pci devices are
enumerated as though they are behind an IOMMU, but the virtio host
ignores the IOMMU, so we must either pretend that the IOMMU isn't
there or somehow map everything as the identity.

Add a feature bit to detect that quirk: VIRTIO_F_IOMMU_PLATFORM.

Any device with this feature bit set to 0 needs a quirk and has to be
passed physical addresses (as opposed to bus addresses) even though
the device is behind an IOMMU.

Note: it has to be a per-device quirk because for example, there could
be a mix of passed-through and virtual virtio devices. As another
example, some devices could be implemented by an out of process
hypervisor backend (in case of qemu vhost, or vhost-user) and so support
for an IOMMU needs to be coded up separately.

It would be cleanest to handle this in IOMMU core code, but that needs
per-device DMA ops. While we are waiting for that to be implemented, use
a work-around in virtio core.

Note: a "noiommu" feature is a quirk - add a wrapper to make
that clear.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
---

Changes from v4:
        Clarify that iommu quirk is the feature, even though we use
	register value 0 to detect this (for compatibility).

Changes from v3:
        In response to Christoph's comments, add TODO so we remember to
        move this into DMA core.  Definitely something we want to do but doesn't
        seem practical for this release cycle.


 include/linux/virtio_config.h      | 13 +++++++++++++
 include/uapi/linux/virtio_config.h | 10 +++++++++-
 drivers/virtio/virtio_ring.c       | 15 ++++++++++++++-
 3 files changed, 36 insertions(+), 2 deletions(-)

diff --git a/include/linux/virtio_config.h b/include/linux/virtio_config.h
index 6e6cb0c..26c155b 100644
--- a/include/linux/virtio_config.h
+++ b/include/linux/virtio_config.h
@@ -149,6 +149,19 @@ static inline bool virtio_has_feature(const struct virtio_device *vdev,
 	return __virtio_test_bit(vdev, fbit);
 }
 
+/**
+ * virtio_has_iommu_quirk - determine whether this device has the iommu quirk
+ * @vdev: the device
+ */
+static inline bool virtio_has_iommu_quirk(const struct virtio_device *vdev)
+{
+	/*
+	 * Note the reverse polarity of the quirk feature (compared to most
+	 * other features), this is for compatibility with legacy systems.
+	 */
+	return !virtio_has_feature(vdev, VIRTIO_F_IOMMU_PLATFORM);
+}
+
 static inline
 struct virtqueue *virtio_find_single_vq(struct virtio_device *vdev,
 					vq_callback_t *c, const char *n)
diff --git a/include/uapi/linux/virtio_config.h b/include/uapi/linux/virtio_config.h
index 4cb65bb..308e209 100644
--- a/include/uapi/linux/virtio_config.h
+++ b/include/uapi/linux/virtio_config.h
@@ -49,7 +49,7 @@
  * transport being used (eg. virtio_ring), the rest are per-device feature
  * bits. */
 #define VIRTIO_TRANSPORT_F_START	28
-#define VIRTIO_TRANSPORT_F_END		33
+#define VIRTIO_TRANSPORT_F_END		34
 
 #ifndef VIRTIO_CONFIG_NO_LEGACY
 /* Do we get callbacks when the ring is completely used, even if we've
@@ -63,4 +63,12 @@
 /* v1.0 compliant. */
 #define VIRTIO_F_VERSION_1		32
 
+/*
+ * If clear - device has the IOMMU bypass quirk feature.
+ * If set - use platform tools to detect the IOMMU.
+ *
+ * Note the reverse polarity (compared to most other features),
+ * this is for compatibility with legacy systems.
+ */
+#define VIRTIO_F_IOMMU_PLATFORM		33
 #endif /* _UAPI_LINUX_VIRTIO_CONFIG_H */
diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
index ca6bfdd..114a0c8 100644
--- a/drivers/virtio/virtio_ring.c
+++ b/drivers/virtio/virtio_ring.c
@@ -117,7 +117,10 @@ struct vring_virtqueue {
 #define to_vvq(_vq) container_of(_vq, struct vring_virtqueue, vq)
 
 /*
- * The interaction between virtio and a possible IOMMU is a mess.
+ * Modern virtio devices have feature bits to specify whether they need a
+ * quirk and bypass the IOMMU. If not there, just use the DMA API.
+ *
+ * If there, the interaction between virtio and DMA API is messy.
  *
  * On most systems with virtio, physical addresses match bus addresses,
  * and it doesn't particularly matter whether we use the DMA API.
@@ -133,10 +136,18 @@ struct vring_virtqueue {
  *
  * For the time being, we preserve historic behavior and bypass the DMA
  * API.
+ *
+ * TODO: install a per-device DMA ops structure that does the right thing
+ * taking into account all the above quirks, and use the DMA API
+ * unconditionally on data path.
  */
 
 static bool vring_use_dma_api(struct virtio_device *vdev)
 {
+	if (!virtio_has_iommu_quirk(vdev))
+		return true;
+
+	/* Otherwise, we are left to guess. */
 	/*
 	 * In theory, it's possible to have a buggy QEMU-supposed
 	 * emulated Q35 IOMMU and Xen enabled at the same time.  On
@@ -1099,6 +1110,8 @@ void vring_transport_features(struct virtio_device *vdev)
 			break;
 		case VIRTIO_F_VERSION_1:
 			break;
+		case VIRTIO_F_IOMMU_PLATFORM:
+			break;
 		default:
 			/* We don't understand this bit. */
 			__virtio_clear_bit(vdev, i);
-- 
MST

^ permalink raw reply related

* RE: [virtio-dev] Re: [PATCH v2 repost 4/7] virtio-balloon: speed up inflate/deflate process
From: Li, Liang Z @ 2016-07-29  1:08 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: virtio-dev@lists.oasis-open.org, kvm@vger.kernel.org,
	Hansen, Dave, qemu-devel@nongnu.org, Amit Shah,
	linux-kernel@vger.kernel.org,
	virtualization@lists.linux-foundation.org, linux-mm@kvack.org,
	Vlastimil Babka, Paolo Bonzini, Andrew Morton, Mel Gorman,
	dgilbert@redhat.com
In-Reply-To: <20160728221533.GA789@redhat.com>

> > > On Wed, Jul 27, 2016 at 09:03:21AM -0700, Dave Hansen wrote:
> > > > On 07/26/2016 06:23 PM, Liang Li wrote:
> > > > > +	vb->pfn_limit = VIRTIO_BALLOON_PFNS_LIMIT;
> > > > > +	vb->pfn_limit = min(vb->pfn_limit, get_max_pfn());
> > > > > +	vb->bmap_len = ALIGN(vb->pfn_limit, BITS_PER_LONG) /
> > > > > +		 BITS_PER_BYTE + 2 * sizeof(unsigned long);
> > > > > +	hdr_len = sizeof(struct balloon_bmap_hdr);
> > > > > +	vb->bmap_hdr = kzalloc(hdr_len + vb->bmap_len,
> GFP_KERNEL);
> > > >
> > > > This ends up doing a 1MB kmalloc() right?  That seems a _bit_ big.
> > > > How big was the pfn buffer before?
> > >
> > >
> > > Yes I would limit this to 1G memory in a go, will result in a 32KByte bitmap.
> > >
> > > --
> > > MST
> >
> > Limit to 1G is bad for the performance, I sent you the test result several
> weeks ago.
> >
> > Paste it bellow:
> > ----------------------------------------------------------------------
> > --------------------------------------------------
> > About the size of page bitmap, I have test the performance of filling
> > the balloon to 15GB with a  16GB RAM VM.
> >
> > ===============================
> > 32K Byte (cover 1GB of RAM)
> >
> > Time spends on inflating: 2031ms
> > ---------------------------------------------
> > 64K Byte (cover 2GB of RAM)
> >
> > Time spends on inflating: 1507ms
> > --------------------------------------------
> > 512K Byte (cover 16GB of RAM)
> >
> > Time spends on inflating: 1237ms
> > ================================
> >
> > If possible, a big bitmap is better for performance.
> >
> > Liang
> 
> Earlier you said:
> a. allocating pages (6.5%)
> b. sending PFNs to host (68.3%)
> c. address translation (6.1%)
> d. madvise (19%)
> 
> Here sending PFNs to host with 512K Byte map should be almost free.
> 
> So is something else taking up the time?
> 
I just want to show you the benefits of using a big bitmap. :)
I did not measure the time spend on each stage after optimization(I will do it later),
but I have tried to allocate the page with big chunk and found it can make things faster.
Without allocating big chunk page, the performance improvement is about 85%, and with
 allocating big  chunk page, the improvement is about 94%.

Liang

> 
> --
> MST

^ permalink raw reply

* RE: [virtio-dev] Re: [PATCH v2 repost 4/7] virtio-balloon: speed up inflate/deflate process
From: Li, Liang Z @ 2016-07-29  0:46 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: virtio-dev@lists.oasis-open.org, kvm@vger.kernel.org,
	Hansen, Dave, qemu-devel@nongnu.org, Amit Shah,
	linux-kernel@vger.kernel.org,
	virtualization@lists.linux-foundation.org, linux-mm@kvack.org,
	Vlastimil Babka, Paolo Bonzini, Andrew Morton, Mel Gorman,
	dgilbert@redhat.com
In-Reply-To: <20160729003759-mutt-send-email-mst@kernel.org>

> On Thu, Jul 28, 2016 at 06:36:18AM +0000, Li, Liang Z wrote:
> > > > > This ends up doing a 1MB kmalloc() right?  That seems a _bit_ big.
> > > > > How big was the pfn buffer before?
> > > >
> > > > Yes, it is if the max pfn is more than 32GB.
> > > > The size of the pfn buffer use before is 256*4 = 1024 Bytes, it's
> > > > too small, and it's the main reason for bad performance.
> > > > Use the max 1MB kmalloc is a balance between performance and
> > > > flexibility, a large page bitmap covers the range of all the
> > > > memory is no good for a system with huge amount of memory. If the
> > > > bitmap is too small, it means we have to traverse a long list for
> > > > many times, and it's bad
> > > for performance.
> > > >
> > > > Thanks!
> > > > Liang
> > >
> > > There are all your implementation decisions though.
> > >
> > > If guest memory is so fragmented that you only have order 0 4k
> > > pages, then allocating a huge 1M contigious chunk is very problematic in
> and of itself.
> > >
> >
> > The memory is allocated in the probe stage. This will not happen if
> > the driver is  loaded when booting the guest.
> >
> > > Most people rarely migrate and do not care how fast that happens.
> > > Wasting a large chunk of memory (and it's zeroed for no good reason,
> > > so you actually request host memory for it) for everyone to speed it
> > > up when it does happen is not really an option.
> > >
> > If people don't plan to do inflating/deflating, they should not enable
> > the virtio-balloon at the beginning, once they decide to use it, the
> > driver should provide better performance as much as possible.
> 
> The reason people inflate/deflate is so they can overcommit memory.
> Do they need to overcommit very quickly? I don't see why.
> So let's get what we can for free but I don't really believe people would want
> to pay for it.
> 
> > 1MB is a very small portion for a VM with more than 32GB memory and
> > it's the *worst case*, for VM with less than 32GB memory, the amount
> > of RAM depends on VM's memory size and will be less than 1MB.
> 
> It's guest memmory so might all be in swap and never touched, your memset
> at probe time will fault it in and make hypervisor actually pay for it.
> 
> > If 1MB is too big, how about 512K, or 256K?  32K seems too small.
> >
> > Liang
> 
> It's only small because it makes you rescan the free list.
> So maybe you should do something else.
> I looked at it a bit. Instead of scanning the free list, how about scanning actual
> page structures? If page is unused, pass it to host.
> Solves the problem of rescanning multiple times, does it not?
> 

Yes, agree.
> 
> Another idea: allocate a small bitmap at probe time (e.g. for deflate), allocate
> a bunch more on each request. Use something like GFP_ATOMIC and a
> scatter/gather, if that fails use the smaller bitmap.
> 

So, the aim of v3 is to use a smaller bitmap without too heavy performance penalty.
Thanks a lot!

Liang

^ permalink raw reply

* RE: [PATCH v2 repost 4/7] virtio-balloon: speed up inflate/deflate process
From: Li, Liang Z @ 2016-07-29  0:38 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: virtio-dev@lists.oasis-open.org, kvm@vger.kernel.org, Amit Shah,
	qemu-devel@nongnu.org, linux-kernel@vger.kernel.org,
	linux-mm@kvack.org, Vlastimil Babka, Paolo Bonzini, Andrew Morton,
	virtualization@lists.linux-foundation.org, Mel Gorman,
	dgilbert@redhat.com
In-Reply-To: <20160729011553-mutt-send-email-mst@kernel.org>

> On Thu, Jul 28, 2016 at 03:06:37AM +0000, Li, Liang Z wrote:
> > > > + * VIRTIO_BALLOON_PFNS_LIMIT is used to limit the size of page
> > > > +bitmap
> > > > + * to prevent a very large page bitmap, there are two reasons for this:
> > > > + * 1) to save memory.
> > > > + * 2) allocate a large bitmap may fail.
> > > > + *
> > > > + * The actual limit of pfn is determined by:
> > > > + * pfn_limit = min(max_pfn, VIRTIO_BALLOON_PFNS_LIMIT);
> > > > + *
> > > > + * If system has more pages than VIRTIO_BALLOON_PFNS_LIMIT, we
> > > > +will scan
> > > > + * the page list and send the PFNs with several times. To reduce
> > > > +the
> > > > + * overhead of scanning the page list. VIRTIO_BALLOON_PFNS_LIMIT
> > > > +should
> > > > + * be set with a value which can cover most cases.
> > >
> > > So what if it covers 1/32 of the memory? We'll do 32 exits and not
> > > 1, still not a big deal for a big guest.
> > >
> >
> > The issue here is the overhead is too high for scanning the page list for 32
> times.
> > Limit the page bitmap size to a fixed value is better for a big guest?
> >
> 
> I'd say avoid scanning free lists completely. Scan pages themselves and check
> the refcount to see whether they are free.
> This way each page needs to be tested once.
> 
> And skip the whole optimization if less than e.g. 10% is free.

That's better than rescanning the free list. Will change in next version.

Thanks!

Liang

^ permalink raw reply

* Re: [PATCH v2 repost 4/7] virtio-balloon: speed up inflate/deflate process
From: Michael S. Tsirkin @ 2016-07-28 22:17 UTC (permalink / raw)
  To: Li, Liang Z
  Cc: virtio-dev@lists.oasis-open.org, kvm@vger.kernel.org, Amit Shah,
	qemu-devel@nongnu.org, linux-kernel@vger.kernel.org,
	linux-mm@kvack.org, Vlastimil Babka, Paolo Bonzini, Andrew Morton,
	virtualization@lists.linux-foundation.org, Mel Gorman,
	dgilbert@redhat.com
In-Reply-To: <F2CBF3009FA73547804AE4C663CAB28E04213E1D@shsmsx102.ccr.corp.intel.com>

On Thu, Jul 28, 2016 at 03:06:37AM +0000, Li, Liang Z wrote:
> > > + * VIRTIO_BALLOON_PFNS_LIMIT is used to limit the size of page bitmap
> > > + * to prevent a very large page bitmap, there are two reasons for this:
> > > + * 1) to save memory.
> > > + * 2) allocate a large bitmap may fail.
> > > + *
> > > + * The actual limit of pfn is determined by:
> > > + * pfn_limit = min(max_pfn, VIRTIO_BALLOON_PFNS_LIMIT);
> > > + *
> > > + * If system has more pages than VIRTIO_BALLOON_PFNS_LIMIT, we will
> > > +scan
> > > + * the page list and send the PFNs with several times. To reduce the
> > > + * overhead of scanning the page list. VIRTIO_BALLOON_PFNS_LIMIT
> > > +should
> > > + * be set with a value which can cover most cases.
> > 
> > So what if it covers 1/32 of the memory? We'll do 32 exits and not 1, still not a
> > big deal for a big guest.
> > 
> 
> The issue here is the overhead is too high for scanning the page list for 32 times.
> Limit the page bitmap size to a fixed value is better for a big guest?
> 

I'd say avoid scanning free lists completely. Scan pages themselves and
check the refcount to see whether they are free.
This way each page needs to be tested once.

And skip the whole optimization if less than e.g. 10% is free.

> > > + */
> > > +#define VIRTIO_BALLOON_PFNS_LIMIT ((32 * (1ULL << 30)) >>
> > PAGE_SHIFT)
> > > +/* 32GB */
> > 
> > I already said this with a smaller limit.
> > 
> > 	2<< 30  is 2G but that is not a useful comment.
> > 	pls explain what is the reason for this selection.
> > 
> > Still applies here.
> > 
> 
> I will add the comment for this.
> 
> > > -	sg_init_one(&sg, vb->pfns, sizeof(vb->pfns[0]) * vb->num_pfns);
> > > +	if (virtio_has_feature(vb->vdev,
> > VIRTIO_BALLOON_F_PAGE_BITMAP)) {
> > > +		struct balloon_bmap_hdr *hdr = vb->bmap_hdr;
> > > +		unsigned long bmap_len;
> > > +
> > > +		/* cmd and req_id are not used here, set them to 0 */
> > > +		hdr->cmd = cpu_to_virtio16(vb->vdev, 0);
> > > +		hdr->page_shift = cpu_to_virtio16(vb->vdev, PAGE_SHIFT);
> > > +		hdr->reserved = cpu_to_virtio16(vb->vdev, 0);
> > > +		hdr->req_id = cpu_to_virtio64(vb->vdev, 0);
> > 
> > no need to swap 0, just fill it in. in fact you allocated all 0s so no need to touch
> > these fields at all.
> > 
> 
> Will change in v3.
> 
> > > @@ -489,7 +612,7 @@ static int virtballoon_migratepage(struct
> > > balloon_dev_info *vb_dev_info,  static int virtballoon_probe(struct
> > > virtio_device *vdev)  {
> > >  	struct virtio_balloon *vb;
> > > -	int err;
> > > +	int err, hdr_len;
> > >
> > >  	if (!vdev->config->get) {
> > >  		dev_err(&vdev->dev, "%s failure: config access disabled\n",
> > @@
> > > -508,6 +631,18 @@ static int virtballoon_probe(struct virtio_device *vdev)
> > >  	spin_lock_init(&vb->stop_update_lock);
> > >  	vb->stop_update = false;
> > >  	vb->num_pages = 0;
> > > +	vb->pfn_limit = VIRTIO_BALLOON_PFNS_LIMIT;
> > > +	vb->pfn_limit = min(vb->pfn_limit, get_max_pfn());
> > > +	vb->bmap_len = ALIGN(vb->pfn_limit, BITS_PER_LONG) /
> > > +		 BITS_PER_BYTE + 2 * sizeof(unsigned long);
> > 
> > What are these 2 longs in aid of?
> > 
> The rounddown(vb->start_pfn,  BITS_PER_LONG) and roundup(vb->end_pfn, BITS_PER_LONG) 
> may cause (vb->end_pfn - vb->start_pfn) > vb->pfn_limit, so we need extra space to save the
> bitmap for this case. 2 longs are enough.
> 
> > > +	hdr_len = sizeof(struct balloon_bmap_hdr);
> > > +	vb->bmap_hdr = kzalloc(hdr_len + vb->bmap_len, GFP_KERNEL);
> > 
> > So it can go up to 1MByte but adding header size etc you need a higher order
> > allocation. This is a waste, there is no need to have a power of two allocation.
> > Start from the other side. Say "I want to allocate 32KBytes for the bitmap".
> > Subtract the header and you get bitmap size.
> > Calculate the pfn limit from there.
> > 
> 
> Indeed, will change. Thanks a lot!
> 
> Liang

^ permalink raw reply

* Re: [PATCH v2 repost 4/7] virtio-balloon: speed up inflate/deflate process
From: Michael S. Tsirkin @ 2016-07-28 22:15 UTC (permalink / raw)
  To: Li, Liang Z
  Cc: virtio-dev@lists.oasis-open.org, kvm@vger.kernel.org,
	Hansen, Dave, qemu-devel@nongnu.org, Amit Shah,
	linux-kernel@vger.kernel.org,
	virtualization@lists.linux-foundation.org, linux-mm@kvack.org,
	Vlastimil Babka, Paolo Bonzini, Andrew Morton, Mel Gorman,
	dgilbert@redhat.com
In-Reply-To: <F2CBF3009FA73547804AE4C663CAB28E04213E5D@shsmsx102.ccr.corp.intel.com>

On Thu, Jul 28, 2016 at 03:30:09AM +0000, Li, Liang Z wrote:
> > Subject: Re: [PATCH v2 repost 4/7] virtio-balloon: speed up inflate/deflate
> > process
> > 
> > On Wed, Jul 27, 2016 at 09:03:21AM -0700, Dave Hansen wrote:
> > > On 07/26/2016 06:23 PM, Liang Li wrote:
> > > > +	vb->pfn_limit = VIRTIO_BALLOON_PFNS_LIMIT;
> > > > +	vb->pfn_limit = min(vb->pfn_limit, get_max_pfn());
> > > > +	vb->bmap_len = ALIGN(vb->pfn_limit, BITS_PER_LONG) /
> > > > +		 BITS_PER_BYTE + 2 * sizeof(unsigned long);
> > > > +	hdr_len = sizeof(struct balloon_bmap_hdr);
> > > > +	vb->bmap_hdr = kzalloc(hdr_len + vb->bmap_len, GFP_KERNEL);
> > >
> > > This ends up doing a 1MB kmalloc() right?  That seems a _bit_ big.
> > > How big was the pfn buffer before?
> > 
> > 
> > Yes I would limit this to 1G memory in a go, will result in a 32KByte bitmap.
> > 
> > --
> > MST
> 
> Limit to 1G is bad for the performance, I sent you the test result several weeks ago.
> 
> Paste it bellow:
> ------------------------------------------------------------------------------------------------------------------------
> About the size of page bitmap, I have test the performance of filling the balloon to 15GB with a
>  16GB RAM VM.
> 
> ===============================
> 32K Byte (cover 1GB of RAM)
> 
> Time spends on inflating: 2031ms
> ---------------------------------------------
> 64K Byte (cover 2GB of RAM)
> 
> Time spends on inflating: 1507ms
> --------------------------------------------
> 512K Byte (cover 16GB of RAM)
> 
> Time spends on inflating: 1237ms
> ================================
> 
> If possible, a big bitmap is better for performance.
> 
> Liang

Earlier you said:
a. allocating pages (6.5%)
b. sending PFNs to host (68.3%)
c. address translation (6.1%)
d. madvise (19%)

Here sending PFNs to host with 512K Byte map
should be almost free.

So is something else taking up the time?


-- 
MST

^ permalink raw reply

* getrandom waits for a long time when /dev/random is insufficiently read from
From: Alex Xu via Virtualization @ 2016-07-28 22:07 UTC (permalink / raw)
  To: linux-crypto, virtualization

Linux 4.6, also tried 4.7, qemu 2.6, using this C program:

#include <fcntl.h>
#include <stdlib.h>
#include <syscall.h>
#include <unistd.h>

int main(int argc, char *argv[]) {
    char buf[16];
    int fd;

    if (argc != 2)
        return 1;

    for (int i = 0; i < atoi(argv[1]); i++) {
        sleep(1);

        if ((fd = open("/dev/random", O_RDONLY)) == -1)
            return 2;

        if (read(fd, buf, sizeof(buf)) < 1)
            return 3;

        if (close(fd) == -1)
            return 4;
    }

    sleep(2);

    if (syscall(SYS_getrandom, buf, sizeof(buf), 0) == -1)
        return 5;

    return 0;
}

$ qemu-system-x86_64 -nodefaults -machine q35,accel=kvm -nographic -object rng-random,id=rng0,filename=/dev/urandom -device virtio-rng-pci,rng=rng0 -kernel linux-4.7/arch/x86/boot/bzImage -fsdev local,path="$PWD/root",security_model=none,id=root -device virtio-9p-pci,fsdev=root,mount_tag=/dev/root -device virtio-serial -chardev stdio,id=stdio -device virtconsole,chardev=stdio -monitor none -append "root=/dev/root rw rootfstype=9p rootflags=trans=virtio console=hvc0 init=/strace /test 2"
execve("/test", ["/test", "2"], [/* 2 vars */]) = 0
arch_prctl(ARCH_SET_FS, 0x601098)       = 0
set_tid_address(0x6010d0)               = 29
nanosleep({1, 0}, 0x7ffcdb7ea6b0)       = 0
open("/dev/random", O_RDONLY)           = 3
read(3, "P'\333\362\352\247\212\272\357E?\343", 16) = 12
close(3)                                = 0
nanosleep({1, 0}, 0x7ffcdb7ea6b0)       = 0
open("/dev/random", O_RDONLY)           = 3
read(3, ">>9\252]\332T\322dL\203\231C\255\303\376", 16) = 16
close(3)                                = 0
nanosleep({2, 0}, 0x7ffcdb7ea6e0)       = 0
getrandom(<some time later>[   89.166661] random: nonblocking pool is initialized
"\217\0\206\220\36t\3\353\t\227\377\356\315\320\2452", 16, 0) = 16
exit_group(0)                           = ?
+++ exited with 0 +++

Identical command but replaced 2 iterations with 3:

$ qemu-system-x86_64 -nodefaults -machine q35,accel=kvm -nographic -object rng-random,id=rng0,filename=/dev/urandom -device virtio-rng-pci,rng=rng0 -kernel linux-4.7/arch/x86/boot/bzImage -fsdev local,path="$PWD/root",security_model=none,id=root -device virtio-9p-pci,fsdev=root,mount_tag=/dev/root -device virtio-serial -chardev stdio,id=stdio -device virtconsole,chardev=stdio -monitor none -append "root=/dev/root rw rootfstype=9p rootflags=trans=virtio console=hvc0 init=/strace /test 3"
execve("/test", ["/test", "3"], [/* 2 vars */]) = 0
arch_prctl(ARCH_SET_FS, 0x601098)       = 0
set_tid_address(0x6010d0)               = 29
nanosleep({1, 0}, 0x7ffc9e13fb70)       = 0
open("/dev/random", O_RDONLY)           = 3
read(3, ">\202\264\350\226\364\364\320'-\200\16", 16) = 12
close(3)                                = 0
nanosleep({1, 0}, 0x7ffc9e13fb70)       = 0
open("/dev/random", O_RDONLY)           = 3
read(3, "\377:\2076\213q0E\307\377\\\234\217\"g\254", 16) = 16
close(3)                                = 0
nanosleep({1, 0}, 0x7ffc9e13fb70)       = 0
open("/dev/random", O_RDONLY)           = 3
read(3, [    3.312266] random: nonblocking pool is initialized
"O\2112g\375\25]\270\347\v\34XP", 16) = 13
close(3)                                = 0
nanosleep({2, 0}, 0x7ffc9e13fba0)       = 0
getrandom("\215\317\207/\324\6\300\216\332zN\351a\323\231\36", 16, 0) = 16
exit_group(0)                           = ?
+++ exited with 0 +++

(irrelevant kernel messages have been removed for clarity)

Removing the calls to "sleep" produces similar results except without
sleeping or the corresponding strace output. Running both commands
repeatedly also produces similar results; the timing of the getrandom
return and "random: nonblocking pool is initialized" message
is different for each run, but it always takes 90-100 seconds.

Sorry if these aren't the right lists or if this is a known issue.

Please CC me on replies.

^ permalink raw reply

* Re: [virtio-dev] Re: [PATCH v2 repost 4/7] virtio-balloon: speed up inflate/deflate process
From: Michael S. Tsirkin @ 2016-07-28 21:51 UTC (permalink / raw)
  To: Li, Liang Z
  Cc: virtio-dev@lists.oasis-open.org, kvm@vger.kernel.org,
	Hansen, Dave, qemu-devel@nongnu.org, Amit Shah,
	linux-kernel@vger.kernel.org,
	virtualization@lists.linux-foundation.org, linux-mm@kvack.org,
	Vlastimil Babka, Paolo Bonzini, Andrew Morton, Mel Gorman,
	dgilbert@redhat.com
In-Reply-To: <F2CBF3009FA73547804AE4C663CAB28E04214103@shsmsx102.ccr.corp.intel.com>

On Thu, Jul 28, 2016 at 06:36:18AM +0000, Li, Liang Z wrote:
> > > > This ends up doing a 1MB kmalloc() right?  That seems a _bit_ big.
> > > > How big was the pfn buffer before?
> > >
> > > Yes, it is if the max pfn is more than 32GB.
> > > The size of the pfn buffer use before is 256*4 = 1024 Bytes, it's too
> > > small, and it's the main reason for bad performance.
> > > Use the max 1MB kmalloc is a balance between performance and
> > > flexibility, a large page bitmap covers the range of all the memory is
> > > no good for a system with huge amount of memory. If the bitmap is too
> > > small, it means we have to traverse a long list for many times, and it's bad
> > for performance.
> > >
> > > Thanks!
> > > Liang
> > 
> > There are all your implementation decisions though.
> > 
> > If guest memory is so fragmented that you only have order 0 4k pages, then
> > allocating a huge 1M contigious chunk is very problematic in and of itself.
> > 
> 
> The memory is allocated in the probe stage. This will not happen if the driver is
>  loaded when booting the guest.
> 
> > Most people rarely migrate and do not care how fast that happens.
> > Wasting a large chunk of memory (and it's zeroed for no good reason, so you
> > actually request host memory for it) for everyone to speed it up when it
> > does happen is not really an option.
> > 
> If people don't plan to do inflating/deflating, they should not enable the virtio-balloon
> at the beginning, once they decide to use it, the driver should provide better performance
> as much as possible.

The reason people inflate/deflate is so they can overcommit memory.
Do they need to overcommit very quickly? I don't see why.
So let's get what we can for free but I don't really believe
people would want to pay for it.

> 1MB is a very small portion for a VM with more than 32GB memory and it's the *worst case*, 
> for VM with less than 32GB memory, the amount of RAM depends on VM's memory size
> and will be less than 1MB.

It's guest memmory so might all be in swap and never touched,
your memset at probe time will fault it in and make hypervisor
actually pay for it.

> If 1MB is too big, how about 512K, or 256K?  32K seems too small.
> 
> Liang

It's only small because it makes you rescan the free list.
So maybe you should do something else.
I looked at it a bit. Instead of scanning the free list, how about
scanning actual page structures? If page is unused, pass it to host.
Solves the problem of rescanning multiple times, does it not?


Another idea: allocate a small bitmap at probe time (e.g. for deflate),
allocate a bunch more on each request. Use something like
GFP_ATOMIC and a scatter/gather, if that fails use the smaller bitmap.



> > --
> > MST
> > 
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
> > For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org

^ permalink raw reply

* Re: [PATCH v2 repost 7/7] virtio-balloon: tell host vm's free page info
From: Michael S. Tsirkin @ 2016-07-28 21:37 UTC (permalink / raw)
  To: Li, Liang Z
  Cc: virtio-dev@lists.oasis-open.org, kvm@vger.kernel.org, Amit Shah,
	qemu-devel@nongnu.org, linux-kernel@vger.kernel.org,
	linux-mm@kvack.org, Vlastimil Babka, Paolo Bonzini, Andrew Morton,
	virtualization@lists.linux-foundation.org, Mel Gorman,
	dgilbert@redhat.com
In-Reply-To: <F2CBF3009FA73547804AE4C663CAB28E042141EE@shsmsx102.ccr.corp.intel.com>

On Thu, Jul 28, 2016 at 07:50:52AM +0000, Li, Liang Z wrote:
> > >  }
> > >
> > > +static void update_free_pages_stats(struct virtio_balloon *vb,
> > 
> > why _stats?
> 
> Will change.
> 
> > > +	max_pfn = get_max_pfn();
> > > +	mutex_lock(&vb->balloon_lock);
> > > +	while (pfn < max_pfn) {
> > > +		memset(vb->page_bitmap, 0, vb->bmap_len);
> > > +		ret = get_free_pages(pfn, pfn + vb->pfn_limit,
> > > +			vb->page_bitmap, vb->bmap_len * BITS_PER_BYTE);
> > > +		hdr->cmd = cpu_to_virtio16(vb->vdev,
> > BALLOON_GET_FREE_PAGES);
> > > +		hdr->page_shift = cpu_to_virtio16(vb->vdev, PAGE_SHIFT);
> > > +		hdr->req_id = cpu_to_virtio64(vb->vdev, req_id);
> > > +		hdr->start_pfn = cpu_to_virtio64(vb->vdev, pfn);
> > > +		bmap_len = vb->pfn_limit / BITS_PER_BYTE;
> > > +		if (!ret) {
> > > +			hdr->flag = cpu_to_virtio16(vb->vdev,
> > > +
> > 	BALLOON_FLAG_DONE);
> > > +			if (pfn + vb->pfn_limit > max_pfn)
> > > +				bmap_len = (max_pfn - pfn) /
> > BITS_PER_BYTE;
> > > +		} else
> > > +			hdr->flag = cpu_to_virtio16(vb->vdev,
> > > +
> > 	BALLOON_FLAG_CONT);
> > > +		hdr->bmap_len = cpu_to_virtio64(vb->vdev, bmap_len);
> > > +		sg_init_one(&sg_out, hdr,
> > > +			 sizeof(struct balloon_bmap_hdr) + bmap_len);
> > 
> > Wait a second. This adds the same buffer multiple times in a loop.
> > We will overwrite the buffer without waiting for hypervisor to process it.
> > What did I miss?
> 
> I am no quite sure about this part, I though the virtqueue_kick(vq) will prevent
> the buffer from overwrite, I realized it's wrong.
> 
> > > +
> > > +		virtqueue_add_outbuf(vq, &sg_out, 1, vb, GFP_KERNEL);
> > 
> > this can fail. you want to maybe make sure vq has enough space before you
> > use it or check error and wait.
> > 
> > > +		virtqueue_kick(vq);
> > 
> > why kick here within loop? wait until done. in fact kick outside lock is better
> > for smp.
> 
> I will change this part in v3.
> 
> > 
> > > +		pfn += vb->pfn_limit;
> > > +	static const char * const names[] = { "inflate", "deflate", "stats",
> > > +						 "misc" };
> > >  	int err, nvqs;
> > >
> > >  	/*
> > >  	 * We expect two virtqueues: inflate and deflate, and
> > >  	 * optionally stat.
> > >  	 */
> > > -	nvqs = virtio_has_feature(vb->vdev,
> > VIRTIO_BALLOON_F_STATS_VQ) ? 3 : 2;
> > > +	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_MISC_VQ))
> > > +		nvqs = 4;
> > 
> > Does misc vq depend on stats vq feature then? if yes please validate that.
> 
> Yes, what's you mean by 'validate' that?

Either handle misc vq without a stats vq, or
clear VIRTIO_BALLOON_F_MISC_VQ if stats vq is off.

> > 
> > 
> > > +	else
> > > +		nvqs = virtio_has_feature(vb->vdev,
> > > +					  VIRTIO_BALLOON_F_STATS_VQ) ? 3 :
> > 2;
> > 
> > Replace that ? with else too pls.
> 
> Will change.
> 
> Thanks!
> Liang

^ permalink raw reply

* Re: [PATCH v4] virtio: new feature to detect IOMMU device quirk
From: Michael S. Tsirkin @ 2016-07-28 21:28 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-kernel, virtualization
In-Reply-To: <20160728065918.GA22392@infradead.org>

On Wed, Jul 27, 2016 at 11:59:18PM -0700, Christoph Hellwig wrote:
> Again, this is still the wrong way around.  A "noiommu" feature is a
> quirk and should not be the default.

Christoph, I'm not sure what you mean by the default here.

We read a register from the device (bit 33 in the feature qword)
and act on it.

The specific register value is 0 on noiommu quirky devices
(it happened to be that way in the past),
and 1 on clean iommu devices.

 static bool vring_use_dma_api(struct virtio_device *vdev)
 {
+       if (virtio_has_feature(vdev, VIRTIO_F_IOMMU_PLATFORM))
+               return true;
+
+       /* Otherwise, we are left to guess. */

As a hypothesis, do you object to use of virtio_has_feature?

Yes this might be confusing but in fact that
is just testing a cached register bit: at init time we read it:

        device_features = dev->config->get_features(dev);

        ....
        vdev->features = device_features

and later

            return vdev->features & BIT_ULL(fbit);

I'll add a comment clarifying that in the next version.

-- 
MST

^ permalink raw reply

* Re: [PATCH 6/7] qemu: Implement virtio-pstore device
From: Steven Rostedt @ 2016-07-28 14:38 UTC (permalink / raw)
  To: Namhyung Kim
  Cc: Anton Vorontsov, Kees Cook, kvm, Michael S. Tsirkin, qemu-devel,
	Radim Krčmář, LKML, virtualization, Minchan Kim,
	Tony Luck, Anthony Liguori, Colin Cross, Paolo Bonzini,
	Ingo Molnar
In-Reply-To: <20160728053953.GB4371@sejong>

On Thu, 28 Jul 2016 14:39:53 +0900
Namhyung Kim <namhyung@kernel.org> wrote:

> Well, I dont' know.  As you know, the kernel oops dump is already sent
> to serial device but it's rather slow.  As I wrote in the cover
> letter, enabling ftrace_dump_on_oops makes it even worse..  Also
> pstore saves the (compressed) binary data so I thought it'd be better
> to have a dedicated IO channel.

BTW, I agree with this. It is better to have a quick way to grab the
ftrace buffers when a crash happens, as serial is excruciatingly slow.
Although, currently I still use kexec/kdump, but as Namhyung said, it
depends on crash being up to date. I tend to be sending in updates
every time I have to use it.

-- Steve

^ permalink raw reply

* [RFC v6 6/6] VSOCK: Add Makefile and Kconfig
From: Stefan Hajnoczi @ 2016-07-28 14:36 UTC (permalink / raw)
  To: kvm
  Cc: marius vlad, Stefan Hajnoczi, Michael S. Tsirkin, netdev,
	Ian Campbell, Claudio Imbrenda, ggarcia, Matt Benjamin, Asias He,
	Greg Kurz, virtualization, Christoffer Dall
In-Reply-To: <1469716595-13591-1-git-send-email-stefanha@redhat.com>

From: Asias He <asias@redhat.com>

Enable virtio-vsock and vhost-vsock.

Signed-off-by: Asias He <asias@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
---
v6:
 * Rename to virtio-vsock kernel modules to vmw_vsock_virtio_transport*
   instead of just virtio_transport to make the name clearer [Ian
   Campbell]
v4:
 * Make checkpatch.pl happy with longer option description
 * Clarify dependency on virtio rather than QEMU as suggested by Alex
   Bennee
v3:
 * Don't put vhost vsock driver into staging
 * Add missing Kconfig dependencies (Arnd Bergmann <arnd@arndb.de>)
---
 drivers/vhost/Kconfig  | 15 +++++++++++++++
 drivers/vhost/Makefile |  4 ++++
 net/vmw_vsock/Kconfig  | 20 ++++++++++++++++++++
 net/vmw_vsock/Makefile |  6 ++++++
 4 files changed, 45 insertions(+)

diff --git a/drivers/vhost/Kconfig b/drivers/vhost/Kconfig
index 533eaf0..d7aae9e 100644
--- a/drivers/vhost/Kconfig
+++ b/drivers/vhost/Kconfig
@@ -21,6 +21,21 @@ config VHOST_SCSI
 	Say M here to enable the vhost_scsi TCM fabric module
 	for use with virtio-scsi guests
 
+config VHOST_VSOCK
+	tristate "vhost virtio-vsock driver"
+	depends on VSOCKETS && EVENTFD
+	select VIRTIO_VSOCKETS_COMMON
+	select VHOST
+	select VHOST_RING
+	default n
+	---help---
+	This kernel module can be loaded in the host kernel to provide AF_VSOCK
+	sockets for communicating with guests.  The guests must have the
+	virtio_transport.ko driver loaded to use the virtio-vsock device.
+
+	To compile this driver as a module, choose M here: the module will be called
+	vhost_vsock.
+
 config VHOST_RING
 	tristate
 	---help---
diff --git a/drivers/vhost/Makefile b/drivers/vhost/Makefile
index e0441c3..6b012b9 100644
--- a/drivers/vhost/Makefile
+++ b/drivers/vhost/Makefile
@@ -4,5 +4,9 @@ vhost_net-y := net.o
 obj-$(CONFIG_VHOST_SCSI) += vhost_scsi.o
 vhost_scsi-y := scsi.o
 
+obj-$(CONFIG_VHOST_VSOCK) += vhost_vsock.o
+vhost_vsock-y := vsock.o
+
 obj-$(CONFIG_VHOST_RING) += vringh.o
+
 obj-$(CONFIG_VHOST)	+= vhost.o
diff --git a/net/vmw_vsock/Kconfig b/net/vmw_vsock/Kconfig
index 14810ab..8831e7c 100644
--- a/net/vmw_vsock/Kconfig
+++ b/net/vmw_vsock/Kconfig
@@ -26,3 +26,23 @@ config VMWARE_VMCI_VSOCKETS
 
 	  To compile this driver as a module, choose M here: the module
 	  will be called vmw_vsock_vmci_transport. If unsure, say N.
+
+config VIRTIO_VSOCKETS
+	tristate "virtio transport for Virtual Sockets"
+	depends on VSOCKETS && VIRTIO
+	select VIRTIO_VSOCKETS_COMMON
+	help
+	  This module implements a virtio transport for Virtual Sockets.
+
+	  Enable this transport if your Virtual Machine host supports Virtual
+	  Sockets over virtio.
+
+	  To compile this driver as a module, choose M here: the module will be
+	  called vmw_vsock_virtio_transport. If unsure, say N.
+
+config VIRTIO_VSOCKETS_COMMON
+	tristate
+	help
+	  This option is selected by any driver which needs to access
+	  the virtio_vsock.  The module will be called
+	  vmw_vsock_virtio_transport_common.
diff --git a/net/vmw_vsock/Makefile b/net/vmw_vsock/Makefile
index 2ce52d7..bc27c70 100644
--- a/net/vmw_vsock/Makefile
+++ b/net/vmw_vsock/Makefile
@@ -1,7 +1,13 @@
 obj-$(CONFIG_VSOCKETS) += vsock.o
 obj-$(CONFIG_VMWARE_VMCI_VSOCKETS) += vmw_vsock_vmci_transport.o
+obj-$(CONFIG_VIRTIO_VSOCKETS) += vmw_vsock_virtio_transport.o
+obj-$(CONFIG_VIRTIO_VSOCKETS_COMMON) += vmw_vsock_virtio_transport_common.o
 
 vsock-y += af_vsock.o vsock_addr.o
 
 vmw_vsock_vmci_transport-y += vmci_transport.o vmci_transport_notify.o \
 	vmci_transport_notify_qstate.o
+
+vmw_vsock_virtio_transport-y += virtio_transport.o
+
+vmw_vsock_virtio_transport_common-y += virtio_transport_common.o
-- 
2.7.4

^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox