* [PATCH v5 6/6] hwmon: use smp_call_on_cpu() for dell-smm i8k
From: Juergen Gross @ 2016-04-06 14:17 UTC (permalink / raw)
To: linux-kernel, xen-devel
Cc: Juergen Gross, jeremy, jdelvare, peterz, hpa, akataria, x86,
virtualization, chrisw, mingo, david.vrabel, Douglas_Warzecha,
pali.rohar, boris.ostrovsky, tglx, linux
In-Reply-To: <1459952266-3687-1-git-send-email-jgross@suse.com>
Use the smp_call_on_cpu() function to call system management
mode on cpu 0.
Make call secure by adding get_online_cpus() to avoid e.g. suspend
resume cycles in between.
Signed-off-by: Juergen Gross <jgross@suse.com>
---
V4: add call to get_online_cpus()
---
drivers/hwmon/dell-smm-hwmon.c | 35 ++++++++++++++++++++---------------
1 file changed, 20 insertions(+), 15 deletions(-)
diff --git a/drivers/hwmon/dell-smm-hwmon.c b/drivers/hwmon/dell-smm-hwmon.c
index c43318d..c91bf35 100644
--- a/drivers/hwmon/dell-smm-hwmon.c
+++ b/drivers/hwmon/dell-smm-hwmon.c
@@ -21,6 +21,7 @@
#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
+#include <linux/cpu.h>
#include <linux/delay.h>
#include <linux/module.h>
#include <linux/types.h>
@@ -35,6 +36,7 @@
#include <linux/uaccess.h>
#include <linux/io.h>
#include <linux/sched.h>
+#include <linux/smp.h>
#include <linux/i8k.h>
@@ -130,23 +132,15 @@ static inline const char *i8k_get_dmi_data(int field)
/*
* Call the System Management Mode BIOS. Code provided by Jonathan Buzzard.
*/
-static int i8k_smm(struct smm_regs *regs)
+static int i8k_smm_func(void *par)
{
int rc;
+ struct smm_regs *regs = par;
int eax = regs->eax;
- cpumask_var_t old_mask;
/* SMM requires CPU 0 */
- if (!alloc_cpumask_var(&old_mask, GFP_KERNEL))
- return -ENOMEM;
- cpumask_copy(old_mask, ¤t->cpus_allowed);
- rc = set_cpus_allowed_ptr(current, cpumask_of(0));
- if (rc)
- goto out;
- if (smp_processor_id() != 0) {
- rc = -EBUSY;
- goto out;
- }
+ if (smp_processor_id() != 0)
+ return -EBUSY;
#if defined(CONFIG_X86_64)
asm volatile("pushq %%rax\n\t"
@@ -204,13 +198,24 @@ static int i8k_smm(struct smm_regs *regs)
if (rc != 0 || (regs->eax & 0xffff) == 0xffff || regs->eax == eax)
rc = -EINVAL;
-out:
- set_cpus_allowed_ptr(current, old_mask);
- free_cpumask_var(old_mask);
return rc;
}
/*
+ * Call the System Management Mode BIOS.
+ */
+static int i8k_smm(struct smm_regs *regs)
+{
+ int ret;
+
+ get_online_cpus();
+ ret = smp_call_on_cpu(0, i8k_smm_func, regs, true);
+ put_online_cpus();
+
+ return ret;
+}
+
+/*
* Read the fan status.
*/
static int i8k_get_fan_status(int fan)
--
2.6.6
^ permalink raw reply related
* Re: [PATCH v5 3/6] smp: add function to execute a function synchronously on a cpu
From: Peter Zijlstra @ 2016-04-06 18:04 UTC (permalink / raw)
To: Juergen Gross
Cc: x86, jeremy, jdelvare, hpa, akataria, linux-kernel,
virtualization, chrisw, mingo, david.vrabel, Douglas_Warzecha,
pali.rohar, xen-devel, boris.ostrovsky, tglx, linux
In-Reply-To: <1459952266-3687-4-git-send-email-jgross@suse.com>
On Wed, Apr 06, 2016 at 04:17:43PM +0200, Juergen Gross wrote:
> On some hardware models (e.g. Dell Studio 1555 laptop) some hardware
> related functions (e.g. SMIs) are to be executed on physical cpu 0
> only. Instead of open coding such a functionality multiple times in
> the kernel add a service function for this purpose. This will enable
> the possibility to take special measures in virtualized environments
> like Xen, too.
>
> Signed-off-by: Juergen Gross <jgross@suse.com>
Thanks!
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
^ permalink raw reply
* Re: [PATCH v3 03/16] mm: add non-lru movable page support document
From: Minchan Kim @ 2016-04-07 2:27 UTC (permalink / raw)
To: Vlastimil Babka
Cc: jlayton, Rik van Riel, YiPing Xu, aquini, rknize,
Sergey Senozhatsky, Chan Gyun Jeong, Jonathan Corbet,
Hugh Dickins, linux-kernel, virtualization, bfields, linux-mm,
Gioh Kim, Mel Gorman, Sangseok Lee, Andrew Morton, Joonsoo Kim,
koct9i, Al Viro
In-Reply-To: <57026782.3020201@suse.cz>
On Mon, Apr 04, 2016 at 03:09:22PM +0200, Vlastimil Babka wrote:
> On 04/04/2016 04:25 AM, Minchan Kim wrote:
> >>
> >>Ah, I see, so it's designed with page lock to handle the concurrent isolations etc.
> >>
> >>In http://marc.info/?l=linux-mm&m=143816716511904&w=2 Mel has warned
> >>about doing this in general under page_lock and suggested that each
> >>user handles concurrent calls to isolate_page() internally. Might be
> >>more generic that way, even if all current implementers will
> >>actually use the page lock.
> >
> >We need PG_lock for two reasons.
> >
> >Firstly, it guarantees page's flags operation(i.e., PG_movable, PG_isolated)
> >atomicity. Another thing is for stability for page->mapping->a_ops.
> >
> >For example,
> >
> >isolate_migratepages_block
> > if (PageMovable(page))
> > isolate_movable_page
> > get_page_unless_zero <--- 1
> > trylock_page
> > page->mapping->a_ops->isolate_page <--- 2
> >
> >Between 1 and 2, driver can nullify page->mapping so we need PG_lock
>
> Hmm I see, that really doesn't seem easily solvable without page_lock.
> My idea is that compaction code would just check PageMovable() and
> PageIsolated() to find a candidate.
> page->mapping->a_ops->isolate_page would do the driver-specific
> necessary locking, revalidate if the page state and succeed
> isolation, or fail. It would need to handle the possibility that the
So you mean that VM can try to isolate false-positive page of the driver?
I don't think it's a good idea. For handling that, every driver should
keep some logics to handle such false-positive which needs each own
data structure or something to remember the page passed from VM
is valid or not. It makes driver's logic more complicated and need
more codes to handle it. It's not a good deal.
> page already doesn't belong to the mapping, which is probably not a
> problem. But what if the driver is a module that was already
> unloaded, and even though we did NULL-check each part from page to
> isolate_page, it points to a function that's already gone? That
> would need some extra handling to prevent that, hm...
Yes, driver should clean up pages is is using. For it, we need some lock.
I think page_lock is good for it because we are migrating *page* and page_lock
have been used it for a long time in migration path.
^ permalink raw reply
* Re: [PATCH v3 02/16] mm/compaction: support non-lru movable page migration
From: Minchan Kim @ 2016-04-07 2:35 UTC (permalink / raw)
To: Vlastimil Babka
Cc: jlayton, Rik van Riel, YiPing Xu, aquini, rknize,
Sergey Senozhatsky, Chan Gyun Jeong, Hugh Dickins, linux-kernel,
dri-devel, virtualization, bfields, linux-mm, Gioh Kim, Gioh Kim,
Mel Gorman, Sangseok Lee, Andrew Morton, Joonsoo Kim, koct9i,
Al Viro
In-Reply-To: <57026B12.6060000@suse.cz>
On Mon, Apr 04, 2016 at 03:24:34PM +0200, Vlastimil Babka wrote:
> On 04/04/2016 07:12 AM, Minchan Kim wrote:
> >On Fri, Apr 01, 2016 at 11:29:14PM +0200, Vlastimil Babka wrote:
> >>Might have been better as a separate migration patch and then a
> >>compaction patch. It's prefixed mm/compaction, but most changed are
> >>in mm/migrate.c
> >
> >Indeed. The title is rather misleading but not sure it's a good idea
> >to separate compaction and migration part.
>
> Guess it's better to see the new functions together with its user
> after all, OK.
>
> >I will just resend to change the tile from "mm/compaction" to
> >"mm/migration".
>
> OK!
>
> >>Also I'm a bit uncomfortable how isolate_movable_page() blindly expects that
> >>page->mapping->a_ops->isolate_page exists for PageMovable() pages.
> >>What if it's a false positive on a PG_reclaim page? Can we rely on
> >>PG_reclaim always (and without races) implying PageLRU() so that we
> >>don't even attempt isolate_movable_page()?
> >
> >For now, we shouldn't have such a false positive because PageMovable
> >checks page->_mapcount == PAGE_MOVABLE_MAPCOUNT_VALUE as well as PG_movable
> >under PG_lock.
> >
> >But I read your question about user-mapped drvier pages so we cannot
> >use _mapcount anymore so I will find another thing. A option is this.
> >
> >static inline int PageMovable(struct page *page)
> >{
> > int ret = 0;
> > struct address_space *mapping;
> > struct address_space_operations *a_op;
> >
> > if (!test_bit(PG_movable, &(page->flags))
> > goto out;
> >
> > mapping = page->mapping;
> > if (!mapping)
> > goto out;
> >
> > a_op = mapping->a_op;
> > if (!aop)
> > goto out;
> > if (a_op->isolate_page)
> > ret = 1;
> >out:
> > return ret;
> >
> >}
> >
> >It works under PG_lock but with this, we need trylock_page to peek
> >whether it's movable non-lru or not for scanning pfn.
>
> Hm I hoped that with READ_ONCE() we could do the peek safely without
> trylock_page, if we use it only as a heuristic. But I guess it would
> require at least RCU-level protection of the
> page->mapping->a_op->isolate_page chain.
>
> >For avoiding that, we need another function to peek which just checks
> >PG_movable bit instead of all things.
> >
> >
> >/*
> > * If @page_locked is false, we cannot guarantee page->mapping's stability
> > * so just the function checks with PG_movable which could be false positive
> > * so caller should check it again under PG_lock to check a_ops->isolate_page.
> > */
> >static inline int PageMovable(struct page *page, bool page_locked)
> >{
> > int ret = 0;
> > struct address_space *mapping;
> > struct address_space_operations *a_op;
> >
> > if (!test_bit(PG_movable, &(page->flags))
> > goto out;
> >
> > if (!page_locked) {
> > ret = 1;
> > goto out;
> > }
> >
> > mapping = page->mapping;
> > if (!mapping)
> > goto out;
> >
> > a_op = mapping->a_op;
> > if (!aop)
> > goto out;
> > if (a_op->isolate_page)
> > ret = 1;
> >out:
> > return ret;
> >}
>
> I wouldn't put everything into single function, but create something
> like __PageMovable() just for the unlocked peek. Unlike the
> zone->lru_lock, we don't keep page_lock() across iterations in
> isolate_migratepages_block(), as obviously each page has different
> lock.
> So the page_locked parameter would be always passed as constant, and
> at that point it's better to have separate functions.
Agree.
>
> So I guess the question is how many false positives from overlap
> with PG_reclaim the scanner will hit if we give up on
> PAGE_MOVABLE_MAPCOUNT_VALUE, as that will increase number of page
> locks just to realize that it's not actual PageMovable() page...
I don't think it's too many because PG_reclaim bit is set to only
LRU pages at the moment and we can check PageMovable after !PageLRU
check.
Thanks.
^ permalink raw reply
* [PULL] virtio/qemu: fixes for 4.6
From: Michael S. Tsirkin @ 2016-04-07 17:07 UTC (permalink / raw)
To: Linus Torvalds
Cc: kvm, mst, netdev, linux-kernel, stable, virtualization, stefanha,
gregkh, somlo
The following changes since commit 9735a22799b9214d17d3c231fe377fc852f042e9:
Linux 4.6-rc2 (2016-04-03 09:09:40 -0500)
are available in the git repository at:
git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost.git tags/for_linus
for you to fetch changes up to c00bbcf8628969e103d4a7b351a53746f1025576:
virtio: add VIRTIO_CONFIG_S_NEEDS_RESET device status bit (2016-04-07 15:16:41 +0300)
----------------------------------------------------------------
virtio/qemu: fixes for 4.6
A couple of fixes for virtio and for the new QEMU fw cfg driver.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
----------------------------------------------------------------
Gabriel Somlo (1):
firmware: qemu_fw_cfg.c: hold ACPI global lock during device access
Michael S. Tsirkin (3):
qemu_fw_cfg: don't leak kobj on init error
virtio: virtio 1.0 cs04 spec compliance for reset
MAINTAINERS: add entry for QEMU
Stefan Hajnoczi (1):
virtio: add VIRTIO_CONFIG_S_NEEDS_RESET device status bit
include/uapi/linux/virtio_config.h | 2 ++
drivers/firmware/qemu_fw_cfg.c | 24 +++++++++++++++++++++++-
drivers/virtio/virtio_pci_modern.c | 11 ++++++++---
MAINTAINERS | 7 +++++++
4 files changed, 40 insertions(+), 4 deletions(-)
^ permalink raw reply
* Re: [RFC v5 0/5] Add virtio transport for AF_VSOCK
From: Ian Campbell @ 2016-04-08 15:35 UTC (permalink / raw)
To: Stefan Hajnoczi, kvm
Cc: marius vlad, Michael S. Tsirkin, netdev, virtualization,
Claudio Imbrenda, Matt Benjamin, Greg Kurz, Christoffer Dall
In-Reply-To: <1459520587-12337-1-git-send-email-stefanha@redhat.com>
On Fri, 2016-04-01 at 15:23 +0100, Stefan Hajnoczi wrote:
> This series is based on Michael Tsirkin's vhost branch (v4.5-rc6).
>
> I'm about to process Claudio Imbrenda's locking fixes for virtio-vsock but
> first I want to share the latest version of the code. Several people are
> playing with vsock now so sharing the latest code should avoid duplicate work.
Thanks for this, I've been using it in my project and it mostly seems
fine.
One wrinkle I came across, which I'm not sure if it is by design or a
problem is that I can see this sequence coming from the guest (with
other activity in between):
1) OP_SHUTDOWN w/ flags == SHUTDOWN_RX
2) OP_SHUTDOWN w/ flags == SHUTDOWN_TX
3) OP_SHUTDOWN w/ flags == SHUTDOWN_TX|SHUTDOWN_RX
I orignally had my backend close things down at #2, however this meant
that when #3 arrived it was for a non-existent socket (or, worse, an
active one if the ports got reused). I checked v5 of the spec
proposal[0] which says:
If these bits are set and there are no more virtqueue buffers
pending the socket is disconnected.
but I'm not entirely sure if this behaviour contradicts this or not
(the bits have both been set at #2, but not at the same time).
BTW, how does one tell if there are no more virtqueue buffers pending
or not while processing the op?
Another thing I noticed, which is really more to do with the generic
AF_VSOCK bits than anything to do with your patches is that there is no
limitations on which vsock ports a non-privileged user can bind to and
relatedly that there is no netns support so e.g. users in unproivileged
containers can bind to any vsock port and talk to the host, which might
be undesirable. For my use for now I just went with the big hammer
approach of denying access from anything other than init_net
namespace[1] while I consider what the right answer is.
Ian.
[0] http://thread.gmane.org/gmane.comp.emulators.virtio.devel/1092
[1]
From 366c9c42afb9bd54f92f72518470c09e46f12e88 Mon Sep 17 00:00:00 2001
From: Ian Campbell <ian.campbell@docker.com>
Date: Mon, 4 Apr 2016 14:50:10 +0100
Subject: [PATCH] VSOCK: Only allow host network namespace to use AF_VSOCK.
The VSOCK addressing schema does not really lend itself to simply creating an
alternative end point address within a namespace.
Signed-off-by: Ian Campbell <ian.campbell@docker.com>
---
net/vmw_vsock/af_vsock.c | 3 +++
1 file changed, 3 insertions(+)
diff --git a/net/vmw_vsock/af_vsock.c b/net/vmw_vsock/af_vsock.c
index 1e5f5ed..cdb3dd3 100644
--- a/net/vmw_vsock/af_vsock.c
+++ b/net/vmw_vsock/af_vsock.c
@@ -1840,6 +1840,9 @@ static const struct proto_ops vsock_stream_ops = {
static int vsock_create(struct net *net, struct socket *sock,
int protocol, int kern)
{
+ if (!net_eq(net, &init_net))
+ return -EAFNOSUPPORT;
+
if (!sock)
return -EINVAL;
--
2.8.0.rc3
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization
^ permalink raw reply related
* [PATCH] drivers/virtio/virtio_ring.c: Deinline virtqueue_add, save 1016 bytes
From: Denys Vlasenko @ 2016-04-08 18:58 UTC (permalink / raw)
To: Michael S. Tsirkin; +Cc: Denys Vlasenko, linux-kernel, virtualization
In-Reply-To: <1460141926-13069-1-git-send-email-dvlasenk@redhat.com>
This function compiles to 839 bytes of machine code.
In C, it is ~150 lines long.
This function has 3 callsites.
Signed-off-by: Denys Vlasenko <dvlasenk@redhat.com>
CC: "Michael S. Tsirkin" <mst@redhat.com>
CC: virtualization@lists.linux-foundation.org
CC: linux-kernel@vger.kernel.org
---
drivers/virtio/virtio_ring.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
index e12e385..77a4771 100644
--- a/drivers/virtio/virtio_ring.c
+++ b/drivers/virtio/virtio_ring.c
@@ -126,7 +126,7 @@ static struct vring_desc *alloc_indirect(struct virtqueue *_vq,
return desc;
}
-static inline int virtqueue_add(struct virtqueue *_vq,
+static int virtqueue_add(struct virtqueue *_vq,
struct scatterlist *sgs[],
unsigned int total_sg,
unsigned int out_sgs,
--
2.1.0
^ permalink raw reply related
* Re: [PATCH] drivers/virtio/virtio_ring.c: Deinline virtqueue_add, save 1016 bytes
From: Michael S. Tsirkin @ 2016-04-09 20:14 UTC (permalink / raw)
To: Denys Vlasenko; +Cc: linux-kernel, virtualization
In-Reply-To: <1460141926-13069-3-git-send-email-dvlasenk@redhat.com>
On Fri, Apr 08, 2016 at 08:58:44PM +0200, Denys Vlasenko wrote:
> This function compiles to 839 bytes of machine code.
> In C, it is ~150 lines long.
>
> This function has 3 callsites.
>
> Signed-off-by: Denys Vlasenko <dvlasenk@redhat.com>
> CC: "Michael S. Tsirkin" <mst@redhat.com>
> CC: virtualization@lists.linux-foundation.org
> CC: linux-kernel@vger.kernel.org
This function is one of the most performance critical ones in the driver, a
bunch of tuning went into it, making this inline intentionally. I'd
have to see some numbers showing making it non-inline is a worth-while
tradeoff.
> ---
> drivers/virtio/virtio_ring.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
> index e12e385..77a4771 100644
> --- a/drivers/virtio/virtio_ring.c
> +++ b/drivers/virtio/virtio_ring.c
> @@ -126,7 +126,7 @@ static struct vring_desc *alloc_indirect(struct virtqueue *_vq,
> return desc;
> }
>
> -static inline int virtqueue_add(struct virtqueue *_vq,
> +static int virtqueue_add(struct virtqueue *_vq,
> struct scatterlist *sgs[],
> unsigned int total_sg,
> unsigned int out_sgs,
> --
> 2.1.0
^ permalink raw reply
* Re: [PATCH v3 04/16] mm/balloon: use general movable page feature into balloon
From: Minchan Kim @ 2016-04-11 4:29 UTC (permalink / raw)
To: Vlastimil Babka
Cc: jlayton, Rik van Riel, YiPing Xu, aquini, rknize,
Sergey Senozhatsky, Chan Gyun Jeong, Hugh Dickins, linux-kernel,
virtualization, bfields, linux-mm, Gioh Kim, Gioh Kim, Mel Gorman,
Sangseok Lee, Andrew Morton, Joonsoo Kim, koct9i, Al Viro
In-Reply-To: <5703A979.402@suse.cz>
On Tue, Apr 05, 2016 at 02:03:05PM +0200, Vlastimil Babka wrote:
> On 03/30/2016 09:12 AM, Minchan Kim wrote:
> >Now, VM has a feature to migrate non-lru movable pages so
> >balloon doesn't need custom migration hooks in migrate.c
> >and compact.c. Instead, this patch implements page->mapping
> >->{isolate|migrate|putback} functions.
> >
> >With that, we could remove hooks for ballooning in general
> >migration functions and make balloon compaction simple.
> >
> >Cc: virtualization@lists.linux-foundation.org
> >Cc: Rafael Aquini <aquini@redhat.com>
> >Cc: Konstantin Khlebnikov <koct9i@gmail.com>
> >Signed-off-by: Gioh Kim <gurugio@hanmail.net>
> >Signed-off-by: Minchan Kim <minchan@kernel.org>
>
> I'm not familiar with the inode and pseudofs stuff, so just some
> things I noticed:
>
> >-#define PAGE_MOVABLE_MAPCOUNT_VALUE (-255)
> >+#define PAGE_MOVABLE_MAPCOUNT_VALUE (-256)
> >+#define PAGE_BALLOON_MAPCOUNT_VALUE PAGE_MOVABLE_MAPCOUNT_VALUE
> >
> > static inline int PageMovable(struct page *page)
> > {
> >- return ((test_bit(PG_movable, &(page)->flags) &&
> >- atomic_read(&page->_mapcount) == PAGE_MOVABLE_MAPCOUNT_VALUE)
> >- || PageBalloon(page));
> >+ return (test_bit(PG_movable, &(page)->flags) &&
> >+ atomic_read(&page->_mapcount) == PAGE_MOVABLE_MAPCOUNT_VALUE);
> > }
> >
> > /* Caller should hold a PG_lock */
> >@@ -645,6 +626,35 @@ static inline void __ClearPageMovable(struct page *page)
> >
> > PAGEFLAG(Isolated, isolated, PF_ANY);
> >
> >+static inline int PageBalloon(struct page *page)
> >+{
> >+ return atomic_read(&page->_mapcount) == PAGE_BALLOON_MAPCOUNT_VALUE
> >+ && PagePrivate2(page);
> >+}
>
> Hmm so you are now using PG_private_2 flag here, but it's not
> documented. Also the only caller of PageBalloon() seems to be
> stable_page_flags(). Which will now report all movable pages with
> PG_private_2 as KPF_BALOON. Seems like an overkill and also not
> reliable. Could it test e.g. page->mapping instead?
Thanks for pointing out.
I will not use page->_mapcount in next version so it should be okay.
>
> Or maybe if we manage to get rid of PAGE_MOVABLE_MAPCOUNT_VALUE, we
> can keep PAGE_BALLOON_MAPCOUNT_VALUE to simply distinguish balloon
> pages for stable_page_flags().
Yeb.
>
> >@@ -1033,7 +1019,7 @@ static int __unmap_and_move(struct page *page, struct page *newpage,
> > out:
> > /* If migration is successful, move newpage to right list */
> > if (rc == MIGRATEPAGE_SUCCESS) {
> >- if (unlikely(__is_movable_balloon_page(newpage)))
> >+ if (unlikely(PageMovable(newpage)))
> > put_page(newpage);
> > else
> > putback_lru_page(newpage);
>
> Hmm shouldn't the condition have been changed to
>
> if (unlikely(__is_movable_balloon_page(newpage)) || PageMovable(newpage)
>
> by patch 02/16? And this patch should be just removing the
> balloon-specific check? Otherwise it seems like between patches 02
> and 04, other kinds of PageMovable pages were unnecessarily/wrongly
> routed through putback_lru_page()?
Fixed.
>
> >diff --git a/mm/vmscan.c b/mm/vmscan.c
> >index d82196244340..c7696a2e11c7 100644
> >--- a/mm/vmscan.c
> >+++ b/mm/vmscan.c
> >@@ -1254,7 +1254,7 @@ unsigned long reclaim_clean_pages_from_list(struct zone *zone,
> >
> > list_for_each_entry_safe(page, next, page_list, lru) {
> > if (page_is_file_cache(page) && !PageDirty(page) &&
> >- !isolated_balloon_page(page)) {
> >+ !PageIsolated(page)) {
> > ClearPageActive(page);
> > list_move(&page->lru, &clean_pages);
> > }
>
> This looks like the same comment as above at first glance. But
> looking closer, it's even weirder. isolated_balloon_page() was
> simply PageBalloon() after d6d86c0a7f8dd... weird already. You
> replace it with check for !PageIsolated() which looks like a more
> correct check, so ok. Except the potential false positive with
> PG_owner_priv_1.
I will change it next version so it shouldn't be a problem.
Thanks for the review!
^ permalink raw reply
* Re: [PATCH v3 00/16] Support non-lru page migration
From: Minchan Kim @ 2016-04-11 4:35 UTC (permalink / raw)
To: Andrew Morton, linux-kernel, linux-mm, jlayton, bfields,
Vlastimil Babka, Joonsoo Kim, koct9i, aquini, virtualization,
Mel Gorman, Hugh Dickins, Sergey Senozhatsky, Rik van Riel,
rknize, Gioh Kim, Sangseok Lee, Chan Gyun Jeong, Al Viro,
YiPing Xu
In-Reply-To: <20160404131718.GA18963@e106921-lin.trondheim.arm.com>
On Mon, Apr 04, 2016 at 03:17:18PM +0200, John Einar Reitan wrote:
> On Wed, Mar 30, 2016 at 04:11:59PM +0900, Minchan Kim wrote:
> > Recently, I got many reports about perfermance degradation
> > in embedded system(Android mobile phone, webOS TV and so on)
> > and failed to fork easily.
> >
> > The problem was fragmentation caused by zram and GPU driver
> > pages. Their pages cannot be migrated so compaction cannot
> > work well, either so reclaimer ends up shrinking all of working
> > set pages. It made system very slow and even to fail to fork
> > easily.
> >
> > Other pain point is that they cannot work with CMA.
> > Most of CMA memory space could be idle(ie, it could be used
> > for movable pages unless driver is using) but if driver(i.e.,
> > zram) cannot migrate his page, that memory space could be
> > wasted. In our product which has big CMA memory, it reclaims
> > zones too exccessively although there are lots of free space
> > in CMA so system was very slow easily.
> >
> > To solve these problem, this patch try to add facility to
> > migrate non-lru pages via introducing new friend functions
> > of migratepage in address_space_operation and new page flags.
> >
> > (isolate_page, putback_page)
> > (PG_movable, PG_isolated)
> >
> > For details, please read description in
> > "mm/compaction: support non-lru movable page migration".
>
> Thanks, this mirrors what we see with the ARM Mali GPU drivers too.
>
> One thing with the current design which worries me is the potential
> for multiple calls due to many separated pages being migrated.
> On GPUs (or any other device) which has an IOMMU and L2 cache, which
> isn't coherent with the CPU, we must do L2 cache flush & invalidation
> per page. I guess batching pages isn't easily possible?
>
Hmm, I think it seems to cause many code stirring but surely worth
to work. So, IMMO, it would be better to add such feature after soft
landing of current work.
Anyway, I will Cc'ed you in next revision.
Thanks.
^ permalink raw reply
* Re: [RFC v5 0/5] Add virtio transport for AF_VSOCK
From: Stefan Hajnoczi @ 2016-04-11 10:45 UTC (permalink / raw)
To: Ian Campbell
Cc: marius vlad, Stefan Hajnoczi, kvm, Michael S. Tsirkin, netdev,
virtualization, Claudio Imbrenda, Matt Benjamin, Greg Kurz,
Christoffer Dall
In-Reply-To: <1460129705.1749.25.camel@docker.com>
[-- Attachment #1.1: Type: text/plain, Size: 2801 bytes --]
On Fri, Apr 08, 2016 at 04:35:05PM +0100, Ian Campbell wrote:
> On Fri, 2016-04-01 at 15:23 +0100, Stefan Hajnoczi wrote:
> > This series is based on Michael Tsirkin's vhost branch (v4.5-rc6).
> >
> > I'm about to process Claudio Imbrenda's locking fixes for virtio-vsock but
> > first I want to share the latest version of the code. Several people are
> > playing with vsock now so sharing the latest code should avoid duplicate work.
>
> Thanks for this, I've been using it in my project and it mostly seems
> fine.
>
> One wrinkle I came across, which I'm not sure if it is by design or a
> problem is that I can see this sequence coming from the guest (with
> other activity in between):
>
> 1) OP_SHUTDOWN w/ flags == SHUTDOWN_RX
> 2) OP_SHUTDOWN w/ flags == SHUTDOWN_TX
> 3) OP_SHUTDOWN w/ flags == SHUTDOWN_TX|SHUTDOWN_RX
>
> I orignally had my backend close things down at #2, however this meant
> that when #3 arrived it was for a non-existent socket (or, worse, an
> active one if the ports got reused). I checked v5 of the spec
> proposal[0] which says:
> If these bits are set and there are no more virtqueue buffers
> pending the socket is disconnected.
>
> but I'm not entirely sure if this behaviour contradicts this or not
> (the bits have both been set at #2, but not at the same time).
>
> BTW, how does one tell if there are no more virtqueue buffers pending
> or not while processing the op?
#2 is odd. The shutdown bits are sticky so they cannot be cleared once
set. I would have expected just #1 and #3. The behavior you observe
look like a bug.
The spec text does not convey the meaning of OP_SHUTDOWN well.
OP_SHUTDOWN SHUTDOWN_TX|SHUTDOWN_RX means no further rx/tx is possible
for this connection. "there are no more virtqueue buffers pending the
socket" really means that this isn't an immediate close from the
perspective of the application. If the application still has unread rx
buffers then the socket stays readable until the rx data has been fully
read.
> Another thing I noticed, which is really more to do with the generic
> AF_VSOCK bits than anything to do with your patches is that there is no
> limitations on which vsock ports a non-privileged user can bind to and
> relatedly that there is no netns support so e.g. users in unproivileged
> containers can bind to any vsock port and talk to the host, which might
> be undesirable. For my use for now I just went with the big hammer
> approach of denying access from anything other than init_net
> namespace[1] while I consider what the right answer is.
From the vhost point of view each netns should have its own AF_VSOCK
namespace. This way two containers could act as "the host" (CID 2) for
their respective guests.
[-- Attachment #1.2: signature.asc --]
[-- Type: application/pgp-signature, Size: 473 bytes --]
[-- Attachment #2: Type: text/plain, Size: 183 bytes --]
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization
^ permalink raw reply
* [PATCH 0/2] virtio-ccw: new status accessor in device
From: Cornelia Huck @ 2016-04-11 11:23 UTC (permalink / raw)
To: virtio-dev, virtualization, qemu-devel
These patches implement the new read status command in qemu and bump
the revision to 2 (as this ccw is currently the only thing new with
that revision); see "[PATCH v2 1/1] ccw: add CCW_CMD_READ_STATUS"
for details.
Note that we'll need to add some compat machine handling when this
is added (not done here, as the machine level is not clear yet).
Changes from the RFC (back in September):
- removed stray ';' in patch 1
- rebased
Pierre Morel (2):
s390x/virtio-ccw: respond to READ_STATUS command
s390x/virtio-ccw: set revision 2 as maximum revision number
hw/s390x/virtio-ccw.c | 20 ++++++++++++++++++++
hw/s390x/virtio-ccw.h | 3 ++-
2 files changed, 22 insertions(+), 1 deletion(-)
--
2.6.6
^ permalink raw reply
* [PATCH 1/2] s390x/virtio-ccw: respond to READ_STATUS command
From: Cornelia Huck @ 2016-04-11 11:23 UTC (permalink / raw)
To: virtio-dev, virtualization, qemu-devel; +Cc: Pierre Morel
In-Reply-To: <1460373804-43477-1-git-send-email-cornelia.huck@de.ibm.com>
From: Pierre Morel <pmorel@linux.vnet.ibm.com>
This patch adds the response to the READ_STATUS CCW command.
Signed-off-by: Pierre Morel <pmorel@linux.vnet.ibm.com>
Signed-off-by: Cornelia Huck <cornelia.huck@de.ibm.com>
---
hw/s390x/virtio-ccw.c | 20 ++++++++++++++++++++
hw/s390x/virtio-ccw.h | 1 +
2 files changed, 21 insertions(+)
diff --git a/hw/s390x/virtio-ccw.c b/hw/s390x/virtio-ccw.c
index d51642d..569ab26 100644
--- a/hw/s390x/virtio-ccw.c
+++ b/hw/s390x/virtio-ccw.c
@@ -525,6 +525,26 @@ static int virtio_ccw_cb(SubchDev *sch, CCW1 ccw)
}
}
break;
+ case CCW_CMD_READ_STATUS:
+ if (check_len) {
+ if (ccw.count != sizeof(status)) {
+ ret = -EINVAL;
+ break;
+ }
+ } else if (ccw.count < sizeof(status)) {
+ /* Can't execute command. */
+ ret = -EINVAL;
+ break;
+ }
+ if (!ccw.cda) {
+ ret = -EFAULT;
+ } else {
+ address_space_stb(&address_space_memory, ccw.cda, vdev->status,
+ MEMTXATTRS_UNSPECIFIED, NULL);
+ sch->curr_status.scsw.count = ccw.count - sizeof(vdev->status);
+ ret = 0;
+ }
+ break;
case CCW_CMD_WRITE_STATUS:
if (check_len) {
if (ccw.count != sizeof(status)) {
diff --git a/hw/s390x/virtio-ccw.h b/hw/s390x/virtio-ccw.h
index 66c831b..6bc14ee 100644
--- a/hw/s390x/virtio-ccw.h
+++ b/hw/s390x/virtio-ccw.h
@@ -41,6 +41,7 @@
#define CCW_CMD_SET_IND 0x43
#define CCW_CMD_SET_CONF_IND 0x53
#define CCW_CMD_READ_VQ_CONF 0x32
+#define CCW_CMD_READ_STATUS 0x72
#define CCW_CMD_SET_IND_ADAPTER 0x73
#define CCW_CMD_SET_VIRTIO_REV 0x83
--
2.6.6
^ permalink raw reply related
* [PATCH 2/2] s390x/virtio-ccw: set revision 2 as maximum revision number
From: Cornelia Huck @ 2016-04-11 11:23 UTC (permalink / raw)
To: virtio-dev, virtualization, qemu-devel; +Cc: Pierre Morel
In-Reply-To: <1460373804-43477-1-git-send-email-cornelia.huck@de.ibm.com>
From: Pierre Morel <pmorel@linux.vnet.ibm.com>
We have everything needed for virtio-ccw revision 2 wired up now. Bump
the maximum supported revision reported to the guest so they can make
use of it.
Signed-off-by: Pierre Morel <pmorel@linux.vnet.ibm.com>
Signed-off-by: Cornelia Huck <cornelia.huck@de.ibm.com>
---
hw/s390x/virtio-ccw.h | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/hw/s390x/virtio-ccw.h b/hw/s390x/virtio-ccw.h
index 6bc14ee..46b5ee6 100644
--- a/hw/s390x/virtio-ccw.h
+++ b/hw/s390x/virtio-ccw.h
@@ -97,7 +97,7 @@ struct VirtioCcwDevice {
};
/* The maximum virtio revision we support. */
-#define VIRTIO_CCW_MAX_REV 1
+#define VIRTIO_CCW_MAX_REV 2
static inline int virtio_ccw_rev_max(VirtioCcwDevice *dev)
{
return dev->max_rev;
--
2.6.6
^ permalink raw reply related
* [PATCH 0/1] virtio_ccw: new status accessor in driver
From: Cornelia Huck @ 2016-04-11 11:23 UTC (permalink / raw)
To: virtio-dev, virtualization, qemu-devel
This patch implements the new status accessor in the ccw device,
as laid out in "[PATCH v2 1/1] ccw: add CCW_CMD_READ_STATUS".
Changes from the RFC (back in September):
- rebased
Pierre Morel (1):
virtio/s390: support READ_STATUS command for virtio-ccw
drivers/s390/virtio/virtio_ccw.c | 25 +++++++++++++++++++++++++
1 file changed, 25 insertions(+)
--
2.6.6
^ permalink raw reply
* [PATCH 1/1] virtio/s390: support READ_STATUS command for virtio-ccw
From: Cornelia Huck @ 2016-04-11 11:23 UTC (permalink / raw)
To: virtio-dev, virtualization, qemu-devel; +Cc: Pierre Morel
In-Reply-To: <1460373817-43522-1-git-send-email-cornelia.huck@de.ibm.com>
From: Pierre Morel <pmorel@linux.vnet.ibm.com>
As virtio-1 introduced the possibility of the device manipulating the
status byte, revision 2 of the virtio-ccw transport introduced a means
of getting the status byte from the device via READ_STATUS. Let's wire
it up for revisions >= 2 and fall back to returning the stored status
byte if not supported.
Signed-off-by: Pierre Morel <pmorel@linux.vnet.ibm.com>
Signed-off-by: Cornelia Huck <cornelia.huck@de.ibm.com>
---
drivers/s390/virtio/virtio_ccw.c | 25 +++++++++++++++++++++++++
1 file changed, 25 insertions(+)
diff --git a/drivers/s390/virtio/virtio_ccw.c b/drivers/s390/virtio/virtio_ccw.c
index 8688ad4..d3a98d3 100644
--- a/drivers/s390/virtio/virtio_ccw.c
+++ b/drivers/s390/virtio/virtio_ccw.c
@@ -145,6 +145,7 @@ static struct airq_info *airq_areas[MAX_AIRQ_AREAS];
#define CCW_CMD_WRITE_CONF 0x21
#define CCW_CMD_WRITE_STATUS 0x31
#define CCW_CMD_READ_VQ_CONF 0x32
+#define CCW_CMD_READ_STATUS 0x72
#define CCW_CMD_SET_IND_ADAPTER 0x73
#define CCW_CMD_SET_VIRTIO_REV 0x83
@@ -160,6 +161,7 @@ static struct airq_info *airq_areas[MAX_AIRQ_AREAS];
#define VIRTIO_CCW_DOING_SET_CONF_IND 0x04000000
#define VIRTIO_CCW_DOING_SET_IND_ADAPTER 0x08000000
#define VIRTIO_CCW_DOING_SET_VIRTIO_REV 0x10000000
+#define VIRTIO_CCW_DOING_READ_STATUS 0x20000000
#define VIRTIO_CCW_INTPARM_MASK 0xffff0000
static struct virtio_ccw_device *to_vc_device(struct virtio_device *vdev)
@@ -902,6 +904,28 @@ out_free:
static u8 virtio_ccw_get_status(struct virtio_device *vdev)
{
struct virtio_ccw_device *vcdev = to_vc_device(vdev);
+ u8 old_status = *vcdev->status;
+ struct ccw1 *ccw;
+
+ if (vcdev->revision < 1)
+ return *vcdev->status;
+
+ ccw = kzalloc(sizeof(*ccw), GFP_DMA | GFP_KERNEL);
+ if (!ccw)
+ return old_status;
+
+ ccw->cmd_code = CCW_CMD_READ_STATUS;
+ ccw->flags = 0;
+ ccw->count = sizeof(*vcdev->status);
+ ccw->cda = (__u32)(unsigned long)vcdev->status;
+ ccw_io_helper(vcdev, ccw, VIRTIO_CCW_DOING_READ_STATUS);
+/*
+ * If the channel program failed (should only happen if the device
+ * was hotunplugged, and then we clean up via the machine check
+ * handler anyway), vcdev->status was not overwritten and we just
+ * return the old status, which is fine.
+*/
+ kfree(ccw);
return *vcdev->status;
}
@@ -997,6 +1021,7 @@ static void virtio_ccw_check_activity(struct virtio_ccw_device *vcdev,
case VIRTIO_CCW_DOING_READ_CONFIG:
case VIRTIO_CCW_DOING_WRITE_CONFIG:
case VIRTIO_CCW_DOING_WRITE_STATUS:
+ case VIRTIO_CCW_DOING_READ_STATUS:
case VIRTIO_CCW_DOING_SET_VQ:
case VIRTIO_CCW_DOING_SET_IND:
case VIRTIO_CCW_DOING_SET_CONF_IND:
--
2.6.6
^ permalink raw reply related
* Re: [RFC v5 0/5] Add virtio transport for AF_VSOCK
From: Michael S. Tsirkin @ 2016-04-11 12:54 UTC (permalink / raw)
To: Stefan Hajnoczi
Cc: marius vlad, Stefan Hajnoczi, kvm, netdev, Ian Campbell,
Claudio Imbrenda, Matt Benjamin, Greg Kurz, virtualization,
Christoffer Dall
In-Reply-To: <20160411104548.GA12826@stefanha-x1.localdomain>
On Mon, Apr 11, 2016 at 11:45:48AM +0100, Stefan Hajnoczi wrote:
> On Fri, Apr 08, 2016 at 04:35:05PM +0100, Ian Campbell wrote:
> > On Fri, 2016-04-01 at 15:23 +0100, Stefan Hajnoczi wrote:
> > > This series is based on Michael Tsirkin's vhost branch (v4.5-rc6).
> > >
> > > I'm about to process Claudio Imbrenda's locking fixes for virtio-vsock but
> > > first I want to share the latest version of the code. Several people are
> > > playing with vsock now so sharing the latest code should avoid duplicate work.
> >
> > Thanks for this, I've been using it in my project and it mostly seems
> > fine.
> >
> > One wrinkle I came across, which I'm not sure if it is by design or a
> > problem is that I can see this sequence coming from the guest (with
> > other activity in between):
> >
> > 1) OP_SHUTDOWN w/ flags == SHUTDOWN_RX
> > 2) OP_SHUTDOWN w/ flags == SHUTDOWN_TX
> > 3) OP_SHUTDOWN w/ flags == SHUTDOWN_TX|SHUTDOWN_RX
> >
> > I orignally had my backend close things down at #2, however this meant
> > that when #3 arrived it was for a non-existent socket (or, worse, an
> > active one if the ports got reused). I checked v5 of the spec
> > proposal[0] which says:
> > If these bits are set and there are no more virtqueue buffers
> > pending the socket is disconnected.
> >
> > but I'm not entirely sure if this behaviour contradicts this or not
> > (the bits have both been set at #2, but not at the same time).
> >
> > BTW, how does one tell if there are no more virtqueue buffers pending
> > or not while processing the op?
>
> #2 is odd. The shutdown bits are sticky so they cannot be cleared once
> set. I would have expected just #1 and #3. The behavior you observe
> look like a bug.
>
> The spec text does not convey the meaning of OP_SHUTDOWN well.
> OP_SHUTDOWN SHUTDOWN_TX|SHUTDOWN_RX means no further rx/tx is possible
> for this connection. "there are no more virtqueue buffers pending the
> socket" really means that this isn't an immediate close from the
> perspective of the application. If the application still has unread rx
> buffers then the socket stays readable until the rx data has been fully
> read.
Yes but you also wrote:
If these bits are set and there are no more virtqueue buffers
pending the socket is disconnected.
how does remote know that there are no buffers pending and so it's safe
to reuse the same source/destination address now? Maybe destination
should send RST at that point?
> > Another thing I noticed, which is really more to do with the generic
> > AF_VSOCK bits than anything to do with your patches is that there is no
> > limitations on which vsock ports a non-privileged user can bind to and
> > relatedly that there is no netns support so e.g. users in unproivileged
> > containers can bind to any vsock port and talk to the host, which might
> > be undesirable. For my use for now I just went with the big hammer
> > approach of denying access from anything other than init_net
> > namespace[1] while I consider what the right answer is.
>
> From the vhost point of view each netns should have its own AF_VSOCK
> namespace. This way two containers could act as "the host" (CID 2) for
> their respective guests.
I wonder how this interacts with the disconnect on migration
idea that you discussed. Specifically, socket has to stay connected
^ permalink raw reply
* Re: [RFC v5 0/5] Add virtio transport for AF_VSOCK
From: Stefan Hajnoczi @ 2016-04-12 13:59 UTC (permalink / raw)
To: Michael S. Tsirkin
Cc: marius vlad, kvm, Ian Campbell, Claudio Imbrenda, Matt Benjamin,
netdev, Greg Kurz, virtualization, Christoffer Dall
In-Reply-To: <20160411154517-mutt-send-email-mst@redhat.com>
[-- Attachment #1.1: Type: text/plain, Size: 3633 bytes --]
On Mon, Apr 11, 2016 at 03:54:08PM +0300, Michael S. Tsirkin wrote:
> On Mon, Apr 11, 2016 at 11:45:48AM +0100, Stefan Hajnoczi wrote:
> > On Fri, Apr 08, 2016 at 04:35:05PM +0100, Ian Campbell wrote:
> > > On Fri, 2016-04-01 at 15:23 +0100, Stefan Hajnoczi wrote:
> > > > This series is based on Michael Tsirkin's vhost branch (v4.5-rc6).
> > > >
> > > > I'm about to process Claudio Imbrenda's locking fixes for virtio-vsock but
> > > > first I want to share the latest version of the code. Several people are
> > > > playing with vsock now so sharing the latest code should avoid duplicate work.
> > >
> > > Thanks for this, I've been using it in my project and it mostly seems
> > > fine.
> > >
> > > One wrinkle I came across, which I'm not sure if it is by design or a
> > > problem is that I can see this sequence coming from the guest (with
> > > other activity in between):
> > >
> > > 1) OP_SHUTDOWN w/ flags == SHUTDOWN_RX
> > > 2) OP_SHUTDOWN w/ flags == SHUTDOWN_TX
> > > 3) OP_SHUTDOWN w/ flags == SHUTDOWN_TX|SHUTDOWN_RX
How did you trigger this sequence? I'd like to reproduce it.
> > > I orignally had my backend close things down at #2, however this meant
> > > that when #3 arrived it was for a non-existent socket (or, worse, an
> > > active one if the ports got reused). I checked v5 of the spec
> > > proposal[0] which says:
> > > If these bits are set and there are no more virtqueue buffers
> > > pending the socket is disconnected.
> > >
> > > but I'm not entirely sure if this behaviour contradicts this or not
> > > (the bits have both been set at #2, but not at the same time).
> > >
> > > BTW, how does one tell if there are no more virtqueue buffers pending
> > > or not while processing the op?
> >
> > #2 is odd. The shutdown bits are sticky so they cannot be cleared once
> > set. I would have expected just #1 and #3. The behavior you observe
> > look like a bug.
> >
> > The spec text does not convey the meaning of OP_SHUTDOWN well.
> > OP_SHUTDOWN SHUTDOWN_TX|SHUTDOWN_RX means no further rx/tx is possible
> > for this connection. "there are no more virtqueue buffers pending the
> > socket" really means that this isn't an immediate close from the
> > perspective of the application. If the application still has unread rx
> > buffers then the socket stays readable until the rx data has been fully
> > read.
>
> Yes but you also wrote:
> If these bits are set and there are no more virtqueue buffers
> pending the socket is disconnected.
>
> how does remote know that there are no buffers pending and so it's safe
> to reuse the same source/destination address now? Maybe destination
> should send RST at that point?
You are right, the source/destination address could be reused while the
remote still has the connection in their table. Connection
establishment would fail with a RST reply.
I can think of two solutions:
1. Implementations must remove connections from their table as soon as
SHUTDOWN_TX|SHUTDOWN_RX is received. This way the source/destination
address tuple can be reused immediately, i.e. new connections with
the same source/destination would be possible while an application is
still draining the receive buffers of an old connection.
2. Extend the connection lifecycle so that an A->B
SHUTDOWN_TX|SHUTDOWN_RX must be followed by a by a B->A RST to close
a connection. This way the source/destination address is only in use
once at a time.
Option #2 seems safer because there is no overlap in source/destination
address usage.
[-- Attachment #1.2: signature.asc --]
[-- Type: application/pgp-signature, Size: 473 bytes --]
[-- Attachment #2: Type: text/plain, Size: 183 bytes --]
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization
^ permalink raw reply
* Re: [RFC v5 0/5] Add virtio transport for AF_VSOCK
From: Ian Campbell @ 2016-04-12 16:07 UTC (permalink / raw)
To: Michael S. Tsirkin, Stefan Hajnoczi
Cc: marius vlad, Stefan Hajnoczi, kvm, netdev, virtualization,
Claudio Imbrenda, Matt Benjamin, Greg Kurz, Christoffer Dall
In-Reply-To: <20160411154517-mutt-send-email-mst@redhat.com>
Some how Stefan's reply disapeared from my INBOX (although I did see
it) so replying here.
On Mon, 2016-04-11 at 15:54 +0300, Michael S. Tsirkin wrote:
> On Mon, Apr 11, 2016 at 11:45:48AM +0100, Stefan Hajnoczi wrote:
> >
> > On Fri, Apr 08, 2016 at 04:35:05PM +0100, Ian Campbell wrote:
> > >
> > > On Fri, 2016-04-01 at 15:23 +0100, Stefan Hajnoczi wrote:
> > > >
> > > > This series is based on Michael Tsirkin's vhost branch (v4.5-rc6).
> > > >
> > > > I'm about to process Claudio Imbrenda's locking fixes for virtio-vsock but
> > > > first I want to share the latest version of the code. Several people are
> > > > playing with vsock now so sharing the latest code should avoid duplicate work.
> > > Thanks for this, I've been using it in my project and it mostly seems
> > > fine.
> > >
> > > One wrinkle I came across, which I'm not sure if it is by design or a
> > > problem is that I can see this sequence coming from the guest (with
> > > other activity in between):
> > >
> > > 1) OP_SHUTDOWN w/ flags == SHUTDOWN_RX
> > > 2) OP_SHUTDOWN w/ flags == SHUTDOWN_TX
> > > 3) OP_SHUTDOWN w/ flags == SHUTDOWN_TX|SHUTDOWN_RX
> > >
> > > I orignally had my backend close things down at #2, however this meant
> > > that when #3 arrived it was for a non-existent socket (or, worse, an
> > > active one if the ports got reused). I checked v5 of the spec
> > > proposal[0] which says:
> > > If these bits are set and there are no more virtqueue buffers
> > > pending the socket is disconnected.
> > >
> > > but I'm not entirely sure if this behaviour contradicts this or not
> > > (the bits have both been set at #2, but not at the same time).
> > >
> > > BTW, how does one tell if there are no more virtqueue buffers pending
> > > or not while processing the op?
> > #2 is odd. The shutdown bits are sticky so they cannot be cleared once
> > set. I would have expected just #1 and #3. The behavior you observe
> > look like a bug.
> >
> > The spec text does not convey the meaning of OP_SHUTDOWN well.
> > OP_SHUTDOWN SHUTDOWN_TX|SHUTDOWN_RX means no further rx/tx is possible
> > for this connection. "there are no more virtqueue buffers pending the
> > socket" really means that this isn't an immediate close from the
> > perspective of the application. If the application still has unread rx
> > buffers then the socket stays readable until the rx data has been fully
> > read.
Thanks, distinguishing the local buffer to the application from the
vring would make that clearer. Perhaps by not talking about "virtqueue
buffers" since they sound like a vring thing.
However, as Michael observes I'm not sure that's the whole story.
> Yes but you also wrote:
> If these bits are set and there are no more virtqueue buffers
> pending the socket is disconnected.
>
> how does remote know that there are no buffers pending and so it's safe
> to reuse the same source/destination address now?
Indeed this is one of the things I struggled with. e.g. If I send a
SHUTDOWN_RX to my peer am I supposed to wait for that buffer to come
back (so I know the peer has seen it) and then wait for an entire
"cycle" of the TX ring to know there is nothing still in flight? That's
some tricky book-keeping.
> Maybe destination
> should send RST at that point?
i.e. upon receipt of SHUTDOWN_RX|SHUTDOWN_TX from the peer you are
expected to send a RST. When the peer observes that then they know
there is no further data in that connection on the ring?
That sounds like it would be helpful.
> > > Another thing I noticed, which is really more to do with the generic
> > > AF_VSOCK bits than anything to do with your patches is that there is no
> > > limitations on which vsock ports a non-privileged user can bind to and
> > > relatedly that there is no netns support so e.g. users in unproivileged
> > > containers can bind to any vsock port and talk to the host, which might
> > > be undesirable. For my use for now I just went with the big hammer
> > > approach of denying access from anything other than init_net
> > > namespace[1] while I consider what the right answer is.
> > From the vhost point of view each netns should have its own AF_VSOCK
> > namespace. This way two containers could act as "the host" (CID 2) for
> > their respective guests.
When you say "should" you mean that's the intended design as opposed to
what the current code is actually doing, right?
Ian.
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization
^ permalink raw reply
* Re: [RFC v5 0/5] Add virtio transport for AF_VSOCK
From: Ian Campbell @ 2016-04-12 16:37 UTC (permalink / raw)
To: Stefan Hajnoczi
Cc: marius vlad, kvm, Michael S. Tsirkin, ijc, virtualization,
Claudio Imbrenda, Matt Benjamin, netdev, Greg Kurz,
Christoffer Dall
In-Reply-To: <20160412135957.GB29405@stefanha-x1.localdomain>
[-- Attachment #1.1: Type: text/plain, Size: 6431 bytes --]
For some reason your mails in this thread only appear in the gmail web UI
and not on the IMAP version of my mailbox (my own and Michael's mails are
fine).
So I'm replying via the web interface, sorry for the inevitable formatting
mess :-/
I've CCd another mailbox in the hopes of getting your mails in that IMAP
folder instead/aswell so I can avoid this next time.
On 12 April 2016 at 14:59, Stefan Hajnoczi <stefanha@redhat.com> wrote:
>
> > > > One wrinkle I came across, which I'm not sure if it is by design or a
> > > > problem is that I can see this sequence coming from the guest (with
> > > > other activity in between):
> > > >
> > > > 1) OP_SHUTDOWN w/ flags == SHUTDOWN_RX
> > > > 2) OP_SHUTDOWN w/ flags == SHUTDOWN_TX
> > > > 3) OP_SHUTDOWN w/ flags == SHUTDOWN_TX|SHUTDOWN_RX
>
> How did you trigger this sequence? I'd like to reproduce it.
>
Nothing magic. I've written some logging into my backend and captured the
result for a simple backend initiated connection.
In the log "TX" and "RX" indicate the thread doing the processing (with
"TX" being the one which processes the guest's TX ring, i.e. data coming
from the guest to the host). "<=" indicates a buffer going from guest to
host and "=>" is from host to guest. NB that guest to host replies are
queued synchronously by the TX thread onto the RX ring which is why the
somewhat odd looking "TX: =>" combination can occur. A host initiated
connection also happens from the TX thread in the same way.
The trace is of a simple request response (which both fit in one buffer in
each direction), the lines without an "?X:" prefix are my
annotations/guesses as to what is going on:
TX: =>SRC:00000002.00010002 DST:00000003.00000948
TX: LEN:00000000 TYPE:0001 OP:1=REQUEST
TX: FLAGS:00000000 BUF_ALLOC:00008000 FWD_CNT:00000000
TX: <=SRC:00000003.00000948 DST:00000002.00010002
TX: LEN:00000000 TYPE:0001 OP:2=RESPONSE
TX: FLAGS:00000000 BUF_ALLOC:00040000 FWD_CNT:00000000
REQUEST + RESPONSE == Channel open successfully
RX: =>SRC:00000002.00010002 DST:00000003.00000948
RX: LEN:0000005e TYPE:0001 OP:5=RW
RX: FLAGS:00000000 BUF_ALLOC:00008000 FWD_CNT:00000000
Host sends a request to the guest
TX: <=SRC:00000003.00000948 DST:00000002.00010002
TX: LEN:00000000 TYPE:0001 OP:6=CREDIT_UPDATE
TX: FLAGS:00000000 BUF_ALLOC:00040000 FWD_CNT:0000005e
Guest replies with a credit update
TX: <=SRC:00000003.00000948 DST:00000002.00010002
TX: LEN:00000091 TYPE:0001 OP:5=RW
TX: FLAGS:00000000 BUF_ALLOC:00040000 FWD_CNT:0000005e
Guest replies with the answer to the request
RX: =>SRC:00000002.00010002 DST:00000003.00000948
RX: LEN:00000000 TYPE:0001 OP:4=SHUTDOWN
RX: FLAGS:00000002 BUF_ALLOC:00008000 FWD_CNT:00000091
Host has sent its only request, so host app must have done
shutdown(SHUT_WR) I suppose and host therefore sends SHUTDOWN_TX.
TX: <=SRC:00000003.00000948 DST:00000002.00010002
TX: LEN:00000000 TYPE:0001 OP:4=SHUTDOWN
TX: FLAGS:00000001 BUF_ALLOC:00040000 FWD_CNT:0000005e
Guest SHUTDOWN_RX. I'm not sure if this is a direct kernel response to the
SHUTDOWN_TX or if the application inside the guest saw an EOF when reading
the socket and did the corresponding shutdown(SHUT_RD).
TX: <=SRC:00000003.00000948 DST:00000002.00010002
TX: LEN:00000000 TYPE:0001 OP:4=SHUTDOWN
TX: FLAGS:00000002 BUF_ALLOC:00040000 FWD_CNT:0000005e
Guest SHUTDOWN_TX, I presume that having sent the only response it is going
to it then does shutdown(SHUT_WR).
TX: <=SRC:00000003.00000948 DST:00000002.00010002
TX: LEN:00000000 TYPE:0001 OP:4=SHUTDOWN
TX: FLAGS:00000003 BUF_ALLOC:00040000 FWD_CNT:0000005e
Guest shuts down both directions.
Perhaps the guest end is turning shutdown(foo) directly into a vsock
message without or-ing in the current state?
> > > I orignally had my backend close things down at #2, however this meant
> > > > that when #3 arrived it was for a non-existent socket (or, worse, an
> > > > active one if the ports got reused). I checked v5 of the spec
> > > > proposal[0] which says:
> > > > If these bits are set and there are no more virtqueue buffers
> > > > pending the socket is disconnected.
> > > >
> > > > but I'm not entirely sure if this behaviour contradicts this or not
> > > > (the bits have both been set at #2, but not at the same time).
> > > >
> > > > BTW, how does one tell if there are no more virtqueue buffers pending
> > > > or not while processing the op?
> > >
> > > #2 is odd. The shutdown bits are sticky so they cannot be cleared once
> > > set. I would have expected just #1 and #3. The behavior you observe
> > > look like a bug.
> > >
> > > The spec text does not convey the meaning of OP_SHUTDOWN well.
> > > OP_SHUTDOWN SHUTDOWN_TX|SHUTDOWN_RX means no further rx/tx is possible
> > > for this connection. "there are no more virtqueue buffers pending the
> > > socket" really means that this isn't an immediate close from the
> > > perspective of the application. If the application still has unread rx
> > > buffers then the socket stays readable until the rx data has been fully
> > > read.
> >
> > Yes but you also wrote:
> > If these bits are set and there are no more virtqueue buffers
> > pending the socket is disconnected.
> >
> > how does remote know that there are no buffers pending and so it's safe
> > to reuse the same source/destination address now? Maybe destination
> > should send RST at that point?
>
> You are right, the source/destination address could be reused while the
> remote still has the connection in their table. Connection
> establishment would fail with a RST reply.
>
> I can think of two solutions:
>
> 1. Implementations must remove connections from their table as soon as
> SHUTDOWN_TX|SHUTDOWN_RX is received. This way the source/destination
> address tuple can be reused immediately, i.e. new connections with
> the same source/destination would be possible while an application is
> still draining the receive buffers of an old connection.
>
> 2. Extend the connection lifecycle so that an A->B
> SHUTDOWN_TX|SHUTDOWN_RX must be followed by a by a B->A RST to close
> a connection. This way the source/destination address is only in use
> once at a time.
>
> Option #2 seems safer because there is no overlap in source/destination
> address usage.
>
I agree (I hadn't seen this when I replied to Michael similar suggestion).
Ian. (sorry again for the formatting)
[-- Attachment #1.2: Type: text/html, Size: 8558 bytes --]
[-- Attachment #2: Type: text/plain, Size: 183 bytes --]
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization
^ permalink raw reply
* Re: [PATCH v5 0/6] Support calling functions on dedicated physical cpu
From: Juergen Gross @ 2016-04-13 8:49 UTC (permalink / raw)
To: linux-kernel, xen-devel
Cc: jeremy, jdelvare, peterz, hpa, akataria, x86, virtualization,
chrisw, mingo, david.vrabel, Douglas_Warzecha, pali.rohar,
boris.ostrovsky, tglx, linux
In-Reply-To: <1459952266-3687-1-git-send-email-jgross@suse.com>
On 06/04/16 16:17, Juergen Gross wrote:
> Some hardware (e.g. Dell Studio laptops) require special functions to
> be called on physical cpu 0 in order to avoid occasional hangs. When
> running as dom0 under Xen this could be achieved only via special boot
> parameters (vcpu pinning) limiting the hypervisor in it's scheduling
> decisions.
>
> This patch series is adding a generic function to be able to temporarily
> pin a (virtual) cpu to a dedicated physical cpu for executing above
> mentioned functions on that specific cpu. The drivers (dcdbas and i8k)
> requiring this functionality are modified accordingly.
>
> Changes in V5:
> - patch 3: rename and reshuffle parameters of smp_call_on_cpu() as requested
> by Peter Zijlstra
> - patch 3: test target cpu to be online as requested by Peter Zijlstra
> - patch 4: less wordy messages as requested by David Vrabel
>
> Changes in V4:
> - move patches 5 and 6 further up in the series
> - patch 2 (was 5): WARN_ONCE in case platform doesn't support pinning
> as requested by Peter Zijlstra
> - patch 3 (was 2): change return value in case of illegal cpu as
> requested by Peter Zijlstra
> - patch 3 (was 2): make pinning of vcpu an option as suggested by
> Peter Zijlstra
> - patches 5 and 6 (were 3 and 4): add call to get_online_cpus()
>
> Changes in V3:
> - use get_cpu()/put_cpu() as suggested by David Vrabel
>
> Changes in V2:
> - instead of manipulating the allowed set of cpus use cpu specific
> workqueue as requested by Peter Zijlstra
> - add include/linux/hypervisor.h to hide architecture specific stuff
> from generic kernel code
>
> Juergen Gross (6):
> xen: sync xen header
> virt, sched: add generic vcpu pinning support
> smp: add function to execute a function synchronously on a cpu
> xen: add xen_pin_vcpu() to support calling functions on a dedicated
> pcpu
> dcdbas: make use of smp_call_on_cpu()
> hwmon: use smp_call_on_cpu() for dell-smm i8k
>
> MAINTAINERS | 1 +
> arch/x86/include/asm/hypervisor.h | 4 ++
> arch/x86/kernel/cpu/hypervisor.c | 11 +++++
> arch/x86/xen/enlighten.c | 40 +++++++++++++++
> drivers/firmware/dcdbas.c | 51 +++++++++----------
> drivers/hwmon/dell-smm-hwmon.c | 35 +++++++------
> include/linux/hypervisor.h | 17 +++++++
> include/linux/smp.h | 3 ++
> include/xen/interface/sched.h | 100 +++++++++++++++++++++++++++++++-------
> kernel/smp.c | 51 +++++++++++++++++++
> kernel/up.c | 18 +++++++
> 11 files changed, 273 insertions(+), 58 deletions(-)
> create mode 100644 include/linux/hypervisor.h
>
So patches 1, 3 and 4 are acked. Any comments regarding patches 2, 5
and 6?
Juergen
^ permalink raw reply
* Re: [RFC v5 0/5] Add virtio transport for AF_VSOCK
From: Stefan Hajnoczi @ 2016-04-13 13:38 UTC (permalink / raw)
To: Ian Campbell
Cc: marius vlad, kvm, Michael S. Tsirkin, ijc, virtualization,
Claudio Imbrenda, Matt Benjamin, netdev, Greg Kurz,
Christoffer Dall
In-Reply-To: <CAOc2ZU1Kaes-UVfFrKrE9QyLMWnAQ77+hkSAGJgD0gOuM8o_UQ@mail.gmail.com>
[-- Attachment #1.1: Type: text/plain, Size: 513 bytes --]
On Tue, Apr 12, 2016 at 05:37:54PM +0100, Ian Campbell wrote:
> Perhaps the guest end is turning shutdown(foo) directly into a vsock
> message without or-ing in the current state?
Yes, you are right:
lock_sock(sk);
sk->sk_shutdown |= mode;
sk->sk_state_change(sk);
release_sock(sk);
if (sk->sk_type == SOCK_STREAM) {
sock_reset_flag(sk, SOCK_DONE);
vsock_send_shutdown(sk, mode);
Although sk_shutdown is ORed correctly, vsock_send_shutdown() is called with
just the shutdown() argument.
[-- Attachment #1.2: signature.asc --]
[-- Type: application/pgp-signature, Size: 473 bytes --]
[-- Attachment #2: Type: text/plain, Size: 183 bytes --]
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization
^ permalink raw reply
* [PATCH 13/14] drm/virtio: use drm_crtc_send_vblank_event()
From: Gustavo Padovan @ 2016-04-14 17:48 UTC (permalink / raw)
To: dri-devel
Cc: David Airlie, open list, Gustavo Padovan,
open list:VIRTIO GPU DRIVER
In-Reply-To: <1460656118-16766-1-git-send-email-gustavo@padovan.org>
From: Gustavo Padovan <gustavo.padovan@collabora.co.uk>
Replace the legacy drm_send_vblank_event() with the new helper function.
Signed-off-by: Gustavo Padovan <gustavo.padovan@collabora.co.uk>
---
drivers/gpu/drm/virtio/virtgpu_display.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/gpu/drm/virtio/virtgpu_display.c b/drivers/gpu/drm/virtio/virtgpu_display.c
index 4854dac..66be450 100644
--- a/drivers/gpu/drm/virtio/virtgpu_display.c
+++ b/drivers/gpu/drm/virtio/virtgpu_display.c
@@ -163,7 +163,7 @@ static int virtio_gpu_page_flip(struct drm_crtc *crtc,
if (event) {
spin_lock_irqsave(&crtc->dev->event_lock, irqflags);
- drm_send_vblank_event(crtc->dev, -1, event);
+ drm_crtc_send_vblank_event(crtc, event);
spin_unlock_irqrestore(&crtc->dev->event_lock, irqflags);
}
--
2.5.5
^ permalink raw reply related
* [PATCH RESEND] drm/virtio: send vblank event after crtc updates
From: Gustavo Padovan @ 2016-04-14 17:58 UTC (permalink / raw)
To: David Airlie
Cc: open list, Gustavo Padovan, open list:VIRTIO GPU DRIVER,
open list:VIRTIO GPU DRIVER
From: Gustavo Padovan <gustavo.padovan@collabora.co.uk>
virtio_gpu was failing to send vblank events when using the atomic IOCTL
with the DRM_MODE_PAGE_FLIP_EVENT flag set. This patch fixes each and
enables atomic pageflips updates.
Signed-off-by: Gustavo Padovan <gustavo.padovan@collabora.co.uk>
---
drivers/gpu/drm/virtio/virtgpu_display.c | 12 ++++++++++++
1 file changed, 12 insertions(+)
diff --git a/drivers/gpu/drm/virtio/virtgpu_display.c b/drivers/gpu/drm/virtio/virtgpu_display.c
index 4854dac..5fd1fd0 100644
--- a/drivers/gpu/drm/virtio/virtgpu_display.c
+++ b/drivers/gpu/drm/virtio/virtgpu_display.c
@@ -267,11 +267,23 @@ static int virtio_gpu_crtc_atomic_check(struct drm_crtc *crtc,
return 0;
}
+static void virtio_gpu_crtc_atomic_flush(struct drm_crtc *crtc,
+ struct drm_crtc_state *old_state)
+{
+ unsigned long flags;
+
+ spin_lock_irqsave(&crtc->dev->event_lock, flags);
+ if (crtc->state->event)
+ drm_crtc_send_vblank_event(crtc, crtc->state->event);
+ spin_unlock_irqrestore(&crtc->dev->event_lock, flags);
+}
+
static const struct drm_crtc_helper_funcs virtio_gpu_crtc_helper_funcs = {
.enable = virtio_gpu_crtc_enable,
.disable = virtio_gpu_crtc_disable,
.mode_set_nofb = virtio_gpu_crtc_mode_set_nofb,
.atomic_check = virtio_gpu_crtc_atomic_check,
+ .atomic_flush = virtio_gpu_crtc_atomic_flush,
};
static void virtio_gpu_enc_mode_set(struct drm_encoder *encoder,
--
2.5.5
^ permalink raw reply related
* Re: [PATCH] documentation: Add disclaimer
From: Paul E. McKenney @ 2016-04-14 21:40 UTC (permalink / raw)
To: Peter Zijlstra
Cc: linux-mips, linux-ia64, Michael S. Tsirkin, Will Deacon,
virtualization, H. Peter Anvin, sparclinux, Ingo Molnar,
linux-arch, linux-s390, Russell King - ARM Linux,
user-mode-linux-devel, linux-sh, Michael Ellerman, x86, xen-devel,
Ingo Molnar, linux-xtensa, james.hogan, Arnd Bergmann,
Stefano Stabellini, adi-buildroot-devel, Leonid Yegoshin,
ddaney.cavm, Thomas Gleixner, linux-metag
In-Reply-To: <20160127083546.GJ6357@twins.programming.kicks-ass.net>
On Wed, Jan 27, 2016 at 09:35:46AM +0100, Peter Zijlstra wrote:
> On Tue, Jan 26, 2016 at 12:11:43PM -0800, Paul E. McKenney wrote:
> > So Peter, would you like to update your patch to include yourself
> > and Will as authors?
>
> Sure, here goes.
>
> ---
> Subject: documentation: Add disclaimer
>
> It appears people are reading this document as a requirements list for
> building hardware. This is not the intent of this document. Nor is it
> particularly suited for this purpose.
>
> The primary purpose of this document is our collective attempt to define
> a set of primitives that (hopefully) allow us to write correct code on
> the myriad of SMP platforms Linux supports.
>
> Its a definite work in progress as our understanding of these platforms,
> and memory ordering in general, progresses.
>
> Nor does being mentioned in this document mean we think its a
> particularly good idea; the data dependency barrier required by Alpha
> being a prime example. Yes we have it, no you're insane to require it
> when building new hardware.
>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Rather belatedly queued and pushed to -rcu, apologies for the delay.
One minor edit noted below.
Thanx, Paul
> ---
> Documentation/memory-barriers.txt | 18 +++++++++++++++++-
> 1 file changed, 17 insertions(+), 1 deletion(-)
>
> diff --git a/Documentation/memory-barriers.txt b/Documentation/memory-barriers.txt
> index a61be39c7b51..98626125f484 100644
> --- a/Documentation/memory-barriers.txt
> +++ b/Documentation/memory-barriers.txt
> @@ -4,8 +4,24 @@
>
> By: David Howells <dhowells@redhat.com>
> Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> + Will Deacon <will.deacon@arm.com>
> + Peter Zijlstra <peterz@infradead.org>
>
> -Contents:
> +==========
> +DISCLAIMER
> +==========
> +
> +This document is not a specification; it is intentionally (for the sake of
> +brevity) and unintentionally (due to being human) incomplete. This document is
> +meant as a guide to using the various memory barriers provided by Linux, but
> +in case of any doubt (and there are many) please ask.
> +
> +I repeat, this document is not a specification of what Linux expects from
s/I/To/ because there is more than one author.
> +hardware.
> +
> +========
> +CONTENTS
> +========
>
> (*) Abstract memory access model.
>
>
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox