All of lore.kernel.org

All of lore.kernel.org
 help / color / mirror / Atom feed

* Re: [RFC v4 15/16] vfio/type1: Check MSI remapping at irq domain level
From: Diana Madalina Craciun @ 2016-12-22 12:41 UTC (permalink / raw)
  To: Eric Auger,
	eric.auger.pro-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org,
	christoffer.dall-QSEj5FYQhm4dnm+yROfE0A@public.gmane.org,
	marc.zyngier-5wv7dgnIgG8@public.gmane.org,
	robin.murphy-5wv7dgnIgG8@public.gmane.org,
	alex.williamson-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org,
	will.deacon-5wv7dgnIgG8@public.gmane.org,
	joro-zLv9SwRftAIdnm+yROfE0A@public.gmane.org,
	tglx-hfZtesqFncYOwBW4kG4KsQ@public.gmane.org,
	jason-NLaQJdtUoK4Be96aLqz0jA@public.gmane.org,
	linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r@public.gmane.org
  Cc: drjones-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org,
	kvm-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	punit.agrawal-5wv7dgnIgG8@public.gmane.org,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org,
	pranav.sawargaonkar-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org,
	shankerd-sgV2jX0FEOL9JmXXK+q4OQ@public.gmane.org,
	gpkulkarni-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org
In-Reply-To: <1481661034-3088-16-git-send-email-eric.auger@redhat.com>

Hi Eric,

On 12/13/2016 10:32 PM, Eric Auger wrote:
> In case the IOMMU does not bypass MSI transactions (typical
> case on ARM), we check all MSI controllers are IRQ remapping
> capable. If not the IRQ assignment may be unsafe.
>
> At this stage the arm-smmu-(v3) still advertise the
> IOMMU_CAP_INTR_REMAP capability at IOMMU level. This will be
> removed in subsequent patches.
>
> Signed-off-by: Eric Auger <eric.auger-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
> ---
>  drivers/vfio/vfio_iommu_type1.c | 9 ++++++---
>  1 file changed, 6 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> index d07fe73..a05648b 100644
> --- a/drivers/vfio/vfio_iommu_type1.c
> +++ b/drivers/vfio/vfio_iommu_type1.c
> @@ -37,6 +37,7 @@
>  #include <linux/vfio.h>
>  #include <linux/workqueue.h>
>  #include <linux/dma-iommu.h>
> +#include <linux/irqdomain.h>
>  
>  #define DRIVER_VERSION  "0.2"
>  #define DRIVER_AUTHOR   "Alex Williamson <alex.williamson-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>"
> @@ -765,7 +766,7 @@ static int vfio_iommu_type1_attach_group(void *iommu_data,
>  	struct vfio_domain *domain, *d;
>  	struct bus_type *bus = NULL;
>  	int ret;
> -	bool resv_msi;
> +	bool resv_msi, msi_remap;
>  	phys_addr_t resv_msi_base;
>  
>  	mutex_lock(&iommu->lock);
> @@ -818,8 +819,10 @@ static int vfio_iommu_type1_attach_group(void *iommu_data,
>  	INIT_LIST_HEAD(&domain->group_list);
>  	list_add(&group->next, &domain->group_list);
>  
> -	if (!allow_unsafe_interrupts &&
> -	    !iommu_capable(bus, IOMMU_CAP_INTR_REMAP)) {
> +	msi_remap = resv_msi ? irq_domain_check_msi_remap() :
> +			       iommu_capable(bus, IOMMU_CAP_INTR_REMAP);
> +
> +	if (!allow_unsafe_interrupts && !msi_remap) {
>  		pr_warn("%s: No interrupt remapping support.  Use the module param \"allow_unsafe_interrupts\" to enable VFIO IOMMU support on this platform\n",
>  		       __func__);
>  		ret = -EPERM;

I tested your v4.9-reserved-v4 branch on a ITS capable hardware (NXP
LS2080), so I did not set allow_unsafe_interrupts. It fails here
complaining that the there is no interrupt remapping support. The
irq_domain_check_msi_remap function returns false as none of the checked
domains has the IRQ_DOMAIN_FLAG_MSI_REMAP flag set. I think the reason
is that the flags are not propagated through the domain hierarchy when
the domain is created.

Thanks,

Diana

^ permalink raw reply

* [RFC v4 15/16] vfio/type1: Check MSI remapping at irq domain level
From: Diana Madalina Craciun @ 2016-12-22 12:41 UTC (permalink / raw)
  To: linux-arm-kernel
In-Reply-To: <1481661034-3088-16-git-send-email-eric.auger@redhat.com>

Hi Eric,

On 12/13/2016 10:32 PM, Eric Auger wrote:
> In case the IOMMU does not bypass MSI transactions (typical
> case on ARM), we check all MSI controllers are IRQ remapping
> capable. If not the IRQ assignment may be unsafe.
>
> At this stage the arm-smmu-(v3) still advertise the
> IOMMU_CAP_INTR_REMAP capability at IOMMU level. This will be
> removed in subsequent patches.
>
> Signed-off-by: Eric Auger <eric.auger@redhat.com>
> ---
>  drivers/vfio/vfio_iommu_type1.c | 9 ++++++---
>  1 file changed, 6 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> index d07fe73..a05648b 100644
> --- a/drivers/vfio/vfio_iommu_type1.c
> +++ b/drivers/vfio/vfio_iommu_type1.c
> @@ -37,6 +37,7 @@
>  #include <linux/vfio.h>
>  #include <linux/workqueue.h>
>  #include <linux/dma-iommu.h>
> +#include <linux/irqdomain.h>
>  
>  #define DRIVER_VERSION  "0.2"
>  #define DRIVER_AUTHOR   "Alex Williamson <alex.williamson@redhat.com>"
> @@ -765,7 +766,7 @@ static int vfio_iommu_type1_attach_group(void *iommu_data,
>  	struct vfio_domain *domain, *d;
>  	struct bus_type *bus = NULL;
>  	int ret;
> -	bool resv_msi;
> +	bool resv_msi, msi_remap;
>  	phys_addr_t resv_msi_base;
>  
>  	mutex_lock(&iommu->lock);
> @@ -818,8 +819,10 @@ static int vfio_iommu_type1_attach_group(void *iommu_data,
>  	INIT_LIST_HEAD(&domain->group_list);
>  	list_add(&group->next, &domain->group_list);
>  
> -	if (!allow_unsafe_interrupts &&
> -	    !iommu_capable(bus, IOMMU_CAP_INTR_REMAP)) {
> +	msi_remap = resv_msi ? irq_domain_check_msi_remap() :
> +			       iommu_capable(bus, IOMMU_CAP_INTR_REMAP);
> +
> +	if (!allow_unsafe_interrupts && !msi_remap) {
>  		pr_warn("%s: No interrupt remapping support.  Use the module param \"allow_unsafe_interrupts\" to enable VFIO IOMMU support on this platform\n",
>  		       __func__);
>  		ret = -EPERM;

I tested your v4.9-reserved-v4 branch on a ITS capable hardware (NXP
LS2080), so I did not set allow_unsafe_interrupts. It fails here
complaining that the there is no interrupt remapping support. The
irq_domain_check_msi_remap function returns false as none of the checked
domains has the IRQ_DOMAIN_FLAG_MSI_REMAP flag set. I think the reason
is that the flags are not propagated through the domain hierarchy when
the domain is created.

Thanks,

Diana

^ permalink raw reply

* Rewrite restorecon python method
From: Petr Lautrbach @ 2016-12-22 12:43 UTC (permalink / raw)
  To: selinux

Hi.

selinux.restorecon(path, recursive=True) uses matchpathcon() to get a
label for a file and when the label is defined as <<None>>,it throws a
backtrace with error:

"OSError: [Errno 2] No such file or directory"

It creates a problem for scripts which tries to relabel whole directory tree
when there a subdirectory with a specification like this:

/var/lib/nfs/rpc_pipefs(/.*)?      all files   <<None>>

>>> selinux.restorecon('/var/lib', recursive=True)                                   
Traceback (most recent call last):
  File "/usr/lib64/python3.5/site-packages/selinux/__init__.py", line 114, in restorecon
    status, context = matchpathcon(path, mode)
FileNotFoundError: [Errno 2] No such file or directory

At the same time, there's a rfe to rewrite restorecon() to use
selinux_restorecon() which uses selabel_lookup() instead of deprecated
matchpathcon() - [1]

The following 2 patches tries to address the described problem using the RFE.

First patch exports selinux_restorecon() to SWIG bindings.
Second one rewites python implementation of restorecon() to use it.

[1] https://github.com/SELinuxProject/selinux/issues/29

Petr

^ permalink raw reply

* [PATCH 1/2] libselinux: Generate SWIG wrappers for selinux_restorecon()
From: Petr Lautrbach @ 2016-12-22 12:43 UTC (permalink / raw)
  To: selinux
In-Reply-To: <20161222124309.27686-1-plautrba@redhat.com>

Signed-off-by: Petr Lautrbach <plautrba@redhat.com>
---
 libselinux/src/selinuxswig.i | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/libselinux/src/selinuxswig.i b/libselinux/src/selinuxswig.i
index c1e4ef7..687c43b 100644
--- a/libselinux/src/selinuxswig.i
+++ b/libselinux/src/selinuxswig.i
@@ -9,6 +9,7 @@
 	#include "../include/selinux/get_context_list.h"
 	#include "../include/selinux/get_default_type.h"
 	#include "../include/selinux/label.h"
+	#include "../include/selinux/restorecon.h"
 	#include "../include/selinux/selinux.h"
 %}
 %apply int *OUTPUT { int *enforce };
@@ -61,4 +62,5 @@
 %include "../include/selinux/get_context_list.h"
 %include "../include/selinux/get_default_type.h"
 %include "../include/selinux/label.h"
+%include "../include/selinux/restorecon.h"
 %include "../include/selinux/selinux.h"
-- 
2.9.3

^ permalink raw reply related

* [PATCH 2/2] libselinux: Rewrite restorecon() python method
From: Petr Lautrbach @ 2016-12-22 12:43 UTC (permalink / raw)
  To: selinux
In-Reply-To: <20161222124309.27686-1-plautrba@redhat.com>

When the restorecon method was added to the libselinux swig python
bindings, there was no libselinux restorecon implementation and it
he had to call matchpathcon() which is deprecated in favor of
selabel_lookup().

The new restorecon method uses selinux_restorecon method from libselinux
and which is exported by the previous commit.

https://github.com/SELinuxProject/selinux/issues/29

Fixes:
>>> selinux.restorecon('/var/lib', recursive=True)
Traceback (most recent call last):
  File "/usr/lib64/python3.5/site-packages/selinux/__init__.py", line 114, in restorecon
    status, context = matchpathcon(path, mode)
FileNotFoundError: [Errno 2] No such file or directory

Signed-off-by: Petr Lautrbach <plautrba@redhat.com>
---
 libselinux/src/selinuxswig_python.i | 42 +++++++++++++++----------------------
 1 file changed, 17 insertions(+), 25 deletions(-)

diff --git a/libselinux/src/selinuxswig_python.i b/libselinux/src/selinuxswig_python.i
index a239f30..be17cef 100644
--- a/libselinux/src/selinuxswig_python.i
+++ b/libselinux/src/selinuxswig_python.i
@@ -19,31 +19,23 @@ DISABLED = -1
 PERMISSIVE = 0
 ENFORCING = 1
 
-def restorecon(path, recursive=False):
-    """ Restore SELinux context on a given path """
-
-    try:
-        mode = os.lstat(path)[stat.ST_MODE]
-        status, context = matchpathcon(path, mode)
-    except OSError:
-        path = os.path.realpath(os.path.expanduser(path))
-        mode = os.lstat(path)[stat.ST_MODE]
-        status, context = matchpathcon(path, mode)
-
-    if status == 0:
-        try:
-            status, oldcontext = lgetfilecon(path)
-        except OSError as e:
-            if e.errno != errno.ENODATA:
-                raise
-            oldcontext = None
-        if context != oldcontext:
-            lsetfilecon(path, context)
-
-        if recursive:
-            for root, dirs, files in os.walk(path):
-                for name in files + dirs:
-                   restorecon(os.path.join(root, name))
+def restorecon(path, recursive=False, verbose=False):
+    """ Restore SELinux context on a given path
+
+    Arguments:
+    path -- The pathname for the file or directory to be relabeled.
+
+    Keyword arguments:
+    recursive -- Change files and directories file labels recursively (default False)
+    verbose -- Show changes in file labels (default False)
+    """
+
+    restorecon_flags = SELINUX_RESTORECON_IGNORE_DIGEST | SELINUX_RESTORECON_REALPATH
+    if recursive:
+        restorecon_flags |= SELINUX_RESTORECON_RECURSE
+    if verbose:
+        restorecon_flags |= SELINUX_RESTORECON_VERBOSE
+    selinux_restorecon(os.path.expanduser(path), restorecon_flags)
 
 def chcon(path, context, recursive=False):
     """ Set the SELinux context on a given path """
-- 
2.9.3

^ permalink raw reply related

* RE: A small window for a race condition in mm/rmap.c:page_lock_anon_vma_read
From: Dashi DS1 Cao @ 2016-12-22 12:43 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, Peter Zijlstra
In-Reply-To: <23B7B563BA4E9446B962B142C86EF24ADBEBB6@CNMAILEX03.lenovo.com>

Value of anon_vma:

print *((struct anon_vma *)0xffff8820833ed940)
$2 = {
  root = 0x0, 
  rwsem = {
    count = 0, 
    wait_lock = {
      raw_lock = {
        {
          head_tail = 0, 
          tickets = {
            head = 0, 
            tail = 0
          }
        }
      }
    }, 
    wait_list = {
      next = 0x0, 
      prev = 0x0
    }
  }, 
  refcount = {
    counter = 0
  }, 
  rb_root = {
    rb_node = 0x0
  }
}
crash>

-----Original Message-----
From: linux-kernel-owner@vger.kernel.org [mailto:linux-kernel-owner@vger.kernel.org] On Behalf Of Dashi DS1 Cao
Sent: Thursday, December 22, 2016 7:53 PM
To: Michal Hocko <mhocko@kernel.org>
Cc: linux-mm@kvack.org; linux-kernel@vger.kernel.org; Peter Zijlstra <peterz@infradead.org>
Subject: RE: A small window for a race condition in mm/rmap.c:page_lock_anon_vma_read

I've used another dump with similar backtrace.

PID: 246    TASK: ffff881fd27df300  CPU: 0   COMMAND: "kswapd0"
 #0 [ffff881fcfb23748] machine_kexec at ffffffff81051e9b
 #1 [ffff881fcfb237a8] crash_kexec at ffffffff810f27e2
 #2 [ffff881fcfb23878] oops_end at ffffffff8163f448
 #3 [ffff881fcfb238a0] no_context at ffffffff8162f561
 #4 [ffff881fcfb238f0] __bad_area_nosemaphore at ffffffff8162f5f7
 #5 [ffff881fcfb23938] bad_area_nosemaphore at ffffffff8162f761
 #6 [ffff881fcfb23948] __do_page_fault at ffffffff816421ce
 #7 [ffff881fcfb239a8] do_page_fault at ffffffff81642363
 #8 [ffff881fcfb239d0] page_fault at ffffffff8163e648
    [exception RIP: down_read_trylock+9]
    RIP: ffffffff810aa9f9  RSP: ffff881fcfb23a88  RFLAGS: 00010202
    RAX: 0000000000000000  RBX: ffff8820833ed940  RCX: 0000000000000000
    RDX: 0000000000000000  RSI: 0000000000000000  RDI: 0000000000000008
    RBP: ffff881fcfb23a88   R8: ffffea00779b3a60   R9: ffff883fd0d7b070
    R10: 000000000000000e  R11: ffffea00049e9580  R12: ffff8820833ed941
    R13: ffffea00779b3a40  R14: 0000000000000008  R15: ffffea00779b3a40
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
 #9 [ffff881fcfb23a90] page_lock_anon_vma_read at ffffffff811a3365
#10 [ffff881fcfb23ac0] page_referenced at ffffffff811a35e7
#11 [ffff881fcfb23b38] shrink_active_list at ffffffff8117e8cc
#12 [ffff881fcfb23bf0] shrink_lruvec at ffffffff8117ef8d
#13 [ffff881fcfb23cf0] shrink_zone at ffffffff8117f2a6
#14 [ffff881fcfb23d48] balance_pgdat at ffffffff8118054c
#15 [ffff881fcfb23e20] kswapd at ffffffff81180813
#16 [ffff881fcfb23ec8] kthread at ffffffff810a5b8f
#17 [ffff881fcfb23f50] ret_from_fork at ffffffff81646a98
crash> print *((struct page *)0xffffea00779b3a40j)
$1 = {
  flags = 13510794587668552,
  mapping = 0xffff8820833ed941,
  {
    {
      index = 34194823743, 
      freelist = 0x7f62b9a3f, 
      pfmemalloc = 63, 
      pmd_huge_pte = 0x7f62b9a3f
    }, 
    {
      counters = 8589934592, 
      {
        {
          _mapcount = {
            counter = 0
          }, 
          {
            inuse = 0, 
            objects = 0, 
            frozen = 0
          }, 
          units = 0
        }, 
        _count = {
          counter = 2
         }
      }
    }
  },
  {
    lru = {
      next = 0xdead000000100100, 
      prev = 0xdead000000200200
    }, 
    {
      next = 0xdead000000100100, 
      pages = 2097664, 
      pobjects = -559087616
    }, 
    list = {
      next = 0xdead000000100100, 
      prev = 0xdead000000200200
    }, 
    slab_page = 0xdead000000100100
  },
  {
    private = 0, 
    ptl = {
      {
        rlock = {
          raw_lock = {
             {
              head_tail = 0, 
              tickets = {
                head = 0, 
                tail = 0
              }
            }
          }
        }
      }
    }, 
    slab_cache = 0x0, 
    first_page = 0x0
  }
}
crash>  disassemble page_lock_anon_vma_read
Dump of assembler code for function page_lock_anon_vma_read:
   0xffffffff811a3310 <+0>:     nopl   0x0(%rax,%rax,1)
   0xffffffff811a3315 <+5>:     push   %rbp
   0xffffffff811a3316 <+6>:     mov    %rsp,%rbp
   0xffffffff811a3319 <+9>:     push   %r14
   0xffffffff811a331b <+11>:    push   %r13
   0xffffffff811a331d <+13>:    mov    %rdi,%r13
   0xffffffff811a3320 <+16>:    push   %r12
   0xffffffff811a3322 <+18>:    push   %rbx
   0xffffffff811a3323 <+19>:    mov    0x8(%rdi),%r12
   0xffffffff811a3327 <+23>:    mov    %r12,%rax
   0xffffffff811a332a <+26>:    and    $0x3,%eax
   0xffffffff811a332d <+29>:    cmp    $0x1,%rax
   0xffffffff811a3331 <+33>:    je     0xffffffff811a3348 <page_lock_anon_vma_read+56>
   0xffffffff811a3333 <+35>:    xor    %ebx,%ebx
   0xffffffff811a3335 <+37>:    mov    %rbx,%rax
   0xffffffff811a3338 <+40>:    pop    %rbx
   0xffffffff811a3339 <+41>:    pop    %r12
   0xffffffff811a333b <+43>:    pop    %r13
   0xffffffff811a333d <+45>:    pop    %r14
   0xffffffff811a333f <+47>:    pop    %rbp
   0xffffffff811a3340 <+48>:    retq   
   0xffffffff811a3341 <+49>:    nopl   0x0(%rax)
   0xffffffff811a3348 <+56>:    mov    0x18(%rdi),%eax
   0xffffffff811a334b <+59>:    test   %eax,%eax
   0xffffffff811a334d <+61>:    js     0xffffffff811a3333 <page_lock_anon_vma_read+35>
   0xffffffff811a334f <+63>:    mov    -0x1(%r12),%r14
   0xffffffff811a3354 <+68>:    lea    -0x1(%r12),%rbx
   0xffffffff811a3359 <+73>:    add    $0x8,%r14
   0xffffffff811a335d <+77>:    mov    %r14,%rdi
   0xffffffff811a3360 <+80>:    callq  0xffffffff810aa9f0 <down_read_trylock>
   0xffffffff811a3365 <+85>:    test   %eax,%eax
   0xffffffff811a3367 <+87>:    je     0xffffffff811a3380 <page_lock_anon_vma_read+112>
   0xffffffff811a3369 <+89>:    mov    0x18(%r13),%eax
   0xffffffff811a336d <+93>:    test   %eax,%eax
   0xffffffff811a336f <+95>:    jns    0xffffffff811a3335 <page_lock_anon_vma_read+37>
   0xffffffff811a3371 <+97>:    mov    %r14,%rdi
   0xffffffff811a3374 <+100>:   xor    %ebx,%ebx
   0xffffffff811a3376 <+102>:   callq  0xffffffff810aaa50 <up_read>
   0xffffffff811a337b <+107>:   jmp    0xffffffff811a3335 <page_lock_anon_vma_read+37>
   0xffffffff811a337d <+109>:   nopl   (%rax)
   0xffffffff811a3380 <+112>:   mov    0x28(%rbx),%edx
   0xffffffff811a3383 <+115>:   test   %edx,%edx
   0xffffffff811a3385 <+117>:   je     0xffffffff811a3333 <page_lock_anon_vma_read+35>
   0xffffffff811a3387 <+119>:   lea    0x1(%rdx),%ecx
   0xffffffff811a338a <+122>:   lea    0x27(%r12),%rsi
   0xffffffff811a338f <+127>:   mov    %edx,%eax
   0xffffffff811a3391 <+129>:   lock cmpxchg %ecx,0x27(%r12)
   0xffffffff811a3398 <+136>:   cmp    %edx,%eax
   0xffffffff811a339a <+138>:   mov    %eax,%ecx
   0xffffffff811a339c <+140>:   jne    0xffffffff811a3402 <page_lock_anon_vma_read+242>
   0xffffffff811a339e <+142>:   mov    0x18(%r13),%eax
   0xffffffff811a33a2 <+146>:   test   %eax,%eax
   0xffffffff811a33a4 <+148>:   js     0xffffffff811a33e2 <page_lock_anon_vma_read+210>
   0xffffffff811a33a6 <+150>:   mov    -0x1(%r12),%rax
   0xffffffff811a33ab <+155>:   lea    0x8(%rax),%rdi
   0xffffffff811a33af <+159>:   callq  0xffffffff8163ad30 <down_read>
   0xffffffff811a33b4 <+164>:   lock decl 0x27(%r12)
   0xffffffff811a33ba <+170>:   sete   %al
   0xffffffff811a33bd <+173>:   test   %al,%al
   0xffffffff811a33bf <+175>:   je     0xffffffff811a3335 <page_lock_anon_vma_read+37>
   0xffffffff811a33c5 <+181>:   mov    -0x1(%r12),%rdi
   0xffffffff811a33ca <+186>:   add    $0x8,%rdi
   0xffffffff811a33ce <+190>:   callq  0xffffffff810aaa50 <up_read>
   0xffffffff811a33d3 <+195>:   mov    %rbx,%rdi
   0xffffffff811a33d6 <+198>:   xor    %ebx,%ebx
   0xffffffff811a33d8 <+200>:   callq  0xffffffff811a2dd0 <__put_anon_vma>
   0xffffffff811a33dd <+205>:   jmpq   0xffffffff811a3335 <page_lock_anon_vma_read+37>
   0xffffffff811a33e2 <+210>:   lock decl 0x27(%r12)
   0xffffffff811a33e8 <+216>:   sete   %al
   0xffffffff811a33eb <+219>:   test   %al,%al
   0xffffffff811a33ed <+221>:   je     0xffffffff811a3333 <page_lock_anon_vma_read+35>
   0xffffffff811a33f3 <+227>:   mov    %rbx,%rdi
   0xffffffff811a33f6 <+230>:   xor    %ebx,%ebx
   0xffffffff811a33f8 <+232>:   callq  0xffffffff811a2dd0 <__put_anon_vma>
   0xffffffff811a33fd <+237>:   jmpq   0xffffffff811a3335 <page_lock_anon_vma_read+37>
   0xffffffff811a3402 <+242>:   test   %ecx,%ecx
   0xffffffff811a3404 <+244>:   je     0xffffffff811a3333 <page_lock_anon_vma_read+35>
   0xffffffff811a340a <+250>:   lea    0x1(%rcx),%edx
   0xffffffff811a340d <+253>:   mov    %ecx,%eax
   0xffffffff811a340f <+255>:   lock cmpxchg %edx,(%rsi)
   0xffffffff811a3413 <+259>:   cmp    %eax,%ecx
   0xffffffff811a3415 <+261>:   je     0xffffffff811a339e <page_lock_anon_vma_read+142>
   0xffffffff811a3417 <+263>:   mov    %eax,%ecx
   0xffffffff811a3419 <+265>:   jmp    0xffffffff811a3402 <page_lock_anon_vma_read+242>
End of assembler dump.
crash>  

Dashi Cao
-----Original Message-----
From: Michal Hocko [mailto:mhocko@kernel.org]
Sent: Wednesday, December 21, 2016 10:44 PM
To: Dashi DS1 Cao <caods1@lenovo.com>
Cc: linux-mm@kvack.org; linux-kernel@vger.kernel.org; Peter Zijlstra <peterz@infradead.org>
Subject: Re: A small window for a race condition in mm/rmap.c:page_lock_anon_vma_read

anon_vma locking is clever^Wsubtle as hell. CC Peter...

On Tue 20-12-16 09:32:27, Dashi DS1 Cao wrote:
> I've collected four crash dumps with similar backtrace. 
> 
> PID: 247    TASK: ffff881fcfad8000  CPU: 14  COMMAND: "kswapd1"
>  #0 [ffff881fcfad7978] machine_kexec at ffffffff81051e9b
>  #1 [ffff881fcfad79d8] crash_kexec at ffffffff810f27e2
>  #2 [ffff881fcfad7aa8] oops_end at ffffffff8163f448
>  #3 [ffff881fcfad7ad0] die at ffffffff8101859b
>  #4 [ffff881fcfad7b00] do_general_protection at ffffffff8163ed3e
>  #5 [ffff881fcfad7b30] general_protection at ffffffff8163e5e8
>     [exception RIP: down_read_trylock+9]
>     RIP: ffffffff810aa9f9  RSP: ffff881fcfad7be0  RFLAGS: 00010286
>     RAX: 0000000000000000  RBX: ffff882b47ddadc0  RCX: 0000000000000000
>     RDX: 0000000000000000  RSI: 0000000000000000  RDI: 
> 91550b2b32f5a3e8

rdi is obviously a mess - smells like a string. So either sombody has overwritten root_anon_vma or this is really a use after free...

>     RBP: ffff881fcfad7be0   R8: ffffea00ecc28860   R9: ffff883fcffeae28
>     R10: ffffffff81a691a0  R11: 0000000000000001  R12: ffff882b47ddadc1
>     R13: ffffea00ecc28840  R14: 91550b2b32f5a3e8  R15: ffffea00ecc28840
>     ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0000
>  #6 [ffff881fcfad7be8] page_lock_anon_vma_read at ffffffff811a3365
>  #7 [ffff881fcfad7c18] page_referenced at ffffffff811a35e7
>  #8 [ffff881fcfad7c90] shrink_active_list at ffffffff8117e8cc
>  #9 [ffff881fcfad7d48] balance_pgdat at ffffffff81180288
> #10 [ffff881fcfad7e20] kswapd at ffffffff81180813
> #11 [ffff881fcfad7ec8] kthread at ffffffff810a5b8f
> #12 [ffff881fcfad7f50] ret_from_fork at ffffffff81646a98
> 
> I suspect my customer hits into a small window of a race condition in mm/rmap.c: page_lock_anon_vma_read.
> struct anon_vma *page_lock_anon_vma_read(struct page *page) {
>         struct anon_vma *anon_vma = NULL;
>         struct anon_vma *root_anon_vma;
>         unsigned long anon_mapping;
> 
>         rcu_read_lock();
>         anon_mapping = (unsigned long)READ_ONCE(page->mapping);
>         if ((anon_mapping & PAGE_MAPPING_FLAGS) != PAGE_MAPPING_ANON)
>                 goto out;
>         if (!page_mapped(page))
>                 goto out;
> 
>         anon_vma = (struct anon_vma *) (anon_mapping - PAGE_MAPPING_ANON);
>         root_anon_vma = READ_ONCE(anon_vma->root);

Could you dump the anon_vma and struct page as well?

>         if (down_read_trylock(&root_anon_vma->rwsem)) {
>                 /*
>                  * If the page is still mapped, then this anon_vma is still
>                  * its anon_vma, and holding the mutex ensures that it will
>                  * not go away, see anon_vma_free().
>                  */
>                 if (!page_mapped(page)) {
>                         up_read(&root_anon_vma->rwsem);
>                         anon_vma = NULL;
>                 }
>                 goto out;
>         }
> ...
> }
> 
> Between the time the two "page_mapped(page)" are checked, the address 
> (anon_mapping - PAGE_MAPPING_ANON) is unmapped! However it seems that 
> anon_vma->root could still be read in but the value is wild. So the 
> kernel crashes in down_read_trylock. But it's weird that all the 
> "struct page" has its member "_mapcount" still with value 0, not -1, 
> in the four crashes.

--
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* RE: A small window for a race condition in mm/rmap.c:page_lock_anon_vma_read
From: Dashi DS1 Cao @ 2016-12-22 12:43 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, Peter Zijlstra
In-Reply-To: <23B7B563BA4E9446B962B142C86EF24ADBEBB6@CNMAILEX03.lenovo.com>

Value of anon_vma:

print *((struct anon_vma *)0xffff8820833ed940)
$2 = {
  root = 0x0, 
  rwsem = {
    count = 0, 
    wait_lock = {
      raw_lock = {
        {
          head_tail = 0, 
          tickets = {
            head = 0, 
            tail = 0
          }
        }
      }
    }, 
    wait_list = {
      next = 0x0, 
      prev = 0x0
    }
  }, 
  refcount = {
    counter = 0
  }, 
  rb_root = {
    rb_node = 0x0
  }
}
crash>

-----Original Message-----
From: linux-kernel-owner@vger.kernel.org [mailto:linux-kernel-owner@vger.kernel.org] On Behalf Of Dashi DS1 Cao
Sent: Thursday, December 22, 2016 7:53 PM
To: Michal Hocko <mhocko@kernel.org>
Cc: linux-mm@kvack.org; linux-kernel@vger.kernel.org; Peter Zijlstra <peterz@infradead.org>
Subject: RE: A small window for a race condition in mm/rmap.c:page_lock_anon_vma_read

I've used another dump with similar backtrace.

PID: 246    TASK: ffff881fd27df300  CPU: 0   COMMAND: "kswapd0"
 #0 [ffff881fcfb23748] machine_kexec at ffffffff81051e9b
 #1 [ffff881fcfb237a8] crash_kexec at ffffffff810f27e2
 #2 [ffff881fcfb23878] oops_end at ffffffff8163f448
 #3 [ffff881fcfb238a0] no_context at ffffffff8162f561
 #4 [ffff881fcfb238f0] __bad_area_nosemaphore at ffffffff8162f5f7
 #5 [ffff881fcfb23938] bad_area_nosemaphore at ffffffff8162f761
 #6 [ffff881fcfb23948] __do_page_fault at ffffffff816421ce
 #7 [ffff881fcfb239a8] do_page_fault at ffffffff81642363
 #8 [ffff881fcfb239d0] page_fault at ffffffff8163e648
    [exception RIP: down_read_trylock+9]
    RIP: ffffffff810aa9f9  RSP: ffff881fcfb23a88  RFLAGS: 00010202
    RAX: 0000000000000000  RBX: ffff8820833ed940  RCX: 0000000000000000
    RDX: 0000000000000000  RSI: 0000000000000000  RDI: 0000000000000008
    RBP: ffff881fcfb23a88   R8: ffffea00779b3a60   R9: ffff883fd0d7b070
    R10: 000000000000000e  R11: ffffea00049e9580  R12: ffff8820833ed941
    R13: ffffea00779b3a40  R14: 0000000000000008  R15: ffffea00779b3a40
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
 #9 [ffff881fcfb23a90] page_lock_anon_vma_read at ffffffff811a3365
#10 [ffff881fcfb23ac0] page_referenced at ffffffff811a35e7
#11 [ffff881fcfb23b38] shrink_active_list at ffffffff8117e8cc
#12 [ffff881fcfb23bf0] shrink_lruvec at ffffffff8117ef8d
#13 [ffff881fcfb23cf0] shrink_zone at ffffffff8117f2a6
#14 [ffff881fcfb23d48] balance_pgdat at ffffffff8118054c
#15 [ffff881fcfb23e20] kswapd at ffffffff81180813
#16 [ffff881fcfb23ec8] kthread at ffffffff810a5b8f
#17 [ffff881fcfb23f50] ret_from_fork at ffffffff81646a98
crash> print *((struct page *)0xffffea00779b3a40j)
$1 = {
  flags = 13510794587668552,
  mapping = 0xffff8820833ed941,
  {
    {
      index = 34194823743, 
      freelist = 0x7f62b9a3f, 
      pfmemalloc = 63, 
      pmd_huge_pte = 0x7f62b9a3f
    }, 
    {
      counters = 8589934592, 
      {
        {
          _mapcount = {
            counter = 0
          }, 
          {
            inuse = 0, 
            objects = 0, 
            frozen = 0
          }, 
          units = 0
        }, 
        _count = {
          counter = 2
         }
      }
    }
  },
  {
    lru = {
      next = 0xdead000000100100, 
      prev = 0xdead000000200200
    }, 
    {
      next = 0xdead000000100100, 
      pages = 2097664, 
      pobjects = -559087616
    }, 
    list = {
      next = 0xdead000000100100, 
      prev = 0xdead000000200200
    }, 
    slab_page = 0xdead000000100100
  },
  {
    private = 0, 
    ptl = {
      {
        rlock = {
          raw_lock = {
             {
              head_tail = 0, 
              tickets = {
                head = 0, 
                tail = 0
              }
            }
          }
        }
      }
    }, 
    slab_cache = 0x0, 
    first_page = 0x0
  }
}
crash>  disassemble page_lock_anon_vma_read
Dump of assembler code for function page_lock_anon_vma_read:
   0xffffffff811a3310 <+0>:     nopl   0x0(%rax,%rax,1)
   0xffffffff811a3315 <+5>:     push   %rbp
   0xffffffff811a3316 <+6>:     mov    %rsp,%rbp
   0xffffffff811a3319 <+9>:     push   %r14
   0xffffffff811a331b <+11>:    push   %r13
   0xffffffff811a331d <+13>:    mov    %rdi,%r13
   0xffffffff811a3320 <+16>:    push   %r12
   0xffffffff811a3322 <+18>:    push   %rbx
   0xffffffff811a3323 <+19>:    mov    0x8(%rdi),%r12
   0xffffffff811a3327 <+23>:    mov    %r12,%rax
   0xffffffff811a332a <+26>:    and    $0x3,%eax
   0xffffffff811a332d <+29>:    cmp    $0x1,%rax
   0xffffffff811a3331 <+33>:    je     0xffffffff811a3348 <page_lock_anon_vma_read+56>
   0xffffffff811a3333 <+35>:    xor    %ebx,%ebx
   0xffffffff811a3335 <+37>:    mov    %rbx,%rax
   0xffffffff811a3338 <+40>:    pop    %rbx
   0xffffffff811a3339 <+41>:    pop    %r12
   0xffffffff811a333b <+43>:    pop    %r13
   0xffffffff811a333d <+45>:    pop    %r14
   0xffffffff811a333f <+47>:    pop    %rbp
   0xffffffff811a3340 <+48>:    retq   
   0xffffffff811a3341 <+49>:    nopl   0x0(%rax)
   0xffffffff811a3348 <+56>:    mov    0x18(%rdi),%eax
   0xffffffff811a334b <+59>:    test   %eax,%eax
   0xffffffff811a334d <+61>:    js     0xffffffff811a3333 <page_lock_anon_vma_read+35>
   0xffffffff811a334f <+63>:    mov    -0x1(%r12),%r14
   0xffffffff811a3354 <+68>:    lea    -0x1(%r12),%rbx
   0xffffffff811a3359 <+73>:    add    $0x8,%r14
   0xffffffff811a335d <+77>:    mov    %r14,%rdi
   0xffffffff811a3360 <+80>:    callq  0xffffffff810aa9f0 <down_read_trylock>
   0xffffffff811a3365 <+85>:    test   %eax,%eax
   0xffffffff811a3367 <+87>:    je     0xffffffff811a3380 <page_lock_anon_vma_read+112>
   0xffffffff811a3369 <+89>:    mov    0x18(%r13),%eax
   0xffffffff811a336d <+93>:    test   %eax,%eax
   0xffffffff811a336f <+95>:    jns    0xffffffff811a3335 <page_lock_anon_vma_read+37>
   0xffffffff811a3371 <+97>:    mov    %r14,%rdi
   0xffffffff811a3374 <+100>:   xor    %ebx,%ebx
   0xffffffff811a3376 <+102>:   callq  0xffffffff810aaa50 <up_read>
   0xffffffff811a337b <+107>:   jmp    0xffffffff811a3335 <page_lock_anon_vma_read+37>
   0xffffffff811a337d <+109>:   nopl   (%rax)
   0xffffffff811a3380 <+112>:   mov    0x28(%rbx),%edx
   0xffffffff811a3383 <+115>:   test   %edx,%edx
   0xffffffff811a3385 <+117>:   je     0xffffffff811a3333 <page_lock_anon_vma_read+35>
   0xffffffff811a3387 <+119>:   lea    0x1(%rdx),%ecx
   0xffffffff811a338a <+122>:   lea    0x27(%r12),%rsi
   0xffffffff811a338f <+127>:   mov    %edx,%eax
   0xffffffff811a3391 <+129>:   lock cmpxchg %ecx,0x27(%r12)
   0xffffffff811a3398 <+136>:   cmp    %edx,%eax
   0xffffffff811a339a <+138>:   mov    %eax,%ecx
   0xffffffff811a339c <+140>:   jne    0xffffffff811a3402 <page_lock_anon_vma_read+242>
   0xffffffff811a339e <+142>:   mov    0x18(%r13),%eax
   0xffffffff811a33a2 <+146>:   test   %eax,%eax
   0xffffffff811a33a4 <+148>:   js     0xffffffff811a33e2 <page_lock_anon_vma_read+210>
   0xffffffff811a33a6 <+150>:   mov    -0x1(%r12),%rax
   0xffffffff811a33ab <+155>:   lea    0x8(%rax),%rdi
   0xffffffff811a33af <+159>:   callq  0xffffffff8163ad30 <down_read>
   0xffffffff811a33b4 <+164>:   lock decl 0x27(%r12)
   0xffffffff811a33ba <+170>:   sete   %al
   0xffffffff811a33bd <+173>:   test   %al,%al
   0xffffffff811a33bf <+175>:   je     0xffffffff811a3335 <page_lock_anon_vma_read+37>
   0xffffffff811a33c5 <+181>:   mov    -0x1(%r12),%rdi
   0xffffffff811a33ca <+186>:   add    $0x8,%rdi
   0xffffffff811a33ce <+190>:   callq  0xffffffff810aaa50 <up_read>
   0xffffffff811a33d3 <+195>:   mov    %rbx,%rdi
   0xffffffff811a33d6 <+198>:   xor    %ebx,%ebx
   0xffffffff811a33d8 <+200>:   callq  0xffffffff811a2dd0 <__put_anon_vma>
   0xffffffff811a33dd <+205>:   jmpq   0xffffffff811a3335 <page_lock_anon_vma_read+37>
   0xffffffff811a33e2 <+210>:   lock decl 0x27(%r12)
   0xffffffff811a33e8 <+216>:   sete   %al
   0xffffffff811a33eb <+219>:   test   %al,%al
   0xffffffff811a33ed <+221>:   je     0xffffffff811a3333 <page_lock_anon_vma_read+35>
   0xffffffff811a33f3 <+227>:   mov    %rbx,%rdi
   0xffffffff811a33f6 <+230>:   xor    %ebx,%ebx
   0xffffffff811a33f8 <+232>:   callq  0xffffffff811a2dd0 <__put_anon_vma>
   0xffffffff811a33fd <+237>:   jmpq   0xffffffff811a3335 <page_lock_anon_vma_read+37>
   0xffffffff811a3402 <+242>:   test   %ecx,%ecx
   0xffffffff811a3404 <+244>:   je     0xffffffff811a3333 <page_lock_anon_vma_read+35>
   0xffffffff811a340a <+250>:   lea    0x1(%rcx),%edx
   0xffffffff811a340d <+253>:   mov    %ecx,%eax
   0xffffffff811a340f <+255>:   lock cmpxchg %edx,(%rsi)
   0xffffffff811a3413 <+259>:   cmp    %eax,%ecx
   0xffffffff811a3415 <+261>:   je     0xffffffff811a339e <page_lock_anon_vma_read+142>
   0xffffffff811a3417 <+263>:   mov    %eax,%ecx
   0xffffffff811a3419 <+265>:   jmp    0xffffffff811a3402 <page_lock_anon_vma_read+242>
End of assembler dump.
crash>  

Dashi Cao
-----Original Message-----
From: Michal Hocko [mailto:mhocko@kernel.org]
Sent: Wednesday, December 21, 2016 10:44 PM
To: Dashi DS1 Cao <caods1@lenovo.com>
Cc: linux-mm@kvack.org; linux-kernel@vger.kernel.org; Peter Zijlstra <peterz@infradead.org>
Subject: Re: A small window for a race condition in mm/rmap.c:page_lock_anon_vma_read

anon_vma locking is clever^Wsubtle as hell. CC Peter...

On Tue 20-12-16 09:32:27, Dashi DS1 Cao wrote:
> I've collected four crash dumps with similar backtrace. 
> 
> PID: 247    TASK: ffff881fcfad8000  CPU: 14  COMMAND: "kswapd1"
>  #0 [ffff881fcfad7978] machine_kexec at ffffffff81051e9b
>  #1 [ffff881fcfad79d8] crash_kexec at ffffffff810f27e2
>  #2 [ffff881fcfad7aa8] oops_end at ffffffff8163f448
>  #3 [ffff881fcfad7ad0] die at ffffffff8101859b
>  #4 [ffff881fcfad7b00] do_general_protection at ffffffff8163ed3e
>  #5 [ffff881fcfad7b30] general_protection at ffffffff8163e5e8
>     [exception RIP: down_read_trylock+9]
>     RIP: ffffffff810aa9f9  RSP: ffff881fcfad7be0  RFLAGS: 00010286
>     RAX: 0000000000000000  RBX: ffff882b47ddadc0  RCX: 0000000000000000
>     RDX: 0000000000000000  RSI: 0000000000000000  RDI: 
> 91550b2b32f5a3e8

rdi is obviously a mess - smells like a string. So either sombody has overwritten root_anon_vma or this is really a use after free...

>     RBP: ffff881fcfad7be0   R8: ffffea00ecc28860   R9: ffff883fcffeae28
>     R10: ffffffff81a691a0  R11: 0000000000000001  R12: ffff882b47ddadc1
>     R13: ffffea00ecc28840  R14: 91550b2b32f5a3e8  R15: ffffea00ecc28840
>     ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0000
>  #6 [ffff881fcfad7be8] page_lock_anon_vma_read at ffffffff811a3365
>  #7 [ffff881fcfad7c18] page_referenced at ffffffff811a35e7
>  #8 [ffff881fcfad7c90] shrink_active_list at ffffffff8117e8cc
>  #9 [ffff881fcfad7d48] balance_pgdat at ffffffff81180288
> #10 [ffff881fcfad7e20] kswapd at ffffffff81180813
> #11 [ffff881fcfad7ec8] kthread at ffffffff810a5b8f
> #12 [ffff881fcfad7f50] ret_from_fork at ffffffff81646a98
> 
> I suspect my customer hits into a small window of a race condition in mm/rmap.c: page_lock_anon_vma_read.
> struct anon_vma *page_lock_anon_vma_read(struct page *page) {
>         struct anon_vma *anon_vma = NULL;
>         struct anon_vma *root_anon_vma;
>         unsigned long anon_mapping;
> 
>         rcu_read_lock();
>         anon_mapping = (unsigned long)READ_ONCE(page->mapping);
>         if ((anon_mapping & PAGE_MAPPING_FLAGS) != PAGE_MAPPING_ANON)
>                 goto out;
>         if (!page_mapped(page))
>                 goto out;
> 
>         anon_vma = (struct anon_vma *) (anon_mapping - PAGE_MAPPING_ANON);
>         root_anon_vma = READ_ONCE(anon_vma->root);

Could you dump the anon_vma and struct page as well?

>         if (down_read_trylock(&root_anon_vma->rwsem)) {
>                 /*
>                  * If the page is still mapped, then this anon_vma is still
>                  * its anon_vma, and holding the mutex ensures that it will
>                  * not go away, see anon_vma_free().
>                  */
>                 if (!page_mapped(page)) {
>                         up_read(&root_anon_vma->rwsem);
>                         anon_vma = NULL;
>                 }
>                 goto out;
>         }
> ...
> }
> 
> Between the time the two "page_mapped(page)" are checked, the address 
> (anon_mapping - PAGE_MAPPING_ANON) is unmapped! However it seems that 
> anon_vma->root could still be read in but the value is wild. So the 
> kernel crashes in down_read_trylock. But it's weird that all the 
> "struct page" has its member "_mapcount" still with value 0, not -1, 
> in the four crashes.

--
Michal Hocko
SUSE Labs

^ permalink raw reply

* Re: [PATCH 2/3] gummiboot: Remove old gummiboot recipe, related class and wks file
From: Burton, Ross @ 2016-12-22 12:44 UTC (permalink / raw)
  To: Alejandro Hernandez; +Cc: OE-core
In-Reply-To: <73f28b0aa183538fd3471df4e5d748488a5c30d2.1482339995.git.alejandro.hernandez@linux.intel.com>

[-- Attachment #1: Type: text/plain, Size: 1360 bytes --]

On 21 December 2016 at 17:08, Alejandro Hernandez <
alejandro.hernandez@linux.intel.com> wrote:

>  delete mode 100644 scripts/lib/wic/canned-wks/mkgummidisk.wks
>

I'm guessing this is why selftest fails:

FAIL: test_mkgummidisk (oeqa.selftest.wic.Wic)
Test creation of mkgummidisk image
----------------------------------------------------------------------
Traceback (most recent call last):
  File
"/home/pokybuild/yocto-autobuilder/yocto-worker/nightly-oe-selftest/build/meta/lib/oeqa/utils/decorators.py",
line 109, in wrapped_f
    return func(*args, **kwargs)
  File
"/home/pokybuild/yocto-autobuilder/yocto-worker/nightly-oe-selftest/build/meta/lib/oeqa/selftest/wic.py",
line 169, in test_mkgummidisk
    self.assertEqual(0, runCmd(cmd).status)
  File
"/home/pokybuild/yocto-autobuilder/yocto-worker/nightly-oe-selftest/build/meta/lib/oeqa/utils/commands.py",
line 121, in runCmd
    raise AssertionError("Command '%s' returned non-zero exit status
%d:\n%s" % (command, result.status, result.output))
AssertionError: Command 'wic create mkgummidisk --image-name
core-image-minimal' returned non-zero exit status 1:
Checking basic build environment...
Done.

No image named mkgummidisk found, exiting.  (Use 'wic list images' to list
available images, or specify a fully-qualified OE kickstart (.wks) filename)

Ross

[-- Attachment #2: Type: text/html, Size: 2029 bytes --]

^ permalink raw reply

* Re: [PATCH 2/2] x86/emul: Pass shadow register state to the vmfunc() hook
From: Andrew Cooper @ 2016-12-22 12:44 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Paul Durrant, Xen-devel
In-Reply-To: <585B9C9C020000780012B8A9@prv-mh.provo.novell.com>

On 22/12/16 08:27, Jan Beulich wrote:
>>>> On 21.12.16 at 17:32, <andrew.cooper3@citrix.com> wrote:
>> vmfunc can in principle modify register state, so should operate on the shadow
>> register state rather than the starting state of emulation.
>>
>> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
> While in principle this is fine, I'd rather see the register state
> constified for now, to demonstrate it is not being modified. I'll
> submit my two remaining follow-up patches in a minute, and
> we can then discuss which of the two to take.

The question here is how likely it is that new functionality for VMFUNC 
will be defined, which starts mutating the values.

I am not aware of anything new, so lets go with the const version for 
now (as it is one fewer parameters).  If this changes in the future, we 
can easily switch back to passing the shadow register block.

~Andrew

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply

* Q: nvme_rdma and reconnect
From: Hannes Reinecke @ 2016-12-22 12:46 UTC (permalink / raw)

In-Reply-To: <9576f8bf-e2df-d00f-edb6-8f17ae188980@grimberg.me>

On 12/22/2016 01:08 PM, Sagi Grimberg wrote:
>
>>> Sagi, Christoph,
>>>
>>> Can you explain what the difference is between the "reset" path and the
>>> "error/reconnect" path is in the rdma driver.  From my point of view, it
>>> would seem both, relative to the fabric-side of the transport, are
>>> terminating the controller and reconnecting to a new controller to
>>> recover.
>>> So why wouldn't they be the same (single) reset flow ?
>>
>> They should use the same flow.  A couple month ago I had a prototype
>> for that but never got it to work fully.
>
> One more distinction is that reconnect failures will retry periodically
> while reset failure will remove the device (aligns with the pci driver
> behavior).
>
> We can go via the same flow and condition on the state for the
> differences, but I'm not sure its easier to understand than two
> distinct routines (although that share a lot of code).
>
And keeping in mind that the reset path will be a killer for any 
prospective multipath scenario; if you need to remove the device to 
reset you are guaranteed to _never_ get it back under memory pressure.

So please do not enforce a reset for all cases.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		      zSeries & Storage
hare at suse.de			      +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 N?rnberg
GF: J. Hawn, J. Guild, F. Imend?rffer, HRB 16746 (AG N?rnberg)

^ permalink raw reply

* Re: [PATCH v2] ethdev: cleanup device ops struct whitespace
From: Thomas Monjalon @ 2016-12-22 12:46 UTC (permalink / raw)
  To: Ferruh Yigit; +Cc: dev
In-Reply-To: <20161222115330.7164-1-ferruh.yigit@intel.com>

2016-12-22 11:53, Ferruh Yigit:
> To make it easy to comment to latest struct, copy-paste here:
> [With some extra notes]
> 
> struct eth_dev_ops {
> 	eth_dev_configure_t        dev_configure; /**< Configure device. */
> 	eth_dev_start_t            dev_start;     /**< Start device. */
> 	eth_dev_stop_t             dev_stop;      /**< Stop device. */
> 	eth_dev_set_link_up_t      dev_set_link_up;   /**< Device link up. */
> 	eth_dev_set_link_down_t    dev_set_link_down; /**< Device link down. */
> 	eth_dev_close_t            dev_close;     /**< Close device. */
> 	eth_promiscuous_enable_t   promiscuous_enable; /**< Promiscuous ON. */
> 	eth_promiscuous_disable_t  promiscuous_disable;/**< Promiscuous OFF. */
> 	eth_allmulticast_enable_t  allmulticast_enable;/**< RX multicast ON. */
> 	eth_allmulticast_disable_t allmulticast_disable;/**< RX multicast OF. */
> 	eth_link_update_t          link_update;   /**< Get device link state. */
> 
> 	eth_stats_get_t            stats_get;     /**< Get generic device statistics. */
> 	eth_stats_reset_t          stats_reset;   /**< Reset generic device statistics. */
> 	eth_xstats_get_t           xstats_get;    /**< Get extended device statistics. */
> 	eth_xstats_reset_t         xstats_reset;  /**< Reset extended device statistics. */
> 	eth_xstats_get_names_t     xstats_get_names;
> 	/**< Get names of extended statistics. */
> 	eth_queue_stats_mapping_set_t queue_stats_mapping_set;
> 	/**< Configure per queue stat counter mapping. */
> 
> 	eth_dev_infos_get_t        dev_infos_get; /**< Get device info. */
> 	eth_dev_supported_ptypes_get_t dev_supported_ptypes_get;
> 	/**< Get packet types supported and identified by device. */
> 
> 	mtu_set_t                  mtu_set;       /**< Set MTU. */
> 
> 	vlan_filter_set_t          vlan_filter_set; /**< Filter VLAN Setup. */
> 	vlan_tpid_set_t            vlan_tpid_set; /**< Outer/Inner VLAN TPID Setup. */
> 	vlan_strip_queue_set_t     vlan_strip_queue_set; /**< VLAN Stripping on queue. */
> 	vlan_offload_set_t         vlan_offload_set; /**< Set VLAN Offload. */
> 	vlan_pvid_set_t            vlan_pvid_set; /**< Set port based TX VLAN insertion. */
> 
> 	eth_queue_start_t          rx_queue_start;/**< Start RX for a queue. */
> 	eth_queue_stop_t           rx_queue_stop; /**< Stop RX for a queue. */
> 	eth_queue_start_t          tx_queue_start;/**< Start TX for a queue. */
> 	eth_queue_stop_t           tx_queue_stop; /**< Stop TX for a queue. */
> 	eth_rx_queue_setup_t       rx_queue_setup;/**< Set up device RX queue. */
> 	eth_queue_release_t        rx_queue_release; /**< Release RX queue. */
> 	eth_rx_queue_count_t       rx_queue_count;/**< Get Rx queue count. */
> 	eth_rx_descriptor_done_t   rx_descriptor_done; /**< Check rxd DD bit. */
> 	eth_rx_enable_intr_t       rx_queue_intr_enable;  /**< Enable Rx queue interrupt. */
> 	eth_rx_disable_intr_t      rx_queue_intr_disable; /**< Disable Rx queue interrupt. */
> 	eth_tx_queue_setup_t       tx_queue_setup;/**< Set up device TX queue. */
> 	eth_queue_release_t        tx_queue_release; /**< Release TX queue. */
> 
> 	eth_dev_led_on_t           dev_led_on;    /**< Turn on LED. */	[Really need these comments?]
> 	eth_dev_led_off_t          dev_led_off;   /**< Turn off LED. */
> 
> 	flow_ctrl_get_t            flow_ctrl_get; /**< Get flow control. */
> 	flow_ctrl_set_t            flow_ctrl_set; /**< Setup flow control. */
> 	priority_flow_ctrl_set_t   priority_flow_ctrl_set; /**< Setup priority flow control. */
> 
> 	eth_mac_addr_remove_t      mac_addr_remove; /**< Remove MAC address. */
> 	eth_mac_addr_add_t         mac_addr_add;  /**< Add a MAC address. */
> 	eth_mac_addr_set_t         mac_addr_set;  /**< Set a MAC address. */
> 	eth_set_mc_addr_list_t     set_mc_addr_list; /**< set list of mcast addrs. */

Could we group the MAC functions with promiscuous and allmulticast?

> 	eth_uc_hash_table_set_t    uc_hash_table_set; /**< Set Unicast Table Array. */
> 	eth_uc_all_hash_table_set_t uc_all_hash_table_set; /**< Set Unicast hash bitmap. */
> 
> 	eth_mirror_rule_set_t	   mirror_rule_set; /**< Add a traffic mirror rule. */
> 	eth_mirror_rule_reset_t	   mirror_rule_reset; /**< reset a traffic mirror rule. */
> 
> 	[Following already removed from next-net]
> 	eth_set_vf_rx_mode_t       set_vf_rx_mode;/**< Set VF RX mode. */
> 	eth_set_vf_rx_t            set_vf_rx;     /**< enable/disable a VF receive. */
> 	eth_set_vf_tx_t            set_vf_tx;     /**< enable/disable a VF transmit. */
> 	eth_set_vf_vlan_filter_t   set_vf_vlan_filter; /**< Set VF VLAN filter. */
> 	eth_set_vf_rate_limit_t    set_vf_rate_limit; /**< Set VF rate limit. */
> 
> 	eth_udp_tunnel_port_add_t  udp_tunnel_port_add; /** Add UDP tunnel port. */
> 	eth_udp_tunnel_port_del_t  udp_tunnel_port_del; /** Del UDP tunnel port. */
> 
> 	eth_set_queue_rate_limit_t set_queue_rate_limit; /**< Set queue rate limit. */
> 
> 	rss_hash_update_t          rss_hash_update; /** Configure RSS hash protocols. */
> 	rss_hash_conf_get_t        rss_hash_conf_get; /** Get current RSS hash configuration. */
> 	reta_update_t              reta_update;   /** Update redirection table. */
> 	reta_query_t               reta_query;    /** Query redirection table. */
> 
> 	eth_get_reg_t              get_reg;           /**< Get registers. */
> 	eth_get_eeprom_length_t    get_eeprom_length; /**< Get eeprom length. */
> 	eth_get_eeprom_t           get_eeprom;        /**< Get eeprom data. */
> 	eth_set_eeprom_t           set_eeprom;        /**< Set eeprom. */
> 
> 	/* bypass control */
> 	bypass_init_t              bypass_init;
> 	bypass_state_set_t         bypass_state_set;
> 	bypass_state_show_t        bypass_state_show;
> 	bypass_event_set_t         bypass_event_set;
> 	bypass_event_show_t        bypass_event_show;
> 	bypass_wd_timeout_set_t    bypass_wd_timeout_set;
> 	bypass_wd_timeout_show_t   bypass_wd_timeout_show;
> 	bypass_ver_show_t          bypass_ver_show;
> 	bypass_wd_reset_t          bypass_wd_reset;
> 
> 	eth_filter_ctrl_t          filter_ctrl; /**< common filter control. */
> 
> 	eth_rxq_info_get_t         rxq_info_get; /**< retrieve RX queue information. */
> 	eth_txq_info_get_t         txq_info_get; /**< retrieve TX queue information. */

It can be grouped with dev_infos_get

> 	eth_get_dcb_info           get_dcb_info; /** Get DCB information. */
> 
> 	eth_timesync_enable_t      timesync_enable;
> 	/** Turn IEEE1588/802.1AS timestamping on. */
> 	eth_timesync_disable_t     timesync_disable;
> 	/** Turn IEEE1588/802.1AS timestamping off. */
> 	eth_timesync_read_rx_timestamp_t timesync_read_rx_timestamp;
> 	/** Read the IEEE1588/802.1AS RX timestamp. */
> 	eth_timesync_read_tx_timestamp_t timesync_read_tx_timestamp;
> 	/** Read the IEEE1588/802.1AS TX timestamp. */
> 	eth_timesync_adjust_time   timesync_adjust_time; /** Adjust the device clock. */
> 	eth_timesync_read_time     timesync_read_time; /** Get the device clock time. */
> 	eth_timesync_write_time    timesync_write_time; /** Set the device clock time. */
> 
> 	eth_l2_tunnel_eth_type_conf_t l2_tunnel_eth_type_conf;
> 	/** Config ether type of l2 tunnel. */
> 	eth_l2_tunnel_offload_set_t   l2_tunnel_offload_set;
> 	/** Enable/disable l2 tunnel offload functions. */

May it be grouped with other tunnel functions?

> };

^ permalink raw reply

* [PATCH v4 06/12] dt: bindings: Add bindings for Marvell Xenon SD Host Controller
From: Ziji Hu @ 2016-12-22 12:47 UTC (permalink / raw)
  To: linux-arm-kernel
In-Reply-To: <20161222111802.GX14217@n2100.armlinux.org.uk>

Hi Russell,

On 2016/12/22 19:18, Russell King - ARM Linux wrote:
> On Tue, Dec 13, 2016 at 06:48:35PM +0100, Gregory CLEMENT wrote:
>> +Optional Properties:
>> +- mmc-card:
>> +  mmc-card child node must be provided when current SDHC is for eMMC.
>> +  Xenon SDHC often can support both SD and eMMC. This child node indicates that
>> +  current SDHC is for eMMC card. Thus Xenon eMMC specific configuration and
>> +  operations can be enabled prior to eMMC init sequence.
>> +  Please refer to Documentation/devicetree/bindings/mmc/mmc-card.txt.
>> +  This child node should not be set if current Xenon SDHC is for SD/SDIO.
> 
> This looks like a typo - shouldn't it be "mmccard" and not "mmc-card"?
> Your examples below use "mmccard" as does the documentation you point
> towards.
>

    Thanks a lot for the review.

    I might mix up it with compatible "mmc-card".
    I will change it to sub-node name in next version soon.

    Thank you.

Best regards,
Hu Ziji

^ permalink raw reply

* Re: [kernel-hardening] Re: [PATCH v7 3/6] random: use SipHash in place of MD5
From: Hannes Frederic Sowa @ 2016-12-22 12:47 UTC (permalink / raw)
  To: Theodore Ts'o, kernel-hardening
  Cc: Andy Lutomirski, Netdev, LKML, Linux Crypto Mailing List,
	David Laight, Eric Dumazet, Linus Torvalds, Eric Biggers,
	Tom Herbert, Andi Kleen, David S. Miller, Jean-Philippe Aumasson
In-Reply-To: <20161222054125.lzxhd6ctovm3wk4p@thunk.org>

Hi Ted,

On Thu, 2016-12-22 at 00:41 -0500, Theodore Ts'o wrote:
> On Thu, Dec 22, 2016 at 03:49:39AM +0100, Jason A. Donenfeld wrote:
> > 
> > Funny -- while you guys were sending this back & forth, I was writing
> > my reply to Andy which essentially arrives at the same conclusion.
> > Given that we're all arriving to the same thing, and that Ted shot in
> > this direction long before we all did, I'm leaning toward abandoning
> > SipHash for the de-MD5-ification of get_random_int/long, and working
> > on polishing Ted's idea into something shiny for this patchset.
> 
> here are my numbers comparing siphash (using the first three patches
> of the v7 siphash patches) with my batched chacha20 implementation.
> The results are taken by running get_random_* 10000 times, and then
> dividing the numbers by 10000 to get the average number of cycles for
> the call.  I compiled 32-bit and 64-bit kernels, and ran the results
> using kvm:
> 
>                    siphash                        batched chacha20
>          get_random_int  get_random_long   get_random_int   get_random_long   
> 
> 32-bit    270                  278             114            146
> 64-bit     75                   75             106            186
> 
> > I did have two objections to this. The first was that my SipHash
> > construction is faster.
> 
> Well, it's faster on everything except 32-bit x86.  :-P
> 
> > The second, and the more
> > important one, was that batching entropy up like this means that 32
> > calls will be really fast, and then the 33rd will be slow, since it
> > has to do a whole ChaCha round, because get_random_bytes must be
> > called to refill the batch.
> 
> ... and this will take 2121 cycles on 64-bit x86, and 2315 cycles on a
> 32-bit x86.  Which on a 2.3 GHz processor, is just under a
> microsecond.  As far as being inconsistent on process startup, I very
> much doubt a microsecond is really going to be visible to the user.  :-)
> 
> The bottom line is that I think we're really "pixel peeping" at this
> point --- which is what obsessed digital photographers will do when
> debating the quality of a Canon vs Nikon DSLR by blowing up a photo by
> a thousand times, and then trying to claim that this is visible to the
> human eye.  Or people who obsessing over the frequency response curves
> of TH-X00 headphones with Mahogony vs Purpleheart wood, when it's
> likely that in a blind head-to-head comparison, most people wouldn't
> be able to tell the difference....
> 
> I think the main argument for using the batched getrandom approach is
> that it, I would argue, simpler than introducing siphash into the
> picture.  On 64-bit platforms it is faster and more consistent, so
> it's basically that versus complexity of having to adding siphash to
> the things that people have to analyze when considering random number
> security on Linux.   But it's a close call either way, I think.

following up on what appears to be a random subject: ;)

IIRC, ext4 code by default still uses half_md4 for hashing of filenames
in the htree. siphash seems to fit this use case pretty good.

xfs could also need an update, as they don't seed the directory hash
tables at all (but not sure if they are vulnerable). I should improve
[1] a bit.

[1] http://oss.sgi.com/cgi-bin/gitweb.cgi?p=xfs/cmds/xfstests.git;a=blo
b;f=src/dirhash_collide.c;h=55cec872d5061ac2ca0f56d1f11e6bf349d5bb97;hb
=HEAD

Bye,
Hannes

^ permalink raw reply

* Re: Re: [PATCH v7 3/6] random: use SipHash in place of MD5
From: Hannes Frederic Sowa @ 2016-12-22 12:47 UTC (permalink / raw)
  To: Theodore Ts'o, kernel-hardening
  Cc: Andy Lutomirski, Netdev, LKML, Linux Crypto Mailing List,
	David Laight, Eric Dumazet, Linus Torvalds, Eric Biggers,
	Tom Herbert, Andi Kleen, David S. Miller, Jean-Philippe Aumasson
In-Reply-To: <20161222054125.lzxhd6ctovm3wk4p@thunk.org>

Hi Ted,

On Thu, 2016-12-22 at 00:41 -0500, Theodore Ts'o wrote:
> On Thu, Dec 22, 2016 at 03:49:39AM +0100, Jason A. Donenfeld wrote:
> > 
> > Funny -- while you guys were sending this back & forth, I was writing
> > my reply to Andy which essentially arrives at the same conclusion.
> > Given that we're all arriving to the same thing, and that Ted shot in
> > this direction long before we all did, I'm leaning toward abandoning
> > SipHash for the de-MD5-ification of get_random_int/long, and working
> > on polishing Ted's idea into something shiny for this patchset.
> 
> here are my numbers comparing siphash (using the first three patches
> of the v7 siphash patches) with my batched chacha20 implementation.
> The results are taken by running get_random_* 10000 times, and then
> dividing the numbers by 10000 to get the average number of cycles for
> the call.  I compiled 32-bit and 64-bit kernels, and ran the results
> using kvm:
> 
>                    siphash                        batched chacha20
>          get_random_int  get_random_long   get_random_int   get_random_long   
> 
> 32-bit    270                  278             114            146
> 64-bit     75                   75             106            186
> 
> > I did have two objections to this. The first was that my SipHash
> > construction is faster.
> 
> Well, it's faster on everything except 32-bit x86.  :-P
> 
> > The second, and the more
> > important one, was that batching entropy up like this means that 32
> > calls will be really fast, and then the 33rd will be slow, since it
> > has to do a whole ChaCha round, because get_random_bytes must be
> > called to refill the batch.
> 
> ... and this will take 2121 cycles on 64-bit x86, and 2315 cycles on a
> 32-bit x86.  Which on a 2.3 GHz processor, is just under a
> microsecond.  As far as being inconsistent on process startup, I very
> much doubt a microsecond is really going to be visible to the user.  :-)
> 
> The bottom line is that I think we're really "pixel peeping" at this
> point --- which is what obsessed digital photographers will do when
> debating the quality of a Canon vs Nikon DSLR by blowing up a photo by
> a thousand times, and then trying to claim that this is visible to the
> human eye.  Or people who obsessing over the frequency response curves
> of TH-X00 headphones with Mahogony vs Purpleheart wood, when it's
> likely that in a blind head-to-head comparison, most people wouldn't
> be able to tell the difference....
> 
> I think the main argument for using the batched getrandom approach is
> that it, I would argue, simpler than introducing siphash into the
> picture.  On 64-bit platforms it is faster and more consistent, so
> it's basically that versus complexity of having to adding siphash to
> the things that people have to analyze when considering random number
> security on Linux.   But it's a close call either way, I think.

following up on what appears to be a random subject: ;)

IIRC, ext4 code by default still uses half_md4 for hashing of filenames
in the htree. siphash seems to fit this use case pretty good.

xfs could also need an update, as they don't seed the directory hash
tables at all (but not sure if they are vulnerable). I should improve
[1] a bit.

[1] http://oss.sgi.com/cgi-bin/gitweb.cgi?p=xfs/cmds/xfstests.git;a=blo
b;f=src/dirhash_collide.c;h=55cec872d5061ac2ca0f56d1f11e6bf349d5bb97;hb
=HEAD

Bye,
Hannes

^ permalink raw reply

* Re: [PATCH v4 06/12] dt: bindings: Add bindings for Marvell Xenon SD Host Controller
From: Ziji Hu @ 2016-12-22 12:47 UTC (permalink / raw)
  To: Russell King - ARM Linux, Gregory CLEMENT
  Cc: Ulf Hansson, Adrian Hunter, linux-mmc, Thomas Petazzoni,
	Andrew Lunn, Yehuda Yitschak, Marcin Wojtas, Jason Cooper,
	Hanna Hawa, Kostya Porotchkin, Nadav Haklai, Jimmy Xu, Doug Jones,
	Ryan Gao, Jisheng Zhang, Victor Gu, Wei(SOCP) Liu, Wilson Ding,
	linux-arm-kernel, Sebastian Hesselbarth
In-Reply-To: <20161222111802.GX14217@n2100.armlinux.org.uk>

Hi Russell,

On 2016/12/22 19:18, Russell King - ARM Linux wrote:
> On Tue, Dec 13, 2016 at 06:48:35PM +0100, Gregory CLEMENT wrote:
>> +Optional Properties:
>> +- mmc-card:
>> +  mmc-card child node must be provided when current SDHC is for eMMC.
>> +  Xenon SDHC often can support both SD and eMMC. This child node indicates that
>> +  current SDHC is for eMMC card. Thus Xenon eMMC specific configuration and
>> +  operations can be enabled prior to eMMC init sequence.
>> +  Please refer to Documentation/devicetree/bindings/mmc/mmc-card.txt.
>> +  This child node should not be set if current Xenon SDHC is for SD/SDIO.
> 
> This looks like a typo - shouldn't it be "mmccard" and not "mmc-card"?
> Your examples below use "mmccard" as does the documentation you point
> towards.
>

    Thanks a lot for the review.

    I might mix up it with compatible "mmc-card".
    I will change it to sub-node name in next version soon.

    Thank you.

Best regards,
Hu Ziji

^ permalink raw reply

* Re: [PATCH v2 1/5] lib: distributor performance enhancements
From: Jerin Jacob @ 2016-12-22 12:47 UTC (permalink / raw)
  To: David Hunt; +Cc: dev, bruce.richardson
In-Reply-To: <1482381428-148094-2-git-send-email-david.hunt@intel.com>

On Thu, Dec 22, 2016 at 04:37:04AM +0000, David Hunt wrote:
> Now sends bursts of up to 8 mbufs to each worker, and tracks
> the in-flight flow-ids (atomic scheduling)
> 
> New file with a new api, similar to the old API except with _burst
> at the end of the function names
> 
> Signed-off-by: David Hunt <david.hunt@intel.com>
> +
> +int
> +rte_distributor_get_pkt_burst(struct rte_distributor_burst *d,
> +		unsigned int worker_id, struct rte_mbuf **pkts,
> +		struct rte_mbuf **oldpkt, unsigned int return_count)
> +{
> +	unsigned int count;
> +	uint64_t retries = 0;
> +
> +	rte_distributor_request_pkt_burst(d, worker_id, oldpkt, return_count);
> +
> +	count = rte_distributor_poll_pkt_burst(d, worker_id, pkts);
> +	while (count == 0) {
> +		rte_pause();
> +		retries++;
> +		if (retries > 1000) {
> +			retries = 0;

This retries write may not have any significance as it just before the
return

> +			return 0;
> +		}
> +		uint64_t t = __rdtsc()+100;

Use rte_ version of __rdtsc.

> +
> +		while (__rdtsc() < t)
> +			rte_pause();
> +
> +		count = rte_distributor_poll_pkt_burst(d, worker_id, pkts);
> +	}
> +	return count;
> +}
> +
> +int
> +rte_distributor_return_pkt_burst(struct rte_distributor_burst *d,
> +		unsigned int worker_id, struct rte_mbuf **oldpkt, int num)
> +{
> +	struct rte_distributor_buffer_burst *buf = &d->bufs[worker_id];
> +	unsigned int i;
> +
> +	for (i = 0; i < RTE_DIST_BURST_SIZE; i++)
> +		/* Switch off the return bit first */
> +		buf->retptr64[i] &= ~RTE_DISTRIB_RETURN_BUF;
> +
> +	for (i = num; i-- > 0; )
> +		buf->retptr64[i] = (((int64_t)(uintptr_t)oldpkt[i]) <<
> +			RTE_DISTRIB_FLAG_BITS) | RTE_DISTRIB_RETURN_BUF;
> +
> +	/* set the GET_BUF but even if we got no returns */
> +	buf->retptr64[0] |= RTE_DISTRIB_GET_BUF;
> +
> +	return 0;
> +}
> +
> +#if RTE_MACHINE_CPUFLAG_SSE2
> +static inline void

Move SSE version of the code to separate file so that later other SIMD arch
specific version like NEON can be incorporated.

> +find_match_sse2(struct rte_distributor_burst *d,
> +			uint16_t *data_ptr,
> +			uint16_t *output_ptr)
> +{
> +	/* Setup */
> +	__m128i incoming_fids;
> +	__m128i inflight_fids;
> +	__m128i preflight_fids;
> +	__m128i wkr;
> +	__m128i mask1;
> +	__m128i mask2;
> +	__m128i output;
> +	struct rte_distributor_backlog *bl;
> +
> +	/*
> +	 * Function overview:
> +	 * 2. Loop through all worker ID's
> +	 *  2a. Load the current inflights for that worker into an xmm reg
> +	 *  2b. Load the current backlog for that worker into an xmm reg
> +	 *  2c. use cmpestrm to intersect flow_ids with backlog and inflights
> +	 *  2d. Add any matches to the output
> +	 * 3. Write the output xmm (matching worker ids).
> +	 */
> +
> +
> +	output = _mm_set1_epi16(0);
> +	incoming_fids = _mm_load_si128((__m128i *)data_ptr);
> +
> +	for (uint16_t i = 0; i < d->num_workers; i++) {
> +		bl = &d->backlog[i];
> +
> +		inflight_fids =
> +			_mm_load_si128((__m128i *)&(d->in_flight_tags[i]));
> +		preflight_fids =
> +			_mm_load_si128((__m128i *)(bl->tags));
> +
> +		/*
> +		 * Any incoming_fid that exists anywhere in inflight_fids will
> +		 * have 0xffff in same position of the mask as the incoming fid
> +		 * Example (shortened to bytes for brevity):
> +		 * incoming_fids   0x01 0x02 0x03 0x04 0x05 0x06 0x07 0x08
> +		 * inflight_fids   0x03 0x05 0x07 0x00 0x00 0x00 0x00 0x00
> +		 * mask            0x00 0x00 0xff 0x00 0xff 0x00 0xff 0x00
> +		 */
> +
> +		mask1 = _mm_cmpestrm(inflight_fids, 8, incoming_fids, 8,
> +			_SIDD_UWORD_OPS |
> +			_SIDD_CMP_EQUAL_ANY |
> +			_SIDD_UNIT_MASK);
> +		mask2 = _mm_cmpestrm(preflight_fids, 8, incoming_fids, 8,
> +			_SIDD_UWORD_OPS |
> +			_SIDD_CMP_EQUAL_ANY |
> +			_SIDD_UNIT_MASK);
> +
> +		mask1 = _mm_or_si128(mask1, mask2);
> +		/*
> +		 * Now mask contains 0xffff where there's a match.
> +		 * Next we need to store the worker_id in the relevant position
> +		 * in the output.
> +		 */
> +
> +		wkr = _mm_set1_epi16(i+1);
> +		mask1 = _mm_and_si128(mask1, wkr);
> +		output = _mm_or_si128(mask1, output);
> +	}
> +
> +/* process a set of packets to distribute them to workers */
> +int
> +rte_distributor_process_burst(struct rte_distributor_burst *d,
> +		struct rte_mbuf **mbufs, unsigned int num_mbufs)
> +{
> +	unsigned int next_idx = 0;
> +	static unsigned int wkr;
> +	struct rte_mbuf *next_mb = NULL;
> +	int64_t next_value = 0;
> +	uint16_t new_tag = 0;
> +	uint16_t flows[8] __rte_cache_aligned;

The const 8 has been used down in the function also. Please replace with macro

> +	//static int iter=0;

Please remove the test-code with // across the patch.

> +
> +	if (unlikely(num_mbufs == 0)) {
> +		/* Flush out all non-full cache-lines to workers. */
> +		for (unsigned int wid = 0 ; wid < d->num_workers; wid++) {
> +			if ((d->bufs[wid].bufptr64[0] & RTE_DISTRIB_GET_BUF)) {
> +				release(d, wid);
> +				handle_returns(d, wid);
> +			}
> +		}
> +		return 0;
> +	}
> +
> +	while (next_idx < num_mbufs) {
> +		uint16_t matches[8];
> +		int pkts;
> +
> +		if (d->bufs[wkr].bufptr64[0] & RTE_DISTRIB_GET_BUF)
> +			d->bufs[wkr].count = 0;
> +
> +		for (unsigned int i = 0; i < RTE_DIST_BURST_SIZE; i++) {
> +			if (mbufs[next_idx + i]) {
> +				/* flows have to be non-zero */
> +				flows[i] = mbufs[next_idx + i]->hash.usr | 1;
> +			} else
> +				flows[i] = 0;
> +		}
> +
> +		switch (d->dist_match_fn) {
> +#ifdef RTE_MACHINE_CPUFLAG_SSE2

Is this conditional compilation flag is really required ? i.e
RTE_DIST_MATCH_SSE will not enabled in non SSE case

> +		case RTE_DIST_MATCH_SSE:
> +			find_match_sse2(d, &flows[0], &matches[0]);
> +			break;
> +#endif
> +		default:
> +			find_match_scalar(d, &flows[0], &matches[0]);
> +		}
> +
> +		/*
> +		 * Matches array now contain the intended worker ID (+1) of
> +		 * the incoming packets. Any zeroes need to be assigned
> +		 * workers.
> +		 */
> +
> +		if ((num_mbufs - next_idx) < RTE_DIST_BURST_SIZE)
> +			pkts = num_mbufs - next_idx;
> +		else
> +			pkts = RTE_DIST_BURST_SIZE;
> +
> +		for (int j = 0; j < pkts; j++) {
> +
> +			next_mb = mbufs[next_idx++];
> +			next_value = (((int64_t)(uintptr_t)next_mb) <<
> +					RTE_DISTRIB_FLAG_BITS);
> +			/*
> +			 * User is advocated to set tag vaue for each
> +			 * mbuf before calling rte_distributor_process.
> +			 * User defined tags are used to identify flows,
> +			 * or sessions.
> +			 */
> +			/* flows MUST be non-zero */
> +			new_tag = (uint16_t)(next_mb->hash.usr) | 1;
> +
> +			/*
> +			 * Using the next line will cause the find_match
> +			 * function to be optimised out, making this function
> +			 * do parallel (non-atomic) distribution
> +			 */
> +			//matches[j] = 0;

test code with //

^ permalink raw reply

* Re: [Qemu-devel] [PATCH for-2.9 V4 2/2] Add a new qmp command to do checkpoint, get replication error
From: addr_cc @ 2016-12-22 12:48 UTC (permalink / raw)
  To: Zhang Chen, qemu devel, Jason Wang
  Cc: Li Zhijian, zhanghailiang, eddie . dong, Bian Naimeng,
	Changlong Xie, Wen Congyang
In-Reply-To: <ed8e4268-e515-470c-9e8f-a0194e43100b@cn.fujitsu.com>

[-- Attachment #1: Type: text/plain, Size: 2726 bytes --]

On 12/22/2016 12:08 AM, Zhang Chen wrote:
>>> Make sense, this command trying to collect status on whether
>>> an error has occurred, and the "replication_get_error_all(errp)"
>>> is always succeeds. So, Can you suggest to me the right name?
>> If replication_get_error_all() always succeeds, then what failure is
>> possible to be checking for?
> 
> We can read the errp to check the last error.

But turning around and reporting an error to the caller is not nice.
The caller can't distinguish between "I called the command correctly,
and it is telling me the system has encountered a replication error" and
"I called the command incorrectly, and it is telling me my usage is
wrong even though the system has never encountered a replication error".
 Passing information through errp is NOT the right way to successfully
report status.

> 
>>
>> Maybe the problem is deeper, in that replication_get_error_all() has an
>> unusual signature, and needs to be fixed first.  I don't know, and
>> haven't looked; I'm only coming at this from the user interface
>> perspective.  But it makes no sense to have a command that queries
>> whether an error occurred, but where an error having occurred is fatal
>> (you want the command to successfully report that an error has occurred,
>> not error out with a second error because a first error was present).
> 
> Do you means we should fix "void replication_get_error_all()"
> to "int replication_get_error_all()" first for get the return value?

Quite possibly yes. But maybe you don't have to do that, and can come up
with a scheme where only the QMP command wrapper has to be careful.
Perhaps something like this would work:

>> Then you probably want a query style interface:
>>
>> { 'command': 'query-xen-replication-status',
>>    'returns': 'SomeStruct' }
>>
>> where SomeStruct contains details such as status (perhaps an enum that
>> reports 'normal' or 'error'), and where you are free to add additional
>> pieces of information that may prove useful later (rather than having to
>> invent yet more commands that give only a boolean result of success or
>> failure based on whether the state is normal or in error).

SomeStruct *qmp_query_xen_replication_status(Error **errp)
{
    Error *err = NULL;
    SomeStruct *result = g_new0(SomeStruct, 1);
    replication_get_error_all(&err);
    result.state = err ? SOME_ENUM_ERRORED : SOME_ENUM_NORMAL;
    error_free(err);
    /* ... and now you can add additional status items to the API,
       as needed. errp remains unset, because the command succeeds */
}

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 604 bytes --]

^ permalink raw reply

* Re: [PATCH v2 00/25] Generic flow API (rte_flow)
From: Adrien Mazarguil @ 2016-12-22 12:48 UTC (permalink / raw)
  To: Simon Horman; +Cc: dev
In-Reply-To: <20161221161914.GA14515@penelope.horms.nl>

On Wed, Dec 21, 2016 at 05:19:16PM +0100, Simon Horman wrote:
> On Fri, Dec 16, 2016 at 05:24:57PM +0100, Adrien Mazarguil wrote:
> > As previously discussed in RFC v1 [1], RFC v2 [2], with changes
> > described in [3] (also pasted below), here is the first non-draft series
> > for this new API.
> > 
> > Its capabilities are so generic that its name had to be vague, it may be
> > called "Generic flow API", "Generic flow interface" (possibly shortened
> > as "GFI") to refer to the name of the new filter type, or "rte_flow" from
> > the prefix used for its public symbols. I personally favor the latter.
> > 
> > While it is currently meant to supersede existing filter types in order for
> > all PMDs to expose a common filtering/classification interface, it may
> > eventually evolve to cover the following ideas as well:
> > 
> > - Rx/Tx offloads configuration through automatic offloads for specific
> >   packets, e.g. performing checksum on TCP packets could be expressed with
> >   an egress rule with a TCP pattern and a kind of checksum action.
> > 
> > - RSS configuration (already defined actually). Could be global or per rule
> >   depending on hardware capabilities.
> > 
> > - Switching configuration for devices with many physical ports; rules doing
> >   both ingress and egress could even be used to completely bypass software
> >   if supported by hardware.

Hi Simon,

> Hi Adrien,
> 
> thanks for this valuable work.
> 
> I would like to ask some high level questions on the proposal.
> I apologise in advance if any of these questions are based on a
> misunderstanding on my part.
> 
> * I am wondering about provisions for actions to modify packet data or
>   metadata.  I do see support for marking packets. Is the implication of
>   this that the main focus is to provide a mechanism for classification
>   with the assumption that any actions - other than drop and variants of
>   output - would be performed elsewhere?

I'm not sure to understand what you mean by "elsewhere" here. Packet marking
as currently defined is a purely ingress action, i.e. HW matches some packet
and returns a user-defined tag in related meta-data that the PMD copies to
the appropriate mbuf structure field before returning it to the application.

There is provision for egress rules and I wrote down a few ideas describing
how they could be useful (as above), however they remain to be defined.

>   If so I would observe that this seems somewhat limiting in the case of
>   hardware that can perform a richer set of actions. And seems particularly
>   limiting on egress as there doesn't seem anywhere else that other actions
>   could be performed after classification is performed by this API.

A single flow rule may contain any number of distinct actions. For egress,
it means you could wrap matching packets in VLAN and VXLAN at once.

If you wanted to perform the same action twice on matching packets, you'd
have to provide two rules with defined priorities and use a non-terminating
action for the first one:

- Rule with priority 0: match UDP -> add VLAN 42, passthrough
- Rule with priority 1: match UDP -> add VLAN 64, terminating

This is how automatic QinQ would be defined for outgoing UDP packets.

> * I am curious to know what considerations have been given to supporting          support for tunnelling (encapsulation and decapsulation of e.g. VXLAN),
>   tagging (pushing and popping e.g. VLANs), and labels (pushing or popping
>   e.g. MPLS).
> 
>   Such features seem would useful for application of this work in a variety
>   of situations including overlay networks and VNFs.

This is also what I had in mind and we'd only have to define specific
ingress/egress actions for these. Currently rte_flow only implements a basic
set of existing features from the legacy filtering framework, but is meant
to be extended.

> * I am wondering if any thought has gone into supporting matching on the
>   n-th instance of a field that may appear more than once: e.g. VLAN tag.

Sure, please see the latest documentation [1] and testpmd examples [2].
Pattern items being stacked in the same order as protocol layers, maching
specific QinQ traffic and redirecting it to some queue could be expressed
with something like:

 testpmd> flow create 0 ingress pattern eth / vlan vid is 64 / vlan vid is 42 / end 
    actions queue 6 / end

Such a rule is translated as-is to rte_flow pattern items and action
structures.

> With the above questions in mind I am curious to know what use-cases
> the proposal is targeted at.

Well, it should be easier to answer if you have a specific use-case in mind
you would like to support but that cannot be expressed with the API as
defined in [1], in which case please share it with the community.

[1] http://dpdk.org/ml/archives/dev/2016-December/052954.html
[2] http://dpdk.org/ml/archives/dev/2016-December/052975.html

-- 
Adrien Mazarguil
6WIND

^ permalink raw reply

* Re: Why IP_PIPELINE is faster than L2FWD
From: Royce Niu @ 2016-12-22 12:48 UTC (permalink / raw)
  To: Bruce Richardson; +Cc: Royce Niu, dev
In-Reply-To: <20161222111528.GA11104@bricha3-MOBL3.ger.corp.intel.com>

But, actually, L3FWD of IP_PIPELINE is also faster than stock L2FWD, which
also modifies mac addr. How can explain this?

Actually, I want to know why IP_PIPELINE is much faster and I can learn
from IP_PIPELINE and make our own program.

But, the documentation of that is not detailed enough. if it is possible,
could you tell me where is the key to boost? Thanks!

On Thu, Dec 22, 2016 at 7:15 PM, Bruce Richardson <
bruce.richardson@intel.com> wrote:

> On Thu, Dec 22, 2016 at 12:18:12AM +0800, Royce Niu wrote:
> > Hi all,
> >
> > I tested default L2FWD and IP_PIPELINE (pass-through). The throughput of
> > IP_PIPELINE is higher immensely.
> >
> > There are only two virtual NICs in KVM. The experiment is just moving
> > packet from vNIC0  to vNIC1. I think the function is so simple. Why L2FWD
> > is much slower?
> >
> > How can I improve L2FWD, to make L2FWD faster?
> >
> Is IP_PIPELINE in passthrough mode modifying the packets? L2FWD swaps
> the mac addresses on each packet as it processes them, which can slow it
> down. L2FWD is also more an example of how the APIs work than anything
> else. For fastest possible port-to-port forwarding, testpmd should give
> the highest performance.
>
> /Bruce
>



-- 
Regards,

Royce

^ permalink raw reply

* Re: Xenstore watch interface in the kernel
From: Sander Eikelenboom @ 2016-12-22 12:49 UTC (permalink / raw)
  To: Ian Jackson
  Cc: Juergen Gross, Stefano Stabellini, Wei Liu, Jan Beulich,
	xen-devel, Boris Ostrovsky
In-Reply-To: <22619.50542.426820.372250@mariner.uk.xensource.com>


Thursday, December 22, 2016, 1:22:06 PM, you wrote:

> Juergen Gross writes ("Xenstore watch interface in the kernel"):
>> While working on the Linux xenbus kernel driver I stumbled over a rather
>> strange interface: a Xenstore watch event is delivered via a callback
>> defined as:
>> 
>> void (*callback)(struct xenbus_watch *,
>>                          const char **vec, unsigned int len);
>> 
>> vec is an array of strings and len the number of strings in that
>> array.
>> 
>> Looking at the Xenstore interface I don't see how there could ever be
>> an array with another len than 2 be presented (the first string being
>> the modified path, the second the token specified when registering
>> the watch).

> Yes, this is an anomaly.

> IIRC (from the last time I looked at this) a long time ago in a galaxy
> far far away someone thought it might be a good idea to introduce some
> kind of payload to watch events, so that watches could be explicitly
> fired with a payload.

> However, this wasn't in any deployed implementation.

Something I did ran into while trying to use xenstore, was that the callbacks
don't give back the previous and current value.
So you don't really know *how* the state changed, unless you keep all change 
locally as well.
I have circumvented it the dirty way, by setting the token as the current
value, but it isn't very pretty and all setters must adhere to that, so it's not 
working for the general Xen entries and therefor only useful for my own entries.

Any idea as to why the callback doesn't return the current and previous value
directly ?
--
Sander

>> I'd like to modify the callback's prototype to:
>> 
>> void (*callback)(struct xenbus_watch *,
>>                          const char *path, const char *token);

> I think this would be a fine idea.

>> Is there any reason not to change the interface in the kernel?

> No.

>> BTW: The handling of more than 2 strings as watch event parameters is
>> even repeated in the interface to libxenstore. I looked through the Xen
>> sources and could find no use of the number of strings returned in case
>> of a watch event. While we can't change the interface of libxenstore
>> I don't think we have to be prepared for an arbitrary number of strings
>> for a watch event at the kernel interface.

> Yes.

> Ian.




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply

* Q: nvme_rdma and reconnect
From: Christoph Hellwig @ 2016-12-22 12:50 UTC (permalink / raw)

In-Reply-To: <0dcc2dab-378e-aae0-742d-a688928249a1@suse.de>

On Thu, Dec 22, 2016@01:46:10PM +0100, Hannes Reinecke wrote:
> And keeping in mind that the reset path will be a killer for any prospective
> multipath scenario; if you need to remove the device to reset you are
> guaranteed to _never_ get it back under memory pressure.

No one talked about removing the device, just shutting down the
instance of the controller.  We need a clean slate to reconnect as all
our RDMA QPs are toast once an error happens by design of IB Verbs.

^ permalink raw reply

* [PATCH 1/2] arm64: setup: introduce kaslr_offset()
From: Alexander Popov @ 2016-12-22 12:51 UTC (permalink / raw)
  To: linux-arm-kernel
In-Reply-To: <20161222061857.GA26502@yury-N73SV>

On 22.12.2016 09:18, Yury Norov wrote:
> On Sun, Dec 11, 2016 at 03:50:55AM +0300, Alexander Popov wrote:
>> Introduce kaslr_offset() similarly to x86_64 for fixing kcov.

[...]

> Hi Alexander,
> 
> I found today's linux-next master broken:

[...]

> It looks like you declare kaslr_offset() twice - in this patch, and in 7ede8665f
> (arm64: setup: introduce kaslr_offset()). 

Hello Yury,

There was a race during applying this patch. So currently linux-next has 2 versions of it.

The first one is 1a339a14b1f2c7a0dfdd6db79eee1e55d3cec357, which is original.
The second one is 7ede8665f27cde7da69e8b2fbeaa1ed0664879c5, updated by Will Deacon and
applied to the mainline.

I'm sorry for that. The first one should be definitely dropped.

Best regards,
Alexander

^ permalink raw reply

* Re: [PATCH v2 01/13] be2iscsi: Fix use of invalidate command table req
From: Hannes Reinecke @ 2016-12-22 12:51 UTC (permalink / raw)
  To: Jitendra Bhivare, cleech, lduncan; +Cc: linux-scsi
In-Reply-To: <1481624766-13846-2-git-send-email-jitendra.bhivare@broadcom.com>

On 12/13/2016 11:25 AM, Jitendra Bhivare wrote:
> Remove shared structure inv_tbl in phba for all sessions to post
> invalidation IOCTL.
> Always allocate and then free the table after use in reset handler.
> Abort handler needs just one instance so define it on stack.
> Add checks for BE_INVLDT_CMD_TBL_SZ to not exceed invalidation
> command table size in IOCTL.
>
> Signed-off-by: Jitendra Bhivare <jitendra.bhivare@broadcom.com>
> ---
>  drivers/scsi/be2iscsi/be_main.c | 85 ++++++++++++++++++++++++-----------------
>  drivers/scsi/be2iscsi/be_main.h | 16 ++++----
>  drivers/scsi/be2iscsi/be_mgmt.c | 12 +++---
>  drivers/scsi/be2iscsi/be_mgmt.h | 40 +++++++++----------
>  4 files changed, 83 insertions(+), 70 deletions(-)
>
Reviewed-by: Hannes Reinecke <hare@suse.com>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		      zSeries & Storage
hare@suse.de			      +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: J. Hawn, J. Guild, F. Imendörffer, HRB 16746 (AG Nürnberg)

^ permalink raw reply

* Re: [PATCH 1/2] arm64: setup: introduce kaslr_offset()
From: Alexander Popov @ 2016-12-22 12:51 UTC (permalink / raw)
  To: Yury Norov
  Cc: Catalin Marinas, Will Deacon, Ard Biesheuvel, Mark Rutland,
	Rob Herring, Kefeng Wang, AKASHI Takahiro, Jon Masters,
	David Daney, Ganapatrao Kulkarni, Andrew Morton, Dmitry Vyukov,
	Nicolai Stange, James Morse, Andrey Ryabinin, Andrey Konovalov,
	linux-arm-kernel, linux-kernel, syzkaller
In-Reply-To: <20161222061857.GA26502@yury-N73SV>

On 22.12.2016 09:18, Yury Norov wrote:
> On Sun, Dec 11, 2016 at 03:50:55AM +0300, Alexander Popov wrote:
>> Introduce kaslr_offset() similarly to x86_64 for fixing kcov.

[...]

> Hi Alexander,
> 
> I found today's linux-next master broken:

[...]

> It looks like you declare kaslr_offset() twice - in this patch, and in 7ede8665f
> (arm64: setup: introduce kaslr_offset()). 

Hello Yury,

There was a race during applying this patch. So currently linux-next has 2 versions of it.

The first one is 1a339a14b1f2c7a0dfdd6db79eee1e55d3cec357, which is original.
The second one is 7ede8665f27cde7da69e8b2fbeaa1ed0664879c5, updated by Will Deacon and
applied to the mainline.

I'm sorry for that. The first one should be definitely dropped.

Best regards,
Alexander

^ permalink raw reply

* Re: [mlmmj] Allowing specific addresses with subonlypost
From: Morten Shearman Kirkegaard @ 2016-12-22 12:51 UTC (permalink / raw)
  To: mlmmj
In-Reply-To: <13b7eecf-62da-b77b-4685-ddae6cc0ee39@gmail.com>

On 2016-12-22, at 13:24:10 +0100, David Demelier wrote:
> However to stop spam I've enabled the subonlypost option to only allow
> subscribers to post mails. The problem is that I also would like to
> add specific addresses so my own mail server can also send mails to
> people subscribed.

The easiest way of allowing that, is adding them in nomailsubs.d/.

// Moki


^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.