* Re: [PATCH bpf-next 00/15] Introducing AF_XDP support
From: Jason Wang @ 2018-04-24 2:29 UTC (permalink / raw)
To: Björn Töpel, magnus.karlsson, alexander.h.duyck,
alexander.duyck, john.fastabend, ast, brouer,
willemdebruijn.kernel, daniel, mst, netdev
Cc: Björn Töpel, michael.lundkvist, jesse.brandeburg,
anjali.singhai, qi.z.zhang
In-Reply-To: <20180423135619.7179-1-bjorn.topel@gmail.com>
On 2018年04月23日 21:56, Björn Töpel wrote:
> From: Björn Töpel <bjorn.topel@intel.com>
>
> This RFC introduces a new address family called AF_XDP that is
> optimized for high performance packet processing and, in upcoming
> patch sets, zero-copy semantics. In this v2 version, we have removed
> all zero-copy related code in order to make it smaller, simpler and
> hopefully more review friendly. This RFC only supports copy-mode for
> the generic XDP path (XDP_SKB) for both RX and TX and copy-mode for RX
> using the XDP_DRV path. Zero-copy support requires XDP and driver
> changes that Jesper Dangaard Brouer is working on. Some of his work
> has already been accepted. We will publish our zero-copy support for
> RX and TX on top of his patch sets at a later point in time.
>
> An AF_XDP socket (XSK) is created with the normal socket()
> syscall. Associated with each XSK are two queues: the RX queue and the
> TX queue. A socket can receive packets on the RX queue and it can send
> packets on the TX queue. These queues are registered and sized with
> the setsockopts XDP_RX_RING and XDP_TX_RING, respectively. It is
> mandatory to have at least one of these queues for each socket. In
> contrast to AF_PACKET V2/V3 these descriptor queues are separated from
> packet buffers. An RX or TX descriptor points to a data buffer in a
> memory area called a UMEM. RX and TX can share the same UMEM so that a
> packet does not have to be copied between RX and TX. Moreover, if a
> packet needs to be kept for a while due to a possible retransmit, the
> descriptor that points to that packet can be changed to point to
> another and reused right away. This again avoids copying data.
>
> This new dedicated packet buffer area is call a UMEM. It consists of a
> number of equally size frames and each frame has a unique frame id. A
> descriptor in one of the queues references a frame by referencing its
> frame id. The user space allocates memory for this UMEM using whatever
> means it feels is most appropriate (malloc, mmap, huge pages,
> etc). This memory area is then registered with the kernel using the new
> setsockopt XDP_UMEM_REG. The UMEM also has two queues: the FILL queue
> and the COMPLETION queue. The fill queue is used by the application to
> send down frame ids for the kernel to fill in with RX packet
> data. References to these frames will then appear in the RX queue of
> the XSK once they have been received. The completion queue, on the
> other hand, contains frame ids that the kernel has transmitted
> completely and can now be used again by user space, for either TX or
> RX. Thus, the frame ids appearing in the completion queue are ids that
> were previously transmitted using the TX queue. In summary, the RX and
> FILL queues are used for the RX path and the TX and COMPLETION queues
> are used for the TX path.
>
> The socket is then finally bound with a bind() call to a device and a
> specific queue id on that device, and it is not until bind is
> completed that traffic starts to flow. Note that in this RFC, all
> packet data is copied out to user-space.
>
> A new feature in this RFC is that the UMEM can be shared between
> processes, if desired. If a process wants to do this, it simply skips
> the registration of the UMEM and its corresponding two queues, sets a
> flag in the bind call and submits the XSK of the process it would like
> to share UMEM with as well as its own newly created XSK socket. The
> new process will then receive frame id references in its own RX queue
> that point to this shared UMEM. Note that since the queue structures
> are single-consumer / single-producer (for performance reasons), the
> new process has to create its own socket with associated RX and TX
> queues, since it cannot share this with the other process. This is
> also the reason that there is only one set of FILL and COMPLETION
> queues per UMEM. It is the responsibility of a single process to
> handle the UMEM. If multiple-producer / multiple-consumer queues are
> implemented in the future, this requirement could be relaxed.
>
> How is then packets distributed between these two XSK? We have
> introduced a new BPF map called XSKMAP (or BPF_MAP_TYPE_XSKMAP in
> full). The user-space application can place an XSK at an arbitrary
> place in this map. The XDP program can then redirect a packet to a
> specific index in this map and at this point XDP validates that the
> XSK in that map was indeed bound to that device and queue number. If
> not, the packet is dropped. If the map is empty at that index, the
> packet is also dropped. This also means that it is currently mandatory
> to have an XDP program loaded (and one XSK in the XSKMAP) to be able
> to get any traffic to user space through the XSK.
>
> AF_XDP can operate in two different modes: XDP_SKB and XDP_DRV. If the
> driver does not have support for XDP, or XDP_SKB is explicitly chosen
> when loading the XDP program, XDP_SKB mode is employed that uses SKBs
> together with the generic XDP support and copies out the data to user
> space. A fallback mode that works for any network device. On the other
> hand, if the driver has support for XDP, it will be used by the AF_XDP
> code to provide better performance, but there is still a copy of the
> data into user space.
>
> There is a xdpsock benchmarking/test application included that
> demonstrates how to use AF_XDP sockets with both private and shared
> UMEMs. Say that you would like your UDP traffic from port 4242 to end
> up in queue 16, that we will enable AF_XDP on. Here, we use ethtool
> for this:
>
> ethtool -N p3p2 rx-flow-hash udp4 fn
> ethtool -N p3p2 flow-type udp4 src-port 4242 dst-port 4242 \
> action 16
>
> Running the rxdrop benchmark in XDP_DRV mode can then be done
> using:
>
> samples/bpf/xdpsock -i p3p2 -q 16 -r -N
>
> For XDP_SKB mode, use the switch "-S" instead of "-N" and all options
> can be displayed with "-h", as usual.
>
> We have run some benchmarks on a dual socket system with two Broadwell
> E5 2660 @ 2.0 GHz with hyperthreading turned off. Each socket has 14
> cores which gives a total of 28, but only two cores are used in these
> experiments. One for TR/RX and one for the user space application. The
> memory is DDR4 @ 2133 MT/s (1067 MHz) and the size of each DIMM is
> 8192MB and with 8 of those DIMMs in the system we have 64 GB of total
> memory. The compiler used is gcc version 5.4.0 20160609. The NIC is an
> Intel I40E 40Gbit/s using the i40e driver.
>
> Below are the results in Mpps of the I40E NIC benchmark runs for 64
> and 1500 byte packets, generated by commercial packet generator HW that is
> generating packets at full 40 Gbit/s line rate.
>
> AF_XDP performance 64 byte packets. Results from RFC V2 in parenthesis.
> Benchmark XDP_SKB XDP_DRV
> rxdrop 2.9(3.0) 9.4(9.3)
> txpush 2.5(2.2) NA*
> l2fwd 1.9(1.7) 2.4(2.4) (TX using XDP_SKB in both cases)
This number looks not very exciting. I can get ~3Mpps when using testpmd
in a guest with xdp_redirect.sh on host between ixgbe and TAP/vhost. I
believe we can even better performance without virt. It would be
interesting to compare this performance with e.g testpmd +
virito_user(vhost_kernel) + XDP.
>
> AF_XDP performance 1500 byte packets:
> Benchmark XDP_SKB XDP_DRV
> rxdrop 2.1(2.2) 3.3(3.1)
> l2fwd 1.4(1.1) 1.8(1.7) (TX using XDP_SKB in both cases)
>
> * NA since we have no support for TX using the XDP_DRV infrastructure
> in this RFC. This is for a future patch set since it involves
> changes to the XDP NDOs. Some of this has been upstreamed by Jesper
> Dangaard Brouer.
>
> XDP performance on our system as a base line:
>
> 64 byte packets:
> XDP stats CPU pps issue-pps
> XDP-RX CPU 16 32,921,521 0
>
> 1500 byte packets:
> XDP stats CPU pps issue-pps
> XDP-RX CPU 16 3,289,491 0
>
> Changes from RFC V2:
>
> * Optimizations and simplifications to the ring structures inspired by
> ptr_ring.h
> * Renamed XDP_[RX|TX]_QUEUE to XDP_[RX|TX]_RING in the uapi to be
> consistent with AF_PACKET
> * Support for only having an RX queue or a TX queue defined
> * Some bug fixes and code cleanup
>
> The structure of the patch set is as follows:
>
> Patches 1-2: Basic socket and umem plumbing
> Patches 3-10: RX support together with the new XSKMAP
> Patches 11-14: TX support
> Patch 15: Sample application
>
> We based this patch set on bpf-next commit fbcf93ebcaef ("bpf: btf:
> Clean up btf.h in uapi")
>
> Questions:
>
> * How to deal with cache alignment for uapi when different
> architectures can have different cache line sizes? We have just
> aligned it to 64 bytes for now, which works for many popular
> architectures, but not all. Please advise.
>
> To do:
>
> * Optimize performance
>
> * Kernel selftest
>
> Post-series plan:
>
> * Kernel load module support of AF_XDP would be nice. Unclear how to
> achieve this though since our XDP code depends on net/core.
>
> * Support for AF_XDP sockets without an XPD program loaded. In this
> case all the traffic on a queue should go up to the user space socket.
I think we probably need this in the case of TUN XDP for virt guest too.
Thanks
>
> * Daniel Borkmann's suggestion for a "copy to XDP socket, and return
> XDP_PASS" for a tcpdump-like functionality.
>
> * And of course getting to zero-copy support in small increments.
>
> Thanks: Björn and Magnus
>
> Björn Töpel (8):
> net: initial AF_XDP skeleton
> xsk: add user memory registration support sockopt
> xsk: add Rx queue setup and mmap support
> xdp: introduce xdp_return_buff API
> xsk: add Rx receive functions and poll support
> bpf: introduce new bpf AF_XDP map type BPF_MAP_TYPE_XSKMAP
> xsk: wire up XDP_DRV side of AF_XDP
> xsk: wire up XDP_SKB side of AF_XDP
>
> Magnus Karlsson (7):
> xsk: add umem fill queue support and mmap
> xsk: add support for bind for Rx
> xsk: add umem completion queue support and mmap
> xsk: add Tx queue setup and mmap support
> xsk: support for Tx
> xsk: statistics support
> samples/bpf: sample application for AF_XDP sockets
>
> MAINTAINERS | 8 +
> include/linux/bpf.h | 26 +
> include/linux/bpf_types.h | 3 +
> include/linux/filter.h | 2 +-
> include/linux/socket.h | 5 +-
> include/net/xdp.h | 1 +
> include/net/xdp_sock.h | 46 ++
> include/uapi/linux/bpf.h | 1 +
> include/uapi/linux/if_xdp.h | 87 ++++
> kernel/bpf/Makefile | 3 +
> kernel/bpf/verifier.c | 8 +-
> kernel/bpf/xskmap.c | 286 +++++++++++
> net/Kconfig | 1 +
> net/Makefile | 1 +
> net/core/dev.c | 34 +-
> net/core/filter.c | 40 +-
> net/core/sock.c | 12 +-
> net/core/xdp.c | 15 +-
> net/xdp/Kconfig | 7 +
> net/xdp/Makefile | 2 +
> net/xdp/xdp_umem.c | 256 ++++++++++
> net/xdp/xdp_umem.h | 65 +++
> net/xdp/xdp_umem_props.h | 23 +
> net/xdp/xsk.c | 704 +++++++++++++++++++++++++++
> net/xdp/xsk_queue.c | 73 +++
> net/xdp/xsk_queue.h | 245 ++++++++++
> samples/bpf/Makefile | 4 +
> samples/bpf/xdpsock.h | 11 +
> samples/bpf/xdpsock_kern.c | 56 +++
> samples/bpf/xdpsock_user.c | 947 ++++++++++++++++++++++++++++++++++++
> security/selinux/hooks.c | 4 +-
> security/selinux/include/classmap.h | 4 +-
> 32 files changed, 2945 insertions(+), 35 deletions(-)
> create mode 100644 include/net/xdp_sock.h
> create mode 100644 include/uapi/linux/if_xdp.h
> create mode 100644 kernel/bpf/xskmap.c
> create mode 100644 net/xdp/Kconfig
> create mode 100644 net/xdp/Makefile
> create mode 100644 net/xdp/xdp_umem.c
> create mode 100644 net/xdp/xdp_umem.h
> create mode 100644 net/xdp/xdp_umem_props.h
> create mode 100644 net/xdp/xsk.c
> create mode 100644 net/xdp/xsk_queue.c
> create mode 100644 net/xdp/xsk_queue.h
> create mode 100644 samples/bpf/xdpsock.h
> create mode 100644 samples/bpf/xdpsock_kern.c
> create mode 100644 samples/bpf/xdpsock_user.c
>
^ permalink raw reply
* Re: [PATCH net-next 0/4] mm,tcp: provide mmap_hook to solve lockdep issue
From: Andy Lutomirski @ 2018-04-24 2:04 UTC (permalink / raw)
To: Eric Dumazet
Cc: Andy Lutomirski, Eric Dumazet, David S . Miller, netdev,
linux-kernel, Soheil Hassas Yeganeh, linux-mm, Linux API
In-Reply-To: <d5b4dc70-6f17-d2be-a519-5ebc3f812f57@gmail.com>
On Mon, Apr 23, 2018 at 2:38 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> Hi Andy
>
> On 04/23/2018 02:14 PM, Andy Lutomirski wrote:
>> I would suggest that you rework the interface a bit. First a user would call mmap() on a TCP socket, which would create an empty VMA. (It would set vm_ops to point to tcp_vm_ops or similar so that the TCP code could recognize it, but it would have no effect whatsoever on the TCP state machine. Reading the VMA would get SIGBUS.) Then a user would call a new ioctl() or setsockopt() function and pass something like:
>
>
>>
>> struct tcp_zerocopy_receive {
>> void *address;
>> size_t length;
>> };
>>
>> The kernel would verify that [address, address+length) is entirely inside a single TCP VMA and then would do the vm_insert_range magic.
>
> I have no idea what is the proper API for that.
> Where the TCP VMA(s) would be stored ?
> In TCP socket, or MM layer ?
MM layer. I haven't tested this at all, and the error handling is
totally wrong, but I think you'd do something like:
len = get_user(...);
down_read(¤t->mm->mmap_sem);
vma = find_vma(mm, start);
if (!vma || vma->vm_start > start)
return -EFAULT;
/* This is buggy. You also need to check that the file is a socket.
This is probably trivial. */
if (vma->vm_file->private_data != sock)
return -EINVAL;
if (len > vma->vm_end - start)
return -EFAULT; /* too big a request. */
and now you'd do the vm_insert_page() dance, except that you don't
have to abort the whole procedure if you discover that something isn't
aligned right. Instead you'd just stop and tell the caller that you
didn't map the full requested size. You might also need to add some
code to charge the caller for the pages that get pinned, but that's an
orthogonal issue.
You also need to provide some way for user programs to signal that
they're done with the page in question. MADV_DONTNEED might be
sufficient.
In the mmap() helper, you might want to restrict the mapped size to
something reasonable. And it might be nice to hook mremap() to
prevent user code from causing too much trouble.
With my x86-writer-of-TLB-code hat on, I expect the major performance
costs to be the generic costs of mmap() and munmap() (which only
happen once per socket instead of once per read if you like my idea),
the cost of a TLB miss when the data gets read (really not so bad on
modern hardware), and the cost of the TLB invalidation when user code
is done with the buffers. The latter is awful, especially in
multithreaded programs. In fact, it's so bad that it might be worth
mentioning in the documentation for this code that it just shouldn't
be used in multithreaded processes. (Also, on non-PCID hardware,
there's an annoying situation in which a recently-migrated thread that
removes a mapping sends an IPI to the CPU that the thread used to be
on. I thought I had a clever idea to get rid of that IPI once, but it
turned out to be wrong.)
Architectures like ARM that have superior TLB handling primitives will
not be hurt as badly if this is used my a multithreaded program.
>
>
> And I am not sure why the error handling would be better (point 4), unless we can return smaller @length than requested maybe ?
Exactly. If I request 10MB mapped and only the first 9MB are aligned
right, I still want the first 9 MB.
>
> Also how the VMA space would be accounted (point 3) when creating an empty VMA (no pages in there yet)
There's nothing to account. It's the same as mapping /dev/null or
similar -- the mm core should take care of it for you.
^ permalink raw reply
* Re: [RFC PATCH ghak32 V2 01/13] audit: add container id
From: Richard Guy Briggs @ 2018-04-24 2:02 UTC (permalink / raw)
To: Paul Moore
Cc: cgroups, containers, linux-api, Linux-Audit Mailing List,
linux-fsdevel, LKML, netdev, ebiederm, luto, jlayton, carlos,
dhowells, viro, simo, Eric Paris, serge
In-Reply-To: <CAHC9VhQkJBU-f-AuEnGF1BA2QW6nCJ_yr_EqBR02-1y9+XQZ5A@mail.gmail.com>
On 2018-04-23 19:15, Paul Moore wrote:
> On Sat, Apr 21, 2018 at 10:34 AM, Richard Guy Briggs <rgb@redhat.com> wrote:
> > On 2018-04-18 19:47, Paul Moore wrote:
> >> On Fri, Mar 16, 2018 at 5:00 AM, Richard Guy Briggs <rgb@redhat.com> wrote:
> >> > Implement the proc fs write to set the audit container ID of a process,
> >> > emitting an AUDIT_CONTAINER record to document the event.
> >> >
> >> > This is a write from the container orchestrator task to a proc entry of
> >> > the form /proc/PID/containerid where PID is the process ID of the newly
> >> > created task that is to become the first task in a container, or an
> >> > additional task added to a container.
> >> >
> >> > The write expects up to a u64 value (unset: 18446744073709551615).
> >> >
> >> > This will produce a record such as this:
> >> > type=CONTAINER msg=audit(1519903238.968:261): op=set pid=596 uid=0 subj=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023 auid=0 tty=pts0 ses=1 opid=596 old-contid=18446744073709551615 contid=123455 res=0
> >> >
> >> > The "op" field indicates an initial set. The "pid" to "ses" fields are
> >> > the orchestrator while the "opid" field is the object's PID, the process
> >> > being "contained". Old and new container ID values are given in the
> >> > "contid" fields, while res indicates its success.
> >> >
> >> > It is not permitted to self-set, unset or re-set the container ID. A
> >> > child inherits its parent's container ID, but then can be set only once
> >> > after.
> >> >
> >> > See: https://github.com/linux-audit/audit-kernel/issues/32
> >> >
> >> > Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
> >> > ---
> >> > fs/proc/base.c | 37 ++++++++++++++++++++
> >> > include/linux/audit.h | 16 +++++++++
> >> > include/linux/init_task.h | 4 ++-
> >> > include/linux/sched.h | 1 +
> >> > include/uapi/linux/audit.h | 2 ++
> >> > kernel/auditsc.c | 84 ++++++++++++++++++++++++++++++++++++++++++++++
> >> > 6 files changed, 143 insertions(+), 1 deletion(-)
>
> ...
>
> >> > diff --git a/include/linux/sched.h b/include/linux/sched.h
> >> > index d258826..1b82191 100644
> >> > --- a/include/linux/sched.h
> >> > +++ b/include/linux/sched.h
> >> > @@ -796,6 +796,7 @@ struct task_struct {
> >> > #ifdef CONFIG_AUDITSYSCALL
> >> > kuid_t loginuid;
> >> > unsigned int sessionid;
> >> > + u64 containerid;
> >>
> >> This one line addition to the task_struct scares me the most of
> >> anything in this patchset. Why? It's a field named "containerid" in
> >> a perhaps one of the most widely used core kernel structures; the
> >> possibilities for abuse are endless, and it's foolish to think we
> >> would ever be able to adequately police this.
> >
> > Fair enough.
> >
> >> Unfortunately, we can't add the field to audit_context as things
> >> currently stand because we don't always allocate an audit_context,
> >> it's dependent on the system's configuration, and we need to track the
> >> audit container ID for a given process, regardless of the audit
> >> configuration. Pretty much the same reason why loginuid and sessionid
> >> are located directly in task_struct now. As I stressed during the
> >> design phase, I really want to keep this as an *audit* container ID
> >> and not a general purpose kernel wide container ID. If the kernel
> >> ever grows a general purpose container ID token, I'll be the first in
> >> line to convert the audit code, but I don't want audit to be that
> >> general purpose mechanism ... audit is hated enough as-is ;)
> >
> > When would we need an audit container ID when audit is not enabled
> > enough to have an audit_context?
>
> I'm thinking of the audit_alloc() case where audit_filter_task()
> returns AUDIT_DISABLED.
Ok, so a task could be marked as filtered but its children would still
be auditable and inheriting its parent containerid (as well at its
loginuid and sessionid)...
> I believe this is the same reason why loginuid and sessionid live
> directly in the task_struct and not in the audit_context; they need to
> persist for the lifetime of the task.
Yes, probably.
> > If it is only used for audit, and audit is the only consumer, and audit
> > can only use it when it is enabled, then we can just return success to
> > any write to the proc filehandle, or not even present it. Nothing will
> > be able to know that value wasn't used.
> >
> > When are loginuid and sessionid used now when audit is not enabled (or
> > should I say, explicitly disabled)?
>
> See above. I think that should answer these questions.
Ok.
> >> > diff --git a/include/uapi/linux/audit.h b/include/uapi/linux/audit.h
> >> > index 4e61a9e..921a71f 100644
> >> > --- a/include/uapi/linux/audit.h
> >> > +++ b/include/uapi/linux/audit.h
> >> > @@ -71,6 +71,7 @@
> >> > #define AUDIT_TTY_SET 1017 /* Set TTY auditing status */
> >> > #define AUDIT_SET_FEATURE 1018 /* Turn an audit feature on or off */
> >> > #define AUDIT_GET_FEATURE 1019 /* Get which features are enabled */
> >> > +#define AUDIT_CONTAINER 1020 /* Define the container id and information */
> >> >
> >> > #define AUDIT_FIRST_USER_MSG 1100 /* Userspace messages mostly uninteresting to kernel */
> >> > #define AUDIT_USER_AVC 1107 /* We filter this differently */
> >> > @@ -465,6 +466,7 @@ struct audit_tty_status {
> >> > };
> >> >
> >> > #define AUDIT_UID_UNSET (unsigned int)-1
> >> > +#define AUDIT_CID_UNSET ((u64)-1)
> >>
> >> I think we need to decide if we want to distinguish between the "host"
> >> (e.g. init ns) and "unset". Looking at this patch (I've only quickly
> >> skimmed the others so far) it would appear that you don't think we
> >> need to worry about this distinction; that's fine, but let's make it
> >> explicit with a comment in the code that AUDIT_CID_UNSET means "unset"
> >> as well as "host".
> >
> > I don't see any reason to distinguish between "host" and "unset". Since
> > a container doesn't have a concrete definition based in namespaces, the
> > initial namespace set is meaningless here.
>
> Okay, that sounds reasonable.
>
> > Is there value in having a container orchestrator process have a
> > reserved container ID that has a policy distinct from any other
> > container?
>
> I'm open to arguments for this idea, but I don't see a point to it right now.
>
> > If so, then I could see the value in making the distinction.
> > For example, I've heard of interest in systemd acting as a container
> > orchestrator, so if it took on that role as PID 1, then every process in
> > the system would inherit that ID and none would be unset.
> >
> > I can't picture how having seperate "host" and "unset" values helps us.
>
> I don't have a strong feeling either way, I just wanted to ask the question.
>
> >> > /* audit_rule_data supports filter rules with both integer and string
> >> > * fields. It corresponds with AUDIT_ADD_RULE, AUDIT_DEL_RULE and
> >> > diff --git a/kernel/auditsc.c b/kernel/auditsc.c
> >> > index 4e0a4ac..29c8482 100644
> >> > --- a/kernel/auditsc.c
> >> > +++ b/kernel/auditsc.c
> >> > @@ -2073,6 +2073,90 @@ int audit_set_loginuid(kuid_t loginuid)
> >> > return rc;
> >> > }
> >> >
> >> > +static int audit_set_containerid_perm(struct task_struct *task, u64 containerid)
> >> > +{
> >> > + struct task_struct *parent;
> >> > + u64 pcontainerid, ccontainerid;
> >> > +
> >> > + /* Don't allow to set our own containerid */
> >> > + if (current == task)
> >> > + return -EPERM;
> >>
> >> Why not? Is there some obvious security concern that I missing?
> >
> > We then lose the distinction in the AUDIT_CONTAINER record between the
> > initiating PID and the target PID. This was outlined in the proposal.
>
> I just went back and reread the v3 proposal and I still don't see a
> good explanation of this. Why is this bad? What's the security
> concern?
I don't remember, specifically. Maybe this has been addressed by the
check for children/threads or identical parent container ID. So, I'm
reluctantly willing to remove that check for now.
> > Having said that, I'm still not sure we have protected sufficiently from
> > a child turning around and setting it's parent's as yet unset or
> > inherited audit container ID.
>
> Yes, I believe we only want to let a task set the audit container for
> it's children (or itself/threads if we decide to allow that, see
> above). There *has* to be a function to check to see if a task if a
> child of a given task ... right? ... although this is likely to be a
> pointer traversal and locking nightmare ... hmmm.
Isn't that just (struct task_struct)parent == (struct
task_struct)child->parent (or ->real_parent)?
And now that I say that, it is covered by the following patch's child
check, so as long as we keep that, we should be fine.
> >> I ask because I suppose it might be possible for some container
> >> runtime to do a fork, setup some of the environment and them exec the
> >> container (before you answer the obvious "namespaces!" please remember
> >> we're not trying to define containers).
> >
> > I don't think namespaces have any bearing on this concern since none are
> > required.
> >
> >> > + /* Don't allow the containerid to be unset */
> >> > + if (!cid_valid(containerid))
> >> > + return -EINVAL;
> >> > + /* if we don't have caps, reject */
> >> > + if (!capable(CAP_AUDIT_CONTROL))
> >> > + return -EPERM;
> >> > + /* if containerid is unset, allow */
> >> > + if (!audit_containerid_set(task))
> >> > + return 0;
> >> > + /* it is already set, and not inherited from the parent, reject */
> >> > + ccontainerid = audit_get_containerid(task);
> >> > + rcu_read_lock();
> >> > + parent = rcu_dereference(task->real_parent);
> >> > + rcu_read_unlock();
> >> > + task_lock(parent);
> >> > + pcontainerid = audit_get_containerid(parent);
> >> > + task_unlock(parent);
> >> > + if (ccontainerid != pcontainerid)
> >> > + return -EPERM;
> >> > + return 0;
> >> > +}
> >> > +
> >> > +static void audit_log_set_containerid(struct task_struct *task, u64 oldcontainerid,
> >> > + u64 containerid, int rc)
> >> > +{
> >> > + struct audit_buffer *ab;
> >> > + uid_t uid;
> >> > + struct tty_struct *tty;
> >> > +
> >> > + if (!audit_enabled)
> >> > + return;
> >> > +
> >> > + ab = audit_log_start(NULL, GFP_KERNEL, AUDIT_CONTAINER);
> >> > + if (!ab)
> >> > + return;
> >> > +
> >> > + uid = from_kuid(&init_user_ns, task_uid(current));
> >> > + tty = audit_get_tty(current);
> >> > +
> >> > + audit_log_format(ab, "op=set pid=%d uid=%u", task_tgid_nr(current), uid);
> >> > + audit_log_task_context(ab);
> >> > + audit_log_format(ab, " auid=%u tty=%s ses=%u opid=%d old-contid=%llu contid=%llu res=%d",
> >> > + from_kuid(&init_user_ns, audit_get_loginuid(current)),
> >> > + tty ? tty_name(tty) : "(none)", audit_get_sessionid(current),
> >> > + task_tgid_nr(task), oldcontainerid, containerid, !rc);
> >> > +
> >> > + audit_put_tty(tty);
> >> > + audit_log_end(ab);
> >> > +}
> >> > +
> >> > +/**
> >> > + * audit_set_containerid - set current task's audit_context containerid
> >> > + * @containerid: containerid value
> >> > + *
> >> > + * Returns 0 on success, -EPERM on permission failure.
> >> > + *
> >> > + * Called (set) from fs/proc/base.c::proc_containerid_write().
> >> > + */
> >> > +int audit_set_containerid(struct task_struct *task, u64 containerid)
> >> > +{
> >> > + u64 oldcontainerid;
> >> > + int rc;
> >> > +
> >> > + oldcontainerid = audit_get_containerid(task);
> >> > +
> >> > + rc = audit_set_containerid_perm(task, containerid);
> >> > + if (!rc) {
> >> > + task_lock(task);
> >> > + task->containerid = containerid;
> >> > + task_unlock(task);
> >> > + }
> >> > +
> >> > + audit_log_set_containerid(task, oldcontainerid, containerid, rc);
> >> > + return rc;
> >>
> >> Why are audit_set_containerid_perm() and audit_log_containerid()
> >> separate functions?
> >
> > (I assume you mean audit_log_set_containerid()?)
>
> Yep. My fingers got tired typing in that function name and decided a
> shortcut was necessary.
>
> > It seemed clearer that all the permission checking was in one function
> > and its return code could be used to report the outcome when logging the
> > (attempted) action. This is the same structure as audit_set_loginuid()
> > and it made sense.
>
> When possible I really like it when the permission checks are in the
> same function as the code which does the work; it's less likely to get
> abused that way (you have to willfully bypass the access checks). The
> exceptions might be if you wanted to reuse the access control code, or
> insert a modular access mechanism (e.g. LSMs).
I don't follow how it could be abused. The return code from the perm
check gates setting the value and is used in the success field in the
log.
> I'm less concerned about audit_log_set_containerid(), but the usual
> idea of avoiding single-use function within the same scope applies
> here.
>
> > This would be the time to connect it to a syscall if that seems like a
> > good idea and remove pid, uid, auid, tty, ses fields.
>
> Ah yes, I missed that. You know my stance on connecting records by
> now (hint: yes, connect them) so I think that would be a good thing to
> do for the next round.
Ok...
> paul moore
- RGB
--
Richard Guy Briggs <rgb@redhat.com>
Sr. S/W Engineer, Kernel Security, Base Operating Systems
Remote, Ottawa, Red Hat Canada
IRC: rgb, SunRaycer
Voice: +1.647.777.2635, Internal: (81) 32635
^ permalink raw reply
* Re: [RFC v2] virtio: support packed ring
From: Tiwei Bie @ 2018-04-24 1:49 UTC (permalink / raw)
To: Michael S. Tsirkin; +Cc: netdev, linux-kernel, virtualization, wexu
In-Reply-To: <20180424044250-mutt-send-email-mst@kernel.org>
On Tue, Apr 24, 2018 at 04:43:22AM +0300, Michael S. Tsirkin wrote:
> On Tue, Apr 24, 2018 at 09:37:47AM +0800, Tiwei Bie wrote:
> > On Tue, Apr 24, 2018 at 04:29:51AM +0300, Michael S. Tsirkin wrote:
> > > On Tue, Apr 24, 2018 at 09:16:38AM +0800, Tiwei Bie wrote:
> > > > On Tue, Apr 24, 2018 at 04:05:07AM +0300, Michael S. Tsirkin wrote:
> > > > > On Tue, Apr 24, 2018 at 08:54:52AM +0800, Jason Wang wrote:
> > > > > >
> > > > > >
> > > > > > On 2018年04月23日 17:29, Tiwei Bie wrote:
> > > > > > > On Mon, Apr 23, 2018 at 01:42:14PM +0800, Jason Wang wrote:
> > > > > > > > On 2018年04月01日 22:12, Tiwei Bie wrote:
> > > > > > > > > Hello everyone,
> > > > > > > > >
> > > > > > > > > This RFC implements packed ring support for virtio driver.
> > > > > > > > >
> > > > > > > > > The code was tested with DPDK vhost (testpmd/vhost-PMD) implemented
> > > > > > > > > by Jens at http://dpdk.org/ml/archives/dev/2018-January/089417.html
> > > > > > > > > Minor changes are needed for the vhost code, e.g. to kick the guest.
> > > > > > > > >
> > > > > > > > > TODO:
> > > > > > > > > - Refinements and bug fixes;
> > > > > > > > > - Split into small patches;
> > > > > > > > > - Test indirect descriptor support;
> > > > > > > > > - Test/fix event suppression support;
> > > > > > > > > - Test devices other than net;
> > > > > > > > >
> > > > > > > > > RFC v1 -> RFC v2:
> > > > > > > > > - Add indirect descriptor support - compile test only;
> > > > > > > > > - Add event suppression supprt - compile test only;
> > > > > > > > > - Move vring_packed_init() out of uapi (Jason, MST);
> > > > > > > > > - Merge two loops into one in virtqueue_add_packed() (Jason);
> > > > > > > > > - Split vring_unmap_one() for packed ring and split ring (Jason);
> > > > > > > > > - Avoid using '%' operator (Jason);
> > > > > > > > > - Rename free_head -> next_avail_idx (Jason);
> > > > > > > > > - Add comments for virtio_wmb() in virtqueue_add_packed() (Jason);
> > > > > > > > > - Some other refinements and bug fixes;
> > > > > > > > >
> > > > > > > > > Thanks!
> > > > > > > > >
> > > > > > > > > Signed-off-by: Tiwei Bie <tiwei.bie@intel.com>
> > > > > > > > > ---
> > > > > > > > > drivers/virtio/virtio_ring.c | 1094 +++++++++++++++++++++++++++++-------
> > > > > > > > > include/linux/virtio_ring.h | 8 +-
> > > > > > > > > include/uapi/linux/virtio_config.h | 12 +-
> > > > > > > > > include/uapi/linux/virtio_ring.h | 61 ++
> > > > > > > > > 4 files changed, 980 insertions(+), 195 deletions(-)
> > > > > > > > >
> > > > > > > > > diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
> > > > > > > > > index 71458f493cf8..0515dca34d77 100644
> > > > > > > > > --- a/drivers/virtio/virtio_ring.c
> > > > > > > > > +++ b/drivers/virtio/virtio_ring.c
> > > > > > > > > @@ -58,14 +58,15 @@
> > > > > > > > [...]
> > > > > > > >
> > > > > > > > > +
> > > > > > > > > + if (vq->indirect) {
> > > > > > > > > + u32 len;
> > > > > > > > > +
> > > > > > > > > + desc = vq->desc_state[head].indir_desc;
> > > > > > > > > + /* Free the indirect table, if any, now that it's unmapped. */
> > > > > > > > > + if (!desc)
> > > > > > > > > + goto out;
> > > > > > > > > +
> > > > > > > > > + len = virtio32_to_cpu(vq->vq.vdev,
> > > > > > > > > + vq->vring_packed.desc[head].len);
> > > > > > > > > +
> > > > > > > > > + BUG_ON(!(vq->vring_packed.desc[head].flags &
> > > > > > > > > + cpu_to_virtio16(vq->vq.vdev, VRING_DESC_F_INDIRECT)));
> > > > > > > > It looks to me spec does not force to keep VRING_DESC_F_INDIRECT here. So we
> > > > > > > > can safely remove this BUG_ON() here.
> > > > > > > >
> > > > > > > > > + BUG_ON(len == 0 || len % sizeof(struct vring_packed_desc));
> > > > > > > > Len could be ignored for used descriptor according to the spec, so we need
> > > > > > > > remove this BUG_ON() too.
> > > > > > > Yeah, you're right! The BUG_ON() isn't right. I'll remove it.
> > > > > > > And I think something related to this in the spec isn't very
> > > > > > > clear currently.
> > > > > > >
> > > > > > > In the spec, there are below words:
> > > > > > >
> > > > > > > https://github.com/oasis-tcs/virtio-spec/blob/d4fec517dfcf/packed-ring.tex#L272
> > > > > > > """
> > > > > > > In descriptors with VIRTQ_DESC_F_INDIRECT set VIRTQ_DESC_F_WRITE
> > > > > > > is reserved and is ignored by the device.
> > > > > > > """
> > > > > > >
> > > > > > > So when device writes back an used descriptor in this case,
> > > > > > > device may not set the VIRTQ_DESC_F_WRITE flag as the flag
> > > > > > > is reserved and should be ignored.
> > > > > > >
> > > > > > > https://github.com/oasis-tcs/virtio-spec/blob/d4fec517dfcf/packed-ring.tex#L170
> > > > > > > """
> > > > > > > Element Length is reserved for used descriptors without the
> > > > > > > VIRTQ_DESC_F_WRITE flag, and is ignored by drivers.
> > > > > > > """
> > > > > > >
> > > > > > > And this is the way how driver ignores the `len` in an used
> > > > > > > descriptor.
> > > > > > >
> > > > > > > https://github.com/oasis-tcs/virtio-spec/blob/d4fec517dfcf/packed-ring.tex#L241
> > > > > > > """
> > > > > > > To increase ring capacity the driver can store a (read-only
> > > > > > > by the device) table of indirect descriptors anywhere in memory,
> > > > > > > and insert a descriptor in the main virtqueue (with \field{Flags}
> > > > > > > bit VIRTQ_DESC_F_INDIRECT on) that refers to a buffer element
> > > > > > > containing this indirect descriptor table;
> > > > > > > """
> > > > > > >
> > > > > > > So the indirect descriptors in the table are read-only by
> > > > > > > the device. And the only descriptor which is writeable by
> > > > > > > the device is the descriptor in the main virtqueue (with
> > > > > > > Flags bit VIRTQ_DESC_F_INDIRECT on). So if we ignore the
> > > > > > > `len` in this descriptor, we won't be able to get the
> > > > > > > length of the data written by the device.
> > > > > > >
> > > > > > > So I think the `len` in this descriptor will carry the
> > > > > > > length of the data written by the device (if the buffers
> > > > > > > are writable to the device) even if the VIRTQ_DESC_F_WRITE
> > > > > > > isn't set by the device. How do you think?
> > > > > >
> > > > > > Yes I think so. But we'd better need clarification from Michael.
> > > > >
> > > > > I think if you use a descriptor, and you want to supply len
> > > > > to guest, you set VIRTQ_DESC_F_WRITE in the used descriptor.
> > > > > Spec also says you must not set VIRTQ_DESC_F_INDIRECT then.
> > > > > If that's a problem we could look at relaxing that last requirement -
> > > > > does driver want INDIRECT in used descriptor to match
> > > > > the value in the avail descriptor for some reason?
> > > >
> > > > For indirect, driver needs some way to get the length
> > > > of the data written by the driver. And the descriptors
> > > > in the indirect table is read-only, so the only place
> > > > device could put this value is the descriptor with the
> > > > VIRTQ_DESC_F_INDIRECT flag set.
> > >
> > > when writing out used descriptor, device should set VIRTQ_DESC_F_WRITE
> > > (and clear VIRTQ_DESC_F_INDIRECT).
> >
> > So the spec allows device to set VIRTQ_DESC_F_WRITE bit
> > when writing out an used descriptor even if the corresponding
> > descriptors it just parsed don't have the VIRTQ_DESC_F_WRITE
> > bit set?
> >
> > Best regards,
> > Tiwei Bie
>
> I think so. In a used descriptor, VIRTQ_DESC_F_WRITE just means length
> is valid.
Got it. It's very neat. Thanks! :)
Best regards,
Tiwei Bie
>
> --
> MST
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization
^ permalink raw reply
* Re: [RFC v2] virtio: support packed ring
From: Michael S. Tsirkin @ 2018-04-24 1:43 UTC (permalink / raw)
To: Tiwei Bie
Cc: Jason Wang, wexu, virtualization, linux-kernel, netdev, jfreimann
In-Reply-To: <20180424013747.hmr4ytqe7gdqfhdt@debian>
On Tue, Apr 24, 2018 at 09:37:47AM +0800, Tiwei Bie wrote:
> On Tue, Apr 24, 2018 at 04:29:51AM +0300, Michael S. Tsirkin wrote:
> > On Tue, Apr 24, 2018 at 09:16:38AM +0800, Tiwei Bie wrote:
> > > On Tue, Apr 24, 2018 at 04:05:07AM +0300, Michael S. Tsirkin wrote:
> > > > On Tue, Apr 24, 2018 at 08:54:52AM +0800, Jason Wang wrote:
> > > > >
> > > > >
> > > > > On 2018年04月23日 17:29, Tiwei Bie wrote:
> > > > > > On Mon, Apr 23, 2018 at 01:42:14PM +0800, Jason Wang wrote:
> > > > > > > On 2018年04月01日 22:12, Tiwei Bie wrote:
> > > > > > > > Hello everyone,
> > > > > > > >
> > > > > > > > This RFC implements packed ring support for virtio driver.
> > > > > > > >
> > > > > > > > The code was tested with DPDK vhost (testpmd/vhost-PMD) implemented
> > > > > > > > by Jens at http://dpdk.org/ml/archives/dev/2018-January/089417.html
> > > > > > > > Minor changes are needed for the vhost code, e.g. to kick the guest.
> > > > > > > >
> > > > > > > > TODO:
> > > > > > > > - Refinements and bug fixes;
> > > > > > > > - Split into small patches;
> > > > > > > > - Test indirect descriptor support;
> > > > > > > > - Test/fix event suppression support;
> > > > > > > > - Test devices other than net;
> > > > > > > >
> > > > > > > > RFC v1 -> RFC v2:
> > > > > > > > - Add indirect descriptor support - compile test only;
> > > > > > > > - Add event suppression supprt - compile test only;
> > > > > > > > - Move vring_packed_init() out of uapi (Jason, MST);
> > > > > > > > - Merge two loops into one in virtqueue_add_packed() (Jason);
> > > > > > > > - Split vring_unmap_one() for packed ring and split ring (Jason);
> > > > > > > > - Avoid using '%' operator (Jason);
> > > > > > > > - Rename free_head -> next_avail_idx (Jason);
> > > > > > > > - Add comments for virtio_wmb() in virtqueue_add_packed() (Jason);
> > > > > > > > - Some other refinements and bug fixes;
> > > > > > > >
> > > > > > > > Thanks!
> > > > > > > >
> > > > > > > > Signed-off-by: Tiwei Bie <tiwei.bie@intel.com>
> > > > > > > > ---
> > > > > > > > drivers/virtio/virtio_ring.c | 1094 +++++++++++++++++++++++++++++-------
> > > > > > > > include/linux/virtio_ring.h | 8 +-
> > > > > > > > include/uapi/linux/virtio_config.h | 12 +-
> > > > > > > > include/uapi/linux/virtio_ring.h | 61 ++
> > > > > > > > 4 files changed, 980 insertions(+), 195 deletions(-)
> > > > > > > >
> > > > > > > > diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
> > > > > > > > index 71458f493cf8..0515dca34d77 100644
> > > > > > > > --- a/drivers/virtio/virtio_ring.c
> > > > > > > > +++ b/drivers/virtio/virtio_ring.c
> > > > > > > > @@ -58,14 +58,15 @@
> > > > > > > [...]
> > > > > > >
> > > > > > > > +
> > > > > > > > + if (vq->indirect) {
> > > > > > > > + u32 len;
> > > > > > > > +
> > > > > > > > + desc = vq->desc_state[head].indir_desc;
> > > > > > > > + /* Free the indirect table, if any, now that it's unmapped. */
> > > > > > > > + if (!desc)
> > > > > > > > + goto out;
> > > > > > > > +
> > > > > > > > + len = virtio32_to_cpu(vq->vq.vdev,
> > > > > > > > + vq->vring_packed.desc[head].len);
> > > > > > > > +
> > > > > > > > + BUG_ON(!(vq->vring_packed.desc[head].flags &
> > > > > > > > + cpu_to_virtio16(vq->vq.vdev, VRING_DESC_F_INDIRECT)));
> > > > > > > It looks to me spec does not force to keep VRING_DESC_F_INDIRECT here. So we
> > > > > > > can safely remove this BUG_ON() here.
> > > > > > >
> > > > > > > > + BUG_ON(len == 0 || len % sizeof(struct vring_packed_desc));
> > > > > > > Len could be ignored for used descriptor according to the spec, so we need
> > > > > > > remove this BUG_ON() too.
> > > > > > Yeah, you're right! The BUG_ON() isn't right. I'll remove it.
> > > > > > And I think something related to this in the spec isn't very
> > > > > > clear currently.
> > > > > >
> > > > > > In the spec, there are below words:
> > > > > >
> > > > > > https://github.com/oasis-tcs/virtio-spec/blob/d4fec517dfcf/packed-ring.tex#L272
> > > > > > """
> > > > > > In descriptors with VIRTQ_DESC_F_INDIRECT set VIRTQ_DESC_F_WRITE
> > > > > > is reserved and is ignored by the device.
> > > > > > """
> > > > > >
> > > > > > So when device writes back an used descriptor in this case,
> > > > > > device may not set the VIRTQ_DESC_F_WRITE flag as the flag
> > > > > > is reserved and should be ignored.
> > > > > >
> > > > > > https://github.com/oasis-tcs/virtio-spec/blob/d4fec517dfcf/packed-ring.tex#L170
> > > > > > """
> > > > > > Element Length is reserved for used descriptors without the
> > > > > > VIRTQ_DESC_F_WRITE flag, and is ignored by drivers.
> > > > > > """
> > > > > >
> > > > > > And this is the way how driver ignores the `len` in an used
> > > > > > descriptor.
> > > > > >
> > > > > > https://github.com/oasis-tcs/virtio-spec/blob/d4fec517dfcf/packed-ring.tex#L241
> > > > > > """
> > > > > > To increase ring capacity the driver can store a (read-only
> > > > > > by the device) table of indirect descriptors anywhere in memory,
> > > > > > and insert a descriptor in the main virtqueue (with \field{Flags}
> > > > > > bit VIRTQ_DESC_F_INDIRECT on) that refers to a buffer element
> > > > > > containing this indirect descriptor table;
> > > > > > """
> > > > > >
> > > > > > So the indirect descriptors in the table are read-only by
> > > > > > the device. And the only descriptor which is writeable by
> > > > > > the device is the descriptor in the main virtqueue (with
> > > > > > Flags bit VIRTQ_DESC_F_INDIRECT on). So if we ignore the
> > > > > > `len` in this descriptor, we won't be able to get the
> > > > > > length of the data written by the device.
> > > > > >
> > > > > > So I think the `len` in this descriptor will carry the
> > > > > > length of the data written by the device (if the buffers
> > > > > > are writable to the device) even if the VIRTQ_DESC_F_WRITE
> > > > > > isn't set by the device. How do you think?
> > > > >
> > > > > Yes I think so. But we'd better need clarification from Michael.
> > > >
> > > > I think if you use a descriptor, and you want to supply len
> > > > to guest, you set VIRTQ_DESC_F_WRITE in the used descriptor.
> > > > Spec also says you must not set VIRTQ_DESC_F_INDIRECT then.
> > > > If that's a problem we could look at relaxing that last requirement -
> > > > does driver want INDIRECT in used descriptor to match
> > > > the value in the avail descriptor for some reason?
> > >
> > > For indirect, driver needs some way to get the length
> > > of the data written by the driver. And the descriptors
> > > in the indirect table is read-only, so the only place
> > > device could put this value is the descriptor with the
> > > VIRTQ_DESC_F_INDIRECT flag set.
> >
> > when writing out used descriptor, device should set VIRTQ_DESC_F_WRITE
> > (and clear VIRTQ_DESC_F_INDIRECT).
>
> So the spec allows device to set VIRTQ_DESC_F_WRITE bit
> when writing out an used descriptor even if the corresponding
> descriptors it just parsed don't have the VIRTQ_DESC_F_WRITE
> bit set?
>
> Best regards,
> Tiwei Bie
I think so. In a used descriptor, VIRTQ_DESC_F_WRITE just means length
is valid.
--
MST
^ permalink raw reply
* Re: [PATCH v7 net-next 4/4] netvsc: refactor notifier/event handling code to use the failover framework
From: Michael S. Tsirkin @ 2018-04-24 1:42 UTC (permalink / raw)
To: Stephen Hemminger
Cc: Siwei Liu, Jiri Pirko, Sridhar Samudrala, David Miller, Netdev,
virtualization, virtio-dev, Brandeburg, Jesse, Alexander Duyck,
Jakub Kicinski, Jason Wang
In-Reply-To: <20180423182503.353180c9@xeon-e3>
On Mon, Apr 23, 2018 at 06:25:03PM -0700, Stephen Hemminger wrote:
> On Mon, 23 Apr 2018 12:44:39 -0700
> Siwei Liu <loseweigh@gmail.com> wrote:
>
> > On Mon, Apr 23, 2018 at 10:56 AM, Michael S. Tsirkin <mst@redhat.com> wrote:
> > > On Mon, Apr 23, 2018 at 10:44:40AM -0700, Stephen Hemminger wrote:
> > >> On Mon, 23 Apr 2018 20:24:56 +0300
> > >> "Michael S. Tsirkin" <mst@redhat.com> wrote:
> > >>
> > >> > On Mon, Apr 23, 2018 at 10:04:06AM -0700, Stephen Hemminger wrote:
> > >> > > > >
> > >> > > > >I will NAK patches to change to common code for netvsc especially the
> > >> > > > >three device model. MS worked hard with distro vendors to support transparent
> > >> > > > >mode, ans we really can't have a new model; or do backport.
> > >> > > > >
> > >> > > > >Plus, DPDK is now dependent on existing model.
> > >> > > >
> > >> > > > Sorry, but nobody here cares about dpdk or other similar oddities.
> > >> > >
> > >> > > The network device model is a userspace API, and DPDK is a userspace application.
> > >> >
> > >> > It is userspace but are you sure dpdk is actually poking at netdevs?
> > >> > AFAIK it's normally banging device registers directly.
> > >> >
> > >> > > You can't go breaking userspace even if you don't like the application.
> > >> >
> > >> > Could you please explain how is the proposed patchset breaking
> > >> > userspace? Ignoring DPDK for now, I don't think it changes the userspace
> > >> > API at all.
> > >> >
> > >>
> > >> The DPDK has a device driver vdev_netvsc which scans the Linux network devices
> > >> to look for Linux netvsc device and the paired VF device and setup the
> > >> DPDK environment. This setup creates a DPDK failsafe (bondingish) instance
> > >> and sets up TAP support over the Linux netvsc device as well as the Mellanox
> > >> VF device.
> > >>
> > >> So it depends on existing 2 device model. You can't go to a 3 device model
> > >> or start hiding devices from userspace.
> > >
> > > Okay so how does the existing patch break that? IIUC does not go to
> > > a 3 device model since netvsc calls failover_register directly.
> > >
> > >> Also, I am working on associating netvsc and VF device based on serial number
> > >> rather than MAC address. The serial number is how Windows works now, and it makes
> > >> sense for Linux and Windows to use the same mechanism if possible.
> > >
> > > Maybe we should support same for virtio ...
> > > Which serial do you mean? From vpd?
> > >
> > > I guess you will want to keep supporting MAC for old hypervisors?
>
> The serial number has always been in the hypervisor since original support of SR-IOV
> in WS2008. So no backward compatibility special cases would be needed.
Is that a serial from real hardware or a hypervisor thing?
--
MST
^ permalink raw reply
* [PATCH 1/1] Revert "rds: ib: add error handle"
From: Zhu Yanjun @ 2018-04-24 1:39 UTC (permalink / raw)
To: santosh.shilimkar, davem, netdev, linux-rdma, rds-devel
This reverts commit 3b12f73a5c2977153f28a224392fd4729b50d1dc.
After long time discussion and investigations, it seems that there
is no mem leak. So this patch is reverted.
Signed-off-by: Zhu Yanjun <yanjun.zhu@oracle.com>
---
net/rds/ib_cm.c | 47 +++++++++++------------------------------------
1 file changed, 11 insertions(+), 36 deletions(-)
diff --git a/net/rds/ib_cm.c b/net/rds/ib_cm.c
index eea1d86..d64bfaf 100644
--- a/net/rds/ib_cm.c
+++ b/net/rds/ib_cm.c
@@ -443,7 +443,7 @@ static int rds_ib_setup_qp(struct rds_connection *conn)
ic->i_send_cq = NULL;
ibdev_put_vector(rds_ibdev, ic->i_scq_vector);
rdsdebug("ib_create_cq send failed: %d\n", ret);
- goto rds_ibdev_out;
+ goto out;
}
ic->i_rcq_vector = ibdev_get_unused_vector(rds_ibdev);
@@ -457,19 +457,19 @@ static int rds_ib_setup_qp(struct rds_connection *conn)
ic->i_recv_cq = NULL;
ibdev_put_vector(rds_ibdev, ic->i_rcq_vector);
rdsdebug("ib_create_cq recv failed: %d\n", ret);
- goto send_cq_out;
+ goto out;
}
ret = ib_req_notify_cq(ic->i_send_cq, IB_CQ_NEXT_COMP);
if (ret) {
rdsdebug("ib_req_notify_cq send failed: %d\n", ret);
- goto recv_cq_out;
+ goto out;
}
ret = ib_req_notify_cq(ic->i_recv_cq, IB_CQ_SOLICITED);
if (ret) {
rdsdebug("ib_req_notify_cq recv failed: %d\n", ret);
- goto recv_cq_out;
+ goto out;
}
/* XXX negotiate max send/recv with remote? */
@@ -495,7 +495,7 @@ static int rds_ib_setup_qp(struct rds_connection *conn)
ret = rdma_create_qp(ic->i_cm_id, ic->i_pd, &attr);
if (ret) {
rdsdebug("rdma_create_qp failed: %d\n", ret);
- goto recv_cq_out;
+ goto out;
}
ic->i_send_hdrs = ib_dma_alloc_coherent(dev,
@@ -505,7 +505,7 @@ static int rds_ib_setup_qp(struct rds_connection *conn)
if (!ic->i_send_hdrs) {
ret = -ENOMEM;
rdsdebug("ib_dma_alloc_coherent send failed\n");
- goto qp_out;
+ goto out;
}
ic->i_recv_hdrs = ib_dma_alloc_coherent(dev,
@@ -515,7 +515,7 @@ static int rds_ib_setup_qp(struct rds_connection *conn)
if (!ic->i_recv_hdrs) {
ret = -ENOMEM;
rdsdebug("ib_dma_alloc_coherent recv failed\n");
- goto send_hdrs_dma_out;
+ goto out;
}
ic->i_ack = ib_dma_alloc_coherent(dev, sizeof(struct rds_header),
@@ -523,7 +523,7 @@ static int rds_ib_setup_qp(struct rds_connection *conn)
if (!ic->i_ack) {
ret = -ENOMEM;
rdsdebug("ib_dma_alloc_coherent ack failed\n");
- goto recv_hdrs_dma_out;
+ goto out;
}
ic->i_sends = vzalloc_node(ic->i_send_ring.w_nr * sizeof(struct rds_ib_send_work),
@@ -531,7 +531,7 @@ static int rds_ib_setup_qp(struct rds_connection *conn)
if (!ic->i_sends) {
ret = -ENOMEM;
rdsdebug("send allocation failed\n");
- goto ack_dma_out;
+ goto out;
}
ic->i_recvs = vzalloc_node(ic->i_recv_ring.w_nr * sizeof(struct rds_ib_recv_work),
@@ -539,7 +539,7 @@ static int rds_ib_setup_qp(struct rds_connection *conn)
if (!ic->i_recvs) {
ret = -ENOMEM;
rdsdebug("recv allocation failed\n");
- goto sends_out;
+ goto out;
}
rds_ib_recv_init_ack(ic);
@@ -547,33 +547,8 @@ static int rds_ib_setup_qp(struct rds_connection *conn)
rdsdebug("conn %p pd %p cq %p %p\n", conn, ic->i_pd,
ic->i_send_cq, ic->i_recv_cq);
- return ret;
-
-sends_out:
- vfree(ic->i_sends);
-ack_dma_out:
- ib_dma_free_coherent(dev, sizeof(struct rds_header),
- ic->i_ack, ic->i_ack_dma);
-recv_hdrs_dma_out:
- ib_dma_free_coherent(dev, ic->i_recv_ring.w_nr *
- sizeof(struct rds_header),
- ic->i_recv_hdrs, ic->i_recv_hdrs_dma);
-send_hdrs_dma_out:
- ib_dma_free_coherent(dev, ic->i_send_ring.w_nr *
- sizeof(struct rds_header),
- ic->i_send_hdrs, ic->i_send_hdrs_dma);
-qp_out:
- rdma_destroy_qp(ic->i_cm_id);
-recv_cq_out:
- if (!ib_destroy_cq(ic->i_recv_cq))
- ic->i_recv_cq = NULL;
-send_cq_out:
- if (!ib_destroy_cq(ic->i_send_cq))
- ic->i_send_cq = NULL;
-rds_ibdev_out:
- rds_ib_remove_conn(rds_ibdev, conn);
+out:
rds_ib_dev_put(rds_ibdev);
-
return ret;
}
--
2.7.4
^ permalink raw reply related
* Re: [RFC v2] virtio: support packed ring
From: Tiwei Bie @ 2018-04-24 1:37 UTC (permalink / raw)
To: Michael S. Tsirkin
Cc: Jason Wang, wexu, virtualization, linux-kernel, netdev, jfreimann
In-Reply-To: <20180424042913-mutt-send-email-mst@kernel.org>
On Tue, Apr 24, 2018 at 04:29:51AM +0300, Michael S. Tsirkin wrote:
> On Tue, Apr 24, 2018 at 09:16:38AM +0800, Tiwei Bie wrote:
> > On Tue, Apr 24, 2018 at 04:05:07AM +0300, Michael S. Tsirkin wrote:
> > > On Tue, Apr 24, 2018 at 08:54:52AM +0800, Jason Wang wrote:
> > > >
> > > >
> > > > On 2018年04月23日 17:29, Tiwei Bie wrote:
> > > > > On Mon, Apr 23, 2018 at 01:42:14PM +0800, Jason Wang wrote:
> > > > > > On 2018年04月01日 22:12, Tiwei Bie wrote:
> > > > > > > Hello everyone,
> > > > > > >
> > > > > > > This RFC implements packed ring support for virtio driver.
> > > > > > >
> > > > > > > The code was tested with DPDK vhost (testpmd/vhost-PMD) implemented
> > > > > > > by Jens at http://dpdk.org/ml/archives/dev/2018-January/089417.html
> > > > > > > Minor changes are needed for the vhost code, e.g. to kick the guest.
> > > > > > >
> > > > > > > TODO:
> > > > > > > - Refinements and bug fixes;
> > > > > > > - Split into small patches;
> > > > > > > - Test indirect descriptor support;
> > > > > > > - Test/fix event suppression support;
> > > > > > > - Test devices other than net;
> > > > > > >
> > > > > > > RFC v1 -> RFC v2:
> > > > > > > - Add indirect descriptor support - compile test only;
> > > > > > > - Add event suppression supprt - compile test only;
> > > > > > > - Move vring_packed_init() out of uapi (Jason, MST);
> > > > > > > - Merge two loops into one in virtqueue_add_packed() (Jason);
> > > > > > > - Split vring_unmap_one() for packed ring and split ring (Jason);
> > > > > > > - Avoid using '%' operator (Jason);
> > > > > > > - Rename free_head -> next_avail_idx (Jason);
> > > > > > > - Add comments for virtio_wmb() in virtqueue_add_packed() (Jason);
> > > > > > > - Some other refinements and bug fixes;
> > > > > > >
> > > > > > > Thanks!
> > > > > > >
> > > > > > > Signed-off-by: Tiwei Bie <tiwei.bie@intel.com>
> > > > > > > ---
> > > > > > > drivers/virtio/virtio_ring.c | 1094 +++++++++++++++++++++++++++++-------
> > > > > > > include/linux/virtio_ring.h | 8 +-
> > > > > > > include/uapi/linux/virtio_config.h | 12 +-
> > > > > > > include/uapi/linux/virtio_ring.h | 61 ++
> > > > > > > 4 files changed, 980 insertions(+), 195 deletions(-)
> > > > > > >
> > > > > > > diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
> > > > > > > index 71458f493cf8..0515dca34d77 100644
> > > > > > > --- a/drivers/virtio/virtio_ring.c
> > > > > > > +++ b/drivers/virtio/virtio_ring.c
> > > > > > > @@ -58,14 +58,15 @@
> > > > > > [...]
> > > > > >
> > > > > > > +
> > > > > > > + if (vq->indirect) {
> > > > > > > + u32 len;
> > > > > > > +
> > > > > > > + desc = vq->desc_state[head].indir_desc;
> > > > > > > + /* Free the indirect table, if any, now that it's unmapped. */
> > > > > > > + if (!desc)
> > > > > > > + goto out;
> > > > > > > +
> > > > > > > + len = virtio32_to_cpu(vq->vq.vdev,
> > > > > > > + vq->vring_packed.desc[head].len);
> > > > > > > +
> > > > > > > + BUG_ON(!(vq->vring_packed.desc[head].flags &
> > > > > > > + cpu_to_virtio16(vq->vq.vdev, VRING_DESC_F_INDIRECT)));
> > > > > > It looks to me spec does not force to keep VRING_DESC_F_INDIRECT here. So we
> > > > > > can safely remove this BUG_ON() here.
> > > > > >
> > > > > > > + BUG_ON(len == 0 || len % sizeof(struct vring_packed_desc));
> > > > > > Len could be ignored for used descriptor according to the spec, so we need
> > > > > > remove this BUG_ON() too.
> > > > > Yeah, you're right! The BUG_ON() isn't right. I'll remove it.
> > > > > And I think something related to this in the spec isn't very
> > > > > clear currently.
> > > > >
> > > > > In the spec, there are below words:
> > > > >
> > > > > https://github.com/oasis-tcs/virtio-spec/blob/d4fec517dfcf/packed-ring.tex#L272
> > > > > """
> > > > > In descriptors with VIRTQ_DESC_F_INDIRECT set VIRTQ_DESC_F_WRITE
> > > > > is reserved and is ignored by the device.
> > > > > """
> > > > >
> > > > > So when device writes back an used descriptor in this case,
> > > > > device may not set the VIRTQ_DESC_F_WRITE flag as the flag
> > > > > is reserved and should be ignored.
> > > > >
> > > > > https://github.com/oasis-tcs/virtio-spec/blob/d4fec517dfcf/packed-ring.tex#L170
> > > > > """
> > > > > Element Length is reserved for used descriptors without the
> > > > > VIRTQ_DESC_F_WRITE flag, and is ignored by drivers.
> > > > > """
> > > > >
> > > > > And this is the way how driver ignores the `len` in an used
> > > > > descriptor.
> > > > >
> > > > > https://github.com/oasis-tcs/virtio-spec/blob/d4fec517dfcf/packed-ring.tex#L241
> > > > > """
> > > > > To increase ring capacity the driver can store a (read-only
> > > > > by the device) table of indirect descriptors anywhere in memory,
> > > > > and insert a descriptor in the main virtqueue (with \field{Flags}
> > > > > bit VIRTQ_DESC_F_INDIRECT on) that refers to a buffer element
> > > > > containing this indirect descriptor table;
> > > > > """
> > > > >
> > > > > So the indirect descriptors in the table are read-only by
> > > > > the device. And the only descriptor which is writeable by
> > > > > the device is the descriptor in the main virtqueue (with
> > > > > Flags bit VIRTQ_DESC_F_INDIRECT on). So if we ignore the
> > > > > `len` in this descriptor, we won't be able to get the
> > > > > length of the data written by the device.
> > > > >
> > > > > So I think the `len` in this descriptor will carry the
> > > > > length of the data written by the device (if the buffers
> > > > > are writable to the device) even if the VIRTQ_DESC_F_WRITE
> > > > > isn't set by the device. How do you think?
> > > >
> > > > Yes I think so. But we'd better need clarification from Michael.
> > >
> > > I think if you use a descriptor, and you want to supply len
> > > to guest, you set VIRTQ_DESC_F_WRITE in the used descriptor.
> > > Spec also says you must not set VIRTQ_DESC_F_INDIRECT then.
> > > If that's a problem we could look at relaxing that last requirement -
> > > does driver want INDIRECT in used descriptor to match
> > > the value in the avail descriptor for some reason?
> >
> > For indirect, driver needs some way to get the length
> > of the data written by the driver. And the descriptors
> > in the indirect table is read-only, so the only place
> > device could put this value is the descriptor with the
> > VIRTQ_DESC_F_INDIRECT flag set.
>
> when writing out used descriptor, device should set VIRTQ_DESC_F_WRITE
> (and clear VIRTQ_DESC_F_INDIRECT).
So the spec allows device to set VIRTQ_DESC_F_WRITE bit
when writing out an used descriptor even if the corresponding
descriptors it just parsed don't have the VIRTQ_DESC_F_WRITE
bit set?
Best regards,
Tiwei Bie
^ permalink raw reply
* Re: [RFC v2] virtio: support packed ring
From: Michael S. Tsirkin @ 2018-04-24 1:29 UTC (permalink / raw)
To: Tiwei Bie
Cc: Jason Wang, wexu, virtualization, linux-kernel, netdev, jfreimann
In-Reply-To: <20180424011638.vf6f22dnl7ufnnij@debian>
On Tue, Apr 24, 2018 at 09:16:38AM +0800, Tiwei Bie wrote:
> On Tue, Apr 24, 2018 at 04:05:07AM +0300, Michael S. Tsirkin wrote:
> > On Tue, Apr 24, 2018 at 08:54:52AM +0800, Jason Wang wrote:
> > >
> > >
> > > On 2018年04月23日 17:29, Tiwei Bie wrote:
> > > > On Mon, Apr 23, 2018 at 01:42:14PM +0800, Jason Wang wrote:
> > > > > On 2018年04月01日 22:12, Tiwei Bie wrote:
> > > > > > Hello everyone,
> > > > > >
> > > > > > This RFC implements packed ring support for virtio driver.
> > > > > >
> > > > > > The code was tested with DPDK vhost (testpmd/vhost-PMD) implemented
> > > > > > by Jens at http://dpdk.org/ml/archives/dev/2018-January/089417.html
> > > > > > Minor changes are needed for the vhost code, e.g. to kick the guest.
> > > > > >
> > > > > > TODO:
> > > > > > - Refinements and bug fixes;
> > > > > > - Split into small patches;
> > > > > > - Test indirect descriptor support;
> > > > > > - Test/fix event suppression support;
> > > > > > - Test devices other than net;
> > > > > >
> > > > > > RFC v1 -> RFC v2:
> > > > > > - Add indirect descriptor support - compile test only;
> > > > > > - Add event suppression supprt - compile test only;
> > > > > > - Move vring_packed_init() out of uapi (Jason, MST);
> > > > > > - Merge two loops into one in virtqueue_add_packed() (Jason);
> > > > > > - Split vring_unmap_one() for packed ring and split ring (Jason);
> > > > > > - Avoid using '%' operator (Jason);
> > > > > > - Rename free_head -> next_avail_idx (Jason);
> > > > > > - Add comments for virtio_wmb() in virtqueue_add_packed() (Jason);
> > > > > > - Some other refinements and bug fixes;
> > > > > >
> > > > > > Thanks!
> > > > > >
> > > > > > Signed-off-by: Tiwei Bie <tiwei.bie@intel.com>
> > > > > > ---
> > > > > > drivers/virtio/virtio_ring.c | 1094 +++++++++++++++++++++++++++++-------
> > > > > > include/linux/virtio_ring.h | 8 +-
> > > > > > include/uapi/linux/virtio_config.h | 12 +-
> > > > > > include/uapi/linux/virtio_ring.h | 61 ++
> > > > > > 4 files changed, 980 insertions(+), 195 deletions(-)
> > > > > >
> > > > > > diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
> > > > > > index 71458f493cf8..0515dca34d77 100644
> > > > > > --- a/drivers/virtio/virtio_ring.c
> > > > > > +++ b/drivers/virtio/virtio_ring.c
> > > > > > @@ -58,14 +58,15 @@
> > > > > [...]
> > > > >
> > > > > > +
> > > > > > + if (vq->indirect) {
> > > > > > + u32 len;
> > > > > > +
> > > > > > + desc = vq->desc_state[head].indir_desc;
> > > > > > + /* Free the indirect table, if any, now that it's unmapped. */
> > > > > > + if (!desc)
> > > > > > + goto out;
> > > > > > +
> > > > > > + len = virtio32_to_cpu(vq->vq.vdev,
> > > > > > + vq->vring_packed.desc[head].len);
> > > > > > +
> > > > > > + BUG_ON(!(vq->vring_packed.desc[head].flags &
> > > > > > + cpu_to_virtio16(vq->vq.vdev, VRING_DESC_F_INDIRECT)));
> > > > > It looks to me spec does not force to keep VRING_DESC_F_INDIRECT here. So we
> > > > > can safely remove this BUG_ON() here.
> > > > >
> > > > > > + BUG_ON(len == 0 || len % sizeof(struct vring_packed_desc));
> > > > > Len could be ignored for used descriptor according to the spec, so we need
> > > > > remove this BUG_ON() too.
> > > > Yeah, you're right! The BUG_ON() isn't right. I'll remove it.
> > > > And I think something related to this in the spec isn't very
> > > > clear currently.
> > > >
> > > > In the spec, there are below words:
> > > >
> > > > https://github.com/oasis-tcs/virtio-spec/blob/d4fec517dfcf/packed-ring.tex#L272
> > > > """
> > > > In descriptors with VIRTQ_DESC_F_INDIRECT set VIRTQ_DESC_F_WRITE
> > > > is reserved and is ignored by the device.
> > > > """
> > > >
> > > > So when device writes back an used descriptor in this case,
> > > > device may not set the VIRTQ_DESC_F_WRITE flag as the flag
> > > > is reserved and should be ignored.
> > > >
> > > > https://github.com/oasis-tcs/virtio-spec/blob/d4fec517dfcf/packed-ring.tex#L170
> > > > """
> > > > Element Length is reserved for used descriptors without the
> > > > VIRTQ_DESC_F_WRITE flag, and is ignored by drivers.
> > > > """
> > > >
> > > > And this is the way how driver ignores the `len` in an used
> > > > descriptor.
> > > >
> > > > https://github.com/oasis-tcs/virtio-spec/blob/d4fec517dfcf/packed-ring.tex#L241
> > > > """
> > > > To increase ring capacity the driver can store a (read-only
> > > > by the device) table of indirect descriptors anywhere in memory,
> > > > and insert a descriptor in the main virtqueue (with \field{Flags}
> > > > bit VIRTQ_DESC_F_INDIRECT on) that refers to a buffer element
> > > > containing this indirect descriptor table;
> > > > """
> > > >
> > > > So the indirect descriptors in the table are read-only by
> > > > the device. And the only descriptor which is writeable by
> > > > the device is the descriptor in the main virtqueue (with
> > > > Flags bit VIRTQ_DESC_F_INDIRECT on). So if we ignore the
> > > > `len` in this descriptor, we won't be able to get the
> > > > length of the data written by the device.
> > > >
> > > > So I think the `len` in this descriptor will carry the
> > > > length of the data written by the device (if the buffers
> > > > are writable to the device) even if the VIRTQ_DESC_F_WRITE
> > > > isn't set by the device. How do you think?
> > >
> > > Yes I think so. But we'd better need clarification from Michael.
> >
> > I think if you use a descriptor, and you want to supply len
> > to guest, you set VIRTQ_DESC_F_WRITE in the used descriptor.
> > Spec also says you must not set VIRTQ_DESC_F_INDIRECT then.
> > If that's a problem we could look at relaxing that last requirement -
> > does driver want INDIRECT in used descriptor to match
> > the value in the avail descriptor for some reason?
>
> For indirect, driver needs some way to get the length
> of the data written by the driver. And the descriptors
> in the indirect table is read-only, so the only place
> device could put this value is the descriptor with the
> VIRTQ_DESC_F_INDIRECT flag set.
when writing out used descriptor, device should set VIRTQ_DESC_F_WRITE
(and clear VIRTQ_DESC_F_INDIRECT).
> >
> > > >
> > > >
> > > > > The reason is we don't touch descriptor ring in the case of split, so
> > > > > BUG_ON()s may help there.
> > > > >
> > > > > > +
> > > > > > + for (j = 0; j < len / sizeof(struct vring_packed_desc); j++)
> > > > > > + vring_unmap_one_packed(vq, &desc[j]);
> > > > > > +
> > > > > > + kfree(desc);
> > > > > > + vq->desc_state[head].indir_desc = NULL;
> > > > > > + } else if (ctx) {
> > > > > > + *ctx = vq->desc_state[head].indir_desc;
> > > > > > + }
> > > > > > +
> > > > > > +out:
> > > > > > + return vq->desc_state[head].num;
> > > > > > +}
> > > > > > +
> > > > > > +static inline bool more_used_split(const struct vring_virtqueue *vq)
> > > > > > {
> > > > > > return vq->last_used_idx != virtio16_to_cpu(vq->vq.vdev, vq->vring.used->idx);
> > > > > > }
> > > > > > +static inline bool more_used_packed(const struct vring_virtqueue *vq)
> > > > > > +{
> > > > > > + u16 last_used, flags;
> > > > > > + bool avail, used;
> > > > > > +
> > > > > > + if (vq->vq.num_free == vq->vring_packed.num)
> > > > > > + return false;
> > > > > > +
> > > > > > + last_used = vq->last_used_idx;
> > > > > > + flags = virtio16_to_cpu(vq->vq.vdev,
> > > > > > + vq->vring_packed.desc[last_used].flags);
> > > > > > + avail = flags & VRING_DESC_F_AVAIL(1);
> > > > > > + used = flags & VRING_DESC_F_USED(1);
> > > > > > +
> > > > > > + return avail == used;
> > > > > > +}
> > > > > This looks interesting, spec said:
> > > > >
> > > > > "
> > > > > Thus VIRTQ_DESC_F_AVAIL and VIRTQ_DESC_F_USED bits are different for an
> > > > > available descriptor and
> > > > > equal for a used descriptor.
> > > > > Note that this observation is mostly useful for sanity-checking as these are
> > > > > necessary but not sufficient
> > > > > conditions - for example, all descriptors are zero-initialized. To detect
> > > > > used and available descriptors it is
> > > > > possible for drivers and devices to keep track of the last observed value of
> > > > > VIRTQ_DESC_F_USED/VIRTQ_-
> > > > > DESC_F_AVAIL. Other techniques to detect
> > > > > VIRTQ_DESC_F_AVAIL/VIRTQ_DESC_F_USED bit changes
> > > > > might also be possible.
> > > > > "
> > > > >
> > > > > So it looks to me it was not sufficient, looking at the example codes in
> > > > > spec, do we need to track last seen used_wrap_counter here?
> > > > I don't think we have to track used_wrap_counter in
> > > > driver. There was a discussion on this:
> > > >
> > > > https://lists.oasis-open.org/archives/virtio-dev/201802/msg00177.html
> > > >
> > > > And after that, below sentence was added (it's also
> > > > in the above words you quoted):
> > > >
> > > > """
> > > > Other techniques to detect
> > > > VIRTQ_DESC_F_AVAIL/VIRTQ_DESC_F_USED bit changes
> > > > might also be possible.
> > > > """
> > > >
> > > > Best regards,
> > > > Tiwei Bie
> > >
> > > I see, the extra condition "if (vq->vq.num_free == vq->vring_packed.num)"
> > > help in this case.
> > >
> > > Thanks
> >
> > I still think tracking a wrap counter is better.
>
> >From my understanding, wrap counter is only needed when
> one side just want to update parts of the status bit(s),
> it's something like the "report status" or "write back"
> feature [1] in the hardware NIC. And in the driver, all
> the status must be updated, and that's why I don't want
> to track the usedwrap counter.
>
> [1] https://github.com/btw616/dpdk-virtio1.1/commit/ca837865bd10
>
> Best regards,
> Tiwei Bie
>
> >
> > > >
> > > > > Thanks
^ permalink raw reply
* Re: [PATCH v7 net-next 4/4] netvsc: refactor notifier/event handling code to use the failover framework
From: Stephen Hemminger @ 2018-04-24 1:28 UTC (permalink / raw)
To: Michael S. Tsirkin
Cc: Siwei Liu, Jiri Pirko, Sridhar Samudrala, David Miller, Netdev,
virtualization, virtio-dev, Brandeburg, Jesse, Alexander Duyck,
Jakub Kicinski, Jason Wang
In-Reply-To: <20180423230037-mutt-send-email-mst@kernel.org>
On Mon, 23 Apr 2018 23:06:55 +0300
"Michael S. Tsirkin" <mst@redhat.com> wrote:
> On Mon, Apr 23, 2018 at 12:44:39PM -0700, Siwei Liu wrote:
> > On Mon, Apr 23, 2018 at 10:56 AM, Michael S. Tsirkin <mst@redhat.com> wrote:
> > > On Mon, Apr 23, 2018 at 10:44:40AM -0700, Stephen Hemminger wrote:
> > >> On Mon, 23 Apr 2018 20:24:56 +0300
> > >> "Michael S. Tsirkin" <mst@redhat.com> wrote:
> > >>
> > >> > On Mon, Apr 23, 2018 at 10:04:06AM -0700, Stephen Hemminger wrote:
> > >> > > > >
> > >> > > > >I will NAK patches to change to common code for netvsc especially the
> > >> > > > >three device model. MS worked hard with distro vendors to support transparent
> > >> > > > >mode, ans we really can't have a new model; or do backport.
> > >> > > > >
> > >> > > > >Plus, DPDK is now dependent on existing model.
> > >> > > >
> > >> > > > Sorry, but nobody here cares about dpdk or other similar oddities.
> > >> > >
> > >> > > The network device model is a userspace API, and DPDK is a userspace application.
> > >> >
> > >> > It is userspace but are you sure dpdk is actually poking at netdevs?
> > >> > AFAIK it's normally banging device registers directly.
> > >> >
> > >> > > You can't go breaking userspace even if you don't like the application.
> > >> >
> > >> > Could you please explain how is the proposed patchset breaking
> > >> > userspace? Ignoring DPDK for now, I don't think it changes the userspace
> > >> > API at all.
> > >> >
> > >>
> > >> The DPDK has a device driver vdev_netvsc which scans the Linux network devices
> > >> to look for Linux netvsc device and the paired VF device and setup the
> > >> DPDK environment. This setup creates a DPDK failsafe (bondingish) instance
> > >> and sets up TAP support over the Linux netvsc device as well as the Mellanox
> > >> VF device.
> > >>
> > >> So it depends on existing 2 device model. You can't go to a 3 device model
> > >> or start hiding devices from userspace.
> > >
> > > Okay so how does the existing patch break that? IIUC does not go to
> > > a 3 device model since netvsc calls failover_register directly.
> > >
> > >> Also, I am working on associating netvsc and VF device based on serial number
> > >> rather than MAC address. The serial number is how Windows works now, and it makes
> > >> sense for Linux and Windows to use the same mechanism if possible.
> > >
> > > Maybe we should support same for virtio ...
> > > Which serial do you mean? From vpd?
> > >
> > > I guess you will want to keep supporting MAC for old hypervisors?
> > >
> > > It all seems like a reasonable thing to support in the generic core.
> >
> > That's the reason why I chose explicit identifier rather than rely on
> > MAC address to bind/pair a device. MAC address can change. Even if it
> > can't, malicious guest user can fake MAC address to skip binding.
> >
> > -Siwei
>
> Address should be sampled at device creation to prevent this
> kind of hack. Not that it buys the malicious user much:
> if you can poke at MAC addresses you probably already can
> break networking.
On Hyper-V guest can't really change MAC address if SR-IOV is enabled.
Also, Linux has permanent ether address in netdev which is what should
be used to avoid user messing with device.
^ permalink raw reply
* Re: [PATCH v7 net-next 4/4] netvsc: refactor notifier/event handling code to use the failover framework
From: Stephen Hemminger @ 2018-04-24 1:25 UTC (permalink / raw)
To: Siwei Liu
Cc: Michael S. Tsirkin, Jiri Pirko, Sridhar Samudrala, David Miller,
Netdev, virtualization, virtio-dev, Brandeburg, Jesse,
Alexander Duyck, Jakub Kicinski, Jason Wang
In-Reply-To: <CADGSJ20ge75T+ddxtUBT4d9m1i3=HLOAHJEoS7Cg0bqnXrutwA@mail.gmail.com>
On Mon, 23 Apr 2018 12:44:39 -0700
Siwei Liu <loseweigh@gmail.com> wrote:
> On Mon, Apr 23, 2018 at 10:56 AM, Michael S. Tsirkin <mst@redhat.com> wrote:
> > On Mon, Apr 23, 2018 at 10:44:40AM -0700, Stephen Hemminger wrote:
> >> On Mon, 23 Apr 2018 20:24:56 +0300
> >> "Michael S. Tsirkin" <mst@redhat.com> wrote:
> >>
> >> > On Mon, Apr 23, 2018 at 10:04:06AM -0700, Stephen Hemminger wrote:
> >> > > > >
> >> > > > >I will NAK patches to change to common code for netvsc especially the
> >> > > > >three device model. MS worked hard with distro vendors to support transparent
> >> > > > >mode, ans we really can't have a new model; or do backport.
> >> > > > >
> >> > > > >Plus, DPDK is now dependent on existing model.
> >> > > >
> >> > > > Sorry, but nobody here cares about dpdk or other similar oddities.
> >> > >
> >> > > The network device model is a userspace API, and DPDK is a userspace application.
> >> >
> >> > It is userspace but are you sure dpdk is actually poking at netdevs?
> >> > AFAIK it's normally banging device registers directly.
> >> >
> >> > > You can't go breaking userspace even if you don't like the application.
> >> >
> >> > Could you please explain how is the proposed patchset breaking
> >> > userspace? Ignoring DPDK for now, I don't think it changes the userspace
> >> > API at all.
> >> >
> >>
> >> The DPDK has a device driver vdev_netvsc which scans the Linux network devices
> >> to look for Linux netvsc device and the paired VF device and setup the
> >> DPDK environment. This setup creates a DPDK failsafe (bondingish) instance
> >> and sets up TAP support over the Linux netvsc device as well as the Mellanox
> >> VF device.
> >>
> >> So it depends on existing 2 device model. You can't go to a 3 device model
> >> or start hiding devices from userspace.
> >
> > Okay so how does the existing patch break that? IIUC does not go to
> > a 3 device model since netvsc calls failover_register directly.
> >
> >> Also, I am working on associating netvsc and VF device based on serial number
> >> rather than MAC address. The serial number is how Windows works now, and it makes
> >> sense for Linux and Windows to use the same mechanism if possible.
> >
> > Maybe we should support same for virtio ...
> > Which serial do you mean? From vpd?
> >
> > I guess you will want to keep supporting MAC for old hypervisors?
The serial number has always been in the hypervisor since original support of SR-IOV
in WS2008. So no backward compatibility special cases would be needed.
^ permalink raw reply
* Re: [PATCH net 0/3] amd-xgbe: AMD XGBE driver fixes 2018-04-23
From: David Miller @ 2018-04-24 1:24 UTC (permalink / raw)
To: thomas.lendacky; +Cc: netdev
In-Reply-To: <20180423164258.18740.98574.stgit@tlendack-t1.amdoffice.net>
From: Tom Lendacky <thomas.lendacky@amd.com>
Date: Mon, 23 Apr 2018 11:42:58 -0500
> This patch series addresses some issues in the AMD XGBE driver.
>
> The following fixes are included in this driver update series:
>
> - Improve KR auto-negotiation and training (2 patches)
> - Add pre and post auto-negotiation hooks
> - Use the pre and post auto-negotiation hooks to disable CDR tracking
> during auto-negotiation page exchange in KR mode
> - Check for SFP tranceiver signal support and only use the signal if the
> SFP indicates that it is supported
>
> This patch series is based on net.
>
> ---
>
> Please queue this patch series up to stable releases 4.14 and above.
Series applied and queued up for -stable, thanks.
^ permalink raw reply
* Re: [RFC v2] virtio: support packed ring
From: Tiwei Bie @ 2018-04-24 1:16 UTC (permalink / raw)
To: Michael S. Tsirkin
Cc: Jason Wang, wexu, virtualization, linux-kernel, netdev, jfreimann
In-Reply-To: <20180424040258-mutt-send-email-mst@kernel.org>
On Tue, Apr 24, 2018 at 04:05:07AM +0300, Michael S. Tsirkin wrote:
> On Tue, Apr 24, 2018 at 08:54:52AM +0800, Jason Wang wrote:
> >
> >
> > On 2018年04月23日 17:29, Tiwei Bie wrote:
> > > On Mon, Apr 23, 2018 at 01:42:14PM +0800, Jason Wang wrote:
> > > > On 2018年04月01日 22:12, Tiwei Bie wrote:
> > > > > Hello everyone,
> > > > >
> > > > > This RFC implements packed ring support for virtio driver.
> > > > >
> > > > > The code was tested with DPDK vhost (testpmd/vhost-PMD) implemented
> > > > > by Jens at http://dpdk.org/ml/archives/dev/2018-January/089417.html
> > > > > Minor changes are needed for the vhost code, e.g. to kick the guest.
> > > > >
> > > > > TODO:
> > > > > - Refinements and bug fixes;
> > > > > - Split into small patches;
> > > > > - Test indirect descriptor support;
> > > > > - Test/fix event suppression support;
> > > > > - Test devices other than net;
> > > > >
> > > > > RFC v1 -> RFC v2:
> > > > > - Add indirect descriptor support - compile test only;
> > > > > - Add event suppression supprt - compile test only;
> > > > > - Move vring_packed_init() out of uapi (Jason, MST);
> > > > > - Merge two loops into one in virtqueue_add_packed() (Jason);
> > > > > - Split vring_unmap_one() for packed ring and split ring (Jason);
> > > > > - Avoid using '%' operator (Jason);
> > > > > - Rename free_head -> next_avail_idx (Jason);
> > > > > - Add comments for virtio_wmb() in virtqueue_add_packed() (Jason);
> > > > > - Some other refinements and bug fixes;
> > > > >
> > > > > Thanks!
> > > > >
> > > > > Signed-off-by: Tiwei Bie <tiwei.bie@intel.com>
> > > > > ---
> > > > > drivers/virtio/virtio_ring.c | 1094 +++++++++++++++++++++++++++++-------
> > > > > include/linux/virtio_ring.h | 8 +-
> > > > > include/uapi/linux/virtio_config.h | 12 +-
> > > > > include/uapi/linux/virtio_ring.h | 61 ++
> > > > > 4 files changed, 980 insertions(+), 195 deletions(-)
> > > > >
> > > > > diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
> > > > > index 71458f493cf8..0515dca34d77 100644
> > > > > --- a/drivers/virtio/virtio_ring.c
> > > > > +++ b/drivers/virtio/virtio_ring.c
> > > > > @@ -58,14 +58,15 @@
> > > > [...]
> > > >
> > > > > +
> > > > > + if (vq->indirect) {
> > > > > + u32 len;
> > > > > +
> > > > > + desc = vq->desc_state[head].indir_desc;
> > > > > + /* Free the indirect table, if any, now that it's unmapped. */
> > > > > + if (!desc)
> > > > > + goto out;
> > > > > +
> > > > > + len = virtio32_to_cpu(vq->vq.vdev,
> > > > > + vq->vring_packed.desc[head].len);
> > > > > +
> > > > > + BUG_ON(!(vq->vring_packed.desc[head].flags &
> > > > > + cpu_to_virtio16(vq->vq.vdev, VRING_DESC_F_INDIRECT)));
> > > > It looks to me spec does not force to keep VRING_DESC_F_INDIRECT here. So we
> > > > can safely remove this BUG_ON() here.
> > > >
> > > > > + BUG_ON(len == 0 || len % sizeof(struct vring_packed_desc));
> > > > Len could be ignored for used descriptor according to the spec, so we need
> > > > remove this BUG_ON() too.
> > > Yeah, you're right! The BUG_ON() isn't right. I'll remove it.
> > > And I think something related to this in the spec isn't very
> > > clear currently.
> > >
> > > In the spec, there are below words:
> > >
> > > https://github.com/oasis-tcs/virtio-spec/blob/d4fec517dfcf/packed-ring.tex#L272
> > > """
> > > In descriptors with VIRTQ_DESC_F_INDIRECT set VIRTQ_DESC_F_WRITE
> > > is reserved and is ignored by the device.
> > > """
> > >
> > > So when device writes back an used descriptor in this case,
> > > device may not set the VIRTQ_DESC_F_WRITE flag as the flag
> > > is reserved and should be ignored.
> > >
> > > https://github.com/oasis-tcs/virtio-spec/blob/d4fec517dfcf/packed-ring.tex#L170
> > > """
> > > Element Length is reserved for used descriptors without the
> > > VIRTQ_DESC_F_WRITE flag, and is ignored by drivers.
> > > """
> > >
> > > And this is the way how driver ignores the `len` in an used
> > > descriptor.
> > >
> > > https://github.com/oasis-tcs/virtio-spec/blob/d4fec517dfcf/packed-ring.tex#L241
> > > """
> > > To increase ring capacity the driver can store a (read-only
> > > by the device) table of indirect descriptors anywhere in memory,
> > > and insert a descriptor in the main virtqueue (with \field{Flags}
> > > bit VIRTQ_DESC_F_INDIRECT on) that refers to a buffer element
> > > containing this indirect descriptor table;
> > > """
> > >
> > > So the indirect descriptors in the table are read-only by
> > > the device. And the only descriptor which is writeable by
> > > the device is the descriptor in the main virtqueue (with
> > > Flags bit VIRTQ_DESC_F_INDIRECT on). So if we ignore the
> > > `len` in this descriptor, we won't be able to get the
> > > length of the data written by the device.
> > >
> > > So I think the `len` in this descriptor will carry the
> > > length of the data written by the device (if the buffers
> > > are writable to the device) even if the VIRTQ_DESC_F_WRITE
> > > isn't set by the device. How do you think?
> >
> > Yes I think so. But we'd better need clarification from Michael.
>
> I think if you use a descriptor, and you want to supply len
> to guest, you set VIRTQ_DESC_F_WRITE in the used descriptor.
> Spec also says you must not set VIRTQ_DESC_F_INDIRECT then.
> If that's a problem we could look at relaxing that last requirement -
> does driver want INDIRECT in used descriptor to match
> the value in the avail descriptor for some reason?
For indirect, driver needs some way to get the length
of the data written by the driver. And the descriptors
in the indirect table is read-only, so the only place
device could put this value is the descriptor with the
VIRTQ_DESC_F_INDIRECT flag set.
>
> > >
> > >
> > > > The reason is we don't touch descriptor ring in the case of split, so
> > > > BUG_ON()s may help there.
> > > >
> > > > > +
> > > > > + for (j = 0; j < len / sizeof(struct vring_packed_desc); j++)
> > > > > + vring_unmap_one_packed(vq, &desc[j]);
> > > > > +
> > > > > + kfree(desc);
> > > > > + vq->desc_state[head].indir_desc = NULL;
> > > > > + } else if (ctx) {
> > > > > + *ctx = vq->desc_state[head].indir_desc;
> > > > > + }
> > > > > +
> > > > > +out:
> > > > > + return vq->desc_state[head].num;
> > > > > +}
> > > > > +
> > > > > +static inline bool more_used_split(const struct vring_virtqueue *vq)
> > > > > {
> > > > > return vq->last_used_idx != virtio16_to_cpu(vq->vq.vdev, vq->vring.used->idx);
> > > > > }
> > > > > +static inline bool more_used_packed(const struct vring_virtqueue *vq)
> > > > > +{
> > > > > + u16 last_used, flags;
> > > > > + bool avail, used;
> > > > > +
> > > > > + if (vq->vq.num_free == vq->vring_packed.num)
> > > > > + return false;
> > > > > +
> > > > > + last_used = vq->last_used_idx;
> > > > > + flags = virtio16_to_cpu(vq->vq.vdev,
> > > > > + vq->vring_packed.desc[last_used].flags);
> > > > > + avail = flags & VRING_DESC_F_AVAIL(1);
> > > > > + used = flags & VRING_DESC_F_USED(1);
> > > > > +
> > > > > + return avail == used;
> > > > > +}
> > > > This looks interesting, spec said:
> > > >
> > > > "
> > > > Thus VIRTQ_DESC_F_AVAIL and VIRTQ_DESC_F_USED bits are different for an
> > > > available descriptor and
> > > > equal for a used descriptor.
> > > > Note that this observation is mostly useful for sanity-checking as these are
> > > > necessary but not sufficient
> > > > conditions - for example, all descriptors are zero-initialized. To detect
> > > > used and available descriptors it is
> > > > possible for drivers and devices to keep track of the last observed value of
> > > > VIRTQ_DESC_F_USED/VIRTQ_-
> > > > DESC_F_AVAIL. Other techniques to detect
> > > > VIRTQ_DESC_F_AVAIL/VIRTQ_DESC_F_USED bit changes
> > > > might also be possible.
> > > > "
> > > >
> > > > So it looks to me it was not sufficient, looking at the example codes in
> > > > spec, do we need to track last seen used_wrap_counter here?
> > > I don't think we have to track used_wrap_counter in
> > > driver. There was a discussion on this:
> > >
> > > https://lists.oasis-open.org/archives/virtio-dev/201802/msg00177.html
> > >
> > > And after that, below sentence was added (it's also
> > > in the above words you quoted):
> > >
> > > """
> > > Other techniques to detect
> > > VIRTQ_DESC_F_AVAIL/VIRTQ_DESC_F_USED bit changes
> > > might also be possible.
> > > """
> > >
> > > Best regards,
> > > Tiwei Bie
> >
> > I see, the extra condition "if (vq->vq.num_free == vq->vring_packed.num)"
> > help in this case.
> >
> > Thanks
>
> I still think tracking a wrap counter is better.
>From my understanding, wrap counter is only needed when
one side just want to update parts of the status bit(s),
it's something like the "report status" or "write back"
feature [1] in the hardware NIC. And in the driver, all
the status must be updated, and that's why I don't want
to track the usedwrap counter.
[1] https://github.com/btw616/dpdk-virtio1.1/commit/ca837865bd10
Best regards,
Tiwei Bie
>
> > >
> > > > Thanks
^ permalink raw reply
* Re: [RFC v2] virtio: support packed ring
From: Jason Wang @ 2018-04-24 1:14 UTC (permalink / raw)
To: Michael S. Tsirkin
Cc: Tiwei Bie, wexu, virtualization, linux-kernel, netdev, jfreimann
In-Reply-To: <20180424040258-mutt-send-email-mst@kernel.org>
On 2018年04月24日 09:05, Michael S. Tsirkin wrote:
>>>>> + if (vq->indirect) {
>>>>> + u32 len;
>>>>> +
>>>>> + desc = vq->desc_state[head].indir_desc;
>>>>> + /* Free the indirect table, if any, now that it's unmapped. */
>>>>> + if (!desc)
>>>>> + goto out;
>>>>> +
>>>>> + len = virtio32_to_cpu(vq->vq.vdev,
>>>>> + vq->vring_packed.desc[head].len);
>>>>> +
>>>>> + BUG_ON(!(vq->vring_packed.desc[head].flags &
>>>>> + cpu_to_virtio16(vq->vq.vdev, VRING_DESC_F_INDIRECT)));
>>>> It looks to me spec does not force to keep VRING_DESC_F_INDIRECT here. So we
>>>> can safely remove this BUG_ON() here.
>>>>
>>>>> + BUG_ON(len == 0 || len % sizeof(struct vring_packed_desc));
>>>> Len could be ignored for used descriptor according to the spec, so we need
>>>> remove this BUG_ON() too.
>>> Yeah, you're right! The BUG_ON() isn't right. I'll remove it.
>>> And I think something related to this in the spec isn't very
>>> clear currently.
>>>
>>> In the spec, there are below words:
>>>
>>> https://github.com/oasis-tcs/virtio-spec/blob/d4fec517dfcf/packed-ring.tex#L272
>>> """
>>> In descriptors with VIRTQ_DESC_F_INDIRECT set VIRTQ_DESC_F_WRITE
>>> is reserved and is ignored by the device.
>>> """
>>>
>>> So when device writes back an used descriptor in this case,
>>> device may not set the VIRTQ_DESC_F_WRITE flag as the flag
>>> is reserved and should be ignored.
>>>
>>> https://github.com/oasis-tcs/virtio-spec/blob/d4fec517dfcf/packed-ring.tex#L170
>>> """
>>> Element Length is reserved for used descriptors without the
>>> VIRTQ_DESC_F_WRITE flag, and is ignored by drivers.
>>> """
>>>
>>> And this is the way how driver ignores the `len` in an used
>>> descriptor.
>>>
>>> https://github.com/oasis-tcs/virtio-spec/blob/d4fec517dfcf/packed-ring.tex#L241
>>> """
>>> To increase ring capacity the driver can store a (read-only
>>> by the device) table of indirect descriptors anywhere in memory,
>>> and insert a descriptor in the main virtqueue (with \field{Flags}
>>> bit VIRTQ_DESC_F_INDIRECT on) that refers to a buffer element
>>> containing this indirect descriptor table;
>>> """
>>>
>>> So the indirect descriptors in the table are read-only by
>>> the device. And the only descriptor which is writeable by
>>> the device is the descriptor in the main virtqueue (with
>>> Flags bit VIRTQ_DESC_F_INDIRECT on). So if we ignore the
>>> `len` in this descriptor, we won't be able to get the
>>> length of the data written by the device.
>>>
>>> So I think the `len` in this descriptor will carry the
>>> length of the data written by the device (if the buffers
>>> are writable to the device) even if the VIRTQ_DESC_F_WRITE
>>> isn't set by the device. How do you think?
>> Yes I think so. But we'd better need clarification from Michael.
> I think if you use a descriptor, and you want to supply len
> to guest, you set VIRTQ_DESC_F_WRITE in the used descriptor.
> Spec also says you must not set VIRTQ_DESC_F_INDIRECT then.
> If that's a problem we could look at relaxing that last requirement -
> does driver want INDIRECT in used descriptor to match
> the value in the avail descriptor for some reason?
>
Looks not, so what I get it:
- device and set VIRTQ_DESC_F_WRITE flag for used descriptor when needed
- there no need to keep INDIRECT flag in used descriptor
So for the above case, we can just have a used descriptor with _F_WRITE
but without INDIRECT flag.
Thanks
^ permalink raw reply
* Re: [PATCH net] pppoe: check sockaddr length in pppoe_connect()
From: David Miller @ 2018-04-24 1:12 UTC (permalink / raw)
To: g.nault; +Cc: netdev, mostrows
In-Reply-To: <387ca48810af36f2626049008a795d1adc375cb8.1524494257.git.g.nault@alphalink.fr>
From: Guillaume Nault <g.nault@alphalink.fr>
Date: Mon, 23 Apr 2018 16:38:27 +0200
> We must validate sockaddr_len, otherwise userspace can pass fewer data
> than we expect and we end up accessing invalid data.
>
> Fixes: 224cf5ad14c0 ("ppp: Move the PPP drivers")
> Reported-by: syzbot+4f03bdf92fdf9ef5ddab@syzkaller.appspotmail.com
> Signed-off-by: Guillaume Nault <g.nault@alphalink.fr>
Applied and queued up for -stable, thank you.
^ permalink raw reply
* Re: [PATCH net] l2tp: check sockaddr length in pppol2tp_connect()
From: David Miller @ 2018-04-24 1:11 UTC (permalink / raw)
To: g.nault; +Cc: netdev, jchapman
In-Reply-To: <e76f65e9826fe7591dc8bba561461c993e5643e0.1524492877.git.g.nault@alphalink.fr>
From: Guillaume Nault <g.nault@alphalink.fr>
Date: Mon, 23 Apr 2018 16:15:14 +0200
> Check sockaddr_len before dereferencing sp->sa_protocol, to ensure that
> it actually points to valid data.
>
> Fixes: fd558d186df2 ("l2tp: Split pppol2tp patch into separate l2tp and ppp parts")
> Reported-by: syzbot+a70ac890b23b1bf29f5c@syzkaller.appspotmail.com
> Signed-off-by: Guillaume Nault <g.nault@alphalink.fr>
Applied and queued up for -stable.
I guess you can completely remove the "bad socket address" -EINVAL else
clause later in the function as a cleanup in net-next.
^ permalink raw reply
* Re: [PATCH] selftests: net: update .gitignore with missing test
From: David Miller @ 2018-04-24 1:07 UTC (permalink / raw)
To: anders.roxell; +Cc: shuah, netdev, linux-kselftest, linux-kernel
In-Reply-To: <20180423140050.4508-1-anders.roxell@linaro.org>
From: Anders Roxell <anders.roxell@linaro.org>
Date: Mon, 23 Apr 2018 16:00:50 +0200
> Fixes: 192dc405f308 ("selftests: net: add tcp_mmap program")
> Signed-off-by: Anders Roxell <anders.roxell@linaro.org>
Applied, thanks.
^ permalink raw reply
* Re: [RFC v2] virtio: support packed ring
From: Michael S. Tsirkin @ 2018-04-24 1:05 UTC (permalink / raw)
To: Jason Wang
Cc: Tiwei Bie, wexu, virtualization, linux-kernel, netdev, jfreimann
In-Reply-To: <e08c55df-6ea2-cf7f-d7c9-91e55913ab16@redhat.com>
On Tue, Apr 24, 2018 at 08:54:52AM +0800, Jason Wang wrote:
>
>
> On 2018年04月23日 17:29, Tiwei Bie wrote:
> > On Mon, Apr 23, 2018 at 01:42:14PM +0800, Jason Wang wrote:
> > > On 2018年04月01日 22:12, Tiwei Bie wrote:
> > > > Hello everyone,
> > > >
> > > > This RFC implements packed ring support for virtio driver.
> > > >
> > > > The code was tested with DPDK vhost (testpmd/vhost-PMD) implemented
> > > > by Jens at http://dpdk.org/ml/archives/dev/2018-January/089417.html
> > > > Minor changes are needed for the vhost code, e.g. to kick the guest.
> > > >
> > > > TODO:
> > > > - Refinements and bug fixes;
> > > > - Split into small patches;
> > > > - Test indirect descriptor support;
> > > > - Test/fix event suppression support;
> > > > - Test devices other than net;
> > > >
> > > > RFC v1 -> RFC v2:
> > > > - Add indirect descriptor support - compile test only;
> > > > - Add event suppression supprt - compile test only;
> > > > - Move vring_packed_init() out of uapi (Jason, MST);
> > > > - Merge two loops into one in virtqueue_add_packed() (Jason);
> > > > - Split vring_unmap_one() for packed ring and split ring (Jason);
> > > > - Avoid using '%' operator (Jason);
> > > > - Rename free_head -> next_avail_idx (Jason);
> > > > - Add comments for virtio_wmb() in virtqueue_add_packed() (Jason);
> > > > - Some other refinements and bug fixes;
> > > >
> > > > Thanks!
> > > >
> > > > Signed-off-by: Tiwei Bie <tiwei.bie@intel.com>
> > > > ---
> > > > drivers/virtio/virtio_ring.c | 1094 +++++++++++++++++++++++++++++-------
> > > > include/linux/virtio_ring.h | 8 +-
> > > > include/uapi/linux/virtio_config.h | 12 +-
> > > > include/uapi/linux/virtio_ring.h | 61 ++
> > > > 4 files changed, 980 insertions(+), 195 deletions(-)
> > > >
> > > > diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
> > > > index 71458f493cf8..0515dca34d77 100644
> > > > --- a/drivers/virtio/virtio_ring.c
> > > > +++ b/drivers/virtio/virtio_ring.c
> > > > @@ -58,14 +58,15 @@
> > > [...]
> > >
> > > > +
> > > > + if (vq->indirect) {
> > > > + u32 len;
> > > > +
> > > > + desc = vq->desc_state[head].indir_desc;
> > > > + /* Free the indirect table, if any, now that it's unmapped. */
> > > > + if (!desc)
> > > > + goto out;
> > > > +
> > > > + len = virtio32_to_cpu(vq->vq.vdev,
> > > > + vq->vring_packed.desc[head].len);
> > > > +
> > > > + BUG_ON(!(vq->vring_packed.desc[head].flags &
> > > > + cpu_to_virtio16(vq->vq.vdev, VRING_DESC_F_INDIRECT)));
> > > It looks to me spec does not force to keep VRING_DESC_F_INDIRECT here. So we
> > > can safely remove this BUG_ON() here.
> > >
> > > > + BUG_ON(len == 0 || len % sizeof(struct vring_packed_desc));
> > > Len could be ignored for used descriptor according to the spec, so we need
> > > remove this BUG_ON() too.
> > Yeah, you're right! The BUG_ON() isn't right. I'll remove it.
> > And I think something related to this in the spec isn't very
> > clear currently.
> >
> > In the spec, there are below words:
> >
> > https://github.com/oasis-tcs/virtio-spec/blob/d4fec517dfcf/packed-ring.tex#L272
> > """
> > In descriptors with VIRTQ_DESC_F_INDIRECT set VIRTQ_DESC_F_WRITE
> > is reserved and is ignored by the device.
> > """
> >
> > So when device writes back an used descriptor in this case,
> > device may not set the VIRTQ_DESC_F_WRITE flag as the flag
> > is reserved and should be ignored.
> >
> > https://github.com/oasis-tcs/virtio-spec/blob/d4fec517dfcf/packed-ring.tex#L170
> > """
> > Element Length is reserved for used descriptors without the
> > VIRTQ_DESC_F_WRITE flag, and is ignored by drivers.
> > """
> >
> > And this is the way how driver ignores the `len` in an used
> > descriptor.
> >
> > https://github.com/oasis-tcs/virtio-spec/blob/d4fec517dfcf/packed-ring.tex#L241
> > """
> > To increase ring capacity the driver can store a (read-only
> > by the device) table of indirect descriptors anywhere in memory,
> > and insert a descriptor in the main virtqueue (with \field{Flags}
> > bit VIRTQ_DESC_F_INDIRECT on) that refers to a buffer element
> > containing this indirect descriptor table;
> > """
> >
> > So the indirect descriptors in the table are read-only by
> > the device. And the only descriptor which is writeable by
> > the device is the descriptor in the main virtqueue (with
> > Flags bit VIRTQ_DESC_F_INDIRECT on). So if we ignore the
> > `len` in this descriptor, we won't be able to get the
> > length of the data written by the device.
> >
> > So I think the `len` in this descriptor will carry the
> > length of the data written by the device (if the buffers
> > are writable to the device) even if the VIRTQ_DESC_F_WRITE
> > isn't set by the device. How do you think?
>
> Yes I think so. But we'd better need clarification from Michael.
I think if you use a descriptor, and you want to supply len
to guest, you set VIRTQ_DESC_F_WRITE in the used descriptor.
Spec also says you must not set VIRTQ_DESC_F_INDIRECT then.
If that's a problem we could look at relaxing that last requirement -
does driver want INDIRECT in used descriptor to match
the value in the avail descriptor for some reason?
> >
> >
> > > The reason is we don't touch descriptor ring in the case of split, so
> > > BUG_ON()s may help there.
> > >
> > > > +
> > > > + for (j = 0; j < len / sizeof(struct vring_packed_desc); j++)
> > > > + vring_unmap_one_packed(vq, &desc[j]);
> > > > +
> > > > + kfree(desc);
> > > > + vq->desc_state[head].indir_desc = NULL;
> > > > + } else if (ctx) {
> > > > + *ctx = vq->desc_state[head].indir_desc;
> > > > + }
> > > > +
> > > > +out:
> > > > + return vq->desc_state[head].num;
> > > > +}
> > > > +
> > > > +static inline bool more_used_split(const struct vring_virtqueue *vq)
> > > > {
> > > > return vq->last_used_idx != virtio16_to_cpu(vq->vq.vdev, vq->vring.used->idx);
> > > > }
> > > > +static inline bool more_used_packed(const struct vring_virtqueue *vq)
> > > > +{
> > > > + u16 last_used, flags;
> > > > + bool avail, used;
> > > > +
> > > > + if (vq->vq.num_free == vq->vring_packed.num)
> > > > + return false;
> > > > +
> > > > + last_used = vq->last_used_idx;
> > > > + flags = virtio16_to_cpu(vq->vq.vdev,
> > > > + vq->vring_packed.desc[last_used].flags);
> > > > + avail = flags & VRING_DESC_F_AVAIL(1);
> > > > + used = flags & VRING_DESC_F_USED(1);
> > > > +
> > > > + return avail == used;
> > > > +}
> > > This looks interesting, spec said:
> > >
> > > "
> > > Thus VIRTQ_DESC_F_AVAIL and VIRTQ_DESC_F_USED bits are different for an
> > > available descriptor and
> > > equal for a used descriptor.
> > > Note that this observation is mostly useful for sanity-checking as these are
> > > necessary but not sufficient
> > > conditions - for example, all descriptors are zero-initialized. To detect
> > > used and available descriptors it is
> > > possible for drivers and devices to keep track of the last observed value of
> > > VIRTQ_DESC_F_USED/VIRTQ_-
> > > DESC_F_AVAIL. Other techniques to detect
> > > VIRTQ_DESC_F_AVAIL/VIRTQ_DESC_F_USED bit changes
> > > might also be possible.
> > > "
> > >
> > > So it looks to me it was not sufficient, looking at the example codes in
> > > spec, do we need to track last seen used_wrap_counter here?
> > I don't think we have to track used_wrap_counter in
> > driver. There was a discussion on this:
> >
> > https://lists.oasis-open.org/archives/virtio-dev/201802/msg00177.html
> >
> > And after that, below sentence was added (it's also
> > in the above words you quoted):
> >
> > """
> > Other techniques to detect
> > VIRTQ_DESC_F_AVAIL/VIRTQ_DESC_F_USED bit changes
> > might also be possible.
> > """
> >
> > Best regards,
> > Tiwei Bie
>
> I see, the extra condition "if (vq->vq.num_free == vq->vring_packed.num)"
> help in this case.
>
> Thanks
I still think tracking a wrap counter is better.
> >
> > > Thanks
^ permalink raw reply
* Re: [RFC V3 PATCH 0/8] Packed ring for vhost
From: Jason Wang @ 2018-04-24 1:04 UTC (permalink / raw)
To: Konrad Rzeszutek Wilk, Michael S. Tsirkin
Cc: kvm, netdev, linux-kernel, virtualization, wexu
In-Reply-To: <20180423201151.GD5215@char.us.oracle.com>
On 2018年04月24日 04:11, Konrad Rzeszutek Wilk wrote:
> On Mon, Apr 23, 2018 at 10:59:43PM +0300, Michael S. Tsirkin wrote:
>> On Mon, Apr 23, 2018 at 03:31:20PM -0400, Konrad Rzeszutek Wilk wrote:
>>> On Mon, Apr 23, 2018 at 01:34:52PM +0800, Jason Wang wrote:
>>>> Hi all:
>>>>
>>>> This RFC implement packed ring layout. The code were tested with
>>>> Tiwei's RFC V2 a thttps://lkml.org/lkml/2018/4/1/48. Some fixups and
>>>> tweaks were needed on top of Tiwei's code to make it run. TCP stream
>>>> and pktgen does not show obvious difference compared with split ring.
>>> I have to ask then - what is the benefit of this?
>> You can use this with dpdk within guest.
>> The DPDK version did see a performance improvement so hopefully with
> Is there a link to this performance improvement document?
>
You probably want to this:
https://www.mail-archive.com/dev@dpdk.org/msg97735.html
Thanks
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization
^ permalink raw reply
* Re: [PATCH] dca: make function dca_common_get_tag static
From: David Miller @ 2018-04-24 1:03 UTC (permalink / raw)
To: colin.king; +Cc: netdev, kernel-janitors, linux-kernel
In-Reply-To: <20180423124938.26070-1-colin.king@canonical.com>
From: Colin King <colin.king@canonical.com>
Date: Mon, 23 Apr 2018 13:49:38 +0100
> From: Colin Ian King <colin.king@canonical.com>
>
> Function dca_common_get_tag is local to the source and does not need to be
> in global scope, so make it static.
>
> Cleans up sparse warning:
> drivers/dca/dca-core.c:273:4: warning: symbol 'dca_common_get_tag' was
> not declared. Should it be static?
>
> Signed-off-by: Colin Ian King <colin.king@canonical.com>
Applied to net-next, thanks.
^ permalink raw reply
* Re: [PATCH V6 net-next 08/15] net/tls: Support TLS device offload with IPv6
From: David Miller @ 2018-04-24 0:56 UTC (permalink / raw)
To: borisp; +Cc: netdev, saeedm, davejwatson, ktkhai, ilyal
In-Reply-To: <1524410397-108425-9-git-send-email-borisp@mellanox.com>
From: Boris Pismenny <borisp@mellanox.com>
Date: Sun, 22 Apr 2018 18:19:50 +0300
> @@ -97,13 +102,57 @@ static void tls_device_queue_ctx_destruction(struct tls_context *ctx)
> spin_unlock_irqrestore(&tls_device_lock, flags);
> }
>
> +#if IS_ENABLED(CONFIG_IPV6)
> +static struct net_device *ipv6_get_netdev(struct sock *sk)
> +{
> + struct net_device *dev = NULL;
> + struct inet_sock *inet = inet_sk(sk);
> + struct ipv6_pinfo *np = inet6_sk(sk);
> + struct flowi6 _fl6, *fl6 = &_fl6;
> + struct dst_entry *dst;
Ugh, please use sk->sk_dst_cache->dev and avoid all of the unnecessary
work.
Thank you.
^ permalink raw reply
* Re: [PATCH V6 net-next 07/15] net/tls: Add generic NIC offload infrastructure
From: David Miller @ 2018-04-24 0:55 UTC (permalink / raw)
To: borisp; +Cc: netdev, saeedm, davejwatson, ktkhai, ilyal, aviadye
In-Reply-To: <1524410397-108425-8-git-send-email-borisp@mellanox.com>
From: Boris Pismenny <borisp@mellanox.com>
Date: Sun, 22 Apr 2018 18:19:49 +0300
> +/* We assume that the socket is already connected */
> +static struct net_device *get_netdev_for_sock(struct sock *sk)
> +{
> + struct inet_sock *inet = inet_sk(sk);
> + struct net_device *netdev = NULL;
> +
> + netdev = dev_get_by_index(sock_net(sk), inet->cork.fl.flowi_oif);
> +
> + return netdev;
> +}
Please do this more directly by looking at sk->sk_dst_cache and taking
the device from dst->dev if non-NULL.
That avoids the dev_get_by_index() demux work, and also if
sk->sk_dst_cache is non-NULL then the socket is indeed connected.
Thanks.
^ permalink raw reply
* Re: [RFC v2] virtio: support packed ring
From: Jason Wang @ 2018-04-24 0:54 UTC (permalink / raw)
To: Tiwei Bie; +Cc: mst, netdev, linux-kernel, virtualization, wexu
In-Reply-To: <20180423092908.77rii3gi7dcaf7o6@debian>
On 2018年04月23日 17:29, Tiwei Bie wrote:
> On Mon, Apr 23, 2018 at 01:42:14PM +0800, Jason Wang wrote:
>> On 2018年04月01日 22:12, Tiwei Bie wrote:
>>> Hello everyone,
>>>
>>> This RFC implements packed ring support for virtio driver.
>>>
>>> The code was tested with DPDK vhost (testpmd/vhost-PMD) implemented
>>> by Jens at http://dpdk.org/ml/archives/dev/2018-January/089417.html
>>> Minor changes are needed for the vhost code, e.g. to kick the guest.
>>>
>>> TODO:
>>> - Refinements and bug fixes;
>>> - Split into small patches;
>>> - Test indirect descriptor support;
>>> - Test/fix event suppression support;
>>> - Test devices other than net;
>>>
>>> RFC v1 -> RFC v2:
>>> - Add indirect descriptor support - compile test only;
>>> - Add event suppression supprt - compile test only;
>>> - Move vring_packed_init() out of uapi (Jason, MST);
>>> - Merge two loops into one in virtqueue_add_packed() (Jason);
>>> - Split vring_unmap_one() for packed ring and split ring (Jason);
>>> - Avoid using '%' operator (Jason);
>>> - Rename free_head -> next_avail_idx (Jason);
>>> - Add comments for virtio_wmb() in virtqueue_add_packed() (Jason);
>>> - Some other refinements and bug fixes;
>>>
>>> Thanks!
>>>
>>> Signed-off-by: Tiwei Bie <tiwei.bie@intel.com>
>>> ---
>>> drivers/virtio/virtio_ring.c | 1094 +++++++++++++++++++++++++++++-------
>>> include/linux/virtio_ring.h | 8 +-
>>> include/uapi/linux/virtio_config.h | 12 +-
>>> include/uapi/linux/virtio_ring.h | 61 ++
>>> 4 files changed, 980 insertions(+), 195 deletions(-)
>>>
>>> diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
>>> index 71458f493cf8..0515dca34d77 100644
>>> --- a/drivers/virtio/virtio_ring.c
>>> +++ b/drivers/virtio/virtio_ring.c
>>> @@ -58,14 +58,15 @@
>> [...]
>>
>>> +
>>> + if (vq->indirect) {
>>> + u32 len;
>>> +
>>> + desc = vq->desc_state[head].indir_desc;
>>> + /* Free the indirect table, if any, now that it's unmapped. */
>>> + if (!desc)
>>> + goto out;
>>> +
>>> + len = virtio32_to_cpu(vq->vq.vdev,
>>> + vq->vring_packed.desc[head].len);
>>> +
>>> + BUG_ON(!(vq->vring_packed.desc[head].flags &
>>> + cpu_to_virtio16(vq->vq.vdev, VRING_DESC_F_INDIRECT)));
>> It looks to me spec does not force to keep VRING_DESC_F_INDIRECT here. So we
>> can safely remove this BUG_ON() here.
>>
>>> + BUG_ON(len == 0 || len % sizeof(struct vring_packed_desc));
>> Len could be ignored for used descriptor according to the spec, so we need
>> remove this BUG_ON() too.
> Yeah, you're right! The BUG_ON() isn't right. I'll remove it.
> And I think something related to this in the spec isn't very
> clear currently.
>
> In the spec, there are below words:
>
> https://github.com/oasis-tcs/virtio-spec/blob/d4fec517dfcf/packed-ring.tex#L272
> """
> In descriptors with VIRTQ_DESC_F_INDIRECT set VIRTQ_DESC_F_WRITE
> is reserved and is ignored by the device.
> """
>
> So when device writes back an used descriptor in this case,
> device may not set the VIRTQ_DESC_F_WRITE flag as the flag
> is reserved and should be ignored.
>
> https://github.com/oasis-tcs/virtio-spec/blob/d4fec517dfcf/packed-ring.tex#L170
> """
> Element Length is reserved for used descriptors without the
> VIRTQ_DESC_F_WRITE flag, and is ignored by drivers.
> """
>
> And this is the way how driver ignores the `len` in an used
> descriptor.
>
> https://github.com/oasis-tcs/virtio-spec/blob/d4fec517dfcf/packed-ring.tex#L241
> """
> To increase ring capacity the driver can store a (read-only
> by the device) table of indirect descriptors anywhere in memory,
> and insert a descriptor in the main virtqueue (with \field{Flags}
> bit VIRTQ_DESC_F_INDIRECT on) that refers to a buffer element
> containing this indirect descriptor table;
> """
>
> So the indirect descriptors in the table are read-only by
> the device. And the only descriptor which is writeable by
> the device is the descriptor in the main virtqueue (with
> Flags bit VIRTQ_DESC_F_INDIRECT on). So if we ignore the
> `len` in this descriptor, we won't be able to get the
> length of the data written by the device.
>
> So I think the `len` in this descriptor will carry the
> length of the data written by the device (if the buffers
> are writable to the device) even if the VIRTQ_DESC_F_WRITE
> isn't set by the device. How do you think?
Yes I think so. But we'd better need clarification from Michael.
>
>
>> The reason is we don't touch descriptor ring in the case of split, so
>> BUG_ON()s may help there.
>>
>>> +
>>> + for (j = 0; j < len / sizeof(struct vring_packed_desc); j++)
>>> + vring_unmap_one_packed(vq, &desc[j]);
>>> +
>>> + kfree(desc);
>>> + vq->desc_state[head].indir_desc = NULL;
>>> + } else if (ctx) {
>>> + *ctx = vq->desc_state[head].indir_desc;
>>> + }
>>> +
>>> +out:
>>> + return vq->desc_state[head].num;
>>> +}
>>> +
>>> +static inline bool more_used_split(const struct vring_virtqueue *vq)
>>> {
>>> return vq->last_used_idx != virtio16_to_cpu(vq->vq.vdev, vq->vring.used->idx);
>>> }
>>> +static inline bool more_used_packed(const struct vring_virtqueue *vq)
>>> +{
>>> + u16 last_used, flags;
>>> + bool avail, used;
>>> +
>>> + if (vq->vq.num_free == vq->vring_packed.num)
>>> + return false;
>>> +
>>> + last_used = vq->last_used_idx;
>>> + flags = virtio16_to_cpu(vq->vq.vdev,
>>> + vq->vring_packed.desc[last_used].flags);
>>> + avail = flags & VRING_DESC_F_AVAIL(1);
>>> + used = flags & VRING_DESC_F_USED(1);
>>> +
>>> + return avail == used;
>>> +}
>> This looks interesting, spec said:
>>
>> "
>> Thus VIRTQ_DESC_F_AVAIL and VIRTQ_DESC_F_USED bits are different for an
>> available descriptor and
>> equal for a used descriptor.
>> Note that this observation is mostly useful for sanity-checking as these are
>> necessary but not sufficient
>> conditions - for example, all descriptors are zero-initialized. To detect
>> used and available descriptors it is
>> possible for drivers and devices to keep track of the last observed value of
>> VIRTQ_DESC_F_USED/VIRTQ_-
>> DESC_F_AVAIL. Other techniques to detect
>> VIRTQ_DESC_F_AVAIL/VIRTQ_DESC_F_USED bit changes
>> might also be possible.
>> "
>>
>> So it looks to me it was not sufficient, looking at the example codes in
>> spec, do we need to track last seen used_wrap_counter here?
> I don't think we have to track used_wrap_counter in
> driver. There was a discussion on this:
>
> https://lists.oasis-open.org/archives/virtio-dev/201802/msg00177.html
>
> And after that, below sentence was added (it's also
> in the above words you quoted):
>
> """
> Other techniques to detect
> VIRTQ_DESC_F_AVAIL/VIRTQ_DESC_F_USED bit changes
> might also be possible.
> """
>
> Best regards,
> Tiwei Bie
I see, the extra condition "if (vq->vq.num_free ==
vq->vring_packed.num)" help in this case.
Thanks
>
>> Thanks
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization
^ permalink raw reply
* Re: [PATCH] kvmalloc: always use vmalloc if CONFIG_DEBUG_VM
From: Mikulas Patocka @ 2018-04-24 0:25 UTC (permalink / raw)
To: Michal Hocko
Cc: Matthew Wilcox, David Miller, Andrew Morton, linux-mm,
eric.dumazet, edumazet, netdev, linux-kernel, mst, jasowang,
virtualization, dm-devel, Vlastimil Babka
In-Reply-To: <20180423151545.GU17484@dhcp22.suse.cz>
On Mon, 23 Apr 2018, Michal Hocko wrote:
> On Mon 23-04-18 10:06:08, Mikulas Patocka wrote:
>
> > > > He didn't want to fix vmalloc(GFP_NOIO)
> > >
> > > I don't remember that conversation, so I don't know whether I agree with
> > > his reasoning or not. But we are supposed to be moving away from GFP_NOIO
> > > towards marking regions with memalloc_noio_save() / restore. If you do
> > > that, you won't need vmalloc(GFP_NOIO).
> >
> > He said the same thing a year ago. And there was small progress. 6 out of
> > 27 __vmalloc calls were converted to memalloc_noio_save in a year - 5 in
> > infiniband and 1 in btrfs. (the whole discussion is here
> > http://lkml.iu.edu/hypermail/linux/kernel/1706.3/04681.html )
>
> Well this is not that easy. It requires a cooperation from maintainers.
> I can only do as much. I've posted patches in the past and actively
> bringing up this topic at LSFMM last two years...
You're right - but you have chosen the uneasy path. Fixing __vmalloc code
is easy and it doesn't require cooperation with maintainers.
> > He refuses 15-line patch to fix GFP_NOIO bug because he believes that in 4
> > years, the kernel will be refactored and GFP_NOIO will be eliminated. Why
> > does he have veto over this part of the code? I'd much rather argue with
> > people who have constructive comments about fixing bugs than with him.
>
> I didn't NACK the patch AFAIR. I've said it is not a good idea longterm.
> I would be much more willing to change my mind if you would back your
> patch by a real bug report. Hacks are acceptable when we have a real
> issue in hands. But if we want to fix potential issue then better make
> it properly.
Developers should fix bugs in advance, not to wait until a crash hapens,
is analyzed and reported.
What's the problem with 15-line hack? Is the problem that kernel
developers would feel depressed when looking the source code? Other than
harming developers' feelings, I don't see what kind of damange could that
piece of code do.
Mikulas
^ permalink raw reply
* Re: [PATCH bpf-next v4 1/2] bpf: extend stackmap to save binary_build_id+offset instead of address
From: Yonghong Song @ 2018-04-24 0:06 UTC (permalink / raw)
To: Peter Zijlstra, Song Liu
Cc: netdev, ast, daniel, kernel-team, hannes, qinteng
In-Reply-To: <20180312220132.GH4043@hirez.programming.kicks-ass.net>
Hi, Peter,
I have a question regarding to one of your comments below.
On 3/12/18 3:01 PM, Peter Zijlstra wrote:
> On Mon, Mar 12, 2018 at 01:39:56PM -0700, Song Liu wrote:
>> +static void stack_map_get_build_id_offset(struct bpf_map *map,
>> + struct stack_map_bucket *bucket,
>> + u64 *ips, u32 trace_nr)
>> +{
>> + int i;
>> + struct vm_area_struct *vma;
>> + struct bpf_stack_build_id *id_offs;
>> +
>> + bucket->nr = trace_nr;
>> + id_offs = (struct bpf_stack_build_id *)bucket->data;
>> +
>> + if (!current || !current->mm ||
>> + down_read_trylock(¤t->mm->mmap_sem) == 0) {
>
> You probably want an in_nmi() before the down_read_trylock(). Doing
> up_read() is an absolute no-no from NMI context.
The below is the final code from Song:
/*
* We cannot do up_read() in nmi context, so build_id lookup is
* only supported for non-nmi events. If at some point, it is
* possible to run find_vma() without taking the semaphore, we
* would like to allow build_id lookup in nmi context.
*
* Same fallback is used for kernel stack (!user) on a stackmap
* with build_id.
*/
if (!user || !current || !current->mm || in_nmi() ||
down_read_trylock(¤t->mm->mmap_sem) == 0) {
/* cannot access current->mm, fall back to ips */
for (i = 0; i < trace_nr; i++) {
id_offs[i].status = BPF_STACK_BUILD_ID_IP;
id_offs[i].ip = ips[i];
}
return;
}
....
>
> And IIUC its 'trivial' to use this stuff with hardware counters.
Here, you mentioned that it was 'trivial' to use buildid thing with
hardware counters, if I interpreted correctly. However, the hardware
counter overflow will trigger NMI. Based on the above logic,
it will default to old IP only behavior.
Could you explain a little more how to get buildid with hardware
counter overflow events?
Thanks!
>
>> + /* cannot access current->mm, fall back to ips */
>> + for (i = 0; i < trace_nr; i++) {
>> + id_offs[i].status = BPF_STACK_BUILD_ID_IP;
>> + id_offs[i].ip = ips[i];
>> + }
>> + return;
>> + }
>> +
>> + for (i = 0; i < trace_nr; i++) {
>> + vma = find_vma(current->mm, ips[i]);
>> + if (!vma || stack_map_get_build_id(vma, id_offs[i].build_id)) {
>> + /* per entry fall back to ips */
>> + id_offs[i].status = BPF_STACK_BUILD_ID_IP;
>> + id_offs[i].ip = ips[i];
>> + continue;
>> + }
>> + id_offs[i].offset = (vma->vm_pgoff << PAGE_SHIFT) + ips[i]
>> + - vma->vm_start;
>> + id_offs[i].status = BPF_STACK_BUILD_ID_VALID;
>> + }
>> + up_read(¤t->mm->mmap_sem);
>> +}
>
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox