Netdev List
 help / color / mirror / Atom feed
* [RFC PATCH net] tcp: allow to use TCP Fastopen with MSG_ZEROCOPY
From: Alexey Kodanev @ 2018-04-03 12:43 UTC (permalink / raw)
  To: netdev; +Cc: Willem de Bruijn, Eric Dumazet, David Miller, Alexey Kodanev

With TCP Fastopen we can have the following cases, which could also
use MSG_ZEROCOPY flag with send() and sendto():

* sendto() + MSG_FASTOPEN flag, sk state can be in TCP_CLOSE at
  the start of tcp_sendmsg()

* set socket option TCP_FASTOPEN_CONNECT, then connect()
  and send(), sk state in TCP_SYN_SENT

Currently, both cases with tcp_sendmsg() and MSG_ZEROCOPY flag results
to EINVAL error, because of the check for TCP_ESTABLISHED sk state in
the beginning of tcp_sendmsg().

Both conditions require two more checks there: !tp->fastopen_connect
and !(flags & MSG_FASTOPEN). It looks like we could remove the original
check altogether for this unlikely event instead. That way tcp_sendmsg()
without TFO should fail with EPIPE on sk_stream_wait_connect(), as
before the introduction of MSG_ZEROCOPY there. And work smoothly for
the TFO cases.

Fixes: f214f915e7db ("tcp: enable MSG_ZEROCOPY")
Signed-off-by: Alexey Kodanev <alexey.kodanev@oracle.com>
---

Is there something that I've overlooked and we can't use it here, and
we should handle this type of error, while using sendto() + TFO,
in userspace?

 net/ipv4/tcp.c | 5 -----
 1 file changed, 5 deletions(-)

diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 9225610..768f02c 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -1193,11 +1193,6 @@ int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size)
 	flags = msg->msg_flags;
 
 	if (flags & MSG_ZEROCOPY && size) {
-		if (sk->sk_state != TCP_ESTABLISHED) {
-			err = -EINVAL;
-			goto out_err;
-		}
-
 		skb = tcp_write_queue_tail(sk);
 		uarg = sock_zerocopy_realloc(sk, size, skb_zcopy(skb));
 		if (!uarg) {
-- 
1.8.3.1

^ permalink raw reply related

* Re: [PATCH net-next v2 1/2] fs/crashdd: add API to collect hardware dump in second kernel
From: Andrew Lunn @ 2018-04-03 12:35 UTC (permalink / raw)
  To: Alex Vesker
  Cc: Jiri Pirko, Rahul Lakkireddy, netdev, linux-fsdevel, kexec,
	linux-kernel, davem, viro, ebiederm, stephen, akpm, torvalds,
	ganeshgr, nirranjan, indranil
In-Reply-To: <0d106ece-4669-389d-da30-63a630ca625c@mellanox.com>

On Tue, Apr 03, 2018 at 08:43:27AM +0300, Alex Vesker wrote:
> 
> 
> On 4/2/2018 12:12 PM, Jiri Pirko wrote:
> >Fri, Mar 30, 2018 at 05:11:29PM CEST, andrew@lunn.ch wrote:
> >>>Please see:
> >>>http://patchwork.ozlabs.org/project/netdev/list/?series=36524
> >>>
> >>>I bevieve that the solution in the patchset could be used for
> >>>your usecase too.
> >>Hi Jiri
> >>
> >>https://lkml.org/lkml/2018/3/20/436
> >>
> >>How well does this API work for a 2Gbyte snapshot?
> >Ccing Alex who did the tests.
> 
> I didn't check the performance for such a large snapshot.
> From my measurement it takes 0.09s for 1 MB of data this means
> about ~3m.

I was not really thinking about performance. More about how well does
the system work when you ask the kernel for 2GB of RAM to put a
snapshot into? And given your current design, you need another 2GB
buffer for the driver to use before calling this new API.

So i'm asking, how well does this API scale?

I think you need to remove the need for a second buffer in the
driver. Either the driver allocates the buffer and hands it over, or
your core code allocates the buffer and gives it to the driver to
fill. Maybe look at what makes most sense for the crash dump code?

      Andrew

^ permalink raw reply

* Re: linux-next: build failure after merge of the tip tree
From: David Howells @ 2018-04-03 12:39 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: dhowells, Stephen Rothwell, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, David Miller, Networking, Linux-Next Mailing List,
	Linux Kernel Mailing List
In-Reply-To: <20180403093030.GB4082@hirez.programming.kicks-ass.net>

Peter Zijlstra <peterz@infradead.org> wrote:

> I figured that since there were only a handful of users it wasn't a
> popular API, also David very much knew of those patches changing it so
> could easily have pulled in the special tip/sched/wait branch :/

I'm not sure I could, since I have to base on net-next.  I'm not sure what
DaveM's policy on that is.

Also, it might've been better not to simply erase the atomic_t wait API
immediately, but substitute wrappers for it to be removed one iteration hence.

David

^ permalink raw reply

* Re: linux-next: build failure after merge of the tip tree
From: David Howells @ 2018-04-03 12:41 UTC (permalink / raw)
  To: Stephen Rothwell
  Cc: dhowells, Thomas Gleixner, Ingo Molnar, H. Peter Anvin,
	Peter Zijlstra, David Miller, Networking, Linux-Next Mailing List,
	Linux Kernel Mailing List
In-Reply-To: <20180403154122.00d76d61@canb.auug.org.au>

Stephen Rothwell <sfr@canb.auug.org.au> wrote:

> +	wait_var_event(&rxnet->nr_calls, !atomic_read(&rxnet->nr_calls));

I would prefer == 0 to ! as it's not really a true/false value.

But apart from that, it's looks okay and you can add my Reviewed-by.

David

^ permalink raw reply

* Re: linux-next: build failure after merge of the tip tree
From: Peter Zijlstra @ 2018-04-03 12:42 UTC (permalink / raw)
  To: David Howells
  Cc: Stephen Rothwell, Thomas Gleixner, Ingo Molnar, H. Peter Anvin,
	David Miller, Networking, Linux-Next Mailing List,
	Linux Kernel Mailing List
In-Reply-To: <29149.1522759148@warthog.procyon.org.uk>

On Tue, Apr 03, 2018 at 01:39:08PM +0100, David Howells wrote:
> Peter Zijlstra <peterz@infradead.org> wrote:
> 
> > I figured that since there were only a handful of users it wasn't a
> > popular API, also David very much knew of those patches changing it so
> > could easily have pulled in the special tip/sched/wait branch :/
> 
> I'm not sure I could, since I have to base on net-next.  I'm not sure what
> DaveM's policy on that is.
> 
> Also, it might've been better not to simply erase the atomic_t wait API
> immediately, but substitute wrappers for it to be removed one iteration hence.

Yeah, I know, but I really wasn't expecting new users of this thing, it
seemed like quite an exotic API with very limited users.

A well..

^ permalink raw reply

* Re: [PATCH v3 2/4] bus: fsl-mc: add restool userspace support
From: Andrew Lunn @ 2018-04-03 13:04 UTC (permalink / raw)
  To: Razvan Stefanescu
  Cc: Arnd Bergmann, gregkh, Laurentiu Tudor, Linux Kernel Mailing List,
	Stuart Yoder, Ruxandra Ioana Ciocoi Radulescu, Roy Pledge,
	Networking, Ioana Ciornei
In-Reply-To: <AM3PR04MB07439B79DD857CD2F8033623E6A50@AM3PR04MB0743.eurprd04.prod.outlook.com>

On Tue, Apr 03, 2018 at 11:12:52AM +0000, Razvan Stefanescu wrote:
> DPAA2 offers several object-based abstractions for modeling network
> related devices (interfaces, L2 Ethernet switch) or accelerators
> (DPSECI - crypto and DPDCEI - compression), the latter not up-streamed yet.
> They are modeled using various low-level resources (e.g. queues,
> classification tables, physical ports) and have multiple configuration and
> interconnectivity options, managed by the Management Complex. 
> Resources are limited and they are only used when needed by the objects,
> to accommodate more configurations and usage scenarios.
>  
> Some of the objects have a 1-to-1 correspondence to physical resources
> (e.g. DPMACs to physical ports), while others (like DPNIs and DPSW)
> can be seen as a collection of the mentioned resources. The types and 
> number of such objects are not predetermined.
> 
> When the board boots up, none of them exist yet. Restool allows a user to
> define the system topology, by providing a way to dynamically create, destroy
> and interconnect these objects.

Hi Razvan

The core concept with Linux networking and offload is that the
hardware is there to accelerate what Linux can already do. Since Linux
can already do it, i don't need any additional tools.

You have new hardware. It might offer features which we currently
don't have offload support for. But all the means is you need to
extend the core networking code which implements the software version
of that feature to offload to the hardware.

The board knows how many physical ports it has. switchdev can then
setup the plumbing to create the objects needed to represent the
ports. Restool is not needed for that.

> In the latter case, the two DPNIs will not be connected to any physical
> port, but can be used as a point-to-point connection between two virtual
> machines for instance.
 
Can Linux already do this? Isn't that what PCI Virtual Functions are
all about? You need to find the current Linux concept for this, and
extend it to offload the functionality to hardware. If Linux can do
it, it already has the tools to configure it. Restool is not needed
for that.

> So, it is not possible to connect a DPNI to a DPSW after it was
> connected to a DPMAC. The DPNI-DPMAC pair would have to be
> disconnected and DPMAC will be reconnected to the switch. DPNI
> interface that is no longer connected to a DPMAC will be destroyed
> and any new addition/deletion of a DPNI/DPMAC interface to the
> switch port will trigger the entire switch re-configuration.

Switches and ports connected to switches are dynamic. They come and
go. You don't expect it to happen very often, but Linux has no
restrictions on this. You need to figure out how best to offload this
to your hardware. Maybe when you create the switch object you make a
guess as to how many ports you need. Leave some of the ports not
connected to anything. You can then add ports to the switch using the
free ports. If you run out of ports, you have no choice but to destroy
the switch object and create a new one. Hopefully that does not take
too long. Restool is not needed for this, it all happens within the
switchdev driver.

	  Andrew

^ permalink raw reply

* [PATCH iproute2-next 1/1] tc: jsonify connmark action
From: Roman Mashak @ 2018-04-03 13:09 UTC (permalink / raw)
  To: dsahern; +Cc: stephen, netdev, kernel, jhs, xiyou.wangcong, jiri, Roman Mashak

Signed-off-by: Roman Mashak <mrv@mojatatu.com>
---
 tc/m_connmark.c | 16 ++++++++++------
 1 file changed, 10 insertions(+), 6 deletions(-)

diff --git a/tc/m_connmark.c b/tc/m_connmark.c
index bcce41391398..45e2d05f1a91 100644
--- a/tc/m_connmark.c
+++ b/tc/m_connmark.c
@@ -114,16 +114,20 @@ static int print_connmark(struct action_util *au, FILE *f, struct rtattr *arg)
 
 	parse_rtattr_nested(tb, TCA_CONNMARK_MAX, arg);
 	if (tb[TCA_CONNMARK_PARMS] == NULL) {
-		fprintf(f, "[NULL connmark parameters]");
+		print_string(PRINT_FP, NULL, "%s", "[NULL connmark parameters]");
 		return -1;
 	}
 
 	ci = RTA_DATA(tb[TCA_CONNMARK_PARMS]);
 
-	fprintf(f, " connmark zone %d", ci->zone);
-	print_action_control(f, " ", ci->action, "\n");
-	fprintf(f, "\t index %u ref %d bind %d", ci->index,
-		ci->refcnt, ci->bindcnt);
+	print_string(PRINT_ANY, "kind", "%s ", "connmark");
+	print_uint(PRINT_ANY, "zone", "zone %u", ci->zone);
+	print_action_control(f, " ", ci->action, "");
+
+	print_string(PRINT_FP, NULL, "%s", _SL_);
+	print_uint(PRINT_ANY, "index", "\t index %u", ci->index);
+	print_int(PRINT_ANY, "ref", " ref %d", ci->refcnt);
+	print_int(PRINT_ANY, "bind", " bind %d", ci->bindcnt);
 
 	if (show_stats) {
 		if (tb[TCA_CONNMARK_TM]) {
@@ -132,7 +136,7 @@ static int print_connmark(struct action_util *au, FILE *f, struct rtattr *arg)
 			print_tm(f, tm);
 		}
 	}
-	fprintf(f, "\n");
+	print_string(PRINT_FP, NULL, "%s", _SL_);
 
 	return 0;
 }
-- 
2.7.4

^ permalink raw reply related

* Re: [pci PATCH v7 2/5] virtio_pci: Add support for unmanaged SR-IOV on virtio_pci devices
From: Michael S. Tsirkin @ 2018-04-03 13:11 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: bhelgaas, alexander.h.duyck, linux-pci, virtio-dev, kvm, netdev,
	dan.daly, linux-kernel, linux-nvme, keith.busch, netanel, ddutile,
	mheyne, liang-min.wang, mark.d.rustad, dwmw2, hch, dwmw
In-Reply-To: <20180315184132.3102.90947.stgit@localhost.localdomain>

On Thu, Mar 15, 2018 at 11:42:41AM -0700, Alexander Duyck wrote:
> From: Alexander Duyck <alexander.h.duyck@intel.com>
> 
> Hardware-realized virtio_pci devices can implement SR-IOV, so this
> patch enables its use. The device in question is an upcoming Intel
> NIC that implements both a virtio_net PF and virtio_net VFs. These
> are hardware realizations of what has been up to now been a software
> interface.
> 
> The device in question has the following 4-part PCI IDs:
> 
> PF: vendor: 1af4 device: 1041 subvendor: 8086 subdevice: 15fe
> VF: vendor: 1af4 device: 1041 subvendor: 8086 subdevice: 05fe
> 
> The patch currently needs no check for device ID, because the callback
> will never be made for devices that do not assert the capability or
> when run on a platform incapable of SR-IOV.
> 
> One reason for this patch is because the hardware requires the
> vendor ID of a VF to be the same as the vendor ID of the PF that
> created it. So it seemed logical to simply have a fully-functioning
> virtio_net PF create the VFs. This patch makes that possible.
> 
> Reviewed-by: Christoph Hellwig <hch@lst.de>
> Signed-off-by: Mark Rustad <mark.d.rustad@intel.com>
> Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>

I thought hard about this, and I think we need a feature
bit for this. This way host can detect support,
and we can also change our minds later if we need
to modify the interface and manage VFs after all.

It seems PCI specific so non pci transports would disable the feature
for now.

> ---
> 
> v4: Dropped call to pci_disable_sriov in virtio_pci_remove function
> v5: Replaced call to pci_sriov_configure_unmanaged with
>         pci_sriov_configure_simple
> v6: Dropped "#ifdef" checks for IOV wrapping sriov_configure definition
> v7: No code change, added Reviewed-by
> 
>  drivers/virtio/virtio_pci_common.c |    1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/drivers/virtio/virtio_pci_common.c b/drivers/virtio/virtio_pci_common.c
> index 48d4d1cf1cb6..67a227fd7aa0 100644
> --- a/drivers/virtio/virtio_pci_common.c
> +++ b/drivers/virtio/virtio_pci_common.c
> @@ -596,6 +596,7 @@ static void virtio_pci_remove(struct pci_dev *pci_dev)
>  #ifdef CONFIG_PM_SLEEP
>  	.driver.pm	= &virtio_pci_pm_ops,
>  #endif
> +	.sriov_configure = pci_sriov_configure_simple,
>  };
>  
>  module_pci_driver(virtio_pci_driver);
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
> For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org

^ permalink raw reply

* Re: [virtio-dev] [pci PATCH v7 2/5] virtio_pci: Add support for unmanaged SR-IOV on virtio_pci devices
From: Michael S. Tsirkin @ 2018-04-03 13:12 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Daly, Dan, Bjorn Helgaas, Duyck, Alexander H, linux-pci,
	virtio-dev, kvm, Netdev, LKML, linux-nvme, Keith Busch, netanel,
	Don Dutile, Maximilian Heyne, Wang, Liang-min, Rustad, Mark D,
	David Woodhouse, Christoph Hellwig, dwmw
In-Reply-To: <CAKgT0UfgZ2gAdiAbCe4MTOr3o9cqq1q-mykCQsw68p2F-QEC8g@mail.gmail.com>

On Fri, Mar 16, 2018 at 09:40:34AM -0700, Alexander Duyck wrote:
> On Fri, Mar 16, 2018 at 9:34 AM, Michael S. Tsirkin <mst@redhat.com> wrote:
> > On Thu, Mar 15, 2018 at 11:42:41AM -0700, Alexander Duyck wrote:
> >> From: Alexander Duyck <alexander.h.duyck@intel.com>
> >>
> >> Hardware-realized virtio_pci devices can implement SR-IOV, so this
> >> patch enables its use. The device in question is an upcoming Intel
> >> NIC that implements both a virtio_net PF and virtio_net VFs. These
> >> are hardware realizations of what has been up to now been a software
> >> interface.
> >>
> >> The device in question has the following 4-part PCI IDs:
> >>
> >> PF: vendor: 1af4 device: 1041 subvendor: 8086 subdevice: 15fe
> >> VF: vendor: 1af4 device: 1041 subvendor: 8086 subdevice: 05fe
> >>
> >> The patch currently needs no check for device ID, because the callback
> >> will never be made for devices that do not assert the capability or
> >> when run on a platform incapable of SR-IOV.
> >>
> >> One reason for this patch is because the hardware requires the
> >> vendor ID of a VF to be the same as the vendor ID of the PF that
> >> created it. So it seemed logical to simply have a fully-functioning
> >> virtio_net PF create the VFs. This patch makes that possible.
> >>
> >> Reviewed-by: Christoph Hellwig <hch@lst.de>
> >> Signed-off-by: Mark Rustad <mark.d.rustad@intel.com>
> >> Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
> >
> > So if and when virtio PFs can manage the VFs, then we can
> > add a feature bit for that?
> > Seems reasonable.
> 
> Yes. If nothing else you may not even need a feature bit depending on
> how things go.

OTOH if the interface is changed in an incompatible way,
and old Linux will attempt to drive the new device
since there is no check.

I think we should add a feature bit right away.


> One of the reasons why Mark called out the
> subvendor/subdevice was because that might be able to be used to
> identify the specific hardware that is providing the SR-IOV feature so
> in the future if it is added to virtio itself then you could exclude
> devices like this by just limiting things based on subvendor/subdevice
> IDs.
> 
> > Also, I am guessing that hardware implementations will want
> > to add things like stong memory barriers - I guess we
> > will add new feature bits for that too down the road?
> 
> That piece I don't have visibility into at this time. Perhaps Dan
> might have more visibility into future plans on what this might need.
> 
> Thanks.
> 
> - Alex

^ permalink raw reply

* Re: [PATCH net-next RFC V1 5/5] net: mdio: Add a driver for InES time stamping IP core.
From: Andrew Lunn @ 2018-04-03 13:13 UTC (permalink / raw)
  To: Richard Cochran
  Cc: Florian Fainelli, netdev, devicetree, David Miller, Mark Rutland,
	Miroslav Lichvar, Rob Herring, Willem de Bruijn
In-Reply-To: <20180403035527.lgcm6gql3qx4rpuv@localhost>

> On Mon, Apr 02, 2018 at 08:55:27PM -0700, Richard Cochran wrote:
> On Sun, Mar 25, 2018 at 04:01:49PM -0700, Florian Fainelli wrote:
> > The best that I can think about and it still is a hack in some way, is
> > to you have your time stamping driver create a proxy mii_bus whose
> > purpose is just to hook to mdio/phy_device events (such as link changes)
> > in order to do what is necessary, or at least, this would indicate its
> > transparent nature towards the MDIO/MDC lines...
> 
> That won't work at all, AFAICT.  There is only one mii_bus per netdev,
> that is one that is attached to the phydev.


Hi Richard

Have you tried implementing it using a phandle in the phy node,
pointing to the time stamping device?

I think it makes a much better architecture.

  Andrew

^ permalink raw reply

* Re: [PATCH] vhost-net: add limitation of sent packets for tx polling
From: Michael S. Tsirkin @ 2018-04-03 13:26 UTC (permalink / raw)
  To: haibinzhang(张海斌)
  Cc: Jason Wang, kvm@vger.kernel.org,
	virtualization@lists.linux-foundation.org, netdev@vger.kernel.org,
	linux-kernel@vger.kernel.org,
	lidongchen(陈立东),
	yunfangtai(台运方)
In-Reply-To: <88D661ADF6AFBF42B2AB88D8E7682B0901FC465B@EXMBX-SZMAIL011.tencent.com>

On Tue, Apr 03, 2018 at 12:29:47PM +0000, haibinzhang(张海斌) wrote:
> 
> >On Tue, Apr 03, 2018 at 08:08:26AM +0000, haibinzhang wrote:
> >> handle_tx will delay rx for a long time when tx busy polling udp packets
> >> with small length(e.g. 1byte udp payload), because setting VHOST_NET_WEIGHT
> >> takes into account only sent-bytes but no single packet length.
> >> 
> >> Tests were done between two Virtual Machines using netperf(UDP_STREAM, len=1),
> >> then another machine pinged the client. Result shows as follow:
> >> 
> >> Packet#       Ping-Latency(ms)
> >>               min     avg     max
> >> Origin      3.319  18.489  57.503
> >> 64          1.643   2.021   2.552
> >> 128         1.825   2.600   3.224
> >> 256         1.997   2.710   4.295
> >> 512*        1.860   3.171   4.631
> >> 1024        2.002   4.173   9.056
> >> 2048        2.257   5.650   9.688
> >> 4096        2.093   8.508  15.943
> >> 
> >> 512 is selected, which is multi-VRING_SIZE
> >
> >There's no guarantee vring size is 256.
> >
> >Could you pls try with a different tx ring size?
> >
> >I suspect we want:
> >
> >#define VHOST_NET_PKT_WEIGHT(vq) ((vq)->num * 2)
> >
> >
> >> and close to VHOST_NET_WEIGHT/MTU.
> >
> >Puzzled by this part.  Does tweaking MTU change anything?
> 
> The MTU of ethernet is 1500, so VHOST_NET_WEIGHT/MTU equals 0x80000/1500=350.

We should include the 12 byte header so it's a bit lower.

> Then sent-bytes cannot reach VHOST_NET_WEIGHT in one handle_tx even with 1500-bytes 
> frame if packet# is less than 350. So packet# must be bigger than 350.
> 512 meets this condition

What you seem to say is this:

	imagine MTU sized buffers. With these we stop after 350
	packets. Thus adding another limit > 350 will not
	slow us down.

	Fair enough but won't apply with smaller packet
	sizes, will it?

	I still think a simpler argument carries more weight:

ring size is a hint from device about a burst size
it can tolerate. Based on benchmarks, we tweak
the limit to 2 * vq size as that seems to
perform a bit better, and is still safer
than no limit on # of packets as is done now.

	but this needs testing with another ring size.
	Could you try that please?

	
> and is also DEFAULT VRING_SIZE aligned.

Neither Linux nor virtio have a default vring size. It's a historical
construct that exists in qemu for qemu compatibility
reasons.

> >
> >> To evaluate this change, another tests were done using netperf(RR, TX) between
> >> two machines with Intel(R) Xeon(R) Gold 6133 CPU @ 2.50GHz. Result as follow
> >> does not show obvious changes:
> >> 
> >> TCP_RR
> >> 
> >> size/sessions/+thu%/+normalize%
> >>    1/       1/  -7%/        -2%
> >>    1/       4/  +1%/         0%
> >>    1/       8/  +1%/        -2%
> >>   64/       1/  -6%/         0%
> >>   64/       4/   0%/        +2%
> >>   64/       8/   0%/         0%
> >>  256/       1/  -3%/        -4%
> >>  256/       4/  +3%/        +4%
> >>  256/       8/  +2%/         0%
> >> 
> >> UDP_RR
> >> 
> >> size/sessions/+thu%/+normalize%
> >>    1/       1/  -5%/        +1%
> >>    1/       4/  +4%/        +1%
> >>    1/       8/  -1%/        -1%
> >>   64/       1/  -2%/        -3%
> >>   64/       4/  -5%/        -1%
> >>   64/       8/   0%/        -1%
> >>  256/       1/  +7%/        +1%
> >>  256/       4/  +1%/        +1%
> >>  256/       8/  +2%/        +2%
> >> 
> >> TCP_STREAM
> >> 
> >> size/sessions/+thu%/+normalize%
> >>   64/       1/   0%/        -3%
> >>   64/       4/  +3%/        -1%
> >>   64/       8/  +9%/        -4%
> >>  256/       1/  +1%/        -4%
> >>  256/       4/  -1%/        -1%
> >>  256/       8/  +7%/        +5%
> >>  512/       1/  +1%/         0%
> >>  512/       4/  +1%/        -1%
> >>  512/       8/  +7%/        -5%
> >> 1024/       1/   0%/        -1%
> >> 1024/       4/  +3%/         0%
> >> 1024/       8/  +8%/        +5%
> >> 2048/       1/  +2%/        +2%
> >> 2048/       4/  +1%/         0%
> >> 2048/       8/  -2%/         0%
> >> 4096/       1/  -2%/         0%
> >> 4096/       4/  +2%/         0%
> >> 4096/       8/  +9%/        -2%
> >> 
> >> Signed-off-by: Haibin Zhang <haibinzhang@tencent.com>
> >> Signed-off-by: Yunfang Tai <yunfangtai@tencent.com>
> >> Signed-off-by: Lidong Chen <lidongchen@tencent.com>
> >> ---
> >>  drivers/vhost/net.c | 8 +++++++-
> >>  1 file changed, 7 insertions(+), 1 deletion(-)
> >> 
> >> diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
> >> index 8139bc70ad7d..13a23f3f3ea4 100644
> >> --- a/drivers/vhost/net.c
> >> +++ b/drivers/vhost/net.c
> >> @@ -44,6 +44,10 @@ MODULE_PARM_DESC(experimental_zcopytx, "Enable Zero Copy TX;"
> >>   * Using this limit prevents one virtqueue from starving others. */
> >>  #define VHOST_NET_WEIGHT 0x80000
> >>  
> >> +/* Max number of packets transferred before requeueing the job.
> >> + * Using this limit prevents one virtqueue from starving rx. */
> >> +#define VHOST_NET_PKT_WEIGHT 512
> >> +
> >>  /* MAX number of TX used buffers for outstanding zerocopy */
> >>  #define VHOST_MAX_PEND 128
> >>  #define VHOST_GOODCOPY_LEN 256
> >> @@ -473,6 +477,7 @@ static void handle_tx(struct vhost_net *net)
> >>  	struct socket *sock;
> >>  	struct vhost_net_ubuf_ref *uninitialized_var(ubufs);
> >>  	bool zcopy, zcopy_used;
> >> +	int sent_pkts = 0;
> >>  
> >>  	mutex_lock(&vq->mutex);
> >>  	sock = vq->private_data;
> >> @@ -580,7 +585,8 @@ static void handle_tx(struct vhost_net *net)
> >>  		else
> >>  			vhost_zerocopy_signal_used(net, vq);
> >>  		vhost_net_tx_packet(net);
> >> -		if (unlikely(total_len >= VHOST_NET_WEIGHT)) {
> >> +		if (unlikely(total_len >= VHOST_NET_WEIGHT) ||
> >> +		    unlikely(++sent_pkts >= VHOST_NET_PKT_WEIGHT)) {
> >>  			vhost_poll_queue(&vq->poll);
> >>  			break;
> >>  		}
> >> -- 
> >> 2.12.3
> >> 
> 

^ permalink raw reply

* RE: [PATCH iproute2 rdma: Ignore unknown netlink attributes
From: Steve Wise @ 2018-04-03 13:32 UTC (permalink / raw)
  To: 'Leon Romanovsky', 'Stephen Hemminger'
  Cc: 'Leon Romanovsky', 'netdev',
	'RDMA mailing list', 'David Ahern'
In-Reply-To: <20180403072842.32153-1-leon@kernel.org>



> -----Original Message-----
> From: Leon Romanovsky <leon@kernel.org>
> Sent: Tuesday, April 3, 2018 2:29 AM
> To: Stephen Hemminger <stephen@networkplumber.org>
> Cc: Leon Romanovsky <leonro@mellanox.com>; netdev
> <netdev@vger.kernel.org>; RDMA mailing list <linux-
> rdma@vger.kernel.org>; David Ahern <dsahern@gmail.com>; Steve Wise
> <swise@opengridcomputing.com>
> Subject: [PATCH iproute2 rdma: Ignore unknown netlink attributes
> 
> From: Leon Romanovsky <leonro@mellanox.com>
> 
> The check if netlink attributes supplied more than maximum supported
> is to strict and may lead to backward compatibility issues with old
> application with a newer kernel that supports new attribute.
> 
> CC: Steve Wise <swise@opengridcomputing.com>
> Fixes: 74bd75c2b68d ("rdma: Add basic infrastructure for RDMA tool")
> Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
> ---
>  rdma/utils.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/rdma/utils.c b/rdma/utils.c
> index a2e08e91..5c1e736a 100644
> --- a/rdma/utils.c
> +++ b/rdma/utils.c
> @@ -399,7 +399,8 @@ int rd_attr_cb(const struct nlattr *attr, void *data)
>  	int type;
> 
>  	if (mnl_attr_type_valid(attr, RDMA_NLDEV_ATTR_MAX) < 0)
> -		return MNL_CB_ERROR;
> +		/* We received uknown attribute */
> +		return MNL_CB_OK;
> 
>  	type = mnl_attr_get_type(attr);
> 

Hey Leon,

So the resource parsing functions correctly ignore the unkown attrs and
print everything else?

Looks good.

Reviewed-by: Steve Wise <swise@opengridcomputing.com>

^ permalink raw reply

* Re: [PATCH v2 bpf-next 0/3] bpf/verifier: subprog/func_call simplifications
From: Edward Cree @ 2018-04-03 13:39 UTC (permalink / raw)
  To: Alexei Starovoitov; +Cc: Daniel Borkmann, netdev
In-Reply-To: <20180403010802.jkqffxw4m75oioj7@ast-mbp>

On 03/04/18 02:08, Alexei Starovoitov wrote:
> I like patch 3 and going to play with it.
> How did you test it ?
Just test_verifier and test_progs (the latter has a failure
 "test_bpf_obj_id:FAIL:get-prog-info(fd) err 0 errno 2 i 0 type 1(1) info_len 80(40) jit_enabled 0 jited_prog_len 0 xlated_prog_len 72"
 but that was already there before my patch).

> Do you have processed_insn numbers for
> cilium or selftests programs before and after?
> There should be no difference, right?
That's a good check, I'll do that.

> As far as patch 1 it was very difficult to review, since several logically
> different things clamped together. So breaking it apart:
> - converting two arrays of subprog_starts and subprog_stack_depth into
>   single array of struct bpf_subprog_info is a good thing
> - tsort is interesting, but not sure it's correct. more below
> - but main change of combining subprog logic with do_check is no good.
<snip>
> There will be no 'do_check() across whole program' walk.
> Combining subprog pass with do_check is going into opposite direction
> of this long term work. Divide and conquer. Combining more things into
> do_check is the opposite of this programming principle.
The main object of my change here was to change the data structures, to
 store a subprogno in each insn aux rather than using bsearch() on the
 subprog_starts.  I have now figured out the algorithm to do this in its
 own pass (previously I thought this needed a recursive walk which is why
 I wanted to roll it into do_check() - doing more than one whole-program
 recursive walk seems like a bad idea.)

> My short term plan is to split basic instruction correctness checks
> out of do_check() loop into separate pass and run it early on.
I agree with that short term plan, sounds like a good idea.
I'm still not sure I understand the long-term plan, though; since most
 insns' input registers will still need to be checked (I'm assuming
 majority of most real ebpf programs consists of computing and
 dereferencing pointers), the data flow analysis will have to end up
 doing all the same register updates current do_check() does (though
 potentially in a different order), e.g. if a function is called three
 times it will have to analyse it with three sets of input registers.
Unless you have some way of specifying function preconditions, I don't
 see how it works.  In particular something like
    char *f(char *p)
    {
        *p++ = 0;
        return p;
    }
    int main(void)
    {
        char *p = "abc"; /* represents getting a ptr from ctx or map */
        p = f(p);
        p = f(p);
        p = f(p);
        return 0;
    }
 seems as though it would be difficult to analyse in any way more
 scalable than the current full recursive walk.  Unless you somehow
 propagate register state _backwards_ as constraints when analysing a
 function...?  In any case it seems like there are likely to be things
 which current verifier accepts which require 'whole-program' analysis
 to determine the dataflow (e.g. what if there were some conditional
 branches in f(), and the preconditions on p depended on the value of
 some other arg in r2?)

> As far as tsort approach for determining max stack depth...
> It's an interesting idea, but implementation is suffering from the same
> 'combine everything into one loop' coding issue.
> I think updating total_stack_depth math should be separate from sorting,
> since the same function can be part of different stacks with different max.
> I don't see how updating global subprog_info[i].total_stack_depth can
> be correct. It has to be different for every stack and the longest
> stack is not necessarily the deepest. May be I'm missing something,
> but combined sort and stack_depth math didn't make it easy to review.
> I find existing check_max_stack_depth() algo much easier to understand.
The sort _is_ the way to compute total stack depth.  The sort order isn't
 being stored anywhere; it's being done just so that each subprog gets
 looked at after all its callers have been considered.  So when it gets
 selected as a 'frontier node', its maximum stack depth is known, and can
 thus be used to update its callees (note that we do a max_t() with each
 callee's existing total_stack_depth, thus getting the deepest stack of
 all call chains to the function).
It may help to imagine drawing the call graph and labelling each node with
 a stack depth as it is visited; sadly that's difficult to show in an email
 (or a code comment).  But I can try to explain it a bit better than
 "/* Update callee total stack depth */".

I will also try to split up patch #1 into more pieces.  I mistakenly thought
 that existing check_max_stack_depth() depended on some invariants that I was
 removing, but I guess that was only true while I had non-contiguous subprogs.

-Ed

^ permalink raw reply

* Re: [PATCH net-next RFC 0/5] ipv6: sr: introduce seg6local End.BPF action
From: David Lebrun @ 2018-04-03 13:40 UTC (permalink / raw)
  To: Mathieu Xhonneux, Alexei Starovoitov
  Cc: netdev, David Lebrun, Daniel Borkmann
In-Reply-To: <CAKSCvkRKS42K8rCfCBYgtfFf7MdCi3iM8O3-YOSa=ezkOZv=cw@mail.gmail.com>

On 04/03/2018 12:16 PM, Mathieu Xhonneux wrote:
> 
>> In patch 2 I was a bit concerned that:
>> +       struct seg6_bpf_srh_state *srh_state = (struct seg6_bpf_srh_state *)
>> +                                              &skb->cb;
>> would not collide with other users of skb->cb, but it seems the way
>> the hook is placed such usage should always be valid.
>> Would be good to add a comment describing the situation.
> Yes, it's indeed a little hack, but this should be OK since the IPv6 layer does
> not use the cb field. Another solution would be to create a new field in
> __sk_buff but it's more cumbersome.
> I will add a comment.

Good point. The IPv6 layer *does* use the cb field through the IP6CB() 
macro. It is first filled in ipv6_rcv() for ingress packets and used, 
among others, in the input path by extension headers processing 
functions to store EH offsets.

Given that input_action_end_bpf is called in the forwarding path 	 and 
terminates with a call to dst_input(), IP6CB() will be then reset by 
ipv6_rcv(), and the use of skb->cb here indeed should not collide with 
other users.

> 
>> Looks like somewhat odd 'End.BPF' name comes from similar names in SRv6 draft.
>> Do you plan to disclose such End.BPF action in the draft as well?
> This is something I've discussed with David Lebrun (the author of the Segment
> Routing implementation). There's no plan to disclose an End.BPF action as-is
> in the draft, since eBPF is really specific to Linux, and David doesn't mind not
> having a 1:1 mapping between the actions of the draft and the implemented
> ones. Writing "End.BPF" instead of just "bpf" is important to indicate that the
> action will advance to the next segment by itself, like all other End actions.
> One could imagine adding later a T.BPF action (a transit action), whose SID
> wouldn't have to be a segment, but that could still e.g. add/edit/delete TLVs.
> 

To clarify, I don't see why we shouldn't support "experimental" features 
that are not defined in draft-6man-segment-routing-header. However, we 
could create a separate draft describing the End.BPF feature, but that's 
perhaps best left for after the ongoing draft's last call.

David

^ permalink raw reply

* Re: [PATCH v5 03/14] PCI: Add pcie_bandwidth_capable() to compute max supported link bandwidth
From: Bjorn Helgaas @ 2018-04-03 14:05 UTC (permalink / raw)
  To: Jacob Keller
  Cc: Tal Gilboa, Tariq Toukan, Jacob Keller, Ariel Elior,
	Ganesh Goudar, Jeff Kirsher, everest-linux-l2, intel-wired-lan,
	netdev, linux-kernel, linux-pci
In-Reply-To: <CA+P7+xr1N+X5DyPwNpWUtfqr9U4pLL9bMoB1wkBdf2K9n6cxKw@mail.gmail.com>

On Mon, Apr 02, 2018 at 05:30:54PM -0700, Jacob Keller wrote:
> On Mon, Apr 2, 2018 at 7:05 AM, Bjorn Helgaas <helgaas@kernel.org> wrote:
> > +/* PCIe speed to Mb/s reduced by encoding overhead */
> > +#define PCIE_SPEED2MBS_ENC(speed) \
> > +       ((speed) == PCIE_SPEED_16_0GT ? (16000*(128/130)) : \
> > +        (speed) == PCIE_SPEED_8_0GT  ?  (8000*(128/130)) : \
> > +        (speed) == PCIE_SPEED_5_0GT  ?  (5000*(8/10)) : \
> > +        (speed) == PCIE_SPEED_2_5GT  ?  (2500*(8/10)) : \
> > +        0)
> > +
> 
> Should this be "(speed * x ) / y" instead? wouldn't they calculate
> 128/130 and truncate that to zero before multiplying by the speed? Or
> are compilers smart enough to do this the other way to avoid the
> losses?

Yep, thanks for saving me yet more embarrassment.

^ permalink raw reply

* Re: [PATCH] net: improve ipv4 performances
From: Douglas Caetano dos Santos @ 2018-04-03 14:18 UTC (permalink / raw)
  To: Anton Gary Ceph, netdev, linux-kernel
In-Reply-To: <20180401183121.13022-1-agaceph@gmail.com>

Hi Anton, everyone,

On 04/01/18 15:31, Anton Gary Ceph wrote:
> As the Linux networking stack is growing, more and more protocols are
> added, increasing the complexity of stack itself.
> Modern processors, contrary to common belief, are very bad in branch
> prediction, so it's our task to give hints to the compiler when possible.
> 
> After a few profiling and analysis, turned out that the ethertype field
> of the packets has the following distribution:
> 
>     92.1% ETH_P_IP
>      3.2% ETH_P_ARP
>      2.7% ETH_P_8021Q
>      1.4% ETH_P_PPP_SES
>      0.6% don't know/no opinion
> 
> From a projection on statistics collected by Google about IPv6 adoption[1],
> IPv6 should peak at 25% usage at the beginning of 2030. Hence, we should
> give proper hints to the compiler about the low IPv6 usage.

My two cents on the matter:

You should not consider favoring some parts of code in detriment of another just because of one use case. In your patch, you're considering one server that attends for IPv4 and IPv6 connections simultaneously, in a proportion seen on the Internet, but you completely disregard the use cases of servers that could serve, for example, only IPv6. What about those, just let them slow down?

What I think about such hints and optimizations - someone correct me if I'm wrong - is that they should be done not with specific use cases in mind, but according to the code flow in general. For example, it could be a good idea to slow down ARP requests, because there is AFAIK not such a server that attends only ARP (not that I'm advocating for it, just using as an example). But slowing down IPv6, as Eric already said, is utterly non-sense.

Again, "low IPv6 usage" doesn't mean code that is barely touched, with an IPv6-only server being the obvious example.

-- 
Douglas

^ permalink raw reply

* Re: [PATCH net-next RFC 0/5] ipv6: sr: introduce seg6local End.BPF action
From: David Lebrun @ 2018-04-03 14:25 UTC (permalink / raw)
  To: Mathieu Xhonneux, Alexei Starovoitov
  Cc: netdev, David Lebrun, Daniel Borkmann
In-Reply-To: <e8cef615-04e7-aa38-ee29-9e8d81f67f20@gmail.com>

On 04/03/2018 02:40 PM, David Lebrun wrote:
> On 04/03/2018 12:16 PM, Mathieu Xhonneux wrote:
>>
>>> In patch 2 I was a bit concerned that:
>>> +       struct seg6_bpf_srh_state *srh_state = (struct 
>>> seg6_bpf_srh_state *)
>>> +                                              &skb->cb;
>>> would not collide with other users of skb->cb, but it seems the way
>>> the hook is placed such usage should always be valid.
>>> Would be good to add a comment describing the situation.
>> Yes, it's indeed a little hack, but this should be OK since the IPv6 
>> layer does
>> not use the cb field. Another solution would be to create a new field in
>> __sk_buff but it's more cumbersome.
>> I will add a comment.
> 
> Good point. The IPv6 layer *does* use the cb field through the IP6CB() 
> macro. It is first filled in ipv6_rcv() for ingress packets and used, 
> among others, in the input path by extension headers processing 
> functions to store EH offsets.
> 
> Given that input_action_end_bpf is called in the forwarding path      
> and terminates with a call to dst_input(), IP6CB() will be then reset by 
> ipv6_rcv(), and the use of skb->cb here indeed should not collide with 
> other users.

Actually I'm wrong here. dst_input() will call either ip6_input() or 
ip6_forward(), not ipv6_rcv(). Both functions expect IP6CB() to be set,
so using skb->cb here will interfere with them.

What about saving and restoring the IPv6 CB, similarly to what TCP does 
with tcp_v6_restore_cb() ?

David

^ permalink raw reply

* meine Spende an dich
From: Mrs Nelma @ 2018-04-03 14:29 UTC (permalink / raw)
  To: Recipients

Hallo Lieber, ich habe eine Spende von 4.600.000,00 Euro, die ich Ihnen geben möchte, um den Armen und Waisen in Ihrer Gemeinde zu helfen ... Bitte antworten Sie für weitere Details, um meine Spende zu erhalten

Grüße

Nelma Ruaan

^ permalink raw reply

* Re: [PATCH net-next 09/11] devlink: convert occ_get op to separate registration
From: David Ahern @ 2018-04-03 14:33 UTC (permalink / raw)
  To: Jiri Pirko; +Cc: Ido Schimmel, netdev, davem, jiri, petrm, mlxsw
In-Reply-To: <20180403073212.GI3313@nanopsycho>

On 4/3/18 1:32 AM, Jiri Pirko wrote:
> Fri, Mar 30, 2018 at 04:45:50PM CEST, dsahern@gmail.com wrote:
>> On 3/29/18 2:33 PM, Ido Schimmel wrote:
>>> From: Jiri Pirko <jiri@mellanox.com>
>>>
>>> This resolves race during initialization where the resources with
>>> ops are registered before driver and the structures used by occ_get
>>> op is initialized. So keep occ_get callbacks registered only when
>>> all structs are initialized.
>>
>> Why can't the occ_get handler look at some flag in an mlxsw struct to
>> know if the system has initialized?
>>
>> Separate registration here is awkward. You register a resource and then
>> register its op later.
> 
> The separation is exactly why this patch is made. Note that devlink
> resouce is registered by core way before the initialization is done and
> the driver is actually able to perform the op. Also consider "reload"

That's how you have chose to code it. I hit this problem adding devlink
to netdevsim; the solution was to fix the init order.

> case, when the resource is still registered and the driver unloads and
> loads again. For that makes perfect sense to have that separated.
> Flag would just make things odd. Also, the priv could not be used in
> that case.
> 

I am not aware of any other API where you invoked the register function
at point A and then later add the operations at point B. In every API
that comes to mind the ops are part of the register.

I am sure there are options for you to fix the init order of mlxsw
without making the devlink API awkward.

^ permalink raw reply

* Re: [PATCH net-next RFC 0/5] ipv6: sr: introduce seg6local End.BPF action
From: Eric Dumazet @ 2018-04-03 14:51 UTC (permalink / raw)
  To: David Lebrun, Mathieu Xhonneux, Alexei Starovoitov
  Cc: netdev, David Lebrun, Daniel Borkmann
In-Reply-To: <c4e374e8-d385-1f86-cafe-85d983f6c45e@gmail.com>



On 04/03/2018 07:25 AM, David Lebrun wrote:
> 
> What about saving and restoring the IPv6 CB, similarly to what TCP does with tcp_v6_restore_cb() ?

Note that TCP only moves IPCB around in skb->cb[] for cache locality gains.

Now we switched to rb-tree for out-of-order queue, these gains might be marginal.

^ permalink raw reply

* Re: [net-next V9 PATCH 00/16] XDP redirect memory return API
From: David Miller @ 2018-04-03 14:54 UTC (permalink / raw)
  To: brouer
  Cc: netdev, bjorn.topel, magnus.karlsson, eugenia, jasowang,
	john.fastabend, eranbe, saeedm, galp, borkmann,
	alexei.starovoitov, tariqt
In-Reply-To: <152275360298.1026.10333759008401281682.stgit@firesoul>

From: Jesper Dangaard Brouer <brouer@redhat.com>
Date: Tue, 03 Apr 2018 13:07:36 +0200

> This is V9, but it's worth mentioning that V8 was send against
> net-next, because i40e got XDP_REDIRECT support in-between V6, and it
> doesn't exist in bpf-next yet.  Most significant change in V8 was that
> page_pool only gets compiled into the kernel when a drivers Kconfig
> 'select' the feature.

Jesper, this series now looks good to me, however the net-next tree is
closed at this point.

Don't worry, just resubmit when net-next opens back up.

Thanks!

^ permalink raw reply

* Re: [RFC PATCH 2/3] netdev: kernel-only IFF_HIDDEN netdevice
From: David Ahern @ 2018-04-03 14:57 UTC (permalink / raw)
  To: Siwei Liu
  Cc: Si-Wei Liu, Michael S. Tsirkin, Jiri Pirko, Stephen Hemminger,
	Alexander Duyck, David Miller, Brandeburg, Jesse, Jakub Kicinski,
	Jason Wang, Samudrala, Sridhar, Netdev, virtualization,
	virtio-dev
In-Reply-To: <CADGSJ23tSSo8WmGCmhWPFapXpW=c_A0_qpYH4r6fb9NEnzsnsQ@mail.gmail.com>

On 4/3/18 1:40 AM, Siwei Liu wrote:
>> There are other use cases that want to hide a device from userspace.
> 
> Can you elaborate your case in more details? Looking at the links
> below I realize that the purpose of hiding devices in your case is
> quite different from the our migration case. Particularly, I don't

some kernel drivers create "control" netdev's. They are not intended for
users to manipulate and doing so may actually break networking.

> like the part of elaborately allowing user to manipulate the link's
> visibility - things fall apart easily while live migration is on
> going. And, why doing additional check for invisible links in every
> for_each_netdev() and its friends. This is effectively changing
> semantics of internal APIs that exist for decades.

Read the patch again: there are 40 references to for_each_netdev and
that patch touches 2 of them -- link dumps via rtnetlink and link dumps
via ioctl.

>> one that includes an API for users to list all devices -- even ones
> 
> What kind of API you would like to query for hidden devices?
> rtnetlink? a private socket API? or something else?

There are existing, established APIs for dumping links. No new API is
needed. As suggested in the 2 patches I referenced the hidden /
invisibility cloak is an attribute of the device. When a link dump is
requested if the attribute is set, the device is skipped and not
included in the dump. However, if the user knows the device name the
GETLINK / SETLINK / DELLINK apis all work as normal. This allows the
device to be hidden from apps like snmpd, lldpd, etc, yet still usable.

> 
> For our case, the sysfs interface is what we need and is sufficient,
> since udev is the main target we'd like to support to make the naming
> of virtio_bypass consistent and compatible.

You are not hiding a device if it is visible in 1 API (/sysfs) and not
visible by another API (rtnetlink). That only creates confusion.

> 
>> hidden by default.
>>
>> https://github.com/dsahern/linux/commit/48a80a00eac284e58bae04af10a5a932dd7aee00
>>
>> https://github.com/dsahern/iproute2/commit/7563f5b26f5539960e99066e34a995d22ea908ed
>>
>> Also, why are you suggesting that the device should still be visible via
>> /sysfs? That leads to inconsistent views of networking state - /sys
>> shows a device but a link dump does not.
> 
> See my clarifications above. I don't mind kernel-only netdevs being
> visible via sysfs, as that way we get a good trade-off between
> backwards compatibility and visibility. There's still kobject created
> there right. Bottom line is that all kernel devices and its life-cycle
> uevents are made invisible to userpace network utilities, and I think
> it simply gets to the goal of not breaking existing apps while being
> able to add new features.

^ permalink raw reply

* Re: [PATCH net-next RFC V1 5/5] net: mdio: Add a driver for InES time stamping IP core.
From: Richard Cochran @ 2018-04-03 15:02 UTC (permalink / raw)
  To: Andrew Lunn
  Cc: Florian Fainelli, netdev, devicetree, David Miller, Mark Rutland,
	Miroslav Lichvar, Rob Herring, Willem de Bruijn
In-Reply-To: <20180403131319.GD31740@lunn.ch>

On Tue, Apr 03, 2018 at 03:13:19PM +0200, Andrew Lunn wrote:
> Have you tried implementing it using a phandle in the phy node,
> pointing to the time stamping device?

Not yet, but I'll take this up for V2, after the merge window...

Thinking about MII, it really is a 1:1 connection between the MAC and
the PHY.  It has no representation in the current code, at least not
yet.  It is too bad about the naming of mii_bus, oh well.  While
hanging this thing off of the PHY isn't really great modeling (it
isn't a sub-device of the PHY in any sense), still this will work well
enough to enable the new functionality.

Thanks,
Richard

^ permalink raw reply

* Re: [PATCH v15 ] net/veth/XDP: Line-rate packet forwarding in kernel
From: David Ahern @ 2018-04-03 15:07 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: John Fastabend, Md. Islam, netdev, David Miller, stephen, agaceph,
	Pavel Emelyanov, Eric Dumazet, brouer
In-Reply-To: <20180402181602.jpdb25ytmffg2gei@ast-mbp.dhcp.thefacebook.com>

On 4/2/18 12:16 PM, Alexei Starovoitov wrote:
> On Mon, Apr 02, 2018 at 12:09:44PM -0600, David Ahern wrote:
>> On 4/2/18 12:03 PM, John Fastabend wrote:
>>>
>>> Can the above be a normal BPF helper that returns an
>>> ifindex? Then something roughly like this patter would
>>> work for all drivers with redirect support,
>>>
>>>
>>>      route_ifindex = ip_route_lookup(__daddr, ....)
>>>      if (!route_ifindex)
>>>            return do_foo()
>>>      return xdp_redirect(route_ifindex);
>>>      
>>> So my suggestion is,
>>>
>>>   1. enable veth xdp (including redirect support)
>>>   2. add a helper to lookup route from routing table
>>>
>>> Alternatively you can skip step (2) and encode the routing
>>> table in BPF directly. Maybe we need a more efficient data
>>> structure but that should also work.
>>>
>>
>> That's what I have here:
>>
>> https://github.com/dsahern/linux/commit/bab42f158c0925339f7519df7fb2cde8eac33aa8
> 
> was wondering what's up with the delay and when are you going to
> submit them officially...
> The use case came up several times.
> 

I need to find time to come back to that set. As I recall there a number
of outstanding issues:

1. you and Daniel had comments about the bpf_func_proto declarations

2. Jesper had concerns about xdp redirect to any netdev. e.g., How does
the lookup know the egress netdev supports xdp? Right now you can try
and the packet is dropped if it is not supported.

3. VLAN devices. I suspect these will affect the final bpf function
prototype. It would awkward to have 1 forwarding API for non-vlan
devices and a second for vlan devices, hence the need to resolve this
before it goes in.

4. What about other stacked devices - bonds and bridges - will those
just work with the bpf helper? VRF is already handled of course. ;-)

^ permalink raw reply

* Re: [PATCH 00/15] ARM: pxa: switch to DMA slave maps
From: Ulf Hansson @ 2018-04-03 15:08 UTC (permalink / raw)
  To: Robert Jarzmik
  Cc: alsa-devel, Jaroslav Kysela, linux-ide, netdev, linux-mtd,
	driverdevel, Boris Brezillon, Vinod Koul, Richard Weinberger,
	Takashi Iwai, Marek Vasut, Ezequiel Garcia, linux-media,
	Samuel Ortiz, Arnd Bergmann, Bartlomiej Zolnierkiewicz,
	Haojian Zhuang, dmaengine, Mark Brown, Mauro Carvalho Chehab,
	Linux ARM, Nicolas Pitre, Greg Kroah-Hartman,
	"linux-mmc@vger.ke
In-Reply-To: <20180402142656.26815-1-robert.jarzmik@free.fr>

On 2 April 2018 at 16:26, Robert Jarzmik <robert.jarzmik@free.fr> wrote:
> Hi,
>
> This serie is aimed at removing the dmaengine slave compat use, and transfer
> knowledge of the DMA requestors into architecture code.
>
> This was discussed/advised by Arnd a couple of years back, it's almost time.
>
> The serie is divided in 3 phasees :
>  - phase 1 : patch 1/15 and patch 2/15
>    => this is the preparation work
>  - phase 2 : patches 3/15 .. 10/15
>    => this is the switch of all the drivers
>    => this one will require either an Ack of the maintainers or be taken by them
>       once phase 1 is merged
>  - phase 3 : patches 11/15
>    => this is the last part, cleanup and removal of export of the DMA filter
>       function
>
> As this looks like a patch bomb, each maintainer expressing for his tree either
> an Ack or "I want to take through my tree" will be spared in the next iterations
> of this serie.

Perhaps an option is to send this hole series as PR for 3.17 rc1, that
would removed some churns and make this faster/easier? Well, if you
receive the needed acks of course.

For the mmc change:

Acked-by: Ulf Hansson <ulf.hansson@linaro.org>

Kind regards
Uffe

>
> Several of these changes have been tested on actual hardware, including :
>  - pxamci
>  - pxa_camera
>  - smc*
>  - ASoC and SSP
>
> Happy review.
>
> Robert Jarzmik (15):
>   dmaengine: pxa: use a dma slave map
>   ARM: pxa: add dma slave map
>   mmc: pxamci: remove the dmaengine compat need
>   media: pxa_camera: remove the dmaengine compat need
>   mtd: nand: pxa3xx: remove the dmaengine compat need
>   net: smc911x: remove the dmaengine compat need
>   net: smc91x: remove the dmaengine compat need
>   ASoC: pxa: remove the dmaengine compat need
>   net: irda: pxaficp_ir: remove the dmaengine compat need
>   ata: pata_pxa: remove the dmaengine compat need
>   dmaengine: pxa: document pxad_param
>   dmaengine: pxa: make the filter function internal
>   ARM: pxa: remove the DMA IO resources
>   ARM: pxa: change SSP devices allocation
>   ARM: pxa: change SSP DMA channels allocation
>
>  arch/arm/mach-pxa/devices.c               | 269 ++++++++++++++----------------
>  arch/arm/mach-pxa/devices.h               |  14 +-
>  arch/arm/mach-pxa/include/mach/audio.h    |  12 ++
>  arch/arm/mach-pxa/pxa25x.c                |   4 +-
>  arch/arm/mach-pxa/pxa27x.c                |   4 +-
>  arch/arm/mach-pxa/pxa3xx.c                |   5 +-
>  arch/arm/plat-pxa/ssp.c                   |  50 +-----
>  drivers/ata/pata_pxa.c                    |  10 +-
>  drivers/dma/pxa_dma.c                     |  13 +-
>  drivers/media/platform/pxa_camera.c       |  22 +--
>  drivers/mmc/host/pxamci.c                 |  29 +---
>  drivers/mtd/nand/pxa3xx_nand.c            |  10 +-
>  drivers/net/ethernet/smsc/smc911x.c       |  16 +-
>  drivers/net/ethernet/smsc/smc91x.c        |  12 +-
>  drivers/net/ethernet/smsc/smc91x.h        |   1 -
>  drivers/staging/irda/drivers/pxaficp_ir.c |  14 +-
>  include/linux/dma/pxa-dma.h               |  20 +--
>  include/linux/platform_data/mmp_dma.h     |   4 +
>  include/linux/pxa2xx_ssp.h                |   4 +-
>  sound/arm/pxa2xx-ac97.c                   |  14 +-
>  sound/arm/pxa2xx-pcm-lib.c                |   6 +-
>  sound/soc/pxa/pxa-ssp.c                   |   5 +-
>  sound/soc/pxa/pxa2xx-ac97.c               |  32 +---
>  23 files changed, 196 insertions(+), 374 deletions(-)
>
> --
> 2.11.0
>

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox