* Re: [RFC net-next 0/5] TSN: Add qdisc-based config interfaces for traffic shapers
From: levipearson @ 2017-09-20 1:59 UTC (permalink / raw)
To: vinicius.gomes
Cc: netdev, intel-wired-lan, andre.guedes, ivan.briano,
jesus.sanchez-palencia, boon.leong.ong, jhs, xiyou.wangcong, jiri,
richardcochran, henrik
In-Reply-To: <20170901012625.14838-1-vinicius.gomes@intel.com>
On Thu, Aug 31, 2017 at 06:26:20PM -0700, Vinicius Costa Gomes wrote:
> Hi,
>
> This patchset is an RFC on a proposal of how the Traffic Control subsystem can
> be used to offload the configuration of traffic shapers into network devices
> that provide support for them in HW. Our goal here is to start upstreaming
> support for features related to the Time-Sensitive Networking (TSN) set of
> standards into the kernel.
I'm very excited to see these features moving into the kernel! I am one of the
maintainers of the OpenAvnu project and I've been involved in building AVB/TSN
systems and working on the standards for around 10 years, so the support that's
been slowly making it into more silicon and now Linux drivers is very
encouraging.
My team at Harman is working on endpoint code based on what's in the OpenAvnu
project and a few Linux-based platforms. The Qav interface you've proposed will
fit nicely with our traffic shaper management daemon, which already uses mqprio
as a base but uses the htb shaper to approximate the Qav credit-based shaper on
platforms where launch time scheduling isn't available.
I've applied your patches and plan on testing them in conjunction with our
shaper manager to see if we run into any hitches, but I don't expect any
problems.
> As part of this work, we've assessed previous public discussions related to TSN
> enabling: patches from Henrik Austad (Cisco), the presentation from Eric Mann
> at Linux Plumbers 2012, patches from Gangfeng Huang (National Instruments) and
> the current state of the OpenAVNU project (https://github.com/AVnu/OpenAvnu/).
>
> Please note that the patches provided as part of this RFC are implementing what
> is needed only for 802.1Qav (FQTSS) only, but we'd like to take advantage of
> this discussion and share our WIP ideas for the 802.1Qbv and 802.1Qbu interfaces
> as well. The current patches are only providing support for HW offload of the
> configs.
>
>
> Overview
> ========
>
> Time-sensitive Networking (TSN) is a set of standards that aim to address
> resources availability for providing bandwidth reservation and bounded latency
> on Ethernet based LANs. The proposal described here aims to cover mainly what is
> needed to enable the following standards: 802.1Qat, 802.1Qav, 802.1Qbv and
> 802.1Qbu.
>
> The initial target of this work is the Intel i210 NIC, but other controllers'
> datasheet were also taken into account, like the Renesas RZ/A1H RZ/A1M group and
> the Synopsis DesignWare Ethernet QoS controller.
Recent SoCs from NXP (the i.MX 6 SoloX, and all the i.MX 7 and 8 parts) support
Qav shaping as well as scheduled launch functionality; these are the parts I
have been mostly working with. Marvell silicon (some subset of Armada processors
and Link Street DSA switches) generally supports traffic shaping as well.
I think a lack of an interface like this has probably slowed upstream driver
support for this functionality where it exists; most vendors have an out-of-
tree version of their driver with TSN functionality enabled via non-standard
interfaces. Hopefully making it available will encourage vendors to upstream
their driver support!
> Proposal
> ========
>
> Feature-wise, what is covered here are configuration interfaces for HW
> implementations of the Credit-Based shaper (CBS, 802.1Qav), Time-Aware shaper
> (802.1Qbv) and Frame Preemption (802.1Qbu). CBS is a per-queue shaper, while
> Qbv and Qbu must be configured per port, with the configuration covering all
> queues. Given that these features are related to traffic shaping, and that the
> traffic control subsystem already provides a queueing discipline that offloads
> config into the device driver (i.e. mqprio), designing new qdiscs for the
> specific purpose of offloading the config for each shaper seemed like a good
> fit.
This makes sense to me too. The 802.1Q standards are all based on the sort of
mappings between priority, traffic class, and hardware queues that the existing
tc infrastructure seems to be modeling. I believe the mqprio module's mapping
scheme is flexible enough to meet any TSN needs in conjunction with the other
parts of the kernel qdisc system.
> For steering traffic into the correct queues, we use the socket option
> SO_PRIORITY and then a mechanism to map priority to traffic classes / Txqueues.
> The qdisc mqprio is currently used in our tests.
>
> As for the shapers config interface:
>
> * CBS (802.1Qav)
>
> This patchset is proposing a new qdisc called 'cbs'. Its 'tc' cmd line is:
> $ tc qdisc add dev IFACE parent ID cbs locredit N hicredit M sendslope S \
> idleslope I
>
> Note that the parameters for this qdisc are the ones defined by the
> 802.1Q-2014 spec, so no hardware specific functionality is exposed here.
These parameters look good to me as a baseline; some additional optional
parameters may be useful for software-based implementations--such as setting an
interval at which to recalculate queues--but those can be discussed later.
> * Time-aware shaper (802.1Qbv):
I haven't come across any specific NIC or SoC MAC that does Qbv, but I have
been experimenting with an EspressoBin board, which has a "Topaz" DSA switch
in it that has some features intended for Qbv support, although they were done
with a draft version in mind.
I haven't looked at the interaction between the qdisc subsystem and DSA yet,
but this mechanism might be useful to configure Qbv on the slave ports in
that context. I've got both the board and the documentation, so I might be
able to work on an implementation at some point.
If some endpoint device shows up with direct Qbv support, this interface would
probably work well there too, although a talker would need to be able to
schedule its transmits pretty precisely to achieve the lowest possible latency.
> The idea we are currently exploring is to add a "time-aware", priority based
> qdisc, that also exposes the Tx queues available and provides a mechanism for
> mapping priority <-> traffic class <-> Tx queues in a similar fashion as
> mqprio. We are calling this qdisc 'taprio', and its 'tc' cmd line would be:
>
> $ $ tc qdisc add dev ens4 parent root handle 100 taprio num_tc 4 \
> map 2 2 1 0 3 3 3 3 3 3 3 3 3 3 3 3 \
> queues 0 1 2 3 \
> sched-file gates.sched [base-time <interval>] \
> [cycle-time <interval>] [extension-time <interval>]
One concern here is calling the base-time parameter an interval; it's really
an absolute time with respect to the PTP timescale. Good documentation will
be important to this one, since the specification discusses some subtleties
regarding the impact of different time values chosen here.
The format for specifying the actual intervals such as cycle-time could prove
to be an important detail as well; Qbv specifies cycle-time as a ratio of two
integers expressed in seconds, while extension-time is specified as an integer
number of nanoseconds.
Precision with the cycle-time is especially important, since base-time can be
almost arbitrarily far in the past or future, and any given cycle start should
be calculable from the base-time plus/minus some integer multiple of cycle-
time.
> <file> is multi-line, with each line being of the following format:
> <cmd> <gate mask> <interval in nanoseconds>
>
> Qbv only defines one <cmd>: "S" for 'SetGates'
>
> For example:
>
> S 0x01 300
> S 0x03 500
>
> This means that there are two intervals, the first will have the gate
> for traffic class 0 open for 300 nanoseconds, the second will have
> both traffic classes open for 500 nanoseconds.
>
> Additionally, an option to set just one entry of the gate control list will
> also be provided by 'taprio':
>
> $ tc qdisc (...) \
> sched-row <row number> <cmd> <gate mask> <interval> \
> [base-time <interval>] [cycle-time <interval>] \
> [extension-time <interval>]
If I understand correctly, 'sched-row' is meant to be usable multiple times in
a single command and the 'sched-file' option is just a shorthand version for
large tables? Or is it meant to update an existing schedule table? It doesn't
seem very useful if it can only be specified once when the whole taprio intance
is being established.
> * Frame Preemption (802.1Qbu):
>
> To control even further the latency, it may prove useful to signal which
> traffic classes are marked as preemptable. For that, 'taprio' provides the
> preemption command so you set each traffic class as preemptable or not:
>
> $ tc qdisc (...) \
> preemption 0 1 1 1
>
>
> * Time-aware shaper + Preemption:
>
> As an example of how Qbv and Qbu can be used together, we may specify
> both the schedule and the preempt-mask, and this way we may also
> specify the Set-Gates-and-Hold and Set-Gates-and-Release commands as
> specified in the Qbu spec:
>
> $ tc qdisc add dev ens4 parent root handle 100 taprio num_tc 4 \
> map 2 2 1 0 3 3 3 3 3 3 3 3 3 3 3 3 \
> queues 0 1 2 3 \
> preemption 0 1 1 1 \
> sched-file preempt_gates.sched
>
> <file> is multi-line, with each line being of the following format:
> <cmd> <gate mask> <interval in nanoseconds>
>
> For this case, two new commands are introduced:
>
> "H" for 'set gates and hold'
> "R" for 'set gates and release'
>
> H 0x01 300
> R 0x03 500
>
The new Hold and Release gate commands look right, but I'm not sure about the
preemption flags. Qbu describes a preemption parameter table indexed by
*priority* rather than traffic class or queue. These select which of two MAC
service interfaces is used by the frame at the ISS layer, either express or
preemptable, at the time the frame is selected for transmit. If my
understanding is correct, it's possible to map a preemptable priority as well
as an express priority to the same queue, so flagging preemptability at the
queue level is not correct.
I'm not aware of any endpoint interfaces that support Qbu either, nor do I
know of any switches that support it that someone could experiment with right
now, so there's no pressure on getting that interface nailed down yet.
Hopefully you find this feedback useful, and I appreciate the effort taken to
get the RFC posted here!
Levi
^ permalink raw reply
* RE: [PATCH v4 2/3] net: fec: remove unused interrupt FEC_ENET_TS_TIMER
From: Andy Duan @ 2017-09-20 2:14 UTC (permalink / raw)
To: Troy Kisky, netdev@vger.kernel.org, davem@davemloft.net
Cc: Fabio Estevam, lznuaa@gmail.com
In-Reply-To: <1505867589-11784-2-git-send-email-troy.kisky@boundarydevices.com>
From: Troy Kisky <troy.kisky@boundarydevices.com> Sent: Wednesday, September 20, 2017 8:33 AM
>FEC_ENET_TS_TIMER is not checked in the interrupt routine so there is no
>need to enable it.
>
>Signed-off-by: Troy Kisky <troy.kisky@boundarydevices.com>
>Acked-by: Fugang Duan <fugang.duan@nxp.com>
>
>---
>v4: Added Acked-by
>
>Signed-off-by: Troy Kisky <troy.kisky@boundarydevices.com>
>---
Troy, thank you for submitting the version.
The patch series had been already acked by me.
Thanks again.
> drivers/net/ethernet/freescale/fec.h | 4 ++--
> 1 file changed, 2 insertions(+), 2 deletions(-)
>
>diff --git a/drivers/net/ethernet/freescale/fec.h
>b/drivers/net/ethernet/freescale/fec.h
>index 38c7b21..ede1876 100644
>--- a/drivers/net/ethernet/freescale/fec.h
>+++ b/drivers/net/ethernet/freescale/fec.h
>@@ -374,8 +374,8 @@ struct bufdesc_ex {
> #define FEC_ENET_TS_AVAIL ((uint)0x00010000)
> #define FEC_ENET_TS_TIMER ((uint)0x00008000)
>
>-#define FEC_DEFAULT_IMASK (FEC_ENET_TXF | FEC_ENET_RXF |
>FEC_ENET_MII | FEC_ENET_TS_TIMER)
>-#define FEC_NAPI_IMASK (FEC_ENET_MII | FEC_ENET_TS_TIMER)
>+#define FEC_DEFAULT_IMASK (FEC_ENET_TXF | FEC_ENET_RXF |
>FEC_ENET_MII)
>+#define FEC_NAPI_IMASK FEC_ENET_MII
> #define FEC_RX_DISABLED_IMASK (FEC_DEFAULT_IMASK &
>(~FEC_ENET_RXF))
>
> /* ENET interrupt coalescing macro define */
>--
>2.7.4
^ permalink raw reply
* RE: [PATCH RFC v1 0/3] Support for tap user-space access with veth interfaces
From: Grandhi, Sainath @ 2017-09-20 2:38 UTC (permalink / raw)
To: netdev@vger.kernel.org; +Cc: davem@davemloft.net
In-Reply-To: <1504744467-79590-1-git-send-email-sainath.grandhi@intel.com>
Just a reminder for feedback.
> -----Original Message-----
> From: Grandhi, Sainath
> Sent: Wednesday, September 06, 2017 5:34 PM
> To: netdev@vger.kernel.org
> Cc: davem@davemloft.net; Grandhi, Sainath <sainath.grandhi@intel.com>
> Subject: [PATCH RFC v1 0/3] Support for tap user-space access with veth
> interfaces
>
> From: Sainath Grandhi <sainath.grandhi@intel.com>
>
> This patchset adds a tap device driver for veth virtual network interface.
> With this implementation, tap character interface can be added only to the peer
> veth interface. Adding tap interface to veth is for usecases that forwards
> packets between host and VMs. This eliminates the need for an additional
> software bridge. This can be extended to create both the peer interfaces as tap
> interfaces. These patches are a step in that direction.
>
> Sainath Grandhi (3):
> net: Adding API to parse IFLA_LINKINFO attribute
> net: Abstracting out common routines from veth for use by vethtap
> vethtap: veth based tap driver
>
> drivers/net/Kconfig | 1 +
> drivers/net/Makefile | 2 +
> drivers/net/{veth.c => veth_main.c} | 80 ++++++++++---
> drivers/net/vethtap.c | 216 ++++++++++++++++++++++++++++++++++++
> include/linux/if_veth.h | 13 +++
> include/net/rtnetlink.h | 3 +
> net/core/rtnetlink.c | 8 ++
> 7 files changed, 308 insertions(+), 15 deletions(-) rename drivers/net/{veth.c =>
> veth_main.c} (89%) create mode 100644 drivers/net/vethtap.c create mode
> 100644 include/linux/if_veth.h
>
> --
> 2.7.4
^ permalink raw reply
* Re: [PATCH v5 05/10] dt-bindings: net: dwmac-sun8i: update documentation about integrated PHY
From: Rob Herring @ 2017-09-20 2:49 UTC (permalink / raw)
To: Andrew Lunn
Cc: Corentin Labbe, Mark Rutland, Florian Fainelli, Alexandre Torgue,
devicetree-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
Catalin Marinas, Will Deacon, Russell King,
linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
Chen-Yu Tsai, netdev, Giuseppe CAVALLARO, Maxime Ripard,
linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r@public.gmane.org
In-Reply-To: <20170914191949.GA3796-g2DYL2Zd6BY@public.gmane.org>
On Thu, Sep 14, 2017 at 2:19 PM, Andrew Lunn <andrew-g2DYL2Zd6BY@public.gmane.org> wrote:
>> > Is the MDIO controller "allwinner,sun8i-h3-emac" or "snps,dwmac-mdio"?
>> > If the latter, then I think the node is fine, but then the mux should be
>> > a child node of it. IOW, the child of an MDIO controller should either
>> > be a mux node or slave devices.
>
> Hi Rob
>
> Up until now, children of an MDIO bus have been MDIO devices. Those
> MDIO devices are either Ethernet PHYs, Ethernet Switches, or the
> oddball devices that Broadcom iProc has, like generic PHYs.
>
> We have never had MDIO-muxes as MDIO children. A Mux is not an MDIO
> device, and does not have the properties of an MDIO device. It is not
> addressable on the MDIO bus. The current MUXes are addressed via GPIOs
> or MMIO.
The DT parent/child relationship defines the bus topology. We describe
MDIO buses in that way and if a mux is sitting between the controller
and the devices, then the DT hierarchy should reflect that. Now
sometimes we have 2 options for what interface has the parent/child
relationship (e.g. an I2C controlled USB hub chip), but in this case
we don't.
> There other similar cases. i2c-mux-gpio is not a child of an i2c bus,
> nor i2c-mux-reg or gpio-mux. nxp,pca9548 is however a child of the i2c
> bus, because it is an i2c device itself...
Some are i2c controlled mux devices, but some can be GPIO controlled.
>
> If the MDIO mux was an MDIO device, i would agree with you. Bit it is
> not, so lets not make it a child.
>
> Andrew
>
> _______________________________________________
> linux-arm-kernel mailing list
> linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r@public.gmane.org
> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
--
To unsubscribe from this list: send the line "unsubscribe devicetree" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply
* Re: [PATCH,net-next,1/2] tun: enable NAPI for TUN/TAP driver
From: Eric Dumazet @ 2017-09-20 2:51 UTC (permalink / raw)
To: Petar Penkov
Cc: netdev, Eric Dumazet, Mahesh Bandewar, Willem de Bruijn, davem,
ppenkov
In-Reply-To: <20170919073402.2292-2-peterpenkov96@gmail.com>
On Tue, 2017-09-19 at 00:34 -0700, Petar Penkov wrote:
> Changes TUN driver to use napi_gro_receive() upon receiving packets
> rather than netif_rx_ni(). Adds flag CONFIG_TUN_NAPI that enables
> these changes and operation is not affected if the flag is disabled.
> SKBs are constructed upon packet arrival and are queued to be
> processed later.
>
>
> Signed-off-by: Petar Penkov <peterpenkov96@gmail.com>
> Cc: Eric Dumazet <edumazet@google.com>
> Cc: Mahesh Bandewar <maheshb@google.com>
> Cc: Willem de Bruijn <willemb@google.com>
> Cc: davem@davemloft.net
> Cc: ppenkov@stanford.edu
> ---
Very nice, thanks a lot Petar.
Acked-by: Eric Dumazet <edumazet@google.com>
^ permalink raw reply
* Re: [PATCH,net-next,2/2] tun: enable napi_gro_frags() for TUN/TAP driver
From: Eric Dumazet @ 2017-09-20 3:16 UTC (permalink / raw)
To: Petar Penkov
Cc: netdev, Eric Dumazet, Mahesh Bandewar, Willem de Bruijn, davem,
ppenkov
In-Reply-To: <20170919073402.2292-3-peterpenkov96@gmail.com>
On Tue, 2017-09-19 at 00:34 -0700, Petar Penkov wrote:
> Add a TUN/TAP receive mode that exercises the napi_gro_frags()
> interface. This mode is available only in TAP mode, as the interface
> expects packets with Ethernet headers.
>
> Furthermore, packets follow the layout of the iovec_iter that was
> received. The first iovec is the linear data, and every one after the
> first is a fragment. If there are more fragments than the max number,
> drop the packet. Additionally, invoke eth_get_headlen() to exercise flow
> dissector code and to verify that the header resides in the linear data.
>
> The napi_gro_frags() mode requires setting the IFF_NAPI_FRAGS option.
> This is imposed because this mode is intended for testing via tools like
> syzkaller and packetdrill, and the increased flexibility it provides can
> introduce security vulnerabilities.
>
> Signed-off-by: Petar Penkov <peterpenkov96@gmail.com>
> Cc: Eric Dumazet <edumazet@google.com>
> Cc: Mahesh Bandewar <maheshb@google.com>
> Cc: Willem de Bruijn <willemb@google.com>
> Cc: davem@davemloft.net
> Cc: ppenkov@stanford.edu
> ---
Again, very nice, thanks a lot Petar.
Acked-by: Eric Dumazet <edumazet@google.com>
^ permalink raw reply
* Re: Latest net-next from GIT panic
From: Eric Dumazet @ 2017-09-20 3:24 UTC (permalink / raw)
To: Paweł Staszewski; +Cc: Linux Kernel Network Developers
In-Reply-To: <65e2195b-3bd1-c0b4-b474-e07dd08f71b9@itcare.pl>
On Wed, 2017-09-20 at 02:06 +0200, Paweł Staszewski wrote:
> Just checked kernel 4.13.2 and same problem
>
> Just after start all 6 bgp sessions - and kernel starts to learn routes
> it panic.
>
> https://bugzilla.kernel.org/attachment.cgi?id=258509
>
Unfortunately we have not enough information from these traces.
Can you get a full stack trace ?
Alternatively, can you bisect ?
Thanks.
^ permalink raw reply
* [PATCH net] net/ncsi: Don't assume last available channel exists
From: Samuel Mendoza-Jonas @ 2017-09-20 4:12 UTC (permalink / raw)
To: netdev; +Cc: Samuel Mendoza-Jonas, David S . Miller, linux-kernel
When handling new VLAN tags in NCSI we check the maximum allowed number
of filters on the last active ("hot") channel. However if the 'add'
callback is called before NCSI has configured a channel, this causes a
NULL dereference.
Check that we actually have a hot channel, and warn if it is missing.
Signed-off-by: Samuel Mendoza-Jonas <sam@mendozajonas.com>
---
net/ncsi/ncsi-manage.c | 21 +++++++++++++++------
1 file changed, 15 insertions(+), 6 deletions(-)
diff --git a/net/ncsi/ncsi-manage.c b/net/ncsi/ncsi-manage.c
index 3fd3c39e6278..fc800f934beb 100644
--- a/net/ncsi/ncsi-manage.c
+++ b/net/ncsi/ncsi-manage.c
@@ -1420,7 +1420,10 @@ int ncsi_vlan_rx_add_vid(struct net_device *dev, __be16 proto, u16 vid)
}
ndp = TO_NCSI_DEV_PRIV(nd);
- ncf = ndp->hot_channel->filters[NCSI_FILTER_VLAN];
+ if (!ndp) {
+ netdev_warn(dev, "ncsi: No ncsi_dev_priv?\n");
+ return 0;
+ }
/* Add the VLAN id to our internal list */
list_for_each_entry_rcu(vlan, &ndp->vlan_vids, list) {
@@ -1432,11 +1435,17 @@ int ncsi_vlan_rx_add_vid(struct net_device *dev, __be16 proto, u16 vid)
}
}
- if (n_vids >= ncf->total) {
- netdev_info(dev,
- "NCSI Channel supports up to %u VLAN tags but %u are already set\n",
- ncf->total, n_vids);
- return -EINVAL;
+ if (!ndp->hot_channel) {
+ netdev_warn(dev,
+ "ncsi: no available filter to check maximum\n");
+ } else {
+ ncf = ndp->hot_channel->filters[NCSI_FILTER_VLAN];
+ if (n_vids >= ncf->total) {
+ netdev_info(dev,
+ "NCSI Channel supports up to %u VLAN tags but %u are already set\n",
+ ncf->total, n_vids);
+ return -EINVAL;
+ }
}
vlan = kzalloc(sizeof(*vlan), GFP_KERNEL);
--
2.14.1
^ permalink raw reply related
* Re: [PATCH,net-next,0/2] Improve code coverage of syzkaller
From: Petar Penkov @ 2017-09-20 4:26 UTC (permalink / raw)
To: David Miller; +Cc: netdev
In-Reply-To: <20170919.160134.534837308168786965.davem@davemloft.net>
On Tue, Sep 19, 2017 at 4:01 PM, David Miller <davem@davemloft.net> wrote:
> From: Petar Penkov <peterpenkov96@gmail.com>
> Date: Tue, 19 Sep 2017 00:34:00 -0700
>
>> The following patches address this by providing the user(syzkaller)
>> with the ability to send via napi_gro_receive() and napi_gro_frags().
>> Additionally, syzkaller can specify how many fragments there are and
>> how much data per fragment there is. This is done by exploiting the
>> convenient structure of iovecs. Finally, this patch series adds
>> support for exercising the flow dissector during fuzzing.
>>
>> The code path including napi_gro_receive() can be enabled via the
>> CONFIG_TUN_NAPI compile-time flag, and can be used by users other than
>> syzkaller. The remainder of the changes in this patch series give the
>> user significantly more control over packets entering the kernel. To
>> avoid potential security vulnerabilities, hide the ability to send
>> custom skbs and the flow dissector code paths behind a run-time flag
>> IFF_NAPI_FRAGS that is advertised and accepted only if CONFIG_TUN_NAPI
>> is enabled.
>>
>> The patch series will be followed with changes to packetdrill, where
>> these additions to the TUN driver are exercised and demonstrated.
>> This will give the ability to write regression tests for specific
>> parts of the early networking stack.
>>
>> Patch 1/ Add NAPI struct per receive queue, enable NAPI, and use
>> napi_gro_receive()
>> Patch 2/ Use NAPI skb and napi_gro_frags(), exercise flow
>> dissector, and allow custom skbs.
>
> I'm happy with everything except the TUN_NAPI Kconfig knob
> requirement.
>
> Rebuilding something just to test things isn't going to fly very well.
>
> Please make it secure somehow, enable this stuff by default.
>
> Thanks.
Without a compile-time option, the TUN/TAP driver will have a
code-path that allows
user control over kernel memory allocation, and specifically over the
SKBs that enter
the kernel. That path might be hard to exploit as it requires some
user privileges,
but it does exist and increases attack surface of the kernel. While
the flag certainly
inconveniences testing, I think the layer of security it adds
outweighs its disadvantages.
Furthermore, in a way testing already requires specific kernel configuration.
In this particular example, syzkaller prefers synchronous operation
and therefore needs
4KSTACKS disabled. Other features that require rebuilding are KASAN
and dbx. From
this point of view, I still think that having the TUN_NAPI flag has value.
^ permalink raw reply
* Re: [RFC PATCH] net: Introduce a socket option to enable picking tx queue based on rx queue.
From: Eric Dumazet @ 2017-09-20 5:13 UTC (permalink / raw)
To: Samudrala, Sridhar
Cc: Tom Herbert, Alexander Duyck, Linux Kernel Network Developers
In-Reply-To: <fe565f14-156e-d703-c91d-d67136a0a0c0@intel.com>
On Tue, 2017-09-19 at 21:59 -0700, Samudrala, Sridhar wrote:
> On 9/19/2017 5:48 PM, Tom Herbert wrote:
> > On Tue, Sep 19, 2017 at 5:34 PM, Samudrala, Sridhar
> > <sridhar.samudrala@intel.com> wrote:
> > > On 9/12/2017 3:53 PM, Tom Herbert wrote:
> > > > On Tue, Sep 12, 2017 at 3:31 PM, Samudrala, Sridhar
> > > > <sridhar.samudrala@intel.com> wrote:
> > > > >
> > > > > On 9/12/2017 8:47 AM, Eric Dumazet wrote:
> > > > > > On Mon, 2017-09-11 at 23:27 -0700, Samudrala, Sridhar wrote:
> > > > > > > On 9/11/2017 8:53 PM, Eric Dumazet wrote:
> > > > > > > > On Mon, 2017-09-11 at 20:12 -0700, Tom Herbert wrote:
> > > > > > > >
> > > > > > > > > Two ints in sock_common for this purpose is quite expensive and the
> > > > > > > > > use case for this is limited-- even if a RX->TX queue mapping were
> > > > > > > > > introduced to eliminate the queue pair assumption this still won't
> > > > > > > > > help if the receive and transmit interfaces are different for the
> > > > > > > > > connection. I think we really need to see some very compelling
> > > > > > > > > results
> > > > > > > > > to be able to justify this.
> > > > > > > Will try to collect and post some perf data with symmetric queue
> > > > > > > configuration.
> > >
> > > Here is some performance data i collected with memcached workload over
> > > ixgbe 10Gb NIC with mcblaster benchmark.
> > > ixgbe is configured with 16 queues and rx-usecs is set to 1000 for a very
> > > low
> > > interrupt rate.
> > > ethtool -L p1p1 combined 16
> > > ethtool -C p1p1 rx-usecs 1000
> > > and busy poll is set to 1000usecs
> > > sysctl net.core.busy_poll = 1000
> > >
> > > 16 threads 800K requests/sec
> > > =============================
> > > rtt(min/avg/max)usecs intr/sec contextswitch/sec
> > > -----------------------------------------------------------------------
> > > Default 2/182/10641 23391 61163
> > > Symmetric Queues 2/50/6311 20457 32843
> > >
> > > 32 threads 800K requests/sec
> > > =============================
> > > rtt(min/avg/max)usecs intr/sec contextswitch/sec
> > > ------------------------------------------------------------------------
> > > Default 2/162/6390 32168 69450
> > > Symmetric Queues 2/50/3853 35044 35847
> > >
> > No idea what "Default" configuration is. Please report how xps_cpus is
> > being set, how many RSS queues there are, and what the mapping is
> > between RSS queues and CPUs and shared caches. Also, whether and
> > threads are pinned.
> Default is linux 4.13 with the settings i listed above.
> ethtool -L p1p1 combined 16
> ethtool -C p1p1 rx-usecs 1000
> sysctl net.core.busy_poll = 1000
>
> # ethtool -x p1p1
> RX flow hash indirection table for p1p1 with 16 RX ring(s):
> 0: 0 1 2 3 4 5 6 7
> 8: 8 9 10 11 12 13 14 15
> 16: 0 1 2 3 4 5 6 7
> 24: 8 9 10 11 12 13 14 15
> 32: 0 1 2 3 4 5 6 7
> 40: 8 9 10 11 12 13 14 15
> 48: 0 1 2 3 4 5 6 7
> 56: 8 9 10 11 12 13 14 15
> 64: 0 1 2 3 4 5 6 7
> 72: 8 9 10 11 12 13 14 15
> 80: 0 1 2 3 4 5 6 7
> 88: 8 9 10 11 12 13 14 15
> 96: 0 1 2 3 4 5 6 7
> 104: 8 9 10 11 12 13 14 15
> 112: 0 1 2 3 4 5 6 7
> 120: 8 9 10 11 12 13 14 15
>
> smp_affinity for the 16 queuepairs
> 141 p1p1-TxRx-0 0000,00000001
> 142 p1p1-TxRx-1 0000,00000002
> 143 p1p1-TxRx-2 0000,00000004
> 144 p1p1-TxRx-3 0000,00000008
> 145 p1p1-TxRx-4 0000,00000010
> 146 p1p1-TxRx-5 0000,00000020
> 147 p1p1-TxRx-6 0000,00000040
> 148 p1p1-TxRx-7 0000,00000080
> 149 p1p1-TxRx-8 0000,00000100
> 150 p1p1-TxRx-9 0000,00000200
> 151 p1p1-TxRx-10 0000,00000400
> 152 p1p1-TxRx-11 0000,00000800
> 153 p1p1-TxRx-12 0000,00001000
> 154 p1p1-TxRx-13 0000,00002000
> 155 p1p1-TxRx-14 0000,00004000
> 156 p1p1-TxRx-15 0000,00008000
> xps_cpus for the 16 Tx queues
> 0000,00000001
> 0000,00000002
> 0000,00000004
> 0000,00000008
> 0000,00000010
> 0000,00000020
> 0000,00000040
> 0000,00000080
> 0000,00000100
> 0000,00000200
> 0000,00000400
> 0000,00000800
> 0000,00001000
> 0000,00002000
> 0000,00004000
> 0000,00008000
> memcached threads are not pinned.
>
...
I urge you to take the time to properly tune this host.
linux kernel does not do automagic configuration. This is user policy.
Documentation/networking/scaling.txt has everything you need.
^ permalink raw reply
* TSN Scorecard, was Re: [RFC net-next 0/5] TSN: Add qdisc-based config interfaces for traffic shapers
From: levipearson @ 2017-09-20 5:17 UTC (permalink / raw)
To: richardcochran
Cc: netdev, intel-wired-lan, vinicius.gomes, andre.guedes,
ivan.briano, jesus.sanchez-palencia, boon.leong.ong, jhs,
xiyou.wangcong, jiri, henrik
In-Reply-To: <20170918081248.txhars53icbqsvld@localhost>
On Mon, Sep 18, 2017, Richard Cochran wrote:
> Just for the record, here is my score card showing the current status
> of TSN support in Linux. Comments and corrections are more welcome.
>
> Thanks,
> Richard
>
>
> | FEATURE | STANDARD | STATUS |
> |------------------------------------------------+---------------------+------------------------------|
> | Synchronization | 802.1AS-2011 | Implemented in |
> | | | - Linux kernel PHC subsystem |
> | | | - linuxptp (userspace) |
> |------------------------------------------------+---------------------+------------------------------|
An alternate implementation of the userspace portion of gPTP is also available at [1]
> | Forwarding and Queuing Enhancements | 802.1Q-2014 sec. 34 | RFC posted (this thread) |
> | for Time-Sensitive Streams (FQTSS) | | |
> |------------------------------------------------+---------------------+------------------------------|
> | Stream Reservation Protocol (SRP) | 802.1Q-2014 sec. 35 | in Open-AVB [1] |
> |------------------------------------------------+---------------------+------------------------------|
> | Audio Video Transport Protocol (AVTP) | IEEE 1722-2011 | DNE |
> |------------------------------------------------+---------------------+------------------------------|
> | Audio/Video Device Discovery, Enumeration, | IEEE 1722.1-2013 | jdksavdecc-c [2] |
> | Connection Management and Control (AVDECC) | | |
> | AVDECC Connection Management Protocol (ACMP) | | |
> | AVDECC Enumeration and Control Protocol (AECP) | | |
> | MAC Address Acquisition Protocol (MAAP) | | in Open-AVB |
> |------------------------------------------------+---------------------+------------------------------|
All of the above are available to some degree in the AVTP Pipeline part of [1], specifically at this
location: https://github.com/AVnu/OpenAvnu/tree/master/lib/avtp_pipeline
The code is very modular and configurable, although some parts are in better shape than others. The AVTP
portion can use the custom userspace driver for the i210, which can be configured to use launch scheduling,
or it can use standard kernel interfaces via sendmsg or PACKET_MMAP. It runs as-is when configured for
standard interfaces with any network hardware that supports gPTP. I previously implemented a CMSG-based
launch time scheduling mechanism like the one you have proposed, and I have a socket backend for it that
could easily be ported to your proposal. It is not part of the repository yet since there's no kernel
support for it outside of my prototype and your RFC.
It is currently tied to the OpenAvnu gPTP daemon rather than linuxptp, as it uses a shared memory interface
to get the current rate-ratio and offset information between the various clocks. There may be better ways
to do this, but that's how the initial port of the codebase was done. It would be nice to get it working
with linuxptp's userspace tools at some point as well, though.
The libraries under avtp_pipeline are designed to be used separately, but a simple integrated application
is provided and is built by the CI system.
In addition to OpenAvnu, Renesas has a number of github repositories with what looks like a fairly
complete media streaming system:
https://github.com/renesas-rcar/avb-mse
https://github.com/renesas-rcar/avb-streaming
https://github.com/renesas-rcar/avb-applications
I haven't examined them in great detail yet, though.
> | Frame Preemption | P802.1Qbu | DNE |
> | Scheduled Traffic | P802.1Qbv | RFC posted (SO_TXTIME) |
> | SRP Enhancements and Performance Improvements | P802.1Qcc | DNE |
>
> DNE = Does Not Exist (to my knowledge)
Although your SO_TXTIME proposal could certainly form the basis of an endpoint's implementation of Qbv, I
think it is a stretch to consider it a Qbv implementation in itself, if that's what you mean by this.
I have been working with colleagues on some experiments relating to a Linux-controlled DSN switch
(a Marvell Topaz) that are a part of this effort in TSN:
http://ieee802.org/1/files/public/docs2017/tsn-cgunther-802-3cg-multidrop-0917-v01.pdf
The proper interfaces for the Qbv configuration and managing of switch-level PTP timestamps are not yet
in place, so there's nothing even at RFC stage to present yet, but Qbv-capable Linux-managed switch
hardware is available and we hope to get some reusable code published even if it's not yet ready to be
integrated in the kernel.
>
> 1. https://github.com/Avnu/OpenAvnu
>
> (DISCLAIMER from the website:)
>
> It is planned to eventually include the various packet encapsulation types,
> protocol discovery daemons, libraries to convert media clocks to AVB clocks
> and vice versa, and drivers.
>
> This repository does not include all components required to build a full
> production AVB/TSN system (e.g. a turnkey solution to stream stored or live audio
> or video content). Some simple example applications are provided which
> illustrate the flow - but a professional Audio/Video system requires a full media stack
> - including audio and video inputs and outputs, media processing elements, and
> various graphical user interfaces. Various companies provide such integrated
> solutions.
A bit of progress has been made since that was written, although it is true that it's still not
quite complete and certainly not turnkey. The most glaring absence at the moment is the media
clock recovery portion of AVTP, but I am actively working on this.
>
> 2. https://github.com/jdkoftinoff/jdksavdecc-c
This is pulled in as a dependency of the AVDECC code in OpenAvnu; it's used in the command line driven
controller, but not in the avtp_pipeline code that implements the endpoint AVDECC behavior. I don't think
either are complete by any means, but they are complete enough to be mostly compliant and usable in the
subset of behavior they support.
The bulk of the command line controller is a clone of: https://github.com/audioscience/avdecc-lib
Things are maybe a bit farther along than they seemed, but there is still important kernel work to be
done to reduce the need for out-of-tree drivers and to get everyone on the same interfaces. I plan
to be an active participant going forward.
Levi
^ permalink raw reply
* Re: [RFC net-next 0/5] TSN: Add qdisc-based config interfaces for traffic shapers
From: Richard Cochran @ 2017-09-20 5:25 UTC (permalink / raw)
To: Vinicius Costa Gomes
Cc: netdev, jhs, xiyou.wangcong, jiri, intel-wired-lan, andre.guedes,
ivan.briano, jesus.sanchez-palencia, boon.leong.ong
In-Reply-To: <87wp4ufchl.fsf@intel.com>
On Tue, Sep 19, 2017 at 05:19:18PM -0700, Vinicius Costa Gomes wrote:
> One of the problems with OpenAVNU is that it's too coupled with the i210
> NIC. One of the things we want is to decouple OpenAVNU from the
> controller.
Yes, I want that, too.
> The way we thought best was to propose interfaces (that
> would work along side to the Linux networking stack) as close as
> possible to what the current standards define, that means the IEEE
> 802.1Q family of specifications, in the hope that network controller
> vendors would also look at the specifications when designing their
> controllers.
These standard define the *behavior*, not the programming APIs. Our
task as kernel developers is to invent the best interfaces for
supporting 802.1Q and other standards, the hardware capabilities, and
the widest range of applications (not jut AVB).
> Our objective with the Qdiscs we are proposing (both cbs and taprio) is
> to provide a sane way to configure controllers that support TSN features
> (we were looking specifically at the IEEE specs).
I can see how your proposed Qdiscs are inspired by the IEEE standards.
However, in the case of time based transmission, I think there is a
better way to do it, namely with SO_TXTIME (which BTW was originally
proposed by Eric Mann).
> After we have some rough consensus on the interfaces to use, then we can
> start working on OpenAVNU.
Did you see my table in the other mail? Any comments?
> (Sorry if I am being annoying here, but the idea of an opaque schedule
> is not ours, that comes from the people who wrote the Qbv specification)
The schedule is easy to implement using SO_TXTIME.
> I have a question, what about a controller that doesn't provide a way to
> set a per-packet transmission time, but it supports Qbv/Qbu. What would
> be your proposal to configure it?
SO_TXTIME will have a generic SW fallback.
BTW, regarding the i210, there is no sensible way to configure both
CBS and time based transmission at the same time. The card performs a
logical AND to make the launch decision. The effect of this is that
each and every packet needs a LaunchTime, and the driver would be
forced to guess the time for a packet before entering it into its
queue.
So if we end up merging CBS and SO_TXTIME, then we'll have to make
them exclusive of each other (in the case of the i210) and manage the
i210 queue configurations correctly.
> (I think LaunchTime is something specific to the i210, right?)
To my knowledge yes. However, if TSN does take hold, then other MAC
vendors will copy it.
Thanks,
Richard
^ permalink raw reply
* Re: [PATCH net-next 3/4] qed: Fix maximum number of CQs for iWARP
From: Kalderon, Michal @ 2017-09-20 5:46 UTC (permalink / raw)
To: Leon Romanovsky
Cc: davem@davemloft.net, netdev@vger.kernel.org,
linux-rdma@vger.kernel.org, dledford@redhat.com, Elior, Ariel
In-Reply-To: <20170919174610.GK5788@mtr-leonro.local>
From: Leon Romanovsky <leon@kernel.org>
Sent: Tuesday, September 19, 2017 8:46 PM
On Tue, Sep 19, 2017 at 08:26:18PM +0300, Michal Kalderon wrote:
>> The maximum number of CQs supported is bound to the number
>> of connections supported, which differs between RoCE and iWARP.
>>
>> This fixes a crash that occurred in iWARP when running 1000 sessions
>> using perftest.
>>
>> Signed-off-by: Michal Kalderon <Michal.Kalderon@cavium.com>
>> Signed-off-by: Ariel Elior <Ariel.Elior@cavium.com>
>> ---
>
>It is worth to add Fixes line.
>
>Thanks
The original code was there before we had iWARP support, so this doesn't
exactly fix an older commit, but fixes iWARP code in general.
^ permalink raw reply
* Re: TSN Scorecard, was Re: [RFC net-next 0/5] TSN: Add qdisc-based config interfaces for traffic shapers
From: Richard Cochran @ 2017-09-20 5:49 UTC (permalink / raw)
To: levipearson
Cc: netdev, intel-wired-lan, vinicius.gomes, andre.guedes,
ivan.briano, jesus.sanchez-palencia, boon.leong.ong, jhs,
xiyou.wangcong, jiri, henrik
In-Reply-To: <20170920051754.21745-1-levipearson@gmail.com>
On Tue, Sep 19, 2017 at 11:17:54PM -0600, levipearson@gmail.com wrote:
> In addition to OpenAvnu, Renesas has a number of github repositories with what looks like a fairly
> complete media streaming system:
Is it a generic stack or a set of hacks for their HW?
> Although your SO_TXTIME proposal could certainly form the basis of an endpoint's implementation of Qbv, I
> think it is a stretch to consider it a Qbv implementation in itself, if that's what you mean by this.
No, that is not what I meant. We need some minimal additional kernel
support in order to fully implement the TSN family of standards. Of
course, the bulk will have to be done in user space. It would be a
mistake to cram the stuff that belongs in userland into the kernel.
Looking at the table, and reading your descriptions of the state of
OpenAVB, I remained convinced that the kernel needs only three
additions:
1. SO_TXTIME
2. CBS Qdisc
3. ALSA support for DAC clock control (but that is another story)
> The proper interfaces for the Qbv configuration and managing of switch-level PTP timestamps are not yet
> in place, so there's nothing even at RFC stage to present yet, but Qbv-capable Linux-managed switch
> hardware is available and we hope to get some reusable code published even if it's not yet ready to be
> integrated in the kernel.
Right, configuring Qbv in an attached DSA switch needs its own
interface.
Regarding PHC support for DSA switches, I have something in the works
to be published soon.
> A bit of progress has been made since that was written, although it is true that it's still not
> quite complete and certainly not turnkey.
So OpenAVB is neither complete nor turnkey. That was my impression,
too.
> Things are maybe a bit farther along than they seemed, but there is still important kernel work to be
> done to reduce the need for out-of-tree drivers and to get everyone on the same interfaces. I plan
> to be an active participant going forward.
You mentioned a couple of different kernel things you implemented.
I would encourage you to post the work already done.
Thanks,
Richard
^ permalink raw reply
* Re: [RFC net-next 0/5] TSN: Add qdisc-based config interfaces for traffic shapers
From: Richard Cochran @ 2017-09-20 5:56 UTC (permalink / raw)
To: levipearson
Cc: vinicius.gomes, netdev, intel-wired-lan, andre.guedes,
ivan.briano, jesus.sanchez-palencia, boon.leong.ong, jhs,
xiyou.wangcong, jiri, henrik
In-Reply-To: <20170920015911.18999-1-levipearson@gmail.com>
On Tue, Sep 19, 2017 at 07:59:11PM -0600, levipearson@gmail.com wrote:
> If some endpoint device shows up with direct Qbv support, this interface would
> probably work well there too, although a talker would need to be able to
> schedule its transmits pretty precisely to achieve the lowest possible latency.
This is an argument for SO_TXTIME.
> One concern here is calling the base-time parameter an interval; it's really
> an absolute time with respect to the PTP timescale. Good documentation will
> be important to this one, since the specification discusses some subtleties
> regarding the impact of different time values chosen here.
>
> The format for specifying the actual intervals such as cycle-time could prove
> to be an important detail as well; Qbv specifies cycle-time as a ratio of two
> integers expressed in seconds, while extension-time is specified as an integer
> number of nanoseconds.
>
> Precision with the cycle-time is especially important, since base-time can be
> almost arbitrarily far in the past or future, and any given cycle start should
> be calculable from the base-time plus/minus some integer multiple of cycle-
> time.
The above three points also.
Thanks,
Richard
^ permalink raw reply
* Re: [RFC net-next 0/5] TSN: Add qdisc-based config interfaces for traffic shapers
From: Richard Cochran @ 2017-09-20 5:58 UTC (permalink / raw)
To: Vinicius Costa Gomes
Cc: netdev, jhs, xiyou.wangcong, jiri, intel-wired-lan, andre.guedes,
ivan.briano, jesus.sanchez-palencia, boon.leong.ong
In-Reply-To: <87wp4ufchl.fsf@intel.com>
On Tue, Sep 19, 2017 at 05:19:18PM -0700, Vinicius Costa Gomes wrote:
> (I think LaunchTime is something specific to the i210, right?)
Levi just told us:
Recent SoCs from NXP (the i.MX 6 SoloX, and all the i.MX 7 and 8
parts) support Qav shaping as well as scheduled launch
functionality;
Thanks,
Richard
^ permalink raw reply
* [PATCH net-next] cxgb4: add new T5 pci device id's
From: Ganesh Goudar @ 2017-09-20 6:02 UTC (permalink / raw)
To: netdev, davem; +Cc: nirranjan, indranil, venkatesh, Ganesh Goudar
Add 0x50a5, 0x50a6, 0x50a7, 0x50a8 and 0x50a9 T5 device
id's.
Signed-off-by: Ganesh Goudar <ganeshgr@chelsio.com>
---
drivers/net/ethernet/chelsio/cxgb4/t4_pci_id_tbl.h | 5 +++++
1 file changed, 5 insertions(+)
diff --git a/drivers/net/ethernet/chelsio/cxgb4/t4_pci_id_tbl.h b/drivers/net/ethernet/chelsio/cxgb4/t4_pci_id_tbl.h
index aa28299..37d90d6 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/t4_pci_id_tbl.h
+++ b/drivers/net/ethernet/chelsio/cxgb4/t4_pci_id_tbl.h
@@ -176,6 +176,11 @@ CH_PCI_DEVICE_ID_TABLE_DEFINE_BEGIN
CH_PCI_ID_TABLE_FENTRY(0x50a2), /* Custom T540-KR4 */
CH_PCI_ID_TABLE_FENTRY(0x50a3), /* Custom T580-KR4 */
CH_PCI_ID_TABLE_FENTRY(0x50a4), /* Custom 2x T540-CR */
+ CH_PCI_ID_TABLE_FENTRY(0x50a5), /* Custom T522-BT */
+ CH_PCI_ID_TABLE_FENTRY(0x50a6), /* Custom T522-BT-SO */
+ CH_PCI_ID_TABLE_FENTRY(0x50a7), /* Custom T580-CR */
+ CH_PCI_ID_TABLE_FENTRY(0x50a8), /* Custom T580-KR */
+ CH_PCI_ID_TABLE_FENTRY(0x50a9), /* Custom T580-KR */
/* T6 adapters:
*/
--
2.1.0
^ permalink raw reply related
* Re: [PATCH net-next 2/4] qed: Add iWARP out of order support
From: Kalderon, Michal @ 2017-09-20 6:03 UTC (permalink / raw)
To: Leon Romanovsky
Cc: davem-fT/PcQaiUtIeIZ0/mPfg9Q@public.gmane.org,
netdev-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org, Elior, Ariel
In-Reply-To: <20170919174511.GJ5788-U/DQcQFIOTAAJjI8aNfphQ@public.gmane.org>
From: Leon Romanovsky <leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
Sent: Tuesday, September 19, 2017 8:45 PM
On Tue, Sep 19, 2017 at 08:26:17PM +0300, Michal Kalderon wrote:
>> iWARP requires OOO support which is already provided by the ll2
>> interface (until now was used only for iSCSI offload).
>> The changes mostly include opening a ll2 dedicated connection for
>> OOO and notifiying the FW about the handle id.
>>
>> Signed-off-by: Michal Kalderon <Michal.Kalderon-YGCgFSpz5w/QT0dZR+AlfA@public.gmane.org>
>> Signed-off-by: Ariel Elior <Ariel.Elior-YGCgFSpz5w/QT0dZR+AlfA@public.gmane.org>
>> ---
>> drivers/net/ethernet/qlogic/qed/qed_iwarp.c | 44 +++++++++++++++++++++++++++++
>> drivers/net/ethernet/qlogic/qed/qed_iwarp.h | 11 +++++++-
>> drivers/net/ethernet/qlogic/qed/qed_rdma.c | 7 +++--
>> 3 files changed, 59 insertions(+), 3 deletions(-)
>>
>> diff --git a/drivers/net/ethernet/qlogic/qed/qed_iwarp.c b/drivers/net/ethernet/qlogic/qed/qed_iwarp.c
>> index 9d989c9..568e985 100644
>> --- a/drivers/net/ethernet/qlogic/qed/qed_iwarp.c
>> +++ b/drivers/net/ethernet/qlogic/qed/qed_iwarp.c
>> @@ -41,6 +41,7 @@
>> #include "qed_rdma.h"
>> #include "qed_reg_addr.h"
>> #include "qed_sp.h"
>> +#include "qed_ooo.h"
>>
>> #define QED_IWARP_ORD_DEFAULT 32
>> #define QED_IWARP_IRD_DEFAULT 32
>> @@ -119,6 +120,13 @@ static void qed_iwarp_cid_cleaned(struct qed_hwfn *p_hwfn, u32 cid)
>> spin_unlock_bh(&p_hwfn->p_rdma_info->lock);
>> }
>>
>> +void qed_iwarp_init_fw_ramrod(struct qed_hwfn *p_hwfn,
>> + struct iwarp_init_func_params *p_ramrod)
>> +{
>> + p_ramrod->ll2_ooo_q_index = RESC_START(p_hwfn, QED_LL2_QUEUE) +
>> + p_hwfn->p_rdma_info->iwarp.ll2_ooo_handle;
>> +}
>> +
>> static int qed_iwarp_alloc_cid(struct qed_hwfn *p_hwfn, u32 *cid)
>> {
>> int rc;
>> @@ -1876,6 +1884,16 @@ static int qed_iwarp_ll2_stop(struct qed_hwfn *p_hwfn, struct qed_ptt *p_ptt)
>> iwarp_info->ll2_syn_handle = QED_IWARP_HANDLE_INVAL;
>> }
>>
>> + if (iwarp_info->ll2_ooo_handle != QED_IWARP_HANDLE_INVAL) {
>> + rc = qed_ll2_terminate_connection(p_hwfn,
>> + iwarp_info->ll2_ooo_handle);
>> + if (rc)
>> + DP_INFO(p_hwfn, "Failed to terminate ooo connection\n");
>
>What exactly will you do with this knowledge? Anyway you are not
>interested in return values of qed_ll2_terminate_connection function in
>this place and other places too.
>
>Why don't you handle EAGAIN returned from the qed_ll2_terminate_connection()?
>
>Thanks
Thanks for pointing this out, you're right we could have ignored the return code, as there's
not much we can do at this point if it failed. But I still feel failures are worth knowing about,
and could help in analysis if they unexpectedly lead to another issue.
As for EAGAIN, it is very unlikely that we'll get this return code. Will consider adding generic
handling for this as a separate patch, as this currently isn't handled in any of the ll2 flows.
thanks,--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply
* Re: [PATCH,net-next,0/2] Improve code coverage of syzkaller
From: David Miller @ 2017-09-20 6:08 UTC (permalink / raw)
To: peterpenkov96; +Cc: netdev
In-Reply-To: <CA+DcSEjDe=h5Kk2Bg0vCOatQj2Zs8wTAyVH+z5curg4O4c3=Hw@mail.gmail.com>
From: Petar Penkov <peterpenkov96@gmail.com>
Date: Tue, 19 Sep 2017 21:26:14 -0700
> Furthermore, in a way testing already requires specific kernel
> configuration. In this particular example, syzkaller prefers
> synchronous operation and therefore needs 4KSTACKS disabled. Other
> features that require rebuilding are KASAN and dbx. From this point
> of view, I still think that having the TUN_NAPI flag has value.
Then I think this path could be enabled/disabled with a runtime flag
just as easily, no?
^ permalink raw reply
* [PATCH net-next v5 0/4] bpf: add two helpers to read perf event enabled/running time
From: Yonghong Song @ 2017-09-20 6:09 UTC (permalink / raw)
To: peterz, rostedt, ast, daniel, netdev; +Cc: kernel-team
Hardware pmu counters are limited resources. When there are more
pmu based perf events opened than available counters, kernel will
multiplex these events so each event gets certain percentage
(but not 100%) of the pmu time. In case that multiplexing happens,
the number of samples or counter value will not reflect the
case compared to no multiplexing. This makes comparison between
different runs difficult.
Typically, the number of samples or counter value should be
normalized before comparing to other experiments. The typical
normalization is done like:
normalized_num_samples = num_samples * time_enabled / time_running
normalized_counter_value = counter_value * time_enabled / time_running
where time_enabled is the time enabled for event and time_running is
the time running for event since last normalization.
This patch set implements two helper functions.
The helper bpf_perf_event_read_value reads counter/time_enabled/time_running
for perf event array map. The helper bpf_perf_prog_read_value read
counter/time_enabled/time_running for bpf prog with type BPF_PROG_TYPE_PERF_EVENT.
Changelogs:
v4->v5:
. fix some coding style issues
. memset the input buffer in case of error for ARG_PTR_TO_UNINIT_MEM
type of argument.
v3->v4:
. fix a build failure
v2->v3:
. counters should be read in order to read enabled/running time. This is to
prevent that counters and enabled/running time may be read separately.
v1->v2:
. reading enabled/running time should be together with reading counters
which contains the logic to ensure the result is valid.
Yonghong Song (4):
bpf: add helper bpf_perf_event_read_value for perf event array map
bpf: add a test case for helper bpf_perf_event_read_value
bpf: add helper bpf_perf_prog_read_value
bpf: add a test case for helper bpf_perf_prog_read_value
include/linux/perf_event.h | 7 ++-
include/uapi/linux/bpf.h | 27 ++++++++++-
kernel/bpf/arraymap.c | 2 +-
kernel/bpf/verifier.c | 4 +-
kernel/events/core.c | 16 +++++--
kernel/trace/bpf_trace.c | 74 ++++++++++++++++++++++++++++---
samples/bpf/trace_event_kern.c | 10 +++++
samples/bpf/trace_event_user.c | 13 +++---
samples/bpf/tracex6_kern.c | 26 +++++++++++
samples/bpf/tracex6_user.c | 13 +++++-
tools/include/uapi/linux/bpf.h | 4 +-
tools/testing/selftests/bpf/bpf_helpers.h | 6 +++
12 files changed, 182 insertions(+), 20 deletions(-)
--
2.9.5
^ permalink raw reply
* [PATCH net-next v5 3/4] bpf: add helper bpf_perf_prog_read_value
From: Yonghong Song @ 2017-09-20 6:09 UTC (permalink / raw)
To: peterz, rostedt, ast, daniel, netdev; +Cc: kernel-team
In-Reply-To: <20170920060935.1102268-1-yhs@fb.com>
This patch adds helper bpf_perf_prog_read_cvalue for perf event based bpf
programs, to read event counter and enabled/running time.
The enabled/running time is accumulated since the perf event open.
The typical use case for perf event based bpf program is to attach itself
to a single event. In such cases, if it is desirable to get scaling factor
between two bpf invocations, users can can save the time values in a map,
and use the value from the map and the current value to calculate
the scaling factor.
Signed-off-by: Yonghong Song <yhs@fb.com>
---
include/linux/perf_event.h | 1 +
include/uapi/linux/bpf.h | 8 ++++++++
kernel/events/core.c | 1 +
kernel/trace/bpf_trace.c | 28 ++++++++++++++++++++++++++++
4 files changed, 38 insertions(+)
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 21d8c12..79b18a2 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -806,6 +806,7 @@ struct perf_output_handle {
struct bpf_perf_event_data_kern {
struct pt_regs *regs;
struct perf_sample_data *data;
+ struct perf_event *event;
};
#ifdef CONFIG_CGROUP_PERF
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index ccfe1b1..f3eeae2 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -590,6 +590,13 @@ union bpf_attr {
* @buf: buf to fill
* @buf_size: size of the buf
* Return: 0 on success or negative error code
+ *
+ * int bpf_perf_prog_read_value(ctx, buf, buf_size)
+ * read perf prog attached perf event counter and enabled/running time
+ * @ctx: pointer to ctx
+ * @buf: buf to fill
+ * @buf_size: size of the buf
+ * Return : 0 on success or negative error code
*/
#define __BPF_FUNC_MAPPER(FN) \
FN(unspec), \
@@ -647,6 +654,7 @@ union bpf_attr {
FN(sk_redirect_map), \
FN(sock_map_update), \
FN(perf_event_read_value), \
+ FN(perf_prog_read_value), \
/* integer value in 'imm' field of BPF_CALL instruction selects which helper
* function eBPF program intends to call
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 2d5bbe5..d039086 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -8081,6 +8081,7 @@ static void bpf_overflow_handler(struct perf_event *event,
struct bpf_perf_event_data_kern ctx = {
.data = data,
.regs = regs,
+ .event = event,
};
int ret = 0;
diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
index 686dfa1..c4d617a 100644
--- a/kernel/trace/bpf_trace.c
+++ b/kernel/trace/bpf_trace.c
@@ -612,6 +612,32 @@ static const struct bpf_func_proto bpf_get_stackid_proto_tp = {
.arg3_type = ARG_ANYTHING,
};
+BPF_CALL_3(bpf_perf_prog_read_value_tp, struct bpf_perf_event_data_kern *, ctx,
+ struct bpf_perf_event_value *, buf, u32, size)
+{
+ int err;
+
+ if (unlikely(size != sizeof(struct bpf_perf_event_value)))
+ return -EINVAL;
+
+ err = perf_event_read_local(ctx->event, &buf->counter, &buf->enabled,
+ &buf->running);
+ if (unlikely(err)) {
+ memset(buf, 0, size);
+ return err;
+ }
+ return 0;
+}
+
+static const struct bpf_func_proto bpf_perf_prog_read_value_proto_tp = {
+ .func = bpf_perf_prog_read_value_tp,
+ .gpl_only = true,
+ .ret_type = RET_INTEGER,
+ .arg1_type = ARG_PTR_TO_CTX,
+ .arg2_type = ARG_PTR_TO_UNINIT_MEM,
+ .arg3_type = ARG_CONST_SIZE,
+};
+
static const struct bpf_func_proto *tp_prog_func_proto(enum bpf_func_id func_id)
{
switch (func_id) {
@@ -619,6 +645,8 @@ static const struct bpf_func_proto *tp_prog_func_proto(enum bpf_func_id func_id)
return &bpf_perf_event_output_proto_tp;
case BPF_FUNC_get_stackid:
return &bpf_get_stackid_proto_tp;
+ case BPF_FUNC_perf_prog_read_value:
+ return &bpf_perf_prog_read_value_proto_tp;
default:
return tracing_func_proto(func_id);
}
--
2.9.5
^ permalink raw reply related
* [PATCH net-next v5 1/4] bpf: add helper bpf_perf_event_read_value for perf event array map
From: Yonghong Song @ 2017-09-20 6:09 UTC (permalink / raw)
To: peterz, rostedt, ast, daniel, netdev; +Cc: kernel-team
In-Reply-To: <20170920060935.1102268-1-yhs@fb.com>
Hardware pmu counters are limited resources. When there are more
pmu based perf events opened than available counters, kernel will
multiplex these events so each event gets certain percentage
(but not 100%) of the pmu time. In case that multiplexing happens,
the number of samples or counter value will not reflect the
case compared to no multiplexing. This makes comparison between
different runs difficult.
Typically, the number of samples or counter value should be
normalized before comparing to other experiments. The typical
normalization is done like:
normalized_num_samples = num_samples * time_enabled / time_running
normalized_counter_value = counter_value * time_enabled / time_running
where time_enabled is the time enabled for event and time_running is
the time running for event since last normalization.
This patch adds helper bpf_perf_event_read_value for kprobed based perf
event array map, to read perf counter and enabled/running time.
The enabled/running time is accumulated since the perf event open.
To achieve scaling factor between two bpf invocations, users
can can use cpu_id as the key (which is typical for perf array usage model)
to remember the previous value and do the calculation inside the
bpf program.
Signed-off-by: Yonghong Song <yhs@fb.com>
---
include/linux/perf_event.h | 6 ++++--
include/uapi/linux/bpf.h | 19 ++++++++++++++++++-
kernel/bpf/arraymap.c | 2 +-
kernel/bpf/verifier.c | 4 +++-
kernel/events/core.c | 15 ++++++++++++---
kernel/trace/bpf_trace.c | 46 +++++++++++++++++++++++++++++++++++++++++-----
6 files changed, 79 insertions(+), 13 deletions(-)
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 8e22f24..21d8c12 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -884,7 +884,8 @@ perf_event_create_kernel_counter(struct perf_event_attr *attr,
void *context);
extern void perf_pmu_migrate_context(struct pmu *pmu,
int src_cpu, int dst_cpu);
-int perf_event_read_local(struct perf_event *event, u64 *value);
+int perf_event_read_local(struct perf_event *event, u64 *value,
+ u64 *enabled, u64 *running);
extern u64 perf_event_read_value(struct perf_event *event,
u64 *enabled, u64 *running);
@@ -1286,7 +1287,8 @@ static inline const struct perf_event_attr *perf_event_attrs(struct perf_event *
{
return ERR_PTR(-EINVAL);
}
-static inline int perf_event_read_local(struct perf_event *event, u64 *value)
+static inline int perf_event_read_local(struct perf_event *event, u64 *value,
+ u64 *enabled, u64 *running)
{
return -EINVAL;
}
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 43ab5c4..ccfe1b1 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -582,6 +582,14 @@ union bpf_attr {
* @map: pointer to sockmap to update
* @key: key to insert/update sock in map
* @flags: same flags as map update elem
+ *
+ * int bpf_perf_event_read_value(map, flags, buf, buf_size)
+ * read perf event counter value and perf event enabled/running time
+ * @map: pointer to perf_event_array map
+ * @flags: index of event in the map or bitmask flags
+ * @buf: buf to fill
+ * @buf_size: size of the buf
+ * Return: 0 on success or negative error code
*/
#define __BPF_FUNC_MAPPER(FN) \
FN(unspec), \
@@ -638,6 +646,7 @@ union bpf_attr {
FN(redirect_map), \
FN(sk_redirect_map), \
FN(sock_map_update), \
+ FN(perf_event_read_value), \
/* integer value in 'imm' field of BPF_CALL instruction selects which helper
* function eBPF program intends to call
@@ -681,7 +690,9 @@ enum bpf_func_id {
#define BPF_F_ZERO_CSUM_TX (1ULL << 1)
#define BPF_F_DONT_FRAGMENT (1ULL << 2)
-/* BPF_FUNC_perf_event_output and BPF_FUNC_perf_event_read flags. */
+/* BPF_FUNC_perf_event_output, BPF_FUNC_perf_event_read and
+ * BPF_FUNC_perf_event_read_value flags.
+ */
#define BPF_F_INDEX_MASK 0xffffffffULL
#define BPF_F_CURRENT_CPU BPF_F_INDEX_MASK
/* BPF_FUNC_perf_event_output for sk_buff input context. */
@@ -864,4 +875,10 @@ enum {
#define TCP_BPF_IW 1001 /* Set TCP initial congestion window */
#define TCP_BPF_SNDCWND_CLAMP 1002 /* Set sndcwnd_clamp */
+struct bpf_perf_event_value {
+ __u64 counter;
+ __u64 enabled;
+ __u64 running;
+};
+
#endif /* _UAPI__LINUX_BPF_H__ */
diff --git a/kernel/bpf/arraymap.c b/kernel/bpf/arraymap.c
index 98c0f00..68d8666 100644
--- a/kernel/bpf/arraymap.c
+++ b/kernel/bpf/arraymap.c
@@ -492,7 +492,7 @@ static void *perf_event_fd_array_get_ptr(struct bpf_map *map,
ee = ERR_PTR(-EOPNOTSUPP);
event = perf_file->private_data;
- if (perf_event_read_local(event, &value) == -EOPNOTSUPP)
+ if (perf_event_read_local(event, &value, NULL, NULL) == -EOPNOTSUPP)
goto err_out;
ee = bpf_event_entry_gen(perf_file, map_file);
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 799b245..1bf9d7b 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -1494,7 +1494,8 @@ static int check_map_func_compatibility(struct bpf_map *map, int func_id)
break;
case BPF_MAP_TYPE_PERF_EVENT_ARRAY:
if (func_id != BPF_FUNC_perf_event_read &&
- func_id != BPF_FUNC_perf_event_output)
+ func_id != BPF_FUNC_perf_event_output &&
+ func_id != BPF_FUNC_perf_event_read_value)
goto error;
break;
case BPF_MAP_TYPE_STACK_TRACE:
@@ -1537,6 +1538,7 @@ static int check_map_func_compatibility(struct bpf_map *map, int func_id)
break;
case BPF_FUNC_perf_event_read:
case BPF_FUNC_perf_event_output:
+ case BPF_FUNC_perf_event_read_value:
if (map->map_type != BPF_MAP_TYPE_PERF_EVENT_ARRAY)
goto error;
break;
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 3e691b7..2d5bbe5 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -3684,10 +3684,12 @@ static inline u64 perf_event_count(struct perf_event *event)
* will not be local and we cannot read them atomically
* - must not have a pmu::count method
*/
-int perf_event_read_local(struct perf_event *event, u64 *value)
+int perf_event_read_local(struct perf_event *event, u64 *value,
+ u64 *enabled, u64 *running)
{
unsigned long flags;
int ret = 0;
+ u64 now;
/*
* Disabling interrupts avoids all counter scheduling (context
@@ -3718,14 +3720,21 @@ int perf_event_read_local(struct perf_event *event, u64 *value)
goto out;
}
+ now = event->shadow_ctx_time + perf_clock();
+ if (enabled)
+ *enabled = now - event->tstamp_enabled;
/*
* If the event is currently on this CPU, its either a per-task event,
* or local to this CPU. Furthermore it means its ACTIVE (otherwise
* oncpu == -1).
*/
- if (event->oncpu == smp_processor_id())
+ if (event->oncpu == smp_processor_id()) {
event->pmu->read(event);
-
+ if (running)
+ *running = now - event->tstamp_running;
+ } else if (running) {
+ *running = event->total_time_running;
+ }
*value = local64_read(&event->count);
out:
local_irq_restore(flags);
diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
index dc498b6..686dfa1 100644
--- a/kernel/trace/bpf_trace.c
+++ b/kernel/trace/bpf_trace.c
@@ -255,14 +255,13 @@ const struct bpf_func_proto *bpf_get_trace_printk_proto(void)
return &bpf_trace_printk_proto;
}
-BPF_CALL_2(bpf_perf_event_read, struct bpf_map *, map, u64, flags)
-{
+static __always_inline int
+get_map_perf_counter(struct bpf_map *map, u64 flags,
+ u64 *value, u64 *enabled, u64 *running) {
struct bpf_array *array = container_of(map, struct bpf_array, map);
unsigned int cpu = smp_processor_id();
u64 index = flags & BPF_F_INDEX_MASK;
struct bpf_event_entry *ee;
- u64 value = 0;
- int err;
if (unlikely(flags & ~(BPF_F_INDEX_MASK)))
return -EINVAL;
@@ -275,7 +274,15 @@ BPF_CALL_2(bpf_perf_event_read, struct bpf_map *, map, u64, flags)
if (!ee)
return -ENOENT;
- err = perf_event_read_local(ee->event, &value);
+ return perf_event_read_local(ee->event, value, enabled, running);
+}
+
+BPF_CALL_2(bpf_perf_event_read, struct bpf_map *, map, u64, flags)
+{
+ u64 value = 0;
+ int err;
+
+ err = get_map_perf_counter(map, flags, &value, NULL, NULL);
/*
* this api is ugly since we miss [-22..-2] range of valid
* counter values, but that's uapi
@@ -293,6 +300,33 @@ static const struct bpf_func_proto bpf_perf_event_read_proto = {
.arg2_type = ARG_ANYTHING,
};
+BPF_CALL_4(bpf_perf_event_read_value, struct bpf_map *, map, u64, flags,
+ struct bpf_perf_event_value *, buf, u32, size)
+{
+ int err;
+
+ if (unlikely(size != sizeof(struct bpf_perf_event_value)))
+ return -EINVAL;
+
+ err = get_map_perf_counter(map, flags, &buf->counter, &buf->enabled,
+ &buf->running);
+ if (unlikely(err)) {
+ memset(buf, 0, size);
+ return err;
+ }
+ return 0;
+}
+
+static const struct bpf_func_proto bpf_perf_event_read_value_proto = {
+ .func = bpf_perf_event_read_value,
+ .gpl_only = true,
+ .ret_type = RET_INTEGER,
+ .arg1_type = ARG_CONST_MAP_PTR,
+ .arg2_type = ARG_ANYTHING,
+ .arg3_type = ARG_PTR_TO_UNINIT_MEM,
+ .arg4_type = ARG_CONST_SIZE,
+};
+
static DEFINE_PER_CPU(struct perf_sample_data, bpf_sd);
static __always_inline u64
@@ -499,6 +533,8 @@ static const struct bpf_func_proto *kprobe_prog_func_proto(enum bpf_func_id func
return &bpf_perf_event_output_proto;
case BPF_FUNC_get_stackid:
return &bpf_get_stackid_proto;
+ case BPF_FUNC_perf_event_read_value:
+ return &bpf_perf_event_read_value_proto;
default:
return tracing_func_proto(func_id);
}
--
2.9.5
^ permalink raw reply related
* [PATCH net-next v5 4/4] bpf: add a test case for helper bpf_perf_prog_read_value
From: Yonghong Song @ 2017-09-20 6:09 UTC (permalink / raw)
To: peterz, rostedt, ast, daniel, netdev; +Cc: kernel-team
In-Reply-To: <20170920060935.1102268-1-yhs@fb.com>
The bpf sample program trace_event is enhanced to use the new
helper to print out enabled/running time.
Signed-off-by: Yonghong Song <yhs@fb.com>
---
samples/bpf/trace_event_kern.c | 10 ++++++++++
samples/bpf/trace_event_user.c | 13 ++++++++-----
tools/include/uapi/linux/bpf.h | 3 ++-
tools/testing/selftests/bpf/bpf_helpers.h | 3 +++
4 files changed, 23 insertions(+), 6 deletions(-)
diff --git a/samples/bpf/trace_event_kern.c b/samples/bpf/trace_event_kern.c
index 41b6115..a77a583d 100644
--- a/samples/bpf/trace_event_kern.c
+++ b/samples/bpf/trace_event_kern.c
@@ -37,10 +37,14 @@ struct bpf_map_def SEC("maps") stackmap = {
SEC("perf_event")
int bpf_prog1(struct bpf_perf_event_data *ctx)
{
+ char time_fmt1[] = "Time Enabled: %llu, Time Running: %llu";
+ char time_fmt2[] = "Get Time Failed, ErrCode: %d";
char fmt[] = "CPU-%d period %lld ip %llx";
u32 cpu = bpf_get_smp_processor_id();
+ struct bpf_perf_event_value value_buf;
struct key_t key;
u64 *val, one = 1;
+ int ret;
if (ctx->sample_period < 10000)
/* ignore warmup */
@@ -54,6 +58,12 @@ int bpf_prog1(struct bpf_perf_event_data *ctx)
return 0;
}
+ ret = bpf_perf_prog_read_value(ctx, (void *)&value_buf, sizeof(struct bpf_perf_event_value));
+ if (!ret)
+ bpf_trace_printk(time_fmt1, sizeof(time_fmt1), value_buf.enabled, value_buf.running);
+ else
+ bpf_trace_printk(time_fmt2, sizeof(time_fmt2), ret);
+
val = bpf_map_lookup_elem(&counts, &key);
if (val)
(*val)++;
diff --git a/samples/bpf/trace_event_user.c b/samples/bpf/trace_event_user.c
index 7bd827b..bf4f1b6 100644
--- a/samples/bpf/trace_event_user.c
+++ b/samples/bpf/trace_event_user.c
@@ -127,6 +127,9 @@ static void test_perf_event_all_cpu(struct perf_event_attr *attr)
int *pmu_fd = malloc(nr_cpus * sizeof(int));
int i, error = 0;
+ /* system wide perf event, no need to inherit */
+ attr->inherit = 0;
+
/* open perf_event on all cpus */
for (i = 0; i < nr_cpus; i++) {
pmu_fd[i] = sys_perf_event_open(attr, -1, i, -1, 0);
@@ -154,6 +157,11 @@ static void test_perf_event_task(struct perf_event_attr *attr)
{
int pmu_fd;
+ /* per task perf event, enable inherit so the "dd ..." command can be traced properly.
+ * Enabling inherit will cause bpf_perf_prog_read_time helper failure.
+ */
+ attr->inherit = 1;
+
/* open task bound event */
pmu_fd = sys_perf_event_open(attr, 0, -1, -1, 0);
if (pmu_fd < 0) {
@@ -175,14 +183,12 @@ static void test_bpf_perf_event(void)
.freq = 1,
.type = PERF_TYPE_HARDWARE,
.config = PERF_COUNT_HW_CPU_CYCLES,
- .inherit = 1,
};
struct perf_event_attr attr_type_sw = {
.sample_freq = SAMPLE_FREQ,
.freq = 1,
.type = PERF_TYPE_SOFTWARE,
.config = PERF_COUNT_SW_CPU_CLOCK,
- .inherit = 1,
};
struct perf_event_attr attr_hw_cache_l1d = {
.sample_freq = SAMPLE_FREQ,
@@ -192,7 +198,6 @@ static void test_bpf_perf_event(void)
PERF_COUNT_HW_CACHE_L1D |
(PERF_COUNT_HW_CACHE_OP_READ << 8) |
(PERF_COUNT_HW_CACHE_RESULT_ACCESS << 16),
- .inherit = 1,
};
struct perf_event_attr attr_hw_cache_branch_miss = {
.sample_freq = SAMPLE_FREQ,
@@ -202,7 +207,6 @@ static void test_bpf_perf_event(void)
PERF_COUNT_HW_CACHE_BPU |
(PERF_COUNT_HW_CACHE_OP_READ << 8) |
(PERF_COUNT_HW_CACHE_RESULT_MISS << 16),
- .inherit = 1,
};
struct perf_event_attr attr_type_raw = {
.sample_freq = SAMPLE_FREQ,
@@ -210,7 +214,6 @@ static void test_bpf_perf_event(void)
.type = PERF_TYPE_RAW,
/* Intel Instruction Retired */
.config = 0xc0,
- .inherit = 1,
};
printf("Test HW_CPU_CYCLES\n");
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 79eb529..50d2bcd 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -633,7 +633,8 @@ union bpf_attr {
FN(redirect_map), \
FN(sk_redirect_map), \
FN(sock_map_update), \
- FN(perf_event_read_value),
+ FN(perf_event_read_value), \
+ FN(perf_prog_read_value),
/* integer value in 'imm' field of BPF_CALL instruction selects which helper
* function eBPF program intends to call
diff --git a/tools/testing/selftests/bpf/bpf_helpers.h b/tools/testing/selftests/bpf/bpf_helpers.h
index 08e6f8c..1d3dcd4 100644
--- a/tools/testing/selftests/bpf/bpf_helpers.h
+++ b/tools/testing/selftests/bpf/bpf_helpers.h
@@ -73,6 +73,9 @@ static int (*bpf_sock_map_update)(void *map, void *key, void *value,
static int (*bpf_perf_event_read_value)(void *map, unsigned long long flags,
void *buf, unsigned int buf_size) =
(void *) BPF_FUNC_perf_event_read_value;
+static int (*bpf_perf_prog_read_value)(void *ctx, void *buf,
+ unsigned int buf_size) =
+ (void *) BPF_FUNC_perf_prog_read_value;
/* llvm builtin functions that eBPF C program may use to
--
2.9.5
^ permalink raw reply related
* [PATCH net-next v5 2/4] bpf: add a test case for helper bpf_perf_event_read_value
From: Yonghong Song @ 2017-09-20 6:09 UTC (permalink / raw)
To: peterz, rostedt, ast, daniel, netdev; +Cc: kernel-team
In-Reply-To: <20170920060935.1102268-1-yhs@fb.com>
The bpf sample program tracex6 is enhanced to use the new
helper to read enabled/running time as well.
Signed-off-by: Yonghong Song <yhs@fb.com>
---
samples/bpf/tracex6_kern.c | 26 ++++++++++++++++++++++++++
samples/bpf/tracex6_user.c | 13 ++++++++++++-
tools/include/uapi/linux/bpf.h | 3 ++-
tools/testing/selftests/bpf/bpf_helpers.h | 3 +++
4 files changed, 43 insertions(+), 2 deletions(-)
diff --git a/samples/bpf/tracex6_kern.c b/samples/bpf/tracex6_kern.c
index e7d1803..46c557a 100644
--- a/samples/bpf/tracex6_kern.c
+++ b/samples/bpf/tracex6_kern.c
@@ -15,6 +15,12 @@ struct bpf_map_def SEC("maps") values = {
.value_size = sizeof(u64),
.max_entries = 64,
};
+struct bpf_map_def SEC("maps") values2 = {
+ .type = BPF_MAP_TYPE_HASH,
+ .key_size = sizeof(int),
+ .value_size = sizeof(struct bpf_perf_event_value),
+ .max_entries = 64,
+};
SEC("kprobe/htab_map_get_next_key")
int bpf_prog1(struct pt_regs *ctx)
@@ -37,5 +43,25 @@ int bpf_prog1(struct pt_regs *ctx)
return 0;
}
+SEC("kprobe/htab_map_lookup_elem")
+int bpf_prog2(struct pt_regs *ctx)
+{
+ u32 key = bpf_get_smp_processor_id();
+ struct bpf_perf_event_value *val, buf;
+ int error;
+
+ error = bpf_perf_event_read_value(&counters, key, &buf, sizeof(buf));
+ if (error)
+ return 0;
+
+ val = bpf_map_lookup_elem(&values2, &key);
+ if (val)
+ *val = buf;
+ else
+ bpf_map_update_elem(&values2, &key, &buf, BPF_NOEXIST);
+
+ return 0;
+}
+
char _license[] SEC("license") = "GPL";
u32 _version SEC("version") = LINUX_VERSION_CODE;
diff --git a/samples/bpf/tracex6_user.c b/samples/bpf/tracex6_user.c
index a05a99a..3341a96 100644
--- a/samples/bpf/tracex6_user.c
+++ b/samples/bpf/tracex6_user.c
@@ -22,6 +22,7 @@
static void check_on_cpu(int cpu, struct perf_event_attr *attr)
{
+ struct bpf_perf_event_value value2;
int pmu_fd, error = 0;
cpu_set_t set;
__u64 value;
@@ -46,8 +47,18 @@ static void check_on_cpu(int cpu, struct perf_event_attr *attr)
fprintf(stderr, "Value missing for CPU %d\n", cpu);
error = 1;
goto on_exit;
+ } else {
+ fprintf(stderr, "CPU %d: %llu\n", cpu, value);
+ }
+ /* The above bpf_map_lookup_elem should trigger the second kprobe */
+ if (bpf_map_lookup_elem(map_fd[2], &cpu, &value2)) {
+ fprintf(stderr, "Value2 missing for CPU %d\n", cpu);
+ error = 1;
+ goto on_exit;
+ } else {
+ fprintf(stderr, "CPU %d: counter: %llu, enabled: %llu, running: %llu\n", cpu,
+ value2.counter, value2.enabled, value2.running);
}
- fprintf(stderr, "CPU %d: %llu\n", cpu, value);
on_exit:
assert(bpf_map_delete_elem(map_fd[0], &cpu) == 0 || error);
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 461811e..79eb529 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -632,7 +632,8 @@ union bpf_attr {
FN(skb_adjust_room), \
FN(redirect_map), \
FN(sk_redirect_map), \
- FN(sock_map_update),
+ FN(sock_map_update), \
+ FN(perf_event_read_value),
/* integer value in 'imm' field of BPF_CALL instruction selects which helper
* function eBPF program intends to call
diff --git a/tools/testing/selftests/bpf/bpf_helpers.h b/tools/testing/selftests/bpf/bpf_helpers.h
index 36fb916..08e6f8c 100644
--- a/tools/testing/selftests/bpf/bpf_helpers.h
+++ b/tools/testing/selftests/bpf/bpf_helpers.h
@@ -70,6 +70,9 @@ static int (*bpf_sk_redirect_map)(void *map, int key, int flags) =
static int (*bpf_sock_map_update)(void *map, void *key, void *value,
unsigned long long flags) =
(void *) BPF_FUNC_sock_map_update;
+static int (*bpf_perf_event_read_value)(void *map, unsigned long long flags,
+ void *buf, unsigned int buf_size) =
+ (void *) BPF_FUNC_perf_event_read_value;
/* llvm builtin functions that eBPF C program may use to
--
2.9.5
^ permalink raw reply related
* Re: Regression in throughput between kvm guests over virtual bridge
From: Jason Wang @ 2017-09-20 6:27 UTC (permalink / raw)
To: Matthew Rosato, netdev; +Cc: davem, mst
In-Reply-To: <7d444584-3854-ace2-008d-0fdef1c9cef4@linux.vnet.ibm.com>
[-- Attachment #1: Type: text/plain, Size: 4826 bytes --]
On 2017年09月19日 02:11, Matthew Rosato wrote:
> On 09/18/2017 03:36 AM, Jason Wang wrote:
>>
>> On 2017年09月18日 11:13, Jason Wang wrote:
>>>
>>> On 2017年09月16日 03:19, Matthew Rosato wrote:
>>>>> It looks like vhost is slowed down for some reason which leads to more
>>>>> idle time on 4.13+VHOST_RX_BATCH=1. Appreciated if you can collect the
>>>>> perf.diff on host, one for rx and one for tx.
>>>>>
>>>> perf data below for the associated vhost threads, baseline=4.12,
>>>> delta1=4.13, delta2=4.13+VHOST_RX_BATCH=1
>>>>
>>>> Client vhost:
>>>>
>>>> 60.12% -11.11% -12.34% [kernel.vmlinux] [k] raw_copy_from_user
>>>> 13.76% -1.28% -0.74% [kernel.vmlinux] [k] get_page_from_freelist
>>>> 2.00% +3.69% +3.54% [kernel.vmlinux] [k] __wake_up_sync_key
>>>> 1.19% +0.60% +0.66% [kernel.vmlinux] [k] __alloc_pages_nodemask
>>>> 1.12% +0.76% +0.86% [kernel.vmlinux] [k] copy_page_from_iter
>>>> 1.09% +0.28% +0.35% [vhost] [k] vhost_get_vq_desc
>>>> 1.07% +0.31% +0.26% [kernel.vmlinux] [k] alloc_skb_with_frags
>>>> 0.94% +0.42% +0.65% [kernel.vmlinux] [k] alloc_pages_current
>>>> 0.91% -0.19% -0.18% [kernel.vmlinux] [k] memcpy
>>>> 0.88% +0.26% +0.30% [kernel.vmlinux] [k] __next_zones_zonelist
>>>> 0.85% +0.05% +0.12% [kernel.vmlinux] [k] iov_iter_advance
>>>> 0.79% +0.09% +0.19% [vhost] [k] __vhost_add_used_n
>>>> 0.74% [kernel.vmlinux] [k] get_task_policy.part.7
>>>> 0.74% -0.01% -0.05% [kernel.vmlinux] [k] tun_net_xmit
>>>> 0.60% +0.17% +0.33% [kernel.vmlinux] [k] policy_nodemask
>>>> 0.58% -0.15% -0.12% [ebtables] [k] ebt_do_table
>>>> 0.52% -0.25% -0.22% [kernel.vmlinux] [k] __alloc_skb
>>>> ...
>>>> 0.42% +0.58% +0.59% [kernel.vmlinux] [k] eventfd_signal
>>>> ...
>>>> 0.32% +0.96% +0.93% [kernel.vmlinux] [k] finish_task_switch
>>>> ...
>>>> +1.50% +1.16% [kernel.vmlinux] [k] get_task_policy.part.9
>>>> +0.40% +0.42% [kernel.vmlinux] [k] __skb_get_hash_symmetr
>>>> +0.39% +0.40% [kernel.vmlinux] [k] _copy_from_iter_full
>>>> +0.24% +0.23% [vhost_net] [k] vhost_net_buf_peek
>>>>
>>>> Server vhost:
>>>>
>>>> 61.93% -10.72% -10.91% [kernel.vmlinux] [k] raw_copy_to_user
>>>> 9.25% +0.47% +0.86% [kernel.vmlinux] [k] free_hot_cold_page
>>>> 5.16% +1.41% +1.57% [vhost] [k] vhost_get_vq_desc
>>>> 5.12% -3.81% -3.78% [kernel.vmlinux] [k] skb_release_data
>>>> 3.30% +0.42% +0.55% [kernel.vmlinux] [k] raw_copy_from_user
>>>> 1.29% +2.20% +2.28% [kernel.vmlinux] [k] copy_page_to_iter
>>>> 1.24% +1.65% +0.45% [vhost_net] [k] handle_rx
>>>> 1.08% +3.03% +2.85% [kernel.vmlinux] [k] __wake_up_sync_key
>>>> 0.96% +0.70% +1.10% [vhost] [k] translate_desc
>>>> 0.69% -0.20% -0.22% [kernel.vmlinux] [k] tun_do_read.part.10
>>>> 0.69% [kernel.vmlinux] [k] tun_peek_len
>>>> 0.67% +0.75% +0.78% [kernel.vmlinux] [k] eventfd_signal
>>>> 0.52% +0.96% +0.98% [kernel.vmlinux] [k] finish_task_switch
>>>> 0.50% +0.05% +0.09% [vhost] [k] vhost_add_used_n
>>>> ...
>>>> +0.63% +0.58% [vhost_net] [k] vhost_net_buf_peek
>>>> +0.32% +0.32% [kernel.vmlinux] [k] _copy_to_iter
>>>> +0.19% +0.19% [kernel.vmlinux] [k] __skb_get_hash_symmetr
>>>> +0.11% +0.21% [vhost] [k] vhost_umem_interval_tr
>>>>
>>> Looks like for some unknown reason which leads more wakeups.
>>>
>>> Could you please try to attached patch to see if it solves or mitigate
>>> the issue?
>>>
>>> Thanks
>> My bad, please try this.
>>
>> Thanks
> Thanks Jason. Built 4.13 + supplied patch, I see some decrease in
> wakeups, but there's still quite a bit more compared to 4.12
> (baseline=4.12, delta1=4.13, delta2=4.13+patch):
>
> client:
> 2.00% +3.69% +2.55% [kernel.vmlinux] [k] __wake_up_sync_key
>
> server:
> 1.08% +3.03% +1.85% [kernel.vmlinux] [k] __wake_up_sync_key
>
>
> Throughput was roughly equivalent to base 4.13 (so, still seeing the
> regression w/ this patch applied).
>
Seems to make some progress on wakeup mitigation. Previous patch tries
to reduce the unnecessary traversal of waitqueue during rx. Attached
patch goes even further which disables rx polling during processing tx.
Please try it to see if it has any difference.
And two questions:
- Is the issue existed if you do uperf between 2VMs (instead of 4VMs)
- Can enable batching in the tap of sending VM improve the performance
(ethtool -C $tap rx-frames 64)
Thanks
[-- Attachment #2: 0001-vhost_net-avoid-unnecessary-wakeups-during-tx.patch --]
[-- Type: text/x-patch, Size: 1938 bytes --]
>From d57ad96083fc57205336af1b5ea777e5185f1581 Mon Sep 17 00:00:00 2001
From: Jason Wang <jasowang@redhat.com>
Date: Wed, 20 Sep 2017 11:44:49 +0800
Subject: [PATCH] vhost_net: avoid unnecessary wakeups during tx
Signed-off-by: Jason Wang <jasowang@redhat.com>
---
drivers/vhost/net.c | 21 ++++++++++++++++++---
1 file changed, 18 insertions(+), 3 deletions(-)
diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index ed476fa..e7349cf 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -444,8 +444,11 @@ static bool vhost_exceeds_maxpend(struct vhost_net *net)
* read-size critical section for our kind of RCU. */
static void handle_tx(struct vhost_net *net)
{
+ struct vhost_net_virtqueue *rx_nvq = &net->vqs[VHOST_NET_VQ_RX];
struct vhost_net_virtqueue *nvq = &net->vqs[VHOST_NET_VQ_TX];
struct vhost_virtqueue *vq = &nvq->vq;
+ struct vhost_virtqueue *rx_vq = &rx_nvq->vq;
+
unsigned out, in;
int head;
struct msghdr msg = {
@@ -462,6 +465,10 @@ static void handle_tx(struct vhost_net *net)
struct vhost_net_ubuf_ref *uninitialized_var(ubufs);
bool zcopy, zcopy_used;
+ mutex_lock(&rx_vq->mutex);
+ vhost_net_disable_vq(net, rx_vq);
+ mutex_unlock(&rx_vq->mutex);
+
mutex_lock(&vq->mutex);
sock = vq->private_data;
if (!sock)
@@ -574,13 +581,21 @@ static void handle_tx(struct vhost_net *net)
else
vhost_zerocopy_signal_used(net, vq);
vhost_net_tx_packet(net);
- if (unlikely(total_len >= VHOST_NET_WEIGHT)) {
- vhost_poll_queue(&vq->poll);
+ if (unlikely(total_len >= VHOST_NET_WEIGHT))
break;
- }
}
out:
mutex_unlock(&vq->mutex);
+
+ mutex_lock(&rx_vq->mutex);
+ vhost_net_enable_vq(net, rx_vq);
+ mutex_unlock(&rx_vq->mutex);
+
+ if (unlikely(total_len >= VHOST_NET_WEIGHT)) {
+ mutex_lock(&vq->mutex);
+ vhost_poll_queue(&vq->poll);
+ mutex_unlock(&vq->mutex);
+ }
}
static int peek_head_len(struct vhost_net_virtqueue *rvq, struct sock *sk)
--
1.8.3.1
^ permalink raw reply related
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox