Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [RFC net-next 0/5] TSN: Add qdisc-based config interfaces for traffic shapers
From: Guedes, Andre @ 2017-10-02 23:06 UTC (permalink / raw)
  To: levipearson@gmail.com, rodney.cummings@ni.com
  Cc: Sanchez-Palencia, Jesus, netdev@vger.kernel.org, Gomes, Vinicius,
	Briano, Ivan, richardcochran@gmail.com, henrik@austad.us
In-Reply-To: <CAEYbN3STY+7ZOodQLaP4MfbvwCovWsav52PXBmjHND7QOO=srg@mail.gmail.com>

[-- Attachment #1: Type: text/plain, Size: 10530 bytes --]

Hi all,

On Mon, 2017-10-02 at 12:45 -0600, Levi Pearson wrote:
> Hi Rodney,
> 
> Some archives seem to have threaded it, but I have CC'd the
> participants I saw in the original discussion thread since they may
> not otherwise notice it amongst the normal traffic.
> 
> On Fri, Sep 29, 2017 at 2:44 PM, Rodney Cummings <rodney.cummings@ni.com>
> wrote:

[...]

> > 1. Question: From an 802.1 perspective, is this RFC intended to support
> > end-station (e.g. NIC in host), bridges (i.e. DSA), or both?
> > 
> > This is very important to clarify, because the usage of this interface
> > will be very different for one or the other.
> > 
> > For a bridge, the user code typically represents a remote management
> > protocol (e.g. SNMP, NETCONF, RESTCONF), and this interface is
> > expected to align with the specifications of 802.1Q clause 12,
> > which serves as the information model for management. Historically,
> > a standard kernel interface for management hasn't been viewed as
> > essential, but I suppose it wouldn't hurt.
> 
> I don't think the proposal was meant to cover the case of non-local
> switch hardware, but in addition to dsa and switchdev switch ICs
> managed by embedded Linux-running SoCs, there are SoCs with embedded
> small port count switches or even plain multiple NICs with software
> bridging. Many of these embedded small port count switches have FQTSS
> hardware that could potentially be configured by the proposed cbs
> qdisc. This blurs the line somewhat between what is a "bridge" and
> what is an "end-station" in 802.1Q terminology, but nevertheless these
> devices exist, sometimes acting as an endpoint + a real bridge and
> sometimes as just a system with multiple network interfaces.

During the development of this proposal, we were most focused on end-station
use-cases. We considered some bridge use-cases as well just to verify that the
proposed design wouldn't be an issue if someone else goes for it.

We agree that the line between end-station and bridge can be a bit blurred (in
this case). Even though we designed this interface with end-station use-cases
in mind, if the proposed infrastructure could be used as is in bridge use-
cases, good.

> > For an end station, the user code can be an implementation of SRP
> > (802.1Q clause 35), or it can be an application-specific
> > protocol (e.g. industrial fieldbus) that exchanges data according
> > to P802.1Qcc clause 46. Either way, the top-level user interface
> > is designed for individual streams, not queues and shapers. That
> > implies some translation code between that top-level interface
> > and this sort of kernel interface.

Yes, you're right. Our understanding is that the top-level interfaces should be
implemented at user space as well as any stream management functionality. The
idea here is to keep the kernel-side as simple as possible. The kernel handles
hardware configuration (via Traffic Control interface) while the user space
handles TSN streams i.e. the kernel provides the mechanism and the user space
provides the policy.

> > As a specific end-station example, for CBS, 802.1Q-2014 subclause
> > 34.6.1 requires "per-stream queues" in the Talker end-station.
> > I don't see 34.6.1 represented in the proposed RFC, but that's
> > okay... maybe per-stream queues are implemented in user code.
> > Nevertheless, if that is the assumption, I think we need to
> > clarify, especially in examples.
> 
> You're correct that the FQTSS credit-based shaping algorithm requires
> per-stream shaping by Talker endpoints as well, but this is in
> addition to the per-class shaping provided by most hardware shaping
> implementations that I'm aware of in endpoint network hardware. I
> agree that we need to document the need to provide this, but it can
> definitely be built on top of the current proposal.
> 
> I believe the per-stream shaping could be managed either by a user
> space application that manages all use of a streaming traffic class,
> or through an additional qdisc module that performs per-stream
> management on top of the proposed cbs qdisc, ensuring that the
> frames-per-observation interval aspect of each stream's reservation is
> obeyed. This becomes a fairly simple qdisc to implement on top of a
> per-traffic class shaper, and could even be implemented with the help
> of the timestamp that the SO_TXTIME proposal adds to skbuffs, but I
> think keeping the layers separate provides more flexibility to
> implementations and keeps management of various kinds of hardware
> offload support simpler as well.

Indeed, 'per-stream queue' is not covered in this RFC. For now, we expect it to
be implemented in user code. We believe the proposed CBS qdisc could be
extended to support a full software-based implementation which would be used to
implement 'per-stream queue' support. This functionality should be addressed by
a separated series.

Anyways, we're about to send the v3 patchset implementing this proposal and
we'll make it clear.

> > 2. Suggestion: Do not assume that a time-aware (i.e. scheduled)
> > end-station will always use 802.1Qbv.
> > 
> > For those who are subscribed to the 802.1 mailing list,
> > I'd suggest a read of draft P802.1Qcc/D1.6, subclause U.1
> > of Annex U. Subclause U.1 assumes that bridges in the network use
> > 802.1Qbv, and then it poses the question of what an end-station
> > Talker should do. If the end-station also uses 802.1Qbv,
> > and that end-station transmits multiple streams, 802.1Qbv is
> > a bad implementation. The reason is that the scheduling
> > (i.e. order in time) of each stream cannot be controlled, which
> > in turn means that the CNC (network manager) cannot optimize
> > the 802.1Qbv schedules in bridges. The preferred technique
> > is to use "per-stream scheduling" in each Talker, so that
> > the CNC can create an optimal schedules (i.e. best determinism).
> > 
> > I'm aware of a small number of proprietary CNC implementations for
> > 802.1Qbv in bridges, and they are generally assuming per-stream
> > scheduling in end-stations (Talkers).
> > 
> > The i210 NIC's LaunchTime can be used to implement per-stream
> > scheduling. I haven't looked at SO_TXTIME in detail, but it sounds
> > like per-stream scheduling. If so, then we already have the
> > fundamental building blocks for a complete implementation
> > of a time-aware end-station.
> > 
> > If we answer the preceding question #1 as "end-station only",
> > I would recommend avoiding 802.1Qbv in this interface. There
> > isn't really anything wrong with it per-se, but it would lead
> > developers down the wrong path.
> 
> In some situations, such as device nodes that each incorporate a small
> port count switch for the purpose of daisy-chaining a segment of the
> network, "end stations" must do a limited subset of local bridge
> management as well. I'm not sure how common this is going to be for
> industrial control applications, but I know there are audio and
> automotive applications built this way.
> 
> One particular device I am working with now provides all network
> access through a DSA switch chip with hardware Qbv support in addtion
> to hardware Qav support. The SoC attached to it has no hardware timed
> launch (SO_TXTIME) support. In this case, although the proposed
> interface for Qbv is not *sufficient* to make a working time-aware end
> station, it does provide a usable building block to provide one. As
> with the credit-based shaping system, Talkers must provide an
> additional level of per-stream shaping as well, but this is largely
> (absent the jitter calculations, which are sort of a middle-level
> concern) independent of what sort of hardware offload of the
> scheduling is provided.
> 
> Both Qbv windows and timed launch support do roughly the same thing;
> they *delay* the launch of a hardware-queued frame so it can egress at
> a precisely specified time, and at least with the i210 and Qbv, ensure
> that no other traffic will be in-progress when that time arrives. For
> either to be used effectively, the application still has to prepare
> the frame slightly ahead-of-time and thus must have the same level of
> time-awareness. This is, again, largely independent of what kind of
> hardware offloading support is provided and is also largely
> independent of the network stack itself. Neither queue window
> management nor SO_TXTIME help the application present its
> time-sensitive traffic at the right time; that's a matter to be worked
> out with the application taking advantage of PTP and the OS scheduler.
> Whether you rely on managed windows or hardware launch time to provide
> the precisely correct amount of delay beyond that is immaterial to the
> application. In the absence of SO_TXTIME offloading (or even with it,
> and in the presence of sufficient OS scheduling jitter), an additional
> layer may need to be provided to ensure different applications' frames
> are queued in the correct order for egress during the window. Again,
> this could be a purely user-space application multiplexer or a
> separate qdisc module.
> 
> I wholeheartedly agree with you and Richard that we ought to
> eventually provide application-level APIs that don't require users to
> have deep knowledge of various 802.1Q intricacies. But I believe that
> the hardware offloading capability being provided now, and the variety
> of the way things are hooked up in real hardware, suggests that we
> ought to also build the support for the underlying protocols in layers
> so that we don't create unnecessary mismatches between offloading
> capability (which can be essential to overall network performance) and
> APIs, such that one configuration of offload support is privileged
> above others even when comparable scheduling accuracy could be
> provided by either.
> 
> In any case, only the cbs qdisc has been included in the post-RFC
> patch cover page for its last couple of iterations, so there is plenty
> of time to discuss how time-aware shaping, preemption, etc. management
> should occur beyond the cbs and SO_TXTIME proposals.

Yes, based on the previous feedback about the Qbv offloading interface
('taprio'), we've decided to postpone its proposal until we have NICs
supporting Qbv and more realistic use-cases. The current proposal covers only
FQTSS.

Thanks for your feedback!

Best regards,

Andre

[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 3262 bytes --]

^ permalink raw reply

* Re: [PATCH] fsl/fman: remove of_node
From: David Miller @ 2017-10-02 23:04 UTC (permalink / raw)
  To: madalin.bucur; +Cc: netdev, f.fainelli, linux-kernel
In-Reply-To: <1506940297-20442-1-git-send-email-madalin.bucur@nxp.com>

From: Madalin Bucur <madalin.bucur@nxp.com>
Date: Mon, 2 Oct 2017 13:31:37 +0300

> The FMan MAC driver allocates a platform device for the Ethernet
> driver to probe on. Setting pdev->dev.of_node with the MAC node
> triggers the MAC driver probing of the new platform device. While
> this fails quickly and does not affect the functionality of the
> drivers, it is incorrect and must be removed. This was added to
> address a report that DSA code using of_find_net_device_by_node()
> is unable to use the DPAA interfaces. Error message seen before
> this fix:
> 
> fsl_mac dpaa-ethernet.0: __devm_request_mem_region(mac) failed
> fsl_mac: probe of dpaa-ethernet.0 failed with error -16
> 
> Signed-off-by: Madalin Bucur <madalin.bucur@nxp.com>

Is the DSA issue no longer something we need to be concerned
about?  If not, why?  You have to explain this.

^ permalink raw reply

* Re: [PATCH net-next] ibmvnic: Improve output for unsupported stats
From: David Miller @ 2017-10-02 23:03 UTC (permalink / raw)
  To: andrew; +Cc: jallen, netdev, tlfalcon
In-Reply-To: <20171002215723.GN17713@lunn.ch>

From: Andrew Lunn <andrew@lunn.ch>
Date: Mon, 2 Oct 2017 23:57:23 +0200

> On Mon, Oct 02, 2017 at 03:31:39PM -0500, John Allen wrote:
>> The vnic server can report -1 in the event that a given statistic is not
>> supported. Currently, the -1 value is implicitly cast to an unsigned
>> integer and appears through the ethtool -S output as a very large number.
>> This patch improves this behavior by reporting 0 in the event that a
>> given statistic is not supported.
> 
> Hi John
> 
> If it does not exist, why not skip it altogether?
> ibmvnic_get_sset_count() could walk the list and count the valid ones.
> ibmvnic_get_strings() could only return the name of the valid onces.

Agreed, skipping would be preferable to printing zero values.

^ permalink raw reply

* RE: [RFC net-next 0/5] TSN: Add qdisc-based config interfaces for traffic shapers
From: Rodney Cummings @ 2017-10-02 22:52 UTC (permalink / raw)
  To: Levi Pearson
  Cc: Linux Kernel Network Developers, Vinicius Costa Gomes,
	Henrik Austad, richardcochran@gmail.com,
	jesus.sanchez-palencia@intel.com, andre.guedes@intel.com
In-Reply-To: <CAEYbN3TWwPjoHW-wQ9DC=ubXELoZ4jbMUTX_+D3Rn_h=6LHvDA@mail.gmail.com>

Thanks Levi,

Great discussion. It sounds like we're aligned.

Qbv is essential in switches. My concern was the end-station side only.

You're absolutely right that i210 LaunchTime doesn't do everything that is needed 
for scheduling. It still requires quite a bit of frame-juggling to ensure that
transmit descriptors are ordered by time, and that the control system's cyclic
traffic is repeated as expected. Like you, I'm not sure where that code is best
located in the long run.

For a hardware design that connects an FPGA to one of the internal ports of a
Qbv-capable switch, the preferable scheduling design can be done there.
The FPGA IP essentially implements a cyclic schedule of transmit descriptors,
each with its own buffer that can be written atomically by application code.
For a control system (i.e. industrial field-level or automotive control),
each buffer (stream) can hold a single frame. If you take a look at the
"Network Interface" specs for FlexRay in automotive, they went so far as to
mandate that design for a NIC. The IP for that sort of thing
doesn't take alot of gates or memory, so I can see why FlexRay went that way.

That sort of design wasn't destined to happen as part of Qbv, because the switch
vendors are worried about things like 64-port switches. Also, if we treat Qbv
as applying to switches only, FlexRay-style IP isn't needed. For the switch,
we only need it to "get out of the way", and Qbv is fine for that.

I may be naive and idealistic, but in the long run, I'm still hopeful that an
Ethernet NIC vendor will implement Flexray-style scheduling for end-station. 
We need a NIC trend-setter to do it as a differentiator, and the rest will 
probably follow along. That completely eliminates code in Linux for the 
frame-juggling. We'll need to provide a way to open a socket for a specific 
stream, but that doesn't seem too challenging.

Rodney




> -----Original Message-----
> From: Levi Pearson [mailto:levipearson@gmail.com]
> Sent: Monday, October 2, 2017 4:49 PM
> To: Rodney Cummings <rodney.cummings@ni.com>
> Cc: Linux Kernel Network Developers <netdev@vger.kernel.org>; Vinicius
> Costa Gomes <vinicius.gomes@intel.com>; Henrik Austad <henrik@austad.us>;
> richardcochran@gmail.com; jesus.sanchez-palencia@intel.com;
> andre.guedes@intel.com
> Subject: Re: [RFC net-next 0/5] TSN: Add qdisc-based config interfaces for
> traffic shapers
> 
> Hi Rodney,
> 
> On Mon, Oct 2, 2017 at 1:40 PM, Rodney Cummings <rodney.cummings@ni.com>
> wrote:
> 
> > It's a shame that someone built such hardware. Speaking as a
> manufacturer
> > of daisy-chainable products for industrial/automotive applications, I
> > wouldn't use that hardware in my products. The whole point of scheduling
> > is to obtain close-to-optimal network determinism. If the hardware
> > doesn't schedule properly, the product is probably better off using CBS.
> 
> I should note that the hardware I'm using was not designed to be a TSN
> endpoint. It just happens that the hardware, designed for other
> reasons, happens to have hardware support for Qbv and thus makes it an
> interesting and inexpensive platform on which to do Qbv experiments.
> Nevertheless, I believe the right combination of driver support for
> the hardware shaping and appropriate software in the network stack
> would allow it to schedule things fairly precisely in the
> application-centric manner you've described. I will ultimately defer
> to your judgement on that, though, and spend some time with the Qcc
> draft to see where I may be wrong.
> 
> > It would be great if all of the user-level application code was
> scheduled
> > as tightly as the network, but the reality is that we aren't there yet.
> > Also, even if the end-station has a separately timed RT Linux thread for
> > each stream, the order of those threads can vary. Therefore, when the
> > end-station (SoC) has more than one talker, the frames for each talker
> > can be written into the Qbv queue at any time, in any order.
> > There is no scheduling of frames (i.e. streams)... only the queue.
> 
> Frames are presented to *the kernel* in any order. The qdisc
> infrastructure provides mechanisms by which other scheduling policy
> can be added between presentation to the kernel and presentation to
> hardware queues.
> 
> Even with i210-style LaunchTime functionality, this must also be
> provided. The hardware cannot re-order time-scheduled frames, and
> reviewers of the SO_TXTIME proposal rejected the idea of having the
> driver try to re-order scheduled frames already submitted to the
> hardware-visible descriptor chains, and rightly so I think.
> 
> So far, I have thought of 3 places this ordering constraint might be
> provided:
> 
> 1. Built into each driver with offload support for SO_TXTIME
> 2. A qdisc module that will re-order scheduled skbuffs within a certain
> window
> 3. Left to userspace to coordinate
> 
> I think 1 is particularly bad due to duplication of code. 3 is the
> easy default from the point of view of the kernel, but not attractive
> from the point of view of precise scheduling and low latency to
> egress. Unless someone has a better idea, I think 2 would be the most
> appropriate choice. I can think of a number of ways one might be
> designed, but I have not thought them through fully yet.
> 
> > From the perspective of the CNC, that means that a Qbv-only end-station
> is
> > sloppy. In contrast, an end-station using per-stream scheduling
> > (e.g. i210 with LaunchTime) is precise, because each frame has a known
> time.
> > It's true that P802.1Qcc supports both, so the CNC might support both.
> > Nevertheless, the CNC is likely to compute a network-wide schedule
> > that optimizes for the per-stream end-stations, and locates the windows
> > for the Qbv-only end-stations in a slower interval.
> 
> Because out-of-order submission of frames to LaunchTime hardware would
> result in the frame that is enqueued later but scheduled earlier only
> transmitting once the later-scheduled frame completes, enqueuing
> frames immediately with LaunchTime information results in even *worse*
> sloppiness if two applications don't always submit their frames in the
> correct order to the kernel during a cycle. If your queue is simply
> gated at the time of the first scheduled launch, the frame that should
> have been first will only be late by the length of the queued-first
> frame. If they both have precise launch times, the frame that should
> have been first will have to wait until the normal launch time of the
> later-scheduled frame plus the time it takes for it to egress!
> 
> In either case, re-ordering is required if it is not controlled for in
> userspace. In other words, I am not disagreeing with you about the
> insufficiency of a Qbv shaper for an endpoint! I am just pointing out
> that neither hardware LaunchTime support nor hardware Qbv window
> support are sufficient by themselves. Nor, for that case, is the CBS
> offload support sufficient to meet multi-stream requirements from SRP.
> All of these provide a low-layer mechanism by which appropriate
> scheduling of a multi-stream or multi-application endpoint can be
> enhanced, but none of them provide the complete story by themselves.
> 
> > I realize that we are only doing CBS for now. I guess my main point is
> that
> > if/when we add scheduling, it cannot be limited to Qbv. Per-stream
> scheduling
> > is essential, because that is the assumption for most CNC designs that
> > I'm aware of.
> 
> I understand and completely agree. I have believed that per-stream
> shaping/scheduling in some form is essential from the beginning.
> Likewise, CBS is not sufficient in itself for multi-stream systems in
> which multiple applications provide streams scheduled for the same
> traffic class. But CBS and SO_TXTIME are good first steps; they
> provides hooks for drivers to finally support the mechanisms provided
> by hardware, and also in the case of SO_TXTIME a hook on which further
> time-sensitive enhancements to the network stack can be hung, such as
> qdiscs that can store and release frames based on scheduled launch
> times. When these are in place, I think stream-based scheduling via
> qdisc would be a reasonable next step; perhaps a common design could
> be found to work both for media streams over CBS and industrial
> streams scheduled by launch time.
> 
> 
> Levi

^ permalink raw reply

* [PATCH net-next v6 0/4] bpf: add two helpers to read perf event enabled/running time
From: Yonghong Song @ 2017-10-02 22:42 UTC (permalink / raw)
  To: peterz, rostedt, ast, daniel, netdev; +Cc: kernel-team

Hardware pmu counters are limited resources. When there are more
pmu based perf events opened than available counters, kernel will
multiplex these events so each event gets certain percentage
(but not 100%) of the pmu time. In case that multiplexing happens,
the number of samples or counter value will not reflect the
case compared to no multiplexing. This makes comparison between
different runs difficult.

Typically, the number of samples or counter value should be
normalized before comparing to other experiments. The typical
normalization is done like:
  normalized_num_samples = num_samples * time_enabled / time_running
  normalized_counter_value = counter_value * time_enabled / time_running
where time_enabled is the time enabled for event and time_running is
the time running for event since last normalization.

This patch set implements two helper functions.
The helper bpf_perf_event_read_value reads counter/time_enabled/time_running
for perf event array map. The helper bpf_perf_prog_read_value read
counter/time_enabled/time_running for bpf prog with type BPF_PROG_TYPE_PERF_EVENT.

[Dave, Peter,

 Previous communcation shows that this patch may potentially have
 merge conflict with upcoming tip changes in the next merge window.

 Could you advise how this patch should proceed?

 Thanks!
]

Changelogs:
v5->v6:
  . rebase on top of net-next
v4->v5:
  . fix some coding style issues
  . memset the input buffer in case of error for ARG_PTR_TO_UNINIT_MEM
    type of argument.
v3->v4:
  . fix a build failure
v2->v3:
  . counters should be read in order to read enabled/running time. This is to
    prevent that counters and enabled/running time may be read separately.
v1->v2:
  . reading enabled/running time should be together with reading counters
    which contains the logic to ensure the result is valid.

Yonghong Song (4):
  bpf: add helper bpf_perf_event_read_value for perf event array map
  bpf: add a test case for helper bpf_perf_event_read_value
  bpf: add helper bpf_perf_prog_read_value
  bpf: add a test case for helper bpf_perf_prog_read_value

 include/linux/perf_event.h                |  7 ++-
 include/uapi/linux/bpf.h                  | 28 +++++++++++-
 kernel/bpf/arraymap.c                     |  2 +-
 kernel/bpf/verifier.c                     |  4 +-
 kernel/events/core.c                      | 16 +++++--
 kernel/trace/bpf_trace.c                  | 74 ++++++++++++++++++++++++++++---
 samples/bpf/trace_event_kern.c            | 10 +++++
 samples/bpf/trace_event_user.c            | 13 +++---
 samples/bpf/tracex6_kern.c                | 26 +++++++++++
 samples/bpf/tracex6_user.c                | 13 +++++-
 tools/include/uapi/linux/bpf.h            |  4 +-
 tools/testing/selftests/bpf/bpf_helpers.h |  6 +++
 12 files changed, 182 insertions(+), 21 deletions(-)

-- 
2.9.5

^ permalink raw reply

* [PATCH net-next v6 1/4] bpf: add helper bpf_perf_event_read_value for perf event array map
From: Yonghong Song @ 2017-10-02 22:42 UTC (permalink / raw)
  To: peterz, rostedt, ast, daniel, netdev; +Cc: kernel-team
In-Reply-To: <20171002224218.3181418-1-yhs@fb.com>

Hardware pmu counters are limited resources. When there are more
pmu based perf events opened than available counters, kernel will
multiplex these events so each event gets certain percentage
(but not 100%) of the pmu time. In case that multiplexing happens,
the number of samples or counter value will not reflect the
case compared to no multiplexing. This makes comparison between
different runs difficult.

Typically, the number of samples or counter value should be
normalized before comparing to other experiments. The typical
normalization is done like:
  normalized_num_samples = num_samples * time_enabled / time_running
  normalized_counter_value = counter_value * time_enabled / time_running
where time_enabled is the time enabled for event and time_running is
the time running for event since last normalization.

This patch adds helper bpf_perf_event_read_value for kprobed based perf
event array map, to read perf counter and enabled/running time.
The enabled/running time is accumulated since the perf event open.
To achieve scaling factor between two bpf invocations, users
can can use cpu_id as the key (which is typical for perf array usage model)
to remember the previous value and do the calculation inside the
bpf program.

Signed-off-by: Yonghong Song <yhs@fb.com>
---
 include/linux/perf_event.h |  6 ++++--
 include/uapi/linux/bpf.h   | 20 ++++++++++++++++++--
 kernel/bpf/arraymap.c      |  2 +-
 kernel/bpf/verifier.c      |  4 +++-
 kernel/events/core.c       | 15 ++++++++++++---
 kernel/trace/bpf_trace.c   | 46 +++++++++++++++++++++++++++++++++++++++++-----
 6 files changed, 79 insertions(+), 14 deletions(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 8e22f24..21d8c12 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -884,7 +884,8 @@ perf_event_create_kernel_counter(struct perf_event_attr *attr,
 				void *context);
 extern void perf_pmu_migrate_context(struct pmu *pmu,
 				int src_cpu, int dst_cpu);
-int perf_event_read_local(struct perf_event *event, u64 *value);
+int perf_event_read_local(struct perf_event *event, u64 *value,
+			  u64 *enabled, u64 *running);
 extern u64 perf_event_read_value(struct perf_event *event,
 				 u64 *enabled, u64 *running);
 
@@ -1286,7 +1287,8 @@ static inline const struct perf_event_attr *perf_event_attrs(struct perf_event *
 {
 	return ERR_PTR(-EINVAL);
 }
-static inline int perf_event_read_local(struct perf_event *event, u64 *value)
+static inline int perf_event_read_local(struct perf_event *event, u64 *value,
+					u64 *enabled, u64 *running)
 {
 	return -EINVAL;
 }
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 6d2137b..4bdd4b2 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -592,6 +592,13 @@ union bpf_attr {
  *     @xdp_md: pointer to xdp_md
  *     @delta: An positive/negative integer to be added to xdp_md.data_meta
  *     Return: 0 on success or negative on error
+ * int bpf_perf_event_read_value(map, flags, buf, buf_size)
+ *     read perf event counter value and perf event enabled/running time
+ *     @map: pointer to perf_event_array map
+ *     @flags: index of event in the map or bitmask flags
+ *     @buf: buf to fill
+ *     @buf_size: size of the buf
+ *     Return: 0 on success or negative error code
  */
 #define __BPF_FUNC_MAPPER(FN)		\
 	FN(unspec),			\
@@ -648,7 +655,8 @@ union bpf_attr {
 	FN(redirect_map),		\
 	FN(sk_redirect_map),		\
 	FN(sock_map_update),		\
-	FN(xdp_adjust_meta),
+	FN(xdp_adjust_meta),		\
+	FN(perf_event_read_value),
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
  * function eBPF program intends to call
@@ -692,7 +700,9 @@ enum bpf_func_id {
 #define BPF_F_ZERO_CSUM_TX		(1ULL << 1)
 #define BPF_F_DONT_FRAGMENT		(1ULL << 2)
 
-/* BPF_FUNC_perf_event_output and BPF_FUNC_perf_event_read flags. */
+/* BPF_FUNC_perf_event_output, BPF_FUNC_perf_event_read and
+ * BPF_FUNC_perf_event_read_value flags.
+ */
 #define BPF_F_INDEX_MASK		0xffffffffULL
 #define BPF_F_CURRENT_CPU		BPF_F_INDEX_MASK
 /* BPF_FUNC_perf_event_output for sk_buff input context. */
@@ -885,4 +895,10 @@ enum {
 #define TCP_BPF_IW		1001	/* Set TCP initial congestion window */
 #define TCP_BPF_SNDCWND_CLAMP	1002	/* Set sndcwnd_clamp */
 
+struct bpf_perf_event_value {
+	__u64 counter;
+	__u64 enabled;
+	__u64 running;
+};
+
 #endif /* _UAPI__LINUX_BPF_H__ */
diff --git a/kernel/bpf/arraymap.c b/kernel/bpf/arraymap.c
index 98c0f00..68d8666 100644
--- a/kernel/bpf/arraymap.c
+++ b/kernel/bpf/arraymap.c
@@ -492,7 +492,7 @@ static void *perf_event_fd_array_get_ptr(struct bpf_map *map,
 
 	ee = ERR_PTR(-EOPNOTSUPP);
 	event = perf_file->private_data;
-	if (perf_event_read_local(event, &value) == -EOPNOTSUPP)
+	if (perf_event_read_local(event, &value, NULL, NULL) == -EOPNOTSUPP)
 		goto err_out;
 
 	ee = bpf_event_entry_gen(perf_file, map_file);
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 4cf9b72..b0cbc2d 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -1552,7 +1552,8 @@ static int check_map_func_compatibility(struct bpf_map *map, int func_id)
 		break;
 	case BPF_MAP_TYPE_PERF_EVENT_ARRAY:
 		if (func_id != BPF_FUNC_perf_event_read &&
-		    func_id != BPF_FUNC_perf_event_output)
+		    func_id != BPF_FUNC_perf_event_output &&
+		    func_id != BPF_FUNC_perf_event_read_value)
 			goto error;
 		break;
 	case BPF_MAP_TYPE_STACK_TRACE:
@@ -1595,6 +1596,7 @@ static int check_map_func_compatibility(struct bpf_map *map, int func_id)
 		break;
 	case BPF_FUNC_perf_event_read:
 	case BPF_FUNC_perf_event_output:
+	case BPF_FUNC_perf_event_read_value:
 		if (map->map_type != BPF_MAP_TYPE_PERF_EVENT_ARRAY)
 			goto error;
 		break;
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 6bc21e2..c761ef4 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -3684,10 +3684,12 @@ static inline u64 perf_event_count(struct perf_event *event)
  *     will not be local and we cannot read them atomically
  *   - must not have a pmu::count method
  */
-int perf_event_read_local(struct perf_event *event, u64 *value)
+int perf_event_read_local(struct perf_event *event, u64 *value,
+			  u64 *enabled, u64 *running)
 {
 	unsigned long flags;
 	int ret = 0;
+	u64 now;
 
 	/*
 	 * Disabling interrupts avoids all counter scheduling (context
@@ -3718,14 +3720,21 @@ int perf_event_read_local(struct perf_event *event, u64 *value)
 		goto out;
 	}
 
+	now = event->shadow_ctx_time + perf_clock();
+	if (enabled)
+		*enabled = now - event->tstamp_enabled;
 	/*
 	 * If the event is currently on this CPU, its either a per-task event,
 	 * or local to this CPU. Furthermore it means its ACTIVE (otherwise
 	 * oncpu == -1).
 	 */
-	if (event->oncpu == smp_processor_id())
+	if (event->oncpu == smp_processor_id()) {
 		event->pmu->read(event);
-
+		if (running)
+			*running = now - event->tstamp_running;
+	} else if (running) {
+		*running = event->total_time_running;
+	}
 	*value = local64_read(&event->count);
 out:
 	local_irq_restore(flags);
diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
index dc498b6..686dfa1 100644
--- a/kernel/trace/bpf_trace.c
+++ b/kernel/trace/bpf_trace.c
@@ -255,14 +255,13 @@ const struct bpf_func_proto *bpf_get_trace_printk_proto(void)
 	return &bpf_trace_printk_proto;
 }
 
-BPF_CALL_2(bpf_perf_event_read, struct bpf_map *, map, u64, flags)
-{
+static __always_inline int
+get_map_perf_counter(struct bpf_map *map, u64 flags,
+		     u64 *value, u64 *enabled, u64 *running) {
 	struct bpf_array *array = container_of(map, struct bpf_array, map);
 	unsigned int cpu = smp_processor_id();
 	u64 index = flags & BPF_F_INDEX_MASK;
 	struct bpf_event_entry *ee;
-	u64 value = 0;
-	int err;
 
 	if (unlikely(flags & ~(BPF_F_INDEX_MASK)))
 		return -EINVAL;
@@ -275,7 +274,15 @@ BPF_CALL_2(bpf_perf_event_read, struct bpf_map *, map, u64, flags)
 	if (!ee)
 		return -ENOENT;
 
-	err = perf_event_read_local(ee->event, &value);
+	return perf_event_read_local(ee->event, value, enabled, running);
+}
+
+BPF_CALL_2(bpf_perf_event_read, struct bpf_map *, map, u64, flags)
+{
+	u64 value = 0;
+	int err;
+
+	err = get_map_perf_counter(map, flags, &value, NULL, NULL);
 	/*
 	 * this api is ugly since we miss [-22..-2] range of valid
 	 * counter values, but that's uapi
@@ -293,6 +300,33 @@ static const struct bpf_func_proto bpf_perf_event_read_proto = {
 	.arg2_type	= ARG_ANYTHING,
 };
 
+BPF_CALL_4(bpf_perf_event_read_value, struct bpf_map *, map, u64, flags,
+	   struct bpf_perf_event_value *, buf, u32, size)
+{
+	int err;
+
+	if (unlikely(size != sizeof(struct bpf_perf_event_value)))
+		return -EINVAL;
+
+	err = get_map_perf_counter(map, flags, &buf->counter, &buf->enabled,
+				   &buf->running);
+	if (unlikely(err)) {
+		memset(buf, 0, size);
+		return err;
+	}
+	return 0;
+}
+
+static const struct bpf_func_proto bpf_perf_event_read_value_proto = {
+	.func		= bpf_perf_event_read_value,
+	.gpl_only	= true,
+	.ret_type	= RET_INTEGER,
+	.arg1_type	= ARG_CONST_MAP_PTR,
+	.arg2_type	= ARG_ANYTHING,
+	.arg3_type	= ARG_PTR_TO_UNINIT_MEM,
+	.arg4_type	= ARG_CONST_SIZE,
+};
+
 static DEFINE_PER_CPU(struct perf_sample_data, bpf_sd);
 
 static __always_inline u64
@@ -499,6 +533,8 @@ static const struct bpf_func_proto *kprobe_prog_func_proto(enum bpf_func_id func
 		return &bpf_perf_event_output_proto;
 	case BPF_FUNC_get_stackid:
 		return &bpf_get_stackid_proto;
+	case BPF_FUNC_perf_event_read_value:
+		return &bpf_perf_event_read_value_proto;
 	default:
 		return tracing_func_proto(func_id);
 	}
-- 
2.9.5

^ permalink raw reply related

* [PATCH net-next v6 2/4] bpf: add a test case for helper bpf_perf_event_read_value
From: Yonghong Song @ 2017-10-02 22:42 UTC (permalink / raw)
  To: peterz, rostedt, ast, daniel, netdev; +Cc: kernel-team
In-Reply-To: <20171002224218.3181418-1-yhs@fb.com>

The bpf sample program tracex6 is enhanced to use the new
helper to read enabled/running time as well.

Signed-off-by: Yonghong Song <yhs@fb.com>
---
 samples/bpf/tracex6_kern.c                | 26 ++++++++++++++++++++++++++
 samples/bpf/tracex6_user.c                | 13 ++++++++++++-
 tools/include/uapi/linux/bpf.h            |  3 ++-
 tools/testing/selftests/bpf/bpf_helpers.h |  3 +++
 4 files changed, 43 insertions(+), 2 deletions(-)

diff --git a/samples/bpf/tracex6_kern.c b/samples/bpf/tracex6_kern.c
index e7d1803..46c557a 100644
--- a/samples/bpf/tracex6_kern.c
+++ b/samples/bpf/tracex6_kern.c
@@ -15,6 +15,12 @@ struct bpf_map_def SEC("maps") values = {
 	.value_size = sizeof(u64),
 	.max_entries = 64,
 };
+struct bpf_map_def SEC("maps") values2 = {
+	.type = BPF_MAP_TYPE_HASH,
+	.key_size = sizeof(int),
+	.value_size = sizeof(struct bpf_perf_event_value),
+	.max_entries = 64,
+};
 
 SEC("kprobe/htab_map_get_next_key")
 int bpf_prog1(struct pt_regs *ctx)
@@ -37,5 +43,25 @@ int bpf_prog1(struct pt_regs *ctx)
 	return 0;
 }
 
+SEC("kprobe/htab_map_lookup_elem")
+int bpf_prog2(struct pt_regs *ctx)
+{
+	u32 key = bpf_get_smp_processor_id();
+	struct bpf_perf_event_value *val, buf;
+	int error;
+
+	error = bpf_perf_event_read_value(&counters, key, &buf, sizeof(buf));
+	if (error)
+		return 0;
+
+	val = bpf_map_lookup_elem(&values2, &key);
+	if (val)
+		*val = buf;
+	else
+		bpf_map_update_elem(&values2, &key, &buf, BPF_NOEXIST);
+
+	return 0;
+}
+
 char _license[] SEC("license") = "GPL";
 u32 _version SEC("version") = LINUX_VERSION_CODE;
diff --git a/samples/bpf/tracex6_user.c b/samples/bpf/tracex6_user.c
index a05a99a..3341a96 100644
--- a/samples/bpf/tracex6_user.c
+++ b/samples/bpf/tracex6_user.c
@@ -22,6 +22,7 @@
 
 static void check_on_cpu(int cpu, struct perf_event_attr *attr)
 {
+	struct bpf_perf_event_value value2;
 	int pmu_fd, error = 0;
 	cpu_set_t set;
 	__u64 value;
@@ -46,8 +47,18 @@ static void check_on_cpu(int cpu, struct perf_event_attr *attr)
 		fprintf(stderr, "Value missing for CPU %d\n", cpu);
 		error = 1;
 		goto on_exit;
+	} else {
+		fprintf(stderr, "CPU %d: %llu\n", cpu, value);
+	}
+	/* The above bpf_map_lookup_elem should trigger the second kprobe */
+	if (bpf_map_lookup_elem(map_fd[2], &cpu, &value2)) {
+		fprintf(stderr, "Value2 missing for CPU %d\n", cpu);
+		error = 1;
+		goto on_exit;
+	} else {
+		fprintf(stderr, "CPU %d: counter: %llu, enabled: %llu, running: %llu\n", cpu,
+			value2.counter, value2.enabled, value2.running);
 	}
-	fprintf(stderr, "CPU %d: %llu\n", cpu, value);
 
 on_exit:
 	assert(bpf_map_delete_elem(map_fd[0], &cpu) == 0 || error);
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 6d2137b..db51cbc 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -648,7 +648,8 @@ union bpf_attr {
 	FN(redirect_map),		\
 	FN(sk_redirect_map),		\
 	FN(sock_map_update),		\
-	FN(xdp_adjust_meta),
+	FN(xdp_adjust_meta),		\
+	FN(perf_event_read_value),
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
  * function eBPF program intends to call
diff --git a/tools/testing/selftests/bpf/bpf_helpers.h b/tools/testing/selftests/bpf/bpf_helpers.h
index a56053d..c15ca83 100644
--- a/tools/testing/selftests/bpf/bpf_helpers.h
+++ b/tools/testing/selftests/bpf/bpf_helpers.h
@@ -72,6 +72,9 @@ static int (*bpf_sk_redirect_map)(void *map, int key, int flags) =
 static int (*bpf_sock_map_update)(void *map, void *key, void *value,
 				  unsigned long long flags) =
 	(void *) BPF_FUNC_sock_map_update;
+static int (*bpf_perf_event_read_value)(void *map, unsigned long long flags,
+					void *buf, unsigned int buf_size) =
+	(void *) BPF_FUNC_perf_event_read_value;
 
 
 /* llvm builtin functions that eBPF C program may use to
-- 
2.9.5

^ permalink raw reply related

* [PATCH net-next v6 4/4] bpf: add a test case for helper bpf_perf_prog_read_value
From: Yonghong Song @ 2017-10-02 22:42 UTC (permalink / raw)
  To: peterz, rostedt, ast, daniel, netdev; +Cc: kernel-team
In-Reply-To: <20171002224218.3181418-1-yhs@fb.com>

The bpf sample program trace_event is enhanced to use the new
helper to print out enabled/running time.

Signed-off-by: Yonghong Song <yhs@fb.com>
---
 samples/bpf/trace_event_kern.c            | 10 ++++++++++
 samples/bpf/trace_event_user.c            | 13 ++++++++-----
 tools/include/uapi/linux/bpf.h            |  3 ++-
 tools/testing/selftests/bpf/bpf_helpers.h |  3 +++
 4 files changed, 23 insertions(+), 6 deletions(-)

diff --git a/samples/bpf/trace_event_kern.c b/samples/bpf/trace_event_kern.c
index 41b6115..a77a583d 100644
--- a/samples/bpf/trace_event_kern.c
+++ b/samples/bpf/trace_event_kern.c
@@ -37,10 +37,14 @@ struct bpf_map_def SEC("maps") stackmap = {
 SEC("perf_event")
 int bpf_prog1(struct bpf_perf_event_data *ctx)
 {
+	char time_fmt1[] = "Time Enabled: %llu, Time Running: %llu";
+	char time_fmt2[] = "Get Time Failed, ErrCode: %d";
 	char fmt[] = "CPU-%d period %lld ip %llx";
 	u32 cpu = bpf_get_smp_processor_id();
+	struct bpf_perf_event_value value_buf;
 	struct key_t key;
 	u64 *val, one = 1;
+	int ret;
 
 	if (ctx->sample_period < 10000)
 		/* ignore warmup */
@@ -54,6 +58,12 @@ int bpf_prog1(struct bpf_perf_event_data *ctx)
 		return 0;
 	}
 
+	ret = bpf_perf_prog_read_value(ctx, (void *)&value_buf, sizeof(struct bpf_perf_event_value));
+	if (!ret)
+	  bpf_trace_printk(time_fmt1, sizeof(time_fmt1), value_buf.enabled, value_buf.running);
+	else
+	  bpf_trace_printk(time_fmt2, sizeof(time_fmt2), ret);
+
 	val = bpf_map_lookup_elem(&counts, &key);
 	if (val)
 		(*val)++;
diff --git a/samples/bpf/trace_event_user.c b/samples/bpf/trace_event_user.c
index 7bd827b..bf4f1b6 100644
--- a/samples/bpf/trace_event_user.c
+++ b/samples/bpf/trace_event_user.c
@@ -127,6 +127,9 @@ static void test_perf_event_all_cpu(struct perf_event_attr *attr)
 	int *pmu_fd = malloc(nr_cpus * sizeof(int));
 	int i, error = 0;
 
+	/* system wide perf event, no need to inherit */
+	attr->inherit = 0;
+
 	/* open perf_event on all cpus */
 	for (i = 0; i < nr_cpus; i++) {
 		pmu_fd[i] = sys_perf_event_open(attr, -1, i, -1, 0);
@@ -154,6 +157,11 @@ static void test_perf_event_task(struct perf_event_attr *attr)
 {
 	int pmu_fd;
 
+	/* per task perf event, enable inherit so the "dd ..." command can be traced properly.
+	 * Enabling inherit will cause bpf_perf_prog_read_time helper failure.
+	 */
+	attr->inherit = 1;
+
 	/* open task bound event */
 	pmu_fd = sys_perf_event_open(attr, 0, -1, -1, 0);
 	if (pmu_fd < 0) {
@@ -175,14 +183,12 @@ static void test_bpf_perf_event(void)
 		.freq = 1,
 		.type = PERF_TYPE_HARDWARE,
 		.config = PERF_COUNT_HW_CPU_CYCLES,
-		.inherit = 1,
 	};
 	struct perf_event_attr attr_type_sw = {
 		.sample_freq = SAMPLE_FREQ,
 		.freq = 1,
 		.type = PERF_TYPE_SOFTWARE,
 		.config = PERF_COUNT_SW_CPU_CLOCK,
-		.inherit = 1,
 	};
 	struct perf_event_attr attr_hw_cache_l1d = {
 		.sample_freq = SAMPLE_FREQ,
@@ -192,7 +198,6 @@ static void test_bpf_perf_event(void)
 			PERF_COUNT_HW_CACHE_L1D |
 			(PERF_COUNT_HW_CACHE_OP_READ << 8) |
 			(PERF_COUNT_HW_CACHE_RESULT_ACCESS << 16),
-		.inherit = 1,
 	};
 	struct perf_event_attr attr_hw_cache_branch_miss = {
 		.sample_freq = SAMPLE_FREQ,
@@ -202,7 +207,6 @@ static void test_bpf_perf_event(void)
 			PERF_COUNT_HW_CACHE_BPU |
 			(PERF_COUNT_HW_CACHE_OP_READ << 8) |
 			(PERF_COUNT_HW_CACHE_RESULT_MISS << 16),
-		.inherit = 1,
 	};
 	struct perf_event_attr attr_type_raw = {
 		.sample_freq = SAMPLE_FREQ,
@@ -210,7 +214,6 @@ static void test_bpf_perf_event(void)
 		.type = PERF_TYPE_RAW,
 		/* Intel Instruction Retired */
 		.config = 0xc0,
-		.inherit = 1,
 	};
 
 	printf("Test HW_CPU_CYCLES\n");
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index db51cbc..cb7bec9 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -649,7 +649,8 @@ union bpf_attr {
 	FN(sk_redirect_map),		\
 	FN(sock_map_update),		\
 	FN(xdp_adjust_meta),		\
-	FN(perf_event_read_value),
+	FN(perf_event_read_value),	\
+	FN(perf_prog_read_value),
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
  * function eBPF program intends to call
diff --git a/tools/testing/selftests/bpf/bpf_helpers.h b/tools/testing/selftests/bpf/bpf_helpers.h
index c15ca83..e25dbf6 100644
--- a/tools/testing/selftests/bpf/bpf_helpers.h
+++ b/tools/testing/selftests/bpf/bpf_helpers.h
@@ -75,6 +75,9 @@ static int (*bpf_sock_map_update)(void *map, void *key, void *value,
 static int (*bpf_perf_event_read_value)(void *map, unsigned long long flags,
 					void *buf, unsigned int buf_size) =
 	(void *) BPF_FUNC_perf_event_read_value;
+static int (*bpf_perf_prog_read_value)(void *ctx, void *buf,
+				       unsigned int buf_size) =
+	(void *) BPF_FUNC_perf_prog_read_value;
 
 
 /* llvm builtin functions that eBPF C program may use to
-- 
2.9.5

^ permalink raw reply related

* [PATCH net-next v6 3/4] bpf: add helper bpf_perf_prog_read_value
From: Yonghong Song @ 2017-10-02 22:42 UTC (permalink / raw)
  To: peterz, rostedt, ast, daniel, netdev; +Cc: kernel-team
In-Reply-To: <20171002224218.3181418-1-yhs@fb.com>

This patch adds helper bpf_perf_prog_read_cvalue for perf event based bpf
programs, to read event counter and enabled/running time.
The enabled/running time is accumulated since the perf event open.

The typical use case for perf event based bpf program is to attach itself
to a single event. In such cases, if it is desirable to get scaling factor
between two bpf invocations, users can can save the time values in a map,
and use the value from the map and the current value to calculate
the scaling factor.

Signed-off-by: Yonghong Song <yhs@fb.com>
---
 include/linux/perf_event.h |  1 +
 include/uapi/linux/bpf.h   | 10 +++++++++-
 kernel/events/core.c       |  1 +
 kernel/trace/bpf_trace.c   | 28 ++++++++++++++++++++++++++++
 4 files changed, 39 insertions(+), 1 deletion(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 21d8c12..79b18a2 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -806,6 +806,7 @@ struct perf_output_handle {
 struct bpf_perf_event_data_kern {
 	struct pt_regs *regs;
 	struct perf_sample_data *data;
+	struct perf_event *event;
 };
 
 #ifdef CONFIG_CGROUP_PERF
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 4bdd4b2..6f6a236 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -599,6 +599,13 @@ union bpf_attr {
  *     @buf: buf to fill
  *     @buf_size: size of the buf
  *     Return: 0 on success or negative error code
+ *
+ * int bpf_perf_prog_read_value(ctx, buf, buf_size)
+ *     read perf prog attached perf event counter and enabled/running time
+ *     @ctx: pointer to ctx
+ *     @buf: buf to fill
+ *     @buf_size: size of the buf
+ *     Return : 0 on success or negative error code
  */
 #define __BPF_FUNC_MAPPER(FN)		\
 	FN(unspec),			\
@@ -656,7 +663,8 @@ union bpf_attr {
 	FN(sk_redirect_map),		\
 	FN(sock_map_update),		\
 	FN(xdp_adjust_meta),		\
-	FN(perf_event_read_value),
+	FN(perf_event_read_value),	\
+	FN(perf_prog_read_value),
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
  * function eBPF program intends to call
diff --git a/kernel/events/core.c b/kernel/events/core.c
index c761ef4..d1efac2 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -8081,6 +8081,7 @@ static void bpf_overflow_handler(struct perf_event *event,
 	struct bpf_perf_event_data_kern ctx = {
 		.data = data,
 		.regs = regs,
+		.event = event,
 	};
 	int ret = 0;
 
diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
index 686dfa1..c4d617a 100644
--- a/kernel/trace/bpf_trace.c
+++ b/kernel/trace/bpf_trace.c
@@ -612,6 +612,32 @@ static const struct bpf_func_proto bpf_get_stackid_proto_tp = {
 	.arg3_type	= ARG_ANYTHING,
 };
 
+BPF_CALL_3(bpf_perf_prog_read_value_tp, struct bpf_perf_event_data_kern *, ctx,
+	   struct bpf_perf_event_value *, buf, u32, size)
+{
+	int err;
+
+	if (unlikely(size != sizeof(struct bpf_perf_event_value)))
+		return -EINVAL;
+
+	err = perf_event_read_local(ctx->event, &buf->counter, &buf->enabled,
+				    &buf->running);
+	if (unlikely(err)) {
+		memset(buf, 0, size);
+		return err;
+	}
+	return 0;
+}
+
+static const struct bpf_func_proto bpf_perf_prog_read_value_proto_tp = {
+         .func           = bpf_perf_prog_read_value_tp,
+         .gpl_only       = true,
+         .ret_type       = RET_INTEGER,
+         .arg1_type      = ARG_PTR_TO_CTX,
+         .arg2_type      = ARG_PTR_TO_UNINIT_MEM,
+         .arg3_type      = ARG_CONST_SIZE,
+};
+
 static const struct bpf_func_proto *tp_prog_func_proto(enum bpf_func_id func_id)
 {
 	switch (func_id) {
@@ -619,6 +645,8 @@ static const struct bpf_func_proto *tp_prog_func_proto(enum bpf_func_id func_id)
 		return &bpf_perf_event_output_proto_tp;
 	case BPF_FUNC_get_stackid:
 		return &bpf_get_stackid_proto_tp;
+	case BPF_FUNC_perf_prog_read_value:
+		return &bpf_perf_prog_read_value_proto_tp;
 	default:
 		return tracing_func_proto(func_id);
 	}
-- 
2.9.5

^ permalink raw reply related

* Re: [net-next 00/15][pull request] 40GbE Intel Wired LAN Driver Updates 2017-10-02
From: David Miller @ 2017-10-02 22:17 UTC (permalink / raw)
  To: jeffrey.t.kirsher; +Cc: netdev, nhorman, sassmann, jogreene
In-Reply-To: <20171002194852.71970-1-jeffrey.t.kirsher@intel.com>

From: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Date: Mon,  2 Oct 2017 12:48:37 -0700

> This series contains updates to i40e and i40evf.

Pulled, thanks Jeff.

I saw the feedback about "odd parenthesis" in patch #11 from Yuval
Mintz, but I simply don't see it at all.

^ permalink raw reply

* Re: [PATCH net-next] ibmvnic: Improve output for unsupported stats
From: Andrew Lunn @ 2017-10-02 21:57 UTC (permalink / raw)
  To: John Allen; +Cc: netdev, Thomas Falcon
In-Reply-To: <6ce14d1a-11cc-37be-e6ca-b9ec4afb01a8@linux.vnet.ibm.com>

On Mon, Oct 02, 2017 at 03:31:39PM -0500, John Allen wrote:
> The vnic server can report -1 in the event that a given statistic is not
> supported. Currently, the -1 value is implicitly cast to an unsigned
> integer and appears through the ethtool -S output as a very large number.
> This patch improves this behavior by reporting 0 in the event that a
> given statistic is not supported.

Hi John

If it does not exist, why not skip it altogether?
ibmvnic_get_sset_count() could walk the list and count the valid ones.
ibmvnic_get_strings() could only return the name of the valid onces.

      Andrew

^ permalink raw reply

* [PATCH v3 net-next] net: core: decouple ifalias get/set from rtnl lock
From: Florian Westphal @ 2017-10-02 21:50 UTC (permalink / raw)
  To: netdev; +Cc: edumazet, Florian Westphal

Device alias can be set by either rtnetlink (rtnl is held) or sysfs.

rtnetlink hold the rtnl mutex, sysfs acquires it for this purpose.
Add an extra mutex for it and use rcu to protect concurrent accesses.

This allows the sysfs path to not take rtnl and would later allow
to not hold it when dumping ifalias.

Based on suggestion from Eric Dumazet.

Signed-off-by: Florian Westphal <fw@strlen.de>
---
 Changes since v2:
  - use rcu_swap_protected
 Change since v1:
  - don't use a sequence lock, instead use kmalloc+kfree_rcu
  - add comment for dev_get_alias
  - add comment why we use kfree and not kfree_rcu
  to free dev->alias during netdevice destruction

 include/linux/netdevice.h |  8 +++++++-
 net/core/dev.c            | 52 +++++++++++++++++++++++++++++++++++------------
 net/core/net-sysfs.c      | 17 ++++++++--------
 net/core/rtnetlink.c      | 13 ++++++++++--
 4 files changed, 65 insertions(+), 25 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index e1d6ef130611..d04424cfffba 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -826,6 +826,11 @@ struct xfrmdev_ops {
 };
 #endif
 
+struct dev_ifalias {
+	struct rcu_head rcuhead;
+	char ifalias[];
+};
+
 /*
  * This structure defines the management hooks for network devices.
  * The following hooks can be defined; unless noted otherwise, they are
@@ -1632,7 +1637,7 @@ enum netdev_priv_flags {
 struct net_device {
 	char			name[IFNAMSIZ];
 	struct hlist_node	name_hlist;
-	char 			*ifalias;
+	struct dev_ifalias	__rcu *ifalias;
 	/*
 	 *	I/O specific fields
 	 *	FIXME: Merge these and struct ifmap into one
@@ -3275,6 +3280,7 @@ void __dev_notify_flags(struct net_device *, unsigned int old_flags,
 			unsigned int gchanges);
 int dev_change_name(struct net_device *, const char *);
 int dev_set_alias(struct net_device *, const char *, size_t);
+int dev_get_alias(const struct net_device *, char *, size_t);
 int dev_change_net_namespace(struct net_device *, struct net *, const char *);
 int __dev_set_mtu(struct net_device *, int);
 int dev_set_mtu(struct net_device *, int);
diff --git a/net/core/dev.c b/net/core/dev.c
index e350c768d4b5..1770097cfd86 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -188,6 +188,8 @@ static struct napi_struct *napi_by_id(unsigned int napi_id);
 DEFINE_RWLOCK(dev_base_lock);
 EXPORT_SYMBOL(dev_base_lock);
 
+static DEFINE_MUTEX(ifalias_mutex);
+
 /* protects napi_hash addition/deletion and napi_gen_id */
 static DEFINE_SPINLOCK(napi_hash_lock);
 
@@ -1265,29 +1267,53 @@ int dev_change_name(struct net_device *dev, const char *newname)
  */
 int dev_set_alias(struct net_device *dev, const char *alias, size_t len)
 {
-	char *new_ifalias;
-
-	ASSERT_RTNL();
+	struct dev_ifalias *new_alias = NULL;
 
 	if (len >= IFALIASZ)
 		return -EINVAL;
 
-	if (!len) {
-		kfree(dev->ifalias);
-		dev->ifalias = NULL;
-		return 0;
+	if (len) {
+		new_alias = kmalloc(sizeof(*new_alias) + len + 1, GFP_KERNEL);
+		if (!new_alias)
+			return -ENOMEM;
+
+		memcpy(new_alias->ifalias, alias, len);
+		new_alias->ifalias[len] = 0;
 	}
 
-	new_ifalias = krealloc(dev->ifalias, len + 1, GFP_KERNEL);
-	if (!new_ifalias)
-		return -ENOMEM;
-	dev->ifalias = new_ifalias;
-	memcpy(dev->ifalias, alias, len);
-	dev->ifalias[len] = 0;
+	mutex_lock(&ifalias_mutex);
+	rcu_swap_protected(dev->ifalias, new_alias,
+			   mutex_is_locked(&ifalias_mutex));
+	mutex_unlock(&ifalias_mutex);
+
+	if (new_alias)
+		kfree_rcu(new_alias, rcuhead);
 
 	return len;
 }
 
+/**
+ *	dev_get_alias - get ifalias of a device
+ *	@dev: device
+ *	@alias: buffer to store name of ifalias
+ *	@len: size of buffer
+ *
+ *	get ifalias for a device.  Caller must make sure dev cannot go
+ *	away,  e.g. rcu read lock or own a reference count to device.
+ */
+int dev_get_alias(const struct net_device *dev, char *name, size_t len)
+{
+	const struct dev_ifalias *alias;
+	int ret = 0;
+
+	rcu_read_lock();
+	alias = rcu_dereference(dev->ifalias);
+	if (alias)
+		ret = snprintf(name, len, "%s", alias->ifalias);
+	rcu_read_unlock();
+
+	return ret;
+}
 
 /**
  *	netdev_features_change - device changes features
diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
index 927a6dcbad96..51d5836d8fb9 100644
--- a/net/core/net-sysfs.c
+++ b/net/core/net-sysfs.c
@@ -391,10 +391,7 @@ static ssize_t ifalias_store(struct device *dev, struct device_attribute *attr,
 	if (len >  0 && buf[len - 1] == '\n')
 		--count;
 
-	if (!rtnl_trylock())
-		return restart_syscall();
 	ret = dev_set_alias(netdev, buf, count);
-	rtnl_unlock();
 
 	return ret < 0 ? ret : len;
 }
@@ -403,13 +400,12 @@ static ssize_t ifalias_show(struct device *dev,
 			    struct device_attribute *attr, char *buf)
 {
 	const struct net_device *netdev = to_net_dev(dev);
+	char tmp[IFALIASZ];
 	ssize_t ret = 0;
 
-	if (!rtnl_trylock())
-		return restart_syscall();
-	if (netdev->ifalias)
-		ret = sprintf(buf, "%s\n", netdev->ifalias);
-	rtnl_unlock();
+	ret = dev_get_alias(netdev, tmp, sizeof(tmp));
+	if (ret > 0)
+		ret = sprintf(buf, "%s\n", tmp);
 	return ret;
 }
 static DEVICE_ATTR_RW(ifalias);
@@ -1488,7 +1484,10 @@ static void netdev_release(struct device *d)
 
 	BUG_ON(dev->reg_state != NETREG_RELEASED);
 
-	kfree(dev->ifalias);
+	/* no need to wait for rcu grace period:
+	 * device is dead and about to be freed.
+	 */
+	kfree(rcu_access_pointer(dev->ifalias));
 	netdev_freemem(dev);
 }
 
diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
index e6955da0d58d..3961f87cdc76 100644
--- a/net/core/rtnetlink.c
+++ b/net/core/rtnetlink.c
@@ -1366,6 +1366,16 @@ static int nla_put_iflink(struct sk_buff *skb, const struct net_device *dev)
 	return nla_put_u32(skb, IFLA_LINK, ifindex);
 }
 
+static noinline_for_stack int nla_put_ifalias(struct sk_buff *skb,
+					      struct net_device *dev)
+{
+	char buf[IFALIASZ];
+	int ret;
+
+	ret = dev_get_alias(dev, buf, sizeof(buf));
+	return ret > 0 ? nla_put_string(skb, IFLA_IFALIAS, buf) : 0;
+}
+
 static int rtnl_fill_link_netnsid(struct sk_buff *skb,
 				  const struct net_device *dev)
 {
@@ -1425,8 +1435,7 @@ static int rtnl_fill_ifinfo(struct sk_buff *skb, struct net_device *dev,
 	    nla_put_u8(skb, IFLA_CARRIER, netif_carrier_ok(dev)) ||
 	    (dev->qdisc &&
 	     nla_put_string(skb, IFLA_QDISC, dev->qdisc->ops->id)) ||
-	    (dev->ifalias &&
-	     nla_put_string(skb, IFLA_IFALIAS, dev->ifalias)) ||
+	    nla_put_ifalias(skb, dev) ||
 	    nla_put_u32(skb, IFLA_CARRIER_CHANGES,
 			atomic_read(&dev->carrier_changes)) ||
 	    nla_put_u8(skb, IFLA_PROTO_DOWN, dev->proto_down))
-- 
2.13.6

^ permalink raw reply related

* Re: [RFC net-next 0/5] TSN: Add qdisc-based config interfaces for traffic shapers
From: Levi Pearson @ 2017-10-02 21:48 UTC (permalink / raw)
  To: Rodney Cummings
  Cc: Linux Kernel Network Developers, Vinicius Costa Gomes,
	Henrik Austad, richardcochran@gmail.com,
	jesus.sanchez-palencia@intel.com, andre.guedes@intel.com
In-Reply-To: <DM2PR0401MB138982C6B22DF8154A82DB39927D0@DM2PR0401MB1389.namprd04.prod.outlook.com>

Hi Rodney,

On Mon, Oct 2, 2017 at 1:40 PM, Rodney Cummings <rodney.cummings@ni.com> wrote:

> It's a shame that someone built such hardware. Speaking as a manufacturer
> of daisy-chainable products for industrial/automotive applications, I
> wouldn't use that hardware in my products. The whole point of scheduling
> is to obtain close-to-optimal network determinism. If the hardware
> doesn't schedule properly, the product is probably better off using CBS.

I should note that the hardware I'm using was not designed to be a TSN
endpoint. It just happens that the hardware, designed for other
reasons, happens to have hardware support for Qbv and thus makes it an
interesting and inexpensive platform on which to do Qbv experiments.
Nevertheless, I believe the right combination of driver support for
the hardware shaping and appropriate software in the network stack
would allow it to schedule things fairly precisely in the
application-centric manner you've described. I will ultimately defer
to your judgement on that, though, and spend some time with the Qcc
draft to see where I may be wrong.

> It would be great if all of the user-level application code was scheduled
> as tightly as the network, but the reality is that we aren't there yet.
> Also, even if the end-station has a separately timed RT Linux thread for
> each stream, the order of those threads can vary. Therefore, when the
> end-station (SoC) has more than one talker, the frames for each talker
> can be written into the Qbv queue at any time, in any order.
> There is no scheduling of frames (i.e. streams)... only the queue.

Frames are presented to *the kernel* in any order. The qdisc
infrastructure provides mechanisms by which other scheduling policy
can be added between presentation to the kernel and presentation to
hardware queues.

Even with i210-style LaunchTime functionality, this must also be
provided. The hardware cannot re-order time-scheduled frames, and
reviewers of the SO_TXTIME proposal rejected the idea of having the
driver try to re-order scheduled frames already submitted to the
hardware-visible descriptor chains, and rightly so I think.

So far, I have thought of 3 places this ordering constraint might be provided:

1. Built into each driver with offload support for SO_TXTIME
2. A qdisc module that will re-order scheduled skbuffs within a certain window
3. Left to userspace to coordinate

I think 1 is particularly bad due to duplication of code. 3 is the
easy default from the point of view of the kernel, but not attractive
from the point of view of precise scheduling and low latency to
egress. Unless someone has a better idea, I think 2 would be the most
appropriate choice. I can think of a number of ways one might be
designed, but I have not thought them through fully yet.

> From the perspective of the CNC, that means that a Qbv-only end-station is
> sloppy. In contrast, an end-station using per-stream scheduling
> (e.g. i210 with LaunchTime) is precise, because each frame has a known time.
> It's true that P802.1Qcc supports both, so the CNC might support both.
> Nevertheless, the CNC is likely to compute a network-wide schedule
> that optimizes for the per-stream end-stations, and locates the windows
> for the Qbv-only end-stations in a slower interval.

Because out-of-order submission of frames to LaunchTime hardware would
result in the frame that is enqueued later but scheduled earlier only
transmitting once the later-scheduled frame completes, enqueuing
frames immediately with LaunchTime information results in even *worse*
sloppiness if two applications don't always submit their frames in the
correct order to the kernel during a cycle. If your queue is simply
gated at the time of the first scheduled launch, the frame that should
have been first will only be late by the length of the queued-first
frame. If they both have precise launch times, the frame that should
have been first will have to wait until the normal launch time of the
later-scheduled frame plus the time it takes for it to egress!

In either case, re-ordering is required if it is not controlled for in
userspace. In other words, I am not disagreeing with you about the
insufficiency of a Qbv shaper for an endpoint! I am just pointing out
that neither hardware LaunchTime support nor hardware Qbv window
support are sufficient by themselves. Nor, for that case, is the CBS
offload support sufficient to meet multi-stream requirements from SRP.
All of these provide a low-layer mechanism by which appropriate
scheduling of a multi-stream or multi-application endpoint can be
enhanced, but none of them provide the complete story by themselves.

> I realize that we are only doing CBS for now. I guess my main point is that
> if/when we add scheduling, it cannot be limited to Qbv. Per-stream scheduling
> is essential, because that is the assumption for most CNC designs that
> I'm aware of.

I understand and completely agree. I have believed that per-stream
shaping/scheduling in some form is essential from the beginning.
Likewise, CBS is not sufficient in itself for multi-stream systems in
which multiple applications provide streams scheduled for the same
traffic class. But CBS and SO_TXTIME are good first steps; they
provides hooks for drivers to finally support the mechanisms provided
by hardware, and also in the case of SO_TXTIME a hook on which further
time-sensitive enhancements to the network stack can be hung, such as
qdiscs that can store and release frames based on scheduled launch
times. When these are in place, I think stream-based scheduling via
qdisc would be a reasonable next step; perhaps a common design could
be found to work both for media streams over CBS and industrial
streams scheduled by launch time.

Levi

^ permalink raw reply

* Re: [PATCH v2 net-next] net: core: decouple ifalias get/set from rtnl lock
From: Florian Westphal @ 2017-10-02 21:47 UTC (permalink / raw)
  To: Florian Westphal; +Cc: netdev
In-Reply-To: <20171002212942.14521-1-fw@strlen.de>

Florian Westphal <fw@strlen.de> wrote:
> +	mutex_lock(&ifalias_mutex);
> +
> +	old = rcu_dereference_protected(dev->ifalias,
> +					mutex_is_locked(&ifalias_mutex));
> +	if (len) {
> +		memcpy(new_alias->ifalias, alias, len);
> +		new_alias->ifalias[len] = 0;
> +	}

This can be done outside of the lock, so this can be
> +
> +	rcu_assign_pointer(dev->ifalias, new_alias);

rcu_swap_protected().

I'll send a v3.

^ permalink raw reply

* RE: [net-next 11/15] i40evf: Enable VF to request an alternate queue allocation
From: Brady, Alan @ 2017-10-02 21:35 UTC (permalink / raw)
  To: Yuval Mintz, Kirsher, Jeffrey T, davem@davemloft.net
  Cc: netdev@vger.kernel.org, nhorman@redhat.com, sassmann@redhat.com,
	jogreene@redhat.com
In-Reply-To: <AM0PR0502MB368318470E6010AB9D6C07E1BF7D0@AM0PR0502MB3683.eurprd05.prod.outlook.com>

> -----Original Message-----
> From: Yuval Mintz [mailto:yuvalm@mellanox.com]
> Sent: Monday, October 2, 2017 1:51 PM
> To: Kirsher, Jeffrey T <jeffrey.t.kirsher@intel.com>; davem@davemloft.net
> Cc: Brady, Alan <alan.brady@intel.com>; netdev@vger.kernel.org;
> nhorman@redhat.com; sassmann@redhat.com; jogreene@redhat.com
> Subject: RE: [net-next 11/15] i40evf: Enable VF to request an alternate queue
> allocation
> 
> > +	case VIRTCHNL_OP_REQUEST_QUEUES: {
> > +		struct virtchnl_vf_res_request *vfres =
> > +			(struct virtchnl_vf_res_request *)msg;
> > +		if (vfres->num_queue_pairs == adapter->num_req_queues)
> > {
> > +			adapter->flags |=
> > I40EVF_FLAG_REINIT_ITR_NEEDED;
> > +			i40evf_schedule_reset(adapter);
> > +		} else {
> > +			dev_info(&adapter->pdev->dev,
> > +				 "Requested %d queues, PF can support
> > %d\n",
> > +				 adapter->num_req_queues,
> > +				 vfres->num_queue_pairs);
> > +			adapter->num_req_queues = 0;
> > +		}
> > +		}
> > +		break;
> 
> Something is odd about your parenthesis.
> 
> >  	default:
> >  		if (adapter->current_op && (v_opcode != adapter-
> > >current_op))
> >  			dev_warn(&adapter->pdev->dev, "Expected
> response %d from PF,
> > received %d\n",
> > --
> > 2.14.2

I'm not sure which parentheses you are referring to.  Perhaps you mean the double curly braces?  The outer set of braces are creating a block in the case statement so we can declare a variable specific to the case statement.  The inner brace is an if/else statement block.  They are both consistent with brace usage/alignment in the rest of the file and adding a new line between them generates a GCC warning.

-Alan

^ permalink raw reply

* Re: [PATCH net-next] vhost_net: do not stall on zerocopy depletion
From: Willem de Bruijn @ 2017-10-02 21:34 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Network Development, David Miller, Jason Wang, Koichiro Den,
	virtualization, Willem de Bruijn
In-Reply-To: <CAF=yD-KotdpHs96GomMKR-BqG3Gyrvo+to0sk2=a6E5BKjgpkg@mail.gmail.com>

On Fri, Sep 29, 2017 at 9:25 PM, Willem de Bruijn
<willemdebruijn.kernel@gmail.com> wrote:
> On Fri, Sep 29, 2017 at 3:38 PM, Michael S. Tsirkin <mst@redhat.com> wrote:
>> On Wed, Sep 27, 2017 at 08:25:56PM -0400, Willem de Bruijn wrote:
>>> From: Willem de Bruijn <willemb@google.com>
>>>
>>> Vhost-net has a hard limit on the number of zerocopy skbs in flight.
>>> When reached, transmission stalls. Stalls cause latency, as well as
>>> head-of-line blocking of other flows that do not use zerocopy.
>>>
>>> Instead of stalling, revert to copy-based transmission.
>>>
>>> Tested by sending two udp flows from guest to host, one with payload
>>> of VHOST_GOODCOPY_LEN, the other too small for zerocopy (1B). The
>>> large flow is redirected to a netem instance with 1MBps rate limit
>>> and deep 1000 entry queue.
>>>
>>>   modprobe ifb
>>>   ip link set dev ifb0 up
>>>   tc qdisc add dev ifb0 root netem limit 1000 rate 1MBit
>>>
>>>   tc qdisc add dev tap0 ingress
>>>   tc filter add dev tap0 parent ffff: protocol ip \
>>>       u32 match ip dport 8000 0xffff \
>>>       action mirred egress redirect dev ifb0
>>>
>>> Before the delay, both flows process around 80K pps. With the delay,
>>> before this patch, both process around 400. After this patch, the
>>> large flow is still rate limited, while the small reverts to its
>>> original rate. See also discussion in the first link, below.
>>>
>>> The limit in vhost_exceeds_maxpend must be carefully chosen. When
>>> vq->num >> 1, the flows remain correlated. This value happens to
>>> correspond to VHOST_MAX_PENDING for vq->num == 256. Allow smaller
>>> fractions and ensure correctness also for much smaller values of
>>> vq->num, by testing the min() of both explicitly. See also the
>>> discussion in the second link below.
>>>
>>> Link:http://lkml.kernel.org/r/CAF=yD-+Wk9sc9dXMUq1+x_hh=3ThTXa6BnZkygP3tgVpjbp93g@mail.gmail.com
>>> Link:http://lkml.kernel.org/r/20170819064129.27272-1-den@klaipeden.com
>>> Signed-off-by: Willem de Bruijn <willemb@google.com>
>>
>> I'd like to see the effect on the non rate limited case though.
>> If guest is quick won't we have lots of copies then?
>
> Yes, but not significantly more than without this patch.
>
> I ran 1, 10 and 100 flow tcp_stream throughput tests from a sender
> in the guest to a receiver in the host.
>
> To answer the other benchmark question first, I did not see anything
> noteworthy when increasing vq->num from 256 to 1024.
>
> With 1 and 10 flows without this patch all packets use zerocopy.
> With the patch, less than 1% eschews zerocopy.
>
> With 100 flows, even without this patch, 90+% of packets are copied.
> Some zerocopy packets from vhost_net fail this test in tun.c
>
>     if (iov_iter_npages(&i, INT_MAX) <= MAX_SKB_FRAGS)
>
> Generating packets with up to 21 frags. I'm not sure yet why or
> what the fraction of these packets is.

This seems to be a mix of page alignment and compound pages.
The iov_len is always well below the maximum, but frags exceed
page size and can start high in the initial page.

 tun_get_user: num_pages=21 max=17 iov_len=6 len=65226
   0: p_off=3264 len=6888
   1: p_off=1960 len=16384
   2: p_off=1960 len=6232
   3: p_off=0 len=10152
   4: p_off=1960 len=16384
   5: p_off=1960 len=9120

> But this in turn can
> disable zcopy_used in vhost_net_tx_select_zcopy for a
> larger share of packets:
>
>         return !net->tx_flush &&
>                 net->tx_packets / 64 >= net->tx_zcopy_err;
>

Testing iov_iter_npages() in handle_tx to inform zcopy_used allows
skipping these without turning off zerocopy for all other packets.

After implementing that, tx_zcopy_err drops to zero, but only around
40% of packets use zerocopy.

^ permalink raw reply

* Question about "prevent dst uses after free" and WARNING in nf_xfrm_me_harder / refcnt / 4.13.3
From: Denys Fedoryshchenko @ 2017-10-02 21:33 UTC (permalink / raw)
  To: Eric Dumazet, Linux Kernel Network Developers

Hi,

I'm running now 4.13.3, is this patch required for 4.13 as well?
(it doesnt apply cleanly, as in 4.13 tcp_prequeue use 
skb_dst_force_safe, so i just renamed it there to skb_dst_force )

This is what i get on PPPoE BRAS on this kernel, patch applied
(no idea if its related to patch, but just mentioning i applied it, as 
it's not vanilla 4.13.3)

[ 7858.579600] ------------[ cut here ]------------
[ 7858.579818] WARNING: CPU: 2 PID: 0 at ./include/net/dst.h:254 
nf_xfrm_me_harder+0x61/0xec [nf_nat]
[ 7858.580160] Modules linked in: cls_fw act_police cls_u32 sch_ingress 
sch_htb pppoe pppox ppp_generic slhc netconsole configfs coretemp 
nf_nat_pptp nf_nat_proto_gre nf_conntrack_pptp nf_conntrack_proto_gre 
tun xt_REDIRECT nf_nat_redirect xt_nat xt_TCPMSS ipt_REJECT 
nf_reject_ipv4 xt_set ts_bm xt_string xt_connmark xt_DSCP xt_mark 
xt_tcpudp ip_set_hash_net ip_set_hash_ip ip_set nfnetlink iptable_mangle 
iptable_filter iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 
nf_nat nf_conntrack ip_tables x_tables 8021q garp mrp stp llc ixgbe dca
[ 7858.581255] CPU: 2 PID: 0 Comm: swapper/2 Not tainted 
4.13.3-build-0133 #27
[ 7858.581456] Hardware name: HP ProLiant DL320e Gen8 v2, BIOS P80 
04/02/2015
[ 7858.581659] task: ffff880434e6a700 task.stack: ffffc90001904000
[ 7858.581862] RIP: 0010:nf_xfrm_me_harder+0x61/0xec [nf_nat]
[ 7858.582061] RSP: 0018:ffff880436483bc0 EFLAGS: 00010246
[ 7858.582259] RAX: 0000000000000000 RBX: ffffffff822df000 RCX: 
ffff8803ee9028ce
[ 7858.582461] RDX: 0000000000000014 RSI: ffff88041cd82900 RDI: 
ffff880436483bf8
[ 7858.582661] RBP: ffff880436483c20 R08: ffffffff81e0b400 R09: 
00000000b9160000
[ 7858.582865] R10: ffff8803ee9028e8 R11: 0000000000000000 R12: 
ffff880401e92100
[ 7858.583068] R13: 0000000000000001 R14: ffffffff822df000 R15: 
ffff88042e280078
[ 7858.583269] FS:  0000000000000000(0000) GS:ffff880436480000(0000) 
knlGS:0000000000000000
[ 7858.583608] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 7858.583809] CR2: 00007f9b2886fc9c CR3: 0000000429223000 CR4: 
00000000001406e0
[ 7858.584013] Call Trace:
[ 7858.584209]  <IRQ>
[ 7858.584408]  ? nf_nat_ipv4_fn+0x12e/0x189 [nf_nat_ipv4]
[ 7858.584605]  nf_nat_ipv4_out+0xb6/0xd3 [nf_nat_ipv4]
[ 7858.584807]  iptable_nat_ipv4_out+0x15/0x17 [iptable_nat]
[ 7858.585010]  nf_hook_slow+0x2a/0x9a
[ 7858.585209]  ip_output+0x96/0xb4
[ 7858.585410]  ? ip_fragment.constprop.5+0x7c/0x7c
[ 7858.585610]  ip_forward_finish+0x5b/0x60
[ 7858.585811]  ip_forward+0x36d/0x37a
[ 7858.586010]  ? ip_frag_mem+0x11/0x11
[ 7858.586207]  ip_rcv_finish+0x2f9/0x304
[ 7858.586406]  ip_rcv+0x32a/0x337
[ 7858.586604]  ? ip_local_deliver_finish+0x1bb/0x1bb
[ 7858.586808]  __netif_receive_skb_core+0x4f0/0x847
[ 7858.587009]  __netif_receive_skb+0x18/0x5a
[ 7858.587208]  ? __netif_receive_skb+0x18/0x5a
[ 7858.587407]  process_backlog+0xa4/0x127
[ 7858.587606]  net_rx_action+0x11e/0x2d8
[ 7858.587811]  ? sched_clock_cpu+0x15/0x9b
[ 7858.588013]  __do_softirq+0xe7/0x23a
[ 7858.588210]  irq_exit+0x52/0x93
[ 7858.588408]  smp_call_function_single_interrupt+0x33/0x35
[ 7858.588610]  call_function_single_interrupt+0x83/0x90
[ 7858.588811] RIP: 0010:mwait_idle+0x93/0x13c
[ 7858.589007] RSP: 0018:ffffc90001907eb0 EFLAGS: 00000246 ORIG_RAX: 
ffffffffffffff04
[ 7858.589347] RAX: 0000000000000000 RBX: ffff880434e6a700 RCX: 
0000000000000000
[ 7858.589548] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 
0000000000000000
[ 7858.589750] RBP: ffffc90001907ec0 R08: 0000000000000000 R09: 
0000000000000001
[ 7858.589952] R10: ffffc90001907e58 R11: 000000000000024d R12: 
0000000000000002
[ 7858.590149] R13: 0000000000000000 R14: ffff880434e6a700 R15: 
ffff880434e6a700
[ 7858.590347]  </IRQ>
[ 7858.590541]  arch_cpu_idle+0xf/0x11
[ 7858.590738]  default_idle_call+0x25/0x27
[ 7858.590938]  do_idle+0xb8/0x150
[ 7858.591133]  cpu_startup_entry+0x1f/0x21
[ 7858.591332]  start_secondary+0xe8/0xeb
[ 7858.591531]  secondary_startup_64+0x9f/0x9f
[ 7858.591729] Code: 83 7e 48 00 74 07 48 8b b6 80 01 00 00 8b 86 80 00 
00 00 85 c0 74 14 8d 50 01 f0 0f b1 96 80 00 00 00 0f 94 c2 84 d2 75 04 
eb e8 <0f> ff 49 8b 4c 24 18 48 8d 55 a0 45 31 c0 48 89 df e8 d9 de 95
[ 7858.592239] ---[ end trace c089174999ff4fc3 ]---
[ 7858.592448] dst_release: dst:ffff88041cd82900 refcnt:-1
[ 8139.130003] igb 0000:07:00.0 eth0: igb: eth0 NIC Link is Down
[ 8139.130309] igb 0000:07:00.0 eth0: Reset adapter
[ 8164.431523] igb 0000:07:00.0 eth0: igb: eth0 NIC Link is Up 1000 Mbps 
Full Duplex, Flow Control: RX/TX
[ 9149.190518] perf: interrupt took too long (3132 > 3128), lowering 
kernel.perf_event_max_sample_rate to 63000
[17205.528640] ------------[ cut here ]------------
[17205.528855] WARNING: CPU: 0 PID: 0 at ./include/net/dst.h:254 
nf_xfrm_me_harder+0x61/0xec [nf_nat]
[17205.529197] Modules linked in: cls_fw act_police cls_u32 sch_ingress 
sch_htb pppoe pppox ppp_generic slhc netconsole configfs coretemp 
nf_nat_pptp nf_nat_proto_gre nf_conntrack_pptp nf_conntrack_proto_gre 
tun xt_REDIRECT nf_nat_redirect xt_nat xt_TCPMSS ipt_REJECT 
nf_reject_ipv4 xt_set ts_bm xt_string xt_connmark xt_DSCP xt_mark 
xt_tcpudp ip_set_hash_net ip_set_hash_ip ip_set nfnetlink iptable_mangle 
iptable_filter iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 
nf_nat nf_conntrack ip_tables x_tables 8021q garp mrp stp llc ixgbe dca
[17205.530294] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G        W       
4.13.3-build-0133 #27
[17205.530632] Hardware name: HP ProLiant DL320e Gen8 v2, BIOS P80 
04/02/2015
[17205.530834] task: ffffffff8220e480 task.stack: ffffffff82200000
[17205.531033] RIP: 0010:nf_xfrm_me_harder+0x61/0xec [nf_nat]
[17205.531232] RSP: 0018:ffff880436403bc0 EFLAGS: 00010246
[17205.531434] RAX: 0000000000000000 RBX: ffffffff822df000 RCX: 
ffff8803f5fba0ce
[17205.531636] RDX: 0000000000000014 RSI: ffff8804041ae100 RDI: 
ffff880436403bf8
[17205.531836] RBP: ffff880436403c20 R08: ffffffff81e0b400 R09: 
0000000033d10000
[17205.532035] R10: ffff8803f5fba0e8 R11: 0000000000000000 R12: 
ffff88041e7a3500
[17205.532235] R13: 0000000000000001 R14: ffffffff822df000 R15: 
ffff88042e280078
[17205.532435] FS:  0000000000000000(0000) GS:ffff880436400000(0000) 
knlGS:0000000000000000
[17205.532775] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[17205.532974] CR2: 00007f9b2c6b52b8 CR3: 0000000429223000 CR4: 
00000000001406f0
[17205.533170] Call Trace:
[17205.533361]  <IRQ>
[17205.533555]  ? nf_nat_ipv4_fn+0x12e/0x189 [nf_nat_ipv4]
[17205.533754]  nf_nat_ipv4_out+0xb6/0xd3 [nf_nat_ipv4]
[17205.533953]  iptable_nat_ipv4_out+0x15/0x17 [iptable_nat]
[17205.534151]  nf_hook_slow+0x2a/0x9a
[17205.534344]  ip_output+0x96/0xb4
[17205.534539]  ? ip_fragment.constprop.5+0x7c/0x7c
[17205.534738]  ip_forward_finish+0x5b/0x60
[17205.534939]  ip_forward+0x36d/0x37a
[17205.535137]  ? ip_frag_mem+0x11/0x11
[17205.535337]  ip_rcv_finish+0x2f9/0x304
[17205.535537]  ip_rcv+0x32a/0x337
[17205.535732]  ? ip_local_deliver_finish+0x1bb/0x1bb
[17205.535935]  __netif_receive_skb_core+0x4f0/0x847
[17205.536135]  __netif_receive_skb+0x18/0x5a
[17205.536332]  ? __netif_receive_skb+0x18/0x5a
[17205.536533]  process_backlog+0xa4/0x127
[17205.536731]  net_rx_action+0x11e/0x2d8
[17205.536934]  ? sched_clock_cpu+0x15/0x9b
[17205.537134]  __do_softirq+0xe7/0x23a
[17205.537331]  irq_exit+0x52/0x93
[17205.537530]  smp_call_function_single_interrupt+0x33/0x35
[17205.537730]  call_function_single_interrupt+0x83/0x90
[17205.537934] RIP: 0010:mwait_idle+0x93/0x13c
[17205.538131] RSP: 0018:ffffffff82203e28 EFLAGS: 00000246 ORIG_RAX: 
ffffffffffffff04
[17205.538469] RAX: 0000000000000000 RBX: ffffffff8220e480 RCX: 
0000000000000000
[17205.538668] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 
0000000000000000
[17205.538871] RBP: ffffffff82203e38 R08: 0000000000000000 R09: 
0000000000000001
[17205.539071] R10: ffffffff82203dd0 R11: 000000000000002a R12: 
0000000000000000
[17205.539271] R13: 0000000000000000 R14: ffffffff8220e480 R15: 
ffffffff8220e480
[17205.539472]  </IRQ>
[17205.539670]  arch_cpu_idle+0xf/0x11
[17205.539869]  default_idle_call+0x25/0x27
[17205.540068]  do_idle+0xb8/0x150
[17205.540266]  cpu_startup_entry+0x1f/0x21
[17205.540465]  rest_init+0xb5/0xb7
[17205.540665]  start_kernel+0x3b0/0x3bd
[17205.540864]  x86_64_start_reservations+0x2a/0x2c
[17205.541063]  x86_64_start_kernel+0x16a/0x178
[17205.541262]  secondary_startup_64+0x9f/0x9f
[17205.541458] Code: 83 7e 48 00 74 07 48 8b b6 80 01 00 00 8b 86 80 00 
00 00 85 c0 74 14 8d 50 01 f0 0f b1 96 80 00 00 00 0f 94 c2 84 d2 75 04 
eb e8 <0f> ff 49 8b 4c 24 18 48 8d 55 a0 45 31 c0 48 89 df e8 d9 de 95
[17205.541964] ---[ end trace c089174999ff4fc4 ]---
[17205.542165] dst_release: dst:ffff8804041ae100 refcnt:-1

^ permalink raw reply

* [PATCH v2 net-next] net: core: decouple ifalias get/set from rtnl lock
From: Florian Westphal @ 2017-10-02 21:29 UTC (permalink / raw)
  To: netdev; +Cc: Florian Westphal

Device alias can be set by either rtnetlink (rtnl is held) or sysfs.

rtnetlink hold the rtnl mutex, sysfs acquires it for this purpose.
Add an extra mutex for it and use rcu to protect concurrent accesses.

This allows the sysfs path to not take rtnl and would later allow
to not hold it when dumping ifalias.

Based on suggestion from Eric Dumazet.

Signed-off-by: Florian Westphal <fw@strlen.de>
---
 Change since v1:
  - don't use a sequence lock, instead use kmalloc+kfree_rcu.

 include/linux/netdevice.h |  8 ++++++-
 net/core/dev.c            | 58 ++++++++++++++++++++++++++++++++++++-----------
 net/core/net-sysfs.c      | 17 +++++++-------
 net/core/rtnetlink.c      | 13 +++++++++--
 4 files changed, 71 insertions(+), 25 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index e1d6ef130611..d04424cfffba 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -826,6 +826,11 @@ struct xfrmdev_ops {
 };
 #endif
 
+struct dev_ifalias {
+	struct rcu_head rcuhead;
+	char ifalias[];
+};
+
 /*
  * This structure defines the management hooks for network devices.
  * The following hooks can be defined; unless noted otherwise, they are
@@ -1632,7 +1637,7 @@ enum netdev_priv_flags {
 struct net_device {
 	char			name[IFNAMSIZ];
 	struct hlist_node	name_hlist;
-	char 			*ifalias;
+	struct dev_ifalias	__rcu *ifalias;
 	/*
 	 *	I/O specific fields
 	 *	FIXME: Merge these and struct ifmap into one
@@ -3275,6 +3280,7 @@ void __dev_notify_flags(struct net_device *, unsigned int old_flags,
 			unsigned int gchanges);
 int dev_change_name(struct net_device *, const char *);
 int dev_set_alias(struct net_device *, const char *, size_t);
+int dev_get_alias(const struct net_device *, char *, size_t);
 int dev_change_net_namespace(struct net_device *, struct net *, const char *);
 int __dev_set_mtu(struct net_device *, int);
 int dev_set_mtu(struct net_device *, int);
diff --git a/net/core/dev.c b/net/core/dev.c
index e350c768d4b5..cadd7218a6ae 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -188,6 +188,8 @@ static struct napi_struct *napi_by_id(unsigned int napi_id);
 DEFINE_RWLOCK(dev_base_lock);
 EXPORT_SYMBOL(dev_base_lock);
 
+static DEFINE_MUTEX(ifalias_mutex);
+
 /* protects napi_hash addition/deletion and napi_gen_id */
 static DEFINE_SPINLOCK(napi_hash_lock);
 
@@ -1265,29 +1267,59 @@ int dev_change_name(struct net_device *dev, const char *newname)
  */
 int dev_set_alias(struct net_device *dev, const char *alias, size_t len)
 {
-	char *new_ifalias;
-
-	ASSERT_RTNL();
+	struct dev_ifalias *new_alias, *old;
 
 	if (len >= IFALIASZ)
 		return -EINVAL;
 
-	if (!len) {
-		kfree(dev->ifalias);
-		dev->ifalias = NULL;
-		return 0;
+	if (len) {
+		new_alias = kmalloc(sizeof(*new_alias) + len + 1, GFP_KERNEL);
+		if (!new_alias)
+			return -ENOMEM;
+	} else {
+		new_alias = NULL;
 	}
 
-	new_ifalias = krealloc(dev->ifalias, len + 1, GFP_KERNEL);
-	if (!new_ifalias)
-		return -ENOMEM;
-	dev->ifalias = new_ifalias;
-	memcpy(dev->ifalias, alias, len);
-	dev->ifalias[len] = 0;
+	mutex_lock(&ifalias_mutex);
+
+	old = rcu_dereference_protected(dev->ifalias,
+					mutex_is_locked(&ifalias_mutex));
+	if (len) {
+		memcpy(new_alias->ifalias, alias, len);
+		new_alias->ifalias[len] = 0;
+	}
+
+	rcu_assign_pointer(dev->ifalias, new_alias);
+	mutex_unlock(&ifalias_mutex);
+
+	if (old)
+		kfree_rcu(old, rcuhead);
 
 	return len;
 }
 
+/**
+ *	dev_get_alias - get ifalias of a device
+ *	@dev: device
+ *	@alias: buffer to store name of ifalias
+ *	@len: size of buffer
+ *
+ *	get ifalias for a device.  Caller must make sure dev cannot go
+ *	away,  e.g. rcu read lock or own a reference count to device.
+ */
+int dev_get_alias(const struct net_device *dev, char *name, size_t len)
+{
+	const struct dev_ifalias *alias;
+	int ret = 0;
+
+	rcu_read_lock();
+	alias = rcu_dereference(dev->ifalias);
+	if (alias)
+		ret = snprintf(name, len, "%s", alias->ifalias);
+	rcu_read_unlock();
+
+	return ret;
+}
 
 /**
  *	netdev_features_change - device changes features
diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
index 927a6dcbad96..51d5836d8fb9 100644
--- a/net/core/net-sysfs.c
+++ b/net/core/net-sysfs.c
@@ -391,10 +391,7 @@ static ssize_t ifalias_store(struct device *dev, struct device_attribute *attr,
 	if (len >  0 && buf[len - 1] == '\n')
 		--count;
 
-	if (!rtnl_trylock())
-		return restart_syscall();
 	ret = dev_set_alias(netdev, buf, count);
-	rtnl_unlock();
 
 	return ret < 0 ? ret : len;
 }
@@ -403,13 +400,12 @@ static ssize_t ifalias_show(struct device *dev,
 			    struct device_attribute *attr, char *buf)
 {
 	const struct net_device *netdev = to_net_dev(dev);
+	char tmp[IFALIASZ];
 	ssize_t ret = 0;
 
-	if (!rtnl_trylock())
-		return restart_syscall();
-	if (netdev->ifalias)
-		ret = sprintf(buf, "%s\n", netdev->ifalias);
-	rtnl_unlock();
+	ret = dev_get_alias(netdev, tmp, sizeof(tmp));
+	if (ret > 0)
+		ret = sprintf(buf, "%s\n", tmp);
 	return ret;
 }
 static DEVICE_ATTR_RW(ifalias);
@@ -1488,7 +1484,10 @@ static void netdev_release(struct device *d)
 
 	BUG_ON(dev->reg_state != NETREG_RELEASED);
 
-	kfree(dev->ifalias);
+	/* no need to wait for rcu grace period:
+	 * device is dead and about to be freed.
+	 */
+	kfree(rcu_access_pointer(dev->ifalias));
 	netdev_freemem(dev);
 }
 
diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
index e6955da0d58d..3961f87cdc76 100644
--- a/net/core/rtnetlink.c
+++ b/net/core/rtnetlink.c
@@ -1366,6 +1366,16 @@ static int nla_put_iflink(struct sk_buff *skb, const struct net_device *dev)
 	return nla_put_u32(skb, IFLA_LINK, ifindex);
 }
 
+static noinline_for_stack int nla_put_ifalias(struct sk_buff *skb,
+					      struct net_device *dev)
+{
+	char buf[IFALIASZ];
+	int ret;
+
+	ret = dev_get_alias(dev, buf, sizeof(buf));
+	return ret > 0 ? nla_put_string(skb, IFLA_IFALIAS, buf) : 0;
+}
+
 static int rtnl_fill_link_netnsid(struct sk_buff *skb,
 				  const struct net_device *dev)
 {
@@ -1425,8 +1435,7 @@ static int rtnl_fill_ifinfo(struct sk_buff *skb, struct net_device *dev,
 	    nla_put_u8(skb, IFLA_CARRIER, netif_carrier_ok(dev)) ||
 	    (dev->qdisc &&
 	     nla_put_string(skb, IFLA_QDISC, dev->qdisc->ops->id)) ||
-	    (dev->ifalias &&
-	     nla_put_string(skb, IFLA_IFALIAS, dev->ifalias)) ||
+	    nla_put_ifalias(skb, dev) ||
 	    nla_put_u32(skb, IFLA_CARRIER_CHANGES,
 			atomic_read(&dev->carrier_changes)) ||
 	    nla_put_u8(skb, IFLA_PROTO_DOWN, dev->proto_down))
-- 
2.13.6

^ permalink raw reply related

* RE: [net-next 11/15] i40evf: Enable VF to request an alternate queue allocation
From: Yuval Mintz @ 2017-10-02 20:50 UTC (permalink / raw)
  To: Jeff Kirsher, davem@davemloft.net
  Cc: Alan Brady, netdev@vger.kernel.org, nhorman@redhat.com,
	sassmann@redhat.com, jogreene@redhat.com
In-Reply-To: <20171002194852.71970-12-jeffrey.t.kirsher@intel.com>

> +	case VIRTCHNL_OP_REQUEST_QUEUES: {
> +		struct virtchnl_vf_res_request *vfres =
> +			(struct virtchnl_vf_res_request *)msg;
> +		if (vfres->num_queue_pairs == adapter->num_req_queues)
> {
> +			adapter->flags |=
> I40EVF_FLAG_REINIT_ITR_NEEDED;
> +			i40evf_schedule_reset(adapter);
> +		} else {
> +			dev_info(&adapter->pdev->dev,
> +				 "Requested %d queues, PF can support
> %d\n",
> +				 adapter->num_req_queues,
> +				 vfres->num_queue_pairs);
> +			adapter->num_req_queues = 0;
> +		}
> +		}
> +		break;

Something is odd about your parenthesis.

>  	default:
>  		if (adapter->current_op && (v_opcode != adapter-
> >current_op))
>  			dev_warn(&adapter->pdev->dev, "Expected
> response %d from PF, received %d\n",
> --
> 2.14.2

^ permalink raw reply

* RE: [PATCH net-next 01/10] net/smc: add missing dev_put
From: Parav Pandit @ 2017-10-02 20:39 UTC (permalink / raw)
  To: Parav Pandit, Ursula Braun, davem@davemloft.net
  Cc: netdev@vger.kernel.org, linux-rdma@vger.kernel.org,
	linux-s390@vger.kernel.org, jwi@linux.vnet.ibm.com,
	schwidefsky@de.ibm.com, heiko.carstens@de.ibm.com,
	raspl@linux.vnet.ibm.com
In-Reply-To: <VI1PR0502MB30089D06B9C95D682E21511BD17D0@VI1PR0502MB3008.eurprd05.prod.outlook.com>

> -----Original Message-----
> From: linux-rdma-owner@vger.kernel.org [mailto:linux-rdma-
> owner@vger.kernel.org] On Behalf Of Parav Pandit
> Sent: Monday, October 02, 2017 3:36 PM
> To: Ursula Braun <ubraun@linux.vnet.ibm.com>; davem@davemloft.net
> Cc: netdev@vger.kernel.org; linux-rdma@vger.kernel.org; linux-
> s390@vger.kernel.org; jwi@linux.vnet.ibm.com; schwidefsky@de.ibm.com;
> heiko.carstens@de.ibm.com; raspl@linux.vnet.ibm.com
> Subject: RE: [PATCH net-next 01/10] net/smc: add missing dev_put
> 
> Hi Ursula, Dave, Hans,
> 
> > -----Original Message-----
> > From: linux-rdma-owner@vger.kernel.org [mailto:linux-rdma-
> > owner@vger.kernel.org] On Behalf Of Ursula Braun
> > Sent: Wednesday, September 20, 2017 6:58 AM
> > To: davem@davemloft.net
> > Cc: netdev@vger.kernel.org; linux-rdma@vger.kernel.org; linux-
> > s390@vger.kernel.org; jwi@linux.vnet.ibm.com; schwidefsky@de.ibm.com;
> > heiko.carstens@de.ibm.com; raspl@linux.vnet.ibm.com;
> > ubraun@linux.vnet.ibm.com
> > Subject: [PATCH net-next 01/10] net/smc: add missing dev_put
> >
> > From: Hans Wippel <hwippel@linux.vnet.ibm.com>
> >
> > In the infiniband part, SMC currently uses get_netdev which calls
> > dev_hold on the returned net device. However, the SMC code never calls
> > dev_put on that net device resulting in a wrong reference count.
> >
> > This patch adds a dev_put after the usage of the net device to fix the issue.
> >
> > Signed-off-by: Hans Wippel <hwippel@linux.vnet.ibm.com>
> > Signed-off-by: Ursula Braun <ubraun@linux.vnet.ibm.com>
> > ---
> >  net/smc/smc_ib.c | 1 +
> >  1 file changed, 1 insertion(+)
> >
> > diff --git a/net/smc/smc_ib.c b/net/smc/smc_ib.c index
> > 547e0e113b17..0b5852299158 100644
> > --- a/net/smc/smc_ib.c
> > +++ b/net/smc/smc_ib.c
> > @@ -380,6 +380,7 @@ static int smc_ib_fill_gid_and_mac(struct
> > smc_ib_device *smcibdev, u8 ibport)
> >  	ndev = smcibdev->ibdev->get_netdev(smcibdev->ibdev, ibport);
> >  	if (ndev) {
> >  		memcpy(&smcibdev->mac, ndev->dev_addr, ETH_ALEN);
> > +		dev_put(ndev);
> 
> I am sorry for providing late comments. smc_ib_fill_gid_and_mac() is not coded
> correctly.
> Few fixes are needed.
> 1. ULP such as SMC should not open code/deference any function pointer like
> get_netdev() of the IB device.
> 2. Replace ib_query_gid(..., NULL)
> With
> ib_query_gid(..., gid_attr);
> 
> Use gid_attr.ndev to get the MAC address.
> Do dev_put(gid_attr.ndev);
> 
> Code should look like below,
> 
> struct ib_gid_attr gid_attr;
> 
> rc = ib_query_gid(..., &gid_attr);
> if (rc || !gid_addr.ndev)
> 	return -ENODEV;
> else
> 	memcpy(smcibdev->mac, ndev->dev_addr, ETH_ALEN);
> dev_put(gid_addr.ndev);
> --

Also,
smc_link_determine_gid() doesn't do dev_put(gattr.ndev) in for loop.
Please fix it as well.

^ permalink raw reply

* Re: [PATCH net-next 2/2] flow_dissector: dissect tunnel info
From: Tom Herbert @ 2017-10-02 20:37 UTC (permalink / raw)
  To: Simon Horman
  Cc: David Miller, Jiri Pirko, Jamal Hadi Salim, Cong Wang,
	Linux Kernel Network Developers, oss-drivers
In-Reply-To: <1506933676-20121-3-git-send-email-simon.horman@netronome.com>

On Mon, Oct 2, 2017 at 1:41 AM, Simon Horman <simon.horman@netronome.com> wrote:
> Move dissection of tunnel info from the flower classifier to the flow
> dissector where all other dissection occurs.  This should not have any
> behavioural affect on other users of the flow dissector.
>
> Signed-off-by: Simon Horman <simon.horman@netronome.com>
> Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
> ---
>  net/core/flow_dissector.c | 100 ++++++++++++++++++++++++++++++++++++++++++++++
>  net/sched/cls_flower.c    |  25 ------------
>  2 files changed, 100 insertions(+), 25 deletions(-)
>
> diff --git a/net/core/flow_dissector.c b/net/core/flow_dissector.c
> index 0a977373d003..1f5caafb4492 100644
> --- a/net/core/flow_dissector.c
> +++ b/net/core/flow_dissector.c
> @@ -5,6 +5,7 @@
>  #include <linux/ipv6.h>
>  #include <linux/if_vlan.h>
>  #include <net/dsa.h>
> +#include <net/dst_metadata.h>
>  #include <net/ip.h>
>  #include <net/ipv6.h>
>  #include <net/gre.h>
> @@ -115,6 +116,102 @@ __be32 __skb_flow_get_ports(const struct sk_buff *skb, int thoff, u8 ip_proto,
>  }
>  EXPORT_SYMBOL(__skb_flow_get_ports);
>
> +static void
> +skb_flow_dissect_set_enc_addr_type(enum flow_dissector_key_id type,
> +                                  struct flow_dissector *flow_dissector,
> +                                  void *target_container)
> +{
> +       struct flow_dissector_key_control *ctrl;
> +
> +       if (!dissector_uses_key(flow_dissector, FLOW_DISSECTOR_KEY_ENC_CONTROL))
> +               return;
> +
> +       ctrl = skb_flow_dissector_target(flow_dissector,
> +                                        FLOW_DISSECTOR_KEY_ENC_CONTROL,
> +                                        target_container);
> +       ctrl->addr_type = type;
> +}
> +
> +static void
> +__skb_flow_dissect_tunnel_info(const struct sk_buff *skb,
> +                              struct flow_dissector *flow_dissector,
> +                              void *target_container)
> +{
> +       struct ip_tunnel_info *info;
> +       struct ip_tunnel_key *key;
> +
> +       /* A quick check to see if there might be something to do. */
> +       if (!dissector_uses_key(flow_dissector,
> +                               FLOW_DISSECTOR_KEY_ENC_KEYID) &&
> +           !dissector_uses_key(flow_dissector,
> +                               FLOW_DISSECTOR_KEY_ENC_IPV4_ADDRS) &&
> +           !dissector_uses_key(flow_dissector,
> +                               FLOW_DISSECTOR_KEY_ENC_IPV6_ADDRS) &&
> +           !dissector_uses_key(flow_dissector,
> +                               FLOW_DISSECTOR_KEY_ENC_CONTROL) &&
> +           !dissector_uses_key(flow_dissector,
> +                               FLOW_DISSECTOR_KEY_ENC_PORTS))
> +               return;
> +
> +       info = skb_tunnel_info(skb);
> +       if (!info)
> +               return;
> +
> +       key = &info->key;
> +
> +       switch (ip_tunnel_info_af(info)) {
> +       case AF_INET:
> +               skb_flow_dissect_set_enc_addr_type(FLOW_DISSECTOR_KEY_IPV4_ADDRS,
> +                                                  flow_dissector,
> +                                                  target_container);
> +               if (dissector_uses_key(flow_dissector,
> +                                      FLOW_DISSECTOR_KEY_ENC_IPV4_ADDRS)) {
> +                       struct flow_dissector_key_ipv4_addrs *ipv4;
> +
> +                       ipv4 = skb_flow_dissector_target(flow_dissector,
> +                                                        FLOW_DISSECTOR_KEY_ENC_IPV4_ADDRS,
> +                                                        target_container);
> +                       ipv4->src = key->u.ipv4.src;
> +                       ipv4->dst = key->u.ipv4.dst;
> +               }
> +               break;
> +       case AF_INET6:
> +               skb_flow_dissect_set_enc_addr_type(FLOW_DISSECTOR_KEY_IPV6_ADDRS,
> +                                                  flow_dissector,
> +                                                  target_container);
> +               if (dissector_uses_key(flow_dissector,
> +                                      FLOW_DISSECTOR_KEY_ENC_IPV6_ADDRS)) {
> +                       struct flow_dissector_key_ipv6_addrs *ipv6;
> +
> +                       ipv6 = skb_flow_dissector_target(flow_dissector,
> +                                                        FLOW_DISSECTOR_KEY_ENC_IPV6_ADDRS,
> +                                                        target_container);
> +                       ipv6->src = key->u.ipv6.src;
> +                       ipv6->dst = key->u.ipv6.dst;

Simon,

I think I'm missing something fundamental here. This code is
populating flow dissector keys not based on the contents of the packet
like rest of the flow dissector, but on external meta data related to
the packet which I believe is constant during the whole flow
dissection. Why can't this be handled by the caller? Also, if I read
this correctly, this code could be called multiple times and it seems
like it does the exact same thing in each call.

Tom

> +               }
> +               break;
> +       }
> +
> +       if (dissector_uses_key(flow_dissector, FLOW_DISSECTOR_KEY_ENC_KEYID)) {
> +               struct flow_dissector_key_keyid *keyid;
> +
> +               keyid = skb_flow_dissector_target(flow_dissector,
> +                                                 FLOW_DISSECTOR_KEY_ENC_KEYID,
> +                                                 target_container);
> +               keyid->keyid = tunnel_id_to_key32(key->tun_id);
> +       }
> +
> +       if (dissector_uses_key(flow_dissector, FLOW_DISSECTOR_KEY_ENC_PORTS)) {
> +               struct flow_dissector_key_ports *tp;
> +
> +               tp = skb_flow_dissector_target(flow_dissector,
> +                                              FLOW_DISSECTOR_KEY_ENC_PORTS,
> +                                              target_container);
> +               tp->src = key->tp_src;
> +               tp->dst = key->tp_dst;
> +       }
> +}
> +
>  static enum flow_dissect_ret
>  __skb_flow_dissect_mpls(const struct sk_buff *skb,
>                         struct flow_dissector *flow_dissector,
> @@ -478,6 +575,9 @@ bool __skb_flow_dissect(const struct sk_buff *skb,
>                                               FLOW_DISSECTOR_KEY_BASIC,
>                                               target_container);
>
> +       __skb_flow_dissect_tunnel_info(skb, flow_dissector,
> +                                      target_container);
> +
>         if (dissector_uses_key(flow_dissector,
>                                FLOW_DISSECTOR_KEY_ETH_ADDRS)) {
>                 struct ethhdr *eth = eth_hdr(skb);
> diff --git a/net/sched/cls_flower.c b/net/sched/cls_flower.c
> index d230cb4c8094..db831ac708f6 100644
> --- a/net/sched/cls_flower.c
> +++ b/net/sched/cls_flower.c
> @@ -152,37 +152,12 @@ static int fl_classify(struct sk_buff *skb, const struct tcf_proto *tp,
>         struct cls_fl_filter *f;
>         struct fl_flow_key skb_key;
>         struct fl_flow_key skb_mkey;
> -       struct ip_tunnel_info *info;
>
>         if (!atomic_read(&head->ht.nelems))
>                 return -1;
>
>         fl_clear_masked_range(&skb_key, &head->mask);
>
> -       info = skb_tunnel_info(skb);
> -       if (info) {
> -               struct ip_tunnel_key *key = &info->key;
> -
> -               switch (ip_tunnel_info_af(info)) {
> -               case AF_INET:
> -                       skb_key.enc_control.addr_type =
> -                               FLOW_DISSECTOR_KEY_IPV4_ADDRS;
> -                       skb_key.enc_ipv4.src = key->u.ipv4.src;
> -                       skb_key.enc_ipv4.dst = key->u.ipv4.dst;
> -                       break;
> -               case AF_INET6:
> -                       skb_key.enc_control.addr_type =
> -                               FLOW_DISSECTOR_KEY_IPV6_ADDRS;
> -                       skb_key.enc_ipv6.src = key->u.ipv6.src;
> -                       skb_key.enc_ipv6.dst = key->u.ipv6.dst;
> -                       break;
> -               }
> -
> -               skb_key.enc_key_id.keyid = tunnel_id_to_key32(key->tun_id);
> -               skb_key.enc_tp.src = key->tp_src;
> -               skb_key.enc_tp.dst = key->tp_dst;
> -       }
> -
>         skb_key.indev_ifindex = skb->skb_iif;
>         /* skb_flow_dissect() does not set n_proto in case an unknown protocol,
>          * so do it rather here.
> --
> 2.1.4
>

^ permalink raw reply

* RE: [PATCH net-next 01/10] net/smc: add missing dev_put
From: Parav Pandit @ 2017-10-02 20:36 UTC (permalink / raw)
  To: Ursula Braun, davem@davemloft.net
  Cc: netdev@vger.kernel.org, linux-rdma@vger.kernel.org,
	linux-s390@vger.kernel.org, jwi@linux.vnet.ibm.com,
	schwidefsky@de.ibm.com, heiko.carstens@de.ibm.com,
	raspl@linux.vnet.ibm.com
In-Reply-To: <20170920115813.63745-2-ubraun@linux.vnet.ibm.com>

Hi Ursula, Dave, Hans,

> -----Original Message-----
> From: linux-rdma-owner@vger.kernel.org [mailto:linux-rdma-
> owner@vger.kernel.org] On Behalf Of Ursula Braun
> Sent: Wednesday, September 20, 2017 6:58 AM
> To: davem@davemloft.net
> Cc: netdev@vger.kernel.org; linux-rdma@vger.kernel.org; linux-
> s390@vger.kernel.org; jwi@linux.vnet.ibm.com; schwidefsky@de.ibm.com;
> heiko.carstens@de.ibm.com; raspl@linux.vnet.ibm.com;
> ubraun@linux.vnet.ibm.com
> Subject: [PATCH net-next 01/10] net/smc: add missing dev_put
> 
> From: Hans Wippel <hwippel@linux.vnet.ibm.com>
> 
> In the infiniband part, SMC currently uses get_netdev which calls dev_hold on
> the returned net device. However, the SMC code never calls dev_put on that net
> device resulting in a wrong reference count.
> 
> This patch adds a dev_put after the usage of the net device to fix the issue.
> 
> Signed-off-by: Hans Wippel <hwippel@linux.vnet.ibm.com>
> Signed-off-by: Ursula Braun <ubraun@linux.vnet.ibm.com>
> ---
>  net/smc/smc_ib.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/net/smc/smc_ib.c b/net/smc/smc_ib.c index
> 547e0e113b17..0b5852299158 100644
> --- a/net/smc/smc_ib.c
> +++ b/net/smc/smc_ib.c
> @@ -380,6 +380,7 @@ static int smc_ib_fill_gid_and_mac(struct smc_ib_device
> *smcibdev, u8 ibport)
>  	ndev = smcibdev->ibdev->get_netdev(smcibdev->ibdev, ibport);
>  	if (ndev) {
>  		memcpy(&smcibdev->mac, ndev->dev_addr, ETH_ALEN);
> +		dev_put(ndev);

I am sorry for providing late comments. smc_ib_fill_gid_and_mac() is not coded correctly.
Few fixes are needed.
1. ULP such as SMC should not open code/deference any function pointer like get_netdev() of the IB device.
2. Replace ib_query_gid(..., NULL)
With
ib_query_gid(..., gid_attr);

Use gid_attr.ndev to get the MAC address.
Do dev_put(gid_attr.ndev);

Code should look like below,

struct ib_gid_attr gid_attr;

rc = ib_query_gid(..., &gid_attr);
if (rc || !gid_addr.ndev)
	return -ENODEV;
else
	memcpy(smcibdev->mac, ndev->dev_addr, ETH_ALEN);
dev_put(gid_addr.ndev);

^ permalink raw reply

* [PATCH net-next] ibmvnic: Improve output for unsupported stats
From: John Allen @ 2017-10-02 20:31 UTC (permalink / raw)
  To: netdev, Thomas Falcon

The vnic server can report -1 in the event that a given statistic is not
supported. Currently, the -1 value is implicitly cast to an unsigned
integer and appears through the ethtool -S output as a very large number.
This patch improves this behavior by reporting 0 in the event that a
given statistic is not supported.

Signed-off-by: John Allen <jallen@linux.vnet.ibm.com>
---
diff --git a/drivers/net/ethernet/ibm/ibmvnic.c b/drivers/net/ethernet/ibm/ibmvnic.c
index cb8182f..b8ad2db 100644
--- a/drivers/net/ethernet/ibm/ibmvnic.c
+++ b/drivers/net/ethernet/ibm/ibmvnic.c
@@ -1862,6 +1862,7 @@ static void ibmvnic_get_ethtool_stats(struct net_device *dev,
 {
 	struct ibmvnic_adapter *adapter = netdev_priv(dev);
 	union ibmvnic_crq crq;
+	u64 stat;
 	int i, j;

 	memset(&crq, 0, sizeof(crq));
@@ -1876,9 +1877,11 @@ static void ibmvnic_get_ethtool_stats(struct net_device *dev,
 	ibmvnic_send_crq(adapter, &crq);
 	wait_for_completion(&adapter->stats_done);

-	for (i = 0; i < ARRAY_SIZE(ibmvnic_stats); i++)
-		data[i] = be64_to_cpu(IBMVNIC_GET_STAT(adapter,
-						ibmvnic_stats[i].offset));
+	for (i = 0; i < ARRAY_SIZE(ibmvnic_stats); i++) {
+		stat = be64_to_cpu(IBMVNIC_GET_STAT(adapter,
+						    ibmvnic_stats[i].offset));
+		data[i] = stat == -1 ? 0 : stat;
+	}

 	for (j = 0; j < adapter->req_tx_queues; j++) {
 		data[i] = adapter->tx_stats_buffers[j].packets;

^ permalink raw reply related

* Re: [PATCH net] RDS: IB: Limit the scope of has_fr/has_fmr variables
From: Santosh Shilimkar @ 2017-10-02 20:30 UTC (permalink / raw)
  To: David Miller, avinash.repaka; +Cc: netdev, linux-rdma, rds-devel, linux-kernel
In-Reply-To: <20171001.225655.1261950191787562705.davem@davemloft.net>

On 10/1/2017 10:56 PM, David Miller wrote:
> From: David Miller <davem@davemloft.net>
> Date: Sun, 01 Oct 2017 22:54:19 -0700 (PDT)
> 
>> From: Avinash Repaka <avinash.repaka@oracle.com>
>> Date: Fri, 29 Sep 2017 18:13:50 -0700
>>
>>> This patch fixes the scope of has_fr and has_fmr variables as they are
>>> needed only in rds_ib_add_one().
>>>
>>> Signed-off-by: Avinash Repaka <avinash.repaka@oracle.com>
>>
>> Applied.
> 
> Actually, reverted, this breaks the build.
> 
> net/rds/rdma_transport.c:38:10: fatal error: ib.h: No such file or directory
>   #include "ib.h"
> 
> Although I can't see how in the world this patch is causing such
> an error.
> 
Weird indeed. Will sort this out with Avinash. Thanks Dave.

Regards,
Santosh

^ permalink raw reply

* Re: [PATCH iproute2] iproute: build more easily on Android
From: enh @ 2017-10-02 20:23 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: Lorenzo Colitti, netdev
In-Reply-To: <20171002103611.17ccb01f@xeon-e3>

On Mon, Oct 2, 2017 at 10:36 AM, Stephen Hemminger
<stephen@networkplumber.org> wrote:
> On Tue,  3 Oct 2017 02:03:37 +0900
> Lorenzo Colitti <lorenzo@google.com> wrote:
>
>> iproute2 contains a bunch of kernel headers, including uapi ones.
>> Android's libc uses uapi headers almost directly, and uses a
>> script to fix kernel types that don't match what userspace
>> expects.
>>
>> For example: https://issuetracker.google.com/36987220 reports
>> that our struct ip_mreq_source contains "__be32 imr_multiaddr"
>> rather than "struct in_addr imr_multiaddr". The script addresses
>> this by replacing the uapi struct definition with a #include
>> <bits/ip_mreq.h> which contains the traditional userspace
>> definition.
>>
>> Unfortunately, when we compile iproute2, this definition
>> conflicts with the one in iproute2's linux/in.h.
>>
>> Historically we've just solved this problem by running "git rm"
>> on all the iproute2 include/linux headers that break Android's
>> libc.  However, deleting the files in this way makes it harder to
>> keep up with upstream, because every upstream change to
>> an include file causes a merge conflict with the delete.
>>
>> This patch fixes the problem by moving the iproute2 linux headers
>> from include/linux to include/uapi/linux.
>>
>> Tested: compiles on ubuntu trusty (glibc)
>>
>> Signed-off-by: Elliott Hughes <enh@google.com>
>> Signed-off-by: Lorenzo Colitti <lorenzo@google.com>
>
> Rather than moving everything, why not make kernel headers directory
> configurable as part of the configure script setup process.

the problem is that C libraries with their our own uapi headers still
need your app-specific headers. to build iproute2 we need to put
iproute2's include/ on our include path, but then the fact that your
different uapi headers are *under* that directory causes the conflict.
separating your headers from your copy of the uapi headers is a
necessary part of any fix.

-- 
Elliott Hughes - http://who/enh - http://jessies.org/~enh/
Android native code/tools questions? Mail me/drop by/add me as a reviewer.

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox