* [PATCH net-next 0/7] selftests: Add tests for mirroring to gretap
From: Petr Machata @ 2018-04-26 23:17 UTC (permalink / raw)
To: netdev; +Cc: davem, linux-mlxsw
This suite tests GRE-encapsulated mirroring. The general topology that
most of the tests use is as follows, but each test defines details of
the topology based on its needs, and some tests actually use a somewhat
different topology.
+---------------------+ +---------------------+
| H1 | | H2 |
| + $h1 | | $h2 + |
+-----|---------------+ +---------------|-----+
| |
+-----|------------------------------------------------------|-----+
| SW o---> mirror | |
| +---|------------------------------------------------------|---+ |
| | + $swp1 BR $swp2 + | |
| +--------------------------------------------------------------+ |
| |
| + $swp3 + gt6 (ip6gretap) + gt4 (gretap) |
+-----|----------------:--------------------:----------------------+
| : :
+-----|----------------:--------------------:----------------------+
| + $h3 + h3-gt6(ip6gretap) + h3-gt4 (gretap) |
| H3 |
+------------------------------------------------------------------+
The following axes of configuration space are tested:
- ingress and egress mirroring
- mirroring triggered by matchall and flower
- mirroring to ipgretap and ip6gretap
- remote tunnel reachable directly or through a next-hop route
- skip_sw as well as skip_hw configurations
Apart from basic tests with the above mentioned features, the following
tests are included:
- handling of changes to neighbors pertinent to routing decisions in
mirrored underlay
- handling of configuration changes at the mirrored-to tunnel (endpoint
addresses, upness)
A suite of mlxsw-specific tests will be part of a separate submission
through linux-mlxsw patch queue.
Petr Machata (7):
selftests: forwarding: Add libs for gretap mirror testing
selftests: forwarding: Add test for mirror to gretap
selftests: forwarding: Test gretap mirror with next-hop remote
selftests: forwarding: Test mirror to gretap w/ bound dev
selftests: forwarding: Test flower mirror to gretap
selftests: forwarding: Test neighbor updates when mirroring to gretap
selftests: forwarding: Test changes in mirror-to-gretap
tools/testing/selftests/net/forwarding/lib.sh | 96 ++++++++++
.../testing/selftests/net/forwarding/mirror_gre.sh | 139 ++++++++++++++
.../selftests/net/forwarding/mirror_gre_bound.sh | 213 +++++++++++++++++++++
.../selftests/net/forwarding/mirror_gre_changes.sh | 194 +++++++++++++++++++
.../selftests/net/forwarding/mirror_gre_flower.sh | 116 +++++++++++
.../selftests/net/forwarding/mirror_gre_lib.sh | 85 ++++++++
.../selftests/net/forwarding/mirror_gre_neigh.sh | 101 ++++++++++
.../selftests/net/forwarding/mirror_gre_nh.sh | 117 +++++++++++
.../net/forwarding/mirror_gre_topo_lib.sh | 129 +++++++++++++
.../testing/selftests/net/forwarding/mirror_lib.sh | 40 ++++
10 files changed, 1230 insertions(+)
create mode 100755 tools/testing/selftests/net/forwarding/mirror_gre.sh
create mode 100755 tools/testing/selftests/net/forwarding/mirror_gre_bound.sh
create mode 100755 tools/testing/selftests/net/forwarding/mirror_gre_changes.sh
create mode 100755 tools/testing/selftests/net/forwarding/mirror_gre_flower.sh
create mode 100644 tools/testing/selftests/net/forwarding/mirror_gre_lib.sh
create mode 100755 tools/testing/selftests/net/forwarding/mirror_gre_neigh.sh
create mode 100755 tools/testing/selftests/net/forwarding/mirror_gre_nh.sh
create mode 100644 tools/testing/selftests/net/forwarding/mirror_gre_topo_lib.sh
create mode 100644 tools/testing/selftests/net/forwarding/mirror_lib.sh
--
2.4.11
^ permalink raw reply
* Re: [PATCH net-next] net/mlx4_en: optimizes get_fixed_ipv6_csum()
From: Saeed Mahameed @ 2018-04-26 22:56 UTC (permalink / raw)
To: davem@davemloft.net, edumazet@google.com
Cc: netdev@vger.kernel.org, eric.dumazet@gmail.com, Tariq Toukan
In-Reply-To: <20180419154929.25718-1-edumazet@google.com>
On Thu, 2018-04-19 at 08:49 -0700, Eric Dumazet wrote:
> While trying to support CHECKSUM_COMPLETE for IPV6 fragments,
> I had to experiments various hacks in get_fixed_ipv6_csum().
> I must admit I could not find how to implement this :/
>
> However, get_fixed_ipv6_csum() does a lot of redundant operations,
> calling csum_partial() twice.
>
> First csum_partial() computes the checksum of saddr and daddr,
> put in @csum_pseudo_hdr. Undone later in the second csum_partial()
> computed on whole ipv6 header.
>
> Then nexthdr is added once, added a second time, then substracted.
>
> payload_len is added once, then substracted.
>
> Really all this can be reduced to two add_csum(), to add back 6 bytes
> that were removed by mlx4 when providing hw_checksum in RX
> descriptor.
>
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Cc: Saeed Mahameed <saeedm@mellanox.com>
> Cc: Tariq Toukan <tariqt@mellanox.com>
> ---
> Note: This patch, like other mlx4 patches can definitely wait
> Tariq approval, thanks !
>
LGTM,
Reviewed-by: Saeed Mahameed <saeedm@mellanox.com>
> drivers/net/ethernet/mellanox/mlx4/en_rx.c | 21 ++++++++----------
> ---
> 1 file changed, 8 insertions(+), 13 deletions(-)
>
> diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c
> b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
> index
> 5c613c6663da51a4ae792eeb4d8956b54655786b..38c56fb6e5f5970f245dd56c38e
> 1fc63a9349a07 100644
> --- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c
> +++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
> @@ -593,30 +593,25 @@ static int get_fixed_ipv4_csum(__wsum
> hw_checksum, struct sk_buff *skb,
> }
>
> #if IS_ENABLED(CONFIG_IPV6)
> -/* In IPv6 packets, besides subtracting the pseudo header checksum,
> - * we also compute/add the IP header checksum which
> - * is not added by the HW.
> +/* In IPv6 packets, hw_checksum lacks 6 bytes from IPv6 header:
> + * 4 first bytes : priority, version, flow_lbl
> + * and 2 additional bytes : nexthdr, hop_limit.
> */
> static int get_fixed_ipv6_csum(__wsum hw_checksum, struct sk_buff
> *skb,
> struct ipv6hdr *ipv6h)
> {
> __u8 nexthdr = ipv6h->nexthdr;
> - __wsum csum_pseudo_hdr = 0;
> + __wsum temp;
>
> if (unlikely(nexthdr == IPPROTO_FRAGMENT ||
> nexthdr == IPPROTO_HOPOPTS ||
> nexthdr == IPPROTO_SCTP))
> return -1;
> - hw_checksum = csum_add(hw_checksum, (__force
> __wsum)htons(nexthdr));
>
> - csum_pseudo_hdr = csum_partial(&ipv6h->saddr,
> - sizeof(ipv6h->saddr) +
> sizeof(ipv6h->daddr), 0);
> - csum_pseudo_hdr = csum_add(csum_pseudo_hdr, (__force
> __wsum)ipv6h->payload_len);
> - csum_pseudo_hdr = csum_add(csum_pseudo_hdr,
> - (__force __wsum)htons(nexthdr));
> -
> - skb->csum = csum_sub(hw_checksum, csum_pseudo_hdr);
> - skb->csum = csum_add(skb->csum, csum_partial(ipv6h,
> sizeof(struct ipv6hdr), 0));
> + /* priority, version, flow_lbl */
> + temp = csum_add(hw_checksum, *(__wsum *)ipv6h);
> + /* nexthdr and hop_limit */
> + skb->csum = csum_add(temp, (__force __wsum)*(__be16
> *)&ipv6h->nexthdr);
> return 0;
> }
> #endif
^ permalink raw reply
* Re: [dm-devel] [PATCH v5] fault-injection: introduce kvmalloc fallback options
From: Mikulas Patocka @ 2018-04-26 22:52 UTC (permalink / raw)
To: Michael S. Tsirkin
Cc: Andrew, eric.dumazet, linux-mm, edumazet, netdev, Randy Dunlap,
John Stoffel, linux-kernel, Matthew Wilcox, Hocko,
James Bottomley, Michal, dm-devel, David Rientjes, Morton,
virtualization, David Miller, Vlastimil Babka
In-Reply-To: <20180427005213-mutt-send-email-mst@kernel.org>
On Fri, 27 Apr 2018, Michael S. Tsirkin wrote:
> On Thu, Apr 26, 2018 at 05:50:20PM -0400, Mikulas Patocka wrote:
> > How is the user or developer supposed to learn about this option, if
> > he gets no crash at all?
>
> Look in /sys/kernel/debug/fail* ? That actually lets you
> filter by module, process etc.
>
> I think this patch conflates two things:
>
> 1. Make kvmalloc use the vmalloc path.
> This seems a bit narrow.
> What is special about kvmalloc? IMHO nothing - it's yet another user
> of __GFP_NORETRY or __GFP_RETRY_MAYFAIL. As any such
__GFP_RETRY_MAYFAIL makes the allocator retry the costly_order allocations
> user, it either recovers correctly or not.
> So IMHO it's just a case of
> making __GFP_NORETRY, __GFP_RETRY_MAYFAIL, or both
> fail once in a while.
> Seems like a better extension to me than focusing on vmalloc.
> I think you will find more bugs this way.
If the array is <= PAGE_SIZE, vmalloc will not use __GFP_NORETRY. So it
still hides some bugs - such as, if a structure grows above 4k, it would
start randomly crashing due to memory fragmentation.
> 2. Ability to control this from a separate config
> option.
>
> It's still not that clear to me why is this such a
> hard requirement. If a distro wants to force specific
> boot time options, why isn't CONFIG_CMDLINE sufficient?
There are 489 kernel options declared with the __setup keyword. Hardly any
kernel developer notices that a new one was added and selects it when
testing his code.
> But assuming it's important to control this kind of
> fault injection to be controlled from
> a dedicated menuconfig option, why not the rest of
> faults?
The injected faults cause damage to the user, so there's no point to
enable them by default. vmalloc fallback should not cause any damage
(assuming that the code is correctly written).
> IMHO if you split 1/2 up, and generalize, the path upstream
> will be much smoother.
This seems like a lost case. So, let's not care about code correctness and
let's solve crashes only after they are reported. If the upstream wants to
work this way, there's nothing that can be done about it.
I'm wondering if I can still push it to RHEL or not.
> Hope this helps.
>
> --
> MST
Mikulas
^ permalink raw reply
* Proposal
From: MS Zeliha Omer Faruk @ 2018-04-26 22:50 UTC (permalink / raw)
Hello
Greetings to you today i asked before but i did't get a response please
i know this might come to you as a surprise because you do not know me
personally i have a business proposal for you please reply for more
info.
Best Regards,
Esentepe Mahallesi Büyükdere
Caddesi Kristal Kule Binasi
No:215
Sisli - Istanbul, Turkey
^ permalink raw reply
* Re: [RFC PATCH ghak32 V2 01/13] audit: add container id
From: Paul Moore @ 2018-04-26 22:47 UTC (permalink / raw)
To: Richard Guy Briggs
Cc: cgroups, containers, linux-api, Linux-Audit Mailing List,
linux-fsdevel, LKML, netdev, ebiederm, luto, jlayton, carlos,
dhowells, viro, simo, Eric Paris, serge
In-Reply-To: <20180425004031.zutsno6hvmpq3crd@madcap2.tricolour.ca>
On Tue, Apr 24, 2018 at 8:40 PM, Richard Guy Briggs <rgb@redhat.com> wrote:
> On 2018-04-24 15:01, Paul Moore wrote:
>> On Mon, Apr 23, 2018 at 10:02 PM, Richard Guy Briggs <rgb@redhat.com> wrote:
>> > On 2018-04-23 19:15, Paul Moore wrote:
>> >> On Sat, Apr 21, 2018 at 10:34 AM, Richard Guy Briggs <rgb@redhat.com> wrote:
>> >> > On 2018-04-18 19:47, Paul Moore wrote:
>> >> >> On Fri, Mar 16, 2018 at 5:00 AM, Richard Guy Briggs <rgb@redhat.com> wrote:
>> >> >> > Implement the proc fs write to set the audit container ID of a process,
>> >> >> > emitting an AUDIT_CONTAINER record to document the event.
>> >> >> >
>> >> >> > This is a write from the container orchestrator task to a proc entry of
>> >> >> > the form /proc/PID/containerid where PID is the process ID of the newly
>> >> >> > created task that is to become the first task in a container, or an
>> >> >> > additional task added to a container.
>> >> >> >
>> >> >> > The write expects up to a u64 value (unset: 18446744073709551615).
>> >> >> >
>> >> >> > This will produce a record such as this:
>> >> >> > type=CONTAINER msg=audit(1519903238.968:261): op=set pid=596 uid=0 subj=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023 auid=0 tty=pts0 ses=1 opid=596 old-contid=18446744073709551615 contid=123455 res=0
>> >> >> >
>> >> >> > The "op" field indicates an initial set. The "pid" to "ses" fields are
>> >> >> > the orchestrator while the "opid" field is the object's PID, the process
>> >> >> > being "contained". Old and new container ID values are given in the
>> >> >> > "contid" fields, while res indicates its success.
>> >> >> >
>> >> >> > It is not permitted to self-set, unset or re-set the container ID. A
>> >> >> > child inherits its parent's container ID, but then can be set only once
>> >> >> > after.
>> >> >> >
>> >> >> > See: https://github.com/linux-audit/audit-kernel/issues/32
>> >> >> >
>> >> >> > Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
>> >> >> > ---
>> >> >> > fs/proc/base.c | 37 ++++++++++++++++++++
>> >> >> > include/linux/audit.h | 16 +++++++++
>> >> >> > include/linux/init_task.h | 4 ++-
>> >> >> > include/linux/sched.h | 1 +
>> >> >> > include/uapi/linux/audit.h | 2 ++
>> >> >> > kernel/auditsc.c | 84 ++++++++++++++++++++++++++++++++++++++++++++++
>> >> >> > 6 files changed, 143 insertions(+), 1 deletion(-)
>>
>> ...
>>
>> >> >> > /* audit_rule_data supports filter rules with both integer and string
>> >> >> > * fields. It corresponds with AUDIT_ADD_RULE, AUDIT_DEL_RULE and
>> >> >> > diff --git a/kernel/auditsc.c b/kernel/auditsc.c
>> >> >> > index 4e0a4ac..29c8482 100644
>> >> >> > --- a/kernel/auditsc.c
>> >> >> > +++ b/kernel/auditsc.c
>> >> >> > @@ -2073,6 +2073,90 @@ int audit_set_loginuid(kuid_t loginuid)
>> >> >> > return rc;
>> >> >> > }
>> >> >> >
>> >> >> > +static int audit_set_containerid_perm(struct task_struct *task, u64 containerid)
>> >> >> > +{
>> >> >> > + struct task_struct *parent;
>> >> >> > + u64 pcontainerid, ccontainerid;
>> >> >> > +
>> >> >> > + /* Don't allow to set our own containerid */
>> >> >> > + if (current == task)
>> >> >> > + return -EPERM;
>> >> >>
>> >> >> Why not? Is there some obvious security concern that I missing?
>> >> >
>> >> > We then lose the distinction in the AUDIT_CONTAINER record between the
>> >> > initiating PID and the target PID. This was outlined in the proposal.
>> >>
>> >> I just went back and reread the v3 proposal and I still don't see a
>> >> good explanation of this. Why is this bad? What's the security
>> >> concern?
>> >
>> > I don't remember, specifically. Maybe this has been addressed by the
>> > check for children/threads or identical parent container ID. So, I'm
>> > reluctantly willing to remove that check for now.
>>
>> Okay. For the record, if someone can explain to me why this
>> restriction saves us from some terrible situation I'm all for leaving
>> it. I'm just opposed to restrictions without solid reasoning behind
>> them.
>>
>> >> > Having said that, I'm still not sure we have protected sufficiently from
>> >> > a child turning around and setting it's parent's as yet unset or
>> >> > inherited audit container ID.
>> >>
>> >> Yes, I believe we only want to let a task set the audit container for
>> >> it's children (or itself/threads if we decide to allow that, see
>> >> above). There *has* to be a function to check to see if a task if a
>> >> child of a given task ... right? ... although this is likely to be a
>> >> pointer traversal and locking nightmare ... hmmm.
>> >
>> > Isn't that just (struct task_struct)parent == (struct
>> > task_struct)child->parent (or ->real_parent)?
>> >
>> > And now that I say that, it is covered by the following patch's child
>> > check, so as long as we keep that, we should be fine.
>>
>> I was thinking of checking not just current's immediate children, but
>> any of it's descendants as I believe that is what we want to limit,
>> yes? I just worry that it isn't really practical to perform that
>> check.
>
> The child check I'm talking about prevents setting a task's audit
> container ID if it *has* any children or threads, so if it has children
> it is automatically disqualified and its grandchildren are irrelevant.
>
>> >> >> I ask because I suppose it might be possible for some container
>> >> >> runtime to do a fork, setup some of the environment and them exec the
>> >> >> container (before you answer the obvious "namespaces!" please remember
>> >> >> we're not trying to define containers).
>> >> >
>> >> > I don't think namespaces have any bearing on this concern since none are
>> >> > required.
>> >> >
>> >> >> > + /* Don't allow the containerid to be unset */
>> >> >> > + if (!cid_valid(containerid))
>> >> >> > + return -EINVAL;
>> >> >> > + /* if we don't have caps, reject */
>> >> >> > + if (!capable(CAP_AUDIT_CONTROL))
>> >> >> > + return -EPERM;
>> >> >> > + /* if containerid is unset, allow */
>> >> >> > + if (!audit_containerid_set(task))
>> >> >> > + return 0;
>> >> >> > + /* it is already set, and not inherited from the parent, reject */
>> >> >> > + ccontainerid = audit_get_containerid(task);
>> >> >> > + rcu_read_lock();
>> >> >> > + parent = rcu_dereference(task->real_parent);
>> >> >> > + rcu_read_unlock();
>> >> >> > + task_lock(parent);
>> >> >> > + pcontainerid = audit_get_containerid(parent);
>> >> >> > + task_unlock(parent);
>> >> >> > + if (ccontainerid != pcontainerid)
>> >> >> > + return -EPERM;
>> >> >> > + return 0;
>>
>> I'm looking at the parent checks again and I wonder if the logic above
>> is what we really want. Maybe it is, but I'm not sure.
>>
>> Things I'm wondering about:
>>
>> * "ccontainerid" and "containerid" are too close in name, I kept
>> confusing myself when looking at this code. Please change one. Bonus
>> points if it is shorter.
>
> Would c_containerid and p_containerid be ok? child_cid and parent_cid?
Either would be an improvement over ccontainerid/containerid. I would
give a slight node to child_cid/parent_cid just for length reasons.
> I'd really like it to have the same root as the parameter handed in so
> teh code is easier to follow. It would be nice to have that across
> caller to local, but that's challenging.
That's fine, but you have to admit that ccontainerid/containerid is
awkward and not easy to quickly differentiate :)
> I've been tempted to use contid or even cid everywhere instead of
> containerid. Perhaps the longer name doesn't bother me because I
> like its uniqueness and I learned touch-typing in grade 9 and I like
> 100+ character wide terminals? ;-)
I would definitely appreciate contid/cid or similar, but I don't care
too much either way. As far as terminal width is concerned, please
make sure your code fits in 80 char terminals.
>> * What if the orchestrator wants to move the task to a new container?
>> Right now it looks like you can only do that once, then then the
>> task's audit container ID will no longer be the same as real_parent
>
> A task's audit container ID can be unset or inherited, and then set
> only once. After that, if you want it moved to a new container you
> can't and your only option is to spawn another peer to that task or a
> child of it and set that new task's audit container ID.
Okay. We've had some many discussions about this both on and off list
that I lose track on where we stand for certain things. I think
preventing task movement is fine for the initial effort so long as we
don't prevent adding it in the future; I don't see anything (other
than the permission checks under discussion, which is fine) preventing
this.
> Currently, the method of detecting if its audit container ID has been
> set (rather than inherited) was to check its parent's audit container
> ID.
Yeah ... those are two different things. I've been wondering if we
should introduce a set/inherited flag as simply checking the parent
task's audit container ID isn't quite the same; although it may be
"close enough" that it doesn't matter in practice. However, I'm
beginning to think this parent/child relationship isn't really
important beyond the inheritance issue ... more on this below.
> The only reason to change this might be if the audit container ID
> were not inheritable, but then we lose the accountability of a task
> spawning another process and being able to leave its child's audit
> container ID unset and unaccountable to any existing container. I think
> the relationship to the parent is crucial, and if something wants to
> change audit container ID it can, by spawning childrent and leaving a
> trail of container IDs in its parent processes. (So what if a parent
> dies?)
The audit container ID *must* be inherited, I don't really think
anyone is questioning that. What I'm wondering about is what we
accomplish by comparing the child's and parent's audit container ID?
I've thought about this a bit more and I think we are making this way
too complicated right now. We basically have three rules for the
audit container ID which we need to follow:
1. Children inherit their parent's audit container ID; this includes
the magic "unset" audit container ID.
2. You can't change the audit container ID once set.
3. In order to set the audit container ID of a process you must have
CAP_AUDIT_CONTROL.
With that in mind, I think the permission checks would be something like this:
[SIDE NOTE: Audit Container ID in acronym form works out to "acid" ;) ]
int perm(task, acid)
{
if (!task || !valid(acid))
return -EINVAL;
if (!capable(CAP_AUDIT_CONTROL))
return -EPERM;
if (task->acid != UNSET)
return -EPERM;
return 0;
}
>> ... or does the orchestrator change that? *Can* the orchestrator
>> change real_parent (I suspect the answer is "no")?
>
> I don't think the orchestrator is able to change real_parent.
I didn't think so either, but I didn't do an exhaustive check.
> I've forgotten why there is a ->parent and ->real_parent and how they can
> change. One is for the wait signal. I don't remember the purpose of
> the other.
I know ptrace makes use of real_parent when re-parenting the process
being ptrace'd.
> If the parent dies before the child, the child will be re-parented on
> its grandparent if the parent doesn't hang around zombified, if I
> understand correctly. If anything, a parent dying would likely further
> restrict the ability to set a task's audit container ID because a parent
> with an identical ID could vanish.
All the more reason to go with the simplified approach above. I think
the parent/child relationship is a bit of a distraction and a
complexity that isn't important (except for the inheritance of
course).
>> * I think the key is the relationship between current and task, not
>> between task and task->real_parent. I believe what we really care
>> about is that task is a descendant of current. We might also want to
>> allow current to change the audit container ID if it holds
>> CAP_AUDIT_CONTROL, regardless of it's relationship with task.
>
> Currently, a process with CAP_AUDIT_CONTROL can set the audit container
> ID of any task that hasn't got children or threads, isn't itself, and
> its audit container ID is inherited or unset. This was to try to
> prevent games with parents and children scratching each other's backs.
>
> I would feel more comfortable if only descendants were settable, so
> adding that restriction sounds like a good idea to me other than the
> tree-climbing excercise and overhead involved.
>
>> >> >> > +static void audit_log_set_containerid(struct task_struct *task, u64 oldcontainerid,
>> >> >> > + u64 containerid, int rc)
>> >> >> > +{
>> >> >> > + struct audit_buffer *ab;
>> >> >> > + uid_t uid;
>> >> >> > + struct tty_struct *tty;
>> >> >> > +
>> >> >> > + if (!audit_enabled)
>> >> >> > + return;
>> >> >> > +
>> >> >> > + ab = audit_log_start(NULL, GFP_KERNEL, AUDIT_CONTAINER);
>> >> >> > + if (!ab)
>> >> >> > + return;
>> >> >> > +
>> >> >> > + uid = from_kuid(&init_user_ns, task_uid(current));
>> >> >> > + tty = audit_get_tty(current);
>> >> >> > +
>> >> >> > + audit_log_format(ab, "op=set pid=%d uid=%u", task_tgid_nr(current), uid);
>> >> >> > + audit_log_task_context(ab);
>> >> >> > + audit_log_format(ab, " auid=%u tty=%s ses=%u opid=%d old-contid=%llu contid=%llu res=%d",
>> >> >> > + from_kuid(&init_user_ns, audit_get_loginuid(current)),
>> >> >> > + tty ? tty_name(tty) : "(none)", audit_get_sessionid(current),
>> >> >> > + task_tgid_nr(task), oldcontainerid, containerid, !rc);
>> >> >> > +
>> >> >> > + audit_put_tty(tty);
>> >> >> > + audit_log_end(ab);
>> >> >> > +}
>> >> >> > +
>> >> >> > +/**
>> >> >> > + * audit_set_containerid - set current task's audit_context containerid
>> >> >> > + * @containerid: containerid value
>> >> >> > + *
>> >> >> > + * Returns 0 on success, -EPERM on permission failure.
>> >> >> > + *
>> >> >> > + * Called (set) from fs/proc/base.c::proc_containerid_write().
>> >> >> > + */
>> >> >> > +int audit_set_containerid(struct task_struct *task, u64 containerid)
>> >> >> > +{
>> >> >> > + u64 oldcontainerid;
>> >> >> > + int rc;
>> >> >> > +
>> >> >> > + oldcontainerid = audit_get_containerid(task);
>> >> >> > +
>> >> >> > + rc = audit_set_containerid_perm(task, containerid);
>> >> >> > + if (!rc) {
>> >> >> > + task_lock(task);
>> >> >> > + task->containerid = containerid;
>> >> >> > + task_unlock(task);
>> >> >> > + }
>> >> >> > +
>> >> >> > + audit_log_set_containerid(task, oldcontainerid, containerid, rc);
>> >> >> > + return rc;
>> >> >>
>> >> >> Why are audit_set_containerid_perm() and audit_log_containerid()
>> >> >> separate functions?
>> >> >
>> >> > (I assume you mean audit_log_set_containerid()?)
>> >>
>> >> Yep. My fingers got tired typing in that function name and decided a
>> >> shortcut was necessary.
>> >>
>> >> > It seemed clearer that all the permission checking was in one function
>> >> > and its return code could be used to report the outcome when logging the
>> >> > (attempted) action. This is the same structure as audit_set_loginuid()
>> >> > and it made sense.
>> >>
>> >> When possible I really like it when the permission checks are in the
>> >> same function as the code which does the work; it's less likely to get
>> >> abused that way (you have to willfully bypass the access checks). The
>> >> exceptions might be if you wanted to reuse the access control code, or
>> >> insert a modular access mechanism (e.g. LSMs).
>> >
>> > I don't follow how it could be abused. The return code from the perm
>> > check gates setting the value and is used in the success field in the
>> > log.
>>
>> If the permission checks are in the same function body as the code
>> which does the work you have to either split the function, or rewrite
>> it, if you want to bypass the permission checks. It may be more of a
>> style issue than an actual safety issue, but the comments about
>> single-use functions in the same scope is the tie breaker.
>
> Perhaps I'm just being quite dense, but I just don't follow what the
> problem is and how you suggest fixing it. A bunch of gotos to a label
> such as "out:" to log the refused action? That seems messy and
> unstructured.
Fold audit_set_containerid_perm() and audit_log_set_containerid() into
their only caller, audit_set_containerid().
--
paul moore
www.paul-moore.com
^ permalink raw reply
* Re: [PATCH net-next 0/2] net/sctp: Avoid allocating high order memory with kmalloc()
From: Oleg Babin @ 2018-04-26 22:45 UTC (permalink / raw)
To: Marcelo Ricardo Leitner
Cc: netdev, linux-sctp, David S. Miller, Vlad Yasevich, Neil Horman,
Xin Long, Andrey Ryabinin
In-Reply-To: <20180426222814.GA10301@localhost.localdomain>
On 04/27/2018 01:28 AM, Marcelo Ricardo Leitner wrote:
> On Fri, Apr 27, 2018 at 01:14:56AM +0300, Oleg Babin wrote:
>> Hi Marcelo,
>>
>> On 04/24/2018 12:33 AM, Marcelo Ricardo Leitner wrote:
>>> Hi,
>>>
>>> On Mon, Apr 23, 2018 at 09:41:04PM +0300, Oleg Babin wrote:
>>>> Each SCTP association can have up to 65535 input and output streams.
>>>> For each stream type an array of sctp_stream_in or sctp_stream_out
>>>> structures is allocated using kmalloc_array() function. This function
>>>> allocates physically contiguous memory regions, so this can lead
>>>> to allocation of memory regions of very high order, i.e.:
>>>>
>>>> sizeof(struct sctp_stream_out) == 24,
>>>> ((65535 * 24) / 4096) == 383 memory pages (4096 byte per page),
>>>> which means 9th memory order.
>>>>
>>>> This can lead to a memory allocation failures on the systems
>>>> under a memory stress.
>>>
>>> Did you do performance tests while actually using these 65k streams
>>> and with 256 (so it gets 2 pages)?
>>>
>>> This will introduce another deref on each access to an element, but
>>> I'm not expecting any impact due to it.
>>>
>>
>> No, I didn't do such tests. Could you please tell me what methodology
>> do you usually use to measure performance properly?
>>
>> I'm trying to do measurements with iperf3 on unmodified kernel and get
>> very strange results like this:
> ...
>
> I've been trying to fight this fluctuation for some time now but
> couldn't really fix it yet. One thing that usually helps (quite a lot)
> is increasing the socket buffer sizes and/or using smaller messages,
> so there is more cushion in the buffers.
>
> What I have seen in my tests is that when it floats like this, is
> because socket buffers floats between 0 and full and don't get into a
> steady state. I believe this is because of socket buffer size is used
> for limiting the amount of memory used by the socket, instead of being
> the amount of payload that the buffer can hold. This causes some
> discrepancy, especially because in SCTP we don't defrag the buffer (as
> TCP does, it's the collapse operation), and the announced rwnd may
> turn up being a lie in the end, which triggers rx drops, then tx cwnd
> reduction, and so on. SCTP min_rto of 1s also doesn't help much on
> this situation.
>
> On netperf, you may use -S 200000,200000 -s 200000,200000. That should
> help it.
>
Thank you very much! I'll try this and get back with results later.
--
Best regards,
Oleg
^ permalink raw reply
* Re: [PATCH] net/mlx5: report persistent netdev stats across ifdown/ifup commands
From: Qing Huang @ 2018-04-26 22:37 UTC (permalink / raw)
To: Saeed Mahameed, Eran Ben Elisha
Cc: linux-kernel, RDMA mailing list, Linux Netdev List,
Leon Romanovsky, Matan Barak, Saeed Mahameed
In-Reply-To: <CALzJLG8=dnyLGRXamLR5-eYJ9Q8WCwKe44sHHZ57=1h_x0RCRA@mail.gmail.com>
On 04/26/2018 02:50 PM, Saeed Mahameed wrote:
> On Thu, Apr 26, 2018 at 1:37 PM, Qing Huang <qing.huang@oracle.com> wrote:
>> Current stats collecting scheme in mlx5 driver is to periodically fetch
>> aggregated stats from all the active mlx5 software channels associated
>> with the device. However when a mlx5 interface is brought down(ifdown),
>> all the channels will be deactivated and closed. A new set of channels
>> will be created when next ifup command or a similar command is called.
>> Unfortunately the new channels will have all stats reset to 0. So you
>> lose the accumulated stats information. This behavior is different from
>> other netdev drivers including the mlx4 driver. In order to fix it, we
>> now save prior mlx5 software stats into netdev stats fields, so all the
>> accumulated stats will survive multiple runs of ifdown/ifup commands and
>> be shown correctly.
>>
>> Orabug: 27548610
>>
>> Signed-off-by: Qing Huang <qing.huang@oracle.com>
>> ---
> Hi Qing,
>
> I am adding Eran since he is currently working on a similar patch,
> He is also taking care of all cores/rings stats to make them
> persistent, so you won't have discrepancy between
> ethtool and ifconfig stats.
>
> I am ok with this patch, but this means Eran has to work his way around it.
>
> so we have two options:
>
> 1. Temporary accept this patch, and change it later with Eran's work.
> 2. Wait for Eran's work.
>
> I am ok with either one of them, please let me know.
>
> Thanks !
Hi Saeed,
Any idea on rough ETA of Eran's stats work to be in upstream? If it will
be available soon, I think
we can wait a bit. If it will take a while to redesign the whole stats
scheme (for both ethtool and ifconfig),
maybe we can go with this incremental fix first?
Thanks!
>
>> drivers/net/ethernet/mellanox/mlx5/core/en_main.c | 30 +++++++++++++++++++----
>> 1 file changed, 25 insertions(+), 5 deletions(-)
>>
>> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
>> index f1fe490..5d50e69 100644
>> --- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
>> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
>> @@ -2621,6 +2621,23 @@ static void mlx5e_netdev_set_tcs(struct net_device *netdev)
>> netdev_set_tc_queue(netdev, tc, nch, 0);
>> }
>>
>> +static void mlx5e_netdev_save_stats(struct mlx5e_priv *priv)
>> +{
>> + struct net_device *netdev = priv->netdev;
>> +
>> + netdev->stats.rx_packets += priv->stats.sw.rx_packets;
>> + netdev->stats.rx_bytes += priv->stats.sw.rx_bytes;
>> + netdev->stats.tx_packets += priv->stats.sw.tx_packets;
>> + netdev->stats.tx_bytes += priv->stats.sw.tx_bytes;
>> + netdev->stats.tx_dropped += priv->stats.sw.tx_queue_dropped;
>> +
>> + priv->stats.sw.rx_packets = 0;
>> + priv->stats.sw.rx_bytes = 0;
>> + priv->stats.sw.tx_packets = 0;
>> + priv->stats.sw.tx_bytes = 0;
>> + priv->stats.sw.tx_queue_dropped = 0;
>> +}
>> +
> This means that we are now explicitly clearing channels stats on
> ifconfig down or switch_channels.
> and now after ifconfing down, ethtool will always show 0, before this
> patch it didn't.
> Anyway update sw stats function will always override them with the new
> channels stats next time we load new channels.
> so it is not that big of a deal.
>
>
>> static void mlx5e_build_channels_tx_maps(struct mlx5e_priv *priv)
>> {
>> struct mlx5e_channel *c;
>> @@ -2691,6 +2708,7 @@ void mlx5e_switch_priv_channels(struct mlx5e_priv *priv,
>> netif_set_real_num_tx_queues(netdev, new_num_txqs);
>>
>> mlx5e_deactivate_priv_channels(priv);
>> + mlx5e_netdev_save_stats(priv);
>> mlx5e_close_channels(&priv->channels);
>>
>> priv->channels = *new_chs;
>> @@ -2770,6 +2788,7 @@ int mlx5e_close_locked(struct net_device *netdev)
>>
>> netif_carrier_off(priv->netdev);
>> mlx5e_deactivate_priv_channels(priv);
>> + mlx5e_netdev_save_stats(priv);
>> mlx5e_close_channels(&priv->channels);
>>
>> return 0;
>> @@ -3215,11 +3234,12 @@ static int mlx5e_setup_tc(struct net_device *dev, enum tc_setup_type type,
>> stats->tx_packets = PPORT_802_3_GET(pstats, a_frames_transmitted_ok);
>> stats->tx_bytes = PPORT_802_3_GET(pstats, a_octets_transmitted_ok);
>> } else {
>> - stats->rx_packets = sstats->rx_packets;
>> - stats->rx_bytes = sstats->rx_bytes;
>> - stats->tx_packets = sstats->tx_packets;
>> - stats->tx_bytes = sstats->tx_bytes;
>> - stats->tx_dropped = sstats->tx_queue_dropped;
>> + stats->rx_packets = sstats->rx_packets + dev->stats.rx_packets;
>> + stats->rx_bytes = sstats->rx_bytes + dev->stats.rx_bytes;
>> + stats->tx_packets = sstats->tx_packets + dev->stats.tx_packets;
>> + stats->tx_bytes = sstats->tx_bytes + dev->stats.tx_bytes;
>> + stats->tx_dropped = sstats->tx_queue_dropped +
>> + dev->stats.tx_dropped;
>> }
>>
>> stats->rx_dropped = priv->stats.qcnt.rx_out_of_buffer;
>> --
>> 1.8.3.1
>>
^ permalink raw reply
* Re: [PATCH net-next v8 4/4] netvsc: refactor notifier/event handling code to use the failover framework
From: Stephen Hemminger @ 2018-04-26 22:33 UTC (permalink / raw)
To: Michael S. Tsirkin
Cc: Sridhar Samudrala, davem, netdev, virtualization, virtio-dev,
jesse.brandeburg, alexander.h.duyck, kubakici, jasowang,
loseweigh, jiri, aaron.f.brown
In-Reply-To: <20180426052908-mutt-send-email-mst@kernel.org>
On Thu, 26 Apr 2018 05:30:05 +0300
"Michael S. Tsirkin" <mst@redhat.com> wrote:
> On Wed, Apr 25, 2018 at 05:08:37PM -0700, Stephen Hemminger wrote:
> > On Wed, 25 Apr 2018 16:59:28 -0700
> > Sridhar Samudrala <sridhar.samudrala@intel.com> wrote:
> >
> > > Use the registration/notification framework supported by the generic
> > > failover infrastructure.
> > >
> > > Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
> >
> > NAK unless you prove this works on legacy distributions and with DPDK 18.05
> > without modification.
>
> It looks like it should work. What kind of proof are you looking for?
>
I tried this with working Ubuntu 17 on WS2016.
It boots if the failover driver is configured in (as module).
But if the configuration has:
$ grep FAILOVER .config
# CONFIG_NET_FAILOVER is not set
CONFIG_MAY_USE_NET_FAILOVER=y
The netvsc driver fails on boot with:
[ 0.826447] hv_vmbus: registering driver hv_netvsc
[ 0.829616] scsi 0:0:0:0: Direct-Access Msft Virtual Disk 1.0 PQ: 0 ANSI: 5
[ 0.836291] input: Microsoft Vmbus HID-compliant Mouse as /devices/0006:045E:0621.0001/input/input1
[ 0.839139] hid-generic 0006:045E:0621.0001: input: <UNKNOWN> HID v0.01 Mouse [Microsoft Vmbus HID-compliant Mouse] on
[ 0.964897] hv_vmbus: probe failed for device 849a776e-8120-4e4a-9a36-7e3d95ac75b3 (-95)
[ 0.968039] hv_netvsc: probe of 849a776e-8120-4e4a-9a36-7e3d95ac75b3 failed with error -95
[ 1.112877] hv_vmbus: probe failed for device 53557f8e-057d-425b-9265-01c0fd7e273e (-95)
[ 1.116064] hv_netvsc: probe of 53557f8e-057d-425b-9265-01c0fd7e273e failed with error -95
The system has two virtual networks. eth0 is on vswitch for management.
eth1 is on vswitch with SR-IOV for performance tests.
You probably need to just put the failover part in net/core and select it.
It is trivial to get an evaluation version of Windows Server 2016 and setup a Linux VM.
Please try it.
^ permalink raw reply
* Re: [PATCH v2] net: qrtr: Expose tunneling endpoint to user space
From: Chris Lew @ 2018-04-26 22:29 UTC (permalink / raw)
To: Bjorn Andersson, David S. Miller; +Cc: linux-kernel, netdev, linux-arm-msm
In-Reply-To: <20180423214653.10016-1-bjorn.andersson@linaro.org>
On 4/23/2018 2:46 PM, Bjorn Andersson wrote:
> This implements a misc character device named "qrtr-tun" for the purpose
> of allowing user space applications to implement endpoints in the qrtr
> network.
>
> This allows more advanced (and dynamic) testing of the qrtr code as well
> as opens up the ability of tunneling qrtr over a network or USB link.
>
> Signed-off-by: Bjorn Andersson <bjorn.andersson@linaro.org>
Acked-by: Chris Lew <clew@codeaurora.org>
> +static ssize_t qrtr_tun_read_iter(struct kiocb *iocb, struct iov_iter *to)
> +{
> + struct file *filp = iocb->ki_filp;
> + struct qrtr_tun *tun = filp->private_data;
> + struct sk_buff *skb;
> + int count;
> +
> + while (!(skb = skb_dequeue(&tun->queue))) {
> + if (filp->f_flags & O_NONBLOCK)
> + return -EAGAIN;
> +
> + /* Wait until we get data or the endpoint goes away */
> + if (wait_event_interruptible(tun->readq,
> + !skb_queue_empty(&tun->queue)))
> + return -ERESTARTSYS;
> + }
> +
> + count = min_t(size_t, iov_iter_count(to), skb->len);
> + if (copy_to_iter(skb->data, count, to) != count)
> + count = -EFAULT;
> +
> + kfree_skb(skb);
Is it better to use consume_skb() since this is the expected behavior path?
> +
> + return count;
> +}
> +
Thanks,
Chris
--
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
a Linux Foundation Collaborative Project
^ permalink raw reply
* Re: [PATCH net-next 0/2] net/sctp: Avoid allocating high order memory with kmalloc()
From: Marcelo Ricardo Leitner @ 2018-04-26 22:28 UTC (permalink / raw)
To: Oleg Babin
Cc: netdev, linux-sctp, David S. Miller, Vlad Yasevich, Neil Horman,
Xin Long, Andrey Ryabinin
In-Reply-To: <27adcf09-830d-48cb-34ab-aaabffa2b202@virtuozzo.com>
On Fri, Apr 27, 2018 at 01:14:56AM +0300, Oleg Babin wrote:
> Hi Marcelo,
>
> On 04/24/2018 12:33 AM, Marcelo Ricardo Leitner wrote:
> > Hi,
> >
> > On Mon, Apr 23, 2018 at 09:41:04PM +0300, Oleg Babin wrote:
> >> Each SCTP association can have up to 65535 input and output streams.
> >> For each stream type an array of sctp_stream_in or sctp_stream_out
> >> structures is allocated using kmalloc_array() function. This function
> >> allocates physically contiguous memory regions, so this can lead
> >> to allocation of memory regions of very high order, i.e.:
> >>
> >> sizeof(struct sctp_stream_out) == 24,
> >> ((65535 * 24) / 4096) == 383 memory pages (4096 byte per page),
> >> which means 9th memory order.
> >>
> >> This can lead to a memory allocation failures on the systems
> >> under a memory stress.
> >
> > Did you do performance tests while actually using these 65k streams
> > and with 256 (so it gets 2 pages)?
> >
> > This will introduce another deref on each access to an element, but
> > I'm not expecting any impact due to it.
> >
>
> No, I didn't do such tests. Could you please tell me what methodology
> do you usually use to measure performance properly?
>
> I'm trying to do measurements with iperf3 on unmodified kernel and get
> very strange results like this:
...
I've been trying to fight this fluctuation for some time now but
couldn't really fix it yet. One thing that usually helps (quite a lot)
is increasing the socket buffer sizes and/or using smaller messages,
so there is more cushion in the buffers.
What I have seen in my tests is that when it floats like this, is
because socket buffers floats between 0 and full and don't get into a
steady state. I believe this is because of socket buffer size is used
for limiting the amount of memory used by the socket, instead of being
the amount of payload that the buffer can hold. This causes some
discrepancy, especially because in SCTP we don't defrag the buffer (as
TCP does, it's the collapse operation), and the announced rwnd may
turn up being a lie in the end, which triggers rx drops, then tx cwnd
reduction, and so on. SCTP min_rto of 1s also doesn't help much on
this situation.
On netperf, you may use -S 200000,200000 -s 200000,200000. That should
help it.
Cheers,
Marcelo
^ permalink raw reply
* Re: [dm-devel] [PATCH v5] fault-injection: introduce kvmalloc fallback options
From: Michael S. Tsirkin @ 2018-04-26 22:21 UTC (permalink / raw)
To: Mikulas Patocka
Cc: John Stoffel, James Bottomley, Michal, eric.dumazet, netdev,
jasowang, Randy Dunlap, linux-kernel, Matthew Wilcox, Hocko,
linux-mm, dm-devel, Vlastimil Babka, Andrew, David Rientjes,
Morton, virtualization, David Miller, edumazet
In-Reply-To: <alpine.LRH.2.02.1804261726540.13401@file01.intranet.prod.int.rdu2.redhat.com>
On Thu, Apr 26, 2018 at 05:50:20PM -0400, Mikulas Patocka wrote:
> How is the user or developer supposed to learn about this option, if
> he gets no crash at all?
Look in /sys/kernel/debug/fail* ? That actually lets you
filter by module, process etc.
I think this patch conflates two things:
1. Make kvmalloc use the vmalloc path.
This seems a bit narrow.
What is special about kvmalloc? IMHO nothing - it's yet another user
of __GFP_NORETRY or __GFP_RETRY_MAYFAIL. As any such
user, it either recovers correctly or not.
So IMHO it's just a case of
making __GFP_NORETRY, __GFP_RETRY_MAYFAIL, or both
fail once in a while.
Seems like a better extension to me than focusing on vmalloc.
I think you will find more bugs this way.
2. Ability to control this from a separate config
option.
It's still not that clear to me why is this such a
hard requirement. If a distro wants to force specific
boot time options, why isn't CONFIG_CMDLINE sufficient?
But assuming it's important to control this kind of
fault injection to be controlled from
a dedicated menuconfig option, why not the rest of
faults?
IMHO if you split 1/2 up, and generalize, the path upstream
will be much smoother.
Hope this helps.
--
MST
^ permalink raw reply
* Re: [Patch nf] ipvs: initialize tbl->entries in ip_vs_lblc_init_svc()
From: Pablo Neira Ayuso @ 2018-04-26 22:21 UTC (permalink / raw)
To: Simon Horman
Cc: Julian Anastasov, Cong Wang, netdev, lvs-devel, netfilter-devel
In-Reply-To: <20180426121436.fscassmhtiju2wwi@verge.net.au>
On Thu, Apr 26, 2018 at 02:14:36PM +0200, Simon Horman wrote:
> On Tue, Apr 24, 2018 at 08:17:06AM +0300, Julian Anastasov wrote:
> >
> > Hello,
> >
> > On Mon, 23 Apr 2018, Cong Wang wrote:
> >
> > > Similarly, tbl->entries is not initialized after kmalloc(),
> > > therefore causes an uninit-value warning in ip_vs_lblc_check_expire(),
> > > as reported by syzbot.
> > >
> > > Reported-by: <syzbot+3e9695f147fb529aa9bc@syzkaller.appspotmail.com>
> > > Cc: Simon Horman <horms@verge.net.au>
> > > Cc: Julian Anastasov <ja@ssi.bg>
> > > Cc: Pablo Neira Ayuso <pablo@netfilter.org>
> > > Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
> >
> > Thanks!
> >
> > Acked-by: Julian Anastasov <ja@ssi.bg>
>
> Thanks.
>
> Pablo, could you take this into nf?
>
> Acked-by: Simon Horman <horms@verge.net.au>
Done, thanks.
^ permalink raw reply
* Re: [Patch nf] ipvs: initialize tbl->entries after allocation
From: Pablo Neira Ayuso @ 2018-04-26 22:21 UTC (permalink / raw)
To: Simon Horman
Cc: Julian Anastasov, Cong Wang, netdev, lvs-devel, netfilter-devel
In-Reply-To: <20180426121423.5c7iy2ddhjy2clzf@verge.net.au>
On Thu, Apr 26, 2018 at 02:14:25PM +0200, Simon Horman wrote:
> On Tue, Apr 24, 2018 at 08:16:14AM +0300, Julian Anastasov wrote:
> >
> > Hello,
> >
> > On Mon, 23 Apr 2018, Cong Wang wrote:
> >
> > > tbl->entries is not initialized after kmalloc(), therefore
> > > causes an uninit-value warning in ip_vs_lblc_check_expire()
> > > as reported by syzbot.
> > >
> > > Reported-by: <syzbot+3dfdea57819073a04f21@syzkaller.appspotmail.com>
> > > Cc: Simon Horman <horms@verge.net.au>
> > > Cc: Julian Anastasov <ja@ssi.bg>
> > > Cc: Pablo Neira Ayuso <pablo@netfilter.org>
> > > Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
> >
> > Thanks!
> >
> > Acked-by: Julian Anastasov <ja@ssi.bg>
>
> Thanks.
>
> Pablo, could you take this into nf?
>
> Acked-by: Simon Horman <horms@verge.net.au>
Done, thanks Simon.
^ permalink raw reply
* Re: [GIT PULL 0/5] IPVS Updates for v4.18
From: Pablo Neira Ayuso @ 2018-04-26 22:19 UTC (permalink / raw)
To: Simon Horman
Cc: lvs-devel, netdev, netfilter-devel, Wensong Zhang,
Julian Anastasov
In-Reply-To: <20180419085614.7437-1-horms@verge.net.au>
On Thu, Apr 19, 2018 at 10:56:09AM +0200, Simon Horman wrote:
> Hi Pablo,
>
> please consider these IPVS enhancements for v4.18.
>
> * Whitepace cleanup
>
> * Add Maglev hashing algorithm as a IPVS scheduler
>
> Inju Song says "Implements the Google's Maglev hashing algorithm as a
> IPVS scheduler. Basically it provides consistent hashing but offers some
> special features about disruption and load balancing.
>
> 1) minimal disruption: when the set of destinations changes,
> a connection will likely be sent to the same destination
> as it was before.
>
> 2) load balancing: each destination will receive an almost
> equal number of connections.
>
> Seel also: [3.4 Consistent Hasing] in
> https://www.usenix.org/system/files/conference/nsdi16/nsdi16-paper-eisenbud.pdf
> "
>
> * Fix to correct implementation of Knuth's multiplicative hashing
> which is used in sh/dh/lblc/lblcr algorithms. Instead the
> implementation provided by the hash_32() macro is used.
Pulled, thanks Simon.
^ permalink raw reply
* Re: [PATCH net-next 1/2] net/sctp: Make wrappers for accessing in/out streams
From: Oleg Babin @ 2018-04-26 22:19 UTC (permalink / raw)
To: Marcelo Ricardo Leitner
Cc: netdev, linux-sctp, David S. Miller, Vlad Yasevich, Neil Horman,
Xin Long, Andrey Ryabinin
In-Reply-To: <20180423213331.GH3711@localhost.localdomain>
On 04/24/2018 12:33 AM, Marcelo Ricardo Leitner wrote:
> On Mon, Apr 23, 2018 at 09:41:05PM +0300, Oleg Babin wrote:
>> This patch introduces wrappers for accessing in/out streams indirectly.
>> This will enable to replace physically contiguous memory arrays
>> of streams with flexible arrays (or maybe any other appropriate
>> mechanism) which do memory allocation on a per-page basis.
>>
>> Signed-off-by: Oleg Babin <obabin@virtuozzo.com>
>> ---
>> include/net/sctp/structs.h | 30 +++++++-----
>> net/sctp/chunk.c | 6 ++-
>> net/sctp/outqueue.c | 11 +++--
>> net/sctp/socket.c | 4 +-
>> net/sctp/stream.c | 107 +++++++++++++++++++++++++------------------
>> net/sctp/stream_interleave.c | 2 +-
>> net/sctp/stream_sched.c | 13 +++---
>> net/sctp/stream_sched_prio.c | 22 ++++-----
>> net/sctp/stream_sched_rr.c | 8 ++--
>> 9 files changed, 116 insertions(+), 87 deletions(-)
>>
>> diff --git a/include/net/sctp/structs.h b/include/net/sctp/structs.h
>> index a0ec462..578bb40 100644
>> --- a/include/net/sctp/structs.h
>> +++ b/include/net/sctp/structs.h
>> @@ -394,37 +394,37 @@ int sctp_stream_init(struct sctp_stream *stream, __u16 outcnt, __u16 incnt,
>>
>> /* What is the current SSN number for this stream? */
>> #define sctp_ssn_peek(stream, type, sid) \
>> - ((stream)->type[sid].ssn)
>> + (sctp_stream_##type##_ptr((stream), (sid))->ssn)
>>
>> /* Return the next SSN number for this stream. */
>> #define sctp_ssn_next(stream, type, sid) \
>> - ((stream)->type[sid].ssn++)
>> + (sctp_stream_##type##_ptr((stream), (sid))->ssn++)
>>
>> /* Skip over this ssn and all below. */
>> #define sctp_ssn_skip(stream, type, sid, ssn) \
>> - ((stream)->type[sid].ssn = ssn + 1)
>> + (sctp_stream_##type##_ptr((stream), (sid))->ssn = ssn + 1)
>>
>> /* What is the current MID number for this stream? */
>> #define sctp_mid_peek(stream, type, sid) \
>> - ((stream)->type[sid].mid)
>> + (sctp_stream_##type##_ptr((stream), (sid))->mid)
>>
>> /* Return the next MID number for this stream. */
>> #define sctp_mid_next(stream, type, sid) \
>> - ((stream)->type[sid].mid++)
>> + (sctp_stream_##type##_ptr((stream), (sid))->mid++)
>>
>> /* Skip over this mid and all below. */
>> #define sctp_mid_skip(stream, type, sid, mid) \
>> - ((stream)->type[sid].mid = mid + 1)
>> + (sctp_stream_##type##_ptr((stream), (sid))->mid = mid + 1)
>>
>> -#define sctp_stream_in(asoc, sid) (&(asoc)->stream.in[sid])
>> +#define sctp_stream_in(asoc, sid) sctp_stream_in_ptr(&(asoc)->stream, (sid))
>
> This will get confusing:
> - sctp_stream_in(asoc, sid)
> - sctp_stream_in_ptr(stream, sid)
>
> Considering all usages of sctp_stream_in(), seems you can just update
> them to do the ->stream deref and keep only the later implementation.
> Which then don't need the _ptr suffix.
Ok, I'll change that in the next path version.
--
Best regards,
Oleg Babin
^ permalink raw reply
* Re: [PATCH] netfilter: fix nf_tables filter chain type build
From: Pablo Neira Ayuso @ 2018-04-26 22:15 UTC (permalink / raw)
To: Randy Dunlap
Cc: netdev@vger.kernel.org, LKML, coreteam, netfilter-devel,
Florian Westphal, Jozsef Kadlecsik, kbuild test robot
In-Reply-To: <c8e42e35-ccff-471c-8c45-61f44855e21a@infradead.org>
On Sat, Apr 21, 2018 at 09:10:09PM -0700, Randy Dunlap wrote:
> From: Randy Dunlap <rdunlap@infradead.org>
>
> Fix build errors due to a missing Kconfig dependency term.
> Fixes these build errors:
>
> net/ipv6/netfilter/nft_chain_nat_ipv6.o: In function `nft_nat_do_chain':
> net/ipv6/netfilter/nft_chain_nat_ipv6.c:37: undefined reference to `nft_do_chain'
> net/ipv6/netfilter/nft_chain_nat_ipv6.o: In function `nft_chain_nat_ipv6_exit':
> net/ipv6/netfilter/nft_chain_nat_ipv6.c:94: undefined reference to `nft_unregister_chain_type'
> net/ipv6/netfilter/nft_chain_nat_ipv6.o: In function `nft_chain_nat_ipv6_init':
> net/ipv6/netfilter/nft_chain_nat_ipv6.c:87: undefined reference to `nft_register_chain_type'
Thanks for sending a patch for this Randy.
We have a patch to address this that should fix this:
https://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf.git/commit/?id=39f2ff0816e5421476c2bc538b68b4bb0708a78e
Thanks!
^ permalink raw reply
* Re: [PATCH net-next 0/2] net/sctp: Avoid allocating high order memory with kmalloc()
From: Oleg Babin @ 2018-04-26 22:14 UTC (permalink / raw)
To: Marcelo Ricardo Leitner
Cc: netdev, linux-sctp, David S. Miller, Vlad Yasevich, Neil Horman,
Xin Long, Andrey Ryabinin
In-Reply-To: <20180423213314.GG3711@localhost.localdomain>
Hi Marcelo,
On 04/24/2018 12:33 AM, Marcelo Ricardo Leitner wrote:
> Hi,
>
> On Mon, Apr 23, 2018 at 09:41:04PM +0300, Oleg Babin wrote:
>> Each SCTP association can have up to 65535 input and output streams.
>> For each stream type an array of sctp_stream_in or sctp_stream_out
>> structures is allocated using kmalloc_array() function. This function
>> allocates physically contiguous memory regions, so this can lead
>> to allocation of memory regions of very high order, i.e.:
>>
>> sizeof(struct sctp_stream_out) == 24,
>> ((65535 * 24) / 4096) == 383 memory pages (4096 byte per page),
>> which means 9th memory order.
>>
>> This can lead to a memory allocation failures on the systems
>> under a memory stress.
>
> Did you do performance tests while actually using these 65k streams
> and with 256 (so it gets 2 pages)?
>
> This will introduce another deref on each access to an element, but
> I'm not expecting any impact due to it.
>
No, I didn't do such tests. Could you please tell me what methodology
do you usually use to measure performance properly?
I'm trying to do measurements with iperf3 on unmodified kernel and get
very strange results like this:
ovbabin@ovbabin-laptop:~$ ~/programs/iperf/bin/iperf3 -c 169.254.11.150 --sctp
Connecting to host 169.254.11.150, port 5201
[ 5] local 169.254.11.150 port 46330 connected to 169.254.11.150 port 5201
[ ID] Interval Transfer Bitrate
[ 5] 0.00-1.00 sec 9.88 MBytes 82.8 Mbits/sec
[ 5] 1.00-2.00 sec 226 MBytes 1.90 Gbits/sec
[ 5] 2.00-3.00 sec 832 KBytes 6.82 Mbits/sec
[ 5] 3.00-4.00 sec 640 KBytes 5.24 Mbits/sec
[ 5] 4.00-5.00 sec 756 MBytes 6.34 Gbits/sec
[ 5] 5.00-6.00 sec 522 MBytes 4.38 Gbits/sec
[ 5] 6.00-7.00 sec 896 KBytes 7.34 Mbits/sec
[ 5] 7.00-8.00 sec 519 MBytes 4.35 Gbits/sec
[ 5] 8.00-9.00 sec 504 MBytes 4.23 Gbits/sec
[ 5] 9.00-10.00 sec 475 MBytes 3.98 Gbits/sec
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval Transfer Bitrate
[ 5] 0.00-10.00 sec 2.94 GBytes 2.53 Gbits/sec sender
[ 5] 0.00-10.04 sec 2.94 GBytes 2.52 Gbits/sec receiver
iperf Done.
The values are spread enormously from hundreds of kilobits to gigabits.
I get similar results with netperf. This particular result was obtained
with client and server running on the same machine. Also I tried this
on different machines with different kernel versions - situation was similar.
I compiled latest versions of iperf and netperf from sources.
Could it possibly be that I am missing something very obvious?
Thanks!
--
Best regards,
Oleg
> Marcelo
> --
> To unsubscribe from this list: send the line "unsubscribe linux-sctp" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> .
>
^ permalink raw reply
* Re: [PATCH v7 net-next 4/4] netvsc: refactor notifier/event handling code to use the failover framework
From: Siwei Liu @ 2018-04-26 22:14 UTC (permalink / raw)
To: Michael S. Tsirkin
Cc: Stephen Hemminger, Jiri Pirko, Sridhar Samudrala, David Miller,
Netdev, virtualization, virtio-dev, Brandeburg, Jesse,
Alexander Duyck, Jakub Kicinski, Jason Wang
In-Reply-To: <20180426050934-mutt-send-email-mst@kernel.org>
On Wed, Apr 25, 2018 at 7:28 PM, Michael S. Tsirkin <mst@redhat.com> wrote:
> On Wed, Apr 25, 2018 at 03:57:57PM -0700, Siwei Liu wrote:
>> On Wed, Apr 25, 2018 at 3:22 PM, Michael S. Tsirkin <mst@redhat.com> wrote:
>> > On Wed, Apr 25, 2018 at 02:38:57PM -0700, Siwei Liu wrote:
>> >> On Mon, Apr 23, 2018 at 1:06 PM, Michael S. Tsirkin <mst@redhat.com> wrote:
>> >> > On Mon, Apr 23, 2018 at 12:44:39PM -0700, Siwei Liu wrote:
>> >> >> On Mon, Apr 23, 2018 at 10:56 AM, Michael S. Tsirkin <mst@redhat.com> wrote:
>> >> >> > On Mon, Apr 23, 2018 at 10:44:40AM -0700, Stephen Hemminger wrote:
>> >> >> >> On Mon, 23 Apr 2018 20:24:56 +0300
>> >> >> >> "Michael S. Tsirkin" <mst@redhat.com> wrote:
>> >> >> >>
>> >> >> >> > On Mon, Apr 23, 2018 at 10:04:06AM -0700, Stephen Hemminger wrote:
>> >> >> >> > > > >
>> >> >> >> > > > >I will NAK patches to change to common code for netvsc especially the
>> >> >> >> > > > >three device model. MS worked hard with distro vendors to support transparent
>> >> >> >> > > > >mode, ans we really can't have a new model; or do backport.
>> >> >> >> > > > >
>> >> >> >> > > > >Plus, DPDK is now dependent on existing model.
>> >> >> >> > > >
>> >> >> >> > > > Sorry, but nobody here cares about dpdk or other similar oddities.
>> >> >> >> > >
>> >> >> >> > > The network device model is a userspace API, and DPDK is a userspace application.
>> >> >> >> >
>> >> >> >> > It is userspace but are you sure dpdk is actually poking at netdevs?
>> >> >> >> > AFAIK it's normally banging device registers directly.
>> >> >> >> >
>> >> >> >> > > You can't go breaking userspace even if you don't like the application.
>> >> >> >> >
>> >> >> >> > Could you please explain how is the proposed patchset breaking
>> >> >> >> > userspace? Ignoring DPDK for now, I don't think it changes the userspace
>> >> >> >> > API at all.
>> >> >> >> >
>> >> >> >>
>> >> >> >> The DPDK has a device driver vdev_netvsc which scans the Linux network devices
>> >> >> >> to look for Linux netvsc device and the paired VF device and setup the
>> >> >> >> DPDK environment. This setup creates a DPDK failsafe (bondingish) instance
>> >> >> >> and sets up TAP support over the Linux netvsc device as well as the Mellanox
>> >> >> >> VF device.
>> >> >> >>
>> >> >> >> So it depends on existing 2 device model. You can't go to a 3 device model
>> >> >> >> or start hiding devices from userspace.
>> >> >> >
>> >> >> > Okay so how does the existing patch break that? IIUC does not go to
>> >> >> > a 3 device model since netvsc calls failover_register directly.
>> >> >> >
>> >> >> >> Also, I am working on associating netvsc and VF device based on serial number
>> >> >> >> rather than MAC address. The serial number is how Windows works now, and it makes
>> >> >> >> sense for Linux and Windows to use the same mechanism if possible.
>> >> >> >
>> >> >> > Maybe we should support same for virtio ...
>> >> >> > Which serial do you mean? From vpd?
>> >> >> >
>> >> >> > I guess you will want to keep supporting MAC for old hypervisors?
>> >> >> >
>> >> >> > It all seems like a reasonable thing to support in the generic core.
>> >> >>
>> >> >> That's the reason why I chose explicit identifier rather than rely on
>> >> >> MAC address to bind/pair a device. MAC address can change. Even if it
>> >> >> can't, malicious guest user can fake MAC address to skip binding.
>> >> >>
>> >> >> -Siwei
>> >> >
>> >> > Address should be sampled at device creation to prevent this
>> >> > kind of hack. Not that it buys the malicious user much:
>> >> > if you can poke at MAC addresses you probably already can
>> >> > break networking.
>> >>
>> >> I don't understand why poking at MAC address may potentially break
>> >> networking.
>> >
>> > Set a MAC address to match another device on the same LAN,
>> > packets will stop reaching that MAC.
>>
>> What I meant was guest users may create a virtual link, say veth that
>> has exactly the same MAC address as that for the VF, which can easily
>> get around of the binding procedure.
>
> This patchset limits binding to PCI devices so it won't be affected
> by any hacks around virtual devices.
Wait, I vaguely recall you seemed to like to generalize this feature
to non-PCI device. But now you're saying it should stick to PCI. It's
not that I'm reluctant with sticking to PCI. The fact is that I don't
think we can go with implementation until the semantics of the
so-called _F_STANDBY feature can be clearly defined into the spec.
Previously the boundary of using MAC address as the identifier for
bonding was quite confusing to me. And now PCI adds to the matrix.
However it still does not gurantee uniqueness I think. It's almost
incorrect of choosing MAC address as the ID in the beginning since
that has the implication of breaking existing configs. I don't think
libvirt or QEMU today retricts the MAC address to be unique per VM
instance. Neither the virtio spec mentions that.
In addition, it's difficult to fake PCI device on Linux does not mean
the same applies to other OSes that is going to implement this VirtIO
feature. It's a fragile assumption IMHO.
>
>> There's no explicit flag to
>> identify a VF or pass-through device AFAIK. And sometimes this happens
>> maybe due to user misconfiguring the link. This process should be
>> hardened to avoid from any potential configuration errors.
>
> They are still PCI devices though.
>
>> >
>> >> Unlike VF, passthrough PCI endpoint device has its freedom
>> >> to change the MAC address. Even on a VF setup it's not neccessarily
>> >> always safe to assume the VF's MAC address cannot or shouldn't be
>> >> changed. That depends on the specific need whether the host admin
>> >> wants to restrict guest from changing the MAC address, although in
>> >> most cases it's true.
>> >>
>> >> I understand we can use the perm_addr to distinguish. But as said,
>> >> this will pose limitation of flexible configuration where one can
>> >> assign VFs with identical MAC address at all while each VF belongs to
>> >> different PF and/or different subnet for e.g. load balancing.
>> >> And
>> >> furthermore, the QEMU device model never uses MAC address to be
>> >> interpreted as an identifier, which requires to be unique per VM
>> >> instance. Why we're introducing this inconsistency?
>> >>
>> >> -Siwei
>> >
>> > Because it addresses most of the issues and is simple. That's already
>> > much better than what we have now which is nothing unless guest
>> > configures things manually.
>>
>> Did you see my QEMU patch for using BDF as the grouping identifier?
>
> Yes. And I don't think it can work because bus numbers are
> guest specified.
I know it's not ideal but perhaps its the best one can do in the KVM
world without adding complex config e.g. PCI bridge. Even if bus
number is guest specified, it's readily available in the guest and
recognizable by any OS, while on the QEMU configuration users specify
an id instead of the bus number. Unlike Hyper-V PCI bus, I don't think
there exists a para-virtual PCI bus in QEMU backend to expose VPD
capability to a passthrough device.
>
>> And there can be others like what you suggested, but the point is that
>> it's requried to support explicit grouping mechanism from day one,
>> before the backup property cast into stones.
>
> Let's start with addressing simple configs with just two NICs.
>
> Down the road I can see possible extensions that can work: for example,
> require that devices are on the same pci bridge. Or we could even make
> the virtio device actually include a pci bridge (as part of same
> or a child function), the PT would have to be
> behind it.
>
> As long as we are not breaking anything, adding more flags to fix
> non-working configurations is always fair game.
While it may work, the PCI bridge has NUMA and IOMMU implications that
would restrict the current flexibility to group devices. I'm not sure
if vIOMMU would have to be introduced inadvertently for
isolation/protection of devices under the PCI bridge which may cause
negative performance impact on the VF.
>
>> This is orthogonal to
>> device model being proposed, be it 1-netdev or not. Delaying it would
>> just mean support and compatibility burden, appearing more like a
>> design flaw rather than a feature to add later on.
>
> Well it's mostly myself who gets to support it, and I see the device
> model as much more fundamental as userspace will come to depend
> on it. So I'm not too worried, let's take this one step at a time.
>
>> >
>> > I think ideally the infrastructure should suppport flexible matching of
>> > NICs - netvsc is already reported to be moving to some kind of serial
>> > address.
>> >
>> As Stephen said, Hyper-V supports the serial UUID thing from day-one.
>> It's just the Linux netvsc guest driver itself does not leverage that
>> ID from the very beginging.
>>
>> Regards,
>> -Siwei
>
> We could add something like this, too. For example,
> we could add a virtual VPD capability with a UUID.
I'm not an expert on that and wonder how you could do this (add a
virtual VPD capability with a UUID to passthrough device) with
existing QEMU emulation model and native PCI bus.
>
> Do you know how exactly does hyperv pass the UUID for NICs?
Stephen might know it more and can correct me. But my personal
interpretation is that the SN is a host generated 32 bit sequence
number which is unique per VM instance and gets propogated to guest
via the para-virtual Hyper-V PCI bus.
Regards,
-Siwei
>
>> >
>> >> >
>> >> >
>> >> >
>> >> >
>> >> >>
>> >> >> >
>> >> >> > --
>> >> >> > MST
^ permalink raw reply
* Re: [RFC/PATCH] net: ethernet: nixge: Use of_get_mac_address()
From: Moritz Fischer @ 2018-04-26 22:04 UTC (permalink / raw)
To: Moritz Fischer
Cc: linux-kernel, devicetree, netdev, davem, robh+dt, mark.rutland
In-Reply-To: <20180426215742.18966-1-mdf@kernel.org>
On Thu, Apr 26, 2018 at 02:57:42PM -0700, Moritz Fischer wrote:
> Make nixge driver work with 'mac-address' property instead of
> 'address' property. There are currently no in-tree users and
> the only users of this driver are devices that use overlays
> we control to instantiate the device together with the corresponding
> FPGA images.
>
> Signed-off-by: Moritz Fischer <mdf@kernel.org>
> ---
>
> Hi David, Rob,
>
> with Mike's change that enable the generic 'mac-address'
> binding that I barely missed with the submission of this
> driver I was wondering if we can still change the binding.
>
> I'm aware that this generally is a nonono case, since the binding
> is considered API, but since there are no users outside of our
> devicetree overlays that we ship with our devices I thought I'd ask.
>
> If you don't think that's a good idea do you think supporting both
> would be worthwhile?
>
> Thanks,
>
> Moritz
>
> ---
> .../devicetree/bindings/net/nixge.txt | 4 ++--
> drivers/net/ethernet/ni/nixge.c | 20 ++-----------------
> 2 files changed, 4 insertions(+), 20 deletions(-)
>
> diff --git a/Documentation/devicetree/bindings/net/nixge.txt b/Documentation/devicetree/bindings/net/nixge.txt
> index e55af7f0881a..9bc1ecfb6762 100644
> --- a/Documentation/devicetree/bindings/net/nixge.txt
> +++ b/Documentation/devicetree/bindings/net/nixge.txt
> @@ -8,7 +8,7 @@ Required properties:
> - phy-mode: See ethernet.txt file in the same directory.
> - phy-handle: See ethernet.txt file in the same directory.
> - nvmem-cells: Phandle of nvmem cell containing the MAC address
> -- nvmem-cell-names: Should be "address"
> +- nvmem-cell-names: Should be "mac-address"
>
> Examples (10G generic PHY):
> nixge0: ethernet@40000000 {
> @@ -16,7 +16,7 @@ Examples (10G generic PHY):
> reg = <0x40000000 0x6000>;
>
> nvmem-cells = <ð1_addr>;
> - nvmem-cell-names = "address";
> + nvmem-cell-names = "mac-address";
>
> interrupts = <0 29 IRQ_TYPE_LEVEL_HIGH>, <0 30 IRQ_TYPE_LEVEL_HIGH>;
> interrupt-names = "rx", "tx";
> diff --git a/drivers/net/ethernet/ni/nixge.c b/drivers/net/ethernet/ni/nixge.c
> index 27364b7572fc..7918c7b7273b 100644
> --- a/drivers/net/ethernet/ni/nixge.c
> +++ b/drivers/net/ethernet/ni/nixge.c
> @@ -1162,22 +1162,6 @@ static int nixge_mdio_setup(struct nixge_priv *priv, struct device_node *np)
> return of_mdiobus_register(bus, np);
> }
>
> -static void *nixge_get_nvmem_address(struct device *dev)
> -{
> - struct nvmem_cell *cell;
> - size_t cell_size;
> - char *mac;
> -
> - cell = nvmem_cell_get(dev, "address");
> - if (IS_ERR(cell))
> - return cell;
> -
> - mac = nvmem_cell_read(cell, &cell_size);
> - nvmem_cell_put(cell);
> -
> - return mac;
> -}
> -
> static int nixge_probe(struct platform_device *pdev)
> {
> struct nixge_priv *priv;
> @@ -1201,8 +1185,8 @@ static int nixge_probe(struct platform_device *pdev)
> ndev->min_mtu = 64;
> ndev->max_mtu = NIXGE_JUMBO_MTU;
>
> - mac_addr = nixge_get_nvmem_address(&pdev->dev);
> - if (mac_addr && is_valid_ether_addr(mac_addr))
> + mac_addr = of_get_mac_address(np);
Sorry, that should be &pdev->dev.of_node here ... I'll resubmit if
general idea ok.
> + if (mac_addr)
> ether_addr_copy(ndev->dev_addr, mac_addr);
> else
> eth_hw_addr_random(ndev);
> --
> 2.17.0
>
- Moritz
^ permalink raw reply
* [RFC/PATCH] net: ethernet: nixge: Use of_get_mac_address()
From: Moritz Fischer @ 2018-04-26 21:57 UTC (permalink / raw)
To: linux-kernel
Cc: devicetree, netdev, davem, robh+dt, mark.rutland, Moritz Fischer
Make nixge driver work with 'mac-address' property instead of
'address' property. There are currently no in-tree users and
the only users of this driver are devices that use overlays
we control to instantiate the device together with the corresponding
FPGA images.
Signed-off-by: Moritz Fischer <mdf@kernel.org>
---
Hi David, Rob,
with Mike's change that enable the generic 'mac-address'
binding that I barely missed with the submission of this
driver I was wondering if we can still change the binding.
I'm aware that this generally is a nonono case, since the binding
is considered API, but since there are no users outside of our
devicetree overlays that we ship with our devices I thought I'd ask.
If you don't think that's a good idea do you think supporting both
would be worthwhile?
Thanks,
Moritz
---
.../devicetree/bindings/net/nixge.txt | 4 ++--
drivers/net/ethernet/ni/nixge.c | 20 ++-----------------
2 files changed, 4 insertions(+), 20 deletions(-)
diff --git a/Documentation/devicetree/bindings/net/nixge.txt b/Documentation/devicetree/bindings/net/nixge.txt
index e55af7f0881a..9bc1ecfb6762 100644
--- a/Documentation/devicetree/bindings/net/nixge.txt
+++ b/Documentation/devicetree/bindings/net/nixge.txt
@@ -8,7 +8,7 @@ Required properties:
- phy-mode: See ethernet.txt file in the same directory.
- phy-handle: See ethernet.txt file in the same directory.
- nvmem-cells: Phandle of nvmem cell containing the MAC address
-- nvmem-cell-names: Should be "address"
+- nvmem-cell-names: Should be "mac-address"
Examples (10G generic PHY):
nixge0: ethernet@40000000 {
@@ -16,7 +16,7 @@ Examples (10G generic PHY):
reg = <0x40000000 0x6000>;
nvmem-cells = <ð1_addr>;
- nvmem-cell-names = "address";
+ nvmem-cell-names = "mac-address";
interrupts = <0 29 IRQ_TYPE_LEVEL_HIGH>, <0 30 IRQ_TYPE_LEVEL_HIGH>;
interrupt-names = "rx", "tx";
diff --git a/drivers/net/ethernet/ni/nixge.c b/drivers/net/ethernet/ni/nixge.c
index 27364b7572fc..7918c7b7273b 100644
--- a/drivers/net/ethernet/ni/nixge.c
+++ b/drivers/net/ethernet/ni/nixge.c
@@ -1162,22 +1162,6 @@ static int nixge_mdio_setup(struct nixge_priv *priv, struct device_node *np)
return of_mdiobus_register(bus, np);
}
-static void *nixge_get_nvmem_address(struct device *dev)
-{
- struct nvmem_cell *cell;
- size_t cell_size;
- char *mac;
-
- cell = nvmem_cell_get(dev, "address");
- if (IS_ERR(cell))
- return cell;
-
- mac = nvmem_cell_read(cell, &cell_size);
- nvmem_cell_put(cell);
-
- return mac;
-}
-
static int nixge_probe(struct platform_device *pdev)
{
struct nixge_priv *priv;
@@ -1201,8 +1185,8 @@ static int nixge_probe(struct platform_device *pdev)
ndev->min_mtu = 64;
ndev->max_mtu = NIXGE_JUMBO_MTU;
- mac_addr = nixge_get_nvmem_address(&pdev->dev);
- if (mac_addr && is_valid_ether_addr(mac_addr))
+ mac_addr = of_get_mac_address(np);
+ if (mac_addr)
ether_addr_copy(ndev->dev_addr, mac_addr);
else
eth_hw_addr_random(ndev);
--
2.17.0
^ permalink raw reply related
* Re: [PATCH] net/mlx5: report persistent netdev stats across ifdown/ifup commands
From: Saeed Mahameed @ 2018-04-26 21:50 UTC (permalink / raw)
To: Qing Huang, Eran Ben Elisha
Cc: linux-kernel, RDMA mailing list, Linux Netdev List,
Leon Romanovsky, Matan Barak, Saeed Mahameed
In-Reply-To: <1524775057-8012-1-git-send-email-qing.huang@oracle.com>
On Thu, Apr 26, 2018 at 1:37 PM, Qing Huang <qing.huang@oracle.com> wrote:
> Current stats collecting scheme in mlx5 driver is to periodically fetch
> aggregated stats from all the active mlx5 software channels associated
> with the device. However when a mlx5 interface is brought down(ifdown),
> all the channels will be deactivated and closed. A new set of channels
> will be created when next ifup command or a similar command is called.
> Unfortunately the new channels will have all stats reset to 0. So you
> lose the accumulated stats information. This behavior is different from
> other netdev drivers including the mlx4 driver. In order to fix it, we
> now save prior mlx5 software stats into netdev stats fields, so all the
> accumulated stats will survive multiple runs of ifdown/ifup commands and
> be shown correctly.
>
> Orabug: 27548610
>
> Signed-off-by: Qing Huang <qing.huang@oracle.com>
> ---
Hi Qing,
I am adding Eran since he is currently working on a similar patch,
He is also taking care of all cores/rings stats to make them
persistent, so you won't have discrepancy between
ethtool and ifconfig stats.
I am ok with this patch, but this means Eran has to work his way around it.
so we have two options:
1. Temporary accept this patch, and change it later with Eran's work.
2. Wait for Eran's work.
I am ok with either one of them, please let me know.
Thanks !
> drivers/net/ethernet/mellanox/mlx5/core/en_main.c | 30 +++++++++++++++++++----
> 1 file changed, 25 insertions(+), 5 deletions(-)
>
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
> index f1fe490..5d50e69 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
> @@ -2621,6 +2621,23 @@ static void mlx5e_netdev_set_tcs(struct net_device *netdev)
> netdev_set_tc_queue(netdev, tc, nch, 0);
> }
>
> +static void mlx5e_netdev_save_stats(struct mlx5e_priv *priv)
> +{
> + struct net_device *netdev = priv->netdev;
> +
> + netdev->stats.rx_packets += priv->stats.sw.rx_packets;
> + netdev->stats.rx_bytes += priv->stats.sw.rx_bytes;
> + netdev->stats.tx_packets += priv->stats.sw.tx_packets;
> + netdev->stats.tx_bytes += priv->stats.sw.tx_bytes;
> + netdev->stats.tx_dropped += priv->stats.sw.tx_queue_dropped;
> +
> + priv->stats.sw.rx_packets = 0;
> + priv->stats.sw.rx_bytes = 0;
> + priv->stats.sw.tx_packets = 0;
> + priv->stats.sw.tx_bytes = 0;
> + priv->stats.sw.tx_queue_dropped = 0;
> +}
> +
This means that we are now explicitly clearing channels stats on
ifconfig down or switch_channels.
and now after ifconfing down, ethtool will always show 0, before this
patch it didn't.
Anyway update sw stats function will always override them with the new
channels stats next time we load new channels.
so it is not that big of a deal.
> static void mlx5e_build_channels_tx_maps(struct mlx5e_priv *priv)
> {
> struct mlx5e_channel *c;
> @@ -2691,6 +2708,7 @@ void mlx5e_switch_priv_channels(struct mlx5e_priv *priv,
> netif_set_real_num_tx_queues(netdev, new_num_txqs);
>
> mlx5e_deactivate_priv_channels(priv);
> + mlx5e_netdev_save_stats(priv);
> mlx5e_close_channels(&priv->channels);
>
> priv->channels = *new_chs;
> @@ -2770,6 +2788,7 @@ int mlx5e_close_locked(struct net_device *netdev)
>
> netif_carrier_off(priv->netdev);
> mlx5e_deactivate_priv_channels(priv);
> + mlx5e_netdev_save_stats(priv);
> mlx5e_close_channels(&priv->channels);
>
> return 0;
> @@ -3215,11 +3234,12 @@ static int mlx5e_setup_tc(struct net_device *dev, enum tc_setup_type type,
> stats->tx_packets = PPORT_802_3_GET(pstats, a_frames_transmitted_ok);
> stats->tx_bytes = PPORT_802_3_GET(pstats, a_octets_transmitted_ok);
> } else {
> - stats->rx_packets = sstats->rx_packets;
> - stats->rx_bytes = sstats->rx_bytes;
> - stats->tx_packets = sstats->tx_packets;
> - stats->tx_bytes = sstats->tx_bytes;
> - stats->tx_dropped = sstats->tx_queue_dropped;
> + stats->rx_packets = sstats->rx_packets + dev->stats.rx_packets;
> + stats->rx_bytes = sstats->rx_bytes + dev->stats.rx_bytes;
> + stats->tx_packets = sstats->tx_packets + dev->stats.tx_packets;
> + stats->tx_bytes = sstats->tx_bytes + dev->stats.tx_bytes;
> + stats->tx_dropped = sstats->tx_queue_dropped +
> + dev->stats.tx_dropped;
> }
>
> stats->rx_dropped = priv->stats.qcnt.rx_out_of_buffer;
> --
> 1.8.3.1
>
^ permalink raw reply
* Re: [dm-devel] [PATCH v5] fault-injection: introduce kvmalloc fallback options
From: Mikulas Patocka @ 2018-04-26 21:50 UTC (permalink / raw)
To: John Stoffel
Cc: Andrew, eric.dumazet, mst, edumazet, netdev, Randy Dunlap,
linux-kernel, Matthew Wilcox, Hocko, James Bottomley, Michal,
dm-devel, David Miller, David Rientjes, Morton, virtualization,
linux-mm, Vlastimil Babka
In-Reply-To: <23266.8532.619051.784274@quad.stoffel.home>
On Thu, 26 Apr 2018, John Stoffel wrote:
> >>>>> "James" == James Bottomley <James.Bottomley@HansenPartnership.com> writes:
>
> James> I may be an atypical developer but I'd rather have a root canal
> James> than browse through menuconfig options. The way to get people
> James> to learn about new debugging options is to blog about it (or
> James> write an lwn.net article) which google will find the next time
> James> I ask it how I debug XXX. Google (probably as a service to
> James> humanity) rarely turns up Kconfig options in response to a
> James> query.
>
> I agree with James here. Looking at the SLAB vs SLUB Kconfig entries
> tells me *nothing* about why I should pick one or the other, as an
> example.
>
> John
I see your point - and I think the misunderstanding is this.
This patch is not really helping people to debug existing crashes. It is
not like "you get a crash" - "you google for some keywords" - "you get a
page that suggests to turn this option on" - "you turn it on and solve the
crash".
What this patch really does is that - it makes the kernel deliberately
crash in a situation when the code violates the specification, but it
would not crash otherwise or it would crash very rarely. It helps to
detect specification violations.
If the kernel developer (or tester) doesn't use this option, his buggy
code won't crash - and if it won't crash, he won't fix the bug or report
it. How is the user or developer supposed to learn about this option, if
he gets no crash at all?
Mikulas
^ permalink raw reply
* Re: [PATCH iproute2] ipaddress: strengthen check on 'label' input
From: Stephen Hemminger @ 2018-04-26 21:45 UTC (permalink / raw)
To: Patrick Talbert; +Cc: netdev
In-Reply-To: <1524578901-28278-1-git-send-email-ptalbert@redhat.com>
On Tue, 24 Apr 2018 16:08:21 +0200
Patrick Talbert <ptalbert@redhat.com> wrote:
> As mentioned in the ip-address man page, an address label must
> be equal to the device name or prefixed by the device name
> followed by a colon. Currently the only check on this input is
> to see if the device name appears at the beginning of the label
> string.
>
> This commit adds an additional check to ensure label == dev or
> continues with a colon.
>
> Signed-off-by: Patrick Talbert <ptalbert@redhat.com>
> ---
> ip/ipaddress.c | 11 ++++++++---
> 1 file changed, 8 insertions(+), 3 deletions(-)
>
> diff --git a/ip/ipaddress.c b/ip/ipaddress.c
> index aecc9a1..edcf821 100644
> --- a/ip/ipaddress.c
> +++ b/ip/ipaddress.c
> @@ -2168,9 +2168,14 @@ static int ipaddr_modify(int cmd, int flags, int argc, char **argv)
> fprintf(stderr, "Not enough information: \"dev\" argument is required.\n");
> return -1;
> }
> - if (l && matches(d, l) != 0) {
> - fprintf(stderr, "\"dev\" (%s) must match \"label\" (%s).\n", d, l);
> - return -1;
> + if (l) {
> + size_t d_len = strlen(d);
> +
> + if (!(matches(d, l) == 0 && (l[d_len] == '\0' || l[d_len] == ':'))) {
matches is not what you want here. matches does prefix match (ie matches("eth0", "eth") == 0).
Also, what if label is shorter than the device, you would end up dereferencing past
the end of the string!
I think you want something like:
static bool is_valid_label(const char *dev, const char *label)
{
const char *sep;
sep = strchr(label, ':');
if (sep)
return strncmp(dev, label, sep - label) == 0;
else
return strcmp(dev, label) == 0;
}
> + fprintf(stderr, "\"label\" (%s) must match \"dev\" (%s) or be prefixed by"
> + " \"dev\" with a colon.\n", l, d);
> + return -1;
> + }
> }
>
> if (peer_len == 0 && local_len) {
^ permalink raw reply
* [PATCH net-next v2 14/14] bnxt_en: Reserve rings at driver open if none was reserved at probe time.
From: Michael Chan @ 2018-04-26 21:44 UTC (permalink / raw)
To: davem; +Cc: netdev
In-Reply-To: <1524779084-4016-1-git-send-email-michael.chan@broadcom.com>
Add logic to reserve default rings at driver open time if none was
reserved during probe time. This will happen when the PF driver did
not provision minimum rings to the VF, due to more limited resources.
Driver open will only succeed if some minimum rings can be reserved.
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
---
drivers/net/ethernet/broadcom/bnxt/bnxt.c | 31 +++++++++++++++++++++++++++++++
1 file changed, 31 insertions(+)
diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.c b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
index fee1c0d..efe5c72 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
@@ -6844,6 +6844,8 @@ static void bnxt_preset_reg_win(struct bnxt *bp)
}
}
+static int bnxt_init_dflt_ring_mode(struct bnxt *bp);
+
static int __bnxt_open_nic(struct bnxt *bp, bool irq_re_init, bool link_re_init)
{
int rc = 0;
@@ -6851,6 +6853,12 @@ static int __bnxt_open_nic(struct bnxt *bp, bool irq_re_init, bool link_re_init)
bnxt_preset_reg_win(bp);
netif_carrier_off(bp->dev);
if (irq_re_init) {
+ /* Reserve rings now if none were reserved at driver probe. */
+ rc = bnxt_init_dflt_ring_mode(bp);
+ if (rc) {
+ netdev_err(bp->dev, "Failed to reserve default rings at open\n");
+ return rc;
+ }
rc = bnxt_reserve_rings(bp);
if (rc)
return rc;
@@ -8600,6 +8608,29 @@ static int bnxt_set_dflt_rings(struct bnxt *bp, bool sh)
return rc;
}
+static int bnxt_init_dflt_ring_mode(struct bnxt *bp)
+{
+ int rc;
+
+ if (bp->tx_nr_rings)
+ return 0;
+
+ rc = bnxt_set_dflt_rings(bp, true);
+ if (rc) {
+ netdev_err(bp->dev, "Not enough rings available.\n");
+ return rc;
+ }
+ rc = bnxt_init_int_mode(bp);
+ if (rc)
+ return rc;
+ bp->tx_nr_rings_per_tc = bp->tx_nr_rings;
+ if (bnxt_rfs_supported(bp) && bnxt_rfs_capable(bp)) {
+ bp->flags |= BNXT_FLAG_RFS;
+ bp->dev->features |= NETIF_F_NTUPLE;
+ }
+ return 0;
+}
+
int bnxt_restore_pf_fw_resources(struct bnxt *bp)
{
int rc;
--
1.8.3.1
^ permalink raw reply related
* [PATCH net-next v2 13/14] bnxt_en: Reserve RSS and L2 contexts for VF.
From: Michael Chan @ 2018-04-26 21:44 UTC (permalink / raw)
To: davem; +Cc: netdev
In-Reply-To: <1524779084-4016-1-git-send-email-michael.chan@broadcom.com>
For completeness and correctness, the VF driver needs to reserve these
RSS and L2 contexts.
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
---
drivers/net/ethernet/broadcom/bnxt/bnxt.c | 4 ++++
drivers/net/ethernet/broadcom/bnxt/bnxt_sriov.c | 10 +++++-----
drivers/net/ethernet/broadcom/bnxt/bnxt_sriov.h | 5 +++++
3 files changed, 14 insertions(+), 5 deletions(-)
diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.c b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
index 0884e49..fee1c0d 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
@@ -4713,6 +4713,10 @@ int __bnxt_hwrm_get_tx_rings(struct bnxt *bp, u16 fid, int *tx_rings)
__bnxt_hwrm_reserve_vf_rings(bp, &req, tx_rings, rx_rings, ring_grps,
cp_rings, vnics);
+ req.enables |= cpu_to_le32(FUNC_VF_CFG_REQ_ENABLES_NUM_RSSCOS_CTXS |
+ FUNC_VF_CFG_REQ_ENABLES_NUM_L2_CTXS);
+ req.num_rsscos_ctxs = cpu_to_le16(BNXT_VF_MAX_RSS_CTX);
+ req.num_l2_ctxs = cpu_to_le16(BNXT_VF_MAX_L2_CTX);
rc = hwrm_send_message(bp, &req, sizeof(req), HWRM_CMD_TIMEOUT);
if (rc)
return -ENOMEM;
diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt_sriov.c b/drivers/net/ethernet/broadcom/bnxt/bnxt_sriov.c
index 18ee471..cc21d87 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt_sriov.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt_sriov.c
@@ -462,13 +462,13 @@ static int bnxt_hwrm_func_vf_resc_cfg(struct bnxt *bp, int num_vfs)
vf_vnics = hw_resc->max_vnics - bp->nr_vnics;
vf_vnics = min_t(u16, vf_vnics, vf_rx_rings);
- req.min_rsscos_ctx = cpu_to_le16(1);
- req.max_rsscos_ctx = cpu_to_le16(1);
+ req.min_rsscos_ctx = cpu_to_le16(BNXT_VF_MIN_RSS_CTX);
+ req.max_rsscos_ctx = cpu_to_le16(BNXT_VF_MAX_RSS_CTX);
if (pf->vf_resv_strategy == BNXT_VF_RESV_STRATEGY_MINIMAL) {
req.min_cmpl_rings = cpu_to_le16(1);
req.min_tx_rings = cpu_to_le16(1);
req.min_rx_rings = cpu_to_le16(1);
- req.min_l2_ctxs = cpu_to_le16(1);
+ req.min_l2_ctxs = cpu_to_le16(BNXT_VF_MIN_L2_CTX);
req.min_vnics = cpu_to_le16(1);
req.min_stat_ctx = cpu_to_le16(1);
req.min_hw_ring_grps = cpu_to_le16(1);
@@ -483,7 +483,7 @@ static int bnxt_hwrm_func_vf_resc_cfg(struct bnxt *bp, int num_vfs)
req.min_cmpl_rings = cpu_to_le16(vf_cp_rings);
req.min_tx_rings = cpu_to_le16(vf_tx_rings);
req.min_rx_rings = cpu_to_le16(vf_rx_rings);
- req.min_l2_ctxs = cpu_to_le16(4);
+ req.min_l2_ctxs = cpu_to_le16(BNXT_VF_MAX_L2_CTX);
req.min_vnics = cpu_to_le16(vf_vnics);
req.min_stat_ctx = cpu_to_le16(vf_stat_ctx);
req.min_hw_ring_grps = cpu_to_le16(vf_ring_grps);
@@ -491,7 +491,7 @@ static int bnxt_hwrm_func_vf_resc_cfg(struct bnxt *bp, int num_vfs)
req.max_cmpl_rings = cpu_to_le16(vf_cp_rings);
req.max_tx_rings = cpu_to_le16(vf_tx_rings);
req.max_rx_rings = cpu_to_le16(vf_rx_rings);
- req.max_l2_ctxs = cpu_to_le16(4);
+ req.max_l2_ctxs = cpu_to_le16(BNXT_VF_MAX_L2_CTX);
req.max_vnics = cpu_to_le16(vf_vnics);
req.max_stat_ctx = cpu_to_le16(vf_stat_ctx);
req.max_hw_ring_grps = cpu_to_le16(vf_ring_grps);
diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt_sriov.h b/drivers/net/ethernet/broadcom/bnxt/bnxt_sriov.h
index 6f6d850..e9b20cd 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt_sriov.h
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt_sriov.h
@@ -23,6 +23,11 @@
((offsetof(struct hwrm_reject_fwd_resp_input, encap_request) + n) >\
offsetof(struct hwrm_reject_fwd_resp_input, encap_resp_target_id))
+#define BNXT_VF_MIN_RSS_CTX 1
+#define BNXT_VF_MAX_RSS_CTX 1
+#define BNXT_VF_MIN_L2_CTX 1
+#define BNXT_VF_MAX_L2_CTX 4
+
int bnxt_get_vf_config(struct net_device *, int, struct ifla_vf_info *);
int bnxt_set_vf_mac(struct net_device *, int, u8 *);
int bnxt_set_vf_vlan(struct net_device *, int, u16, u8, __be16);
--
1.8.3.1
^ permalink raw reply related
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox