* Re: [PATCH net-next 1/4] netlink: fix test alignment in nla_align_64bit()
From: Nicolas Dichtel @ 2016-04-20 9:44 UTC (permalink / raw)
To: Eric Dumazet; +Cc: netdev, davem, roopa, tgraf, jhs
In-Reply-To: <1461144802.10638.249.camel@edumazet-glaptop3.roam.corp.google.com>
Le 20/04/2016 11:33, Eric Dumazet a écrit :
[snip]
> How have you tested your patch exactly ?
As stated in the cover letter, I didn't test it.
>
> I guess David should have copied his original comment here.
>
> - * The nlattr header is 4 bytes in size, that's why we test
> - * if the skb->data _is_ aligned. This NOP attribute, plus
> - * nlattr header for IFLA_STATS64, will make nla_data() 8-byte
> - * aligned.
>
>
I knew I was missing something, thanks for the explanation.
All other patches of this series need to be updated, I will do it if there is
no other comment.
^ permalink raw reply
* drop all fragments inside tx queue if one gets dropped
From: Alexander Aring @ 2016-04-20 9:52 UTC (permalink / raw)
To: netdev; +Cc: linux-wpan
Hi,
On linux-wpan we had a discussion about setting the right tx_queue_len
and came to some issues in 802.15.4 6LoWPAN networks.
Our hardware parameters are:
- Bandwidth: 250kb/s
- One framebuffer at hardware side for transmit a frame.
- MTU - 127 bytes (without mac headers)
To provide 6LoWPAN (IPv6) on such interface, we have two interfaces.
One wpan interface (which works on 802.15.4 layer and has a queue) and
another lowpan interface (gets IPv6 and queue 6LoWPAN into wpan
interface, has no queue - it's virtual interface).
If the IPv6 packets needs fragmentation, mostly if payload is 127 bytes.
We have the following situation:
- 6lowpan interface gets IPv6 packet:
- generate 6LoWPAN fragments
- dev_queue_xmit(wpan_dev, frag1)
- dev_queue_xmit(wpan_dev, frag2)
- dev_queue_xmit(wpan_dev, frag3)
- dev_queue_xmit(wpan_dev, ...)
And then a lot of fragments laying inside the tx_queue and waits to
transfer to the transceiver which has only one framebuffer to transmit
one frame and waits for tx completion to transfer the next one.
My question is, if qdisc drops some fragment because the queue is full
or something else. Exists there some way to remove all fragments inside
the queue? If one fragment will be dropped and all related are still
inside the queue then we send mostly garbage.
I want to add a behaviour which drops all related fragments for
6LoWPAN fragmentation at first, if the payload is above 1280 bytes, then
we have also IPv6 fragmentation on it. In future I also like to remove
all related 6LoWPAN fragments which are related according to the IPv6
fragment.
- Alex
^ permalink raw reply
* Re: [PATCH net-next 0/2] act_bpf, cls_bpf: send eBPF bytecode through
From: Daniel Borkmann @ 2016-04-20 9:55 UTC (permalink / raw)
To: Quentin Monnet; +Cc: Alexei Starovoitov, netdev
In-Reply-To: <57172ED3.30101@6wind.com>
Hi Quentin,
On 04/20/2016 09:25 AM, Quentin Monnet wrote:
> 2016-04-15 (11:44 UTC-0700) ~ Alexei Starovoitov:
>> On Fri, Apr 15, 2016 at 12:41:05PM +0200, Daniel Borkmann wrote:
>>> On 04/15/2016 12:07 PM, Quentin Monnet wrote:
>>>> When a new BPF traffic control filter or action is set up with tc, the
>>>> bytecode is sent back to userspace through a netlink socket for cBPF, but
>>>> not for eBPF (the file descriptor pointing to the object file containing
>>>> the bytecode is sent instead).
>>>>
>>>> This patch makes cls_bpf and act_bpf modules send the bytecode for eBPF as
>>>> well (in addition to the file descriptor).
>>>>
> […]
>>>
>>> Thanks for working on this, but it's unfortunately not that easy. Let
>>> me ask, what would be the intended use-case to dump the insns?
>>
>> +1
>>
>>> I'm asking because if you dump them as-is, then a reinject at a later
>>> time of that bytecode back into the kernel will most likely be rejected
>>> by the verifier.
>>>
>>> This is because on load time, verifier does rewrites/expansion on some
>>> of the insns (f.e. map pointers, helper functions, ctx access etc, see
>>> also appendix in [1]), so the code as seen in the kernel would need to
>>> be sanitized first.
>>
>> +1
>> we had similar discussion about this in seccomp context and decided that
>> the only sensible way is to keep original instructions, but it's wasteful
>> to do unconditionally and snapshotting of maps is not possible,
>> so there was no use for such dumping facility other than debugging.
>> Is it what the patch after?
>> We need to discuss it in the proper context.
>
> I am experimenting with BPF, and so far I was just trying to dump the
> bytecode sent from tc to the kernel. I had not realized that the
> verifier would bring some changes to the instructions. And I agree that
> a more comprehensive debugging solution could be obtained if I can find
> some way to get a snapshot of the maps.
>
>>> Also, how would you make sense/transform maps into a meaningful
>>> representation (probably possible to find a scheme when they are pinned)?
>>>
>>> Another possibility is that such programs need to be pinned (can be done
>>> easily by tc in the background) and then implement a CRIU facility into
>>> the bpf(2) syscall to retrieve them. tc could make use of this w/o too
>>> much effort, and at the same time it would help CRIU folks, too. It
>>> also seems cleaner to have only one central api (bpf(2)) to dump them,
>>> but needs a bit of thought.
>>
>> +1
>> any debugging or criu needs to be done in a centralized way via syscall
>> and/or bpffs.
>
> Maintaining a central API around bpf() makes sense to me. I have been
> looking at the BPF filesystem to see what information I can obtain from
> it, but I did not understand it well. I read the logs of Daniel's commit
> b2197755b263 (“bpf: add support for persistent maps/progs”), but I am
> unsure how I could use it in order to gather data about the maps and
> programs (if this is possible at all). I tried to set up some BPF
Currently, there's not yet much information to extract. F.e. if you look at
the tc source code, we do bpf_map_selfcheck_pinned() from fdinfo to check if
the map fd that we got from the pinned one fits to the one from the object
file. But obviously more work is needed for extraction of bytecode as in your
case.
Haven't thought much about it yet, but one idea could be that tc also pins
programs, then sends some kind of annotation down to cls_bpf where on filter
dump tc could retrieve the path to the pinned program again, then uses bpf(2)
with BPF_OBJ_GET to get the fd, and a new command e.g. BPF_PROG_DUMP to extract
bytecode/map info from the running program and dumps it to the user in a way
where some sense can be made out of it from admin/user perspective (in other
words, not just raw opcodes I mean).
BPF_PROG_DUMP could have auxiliary information with map specs, kind of in a
similar way like we retrieve them as relo entries from the object file in
the loader, and in addition some information where to retrieve the maps in
case they were pinned. This still doesn't give you a entire snapshot of the
map, but would at least allow you for the pinned ones to iterate over them
via bpf(2) with BPF_MAP_GET_NEXT_KEY, plus in general it would allow you to
reload the program.
There's still the issue with the additional memory overhead to keep original
insns around as Alexei mentioned. Two things that come to mind, one being
that when JITing was successful, we could actually try to shrink struct bpf_prog
again since we work on a different image, but it doesn't address the case
where JIT is not used. Other one being to perhaps only keep a 'diff' around
in orig_prog where we can patch insns back to original, probably possible,
but needs a bit of work though.
> filters working with maps, but I could not find any file under
> /sys/fs/bpf/tc.
There are some getting started examples under examples/bpf/ in the iproute2
repo, f.e. bpf_shared.c is one.
> Would you have a pointer to some documentation about this filesystem? Or
> is there only the kernel code?
Yeah, b2197755b263 and 42984d7c1e56, and in my netdev1.1 paper I tried to put
more extensive information, but seems the proceedings haven't been published
yet. I can send you a private copy until they are officially released I guess.
Thanks,
Daniel
^ permalink raw reply
* Re: [PATCH net-next 1/4] netlink: fix test alignment in nla_align_64bit()
From: Eric Dumazet @ 2016-04-20 9:57 UTC (permalink / raw)
To: nicolas.dichtel; +Cc: netdev, davem, roopa, tgraf, jhs
In-Reply-To: <57174F8E.6050201@6wind.com>
On Wed, 2016-04-20 at 11:44 +0200, Nicolas Dichtel wrote:
> Le 20/04/2016 11:33, Eric Dumazet a écrit :
> [snip]
> > How have you tested your patch exactly ?
> As stated in the cover letter, I didn't test it.
You certainly can test this, by tweaking HAVE_EFFICIENT_UNALIGNED_ACCESS
and adding another assertion in the code.
By testing it you would have caught a real bug, since David incorrectly
used HAVE_EFFICIENT_UNALIGNED_ACCESS instead of
CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS
;)
diff --git a/include/net/netlink.h b/include/net/netlink.h
index e644b3489acf..ea6872633a92 100644
--- a/include/net/netlink.h
+++ b/include/net/netlink.h
@@ -1244,7 +1244,7 @@ static inline int nla_validate_nested(const struct nlattr *start, int maxtype,
*/
static inline int nla_align_64bit(struct sk_buff *skb, int padattr)
{
-#ifndef HAVE_EFFICIENT_UNALIGNED_ACCESS
+#ifndef CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS
if (IS_ALIGNED((unsigned long)skb->data, 8)) {
struct nlattr *attr = nla_reserve(skb, padattr, 0);
if (!attr)
@@ -1261,7 +1261,7 @@ static inline int nla_align_64bit(struct sk_buff *skb, int padattr)
static inline int nla_total_size_64bit(int payload)
{
return NLA_ALIGN(nla_attr_size(payload))
-#ifndef HAVE_EFFICIENT_UNALIGNED_ACCESS
+#ifndef CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS
+ NLA_ALIGN(nla_attr_size(0))
#endif
;
^ permalink raw reply related
* Re: [PATCH net-next 1/4] netlink: fix test alignment in nla_align_64bit()
From: Nicolas Dichtel @ 2016-04-20 10:14 UTC (permalink / raw)
To: Eric Dumazet; +Cc: netdev, davem, roopa, tgraf, jhs
In-Reply-To: <1461146278.10638.253.camel@edumazet-glaptop3.roam.corp.google.com>
Le 20/04/2016 11:57, Eric Dumazet a écrit :
> On Wed, 2016-04-20 at 11:44 +0200, Nicolas Dichtel wrote:
>> Le 20/04/2016 11:33, Eric Dumazet a écrit :
>> [snip]
>>> How have you tested your patch exactly ?
>> As stated in the cover letter, I didn't test it.
>
>
> You certainly can test this, by tweaking HAVE_EFFICIENT_UNALIGNED_ACCESS
> and adding another assertion in the code.
>
> By testing it you would have caught a real bug, since David incorrectly
> used HAVE_EFFICIENT_UNALIGNED_ACCESS instead of
> CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS
>
> ;)
Héhé, good catch :)
^ permalink raw reply
* [PATCH net-next 0/2] pskb_extract() helper function.
From: Sowmini Varadhan @ 2016-04-20 10:17 UTC (permalink / raw)
To: netdev, rds-devel, santosh.shilimkar, davem
Cc: sowmini.varadhan, eric.dumazet, marcelo.leitner
This patchset follows up on the discussion in
https://www.mail-archive.com/netdev@vger.kernel.org/msg105090.html
For RDS-TCP, we have to deal with the full gamut of
nonlinear sk_buffs, including all the frag_list variants.
Also, the parent skb has to remain unchanged, while the clone
is queued for Rx on the PF_RDS socket.
Patch 1 of this patchset adds a pskb_extract() function that
does all this without the redundant memcpy's in pskb_expand_head()
and __pskb_pull_tail().
Sowmini Varadhan (2):
Add pskb_extract() helper function
Call pskb_extract() helper function
include/linux/skbuff.h | 2 +
net/core/skbuff.c | 248 ++++++++++++++++++++++++++++++++++++++++++++++++
net/rds/tcp_recv.c | 14 +--
3 files changed, 253 insertions(+), 11 deletions(-)
^ permalink raw reply
* [PATCH net-next 1/2] skbuff: Add pskb_extract() helper function
From: Sowmini Varadhan @ 2016-04-20 10:17 UTC (permalink / raw)
To: netdev, rds-devel, santosh.shilimkar, davem
Cc: sowmini.varadhan, eric.dumazet, marcelo.leitner
In-Reply-To: <cover.1461086306.git.sowmini.varadhan@oracle.com>
A pattern of skb usage seen in modules such as RDS-TCP is to
extract `to_copy' bytes from the received TCP segment, starting
at some offset `off' into a new skb `clone'. This is done in
the ->data_ready callback, where the clone skb is queued up for rx on
the PF_RDS socket, while the parent TCP segment is returned unchanged
back to the TCP engine.
The existing code uses the sequence
clone = skb_clone(..);
pskb_pull(clone, off, ..);
pskb_trim(clone, to_copy, ..);
with the intention of discarding the first `off' bytes. However,
skb_clone() + pskb_pull() implies pksb_expand_head(), which ends
up doing a redundant memcpy of bytes that will then get discarded
in __pskb_pull_tail().
To avoid this inefficiency, this commit adds pskb_extract() that
creates the clone, and memcpy's only the relevant header/frag/frag_list
to the start of `clone'. pskb_trim() is then invoked to trim clone
down to the requested to_copy bytes.
Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
---
include/linux/skbuff.h | 2 +
net/core/skbuff.c | 248 ++++++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 250 insertions(+), 0 deletions(-)
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index da0ace3..a1ce639 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -2986,6 +2986,8 @@ struct sk_buff *skb_vlan_untag(struct sk_buff *skb);
int skb_ensure_writable(struct sk_buff *skb, int write_len);
int skb_vlan_pop(struct sk_buff *skb);
int skb_vlan_push(struct sk_buff *skb, __be16 vlan_proto, u16 vlan_tci);
+struct sk_buff *pskb_extract(struct sk_buff *skb, int off, int to_copy,
+ gfp_t gfp);
static inline int memcpy_from_msg(void *data, struct msghdr *msg, int len)
{
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 4cc594c..e8b6d20 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -4619,3 +4619,251 @@ struct sk_buff *alloc_skb_with_frags(unsigned long header_len,
return NULL;
}
EXPORT_SYMBOL(alloc_skb_with_frags);
+
+/* carve out the first off bytes from skb when off < headlen */
+static int pskb_carve_inside_header(struct sk_buff *skb, const u32 off,
+ const int headlen, gfp_t gfp_mask)
+{
+ int i;
+ int size = skb_end_offset(skb);
+ int new_hlen = headlen - off;
+ u8 *data;
+ int doff = 0;
+
+ size = SKB_DATA_ALIGN(size);
+
+ if (skb_pfmemalloc(skb))
+ gfp_mask |= __GFP_MEMALLOC;
+ data = kmalloc_reserve(size +
+ SKB_DATA_ALIGN(sizeof(struct skb_shared_info)),
+ gfp_mask, NUMA_NO_NODE, NULL);
+ if (!data)
+ return -ENOMEM;
+
+ size = SKB_WITH_OVERHEAD(ksize(data));
+
+ /* Copy real data, and all frags */
+ skb_copy_from_linear_data_offset(skb, off, data, new_hlen);
+ skb->len -= off;
+
+ memcpy((struct skb_shared_info *)(data + size),
+ skb_shinfo(skb),
+ offsetof(struct skb_shared_info,
+ frags[skb_shinfo(skb)->nr_frags]));
+ if (skb_cloned(skb)) {
+ /* drop the old head gracefully */
+ if (skb_orphan_frags(skb, gfp_mask)) {
+ kfree(data);
+ return -ENOMEM;
+ }
+ for (i = 0; i < skb_shinfo(skb)->nr_frags; i++)
+ skb_frag_ref(skb, i);
+ if (skb_has_frag_list(skb))
+ skb_clone_fraglist(skb);
+ skb_release_data(skb);
+ } else {
+ /* we can reuse existing recount- all we did was
+ * relocate values
+ */
+ skb_free_head(skb);
+ }
+
+ doff = (data - skb->head);
+ skb->head = data;
+ skb->data = data;
+ skb->head_frag = 0;
+#ifdef NET_SKBUFF_DATA_USES_OFFSET
+ skb->end = size;
+ doff = 0;
+#else
+ skb->end = skb->head + size;
+#endif
+ skb_set_tail_pointer(skb, skb_headlen(skb));
+ skb_headers_offset_update(skb, 0);
+ skb->cloned = 0;
+ skb->hdr_len = 0;
+ skb->nohdr = 0;
+ atomic_set(&skb_shinfo(skb)->dataref, 1);
+
+ return 0;
+}
+
+static int pskb_carve(struct sk_buff *skb, const u32 off, gfp_t gfp);
+
+/* carve out the first eat bytes from skb's frag_list. May recurse into
+ * pskb_carve()
+ */
+static int pskb_carve_frag_list(struct sk_buff *skb,
+ struct skb_shared_info *shinfo, int eat,
+ gfp_t gfp_mask)
+{
+ struct sk_buff *list = shinfo->frag_list;
+ struct sk_buff *clone = NULL;
+ struct sk_buff *insp = NULL;
+
+ do {
+ if (!list) {
+ pr_err("Not enough bytes to eat. Want %d\n", eat);
+ return -EFAULT;
+ }
+ if (list->len <= eat) {
+ /* Eaten as whole. */
+ eat -= list->len;
+ list = list->next;
+ insp = list;
+ } else {
+ /* Eaten partially. */
+ if (skb_shared(list)) {
+ clone = skb_clone(list, gfp_mask);
+ if (!clone)
+ return -ENOMEM;
+ insp = list->next;
+ list = clone;
+ } else {
+ /* This may be pulled without problems. */
+ insp = list;
+ }
+ if (pskb_carve(list, eat, gfp_mask) < 0) {
+ kfree_skb(clone);
+ return -ENOMEM;
+ }
+ break;
+ }
+ } while (eat);
+
+ /* Free pulled out fragments. */
+ while ((list = shinfo->frag_list) != insp) {
+ shinfo->frag_list = list->next;
+ kfree_skb(list);
+ }
+ /* And insert new clone at head. */
+ if (clone) {
+ clone->next = list;
+ shinfo->frag_list = clone;
+ }
+ return 0;
+}
+
+/* carve off first len bytes from skb. Split line (off) is in the
+ * non-linear part of skb
+ */
+static int pskb_carve_inside_nonlinear(struct sk_buff *skb, const u32 off,
+ int pos, gfp_t gfp_mask)
+{
+ int i, k = 0;
+ int size = skb_end_offset(skb);
+ u8 *data;
+ const int nfrags = skb_shinfo(skb)->nr_frags;
+ struct skb_shared_info *shinfo;
+ int doff = 0;
+
+ size = SKB_DATA_ALIGN(size);
+
+ if (skb_pfmemalloc(skb))
+ gfp_mask |= __GFP_MEMALLOC;
+ data = kmalloc_reserve(size +
+ SKB_DATA_ALIGN(sizeof(struct skb_shared_info)),
+ gfp_mask, NUMA_NO_NODE, NULL);
+ if (!data)
+ return -ENOMEM;
+
+ size = SKB_WITH_OVERHEAD(ksize(data));
+
+ memcpy((struct skb_shared_info *)(data + size),
+ skb_shinfo(skb), offsetof(struct skb_shared_info,
+ frags[skb_shinfo(skb)->nr_frags]));
+ if (skb_orphan_frags(skb, gfp_mask)) {
+ kfree(data);
+ return -ENOMEM;
+ }
+ shinfo = (struct skb_shared_info *)(data + size);
+ for (i = 0; i < nfrags; i++) {
+ int fsize = skb_frag_size(&skb_shinfo(skb)->frags[i]);
+
+ if (pos + fsize > off) {
+ shinfo->frags[k] = skb_shinfo(skb)->frags[i];
+
+ if (pos < off) {
+ /* Split frag.
+ * We have two variants in this case:
+ * 1. Move all the frag to the second
+ * part, if it is possible. F.e.
+ * this approach is mandatory for TUX,
+ * where splitting is expensive.
+ * 2. Split is accurately. We make this.
+ */
+ shinfo->frags[0].page_offset += off - pos;
+ skb_frag_size_sub(&shinfo->frags[0], off - pos);
+ }
+ skb_frag_ref(skb, i);
+ k++;
+ }
+ pos += fsize;
+ }
+ shinfo->nr_frags = k;
+ if (skb_has_frag_list(skb))
+ skb_clone_fraglist(skb);
+
+ if (k == 0) {
+ /* split line is in frag list */
+ pskb_carve_frag_list(skb, shinfo, off - pos, gfp_mask);
+ }
+ skb_release_data(skb);
+
+ doff = (data - skb->head);
+ skb->head = data;
+ skb->head_frag = 0;
+ skb->data = data;
+#ifdef NET_SKBUFF_DATA_USES_OFFSET
+ skb->end = size;
+ doff = 0;
+#else
+ skb->end = skb->head + size;
+#endif
+ skb_reset_tail_pointer(skb);
+ skb_headers_offset_update(skb, 0);
+ skb->cloned = 0;
+ skb->hdr_len = 0;
+ skb->nohdr = 0;
+ skb->len -= off;
+ skb->data_len = skb->len;
+ atomic_set(&skb_shinfo(skb)->dataref, 1);
+ return 0;
+}
+
+/* remove len bytes from the beginning of the skb */
+static int pskb_carve(struct sk_buff *skb, const u32 len, gfp_t gfp)
+{
+ int headlen = skb_headlen(skb);
+
+ if (len < headlen)
+ return pskb_carve_inside_header(skb, len, headlen, gfp);
+ else
+ return pskb_carve_inside_nonlinear(skb, len, headlen, gfp);
+}
+
+/* Extract to_copy bytes starting at off from skb, and return this in
+ * a new skb
+ */
+struct sk_buff *pskb_extract(struct sk_buff *skb, int off,
+ int to_copy, gfp_t gfp)
+{
+ struct sk_buff *clone = skb_clone(skb, gfp);
+
+ if (!clone)
+ return NULL;
+
+ if (pskb_carve(clone, off, gfp) < 0) {
+ pr_warn("pskb_carve failed\n");
+ kfree_skb(clone);
+ return NULL;
+ }
+
+ if (pskb_trim(clone, to_copy)) {
+ pr_warn("pskb_trim failed\n");
+ kfree_skb(clone);
+ return NULL;
+ }
+ return clone;
+}
+EXPORT_SYMBOL(pskb_extract);
--
1.7.1
^ permalink raw reply related
* [PATCH net-next 2/2] RDS: TCP: Call pskb_extract() helper function
From: Sowmini Varadhan @ 2016-04-20 10:17 UTC (permalink / raw)
To: netdev, rds-devel, santosh.shilimkar, davem
Cc: sowmini.varadhan, eric.dumazet, marcelo.leitner
In-Reply-To: <cover.1461086306.git.sowmini.varadhan@oracle.com>
rds-stress experiments with request size 256 bytes, 8K acks,
using 16 threads show a 40% improvment when pskb_extract()
replaces the {skb_clone(..); pskb_pull(..); pskb_trim(..);}
pattern in the Rx path, so we leverage the perf gain with
this commit.
Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
---
net/rds/tcp_recv.c | 14 +++-----------
1 files changed, 3 insertions(+), 11 deletions(-)
diff --git a/net/rds/tcp_recv.c b/net/rds/tcp_recv.c
index 27a9921..d75d8b5 100644
--- a/net/rds/tcp_recv.c
+++ b/net/rds/tcp_recv.c
@@ -207,22 +207,14 @@ static int rds_tcp_data_recv(read_descriptor_t *desc, struct sk_buff *skb,
}
if (left && tc->t_tinc_data_rem) {
- clone = skb_clone(skb, arg->gfp);
+ to_copy = min(tc->t_tinc_data_rem, left);
+
+ clone = pskb_extract(skb, offset, to_copy, arg->gfp);
if (!clone) {
desc->error = -ENOMEM;
goto out;
}
- to_copy = min(tc->t_tinc_data_rem, left);
- if (!pskb_pull(clone, offset) ||
- pskb_trim(clone, to_copy)) {
- pr_warn("rds_tcp_data_recv: pull/trim failed "
- "left %zu data_rem %zu skb_len %d\n",
- left, tc->t_tinc_data_rem, skb->len);
- kfree_skb(clone);
- desc->error = -ENOMEM;
- goto out;
- }
skb_queue_tail(&tinc->ti_skb_list, clone);
rdsdebug("skb %p data %p len %d off %u to_copy %zu -> "
--
1.7.1
^ permalink raw reply related
* skb_at_tc_ingress helper breaks compilation of oot modules
From: Ingo Saitz @ 2016-04-20 10:21 UTC (permalink / raw)
To: netdev
[-- Attachment #1: Type: text/plain, Size: 819 bytes --]
In Linux 4.5, when CONFIG_NET_CLS_ACT is defined, compilation of out of
tree modules breaks with undeclared functions/constants. The culprit is:
commit fdc5432a7b44ab7de17141beec19d946b9344e91
Author: Daniel Borkmann <daniel@iogearbox.net>
Date: Thu Jan 7 15:50:22 2016 +0100
net, sched: add skb_at_tc_ingress helper
which uses G_TC_AT and AT_INGRESS but only includes linux/pkt_cls.h,
which does not include these #defines for oot builds. Unfortunately I'm
not sure what the correct fix is, maybe the uapi folks could help, but i
attached a simple testcase and build log (Makefile is straight from
kernelnewbies).
Ingo
--
╭─╮ Kennedy's Lemma:
╭│───╮ If you can parse Perl, you can solve the Halting Problem.
│╰─│─╯
╰──╯ http://www.perlmonks.org/?node_id=663393
[-- Attachment #2: fail.c --]
[-- Type: text/x-csrc, Size: 22 bytes --]
#include <net/ipv6.h>
[-- Attachment #3: Makefile --]
[-- Type: text/plain, Size: 154 bytes --]
obj-m := fail.o
all:
make -C /lib/modules/$(shell uname -r)/build M=$(PWD) modules
clean:
make -C /lib/modules/$(shell uname -r)/build M=$(PWD) clean
[-- Attachment #4: make.log --]
[-- Type: text/plain, Size: 1551 bytes --]
make -C /lib/modules/4.5.0-pinguin20160314/build M=/home/ingo/src/linux/pkt_cls-bug modules
make[1]: Entering directory '/usr/src/linux-headers-4.5.0-pinguin20160314'
CC [M] /home/ingo/src/linux/pkt_cls-bug/fail.o
In file included from include/linux/filter.h:16:0,
from include/net/sock.h:64,
from include/linux/tcp.h:22,
from include/linux/ipv6.h:72,
from include/net/ipv6.h:16,
from /home/ingo/src/linux/pkt_cls-bug/fail.c:1:
include/net/sch_generic.h: In function ‘skb_at_tc_ingress’:
include/net/sch_generic.h:413:9: error: implicit declaration of function ‘G_TC_AT’ [-Werror=implicit-function-declaration]
return G_TC_AT(skb->tc_verd) & AT_INGRESS;
^
include/net/sch_generic.h:413:33: error: ‘AT_INGRESS’ undeclared (first use in this function)
return G_TC_AT(skb->tc_verd) & AT_INGRESS;
^
include/net/sch_generic.h:413:33: note: each undeclared identifier is reported only once for each function it appears in
cc1: some warnings being treated as errors
scripts/Makefile.build:264: recipe for target '/home/ingo/src/linux/pkt_cls-bug/fail.o' failed
make[2]: *** [/home/ingo/src/linux/pkt_cls-bug/fail.o] Error 1
Makefile:1391: recipe for target '_module_/home/ingo/src/linux/pkt_cls-bug' failed
make[1]: *** [_module_/home/ingo/src/linux/pkt_cls-bug] Error 2
make[1]: Leaving directory '/usr/src/linux-headers-4.5.0-pinguin20160314'
Makefile:4: recipe for target 'all' failed
make: *** [all] Error 2
^ permalink raw reply
* Re: skb_at_tc_ingress helper breaks compilation of oot modules
From: Daniel Borkmann @ 2016-04-20 10:38 UTC (permalink / raw)
To: Ingo Saitz; +Cc: netdev
In-Reply-To: <20160420102148.GA18615@echse.zoo>
On 04/20/2016 12:21 PM, Ingo Saitz wrote:
> In Linux 4.5, when CONFIG_NET_CLS_ACT is defined, compilation of out of
> tree modules breaks with undeclared functions/constants. The culprit is:
>
> commit fdc5432a7b44ab7de17141beec19d946b9344e91
> Author: Daniel Borkmann <daniel@iogearbox.net>
> Date: Thu Jan 7 15:50:22 2016 +0100
>
> net, sched: add skb_at_tc_ingress helper
>
> which uses G_TC_AT and AT_INGRESS but only includes linux/pkt_cls.h,
> which does not include these #defines for oot builds. Unfortunately I'm
> not sure what the correct fix is, maybe the uapi folks could help, but i
> attached a simple testcase and build log (Makefile is straight from
> kernelnewbies).
Hmm, your fail.c test case only contains '#include <net/ipv6.h>'?
Note, upstream kernel never cared about out-of-tree modules, only
in-tree code. ;) Did you run into an issue with any in-tree code?
Thanks,
Daniel
^ permalink raw reply
* Re: skb_at_tc_ingress helper breaks compilation of oot modules
From: Ingo Saitz @ 2016-04-20 10:53 UTC (permalink / raw)
To: Daniel Borkmann; +Cc: Ingo Saitz, netdev
In-Reply-To: <57175C13.8080109@iogearbox.net>
On Wed, Apr 20, 2016 at 12:38:11PM +0200, Daniel Borkmann wrote:
> Hmm, your fail.c test case only contains '#include <net/ipv6.h>'?
No, only when building oot modules (virtualbox, and I found batman-adv
having the same issue), so I reduced it to the most simple test case. This
actually builds a fail.ko on 4.4.7 with CONFIG_NET_CLS_ACT=y without
errors.
> Note, upstream kernel never cared about out-of-tree modules, only
> in-tree code. ;) Did you run into an issue with any in-tree code?
A current fix for oot modules would be to add:
#include <uapi/linux/pkt_cls.h>
in front of that include for kernel >= 4.5…
Ingo
--
Kennedy's Lemma:
If you can parse Perl, you can solve the Halting Problem.
http://www.perlmonks.org/?node_id=663393
^ permalink raw reply
* Re: [PATCH V2] net: ethernet: mellanox: correct page conversion
From: Eran Ben Elisha @ 2016-04-20 11:08 UTC (permalink / raw)
To: Sinan Kaya
Cc: Christoph Hellwig, linux-rdma, timur, cov, Yishai Hadas,
Linux Netdev List, linux-kernel
In-Reply-To: <57167AF6.9090507@codeaurora.org>
Hi Sinan,
We are working in Mellanox for a solution which
removes the vmap call and allocate contiguous memory (using dma_alloc_coherent).
Thanks,
Eran
On Tue, Apr 19, 2016 at 9:37 PM, Sinan Kaya <okaya@codeaurora.org> wrote:
> On 4/19/2016 2:22 PM, Christoph Hellwig wrote:
>> What I think we need is something like the patch below. In the long
>> ru nwe should also kill the mlx4_buf structure which now is pretty
>> pointless.
>
> Maybe; this could be the correct approach if we can guarantee that the
> architecture can allocate the requested amount of memory with
> dma_alloc_coherent.
>
> When I brought this issue a year ago, the objection was that my code
> doesn't compile on intel (dma_to_phys) and also some arches run out of
> DMA memory with existing customer base.
>
> If there is a real need to maintain compatibility with the existing
> architectures due to limited amount of DMA memory, we need to correct this
> code to make proper use of vmap via alloc_pages and also insert the
> dma_sync in proper places for DMA API conformance.
>
> Also, the tx descriptors always has to be allocated from a single DMA region
> or the tx code needs to be corrected to support page_list.
>
> If we can live with just using dma_alloc_coherent, your solution is
> better. I was trying to put this support for 64bit arches only while
> maintaining support for the existing code base.
>
>>
>> ---
>> From a493881d2a6c90152d3daabb7b6b3afd1d254d78 Mon Sep 17 00:00:00 2001
>> From: Christoph Hellwig <hch@lst.de>
>> Date: Tue, 19 Apr 2016 14:12:14 -0400
>> Subject: mlx4_en: don't try to split and vmap dma coherent allocations
>>
>> The memory returned by dma_alloc_coherent is not suitable for calling
>> virt_to_page on it, as it might for example come from vmap allocator.
>>
>> Remove the code that calls virt_to_page and vmap on dma coherent
>> allocations from the mlx4 drivers, and replace them with a single
>> high-order dma_alloc_coherent call.
>>
>> Signed-off-by: Christoph Hellwig <hch@lst.de>
>> Reported-by: Sinan Kaya <okaya@codeaurora.org>
>
>
> --
> Sinan Kaya
> Qualcomm Technologies, Inc. on behalf of Qualcomm Innovation Center, Inc.
> Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux Foundation Collaborative Project
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply
* Re: [Intel-gfx] [PATCH 4/4] drm/i915: Move ioremap_wc tracking onto VMA
From: Daniel Vetter @ 2016-04-20 11:17 UTC (permalink / raw)
To: Luis R. Rodriguez
Cc: David Hildenbrand, netdev, intel-gfx, linux-kernel, dri-devel,
Peter Zijlstra (Intel), Daniel Vetter, Dan Williams, Yishai Hadas,
Ingo Molnar, linux-rdma
In-Reply-To: <20160420091054.GL1990@wotan.suse.de>
On Wed, Apr 20, 2016 at 11:10:54AM +0200, Luis R. Rodriguez wrote:
> Reason I ask is since I noticed a while ago a lot of drivers
> were using info->fix.smem_start and info->fix.smem_len consistently
> for their ioremap'd areas it might make sense instead to let the
> internal framebuffer (register_framebuffer()) optionally manage the
> ioremap_wc() for drivers, given that this is pretty generic stuff.
All that legacy fbdev stuff is just for legacy support, and I prefer to
have that as dumb as possible. There's been some discussion even around
lifting the "kick out firmware fb driver" out of fbdev, since we'd need it
to have a simple drm driver for e.g. uefi.
But I definitely don't want a legacy horror show like fbdev to
automagically take care of device mappings for drivers.
-Daniel
--
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel
^ permalink raw reply
* [PATCH] MAINTAINERS: net: add entry for TI Ethernet Switch drivers
From: Grygorii Strashko @ 2016-04-20 11:25 UTC (permalink / raw)
To: netdev, linux-kernel
Cc: Sekhar Nori, Tony Lindgren, linux-omap, Grygorii Strashko,
David S. Miller, Mugunthan V N, Richard Cochran
Add record for TI Ethernet Switch Driver CPSW/CPDMA/MDIO HW
(am33/am43/am57/dr7/davinci) to ensure that related patches
will go through dedicated linux-omap list.
Also add Mugunthan as maintainer and myself as the reviewer.
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Mugunthan V N <mugunthanvnm@ti.com>
Cc: Richard Cochran <richardcochran@gmail.com>
Signed-off-by: Grygorii Strashko <grygorii.strashko@ti.com>
---
MAINTAINERS | 8 ++++++++
1 file changed, 8 insertions(+)
diff --git a/MAINTAINERS b/MAINTAINERS
index 1d5b4be..aca864d 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -11071,6 +11071,14 @@ S: Maintained
F: drivers/clk/ti/
F: include/linux/clk/ti.h
+TI ETHERNET SWITCH DRIVER (CPSW)
+M: Mugunthan V N <mugunthanvnm@ti.com>
+R: Grygorii Strashko <grygorii.strashko@ti.com>
+L: linux-omap@vger.kernel.org
+S: Maintained
+F: drivers/net/ethernet/ti/cpsw*
+F: drivers/net/ethernet/ti/davinci*
+
TI FLASH MEDIA INTERFACE DRIVER
M: Alex Dubov <oakad@yahoo.com>
S: Maintained
--
2.8.1
^ permalink raw reply related
* [PATCH] net: phy: spi_ks8895: Don't leak references to SPI devices
From: Mark Brown @ 2016-04-20 11:54 UTC (permalink / raw)
To: Florian Fainelli; +Cc: netdev, Mark Brown
The ks8895 driver is using spi_dev_get() apparently just to take a copy
of the SPI device used to instantiate it but never calls spi_dev_put()
to free it. Since the device is guaranteed to exist between probe() and
remove() there should be no need for the driver to take an extra
reference to it so fix the leak by just using a straight assignment.
Signed-off-by: Mark Brown <broonie@kernel.org>
---
drivers/net/phy/spi_ks8995.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/net/phy/spi_ks8995.c b/drivers/net/phy/spi_ks8995.c
index b5d50d458728..93ffedfa2994 100644
--- a/drivers/net/phy/spi_ks8995.c
+++ b/drivers/net/phy/spi_ks8995.c
@@ -441,7 +441,7 @@ static int ks8995_probe(struct spi_device *spi)
return -ENOMEM;
mutex_init(&ks->lock);
- ks->spi = spi_dev_get(spi);
+ ks->spi = spi;
ks->chip = &ks8995_chip[variant];
if (ks->spi->dev.of_node) {
--
2.8.0.rc3
^ permalink raw reply related
* Re: [PATCH net-next v5] rtnetlink: add new RTM_GETSTATS message to dump link stats
From: Jiri Benc @ 2016-04-20 12:48 UTC (permalink / raw)
To: Johannes Berg
Cc: David Ahern, David Miller, eric.dumazet, roopa, netdev, jhs,
tgraf, nicolas.dichtel, egrumbach
In-Reply-To: <1461137540.2176.5.camel@sipsolutions.net>
On Wed, 20 Apr 2016 09:32:20 +0200, Johannes Berg wrote:
> 2) Use the new attribute flag with some required attribute for
> existing commands, so that older kernel will not find the required
> attribute and will reject the operation entirely.
> May or may not fall back to trying the operation again without the
> flag.
This is basically what I submitted half a year ago. See:
http://thread.gmane.org/gmane.linux.network/382850
Jiri
^ permalink raw reply
* [PATCH net 0/4] Mellaox 40G driver fixes for 4.6-rc
From: Or Gerlitz @ 2016-04-20 13:01 UTC (permalink / raw)
To: David S. Miller; +Cc: netdev, Eran Ben Elisha, Yishai Hadas, Or Gerlitz
Hi Dave,
With the fix for ARM bug being under the works, these are
few other fixes for mlx4 we have ready to go.
Eran addressed the problematic/wrong reporting of dropped packets, Daniel
fixed some matters related to PPC EEH's and Jenny's patch makes sure
VFs can't change the port's pause settings.
Or.
Daniel Jurgens (2):
net/mlx4_core: Implement pci_resume callback
net/mlx4_core: Avoid repeated calls to pci enable/disable
Eran Ben Elisha (1):
net/mlx4_en: Split SW RX dropped counter per RX ring
Eugenia Emantayev (1):
net/mlx4_core: Don't allow to VF change global pause settings
drivers/net/ethernet/mellanox/mlx4/en_ethtool.c | 5 +-
drivers/net/ethernet/mellanox/mlx4/en_port.c | 5 +-
drivers/net/ethernet/mellanox/mlx4/en_rx.c | 2 +-
drivers/net/ethernet/mellanox/mlx4/main.c | 76 ++++++++++++++++++-------
drivers/net/ethernet/mellanox/mlx4/mlx4.h | 2 +
drivers/net/ethernet/mellanox/mlx4/mlx4_en.h | 1 +
drivers/net/ethernet/mellanox/mlx4/port.c | 13 +++++
include/linux/mlx4/device.h | 7 +++
8 files changed, 89 insertions(+), 22 deletions(-)
--
2.3.7
^ permalink raw reply
* [PATCH net 2/4] net/mlx4_core: Avoid repeated calls to pci enable/disable
From: Or Gerlitz @ 2016-04-20 13:01 UTC (permalink / raw)
To: David S. Miller
Cc: netdev, Eran Ben Elisha, Yishai Hadas, Daniel Jurgens, Or Gerlitz
In-Reply-To: <1461157278-18528-1-git-send-email-ogerlitz@mellanox.com>
From: Daniel Jurgens <danielj@mellanox.com>
Maintain the PCI status and provide wrappers for enabling and disabling
the PCI device. Performing the actions more than once without doing
its opposite results in warning logs.
This occurred when EEH hotplugged the device causing a warning for
disabling an already disabled device.
Fixes: 2ba5fbd62b25 ('net/mlx4_core: Handle AER flow properly')
Signed-off-by: Daniel Jurgens <danielj@mellanox.com>
Signed-off-by: Yishai Hadas <yishaih@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
---
drivers/net/ethernet/mellanox/mlx4/main.c | 39 +++++++++++++++++++++++++++----
include/linux/mlx4/device.h | 7 ++++++
2 files changed, 41 insertions(+), 5 deletions(-)
diff --git a/drivers/net/ethernet/mellanox/mlx4/main.c b/drivers/net/ethernet/mellanox/mlx4/main.c
index 5d45aa3..12c77a7 100644
--- a/drivers/net/ethernet/mellanox/mlx4/main.c
+++ b/drivers/net/ethernet/mellanox/mlx4/main.c
@@ -3172,6 +3172,34 @@ static int mlx4_check_dev_cap(struct mlx4_dev *dev, struct mlx4_dev_cap *dev_cap
return 0;
}
+static int mlx4_pci_enable_device(struct mlx4_dev *dev)
+{
+ struct pci_dev *pdev = dev->persist->pdev;
+ int err = 0;
+
+ mutex_lock(&dev->persist->pci_status_mutex);
+ if (dev->persist->pci_status == MLX4_PCI_STATUS_DISABLED) {
+ err = pci_enable_device(pdev);
+ if (!err)
+ dev->persist->pci_status = MLX4_PCI_STATUS_ENABLED;
+ }
+ mutex_unlock(&dev->persist->pci_status_mutex);
+
+ return err;
+}
+
+static void mlx4_pci_disable_device(struct mlx4_dev *dev)
+{
+ struct pci_dev *pdev = dev->persist->pdev;
+
+ mutex_lock(&dev->persist->pci_status_mutex);
+ if (dev->persist->pci_status == MLX4_PCI_STATUS_ENABLED) {
+ pci_disable_device(pdev);
+ dev->persist->pci_status = MLX4_PCI_STATUS_DISABLED;
+ }
+ mutex_unlock(&dev->persist->pci_status_mutex);
+}
+
static int mlx4_load_one(struct pci_dev *pdev, int pci_dev_data,
int total_vfs, int *nvfs, struct mlx4_priv *priv,
int reset_flow)
@@ -3582,7 +3610,7 @@ static int __mlx4_init_one(struct pci_dev *pdev, int pci_dev_data,
pr_info(DRV_NAME ": Initializing %s\n", pci_name(pdev));
- err = pci_enable_device(pdev);
+ err = mlx4_pci_enable_device(&priv->dev);
if (err) {
dev_err(&pdev->dev, "Cannot enable PCI device, aborting\n");
return err;
@@ -3715,7 +3743,7 @@ err_release_regions:
pci_release_regions(pdev);
err_disable_pdev:
- pci_disable_device(pdev);
+ mlx4_pci_disable_device(&priv->dev);
pci_set_drvdata(pdev, NULL);
return err;
}
@@ -3775,6 +3803,7 @@ static int mlx4_init_one(struct pci_dev *pdev, const struct pci_device_id *id)
priv->pci_dev_data = id->driver_data;
mutex_init(&dev->persist->device_state_mutex);
mutex_init(&dev->persist->interface_state_mutex);
+ mutex_init(&dev->persist->pci_status_mutex);
ret = devlink_register(devlink, &pdev->dev);
if (ret)
@@ -3923,7 +3952,7 @@ static void mlx4_remove_one(struct pci_dev *pdev)
}
pci_release_regions(pdev);
- pci_disable_device(pdev);
+ mlx4_pci_disable_device(dev);
devlink_unregister(devlink);
kfree(dev->persist);
devlink_free(devlink);
@@ -4042,7 +4071,7 @@ static pci_ers_result_t mlx4_pci_err_detected(struct pci_dev *pdev,
if (state == pci_channel_io_perm_failure)
return PCI_ERS_RESULT_DISCONNECT;
- pci_disable_device(pdev);
+ mlx4_pci_disable_device(persist->dev);
return PCI_ERS_RESULT_NEED_RESET;
}
@@ -4053,7 +4082,7 @@ static pci_ers_result_t mlx4_pci_slot_reset(struct pci_dev *pdev)
int err;
mlx4_err(dev, "mlx4_pci_slot_reset was called\n");
- err = pci_enable_device(pdev);
+ err = mlx4_pci_enable_device(dev);
if (err) {
mlx4_err(dev, "Can not re-enable device, err=%d\n", err);
return PCI_ERS_RESULT_DISCONNECT;
diff --git a/include/linux/mlx4/device.h b/include/linux/mlx4/device.h
index 8541a91..d1f904c 100644
--- a/include/linux/mlx4/device.h
+++ b/include/linux/mlx4/device.h
@@ -828,6 +828,11 @@ struct mlx4_vf_dev {
u8 n_ports;
};
+enum mlx4_pci_status {
+ MLX4_PCI_STATUS_DISABLED,
+ MLX4_PCI_STATUS_ENABLED,
+};
+
struct mlx4_dev_persistent {
struct pci_dev *pdev;
struct mlx4_dev *dev;
@@ -841,6 +846,8 @@ struct mlx4_dev_persistent {
u8 state;
struct mutex interface_state_mutex; /* protect SW state */
u8 interface_state;
+ struct mutex pci_status_mutex; /* sync pci state */
+ enum mlx4_pci_status pci_status;
};
struct mlx4_dev {
--
2.3.7
^ permalink raw reply related
* [PATCH net 3/4] net/mlx4_core: Don't allow to VF change global pause settings
From: Or Gerlitz @ 2016-04-20 13:01 UTC (permalink / raw)
To: David S. Miller
Cc: netdev, Eran Ben Elisha, Yishai Hadas, Eugenia Emantayev,
Saeed Mahameed, Or Gerlitz
In-Reply-To: <1461157278-18528-1-git-send-email-ogerlitz@mellanox.com>
From: Eugenia Emantayev <eugenia@mellanox.com>
Currently changing global pause settings is done via SET_PORT
command with input modifier GENERAL. This command is allowed
for each VF since MTU setting is done via the same command.
Change the above to the following scheme: before passing the
request to the FW, the PF will check whether it was issued
by a slave. If yes, don't change global pause and warn,
otherwise change to the requested value and store for
further reference.
Signed-off-by: Eugenia Emantayev <eugenia@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
---
drivers/net/ethernet/mellanox/mlx4/mlx4.h | 2 ++
drivers/net/ethernet/mellanox/mlx4/port.c | 13 +++++++++++++
2 files changed, 15 insertions(+)
diff --git a/drivers/net/ethernet/mellanox/mlx4/mlx4.h b/drivers/net/ethernet/mellanox/mlx4/mlx4.h
index ef96831..c9d7fc51 100644
--- a/drivers/net/ethernet/mellanox/mlx4/mlx4.h
+++ b/drivers/net/ethernet/mellanox/mlx4/mlx4.h
@@ -586,6 +586,8 @@ struct mlx4_mfunc_master_ctx {
struct mlx4_master_qp0_state qp0_state[MLX4_MAX_PORTS + 1];
int init_port_ref[MLX4_MAX_PORTS + 1];
u16 max_mtu[MLX4_MAX_PORTS + 1];
+ u8 pptx;
+ u8 pprx;
int disable_mcast_ref[MLX4_MAX_PORTS + 1];
struct mlx4_resource_tracker res_tracker;
struct workqueue_struct *comm_wq;
diff --git a/drivers/net/ethernet/mellanox/mlx4/port.c b/drivers/net/ethernet/mellanox/mlx4/port.c
index 211c650..087b23b 100644
--- a/drivers/net/ethernet/mellanox/mlx4/port.c
+++ b/drivers/net/ethernet/mellanox/mlx4/port.c
@@ -1317,6 +1317,19 @@ static int mlx4_common_set_port(struct mlx4_dev *dev, int slave, u32 in_mod,
}
gen_context->mtu = cpu_to_be16(master->max_mtu[port]);
+ /* Slave cannot change Global Pause configuration */
+ if (slave != mlx4_master_func_num(dev) &&
+ ((gen_context->pptx != master->pptx) ||
+ (gen_context->pprx != master->pprx))) {
+ gen_context->pptx = master->pptx;
+ gen_context->pprx = master->pprx;
+ mlx4_warn(dev,
+ "denying Global Pause change for slave:%d\n",
+ slave);
+ } else {
+ master->pptx = gen_context->pptx;
+ master->pprx = gen_context->pprx;
+ }
break;
case MLX4_SET_PORT_GID_TABLE:
/* change to MULTIPLE entries: number of guest's gids
--
2.3.7
^ permalink raw reply related
* [PATCH net 1/4] net/mlx4_core: Implement pci_resume callback
From: Or Gerlitz @ 2016-04-20 13:01 UTC (permalink / raw)
To: David S. Miller
Cc: netdev, Eran Ben Elisha, Yishai Hadas, Daniel Jurgens, Or Gerlitz
In-Reply-To: <1461157278-18528-1-git-send-email-ogerlitz@mellanox.com>
From: Daniel Jurgens <danielj@mellanox.com>
Move resume related activities to a new pci_resume function instead of
performing them in mlx4_pci_slot_reset. This change is needed to avoid
a hotplug during EEH recovery due to commit f2da4ccf8bd4 ("powerpc/eeh:
More relaxed hotplug criterion").
Fixes: 2ba5fbd62b25 ('net/mlx4_core: Handle AER flow properly')
Signed-off-by: Daniel Jurgens <danielj@mellanox.com>
Signed-off-by: Yishai Hadas <yishaih@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
---
drivers/net/ethernet/mellanox/mlx4/main.c | 39 +++++++++++++++++++------------
1 file changed, 24 insertions(+), 15 deletions(-)
diff --git a/drivers/net/ethernet/mellanox/mlx4/main.c b/drivers/net/ethernet/mellanox/mlx4/main.c
index 358f723..5d45aa3 100644
--- a/drivers/net/ethernet/mellanox/mlx4/main.c
+++ b/drivers/net/ethernet/mellanox/mlx4/main.c
@@ -4050,45 +4050,53 @@ static pci_ers_result_t mlx4_pci_slot_reset(struct pci_dev *pdev)
{
struct mlx4_dev_persistent *persist = pci_get_drvdata(pdev);
struct mlx4_dev *dev = persist->dev;
- struct mlx4_priv *priv = mlx4_priv(dev);
- int ret;
- int nvfs[MLX4_MAX_PORTS + 1] = {0, 0, 0};
- int total_vfs;
+ int err;
mlx4_err(dev, "mlx4_pci_slot_reset was called\n");
- ret = pci_enable_device(pdev);
- if (ret) {
- mlx4_err(dev, "Can not re-enable device, ret=%d\n", ret);
+ err = pci_enable_device(pdev);
+ if (err) {
+ mlx4_err(dev, "Can not re-enable device, err=%d\n", err);
return PCI_ERS_RESULT_DISCONNECT;
}
pci_set_master(pdev);
pci_restore_state(pdev);
pci_save_state(pdev);
+ return PCI_ERS_RESULT_RECOVERED;
+}
+
+static void mlx4_pci_resume(struct pci_dev *pdev)
+{
+ struct mlx4_dev_persistent *persist = pci_get_drvdata(pdev);
+ struct mlx4_dev *dev = persist->dev;
+ struct mlx4_priv *priv = mlx4_priv(dev);
+ int nvfs[MLX4_MAX_PORTS + 1] = {0, 0, 0};
+ int total_vfs;
+ int err;
+ mlx4_err(dev, "%s was called\n", __func__);
total_vfs = dev->persist->num_vfs;
memcpy(nvfs, dev->persist->nvfs, sizeof(dev->persist->nvfs));
mutex_lock(&persist->interface_state_mutex);
if (!(persist->interface_state & MLX4_INTERFACE_STATE_UP)) {
- ret = mlx4_load_one(pdev, priv->pci_dev_data, total_vfs, nvfs,
+ err = mlx4_load_one(pdev, priv->pci_dev_data, total_vfs, nvfs,
priv, 1);
- if (ret) {
- mlx4_err(dev, "%s: mlx4_load_one failed, ret=%d\n",
- __func__, ret);
+ if (err) {
+ mlx4_err(dev, "%s: mlx4_load_one failed, err=%d\n",
+ __func__, err);
goto end;
}
- ret = restore_current_port_types(dev, dev->persist->
+ err = restore_current_port_types(dev, dev->persist->
curr_port_type, dev->persist->
curr_port_poss_type);
- if (ret)
- mlx4_err(dev, "could not restore original port types (%d)\n", ret);
+ if (err)
+ mlx4_err(dev, "could not restore original port types (%d)\n", err);
}
end:
mutex_unlock(&persist->interface_state_mutex);
- return ret ? PCI_ERS_RESULT_DISCONNECT : PCI_ERS_RESULT_RECOVERED;
}
static void mlx4_shutdown(struct pci_dev *pdev)
@@ -4105,6 +4113,7 @@ static void mlx4_shutdown(struct pci_dev *pdev)
static const struct pci_error_handlers mlx4_err_handler = {
.error_detected = mlx4_pci_err_detected,
.slot_reset = mlx4_pci_slot_reset,
+ .resume = mlx4_pci_resume,
};
static struct pci_driver mlx4_driver = {
--
2.3.7
^ permalink raw reply related
* [PATCH net 4/4] net/mlx4_en: Split SW RX dropped counter per RX ring
From: Or Gerlitz @ 2016-04-20 13:01 UTC (permalink / raw)
To: David S. Miller
Cc: netdev, Eran Ben Elisha, Yishai Hadas, Saeed Mahameed, Or Gerlitz
In-Reply-To: <1461157278-18528-1-git-send-email-ogerlitz@mellanox.com>
From: Eran Ben Elisha <eranbe@mellanox.com>
Count SW packet drops per RX ring instead of a global counter. This
will allow monitoring the number of rx drops per ring.
In addition, SW rx_dropped counter was overwritten by HW rx_dropped
counter, sum both of them instead to show the accurate value.
Fixes: a3333b35da16 ('net/mlx4_en: Moderate ethtool callback to [...] ')
Signed-off-by: Eran Ben Elisha <eranbe@mellanox.com>
Reported-by: Brenden Blanco <bblanco@plumgrid.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
---
drivers/net/ethernet/mellanox/mlx4/en_ethtool.c | 5 ++++-
drivers/net/ethernet/mellanox/mlx4/en_port.c | 5 ++++-
drivers/net/ethernet/mellanox/mlx4/en_rx.c | 2 +-
drivers/net/ethernet/mellanox/mlx4/mlx4_en.h | 1 +
4 files changed, 10 insertions(+), 3 deletions(-)
diff --git a/drivers/net/ethernet/mellanox/mlx4/en_ethtool.c b/drivers/net/ethernet/mellanox/mlx4/en_ethtool.c
index f69584a..c761194 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_ethtool.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_ethtool.c
@@ -337,7 +337,7 @@ static int mlx4_en_get_sset_count(struct net_device *dev, int sset)
case ETH_SS_STATS:
return bitmap_iterator_count(&it) +
(priv->tx_ring_num * 2) +
- (priv->rx_ring_num * 2);
+ (priv->rx_ring_num * 3);
case ETH_SS_TEST:
return MLX4_EN_NUM_SELF_TEST - !(priv->mdev->dev->caps.flags
& MLX4_DEV_CAP_FLAG_UC_LOOPBACK) * 2;
@@ -404,6 +404,7 @@ static void mlx4_en_get_ethtool_stats(struct net_device *dev,
for (i = 0; i < priv->rx_ring_num; i++) {
data[index++] = priv->rx_ring[i]->packets;
data[index++] = priv->rx_ring[i]->bytes;
+ data[index++] = priv->rx_ring[i]->dropped;
}
spin_unlock_bh(&priv->stats_lock);
@@ -477,6 +478,8 @@ static void mlx4_en_get_strings(struct net_device *dev,
"rx%d_packets", i);
sprintf(data + (index++) * ETH_GSTRING_LEN,
"rx%d_bytes", i);
+ sprintf(data + (index++) * ETH_GSTRING_LEN,
+ "rx%d_dropped", i);
}
break;
case ETH_SS_PRIV_FLAGS:
diff --git a/drivers/net/ethernet/mellanox/mlx4/en_port.c b/drivers/net/ethernet/mellanox/mlx4/en_port.c
index 3904b5f..20b6c2e 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_port.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_port.c
@@ -158,6 +158,7 @@ int mlx4_en_DUMP_ETH_STATS(struct mlx4_en_dev *mdev, u8 port, u8 reset)
u64 in_mod = reset << 8 | port;
int err;
int i, counter_index;
+ unsigned long sw_rx_dropped = 0;
mailbox = mlx4_alloc_cmd_mailbox(mdev->dev);
if (IS_ERR(mailbox))
@@ -180,6 +181,7 @@ int mlx4_en_DUMP_ETH_STATS(struct mlx4_en_dev *mdev, u8 port, u8 reset)
for (i = 0; i < priv->rx_ring_num; i++) {
stats->rx_packets += priv->rx_ring[i]->packets;
stats->rx_bytes += priv->rx_ring[i]->bytes;
+ sw_rx_dropped += priv->rx_ring[i]->dropped;
priv->port_stats.rx_chksum_good += priv->rx_ring[i]->csum_ok;
priv->port_stats.rx_chksum_none += priv->rx_ring[i]->csum_none;
priv->port_stats.rx_chksum_complete += priv->rx_ring[i]->csum_complete;
@@ -236,7 +238,8 @@ int mlx4_en_DUMP_ETH_STATS(struct mlx4_en_dev *mdev, u8 port, u8 reset)
&mlx4_en_stats->MCAST_prio_1,
NUM_PRIORITIES);
stats->collisions = 0;
- stats->rx_dropped = be32_to_cpu(mlx4_en_stats->RDROP);
+ stats->rx_dropped = be32_to_cpu(mlx4_en_stats->RDROP) +
+ sw_rx_dropped;
stats->rx_length_errors = be32_to_cpu(mlx4_en_stats->RdropLength);
stats->rx_over_errors = 0;
stats->rx_crc_errors = be32_to_cpu(mlx4_en_stats->RCRC);
diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
index 86bcfe5..91abc13 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
@@ -939,7 +939,7 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int bud
/* GRO not possible, complete processing here */
skb = mlx4_en_rx_skb(priv, rx_desc, frags, length);
if (!skb) {
- priv->stats.rx_dropped++;
+ ring->dropped++;
goto next;
}
diff --git a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
index d12ab6a..63b1aea 100644
--- a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
+++ b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
@@ -323,6 +323,7 @@ struct mlx4_en_rx_ring {
unsigned long csum_ok;
unsigned long csum_none;
unsigned long csum_complete;
+ unsigned long dropped;
int hwtstamp_rx_filter;
cpumask_var_t affinity_mask;
};
--
2.3.7
^ permalink raw reply related
* Re: [PATCH net-next v5] rtnetlink: add new RTM_GETSTATS message to dump link stats
From: Johannes Berg @ 2016-04-20 13:17 UTC (permalink / raw)
To: Jiri Benc
Cc: David Ahern, David Miller, eric.dumazet, roopa, netdev, jhs,
tgraf, nicolas.dichtel, egrumbach
In-Reply-To: <20160420144828.5537dce7@griffin>
On Wed, 2016-04-20 at 14:48 +0200, Jiri Benc wrote:
> On Wed, 20 Apr 2016 09:32:20 +0200, Johannes Berg wrote:
> >
> > 2) Use the new attribute flag with some required attribute for
> > existing commands, so that older kernel will not find the
> > required
> > attribute and will reject the operation entirely.
> > May or may not fall back to trying the operation again without
> > the
> > flag.
> This is basically what I submitted half a year ago. See:
> http://thread.gmane.org/gmane.linux.network/382850
>
That looks like a *huge* patchset though - whereas my proposal really
required only what Emmanuel sent in this thread. It did make some
assumptions, for example that any attribute lower than the "maxtype"
argument to nla_parse() was understood. [1]
Looks like you have this on a per-message basis. I thought it was
better on an attribute basis because that's really where the issue is.
You can still detect it with the per-attribute flag approach as I
described in (2) - if, for your lwtunnel example, you could specify the
flag on the RTA_ENCAP attribute, without which no lwtunnel can be
created (if I understand the code correctly.)
johannes
[1] for example, if I have three attributes:
enum attrs {__unused, A, B, C};
and the policy
policy = {
[A] = { .type = NLA_U32 },
[C] = { .type = NLA_U8 },
}
and then do
nla_parse(tb, 3, msg, msg_len, &policy)
it would assume that "B" is valid. Since this policy is equivalent to
the policy with
[B] = { .type = NLA_BINARY }
(minimum length 0) we could also reject anything that has type=len=0 in
the policy, if the NLA_F_NET_MUST_PARSE flag is set in the nla_type.
This would likely be the right approach for most netlink families,
since they usually don't have holes that they actually care about -
I've yet to see any attribute that's not specified at all in the policy
but used anyway, normally you want some level of checking, and indicate
that by using { .type = NLA_BINARY } - but other things are possible.
johannes
^ permalink raw reply
* Re: [PATCH net-next v5] rtnetlink: add new RTM_GETSTATS message to dump link stats
From: Jiri Benc @ 2016-04-20 13:34 UTC (permalink / raw)
To: Johannes Berg
Cc: David Ahern, David Miller, eric.dumazet, roopa, netdev, jhs,
tgraf, nicolas.dichtel, egrumbach
In-Reply-To: <1461158228.2176.18.camel@sipsolutions.net>
On Wed, 20 Apr 2016 15:17:08 +0200, Johannes Berg wrote:
> Looks like you have this on a per-message basis. I thought it was
> better on an attribute basis because that's really where the issue is.
No problem. I'm not that happy with my patchset myself. Just wanted to
point it out in case it's useful.
Jiri
^ permalink raw reply
* Re: [PATCH V2] net: ethernet: mellanox: correct page conversion
From: Sinan Kaya @ 2016-04-20 13:35 UTC (permalink / raw)
To: Christoph Hellwig
Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, timur-sgV2jX0FEOL9JmXXK+q4OQ,
cov-sgV2jX0FEOL9JmXXK+q4OQ, Yishai Hadas,
netdev-u79uwXL29TY76Z2rM5mHXA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <20160419182212.GA8441-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org>
On 4/19/2016 2:22 PM, Christoph Hellwig wrote:
> What I think we need is something like the patch below. In the long
> ru nwe should also kill the mlx4_buf structure which now is pretty
> pointless.
>
It is been 1.5 years since I reported the problem. We came up with three
different solutions this week. I'd like to see a version of the solution
to get merged until Mellanox comes up with a better solution with another
patch. My proposal is to use this one.
-- a/drivers/net/ethernet/mellanox/mlx4/alloc.c
+++ b/drivers/net/ethernet/mellanox/mlx4/alloc.c
@@ -588,7 +588,7 @@ int mlx4_buf_alloc(struct mlx4_dev *dev, int size, int max_direct,
{
dma_addr_t t;
- if (size <= max_direct) {
+ if ((size <= max_direct) || (BITS_PER_LONG == 64)){
buf->nbufs = 1;
buf->npages = 1;
buf->page_shift = get_order(size) + PAGE_SHIFT;
Of course, this is assuming that you are not ready to submit your patch yet. If you
are, feel free to post.
--
Sinan Kaya
Qualcomm Technologies, Inc. on behalf of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux Foundation Collaborative Project
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply
* Re: [PATCH V2] net: ethernet: mellanox: correct page conversion
From: Sinan Kaya @ 2016-04-20 13:38 UTC (permalink / raw)
To: eranlinuxmellanox-Re5JQEeQqe8AvxtiuMwx3w
Cc: Christoph Hellwig, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
timur-sgV2jX0FEOL9JmXXK+q4OQ, cov-sgV2jX0FEOL9JmXXK+q4OQ,
Yishai Hadas, netdev-u79uwXL29TY76Z2rM5mHXA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <571785A5.5040306-sgV2jX0FEOL9JmXXK+q4OQ@public.gmane.org>
Apologies,
Replied to an older post by mistake. I was trying to reply to Eran.
>Hi Sinan,
>
>We are working in Mellanox for a solution which
>removes the vmap call and allocate contiguous memory (using dma_alloc_coherent).
>
>Thanks,
>Eran
>
>
>On 4/20/2016 9:35 AM, Sinan Kaya wrote:
> On 4/19/2016 2:22 PM, Christoph Hellwig wrote:
>> What I think we need is something like the patch below. In the long
>> ru nwe should also kill the mlx4_buf structure which now is pretty
>> pointless.
>>
>
It is been 1.5 years since I reported the problem. We came up with three
different solutions this week. I'd like to see a version of the solution
to get merged until Mellanox comes up with a better solution with another
patch. My proposal is to use this one.
>
> -- a/drivers/net/ethernet/mellanox/mlx4/alloc.c
> +++ b/drivers/net/ethernet/mellanox/mlx4/alloc.c
> @@ -588,7 +588,7 @@ int mlx4_buf_alloc(struct mlx4_dev *dev, int size, int max_direct,
> {
> dma_addr_t t;
>
> - if (size <= max_direct) {
> + if ((size <= max_direct) || (BITS_PER_LONG == 64)){
> buf->nbufs = 1;
> buf->npages = 1;
> buf->page_shift = get_order(size) + PAGE_SHIFT;
>
> Of course, this is assuming that you are not ready to submit your patch yet. If you
> are, feel free to post.
>
--
Sinan Kaya
Qualcomm Technologies, Inc. on behalf of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux Foundation Collaborative Project
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox