Netdev List

Netdev List
 help / color / mirror / Atom feed

* llvm-objdump...
From: David Miller @ 2017-04-25 17:13 UTC (permalink / raw)
  To: ast; +Cc: daniel, netdev


I think there are some endianness issues ;-)

davem@patience:~/src/GIT/net-next/tools/testing/selftests/bpf$ llvm-objdump -S x.o

x.o:    file format ELF64-BPF

Disassembly of section test1:
process:
       0:       b7 00 00 00 00 00 00 02         r0 = 33554432
       1:       61 21 00 50 00 00 00 00         r1 = *(u32 *)(r2 + 20480)

That first instruction should be "r0 = 2"

^ permalink raw reply

* [PATCH v2] macsec: dynamically allocate space for sglist
From: Jason A. Donenfeld @ 2017-04-25 17:08 UTC (permalink / raw)
  To: Netdev, LKML, David Miller, stable, security, Sabrina Dubroca
  Cc: Jason A. Donenfeld
In-Reply-To: <20170425163602.GA17973@bistromath.localdomain>

We call skb_cow_data, which is good anyway to ensure we can actually
modify the skb as such (another error from prior). Now that we have the
number of fragments required, we can safely allocate exactly that amount
of memory.

Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Cc: Sabrina Dubroca <sd@queasysnail.net>
Cc: security@kernel.org
Cc: stable@vger.kernel.org
---
Changes v1 -> v2:
  - sg_init_table now takes the correct argument.

 drivers/net/macsec.c | 29 ++++++++++++++++++++++-------
 1 file changed, 22 insertions(+), 7 deletions(-)

diff --git a/drivers/net/macsec.c b/drivers/net/macsec.c
index dbab05afcdbe..49ce4e9f4a0f 100644
--- a/drivers/net/macsec.c
+++ b/drivers/net/macsec.c
@@ -617,7 +617,8 @@ static void macsec_encrypt_done(struct crypto_async_request *base, int err)
 
 static struct aead_request *macsec_alloc_req(struct crypto_aead *tfm,
 					     unsigned char **iv,
-					     struct scatterlist **sg)
+					     struct scatterlist **sg,
+					     int num_frags)
 {
 	size_t size, iv_offset, sg_offset;
 	struct aead_request *req;
@@ -629,7 +630,7 @@ static struct aead_request *macsec_alloc_req(struct crypto_aead *tfm,
 
 	size = ALIGN(size, __alignof__(struct scatterlist));
 	sg_offset = size;
-	size += sizeof(struct scatterlist) * (MAX_SKB_FRAGS + 1);
+	size += sizeof(struct scatterlist) * num_frags;
 
 	tmp = kmalloc(size, GFP_ATOMIC);
 	if (!tmp)
@@ -649,6 +650,7 @@ static struct sk_buff *macsec_encrypt(struct sk_buff *skb,
 {
 	int ret;
 	struct scatterlist *sg;
+	struct sk_buff *trailer;
 	unsigned char *iv;
 	struct ethhdr *eth;
 	struct macsec_eth_header *hh;
@@ -723,7 +725,14 @@ static struct sk_buff *macsec_encrypt(struct sk_buff *skb,
 		return ERR_PTR(-EINVAL);
 	}
 
-	req = macsec_alloc_req(tx_sa->key.tfm, &iv, &sg);
+	ret = skb_cow_data(skb, 0, &trailer);
+	if (unlikely(ret < 0)) {
+		macsec_txsa_put(tx_sa);
+		kfree_skb(skb);
+		return ERR_PTR(ret);
+	}
+
+	req = macsec_alloc_req(tx_sa->key.tfm, &iv, &sg, ret);
 	if (!req) {
 		macsec_txsa_put(tx_sa);
 		kfree_skb(skb);
@@ -732,7 +741,7 @@ static struct sk_buff *macsec_encrypt(struct sk_buff *skb,
 
 	macsec_fill_iv(iv, secy->sci, pn);
 
-	sg_init_table(sg, MAX_SKB_FRAGS + 1);
+	sg_init_table(sg, ret);
 	skb_to_sgvec(skb, sg, 0, skb->len);
 
 	if (tx_sc->encrypt) {
@@ -917,6 +926,7 @@ static struct sk_buff *macsec_decrypt(struct sk_buff *skb,
 {
 	int ret;
 	struct scatterlist *sg;
+	struct sk_buff *trailer;
 	unsigned char *iv;
 	struct aead_request *req;
 	struct macsec_eth_header *hdr;
@@ -927,7 +937,12 @@ static struct sk_buff *macsec_decrypt(struct sk_buff *skb,
 	if (!skb)
 		return ERR_PTR(-ENOMEM);
 
-	req = macsec_alloc_req(rx_sa->key.tfm, &iv, &sg);
+	ret = skb_cow_data(skb, 0, &trailer);
+	if (unlikely(ret < 0)) {
+		kfree_skb(skb);
+		return ERR_PTR(ret);
+	}
+	req = macsec_alloc_req(rx_sa->key.tfm, &iv, &sg, ret);
 	if (!req) {
 		kfree_skb(skb);
 		return ERR_PTR(-ENOMEM);
@@ -936,7 +951,7 @@ static struct sk_buff *macsec_decrypt(struct sk_buff *skb,
 	hdr = (struct macsec_eth_header *)skb->data;
 	macsec_fill_iv(iv, sci, ntohl(hdr->packet_number));
 
-	sg_init_table(sg, MAX_SKB_FRAGS + 1);
+	sg_init_table(sg, ret);
 	skb_to_sgvec(skb, sg, 0, skb->len);
 
 	if (hdr->tci_an & MACSEC_TCI_E) {
@@ -2716,7 +2731,7 @@ static netdev_tx_t macsec_start_xmit(struct sk_buff *skb,
 }
 
 #define MACSEC_FEATURES \
-	(NETIF_F_SG | NETIF_F_HIGHDMA)
+	(NETIF_F_SG | NETIF_F_HIGHDMA | NETIF_F_FRAGLIST)
 static struct lock_class_key macsec_netdev_addr_lock_key;
 
 static int macsec_dev_init(struct net_device *dev)
-- 
2.12.2

^ permalink raw reply related

* Re: [PATCHv2 net] bridge: move bridge multicast cleanup to ndo_uninit
From: Nikolay Aleksandrov @ 2017-04-25 17:01 UTC (permalink / raw)
  To: Xin Long, network dev; +Cc: davem, stephen
In-Reply-To: <6e156e72cb1a5e279da8ac53bdb601eee5d654fe.1493132317.git.lucien.xin@gmail.com>

On 25/04/17 17:58, Xin Long wrote:
> During removing a bridge device, if the bridge is still up, a new mdb entry
> still can be added in br_multicast_add_group() after all mdb entries are
> removed in br_multicast_dev_del(). Like the path:
> 
>   mld_ifc_timer_expire ->
>     mld_sendpack -> ...
>       br_multicast_rcv ->
>         br_multicast_add_group
> 
> The new mp's timer will be set up. If the timer expires after the bridge
> is freed, it may cause use-after-free panic in br_multicast_group_expired.
> 
> BUG: unable to handle kernel NULL pointer dereference at 0000000000000048
> IP: [<ffffffffa07ed2c8>] br_multicast_group_expired+0x28/0xb0 [bridge]
> Call Trace:
>  <IRQ>
>  [<ffffffff81094536>] call_timer_fn+0x36/0x110
>  [<ffffffffa07ed2a0>] ? br_mdb_free+0x30/0x30 [bridge]
>  [<ffffffff81096967>] run_timer_softirq+0x237/0x340
>  [<ffffffff8108dcbf>] __do_softirq+0xef/0x280
>  [<ffffffff8169889c>] call_softirq+0x1c/0x30
>  [<ffffffff8102c275>] do_softirq+0x65/0xa0
>  [<ffffffff8108e055>] irq_exit+0x115/0x120
>  [<ffffffff81699515>] smp_apic_timer_interrupt+0x45/0x60
>  [<ffffffff81697a5d>] apic_timer_interrupt+0x6d/0x80
> 
> Nikolay also found it would cause a memory leak - the mdb hash is
> reallocated and not freed due to the mdb rehash.
> 
> unreferenced object 0xffff8800540ba800 (size 2048):
>   backtrace:
>     [<ffffffff816e2287>] kmemleak_alloc+0x67/0xc0
>     [<ffffffff81260bea>] __kmalloc+0x1ba/0x3e0
>     [<ffffffffa05c60ee>] br_mdb_rehash+0x5e/0x340 [bridge]
>     [<ffffffffa05c74af>] br_multicast_new_group+0x43f/0x6e0 [bridge]
>     [<ffffffffa05c7aa3>] br_multicast_add_group+0x203/0x260 [bridge]
>     [<ffffffffa05ca4b5>] br_multicast_rcv+0x945/0x11d0 [bridge]
>     [<ffffffffa05b6b10>] br_dev_xmit+0x180/0x470 [bridge]
>     [<ffffffff815c781b>] dev_hard_start_xmit+0xbb/0x3d0
>     [<ffffffff815c8743>] __dev_queue_xmit+0xb13/0xc10
>     [<ffffffff815c8850>] dev_queue_xmit+0x10/0x20
>     [<ffffffffa02f8d7a>] ip6_finish_output2+0x5ca/0xac0 [ipv6]
>     [<ffffffffa02fbfc6>] ip6_finish_output+0x126/0x2c0 [ipv6]
>     [<ffffffffa02fc245>] ip6_output+0xe5/0x390 [ipv6]
>     [<ffffffffa032b92c>] NF_HOOK.constprop.44+0x6c/0x240 [ipv6]
>     [<ffffffffa032bd16>] mld_sendpack+0x216/0x3e0 [ipv6]
>     [<ffffffffa032d5eb>] mld_ifc_timer_expire+0x18b/0x2b0 [ipv6]
> 
> This could happen when ip link remove a bridge or destroy a netns with a
> bridge device inside.
> 
> With Nikolay's suggestion, this patch is to clean up bridge multicast in
> ndo_uninit after bridge dev is shutdown, instead of br_dev_delete, so
> that netif_running check in br_multicast_add_group can avoid this issue.
> 
> v1->v2:
>   - fix this issue by moving br_multicast_dev_del to ndo_uninit, instead
>     of calling dev_close in br_dev_delete.
> 
> Reported-by: Jianwen Ji <jiji@redhat.com>
> Signed-off-by: Xin Long <lucien.xin@gmail.com>
> ---
>  net/bridge/br_device.c | 1 +
>  net/bridge/br_if.c     | 1 -
>  2 files changed, 1 insertion(+), 1 deletion(-)

Thank you for modifying the fix to use ndo_uninit(). Important note -
this fix is dependent on Ido's earlier ndo_uninit() patch:
b6fe0440c637 ("bridge: implement missing ndo_uninit()")

Fixes: e10177abf842 ("bridge: multicast: fix handling of temp and perm
entries")
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>

^ permalink raw reply

* Re: [PATCH 5/7] IB/hfi1: use pcie_flr instead of duplicating it
From: Doug Ledford @ 2017-04-25 17:00 UTC (permalink / raw)
  To: Christoph Hellwig, Byczkowski, Jakub
  Cc: Bjorn Helgaas, Cabiddu, Giovanni, Benedetto, Salvatore,
	Marciniszyn, Mike, Dalessandro, Dennis, Derek Chickles,
	Satanand Burla, Felix Manlunas, Raghu Vatsavayi,
	Kirsher, Jeffrey T,
	linux-pci-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, qat-linux,
	linux-crypto-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	netdev-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, "li
In-Reply-To: <20170424143507.GA28812-jcswGhMUV9g@public.gmane.org>

On Mon, 2017-04-24 at 16:35 +0200, Christoph Hellwig wrote:
> On Mon, Apr 24, 2017 at 02:16:31PM +0000, Byczkowski, Jakub wrote:
> > 
> > Tested-by: Jakub Byczkowski <jakub.byczkowski-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
> 
> Are you (and Doug) ok with queueing this up in the PCI tree?

I'm fine with that.  Feel free to add my Acked-by to the hfi1 patch.

-- 
Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
    GPG KeyID: B826A3330E572FDD
   
Key fingerprint = AE6B 1BDA 122B 23B4 265B  1274 B826 A333 0E57 2FDD

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [RFC 0/4] xdp: use netlink extended ACK reporting
From: Jesper Dangaard Brouer @ 2017-04-25 16:54 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: brouer, Jakub Kicinski, netdev, davem, johannes, dsa,
	alexei.starovoitov, bblanco, john.fastabend, kubakici,
	oss-drivers
In-Reply-To: <58FF1157.8030309@iogearbox.net>

On Tue, 25 Apr 2017 11:05:27 +0200
Daniel Borkmann <daniel@iogearbox.net> wrote:

> > Where the message is coming directly from the driver.  There could
> > still be a bit of a leap for a complete novice from the message
> > above to the right settings.  I wonder if it would be worthwhile  
> 
> But still 100x better than the current situation. ;) I really
> like the series, thanks for working on this!

+1 thanks for working on this! :-)

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply

* more test_progs...
From: David Miller @ 2017-04-25 16:52 UTC (permalink / raw)
  To: ast; +Cc: daniel, netdev


I'm trying to get test_progs working in net-next on sparc, it can't
even load the first BPF program.  It's triggering this verifier
check:

static int check_pkt_ptr_alignment(const struct bpf_reg_state *reg,
                                   int off, int size)
{
        if (reg->id && size != 1) {
                verbose("Unknown alignment. Only byte-sized access allowed in packet access.\n");

Specifically (this is test_pkt_access.o of course):

libbpf: load bpf program failed: Permission denied
libbpf: -- BEGIN DUMP LOG ---
libbpf: 
0: (b7) r0 = 2
1: (61) r2 = *(u32 *)(r1 +80)

r2 = skb->data_end

2: (61) r1 = *(u32 *)(r1 +76)

r1 = skb->data

3: (bf) r4 = r1

r4 = skb->data

4: (07) r4 += 14

r4 += sizeof(struct ethhdr);

5: (2d) if r4 > r2 goto pc+37

if out of range, exit

 R0=imm2,min_value=2,max_value=2 R1=pkt(id=0,off=0,r=14) R2=pkt_end R4=pkt(id=0,off=14,r=14) R10=fp

R1/R2/R4 analysis seems correct

6: (71) r5 = *(u8 *)(r1 +13)
7: (71) r3 = *(u8 *)(r1 +12)
8: (67) r3 <<= 8
9: (4f) r3 |= r5

Load eth->h_proto

10: (15) if r3 == 0xdd86 goto pc+9
 R0=imm2,min_value=2,max_value=2 R1=pkt(id=0,off=0,r=14) R2=pkt_end R3=inv R4=pkt(id=0,off=14,r=14) R5=inv56 R10=fp

Hmmm, endianness looks wrong here.  "-target bpf" defaults to the
endianness of whatever cpu that llvm was built for, right?

Maybe I need to make even more adjustments to selftests/bpf/Makefile
handling of clang options...

Anyways this is checking for ipv6, if so goto insn 25

11: (55) if r3 != 0x8 goto pc+29
 R0=imm2,min_value=2,max_value=2 R1=pkt(id=0,off=0,r=14) R2=pkt_end R3=inv,min_value=8,max_value=8 R4=pkt(id=0,off=14,r=14) R5=inv56 R10=fp

IPV4:

12: (bf) r3 = r1
13: (07) r3 += 34
14: (2d) if r3 > r2 goto pc+28

Make sure pkt + 34 is in range

 R0=imm2,min_value=2,max_value=2 R1=pkt(id=0,off=0,r=34) R2=pkt_end R3=pkt(id=0,off=34,r=34) R4=pkt(id=0,off=14,r=34) R5=inv56 R10=fp
15: (b7) r3 = 3
16: (71) r5 = *(u8 *)(r4 +0)
17: (67) r5 <<= 2
18: (57) r5 &= 60
19: (05) goto pc+5

IPV6:

25: (bf) r4 = r1
26: (0f) r4 += r5
27: (07) r4 += 14
28: (15) if r4 == 0x0 goto pc+12
 R0=imm2,min_value=2,max_value=2 R1=pkt(id=0,off=0,r=34) R2=pkt_end R3=imm3,min_value=3,max_value=3 R4=pkt(id=1,off=14,r=0) R5=inv58,min_value=0,max_value=60 R10=fp
29: (bf) r5 = r4
30: (07) r5 += 20
31: (2d) if r5 > r2 goto pc+11
 R0=imm2,min_value=2,max_value=2 R1=pkt(id=0,off=0,r=34) R2=pkt_end R3=imm3,min_value=3,max_value=3 R4=pkt(id=1,off=14,r=34) R5=pkt(id=1,off=34,r=34) R10=fp
32: (0f) r1 += r3
33: (71) r1 = *(u8 *)(r1 +20)

Load ipv6 nexthdr

34: (57) r1 &= 255
35: (55) if r1 != 0x6 goto pc+7

if proto not TCP skip

 R0=imm2,min_value=2,max_value=2 R1=inv56,min_value=6,max_value=6 R2=pkt_end R3=imm3,min_value=3,max_value=3 R4=pkt(id=1,off=14,r=34) R5=pkt(id=1,off=34,r=34) R10=fp
36: (07) r4 += 18
37: (2d) if r4 > r2 goto pc+5
 R0=imm2,min_value=2,max_value=2 R1=inv56,min_value=6,max_value=6 R2=pkt_end R3=imm3,min_value=3,max_value=3 R4=pkt(id=1,off=32,r=34) R5=pkt(id=1,off=34,r=34) R10=fp
38: (b7) r0 = 0
39: (69) r1 = *(u16 *)(r4 +0)
Unknown alignment. Only byte-sized access allowed in packet access.

And this seems to load the urgent pointer as a u16 which is what the verifier rejects.

Oh I see, this is guarded by CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS.

How in the world is this supposed to work?

^ permalink raw reply

* [PATCH] net/packet: check length in getsockopt() called with PACKET_HDRLEN
From: Alexander Potapenko @ 2017-04-25 16:51 UTC (permalink / raw)
  To: dvyukov, kcc, edumazet, davem, kuznet; +Cc: linux-kernel, netdev

In the case getsockopt() is called with PACKET_HDRLEN and optlen < 4
|val| remains uninitialized and the syscall may behave differently
depending on its value, and even copy garbage to userspace on certain
architectures. To fix this we now return -EINVAL if optlen is too small.

This bug has been detected with KMSAN.

Signed-off-by: Alexander Potapenko <glider@google.com>

---
The previous versions of this patch were called "net/packet: initialize
val in packet_getsockopt()"

v3: - change patch summary, return -EINVAL for optlen < sizeof(int)
v2: - if len < sizeof(int), make it 0

---
 net/packet/af_packet.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
index 8489beff5c25..ea81ccf3c7d6 100644
--- a/net/packet/af_packet.c
+++ b/net/packet/af_packet.c
@@ -3836,6 +3836,8 @@ static int packet_getsockopt(struct socket *sock, int level, int optname,
 	case PACKET_HDRLEN:
 		if (len > sizeof(int))
 			len = sizeof(int);
+		if (len < sizeof(int))
+			return -EINVAL;
 		if (copy_from_user(&val, optval, len))
 			return -EFAULT;
 		switch (val) {
-- 
2.13.0.rc0.306.g87b477812d-goog

^ permalink raw reply related

* Re: [PATCHv2 net] bridge: move bridge multicast cleanup to ndo_uninit
From: Stephen Hemminger @ 2017-04-25 16:41 UTC (permalink / raw)
  To: Xin Long; +Cc: network dev, davem, nikolay
In-Reply-To: <6e156e72cb1a5e279da8ac53bdb601eee5d654fe.1493132317.git.lucien.xin@gmail.com>

On Tue, 25 Apr 2017 22:58:37 +0800
Xin Long <lucien.xin@gmail.com> wrote:

> During removing a bridge device, if the bridge is still up, a new mdb entry
> still can be added in br_multicast_add_group() after all mdb entries are
> removed in br_multicast_dev_del(). Like the path:
> 
>   mld_ifc_timer_expire ->
>     mld_sendpack -> ...
>       br_multicast_rcv ->
>         br_multicast_add_group
> 
> The new mp's timer will be set up. If the timer expires after the bridge
> is freed, it may cause use-after-free panic in br_multicast_group_expired.
> 
> BUG: unable to handle kernel NULL pointer dereference at 0000000000000048
> IP: [<ffffffffa07ed2c8>] br_multicast_group_expired+0x28/0xb0 [bridge]
> Call Trace:
>  <IRQ>
>  [<ffffffff81094536>] call_timer_fn+0x36/0x110
>  [<ffffffffa07ed2a0>] ? br_mdb_free+0x30/0x30 [bridge]
>  [<ffffffff81096967>] run_timer_softirq+0x237/0x340
>  [<ffffffff8108dcbf>] __do_softirq+0xef/0x280
>  [<ffffffff8169889c>] call_softirq+0x1c/0x30
>  [<ffffffff8102c275>] do_softirq+0x65/0xa0
>  [<ffffffff8108e055>] irq_exit+0x115/0x120
>  [<ffffffff81699515>] smp_apic_timer_interrupt+0x45/0x60
>  [<ffffffff81697a5d>] apic_timer_interrupt+0x6d/0x80
> 
> Nikolay also found it would cause a memory leak - the mdb hash is
> reallocated and not freed due to the mdb rehash.
> 
> unreferenced object 0xffff8800540ba800 (size 2048):
>   backtrace:
>     [<ffffffff816e2287>] kmemleak_alloc+0x67/0xc0
>     [<ffffffff81260bea>] __kmalloc+0x1ba/0x3e0
>     [<ffffffffa05c60ee>] br_mdb_rehash+0x5e/0x340 [bridge]
>     [<ffffffffa05c74af>] br_multicast_new_group+0x43f/0x6e0 [bridge]
>     [<ffffffffa05c7aa3>] br_multicast_add_group+0x203/0x260 [bridge]
>     [<ffffffffa05ca4b5>] br_multicast_rcv+0x945/0x11d0 [bridge]
>     [<ffffffffa05b6b10>] br_dev_xmit+0x180/0x470 [bridge]
>     [<ffffffff815c781b>] dev_hard_start_xmit+0xbb/0x3d0
>     [<ffffffff815c8743>] __dev_queue_xmit+0xb13/0xc10
>     [<ffffffff815c8850>] dev_queue_xmit+0x10/0x20
>     [<ffffffffa02f8d7a>] ip6_finish_output2+0x5ca/0xac0 [ipv6]
>     [<ffffffffa02fbfc6>] ip6_finish_output+0x126/0x2c0 [ipv6]
>     [<ffffffffa02fc245>] ip6_output+0xe5/0x390 [ipv6]
>     [<ffffffffa032b92c>] NF_HOOK.constprop.44+0x6c/0x240 [ipv6]
>     [<ffffffffa032bd16>] mld_sendpack+0x216/0x3e0 [ipv6]
>     [<ffffffffa032d5eb>] mld_ifc_timer_expire+0x18b/0x2b0 [ipv6]
> 
> This could happen when ip link remove a bridge or destroy a netns with a
> bridge device inside.
> 
> With Nikolay's suggestion, this patch is to clean up bridge multicast in
> ndo_uninit after bridge dev is shutdown, instead of br_dev_delete, so
> that netif_running check in br_multicast_add_group can avoid this issue.
> 
> v1->v2:
>   - fix this issue by moving br_multicast_dev_del to ndo_uninit, instead
>     of calling dev_close in br_dev_delete.
> 
> Reported-by: Jianwen Ji <jiji@redhat.com>
> Signed-off-by: Xin Long <lucien.xin@gmail.com>

Makes sense.

Reviewed-by: Stephen Hemminger <stephen@networkplumber.org>

^ permalink raw reply

* Re: net: heap out-of-bounds in fib6_clean_node/rt6_fill_node/fib6_age/fib6_prune_clone
From: David Ahern @ 2017-04-25 16:40 UTC (permalink / raw)
  To: Andrey Konovalov
  Cc: Dmitry Vyukov, Mahesh Bandewar, Eric Dumazet, David Miller,
	Alexey Kuznetsov, James Morris, Hideaki YOSHIFUJI,
	Patrick McHardy, netdev, LKML, Cong Wang, syzkaller
In-Reply-To: <CAAeHK+wnFMYUXBCQDBKR142uEVXuGoOvo2SO1b04Ya3i8XSr5g@mail.gmail.com>

On 4/25/17 10:38 AM, Andrey Konovalov wrote:
> I'll keep fuzzing in the meantime to make sure.
> Maybe I'll be able to collect more reports or even another reproducer.

start a new email thread for each stack trace. I'll write a debug patch
for the trace you hit today.

^ permalink raw reply

* Re: net: heap out-of-bounds in fib6_clean_node/rt6_fill_node/fib6_age/fib6_prune_clone
From: Andrey Konovalov @ 2017-04-25 16:38 UTC (permalink / raw)
  To: David Ahern
  Cc: Dmitry Vyukov, Mahesh Bandewar, Eric Dumazet, David Miller,
	Alexey Kuznetsov, James Morris, Hideaki YOSHIFUJI,
	Patrick McHardy, netdev, LKML, Cong Wang, syzkaller
In-Reply-To: <CAAeHK+zVkJ1t1xMhMXO2wUyMfJrbXsARHyQ6A5hh2O9pqPRxiw@mail.gmail.com>

On Tue, Apr 25, 2017 at 6:36 PM, Andrey Konovalov <andreyknvl@google.com> wrote:
> On Tue, Apr 25, 2017 at 5:56 PM, David Ahern <dsa@cumulusnetworks.com> wrote:
>> On 3/4/17 11:57 AM, Dmitry Vyukov wrote:
>>> ==================================================================
>>> BUG: KASAN: slab-out-of-bounds in rt6_dump_route+0x293/0x2f0
>>> net/ipv6/route.c:3551 at addr ffff88007e523694
>>> Read of size 4 by task syz-executor3/24426
>>> CPU: 2 PID: 24426 Comm: syz-executor3 Not tainted 4.10.0+ #293
>>> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
>>> Call Trace:
>>>  __dump_stack lib/dump_stack.c:16 [inline]
>>>  dump_stack+0x2fb/0x3fd lib/dump_stack.c:52
>>>  kasan_object_err+0x1c/0x90 mm/kasan/report.c:166
>>>  print_address_description mm/kasan/report.c:208 [inline]
>>>  kasan_report_error mm/kasan/report.c:292 [inline]
>>>  kasan_report.part.2+0x1b0/0x460 mm/kasan/report.c:314
>>>  kasan_report mm/kasan/report.c:334 [inline]
>>>  __asan_report_load4_noabort+0x29/0x30 mm/kasan/report.c:334
>>>  rt6_dump_route+0x293/0x2f0 net/ipv6/route.c:3551
>>>  fib6_dump_node+0x101/0x1a0 net/ipv6/ip6_fib.c:315
>>>  fib6_walk_continue+0x4b3/0x620 net/ipv6/ip6_fib.c:1576
>>>  fib6_walk+0x91/0xf0 net/ipv6/ip6_fib.c:1621
>>>  fib6_dump_table net/ipv6/ip6_fib.c:374 [inline]
>>>  inet6_dump_fib+0x832/0xea0 net/ipv6/ip6_fib.c:447
>>
>> My expectation is that this one is fixed with the ipv6 patch I have sent
>> (not yet committed). Are you seeing this backtrace with that patch?
>
> Before applying your patch I was hitting reports related to fib6 all the time.
> I've stopped seeing them for some time after I applied your patch.
> However today another one was triggered:

I'll keep fuzzing in the meantime to make sure.
Maybe I'll be able to collect more reports or even another reproducer.

>
> ==================================================================
> BUG: KASAN: slab-out-of-bounds in find_rr_leaf net/ipv6/route.c:722
> [inline] at addr ffff880033dddaa8
> BUG: KASAN: slab-out-of-bounds in rt6_select net/ipv6/route.c:758
> [inline] at addr ffff880033dddaa8
> BUG: KASAN: slab-out-of-bounds in ip6_pol_route+0x1b1e/0x1ba0
> net/ipv6/route.c:1091 at addr ffff880033dddaa8
> Read of size 4 by task syz-executor7/9808
> CPU: 2 PID: 9808 Comm: syz-executor7 Not tainted 4.11.0-rc8+ #268
> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
> Call Trace:
>  __dump_stack lib/dump_stack.c:16 [inline]
>  dump_stack+0x192/0x22d lib/dump_stack.c:52
>  kasan_object_err+0x1c/0x70 mm/kasan/report.c:164
>  print_address_description mm/kasan/report.c:202 [inline]
>  kasan_report_error mm/kasan/report.c:291 [inline]
>  kasan_report+0x252/0x510 mm/kasan/report.c:347
>  __asan_report_load4_noabort+0x14/0x20 mm/kasan/report.c:367
>  find_rr_leaf net/ipv6/route.c:722 [inline]
>  rt6_select net/ipv6/route.c:758 [inline]
>  ip6_pol_route+0x1b1e/0x1ba0 net/ipv6/route.c:1091
>  ip6_pol_route_output+0x4c/0x60 net/ipv6/route.c:1212
>  fib6_rule_action+0x261/0x8a0 net/ipv6/fib6_rules.c:100
>  fib_rules_lookup+0x34b/0xae0 net/core/fib_rules.c:265
>  fib6_rule_lookup+0x175/0x360 net/ipv6/fib6_rules.c:44
>  ip6_route_output_flags+0x260/0x2f0 net/ipv6/route.c:1240
>  ip6_route_output include/net/ip6_route.h:79 [inline]
>  ip6_dst_lookup_tail+0xe59/0x1640 net/ipv6/ip6_output.c:959
>  ip6_dst_lookup_flow+0xb1/0x260 net/ipv6/ip6_output.c:1082
>  ip6_sk_dst_lookup_flow+0x2c6/0x7f0 net/ipv6/ip6_output.c:1113
>  udpv6_sendmsg+0x2350/0x3310 net/ipv6/udp.c:1219
>  inet_sendmsg+0x164/0x490 net/ipv4/af_inet.c:762
>  sock_sendmsg_nosec net/socket.c:633 [inline]
>  sock_sendmsg+0xca/0x110 net/socket.c:643
>  SYSC_sendto+0x660/0x810 net/socket.c:1696
>  SyS_sendto+0x40/0x50 net/socket.c:1664
>  entry_SYSCALL_64_fastpath+0x1a/0xa9
> RIP: 0033:0x4458d9
> RSP: 002b:00007ff3a5343b58 EFLAGS: 00000282 ORIG_RAX: 000000000000002c
> RAX: ffffffffffffffda RBX: 0000000000708000 RCX: 00000000004458d9
> RDX: 0000000000000fa8 RSI: 0000000020000000 RDI: 0000000000000005
> RBP: 0000000000004300 R08: 0000000020006000 R09: 000000000000001c
> R10: 0000000000040000 R11: 0000000000000282 R12: 00000000006e33c0
> R13: 0000000000000005 R14: 0000000000000029 R15: 0000000000000023
> Object at ffff880033ddd940, in cache ip_dst_cache size: 224
> Allocated:
> PID = 9717
>  save_stack_trace+0x16/0x20 arch/x86/kernel/stacktrace.c:59
>  save_stack+0x43/0xd0 mm/kasan/kasan.c:513
>  set_track mm/kasan/kasan.c:525 [inline]
>  kasan_kmalloc+0xad/0xe0 mm/kasan/kasan.c:616
>  kasan_slab_alloc+0x12/0x20 mm/kasan/kasan.c:555
>  slab_post_alloc_hook mm/slab.h:456 [inline]
>  slab_alloc_node mm/slub.c:2718 [inline]
>  slab_alloc mm/slub.c:2726 [inline]
>  kmem_cache_alloc+0xa8/0x160 mm/slub.c:2731
>  dst_alloc+0x11b/0x1a0 net/core/dst.c:209
>  rt_dst_alloc+0xf0/0x5d0 net/ipv4/route.c:1497
>  __mkroute_output net/ipv4/route.c:2181 [inline]
>  __ip_route_output_key_hash+0xdc4/0x2930 net/ipv4/route.c:2391
>  __ip_route_output_key include/net/route.h:123 [inline]
>  ip_route_output_flow+0x29/0xa0 net/ipv4/route.c:2477
>  ip_route_output_key include/net/route.h:133 [inline]
>  sctp_v4_get_dst+0x606/0x1420 net/sctp/protocol.c:458
>  sctp_transport_route+0xa8/0x420 net/sctp/transport.c:287
>  sctp_assoc_add_peer+0x5a5/0x1470 net/sctp/associola.c:656
>  sctp_sendmsg+0x18a8/0x3b50 net/sctp/socket.c:1871
>  inet_sendmsg+0x164/0x490 net/ipv4/af_inet.c:762
>  sock_sendmsg_nosec net/socket.c:633 [inline]
>  sock_sendmsg+0xca/0x110 net/socket.c:643
>  SYSC_sendto+0x660/0x810 net/socket.c:1696
>  SyS_sendto+0x40/0x50 net/socket.c:1664
>  entry_SYSCALL_64_fastpath+0x1a/0xa9
> Freed:
> PID = 868
>  save_stack_trace+0x16/0x20 arch/x86/kernel/stacktrace.c:59
>  save_stack+0x43/0xd0 mm/kasan/kasan.c:513
>  set_track mm/kasan/kasan.c:525 [inline]
>  kasan_slab_free+0x73/0xc0 mm/kasan/kasan.c:589
>  slab_free_hook mm/slub.c:1357 [inline]
>  slab_free_freelist_hook mm/slub.c:1379 [inline]
>  slab_free mm/slub.c:2961 [inline]
>  kmem_cache_free+0x72/0x190 mm/slub.c:2983
>  dst_destroy+0x24c/0x390 net/core/dst.c:269
>  dst_free include/net/dst.h:428 [inline]
>  rt_fibinfo_free net/ipv4/fib_semantics.c:155 [inline]
>  free_fib_info_rcu+0x852/0xa10 net/ipv4/fib_semantics.c:214
>  __rcu_reclaim kernel/rcu/rcu.h:118 [inline]
>  rcu_do_batch.isra.65+0x6de/0xbd0 kernel/rcu/tree.c:2879
>  invoke_rcu_callbacks kernel/rcu/tree.c:3142 [inline]
>  __rcu_process_callbacks kernel/rcu/tree.c:3109 [inline]
>  rcu_process_callbacks+0x23f/0x810 kernel/rcu/tree.c:3126
>  __do_softirq+0x253/0x78b kernel/softirq.c:284
> Memory state around the buggy address:
>  ffff880033ddd980: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>  ffff880033ddda00: 00 00 00 00 fc fc fc fc fc fc fc fc fc fc fc fc
>>ffff880033ddda80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
>                                   ^
>  ffff880033dddb00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
>  ffff880033dddb80: fc fc fc fc fc fc fc fc fb fb fb fb fb fb fb fb
> ==================================================================.
>
>>
>> --
>> You received this message because you are subscribed to the Google Groups "syzkaller" group.
>> To unsubscribe from this group and stop receiving emails from it, send an email to syzkaller+unsubscribe@googlegroups.com.
>> For more options, visit https://groups.google.com/d/optout.

^ permalink raw reply

* Re: [PATCH v2] net/packet: initialize val in packet_getsockopt()
From: Alexander Potapenko @ 2017-04-25 16:38 UTC (permalink / raw)
  To: David Miller
  Cc: Dmitriy Vyukov, Kostya Serebryany, Eric Dumazet, Alexey Kuznetsov,
	LKML, Networking
In-Reply-To: <20170425.123201.1910437101082999848.davem@davemloft.net>

On Tue, Apr 25, 2017 at 6:32 PM, David Miller <davem@davemloft.net> wrote:
> From: Alexander Potapenko <glider@google.com>
> Date: Tue, 25 Apr 2017 18:27:04 +0200
>
>> On Tue, Apr 25, 2017 at 5:44 PM, David Miller <davem@davemloft.net> wrote:
>>> From: Alexander Potapenko <glider@google.com>
>>> Date: Mon, 24 Apr 2017 14:59:14 +0200
>>>
>>>> In the case getsockopt() is called with PACKET_HDRLEN and optlen < 4
>>>> |val| remains uninitialized and the syscall may behave differently
>>>> depending on its value. This doesn't have security consequences (as the
>>>> uninit bytes aren't copied back), but it's still cleaner to initialize
>>>> |val| and ensure optlen is not less than sizeof(int).
>>>>
>>>> This bug has been detected with KMSAN.
>>>>
>>>> Signed-off-by: Alexander Potapenko <glider@google.com>
>>>> ---
>>>> v2: - if len < sizeof(int), make it 0
>>>
>>> No, you should signal an error if the len is too small.
>> According to manpages, only setsockopt() may return EINVAL.
>> Is it ok to change the behavior of getsockopt() to return EINVAL in
>> this case? (I.e. won't we break existing users that don't expect it?)
>
> They are currently getting corrupt data depending upon the endianness,
> so -EINVAL is a serious improvement.
On a second glance getsockopt() already returns -EINVAL in some cases,
so man is already imprecise.


-- 
Alexander Potapenko
Software Engineer

Google Germany GmbH
Erika-Mann-Straße, 33
80636 München

Geschäftsführer: Matthew Scott Sucherman, Paul Terence Manicle
Registergericht und -nummer: Hamburg, HRB 86891
Sitz der Gesellschaft: Hamburg

^ permalink raw reply

* Re: net: heap out-of-bounds in fib6_clean_node/rt6_fill_node/fib6_age/fib6_prune_clone
From: Andrey Konovalov @ 2017-04-25 16:36 UTC (permalink / raw)
  To: David Ahern
  Cc: Dmitry Vyukov, Mahesh Bandewar, Eric Dumazet, David Miller,
	Alexey Kuznetsov, James Morris, Hideaki YOSHIFUJI,
	Patrick McHardy, netdev, LKML, Cong Wang, syzkaller
In-Reply-To: <381a3e2d-dbf8-5e26-e6b2-4dc291d49dbe@cumulusnetworks.com>

On Tue, Apr 25, 2017 at 5:56 PM, David Ahern <dsa@cumulusnetworks.com> wrote:
> On 3/4/17 11:57 AM, Dmitry Vyukov wrote:
>> ==================================================================
>> BUG: KASAN: slab-out-of-bounds in rt6_dump_route+0x293/0x2f0
>> net/ipv6/route.c:3551 at addr ffff88007e523694
>> Read of size 4 by task syz-executor3/24426
>> CPU: 2 PID: 24426 Comm: syz-executor3 Not tainted 4.10.0+ #293
>> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
>> Call Trace:
>>  __dump_stack lib/dump_stack.c:16 [inline]
>>  dump_stack+0x2fb/0x3fd lib/dump_stack.c:52
>>  kasan_object_err+0x1c/0x90 mm/kasan/report.c:166
>>  print_address_description mm/kasan/report.c:208 [inline]
>>  kasan_report_error mm/kasan/report.c:292 [inline]
>>  kasan_report.part.2+0x1b0/0x460 mm/kasan/report.c:314
>>  kasan_report mm/kasan/report.c:334 [inline]
>>  __asan_report_load4_noabort+0x29/0x30 mm/kasan/report.c:334
>>  rt6_dump_route+0x293/0x2f0 net/ipv6/route.c:3551
>>  fib6_dump_node+0x101/0x1a0 net/ipv6/ip6_fib.c:315
>>  fib6_walk_continue+0x4b3/0x620 net/ipv6/ip6_fib.c:1576
>>  fib6_walk+0x91/0xf0 net/ipv6/ip6_fib.c:1621
>>  fib6_dump_table net/ipv6/ip6_fib.c:374 [inline]
>>  inet6_dump_fib+0x832/0xea0 net/ipv6/ip6_fib.c:447
>
> My expectation is that this one is fixed with the ipv6 patch I have sent
> (not yet committed). Are you seeing this backtrace with that patch?

Before applying your patch I was hitting reports related to fib6 all the time.
I've stopped seeing them for some time after I applied your patch.
However today another one was triggered:

==================================================================
BUG: KASAN: slab-out-of-bounds in find_rr_leaf net/ipv6/route.c:722
[inline] at addr ffff880033dddaa8
BUG: KASAN: slab-out-of-bounds in rt6_select net/ipv6/route.c:758
[inline] at addr ffff880033dddaa8
BUG: KASAN: slab-out-of-bounds in ip6_pol_route+0x1b1e/0x1ba0
net/ipv6/route.c:1091 at addr ffff880033dddaa8
Read of size 4 by task syz-executor7/9808
CPU: 2 PID: 9808 Comm: syz-executor7 Not tainted 4.11.0-rc8+ #268
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
Call Trace:
 __dump_stack lib/dump_stack.c:16 [inline]
 dump_stack+0x192/0x22d lib/dump_stack.c:52
 kasan_object_err+0x1c/0x70 mm/kasan/report.c:164
 print_address_description mm/kasan/report.c:202 [inline]
 kasan_report_error mm/kasan/report.c:291 [inline]
 kasan_report+0x252/0x510 mm/kasan/report.c:347
 __asan_report_load4_noabort+0x14/0x20 mm/kasan/report.c:367
 find_rr_leaf net/ipv6/route.c:722 [inline]
 rt6_select net/ipv6/route.c:758 [inline]
 ip6_pol_route+0x1b1e/0x1ba0 net/ipv6/route.c:1091
 ip6_pol_route_output+0x4c/0x60 net/ipv6/route.c:1212
 fib6_rule_action+0x261/0x8a0 net/ipv6/fib6_rules.c:100
 fib_rules_lookup+0x34b/0xae0 net/core/fib_rules.c:265
 fib6_rule_lookup+0x175/0x360 net/ipv6/fib6_rules.c:44
 ip6_route_output_flags+0x260/0x2f0 net/ipv6/route.c:1240
 ip6_route_output include/net/ip6_route.h:79 [inline]
 ip6_dst_lookup_tail+0xe59/0x1640 net/ipv6/ip6_output.c:959
 ip6_dst_lookup_flow+0xb1/0x260 net/ipv6/ip6_output.c:1082
 ip6_sk_dst_lookup_flow+0x2c6/0x7f0 net/ipv6/ip6_output.c:1113
 udpv6_sendmsg+0x2350/0x3310 net/ipv6/udp.c:1219
 inet_sendmsg+0x164/0x490 net/ipv4/af_inet.c:762
 sock_sendmsg_nosec net/socket.c:633 [inline]
 sock_sendmsg+0xca/0x110 net/socket.c:643
 SYSC_sendto+0x660/0x810 net/socket.c:1696
 SyS_sendto+0x40/0x50 net/socket.c:1664
 entry_SYSCALL_64_fastpath+0x1a/0xa9
RIP: 0033:0x4458d9
RSP: 002b:00007ff3a5343b58 EFLAGS: 00000282 ORIG_RAX: 000000000000002c
RAX: ffffffffffffffda RBX: 0000000000708000 RCX: 00000000004458d9
RDX: 0000000000000fa8 RSI: 0000000020000000 RDI: 0000000000000005
RBP: 0000000000004300 R08: 0000000020006000 R09: 000000000000001c
R10: 0000000000040000 R11: 0000000000000282 R12: 00000000006e33c0
R13: 0000000000000005 R14: 0000000000000029 R15: 0000000000000023
Object at ffff880033ddd940, in cache ip_dst_cache size: 224
Allocated:
PID = 9717
 save_stack_trace+0x16/0x20 arch/x86/kernel/stacktrace.c:59
 save_stack+0x43/0xd0 mm/kasan/kasan.c:513
 set_track mm/kasan/kasan.c:525 [inline]
 kasan_kmalloc+0xad/0xe0 mm/kasan/kasan.c:616
 kasan_slab_alloc+0x12/0x20 mm/kasan/kasan.c:555
 slab_post_alloc_hook mm/slab.h:456 [inline]
 slab_alloc_node mm/slub.c:2718 [inline]
 slab_alloc mm/slub.c:2726 [inline]
 kmem_cache_alloc+0xa8/0x160 mm/slub.c:2731
 dst_alloc+0x11b/0x1a0 net/core/dst.c:209
 rt_dst_alloc+0xf0/0x5d0 net/ipv4/route.c:1497
 __mkroute_output net/ipv4/route.c:2181 [inline]
 __ip_route_output_key_hash+0xdc4/0x2930 net/ipv4/route.c:2391
 __ip_route_output_key include/net/route.h:123 [inline]
 ip_route_output_flow+0x29/0xa0 net/ipv4/route.c:2477
 ip_route_output_key include/net/route.h:133 [inline]
 sctp_v4_get_dst+0x606/0x1420 net/sctp/protocol.c:458
 sctp_transport_route+0xa8/0x420 net/sctp/transport.c:287
 sctp_assoc_add_peer+0x5a5/0x1470 net/sctp/associola.c:656
 sctp_sendmsg+0x18a8/0x3b50 net/sctp/socket.c:1871
 inet_sendmsg+0x164/0x490 net/ipv4/af_inet.c:762
 sock_sendmsg_nosec net/socket.c:633 [inline]
 sock_sendmsg+0xca/0x110 net/socket.c:643
 SYSC_sendto+0x660/0x810 net/socket.c:1696
 SyS_sendto+0x40/0x50 net/socket.c:1664
 entry_SYSCALL_64_fastpath+0x1a/0xa9
Freed:
PID = 868
 save_stack_trace+0x16/0x20 arch/x86/kernel/stacktrace.c:59
 save_stack+0x43/0xd0 mm/kasan/kasan.c:513
 set_track mm/kasan/kasan.c:525 [inline]
 kasan_slab_free+0x73/0xc0 mm/kasan/kasan.c:589
 slab_free_hook mm/slub.c:1357 [inline]
 slab_free_freelist_hook mm/slub.c:1379 [inline]
 slab_free mm/slub.c:2961 [inline]
 kmem_cache_free+0x72/0x190 mm/slub.c:2983
 dst_destroy+0x24c/0x390 net/core/dst.c:269
 dst_free include/net/dst.h:428 [inline]
 rt_fibinfo_free net/ipv4/fib_semantics.c:155 [inline]
 free_fib_info_rcu+0x852/0xa10 net/ipv4/fib_semantics.c:214
 __rcu_reclaim kernel/rcu/rcu.h:118 [inline]
 rcu_do_batch.isra.65+0x6de/0xbd0 kernel/rcu/tree.c:2879
 invoke_rcu_callbacks kernel/rcu/tree.c:3142 [inline]
 __rcu_process_callbacks kernel/rcu/tree.c:3109 [inline]
 rcu_process_callbacks+0x23f/0x810 kernel/rcu/tree.c:3126
 __do_softirq+0x253/0x78b kernel/softirq.c:284
Memory state around the buggy address:
 ffff880033ddd980: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
 ffff880033ddda00: 00 00 00 00 fc fc fc fc fc fc fc fc fc fc fc fc
>ffff880033ddda80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
                                  ^
 ffff880033dddb00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
 ffff880033dddb80: fc fc fc fc fc fc fc fc fb fb fb fb fb fb fb fb
==================================================================.

>
> --
> You received this message because you are subscribed to the Google Groups "syzkaller" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to syzkaller+unsubscribe@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

^ permalink raw reply

* Re: [PATCH] macsec: dynamically allocate space for sglist
From: Sabrina Dubroca @ 2017-04-25 16:36 UTC (permalink / raw)
  To: Jason A. Donenfeld; +Cc: Netdev, LKML, David Miller, stable, security
In-Reply-To: <20170425152300.3830-1-Jason@zx2c4.com>

2017-04-25, 17:23:00 +0200, Jason A. Donenfeld wrote:
> We call skb_cow_data, which is good anyway to ensure we can actually
> modify the skb as such (another error from prior). Now that we have the
> number of fragments required, we can safely allocate exactly that amount
> of memory.
> 
> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
> Cc: Sabrina Dubroca <sd@queasysnail.net>
> Cc: security@kernel.org
> Cc: stable@vger.kernel.org
> ---
>  drivers/net/macsec.c | 25 ++++++++++++++++++++-----
>  1 file changed, 20 insertions(+), 5 deletions(-)
> 
> diff --git a/drivers/net/macsec.c b/drivers/net/macsec.c
> index dbab05afcdbe..56dafdee4c9c 100644
> --- a/drivers/net/macsec.c
> +++ b/drivers/net/macsec.c
[...]
> @@ -917,6 +926,7 @@ static struct sk_buff *macsec_decrypt(struct sk_buff *skb,
>  {
>  	int ret;
>  	struct scatterlist *sg;
> +	struct sk_buff *trailer;
>  	unsigned char *iv;
>  	struct aead_request *req;
>  	struct macsec_eth_header *hdr;
> @@ -927,7 +937,12 @@ static struct sk_buff *macsec_decrypt(struct sk_buff *skb,
>  	if (!skb)
>  		return ERR_PTR(-ENOMEM);
>  
> -	req = macsec_alloc_req(rx_sa->key.tfm, &iv, &sg);
> +	ret = skb_cow_data(skb, 0, &trailer);
> +	if (unlikely(ret < 0)) {
> +		kfree_skb(skb);
> +		return ERR_PTR(ret);
> +	}
> +	req = macsec_alloc_req(rx_sa->key.tfm, &iv, &sg, ret);
>  	if (!req) {
>  		kfree_skb(skb);
>  		return ERR_PTR(-ENOMEM);

There's a problem here (and in macsec_encrypt): you need to update the
call to sg_init_table, like I did in my patch.  Otherwise,
sg_init_table() is going to access sg[MAX_SKB_FRAGS], which may be
past what you allocated.

How did you test this? ;)

-- 
Sabrina

^ permalink raw reply

* Re: [PATCH v3 net] net: ipv6: regenerate host route if moved to gc list
From: Andrey Konovalov @ 2017-04-25 16:35 UTC (permalink / raw)
  To: David Ahern; +Cc: netdev, Dmitry Vyukov, mmanning, Martin KaFai Lau
In-Reply-To: <c86e9056-151c-6b9d-9475-99158de0874b@cumulusnetworks.com>

On Tue, Apr 25, 2017 at 5:54 PM, David Ahern <dsa@cumulusnetworks.com> wrote:
> On 4/25/17 6:50 AM, Andrey Konovalov wrote:
>> I've been running syzkaller with your patch and got another report
>> from ip6_pol_route.
>
> In general the existing patch cleans up all of the ipv6 fib kasan and
> WARN_ON traces that were seen?

Before applying your patch I was hitting reports related to fib6 all the time.
I've stopped seeing them for some time after I applied your patch.
However today another one was triggered (the one I sent above).

>
>
>>
>> It happened only once so far and I couldn't reproduce it.
>
> Some similarity in the sense of a ipv4 route in the ipv6 fib. That's
> only going to happen if it hits the dst gc list and then back.
>
> This duplicates what Dmitry reported on March 3rd.

^ permalink raw reply

* Re: [PATCH v2] net/packet: initialize val in packet_getsockopt()
From: David Miller @ 2017-04-25 16:32 UTC (permalink / raw)
  To: glider; +Cc: dvyukov, kcc, edumazet, kuznet, linux-kernel, netdev
In-Reply-To: <CAG_fn=URNuXxc7wEaXd_p7B46RRCYYMTv5rOVc_U1CdvervzPQ@mail.gmail.com>

From: Alexander Potapenko <glider@google.com>
Date: Tue, 25 Apr 2017 18:27:04 +0200

> On Tue, Apr 25, 2017 at 5:44 PM, David Miller <davem@davemloft.net> wrote:
>> From: Alexander Potapenko <glider@google.com>
>> Date: Mon, 24 Apr 2017 14:59:14 +0200
>>
>>> In the case getsockopt() is called with PACKET_HDRLEN and optlen < 4
>>> |val| remains uninitialized and the syscall may behave differently
>>> depending on its value. This doesn't have security consequences (as the
>>> uninit bytes aren't copied back), but it's still cleaner to initialize
>>> |val| and ensure optlen is not less than sizeof(int).
>>>
>>> This bug has been detected with KMSAN.
>>>
>>> Signed-off-by: Alexander Potapenko <glider@google.com>
>>> ---
>>> v2: - if len < sizeof(int), make it 0
>>
>> No, you should signal an error if the len is too small.
> According to manpages, only setsockopt() may return EINVAL.
> Is it ok to change the behavior of getsockopt() to return EINVAL in
> this case? (I.e. won't we break existing users that don't expect it?)

They are currently getting corrupt data depending upon the endianness,
so -EINVAL is a serious improvement.

^ permalink raw reply

* Re: [PATCH v4 net-next] mdio_bus: Issue GPIO RESET to PHYs.
From: Florian Fainelli @ 2017-04-25 16:31 UTC (permalink / raw)
  To: Lars-Peter Clausen, Roger Quadros, Andrew Lunn
  Cc: davem, tony, nsekhar, jsarha, netdev, linux-omap, linux-kernel
In-Reply-To: <5954130c-c08c-1555-4d34-83307ed68d92@metafoo.de>

On 04/25/2017 09:22 AM, Lars-Peter Clausen wrote:
> On 04/24/2017 11:04 AM, Roger Quadros wrote:
>> On 24/04/17 02:35, Andrew Lunn wrote:
>>> On Fri, Apr 21, 2017 at 03:31:09PM +0200, Lars-Peter Clausen wrote:
>>>> On 04/21/2017 03:15 PM, Roger Quadros wrote:
>>>>> diff --git a/Documentation/devicetree/bindings/net/mdio.txt b/Documentation/devicetree/bindings/net/mdio.txt
>>>>> new file mode 100644
>>>>> index 0000000..4ffbbac
>>>>> --- /dev/null
>>>>> +++ b/Documentation/devicetree/bindings/net/mdio.txt
>>>>> @@ -0,0 +1,33 @@
>>>>> +Common MDIO bus properties.
>>>>> +
>>>>> +These are generic properties that can apply to any MDIO bus.
>>>>> +
>>>>> +Optional properties:
>>>>> +- reset-gpios: List of one or more GPIOs that control the RESET lines
>>>>> +  of the PHYs on that MDIO bus.
>>>>> +- reset-delay-us: RESET pulse width in microseconds as per PHY datasheet.
>>>>> +
>>>>> +A list of child nodes, one per device on the bus is expected. These
>>>>> +should follow the generic phy.txt, or a device specific binding document.
>>>>> +
>>>>> +Example :
>>>>> +This example shows these optional properties, plus other properties
>>>>> +required for the TI Davinci MDIO driver.
>>>>> +
>>>>> +	davinci_mdio: ethernet@0x5c030000 {
>>>>> +		compatible = "ti,davinci_mdio";
>>>>> +		reg = <0x5c030000 0x1000>;
>>>>> +		#address-cells = <1>;
>>>>> +		#size-cells = <0>;
>>>>> +
>>>>> +		reset-gpios = <&gpio2 5 GPIO_ACTIVE_LOW>;
>>>>> +		reset-delay-us = <2>;   /* PHY datasheet states 1us min */
>>>>
>>>> If this is the reset line of the PHY shouldn't it be a property of the PHY
>>>> node rather than of the MDIO controller node (which might have a reset on
>>>> its own)?
>>>>> +
>>>>> +		ethphy0: ethernet-phy@1 {
>>>>> +			reg = <1>;
>>>>> +		};
>>>>> +
>>>>> +		ethphy1: ethernet-phy@3 {
>>>>> +			reg = <3>;
>>>>> +		};
>>>
>>> Hi Lars-Peter
>>>
>>> We discussed this when the first proposal was made. There are two
>>> cases, to consider.
>>>
>>> 1) Here, one GPIO line resets all PHYs on the same MDIO bus. In this
>>> example, two PHYs.
>>>
>>> 2) There is one GPIO line per PHY. That is a separate case, and as you
>>> say, the reset line should probably be considered a PHY property, not
>>> an MDIO property. However, it can be messy, since in order to probe
>>> the MDIO bus, you probably need to take the PHY out of reset.
>>>
> 
> But the DT binding documentation says something else "List of one or more
> GPIOs that control the RESET lines of the PHYs on that MDIO bus".

I agree, it should be defined more strictly as:

"One GPIO that controls the reset line of *all* PHYs populated on that
MDIO bus"

If there are separate lines, these automatically become properties of
the PHY nodes.

> 
>>> Anyway, this patch addresses the first case, so should be accepted. If
>>> anybody wants to address the second case, they are free to do so.
> 
> I think we all know that that's not going to happen. Once there is a working
> kludge there is no incentive to do a proper implementation anymore.
> 
> 
>> Thanks for the explanation Andrew.
>>
>> For the second case, even if the RESET GPIO property is specified
>> in the PHY node, the RESET *will* have to be done by the MDIO bus driver
>> else the PHY might not be probed at all.
> 
> I'm not arguing with that, just that the hardware description should be
> truthful to the hardware topology and not to the software topology, i.e. the
> implementation details of the Linux kernel in this case. Reset GPIOs are not
> the only resource that is connected to the PHY that needs to be enabled
> before they can be enumerated. E.g. clocks and regulators fall into the same
> realm. And while you might argue that with a on-SoC phy controller node
> there wont be any conflicts in regard to the reset-gpios property, this not
> so very true for the clocks property.

Agreed, but with the exception of the unfortunate choice of words here
(single vs. multiple) there is not a really a divergence in how the
shared reset line is represented compared to other similar control
busses, is there?

> 
> And MDIO is not really special in this regard, other discoverable buses
> (like USB, SDIO, ULPI) have the very same issue. Having a standardized
> binding approach where the resources are declared as part as the child child
> is preferable in my opinion.
> 
>>
>> Whether we need additional code to just to make the DT look prettier is
>> questionable and if required can come as a separate patch.
> 
> Unfortunately not, once it is merged it can't be changed anymore.

There are no in tree users yet, so let's get the different things fixed
right now.
-- 
Florian

^ permalink raw reply

* Re: [PATCH v2] net/packet: initialize val in packet_getsockopt()
From: Alexander Potapenko @ 2017-04-25 16:27 UTC (permalink / raw)
  To: David Miller
  Cc: Dmitriy Vyukov, Kostya Serebryany, Eric Dumazet, Alexey Kuznetsov,
	LKML, Networking
In-Reply-To: <20170425.114433.143144279134920277.davem@davemloft.net>

On Tue, Apr 25, 2017 at 5:44 PM, David Miller <davem@davemloft.net> wrote:
> From: Alexander Potapenko <glider@google.com>
> Date: Mon, 24 Apr 2017 14:59:14 +0200
>
>> In the case getsockopt() is called with PACKET_HDRLEN and optlen < 4
>> |val| remains uninitialized and the syscall may behave differently
>> depending on its value. This doesn't have security consequences (as the
>> uninit bytes aren't copied back), but it's still cleaner to initialize
>> |val| and ensure optlen is not less than sizeof(int).
>>
>> This bug has been detected with KMSAN.
>>
>> Signed-off-by: Alexander Potapenko <glider@google.com>
>> ---
>> v2: - if len < sizeof(int), make it 0
>
> No, you should signal an error if the len is too small.
According to manpages, only setsockopt() may return EINVAL.
Is it ok to change the behavior of getsockopt() to return EINVAL in
this case? (I.e. won't we break existing users that don't expect it?)
> Returning zero bytes to userspace silently makes the user think that
> he got the data he asked for.



-- 
Alexander Potapenko
Software Engineer

Google Germany GmbH
Erika-Mann-Straße, 33
80636 München

Geschäftsführer: Matthew Scott Sucherman, Paul Terence Manicle
Registergericht und -nummer: Hamburg, HRB 86891
Sitz der Gesellschaft: Hamburg

^ permalink raw reply

* Re: Network cooling device and how to control NIC speed on thermal condition
From: Florian Fainelli @ 2017-04-25 16:23 UTC (permalink / raw)
  To: Waldemar Rymarkiewicz, netdev; +Cc: linux-kernel
In-Reply-To: <CAHKzcEMQPQzg6PYCSuv7Ufad+4AZ2gnqMgz3-VH5NTeb_pv4zQ@mail.gmail.com>

Hello,

On 04/25/2017 01:36 AM, Waldemar Rymarkiewicz wrote:
> Hi,
> 
> I am not much aware of linux networking architecture so I'd like to
> ask first before will start to dig into the code. Appreciate any
> feedback.
> 
> I am looking on Linux thermal framework and on how to cool down the
> system effectively when it hits thermal condition. Already existing
> cooling methods cpu_cooling and clock_cooling are good. However, I
> wanted to go further and dynamically control also a switch ports'
> speed based on thermal condition. Lowering speed means less power,
> less power means lower temp.
> 
> Is there any in-kernel interface to configure switch port/NIC from other driver?

Well, there is mostly under the form of notifiers though. For instance
there are lots of devices that do converged FCoE/RoCE/Ethernet that have
a two headed set of drivers, one for normal ethernet, and another one
for RDMA/IB for instance. To some extent stacked devices (VLAN, bond,
team, etc.) also call back down into their lower device, but in an
abstracted way, at the net_device level of course (layering).

> 
> Is there any mechanism to power save, when port/interface is not
> really used (not much or low data traffic), embedded in networking
> stack  or is it a task for NIC driver itself ?

The thing we did (currently out of tree) in the Starfighter 2 switch
driver (drivers/net/dsa/bcm_sf2.c) is that any time a port is brought
up/down (a port = a network device) we recalculate the switch core
clock, and we also resize the buffers and that yields to a little bit of
power savings  here and there. I don't recall the numbers from the top
of my head, but it was significant enough our HW designers convinced me
into doing it ;)

> 
> I was thinking to create net_cooling device similarly to cpu_cooling
> device which cool down the system scaling down cpu freq.  net_cooling
> could lower down interface speed (or tune more parameters to achieve
> ).  Do you thing could this work form networking stack perspective?

This sounds like a good idea, but it could be very tricky to get right,
because even if you can somehow throttle your transmit activity (since
the host is in control), you can't do that without being disruptive to
the receive path (or not as effectively).

Unlike any kind of host driven activity: CPU run queue, block devices,
USB etc. (SPI, I2C and so on when no using slave driven interrupts) you
cannot simply apply a "duty cycle" pattern where you turn on your HW
just enough of time that is needed for you to set it up for transfer,
signal transfer completion and go back to sleep. Networking needs to be
able to asynchronously receive packets in a way that is usually not
predictable although it could be for very specific workloads though.

Another thing is that there is still a fair amount of energy that needs
to be spent in maintaining the link, and the HW design may be entirely
clocked based on the link speed. Depending on the HW architecture (store
and forward, cut through etc.) there would still be a cost associated
with maintaining RAMs in a state where they are operational and so on.

You could imagine writing a queuing discipline driver that would
throttle transmission based on temperature sensors present in your NIC,
you could definitively do this in a way that is completely device driver
agnostic by using Linux's thermal framework trip point and temperature
notifications.

For reception, if you are okay with dropping some packets, you could
implement something similar, but chances are that your NIC would still
need to receive packets, be able to fully process them before SW drops
them, at which point, you have a myriad of solutions about how not to
process incoming traffic.

Hope this helps

> 
> Any pointers  to the code or a doc highly appreciated.
> 
> Thanks,
> /Waldek
> 

-- 
Florian

^ permalink raw reply

* Re: [PATCH v4 net-next] mdio_bus: Issue GPIO RESET to PHYs.
From: Lars-Peter Clausen @ 2017-04-25 16:22 UTC (permalink / raw)
  To: Roger Quadros, Andrew Lunn
  Cc: davem, Florian Fainelli, tony, nsekhar, jsarha, netdev,
	linux-omap, linux-kernel
In-Reply-To: <551ee7a0-d153-3a36-e025-2c1f70f866b7@ti.com>

On 04/24/2017 11:04 AM, Roger Quadros wrote:
> On 24/04/17 02:35, Andrew Lunn wrote:
>> On Fri, Apr 21, 2017 at 03:31:09PM +0200, Lars-Peter Clausen wrote:
>>> On 04/21/2017 03:15 PM, Roger Quadros wrote:
>>>> diff --git a/Documentation/devicetree/bindings/net/mdio.txt b/Documentation/devicetree/bindings/net/mdio.txt
>>>> new file mode 100644
>>>> index 0000000..4ffbbac
>>>> --- /dev/null
>>>> +++ b/Documentation/devicetree/bindings/net/mdio.txt
>>>> @@ -0,0 +1,33 @@
>>>> +Common MDIO bus properties.
>>>> +
>>>> +These are generic properties that can apply to any MDIO bus.
>>>> +
>>>> +Optional properties:
>>>> +- reset-gpios: List of one or more GPIOs that control the RESET lines
>>>> +  of the PHYs on that MDIO bus.
>>>> +- reset-delay-us: RESET pulse width in microseconds as per PHY datasheet.
>>>> +
>>>> +A list of child nodes, one per device on the bus is expected. These
>>>> +should follow the generic phy.txt, or a device specific binding document.
>>>> +
>>>> +Example :
>>>> +This example shows these optional properties, plus other properties
>>>> +required for the TI Davinci MDIO driver.
>>>> +
>>>> +	davinci_mdio: ethernet@0x5c030000 {
>>>> +		compatible = "ti,davinci_mdio";
>>>> +		reg = <0x5c030000 0x1000>;
>>>> +		#address-cells = <1>;
>>>> +		#size-cells = <0>;
>>>> +
>>>> +		reset-gpios = <&gpio2 5 GPIO_ACTIVE_LOW>;
>>>> +		reset-delay-us = <2>;   /* PHY datasheet states 1us min */
>>>
>>> If this is the reset line of the PHY shouldn't it be a property of the PHY
>>> node rather than of the MDIO controller node (which might have a reset on
>>> its own)?
>>>> +
>>>> +		ethphy0: ethernet-phy@1 {
>>>> +			reg = <1>;
>>>> +		};
>>>> +
>>>> +		ethphy1: ethernet-phy@3 {
>>>> +			reg = <3>;
>>>> +		};
>>
>> Hi Lars-Peter
>>
>> We discussed this when the first proposal was made. There are two
>> cases, to consider.
>>
>> 1) Here, one GPIO line resets all PHYs on the same MDIO bus. In this
>> example, two PHYs.
>>
>> 2) There is one GPIO line per PHY. That is a separate case, and as you
>> say, the reset line should probably be considered a PHY property, not
>> an MDIO property. However, it can be messy, since in order to probe
>> the MDIO bus, you probably need to take the PHY out of reset.
>>

But the DT binding documentation says something else "List of one or more
GPIOs that control the RESET lines of the PHYs on that MDIO bus".

>> Anyway, this patch addresses the first case, so should be accepted. If
>> anybody wants to address the second case, they are free to do so.

I think we all know that that's not going to happen. Once there is a working
kludge there is no incentive to do a proper implementation anymore.


> Thanks for the explanation Andrew.
> 
> For the second case, even if the RESET GPIO property is specified
> in the PHY node, the RESET *will* have to be done by the MDIO bus driver
> else the PHY might not be probed at all.

I'm not arguing with that, just that the hardware description should be
truthful to the hardware topology and not to the software topology, i.e. the
implementation details of the Linux kernel in this case. Reset GPIOs are not
the only resource that is connected to the PHY that needs to be enabled
before they can be enumerated. E.g. clocks and regulators fall into the same
realm. And while you might argue that with a on-SoC phy controller node
there wont be any conflicts in regard to the reset-gpios property, this not
so very true for the clocks property.

And MDIO is not really special in this regard, other discoverable buses
(like USB, SDIO, ULPI) have the very same issue. Having a standardized
binding approach where the resources are declared as part as the child child
is preferable in my opinion.

> 
> Whether we need additional code to just to make the DT look prettier is
> questionable and if required can come as a separate patch.

Unfortunately not, once it is merged it can't be changed anymore.

^ permalink raw reply

* [PATCH v4 net] net: ipv6: regenerate host route if moved to gc list
From: David Ahern @ 2017-04-25 16:17 UTC (permalink / raw)
  To: netdev; +Cc: dvyukov, andreyknvl, mmanning, kafai, eric.dumazet, David Ahern

Taking down the loopback device wreaks havoc on IPv6 routing. By
extension, taking down a VRF device wreaks havoc on its table.

Dmitry and Andrey both reported heap out-of-bounds reports in the IPv6
FIB code while running syzkaller fuzzer. The root cause is a dead dst
that is on the garbage list gets reinserted into the IPv6 FIB. While on
the gc (or perhaps when it gets added to the gc list) the dst->next is
set to an IPv4 dst. A subsequent walk of the ipv6 tables causes the
out-of-bounds access.

Andrey's reproducer was the key to getting to the bottom of this.

With IPv6, host routes for an address have the dst->dev set to the
loopback device. When the 'lo' device is taken down, rt6_ifdown initiates
a walk of the fib evicting routes with the 'lo' device which means all
host routes are removed. That process moves the dst which is attached to
an inet6_ifaddr to the gc list and marks it as dead.

The recent change to keep global IPv6 addresses added a new function,
fixup_permanent_addr, that is called on admin up. That function restarts
dad for an inet6_ifaddr and when it completes the host route attached
to it is inserted into the fib. Since the route was marked dead and
moved to the gc list, re-inserting the route causes the reported
out-of-bounds accesses. If the device with the address is taken down
or the address is removed, the WARN_ON in fib6_del is triggered.

All of those faults are fixed by regenerating the host route if the
existing one has been moved to the gc list, something that can be
determined by checking if the rt6i_ref counter is 0.

Fixes: f1705ec197e7 ("net: ipv6: Make address flushing on ifdown optional")
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Reported-by: Andrey Konovalov <andreyknvl@google.com>
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
---
v4
- move 'prev = ifp->rt;' under spinlock as requested by Eric

v3
- removed 'if (prev)' and just call ip6_rt_put; added comment about spinlock

v2
- change ifp->rt under spinlock vs cmpxchg
- add comment about rt6i_ref == 0

 net/ipv6/addrconf.c | 14 ++++++++++++--
 1 file changed, 12 insertions(+), 2 deletions(-)

diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
index 80ce478c4851..0ea96c4d334d 100644
--- a/net/ipv6/addrconf.c
+++ b/net/ipv6/addrconf.c
@@ -3271,14 +3271,24 @@ static void addrconf_gre_config(struct net_device *dev)
 static int fixup_permanent_addr(struct inet6_dev *idev,
 				struct inet6_ifaddr *ifp)
 {
-	if (!ifp->rt) {
-		struct rt6_info *rt;
+	/* rt6i_ref == 0 means the host route was removed from the
+	 * FIB, for example, if 'lo' device is taken down. In that
+	 * case regenerate the host route.
+	 */
+	if (!ifp->rt || !atomic_read(&ifp->rt->rt6i_ref)) {
+		struct rt6_info *rt, *prev;

 		rt = addrconf_dst_alloc(idev, &ifp->addr, false);
 		if (unlikely(IS_ERR(rt)))
 			return PTR_ERR(rt);

+		/* ifp->rt can be accessed outside of rtnl */
+		spin_lock(&ifp->lock);
+		prev = ifp->rt;
 		ifp->rt = rt;
+		spin_unlock(&ifp->lock);
+
+		ip6_rt_put(prev);
 	}

 	if (!(ifp->flags & IFA_F_NOPREFIXROUTE)) {
-- 
2.1.4

^ permalink raw reply related

* Re: IGMP on IPv6
From: Murali Karicheri @ 2017-04-25 16:16 UTC (permalink / raw)
  To: Cong Wang; +Cc: Hangbin Liu, open list:TI NETCP ETHERNET DRIVER, David Miller
In-Reply-To: <CAM_iQpWJ_oPSWV-PBeBGj5LoddMHTmrimLbViXLJ-VKR21e4mQ@mail.gmail.com>

On 04/18/2017 06:37 PM, Cong Wang wrote:
> On Tue, Apr 18, 2017 at 10:20 AM, Murali Karicheri <m-karicheri2@ti.com> wrote:
>> On 04/18/2017 01:12 PM, Murali Karicheri wrote:
>>> On 04/17/2017 05:38 PM, Cong Wang wrote:
>>>> Hello,
>>>>
>>>> On Thu, Apr 13, 2017 at 9:36 AM, Murali Karicheri <m-karicheri2@ti.com> wrote:
>>>>> On 03/22/2017 11:04 AM, Murali Karicheri wrote:
>>>>>> This is going directly to the slave Ethernet interface.
>>>>>>
>>>>>> When I put a WARN_ONCE, I found this is coming directly from
>>>>>> mld_ifc_timer_expire() -> mld_sendpack() -> ip6_output()
>>>>>>
>>>>>> Do you think this is fixed in latest kernel at master? If so, could
>>>>>> you point me to some commits.
>>>>>>
>>>>>>
>>>>> Ping... I see this behavior is also seen on v4.9.x Kernel. Any clue if
>>>>> this is fixed by some commit or I need to debug? I see IGMPv6 has some
>>>>> fixes on the list to make it similar to IGMPv4. So can someone clarify this is
>>>>> is a bug at IGMPv6 code or I need to look into the HSR driver code?
>>>>> Since IGMPv4 is going over the HSR interface I am assuming this is a
>>>>> bug in the IGMPv6 code. But since I have not experience with this code
>>>>> can some expert comment please?
>>>>>
>>>>
>>>> How did you configure your network interfaces and IPv4/IPv6 multicast?
>>>> IOW, how did you reproduce this? For example, did you change your
>>>> HSR setup when this happened since you mentioned
>>>> NETDEV_CHANGEUPPER?
>>>>
>>> Thanks for responding! Really appreciate.
>>>
>>> I didn't set up anything explicitly for IPv4/IPv6 multicast. As part of
>>> my testing, I dump the packets going through the slave interfaces attached
>>> to the hsr interface (for example my Ethernet interfaces eth2 and eth3
>>> are attached to the hsr interface and I dump the packets at the egress
>>> of eth2 and eth3 in my driver along with that at hsr xmit function). As
>>> soon as I create the hsr interface, I see a bunch of packets going directly
>>> through the lower interface, not through the upper one (i.e hsr interface)
>>> and these are of eth_type = 86 dd. Please ignore my reference to
>>> NETDEV_CHANGEUPPER for now as it was wild guess.
> 
> OK. Note: I know nothing about HSR, I assume it is similar to bonding
> in your case?
> 

Similar in the sense it glues two standard Ethernet interfaces and run
HSR protocol frames over it to support redundancy.

>>>
>>> I have not done any debugging, but the WARN_ONCE which I have placed
>>> in the lower level driver looking for eth_type = 86 dd provided the
>>> above trace.
>>>
>> Here is the command I have used to create the hsr interface...
>>
>> ip link add name hsr0 type hsr slave1 eth2 slave2 eth3 supervision 45 version 1
> 
> Did you assign IPv4 and IPv6 addresses to the HSR master device?

No. I just used IPv4. From the trace mld_ifc_timer_expire() -> mld_sendpack() -> ip6_output()
do you know what is it trying to do? Is it some neighbor discovery message or something
going over the lower interface instead of the hsr interface?

Murali

> 


-- 
Murali Karicheri
Linux Kernel, Keystone

^ permalink raw reply

* Re: qed*: debug infrastructures
From: Florian Fainelli @ 2017-04-25 16:13 UTC (permalink / raw)
  To: Elior, Ariel, David Miller
  Cc: netdev@vger.kernel.org, Mintz, Yuval, Tayar, Tomer, Dupuis, Chad
In-Reply-To: <CY1PR0701MB1337FA26F1503FC9928B0340901F0@CY1PR0701MB1337.namprd07.prod.outlook.com>

On 04/24/2017 10:38 AM, Elior, Ariel wrote:
> Hi Dave,
> 
> According to the recent messages on the list indicating debugfs is not the way
> to go, I am looking for some guidance on what is. dpipe approach was
> mentioned as favorable, but I wanted to make sure molding our debug features to
> this infrastructure will result in something acceptable. A few points:
> 
> 1.
> One of our HW debug features is a signal recording feature. HW is configured to
> output specific signals, which are then continuously dumped into a cyclic
> buffer on host. There are ~8000 signals, which can be logically divided to ~100
> groups. I believe this can be modeled in dpipe (or similar tool) as a set of
> ~100 tables with ~80 entries each, and user would be able to see them all and
> choose what they like. The output data itself is binary, and meaningless to
> "the user". The amount of data is basically as large a contiguous buffer as
> driver can allocate, i.e. usually 4MB. When user selects the signals, and sets
> meta data regarding to mode of operations, some device configuration will have
> to take place. Does that sound reasonable?
> This debug feature already exists out of tree for bnx2x and qed* drivers and is
> *very* effective in field deployments. I would very much like to see this as an
> in-tree feature via some infrastructure or another.

By signals do you mean actual electrical signals coming from your chip's
pins/balls? FWIW, there is kind of something similar existing with ARM
Ltd's Embedded Trace Module which allows capturing a SoC's CPU activity
(loads/stores/pc progression etc.) using a high speed interface. You may
want to look at drivers/hwtracing/{coresight,stm}/ for examples on how
this interfaces with Linux' perf events subsystems. There are existing
trace buffers infrastructures and it sounds like your use case fits
nicely within the perf events framework here where you may want the
entire set or subset of these 8000 signals to be recorded in a buffer.


> 
> 2.
> Sometimes we want to debug the probe or removal flow of the driver. ethtool has
> the disadvantage of only being available once network device is available. If a
> bug stops the load flow before the ethtool debug paths are available, there is
> no way to collect a dump. Similarly, removal flows which hit a bug but do remove
> the network device, can't be debugged from ethtool. Does dpipe suffer from the
> same problem? qed* (like mlx*) has a common-functionality module. This allows
> creating debugfs nodes even before the network drivers are probed, providing a
> solution for this (debug nodes are also available after network driver removal).
> If dpipe does hold an answer here (e.g. provide preconfiguration which would
> kick in when network device registers) then we might want to port all of our
> register dump logic over there for this advantage. Does that sound reasonable?

Can you consider using tracing for that particular purpose with trace
filters that are specific to the device of interest? That should allow
you to get events down from the Linux device driver model all the way to
when your driver actually gets initialized. Is that an option?
-- 
Florian

^ permalink raw reply

* Re: [PATCH net-next v8 2/3] net sched actions: dump more than TCA_ACT_MAX_PRIO actions per batch
From: Jiri Pirko @ 2017-04-25 16:04 UTC (permalink / raw)
  To: Jamal Hadi Salim; +Cc: davem, xiyou.wangcong, eric.dumazet, netdev
In-Reply-To: <5e54edd8-3943-6f09-490f-ff04b83077f6@mojatatu.com>

Tue, Apr 25, 2017 at 03:01:22PM CEST, jhs@mojatatu.com wrote:
>On 17-04-25 08:13 AM, Jiri Pirko wrote:
>> Tue, Apr 25, 2017 at 01:54:06PM CEST, jhs@mojatatu.com wrote:
>
>
>[..]
>
>> > -#define TCAA_MAX 1
>> > +/* tcamsg flags stored in attribute TCA_ROOT_FLAGS
>> > + *
>> > + * TCA_FLAG_LARGE_DUMP_ON user->kernel to request for larger than TCA_ACT_MAX_PRIO
>> > + * actions in a dump. All dump responses will contain the number of actions
>> > + * being dumped stored in for user app's consumption in TCA_ROOT_COUNT
>> > + *
>> > + */
>> > +#define TCA_FLAG_LARGE_DUMP_ON		(1 << 0)
>> 
>> BIT (I think I mentioned this before)
>> 
>
>You did - but i took it out about two submissions back (per cover
>letter) because it is no part of UAPI today. I noticed devlink was
>using it but they defined  their own variant.
>So if i added this, iproute2 doesnt compile. I could fix iproute2
>to move it somewhere to a common header then restore this.

So fix iproute2. It is always first kernel, then iproute2.


>
>> > +#define VALID_TCA_FLAGS TCA_FLAG_LARGE_DUMP_ON
>> 
>> Again, namespace please. "TCA_ROOT_FLAGS_VALID"
>
>Good point.
>
>> Why is this UAPI?
>> 
>
>Should not be. I will fix in next update.
>
>
>> 
>> > 
>> > /* New extended info filters for IFLA_EXT_MASK */
>> > #define RTEXT_FILTER_VF		(1 << 0)
>> > diff --git a/net/sched/act_api.c b/net/sched/act_api.c
>> > index 9ce22b7..2756213 100644
>> > --- a/net/sched/act_api.c
>> > +++ b/net/sched/act_api.c
>> > @@ -83,6 +83,7 @@ static int tcf_dump_walker(struct tcf_hashinfo *hinfo, struct sk_buff *skb,
>> > 			   struct netlink_callback *cb)
>> > {
>> > 	int err = 0, index = -1, i = 0, s_i = 0, n_i = 0;
>> > +	u32 act_flags = cb->args[2];
>> > 	struct nlattr *nest;
>> > 
>> > 	spin_lock_bh(&hinfo->lock);
>> > @@ -111,14 +112,18 @@ static int tcf_dump_walker(struct tcf_hashinfo *hinfo, struct sk_buff *skb,
>> > 			}
>> > 			nla_nest_end(skb, nest);
>> > 			n_i++;
>> > -			if (n_i >= TCA_ACT_MAX_PRIO)
>> > +			if (!(act_flags & TCA_FLAG_LARGE_DUMP_ON) &&
>> > +			    n_i >= TCA_ACT_MAX_PRIO)
>> > 				goto done;
>> > 		}
>> > 	}
>> > done:
>> > 	spin_unlock_bh(&hinfo->lock);
>> > -	if (n_i)
>> > +	if (n_i) {
>> > 		cb->args[0] += n_i;
>> > +		if (act_flags & TCA_FLAG_LARGE_DUMP_ON)
>> > +			cb->args[1] = n_i;
>> > +	}
>> > 	return n_i;
>> > 
>> > nla_put_failure:
>> > @@ -993,11 +998,15 @@ static int tcf_action_add(struct net *net, struct nlattr *nla,
>> > 	return tcf_add_notify(net, n, &actions, portid);
>> > }
>> > 
>> > +static const struct nla_policy tcaa_policy[TCA_ROOT_MAX + 1] = {
>> > +	[TCA_ROOT_FLAGS]      = { .type = NLA_U32 },
>> > +};
>> > +
>> > static int tc_ctl_action(struct sk_buff *skb, struct nlmsghdr *n,
>> > 			 struct netlink_ext_ack *extack)
>> > {
>> > 	struct net *net = sock_net(skb->sk);
>> > -	struct nlattr *tca[TCAA_MAX + 1];
>> > +	struct nlattr *tca[TCA_ROOT_MAX + 1];
>> > 	u32 portid = skb ? NETLINK_CB(skb).portid : 0;
>> > 	int ret = 0, ovr = 0;
>> > 
>> > @@ -1005,7 +1014,7 @@ static int tc_ctl_action(struct sk_buff *skb, struct nlmsghdr *n,
>> > 	    !netlink_capable(skb, CAP_NET_ADMIN))
>> > 		return -EPERM;
>> > 
>> > -	ret = nlmsg_parse(n, sizeof(struct tcamsg), tca, TCAA_MAX, NULL,
>> > +	ret = nlmsg_parse(n, sizeof(struct tcamsg), tca, TCA_ROOT_MAX, NULL,
>> > 			  extack);
>> > 	if (ret < 0)
>> > 		return ret;
>> > @@ -1046,16 +1055,12 @@ static int tc_ctl_action(struct sk_buff *skb, struct nlmsghdr *n,
>> > 	return ret;
>> > }
>> > 
>> > -static struct nlattr *find_dump_kind(const struct nlmsghdr *n)
>> > +static struct nlattr *find_dump_kind(struct nlattr **nla)
>> > {
>> > 	struct nlattr *tb1, *tb2[TCA_ACT_MAX + 1];
>> > 	struct nlattr *tb[TCA_ACT_MAX_PRIO + 1];
>> > -	struct nlattr *nla[TCAA_MAX + 1];
>> > 	struct nlattr *kind;
>> > 
>> > -	if (nlmsg_parse(n, sizeof(struct tcamsg), nla, TCAA_MAX,
>> > -			NULL, NULL) < 0)
>> > -		return NULL;
>> > 	tb1 = nla[TCA_ACT_TAB];
>> > 	if (tb1 == NULL)
>> > 		return NULL;
>> > @@ -1073,6 +1078,16 @@ static struct nlattr *find_dump_kind(const struct nlmsghdr *n)
>> > 	return kind;
>> > }
>> > 
>> > +static inline bool tca_flags_valid(u32 act_flags)
>> > +{
>> > +	u32 invalid_flags_mask  = ~VALID_TCA_FLAGS;
>> > +
>> > +	if (act_flags & invalid_flags_mask)
>> > +		return false;
>> 
>> I don't see how this resolves anything. VALID_TCA_FLAGS is set in stone
>> not going to change anytime in future, right?
>
>Every time a new bit gets added VALID_TCA_FLAGS changes.

You mean flag that user can set? If that is the case, you are breaking
UAPI for newer app running on older kernel.


>
>> Then the TCA_ROOT_FLAGS
>> attr will always contain only one flag, right?
>
>Not true - forever. Just in this patch discussion:
>I have added 2 other flags since removed. So I cant predict the
>future. You keep coming back to this assumption there will always
>ever be this one flag. I am not following how you reach this
>conclusion.

I read this paragraph 5 times, still I don't get what you say :/


>
>> Then, I don't see why do we need this dance... u8 flag attr resolves
>> this in elegant way. I know I sound like a broken record. This is the
>> last time I'm commenting on this. I give up.
>> 
>
>Why is a u8 better than u32 Jiri?
>The TLV space consumed is still 64 bits in both cases. If I define u8,
>u16, u32 - it is still 64 bits being alloced. If i use a u8 then i have
>24 bits which are pads - given current discussions I can never ever use
>again!

I don't care, use u8 or u32. Just 1 attr - 1 flag.


>
>If i keep 32 bits I am free to use those without ever changing the
>data structure (except for the restrictions to now make sure nothing
>gets set).

what datastructure? I'm confused.

>
>cheers,
>jamal

^ permalink raw reply

* Re: [PATCH net-next] qed: fix invalid use of sizeof in qed_alloc_qm_data()
From: David Miller @ 2017-04-25 16:00 UTC (permalink / raw)
  To: weiyj.lk; +Cc: Yuval.Mintz, Ariel.Elior, weiyongjun1, everest-linux-l2, netdev
In-Reply-To: <20170425070718.14790-1-weiyj.lk@gmail.com>

From: Wei Yongjun <weiyj.lk@gmail.com>
Date: Tue, 25 Apr 2017 07:07:18 +0000

> From: Wei Yongjun <weiyongjun1@huawei.com>
> 
> sizeof() when applied to a pointer typed expression gives the
> size of the pointer, not that of the pointed data.
> 
> Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com>

Applied with Yuval's Fixes: tag added.

^ permalink raw reply

* Re: [PATCH net] macsec: avoid heap overflow in skb_to_sgvec on receive
From: Jason A. Donenfeld @ 2017-04-25 15:59 UTC (permalink / raw)
  To: Sabrina Dubroca; +Cc: Netdev
In-Reply-To: <20170425155140.GA13414@bistromath.localdomain>

On Tue, Apr 25, 2017 at 5:51 PM, Sabrina Dubroca <sd@queasysnail.net> wrote:
> 2017-04-25, 17:39:09 +0200, Jason A. Donenfeld wrote:
>> Hi Sabrina,
>>
>> I think I may have beaten you to the punch here by a few minutes. :)
>
> I said I was going to post a patch.
> Mail headers seem to disagree with you ;)

Oh, whoops, sorry -- thought you wanted me to. Misread, but happy to
have done it anyway.

> Unless I missed something, encrypt was already handling fragments
> correctly. An skb with ->frag_list should have no skb_tailroom, so it
> will be linearized skb_copy_expand().

You're right. In that case, you might as well add back the FRAGLIST
annotation to the FEATURES, so that packets with frag_lists aren't
copied twice -- once upon linearization into xmit and again when
calling copy_expand in case they need more head or tail space. In
general, too, using the precise number of frags saves memory. But
above all, it seems to me the important thing is that it's very
_clear_ there isn't going to be an overflow, that the code doesn't
rely on so many odd particulars that then get violated later in its
life. Using the explicit length makes things a lot more clear. So
maybe better to go with mine?

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox