Netdev List
 help / color / mirror / Atom feed
* Re: [PATCH v2] kcm: remove any offset before parsing messages
From: Tom Herbert @ 2019-02-15  2:48 UTC (permalink / raw)
  To: Dominique Martinet
  Cc: David Miller, doronrk, Tom Herbert, Dave Watson,
	Linux Kernel Network Developers, LKML
In-Reply-To: <20190215015705.GA17974@nautica>

On Thu, Feb 14, 2019 at 5:57 PM Dominique Martinet
<asmadeus@codewreck.org> wrote:
>
> Tom Herbert wrote on Thu, Feb 14, 2019:
> > > The best alternative I see is adding a proper helper to get
> > > "kcm_rx_msg(skb)->offset" from bpf and document it so users aren't as
> > > lost as I have been; I'm not quite sure how/where to add such a helper
> > > though as I've barely looked at the bpf code until now, but should we go
> > > for that?
> >
> > I'd rather not complicate the bpf code for this. Can we just always do
> > an pskb_pull after skb_clone?
>
> Which skb_clone are you thinking of?
> If you're referring to the one in strparser, that was my very first
> approach here[1], but Dordon shot it down saying that this is not an
> strparser bug but a kcm bug since there are ways for users to properly
> get the offset and use it -- and the ktls code does it right.
>
> Frankly, my opinion still is that it'd be better in strparser because
> there also is some bad use in bpf sockmap (they never look at the offset
> either, while kcm uses it for rcv but not for parse), but looking at it
> from an optimal performance point of view I agree the user can make
> better decision than strparser so I understand where he comes from.
>
>
> This second patch[2] (the current thread) now does an extra clone if
> there is an offset, but the problem really isn't in the clone but the
> pull itself that can fail and return NULL when there is memory pressure.
> For some reason I hadn't been able to reproduce that behaviour with
> strparser doing the pull, but I assume it would also be possible to hit
> in extreme situations, I'm not sure...
>
This option looks the best to me, at least as a way to fix the issue
without requiring a change to the API. If the pull fails, doesn't that
just mean that the parser fails? Is there some behavior with this
patch that is not being handled gracefully?

Thanks,
Tom

> So anyway, we basically have three choices that I can see:
>  - push harder on strparser and go back to my first patch ; it's simple
> and makes using strparser easier/safer but has a small overhead for
> ktls, which uses the current strparser implementation correctly.
> I hadn't been able to get this to fail with KASAN last time I tried but
> I assume it should still be possible somehow.
>
>  - the current patch, that I could only get to fail with KASAN, but does
> complexify kcm a bit; this also does not fix bpf sockmap at all.
> Would still require to fix the hang, so make strparser retry a few times
> if strp->cb.parse_msg failed maybe? Or at least get the error back to
> userspace somehow.
>
>  - my last suggestion to document the kcm behaviour, and if possible add
> a bpf helper to make proper usage easier ; this will make kcm harder to
> use on end users but it's better than not being documented and seeing
> random unappropriate lengths being parsed.
>
>
>
> [1] first strparser patch
> http://lkml.kernel.org/r/1534855906-22870-1-git-send-email-asmadeus@codewreck.org
> [2] current thread's patch
> http://lkml.kernel.org/r/1536657703-27577-1-git-send-email-asmadeus@codewreck.org
>
>
> Thanks,
> --
> Dominique

^ permalink raw reply

* Re: r8169 Driver - Poor Network Performance Since Kernel 4.19
From: David Chang @ 2019-02-15  2:51 UTC (permalink / raw)
  To: Heiner Kallweit; +Cc: Realtek linux nic maintainers, netdev, Martti Laaksonen
In-Reply-To: <47a0819f-5ec3-6d73-210e-235d6bbcaab1@gmail.com>

Hi Heiner,

On Feb 14, 2019 at 07:17:44 +0100, Heiner Kallweit wrote:
> Hi David,
> 
> On 14.02.2019 03:45, David Chang wrote:
> > Hi Heiner,
> > 
> > On Feb 05, 2019 at 19:50:30 +0100, Heiner Kallweit wrote:
> >> Hi David,
> >>
> >> meanwhile there's the following bug report matching what reported.
> >> It's even the same chip version (RTL8168h).
> >> https://bugzilla.redhat.com/show_bug.cgi?id=1671958
> >>
> >> Symptom there is also a significant number of rx_missed packets.
> >> Could you try what I mentioned there last:
> >> Try building a kernel with the call to rtl_hw_aspm_clkreq_enable(tp, true) at the
> >> end of rtl_hw_start_8168h_1() being disabled.
> > 
> > After disabled the aspm function that you mentioned, we finally got the
> > positive testing result. And the rx_missed error was gone. If without
> > the patch, the receive side get back to bad performance.
> > 
> Good to know, thanks. I also checked with Realtek, they confirmed that their Windows
> driver uses some heuristics to disable ASPM under high load. So it seems like there
> is some hw issue. Open so far is whether this affects certain chip versions only.
> Let's see whether they can provide more information.

Ok!

> Disabling ASPM in general would hurt notebook users because based on some past
> measurements we know ASPM can significantly save energy.

I understand, thanks!

regards,
David
> 
> > kernel: r8169: loading out-of-tree module taints kernel.
> > kernel: r8169: module verification failed: signature and/or required key missing - tainting kernel
> > kernel: libphy: r8169: probed
> > kernel: r8169 0000:01:00.0 eth0: RTL8168h/8111h, ec:8e:b5:5a:2c:f5, XID 54100880, IRQ 128
> > kernel: r8169 0000:01:00.0 eth0: jumbo features [frames: 9200 bytes, tx checksumming: ko]
> > kernel: r8169 0000:01:00.0 enp1s0: renamed from eth0
> > kernel: Generic PHY r8169-100:00: attached PHY driver [Generic PHY] (mii_bus:phy_addr=r8169-100:00, irq=IGNORE)
> > kernel: r8169 0000:01:00.0 enp1s0: Link is Up - 1Gbps/Full - flow control off
> > 
> > NIC statistics:
> >      tx_packets: 1653804
> >      rx_packets: 1555966
> >      tx_errors: 0
> >      rx_errors: 0
> >      rx_missed: 0
> >      align_errors: 0
> >      tx_single_collisions: 0
> >      tx_multi_collisions: 0
> >      unicast: 1555884
> >      broadcast: 78
> >      multicast: 4
> >      tx_aborted: 0
> >      tx_underrun: 0
> > 
> > iperf receive:
> > -----------------------------------------------------------
> > Server listening on 5201
> > -----------------------------------------------------------
> > Accepted connection from 10.x.x.x, port 55516
> > [  5] local 10.x.x.x port 5201 connected to 10.x.x.x port 58172
> > [ ID] Interval           Transfer     Bitrate
> > [  5]   0.00-1.00   sec   108 MBytes   906 Mbits/sec
> > [  5]   1.00-2.00   sec   112 MBytes   941 Mbits/sec
> > [  5]   2.00-3.00   sec   112 MBytes   940 Mbits/sec
> > [  5]   3.00-4.00   sec   112 MBytes   941 Mbits/sec
> > [  5]   4.00-5.00   sec   112 MBytes   941 Mbits/sec
> > [  5]   5.00-6.00   sec   112 MBytes   942 Mbits/sec
> > [  5]   6.00-7.00   sec   112 MBytes   939 Mbits/sec
> > [  5]   7.00-8.00   sec   112 MBytes   941 Mbits/sec
> > [  5]   8.00-9.00   sec   112 MBytes   938 Mbits/sec
> > [  5]   9.00-10.00  sec   112 MBytes   941 Mbits/sec
> > [  5]  10.00-11.00  sec   112 MBytes   941 Mbits/sec
> > [...]
> > [  5]  50.00-51.00  sec   112 MBytes   941 Mbits/sec
> > [  5]  51.00-52.00  sec   112 MBytes   941 Mbits/sec
> > [  5]  52.00-53.00  sec   112 MBytes   942 Mbits/sec
> > [  5]  53.00-54.00  sec   112 MBytes   941 Mbits/sec
> > [  5]  54.00-55.00  sec   111 MBytes   934 Mbits/sec
> > [  5]  55.00-56.00  sec   112 MBytes   942 Mbits/sec
> > [  5]  56.00-57.00  sec   112 MBytes   937 Mbits/sec
> > [  5]  57.00-58.00  sec   112 MBytes   941 Mbits/sec
> > [  5]  58.00-59.00  sec   111 MBytes   932 Mbits/sec
> > [  5]  59.00-60.00  sec   112 MBytes   942 Mbits/sec
> > [  5]  60.00-60.04  sec  4.06 MBytes   939 Mbits/sec
> > - - - - - - - - - - - - - - - - - - - - - - - - -
> > [ ID] Interval           Transfer     Bitrate
> > [  5]   0.00-60.04  sec  6.57 GBytes   940 Mbits/sec                  receiver
> > 
> > regards,
> > David
> > 
> Heiner
> 

^ permalink raw reply

* Re: Question on ptr_ring linux header
From: fei phung @ 2019-02-15  3:03 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: feiphung, netdev
In-Reply-To: <20190201102519-mutt-send-email-mst@kernel.org>

Hi Michael,

> If I had to guess I'd say the way you play with indices is probably racy
> so you are producing an invalid index.

You are probably right.

I am suspecting item_recv_push_index and item_send_push_index in
https://gist.github.com/promach/7716ee8addcaa33fda140d74d1ad94d6#file-riffa_driver_ptr_ring-c-L254-L404
causing data race (illegal index) for consume() function

note that this interrupt handler could be interrupted again by another
hardware interrupt

I do not think we can use lock within interrupt handler, right ?


Code:
~~~

struct item item_recv_push[CIRC_BUFF_SIZE];
struct item item_send_push[CIRC_BUFF_SIZE];

unsigned int item_send_push_index;
unsigned int item_recv_push_index;

///////////////////////////////////////////////////////
// INTERRUPT HANDLER
///////////////////////////////////////////////////////

/**
* Reads the interrupt vector and processes it. If processing VECT0, off will
* be 0. If processing VECT1, off will be 6.
*/
static inline void process_intr_vector(struct fpga_state * sc, int off,
unsigned int vect)
{
// VECT_0/VECT_1 are organized from right to left (LSB to MSB) as:
// [ 0] TX_TXN for channel 0 in VECT_0, channel 6 in VECT_1
// [ 1] TX_SG_BUF_RECVD for channel 0 in VECT_0, channel 6 in VECT_1
// [ 2] TX_TXN_DONE for channel 0 in VECT_0, channel 6 in VECT_1
// [ 3] RX_SG_BUF_RECVD for channel 0 in VECT_0, channel 6 in VECT_1
// [ 4] RX_TXN_DONE for channel 0 in VECT_0, channel 6 in VECT_1
// ...
// [25] TX_TXN for channel 5 in VECT_0, channel 11 in VECT_1
// [26] TX_SG_BUF_RECVD for channel 5 in VECT_0, channel 11 in VECT_1
// [27] TX_TXN_DONE for channel 5 in VECT_0, channel 11 in VECT_1
// [28] RX_SG_BUF_RECVD for channel 5 in VECT_0, channel 11 in VECT_1
// [29] RX_TXN_DONE for channel 5 in VECT_0, channel 11 in VECT_1
// Positions 30 - 31 in both VECT_0 and VECT_1 are zero.

unsigned int offlast;
unsigned int len;
int recv;
int send;
int chnl;
int i;

offlast = 0;
len = 0;

//printk(KERN_INFO "riffa: intrpt_handler received:%08x\n", vect);
if (vect & 0xC0000000) {
printk(KERN_ERR "riffa: fpga:%d, received bad interrupt
vector:%08x\n", sc->id, vect);
return;
}

for (i = 0; i < 6 && (i+off) < sc->num_chnls; ++i) {
chnl = i + off;
recv = 0;
send = 0;

// TX (PC receive) scatter gather buffer is read.
if (vect & (1<<((5*i)+1))) {
recv = 1;

item_recv_push[item_recv_push_index].val1 = EVENT_SG_BUF_READ;
item_recv_push[item_recv_push_index].val2 = 0;

// Keep track so the thread can handle this.
if (ptr_ring_produce_any(sc->recv[chnl]->msgs,
&item_recv_push[item_recv_push_index])) {
printk(KERN_ERR "riffa: fpga:%d chnl:%d, recv sg buf read msg queue
full\n", sc->id, chnl);
}
DEBUG_MSG(KERN_INFO "riffa: fpga:%d chnl:%d, recv sg buf read\n", sc->id, chnl);

item_recv_push_index++;
}

// TX (PC receive) transaction done.
if (vect & (1<<((5*i)+2))) {
recv = 1;

item_recv_push[item_recv_push_index].val1 = EVENT_TXN_DONE;
item_recv_push[item_recv_push_index].val2 = len;

// Read the transferred amount.
len = read_reg(sc, CHNL_REG(chnl, TX_TNFR_LEN_REG_OFF));
// Notify the thread.
if (ptr_ring_produce_any(sc->recv[chnl]->msgs,
&item_recv_push[item_recv_push_index])) {
printk(KERN_ERR "riffa: fpga:%d chnl:%d, recv txn done msg queue
full\n", sc->id, chnl);
}
DEBUG_MSG(KERN_INFO "riffa: fpga:%d chnl:%d, recv txn done\n", sc->id, chnl);

item_recv_push_index++;
}

// New TX (PC receive) transaction.
if (vect & (1<<((5*i)+0))) {
recv = 1;
recv_sg_buf_populated = 0; // resets for new transaction

// Read the offset/last and length
offlast = read_reg(sc, CHNL_REG(chnl, TX_OFFLAST_REG_OFF));
tx_len = read_reg(sc, CHNL_REG(chnl, TX_LEN_REG_OFF));

item_recv_push[item_recv_push_index].val1 = EVENT_TXN_OFFLAST;
item_recv_push[item_recv_push_index].val2 = offlast;

// Keep track of this transaction
if (ptr_ring_produce_any(sc->recv[chnl]->msgs,
&item_recv_push[item_recv_push_index])) {
printk(KERN_ERR "riffa: fpga:%d chnl:%d, recv txn offlast msg queue
full\n", sc->id, chnl);
}
/*if (push_circ_queue(sc->recv[chnl]->msgs, EVENT_TXN_LEN, len)) {
printk(KERN_ERR "riffa: fpga:%d chnl:%d, recv txn len msg queue
full\n", sc->id, chnl);
}*/
DEBUG_MSG(KERN_INFO "riffa: fpga:%d chnl:%d, recv txn (len:%d off:%d
last:%d)\n", sc->id, chnl, tx_len, (offlast>>1), (offlast & 0x1));

item_recv_push_index++;
}

// RX (PC send) scatter gather buffer is read.
if (vect & (1<<((5*i)+3))) {
send = 1;

item_send_push[item_send_push_index].val1 = EVENT_SG_BUF_READ;
item_send_push[item_send_push_index].val2 = 0;

// Keep track so the thread can handle this.
if (ptr_ring_produce_any(sc->send[chnl]->msgs,
&item_send_push[item_send_push_index])) {
printk(KERN_ERR "riffa: fpga:%d chnl:%d, send sg buf read msg queue
full\n", sc->id, chnl);
}
DEBUG_MSG(KERN_INFO "riffa: fpga:%d chnl:%d, send sg buf read\n", sc->id, chnl);

item_send_push_index++;
}

// RX (PC send) transaction done.
if (vect & (1<<((5*i)+4))) {
send = 1;

item_send_push[item_send_push_index].val1 = EVENT_TXN_DONE;
item_send_push[item_send_push_index].val2 = len;

// Read the transferred amount.
len = read_reg(sc, CHNL_REG(chnl, RX_TNFR_LEN_REG_OFF));
// Notify the thread.
if (ptr_ring_produce_any(sc->send[chnl]->msgs,
&item_send_push[item_send_push_index])) {
printk(KERN_ERR "riffa: fpga:%d chnl:%d, send txn done msg queue
full\n", sc->id, chnl);
}
DEBUG_MSG(KERN_INFO "riffa: fpga:%d chnl:%d, send txn done\n", sc->id, chnl);

item_send_push_index++;
}

// Wake up the thread?
if (recv)
wake_up(&sc->recv[chnl]->waitq);
if (send)
wake_up(&sc->send[chnl]->waitq);
}
}

~~~


Regards,
Cheng Fei

^ permalink raw reply

* Re: bpf: test_tunnel.sh: BUG: unable to handle kernel NULL pointer dereference
From: Alexei Starovoitov @ 2019-02-15  3:08 UTC (permalink / raw)
  To: Alan Maguire
  Cc: Naresh Kamboju, Netdev, David S. Miller, Valdis Kletnieks,
	Song Liu, Rafael Tinoco, ast, Daniel Borkmann
In-Reply-To: <alpine.LRH.2.20.1902111627170.7885@dhcp-10-175-202-251.vpn.oracle.com>

On Mon, Feb 11, 2019 at 04:39:50PM +0000, Alan Maguire wrote:
> On Fri, 1 Feb 2019, Naresh Kamboju wrote:
> 
> > Kernel panic while running bpf: test_tunnel.sh on linux -next
> > 5.0.0-rc4-next-20190129.
> > 
> > selftests: bpf: test_tunnel.sh
> > [  273.997647] IPv6: ADDRCONF(NETDEV_CHANGE): veth1: link becomes ready
> > [  274.054068] ip (11458) used greatest stack depth: 11448 bytes left
> > [  274.120445] BUG: unable to handle kernel NULL pointer dereference
> > at 0000000000000000
> > [  274.128285] #PF error: [INSTR]
> > [  274.131351] PGD 8000000414a0e067 P4D 8000000414a0e067 PUD 3b6334067 PMD 0
> > [  274.138241] Oops: 0010 [#1] SMP PTI
> > [  274.141734] CPU: 1 PID: 11464 Comm: ping Not tainted
> > 5.0.0-rc4-next-20190129 #1
> > [  274.149046] Hardware name: Supermicro SYS-5019S-ML/X11SSH-F, BIOS
> > 2.0b 07/27/2017
> > [  274.156526] RIP: 0010:          (null)
> > [  274.160280] Code: Bad RIP value.
> > [  274.163509] RSP: 0018:ffffbc9681f83540 EFLAGS: 00010286
> > [  274.168726] RAX: 0000000000000000 RBX: ffffdc967fa80a18 RCX: 0000000000000000
> > [  274.175851] RDX: ffff9db2ee08b540 RSI: 000000000000000e RDI: ffffdc967fa809a0
> > [  274.182974] RBP: ffffbc9681f83580 R08: ffff9db2c4d62690 R09: 000000000000000c
> > [  274.190098] R10: 0000000000000000 R11: ffff9db2ee08b540 R12: ffff9db31ce7c000
> > [  274.197222] R13: 0000000000000001 R14: 000000000000000c R15: ffff9db3179cf400
> > [  274.204346] FS:  00007ff4ae7c5740(0000) GS:ffff9db31fa80000(0000)
> > knlGS:0000000000000000
> > [  274.212424] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > [  274.218162] CR2: ffffffffffffffd6 CR3: 00000004574da004 CR4: 00000000003606e0
> > [  274.225292] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > [  274.232416] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> > [  274.239541] Call Trace:
> > [  274.241988]  ? tnl_update_pmtu+0x296/0x3b0
> > [  274.246085]  ip_md_tunnel_xmit+0x1bc/0x520
> > [  274.250176]  gre_fb_xmit+0x330/0x390
> > [  274.253754]  gre_tap_xmit+0x128/0x180
> > [  274.257414]  dev_hard_start_xmit+0xb7/0x300
> > [  274.261598]  sch_direct_xmit+0xf6/0x290
> > [  274.265430]  __qdisc_run+0x15d/0x5e0
> > [  274.269007]  __dev_queue_xmit+0x2c5/0xc00
> > [  274.273011]  ? dev_queue_xmit+0x10/0x20
> > [  274.276842]  ? eth_header+0x2b/0xc0
> > [  274.280326]  dev_queue_xmit+0x10/0x20
> > [  274.283984]  ? dev_queue_xmit+0x10/0x20
> > [  274.287813]  arp_xmit+0x1a/0xf0
> > [  274.290952]  arp_send_dst.part.19+0x46/0x60
> > [  274.295138]  arp_solicit+0x177/0x6b0
> > [  274.298708]  ? mod_timer+0x18e/0x440
> > [  274.302281]  neigh_probe+0x57/0x70
> > [  274.305684]  __neigh_event_send+0x197/0x2d0
> > [  274.309862]  neigh_resolve_output+0x18c/0x210
> > [  274.314212]  ip_finish_output2+0x257/0x690
> > [  274.318304]  ip_finish_output+0x219/0x340
> > [  274.322314]  ? ip_finish_output+0x219/0x340
> > [  274.326493]  ip_output+0x76/0x240
> > [  274.329805]  ? ip_fragment.constprop.53+0x80/0x80
> > [  274.334510]  ip_local_out+0x3f/0x70
> > [  274.337992]  ip_send_skb+0x19/0x40
> > [  274.341391]  ip_push_pending_frames+0x33/0x40
> > [  274.345740]  raw_sendmsg+0xc15/0x11d0
> > [  274.349403]  ? __might_fault+0x85/0x90
> > [  274.353151]  ? _copy_from_user+0x6b/0xa0
> > [  274.357070]  ? rw_copy_check_uvector+0x54/0x130
> > [  274.361604]  inet_sendmsg+0x42/0x1c0
> > [  274.365179]  ? inet_sendmsg+0x42/0x1c0
> > [  274.368937]  sock_sendmsg+0x3e/0x50
> > [  274.372460]  ___sys_sendmsg+0x26f/0x2d0
> > [  274.376293]  ? lock_acquire+0x95/0x190
> > [  274.380043]  ? __handle_mm_fault+0x7ce/0xb70
> > [  274.384307]  ? lock_acquire+0x95/0x190
> > [  274.388053]  ? __audit_syscall_entry+0xdd/0x130
> > [  274.392586]  ? ktime_get_coarse_real_ts64+0x64/0xc0
> > [  274.397461]  ? __audit_syscall_entry+0xdd/0x130
> > [  274.401989]  ? trace_hardirqs_on+0x4c/0x100
> > [  274.406173]  __sys_sendmsg+0x63/0xa0
> > [  274.409744]  ? __sys_sendmsg+0x63/0xa0
> > [  274.413488]  __x64_sys_sendmsg+0x1f/0x30
> > [  274.417405]  do_syscall_64+0x55/0x190
> > [  274.421064]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
> > [  274.426113] RIP: 0033:0x7ff4ae0e6e87
> > [  274.429686] Code: 64 89 02 48 c7 c0 ff ff ff ff eb b9 0f 1f 80 00
> > 00 00 00 8b 05 ca d9 2b 00 48 63 d2 48 63 ff 85 c0 75 10 b8 2e 00 00
> > 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 53 48 89 f3 48 83 ec 10 48 89 7c
> > 24 08
> > [  274.448422] RSP: 002b:00007ffcd9b76db8 EFLAGS: 00000246 ORIG_RAX:
> > 000000000000002e
> > [  274.455978] RAX: ffffffffffffffda RBX: 0000000000000040 RCX: 00007ff4ae0e6e87
> > [  274.463104] RDX: 0000000000000000 RSI: 00000000006092e0 RDI: 0000000000000003
> > [  274.470228] RBP: 0000000000000000 R08: 00007ffcd9bc40a0 R09: 00007ffcd9bc4080
> > [  274.477349] R10: 000000000000060a R11: 0000000000000246 R12: 0000000000000003
> > [  274.484475] R13: 0000000000000016 R14: 00007ffcd9b77fa0 R15: 00007ffcd9b78da4
> > [  274.491602] Modules linked in: cls_bpf sch_ingress iptable_filter
> > ip_tables algif_hash af_alg x86_pkg_temp_thermal fuse [last unloaded:
> > test_bpf]
> > [  274.504634] CR2: 0000000000000000
> > [  274.507976] ---[ end trace 196d18386545eae1 ]---
> > [  274.512588] RIP: 0010:          (null)
> > [  274.516334] Code: Bad RIP value.
> > [  274.519557] RSP: 0018:ffffbc9681f83540 EFLAGS: 00010286
> > [  274.524775] RAX: 0000000000000000 RBX: ffffdc967fa80a18 RCX: 0000000000000000
> > [  274.531921] RDX: ffff9db2ee08b540 RSI: 000000000000000e RDI: ffffdc967fa809a0
> > [  274.539082] RBP: ffffbc9681f83580 R08: ffff9db2c4d62690 R09: 000000000000000c
> > [  274.546205] R10: 0000000000000000 R11: ffff9db2ee08b540 R12: ffff9db31ce7c000
> > [  274.553329] R13: 0000000000000001 R14: 000000000000000c R15: ffff9db3179cf400
> > [  274.560456] FS:  00007ff4ae7c5740(0000) GS:ffff9db31fa80000(0000)
> > knlGS:0000000000000000
> > [  274.568541] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > [  274.574277] CR2: ffffffffffffffd6 CR3: 00000004574da004 CR4: 00000000003606e0
> > [  274.581403] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > [  274.588535] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> > [  274.595658] Kernel panic - not syncing: Fatal exception in interrupt
> > [  274.602046] Kernel Offset: 0x14400000 from 0xffffffff81000000
> > (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
> > [  274.612827] ---[ end Kernel panic - not syncing: Fatal exception in
> > interrupt ]---
> > [  274.620387] ------------[ cut here ]------------
> > 
> > The above log is from x86_64 device.
> > 
> 
>  
> I'm also seeing the same panic during test_tunnel.sh execution
> on x86_64.
> 
> From poking around it looks like the skb's dst entry is being used
> to calculate the mtu in:
> 
> mtu = skb_dst(skb) ? dst_mtu(skb_dst(skb)) : dev->mtu;
> 
> ...but because that dst_entry  has an "ops" value set to md_dst_ops,
> the various ops (including mtu) are not set:
> 
> crash> struct sk_buff._skb_refdst ffff928f87447700 -x
>       _skb_refdst = 0xffffcd6fbf5ea590
> crash> struct dst_entry.ops 0xffffcd6fbf5ea590
>   ops = 0xffffffffa0193800
> crash> struct dst_ops.mtu 0xffffffffa0193800
>   mtu = 0x0
> crash>
> 
> I confirmed that the dst entry also has dst->input set to
> dst_md_discard, so it looks like it's an entry that's been
> initialized via __metadata_dst_init alright.
> 
> I think the fix here is to use skb_valid_dst(skb) - it checks
> for  DST_METADATA also, and with that fix in place, the
> problem - which was previously 100% reproducible - disappears.

The fix looks good to me.
Could you please resend the patch officially?

> However what we should do in terms of path MTU setting for
> such cases (use ip*_update_pmtu() perhaps?), I'm not sure.
> 
> The below patch resolves the panic and all tunnel tests pass
> without incident.
> 
> Fixes: c8b34e680a09 ("ip_tunnel: Add tnl_update_pmtu in ip_md_tunnel_xmit")
> 
> Signed-off-by: Alan Maguire <alan.maguire@oracle.com>
> ---
>  net/ipv4/ip_tunnel.c | 9 ++++++---
>  1 file changed, 6 insertions(+), 3 deletions(-)
> 
> diff --git a/net/ipv4/ip_tunnel.c b/net/ipv4/ip_tunnel.c
> index 893f013..5dcf50c 100644
> --- a/net/ipv4/ip_tunnel.c
> +++ b/net/ipv4/ip_tunnel.c
> @@ -515,9 +515,10 @@ static int tnl_update_pmtu(struct net_device *dev, struct sk_buff *skb,
>  		mtu = dst_mtu(&rt->dst) - dev->hard_header_len
>  					- sizeof(struct iphdr) - tunnel_hlen;
>  	else
> -		mtu = skb_dst(skb) ? dst_mtu(skb_dst(skb)) : dev->mtu;
> +		mtu = skb_valid_dst(skb) ? dst_mtu(skb_dst(skb)) : dev->mtu;
>  
> -	skb_dst_update_pmtu(skb, mtu);
> +	if (skb_valid_dst(skb))
> +		skb_dst_update_pmtu(skb, mtu);
>  
>  	if (skb->protocol == htons(ETH_P_IP)) {
>  		if (!skb_is_gso(skb) &&
> @@ -530,9 +531,11 @@ static int tnl_update_pmtu(struct net_device *dev, struct sk_buff *skb,
>  	}
>  #if IS_ENABLED(CONFIG_IPV6)
>  	else if (skb->protocol == htons(ETH_P_IPV6)) {
> -		struct rt6_info *rt6 = (struct rt6_info *)skb_dst(skb);
> +		struct rt6_info *rt6;
>  		__be32 daddr;
>  
> +		rt6 = skb_valid_dst(skb) ? (struct rt6_info *)skb_dst(skb) :
> +					   NULL;
>  		daddr = md ? dst : tunnel->parms.iph.daddr;
>  
>  		if (rt6 && mtu < dst_mtu(skb_dst(skb)) &&
> -- 
> 1.8.3.1
> 

^ permalink raw reply

* Re: [PATCH v2 1/2] Provide in-kernel headers for making it easy to extend the kernel
From: Alexei Starovoitov @ 2019-02-15  3:19 UTC (permalink / raw)
  To: Joel Fernandes (Google)
  Cc: netdev, linux-kernel, Andrew Morton, ast, atishp04, dancol,
	Dan Williams, gregkh, Jonathan Corbet, karim.yaghmour, Kees Cook,
	kernel-team, linux-doc, linux-kselftest, Manoj Rao,
	Masahiro Yamada, paulmck, Peter Zijlstra (Intel), rdunlap,
	rostedt, Shuah Khan, Thomas Gleixner, yhs
In-Reply-To: <20190211143600.15021-1-joel@joelfernandes.org>

On Mon, Feb 11, 2019 at 09:35:59AM -0500, Joel Fernandes (Google) wrote:
> Introduce in-kernel headers and other artifacts which are made available
> as an archive through proc (/proc/kheaders.txz file). This archive makes
> it possible to build kernel modules, run eBPF programs, and other
> tracing programs that need to extend the kernel for tracing purposes
> without any dependency on the file system having headers and build
> artifacts.
> 
> On Android and embedded systems, it is common to switch kernels but not
> have kernel headers available on the file system. Raw kernel headers
> also cannot be copied into the filesystem like they can be on other
> distros, due to licensing and other issues. There's no linux-headers
> package on Android. Further once a different kernel is booted, any
> headers stored on the file system will no longer be useful. By storing
> the headers as a compressed archive within the kernel, we can avoid these
> issues that have been a hindrance for a long time.

The set looks good to me and since the main use case is building bpf progs
I can route it via bpf-next tree if there are no objections.
Masahiro, could you please ack it?


^ permalink raw reply

* Re: [PATCH v2] kcm: remove any offset before parsing messages
From: Dominique Martinet @ 2019-02-15  3:31 UTC (permalink / raw)
  To: Tom Herbert
  Cc: David Miller, doronrk, Tom Herbert, Dave Watson,
	Linux Kernel Network Developers, LKML
In-Reply-To: <CALx6S37MZadJ=PaAd+SSv9hxSX9kFTmTUtijPGA39JCx3PYq1Q@mail.gmail.com>

Tom Herbert wrote on Thu, Feb 14, 2019:
> > This second patch[2] (the current thread) now does an extra clone if
> > there is an offset, but the problem really isn't in the clone but the
> > pull itself that can fail and return NULL when there is memory pressure.
> > For some reason I hadn't been able to reproduce that behaviour with
> > strparser doing the pull, but I assume it would also be possible to hit
> > in extreme situations, I'm not sure...
>
> This option looks the best to me, at least as a way to fix the issue
> without requiring a change to the API. If the pull fails, doesn't that
> just mean that the parser fails? Is there some behavior with this
> patch that is not being handled gracefully?

Yes, the parser fails with -ENOMEM ; that is not handled gracefully at
all: from a user point of view, the connection just hangs (recvmsg never
returns), without so much as a message in dmesg either.

It took me a while to figure out what failed exactly as I did indeed
expect strparser/kcm to handle that better, but ultimately as things
stand if the parser fails it calls strp_parser_err() with the error
which ends up in strp_abort_strp that should call
sk->sk_error_report(sk) but in kcm case sk is the csk and I guess
failing csk does not make a pending recv on the kcm sock to fail...

I'm not sure whether propagating the error to the upper socket is the
right thing to do, kcm is meant to be able to work with multiple
underlying sockets so I feel we must be cautious about that, but
strparser might be able to retry somehow.
This is what I said below:
> > [,,,]
> >  - the current patch, that I could only get to fail with KASAN, but does
> > complexify kcm a bit; this also does not fix bpf sockmap at all.
> > Would still require to fix the hang, so make strparser retry a few times
> > if strp->cb.parse_msg failed maybe? Or at least get the error back to
> > userspace somehow.

Thanks,
-- 
Dominique

^ permalink raw reply

* Re: [PATCH v2 1/2] Provide in-kernel headers for making it easy to extend the kernel
From: Joel Fernandes @ 2019-02-15  3:47 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: netdev, linux-kernel, Andrew Morton, ast, atishp04, dancol,
	Dan Williams, gregkh, Jonathan Corbet, karim.yaghmour, Kees Cook,
	kernel-team, linux-doc, linux-kselftest, Manoj Rao,
	Masahiro Yamada, paulmck, Peter Zijlstra (Intel), rdunlap,
	rostedt, Shuah Khan, Thomas Gleixner, yhs
In-Reply-To: <20190215031926.ljzluy2cfxp64u6o@ast-mbp>

On Thu, Feb 14, 2019 at 07:19:29PM -0800, Alexei Starovoitov wrote:
> On Mon, Feb 11, 2019 at 09:35:59AM -0500, Joel Fernandes (Google) wrote:
> > Introduce in-kernel headers and other artifacts which are made available
> > as an archive through proc (/proc/kheaders.txz file). This archive makes
> > it possible to build kernel modules, run eBPF programs, and other
> > tracing programs that need to extend the kernel for tracing purposes
> > without any dependency on the file system having headers and build
> > artifacts.
> > 
> > On Android and embedded systems, it is common to switch kernels but not
> > have kernel headers available on the file system. Raw kernel headers
> > also cannot be copied into the filesystem like they can be on other
> > distros, due to licensing and other issues. There's no linux-headers
> > package on Android. Further once a different kernel is booted, any
> > headers stored on the file system will no longer be useful. By storing
> > the headers as a compressed archive within the kernel, we can avoid these
> > issues that have been a hindrance for a long time.
> 
> The set looks good to me and since the main use case is building bpf progs
> I can route it via bpf-next tree if there are no objections.
> Masahiro, could you please ack it?
> 

Yes, eBPF is one of the usecases. After this, I am also planning to send
patches to BCC so that it can use this feature when compiling C to eBPF.

Thanks!

 - Joel


^ permalink raw reply

* Re: dead code in vhost.c
From: Jason Wang @ 2019-02-15  3:52 UTC (permalink / raw)
  To: Stephen Hemminger, mst; +Cc: netdev
In-Reply-To: <20190214080346.352c6941@shemminger-XPS-13-9360>


On 2019/2/15 上午12:03, Stephen Hemminger wrote:
> Coverity found this obvious bug
>
> ________________________________________________________________________________________________________
> *** CID 1442593:  Control flow issues  (DEADCODE)
> /drivers/vhost/vhost.c: 1795 in log_used()
> 1789     	ret = translate_desc(vq, (uintptr_t)vq->used + used_offset,
> 1790     			     len, iov, 64, VHOST_ACCESS_WO);
> 1791     	if (ret)
> 1792     		return ret;
> 1793
> 1794     	for (i = 0; i < ret; i++) {
>>>>      CID 1442593:  Control flow issues  (DEADCODE)
>>>>      Execution cannot reach this statement: "ret = log_write_hva(vq, (ui...".
> 1795     		ret = log_write_hva(vq,	(uintptr_t)iov[i].iov_base,
> 1796     				    iov[i].iov_len);
> 1797     		if (ret)
> 1798     			return ret;
> 1799     	}
> 1800


My bad, need check ret < 0 instead.

Will post a fix.

Thanks


^ permalink raw reply

* Re: [PATCH v2] kcm: remove any offset before parsing messages
From: Tom Herbert @ 2019-02-15  4:01 UTC (permalink / raw)
  To: Dominique Martinet
  Cc: Tom Herbert, David Miller, Doron Roberts-Kedes, Dave Watson,
	Linux Kernel Network Developers, LKML
In-Reply-To: <20190215033102.GA3099@nautica>

On Thu, Feb 14, 2019 at 7:31 PM Dominique Martinet
<asmadeus@codewreck.org> wrote:
>
> Tom Herbert wrote on Thu, Feb 14, 2019:
> > > This second patch[2] (the current thread) now does an extra clone if
> > > there is an offset, but the problem really isn't in the clone but the
> > > pull itself that can fail and return NULL when there is memory pressure.
> > > For some reason I hadn't been able to reproduce that behaviour with
> > > strparser doing the pull, but I assume it would also be possible to hit
> > > in extreme situations, I'm not sure...
> >
> > This option looks the best to me, at least as a way to fix the issue
> > without requiring a change to the API. If the pull fails, doesn't that
> > just mean that the parser fails? Is there some behavior with this
> > patch that is not being handled gracefully?
>
> Yes, the parser fails with -ENOMEM ; that is not handled gracefully at
> all: from a user point of view, the connection just hangs (recvmsg never
> returns), without so much as a message in dmesg either.
>
Dominique,

That's by design. Here is the description in kcm.txt:

"When a TCP socket is attached to a KCM multiplexor data ready
(POLLIN) and write space available (POLLOUT) events are handled by the
multiplexor. If there is a state change (disconnection) or other error
on a TCP socket, an error is posted on the TCP socket so that a
POLLERR event happens and KCM discontinues using the socket. When the
application gets the error notification for a TCP socket, it should
unattach the socket from KCM and then handle the error condition (the
typical response is to close the socket and create a new connection if
necessary)."

> It took me a while to figure out what failed exactly as I did indeed
> expect strparser/kcm to handle that better, but ultimately as things
> stand if the parser fails it calls strp_parser_err() with the error
> which ends up in strp_abort_strp that should call
> sk->sk_error_report(sk) but in kcm case sk is the csk and I guess
> failing csk does not make a pending recv on the kcm sock to fail...
>
> I'm not sure whether propagating the error to the upper socket is the
> right thing to do, kcm is meant to be able to work with multiple
> underlying sockets so I feel we must be cautious about that, but

Right, that's the motivation for the design.

> strparser might be able to retry somehow.

We could retry on -ENOMEM since it is potentially a transient error,
but generally for errors (like connection is simply broken) it seems
like it's working as intended. I suppose we could add a socket option
to fail the KCM socket when all attached lower sockets have failed,
but I not sure that would be a significant benefit for additional
complexity.

> This is what I said below:
> > > [,,,]
> > >  - the current patch, that I could only get to fail with KASAN, but does
> > > complexify kcm a bit; this also does not fix bpf sockmap at all.
> > > Would still require to fix the hang, so make strparser retry a few times
> > > if strp->cb.parse_msg failed maybe? Or at least get the error back to
> > > userspace somehow.

The error should be getting to userspace via the TCP socket.

Tom

>
> Thanks,
> --
> Dominique

^ permalink raw reply

* Re: [PATCH v2] kcm: remove any offset before parsing messages
From: Dominique Martinet @ 2019-02-15  4:52 UTC (permalink / raw)
  To: Tom Herbert
  Cc: Tom Herbert, David Miller, Doron Roberts-Kedes, Dave Watson,
	Linux Kernel Network Developers, LKML
In-Reply-To: <CAPDqMeoJ7CCo1eGNBp_-crkxfVt_4f=XQqhEo7kmyCN-hf_EWQ@mail.gmail.com>

Tom Herbert wrote on Thu, Feb 14, 2019:
> On Thu, Feb 14, 2019 at 7:31 PM Dominique Martinet
> <asmadeus@codewreck.org> wrote:
> > Yes, the parser fails with -ENOMEM ; that is not handled gracefully at
> > all: from a user point of view, the connection just hangs (recvmsg never
> > returns), without so much as a message in dmesg either.
>
> That's by design. Here is the description in kcm.txt:
> 
> "When a TCP socket is attached to a KCM multiplexor data ready
> (POLLIN) and write space available (POLLOUT) events are handled by the
> multiplexor. If there is a state change (disconnection) or other error
> on a TCP socket, an error is posted on the TCP socket so that a
> POLLERR event happens and KCM discontinues using the socket. When the
> application gets the error notification for a TCP socket, it should
> unattach the socket from KCM and then handle the error condition (the
> typical response is to close the socket and create a new connection if
> necessary)."

Sigh, that's what I get for relying on pieces of code found on the
internet.

It does make "trivial" kcm sockets difficult to use though, the old
ganesha code I adapted to kcm was the horrible (naive?) kind spawning
one thread per connection and just waiting on read(), so making it just
create a kcm socket from the tcp one and wait on recvmsg() until an
error pops up does not seem an option.

That being said I'm a bit confused, I thought part of the point of kcm
was the multiplexing so a simple server could just keep attaching tcp
sockets to a single kcm socket and only have a single trivial loop
reading from the kcm socket ; but I guess there's no harm in also
looking for POLLERR on the tcp sockets... It would still need to close
them once clients disconnect I guess, this really only affects my naive
server.

> > strparser might be able to retry somehow.
> 
> We could retry on -ENOMEM since it is potentially a transient error,

Yes, I think we should aim to retry on -ENOMEM; I agree all errors are
not meant to be retried on but there already are different cases based
on what the parse cb returned; but that can be a different patch.

> but generally for errors (like connection is simply broken) it seems
> like it's working as intended. I suppose we could add a socket option
> to fail the KCM socket when all attached lower sockets have failed,
> but I not sure that would be a significant benefit for additional
> complexity.

Right, I agree it's probably not worth doing, now I'm aware of this I
can probably motivate myself to change this old code to use epoll
properly.

I think it could make sense to have this option for simpler programs,
but as things stand I guess we can require kcm users to do this much,
thanks for the explanation, and sorry for having failed to notice it.



With all that said I guess my patch should work correctly then, I'll try
to find some time to check the error does come back up the tcp socket in
my reproducer but I have no reason to believe it doesn't.

I'd like to see some retry on ENOMEM before this is merged though, so
while I'm there I'll resend this with a second patch doing that
retry,.. I think just not setting strp->interrupted and not reporting
the error up might be enough? Will have to try either way.



Thanks for the feedback,
-- 
Dominique

^ permalink raw reply

* Re: KASAN: use-after-free Read in lock_sock_nested
From: syzbot @ 2019-02-15  5:25 UTC (permalink / raw)
  To: davem, linux-hams, linux-kernel, netdev, ralf, syzkaller-bugs
In-Reply-To: <0000000000007a5aad057e7748c9@google.com>

syzbot has found a reproducer for the following crash on:

HEAD commit:    b3418f8bddf4 Add linux-next specific files for 20190214
git tree:       linux-next
console output: https://syzkaller.appspot.com/x/log.txt?x=14f63047400000
kernel config:  https://syzkaller.appspot.com/x/.config?x=8a3a37525a677c71
dashboard link: https://syzkaller.appspot.com/bug?extid=500c69d1e21d970e461b
compiler:       gcc (GCC) 9.0.0 20181231 (experimental)
syz repro:      https://syzkaller.appspot.com/x/repro.syz?x=14b08da7400000

IMPORTANT: if you fix the bug, please add the following tag to the commit:
Reported-by: syzbot+500c69d1e21d970e461b@syzkaller.appspotmail.com

==================================================================
BUG: KASAN: use-after-free in __lock_acquire+0x3150/0x4710  
kernel/locking/lockdep.c:3200
Read of size 8 at addr ffff8880195faa60 by task syz-executor.4/7495

CPU: 1 PID: 7495 Comm: syz-executor.4 Not tainted 5.0.0-rc6-next-20190214  
#35
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS  
Google 01/01/2011
Call Trace:
  __dump_stack lib/dump_stack.c:77 [inline]
  dump_stack+0x172/0x1f0 lib/dump_stack.c:113
  print_address_description.cold+0x7c/0x20d mm/kasan/report.c:187
  kasan_report.cold+0x1b/0x40 mm/kasan/report.c:317
  __asan_report_load8_noabort+0x14/0x20 mm/kasan/generic_report.c:132
  __lock_acquire+0x3150/0x4710 kernel/locking/lockdep.c:3200
  lock_acquire+0x16f/0x3f0 kernel/locking/lockdep.c:3833
  __raw_spin_lock_bh include/linux/spinlock_api_smp.h:135 [inline]
  _raw_spin_lock_bh+0x33/0x50 kernel/locking/spinlock.c:168
  spin_lock_bh include/linux/spinlock.h:334 [inline]
  lock_sock_nested+0x41/0x120 net/core/sock.c:2878
  lock_sock include/net/sock.h:1507 [inline]
  nr_accept+0x200/0x790 net/netrom/af_netrom.c:808
  __sys_accept4+0x350/0x6a0 net/socket.c:1610
  __do_sys_accept net/socket.c:1651 [inline]
  __se_sys_accept net/socket.c:1648 [inline]
  __x64_sys_accept+0x75/0xb0 net/socket.c:1648
  do_syscall_64+0x103/0x610 arch/x86/entry/common.c:290
  entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x457e29
Code: ad b8 fb ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 89 f8 48 89 f7  
48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff  
ff 0f 83 7b b8 fb ff c3 66 2e 0f 1f 84 00 00 00 00
RSP: 002b:00007f16cb51ec78 EFLAGS: 00000246 ORIG_RAX: 000000000000002b
RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 0000000000457e29
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000004
RBP: 000000000073bfa0 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 00007f16cb51f6d4
R13: 00000000004bdbf0 R14: 00000000004cde80 R15: 00000000ffffffff

Allocated by task 7492:
  save_stack+0x45/0xd0 mm/kasan/common.c:75
  set_track mm/kasan/common.c:87 [inline]
  __kasan_kmalloc mm/kasan/common.c:497 [inline]
  __kasan_kmalloc.constprop.0+0xcf/0xe0 mm/kasan/common.c:470
  kasan_kmalloc+0x9/0x10 mm/kasan/common.c:511
  __do_kmalloc mm/slab.c:3721 [inline]
  __kmalloc+0x15c/0x740 mm/slab.c:3730
  kmalloc include/linux/slab.h:553 [inline]
  sk_prot_alloc+0x19c/0x2e0 net/core/sock.c:1573
  sk_alloc+0x39/0xf70 net/core/sock.c:1627
  nr_create+0xb9/0x5e0 net/netrom/af_netrom.c:436
  __sock_create+0x3e6/0x750 net/socket.c:1297
  sock_create net/socket.c:1337 [inline]
  __sys_socket+0x103/0x220 net/socket.c:1367
  __do_sys_socket net/socket.c:1376 [inline]
  __se_sys_socket net/socket.c:1374 [inline]
  __x64_sys_socket+0x73/0xb0 net/socket.c:1374
  do_syscall_64+0x103/0x610 arch/x86/entry/common.c:290
  entry_SYSCALL_64_after_hwframe+0x49/0xbe

Freed by task 7491:
  save_stack+0x45/0xd0 mm/kasan/common.c:75
  set_track mm/kasan/common.c:87 [inline]
  __kasan_slab_free+0x102/0x150 mm/kasan/common.c:459
  kasan_slab_free+0xe/0x10 mm/kasan/common.c:467
  __cache_free mm/slab.c:3491 [inline]
  kfree+0xcf/0x230 mm/slab.c:3816
  sk_prot_free net/core/sock.c:1610 [inline]
  __sk_destruct+0x4f1/0x6d0 net/core/sock.c:1692
  sk_destruct+0x7b/0x90 net/core/sock.c:1700
  __sk_free+0xce/0x300 net/core/sock.c:1711
  sk_free+0x42/0x50 net/core/sock.c:1722
  sock_put include/net/sock.h:1708 [inline]
  nr_release+0x337/0x3c0 net/netrom/af_netrom.c:557
  __sock_release+0xd3/0x250 net/socket.c:579
  sock_close+0x1b/0x30 net/socket.c:1161
  __fput+0x2e5/0x8d0 fs/file_table.c:278
  ____fput+0x16/0x20 fs/file_table.c:309
  task_work_run+0x14a/0x1c0 kernel/task_work.c:113
  tracehook_notify_resume include/linux/tracehook.h:188 [inline]
  exit_to_usermode_loop+0x273/0x2c0 arch/x86/entry/common.c:166
  prepare_exit_to_usermode arch/x86/entry/common.c:197 [inline]
  syscall_return_slowpath arch/x86/entry/common.c:268 [inline]
  do_syscall_64+0x52d/0x610 arch/x86/entry/common.c:293
  entry_SYSCALL_64_after_hwframe+0x49/0xbe

The buggy address belongs to the object at ffff8880195fa9c0
  which belongs to the cache kmalloc-2k of size 2048
The buggy address is located 160 bytes inside of
  2048-byte region [ffff8880195fa9c0, ffff8880195fb1c0)
The buggy address belongs to the page:
page:ffffea0000657e80 count:1 mapcount:0 mapping:ffff88812c3f0c40 index:0x0  
compound_mapcount: 0
flags: 0x1fffc0000010200(slab|head)
raw: 01fffc0000010200 ffffea00026f8a88 ffffea000220c888 ffff88812c3f0c40
raw: 0000000000000000 ffff8880195fa140 0000000100000003 0000000000000000
page dumped because: kasan: bad access detected

Memory state around the buggy address:
  ffff8880195fa900: fb fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc
  ffff8880195fa980: fc fc fc fc fc fc fc fc fb fb fb fb fb fb fb fb
> ffff8880195faa00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
                                                        ^
  ffff8880195faa80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
  ffff8880195fab00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
==================================================================


^ permalink raw reply

* Re: [PATCH bpf-next 2/4] libbpf: Support 32-bit static data loads
From: Y Song @ 2019-02-15  5:38 UTC (permalink / raw)
  To: Joe Stringer; +Cc: bpf, netdev, Daniel Borkmann, Alexei Starovoitov
In-Reply-To: <20190212004729.535-3-joe@wand.net.nz>

On Mon, Feb 11, 2019 at 4:48 PM Joe Stringer <joe@wand.net.nz> wrote:
>
> Support loads of static 32-bit data when BPF writers make use of
> convenience macros for accessing static global data variables. A later
> patch in this series will demonstrate its usage in a selftest.
>
> As of LLVM-7, this technique only works with 32-bit data, as LLVM will
> complain if this technique is attempted with data of other sizes:
>
>     LLVM ERROR: Unsupported relocation: try to compile with -O2 or above,
>     or check your static variable usage

A little bit clarification from compiler side.
The above compiler error is to prevent people use static variables since current
kernel/libbpf does not handle this. The compiler only warns if .bss or
.data section
has more than one definitions. The first definition always has section offset 0
and the compiler did not warn.

The restriction is a little strange. To only work with 32-bit data is
not a right
statement. The following are some examples.

The following static variable definitions will succeed:
static int a; /* one in .bss */
static long b = 2;  /* one in .data */

The following definitions will fail as both in .bss.
static int a;
static int b;

The following definitions will fail as both in .data:
static char a = 2;
static int b = 3;

Using global variables can prevent compiler errors.
maps are defined as globals and the compiler does not
check whether a particular global variable is defining a map or not.

If you just use static variable like below
static int a = 2;
without potential assignment to a, the compiler will replace variable
a with 2 at compile time.
An alternative is to define like below
static volatile int a = 2;
You can get a "load" for variable "a" in the bpf load even if there is
no assignment to a.

Maybe now is the time to remove the compiler assertions as
libbpf/kernel starts to
handle static variables?

>
> Based on the proof of concept by Daniel Borkmann (presented at LPC 2018).
>
> Signed-off-by: Joe Stringer <joe@wand.net.nz>
> ---
>  tools/lib/bpf/libbpf.c | 34 ++++++++++++++++++++++++++++++++--
>  1 file changed, 32 insertions(+), 2 deletions(-)
>
> diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
> index 1ec28d5154dc..da35d5559b22 100644
> --- a/tools/lib/bpf/libbpf.c
> +++ b/tools/lib/bpf/libbpf.c
> @@ -140,11 +140,13 @@ struct bpf_program {
>                 enum {
>                         RELO_LD64,
>                         RELO_CALL,
> +                       RELO_DATA,
>                 } type;
>                 int insn_idx;
>                 union {
>                         int map_idx;
>                         int text_off;
> +                       uint32_t data;
>                 };
>         } *reloc_desc;
>         int nr_reloc;
> @@ -210,6 +212,7 @@ struct bpf_object {
>                 Elf *elf;
>                 GElf_Ehdr ehdr;
>                 Elf_Data *symbols;
> +               Elf_Data *global_data;
>                 size_t strtabidx;
>                 struct {
>                         GElf_Shdr shdr;
> @@ -218,6 +221,7 @@ struct bpf_object {
>                 int nr_reloc;
>                 int maps_shndx;
>                 int text_shndx;
> +               int data_shndx;
>         } efile;
>         /*
>          * All loaded bpf_object is linked in a list, which is
> @@ -476,6 +480,7 @@ static void bpf_object__elf_finish(struct bpf_object *obj)
>                 obj->efile.elf = NULL;
>         }
>         obj->efile.symbols = NULL;
> +       obj->efile.global_data = NULL;
>
>         zfree(&obj->efile.reloc);
>         obj->efile.nr_reloc = 0;
> @@ -866,6 +871,9 @@ static int bpf_object__elf_collect(struct bpf_object *obj, int flags)
>                                         pr_warning("failed to alloc program %s (%s): %s",
>                                                    name, obj->path, cp);
>                                 }
> +                       } else if (strcmp(name, ".data") == 0) {
> +                               obj->efile.global_data = data;
> +                               obj->efile.data_shndx = idx;
>                         }
>                 } else if (sh.sh_type == SHT_REL) {
>                         void *reloc = obj->efile.reloc;
> @@ -962,6 +970,7 @@ bpf_program__collect_reloc(struct bpf_program *prog, GElf_Shdr *shdr,
>         Elf_Data *symbols = obj->efile.symbols;
>         int text_shndx = obj->efile.text_shndx;
>         int maps_shndx = obj->efile.maps_shndx;
> +       int data_shndx = obj->efile.data_shndx;
>         struct bpf_map *maps = obj->maps;
>         size_t nr_maps = obj->nr_maps;
>         int i, nrels;
> @@ -1000,8 +1009,9 @@ bpf_program__collect_reloc(struct bpf_program *prog, GElf_Shdr *shdr,
>                          (long long) (rel.r_info >> 32),
>                          (long long) sym.st_value, sym.st_name);
>
> -               if (sym.st_shndx != maps_shndx && sym.st_shndx != text_shndx) {
> -                       pr_warning("Program '%s' contains non-map related relo data pointing to section %u\n",
> +               if (sym.st_shndx != maps_shndx && sym.st_shndx != text_shndx &&
> +                   sym.st_shndx != data_shndx) {
> +                       pr_warning("Program '%s' contains unrecognized relo data pointing to section %u\n",
>                                    prog->section_name, sym.st_shndx);
>                         return -LIBBPF_ERRNO__RELOC;
>                 }
> @@ -1046,6 +1056,20 @@ bpf_program__collect_reloc(struct bpf_program *prog, GElf_Shdr *shdr,
>                         prog->reloc_desc[i].type = RELO_LD64;
>                         prog->reloc_desc[i].insn_idx = insn_idx;
>                         prog->reloc_desc[i].map_idx = map_idx;
> +               } else if (sym.st_shndx == data_shndx) {
> +                       Elf_Data *global_data = obj->efile.global_data;
> +                       uint32_t *static_data;
> +
> +                       if (sym.st_value + sizeof(uint32_t) > (int)global_data->d_size) {
> +                               pr_warning("bpf relocation: static data load beyond data size %lu\n",
> +                                          global_data->d_size);
> +                               return -LIBBPF_ERRNO__RELOC;
> +                       }
> +
> +                       static_data = global_data->d_buf + sym.st_value;
> +                       prog->reloc_desc[i].type = RELO_DATA;
> +                       prog->reloc_desc[i].insn_idx = insn_idx;
> +                       prog->reloc_desc[i].data = *static_data;
>                 }
>         }
>         return 0;
> @@ -1399,6 +1423,12 @@ bpf_program__relocate(struct bpf_program *prog, struct bpf_object *obj)
>                                                       &prog->reloc_desc[i]);
>                         if (err)
>                                 return err;
> +               } else if (prog->reloc_desc[i].type == RELO_DATA) {
> +                       struct bpf_insn *insns = prog->insns;
> +                       int insn_idx;
> +
> +                       insn_idx = prog->reloc_desc[i].insn_idx;
> +                       insns[insn_idx].imm = prog->reloc_desc[i].data;
>                 }
>         }
>
> --
> 2.19.1
>

^ permalink raw reply

* Re: general protection fault in prepare_to_wait
From: syzbot @ 2019-02-15  5:43 UTC (permalink / raw)
  To: davem, linux-hams, linux-kernel, netdev, ralf, syzkaller-bugs,
	xiyou.wangcong
In-Reply-To: <000000000000fa6a2c057e8b7064@google.com>

syzbot has found a reproducer for the following crash on:

HEAD commit:    23e93c9b2cde Revert "gfs2: read journal in large chunks to..
git tree:       upstream
console output: https://syzkaller.appspot.com/x/log.txt?x=14e94014c00000
kernel config:  https://syzkaller.appspot.com/x/.config?x=ee434566c893c7b1
dashboard link: https://syzkaller.appspot.com/bug?extid=55f9d3e51d49e20b2ce5
compiler:       gcc (GCC) 9.0.0 20181231 (experimental)
syz repro:      https://syzkaller.appspot.com/x/repro.syz?x=109c886cc00000
C reproducer:   https://syzkaller.appspot.com/x/repro.c?x=11ed20d0c00000

IMPORTANT: if you fix the bug, please add the following tag to the commit:
Reported-by: syzbot+55f9d3e51d49e20b2ce5@syzkaller.appspotmail.com

IPv6: ADDRCONF(NETDEV_CHANGE): veth0_to_hsr: link becomes ready
IPv6: ADDRCONF(NETDEV_CHANGE): hsr_slave_0: link becomes ready
kasan: CONFIG_KASAN_INLINE enabled
8021q: adding VLAN 0 to HW filter on device batadv0
kasan: GPF could be caused by NULL-ptr deref or user memory access
general protection fault: 0000 [#1] PREEMPT SMP KASAN
CPU: 0 PID: 7785 Comm: syz-executor295 Not tainted 5.0.0-rc6+ #72
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS  
Google 01/01/2011
RIP: 0010:__lock_acquire+0x8df/0x4700 kernel/locking/lockdep.c:3215
Code: 28 00 00 00 0f 85 35 27 00 00 48 8d 65 d8 5b 41 5c 41 5d 41 5e 41 5f  
5d c3 48 b8 00 00 00 00 00 fc ff df 4c 89 e2 48 c1 ea 03 <80> 3c 02 00 0f  
85 dc 27 00 00 49 81 3c 24 20 25 9a 89 0f 84 03 f8
RSP: 0018:ffff888094c6f970 EFLAGS: 00010006
kobject: 'vlan0' (00000000d30c60ff): kobject_add_internal: parent: 'mesh',  
set: '<NULL>'
RAX: dffffc0000000000 RBX: 0000000000000000 RCX: 0000000000000000
RDX: 0000000000000003 RSI: 0000000000000000 RDI: 0000000000000018
RBP: ffff888094c6fb40 R08: 0000000000000001 R09: 0000000000000001
R10: 0000000000000000 R11: 0000000000000001 R12: 0000000000000018
R13: 0000000000000001 R14: 0000000000000000 R15: ffff88808c0f00c0
FS:  00007f9abeeec700(0000) GS:ffff8880ae800000(0000) knlGS:0000000000000000
8021q: adding VLAN 0 to HW filter on device batadv0
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
kobject: 'vlan0' (00000000a54e32cc): kobject_add_internal: parent: 'mesh',  
set: '<NULL>'
CR2: 00000000004d0910 CR3: 00000000906d2000 CR4: 00000000001406f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
8021q: adding VLAN 0 to HW filter on device batadv0
Call Trace:
kobject: 'vlan0' (0000000016be0e34): kobject_add_internal: parent: 'mesh',  
set: '<NULL>'
  lock_acquire+0x16f/0x3f0 kernel/locking/lockdep.c:3841
  __raw_spin_lock_irqsave include/linux/spinlock_api_smp.h:110 [inline]
  _raw_spin_lock_irqsave+0x95/0xcd kernel/locking/spinlock.c:152
  prepare_to_wait+0x7c/0x300 kernel/sched/wait.c:230
  nr_accept+0x239/0x790 net/netrom/af_netrom.c:796
  __sys_accept4+0x350/0x6a0 net/socket.c:1588
  __do_sys_accept net/socket.c:1629 [inline]
  __se_sys_accept net/socket.c:1626 [inline]
  __x64_sys_accept+0x75/0xb0 net/socket.c:1626
  do_syscall_64+0x103/0x610 arch/x86/entry/common.c:290
  entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x449489
Code: e8 7c e6 ff ff 48 83 c4 18 c3 0f 1f 80 00 00 00 00 48 89 f8 48 89 f7  
48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff  
ff 0f 83 1b 05 fc ff c3 66 2e 0f 1f 84 00 00 00 00
RSP: 002b:00007f9abeeebcc8 EFLAGS: 00000246 ORIG_RAX: 000000000000002b
RAX: ffffffffffffffda RBX: 00000000006dfc68 RCX: 0000000000449489
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000004
RBP: 00000000006dfc60 R08: 00007f9abeeec700 R09: 0000000000000000
R10: 00007f9abeeec700 R11: 0000000000000246 R12: 00000000006dfc6c
R13: 00007ffe8979730f R14: 00007f9abeeec9c0 R15: 0000000000000002
Modules linked in:
---[ end trace 91ccd60fc619e2e6 ]---
RIP: 0010:__lock_acquire+0x8df/0x4700 kernel/locking/lockdep.c:3215
Code: 28 00 00 00 0f 85 35 27 00 00 48 8d 65 d8 5b 41 5c 41 5d 41 5e 41 5f  
5d c3 48 b8 00 00 00 00 00 fc ff df 4c 89 e2 48 c1 ea 03 <80> 3c 02 00 0f  
85 dc 27 00 00 49 81 3c 24 20 25 9a 89 0f 84 03 f8
RSP: 0018:ffff888094c6f970 EFLAGS: 00010006
RAX: dffffc0000000000 RBX: 0000000000000000 RCX: 0000000000000000
RDX: 0000000000000003 RSI: 0000000000000000 RDI: 0000000000000018
RBP: ffff888094c6fb40 R08: 0000000000000001 R09: 0000000000000001
R10: 0000000000000000 R11: 0000000000000001 R12: 0000000000000018
R13: 0000000000000001 R14: 0000000000000000 R15: ffff88808c0f00c0
FS:  00007f9abeeec700(0000) GS:ffff8880ae800000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00000000004d0910 CR3: 00000000906d2000 CR4: 00000000001406f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400


^ permalink raw reply

* Re: [PATCH bpf-next 2/4] libbpf: Support 32-bit static data loads
From: Joe Stringer @ 2019-02-15  7:16 UTC (permalink / raw)
  To: Y Song; +Cc: Joe Stringer, bpf, netdev, Daniel Borkmann, Alexei Starovoitov
In-Reply-To: <CAH3MdRVU5ayEm6eb9Fz53Q5gjA60vadyJ+0_meCkWBo+t+XXYA@mail.gmail.com>

On Thu, 14 Feb 2019 at 21:39, Y Song <ys114321@gmail.com> wrote:
>
> On Mon, Feb 11, 2019 at 4:48 PM Joe Stringer <joe@wand.net.nz> wrote:
> >
> > Support loads of static 32-bit data when BPF writers make use of
> > convenience macros for accessing static global data variables. A later
> > patch in this series will demonstrate its usage in a selftest.
> >
> > As of LLVM-7, this technique only works with 32-bit data, as LLVM will
> > complain if this technique is attempted with data of other sizes:
> >
> >     LLVM ERROR: Unsupported relocation: try to compile with -O2 or above,
> >     or check your static variable usage
>
> A little bit clarification from compiler side.
> The above compiler error is to prevent people use static variables since current
> kernel/libbpf does not handle this. The compiler only warns if .bss or
> .data section
> has more than one definitions. The first definition always has section offset 0
> and the compiler did not warn.

Ah, interesting. I observed that warning when I attempted to define
global variables of multiple sizes, and I thought also with sizes
other than 32-bit. This clarifies things a bit, thanks.

For the .bss my observation was that if you had a definition like:

static int a = 0;

Then this will be placed into .bss, hence why I looked into the
approach from this patch for patch 3 as well.

> The restriction is a little strange. To only work with 32-bit data is
> not a right
> statement. The following are some examples.
>
> The following static variable definitions will succeed:
> static int a; /* one in .bss */
> static long b = 2;  /* one in .data */
>
> The following definitions will fail as both in .bss.
> static int a;
> static int b;
>
> The following definitions will fail as both in .data:
> static char a = 2;
> static int b = 3;

Are there type restrictions or something? I've been defining multiple
static uint32_t and using them per the approach in this patch series
without hitting this compiler assertion.

> Using global variables can prevent compiler errors.
> maps are defined as globals and the compiler does not
> check whether a particular global variable is defining a map or not.
>
> If you just use static variable like below
> static int a = 2;
> without potential assignment to a, the compiler will replace variable
> a with 2 at compile time.
> An alternative is to define like below
> static volatile int a = 2;
> You can get a "load" for variable "a" in the bpf load even if there is
> no assignment to a.

I'll take a closer look at this too.

> Maybe now is the time to remove the compiler assertions as
> libbpf/kernel starts to
> handle static variables?

I don't understand why those assertions exists in this form. It
already allows code which will not load with libbpf (ie generate any
.data/.bss), does it help prevent unexpected situations for
developers?

^ permalink raw reply

* Re: [PATCH 5/9] perf, bpf: save bpf_prog_info in a rbtree in perf_env
From: Jiri Olsa @ 2019-02-15  7:41 UTC (permalink / raw)
  To: Song Liu
  Cc: Netdev, linux-kernel, ast@kernel.org, daniel@iogearbox.net,
	Kernel Team, peterz@infradead.org, acme@redhat.com
In-Reply-To: <2446DEFA-3CD3-4D3A-BE73-C29037F00C5C@fb.com>

On Thu, Feb 14, 2019 at 05:03:03PM +0000, Song Liu wrote:
> 
> 
> > On Feb 14, 2019, at 4:33 AM, Jiri Olsa <jolsa@redhat.com> wrote:
> > 
> > On Fri, Feb 08, 2019 at 05:17:01PM -0800, Song Liu wrote:
> >> bpf_prog_info contains information necessary to annotate bpf programs.
> >> This patch saves bpf_prog_info for bpf programs loaded in the system.
> >> 
> >> perf-record saves bpf_prog_info information as headers to perf.data.
> >> A new header type HEADER_BPF_PROG_INFO is introduced for this data.
> > 
> > please move those 2 changes into separate patches then
> 
> Do you mean one patch to save data in rbtree, then a separate patch 
> to save data in perf.data file?

yes

jirka

^ permalink raw reply

* [PATCH net] vhost: correctly check the return value of translate_desc() in log_used()
From: Jason Wang @ 2019-02-15  7:53 UTC (permalink / raw)
  To: mst, jasowang; +Cc: kvm, virtualization, netdev, linux-kernel, stephen

When fail, translate_desc() returns negative value, otherwise the
number of iovs. So we should fail when the return value is negative
instead of a blindly check against zero.

Reported-by: Stephen Hemminger <stephen@networkplumber.org>
Fixes: cc5e71075947 ("vhost: log dirty page correctly")
Signed-off-by: Jason Wang <jasowang@redhat.com>
---
 drivers/vhost/vhost.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 24a129fcdd61..a2e5dc7716e2 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -1788,7 +1788,7 @@ static int log_used(struct vhost_virtqueue *vq, u64 used_offset, u64 len)
 
 	ret = translate_desc(vq, (uintptr_t)vq->used + used_offset,
 			     len, iov, 64, VHOST_ACCESS_WO);
-	if (ret)
+	if (ret < 0)
 		return ret;
 
 	for (i = 0; i < ret; i++) {
-- 
2.17.1


^ permalink raw reply related

* [PATCH] net: dsa: Implement flow_dissect callback for tag_dsa.
From: Rundong Ge @ 2019-02-15  8:22 UTC (permalink / raw)
  To: andrew; +Cc: vivien.didelot, f.fainelli, davem, netdev, linux-kernel, rdong.ge

RPS not work for DSA devices since the 'skb_get_hash'
will always get the invalid hash for dsa tagged packets.

"[PATCH] tag_mtk: add flow_dissect callback to the ops struct"
introduced the flow_dissect callback to get the right hash for
MTK tagged packet. And tag_dsa also needs to implement
the callback.

Signed-off-by: Rundong Ge <rdong.ge@gmail.com>
---
 net/dsa/tag_dsa.c | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/net/dsa/tag_dsa.c b/net/dsa/tag_dsa.c
index 8b2f92e..67ff3fa 100644
--- a/net/dsa/tag_dsa.c
+++ b/net/dsa/tag_dsa.c
@@ -146,8 +146,17 @@ static struct sk_buff *dsa_rcv(struct sk_buff *skb, struct net_device *dev,
 	return skb;
 }
 
+static int dsa_tag_flow_dissect(const struct sk_buff *skb, __be16 *proto,
+				int *offset)
+{
+	*offset = 4;
+	*proto = ((__be16 *)skb->data)[1];
+	return 0;
+}
+
 const struct dsa_device_ops dsa_netdev_ops = {
 	.xmit	= dsa_xmit,
 	.rcv	= dsa_rcv,
+	.flow_dissect   = dsa_tag_flow_dissect,
 	.overhead = DSA_HLEN,
 };
-- 
1.8.3.1


^ permalink raw reply related

* [PATCH net-next] net: phy: marvell10g: Don't explicitly set Pause and Asym_Pause
From: Maxime Chevallier @ 2019-02-15  8:33 UTC (permalink / raw)
  To: davem
  Cc: Maxime Chevallier, netdev, linux-kernel, Andrew Lunn,
	Florian Fainelli, Heiner Kallweit, Russell King, linux-arm-kernel,
	Antoine Tenart, thomas.petazzoni, gregory.clement, miquel.raynal

The PHY core expects PHY drivers not to set Pause and Asym_Pause bits,
unless the driver only wants to specify one of them due to HW
limitation. In the case of the Marvell10g driver, we don't need to set
them.

Signed-off-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Suggested-by: Andrew Lunn <andrew@lunn.ch>
---
 drivers/net/phy/marvell10g.c | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/drivers/net/phy/marvell10g.c b/drivers/net/phy/marvell10g.c
index 496805c0ddfe..c04fe5a75129 100644
--- a/drivers/net/phy/marvell10g.c
+++ b/drivers/net/phy/marvell10g.c
@@ -242,9 +242,6 @@ static int mv3310_config_init(struct phy_device *phydev)
 	    phydev->interface != PHY_INTERFACE_MODE_10GKR)
 		return -ENODEV;
 
-	__set_bit(ETHTOOL_LINK_MODE_Pause_BIT, phydev->supported);
-	__set_bit(ETHTOOL_LINK_MODE_Asym_Pause_BIT, phydev->supported);
-
 	if (phydev->c45_ids.devices_in_package & MDIO_DEVS_AN) {
 		val = phy_read_mmd(phydev, MDIO_MMD_AN, MDIO_STAT1);
 		if (val < 0)
-- 
2.19.2


^ permalink raw reply related

* Re: [PATCH net-next 2/3] ethtool: add compat for flash update
From: Michal Kubecek @ 2019-02-15  8:53 UTC (permalink / raw)
  To: netdev; +Cc: Jakub Kicinski, davem, jiri, oss-drivers, andrew
In-Reply-To: <20190214214046.19182-3-jakub.kicinski@netronome.com>

On Thu, Feb 14, 2019 at 01:40:45PM -0800, Jakub Kicinski wrote:
> If driver does not support ethtool flash update operation
> call into devlink.
> 
> Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
> ---
...
> diff --git a/net/core/devlink.c b/net/core/devlink.c
> index bd507e13bb7b..d169b5426d3d 100644
> --- a/net/core/devlink.c
> +++ b/net/core/devlink.c
> @@ -6435,6 +6435,36 @@ void devlink_compat_running_version(struct net_device *dev,
>  	mutex_unlock(&devlink_mutex);
>  }
>  
> +int devlink_compat_flash_update(struct net_device *dev, const char *file_name)
> +{
> +	struct devlink_port *devlink_port;
> +	struct devlink *devlink;
> +
> +	mutex_lock(&devlink_mutex);
> +	list_for_each_entry(devlink, &devlink_list, list) {
> +		mutex_lock(&devlink->lock);
> +		list_for_each_entry(devlink_port, &devlink->port_list, list) {
> +			int ret = -EOPNOTSUPP;
> +
> +			if (devlink_port->type != DEVLINK_PORT_TYPE_ETH ||
> +			    devlink_port->type_dev != dev)
> +				continue;
> +
> +			mutex_unlock(&devlink_mutex);
> +			if (devlink->ops->flash_update)
> +				ret = devlink->ops->flash_update(devlink,
> +								 file_name,
> +								 NULL, NULL);
> +			mutex_unlock(&devlink->lock);
> +			return ret;
> +		}
> +		mutex_unlock(&devlink->lock);
> +	}
> +	mutex_unlock(&devlink_mutex);
> +
> +	return -EOPNOTSUPP;
> +}
> +
>  static int __init devlink_module_init(void)
>  {
>  	return genl_register_family(&devlink_nl_family);

We already have similar lookup in devlink_compat_running_version() (the
only difference seems to be that we keep holding devlink_mutex until the
end there) and it's likely the net_device -> devlink lookup will be
needed in more places in the future. How about having a helper for it?

I also wonder how does the lookup scale. But I don't have clear idea
how long the lists can become in real life and the ethtool operations
are not really time critical.

Michal Kubecek

^ permalink raw reply

* Re: [net-next] net: stmmac: handle endianness in dwmac4_get_timestamp
From: Alexandre Torgue @ 2019-02-15  9:09 UTC (permalink / raw)
  To: Florian Fainelli, Giuseppe Cavallaro, Jose Abreu, davem
  Cc: netdev, linux-stm32, linux-arm-kernel, linux-kernel
In-Reply-To: <59a381e6-02eb-3c97-bc63-5a8f511a1901@gmail.com>



On 2/14/19 7:09 PM, Florian Fainelli wrote:
> On 2/14/19 9:26 AM, Alexandre Torgue wrote:
>> GMAC IP is little-endian and used on several kind of CPU (big or little
>> endian). Main callbacks functions of the stmmac drivers take care about
>> it. It was not the case for dwmac4_get_timestamp function.
> 
> This is clearly a bugfix, so you should be targeting the 'net' tree and
> provide a Fixes: tag so this can be backported to relevant stable kernels.
> 

Ok Florian, I just have to find the right commit to fix :)

>>
>> Signed-off-by: Alexandre Torgue <alexandre.torgue@st.com>
>>
>> diff --git a/drivers/net/ethernet/stmicro/stmmac/dwmac4_descs.c b/drivers/net/ethernet/stmicro/stmmac/dwmac4_descs.c
>> index 20299f6..736e296 100644
>> --- a/drivers/net/ethernet/stmicro/stmmac/dwmac4_descs.c
>> +++ b/drivers/net/ethernet/stmicro/stmmac/dwmac4_descs.c
>> @@ -241,15 +241,18 @@ static inline void dwmac4_get_timestamp(void *desc, u32 ats, u64 *ts)
>>   static int dwmac4_rx_check_timestamp(void *desc)
>>   {
>>   	struct dma_desc *p = (struct dma_desc *)desc;
>> +	unsigned int rdes0 = le32_to_cpu(p->des0);
>> +	unsigned int rdes1 = le32_to_cpu(p->des1);
>> +	unsigned int rdes3 = le32_to_cpu(p->des3);
>>   	u32 own, ctxt;
>>   	int ret = 1;
>>   
>> -	own = p->des3 & RDES3_OWN;
>> -	ctxt = ((p->des3 & RDES3_CONTEXT_DESCRIPTOR)
>> +	own = rdes3 & RDES3_OWN;
>> +	ctxt = ((rdes3 & RDES3_CONTEXT_DESCRIPTOR)
>>   		>> RDES3_CONTEXT_DESCRIPTOR_SHIFT);
>>   
>>   	if (likely(!own && ctxt)) {
>> -		if ((p->des0 == 0xffffffff) && (p->des1 == 0xffffffff))
>> +		if ((rdes0 == 0xffffffff) && (rdes1 == 0xffffffff))
>>   			/* Corrupted value */
>>   			ret = -EINVAL;
>>   		else
>>
> 
> 

^ permalink raw reply

* [PATCH net-next] iptunnel: NULL pointer deref for ip_md_tunnel_xmit
From: Alan Maguire @ 2019-02-15  9:38 UTC (permalink / raw)
  To: davem
  Cc: naresh.kamboju, kuznet, yoshfuji, ast, daniel, kafai,
	songliubraving, yhs, netdev

Naresh Kamboju noted the following oops during execution of selftest
tools/testing/selftests/bpf/test_tunnel.sh on x86_64:

[  274.120445] BUG: unable to handle kernel NULL pointer dereference
at 0000000000000000
[  274.128285] #PF error: [INSTR]
[  274.131351] PGD 8000000414a0e067 P4D 8000000414a0e067 PUD 3b6334067 PMD 0
[  274.138241] Oops: 0010 [#1] SMP PTI
[  274.141734] CPU: 1 PID: 11464 Comm: ping Not tainted
5.0.0-rc4-next-20190129 #1
[  274.149046] Hardware name: Supermicro SYS-5019S-ML/X11SSH-F, BIOS
2.0b 07/27/2017
[  274.156526] RIP: 0010:          (null)
[  274.160280] Code: Bad RIP value.
[  274.163509] RSP: 0018:ffffbc9681f83540 EFLAGS: 00010286
[  274.168726] RAX: 0000000000000000 RBX: ffffdc967fa80a18 RCX: 0000000000000000
[  274.175851] RDX: ffff9db2ee08b540 RSI: 000000000000000e RDI: ffffdc967fa809a0
[  274.182974] RBP: ffffbc9681f83580 R08: ffff9db2c4d62690 R09: 000000000000000c
[  274.190098] R10: 0000000000000000 R11: ffff9db2ee08b540 R12: ffff9db31ce7c000
[  274.197222] R13: 0000000000000001 R14: 000000000000000c R15: ffff9db3179cf400
[  274.204346] FS:  00007ff4ae7c5740(0000) GS:ffff9db31fa80000(0000)
knlGS:0000000000000000
[  274.212424] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  274.218162] CR2: ffffffffffffffd6 CR3: 00000004574da004 CR4: 00000000003606e0
[  274.225292] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  274.232416] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  274.239541] Call Trace:
[  274.241988]  ? tnl_update_pmtu+0x296/0x3b0
[  274.246085]  ip_md_tunnel_xmit+0x1bc/0x520
[  274.250176]  gre_fb_xmit+0x330/0x390
[  274.253754]  gre_tap_xmit+0x128/0x180
[  274.257414]  dev_hard_start_xmit+0xb7/0x300
[  274.261598]  sch_direct_xmit+0xf6/0x290
[  274.265430]  __qdisc_run+0x15d/0x5e0
[  274.269007]  __dev_queue_xmit+0x2c5/0xc00
[  274.273011]  ? dev_queue_xmit+0x10/0x20
[  274.276842]  ? eth_header+0x2b/0xc0
[  274.280326]  dev_queue_xmit+0x10/0x20
[  274.283984]  ? dev_queue_xmit+0x10/0x20
[  274.287813]  arp_xmit+0x1a/0xf0
[  274.290952]  arp_send_dst.part.19+0x46/0x60
[  274.295138]  arp_solicit+0x177/0x6b0
[  274.298708]  ? mod_timer+0x18e/0x440
[  274.302281]  neigh_probe+0x57/0x70
[  274.305684]  __neigh_event_send+0x197/0x2d0
[  274.309862]  neigh_resolve_output+0x18c/0x210
[  274.314212]  ip_finish_output2+0x257/0x690
[  274.318304]  ip_finish_output+0x219/0x340
[  274.322314]  ? ip_finish_output+0x219/0x340
[  274.326493]  ip_output+0x76/0x240
[  274.329805]  ? ip_fragment.constprop.53+0x80/0x80
[  274.334510]  ip_local_out+0x3f/0x70
[  274.337992]  ip_send_skb+0x19/0x40
[  274.341391]  ip_push_pending_frames+0x33/0x40
[  274.345740]  raw_sendmsg+0xc15/0x11d0
[  274.349403]  ? __might_fault+0x85/0x90
[  274.353151]  ? _copy_from_user+0x6b/0xa0
[  274.357070]  ? rw_copy_check_uvector+0x54/0x130
[  274.361604]  inet_sendmsg+0x42/0x1c0
[  274.365179]  ? inet_sendmsg+0x42/0x1c0
[  274.368937]  sock_sendmsg+0x3e/0x50
[  274.372460]  ___sys_sendmsg+0x26f/0x2d0
[  274.376293]  ? lock_acquire+0x95/0x190
[  274.380043]  ? __handle_mm_fault+0x7ce/0xb70
[  274.384307]  ? lock_acquire+0x95/0x190
[  274.388053]  ? __audit_syscall_entry+0xdd/0x130
[  274.392586]  ? ktime_get_coarse_real_ts64+0x64/0xc0
[  274.397461]  ? __audit_syscall_entry+0xdd/0x130
[  274.401989]  ? trace_hardirqs_on+0x4c/0x100
[  274.406173]  __sys_sendmsg+0x63/0xa0
[  274.409744]  ? __sys_sendmsg+0x63/0xa0
[  274.413488]  __x64_sys_sendmsg+0x1f/0x30
[  274.417405]  do_syscall_64+0x55/0x190
[  274.421064]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
[  274.426113] RIP: 0033:0x7ff4ae0e6e87
[  274.429686] Code: 64 89 02 48 c7 c0 ff ff ff ff eb b9 0f 1f 80 00
00 00 00 8b 05 ca d9 2b 00 48 63 d2 48 63 ff 85 c0 75 10 b8 2e 00 00
00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 53 48 89 f3 48 83 ec 10 48 89 7c
24 08
[  274.448422] RSP: 002b:00007ffcd9b76db8 EFLAGS: 00000246 ORIG_RAX:
000000000000002e
[  274.455978] RAX: ffffffffffffffda RBX: 0000000000000040 RCX: 00007ff4ae0e6e87
[  274.463104] RDX: 0000000000000000 RSI: 00000000006092e0 RDI: 0000000000000003
[  274.470228] RBP: 0000000000000000 R08: 00007ffcd9bc40a0 R09: 00007ffcd9bc4080
[  274.477349] R10: 000000000000060a R11: 0000000000000246 R12: 0000000000000003
[  274.484475] R13: 0000000000000016 R14: 00007ffcd9b77fa0 R15: 00007ffcd9b78da4
[  274.491602] Modules linked in: cls_bpf sch_ingress iptable_filter
ip_tables algif_hash af_alg x86_pkg_temp_thermal fuse [last unloaded:
test_bpf]
[  274.504634] CR2: 0000000000000000
[  274.507976] ---[ end trace 196d18386545eae1 ]---
[  274.512588] RIP: 0010:          (null)
[  274.516334] Code: Bad RIP value.
[  274.519557] RSP: 0018:ffffbc9681f83540 EFLAGS: 00010286
[  274.524775] RAX: 0000000000000000 RBX: ffffdc967fa80a18 RCX: 0000000000000000
[  274.531921] RDX: ffff9db2ee08b540 RSI: 000000000000000e RDI: ffffdc967fa809a0
[  274.539082] RBP: ffffbc9681f83580 R08: ffff9db2c4d62690 R09: 000000000000000c
[  274.546205] R10: 0000000000000000 R11: ffff9db2ee08b540 R12: ffff9db31ce7c000
[  274.553329] R13: 0000000000000001 R14: 000000000000000c R15: ffff9db3179cf400
[  274.560456] FS:  00007ff4ae7c5740(0000) GS:ffff9db31fa80000(0000)
knlGS:0000000000000000
[  274.568541] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  274.574277] CR2: ffffffffffffffd6 CR3: 00000004574da004 CR4: 00000000003606e0
[  274.581403] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  274.588535] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  274.595658] Kernel panic - not syncing: Fatal exception in interrupt
[  274.602046] Kernel Offset: 0x14400000 from 0xffffffff81000000
(relocation range: 0xffffffff80000000-0xffffffffbfffffff)
[  274.612827] ---[ end Kernel panic - not syncing: Fatal exception in
interrupt ]---
[  274.620387] ------------[ cut here ]------------

I'm also seeing the same failure on x86_64, and it reproduces
consistently.

From poking around it looks like the skb's dst entry is being used
to calculate the mtu in:

mtu = skb_dst(skb) ? dst_mtu(skb_dst(skb)) : dev->mtu;

...but because that dst_entry  has an "ops" value set to md_dst_ops,
the various ops (including mtu) are not set:

crash> struct sk_buff._skb_refdst ffff928f87447700 -x
      _skb_refdst = 0xffffcd6fbf5ea590
crash> struct dst_entry.ops 0xffffcd6fbf5ea590
  ops = 0xffffffffa0193800
crash> struct dst_ops.mtu 0xffffffffa0193800
  mtu = 0x0
crash>

I confirmed that the dst entry also has dst->input set to
dst_md_discard, so it looks like it's an entry that's been
initialized via __metadata_dst_init alright.

I think the fix here is to use skb_valid_dst(skb) - it checks
for  DST_METADATA also, and with that fix in place, the
problem - which was previously 100% reproducible - disappears.

The below patch resolves the panic and all bpf tunnel tests pass
without incident.

Fixes: c8b34e680a09 ("ip_tunnel: Add tnl_update_pmtu in ip_md_tunnel_xmit")

Reported-by: Naresh Kamboju <naresh.kamboju@linaro.org>
Signed-off-by: Alan Maguire <alan.maguire@oracle.com>
---
 net/ipv4/ip_tunnel.c | 9 ++++++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/net/ipv4/ip_tunnel.c b/net/ipv4/ip_tunnel.c
index 893f013..5dcf50c 100644
--- a/net/ipv4/ip_tunnel.c
+++ b/net/ipv4/ip_tunnel.c
@@ -515,9 +515,10 @@ static int tnl_update_pmtu(struct net_device *dev, struct sk_buff *skb,
 		mtu = dst_mtu(&rt->dst) - dev->hard_header_len
 					- sizeof(struct iphdr) - tunnel_hlen;
 	else
-		mtu = skb_dst(skb) ? dst_mtu(skb_dst(skb)) : dev->mtu;
+		mtu = skb_valid_dst(skb) ? dst_mtu(skb_dst(skb)) : dev->mtu;
 
-	skb_dst_update_pmtu(skb, mtu);
+	if (skb_valid_dst(skb))
+		skb_dst_update_pmtu(skb, mtu);
 
 	if (skb->protocol == htons(ETH_P_IP)) {
 		if (!skb_is_gso(skb) &&
@@ -530,9 +531,11 @@ static int tnl_update_pmtu(struct net_device *dev, struct sk_buff *skb,
 	}
 #if IS_ENABLED(CONFIG_IPV6)
 	else if (skb->protocol == htons(ETH_P_IPV6)) {
-		struct rt6_info *rt6 = (struct rt6_info *)skb_dst(skb);
+		struct rt6_info *rt6;
 		__be32 daddr;
 
+		rt6 = skb_valid_dst(skb) ? (struct rt6_info *)skb_dst(skb) :
+					   NULL;
 		daddr = md ? dst : tunnel->parms.iph.daddr;
 
 		if (rt6 && mtu < dst_mtu(skb_dst(skb)) &&
-- 
1.8.3.1


^ permalink raw reply related

* [net, PATCH v2] net: stmmac: handle endianness in dwmac4_get_timestamp
From: Alexandre Torgue @ 2019-02-15  9:49 UTC (permalink / raw)
  To: Giuseppe Cavallaro, Jose Abreu, davem
  Cc: netdev, linux-stm32, linux-arm-kernel, linux-kernel,
	Alexandre Torgue, f.fainelli

GMAC IP is little-endian and used on several kind of CPU (big or little
endian). Main callbacks functions of the stmmac drivers take care about
it. It was not the case for dwmac4_get_timestamp function.

Fixes: ba1ffd74df74 ("stmmac: fix PTP support for GMAC4")

Signed-off-by: Alexandre Torgue <alexandre.torgue@st.com>

---

changes since V1:
 -consider this patch as a fix
 -change targeted tree from net-next to net

regards
Alex


diff --git a/drivers/net/ethernet/stmicro/stmmac/dwmac4_descs.c b/drivers/net/ethernet/stmicro/stmmac/dwmac4_descs.c
index 20299f6..736e296 100644
--- a/drivers/net/ethernet/stmicro/stmmac/dwmac4_descs.c
+++ b/drivers/net/ethernet/stmicro/stmmac/dwmac4_descs.c
@@ -241,15 +241,18 @@ static inline void dwmac4_get_timestamp(void *desc, u32 ats, u64 *ts)
 static int dwmac4_rx_check_timestamp(void *desc)
 {
 	struct dma_desc *p = (struct dma_desc *)desc;
+	unsigned int rdes0 = le32_to_cpu(p->des0);
+	unsigned int rdes1 = le32_to_cpu(p->des1);
+	unsigned int rdes3 = le32_to_cpu(p->des3);
 	u32 own, ctxt;
 	int ret = 1;
 
-	own = p->des3 & RDES3_OWN;
-	ctxt = ((p->des3 & RDES3_CONTEXT_DESCRIPTOR)
+	own = rdes3 & RDES3_OWN;
+	ctxt = ((rdes3 & RDES3_CONTEXT_DESCRIPTOR)
 		>> RDES3_CONTEXT_DESCRIPTOR_SHIFT);
 
 	if (likely(!own && ctxt)) {
-		if ((p->des0 == 0xffffffff) && (p->des1 == 0xffffffff))
+		if ((rdes0 == 0xffffffff) && (rdes1 == 0xffffffff))
 			/* Corrupted value */
 			ret = -EINVAL;
 		else
-- 
2.7.4


^ permalink raw reply related

* Re: [PATCH] net: phy: at803x: disable delay only for RGMII mode
From: Vinod Koul @ 2019-02-15  9:58 UTC (permalink / raw)
  To: David Miller
  Cc: linux-arm-msm, bjorn.andersson, netdev, niklas.cassel, andrew,
	f.fainelli, nsekhar, peter.ujfalusi, marc.w.gonzalez
In-Reply-To: <20190214.083828.206479765039661735.davem@davemloft.net>

On 14-02-19, 08:38, David Miller wrote:
> From: Vinod Koul <vkoul@kernel.org>
> Date: Tue, 12 Feb 2019 19:49:22 +0530
> 
> > diff --git a/drivers/net/phy/at803x.c b/drivers/net/phy/at803x.c
> > index 8ff12938ab47..7b54b54e3316 100644
> > --- a/drivers/net/phy/at803x.c
> > +++ b/drivers/net/phy/at803x.c
> > @@ -110,6 +110,18 @@ static int at803x_debug_reg_mask(struct phy_device *phydev, u16 reg,
> >  	return phy_write(phydev, AT803X_DEBUG_DATA, val);
> >  }
> >  
> > +static inline int at803x_enable_rx_delay(struct phy_device *phydev)
> > +{
> > +	return at803x_debug_reg_mask(phydev, AT803X_DEBUG_REG_0, 0,
> > +				     AT803X_DEBUG_RX_CLK_DLY_EN);
> > +}
> > +
> > +static inline int at803x_enable_tx_delay(struct phy_device *phydev)
> > +{
> > +	return at803x_debug_reg_mask(phydev, AT803X_DEBUG_REG_5, 0,
> > +				     AT803X_DEBUG_TX_CLK_DLY_EN);
> > +}
> > +
> 
> Please do not use the inline directive in foo.c files, let the compiler
> decide.

Sure, this was in existing code, will fix up current occurrences as well.

Thanks
-- 
~Vinod

^ permalink raw reply

* Re: [PATCH net-next v4 07/17] net: sched: protect filter_chain list with filter_chain_lock mutex
From: Vlad Buslov @ 2019-02-15 10:02 UTC (permalink / raw)
  To: Ido Schimmel
  Cc: netdev@vger.kernel.org, jhs@mojatatu.com,
	xiyou.wangcong@gmail.com, jiri@resnulli.us, davem@davemloft.net,
	ast@kernel.org, daniel@iogearbox.net
In-Reply-To: <20190214182442.GA19269@splinter>


On Thu 14 Feb 2019 at 18:24, Ido Schimmel <idosch@idosch.org> wrote:
> On Mon, Feb 11, 2019 at 10:55:38AM +0200, Vlad Buslov wrote:
>> Extend tcf_chain with new filter_chain_lock mutex. Always lock the chain
>> when accessing filter_chain list, instead of relying on rtnl lock.
>> Dereference filter_chain with tcf_chain_dereference() lockdep macro to
>> verify that all users of chain_list have the lock taken.
>>
>> Rearrange tp insert/remove code in tc_new_tfilter/tc_del_tfilter to execute
>> all necessary code while holding chain lock in order to prevent
>> invalidation of chain_info structure by potential concurrent change. This
>> also serializes calls to tcf_chain0_head_change(), which allows head change
>> callbacks to rely on filter_chain_lock for synchronization instead of rtnl
>> mutex.
>>
>> Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
>> Acked-by: Jiri Pirko <jiri@mellanox.com>
>
> With this sequence [1] I get the following trace [2]. Bisected it to
> this patch. Note that second filter is rejected by the device driver
> (that's the intention). I guess it is not properly removed from the
> filter chain in the error path?
>
> Thanks
>
> [1]
> # tc qdisc add dev swp3 clsact
> # tc filter add dev swp3 ingress pref 1 matchall skip_sw \
> 	action mirred egress mirror dev swp7
> # tc filter add dev swp3 ingress pref 2 matchall skip_sw \
> 	action mirred egress mirror dev swp7
> RTNETLINK answers: File exists
> We have an error talking to the kernel, -1
> # tc qdisc del dev swp3 clsact
>
> [2]
> [   70.545131] kasan: GPF could be caused by NULL-ptr deref or user memory access
> [   70.553394] general protection fault: 0000 [#1] PREEMPT SMP KASAN PTI
> [   70.560618] CPU: 2 PID: 2268 Comm: tc Not tainted 5.0.0-rc5-custom-02103-g415d39427317 #1304
> [   70.570065] Hardware name: Mellanox Technologies Ltd. MSN2100-CB2FO/SA001017, BIOS 5.6.5 06/07/2016
> [   70.580204] RIP: 0010:mall_reoffload+0x14a/0x760
> [   70.585382] Code: c1 0f 85 ba 05 00 00 31 c0 4d 8d 6c 24 34 b9 06 00 00 00 4c 89 ff f3 48 ab 4c 89 ea 48 b8 00 00 00 00 00 fc ff df 48 c1 ea 03 <0f> b6 14 02 4c 89 e8 83
> e0 07 83 c0 03 38 d0 7c 08 84 d2 0f 85 bd
> [   70.606382] RSP: 0018:ffff888231faefc0 EFLAGS: 00010207
> [   70.612235] RAX: dffffc0000000000 RBX: 1ffff110463f5dfe RCX: 0000000000000000
> [   70.620220] RDX: 0000000000000006 RSI: 1ffff110463f5e01 RDI: ffff888231faf040
> [   70.628206] RBP: ffff8881ef151a00 R08: 0000000000000000 R09: ffffed10463f5dfa
> [   70.636192] R10: ffffed10463f5dfa R11: 0000000000000003 R12: 0000000000000000
> [   70.644177] R13: 0000000000000034 R14: 0000000000000000 R15: ffff888231faf010
> [   70.652163] FS:  00007f46b5bf0800(0000) GS:ffff888236c00000(0000) knlGS:0000000000000000
> [   70.661218] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [   70.667649] CR2: 0000000001d590a8 CR3: 0000000231c3c000 CR4: 00000000001006e0
> [   70.675633] Call Trace:
> [   70.693046]  tcf_block_playback_offloads+0x94/0x230
> [   70.710617]  __tcf_block_cb_unregister+0xf7/0x2d0
> [   70.734143]  mlxsw_sp_setup_tc+0x20f/0x660
> [   70.738739]  tcf_block_offload_unbind+0x22b/0x350
> [   70.748898]  __tcf_block_put+0x203/0x630
> [   70.769700]  tcf_block_put_ext+0x2f/0x40
> [   70.774098]  clsact_destroy+0x7a/0xb0
> [   70.782604]  qdisc_destroy+0x11a/0x5c0
> [   70.786810]  qdisc_put+0x5a/0x70
> [   70.790435]  notify_and_destroy+0x8e/0xa0
> [   70.794933]  qdisc_graft+0xbb7/0xef0
> [   70.809009]  tc_get_qdisc+0x518/0xa30
> [   70.821530]  rtnetlink_rcv_msg+0x397/0xa20
> [   70.835510]  netlink_rcv_skb+0x132/0x380
> [   70.848419]  netlink_unicast+0x4c0/0x690
> [   70.866019]  netlink_sendmsg+0x929/0xe10
> [   70.882134]  sock_sendmsg+0xc8/0x110
> [   70.886144]  ___sys_sendmsg+0x77a/0x8f0
> [   70.924064]  __sys_sendmsg+0xf7/0x250
> [   70.955727]  do_syscall_64+0x14d/0x610

Hi Ido,

Thanks for reporting this.

I looked at the code and problem seems to be matchall classifier
specific. My implementation of unlocked cls API assumes that concurrent
insertions are possible and checks for it when deleting "empty" tp.
Since classifiers don't expose number of elements, the only way to test
this is to do tp->walk() on them and assume that walk callback is called
once per filter on every classifier. In your example new tp is created
for second filter, filter insertion fails, number of elements on newly
created tp is checked with tp->walk() before deleting it. However,
matchall classifier always calls the tp->walk() callback once, even when
it doesn't have a valid filter (in this case with NULL filter pointer).

Possible ways to fix this:

1) Check for NULL filter pointer in tp->walk() callback and ignore it
when counting filters on tp - will work but I don't like it because I
don't think it is ever correct to call tp->walk() callback with NULL
filter pointer.

2) Fix matchall to not call tp->walk() callback with NULL filter
pointers - my preferred simple solution.

3) Extend tp API to have direct way to count filters by implementing
tp->count - requires change to every classifier. Or maybe add it as
optional API that only unlocked classifiers are required to implement
and just delete rtnl lock dependent tp's without checking for concurrent
insertions.

What do you think?

Regards,
Vlad

^ permalink raw reply

* [PATCH net-next v2 1/3] arm64: dts: fsl: ls1028a: Add PCI IERC node and ENETC endpoints
From: Claudiu Manoil @ 2019-02-15 10:10 UTC (permalink / raw)
  To: Shawn Guo, Li Yang, David S . Miller
  Cc: alexandru.marginean, linux-arm-kernel, devicetree, netdev,
	linux-kernel
In-Reply-To: <1550225414-12125-1-git-send-email-claudiu.manoil@nxp.com>

The LS1028A SoC features a PCI Integrated Endpoint Root Complex
(IERC) defining several integrated PCI devices, including the ENETC
ethernet controller integrated endpoints (IEPs). The IERC implements
ECAM (Enhanced Configuration Access Mechanism) to provide access
to the PCIe config space of the IEPs. This means the the IEPs
(including ENETC) do not support the standard PCIe BARs, instead
the Enhanced Allocation (EA) capability structures in the ECAM space
are used to fix the base addresses in the system, and the PCI
subsystem uses these structures for device enumeration and discovery.
The "ranges" entries contain basic information from these EA capabily
structures required by the kernel for device enumeration.

The current patch also enables the first 2 ENETC PFs (Physiscal
Functions) and the associated VFs (Virtual Functions), 2 VFs for
each PF.  Each of these ENETC PFs has an external ethernet port
on the LS1028A SoC.

Signed-off-by: Alex Marginean <alexandru.marginean@nxp.com>
Signed-off-by: Claudiu Manoil <claudiu.manoil@nxp.com>
---
v2 - none

 arch/arm64/boot/dts/freescale/fsl-ls1028a.dtsi | 33 ++++++++++++++++++++++++++
 1 file changed, 33 insertions(+)

diff --git a/arch/arm64/boot/dts/freescale/fsl-ls1028a.dtsi b/arch/arm64/boot/dts/freescale/fsl-ls1028a.dtsi
index a8cf92a..7f5a8e6 100644
--- a/arch/arm64/boot/dts/freescale/fsl-ls1028a.dtsi
+++ b/arch/arm64/boot/dts/freescale/fsl-ls1028a.dtsi
@@ -335,5 +335,38 @@
 				     <GIC_SPI 206 IRQ_TYPE_LEVEL_HIGH>, <GIC_SPI 207 IRQ_TYPE_LEVEL_HIGH>,
 				     <GIC_SPI 208 IRQ_TYPE_LEVEL_HIGH>, <GIC_SPI 209 IRQ_TYPE_LEVEL_HIGH>;
 		};
+
+		pcie@1f0000000 { /* Integrated Endpoint Root Complex */
+			compatible = "pci-host-ecam-generic";
+			reg = <0x01 0xf0000000 0x0 0x100000>;
+			#address-cells = <3>;
+			#size-cells = <2>;
+			#interrupt-cells = <1>;
+			msi-parent = <&its>;
+			device_type = "pci";
+			bus-range = <0x0 0x0>;
+			dma-coherent;
+			msi-map = <0 &its 0x17 0xe>;
+			iommu-map = <0 &smmu 0x17 0xe>;
+				  /* PF0-6 BAR0 - non-prefetchable memory */
+			ranges = <0x82000000 0x0 0x00000000  0x1 0xf8000000  0x0 0x160000
+				  /* PF0-6 BAR2 - prefetchable memory */
+				  0xc2000000 0x0 0x00000000  0x1 0xf8160000  0x0 0x070000
+				  /* PF0: VF0-1 BAR0 - non-prefetchable memory */
+				  0x82000000 0x0 0x00000000  0x1 0xf81d0000  0x0 0x020000
+				  /* PF0: VF0-1 BAR2 - prefetchable memory */
+				  0xc2000000 0x0 0x00000000  0x1 0xf81f0000  0x0 0x020000
+				  /* PF1: VF0-1 BAR0 - non-prefetchable memory */
+				  0x82000000 0x0 0x00000000  0x1 0xf8210000  0x0 0x020000
+				  /* PF1: VF0-1 BAR2 - prefetchable memory */
+				  0xc2000000 0x0 0x00000000  0x1 0xf8230000  0x0 0x020000>;
+
+			enetc_port0: pci@0,0 {
+				reg = <0x000000 0 0 0 0>;
+			};
+			enetc_port1: pci@0,1 {
+				reg = <0x000100 0 0 0 0>;
+			};
+		};
 	};
 };
-- 
2.7.4


^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox