Netdev List
 help / color / mirror / Atom feed
* [PATCH net] net: ipmr: fix unresolved entry dumps
From: Nikolay Aleksandrov @ 2018-10-17 19:34 UTC (permalink / raw)
  To: netdev; +Cc: davem, Nikolay Aleksandrov
In-Reply-To: <7e1a20a0-9ab2-5680-bde9-185c72242680@cumulusnetworks.com>

If the skb space ends in an unresolved entry while dumping we'll miss
some unresolved entries. The reason is due to zeroing the entry counter
between dumping resolved and unresolved mfc entries. We should just
keep counting until the whole table is dumped and zero when we move to
the next as we have a separate table counter.

Reported-by: Colin Ian King <colin.king@canonical.com>
Fixes: 8fb472c09b9d ("ipmr: improve hash scalability")
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
---
Dropped Yuval's mail because it bounces.

 net/ipv4/ipmr_base.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/net/ipv4/ipmr_base.c b/net/ipv4/ipmr_base.c
index 1ad9aa62a97b..eab8cd5ec2f5 100644
--- a/net/ipv4/ipmr_base.c
+++ b/net/ipv4/ipmr_base.c
@@ -296,8 +296,6 @@ int mr_rtm_dumproute(struct sk_buff *skb, struct netlink_callback *cb,
 next_entry:
 			e++;
 		}
-		e = 0;
-		s_e = 0;
 
 		spin_lock_bh(lock);
 		list_for_each_entry(mfc, &mrt->mfc_unres_queue, list) {
-- 
2.17.2

^ permalink raw reply related

* Re: [PATCH v3] virtio_net: avoid using netif_tx_disable() for serializing tx routine
From: ake @ 2018-10-18  3:25 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Jason Wang, David S. Miller, virtualization, netdev, linux-kernel
In-Reply-To: <20181017110741-mutt-send-email-mst@kernel.org>



On 2018/10/18 0:09, Michael S. Tsirkin wrote:
> On Wed, Oct 17, 2018 at 07:44:12PM +0900, Ake Koomsin wrote:
>> Commit 713a98d90c5e ("virtio-net: serialize tx routine during reset")
>> introduces netif_tx_disable() after netif_device_detach() in order to
>> avoid use-after-free of tx queues. However, there are two issues.
>>
>> 1) Its operation is redundant with netif_device_detach() in case the
>>    interface is running.
>> 2) In case of the interface is not running before suspending and
>>    resuming, the tx does not get resumed by netif_device_attach().
>>    This results in losing network connectivity.
>>
>> It is better to use netif_tx_lock_bh()/netif_tx_unlock_bh() instead for
>> serializing tx routine during reset. This also preserves the symmetry
>> of netif_device_detach() and netif_device_attach().
>>
>> Fixes commit 713a98d90c5e ("virtio-net: serialize tx routine during reset")
>> Signed-off-by: Ake Koomsin <ake@igel.co.jp>
> 
> Acked-by: Michael S. Tsirkin <mst@redhat.com>
> 
> Thanks a lot for debugging!
> Seems like stable material to me, right?

Yes. With this patch, we can avoid network connectivity lost
because of tx not get re-enabled under some situation. Plus, it avoids
redundant operation between netif_device_detach() and
netif_tx_disable().

I tested the patch on Linux net-next and QEMU master by suspending/
resuming the virtual machine repeatedly. The network looks no problem
and has no connectivity lost so far. I tested with both user-mode
networking and tap interface.

Best Regards

>> ---
>>  drivers/net/virtio_net.c | 5 ++++-
>>  1 file changed, 4 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
>> index 3f5aa59c37b7..3e2c041d76ac 100644
>> --- a/drivers/net/virtio_net.c
>> +++ b/drivers/net/virtio_net.c
>> @@ -2267,8 +2267,9 @@ static void virtnet_freeze_down(struct virtio_device *vdev)
>>  	/* Make sure no work handler is accessing the device */
>>  	flush_work(&vi->config_work);
>>  
>> +	netif_tx_lock_bh(vi->dev);
>>  	netif_device_detach(vi->dev);
>> -	netif_tx_disable(vi->dev);
>> +	netif_tx_unlock_bh(vi->dev);
>>  	cancel_delayed_work_sync(&vi->refill);
>>  
>>  	if (netif_running(vi->dev)) {
>> @@ -2304,7 +2305,9 @@ static int virtnet_restore_up(struct virtio_device *vdev)
>>  		}
>>  	}
>>  
>> +	netif_tx_lock_bh(vi->dev);
>>  	netif_device_attach(vi->dev);
>> +	netif_tx_unlock_bh(vi->dev);
>>  	return err;
>>  }
>>  
>> -- 
>> 2.19.1

^ permalink raw reply

* Re: [PATCH net] r8169: fix NAPI handling under high load
From: Heiner Kallweit @ 2018-10-17 19:27 UTC (permalink / raw)
  To: Holger Hoffstätte, David Miller,
	Realtek linux nic maintainers
  Cc: netdev@vger.kernel.org
In-Reply-To: <e7607b0f-86ed-6784-638c-37d68d910fb7@applied-asynchrony.com>

On 17.10.2018 21:11, Holger Hoffstätte wrote:
> On 10/17/18 20:12, Heiner Kallweit wrote:
>> On 16.10.2018 23:17, Holger Hoffstätte wrote:
>>> On 10/16/18 22:37, Heiner Kallweit wrote:
>>>> rtl_rx() and rtl_tx() are called only if the respective bits are set
>>>> in the interrupt status register. Under high load NAPI may not be
>>>> able to process all data (work_done == budget) and it will schedule
>>>> subsequent calls to the poll callback.
>>>> rtl_ack_events() however resets the bits in the interrupt status
>>>> register, therefore subsequent calls to rtl8169_poll() won't call
>>>> rtl_rx() and rtl_tx() - chip interrupts are still disabled.
>>>
>>> Very interesting! Could this be the reason for the mysterious
>>> hangs & resets we experienced when enabling BQL for r8169?
>>> They happened more often with TSO/GSO enabled and several people
>>> attempted to fix those hangs unsuccessfully; it was later reverted
>>> and has been since then (#87cda7cb43).
>>> If this bug has been there "forever" it might be tempting to
>>> re-apply BQL and see what happens. Any chance you could give that
>>> a try? I'll gladly test patches, just like I'll run this one.
>>>
>> After reading through the old mail threads regarding BQL on r8169
>> I don't think the fix here is related.
>> It seems that BQL on r8169 worked fine for most people, just one
>> had problems on one of his systems. I assume the issue was specific
> 
> I continued to use the BQL patch in my private tree after it was reverted
> and also had occasional timeouts, but *only* after I started playing
> with ethtool to change offload settings. Without offloads or the BQL patch
> everything has been rock-solid since then.
> The other weird problem was that timeouts would occur on an otherwise
> *completely idle* system. Since that occasionally borked my NFS server
> over night I ultimately removed BQL as well. Rock-solid since then.
> 
>> I will apply the old BQL patch and see how it's on my system
>> (with GRO and SG enabled).
> 
> I don't think it still applies cleanly, but if you cook up an updated
> version I'll gladly test it.
> 
> Thanks! :)
> Holger
> 

Good to know. What's your kernel version and RTL8168 chip version?
Regarding the chip version the dmesg line with the XID would be relevant.

Below is the slightly modified original BQL patch, I just moved the call
to netdev_reset_queue(). This patch applies at least to latest linux-next.

My test system:
- RTL8168evl
- latest linux-next
- BQL patch applied
- SG/GRO enabled:

rx-checksumming: on
tx-checksumming: on
        tx-checksum-ipv4: on
        tx-checksum-ip-generic: off [fixed]
        tx-checksum-ipv6: on
        tx-checksum-fcoe-crc: off [fixed]
        tx-checksum-sctp: off [fixed]
scatter-gather: on
        tx-scatter-gather: on
        tx-scatter-gather-fraglist: off [fixed]
tcp-segmentation-offload: on
        tx-tcp-segmentation: on
        tx-tcp-ecn-segmentation: off [fixed]
        tx-tcp-mangleid-segmentation: on
        tx-tcp6-segmentation: on
udp-fragmentation-offload: off
generic-segmentation-offload: on
generic-receive-offload: on
large-receive-offload: off [fixed]
rx-vlan-offload: on
tx-vlan-offload: on

I briefly tested normal operation and did some tests with iperf3.
Everything looks good so far.

---
 drivers/net/ethernet/realtek/r8169.c | 18 +++++++++++++-----
 1 file changed, 13 insertions(+), 5 deletions(-)

diff --git a/drivers/net/ethernet/realtek/r8169.c b/drivers/net/ethernet/realtek/r8169.c
index 0d8070adc..e236b46b8 100644
--- a/drivers/net/ethernet/realtek/r8169.c
+++ b/drivers/net/ethernet/realtek/r8169.c
@@ -5852,6 +5852,7 @@ static void rtl8169_tx_clear(struct rtl8169_private *tp)
 {
 	rtl8169_tx_clear_range(tp, tp->dirty_tx, NUM_TX_DESC);
 	tp->cur_tx = tp->dirty_tx = 0;
+	netdev_reset_queue(tp->dev);
 }
 
 static void rtl_reset_work(struct rtl8169_private *tp)
@@ -6154,6 +6155,8 @@ static netdev_tx_t rtl8169_start_xmit(struct sk_buff *skb,
 
 	txd->opts2 = cpu_to_le32(opts[1]);
 
+	netdev_sent_queue(dev, skb->len);
+
 	skb_tx_timestamp(skb);
 
 	/* Force memory writes to complete before releasing descriptor */
@@ -6252,7 +6255,7 @@ static void rtl8169_pcierr_interrupt(struct net_device *dev)
 
 static void rtl_tx(struct net_device *dev, struct rtl8169_private *tp)
 {
-	unsigned int dirty_tx, tx_left;
+	unsigned int dirty_tx, tx_left, bytes_compl = 0, pkts_compl = 0;
 
 	dirty_tx = tp->dirty_tx;
 	smp_rmb();
@@ -6276,10 +6279,8 @@ static void rtl_tx(struct net_device *dev, struct rtl8169_private *tp)
 		rtl8169_unmap_tx_skb(tp_to_dev(tp), tx_skb,
 				     tp->TxDescArray + entry);
 		if (status & LastFrag) {
-			u64_stats_update_begin(&tp->tx_stats.syncp);
-			tp->tx_stats.packets++;
-			tp->tx_stats.bytes += tx_skb->skb->len;
-			u64_stats_update_end(&tp->tx_stats.syncp);
+			pkts_compl++;
+			bytes_compl += tx_skb->skb->len;
 			dev_consume_skb_any(tx_skb->skb);
 			tx_skb->skb = NULL;
 		}
@@ -6288,6 +6289,13 @@ static void rtl_tx(struct net_device *dev, struct rtl8169_private *tp)
 	}
 
 	if (tp->dirty_tx != dirty_tx) {
+		netdev_completed_queue(dev, pkts_compl, bytes_compl);
+
+		u64_stats_update_begin(&tp->tx_stats.syncp);
+		tp->tx_stats.packets += pkts_compl;
+		tp->tx_stats.bytes += bytes_compl;
+		u64_stats_update_end(&tp->tx_stats.syncp);
+
 		tp->dirty_tx = dirty_tx;
 		/* Sync with rtl8169_start_xmit:
 		 * - publish dirty_tx ring index (write barrier)
-- 
2.19.1

^ permalink raw reply related

* Re: [PATCH] isdn: hfc_{pci,sx}: Avoid empty body if statements and use proper register accessors
From: Masahiro Yamada @ 2018-10-18  3:23 UTC (permalink / raw)
  To: Nathan Chancellor; +Cc: isdn, Networking, Linux Kernel Mailing List
In-Reply-To: <20181017180657.9410-1-natechancellor@gmail.com>

Hi Nathan,

On Thu, Oct 18, 2018 at 3:09 AM Nathan Chancellor
<natechancellor@gmail.com> wrote:
>
> Clang warns:
>
> drivers/isdn/hisax/hfc_pci.c:131:34: error: if statement has empty body
> [-Werror,-Wempty-body]
>         if (Read_hfc(cs, HFCPCI_INT_S1));
>                                         ^
> drivers/isdn/hisax/hfc_pci.c:131:34: note: put the semicolon on a
> separate line to silence this warning
>
> Use the format found in drivers/isdn/hardware/mISDN/hfcpci.c of casting
> the return of Read_hfc to void, instead of using an empty if statement.
>
> While we're at it, Masahiro Yamada pointed out that {Read,Write}_hfc
> should be using a standard access method in hfc_pci.h. Use the one found
> in drivers/isdn/hardware/mISDN/hfc_pci.h.
>
> Link: https://github.com/ClangBuiltLinux/linux/issues/66
> Suggested-by: Masahiro Yamada <yamada.masahiro@socionext.com>
> Signed-off-by: Nathan Chancellor <natechancellor@gmail.com>
> ---
>  drivers/isdn/hisax/hfc_pci.c | 6 +++---
>  drivers/isdn/hisax/hfc_pci.h | 4 ++--
>  drivers/isdn/hisax/hfc_sx.c  | 6 +++---
>  3 files changed, 8 insertions(+), 8 deletions(-)
>
> diff --git a/drivers/isdn/hisax/hfc_pci.c b/drivers/isdn/hisax/hfc_pci.c
> index 8e5b03161b2f..a63b9155b697 100644
> --- a/drivers/isdn/hisax/hfc_pci.c
> +++ b/drivers/isdn/hisax/hfc_pci.c
> @@ -128,7 +128,7 @@ reset_hfcpci(struct IsdnCardState *cs)
>         Write_hfc(cs, HFCPCI_INT_M1, cs->hw.hfcpci.int_m1);
>
>         /* Clear already pending ints */
> -       if (Read_hfc(cs, HFCPCI_INT_S1));
> +       (void) Read_hfc(cs, HFCPCI_INT_S1);



Why is this '(void)' necessary?

I see no warning without it.



-- 
Best Regards
Masahiro Yamada

^ permalink raw reply

* Re: [PATCH net] r8169: fix NAPI handling under high load
From: Holger Hoffstätte @ 2018-10-17 19:11 UTC (permalink / raw)
  To: Heiner Kallweit, David Miller, Realtek linux nic maintainers
  Cc: netdev@vger.kernel.org
In-Reply-To: <62974f0f-1938-3635-69d4-204ed8c587b3@gmail.com>

On 10/17/18 20:12, Heiner Kallweit wrote:
> On 16.10.2018 23:17, Holger Hoffstätte wrote:
>> On 10/16/18 22:37, Heiner Kallweit wrote:
>>> rtl_rx() and rtl_tx() are called only if the respective bits are set
>>> in the interrupt status register. Under high load NAPI may not be
>>> able to process all data (work_done == budget) and it will schedule
>>> subsequent calls to the poll callback.
>>> rtl_ack_events() however resets the bits in the interrupt status
>>> register, therefore subsequent calls to rtl8169_poll() won't call
>>> rtl_rx() and rtl_tx() - chip interrupts are still disabled.
>>
>> Very interesting! Could this be the reason for the mysterious
>> hangs & resets we experienced when enabling BQL for r8169?
>> They happened more often with TSO/GSO enabled and several people
>> attempted to fix those hangs unsuccessfully; it was later reverted
>> and has been since then (#87cda7cb43).
>> If this bug has been there "forever" it might be tempting to
>> re-apply BQL and see what happens. Any chance you could give that
>> a try? I'll gladly test patches, just like I'll run this one.
>>
> After reading through the old mail threads regarding BQL on r8169
> I don't think the fix here is related.
> It seems that BQL on r8169 worked fine for most people, just one
> had problems on one of his systems. I assume the issue was specific

I continued to use the BQL patch in my private tree after it was reverted
and also had occasional timeouts, but *only* after I started playing
with ethtool to change offload settings. Without offloads or the BQL patch
everything has been rock-solid since then.
The other weird problem was that timeouts would occur on an otherwise
*completely idle* system. Since that occasionally borked my NFS server
over night I ultimately removed BQL as well. Rock-solid since then.

> I will apply the old BQL patch and see how it's on my system
> (with GRO and SG enabled).

I don't think it still applies cleanly, but if you cook up an updated
version I'll gladly test it.

Thanks! :)
Holger

^ permalink raw reply

* Re: [PATCH bpf-next 2/3] bpf: emit RECORD_MMAP events for bpf prog load/unload
From: Alexei Starovoitov @ 2018-10-17 19:08 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo
  Cc: David Ahern, Song Liu, Alexei Starovoitov, Peter Zijlstra,
	Alexei Starovoitov, David S . Miller, Daniel Borkmann, Networking,
	Kernel Team
In-Reply-To: <20181017185347.GB4589@kernel.org>

On 10/17/18 11:53 AM, Arnaldo Carvalho de Melo wrote:
> Em Wed, Oct 17, 2018 at 04:36:08PM +0000, Alexei Starovoitov escreveu:
>> On 10/17/18 8:09 AM, David Ahern wrote:
>>> On 10/16/18 11:43 PM, Song Liu wrote:
>>>> I agree that processing events while recording has significant overhead.
>>>> In this case, perf user space need to know details about the the jited BPF
>>>> program. It is impossible to pass all these details to user space through
>>>> the relatively stable ring_buffer API. Therefore, some processing of the
>>>> data is necessary (get bpf prog_id from ring buffer, and then fetch program
>>>> details via BPF_OBJ_GET_INFO_BY_FD.
>>>>
>>>> I have some idea on processing important data with relatively low overhead.
>>>> Let me try implement it.
>>>>
>>>
>>> As I understand it, you want this series:
>>>
>>>  kernel: add event to perf buffer on bpf prog load
>>>
>>>  userspace: perf reads the event and grabs information about the program
>>> from the fd
>>>
>>> Is that correct?
>>>
>>> Userpsace is not awakened immediately when an event is added the the
>>> ring. It is awakened once the number of events crosses a watermark. That
>>> means there is an unknown - and potentially long - time window where the
>>> program can be unloaded before perf reads the event.
>
>>> So no matter what you do expecting perf record to be able to process the
>>> event quickly is an unreasonable expectation.
>
>> yes... unless we go with threaded model as Arnaldo suggested and use
>> single event as a watermark to wakeup our perf thread.
>> In such case there is still a race window between user space waking up
>> and doing _single_ bpf_get_fd_from_id() call to hold that prog
>> and some other process trying to instantly unload the prog it
>> just loaded.
>> I think such race window is extremely tiny and if perf misses
>> those load/unload events it's a good thing, since there is no chance
>> that normal pmu event samples would be happening during prog execution.
>
>> The alternative approach with no race window at all is to burden kernel
>> RECORD_* events with _all_ information about bpf prog. Which is jited
>> addresses, jited image itself, info about all subprogs, info about line
>> info, all BTF data, etc. As I said earlier I'm strongly against such
>> RECORD_* bloating.
>> Instead we need to find a way to process new RECORD_BPF events with
>> single prog_id field in perf user space with minimal race
>> and threaded approach sounds like a win to me.
>
> There is another alternative, I think: put just a content based hash,
> like a git commit id into a PERF_RECORD_MMAP3 new record, and when the
> validator does the jit, etc, it stashes the content that
> BPF_OBJ_GET_INFO_BY_FD would get somewhere, some filesystem populated by
> the kernel right after getting stuff from sys_bpf() and preparing it for
> use, then we know that in (start, end) we have blob foo with content id,
> that we will use to retrieve information that augments what we know with
> just (start, end, id) and allows annotation, etc.
>
> That stash space for jitted stuff gets garbage collected from time to
> time or is even completely disabled if the user is not interested in
> such augmentation, just like one can do disabling perf's ~/.debug/
> directory hashed by build-id.
>
> I think with this we have no races, the PERF_RECORD_MMAP3 gets just what
> is in PERF_RECORD_MMAP2 plus some extra 20 bytes for such content based
> cookie and we solve the other race we already have with kernel modules,
> DSOs, etc.
>
> I have mentioned this before, there were objections, perhaps this time I
> formulated in a different way that makes it more interesting?

that 'content based hash' we already have. It's called program tag.
and we already taught iovisor/bcc to stash that stuff into
/var/tmp/bcc/bpf_prog_TAG/ directory.
Unfortunately that approach didn't work.
JITed image only exists in the kernel. It's there only when
program is loaded and it's the one that matter the most for performance
analysis, since sample IPs are pointing into it.
Also the line info mapping that user space knows is not helping much
either, since verifier optimizes the instructions and then JIT
does more. The debug_info <-> JIT relationship must be preserved
by the kernel and returned to user space.

The only other non-race way is to keep all that info in the kernel after
program is unloaded, but that is even worse than bloating RECORD*s

^ permalink raw reply

* Re: [PATCH bpf-next 1/2] bpf: add cg_skb_is_valid_access for BPF_PROG_TYPE_CGROUP_SKB
From: Song Liu @ 2018-10-17 19:07 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Alexei Starovoitov, netdev@vger.kernel.org, ast@kernel.org,
	daniel@iogearbox.net, Kernel Team
In-Reply-To: <7c6474c0-27a2-bd6c-cd0f-c835856224cb@fb.com>



> On Oct 17, 2018, at 12:02 PM, Alexei Starovoitov <ast@fb.com> wrote:
> 
> On 10/17/18 10:26 AM, Alexei Starovoitov wrote:
>> On Tue, Oct 16, 2018 at 10:56:05PM -0700, Song Liu wrote:
>>> BPF programs of BPF_PROG_TYPE_CGROUP_SKB need to access headers in the
>>> skb. This patch enables direct access of skb for these programs.
>> 
>> The lack of direct packet access in CGROUP_SKB progs was
>> an unpleasant surprise to me, so thank you for fixing it,
>> but there are few issues with the patch. See below.
>> 
>>> In __cgroup_bpf_run_filter_skb(), bpf_compute_data_pointers() is called
>>> to compute proper data_end for the BPF program.
>>> 
>>> Signed-off-by: Song Liu <songliubraving@fb.com>
>>> ---
>>> kernel/bpf/cgroup.c |  4 ++++
>>> net/core/filter.c   | 26 +++++++++++++++++++++++++-
>>> 2 files changed, 29 insertions(+), 1 deletion(-)
>>> 
>>> diff --git a/kernel/bpf/cgroup.c b/kernel/bpf/cgroup.c
>>> index 00f6ed2e4f9a..340d496f35bd 100644
>>> --- a/kernel/bpf/cgroup.c
>>> +++ b/kernel/bpf/cgroup.c
>>> @@ -566,6 +566,10 @@ int __cgroup_bpf_run_filter_skb(struct sock *sk,
>>> 	save_sk = skb->sk;
>>> 	skb->sk = sk;
>>> 	__skb_push(skb, offset);
>>> +
>>> +	/* compute pointers for the bpf prog */
>>> +	bpf_compute_data_pointers(skb);
>>> +
>>> 	ret = BPF_PROG_RUN_ARRAY(cgrp->bpf.effective[type], skb,
>>> 				 bpf_prog_run_save_cb);
>>> 	__skb_pull(skb, offset);
>>> diff --git a/net/core/filter.c b/net/core/filter.c
>>> index 1a3ac6c46873..8b5a502e241f 100644
>>> --- a/net/core/filter.c
>>> +++ b/net/core/filter.c
>>> @@ -5346,6 +5346,30 @@ static bool sk_filter_is_valid_access(int off, int size,
>>> 	return bpf_skb_is_valid_access(off, size, type, prog, info);
>>> }
>>> 
>>> +static bool cg_skb_is_valid_access(int off, int size,
>>> +				   enum bpf_access_type type,
>>> +				   const struct bpf_prog *prog,
>>> +				   struct bpf_insn_access_aux *info)
>>> +{
>>> +	if (type == BPF_WRITE)
>>> +		return false;
>> 
>> this disables writes into cb[0..4] that were allowed for cgroup_inet_* before.
>> One can argue that this may break existing progs,
>> but looking at the place where BPF_CGROUP_RUN_PROG_INET_INGRESS is called
>> it seems it's actually not correct in all cases to access cb there.
>> Just few lines down we call bpf_prog_run_save_cb() which save/restores
>> these 24 bytes.
>> So we have two option either add save/restore for INET_INGRESS only
>> or disable read and write access to cb[0..4] for CGROUP_SKB progs.
>> I prefer the former.
>> 
>>> +
>>> +	switch (off) {
>>> +	case bpf_ctx_range(struct __sk_buff, len):
>>> +		break;
>>> +	case bpf_ctx_range(struct __sk_buff, data):
>>> +		info->reg_type = PTR_TO_PACKET;
>>> +		break;
>>> +	case bpf_ctx_range(struct __sk_buff, data_end):
>>> +		info->reg_type = PTR_TO_PACKET_END;
>>> +		break;
>>> +	default:
>>> +		return false;
>>> +	}
>> 
>> this also enables access to a range of fields family..local_port.
>> It's ok to do for egress, but not for ingress unless we
>> add code similar to the bottom of sk_filter_trim_cap() that
>> inits skb->sk.
>> 
>> above change also allows access to data_meta and flow_keys
>> which is not correct.
>> 
>> Considering all that I'm proposing to fix INET_INGRESS call site
>> similar to code below it in sk_filter_trim_cap().
>> In particular to do:
>> struct sock *save_sk = skb->sk;
>> skb->sk = sk;
>> save and clear cb
>> BPF_CGROUP_RUN_PROG_INET_INGRESS
>> restore cb
>> skb->sk = save_sk;
>> 
>> all of above can probaby be inside BPF_CGROUP_RUN_PROG_INET_INGRESS macro.
>> Then in this cg_skb_is_valid_access() allow access to data/data_end
>> and family..local_port range as well.
>> while disallowing access to flow_keys and data_meta.
>> 
>> In patch 2 we gotta have tests for all these fields.
>> 
>> Thoughts?
> 
> chatted with Song offline.
> I completely misread 'return false' in the above as 'break'.
> The patch actually disables access to pkt_type, mark, queue_mapping
> and so on. Which is not correct either.
> Since tests were not failing we really need to improve this aspect
> of test coverage in test_verifier.c
> 
> Also I missed that __cgroup_bpf_run_filter_skb() already
> does save_sk = skb->sk; skb->sk = sk;
> and bpf_prog_run_save_cb()
> So no issue in the existing code. That was false alarm.
> Revising the proposal...
> I think cg_skb_is_valid_access() can be made similar to
> lwt_is_valid_access().
> Allowing writes into mark, priority, cb[0..4]
> and read of data/data_end.
> In addition it's also ok to allow family..local_port range
> (unlike lwt where sk may not be present).
> and no access to data_meta and flow_keys.

Thanks Alexei! I will send v2 shortly. 

Song

^ permalink raw reply

* Re: [PATCH bpf-next 1/2] bpf: add cg_skb_is_valid_access for BPF_PROG_TYPE_CGROUP_SKB
From: Alexei Starovoitov @ 2018-10-17 19:02 UTC (permalink / raw)
  To: Alexei Starovoitov, Song Liu
  Cc: netdev@vger.kernel.org, ast@kernel.org, daniel@iogearbox.net,
	Kernel Team
In-Reply-To: <20181017172636.57adi6yv7znuaqg5@ast-mbp.dhcp.thefacebook.com>

On 10/17/18 10:26 AM, Alexei Starovoitov wrote:
> On Tue, Oct 16, 2018 at 10:56:05PM -0700, Song Liu wrote:
>> BPF programs of BPF_PROG_TYPE_CGROUP_SKB need to access headers in the
>> skb. This patch enables direct access of skb for these programs.
>
> The lack of direct packet access in CGROUP_SKB progs was
> an unpleasant surprise to me, so thank you for fixing it,
> but there are few issues with the patch. See below.
>
>> In __cgroup_bpf_run_filter_skb(), bpf_compute_data_pointers() is called
>> to compute proper data_end for the BPF program.
>>
>> Signed-off-by: Song Liu <songliubraving@fb.com>
>> ---
>>  kernel/bpf/cgroup.c |  4 ++++
>>  net/core/filter.c   | 26 +++++++++++++++++++++++++-
>>  2 files changed, 29 insertions(+), 1 deletion(-)
>>
>> diff --git a/kernel/bpf/cgroup.c b/kernel/bpf/cgroup.c
>> index 00f6ed2e4f9a..340d496f35bd 100644
>> --- a/kernel/bpf/cgroup.c
>> +++ b/kernel/bpf/cgroup.c
>> @@ -566,6 +566,10 @@ int __cgroup_bpf_run_filter_skb(struct sock *sk,
>>  	save_sk = skb->sk;
>>  	skb->sk = sk;
>>  	__skb_push(skb, offset);
>> +
>> +	/* compute pointers for the bpf prog */
>> +	bpf_compute_data_pointers(skb);
>> +
>>  	ret = BPF_PROG_RUN_ARRAY(cgrp->bpf.effective[type], skb,
>>  				 bpf_prog_run_save_cb);
>>  	__skb_pull(skb, offset);
>> diff --git a/net/core/filter.c b/net/core/filter.c
>> index 1a3ac6c46873..8b5a502e241f 100644
>> --- a/net/core/filter.c
>> +++ b/net/core/filter.c
>> @@ -5346,6 +5346,30 @@ static bool sk_filter_is_valid_access(int off, int size,
>>  	return bpf_skb_is_valid_access(off, size, type, prog, info);
>>  }
>>
>> +static bool cg_skb_is_valid_access(int off, int size,
>> +				   enum bpf_access_type type,
>> +				   const struct bpf_prog *prog,
>> +				   struct bpf_insn_access_aux *info)
>> +{
>> +	if (type == BPF_WRITE)
>> +		return false;
>
> this disables writes into cb[0..4] that were allowed for cgroup_inet_* before.
> One can argue that this may break existing progs,
> but looking at the place where BPF_CGROUP_RUN_PROG_INET_INGRESS is called
> it seems it's actually not correct in all cases to access cb there.
> Just few lines down we call bpf_prog_run_save_cb() which save/restores
> these 24 bytes.
> So we have two option either add save/restore for INET_INGRESS only
> or disable read and write access to cb[0..4] for CGROUP_SKB progs.
> I prefer the former.
>
>> +
>> +	switch (off) {
>> +	case bpf_ctx_range(struct __sk_buff, len):
>> +		break;
>> +	case bpf_ctx_range(struct __sk_buff, data):
>> +		info->reg_type = PTR_TO_PACKET;
>> +		break;
>> +	case bpf_ctx_range(struct __sk_buff, data_end):
>> +		info->reg_type = PTR_TO_PACKET_END;
>> +		break;
>> +	default:
>> +		return false;
>> +	}
>
> this also enables access to a range of fields family..local_port.
> It's ok to do for egress, but not for ingress unless we
> add code similar to the bottom of sk_filter_trim_cap() that
> inits skb->sk.
>
> above change also allows access to data_meta and flow_keys
> which is not correct.
>
> Considering all that I'm proposing to fix INET_INGRESS call site
> similar to code below it in sk_filter_trim_cap().
> In particular to do:
> struct sock *save_sk = skb->sk;
> skb->sk = sk;
> save and clear cb
> BPF_CGROUP_RUN_PROG_INET_INGRESS
> restore cb
> skb->sk = save_sk;
>
> all of above can probaby be inside BPF_CGROUP_RUN_PROG_INET_INGRESS macro.
> Then in this cg_skb_is_valid_access() allow access to data/data_end
> and family..local_port range as well.
> while disallowing access to flow_keys and data_meta.
>
> In patch 2 we gotta have tests for all these fields.
>
> Thoughts?

chatted with Song offline.
I completely misread 'return false' in the above as 'break'.
The patch actually disables access to pkt_type, mark, queue_mapping
and so on. Which is not correct either.
Since tests were not failing we really need to improve this aspect
of test coverage in test_verifier.c

Also I missed that __cgroup_bpf_run_filter_skb() already
does save_sk = skb->sk; skb->sk = sk;
and bpf_prog_run_save_cb()
So no issue in the existing code. That was false alarm.
Revising the proposal...
I think cg_skb_is_valid_access() can be made similar to
lwt_is_valid_access().
Allowing writes into mark, priority, cb[0..4]
and read of data/data_end.
In addition it's also ok to allow family..local_port range
(unlike lwt where sk may not be present).
and no access to data_meta and flow_keys.

^ permalink raw reply

* Re: [PATCH bpf-next 2/3] bpf: emit RECORD_MMAP events for bpf prog load/unload
From: Arnaldo Carvalho de Melo @ 2018-10-17 18:53 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: David Ahern, Song Liu, Alexei Starovoitov, Peter Zijlstra,
	Alexei Starovoitov, David S . Miller, Daniel Borkmann, Networking,
	Kernel Team
In-Reply-To: <97707bf2-ace3-18a5-3621-f69122dd93df@fb.com>

Em Wed, Oct 17, 2018 at 04:36:08PM +0000, Alexei Starovoitov escreveu:
> On 10/17/18 8:09 AM, David Ahern wrote:
> > On 10/16/18 11:43 PM, Song Liu wrote:
> >> I agree that processing events while recording has significant overhead.
> >> In this case, perf user space need to know details about the the jited BPF
> >> program. It is impossible to pass all these details to user space through
> >> the relatively stable ring_buffer API. Therefore, some processing of the
> >> data is necessary (get bpf prog_id from ring buffer, and then fetch program
> >> details via BPF_OBJ_GET_INFO_BY_FD.
> >>
> >> I have some idea on processing important data with relatively low overhead.
> >> Let me try implement it.
> >>
> >
> > As I understand it, you want this series:
> >
> >  kernel: add event to perf buffer on bpf prog load
> >
> >  userspace: perf reads the event and grabs information about the program
> > from the fd
> >
> > Is that correct?
> >
> > Userpsace is not awakened immediately when an event is added the the
> > ring. It is awakened once the number of events crosses a watermark. That
> > means there is an unknown - and potentially long - time window where the
> > program can be unloaded before perf reads the event.

> > So no matter what you do expecting perf record to be able to process the
> > event quickly is an unreasonable expectation.
 
> yes... unless we go with threaded model as Arnaldo suggested and use
> single event as a watermark to wakeup our perf thread.
> In such case there is still a race window between user space waking up
> and doing _single_ bpf_get_fd_from_id() call to hold that prog
> and some other process trying to instantly unload the prog it
> just loaded.
> I think such race window is extremely tiny and if perf misses
> those load/unload events it's a good thing, since there is no chance
> that normal pmu event samples would be happening during prog execution.
 
> The alternative approach with no race window at all is to burden kernel
> RECORD_* events with _all_ information about bpf prog. Which is jited
> addresses, jited image itself, info about all subprogs, info about line
> info, all BTF data, etc. As I said earlier I'm strongly against such
> RECORD_* bloating.
> Instead we need to find a way to process new RECORD_BPF events with
> single prog_id field in perf user space with minimal race
> and threaded approach sounds like a win to me.

There is another alternative, I think: put just a content based hash,
like a git commit id into a PERF_RECORD_MMAP3 new record, and when the
validator does the jit, etc, it stashes the content that
BPF_OBJ_GET_INFO_BY_FD would get somewhere, some filesystem populated by
the kernel right after getting stuff from sys_bpf() and preparing it for
use, then we know that in (start, end) we have blob foo with content id,
that we will use to retrieve information that augments what we know with
just (start, end, id) and allows annotation, etc.

That stash space for jitted stuff gets garbage collected from time to
time or is even completely disabled if the user is not interested in
such augmentation, just like one can do disabling perf's ~/.debug/
directory hashed by build-id.

I think with this we have no races, the PERF_RECORD_MMAP3 gets just what
is in PERF_RECORD_MMAP2 plus some extra 20 bytes for such content based
cookie and we solve the other race we already have with kernel modules,
DSOs, etc.

I have mentioned this before, there were objections, perhaps this time I
formulated in a different way that makes it more interesting?

- Arnaldo

^ permalink raw reply

* Re: Form sk_buff from DMA page
From: David Miller @ 2018-10-17 18:49 UTC (permalink / raw)
  To: keyurp; +Cc: netdev
In-Reply-To: <BN6PR02MB2641F9CF498E60FB3AD98DE6B7FF0@BN6PR02MB2641.namprd02.prod.outlook.com>

From: Keyur Amrutbhai Patel <keyurp@xilinx.com>
Date: Wed, 17 Oct 2018 17:32:33 +0000

> Can anyone help me on how to form sk_buff from DMA page? Basically I
> get complete packet from DMA as single page.

Read any driver that calls build_skb().

^ permalink raw reply

* Re: ipmr, ip6mr: Unite dumproute flows
From: Nikolay Aleksandrov @ 2018-10-17 18:49 UTC (permalink / raw)
  To: Colin Ian King, Yuval Mintz
  Cc: David S. Miller, Alexey Kuznetsov, Hideaki YOSHIFUJI,
	netdev@vger.kernel.org
In-Reply-To: <c3d958d0-9916-48d9-ad8d-076cda75c93f@canonical.com>

On 17/10/2018 21:22, Colin Ian King wrote:
> Hi,
> 
> Static analysis on linux-next has picked up a potential bug with commit:
> 
> commit 7b0db85737db3f4d76b2a412e4f19eae59b8b494
> Author: Yuval Mintz <yuvalm@mellanox.com>
> Date:   Wed Feb 28 23:29:39 2018 +0200
> 
>     ipmr, ip6mr: Unite dumproute flows
> 
> in function mr_table_dump(), s_e is and never seems to change, so the
> check if (e < s_e) is never true:  See code below, as annotated by
> CoverityScan:
> 
> 317        }
>    assignment: Assigning: e = 0U.
> 
> 318        e = 0;
>    assignment: Assigning: s_e = 0U.
> 
> 319        s_e = 0;
> 320
> 321        spin_lock_bh(lock);
> 322        list_for_each_entry(mfc, &mrt->mfc_unres_queue, list) {
>    at_least: At condition e < s_e, the value of e must be at least 0.
>    const: At condition e < s_e, the value of s_e must be equal to 0.
>    dead_error_condition: The condition e < s_e cannot be true.
> 
> 323                if (e < s_e)
> 
>    CID 1474511 (#1 of 1): Logically dead code (DEADCODE)
> dead_error_line: Execution cannot reach this statement: goto next_entry2;.
> 
> 324                        goto next_entry2;
> 325                if (filter->dev &&
> 326                    !mr_mfc_uses_dev(mrt, mfc, filter->dev))
> 327                        goto next_entry2;
> 328
> 329                err = fill(mrt, skb, NETLINK_CB(cb->skb).portid,
> 330                           cb->nlh->nlmsg_seq, mfc, RTM_NEWROUTE, flags);
> 331                if (err < 0) {
> 332                        spin_unlock_bh(lock);
> 333                        goto out;
> 334                }
> 335next_entry2:
> 
>    incr: Incrementing e. The value of e is now 1.
>    incr: Incrementing e. The value of e is now 2.
> 
> 336                e++;
> 
> 
> Note that the zero'ing of e and s_e for ipmr and ip6mr was added in by
> earlier commits:
> 
> commit 8fb472c09b9df478a062eacc7841448e40fc3c17
> Author: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
> Date:   Thu Jan 12 15:53:33 2017 +0100
> 
>     ipmr: improve hash scalability
> 
> commit 87c418bf1323d57140f4b448715f64de3fbb7e91
> Author: Yuval Mintz <yuvalm@mellanox.com>
> Date:   Wed Feb 28 23:29:31 2018 +0200
> 
>     ip6mr: Align hash implementation to ipmr
> 
> Anyhow, this looks incorrect, but I'm not familiar with this code to
> suggest the correct fix.
> 
> Colin
> 

Indeed, there's a case where we might miss an unresolved entry due to the zeroing.
I'll send a fix shortly after testing.

Thanks,
 Nik

^ permalink raw reply

* Fwd: Re: [PATCH net] r8169: fix NAPI handling under high load
From: Heiner Kallweit @ 2018-10-17 18:48 UTC (permalink / raw)
  To: Tomas Szepe; +Cc: netdev@vger.kernel.org, Holger Hoffstätte
In-Reply-To: <62974f0f-1938-3635-69d4-204ed8c587b3@gmail.com>

Tomas,

more than three years back you reported network problems after BQL
was added to the r8169 driver. Due to this the change was reverted.
Now the discussion to add BQL popped up again.
You mentioned that the issue exists on one of your systems only.
Therefore it could be an issue specific to a particular chip version.

I'd be interested in the chip version of the affected system.
You linked to another similar report, there the chip version was:
r8169 0000:10:00.0 eth0: RTL8168c/8111c at 0xf8130000, 00:e0:4c:68:48:d2, XID 1c4000c0 IRQ 29

In case you still have the affected system or at least the old dmesg
logs, I'd appreciate if you could let me know the quoted line from
dmesg output.


Thanks a lot,
Heiner

-------- Forwarded Message --------
Subject: Re: [PATCH net] r8169: fix NAPI handling under high load
Date: Wed, 17 Oct 2018 20:12:48 +0200
From: Heiner Kallweit <hkallweit1@gmail.com>
To: Holger Hoffstätte <holger@applied-asynchrony.com>, David Miller <davem@davemloft.net>, Realtek linux nic maintainers <nic_swsd@realtek.com>
CC: netdev@vger.kernel.org <netdev@vger.kernel.org>

On 16.10.2018 23:17, Holger Hoffstätte wrote:
> On 10/16/18 22:37, Heiner Kallweit wrote:
>> rtl_rx() and rtl_tx() are called only if the respective bits are set
>> in the interrupt status register. Under high load NAPI may not be
>> able to process all data (work_done == budget) and it will schedule
>> subsequent calls to the poll callback.
>> rtl_ack_events() however resets the bits in the interrupt status
>> register, therefore subsequent calls to rtl8169_poll() won't call
>> rtl_rx() and rtl_tx() - chip interrupts are still disabled.
> 
> Very interesting! Could this be the reason for the mysterious
> hangs & resets we experienced when enabling BQL for r8169?
> They happened more often with TSO/GSO enabled and several people
> attempted to fix those hangs unsuccessfully; it was later reverted
> and has been since then (#87cda7cb43).
> If this bug has been there "forever" it might be tempting to
> re-apply BQL and see what happens. Any chance you could give that
> a try? I'll gladly test patches, just like I'll run this one.
> 
After reading through the old mail threads regarding BQL on r8169
I don't think the fix here is related.
It seems that BQL on r8169 worked fine for most people, just one
had problems on one of his systems. I assume the issue was specific
to one chip version. From the ~ 50 chip versions supported by
r8169 more or less each one requires its own quirks.
If we're lucky the chip-version-specific issue has been fixed in
the meantime and we can simply apply the old BQL patch again.

If it turns out that certain chip versions can't be used with BQL,
then we can disable the feature for these chip versions instead
of removing the feature completely.

I will apply the old BQL patch and see how it's on my system
(with GRO and SG enabled).

> cheers
> Holger
> 
Heiner

^ permalink raw reply

* Re: [PATCH net-next V2 6/8] vhost: packed ring support
From: Jason Wang @ 2018-10-18  2:44 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: kvm, netdev, linux-kernel, virtualization, maxime.coquelin, wexu
In-Reply-To: <20181015062241-mutt-send-email-mst@kernel.org>


On 2018/10/15 下午6:25, Michael S. Tsirkin wrote:
> On Mon, Oct 15, 2018 at 10:51:06AM +0800, Jason Wang wrote:
>> On 2018年10月15日 10:43, Michael S. Tsirkin wrote:
>>> On Mon, Oct 15, 2018 at 10:22:33AM +0800, Jason Wang wrote:
>>>> On 2018年10月13日 01:23, Michael S. Tsirkin wrote:
>>>>> On Fri, Oct 12, 2018 at 10:32:44PM +0800, Tiwei Bie wrote:
>>>>>> On Mon, Jul 16, 2018 at 11:28:09AM +0800, Jason Wang wrote:
>>>>>> [...]
>>>>>>> @@ -1367,10 +1397,48 @@ long vhost_vring_ioctl(struct vhost_dev *d, unsigned int ioctl, void __user *arg
>>>>>>>     		vq->last_avail_idx = s.num;
>>>>>>>     		/* Forget the cached index value. */
>>>>>>>     		vq->avail_idx = vq->last_avail_idx;
>>>>>>> +		if (vhost_has_feature(vq, VIRTIO_F_RING_PACKED)) {
>>>>>>> +			vq->last_avail_wrap_counter = wrap_counter;
>>>>>>> +			vq->avail_wrap_counter = vq->last_avail_wrap_counter;
>>>>>>> +		}
>>>>>>>     		break;
>>>>>>>     	case VHOST_GET_VRING_BASE:
>>>>>>>     		s.index = idx;
>>>>>>>     		s.num = vq->last_avail_idx;
>>>>>>> +		if (vhost_has_feature(vq, VIRTIO_F_RING_PACKED))
>>>>>>> +			s.num |= vq->last_avail_wrap_counter << 31;
>>>>>>> +		if (copy_to_user(argp, &s, sizeof(s)))
>>>>>>> +			r = -EFAULT;
>>>>>>> +		break;
>>>>>>> +	case VHOST_SET_VRING_USED_BASE:
>>>>>>> +		/* Moving base with an active backend?
>>>>>>> +		 * You don't want to do that.
>>>>>>> +		 */
>>>>>>> +		if (vq->private_data) {
>>>>>>> +			r = -EBUSY;
>>>>>>> +			break;
>>>>>>> +		}
>>>>>>> +		if (copy_from_user(&s, argp, sizeof(s))) {
>>>>>>> +			r = -EFAULT;
>>>>>>> +			break;
>>>>>>> +		}
>>>>>>> +		if (vhost_has_feature(vq, VIRTIO_F_RING_PACKED)) {
>>>>>>> +			wrap_counter = s.num >> 31;
>>>>>>> +			s.num &= ~(1 << 31);
>>>>>>> +		}
>>>>>>> +		if (s.num > 0xffff) {
>>>>>>> +			r = -EINVAL;
>>>>>>> +			break;
>>>>>>> +		}
>>>>>> Do we want to put wrap_counter at bit 15?
>>>>> I think I second that - seems to be consistent with
>>>>> e.g. event suppression structure and the proposed
>>>>> extension to driver notifications.
>>>> Ok, I assumes packed virtqueue support 64K but looks not. I can change it to
>>>> bit 15 and GET_VRING_BASE need to be changed as well.
>>>>
>>>>>> If put wrap_counter at bit 31, the check (s.num > 0xffff)
>>>>>> won't be able to catch the illegal index 0x8000~0xffff for
>>>>>> packed ring.
>>>>>>
>>>> Do we need to clarify this in the spec?
>>> Isn't this all internal vhost stuff?
>> I meant the illegal index 0x8000-0xffff.
> It does say packed virtqueues support up to 2 15 entries each.
>
> But yes we can add a requirement that devices do not expose
> larger rings. Split does not support 2**16 either, right?
> With 2**16 enties avail index becomes 0 and ring looks empty.
>

Yes, so it's better to clarify this in the spec.

Thanks


_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply

* RE: Form sk_buff from DMA page
From: Keyur Amrutbhai Patel @ 2018-10-17 18:45 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: netdev@vger.kernel.org
In-Reply-To: <20181017105050.695eab3b@shemminger-XPS-13-9360>

Hi Stephen,

Thanks you for answering my query.
I am looking into build_skb it seems addressing to my query. 

Appreciated..

Any good article on sk_buff which explains everything about it?

Regards,
Keyur 

-----Original Message-----
From: Stephen Hemminger <stephen@networkplumber.org> 
Sent: Wednesday, October 17, 2018 11:21 PM
To: Keyur Amrutbhai Patel <keyurp@xilinx.com>
Cc: netdev@vger.kernel.org
Subject: Re: Form sk_buff from DMA page

EXTERNAL EMAIL

On Wed, 17 Oct 2018 17:32:33 +0000
Keyur Amrutbhai Patel <keyurp@xilinx.com> wrote:

> Hi,
>
> Can anyone help me on how to form sk_buff from DMA page? Basically I get complete packet from DMA as single page.
>
> Regards,
> Keyur
> This email and any attachments are intended for the sole use of the named recipient(s) and contain(s) confidential information that may be proprietary, privileged or copyrighted under applicable law. If you are not the intended recipient, do not read, copy, or forward this email message or any attachments. Delete this email message and any attachments immediately.

Did you look at build_skb?

^ permalink raw reply

* Re: ipmr, ip6mr: Unite dumproute flows
From: Colin Ian King @ 2018-10-17 18:22 UTC (permalink / raw)
  To: Yuval Mintz, Nikolay Aleksandrov
  Cc: David S. Miller, Alexey Kuznetsov, Hideaki YOSHIFUJI,
	netdev@vger.kernel.org

Hi,

Static analysis on linux-next has picked up a potential bug with commit:

commit 7b0db85737db3f4d76b2a412e4f19eae59b8b494
Author: Yuval Mintz <yuvalm@mellanox.com>
Date:   Wed Feb 28 23:29:39 2018 +0200

    ipmr, ip6mr: Unite dumproute flows

in function mr_table_dump(), s_e is and never seems to change, so the
check if (e < s_e) is never true:  See code below, as annotated by
CoverityScan:

317        }
   assignment: Assigning: e = 0U.

318        e = 0;
   assignment: Assigning: s_e = 0U.

319        s_e = 0;
320
321        spin_lock_bh(lock);
322        list_for_each_entry(mfc, &mrt->mfc_unres_queue, list) {
   at_least: At condition e < s_e, the value of e must be at least 0.
   const: At condition e < s_e, the value of s_e must be equal to 0.
   dead_error_condition: The condition e < s_e cannot be true.

323                if (e < s_e)

   CID 1474511 (#1 of 1): Logically dead code (DEADCODE)
dead_error_line: Execution cannot reach this statement: goto next_entry2;.

324                        goto next_entry2;
325                if (filter->dev &&
326                    !mr_mfc_uses_dev(mrt, mfc, filter->dev))
327                        goto next_entry2;
328
329                err = fill(mrt, skb, NETLINK_CB(cb->skb).portid,
330                           cb->nlh->nlmsg_seq, mfc, RTM_NEWROUTE, flags);
331                if (err < 0) {
332                        spin_unlock_bh(lock);
333                        goto out;
334                }
335next_entry2:

   incr: Incrementing e. The value of e is now 1.
   incr: Incrementing e. The value of e is now 2.

336                e++;


Note that the zero'ing of e and s_e for ipmr and ip6mr was added in by
earlier commits:

commit 8fb472c09b9df478a062eacc7841448e40fc3c17
Author: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Date:   Thu Jan 12 15:53:33 2017 +0100

    ipmr: improve hash scalability

commit 87c418bf1323d57140f4b448715f64de3fbb7e91
Author: Yuval Mintz <yuvalm@mellanox.com>
Date:   Wed Feb 28 23:29:31 2018 +0200

    ip6mr: Align hash implementation to ipmr

Anyhow, this looks incorrect, but I'm not familiar with this code to
suggest the correct fix.

Colin

^ permalink raw reply

* Re: [PATCH net] r8169: fix NAPI handling under high load
From: Heiner Kallweit @ 2018-10-17 18:12 UTC (permalink / raw)
  To: Holger Hoffstätte, David Miller,
	Realtek linux nic maintainers
  Cc: netdev@vger.kernel.org
In-Reply-To: <ada39488-8e9c-d6a8-461e-672642497db2@applied-asynchrony.com>

On 16.10.2018 23:17, Holger Hoffstätte wrote:
> On 10/16/18 22:37, Heiner Kallweit wrote:
>> rtl_rx() and rtl_tx() are called only if the respective bits are set
>> in the interrupt status register. Under high load NAPI may not be
>> able to process all data (work_done == budget) and it will schedule
>> subsequent calls to the poll callback.
>> rtl_ack_events() however resets the bits in the interrupt status
>> register, therefore subsequent calls to rtl8169_poll() won't call
>> rtl_rx() and rtl_tx() - chip interrupts are still disabled.
> 
> Very interesting! Could this be the reason for the mysterious
> hangs & resets we experienced when enabling BQL for r8169?
> They happened more often with TSO/GSO enabled and several people
> attempted to fix those hangs unsuccessfully; it was later reverted
> and has been since then (#87cda7cb43).
> If this bug has been there "forever" it might be tempting to
> re-apply BQL and see what happens. Any chance you could give that
> a try? I'll gladly test patches, just like I'll run this one.
> 
After reading through the old mail threads regarding BQL on r8169
I don't think the fix here is related.
It seems that BQL on r8169 worked fine for most people, just one
had problems on one of his systems. I assume the issue was specific
to one chip version. From the ~ 50 chip versions supported by
r8169 more or less each one requires its own quirks.
If we're lucky the chip-version-specific issue has been fixed in
the meantime and we can simply apply the old BQL patch again.

If it turns out that certain chip versions can't be used with BQL,
then we can disable the feature for these chip versions instead
of removing the feature completely.

I will apply the old BQL patch and see how it's on my system
(with GRO and SG enabled).

> cheers
> Holger
> 
Heiner

^ permalink raw reply

* Re: [PATCH net] sctp: fix the data size calculation in sctp_data_size
From: Marcelo Ricardo Leitner @ 2018-10-17 18:00 UTC (permalink / raw)
  To: Xin Long; +Cc: network dev, linux-sctp, davem, Neil Horman
In-Reply-To: <c407b682a424f0441d9b9a29ef6296c8e9824b64.1539781887.git.lucien.xin@gmail.com>

On Wed, Oct 17, 2018 at 09:11:27PM +0800, Xin Long wrote:
> sctp data size should be calculated by subtracting data chunk header's
> length from chunk_hdr->length, not just data header.
> 
> Fixes: 668c9beb9020 ("sctp: implement assign_number for sctp_stream_interleave")
> Signed-off-by: Xin Long <lucien.xin@gmail.com>

Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>

> ---
>  include/net/sctp/sm.h | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/include/net/sctp/sm.h b/include/net/sctp/sm.h
> index 5ef1bad..9e3d327 100644
> --- a/include/net/sctp/sm.h
> +++ b/include/net/sctp/sm.h
> @@ -347,7 +347,7 @@ static inline __u16 sctp_data_size(struct sctp_chunk *chunk)
>  	__u16 size;
>  
>  	size = ntohs(chunk->chunk_hdr->length);
> -	size -= sctp_datahdr_len(&chunk->asoc->stream);
> +	size -= sctp_datachk_len(&chunk->asoc->stream);
>  
>  	return size;
>  }
> -- 
> 2.1.0
> 

^ permalink raw reply

* Re: Form sk_buff from DMA page
From: Stephen Hemminger @ 2018-10-17 17:50 UTC (permalink / raw)
  To: Keyur Amrutbhai Patel; +Cc: netdev@vger.kernel.org
In-Reply-To: <BN6PR02MB2641F9CF498E60FB3AD98DE6B7FF0@BN6PR02MB2641.namprd02.prod.outlook.com>

On Wed, 17 Oct 2018 17:32:33 +0000
Keyur Amrutbhai Patel <keyurp@xilinx.com> wrote:

> Hi,
> 
> Can anyone help me on how to form sk_buff from DMA page? Basically I get complete packet from DMA as single page.
> 
> Regards,
> Keyur
> This email and any attachments are intended for the sole use of the named recipient(s) and contain(s) confidential information that may be proprietary, privileged or copyrighted under applicable law. If you are not the intended recipient, do not read, copy, or forward this email message or any attachments. Delete this email message and any attachments immediately.

Did you look at build_skb?

^ permalink raw reply

* Re: [PATCH bpf-next v2 02/13] bpf: btf: Add BTF_KIND_FUNC and BTF_KIND_FUNC_PROTO
From: Martin Lau @ 2018-10-17 17:50 UTC (permalink / raw)
  To: Yonghong Song
  Cc: Edward Cree, Alexei Starovoitov, daniel@iogearbox.net,
	netdev@vger.kernel.org, Kernel Team
In-Reply-To: <0d0534e8-d05a-5541-2380-58a4ea36551b@fb.com>

On Wed, Oct 17, 2018 at 10:25:21AM -0700, Yonghong Song wrote:
> 
> 
> On 10/17/18 9:13 AM, Edward Cree wrote:
> > On 17/10/18 08:23, Yonghong Song wrote:
> >> This patch adds BTF_KIND_FUNC and BTF_KIND_FUNC_PROTO
> >> support to the type section. BTF_KIND_FUNC_PROTO is used
> >> to specify the type of a function pointer. With this,
> >> BTF has a complete set of C types (except float).
> >>
> >> BTF_KIND_FUNC is used to specify the signature of a
> >> defined subprogram. BTF_KIND_FUNC_PROTO can be referenced
> >> by another type, e.g., a pointer type, and BTF_KIND_FUNC
> >> type cannot be referenced by another type.
> > Why are distinct kinds created for these?  A function body is
> >   a value of function type, and since there's no way (in C) to
> >   declare a variable of function type (only pointer-to-
> >   function), any declaration of function type must necessarily
> >   be a BTF_KIND_FUNC, whereas any other reference to a function
> >   type (e.g. a declaration of type pointer to function type)
> >   must, as you state above, be a BTF_KIND_FUNC_PROTO.
> > In fact, you can tell the difference just from name_off, since
> >   a (C-legal) BTF_KIND_FUNC_PROTO will always be anonymous (as
> >   the pointee of a pointer type), while a BTF_KIND_FUNC will
> >   have the name of the subprogram.
> 
> What you stated is true, BTF_KIND_FUNC_PROTO corresponds to
> dwarf subroutine tag which has no name while BTF_KIND_FUNC
> must have a valid name. The original design is to have both
> since they are corresponding to different dwarf constructs.
> 
> Martin, what do you think?
I prefer to have separate kinds.  We need a way to distinguish them.
For example, the BTF verifier is checking it.  Having two kinds is
cleaner instead of resorting to other hints from 'struct btf_type'.
We don't lack of bits for kind.


> 
> > 
> > -Ed
> > 

^ permalink raw reply

* For your photos 25
From: Jenny @ 2018-10-17  9:05 UTC (permalink / raw)
  To: netdev

We provide photoshop services to some of the companies from around the
world.

Some online stores use our services for retouching portraits, jewelry,
apparels, furnitures etc.

Here are the details of what we provide:
Clipping path
Deep etching
Image masking
Portrait retouching
Jewelry retouching
Fashion retouching

Please reply back for further info.
We can provide testing for your photos if needed.

Thanks,
Jenny

^ permalink raw reply

* Re: [PATCH bpf-next v2 4/8] tls: convert to generic sk_msg interface
From: Daniel Borkmann @ 2018-10-17 17:32 UTC (permalink / raw)
  To: Eric Dumazet, alexei.starovoitov; +Cc: john.fastabend, davejwatson, netdev
In-Reply-To: <4b403534-2273-5931-753b-c16a6e2624a0@gmail.com>

On 10/17/2018 04:55 PM, Eric Dumazet wrote:
> On 10/12/2018 05:45 PM, Daniel Borkmann wrote:
>> Convert kTLS over to make use of sk_msg interface for plaintext and
>> encrypted scattergather data, so it reuses all the sk_msg helpers
>> and data structure which later on in a second step enables to glue
>> this to BPF.
>>
>> This also allows to remove quite a bit of open coded helpers which
>> are covered by the sk_msg API. Recent changes in kTLs 80ece6a03aaf
>> ("tls: Remove redundant vars from tls record structure") and
>> 4e6d47206c32 ("tls: Add support for inplace records encryption")
>> changed the data path handling a bit; while we've kept the latter
>> optimization intact, we had to undo the former change to better
>> fit the sk_msg model, hence the sg_aead_in and sg_aead_out have
>> been brought back and are linked into the sk_msg sgs. Now the kTLS
>> record contains a msg_plaintext and msg_encrypted sk_msg each.
>>
>> In the original code, the zerocopy_from_iter() has been used out
>> of TX but also RX path. For the strparser skb-based RX path,
>> we've left the zerocopy_from_iter() in decrypt_internal() mostly
>> untouched, meaning it has been moved into tls_setup_from_iter()
>> with charging logic removed (as not used from RX). Given RX path
>> is not based on sk_msg objects, we haven't pursued setting up a
>> dummy sk_msg to call into sk_msg_zerocopy_from_iter(), but it
>> could be an option to prusue in a later step.
>>
>> Joint work with John.
> 
> Something looks broken and needs this fix :
> 
> (I have to run, I might submit this formally later if needed)
> 
> diff --git a/net/core/sock_map.c b/net/core/sock_map.c
> index 3c0e44cb811a4bed2e02aa107854d74eb3ae7358..3e55dedc038b22132b5b6d042bfedda9e4f3157e 100644
> --- a/net/core/sock_map.c
> +++ b/net/core/sock_map.c
> @@ -175,12 +175,12 @@ static int sock_map_link(struct bpf_map *map, struct sk_psock_progs *progs,
>                 }
>         }
>  
> +       if (!sk_has_psock(sk)) {
> +               ret = -EBUSY;
> +               goto out_progs;
> +       }
>         psock = sk_psock_get(sk);

Thanks for reporting! Yep, so this is almost correct, as-is above this would
always let sock_map_link() bail out. Issue is we take the ref inside sk_psock_get()
before making sure it's a psock, so if the sk user data is present but not a
psock we need to bail out with EBUSY, but if it's NULL or already a valid psock
we may proceed; we'll see to cook something up this evening, thanks!

>         if (psock) {
> -               if (!sk_has_psock(sk)) {
> -                       ret = -EBUSY;
> -                       goto out_progs;
> -               }
>                 if ((msg_parser && READ_ONCE(psock->progs.msg_parser)) ||
>                     (skb_progs  && READ_ONCE(psock->progs.skb_parser))) {
>                         sk_psock_put(sk, psock);
> 
> 
> BUG: KASAN: slab-out-of-bounds in atomic_read include/asm-generic/atomic-instrumented.h:21 [inline]
> BUG: KASAN: slab-out-of-bounds in refcount_inc_not_zero_checked+0x97/0x2f0 lib/refcount.c:120
> Read of size 4 at addr ffff88019548be58 by task syz-executor4/22387
> 
> CPU: 1 PID: 22387 Comm: syz-executor4 Not tainted 4.19.0-rc7+ #264
> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
> Call Trace:
>  __dump_stack lib/dump_stack.c:77 [inline]
>  dump_stack+0x1c4/0x2b4 lib/dump_stack.c:113
>  print_address_description.cold.8+0x9/0x1ff mm/kasan/report.c:256
>  kasan_report_error mm/kasan/report.c:354 [inline]
>  kasan_report.cold.9+0x242/0x309 mm/kasan/report.c:412
>  check_memory_region_inline mm/kasan/kasan.c:260 [inline]
>  check_memory_region+0x13e/0x1b0 mm/kasan/kasan.c:267
>  kasan_check_read+0x11/0x20 mm/kasan/kasan.c:272
>  atomic_read include/asm-generic/atomic-instrumented.h:21 [inline]
>  refcount_inc_not_zero_checked+0x97/0x2f0 lib/refcount.c:120
>  sk_psock_get include/linux/skmsg.h:379 [inline]
>  sock_map_link.isra.6+0x41f/0xe30 net/core/sock_map.c:178
>  sock_hash_update_common+0x19b/0x11e0 net/core/sock_map.c:669
>  sock_hash_update_elem+0x306/0x470 net/core/sock_map.c:738
>  map_update_elem+0x819/0xdf0 kernel/bpf/syscall.c:818
>  __do_sys_bpf kernel/bpf/syscall.c:2400 [inline]
>  __se_sys_bpf kernel/bpf/syscall.c:2371 [inline]
>  __x64_sys_bpf+0x32d/0x510 kernel/bpf/syscall.c:2371
>  do_syscall_64+0x1b9/0x820 arch/x86/entry/common.c:290
>  entry_SYSCALL_64_after_hwframe+0x49/0xbe
> RIP: 0033:0x457569
> Code: fd b3 fb ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 cb b3 fb ff c3 66 2e 0f 1f 84 00 00 00 00
> RSP: 002b:00007f71507e0c78 EFLAGS: 00000246 ORIG_RAX: 0000000000000141
> RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 0000000000457569
> RDX: 0000000000000020 RSI: 0000000020000180 RDI: 0000000000000002
> RBP: 000000000072bf00 R08: 0000000000000000 R09: 0000000000000000
> R10: 0000000000000000 R11: 0000000000000246 R12: 00007f71507e16d4
> R13: 00000000004bd926 R14: 00000000004cc2b0 R15: 00000000ffffffff
> 
> Allocated by task 22387:
>  save_stack+0x43/0xd0 mm/kasan/kasan.c:448
>  set_track mm/kasan/kasan.c:460 [inline]
>  kasan_kmalloc+0xc7/0xe0 mm/kasan/kasan.c:553
>  kasan_slab_alloc+0x12/0x20 mm/kasan/kasan.c:490
>  kmem_cache_alloc+0x12e/0x730 mm/slab.c:3554
>  kmem_cache_zalloc include/linux/slab.h:697 [inline]
>  kcm_attach net/kcm/kcmsock.c:1405 [inline]
>  kcm_attach_ioctl net/kcm/kcmsock.c:1490 [inline]
>  kcm_ioctl+0xca7/0x18c0 net/kcm/kcmsock.c:1696
>  sock_do_ioctl+0xeb/0x420 net/socket.c:950
>  sock_ioctl+0x313/0x690 net/socket.c:1074
>  vfs_ioctl fs/ioctl.c:46 [inline]
>  file_ioctl fs/ioctl.c:501 [inline]
>  do_vfs_ioctl+0x1de/0x1720 fs/ioctl.c:685
>  ksys_ioctl+0xa9/0xd0 fs/ioctl.c:702
>  __do_sys_ioctl fs/ioctl.c:709 [inline]
>  __se_sys_ioctl fs/ioctl.c:707 [inline]
>  __x64_sys_ioctl+0x73/0xb0 fs/ioctl.c:707
>  do_syscall_64+0x1b9/0x820 arch/x86/entry/common.c:290
>  entry_SYSCALL_64_after_hwframe+0x49/0xbe
> 
> Freed by task 0:
> (stack is not available)
> 
> The buggy address belongs to the object at ffff88019548bc00
>  which belongs to the cache kcm_psock_cache of size 544
> The buggy address is located 56 bytes to the right of
>  544-byte region [ffff88019548bc00, ffff88019548be20)
> The buggy address belongs to the page:
> page:ffffea0006552280 count:1 mapcount:0 mapping:ffff8801cb789b40 index:0x0 compound_mapcount: 0
> flags: 0x2fffc0000008100(slab|head)
> raw: 02fffc0000008100 ffff8801cb78c748 ffff8801cb78c748 ffff8801cb789b40
> kobject: 'loop2' (0000000086df2c8c): kobject_uevent_env
> raw: 0000000000000000 ffff88019548a080 000000010000000b 0000000000000000
> page dumped because: kasan: bad access detected
> 
> Memory state around the buggy address:
>  ffff88019548bd00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>  ffff88019548bd80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>> ffff88019548be00: 00 00 00 00 fc fc fc fc fc fc fc fc fc fc fc fc
>                                                     ^
>  ffff88019548be80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
>  ffff88019548bf00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
> ==================================================================
> 

^ permalink raw reply

* Form sk_buff from DMA page
From: Keyur Amrutbhai Patel @ 2018-10-17 17:32 UTC (permalink / raw)
  To: netdev@vger.kernel.org

Hi,

Can anyone help me on how to form sk_buff from DMA page? Basically I get complete packet from DMA as single page.

Regards,
Keyur
This email and any attachments are intended for the sole use of the named recipient(s) and contain(s) confidential information that may be proprietary, privileged or copyrighted under applicable law. If you are not the intended recipient, do not read, copy, or forward this email message or any attachments. Delete this email message and any attachments immediately.

^ permalink raw reply

* Re: [PATCH bpf-next v2 13/13] tools/bpf: bpftool: add support for jited func types
From: Yonghong Song @ 2018-10-17 17:28 UTC (permalink / raw)
  To: Edward Cree, Alexei Starovoitov, Martin Lau, daniel@iogearbox.net,
	netdev@vger.kernel.org
  Cc: Kernel Team
In-Reply-To: <0b60d21b-dcbf-a0df-7b2c-b0506463af1f@solarflare.com>



On 10/17/18 4:11 AM, Edward Cree wrote:
> On 17/10/18 08:24, Yonghong Song wrote:
>> This patch added support to print function signature
>> if btf func_info is available. Note that ksym
>> now uses function name instead of prog_name as
>> prog_name has a limit of 16 bytes including
>> ending '\0'.
>>
>> The following is a sample output for selftests
>> test_btf with file test_btf_haskv.o:
>>
>>    $ bpftool prog dump jited id 1
>>    int _dummy_tracepoint(struct dummy_tracepoint_args * ):
>>    bpf_prog_b07ccb89267cf242__dummy_tracepoint:
>>       0:   push   %rbp
>>       1:   mov    %rsp,%rbp
>>      ......
>>      3c:   add    $0x28,%rbp
>>      40:   leaveq
>>      41:   retq
>>
>>    int test_long_fname_1(struct dummy_tracepoint_args * ):
>>    bpf_prog_2dcecc18072623fc_test_long_fname_1:
>>       0:   push   %rbp
>>       1:   mov    %rsp,%rbp
>>      ......
>>      3a:   add    $0x28,%rbp
>>      3e:   leaveq
>>      3f:   retq
>>
>>    int test_long_fname_2(struct dummy_tracepoint_args * ):
>>    bpf_prog_89d64e4abf0f0126_test_long_fname_2:
>>       0:   push   %rbp
>>       1:   mov    %rsp,%rbp
>>      ......
>>      80:   add    $0x28,%rbp
>>      84:   leaveq
>>      85:   retq
>>
>> Signed-off-by: Yonghong Song <yhs@fb.com>
>> ---
>>   tools/bpf/bpftool/btf_dumper.c | 96 ++++++++++++++++++++++++++++++++++
>>   tools/bpf/bpftool/main.h       |  2 +
>>   tools/bpf/bpftool/prog.c       | 54 +++++++++++++++++++
>>   3 files changed, 152 insertions(+)
>>
>> diff --git a/tools/bpf/bpftool/btf_dumper.c b/tools/bpf/bpftool/btf_dumper.c
>> index 55bc512a1831..a31df4202335 100644
>> --- a/tools/bpf/bpftool/btf_dumper.c
>> +++ b/tools/bpf/bpftool/btf_dumper.c
>> @@ -249,3 +249,99 @@ int btf_dumper_type(const struct btf_dumper *d, __u32 type_id,
>>   {
>>   	return btf_dumper_do_type(d, type_id, 0, data);
>>   }
>> +
>> +#define BTF_PRINT_STRING(str)						\
>> +	{								\
>> +		pos += snprintf(func_sig + pos, size - pos, str);	\
>> +		if (pos >= size)					\
>> +			return -1;					\
>> +	}
> Usual kernel practice for this sort of macro is to use
>      do { \
>      } while(0)
>   to ensure correct behaviour if the macro is used within another control
>   flow statement, e.g.
>      if (x)
>          BTF_PRINT_STRING(x);
>      else
>          do_something_else();
>   will not compile with the bare braces as the else will be detached.

Thanks for the review! Will change to use the "do while" format
as you suggested.

>> +#define BTF_PRINT_ONE_ARG(fmt, arg)					\
>> +	{								\
>> +		pos += snprintf(func_sig + pos, size - pos, fmt, arg);	\
>> +		if (pos >= size)					\
>> +			return -1;					\
>> +	}
> Any reason for not just using a variadic macro?

No particular reason. I will try to use it in the next revision.

>> +#define BTF_PRINT_TYPE_ONLY(type)					\
>> +	{								\
>> +		pos = __btf_dumper_type_only(btf, type, func_sig,	\
>> +					     pos, size);		\
>> +		if (pos == -1)						\
>> +			return -1;					\
>> +	}
>> +
>> +static int __btf_dumper_type_only(struct btf *btf, __u32 type_id,
>> +				  char *func_sig, int pos, int size)
>> +{
>> +	const struct btf_type *t = btf__type_by_id(btf, type_id);
>> +	const struct btf_array *array;
>> +	int i, vlen;
>> +
>> +	switch (BTF_INFO_KIND(t->info)) {
>> +	case BTF_KIND_INT:
>> +		BTF_PRINT_ONE_ARG("%s ",
>> +				  btf__name_by_offset(btf, t->name_off));
>> +		break;
>> +	case BTF_KIND_STRUCT:
>> +		BTF_PRINT_ONE_ARG("struct %s ",
>> +				  btf__name_by_offset(btf, t->name_off));
>> +		break;
>> +	case BTF_KIND_UNION:
>> +		BTF_PRINT_ONE_ARG("union %s ",
>> +				  btf__name_by_offset(btf, t->name_off));
>> +		break;
>> +	case BTF_KIND_ENUM:
>> +		BTF_PRINT_ONE_ARG("enum %s ",
>> +				  btf__name_by_offset(btf, t->name_off));
>> +		break;
>> +	case BTF_KIND_ARRAY:
>> +		array = (struct btf_array *)(t + 1);
>> +		BTF_PRINT_TYPE_ONLY(array->type);
>> +		BTF_PRINT_ONE_ARG("[%d]", array->nelems);
>> +		break;
>> +	case BTF_KIND_PTR:
>> +		BTF_PRINT_TYPE_ONLY(t->type);
>> +		BTF_PRINT_STRING("* ");
>> +		break;
>> +	case BTF_KIND_UNKN:
>> +	case BTF_KIND_FWD:
>> +	case BTF_KIND_TYPEDEF:
>> +		return -1;
>> +	case BTF_KIND_VOLATILE:
>> +		BTF_PRINT_STRING("volatile ");
>> +		BTF_PRINT_TYPE_ONLY(t->type);
>> +		break;
>> +	case BTF_KIND_CONST:
>> +		BTF_PRINT_STRING("const ");
>> +		BTF_PRINT_TYPE_ONLY(t->type);
>> +		break;
>> +	case BTF_KIND_RESTRICT:
>> +		BTF_PRINT_STRING("restrict ");
>> +		BTF_PRINT_TYPE_ONLY(t->type);
>> +		break;
>> +	case BTF_KIND_FUNC:
>> +	case BTF_KIND_FUNC_PROTO:
>> +		BTF_PRINT_TYPE_ONLY(t->type);
>> +		BTF_PRINT_ONE_ARG("%s(", btf__name_by_offset(btf, t->name_off));
>> +		vlen = BTF_INFO_VLEN(t->info);
>> +		for (i = 0; i < vlen; i++) {
>> +			__u32 arg_type = ((__u32 *)(t + 1))[i];
>> +
>> +			BTF_PRINT_TYPE_ONLY(arg_type);
>> +			if (i != (vlen - 1))
>> +				BTF_PRINT_STRING(", ");
>> +		}
> In this kind of loop I find it cleaner to print the comma before the item;
>   that way the test becomes i != 0.  Thus:
>      for (i = 0; i < vlen; i++) {
>          __u32 arg_type = ((__u32 *)(t + 1))[i];
> 
>          if (i)
>              BTF_PRINT_STRING(", ");
>          BTF_PRINT_TYPE_ONLY(arg_type);
>      }

Good suggestion. Will make change in the next revision.

> 
> -Ed
> 

^ permalink raw reply

* Re: [PATCH bpf-next 1/2] bpf: add cg_skb_is_valid_access for BPF_PROG_TYPE_CGROUP_SKB
From: Alexei Starovoitov @ 2018-10-17 17:26 UTC (permalink / raw)
  To: Song Liu; +Cc: netdev, ast, daniel, kernel-team
In-Reply-To: <20181017055606.353449-2-songliubraving@fb.com>

On Tue, Oct 16, 2018 at 10:56:05PM -0700, Song Liu wrote:
> BPF programs of BPF_PROG_TYPE_CGROUP_SKB need to access headers in the
> skb. This patch enables direct access of skb for these programs.

The lack of direct packet access in CGROUP_SKB progs was
an unpleasant surprise to me, so thank you for fixing it,
but there are few issues with the patch. See below.

> In __cgroup_bpf_run_filter_skb(), bpf_compute_data_pointers() is called
> to compute proper data_end for the BPF program.
> 
> Signed-off-by: Song Liu <songliubraving@fb.com>
> ---
>  kernel/bpf/cgroup.c |  4 ++++
>  net/core/filter.c   | 26 +++++++++++++++++++++++++-
>  2 files changed, 29 insertions(+), 1 deletion(-)
> 
> diff --git a/kernel/bpf/cgroup.c b/kernel/bpf/cgroup.c
> index 00f6ed2e4f9a..340d496f35bd 100644
> --- a/kernel/bpf/cgroup.c
> +++ b/kernel/bpf/cgroup.c
> @@ -566,6 +566,10 @@ int __cgroup_bpf_run_filter_skb(struct sock *sk,
>  	save_sk = skb->sk;
>  	skb->sk = sk;
>  	__skb_push(skb, offset);
> +
> +	/* compute pointers for the bpf prog */
> +	bpf_compute_data_pointers(skb);
> +
>  	ret = BPF_PROG_RUN_ARRAY(cgrp->bpf.effective[type], skb,
>  				 bpf_prog_run_save_cb);
>  	__skb_pull(skb, offset);
> diff --git a/net/core/filter.c b/net/core/filter.c
> index 1a3ac6c46873..8b5a502e241f 100644
> --- a/net/core/filter.c
> +++ b/net/core/filter.c
> @@ -5346,6 +5346,30 @@ static bool sk_filter_is_valid_access(int off, int size,
>  	return bpf_skb_is_valid_access(off, size, type, prog, info);
>  }
>  
> +static bool cg_skb_is_valid_access(int off, int size,
> +				   enum bpf_access_type type,
> +				   const struct bpf_prog *prog,
> +				   struct bpf_insn_access_aux *info)
> +{
> +	if (type == BPF_WRITE)
> +		return false;

this disables writes into cb[0..4] that were allowed for cgroup_inet_* before.
One can argue that this may break existing progs,
but looking at the place where BPF_CGROUP_RUN_PROG_INET_INGRESS is called
it seems it's actually not correct in all cases to access cb there.
Just few lines down we call bpf_prog_run_save_cb() which save/restores
these 24 bytes.
So we have two option either add save/restore for INET_INGRESS only
or disable read and write access to cb[0..4] for CGROUP_SKB progs.
I prefer the former.

> +
> +	switch (off) {
> +	case bpf_ctx_range(struct __sk_buff, len):
> +		break;
> +	case bpf_ctx_range(struct __sk_buff, data):
> +		info->reg_type = PTR_TO_PACKET;
> +		break;
> +	case bpf_ctx_range(struct __sk_buff, data_end):
> +		info->reg_type = PTR_TO_PACKET_END;
> +		break;
> +	default:
> +		return false;
> +	}

this also enables access to a range of fields family..local_port.
It's ok to do for egress, but not for ingress unless we
add code similar to the bottom of sk_filter_trim_cap() that
inits skb->sk.

above change also allows access to data_meta and flow_keys
which is not correct.

Considering all that I'm proposing to fix INET_INGRESS call site
similar to code below it in sk_filter_trim_cap().
In particular to do:
struct sock *save_sk = skb->sk;
skb->sk = sk;
save and clear cb
BPF_CGROUP_RUN_PROG_INET_INGRESS
restore cb
skb->sk = save_sk;

all of above can probaby be inside BPF_CGROUP_RUN_PROG_INET_INGRESS macro.
Then in this cg_skb_is_valid_access() allow access to data/data_end
and family..local_port range as well.
while disallowing access to flow_keys and data_meta.

In patch 2 we gotta have tests for all these fields.

Thoughts?

> +
> +	return bpf_skb_is_valid_access(off, size, type, prog, info);
> +}
> +
>  static bool lwt_is_valid_access(int off, int size,
>  				enum bpf_access_type type,
>  				const struct bpf_prog *prog,
> @@ -7038,7 +7062,7 @@ const struct bpf_prog_ops xdp_prog_ops = {
>  
>  const struct bpf_verifier_ops cg_skb_verifier_ops = {
>  	.get_func_proto		= cg_skb_func_proto,
> -	.is_valid_access	= sk_filter_is_valid_access,
> +	.is_valid_access	= cg_skb_is_valid_access,
>  	.convert_ctx_access	= bpf_convert_ctx_access,
>  };
>  
> -- 
> 2.17.1
> 

^ permalink raw reply

* Re: [PATCH bpf-next v2 02/13] bpf: btf: Add BTF_KIND_FUNC and BTF_KIND_FUNC_PROTO
From: Yonghong Song @ 2018-10-17 17:25 UTC (permalink / raw)
  To: Edward Cree, Alexei Starovoitov, Martin Lau, daniel@iogearbox.net,
	netdev@vger.kernel.org
  Cc: Kernel Team
In-Reply-To: <0a2fa960-285d-dc32-bf76-5ba27da89dba@solarflare.com>



On 10/17/18 9:13 AM, Edward Cree wrote:
> On 17/10/18 08:23, Yonghong Song wrote:
>> This patch adds BTF_KIND_FUNC and BTF_KIND_FUNC_PROTO
>> support to the type section. BTF_KIND_FUNC_PROTO is used
>> to specify the type of a function pointer. With this,
>> BTF has a complete set of C types (except float).
>>
>> BTF_KIND_FUNC is used to specify the signature of a
>> defined subprogram. BTF_KIND_FUNC_PROTO can be referenced
>> by another type, e.g., a pointer type, and BTF_KIND_FUNC
>> type cannot be referenced by another type.
> Why are distinct kinds created for these?  A function body is
>   a value of function type, and since there's no way (in C) to
>   declare a variable of function type (only pointer-to-
>   function), any declaration of function type must necessarily
>   be a BTF_KIND_FUNC, whereas any other reference to a function
>   type (e.g. a declaration of type pointer to function type)
>   must, as you state above, be a BTF_KIND_FUNC_PROTO.
> In fact, you can tell the difference just from name_off, since
>   a (C-legal) BTF_KIND_FUNC_PROTO will always be anonymous (as
>   the pointee of a pointer type), while a BTF_KIND_FUNC will
>   have the name of the subprogram.

What you stated is true, BTF_KIND_FUNC_PROTO corresponds to
dwarf subroutine tag which has no name while BTF_KIND_FUNC
must have a valid name. The original design is to have both
since they are corresponding to different dwarf constructs.

Martin, what do you think?

> 
> -Ed
> 

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox