* Re: [PATCH net] r8169: fix NAPI handling under high load
From: Heiner Kallweit @ 2018-10-17 19:27 UTC (permalink / raw)
To: Holger Hoffstätte, David Miller,
Realtek linux nic maintainers
Cc: netdev@vger.kernel.org
In-Reply-To: <e7607b0f-86ed-6784-638c-37d68d910fb7@applied-asynchrony.com>
On 17.10.2018 21:11, Holger Hoffstätte wrote:
> On 10/17/18 20:12, Heiner Kallweit wrote:
>> On 16.10.2018 23:17, Holger Hoffstätte wrote:
>>> On 10/16/18 22:37, Heiner Kallweit wrote:
>>>> rtl_rx() and rtl_tx() are called only if the respective bits are set
>>>> in the interrupt status register. Under high load NAPI may not be
>>>> able to process all data (work_done == budget) and it will schedule
>>>> subsequent calls to the poll callback.
>>>> rtl_ack_events() however resets the bits in the interrupt status
>>>> register, therefore subsequent calls to rtl8169_poll() won't call
>>>> rtl_rx() and rtl_tx() - chip interrupts are still disabled.
>>>
>>> Very interesting! Could this be the reason for the mysterious
>>> hangs & resets we experienced when enabling BQL for r8169?
>>> They happened more often with TSO/GSO enabled and several people
>>> attempted to fix those hangs unsuccessfully; it was later reverted
>>> and has been since then (#87cda7cb43).
>>> If this bug has been there "forever" it might be tempting to
>>> re-apply BQL and see what happens. Any chance you could give that
>>> a try? I'll gladly test patches, just like I'll run this one.
>>>
>> After reading through the old mail threads regarding BQL on r8169
>> I don't think the fix here is related.
>> It seems that BQL on r8169 worked fine for most people, just one
>> had problems on one of his systems. I assume the issue was specific
>
> I continued to use the BQL patch in my private tree after it was reverted
> and also had occasional timeouts, but *only* after I started playing
> with ethtool to change offload settings. Without offloads or the BQL patch
> everything has been rock-solid since then.
> The other weird problem was that timeouts would occur on an otherwise
> *completely idle* system. Since that occasionally borked my NFS server
> over night I ultimately removed BQL as well. Rock-solid since then.
>
>> I will apply the old BQL patch and see how it's on my system
>> (with GRO and SG enabled).
>
> I don't think it still applies cleanly, but if you cook up an updated
> version I'll gladly test it.
>
> Thanks! :)
> Holger
>
Good to know. What's your kernel version and RTL8168 chip version?
Regarding the chip version the dmesg line with the XID would be relevant.
Below is the slightly modified original BQL patch, I just moved the call
to netdev_reset_queue(). This patch applies at least to latest linux-next.
My test system:
- RTL8168evl
- latest linux-next
- BQL patch applied
- SG/GRO enabled:
rx-checksumming: on
tx-checksumming: on
tx-checksum-ipv4: on
tx-checksum-ip-generic: off [fixed]
tx-checksum-ipv6: on
tx-checksum-fcoe-crc: off [fixed]
tx-checksum-sctp: off [fixed]
scatter-gather: on
tx-scatter-gather: on
tx-scatter-gather-fraglist: off [fixed]
tcp-segmentation-offload: on
tx-tcp-segmentation: on
tx-tcp-ecn-segmentation: off [fixed]
tx-tcp-mangleid-segmentation: on
tx-tcp6-segmentation: on
udp-fragmentation-offload: off
generic-segmentation-offload: on
generic-receive-offload: on
large-receive-offload: off [fixed]
rx-vlan-offload: on
tx-vlan-offload: on
I briefly tested normal operation and did some tests with iperf3.
Everything looks good so far.
---
drivers/net/ethernet/realtek/r8169.c | 18 +++++++++++++-----
1 file changed, 13 insertions(+), 5 deletions(-)
diff --git a/drivers/net/ethernet/realtek/r8169.c b/drivers/net/ethernet/realtek/r8169.c
index 0d8070adc..e236b46b8 100644
--- a/drivers/net/ethernet/realtek/r8169.c
+++ b/drivers/net/ethernet/realtek/r8169.c
@@ -5852,6 +5852,7 @@ static void rtl8169_tx_clear(struct rtl8169_private *tp)
{
rtl8169_tx_clear_range(tp, tp->dirty_tx, NUM_TX_DESC);
tp->cur_tx = tp->dirty_tx = 0;
+ netdev_reset_queue(tp->dev);
}
static void rtl_reset_work(struct rtl8169_private *tp)
@@ -6154,6 +6155,8 @@ static netdev_tx_t rtl8169_start_xmit(struct sk_buff *skb,
txd->opts2 = cpu_to_le32(opts[1]);
+ netdev_sent_queue(dev, skb->len);
+
skb_tx_timestamp(skb);
/* Force memory writes to complete before releasing descriptor */
@@ -6252,7 +6255,7 @@ static void rtl8169_pcierr_interrupt(struct net_device *dev)
static void rtl_tx(struct net_device *dev, struct rtl8169_private *tp)
{
- unsigned int dirty_tx, tx_left;
+ unsigned int dirty_tx, tx_left, bytes_compl = 0, pkts_compl = 0;
dirty_tx = tp->dirty_tx;
smp_rmb();
@@ -6276,10 +6279,8 @@ static void rtl_tx(struct net_device *dev, struct rtl8169_private *tp)
rtl8169_unmap_tx_skb(tp_to_dev(tp), tx_skb,
tp->TxDescArray + entry);
if (status & LastFrag) {
- u64_stats_update_begin(&tp->tx_stats.syncp);
- tp->tx_stats.packets++;
- tp->tx_stats.bytes += tx_skb->skb->len;
- u64_stats_update_end(&tp->tx_stats.syncp);
+ pkts_compl++;
+ bytes_compl += tx_skb->skb->len;
dev_consume_skb_any(tx_skb->skb);
tx_skb->skb = NULL;
}
@@ -6288,6 +6289,13 @@ static void rtl_tx(struct net_device *dev, struct rtl8169_private *tp)
}
if (tp->dirty_tx != dirty_tx) {
+ netdev_completed_queue(dev, pkts_compl, bytes_compl);
+
+ u64_stats_update_begin(&tp->tx_stats.syncp);
+ tp->tx_stats.packets += pkts_compl;
+ tp->tx_stats.bytes += bytes_compl;
+ u64_stats_update_end(&tp->tx_stats.syncp);
+
tp->dirty_tx = dirty_tx;
/* Sync with rtl8169_start_xmit:
* - publish dirty_tx ring index (write barrier)
--
2.19.1
^ permalink raw reply related
* Re: [PATCH] isdn: hfc_{pci,sx}: Avoid empty body if statements and use proper register accessors
From: Masahiro Yamada @ 2018-10-18 3:23 UTC (permalink / raw)
To: Nathan Chancellor; +Cc: isdn, Networking, Linux Kernel Mailing List
In-Reply-To: <20181017180657.9410-1-natechancellor@gmail.com>
Hi Nathan,
On Thu, Oct 18, 2018 at 3:09 AM Nathan Chancellor
<natechancellor@gmail.com> wrote:
>
> Clang warns:
>
> drivers/isdn/hisax/hfc_pci.c:131:34: error: if statement has empty body
> [-Werror,-Wempty-body]
> if (Read_hfc(cs, HFCPCI_INT_S1));
> ^
> drivers/isdn/hisax/hfc_pci.c:131:34: note: put the semicolon on a
> separate line to silence this warning
>
> Use the format found in drivers/isdn/hardware/mISDN/hfcpci.c of casting
> the return of Read_hfc to void, instead of using an empty if statement.
>
> While we're at it, Masahiro Yamada pointed out that {Read,Write}_hfc
> should be using a standard access method in hfc_pci.h. Use the one found
> in drivers/isdn/hardware/mISDN/hfc_pci.h.
>
> Link: https://github.com/ClangBuiltLinux/linux/issues/66
> Suggested-by: Masahiro Yamada <yamada.masahiro@socionext.com>
> Signed-off-by: Nathan Chancellor <natechancellor@gmail.com>
> ---
> drivers/isdn/hisax/hfc_pci.c | 6 +++---
> drivers/isdn/hisax/hfc_pci.h | 4 ++--
> drivers/isdn/hisax/hfc_sx.c | 6 +++---
> 3 files changed, 8 insertions(+), 8 deletions(-)
>
> diff --git a/drivers/isdn/hisax/hfc_pci.c b/drivers/isdn/hisax/hfc_pci.c
> index 8e5b03161b2f..a63b9155b697 100644
> --- a/drivers/isdn/hisax/hfc_pci.c
> +++ b/drivers/isdn/hisax/hfc_pci.c
> @@ -128,7 +128,7 @@ reset_hfcpci(struct IsdnCardState *cs)
> Write_hfc(cs, HFCPCI_INT_M1, cs->hw.hfcpci.int_m1);
>
> /* Clear already pending ints */
> - if (Read_hfc(cs, HFCPCI_INT_S1));
> + (void) Read_hfc(cs, HFCPCI_INT_S1);
Why is this '(void)' necessary?
I see no warning without it.
--
Best Regards
Masahiro Yamada
^ permalink raw reply
* Re: [PATCH net] r8169: fix NAPI handling under high load
From: Holger Hoffstätte @ 2018-10-17 19:11 UTC (permalink / raw)
To: Heiner Kallweit, David Miller, Realtek linux nic maintainers
Cc: netdev@vger.kernel.org
In-Reply-To: <62974f0f-1938-3635-69d4-204ed8c587b3@gmail.com>
On 10/17/18 20:12, Heiner Kallweit wrote:
> On 16.10.2018 23:17, Holger Hoffstätte wrote:
>> On 10/16/18 22:37, Heiner Kallweit wrote:
>>> rtl_rx() and rtl_tx() are called only if the respective bits are set
>>> in the interrupt status register. Under high load NAPI may not be
>>> able to process all data (work_done == budget) and it will schedule
>>> subsequent calls to the poll callback.
>>> rtl_ack_events() however resets the bits in the interrupt status
>>> register, therefore subsequent calls to rtl8169_poll() won't call
>>> rtl_rx() and rtl_tx() - chip interrupts are still disabled.
>>
>> Very interesting! Could this be the reason for the mysterious
>> hangs & resets we experienced when enabling BQL for r8169?
>> They happened more often with TSO/GSO enabled and several people
>> attempted to fix those hangs unsuccessfully; it was later reverted
>> and has been since then (#87cda7cb43).
>> If this bug has been there "forever" it might be tempting to
>> re-apply BQL and see what happens. Any chance you could give that
>> a try? I'll gladly test patches, just like I'll run this one.
>>
> After reading through the old mail threads regarding BQL on r8169
> I don't think the fix here is related.
> It seems that BQL on r8169 worked fine for most people, just one
> had problems on one of his systems. I assume the issue was specific
I continued to use the BQL patch in my private tree after it was reverted
and also had occasional timeouts, but *only* after I started playing
with ethtool to change offload settings. Without offloads or the BQL patch
everything has been rock-solid since then.
The other weird problem was that timeouts would occur on an otherwise
*completely idle* system. Since that occasionally borked my NFS server
over night I ultimately removed BQL as well. Rock-solid since then.
> I will apply the old BQL patch and see how it's on my system
> (with GRO and SG enabled).
I don't think it still applies cleanly, but if you cook up an updated
version I'll gladly test it.
Thanks! :)
Holger
^ permalink raw reply
* Re: [PATCH bpf-next 2/3] bpf: emit RECORD_MMAP events for bpf prog load/unload
From: Alexei Starovoitov @ 2018-10-17 19:08 UTC (permalink / raw)
To: Arnaldo Carvalho de Melo
Cc: David Ahern, Song Liu, Alexei Starovoitov, Peter Zijlstra,
Alexei Starovoitov, David S . Miller, Daniel Borkmann, Networking,
Kernel Team
In-Reply-To: <20181017185347.GB4589@kernel.org>
On 10/17/18 11:53 AM, Arnaldo Carvalho de Melo wrote:
> Em Wed, Oct 17, 2018 at 04:36:08PM +0000, Alexei Starovoitov escreveu:
>> On 10/17/18 8:09 AM, David Ahern wrote:
>>> On 10/16/18 11:43 PM, Song Liu wrote:
>>>> I agree that processing events while recording has significant overhead.
>>>> In this case, perf user space need to know details about the the jited BPF
>>>> program. It is impossible to pass all these details to user space through
>>>> the relatively stable ring_buffer API. Therefore, some processing of the
>>>> data is necessary (get bpf prog_id from ring buffer, and then fetch program
>>>> details via BPF_OBJ_GET_INFO_BY_FD.
>>>>
>>>> I have some idea on processing important data with relatively low overhead.
>>>> Let me try implement it.
>>>>
>>>
>>> As I understand it, you want this series:
>>>
>>> kernel: add event to perf buffer on bpf prog load
>>>
>>> userspace: perf reads the event and grabs information about the program
>>> from the fd
>>>
>>> Is that correct?
>>>
>>> Userpsace is not awakened immediately when an event is added the the
>>> ring. It is awakened once the number of events crosses a watermark. That
>>> means there is an unknown - and potentially long - time window where the
>>> program can be unloaded before perf reads the event.
>
>>> So no matter what you do expecting perf record to be able to process the
>>> event quickly is an unreasonable expectation.
>
>> yes... unless we go with threaded model as Arnaldo suggested and use
>> single event as a watermark to wakeup our perf thread.
>> In such case there is still a race window between user space waking up
>> and doing _single_ bpf_get_fd_from_id() call to hold that prog
>> and some other process trying to instantly unload the prog it
>> just loaded.
>> I think such race window is extremely tiny and if perf misses
>> those load/unload events it's a good thing, since there is no chance
>> that normal pmu event samples would be happening during prog execution.
>
>> The alternative approach with no race window at all is to burden kernel
>> RECORD_* events with _all_ information about bpf prog. Which is jited
>> addresses, jited image itself, info about all subprogs, info about line
>> info, all BTF data, etc. As I said earlier I'm strongly against such
>> RECORD_* bloating.
>> Instead we need to find a way to process new RECORD_BPF events with
>> single prog_id field in perf user space with minimal race
>> and threaded approach sounds like a win to me.
>
> There is another alternative, I think: put just a content based hash,
> like a git commit id into a PERF_RECORD_MMAP3 new record, and when the
> validator does the jit, etc, it stashes the content that
> BPF_OBJ_GET_INFO_BY_FD would get somewhere, some filesystem populated by
> the kernel right after getting stuff from sys_bpf() and preparing it for
> use, then we know that in (start, end) we have blob foo with content id,
> that we will use to retrieve information that augments what we know with
> just (start, end, id) and allows annotation, etc.
>
> That stash space for jitted stuff gets garbage collected from time to
> time or is even completely disabled if the user is not interested in
> such augmentation, just like one can do disabling perf's ~/.debug/
> directory hashed by build-id.
>
> I think with this we have no races, the PERF_RECORD_MMAP3 gets just what
> is in PERF_RECORD_MMAP2 plus some extra 20 bytes for such content based
> cookie and we solve the other race we already have with kernel modules,
> DSOs, etc.
>
> I have mentioned this before, there were objections, perhaps this time I
> formulated in a different way that makes it more interesting?
that 'content based hash' we already have. It's called program tag.
and we already taught iovisor/bcc to stash that stuff into
/var/tmp/bcc/bpf_prog_TAG/ directory.
Unfortunately that approach didn't work.
JITed image only exists in the kernel. It's there only when
program is loaded and it's the one that matter the most for performance
analysis, since sample IPs are pointing into it.
Also the line info mapping that user space knows is not helping much
either, since verifier optimizes the instructions and then JIT
does more. The debug_info <-> JIT relationship must be preserved
by the kernel and returned to user space.
The only other non-race way is to keep all that info in the kernel after
program is unloaded, but that is even worse than bloating RECORD*s
^ permalink raw reply
* Re: [PATCH bpf-next 1/2] bpf: add cg_skb_is_valid_access for BPF_PROG_TYPE_CGROUP_SKB
From: Song Liu @ 2018-10-17 19:07 UTC (permalink / raw)
To: Alexei Starovoitov
Cc: Alexei Starovoitov, netdev@vger.kernel.org, ast@kernel.org,
daniel@iogearbox.net, Kernel Team
In-Reply-To: <7c6474c0-27a2-bd6c-cd0f-c835856224cb@fb.com>
> On Oct 17, 2018, at 12:02 PM, Alexei Starovoitov <ast@fb.com> wrote:
>
> On 10/17/18 10:26 AM, Alexei Starovoitov wrote:
>> On Tue, Oct 16, 2018 at 10:56:05PM -0700, Song Liu wrote:
>>> BPF programs of BPF_PROG_TYPE_CGROUP_SKB need to access headers in the
>>> skb. This patch enables direct access of skb for these programs.
>>
>> The lack of direct packet access in CGROUP_SKB progs was
>> an unpleasant surprise to me, so thank you for fixing it,
>> but there are few issues with the patch. See below.
>>
>>> In __cgroup_bpf_run_filter_skb(), bpf_compute_data_pointers() is called
>>> to compute proper data_end for the BPF program.
>>>
>>> Signed-off-by: Song Liu <songliubraving@fb.com>
>>> ---
>>> kernel/bpf/cgroup.c | 4 ++++
>>> net/core/filter.c | 26 +++++++++++++++++++++++++-
>>> 2 files changed, 29 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/kernel/bpf/cgroup.c b/kernel/bpf/cgroup.c
>>> index 00f6ed2e4f9a..340d496f35bd 100644
>>> --- a/kernel/bpf/cgroup.c
>>> +++ b/kernel/bpf/cgroup.c
>>> @@ -566,6 +566,10 @@ int __cgroup_bpf_run_filter_skb(struct sock *sk,
>>> save_sk = skb->sk;
>>> skb->sk = sk;
>>> __skb_push(skb, offset);
>>> +
>>> + /* compute pointers for the bpf prog */
>>> + bpf_compute_data_pointers(skb);
>>> +
>>> ret = BPF_PROG_RUN_ARRAY(cgrp->bpf.effective[type], skb,
>>> bpf_prog_run_save_cb);
>>> __skb_pull(skb, offset);
>>> diff --git a/net/core/filter.c b/net/core/filter.c
>>> index 1a3ac6c46873..8b5a502e241f 100644
>>> --- a/net/core/filter.c
>>> +++ b/net/core/filter.c
>>> @@ -5346,6 +5346,30 @@ static bool sk_filter_is_valid_access(int off, int size,
>>> return bpf_skb_is_valid_access(off, size, type, prog, info);
>>> }
>>>
>>> +static bool cg_skb_is_valid_access(int off, int size,
>>> + enum bpf_access_type type,
>>> + const struct bpf_prog *prog,
>>> + struct bpf_insn_access_aux *info)
>>> +{
>>> + if (type == BPF_WRITE)
>>> + return false;
>>
>> this disables writes into cb[0..4] that were allowed for cgroup_inet_* before.
>> One can argue that this may break existing progs,
>> but looking at the place where BPF_CGROUP_RUN_PROG_INET_INGRESS is called
>> it seems it's actually not correct in all cases to access cb there.
>> Just few lines down we call bpf_prog_run_save_cb() which save/restores
>> these 24 bytes.
>> So we have two option either add save/restore for INET_INGRESS only
>> or disable read and write access to cb[0..4] for CGROUP_SKB progs.
>> I prefer the former.
>>
>>> +
>>> + switch (off) {
>>> + case bpf_ctx_range(struct __sk_buff, len):
>>> + break;
>>> + case bpf_ctx_range(struct __sk_buff, data):
>>> + info->reg_type = PTR_TO_PACKET;
>>> + break;
>>> + case bpf_ctx_range(struct __sk_buff, data_end):
>>> + info->reg_type = PTR_TO_PACKET_END;
>>> + break;
>>> + default:
>>> + return false;
>>> + }
>>
>> this also enables access to a range of fields family..local_port.
>> It's ok to do for egress, but not for ingress unless we
>> add code similar to the bottom of sk_filter_trim_cap() that
>> inits skb->sk.
>>
>> above change also allows access to data_meta and flow_keys
>> which is not correct.
>>
>> Considering all that I'm proposing to fix INET_INGRESS call site
>> similar to code below it in sk_filter_trim_cap().
>> In particular to do:
>> struct sock *save_sk = skb->sk;
>> skb->sk = sk;
>> save and clear cb
>> BPF_CGROUP_RUN_PROG_INET_INGRESS
>> restore cb
>> skb->sk = save_sk;
>>
>> all of above can probaby be inside BPF_CGROUP_RUN_PROG_INET_INGRESS macro.
>> Then in this cg_skb_is_valid_access() allow access to data/data_end
>> and family..local_port range as well.
>> while disallowing access to flow_keys and data_meta.
>>
>> In patch 2 we gotta have tests for all these fields.
>>
>> Thoughts?
>
> chatted with Song offline.
> I completely misread 'return false' in the above as 'break'.
> The patch actually disables access to pkt_type, mark, queue_mapping
> and so on. Which is not correct either.
> Since tests were not failing we really need to improve this aspect
> of test coverage in test_verifier.c
>
> Also I missed that __cgroup_bpf_run_filter_skb() already
> does save_sk = skb->sk; skb->sk = sk;
> and bpf_prog_run_save_cb()
> So no issue in the existing code. That was false alarm.
> Revising the proposal...
> I think cg_skb_is_valid_access() can be made similar to
> lwt_is_valid_access().
> Allowing writes into mark, priority, cb[0..4]
> and read of data/data_end.
> In addition it's also ok to allow family..local_port range
> (unlike lwt where sk may not be present).
> and no access to data_meta and flow_keys.
Thanks Alexei! I will send v2 shortly.
Song
^ permalink raw reply
* Re: [PATCH bpf-next 1/2] bpf: add cg_skb_is_valid_access for BPF_PROG_TYPE_CGROUP_SKB
From: Alexei Starovoitov @ 2018-10-17 19:02 UTC (permalink / raw)
To: Alexei Starovoitov, Song Liu
Cc: netdev@vger.kernel.org, ast@kernel.org, daniel@iogearbox.net,
Kernel Team
In-Reply-To: <20181017172636.57adi6yv7znuaqg5@ast-mbp.dhcp.thefacebook.com>
On 10/17/18 10:26 AM, Alexei Starovoitov wrote:
> On Tue, Oct 16, 2018 at 10:56:05PM -0700, Song Liu wrote:
>> BPF programs of BPF_PROG_TYPE_CGROUP_SKB need to access headers in the
>> skb. This patch enables direct access of skb for these programs.
>
> The lack of direct packet access in CGROUP_SKB progs was
> an unpleasant surprise to me, so thank you for fixing it,
> but there are few issues with the patch. See below.
>
>> In __cgroup_bpf_run_filter_skb(), bpf_compute_data_pointers() is called
>> to compute proper data_end for the BPF program.
>>
>> Signed-off-by: Song Liu <songliubraving@fb.com>
>> ---
>> kernel/bpf/cgroup.c | 4 ++++
>> net/core/filter.c | 26 +++++++++++++++++++++++++-
>> 2 files changed, 29 insertions(+), 1 deletion(-)
>>
>> diff --git a/kernel/bpf/cgroup.c b/kernel/bpf/cgroup.c
>> index 00f6ed2e4f9a..340d496f35bd 100644
>> --- a/kernel/bpf/cgroup.c
>> +++ b/kernel/bpf/cgroup.c
>> @@ -566,6 +566,10 @@ int __cgroup_bpf_run_filter_skb(struct sock *sk,
>> save_sk = skb->sk;
>> skb->sk = sk;
>> __skb_push(skb, offset);
>> +
>> + /* compute pointers for the bpf prog */
>> + bpf_compute_data_pointers(skb);
>> +
>> ret = BPF_PROG_RUN_ARRAY(cgrp->bpf.effective[type], skb,
>> bpf_prog_run_save_cb);
>> __skb_pull(skb, offset);
>> diff --git a/net/core/filter.c b/net/core/filter.c
>> index 1a3ac6c46873..8b5a502e241f 100644
>> --- a/net/core/filter.c
>> +++ b/net/core/filter.c
>> @@ -5346,6 +5346,30 @@ static bool sk_filter_is_valid_access(int off, int size,
>> return bpf_skb_is_valid_access(off, size, type, prog, info);
>> }
>>
>> +static bool cg_skb_is_valid_access(int off, int size,
>> + enum bpf_access_type type,
>> + const struct bpf_prog *prog,
>> + struct bpf_insn_access_aux *info)
>> +{
>> + if (type == BPF_WRITE)
>> + return false;
>
> this disables writes into cb[0..4] that were allowed for cgroup_inet_* before.
> One can argue that this may break existing progs,
> but looking at the place where BPF_CGROUP_RUN_PROG_INET_INGRESS is called
> it seems it's actually not correct in all cases to access cb there.
> Just few lines down we call bpf_prog_run_save_cb() which save/restores
> these 24 bytes.
> So we have two option either add save/restore for INET_INGRESS only
> or disable read and write access to cb[0..4] for CGROUP_SKB progs.
> I prefer the former.
>
>> +
>> + switch (off) {
>> + case bpf_ctx_range(struct __sk_buff, len):
>> + break;
>> + case bpf_ctx_range(struct __sk_buff, data):
>> + info->reg_type = PTR_TO_PACKET;
>> + break;
>> + case bpf_ctx_range(struct __sk_buff, data_end):
>> + info->reg_type = PTR_TO_PACKET_END;
>> + break;
>> + default:
>> + return false;
>> + }
>
> this also enables access to a range of fields family..local_port.
> It's ok to do for egress, but not for ingress unless we
> add code similar to the bottom of sk_filter_trim_cap() that
> inits skb->sk.
>
> above change also allows access to data_meta and flow_keys
> which is not correct.
>
> Considering all that I'm proposing to fix INET_INGRESS call site
> similar to code below it in sk_filter_trim_cap().
> In particular to do:
> struct sock *save_sk = skb->sk;
> skb->sk = sk;
> save and clear cb
> BPF_CGROUP_RUN_PROG_INET_INGRESS
> restore cb
> skb->sk = save_sk;
>
> all of above can probaby be inside BPF_CGROUP_RUN_PROG_INET_INGRESS macro.
> Then in this cg_skb_is_valid_access() allow access to data/data_end
> and family..local_port range as well.
> while disallowing access to flow_keys and data_meta.
>
> In patch 2 we gotta have tests for all these fields.
>
> Thoughts?
chatted with Song offline.
I completely misread 'return false' in the above as 'break'.
The patch actually disables access to pkt_type, mark, queue_mapping
and so on. Which is not correct either.
Since tests were not failing we really need to improve this aspect
of test coverage in test_verifier.c
Also I missed that __cgroup_bpf_run_filter_skb() already
does save_sk = skb->sk; skb->sk = sk;
and bpf_prog_run_save_cb()
So no issue in the existing code. That was false alarm.
Revising the proposal...
I think cg_skb_is_valid_access() can be made similar to
lwt_is_valid_access().
Allowing writes into mark, priority, cb[0..4]
and read of data/data_end.
In addition it's also ok to allow family..local_port range
(unlike lwt where sk may not be present).
and no access to data_meta and flow_keys.
^ permalink raw reply
* Re: [PATCH bpf-next 2/3] bpf: emit RECORD_MMAP events for bpf prog load/unload
From: Arnaldo Carvalho de Melo @ 2018-10-17 18:53 UTC (permalink / raw)
To: Alexei Starovoitov
Cc: David Ahern, Song Liu, Alexei Starovoitov, Peter Zijlstra,
Alexei Starovoitov, David S . Miller, Daniel Borkmann, Networking,
Kernel Team
In-Reply-To: <97707bf2-ace3-18a5-3621-f69122dd93df@fb.com>
Em Wed, Oct 17, 2018 at 04:36:08PM +0000, Alexei Starovoitov escreveu:
> On 10/17/18 8:09 AM, David Ahern wrote:
> > On 10/16/18 11:43 PM, Song Liu wrote:
> >> I agree that processing events while recording has significant overhead.
> >> In this case, perf user space need to know details about the the jited BPF
> >> program. It is impossible to pass all these details to user space through
> >> the relatively stable ring_buffer API. Therefore, some processing of the
> >> data is necessary (get bpf prog_id from ring buffer, and then fetch program
> >> details via BPF_OBJ_GET_INFO_BY_FD.
> >>
> >> I have some idea on processing important data with relatively low overhead.
> >> Let me try implement it.
> >>
> >
> > As I understand it, you want this series:
> >
> > kernel: add event to perf buffer on bpf prog load
> >
> > userspace: perf reads the event and grabs information about the program
> > from the fd
> >
> > Is that correct?
> >
> > Userpsace is not awakened immediately when an event is added the the
> > ring. It is awakened once the number of events crosses a watermark. That
> > means there is an unknown - and potentially long - time window where the
> > program can be unloaded before perf reads the event.
> > So no matter what you do expecting perf record to be able to process the
> > event quickly is an unreasonable expectation.
> yes... unless we go with threaded model as Arnaldo suggested and use
> single event as a watermark to wakeup our perf thread.
> In such case there is still a race window between user space waking up
> and doing _single_ bpf_get_fd_from_id() call to hold that prog
> and some other process trying to instantly unload the prog it
> just loaded.
> I think such race window is extremely tiny and if perf misses
> those load/unload events it's a good thing, since there is no chance
> that normal pmu event samples would be happening during prog execution.
> The alternative approach with no race window at all is to burden kernel
> RECORD_* events with _all_ information about bpf prog. Which is jited
> addresses, jited image itself, info about all subprogs, info about line
> info, all BTF data, etc. As I said earlier I'm strongly against such
> RECORD_* bloating.
> Instead we need to find a way to process new RECORD_BPF events with
> single prog_id field in perf user space with minimal race
> and threaded approach sounds like a win to me.
There is another alternative, I think: put just a content based hash,
like a git commit id into a PERF_RECORD_MMAP3 new record, and when the
validator does the jit, etc, it stashes the content that
BPF_OBJ_GET_INFO_BY_FD would get somewhere, some filesystem populated by
the kernel right after getting stuff from sys_bpf() and preparing it for
use, then we know that in (start, end) we have blob foo with content id,
that we will use to retrieve information that augments what we know with
just (start, end, id) and allows annotation, etc.
That stash space for jitted stuff gets garbage collected from time to
time or is even completely disabled if the user is not interested in
such augmentation, just like one can do disabling perf's ~/.debug/
directory hashed by build-id.
I think with this we have no races, the PERF_RECORD_MMAP3 gets just what
is in PERF_RECORD_MMAP2 plus some extra 20 bytes for such content based
cookie and we solve the other race we already have with kernel modules,
DSOs, etc.
I have mentioned this before, there were objections, perhaps this time I
formulated in a different way that makes it more interesting?
- Arnaldo
^ permalink raw reply
* Re: Form sk_buff from DMA page
From: David Miller @ 2018-10-17 18:49 UTC (permalink / raw)
To: keyurp; +Cc: netdev
In-Reply-To: <BN6PR02MB2641F9CF498E60FB3AD98DE6B7FF0@BN6PR02MB2641.namprd02.prod.outlook.com>
From: Keyur Amrutbhai Patel <keyurp@xilinx.com>
Date: Wed, 17 Oct 2018 17:32:33 +0000
> Can anyone help me on how to form sk_buff from DMA page? Basically I
> get complete packet from DMA as single page.
Read any driver that calls build_skb().
^ permalink raw reply
* Re: ipmr, ip6mr: Unite dumproute flows
From: Nikolay Aleksandrov @ 2018-10-17 18:49 UTC (permalink / raw)
To: Colin Ian King, Yuval Mintz
Cc: David S. Miller, Alexey Kuznetsov, Hideaki YOSHIFUJI,
netdev@vger.kernel.org
In-Reply-To: <c3d958d0-9916-48d9-ad8d-076cda75c93f@canonical.com>
On 17/10/2018 21:22, Colin Ian King wrote:
> Hi,
>
> Static analysis on linux-next has picked up a potential bug with commit:
>
> commit 7b0db85737db3f4d76b2a412e4f19eae59b8b494
> Author: Yuval Mintz <yuvalm@mellanox.com>
> Date: Wed Feb 28 23:29:39 2018 +0200
>
> ipmr, ip6mr: Unite dumproute flows
>
> in function mr_table_dump(), s_e is and never seems to change, so the
> check if (e < s_e) is never true: See code below, as annotated by
> CoverityScan:
>
> 317 }
> assignment: Assigning: e = 0U.
>
> 318 e = 0;
> assignment: Assigning: s_e = 0U.
>
> 319 s_e = 0;
> 320
> 321 spin_lock_bh(lock);
> 322 list_for_each_entry(mfc, &mrt->mfc_unres_queue, list) {
> at_least: At condition e < s_e, the value of e must be at least 0.
> const: At condition e < s_e, the value of s_e must be equal to 0.
> dead_error_condition: The condition e < s_e cannot be true.
>
> 323 if (e < s_e)
>
> CID 1474511 (#1 of 1): Logically dead code (DEADCODE)
> dead_error_line: Execution cannot reach this statement: goto next_entry2;.
>
> 324 goto next_entry2;
> 325 if (filter->dev &&
> 326 !mr_mfc_uses_dev(mrt, mfc, filter->dev))
> 327 goto next_entry2;
> 328
> 329 err = fill(mrt, skb, NETLINK_CB(cb->skb).portid,
> 330 cb->nlh->nlmsg_seq, mfc, RTM_NEWROUTE, flags);
> 331 if (err < 0) {
> 332 spin_unlock_bh(lock);
> 333 goto out;
> 334 }
> 335next_entry2:
>
> incr: Incrementing e. The value of e is now 1.
> incr: Incrementing e. The value of e is now 2.
>
> 336 e++;
>
>
> Note that the zero'ing of e and s_e for ipmr and ip6mr was added in by
> earlier commits:
>
> commit 8fb472c09b9df478a062eacc7841448e40fc3c17
> Author: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
> Date: Thu Jan 12 15:53:33 2017 +0100
>
> ipmr: improve hash scalability
>
> commit 87c418bf1323d57140f4b448715f64de3fbb7e91
> Author: Yuval Mintz <yuvalm@mellanox.com>
> Date: Wed Feb 28 23:29:31 2018 +0200
>
> ip6mr: Align hash implementation to ipmr
>
> Anyhow, this looks incorrect, but I'm not familiar with this code to
> suggest the correct fix.
>
> Colin
>
Indeed, there's a case where we might miss an unresolved entry due to the zeroing.
I'll send a fix shortly after testing.
Thanks,
Nik
^ permalink raw reply
* Fwd: Re: [PATCH net] r8169: fix NAPI handling under high load
From: Heiner Kallweit @ 2018-10-17 18:48 UTC (permalink / raw)
To: Tomas Szepe; +Cc: netdev@vger.kernel.org, Holger Hoffstätte
In-Reply-To: <62974f0f-1938-3635-69d4-204ed8c587b3@gmail.com>
Tomas,
more than three years back you reported network problems after BQL
was added to the r8169 driver. Due to this the change was reverted.
Now the discussion to add BQL popped up again.
You mentioned that the issue exists on one of your systems only.
Therefore it could be an issue specific to a particular chip version.
I'd be interested in the chip version of the affected system.
You linked to another similar report, there the chip version was:
r8169 0000:10:00.0 eth0: RTL8168c/8111c at 0xf8130000, 00:e0:4c:68:48:d2, XID 1c4000c0 IRQ 29
In case you still have the affected system or at least the old dmesg
logs, I'd appreciate if you could let me know the quoted line from
dmesg output.
Thanks a lot,
Heiner
-------- Forwarded Message --------
Subject: Re: [PATCH net] r8169: fix NAPI handling under high load
Date: Wed, 17 Oct 2018 20:12:48 +0200
From: Heiner Kallweit <hkallweit1@gmail.com>
To: Holger Hoffstätte <holger@applied-asynchrony.com>, David Miller <davem@davemloft.net>, Realtek linux nic maintainers <nic_swsd@realtek.com>
CC: netdev@vger.kernel.org <netdev@vger.kernel.org>
On 16.10.2018 23:17, Holger Hoffstätte wrote:
> On 10/16/18 22:37, Heiner Kallweit wrote:
>> rtl_rx() and rtl_tx() are called only if the respective bits are set
>> in the interrupt status register. Under high load NAPI may not be
>> able to process all data (work_done == budget) and it will schedule
>> subsequent calls to the poll callback.
>> rtl_ack_events() however resets the bits in the interrupt status
>> register, therefore subsequent calls to rtl8169_poll() won't call
>> rtl_rx() and rtl_tx() - chip interrupts are still disabled.
>
> Very interesting! Could this be the reason for the mysterious
> hangs & resets we experienced when enabling BQL for r8169?
> They happened more often with TSO/GSO enabled and several people
> attempted to fix those hangs unsuccessfully; it was later reverted
> and has been since then (#87cda7cb43).
> If this bug has been there "forever" it might be tempting to
> re-apply BQL and see what happens. Any chance you could give that
> a try? I'll gladly test patches, just like I'll run this one.
>
After reading through the old mail threads regarding BQL on r8169
I don't think the fix here is related.
It seems that BQL on r8169 worked fine for most people, just one
had problems on one of his systems. I assume the issue was specific
to one chip version. From the ~ 50 chip versions supported by
r8169 more or less each one requires its own quirks.
If we're lucky the chip-version-specific issue has been fixed in
the meantime and we can simply apply the old BQL patch again.
If it turns out that certain chip versions can't be used with BQL,
then we can disable the feature for these chip versions instead
of removing the feature completely.
I will apply the old BQL patch and see how it's on my system
(with GRO and SG enabled).
> cheers
> Holger
>
Heiner
^ permalink raw reply
* Re: [PATCH net-next V2 6/8] vhost: packed ring support
From: Jason Wang @ 2018-10-18 2:44 UTC (permalink / raw)
To: Michael S. Tsirkin
Cc: kvm, netdev, linux-kernel, virtualization, maxime.coquelin, wexu
In-Reply-To: <20181015062241-mutt-send-email-mst@kernel.org>
On 2018/10/15 下午6:25, Michael S. Tsirkin wrote:
> On Mon, Oct 15, 2018 at 10:51:06AM +0800, Jason Wang wrote:
>> On 2018年10月15日 10:43, Michael S. Tsirkin wrote:
>>> On Mon, Oct 15, 2018 at 10:22:33AM +0800, Jason Wang wrote:
>>>> On 2018年10月13日 01:23, Michael S. Tsirkin wrote:
>>>>> On Fri, Oct 12, 2018 at 10:32:44PM +0800, Tiwei Bie wrote:
>>>>>> On Mon, Jul 16, 2018 at 11:28:09AM +0800, Jason Wang wrote:
>>>>>> [...]
>>>>>>> @@ -1367,10 +1397,48 @@ long vhost_vring_ioctl(struct vhost_dev *d, unsigned int ioctl, void __user *arg
>>>>>>> vq->last_avail_idx = s.num;
>>>>>>> /* Forget the cached index value. */
>>>>>>> vq->avail_idx = vq->last_avail_idx;
>>>>>>> + if (vhost_has_feature(vq, VIRTIO_F_RING_PACKED)) {
>>>>>>> + vq->last_avail_wrap_counter = wrap_counter;
>>>>>>> + vq->avail_wrap_counter = vq->last_avail_wrap_counter;
>>>>>>> + }
>>>>>>> break;
>>>>>>> case VHOST_GET_VRING_BASE:
>>>>>>> s.index = idx;
>>>>>>> s.num = vq->last_avail_idx;
>>>>>>> + if (vhost_has_feature(vq, VIRTIO_F_RING_PACKED))
>>>>>>> + s.num |= vq->last_avail_wrap_counter << 31;
>>>>>>> + if (copy_to_user(argp, &s, sizeof(s)))
>>>>>>> + r = -EFAULT;
>>>>>>> + break;
>>>>>>> + case VHOST_SET_VRING_USED_BASE:
>>>>>>> + /* Moving base with an active backend?
>>>>>>> + * You don't want to do that.
>>>>>>> + */
>>>>>>> + if (vq->private_data) {
>>>>>>> + r = -EBUSY;
>>>>>>> + break;
>>>>>>> + }
>>>>>>> + if (copy_from_user(&s, argp, sizeof(s))) {
>>>>>>> + r = -EFAULT;
>>>>>>> + break;
>>>>>>> + }
>>>>>>> + if (vhost_has_feature(vq, VIRTIO_F_RING_PACKED)) {
>>>>>>> + wrap_counter = s.num >> 31;
>>>>>>> + s.num &= ~(1 << 31);
>>>>>>> + }
>>>>>>> + if (s.num > 0xffff) {
>>>>>>> + r = -EINVAL;
>>>>>>> + break;
>>>>>>> + }
>>>>>> Do we want to put wrap_counter at bit 15?
>>>>> I think I second that - seems to be consistent with
>>>>> e.g. event suppression structure and the proposed
>>>>> extension to driver notifications.
>>>> Ok, I assumes packed virtqueue support 64K but looks not. I can change it to
>>>> bit 15 and GET_VRING_BASE need to be changed as well.
>>>>
>>>>>> If put wrap_counter at bit 31, the check (s.num > 0xffff)
>>>>>> won't be able to catch the illegal index 0x8000~0xffff for
>>>>>> packed ring.
>>>>>>
>>>> Do we need to clarify this in the spec?
>>> Isn't this all internal vhost stuff?
>> I meant the illegal index 0x8000-0xffff.
> It does say packed virtqueues support up to 2 15 entries each.
>
> But yes we can add a requirement that devices do not expose
> larger rings. Split does not support 2**16 either, right?
> With 2**16 enties avail index becomes 0 and ring looks empty.
>
Yes, so it's better to clarify this in the spec.
Thanks
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization
^ permalink raw reply
* RE: Form sk_buff from DMA page
From: Keyur Amrutbhai Patel @ 2018-10-17 18:45 UTC (permalink / raw)
To: Stephen Hemminger; +Cc: netdev@vger.kernel.org
In-Reply-To: <20181017105050.695eab3b@shemminger-XPS-13-9360>
Hi Stephen,
Thanks you for answering my query.
I am looking into build_skb it seems addressing to my query.
Appreciated..
Any good article on sk_buff which explains everything about it?
Regards,
Keyur
-----Original Message-----
From: Stephen Hemminger <stephen@networkplumber.org>
Sent: Wednesday, October 17, 2018 11:21 PM
To: Keyur Amrutbhai Patel <keyurp@xilinx.com>
Cc: netdev@vger.kernel.org
Subject: Re: Form sk_buff from DMA page
EXTERNAL EMAIL
On Wed, 17 Oct 2018 17:32:33 +0000
Keyur Amrutbhai Patel <keyurp@xilinx.com> wrote:
> Hi,
>
> Can anyone help me on how to form sk_buff from DMA page? Basically I get complete packet from DMA as single page.
>
> Regards,
> Keyur
> This email and any attachments are intended for the sole use of the named recipient(s) and contain(s) confidential information that may be proprietary, privileged or copyrighted under applicable law. If you are not the intended recipient, do not read, copy, or forward this email message or any attachments. Delete this email message and any attachments immediately.
Did you look at build_skb?
^ permalink raw reply
* Re: ipmr, ip6mr: Unite dumproute flows
From: Colin Ian King @ 2018-10-17 18:22 UTC (permalink / raw)
To: Yuval Mintz, Nikolay Aleksandrov
Cc: David S. Miller, Alexey Kuznetsov, Hideaki YOSHIFUJI,
netdev@vger.kernel.org
Hi,
Static analysis on linux-next has picked up a potential bug with commit:
commit 7b0db85737db3f4d76b2a412e4f19eae59b8b494
Author: Yuval Mintz <yuvalm@mellanox.com>
Date: Wed Feb 28 23:29:39 2018 +0200
ipmr, ip6mr: Unite dumproute flows
in function mr_table_dump(), s_e is and never seems to change, so the
check if (e < s_e) is never true: See code below, as annotated by
CoverityScan:
317 }
assignment: Assigning: e = 0U.
318 e = 0;
assignment: Assigning: s_e = 0U.
319 s_e = 0;
320
321 spin_lock_bh(lock);
322 list_for_each_entry(mfc, &mrt->mfc_unres_queue, list) {
at_least: At condition e < s_e, the value of e must be at least 0.
const: At condition e < s_e, the value of s_e must be equal to 0.
dead_error_condition: The condition e < s_e cannot be true.
323 if (e < s_e)
CID 1474511 (#1 of 1): Logically dead code (DEADCODE)
dead_error_line: Execution cannot reach this statement: goto next_entry2;.
324 goto next_entry2;
325 if (filter->dev &&
326 !mr_mfc_uses_dev(mrt, mfc, filter->dev))
327 goto next_entry2;
328
329 err = fill(mrt, skb, NETLINK_CB(cb->skb).portid,
330 cb->nlh->nlmsg_seq, mfc, RTM_NEWROUTE, flags);
331 if (err < 0) {
332 spin_unlock_bh(lock);
333 goto out;
334 }
335next_entry2:
incr: Incrementing e. The value of e is now 1.
incr: Incrementing e. The value of e is now 2.
336 e++;
Note that the zero'ing of e and s_e for ipmr and ip6mr was added in by
earlier commits:
commit 8fb472c09b9df478a062eacc7841448e40fc3c17
Author: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Date: Thu Jan 12 15:53:33 2017 +0100
ipmr: improve hash scalability
commit 87c418bf1323d57140f4b448715f64de3fbb7e91
Author: Yuval Mintz <yuvalm@mellanox.com>
Date: Wed Feb 28 23:29:31 2018 +0200
ip6mr: Align hash implementation to ipmr
Anyhow, this looks incorrect, but I'm not familiar with this code to
suggest the correct fix.
Colin
^ permalink raw reply
* Re: [PATCH net] r8169: fix NAPI handling under high load
From: Heiner Kallweit @ 2018-10-17 18:12 UTC (permalink / raw)
To: Holger Hoffstätte, David Miller,
Realtek linux nic maintainers
Cc: netdev@vger.kernel.org
In-Reply-To: <ada39488-8e9c-d6a8-461e-672642497db2@applied-asynchrony.com>
On 16.10.2018 23:17, Holger Hoffstätte wrote:
> On 10/16/18 22:37, Heiner Kallweit wrote:
>> rtl_rx() and rtl_tx() are called only if the respective bits are set
>> in the interrupt status register. Under high load NAPI may not be
>> able to process all data (work_done == budget) and it will schedule
>> subsequent calls to the poll callback.
>> rtl_ack_events() however resets the bits in the interrupt status
>> register, therefore subsequent calls to rtl8169_poll() won't call
>> rtl_rx() and rtl_tx() - chip interrupts are still disabled.
>
> Very interesting! Could this be the reason for the mysterious
> hangs & resets we experienced when enabling BQL for r8169?
> They happened more often with TSO/GSO enabled and several people
> attempted to fix those hangs unsuccessfully; it was later reverted
> and has been since then (#87cda7cb43).
> If this bug has been there "forever" it might be tempting to
> re-apply BQL and see what happens. Any chance you could give that
> a try? I'll gladly test patches, just like I'll run this one.
>
After reading through the old mail threads regarding BQL on r8169
I don't think the fix here is related.
It seems that BQL on r8169 worked fine for most people, just one
had problems on one of his systems. I assume the issue was specific
to one chip version. From the ~ 50 chip versions supported by
r8169 more or less each one requires its own quirks.
If we're lucky the chip-version-specific issue has been fixed in
the meantime and we can simply apply the old BQL patch again.
If it turns out that certain chip versions can't be used with BQL,
then we can disable the feature for these chip versions instead
of removing the feature completely.
I will apply the old BQL patch and see how it's on my system
(with GRO and SG enabled).
> cheers
> Holger
>
Heiner
^ permalink raw reply
* Re: [PATCH net] sctp: fix the data size calculation in sctp_data_size
From: Marcelo Ricardo Leitner @ 2018-10-17 18:00 UTC (permalink / raw)
To: Xin Long; +Cc: network dev, linux-sctp, davem, Neil Horman
In-Reply-To: <c407b682a424f0441d9b9a29ef6296c8e9824b64.1539781887.git.lucien.xin@gmail.com>
On Wed, Oct 17, 2018 at 09:11:27PM +0800, Xin Long wrote:
> sctp data size should be calculated by subtracting data chunk header's
> length from chunk_hdr->length, not just data header.
>
> Fixes: 668c9beb9020 ("sctp: implement assign_number for sctp_stream_interleave")
> Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
> ---
> include/net/sctp/sm.h | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/include/net/sctp/sm.h b/include/net/sctp/sm.h
> index 5ef1bad..9e3d327 100644
> --- a/include/net/sctp/sm.h
> +++ b/include/net/sctp/sm.h
> @@ -347,7 +347,7 @@ static inline __u16 sctp_data_size(struct sctp_chunk *chunk)
> __u16 size;
>
> size = ntohs(chunk->chunk_hdr->length);
> - size -= sctp_datahdr_len(&chunk->asoc->stream);
> + size -= sctp_datachk_len(&chunk->asoc->stream);
>
> return size;
> }
> --
> 2.1.0
>
^ permalink raw reply
* Re: Form sk_buff from DMA page
From: Stephen Hemminger @ 2018-10-17 17:50 UTC (permalink / raw)
To: Keyur Amrutbhai Patel; +Cc: netdev@vger.kernel.org
In-Reply-To: <BN6PR02MB2641F9CF498E60FB3AD98DE6B7FF0@BN6PR02MB2641.namprd02.prod.outlook.com>
On Wed, 17 Oct 2018 17:32:33 +0000
Keyur Amrutbhai Patel <keyurp@xilinx.com> wrote:
> Hi,
>
> Can anyone help me on how to form sk_buff from DMA page? Basically I get complete packet from DMA as single page.
>
> Regards,
> Keyur
> This email and any attachments are intended for the sole use of the named recipient(s) and contain(s) confidential information that may be proprietary, privileged or copyrighted under applicable law. If you are not the intended recipient, do not read, copy, or forward this email message or any attachments. Delete this email message and any attachments immediately.
Did you look at build_skb?
^ permalink raw reply
* Re: [PATCH bpf-next v2 02/13] bpf: btf: Add BTF_KIND_FUNC and BTF_KIND_FUNC_PROTO
From: Martin Lau @ 2018-10-17 17:50 UTC (permalink / raw)
To: Yonghong Song
Cc: Edward Cree, Alexei Starovoitov, daniel@iogearbox.net,
netdev@vger.kernel.org, Kernel Team
In-Reply-To: <0d0534e8-d05a-5541-2380-58a4ea36551b@fb.com>
On Wed, Oct 17, 2018 at 10:25:21AM -0700, Yonghong Song wrote:
>
>
> On 10/17/18 9:13 AM, Edward Cree wrote:
> > On 17/10/18 08:23, Yonghong Song wrote:
> >> This patch adds BTF_KIND_FUNC and BTF_KIND_FUNC_PROTO
> >> support to the type section. BTF_KIND_FUNC_PROTO is used
> >> to specify the type of a function pointer. With this,
> >> BTF has a complete set of C types (except float).
> >>
> >> BTF_KIND_FUNC is used to specify the signature of a
> >> defined subprogram. BTF_KIND_FUNC_PROTO can be referenced
> >> by another type, e.g., a pointer type, and BTF_KIND_FUNC
> >> type cannot be referenced by another type.
> > Why are distinct kinds created for these? A function body is
> > a value of function type, and since there's no way (in C) to
> > declare a variable of function type (only pointer-to-
> > function), any declaration of function type must necessarily
> > be a BTF_KIND_FUNC, whereas any other reference to a function
> > type (e.g. a declaration of type pointer to function type)
> > must, as you state above, be a BTF_KIND_FUNC_PROTO.
> > In fact, you can tell the difference just from name_off, since
> > a (C-legal) BTF_KIND_FUNC_PROTO will always be anonymous (as
> > the pointee of a pointer type), while a BTF_KIND_FUNC will
> > have the name of the subprogram.
>
> What you stated is true, BTF_KIND_FUNC_PROTO corresponds to
> dwarf subroutine tag which has no name while BTF_KIND_FUNC
> must have a valid name. The original design is to have both
> since they are corresponding to different dwarf constructs.
>
> Martin, what do you think?
I prefer to have separate kinds. We need a way to distinguish them.
For example, the BTF verifier is checking it. Having two kinds is
cleaner instead of resorting to other hints from 'struct btf_type'.
We don't lack of bits for kind.
>
> >
> > -Ed
> >
^ permalink raw reply
* For your photos 25
From: Jenny @ 2018-10-17 9:05 UTC (permalink / raw)
To: netdev
We provide photoshop services to some of the companies from around the
world.
Some online stores use our services for retouching portraits, jewelry,
apparels, furnitures etc.
Here are the details of what we provide:
Clipping path
Deep etching
Image masking
Portrait retouching
Jewelry retouching
Fashion retouching
Please reply back for further info.
We can provide testing for your photos if needed.
Thanks,
Jenny
^ permalink raw reply
* Re: [PATCH bpf-next v2 4/8] tls: convert to generic sk_msg interface
From: Daniel Borkmann @ 2018-10-17 17:32 UTC (permalink / raw)
To: Eric Dumazet, alexei.starovoitov; +Cc: john.fastabend, davejwatson, netdev
In-Reply-To: <4b403534-2273-5931-753b-c16a6e2624a0@gmail.com>
On 10/17/2018 04:55 PM, Eric Dumazet wrote:
> On 10/12/2018 05:45 PM, Daniel Borkmann wrote:
>> Convert kTLS over to make use of sk_msg interface for plaintext and
>> encrypted scattergather data, so it reuses all the sk_msg helpers
>> and data structure which later on in a second step enables to glue
>> this to BPF.
>>
>> This also allows to remove quite a bit of open coded helpers which
>> are covered by the sk_msg API. Recent changes in kTLs 80ece6a03aaf
>> ("tls: Remove redundant vars from tls record structure") and
>> 4e6d47206c32 ("tls: Add support for inplace records encryption")
>> changed the data path handling a bit; while we've kept the latter
>> optimization intact, we had to undo the former change to better
>> fit the sk_msg model, hence the sg_aead_in and sg_aead_out have
>> been brought back and are linked into the sk_msg sgs. Now the kTLS
>> record contains a msg_plaintext and msg_encrypted sk_msg each.
>>
>> In the original code, the zerocopy_from_iter() has been used out
>> of TX but also RX path. For the strparser skb-based RX path,
>> we've left the zerocopy_from_iter() in decrypt_internal() mostly
>> untouched, meaning it has been moved into tls_setup_from_iter()
>> with charging logic removed (as not used from RX). Given RX path
>> is not based on sk_msg objects, we haven't pursued setting up a
>> dummy sk_msg to call into sk_msg_zerocopy_from_iter(), but it
>> could be an option to prusue in a later step.
>>
>> Joint work with John.
>
> Something looks broken and needs this fix :
>
> (I have to run, I might submit this formally later if needed)
>
> diff --git a/net/core/sock_map.c b/net/core/sock_map.c
> index 3c0e44cb811a4bed2e02aa107854d74eb3ae7358..3e55dedc038b22132b5b6d042bfedda9e4f3157e 100644
> --- a/net/core/sock_map.c
> +++ b/net/core/sock_map.c
> @@ -175,12 +175,12 @@ static int sock_map_link(struct bpf_map *map, struct sk_psock_progs *progs,
> }
> }
>
> + if (!sk_has_psock(sk)) {
> + ret = -EBUSY;
> + goto out_progs;
> + }
> psock = sk_psock_get(sk);
Thanks for reporting! Yep, so this is almost correct, as-is above this would
always let sock_map_link() bail out. Issue is we take the ref inside sk_psock_get()
before making sure it's a psock, so if the sk user data is present but not a
psock we need to bail out with EBUSY, but if it's NULL or already a valid psock
we may proceed; we'll see to cook something up this evening, thanks!
> if (psock) {
> - if (!sk_has_psock(sk)) {
> - ret = -EBUSY;
> - goto out_progs;
> - }
> if ((msg_parser && READ_ONCE(psock->progs.msg_parser)) ||
> (skb_progs && READ_ONCE(psock->progs.skb_parser))) {
> sk_psock_put(sk, psock);
>
>
> BUG: KASAN: slab-out-of-bounds in atomic_read include/asm-generic/atomic-instrumented.h:21 [inline]
> BUG: KASAN: slab-out-of-bounds in refcount_inc_not_zero_checked+0x97/0x2f0 lib/refcount.c:120
> Read of size 4 at addr ffff88019548be58 by task syz-executor4/22387
>
> CPU: 1 PID: 22387 Comm: syz-executor4 Not tainted 4.19.0-rc7+ #264
> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
> Call Trace:
> __dump_stack lib/dump_stack.c:77 [inline]
> dump_stack+0x1c4/0x2b4 lib/dump_stack.c:113
> print_address_description.cold.8+0x9/0x1ff mm/kasan/report.c:256
> kasan_report_error mm/kasan/report.c:354 [inline]
> kasan_report.cold.9+0x242/0x309 mm/kasan/report.c:412
> check_memory_region_inline mm/kasan/kasan.c:260 [inline]
> check_memory_region+0x13e/0x1b0 mm/kasan/kasan.c:267
> kasan_check_read+0x11/0x20 mm/kasan/kasan.c:272
> atomic_read include/asm-generic/atomic-instrumented.h:21 [inline]
> refcount_inc_not_zero_checked+0x97/0x2f0 lib/refcount.c:120
> sk_psock_get include/linux/skmsg.h:379 [inline]
> sock_map_link.isra.6+0x41f/0xe30 net/core/sock_map.c:178
> sock_hash_update_common+0x19b/0x11e0 net/core/sock_map.c:669
> sock_hash_update_elem+0x306/0x470 net/core/sock_map.c:738
> map_update_elem+0x819/0xdf0 kernel/bpf/syscall.c:818
> __do_sys_bpf kernel/bpf/syscall.c:2400 [inline]
> __se_sys_bpf kernel/bpf/syscall.c:2371 [inline]
> __x64_sys_bpf+0x32d/0x510 kernel/bpf/syscall.c:2371
> do_syscall_64+0x1b9/0x820 arch/x86/entry/common.c:290
> entry_SYSCALL_64_after_hwframe+0x49/0xbe
> RIP: 0033:0x457569
> Code: fd b3 fb ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 cb b3 fb ff c3 66 2e 0f 1f 84 00 00 00 00
> RSP: 002b:00007f71507e0c78 EFLAGS: 00000246 ORIG_RAX: 0000000000000141
> RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 0000000000457569
> RDX: 0000000000000020 RSI: 0000000020000180 RDI: 0000000000000002
> RBP: 000000000072bf00 R08: 0000000000000000 R09: 0000000000000000
> R10: 0000000000000000 R11: 0000000000000246 R12: 00007f71507e16d4
> R13: 00000000004bd926 R14: 00000000004cc2b0 R15: 00000000ffffffff
>
> Allocated by task 22387:
> save_stack+0x43/0xd0 mm/kasan/kasan.c:448
> set_track mm/kasan/kasan.c:460 [inline]
> kasan_kmalloc+0xc7/0xe0 mm/kasan/kasan.c:553
> kasan_slab_alloc+0x12/0x20 mm/kasan/kasan.c:490
> kmem_cache_alloc+0x12e/0x730 mm/slab.c:3554
> kmem_cache_zalloc include/linux/slab.h:697 [inline]
> kcm_attach net/kcm/kcmsock.c:1405 [inline]
> kcm_attach_ioctl net/kcm/kcmsock.c:1490 [inline]
> kcm_ioctl+0xca7/0x18c0 net/kcm/kcmsock.c:1696
> sock_do_ioctl+0xeb/0x420 net/socket.c:950
> sock_ioctl+0x313/0x690 net/socket.c:1074
> vfs_ioctl fs/ioctl.c:46 [inline]
> file_ioctl fs/ioctl.c:501 [inline]
> do_vfs_ioctl+0x1de/0x1720 fs/ioctl.c:685
> ksys_ioctl+0xa9/0xd0 fs/ioctl.c:702
> __do_sys_ioctl fs/ioctl.c:709 [inline]
> __se_sys_ioctl fs/ioctl.c:707 [inline]
> __x64_sys_ioctl+0x73/0xb0 fs/ioctl.c:707
> do_syscall_64+0x1b9/0x820 arch/x86/entry/common.c:290
> entry_SYSCALL_64_after_hwframe+0x49/0xbe
>
> Freed by task 0:
> (stack is not available)
>
> The buggy address belongs to the object at ffff88019548bc00
> which belongs to the cache kcm_psock_cache of size 544
> The buggy address is located 56 bytes to the right of
> 544-byte region [ffff88019548bc00, ffff88019548be20)
> The buggy address belongs to the page:
> page:ffffea0006552280 count:1 mapcount:0 mapping:ffff8801cb789b40 index:0x0 compound_mapcount: 0
> flags: 0x2fffc0000008100(slab|head)
> raw: 02fffc0000008100 ffff8801cb78c748 ffff8801cb78c748 ffff8801cb789b40
> kobject: 'loop2' (0000000086df2c8c): kobject_uevent_env
> raw: 0000000000000000 ffff88019548a080 000000010000000b 0000000000000000
> page dumped because: kasan: bad access detected
>
> Memory state around the buggy address:
> ffff88019548bd00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> ffff88019548bd80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>> ffff88019548be00: 00 00 00 00 fc fc fc fc fc fc fc fc fc fc fc fc
> ^
> ffff88019548be80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
> ffff88019548bf00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
> ==================================================================
>
^ permalink raw reply
* Form sk_buff from DMA page
From: Keyur Amrutbhai Patel @ 2018-10-17 17:32 UTC (permalink / raw)
To: netdev@vger.kernel.org
Hi,
Can anyone help me on how to form sk_buff from DMA page? Basically I get complete packet from DMA as single page.
Regards,
Keyur
This email and any attachments are intended for the sole use of the named recipient(s) and contain(s) confidential information that may be proprietary, privileged or copyrighted under applicable law. If you are not the intended recipient, do not read, copy, or forward this email message or any attachments. Delete this email message and any attachments immediately.
^ permalink raw reply
* Re: [PATCH bpf-next v2 13/13] tools/bpf: bpftool: add support for jited func types
From: Yonghong Song @ 2018-10-17 17:28 UTC (permalink / raw)
To: Edward Cree, Alexei Starovoitov, Martin Lau, daniel@iogearbox.net,
netdev@vger.kernel.org
Cc: Kernel Team
In-Reply-To: <0b60d21b-dcbf-a0df-7b2c-b0506463af1f@solarflare.com>
On 10/17/18 4:11 AM, Edward Cree wrote:
> On 17/10/18 08:24, Yonghong Song wrote:
>> This patch added support to print function signature
>> if btf func_info is available. Note that ksym
>> now uses function name instead of prog_name as
>> prog_name has a limit of 16 bytes including
>> ending '\0'.
>>
>> The following is a sample output for selftests
>> test_btf with file test_btf_haskv.o:
>>
>> $ bpftool prog dump jited id 1
>> int _dummy_tracepoint(struct dummy_tracepoint_args * ):
>> bpf_prog_b07ccb89267cf242__dummy_tracepoint:
>> 0: push %rbp
>> 1: mov %rsp,%rbp
>> ......
>> 3c: add $0x28,%rbp
>> 40: leaveq
>> 41: retq
>>
>> int test_long_fname_1(struct dummy_tracepoint_args * ):
>> bpf_prog_2dcecc18072623fc_test_long_fname_1:
>> 0: push %rbp
>> 1: mov %rsp,%rbp
>> ......
>> 3a: add $0x28,%rbp
>> 3e: leaveq
>> 3f: retq
>>
>> int test_long_fname_2(struct dummy_tracepoint_args * ):
>> bpf_prog_89d64e4abf0f0126_test_long_fname_2:
>> 0: push %rbp
>> 1: mov %rsp,%rbp
>> ......
>> 80: add $0x28,%rbp
>> 84: leaveq
>> 85: retq
>>
>> Signed-off-by: Yonghong Song <yhs@fb.com>
>> ---
>> tools/bpf/bpftool/btf_dumper.c | 96 ++++++++++++++++++++++++++++++++++
>> tools/bpf/bpftool/main.h | 2 +
>> tools/bpf/bpftool/prog.c | 54 +++++++++++++++++++
>> 3 files changed, 152 insertions(+)
>>
>> diff --git a/tools/bpf/bpftool/btf_dumper.c b/tools/bpf/bpftool/btf_dumper.c
>> index 55bc512a1831..a31df4202335 100644
>> --- a/tools/bpf/bpftool/btf_dumper.c
>> +++ b/tools/bpf/bpftool/btf_dumper.c
>> @@ -249,3 +249,99 @@ int btf_dumper_type(const struct btf_dumper *d, __u32 type_id,
>> {
>> return btf_dumper_do_type(d, type_id, 0, data);
>> }
>> +
>> +#define BTF_PRINT_STRING(str) \
>> + { \
>> + pos += snprintf(func_sig + pos, size - pos, str); \
>> + if (pos >= size) \
>> + return -1; \
>> + }
> Usual kernel practice for this sort of macro is to use
> do { \
> } while(0)
> to ensure correct behaviour if the macro is used within another control
> flow statement, e.g.
> if (x)
> BTF_PRINT_STRING(x);
> else
> do_something_else();
> will not compile with the bare braces as the else will be detached.
Thanks for the review! Will change to use the "do while" format
as you suggested.
>> +#define BTF_PRINT_ONE_ARG(fmt, arg) \
>> + { \
>> + pos += snprintf(func_sig + pos, size - pos, fmt, arg); \
>> + if (pos >= size) \
>> + return -1; \
>> + }
> Any reason for not just using a variadic macro?
No particular reason. I will try to use it in the next revision.
>> +#define BTF_PRINT_TYPE_ONLY(type) \
>> + { \
>> + pos = __btf_dumper_type_only(btf, type, func_sig, \
>> + pos, size); \
>> + if (pos == -1) \
>> + return -1; \
>> + }
>> +
>> +static int __btf_dumper_type_only(struct btf *btf, __u32 type_id,
>> + char *func_sig, int pos, int size)
>> +{
>> + const struct btf_type *t = btf__type_by_id(btf, type_id);
>> + const struct btf_array *array;
>> + int i, vlen;
>> +
>> + switch (BTF_INFO_KIND(t->info)) {
>> + case BTF_KIND_INT:
>> + BTF_PRINT_ONE_ARG("%s ",
>> + btf__name_by_offset(btf, t->name_off));
>> + break;
>> + case BTF_KIND_STRUCT:
>> + BTF_PRINT_ONE_ARG("struct %s ",
>> + btf__name_by_offset(btf, t->name_off));
>> + break;
>> + case BTF_KIND_UNION:
>> + BTF_PRINT_ONE_ARG("union %s ",
>> + btf__name_by_offset(btf, t->name_off));
>> + break;
>> + case BTF_KIND_ENUM:
>> + BTF_PRINT_ONE_ARG("enum %s ",
>> + btf__name_by_offset(btf, t->name_off));
>> + break;
>> + case BTF_KIND_ARRAY:
>> + array = (struct btf_array *)(t + 1);
>> + BTF_PRINT_TYPE_ONLY(array->type);
>> + BTF_PRINT_ONE_ARG("[%d]", array->nelems);
>> + break;
>> + case BTF_KIND_PTR:
>> + BTF_PRINT_TYPE_ONLY(t->type);
>> + BTF_PRINT_STRING("* ");
>> + break;
>> + case BTF_KIND_UNKN:
>> + case BTF_KIND_FWD:
>> + case BTF_KIND_TYPEDEF:
>> + return -1;
>> + case BTF_KIND_VOLATILE:
>> + BTF_PRINT_STRING("volatile ");
>> + BTF_PRINT_TYPE_ONLY(t->type);
>> + break;
>> + case BTF_KIND_CONST:
>> + BTF_PRINT_STRING("const ");
>> + BTF_PRINT_TYPE_ONLY(t->type);
>> + break;
>> + case BTF_KIND_RESTRICT:
>> + BTF_PRINT_STRING("restrict ");
>> + BTF_PRINT_TYPE_ONLY(t->type);
>> + break;
>> + case BTF_KIND_FUNC:
>> + case BTF_KIND_FUNC_PROTO:
>> + BTF_PRINT_TYPE_ONLY(t->type);
>> + BTF_PRINT_ONE_ARG("%s(", btf__name_by_offset(btf, t->name_off));
>> + vlen = BTF_INFO_VLEN(t->info);
>> + for (i = 0; i < vlen; i++) {
>> + __u32 arg_type = ((__u32 *)(t + 1))[i];
>> +
>> + BTF_PRINT_TYPE_ONLY(arg_type);
>> + if (i != (vlen - 1))
>> + BTF_PRINT_STRING(", ");
>> + }
> In this kind of loop I find it cleaner to print the comma before the item;
> that way the test becomes i != 0. Thus:
> for (i = 0; i < vlen; i++) {
> __u32 arg_type = ((__u32 *)(t + 1))[i];
>
> if (i)
> BTF_PRINT_STRING(", ");
> BTF_PRINT_TYPE_ONLY(arg_type);
> }
Good suggestion. Will make change in the next revision.
>
> -Ed
>
^ permalink raw reply
* Re: [PATCH bpf-next 1/2] bpf: add cg_skb_is_valid_access for BPF_PROG_TYPE_CGROUP_SKB
From: Alexei Starovoitov @ 2018-10-17 17:26 UTC (permalink / raw)
To: Song Liu; +Cc: netdev, ast, daniel, kernel-team
In-Reply-To: <20181017055606.353449-2-songliubraving@fb.com>
On Tue, Oct 16, 2018 at 10:56:05PM -0700, Song Liu wrote:
> BPF programs of BPF_PROG_TYPE_CGROUP_SKB need to access headers in the
> skb. This patch enables direct access of skb for these programs.
The lack of direct packet access in CGROUP_SKB progs was
an unpleasant surprise to me, so thank you for fixing it,
but there are few issues with the patch. See below.
> In __cgroup_bpf_run_filter_skb(), bpf_compute_data_pointers() is called
> to compute proper data_end for the BPF program.
>
> Signed-off-by: Song Liu <songliubraving@fb.com>
> ---
> kernel/bpf/cgroup.c | 4 ++++
> net/core/filter.c | 26 +++++++++++++++++++++++++-
> 2 files changed, 29 insertions(+), 1 deletion(-)
>
> diff --git a/kernel/bpf/cgroup.c b/kernel/bpf/cgroup.c
> index 00f6ed2e4f9a..340d496f35bd 100644
> --- a/kernel/bpf/cgroup.c
> +++ b/kernel/bpf/cgroup.c
> @@ -566,6 +566,10 @@ int __cgroup_bpf_run_filter_skb(struct sock *sk,
> save_sk = skb->sk;
> skb->sk = sk;
> __skb_push(skb, offset);
> +
> + /* compute pointers for the bpf prog */
> + bpf_compute_data_pointers(skb);
> +
> ret = BPF_PROG_RUN_ARRAY(cgrp->bpf.effective[type], skb,
> bpf_prog_run_save_cb);
> __skb_pull(skb, offset);
> diff --git a/net/core/filter.c b/net/core/filter.c
> index 1a3ac6c46873..8b5a502e241f 100644
> --- a/net/core/filter.c
> +++ b/net/core/filter.c
> @@ -5346,6 +5346,30 @@ static bool sk_filter_is_valid_access(int off, int size,
> return bpf_skb_is_valid_access(off, size, type, prog, info);
> }
>
> +static bool cg_skb_is_valid_access(int off, int size,
> + enum bpf_access_type type,
> + const struct bpf_prog *prog,
> + struct bpf_insn_access_aux *info)
> +{
> + if (type == BPF_WRITE)
> + return false;
this disables writes into cb[0..4] that were allowed for cgroup_inet_* before.
One can argue that this may break existing progs,
but looking at the place where BPF_CGROUP_RUN_PROG_INET_INGRESS is called
it seems it's actually not correct in all cases to access cb there.
Just few lines down we call bpf_prog_run_save_cb() which save/restores
these 24 bytes.
So we have two option either add save/restore for INET_INGRESS only
or disable read and write access to cb[0..4] for CGROUP_SKB progs.
I prefer the former.
> +
> + switch (off) {
> + case bpf_ctx_range(struct __sk_buff, len):
> + break;
> + case bpf_ctx_range(struct __sk_buff, data):
> + info->reg_type = PTR_TO_PACKET;
> + break;
> + case bpf_ctx_range(struct __sk_buff, data_end):
> + info->reg_type = PTR_TO_PACKET_END;
> + break;
> + default:
> + return false;
> + }
this also enables access to a range of fields family..local_port.
It's ok to do for egress, but not for ingress unless we
add code similar to the bottom of sk_filter_trim_cap() that
inits skb->sk.
above change also allows access to data_meta and flow_keys
which is not correct.
Considering all that I'm proposing to fix INET_INGRESS call site
similar to code below it in sk_filter_trim_cap().
In particular to do:
struct sock *save_sk = skb->sk;
skb->sk = sk;
save and clear cb
BPF_CGROUP_RUN_PROG_INET_INGRESS
restore cb
skb->sk = save_sk;
all of above can probaby be inside BPF_CGROUP_RUN_PROG_INET_INGRESS macro.
Then in this cg_skb_is_valid_access() allow access to data/data_end
and family..local_port range as well.
while disallowing access to flow_keys and data_meta.
In patch 2 we gotta have tests for all these fields.
Thoughts?
> +
> + return bpf_skb_is_valid_access(off, size, type, prog, info);
> +}
> +
> static bool lwt_is_valid_access(int off, int size,
> enum bpf_access_type type,
> const struct bpf_prog *prog,
> @@ -7038,7 +7062,7 @@ const struct bpf_prog_ops xdp_prog_ops = {
>
> const struct bpf_verifier_ops cg_skb_verifier_ops = {
> .get_func_proto = cg_skb_func_proto,
> - .is_valid_access = sk_filter_is_valid_access,
> + .is_valid_access = cg_skb_is_valid_access,
> .convert_ctx_access = bpf_convert_ctx_access,
> };
>
> --
> 2.17.1
>
^ permalink raw reply
* Re: [PATCH bpf-next v2 02/13] bpf: btf: Add BTF_KIND_FUNC and BTF_KIND_FUNC_PROTO
From: Yonghong Song @ 2018-10-17 17:25 UTC (permalink / raw)
To: Edward Cree, Alexei Starovoitov, Martin Lau, daniel@iogearbox.net,
netdev@vger.kernel.org
Cc: Kernel Team
In-Reply-To: <0a2fa960-285d-dc32-bf76-5ba27da89dba@solarflare.com>
On 10/17/18 9:13 AM, Edward Cree wrote:
> On 17/10/18 08:23, Yonghong Song wrote:
>> This patch adds BTF_KIND_FUNC and BTF_KIND_FUNC_PROTO
>> support to the type section. BTF_KIND_FUNC_PROTO is used
>> to specify the type of a function pointer. With this,
>> BTF has a complete set of C types (except float).
>>
>> BTF_KIND_FUNC is used to specify the signature of a
>> defined subprogram. BTF_KIND_FUNC_PROTO can be referenced
>> by another type, e.g., a pointer type, and BTF_KIND_FUNC
>> type cannot be referenced by another type.
> Why are distinct kinds created for these? A function body is
> a value of function type, and since there's no way (in C) to
> declare a variable of function type (only pointer-to-
> function), any declaration of function type must necessarily
> be a BTF_KIND_FUNC, whereas any other reference to a function
> type (e.g. a declaration of type pointer to function type)
> must, as you state above, be a BTF_KIND_FUNC_PROTO.
> In fact, you can tell the difference just from name_off, since
> a (C-legal) BTF_KIND_FUNC_PROTO will always be anonymous (as
> the pointee of a pointer type), while a BTF_KIND_FUNC will
> have the name of the subprogram.
What you stated is true, BTF_KIND_FUNC_PROTO corresponds to
dwarf subroutine tag which has no name while BTF_KIND_FUNC
must have a valid name. The original design is to have both
since they are corresponding to different dwarf constructs.
Martin, what do you think?
>
> -Ed
>
^ permalink raw reply
* Re: [RFC] VSOCK: The performance problem of vhost_vsock.
From: jiangyiwen @ 2018-10-18 1:22 UTC (permalink / raw)
To: Jason Wang, stefanha; +Cc: netdev, kvm, virtualization
In-Reply-To: <70e7a894-ad12-9909-a0a8-c8772c7c232f@redhat.com>
On 2018/10/17 20:31, Jason Wang wrote:
>
> On 2018/10/17 下午7:41, jiangyiwen wrote:
>> On 2018/10/17 17:51, Jason Wang wrote:
>>> On 2018/10/17 下午5:39, Jason Wang wrote:
>>>>> Hi Jason and Stefan,
>>>>>
>>>>> Maybe I find the reason of bad performance.
>>>>>
>>>>> I found pkt_len is limited to VIRTIO_VSOCK_DEFAULT_RX_BUF_SIZE(4K),
>>>>> it will cause the bandwidth is limited to 500~600MB/s. And once I
>>>>> increase to 64k, it can improve about 3 times(~1500MB/s).
>>>>
>>>> Looks like the value was chosen for a balance between rx buffer size and performance. Allocating 64K always even for small packet is kind of waste and stress for guest memory. Virito-net try to avoid this by inventing the merge able rx buffer which allows big packet to be scattered in into different buffers. We can reuse this idea or revisit the idea of using virtio-net/vhost-net as a transport of vsock.
>>>>
>>>> What interesting is the performance is still behind vhost-net.
>>>>
>>>> Thanks
>>>>
>>>>> By the way, I send to 64K in application once, and I don't use
>>>>> sg_init_one and rewrite function to packet sg list because pkt_len
>>>>> include multiple pages.
>>>>>
>>>>> Thanks,
>>>>> Yiwen.
>>>
>>> Btw, if you're using vsock for transferring large files, maybe it's more efficient to implement sendpage() for vsock to allow sendfile()/splice() work.
>>>
>>> Thanks
>>>
>> I can't agree more.
>>
>> why vhost_vsock is still behind vhost_net?
>> Because I use sendfile() to test performance at first, and then
>> I found vsock don't implement sendpage() and cause the bandwidth
>> can't be increased. So I use read() and send() to replace sendfile(),
>> it will increase some switch between kernel and user mode, and sendfile()
>> can support zero copy. I think this is main reason.
>>
>> Thanks.
>
>
> Want to post patches for this then :) ?
>
> Thanks
>
I may not post patches at the moment because there are other tasks.
After a period of time, I will consider implement the feature.
Thanks.
>
>>
>>> .
>>>
>>
>
> .
>
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization
^ permalink raw reply
* Re: [PATCH bpf-next 05/13] bpf: get better bpf_prog ksyms based on btf func type_id
From: Martin Lau @ 2018-10-17 17:14 UTC (permalink / raw)
To: Yonghong Song
Cc: Alexei Starovoitov, daniel@iogearbox.net, netdev@vger.kernel.org,
Kernel Team
In-Reply-To: <9de47216-e2e2-9e21-3171-dbcf9a61648d@fb.com>
On Tue, Oct 16, 2018 at 05:32:13PM -0700, Yonghong Song wrote:
>
>
> On 10/15/18 4:12 PM, Martin Lau wrote:
> > On Fri, Oct 12, 2018 at 11:54:42AM -0700, Yonghong Song wrote:
> >> This patch added interface to load a program with the following
> >> additional information:
> >> . prog_btf_fd
> >> . func_info and func_info_len
> >> where func_info will provides function range and type_id
> >> corresponding to each function.
> >>
> >> If verifier agrees with function range provided by the user,
> >> the bpf_prog ksym for each function will use the func name
> >> provided in the type_id, which is supposed to provide better
> >> encoding as it is not limited by 16 bytes program name
> >> limitation and this is better for bpf program which contains
> >> multiple subprograms.
> >>
> >> The bpf_prog_info interface is also extended to
> >> return btf_id and jited_func_types, so user spaces can
> >> print out the function prototype for each jited function.
> > Some nits.
> >
> >>
> >> Signed-off-by: Yonghong Song <yhs@fb.com>
> >> ---
> >> include/linux/bpf.h | 2 +
> >> include/linux/bpf_verifier.h | 1 +
> >> include/linux/btf.h | 2 +
> >> include/uapi/linux/bpf.h | 11 +++++
> >> kernel/bpf/btf.c | 16 +++++++
> >> kernel/bpf/core.c | 9 ++++
> >> kernel/bpf/syscall.c | 86 +++++++++++++++++++++++++++++++++++-
> >> kernel/bpf/verifier.c | 50 +++++++++++++++++++++
> >> 8 files changed, 176 insertions(+), 1 deletion(-)
> >>
> >> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> >> index 9b558713447f..e9c63ffa01af 100644
> >> --- a/include/linux/bpf.h
> >> +++ b/include/linux/bpf.h
> >> @@ -308,6 +308,8 @@ struct bpf_prog_aux {
> >> void *security;
> >> #endif
> >> struct bpf_prog_offload *offload;
> >> + struct btf *btf;
> >> + u32 type_id; /* type id for this prog/func */
> >> union {
> >> struct work_struct work;
> >> struct rcu_head rcu;
> >> diff --git a/include/linux/bpf_verifier.h b/include/linux/bpf_verifier.h
> >> index 9e8056ec20fa..e84782ec50ac 100644
> >> --- a/include/linux/bpf_verifier.h
> >> +++ b/include/linux/bpf_verifier.h
> >> @@ -201,6 +201,7 @@ static inline bool bpf_verifier_log_needed(const struct bpf_verifier_log *log)
> >> struct bpf_subprog_info {
> >> u32 start; /* insn idx of function entry point */
> >> u16 stack_depth; /* max. stack depth used by this function */
> >> + u32 type_id; /* btf type_id for this subprog */
> >> };
> >>
> >> /* single container for all structs
> >> diff --git a/include/linux/btf.h b/include/linux/btf.h
> >> index e076c4697049..90e91b52aa90 100644
> >> --- a/include/linux/btf.h
> >> +++ b/include/linux/btf.h
> >> @@ -46,5 +46,7 @@ void btf_type_seq_show(const struct btf *btf, u32 type_id, void *obj,
> >> struct seq_file *m);
> >> int btf_get_fd_by_id(u32 id);
> >> u32 btf_id(const struct btf *btf);
> >> +bool is_btf_func_type(const struct btf *btf, u32 type_id);
> >> +const char *btf_get_name_by_id(const struct btf *btf, u32 type_id);
> >>
> >> #endif
> >> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> >> index f9187b41dff6..7ebbf4f06a65 100644
> >> --- a/include/uapi/linux/bpf.h
> >> +++ b/include/uapi/linux/bpf.h
> >> @@ -332,6 +332,9 @@ union bpf_attr {
> >> * (context accesses, allowed helpers, etc).
> >> */
> >> __u32 expected_attach_type;
> >> + __u32 prog_btf_fd; /* fd pointing to BTF type data */
> >> + __u32 func_info_len; /* func_info length */
> >> + __aligned_u64 func_info; /* func type info */
> >> };
> >>
> >> struct { /* anonymous struct used by BPF_OBJ_* commands */
> >> @@ -2585,6 +2588,9 @@ struct bpf_prog_info {
> >> __u32 nr_jited_func_lens;
> >> __aligned_u64 jited_ksyms;
> >> __aligned_u64 jited_func_lens;
> >> + __u32 btf_id;
> >> + __u32 nr_jited_func_types;
> >> + __aligned_u64 jited_func_types;
> >> } __attribute__((aligned(8)));
> >>
> >> struct bpf_map_info {
> >> @@ -2896,4 +2902,9 @@ struct bpf_flow_keys {
> >> };
> >> };
> >>
> >> +struct bpf_func_info {
> >> + __u32 insn_offset;
> >> + __u32 type_id;
> >> +};
> >> +
> >> #endif /* _UAPI__LINUX_BPF_H__ */
> >> diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c
> >> index 794a185f11bf..85b8eeccddbd 100644
> >> --- a/kernel/bpf/btf.c
> >> +++ b/kernel/bpf/btf.c
> >> @@ -486,6 +486,15 @@ static const struct btf_type *btf_type_by_id(const struct btf *btf, u32 type_id)
> >> return btf->types[type_id];
> >> }
> >>
> >> +bool is_btf_func_type(const struct btf *btf, u32 type_id)
> >> +{
> >> + const struct btf_type *type = btf_type_by_id(btf, type_id);
> >> +
> >> + if (!type || BTF_INFO_KIND(type->info) != BTF_KIND_FUNC)
> >> + return false;
> >> + return true;
> >> +}
> > Can btf_type_is_func() (from patch 2) be reused?
> > The btf_type_by_id() can be done by the caller.
> > I don't think it worths to add a similar helper
> > for just one user for now.
>
> Currently, btf_type_by_id() is not exposed.
>
> bpf/btf.c:
> static const struct btf_type *btf_type_by_id(const struct btf *btf, u32
> type_id)
>
> Do you want to expose this function as global one?
> We cannot put the whole definition in the header as it touches
> btf internals.
btf_type_by_id() does not have to be inline. Not a major issue to block
the v2 though. Just in case there is a v3.
>
> >
> > The !type check can be added to btf_type_is_func() if
> > it is needed.
> >
> >> +
> >> /*
> >> * Regular int is not a bit field and it must be either
> >> * u8/u16/u32/u64.
> >> @@ -2579,3 +2588,10 @@ u32 btf_id(const struct btf *btf)
> >> {
> >> return btf->id;
> >> }
> >> +
> >> +const char *btf_get_name_by_id(const struct btf *btf, u32 type_id)
> >> +{
> >> + const struct btf_type *t = btf_type_by_id(btf, type_id);
> >> +
> >> + return btf_name_by_offset(btf, t->name_off);
> >> +}
> >> diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
> >> index 3f5bf1af0826..add3866a120e 100644
> >> --- a/kernel/bpf/core.c
> >> +++ b/kernel/bpf/core.c
> >> @@ -27,6 +27,7 @@
> >> #include <linux/random.h>
> >> #include <linux/moduleloader.h>
> >> #include <linux/bpf.h>
> >> +#include <linux/btf.h>
> >> #include <linux/frame.h>
> >> #include <linux/rbtree_latch.h>
> >> #include <linux/kallsyms.h>
> >> @@ -387,6 +388,7 @@ bpf_get_prog_addr_region(const struct bpf_prog *prog,
> >> static void bpf_get_prog_name(const struct bpf_prog *prog, char *sym)
> >> {
> >> const char *end = sym + KSYM_NAME_LEN;
> >> + const char *func_name;
> >>
> >> BUILD_BUG_ON(sizeof("bpf_prog_") +
> >> sizeof(prog->tag) * 2 +
> >> @@ -401,6 +403,13 @@ static void bpf_get_prog_name(const struct bpf_prog *prog, char *sym)
> >>
> >> sym += snprintf(sym, KSYM_NAME_LEN, "bpf_prog_");
> >> sym = bin2hex(sym, prog->tag, sizeof(prog->tag));
> >> +
> >> + if (prog->aux->btf) {
> >> + func_name = btf_get_name_by_id(prog->aux->btf, prog->aux->type_id);
> >> + snprintf(sym, (size_t)(end - sym), "_%s", func_name);
> >> + return;
> >> + }
> >> +
> >> if (prog->aux->name[0])
> >> snprintf(sym, (size_t)(end - sym), "_%s", prog->aux->name);
> >> else
> >> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> >> index 4f416234251f..aa4688a1a137 100644
> >> --- a/kernel/bpf/syscall.c
> >> +++ b/kernel/bpf/syscall.c
> >> @@ -1120,6 +1120,7 @@ static void __bpf_prog_put(struct bpf_prog *prog, bool do_idr_lock)
> >> /* bpf_prog_free_id() must be called first */
> >> bpf_prog_free_id(prog, do_idr_lock);
> >> bpf_prog_kallsyms_del_all(prog);
> >> + btf_put(prog->aux->btf);
> >>
> >> call_rcu(&prog->aux->rcu, __bpf_prog_put_rcu);
> >> }
> >> @@ -1343,8 +1344,45 @@ bpf_prog_load_check_attach_type(enum bpf_prog_type prog_type,
> >> }
> >> }
> >>
> >> +static int prog_check_btf(const struct bpf_prog *prog, const struct btf *btf,
> >> + union bpf_attr *attr)
> >> +{
> >> + struct bpf_func_info __user *uinfo, info;
> >> + int i, nfuncs, usize;
> >> + u32 prev_offset;
> >> +
> >> + usize = sizeof(struct bpf_func_info);
> >> + if (attr->func_info_len % usize)
> >> + return -EINVAL;
> >> +
> >> + /* func_info section should have increasing and valid insn_offset
> >> + * and type should be BTF_KIND_FUNC.
> >> + */
> >> + nfuncs = attr->func_info_len / usize;
> >> + uinfo = u64_to_user_ptr(attr->func_info);
> >> + for (i = 0; i < nfuncs; i++) {
> >> + if (copy_from_user(&info, uinfo, usize))
> >> + return -EFAULT;
> >> +
> >> + if (!is_btf_func_type(btf, info.type_id))
> >> + return -EINVAL;
> >> +
> >> + if (i == 0) {
> >> + if (info.insn_offset)
> >> + return -EINVAL;
> >> + prev_offset = 0;
> >> + } else if (info.insn_offset < prev_offset) {
> >> + return -EINVAL;
> >> + }
> >> +
> >> + prev_offset = info.insn_offset;
> >> + }
> >> +
> >> + return 0;
> >> +}
> >> +
> >> /* last field in 'union bpf_attr' used by this command */
> >> -#define BPF_PROG_LOAD_LAST_FIELD expected_attach_type
> >> +#define BPF_PROG_LOAD_LAST_FIELD func_info
> >>
> >> static int bpf_prog_load(union bpf_attr *attr)
> >> {
> >> @@ -1431,6 +1469,27 @@ static int bpf_prog_load(union bpf_attr *attr)
> >> if (err)
> >> goto free_prog;
> >>
> >> + /* copy func_info before verifier which may make
> >> + * some adjustment.
> >> + */
> > Is it a left over comment? I don't see the intention of
> > copying func_info to avoid verifier modification from below.
> > I could be missing something.
> >
> > or should the comments be moved to the new "check_btf_func()" below?
> >
> >> + if (attr->func_info_len) {
> >> + struct btf *btf;
> >> +
> >> + btf = btf_get_by_fd(attr->prog_btf_fd);
> >> + if (IS_ERR(btf)) {
> >> + err = PTR_ERR(btf);
> >> + goto free_prog;
> >> + }
> >> +
> >> + err = prog_check_btf(prog, btf, attr);
> >> + if (err) {
> >> + btf_put(btf);
> >> + goto free_prog;
> >> + }
> >> +
> >> + prog->aux->btf = btf;
> >> + }
> >> +
> >> /* run eBPF verifier */
> >> err = bpf_check(&prog, attr);
> >> if (err < 0)
> >> @@ -1463,6 +1522,7 @@ static int bpf_prog_load(union bpf_attr *attr)
> >> bpf_prog_kallsyms_del_subprogs(prog);
> >> free_used_maps(prog->aux);
> >> free_prog:
> >> + btf_put(prog->aux->btf);
> >> bpf_prog_uncharge_memlock(prog);
> >> free_prog_sec:
> >> security_bpf_prog_free(prog->aux);
> >> @@ -2108,6 +2168,30 @@ static int bpf_prog_get_info_by_fd(struct bpf_prog *prog,
> >> }
> >> }
> >>
> >> + if (prog->aux->btf) {
> >> + info.btf_id = btf_id(prog->aux->btf);
> >> +
> >> + ulen = info.nr_jited_func_types;
> >> + info.nr_jited_func_types = prog->aux->func_cnt;
> >> + if (info.nr_jited_func_types && ulen) {
> >> + if (bpf_dump_raw_ok()) {
> >> + u32 __user *user_types;
> >> + u32 func_type, i;
> >> +
> >> + ulen = min_t(u32, info.nr_jited_func_types,
> >> + ulen);
> >> + user_types = u64_to_user_ptr(info.jited_func_types);
> >> + for (i = 0; i < ulen; i++) {
> >> + func_type = prog->aux->func[i]->aux->type_id;
> >> + if (put_user(func_type, &user_types[i]))
> >> + return -EFAULT;
> >> + }
> >> + } else {
> >> + info.jited_func_types = 0;
> >> + }
> >> + }
> >> + }
> >> +
> >> done:
> >> if (copy_to_user(uinfo, &info, info_len) ||
> >> put_user(info_len, &uattr->info.info_len))
> >> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> >> index 3f93a548a642..97c408e84322 100644
> >> --- a/kernel/bpf/verifier.c
> >> +++ b/kernel/bpf/verifier.c
> >> @@ -4589,6 +4589,50 @@ static int check_cfg(struct bpf_verifier_env *env)
> >> return ret;
> >> }
> >>
> >> +static int check_btf_func(struct bpf_prog *prog, struct bpf_verifier_env *env,
> >> + union bpf_attr *attr)
> >> +{
> >> + struct bpf_func_info *data;
> >> + int i, nfuncs, ret = 0;
> >> +
> >> + if (!attr->func_info_len)
> >> + return 0;
> >> +
> >> + nfuncs = attr->func_info_len / sizeof(struct bpf_func_info);
> >> + if (env->subprog_cnt != nfuncs) {
> >> + verbose(env, "number of funcs in func_info does not match verifier\n");
> >> + return -EINVAL;
> >> + }
> >> +
> >> + data = kvmalloc(attr->func_info_len, GFP_KERNEL | __GFP_NOWARN);
> >> + if (!data) {
> >> + verbose(env, "no memory to allocate attr func_info\n");
> >> + return -ENOMEM;
> >> + }
> >> +
> >> + if (copy_from_user(data, u64_to_user_ptr(attr->func_info),
> >> + attr->func_info_len)) {
> >> + verbose(env, "memory copy error for attr func_info\n");
> >> + ret = -EFAULT;
> >> + goto cleanup;
> >> + }
> > Extra indentation.
> >
> >> +
> >> + for (i = 0; i < nfuncs; i++) {
> >> + if (env->subprog_info[i].start != data[i].insn_offset) {
> >> + verbose(env, "func_info subprog start (%d) does not match verifier (%d)\n",
> >> + env->subprog_info[i].start, data[i].insn_offset);
> > The printing args are swapped? env->subprog_info[i].start should
> > go to the "verifier (%d)"?
> >
> > and s/%d/%u/
> >
> >> + ret = -EINVAL;
> >> + goto cleanup;
> >> + }
> >> + env->subprog_info[i].type_id = data[i].type_id;
> >> + }
> >> +
> >> + prog->aux->type_id = data[0].type_id;
> >> +cleanup:
> >> + kvfree(data);
> >> + return ret;
> >> +}
> >> +
> >> /* check %cur's range satisfies %old's */
> >> static bool range_within(struct bpf_reg_state *old,
> >> struct bpf_reg_state *cur)
> >> @@ -5873,6 +5917,8 @@ static int jit_subprogs(struct bpf_verifier_env *env)
> >> func[i]->aux->name[0] = 'F';
> >> func[i]->aux->stack_depth = env->subprog_info[i].stack_depth;
> >> func[i]->jit_requested = 1;
> >> + func[i]->aux->btf = prog->aux->btf;
> >> + func[i]->aux->type_id = env->subprog_info[i].type_id;
> >> func[i] = bpf_int_jit_compile(func[i]);
> >> if (!func[i]->jited) {
> >> err = -ENOTSUPP;
> >> @@ -6307,6 +6353,10 @@ int bpf_check(struct bpf_prog **prog, union bpf_attr *attr)
> >> if (ret < 0)
> >> goto skip_full_check;
> >>
> >> + ret = check_btf_func(env->prog, env, attr);
> >> + if (ret < 0)
> >> + goto skip_full_check;
> >> +
> >> ret = do_check(env);
> >> if (env->cur_state) {
> >> free_verifier_state(env->cur_state, true);
> >> --
> >> 2.17.1
> >>
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox