* [PATCH bpf-next 0/5] Add support for SKIP_BPF flag for AF_XDP sockets
From: Sridhar Samudrala @ 2019-08-15 3:46 UTC (permalink / raw)
To: magnus.karlsson, bjorn.topel, netdev, bpf, sridhar.samudrala,
intel-wired-lan, maciej.fijalkowski, tom.herbert
This patch series introduces XDP_SKIP_BPF flag that can be specified
during the bind() call of an AF_XDP socket to skip calling the BPF
program in the receive path and pass the buffer directly to the socket.
When a single AF_XDP socket is associated with a queue and a HW
filter is used to redirect the packets and the app is interested in
receiving all the packets on that queue, we don't need an additional
BPF program to do further filtering or lookup/redirect to a socket.
Here are some performance numbers collected on
- 2 socket 28 core Intel(R) Xeon(R) Platinum 8180 CPU @ 2.50GHz
- Intel 40Gb Ethernet NIC (i40e)
All tests use 2 cores and the results are in Mpps.
turbo on (default)
---------------------------------------------
no-skip-bpf skip-bpf
---------------------------------------------
rxdrop zerocopy 21.9 38.5
l2fwd zerocopy 17.0 20.5
rxdrop copy 11.1 13.3
l2fwd copy 1.9 2.0
no turbo : echo 1 > /sys/devices/system/cpu/intel_pstate/no_turbo
---------------------------------------------
no-skip-bpf skip-bpf
---------------------------------------------
rxdrop zerocopy 15.4 29.0
l2fwd zerocopy 11.8 18.2
rxdrop copy 8.2 10.5
l2fwd copy 1.7 1.7
---------------------------------------------
Sridhar Samudrala (5):
xsk: Convert bool 'zc' field in struct xdp_umem to a u32 bitmap
xsk: Introduce XDP_SKIP_BPF bind option
i40e: Enable XDP_SKIP_BPF option for AF_XDP sockets
ixgbe: Enable XDP_SKIP_BPF option for AF_XDP sockets
xdpsock_user: Add skip_bpf option
drivers/net/ethernet/intel/i40e/i40e_txrx.c | 22 +++++++++-
drivers/net/ethernet/intel/i40e/i40e_xsk.c | 6 +++
drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 20 ++++++++-
drivers/net/ethernet/intel/ixgbe/ixgbe_xsk.c | 16 ++++++-
include/net/xdp_sock.h | 21 ++++++++-
include/uapi/linux/if_xdp.h | 1 +
include/uapi/linux/xdp_diag.h | 1 +
net/xdp/xdp_umem.c | 9 ++--
net/xdp/xsk.c | 43 ++++++++++++++++---
net/xdp/xsk_diag.c | 5 ++-
samples/bpf/xdpsock_user.c | 8 ++++
11 files changed, 135 insertions(+), 17 deletions(-)
--
2.20.1
^ permalink raw reply
* Re: [PATCH V5 0/9] Fixes for vhost metadata acceleration
From: Jason Wang @ 2019-08-15 3:28 UTC (permalink / raw)
To: Christoph Hellwig, Jason Gunthorpe
Cc: Michael S. Tsirkin, kvm, virtualization, netdev, linux-kernel,
linux-mm
In-Reply-To: <20190813164105.GD22640@infradead.org>
On 2019/8/14 上午12:41, Christoph Hellwig wrote:
> On Tue, Aug 13, 2019 at 08:57:07AM -0300, Jason Gunthorpe wrote:
>> On Tue, Aug 13, 2019 at 04:31:07PM +0800, Jason Wang wrote:
>>
>>> What kind of issues do you see? Spinlock is to synchronize GUP with MMU
>>> notifier in this series.
>> A GUP that can't sleep can't pagefault which makes it a really weird
>> pattern
> get_user_pages/get_user_pages_fast must not be called under a spinlock.
> We have the somewhat misnamed __get_user_page_fast that just does a
> lookup for existing pages and never faults for a few places that need
> to do that lookup from contexts where we can't sleep.
Yes, I do use __get_user_pages_fast() in the code.
Thanks
^ permalink raw reply
* Re: [PATCH V5 0/9] Fixes for vhost metadata acceleration
From: Jason Wang @ 2019-08-15 3:26 UTC (permalink / raw)
To: Jason Gunthorpe
Cc: Michael S. Tsirkin, kvm, virtualization, netdev, linux-kernel,
linux-mm
In-Reply-To: <20190813115707.GC29508@ziepe.ca>
On 2019/8/13 下午7:57, Jason Gunthorpe wrote:
> On Tue, Aug 13, 2019 at 04:31:07PM +0800, Jason Wang wrote:
>
>> What kind of issues do you see? Spinlock is to synchronize GUP with MMU
>> notifier in this series.
> A GUP that can't sleep can't pagefault which makes it a really weird
> pattern
My understanding is __get_user_pages_fast() assumes caller can fail or
have fallback. And we have graceful fallback to copy_{to|from}_user().
>
>> Btw, back to the original question. May I know why synchronize_rcu() is not
>> suitable? Consider:
> We already went over this. You'd need to determine it doesn't somehow
> deadlock the mm on reclaim paths. Maybe it is OK, the rcq_gq_wq is
> marked WQ_MEM_RECLAIM at least..
Yes, will take a look at this.
>
> I also think Michael was concerned about the latency spikes a long RCU
> delay would cause.
I don't think it's a real problem consider MMU notifier could be
preempted or blocked.
Thanks
>
> Jason
^ permalink raw reply
* [net] tipc: fix false detection of retransmit failures
From: Tuong Lien @ 2019-08-15 3:24 UTC (permalink / raw)
To: davem, jon.maloy, maloy, ying.xue, netdev; +Cc: tipc-discussion
This commit eliminates the use of the link 'stale_limit' & 'prev_from'
(besides the already removed - 'stale_cnt') variables in the detection
of repeated retransmit failures as there is no proper way to initialize
them to avoid a false detection, i.e. it is not really a retransmission
failure but due to a garbage values in the variables.
Instead, a jiffies variable will be added to individual skbs (like the
way we restrict the skb retransmissions) in order to mark the first skb
retransmit time. Later on, at the next retransmissions, the timestamp
will be checked to see if the skb in the link transmq is "too stale",
that is, the link tolerance time has passed, so that a link reset will
be ordered. Note, just checking on the first skb in the queue is fine
enough since it must be the oldest one.
A counter is also added to keep track the actual skb retransmissions'
number for later checking when the failure happens.
The downside of this approach is that the skb->cb[] buffer is about to
be exhausted, however it is always able to allocate another memory area
and keep a reference to it when needed.
Fixes: 77cf8edbc0e7 ("tipc: simplify stale link failure criteria")
Reported-by: Hoang Le <hoang.h.le@dektech.com.au>
Acked-by: Ying Xue <ying.xue@windriver.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Tuong Lien <tuong.t.lien@dektech.com.au>
---
net/tipc/link.c | 92 ++++++++++++++++++++++++++++++++-------------------------
net/tipc/msg.h | 8 +++--
2 files changed, 57 insertions(+), 43 deletions(-)
diff --git a/net/tipc/link.c b/net/tipc/link.c
index dd3155b14654..01d76bf16e9d 100644
--- a/net/tipc/link.c
+++ b/net/tipc/link.c
@@ -106,8 +106,6 @@ struct tipc_stats {
* @transmitq: queue for sent, non-acked messages
* @backlogq: queue for messages waiting to be sent
* @snt_nxt: next sequence number to use for outbound messages
- * @prev_from: sequence number of most previous retransmission request
- * @stale_limit: time when repeated identical retransmits must force link reset
* @ackers: # of peers that needs to ack each packet before it can be released
* @acked: # last packet acked by a certain peer. Used for broadcast.
* @rcv_nxt: next sequence number to expect for inbound messages
@@ -164,9 +162,7 @@ struct tipc_link {
u16 limit;
} backlog[5];
u16 snd_nxt;
- u16 prev_from;
u16 window;
- unsigned long stale_limit;
/* Reception */
u16 rcv_nxt;
@@ -1063,47 +1059,53 @@ static void tipc_link_advance_backlog(struct tipc_link *l,
* link_retransmit_failure() - Detect repeated retransmit failures
* @l: tipc link sender
* @r: tipc link receiver (= l in case of unicast)
- * @from: seqno of the 1st packet in retransmit request
* @rc: returned code
*
* Return: true if the repeated retransmit failures happens, otherwise
* false
*/
static bool link_retransmit_failure(struct tipc_link *l, struct tipc_link *r,
- u16 from, int *rc)
+ int *rc)
{
struct sk_buff *skb = skb_peek(&l->transmq);
struct tipc_msg *hdr;
if (!skb)
return false;
- hdr = buf_msg(skb);
- /* Detect repeated retransmit failures on same packet */
- if (r->prev_from != from) {
- r->prev_from = from;
- r->stale_limit = jiffies + msecs_to_jiffies(r->tolerance);
- } else if (time_after(jiffies, r->stale_limit)) {
- pr_warn("Retransmission failure on link <%s>\n", l->name);
- link_print(l, "State of link ");
- pr_info("Failed msg: usr %u, typ %u, len %u, err %u\n",
- msg_user(hdr), msg_type(hdr), msg_size(hdr),
- msg_errcode(hdr));
- pr_info("sqno %u, prev: %x, src: %x\n",
- msg_seqno(hdr), msg_prevnode(hdr), msg_orignode(hdr));
-
- trace_tipc_list_dump(&l->transmq, true, "retrans failure!");
- trace_tipc_link_dump(l, TIPC_DUMP_NONE, "retrans failure!");
- trace_tipc_link_dump(r, TIPC_DUMP_NONE, "retrans failure!");
+ if (!TIPC_SKB_CB(skb)->retr_cnt)
+ return false;
- if (link_is_bc_sndlink(l))
- *rc = TIPC_LINK_DOWN_EVT;
+ if (!time_after(jiffies, TIPC_SKB_CB(skb)->retr_stamp +
+ msecs_to_jiffies(r->tolerance)))
+ return false;
+
+ hdr = buf_msg(skb);
+ if (link_is_bc_sndlink(l) && !less(r->acked, msg_seqno(hdr)))
+ return false;
+ pr_warn("Retransmission failure on link <%s>\n", l->name);
+ link_print(l, "State of link ");
+ pr_info("Failed msg: usr %u, typ %u, len %u, err %u\n",
+ msg_user(hdr), msg_type(hdr), msg_size(hdr), msg_errcode(hdr));
+ pr_info("sqno %u, prev: %x, dest: %x\n",
+ msg_seqno(hdr), msg_prevnode(hdr), msg_destnode(hdr));
+ pr_info("retr_stamp %d, retr_cnt %d\n",
+ jiffies_to_msecs(TIPC_SKB_CB(skb)->retr_stamp),
+ TIPC_SKB_CB(skb)->retr_cnt);
+
+ trace_tipc_list_dump(&l->transmq, true, "retrans failure!");
+ trace_tipc_link_dump(l, TIPC_DUMP_NONE, "retrans failure!");
+ trace_tipc_link_dump(r, TIPC_DUMP_NONE, "retrans failure!");
+
+ if (link_is_bc_sndlink(l)) {
+ r->state = LINK_RESET;
+ *rc = TIPC_LINK_DOWN_EVT;
+ } else {
*rc = tipc_link_fsm_evt(l, LINK_FAILURE_EVT);
- return true;
}
- return false;
+ return true;
}
/* tipc_link_bc_retrans() - retransmit zero or more packets
@@ -1129,7 +1131,7 @@ static int tipc_link_bc_retrans(struct tipc_link *l, struct tipc_link *r,
trace_tipc_link_retrans(r, from, to, &l->transmq);
- if (link_retransmit_failure(l, r, from, &rc))
+ if (link_retransmit_failure(l, r, &rc))
return rc;
skb_queue_walk(&l->transmq, skb) {
@@ -1138,11 +1140,10 @@ static int tipc_link_bc_retrans(struct tipc_link *l, struct tipc_link *r,
continue;
if (more(msg_seqno(hdr), to))
break;
- if (link_is_bc_sndlink(l)) {
- if (time_before(jiffies, TIPC_SKB_CB(skb)->nxt_retr))
- continue;
- TIPC_SKB_CB(skb)->nxt_retr = TIPC_BC_RETR_LIM;
- }
+
+ if (time_before(jiffies, TIPC_SKB_CB(skb)->nxt_retr))
+ continue;
+ TIPC_SKB_CB(skb)->nxt_retr = TIPC_BC_RETR_LIM;
_skb = __pskb_copy(skb, LL_MAX_HEADER + MIN_H_SIZE, GFP_ATOMIC);
if (!_skb)
return 0;
@@ -1152,6 +1153,10 @@ static int tipc_link_bc_retrans(struct tipc_link *l, struct tipc_link *r,
_skb->priority = TC_PRIO_CONTROL;
__skb_queue_tail(xmitq, _skb);
l->stats.retransmitted++;
+
+ /* Increase actual retrans counter & mark first time */
+ if (!TIPC_SKB_CB(skb)->retr_cnt++)
+ TIPC_SKB_CB(skb)->retr_stamp = jiffies;
}
return 0;
}
@@ -1393,12 +1398,10 @@ static int tipc_link_advance_transmq(struct tipc_link *l, u16 acked, u16 gap,
struct tipc_msg *hdr;
u16 bc_ack = l->bc_rcvlink->rcv_nxt - 1;
u16 ack = l->rcv_nxt - 1;
+ bool passed = false;
u16 seqno, n = 0;
int rc = 0;
- if (gap && link_retransmit_failure(l, l, acked + 1, &rc))
- return rc;
-
skb_queue_walk_safe(&l->transmq, skb, tmp) {
seqno = buf_seqno(skb);
@@ -1408,12 +1411,17 @@ static int tipc_link_advance_transmq(struct tipc_link *l, u16 acked, u16 gap,
__skb_unlink(skb, &l->transmq);
kfree_skb(skb);
} else if (less_eq(seqno, acked + gap)) {
- /* retransmit skb */
+ /* First, check if repeated retrans failures occurs? */
+ if (!passed && link_retransmit_failure(l, l, &rc))
+ return rc;
+ passed = true;
+
+ /* retransmit skb if unrestricted*/
if (time_before(jiffies, TIPC_SKB_CB(skb)->nxt_retr))
continue;
TIPC_SKB_CB(skb)->nxt_retr = TIPC_UC_RETR_TIME;
-
- _skb = __pskb_copy(skb, MIN_H_SIZE, GFP_ATOMIC);
+ _skb = __pskb_copy(skb, LL_MAX_HEADER + MIN_H_SIZE,
+ GFP_ATOMIC);
if (!_skb)
continue;
hdr = buf_msg(_skb);
@@ -1422,6 +1430,10 @@ static int tipc_link_advance_transmq(struct tipc_link *l, u16 acked, u16 gap,
_skb->priority = TC_PRIO_CONTROL;
__skb_queue_tail(xmitq, _skb);
l->stats.retransmitted++;
+
+ /* Increase actual retrans counter & mark first time */
+ if (!TIPC_SKB_CB(skb)->retr_cnt++)
+ TIPC_SKB_CB(skb)->retr_stamp = jiffies;
} else {
/* retry with Gap ACK blocks if any */
if (!ga || n >= ga->gack_cnt)
@@ -2681,7 +2693,7 @@ int tipc_link_dump(struct tipc_link *l, u16 dqueues, char *buf)
i += scnprintf(buf + i, sz - i, " %x", l->peer_caps);
i += scnprintf(buf + i, sz - i, " %u", l->silent_intv_cnt);
i += scnprintf(buf + i, sz - i, " %u", l->rst_cnt);
- i += scnprintf(buf + i, sz - i, " %u", l->prev_from);
+ i += scnprintf(buf + i, sz - i, " %u", 0);
i += scnprintf(buf + i, sz - i, " %u", 0);
i += scnprintf(buf + i, sz - i, " %u", l->acked);
diff --git a/net/tipc/msg.h b/net/tipc/msg.h
index 1c8c8dd32a4e..0daa6f04ca81 100644
--- a/net/tipc/msg.h
+++ b/net/tipc/msg.h
@@ -102,13 +102,15 @@ struct plist;
#define TIPC_MEDIA_INFO_OFFSET 5
struct tipc_skb_cb {
- u32 bytes_read;
- u32 orig_member;
struct sk_buff *tail;
unsigned long nxt_retr;
- bool validated;
+ unsigned long retr_stamp;
+ u32 bytes_read;
+ u32 orig_member;
u16 chain_imp;
u16 ackers;
+ u16 retr_cnt;
+ bool validated;
};
#define TIPC_SKB_CB(__skb) ((struct tipc_skb_cb *)&((__skb)->cb[0]))
--
2.13.7
^ permalink raw reply related
* Re: [PATCH] virtio-net: lower min ring num_free for efficiency
From: Jason Wang @ 2019-08-15 3:17 UTC (permalink / raw)
To: 冉 jiang, mst@redhat.com
Cc: davem@davemloft.net, ast@kernel.org, daniel@iogearbox.net,
jakub.kicinski@netronome.com, hawk@kernel.org,
john.fastabend@gmail.com, kafai@fb.com, songliubraving@fb.com,
yhs@fb.com, virtualization@lists.linux-foundation.org,
netdev@vger.kernel.org, linux-kernel@vger.kernel.org,
xdp-newbies@vger.kernel.org, bpf@vger.kernel.org,
jiangran.jr@alibaba-inc.com
In-Reply-To: <BYAPR14MB3205B734E554EACEEE337ADDA6AC0@BYAPR14MB3205.namprd14.prod.outlook.com>
On 2019/8/15 上午11:11, 冉 jiang wrote:
> On 2019/8/15 11:01, Jason Wang wrote:
>> On 2019/8/14 上午10:06, ? jiang wrote:
>>> This change lowers ring buffer reclaim threshold from 1/2*queue to
>>> budget
>>> for better performance. According to our test with qemu + dpdk, packet
>>> dropping happens when the guest is not able to provide free buffer in
>>> avail ring timely with default 1/2*queue. The value in the patch has
>>> been
>>> tested and does show better performance.
>>
>> Please add your tests setup and result here.
>>
>> Thanks
>>
>>
>>> Signed-off-by: jiangkidd <jiangkidd@hotmail.com>
>>> ---
>>> drivers/net/virtio_net.c | 2 +-
>>> 1 file changed, 1 insertion(+), 1 deletion(-)
>>>
>>> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
>>> index 0d4115c9e20b..bc08be7925eb 100644
>>> --- a/drivers/net/virtio_net.c
>>> +++ b/drivers/net/virtio_net.c
>>> @@ -1331,7 +1331,7 @@ static int virtnet_receive(struct receive_queue
>>> *rq, int budget,
>>> }
>>> }
>>> - if (rq->vq->num_free > virtqueue_get_vring_size(rq->vq) / 2) {
>>> + if (rq->vq->num_free > min((unsigned int)budget,
>>> virtqueue_get_vring_size(rq->vq)) / 2) {
>>> if (!try_fill_recv(vi, rq, GFP_ATOMIC))
>>> schedule_delayed_work(&vi->refill, 0);
>>> }
> Sure, here are the details:
Thanks for the details, but I meant it's better if you could summarize
you test result in the commit log in a compact way.
Btw, some comments, see below:
>
>
> Test setup & result:
>
> ----------------------------------------------------
>
> Below is the snippet from our test result. Test1 was done with default
> driver with the value of 1/2 * queue, while test2 is with my patch. We
> can see average
> drop packets do decrease a lot in test2.
>
> test1Time avgDropPackets test2Time avgDropPackets pps
>
> 16:21.0 12.295 56:50.4 0 300k
> 17:19.1 15.244 56:50.4 0 300k
> 18:17.5 18.789 56:50.4 0 300k
> 19:15.1 14.208 56:50.4 0 300k
> 20:13.2 20.818 56:50.4 0.267 300k
> 21:11.2 12.397 56:50.4 0 300k
> 22:09.3 12.599 56:50.4 0 300k
> 23:07.3 15.531 57:48.4 0 300k
> 24:05.5 13.664 58:46.5 0 300k
> 25:03.7 13.158 59:44.5 4.73 300k
> 26:01.1 2.486 00:42.6 0 300k
> 26:59.1 11.241 01:40.6 0 300k
> 27:57.2 20.521 02:38.6 0 300k
> 28:55.2 30.094 03:36.7 0 300k
> 29:53.3 16.828 04:34.7 0.963 300k
> 30:51.3 46.916 05:32.8 0 400k
> 31:49.3 56.214 05:32.8 0 400k
> 32:47.3 58.69 05:32.8 0 400k
> 33:45.3 61.486 05:32.8 0 400k
> 34:43.3 72.175 05:32.8 0.598 400k
> 35:41.3 56.699 05:32.8 0 400k
> 36:39.3 61.071 05:32.8 0 400k
> 37:37.3 43.355 06:30.8 0 400k
> 38:35.4 44.644 06:30.8 0 400k
> 39:33.4 72.336 06:30.8 0 400k
> 40:31.4 70.676 06:30.8 0 400k
> 41:29.4 108.009 06:30.8 0 400k
> 42:27.4 65.216 06:30.8 0 400k
Why there're difference in test time? Could you summarize them like:
Test setup: e.g testpmd or pktgen to generate packets to guest
avg packets drop before: XXX
avg packets drop after: YYY(-ZZZ%)
Thanks
>
>
> Data to prove why the patch helps:
>
> ----------------------------------------------------
>
> We did have completed several rounds of test with setting the value to
> budget (64 as the default value). It does improve a lot with pps is
> below 400pps for a single stream. We are confident that it runs out of free
> buffer in avail ring when packet dropping happens with below systemtap:
>
> Just a snippet:
>
> probe module("virtio_ring").function("virtqueue_get_buf")
> {
> x = (@cast($_vq, "vring_virtqueue")->vring->used->idx)-
> (@cast($_vq, "vring_virtqueue")->last_used_idx) ---> we use this one
> to verify if the queue is full, which means guest is not able to take
> buffer from the queue timely
>
> if (x<0 && (x+65535)<4096)
> x = x+65535
>
> if((x==1024) && @cast($_vq, "vring_virtqueue")->vq->callback ==
> callback_addr)
> netrxcount[x] <<< gettimeofday_s()
> }
>
>
> probe module("virtio_ring").function("virtqueue_add_inbuf")
> {
> y = (@cast($vq, "vring_virtqueue")->vring->avail->idx)-
> (@cast($vq, "vring_virtqueue")->vring->used->idx) ---> we use this one
> to verify if we run out of free buffer in avail ring
> if (y<0 && (y+65535)<4096)
> y = y+65535
>
> if(@2=="debugon")
> {
> if(y==0 && @cast($vq, "vring_virtqueue")->vq->callback ==
> callback_addr)
> {
> netrxfreecount[y] <<< gettimeofday_s()
>
> printf("no avail ring left seen, printing most recent 5
> num free, vq: %lx, current index: %d\n", $vq, recentfreecount)
> for(i=recentfreecount; i!=((recentfreecount+4) % 5);
> i=((i+1) % 5))
> {
> printf("index: %d, num free: %d\n", i, recentfree[$vq,
> i])
> }
>
> printf("index: %d, num free: %d\n", i, recentfree[$vq, i])
> //exit()
> }
> }
> }
>
>
> probe
> module("virtio_net").statement("virtnet_receive@drivers/net/virtio_net.c:732")
>
> {
> recentfreecount++
> recentfreecount = recentfreecount % 5
> recentfree[$rq->vq, recentfreecount] = $rq->vq->num_free --->
> record the num_free for the last 5 calls to virtnet_receive, so we can
> see if lowering the bar helps.
> }
>
>
> Here is the result:
>
> no avail ring left seen, printing most recent 5 num free, vq:
> ffff9c13c1200000, current index: 1
> index: 1, num free: 561
> index: 2, num free: 305
> index: 3, num free: 369
> index: 4, num free: 433
> index: 0, num free: 497
> no avail ring left seen, printing most recent 5 num free, vq:
> ffff9c13c1200000, current index: 1
> index: 1, num free: 543
> index: 2, num free: 463
> index: 3, num free: 469
> index: 4, num free: 476
> index: 0, num free: 479
> no avail ring left seen, printing most recent 5 num free, vq:
> ffff9c13c1200000, current index: 2
> index: 2, num free: 555
> index: 3, num free: 414
> index: 4, num free: 420
> index: 0, num free: 427
> index: 1, num free: 491
>
> We can see in the last 4 calls to virtnet_receive before we run out
> of free buffer and start to relaim, num_free is quite high. So if we
> can do the reclaim earlier, it will certainly help.
>
> Jiang
Right, but I think there's no need to put those thing in the commit log.
Thanks
^ permalink raw reply
* Re: [PATCH] cxgb4: fix a memory leak bug
From: David Miller @ 2019-08-15 3:03 UTC (permalink / raw)
To: wenwen; +Cc: vishal, netdev, linux-kernel
In-Reply-To: <1565687932-2870-1-git-send-email-wenwen@cs.uga.edu>
From: Wenwen Wang <wenwen@cs.uga.edu>
Date: Tue, 13 Aug 2019 04:18:52 -0500
> In blocked_fl_write(), 't' is not deallocated if bitmap_parse_user() fails,
> leading to a memory leak bug. To fix this issue, free t before returning
> the error.
>
> Signed-off-by: Wenwen Wang <wenwen@cs.uga.edu>
Applied.
^ permalink raw reply
* Re: [PATCH net-next] net/mvpp2: Replace tasklet with softirq hrtimer
From: David Miller @ 2019-08-15 3:02 UTC (permalink / raw)
To: bigeasy; +Cc: netdev, tglx, maxime.chevallier
In-Reply-To: <20190813080025.2j3cj6gfyzzxek7x@linutronix.de>
From: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Date: Tue, 13 Aug 2019 10:00:25 +0200
> From: Thomas Gleixner <tglx@linutronix.de>
>
> The tx_done_tasklet tasklet is used in invoke the hrtimer
> (mvpp2_hr_timer_cb) in softirq context. This can be also achieved without
> the tasklet but with HRTIMER_MODE_SOFT as hrtimer mode.
>
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Applied.
^ permalink raw reply
* Re: [PATCH] virtio-net: lower min ring num_free for efficiency
From: Jason Wang @ 2019-08-15 3:01 UTC (permalink / raw)
To: ? jiang, mst@redhat.com
Cc: davem@davemloft.net, ast@kernel.org, daniel@iogearbox.net,
jakub.kicinski@netronome.com, hawk@kernel.org,
john.fastabend@gmail.com, kafai@fb.com, songliubraving@fb.com,
yhs@fb.com, virtualization@lists.linux-foundation.org,
netdev@vger.kernel.org, linux-kernel@vger.kernel.org,
xdp-newbies@vger.kernel.org, bpf@vger.kernel.org,
jiangran.jr@alibaba-inc.com
In-Reply-To: <BYAPR14MB3205E4E194942B0A1A91A222A6AD0@BYAPR14MB3205.namprd14.prod.outlook.com>
On 2019/8/14 上午10:06, ? jiang wrote:
> This change lowers ring buffer reclaim threshold from 1/2*queue to budget
> for better performance. According to our test with qemu + dpdk, packet
> dropping happens when the guest is not able to provide free buffer in
> avail ring timely with default 1/2*queue. The value in the patch has been
> tested and does show better performance.
Please add your tests setup and result here.
Thanks
>
> Signed-off-by: jiangkidd <jiangkidd@hotmail.com>
> ---
> drivers/net/virtio_net.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> index 0d4115c9e20b..bc08be7925eb 100644
> --- a/drivers/net/virtio_net.c
> +++ b/drivers/net/virtio_net.c
> @@ -1331,7 +1331,7 @@ static int virtnet_receive(struct receive_queue *rq, int budget,
> }
> }
>
> - if (rq->vq->num_free > virtqueue_get_vring_size(rq->vq) / 2) {
> + if (rq->vq->num_free > min((unsigned int)budget, virtqueue_get_vring_size(rq->vq)) / 2) {
> if (!try_fill_recv(vi, rq, GFP_ATOMIC))
> schedule_delayed_work(&vi->refill, 0);
> }
^ permalink raw reply
* Re: [PATCH 0/2] Netfilter updates for net-next
From: David Miller @ 2019-08-15 2:59 UTC (permalink / raw)
To: pablo; +Cc: netfilter-devel, netdev
In-Reply-To: <20190814214347.4940-1-pablo@netfilter.org>
From: Pablo Neira Ayuso <pablo@netfilter.org>
Date: Wed, 14 Aug 2019 23:43:45 +0200
> The following patchset contains Netfilter updates for net-next.
> This round addresses fallout from previous pull request:
>
> 1) Remove #warning from ipt_LOG.h and ip6t_LOG.h headers,
> from Jeremy Sowden.
>
> 2) Incorrect parens in memcmp() in nft_bitwise, from Nathan Chancellor.
>
> You can pull these changes from:
>
> git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next.git
Pulled, thanks.
^ permalink raw reply
* Re: WARNING in cgroup_rstat_updated
From: syzbot @ 2019-08-15 1:26 UTC (permalink / raw)
To: ast, daniel, hdanton, john.fastabend, linux-kernel, linux-mm,
netdev, syzkaller-bugs
In-Reply-To: <000000000000b851cb058f7e637f@google.com>
syzbot has bisected this bug to:
commit e9db4ef6bf4ca9894bb324c76e01b8f1a16b2650
Author: John Fastabend <john.fastabend@gmail.com>
Date: Sat Jun 30 13:17:47 2018 +0000
bpf: sockhash fix omitted bucket lock in sock_close
bisection log: https://syzkaller.appspot.com/x/bisect.txt?x=143286e2600000
start commit: 31cc088a Merge tag 'drm-next-2019-07-19' of git://anongit...
git tree: net-next
final crash: https://syzkaller.appspot.com/x/report.txt?x=163286e2600000
console output: https://syzkaller.appspot.com/x/log.txt?x=123286e2600000
kernel config: https://syzkaller.appspot.com/x/.config?x=4dba67bf8b8c9ad7
dashboard link: https://syzkaller.appspot.com/bug?extid=370e4739fa489334a4ef
syz repro: https://syzkaller.appspot.com/x/repro.syz?x=16dd57dc600000
Reported-by: syzbot+370e4739fa489334a4ef@syzkaller.appspotmail.com
Fixes: e9db4ef6bf4c ("bpf: sockhash fix omitted bucket lock in sock_close")
For information about bisection process see: https://goo.gl/tpsmEJ#bisection
^ permalink raw reply
* Re: [PATCH net-next 1/5] RDS: Re-add pf/sol access via sysctl
From: David Miller @ 2019-08-15 1:25 UTC (permalink / raw)
To: gerd.rausch; +Cc: santosh.shilimkar, dledford, netdev, linux-rdma, rds-devel
In-Reply-To: <1c6d1f04-96d5-94e5-3140-d3da194e14f3@oracle.com>
From: Gerd Rausch <gerd.rausch@oracle.com>
Date: Wed, 14 Aug 2019 14:45:21 -0700
> a) It is utterly inappropriate to have Oracle applications
> rely on a /proc/sys API that was kept out of Upstream-Linux
> for this long
Yes.
>
> b) It is utterly inappropriate to include such a /proc/sys API
> that Oracle applications have depended on this late
Also yes.
^ permalink raw reply
* Re: [patch net-next v3 1/2] netdevsim: implement support for devlink region and snapshots
From: Jakub Kicinski @ 2019-08-15 1:14 UTC (permalink / raw)
To: Jiri Pirko; +Cc: netdev, davem, mlxsw
In-Reply-To: <20190814153735.6923-2-jiri@resnulli.us>
On Wed, 14 Aug 2019 17:37:34 +0200, Jiri Pirko wrote:
> diff --git a/drivers/net/netdevsim/dev.c b/drivers/net/netdevsim/dev.c
> index 08ca59fc189b..125a0358bc04 100644
> --- a/drivers/net/netdevsim/dev.c
> +++ b/drivers/net/netdevsim/dev.c
> @@ -27,6 +27,41 @@
>
> static struct dentry *nsim_dev_ddir;
>
> +#define NSIM_DEV_DUMMY_REGION_SIZE (1024 * 32)
> +
> +static ssize_t nsim_dev_take_snapshot_write(struct file *file,
> + const char __user *data,
> + size_t count, loff_t *ppos)
> +{
> + struct nsim_dev *nsim_dev = file->private_data;
> + void *dummy_data;
> + u32 id;
> + int err;
If you have to rebase on top of Ido's changes and repost, please do
reverse xmas tree.
^ permalink raw reply
* Re: [patch net-next v2 2/2] selftests: netdevsim: add devlink params tests
From: Jakub Kicinski @ 2019-08-15 1:09 UTC (permalink / raw)
To: Jiri Pirko; +Cc: netdev, davem, mlxsw
In-Reply-To: <20190814152604.6385-3-jiri@resnulli.us>
On Wed, 14 Aug 2019 17:26:04 +0200, Jiri Pirko wrote:
> From: Jiri Pirko <jiri@mellanox.com>
>
> Test recently added netdevsim devlink param implementation.
>
> Signed-off-by: Jiri Pirko <jiri@mellanox.com>
> ---
> v1->v2:
> -using cmd_jq helper
Still failing here :(
# ./devlink.sh
TEST: fw flash test [ OK ]
TEST: params test [FAIL]
Failed to get test1 param value
TEST: regions test [ OK ]
# jq --version
jq-1.5-1-a5b5cbe
# echo '{ "a" : false }' | jq -e -r '.[]'
false
# echo $?
1
On another machine:
$ echo '{ "a" : false }' | jq -e -r '.[]'
false
$ echo $?
1
Did you mean to drop the -e ?
^ permalink raw reply
* Re: [PATCH net-next v2 13/14] selftests: devlink_trap: Add test cases for devlink-trap
From: Jakub Kicinski @ 2019-08-15 0:42 UTC (permalink / raw)
To: Ido Schimmel
Cc: netdev, davem, nhorman, jiri, toke, dsahern, roopa, nikolay, andy,
f.fainelli, andrew, vivien.didelot, mlxsw, Ido Schimmel
In-Reply-To: <20190813075400.11841-14-idosch@idosch.org>
On Tue, 13 Aug 2019 10:53:59 +0300, Ido Schimmel wrote:
> From: Ido Schimmel <idosch@mellanox.com>
>
> Add test cases for devlink-trap on top of the netdevsim implementation.
>
> The tests focus on the devlink-trap core infrastructure and user space
> API. They test both good and bad flows and also dismantle of the netdev
> and devlink device used to report trapped packets.
>
> This allows device drivers to focus their tests on device-specific
> functionality.
>
> Signed-off-by: Ido Schimmel <idosch@mellanox.com>
> Acked-by: Jiri Pirko <jiri@mellanox.com>
Thanks for the test!
Should it perhaps live in:
tools/testing/selftests/drivers/net/netdevsim/
?
That's where Jiri puts his devlink tests..
Also the test seems to require netdevsim to be loaded, otherwise:
# ./devlink_trap.sh
SKIP: No netdevsim support
Is that expected?
^ permalink raw reply
* Re: [PATCH v2 bpf-next 1/4] bpf: unprivileged BPF access via /dev/bpf
From: Alexei Starovoitov @ 2019-08-15 0:36 UTC (permalink / raw)
To: Andy Lutomirski
Cc: Andy Lutomirski, Daniel Colascione, Song Liu, Kees Cook,
Networking, bpf, Alexei Starovoitov, Daniel Borkmann, Kernel Team,
Lorenz Bauer, Jann Horn, Greg KH, Linux API, LSM List
In-Reply-To: <317422C3-ACE3-42A7-A287-7B8FEE12E33A@amacapital.net>
On Wed, Aug 14, 2019 at 04:59:18PM -0700, Andy Lutomirski wrote:
>
>
> > On Aug 14, 2019, at 4:33 PM, Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
> >
> >> On Wed, Aug 14, 2019 at 03:30:51PM -0700, Andy Lutomirski wrote:
> >>
> >>
> >>>> On Aug 14, 2019, at 3:05 PM, Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
> >>>>
> >>>> On Wed, Aug 14, 2019 at 10:51:23AM -0700, Andy Lutomirski wrote:
> >>>>
> >>>> If eBPF is genuinely not usable by programs that are not fully trusted
> >>>> by the admin, then no kernel changes at all are needed. Programs that
> >>>> want to reduce their own privileges can easily fork() a privileged
> >>>> subprocess or run a little helper to which they delegate BPF
> >>>> operations. This is far more flexible than anything that will ever be
> >>>> in the kernel because it allows the helper to verify that the rest of
> >>>> the program is doing exactly what it's supposed to and restrict eBPF
> >>>> operations to exactly the subset that is needed. So a container
> >>>> manager or network manager that drops some provilege could have a
> >>>> little bpf-helper that manages its BPF XDP, firewalling, etc
> >>>> configuration. The two processes would talk over a socketpair.
> >>>
> >>> there were three projects that tried to delegate bpf operations.
> >>> All of them failed.
> >>> bpf operational workflow is much more complex than you're imagining.
> >>> fork() also doesn't work for all cases.
> >>> I gave this example before: consider multiple systemd-like deamons
> >>> that need to do bpf operations that want to pass this 'bpf capability'
> >>> to other deamons written by other teams. Some of them will start
> >>> non-root, but still need to do bpf. They will be rpm installed
> >>> and live upgraded while running.
> >>> We considered to make systemd such centralized bpf delegation
> >>> authority too. It didn't work. bpf in kernel grows quickly.
> >>> libbpf part grows independently. llvm keeps evolving.
> >>> All of them are being changed while system overall has to stay
> >>> operational. Centralized approach breaks apart.
> >>>
> >>>> The interesting cases you're talking about really *do* involved
> >>>> unprivileged or less privileged eBPF, though. Let's see:
> >>>>
> >>>> systemd --user: systemd --user *is not privileged at all*. There's no
> >>>> issue of reducing privilege, since systemd --user doesn't have any
> >>>> privilege to begin with. But systemd supports some eBPF features, and
> >>>> presumably it would like to support them in the systemd --user case.
> >>>> This is unprivileged eBPF.
> >>>
> >>> Let's disambiguate the terminology.
> >>> This /dev/bpf patch set started as describing the feature as 'unprivileged bpf'.
> >>> I think that was a mistake.
> >>> Let's call systemd-like deamon usage of bpf 'less privileged bpf'.
> >>> This is not unprivileged.
> >>> 'unprivileged bpf' is what sysctl kernel.unprivileged_bpf_disabled controls.
> >>>
> >>> There is a huge difference between the two.
> >>> I'm against extending 'unprivileged bpf' even a bit more than what it is
> >>> today for many reasons mentioned earlier.
> >>> The /dev/bpf is about 'less privileged'.
> >>> Less privileged than root. We need to split part of full root capability
> >>> into bpf capability. So that most of the root can be dropped.
> >>> This is very similar to what cap_net_admin does.
> >>> cap_net_amdin can bring down eth0 which is just as bad as crashing the box.
> >>> cap_net_admin is very much privileged. Just 'less privileged' than root.
> >>> Same thing for cap_bpf.
> >>
> >> The new pseudo-capability in this patch set is absurdly broad. I’ve proposed some finer-grained divisions in this thread. Do you have comments on them?
> >
> > Initially I agreed that it's probably too broad, but then realized
> > that they're perfect as-is. There is no need to partition further.
> >
> >>> May be we should do both cap_bpf and /dev/bpf to make it clear that
> >>> this is the same thing. Two interfaces to achieve the same result.
> >>
> >> What for? If there’s a CAP_BPF, then why do you want /dev/bpf? Especially if you define it to do the same thing.
> >
> > Indeed, ambient capabilities should work for all cases.
> >
> >> No, I’m not. I have no objection at all if you try to come up with a clear definition of what the capability checks do and what it means to grant a new permission to a task. Changing *all* of the capable checks is needlessly broad.
> >
> > There are not that many bits left. I prefer to consume single CAP_BPF bit.
> > All capable(CAP_SYS_ADMIN) checks in kernel/bpf/ will become CAP_BPF.
> > This is no-brainer.
> >
> > The only question is whether few cases of CAP_NET_ADMIN in kernel/bpf/
> > should be extended to CAP_BPF or not.
> > imo devmap and xskmap can stay CAP_NET_ADMIN,
> > but cgroup bpf attach/detach should be either CAP_NET_ADMIN or CAP_BPF.
> > Initially cgroup-bpf hooks were limited to networking.
> > It's no longer the case. Requiring NET_ADMIN there make little sense now.
> >
>
> Cgroup bpf attach/detach, with the current API, gives very strong control over the whole system, and it will just get stronger as bpf gains features. Making it CAP_BPF means that you will never have the ability to make CAP_BPF safe to give to anything other than an extremely highly trusted process. Unsafe pointers are similar.
'never to less trusted process' ? why do you think so?
I don't see a problem adding /dev/bpf/foo in the future and make things
more granular. There is no such use case today. Hence I don't want to
spend time and design something without clear use case in mind.
> Do new programs really need the by_id calls?
yes. Lorenz gave an example earlier. map-in-map returns map_id.
To operate on that map by_id is needed.
^ permalink raw reply
* [PATCH] ipvlan: set hw_enc_features like macvlan
From: Bill Sommerfeld @ 2019-08-15 0:10 UTC (permalink / raw)
To: David S. Miller, Petr Machata
Cc: Jiri Pirko, Ido Schimmel, Daniel Borkmann, YueHaibing,
Thomas Gleixner, Miaohe Lin, Eric Dumazet, Mahesh Bandewar,
netdev, linux-kernel, Bill Sommerfeld
Allow encapsulated packets sent to tunnels layered over ipvlan to use
offloads rather than forcing SW fallbacks.
Since commit f21e5077010acda73a60 ("macvlan: add offload features for
encapsulation"), macvlan has set dev->hw_enc_features to include
everything in dev->features; do likewise in ipvlan.
Signed-off-by: Bill Sommerfeld <wsommerfeld@google.com>
---
drivers/net/ipvlan/ipvlan_main.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/drivers/net/ipvlan/ipvlan_main.c b/drivers/net/ipvlan/ipvlan_main.c
index 1c96bed5a7c4..887bbba4631e 100644
--- a/drivers/net/ipvlan/ipvlan_main.c
+++ b/drivers/net/ipvlan/ipvlan_main.c
@@ -126,6 +126,7 @@ static int ipvlan_init(struct net_device *dev)
(phy_dev->state & IPVLAN_STATE_MASK);
dev->features = phy_dev->features & IPVLAN_FEATURES;
dev->features |= NETIF_F_LLTX | NETIF_F_VLAN_CHALLENGED;
+ dev->hw_enc_features |= dev->features;
dev->gso_max_size = phy_dev->gso_max_size;
dev->gso_max_segs = phy_dev->gso_max_segs;
dev->hard_header_len = phy_dev->hard_header_len;
--
2.23.0.rc1.153.gdeed80330f-goog
^ permalink raw reply related
* [PATCH bpf-next] tools: libbpf: update extended attributes version of bpf_object__open()
From: Anton Protopopov @ 2019-08-15 0:03 UTC (permalink / raw)
To: Andrii Nakryiko, Alexei Starovoitov, Daniel Borkmann,
Martin KaFai Lau, Song Liu, Yonghong Song, netdev, bpf,
linux-kernel
Cc: Anton Protopopov
Update the bpf_object_open_attr structure and corresponding code so that the
bpf_object__open_xattr function could be used to open objects from buffers as
well as from files. The reason for this change is that the existing
bpf_object__open_buffer function doesn't provide a way to specify neither the
needs_kver nor flags parameters to the internal call to __bpf_object__open
which makes it inconvenient for loading BPF objects which doesn't require a
kernel version.
Two new fields, obj_buf and obj_buf_sz, were added to the structure, and the
file field was union'ed with obj_name so that one can open an object like this:
struct bpf_object_open_attr attr = {
.obj_name = name,
.obj_buf = obj_buf,
.obj_buf_sz = obj_buf_sz,
.prog_type = BPF_PROG_TYPE_UNSPEC,
};
return bpf_object__open_xattr(&attr);
while still being able to use the file semantics:
struct bpf_object_open_attr attr = {
.file = path,
.prog_type = BPF_PROG_TYPE_UNSPEC,
};
return bpf_object__open_xattr(&attr);
Another thing to note is that since the commit c034a177d3c8 ("bpf: bpftool, add
flag to allow non-compat map definitions") which introduced a MAPS_RELAX_COMPAT
flag to load objects with non-compat map definitions, bpf_object__open_buffer
was called with this flag enabled (it was passed as the boolean true value in
flags argument to __bpf_object__open). The default behaviour for all open
functions is to clear this flag and this patch changes bpf_object__open_buffer
to clears this flag. It can be enabled, if needed, by opening an object from
buffer using __bpf_object__open_xattr.
Signed-off-by: Anton Protopopov <a.s.protopopov@gmail.com>
---
tools/lib/bpf/libbpf.c | 45 ++++++++++++++++++++++++++----------------
tools/lib/bpf/libbpf.h | 7 ++++++-
2 files changed, 34 insertions(+), 18 deletions(-)
diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
index 2233f919dd88..7c8054afd901 100644
--- a/tools/lib/bpf/libbpf.c
+++ b/tools/lib/bpf/libbpf.c
@@ -3630,13 +3630,31 @@ __bpf_object__open(const char *path, void *obj_buf, size_t obj_buf_sz,
struct bpf_object *__bpf_object__open_xattr(struct bpf_object_open_attr *attr,
int flags)
{
+ char tmp_name[64];
+
/* param validation */
- if (!attr->file)
+ if (!attr)
return NULL;
- pr_debug("loading %s\n", attr->file);
+ if (attr->obj_buf) {
+ if (attr->obj_buf_sz <= 0)
+ return NULL;
+ if (!attr->file) {
+ snprintf(tmp_name, sizeof(tmp_name), "%lx-%lx",
+ (unsigned long)attr->obj_buf,
+ (unsigned long)attr->obj_buf_sz);
+ attr->obj_name = tmp_name;
+ }
+ pr_debug("loading object '%s' from buffer\n", attr->obj_name);
+ } else if (!attr->file) {
+ return NULL;
+ } else {
+ attr->obj_buf_sz = 0;
- return __bpf_object__open(attr->file, NULL, 0,
+ pr_debug("loading object file '%s'\n", attr->file);
+ }
+
+ return __bpf_object__open(attr->file, attr->obj_buf, attr->obj_buf_sz,
bpf_prog_type__needs_kver(attr->prog_type),
flags);
}
@@ -3660,21 +3678,14 @@ struct bpf_object *bpf_object__open_buffer(void *obj_buf,
size_t obj_buf_sz,
const char *name)
{
- char tmp_name[64];
-
- /* param validation */
- if (!obj_buf || obj_buf_sz <= 0)
- return NULL;
-
- if (!name) {
- snprintf(tmp_name, sizeof(tmp_name), "%lx-%lx",
- (unsigned long)obj_buf,
- (unsigned long)obj_buf_sz);
- name = tmp_name;
- }
- pr_debug("loading object '%s' from buffer\n", name);
+ struct bpf_object_open_attr attr = {
+ .obj_name = name,
+ .obj_buf = obj_buf,
+ .obj_buf_sz = obj_buf_sz,
+ .prog_type = BPF_PROG_TYPE_UNSPEC,
+ };
- return __bpf_object__open(name, obj_buf, obj_buf_sz, true, true);
+ return bpf_object__open_xattr(&attr);
}
int bpf_object__unload(struct bpf_object *obj)
diff --git a/tools/lib/bpf/libbpf.h b/tools/lib/bpf/libbpf.h
index e8f70977d137..634f278578dd 100644
--- a/tools/lib/bpf/libbpf.h
+++ b/tools/lib/bpf/libbpf.h
@@ -63,8 +63,13 @@ LIBBPF_API libbpf_print_fn_t libbpf_set_print(libbpf_print_fn_t fn);
struct bpf_object;
struct bpf_object_open_attr {
- const char *file;
+ union {
+ const char *file;
+ const char *obj_name;
+ };
enum bpf_prog_type prog_type;
+ void *obj_buf;
+ size_t obj_buf_sz;
};
LIBBPF_API struct bpf_object *bpf_object__open(const char *path);
--
2.19.1
^ permalink raw reply related
* Re: [PATCH net-next v2 11/14] netdevsim: Add devlink-trap support
From: Jakub Kicinski @ 2019-08-14 23:59 UTC (permalink / raw)
To: Ido Schimmel
Cc: netdev, davem, nhorman, jiri, toke, dsahern, roopa, nikolay, andy,
f.fainelli, andrew, vivien.didelot, mlxsw, Ido Schimmel
In-Reply-To: <20190813075400.11841-12-idosch@idosch.org>
On Tue, 13 Aug 2019 10:53:57 +0300, Ido Schimmel wrote:
> diff --git a/drivers/net/netdevsim/dev.c b/drivers/net/netdevsim/dev.c
> index 08ca59fc189b..2758d95c8d18 100644
> --- a/drivers/net/netdevsim/dev.c
> +++ b/drivers/net/netdevsim/dev.c
> @@ -17,11 +17,21 @@
>
> #include <linux/debugfs.h>
> #include <linux/device.h>
> +#include <linux/etherdevice.h>
> +#include <linux/inet.h>
> +#include <linux/jiffies.h>
> +#include <linux/kernel.h>
> #include <linux/list.h>
> #include <linux/mutex.h>
> #include <linux/random.h>
> +#include <linux/workqueue.h>
> +#include <linux/random.h>
> #include <linux/rtnetlink.h>
> #include <net/devlink.h>
> +#include <net/ip.h>
> +#include <uapi/linux/devlink.h>
> +#include <uapi/linux/ip.h>
> +#include <uapi/linux/udp.h>
Please keep includes ordered alphabetically. You're adding
linux/random.h second time.
> #include "netdevsim.h"
> +static void nsim_dev_trap_report(struct nsim_dev_port *nsim_dev_port)
> +{
> + struct nsim_dev *nsim_dev = nsim_dev_port->ns->nsim_dev;
> + struct nsim_trap_data *nsim_trap_data = nsim_dev->trap_data;
> + struct devlink *devlink = priv_to_devlink(nsim_dev);
> + int i;
reverse christmas tree, please
^ permalink raw reply
* Re: [PATCH v2 bpf-next 1/4] bpf: unprivileged BPF access via /dev/bpf
From: Andy Lutomirski @ 2019-08-14 23:59 UTC (permalink / raw)
To: Alexei Starovoitov
Cc: Andy Lutomirski, Daniel Colascione, Song Liu, Kees Cook,
Networking, bpf, Alexei Starovoitov, Daniel Borkmann, Kernel Team,
Lorenz Bauer, Jann Horn, Greg KH, Linux API, LSM List
In-Reply-To: <20190814233335.37t4zfsiswrpd4d6@ast-mbp.dhcp.thefacebook.com>
> On Aug 14, 2019, at 4:33 PM, Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
>
>> On Wed, Aug 14, 2019 at 03:30:51PM -0700, Andy Lutomirski wrote:
>>
>>
>>>> On Aug 14, 2019, at 3:05 PM, Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
>>>>
>>>> On Wed, Aug 14, 2019 at 10:51:23AM -0700, Andy Lutomirski wrote:
>>>>
>>>> If eBPF is genuinely not usable by programs that are not fully trusted
>>>> by the admin, then no kernel changes at all are needed. Programs that
>>>> want to reduce their own privileges can easily fork() a privileged
>>>> subprocess or run a little helper to which they delegate BPF
>>>> operations. This is far more flexible than anything that will ever be
>>>> in the kernel because it allows the helper to verify that the rest of
>>>> the program is doing exactly what it's supposed to and restrict eBPF
>>>> operations to exactly the subset that is needed. So a container
>>>> manager or network manager that drops some provilege could have a
>>>> little bpf-helper that manages its BPF XDP, firewalling, etc
>>>> configuration. The two processes would talk over a socketpair.
>>>
>>> there were three projects that tried to delegate bpf operations.
>>> All of them failed.
>>> bpf operational workflow is much more complex than you're imagining.
>>> fork() also doesn't work for all cases.
>>> I gave this example before: consider multiple systemd-like deamons
>>> that need to do bpf operations that want to pass this 'bpf capability'
>>> to other deamons written by other teams. Some of them will start
>>> non-root, but still need to do bpf. They will be rpm installed
>>> and live upgraded while running.
>>> We considered to make systemd such centralized bpf delegation
>>> authority too. It didn't work. bpf in kernel grows quickly.
>>> libbpf part grows independently. llvm keeps evolving.
>>> All of them are being changed while system overall has to stay
>>> operational. Centralized approach breaks apart.
>>>
>>>> The interesting cases you're talking about really *do* involved
>>>> unprivileged or less privileged eBPF, though. Let's see:
>>>>
>>>> systemd --user: systemd --user *is not privileged at all*. There's no
>>>> issue of reducing privilege, since systemd --user doesn't have any
>>>> privilege to begin with. But systemd supports some eBPF features, and
>>>> presumably it would like to support them in the systemd --user case.
>>>> This is unprivileged eBPF.
>>>
>>> Let's disambiguate the terminology.
>>> This /dev/bpf patch set started as describing the feature as 'unprivileged bpf'.
>>> I think that was a mistake.
>>> Let's call systemd-like deamon usage of bpf 'less privileged bpf'.
>>> This is not unprivileged.
>>> 'unprivileged bpf' is what sysctl kernel.unprivileged_bpf_disabled controls.
>>>
>>> There is a huge difference between the two.
>>> I'm against extending 'unprivileged bpf' even a bit more than what it is
>>> today for many reasons mentioned earlier.
>>> The /dev/bpf is about 'less privileged'.
>>> Less privileged than root. We need to split part of full root capability
>>> into bpf capability. So that most of the root can be dropped.
>>> This is very similar to what cap_net_admin does.
>>> cap_net_amdin can bring down eth0 which is just as bad as crashing the box.
>>> cap_net_admin is very much privileged. Just 'less privileged' than root.
>>> Same thing for cap_bpf.
>>
>> The new pseudo-capability in this patch set is absurdly broad. I’ve proposed some finer-grained divisions in this thread. Do you have comments on them?
>
> Initially I agreed that it's probably too broad, but then realized
> that they're perfect as-is. There is no need to partition further.
>
>>> May be we should do both cap_bpf and /dev/bpf to make it clear that
>>> this is the same thing. Two interfaces to achieve the same result.
>>
>> What for? If there’s a CAP_BPF, then why do you want /dev/bpf? Especially if you define it to do the same thing.
>
> Indeed, ambient capabilities should work for all cases.
>
>> No, I’m not. I have no objection at all if you try to come up with a clear definition of what the capability checks do and what it means to grant a new permission to a task. Changing *all* of the capable checks is needlessly broad.
>
> There are not that many bits left. I prefer to consume single CAP_BPF bit.
> All capable(CAP_SYS_ADMIN) checks in kernel/bpf/ will become CAP_BPF.
> This is no-brainer.
>
> The only question is whether few cases of CAP_NET_ADMIN in kernel/bpf/
> should be extended to CAP_BPF or not.
> imo devmap and xskmap can stay CAP_NET_ADMIN,
> but cgroup bpf attach/detach should be either CAP_NET_ADMIN or CAP_BPF.
> Initially cgroup-bpf hooks were limited to networking.
> It's no longer the case. Requiring NET_ADMIN there make little sense now.
>
Cgroup bpf attach/detach, with the current API, gives very strong control over the whole system, and it will just get stronger as bpf gains features. Making it CAP_BPF means that you will never have the ability to make CAP_BPF safe to give to anything other than an extremely highly trusted process. Unsafe pointers are similar. The rest could plausibly be hardened in the future, although the by_id stuff may be tricky too.
Do new programs really need the by_id calls? It could make sense to leave those unchanged and to have new programs use persistent maps instead.
^ permalink raw reply
* RE: [PATCH] selftests: net: tcp_fastopen_backup_key.sh: fix shellcheck issue
From: Tim.Bird @ 2019-08-14 23:34 UTC (permalink / raw)
To: anders.roxell, davem, shuah; +Cc: netdev, linux-kselftest, linux-kernel
In-Reply-To: <20190814214948.5571-1-anders.roxell@linaro.org>
> -----Original Message-----
> From: Anders Roxell
>
> When running tcp_fastopen_backup_key.sh the following issue was seen in
> a busybox environment.
> ./tcp_fastopen_backup_key.sh: line 33: [: -ne: unary operator expected
>
> Shellcheck showed the following issue.
> $ shellcheck tools/testing/selftests/net/tcp_fastopen_backup_key.sh
>
> In tools/testing/selftests/net/tcp_fastopen_backup_key.sh line 33:
> if [ $val -ne 0 ]; then
> ^-- SC2086: Double quote to prevent globbing and word splitting.
>
> Rework to add double quotes around the variable 'val' that shellcheck
> recommends.
>
> Signed-off-by: Anders Roxell <anders.roxell@linaro.org>
> ---
> tools/testing/selftests/net/tcp_fastopen_backup_key.sh | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/tools/testing/selftests/net/tcp_fastopen_backup_key.sh
> b/tools/testing/selftests/net/tcp_fastopen_backup_key.sh
> index 41476399e184..ba5ec3eb314e 100755
> --- a/tools/testing/selftests/net/tcp_fastopen_backup_key.sh
> +++ b/tools/testing/selftests/net/tcp_fastopen_backup_key.sh
> @@ -30,7 +30,7 @@ do_test() {
> ip netns exec "${NETNS}" ./tcp_fastopen_backup_key "$1"
> val=$(ip netns exec "${NETNS}" nstat -az | \
> grep TcpExtTCPFastOpenPassiveFail | awk '{print $2}')
> - if [ $val -ne 0 ]; then
> + if [ "$val" -ne 0 ]; then
Did you test this in the failing environment?
With a busybox shell, I get:
$ [ "" -ne 0 ]
sh: bad number
You might need to explicitly check for empty string here, or switch to a string comparison instead:
if [ "$val" != 0 ]; then
-- Tim
> echo "FAIL: TcpExtTCPFastOpenPassiveFail non-zero"
> return 1
> fi
> --
> 2.20.1
^ permalink raw reply
* Re: [PATCH v2 bpf-next 1/4] bpf: unprivileged BPF access via /dev/bpf
From: Alexei Starovoitov @ 2019-08-14 23:33 UTC (permalink / raw)
To: Andy Lutomirski
Cc: Andy Lutomirski, Daniel Colascione, Song Liu, Kees Cook,
Networking, bpf, Alexei Starovoitov, Daniel Borkmann, Kernel Team,
Lorenz Bauer, Jann Horn, Greg KH, Linux API, LSM List
In-Reply-To: <AD211133-EA60-4B91-AB1B-201713F50AB2@amacapital.net>
On Wed, Aug 14, 2019 at 03:30:51PM -0700, Andy Lutomirski wrote:
>
>
> > On Aug 14, 2019, at 3:05 PM, Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
> >
> >> On Wed, Aug 14, 2019 at 10:51:23AM -0700, Andy Lutomirski wrote:
> >>
> >> If eBPF is genuinely not usable by programs that are not fully trusted
> >> by the admin, then no kernel changes at all are needed. Programs that
> >> want to reduce their own privileges can easily fork() a privileged
> >> subprocess or run a little helper to which they delegate BPF
> >> operations. This is far more flexible than anything that will ever be
> >> in the kernel because it allows the helper to verify that the rest of
> >> the program is doing exactly what it's supposed to and restrict eBPF
> >> operations to exactly the subset that is needed. So a container
> >> manager or network manager that drops some provilege could have a
> >> little bpf-helper that manages its BPF XDP, firewalling, etc
> >> configuration. The two processes would talk over a socketpair.
> >
> > there were three projects that tried to delegate bpf operations.
> > All of them failed.
> > bpf operational workflow is much more complex than you're imagining.
> > fork() also doesn't work for all cases.
> > I gave this example before: consider multiple systemd-like deamons
> > that need to do bpf operations that want to pass this 'bpf capability'
> > to other deamons written by other teams. Some of them will start
> > non-root, but still need to do bpf. They will be rpm installed
> > and live upgraded while running.
> > We considered to make systemd such centralized bpf delegation
> > authority too. It didn't work. bpf in kernel grows quickly.
> > libbpf part grows independently. llvm keeps evolving.
> > All of them are being changed while system overall has to stay
> > operational. Centralized approach breaks apart.
> >
> >> The interesting cases you're talking about really *do* involved
> >> unprivileged or less privileged eBPF, though. Let's see:
> >>
> >> systemd --user: systemd --user *is not privileged at all*. There's no
> >> issue of reducing privilege, since systemd --user doesn't have any
> >> privilege to begin with. But systemd supports some eBPF features, and
> >> presumably it would like to support them in the systemd --user case.
> >> This is unprivileged eBPF.
> >
> > Let's disambiguate the terminology.
> > This /dev/bpf patch set started as describing the feature as 'unprivileged bpf'.
> > I think that was a mistake.
> > Let's call systemd-like deamon usage of bpf 'less privileged bpf'.
> > This is not unprivileged.
> > 'unprivileged bpf' is what sysctl kernel.unprivileged_bpf_disabled controls.
> >
> > There is a huge difference between the two.
> > I'm against extending 'unprivileged bpf' even a bit more than what it is
> > today for many reasons mentioned earlier.
> > The /dev/bpf is about 'less privileged'.
> > Less privileged than root. We need to split part of full root capability
> > into bpf capability. So that most of the root can be dropped.
> > This is very similar to what cap_net_admin does.
> > cap_net_amdin can bring down eth0 which is just as bad as crashing the box.
> > cap_net_admin is very much privileged. Just 'less privileged' than root.
> > Same thing for cap_bpf.
>
> The new pseudo-capability in this patch set is absurdly broad. I’ve proposed some finer-grained divisions in this thread. Do you have comments on them?
Initially I agreed that it's probably too broad, but then realized
that they're perfect as-is. There is no need to partition further.
> > May be we should do both cap_bpf and /dev/bpf to make it clear that
> > this is the same thing. Two interfaces to achieve the same result.
>
> What for? If there’s a CAP_BPF, then why do you want /dev/bpf? Especially if you define it to do the same thing.
Indeed, ambient capabilities should work for all cases.
> No, I’m not. I have no objection at all if you try to come up with a clear definition of what the capability checks do and what it means to grant a new permission to a task. Changing *all* of the capable checks is needlessly broad.
There are not that many bits left. I prefer to consume single CAP_BPF bit.
All capable(CAP_SYS_ADMIN) checks in kernel/bpf/ will become CAP_BPF.
This is no-brainer.
The only question is whether few cases of CAP_NET_ADMIN in kernel/bpf/
should be extended to CAP_BPF or not.
imo devmap and xskmap can stay CAP_NET_ADMIN,
but cgroup bpf attach/detach should be either CAP_NET_ADMIN or CAP_BPF.
Initially cgroup-bpf hooks were limited to networking.
It's no longer the case. Requiring NET_ADMIN there make little sense now.
^ permalink raw reply
* Re: [PATCH net-next v2 6/9] net: macsec: hardware offloading infrastructure
From: Florian Fainelli @ 2019-08-14 23:28 UTC (permalink / raw)
To: Andrew Lunn, Igor Russkikh
Cc: Antoine Tenart, davem@davemloft.net, sd@queasysnail.net,
hkallweit1@gmail.com, netdev@vger.kernel.org,
linux-kernel@vger.kernel.org, thomas.petazzoni@bootlin.com,
alexandre.belloni@bootlin.com, allan.nielsen@microchip.com,
camelia.groza@nxp.com, Simon Edelhaus, Pavel Belous
In-Reply-To: <20190813162823.GH15047@lunn.ch>
On 8/13/19 9:28 AM, Andrew Lunn wrote:
>> 1) With current implementation it's impossible to install SW macsec engine onto
>> the device which supports HW offload. That could be a strong limitation in
>> cases when user sees HW macsec offload is broken or work differently, and he/she
>> wants to replace it with SW one.
>> MACSec is a complex feature, and it may happen something is missing in HW.
>> Trivial example is 256bit encryption, which is not always a musthave in HW
>> implementations.
>
> Ideally, we want the driver to return EOPNOTSUPP if it does not
> support something and the software implement should be used.
>
> If the offload is broken, we want a bug report! And if it works
> differently, it suggests there is also a bug we need to fix, or the
> standard is ambiguous.
>
> It would also be nice to add extra information to the netlink API to
> indicate if HW or SW is being used. In other places where we offload
> to accelerators we have such additional information.
Igor's point is entirely valid in that you should allow both offload to
HW for what is possible and offload to a software implementation for
what is not supported in HW.
Let's an example that is hopefully more familiar to the people in this
thread. Consider a NIC that can do single VLAN tag offload on xmit (or
receive, does not matter), and you find yourself using a double VLAN tag
configuration. You would create a first VLAN stacked network device
which is fully offloaded onto the underlying NIC, and a second VLAN
stacked network device on top of the first once which is not offloaded.
The way I would imagine a MACsec offload is kind of similar here, in
that it would be a macsec virtual network device on top of an underlying
physical device. If no offload is possible, the virtual network device's
xmit/receive path is used. If the NIC driver can offload it, it does
not. How it does it, whether at the MAC/DMA level, or at the PHY level
can be a check added at the appropriate level.
--
Florian
^ permalink raw reply
* Re: [PATCH bpf-next v4 1/2] xsk: remove AF_XDP socket from map when the socket is released
From: Daniel Borkmann @ 2019-08-14 23:17 UTC (permalink / raw)
To: Björn Töpel
Cc: Alexei Starovoitov, Netdev, Björn Töpel,
Karlsson, Magnus, Bruce Richardson, Song Liu, bpf
In-Reply-To: <CAJ+HfNhO+xSs25aPat9WjC75W6_Kgfq=GU+YCEcoZw-GCjZdEg@mail.gmail.com>
On 8/12/19 7:25 PM, Björn Töpel wrote:
> On Mon, 12 Aug 2019 at 14:28, Daniel Borkmann <daniel@iogearbox.net> wrote:
>>
> [...]
>>> diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
>>> index 59b57d708697..c3447bad608a 100644
>>> --- a/net/xdp/xsk.c
>>> +++ b/net/xdp/xsk.c
>>> @@ -362,6 +362,50 @@ static void xsk_unbind_dev(struct xdp_sock *xs)
>>> dev_put(dev);
>>> }
>>>
>>> +static struct xsk_map *xsk_get_map_list_entry(struct xdp_sock *xs,
>>> + struct xdp_sock ***map_entry)
>>> +{
>>> + struct xsk_map *map = NULL;
>>> + struct xsk_map_node *node;
>>> +
>>> + *map_entry = NULL;
>>> +
>>> + spin_lock_bh(&xs->map_list_lock);
>>> + node = list_first_entry_or_null(&xs->map_list, struct xsk_map_node,
>>> + node);
>>> + if (node) {
>>> + WARN_ON(xsk_map_inc(node->map));
>>
>> Can you elaborate on the refcount usage here and against what scenario it is protecting?
>
> Thanks for having a look!
>
> First we access the map_list (under the lock) and pull out the map
> which we intend to clean. In order to clear the map entry, we need to
> a reference to the map. However, when the map_list_lock is released,
> there's a window where the map entry can be cleared and the map can be
> destroyed, and making the "map", which is used in
> xsk_delete_from_maps, stale. To guarantee existence the additional
> refinc is required. Makes sense?
Seems reasonable to me, and inc as opposed to inc_not_zero is also fine
here since at this point in time we're still holding one reference to
the map. But I think there's a catch with the current code that still
needs fixing:
Imagine you do a xsk_map_update_elem() where we have a situation where
xs == old_xs. There, we first do the xsk_map_sock_add() to add the new
xsk map node at the tail of the socket's xs->map_list. We do the xchg()
and then xsk_map_sock_delete() for old_xs which then walks xs->map_list
again and purges all entries including the just newly created one. This
means we'll end up with an xs socket at the given map slot, but the xs
socket has empty xs->map_list. This means we could release the xs sock
and the xsk_delete_from_maps() won't need to clean up anything anymore
but yet the xs is still in the map slot, so if you redirect to that
socket, it would be use-after-free, no?
>> Do we pretend it never fails on the bpf_map_inc() wrt the WARN_ON(),
>> why that (what makes it different from the xsk_map_node_alloc() inc
>> above where we do error out)?
>
> Hmm, given that we're in a cleanup (socket release), we can't really
> return any error. What would be a more robust way? Retrying? AFAIK the
> release ops return an int, but it's not checked/used.
>
>>> + map = node->map;
>>> + *map_entry = node->map_entry;
>>> + }
>>> + spin_unlock_bh(&xs->map_list_lock);
>>> + return map;
>>> +}
>>> +
>>> +static void xsk_delete_from_maps(struct xdp_sock *xs)
>>> +{
>>> + /* This function removes the current XDP socket from all the
>>> + * maps it resides in. We need to take extra care here, due to
>>> + * the two locks involved. Each map has a lock synchronizing
>>> + * updates to the entries, and each socket has a lock that
>>> + * synchronizes access to the list of maps (map_list). For
>>> + * deadlock avoidance the locks need to be taken in the order
>>> + * "map lock"->"socket map list lock". We start off by
>>> + * accessing the socket map list, and take a reference to the
>>> + * map to guarantee existence. Then we ask the map to remove
>>> + * the socket, which tries to remove the socket from the
>>> + * map. Note that there might be updates to the map between
>>> + * xsk_get_map_list_entry() and xsk_map_try_sock_delete().
>>> + */
>
> I tried to clarify here, but I obviously need to do a better job. :-)
>
>
> Björn
>
^ permalink raw reply
* Re: [PATCH net-next v2 5/9] net: phy: add MACsec ops in phy_device
From: Florian Fainelli @ 2019-08-14 23:15 UTC (permalink / raw)
To: Antoine Tenart, davem, sd, andrew, hkallweit1
Cc: netdev, linux-kernel, thomas.petazzoni, alexandre.belloni,
allan.nielsen, camelia.groza, Simon.Edelhaus
In-Reply-To: <20190808140600.21477-6-antoine.tenart@bootlin.com>
On 8/8/19 7:05 AM, Antoine Tenart wrote:
> This patch adds a reference to MACsec ops in the phy_device, to allow
> PHYs to support offloading MACsec operations. The phydev lock will be
> held while calling those helpers.
>
> Signed-off-by: Antoine Tenart <antoine.tenart@bootlin.com>
> ---
> include/linux/phy.h | 10 ++++++++++
> 1 file changed, 10 insertions(+)
>
> diff --git a/include/linux/phy.h b/include/linux/phy.h
> index 462b90b73f93..6947a19587e4 100644
> --- a/include/linux/phy.h
> +++ b/include/linux/phy.h
> @@ -22,6 +22,10 @@
> #include <linux/workqueue.h>
> #include <linux/mod_devicetable.h>
>
> +#ifdef CONFIG_MACSEC
> +#include <net/macsec.h>
> +#endif
#if IS_ENABLED(CONFIG_MACSEC)
> +
> #include <linux/atomic.h>
>
> #define PHY_DEFAULT_FEATURES (SUPPORTED_Autoneg | \
> @@ -345,6 +349,7 @@ struct phy_c45_device_ids {
> * attached_dev: The attached enet driver's device instance ptr
> * adjust_link: Callback for the enet controller to respond to
> * changes in the link state.
> + * macsec_ops: MACsec offloading ops.
> *
> * speed, duplex, pause, supported, advertising, lp_advertising,
> * and autoneg are used like in mii_if_info
> @@ -438,6 +443,11 @@ struct phy_device {
>
> void (*phy_link_change)(struct phy_device *, bool up, bool do_carrier);
> void (*adjust_link)(struct net_device *dev);
> +
> +#if defined(CONFIG_MACSEC)
> + /* MACsec management functions */
> + const struct macsec_ops *macsec_ops;
> +#endif
#if IS_ENABLED(CONFIG_MACSEC)
likewise.
--
Florian
^ permalink raw reply
* Re: [PATCH v2 bpf-next 1/4] bpf: unprivileged BPF access via /dev/bpf
From: Andy Lutomirski @ 2019-08-14 22:30 UTC (permalink / raw)
To: Alexei Starovoitov
Cc: Andy Lutomirski, Daniel Colascione, Song Liu, Kees Cook,
Networking, bpf, Alexei Starovoitov, Daniel Borkmann, Kernel Team,
Lorenz Bauer, Jann Horn, Greg KH, Linux API, LSM List
In-Reply-To: <20190814220545.co5pucyo5jk3weiv@ast-mbp.dhcp.thefacebook.com>
> On Aug 14, 2019, at 3:05 PM, Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
>
>> On Wed, Aug 14, 2019 at 10:51:23AM -0700, Andy Lutomirski wrote:
>>
>> If eBPF is genuinely not usable by programs that are not fully trusted
>> by the admin, then no kernel changes at all are needed. Programs that
>> want to reduce their own privileges can easily fork() a privileged
>> subprocess or run a little helper to which they delegate BPF
>> operations. This is far more flexible than anything that will ever be
>> in the kernel because it allows the helper to verify that the rest of
>> the program is doing exactly what it's supposed to and restrict eBPF
>> operations to exactly the subset that is needed. So a container
>> manager or network manager that drops some provilege could have a
>> little bpf-helper that manages its BPF XDP, firewalling, etc
>> configuration. The two processes would talk over a socketpair.
>
> there were three projects that tried to delegate bpf operations.
> All of them failed.
> bpf operational workflow is much more complex than you're imagining.
> fork() also doesn't work for all cases.
> I gave this example before: consider multiple systemd-like deamons
> that need to do bpf operations that want to pass this 'bpf capability'
> to other deamons written by other teams. Some of them will start
> non-root, but still need to do bpf. They will be rpm installed
> and live upgraded while running.
> We considered to make systemd such centralized bpf delegation
> authority too. It didn't work. bpf in kernel grows quickly.
> libbpf part grows independently. llvm keeps evolving.
> All of them are being changed while system overall has to stay
> operational. Centralized approach breaks apart.
>
>> The interesting cases you're talking about really *do* involved
>> unprivileged or less privileged eBPF, though. Let's see:
>>
>> systemd --user: systemd --user *is not privileged at all*. There's no
>> issue of reducing privilege, since systemd --user doesn't have any
>> privilege to begin with. But systemd supports some eBPF features, and
>> presumably it would like to support them in the systemd --user case.
>> This is unprivileged eBPF.
>
> Let's disambiguate the terminology.
> This /dev/bpf patch set started as describing the feature as 'unprivileged bpf'.
> I think that was a mistake.
> Let's call systemd-like deamon usage of bpf 'less privileged bpf'.
> This is not unprivileged.
> 'unprivileged bpf' is what sysctl kernel.unprivileged_bpf_disabled controls.
>
> There is a huge difference between the two.
> I'm against extending 'unprivileged bpf' even a bit more than what it is
> today for many reasons mentioned earlier.
> The /dev/bpf is about 'less privileged'.
> Less privileged than root. We need to split part of full root capability
> into bpf capability. So that most of the root can be dropped.
> This is very similar to what cap_net_admin does.
> cap_net_amdin can bring down eth0 which is just as bad as crashing the box.
> cap_net_admin is very much privileged. Just 'less privileged' than root.
> Same thing for cap_bpf.
The new pseudo-capability in this patch set is absurdly broad. I’ve proposed some finer-grained divisions in this thread. Do you have comments on them?
>
> May be we should do both cap_bpf and /dev/bpf to make it clear that
> this is the same thing. Two interfaces to achieve the same result.
What for? If there’s a CAP_BPF, then why do you want /dev/bpf? Especially if you define it to do the same thing.
>
>> Seccomp. Seccomp already uses cBPF, which is a form of BPF although
>> it doesn't involve the bpf() syscall. There are some seccomp
>> proposals in the works that will want some stuff from eBPF. In
>
> I'm afraid these proposals won't go anywhere.
Can you explain why?
>
>> So it's a bit of a chicken-and-egg situation. There aren't major
>> unprivileged eBPF users because the kernel support isn't there.
>
> As I said before there are zero known use cases of 'unprivileged bpf'.
>
> If I understand you correctly you're refusing to accept that
> 'less privileged bpf' is a valid use case while pushing for extending
> scope of 'unprivileged'.
No, I’m not. I have no objection at all if you try to come up with a clear definition of what the capability checks do and what it means to grant a new permission to a task. Changing *all* of the capable checks is needlessly broad.
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox