* [PATCH V2 net-next 1/3] net: hns3: Fix for VF mailbox cannot receiving PF response
From: Salil Mehta @ 2018-06-06 13:07 UTC (permalink / raw)
To: davem
Cc: salil.mehta, yisen.zhuang, lipeng321, mehta.salil, netdev,
linux-kernel, linuxarm, Xi Wang
In-Reply-To: <20180606130753.54428-1-salil.mehta@huawei.com>
From: Xi Wang <wangxi11@huawei.com>
When the VF frequently switches the CMDQ interrupt, if the CMDQ_SRC is not
cleared, the VF will not receive the new PF response after the interrupt
is re-enabled, the corresponding log is as follows:
[ 317.482222] hns3 0000:00:03.0: VF could not get mbx resp(=0) from PF
in 500 tries
[ 317.483137] hns3 0000:00:03.0: VF request to get tqp info from PF
failed -5
This patch fixes this problem by clearing CMDQ_SRC before enabling
interrupt and syncing pending IRQ handlers after disabling interrupt.
Fixes: e2cb1dec9779 ("net: hns3: Add HNS3 VF HCL(Hardware Compatibility Layer) Support")
Signed-off-by: Xi Wang <wangxi11@huawei.com>
Signed-off-by: Peng Li <lipeng321@huawei.com>
Signed-off-by: Salil Mehta <salil.mehta@huawei.com>
---
drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_main.c | 3 +++
1 file changed, 3 insertions(+)
diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_main.c b/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_main.c
index dd8e8e6..d55ee9c 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_main.c
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_main.c
@@ -1576,6 +1576,8 @@ static int hclgevf_misc_irq_init(struct hclgevf_dev *hdev)
return ret;
}
+ hclgevf_clear_event_cause(hdev, 0);
+
/* enable misc. vector(vector 0) */
hclgevf_enable_vector(&hdev->misc_vector, true);
@@ -1586,6 +1588,7 @@ static void hclgevf_misc_irq_uninit(struct hclgevf_dev *hdev)
{
/* disable misc vector(vector 0) */
hclgevf_enable_vector(&hdev->misc_vector, false);
+ synchronize_irq(hdev->misc_vector.vector_irq);
free_irq(hdev->misc_vector.vector_irq, hdev);
hclgevf_free_vector(hdev, 0);
}
--
2.7.4
^ permalink raw reply related
* [PATCH V2 net-next 2/3] net: hns3: Fix for VF mailbox receiving unknown message
From: Salil Mehta @ 2018-06-06 13:07 UTC (permalink / raw)
To: davem
Cc: salil.mehta, yisen.zhuang, lipeng321, mehta.salil, netdev,
linux-kernel, linuxarm, Xi Wang
In-Reply-To: <20180606130753.54428-1-salil.mehta@huawei.com>
From: Xi Wang <wangxi11@huawei.com>
Before the firmware updates the crq's tail pointer, if the VF driver
reads the data in the crq, the data may be incomplete at this time,
which will lead to the driver read an unknown message.
This patch fixes it by checking if crq is empty before reading the
message.
Fixes: b11a0bb231f3 ("net: hns3: Add mailbox support to VF driver")
Signed-off-by: Xi Wang <wangxi11@huawei.com>
Signed-off-by: Peng Li <lipeng321@huawei.com>
Signed-off-by: Salil Mehta <salil.mehta@huawei.com>
---
Patch V2: Fixes the compilation break reported David Miller & Kbuild
Link: https://lkml.org/lkml/2018/6/5/866
https://lkml.org/lkml/2018/6/6/147
Patch V1: Initial Submit
---
.../ethernet/hisilicon/hns3/hns3vf/hclgevf_mbx.c | 23 +++++++++++++++++++---
1 file changed, 20 insertions(+), 3 deletions(-)
diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_mbx.c b/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_mbx.c
index a286184..b598c06 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_mbx.c
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_mbx.c
@@ -126,6 +126,13 @@ int hclgevf_send_mbx_msg(struct hclgevf_dev *hdev, u16 code, u16 subcode,
return status;
}
+static bool hclgevf_cmd_crq_empty(struct hclgevf_hw *hw)
+{
+ u32 tail = hclgevf_read_dev(hw, HCLGEVF_NIC_CRQ_TAIL_REG);
+
+ return tail == hw->cmq.crq.next_to_use;
+}
+
void hclgevf_mbx_handler(struct hclgevf_dev *hdev)
{
struct hclgevf_mbx_resp_status *resp;
@@ -140,11 +147,22 @@ void hclgevf_mbx_handler(struct hclgevf_dev *hdev)
resp = &hdev->mbx_resp;
crq = &hdev->hw.cmq.crq;
- flag = le16_to_cpu(crq->desc[crq->next_to_use].flag);
- while (hnae_get_bit(flag, HCLGEVF_CMDQ_RX_OUTVLD_B)) {
+ while (!hclgevf_cmd_crq_empty(&hdev->hw)) {
desc = &crq->desc[crq->next_to_use];
req = (struct hclge_mbx_pf_to_vf_cmd *)desc->data;
+ flag = le16_to_cpu(crq->desc[crq->next_to_use].flag);
+ if (unlikely(!hnae_get_bit(flag, HCLGEVF_CMDQ_RX_OUTVLD_B))) {
+ dev_warn(&hdev->pdev->dev,
+ "dropped invalid mailbox message, code = %d\n",
+ req->msg[0]);
+
+ /* dropping/not processing this invalid message */
+ crq->desc[crq->next_to_use].flag = 0;
+ hclge_mbx_ring_ptr_move_crq(crq);
+ continue;
+ }
+
/* synchronous messages are time critical and need preferential
* treatment. Therefore, we need to acknowledge all the sync
* responses as quickly as possible so that waiting tasks do not
@@ -205,7 +223,6 @@ void hclgevf_mbx_handler(struct hclgevf_dev *hdev)
}
crq->desc[crq->next_to_use].flag = 0;
hclge_mbx_ring_ptr_move_crq(crq);
- flag = le16_to_cpu(crq->desc[crq->next_to_use].flag);
}
/* Write back CMDQ_RQ header pointer, M7 need this pointer */
--
2.7.4
^ permalink raw reply related
* [PATCH V2 net-next 3/3] net: hns3: Optimize PF CMDQ interrupt switching process
From: Salil Mehta @ 2018-06-06 13:07 UTC (permalink / raw)
To: davem
Cc: salil.mehta, yisen.zhuang, lipeng321, mehta.salil, netdev,
linux-kernel, linuxarm, Xi Wang
In-Reply-To: <20180606130753.54428-1-salil.mehta@huawei.com>
From: Xi Wang <wangxi11@huawei.com>
When the PF frequently switches the CMDQ interrupt, if the CMDQ_SRC is
not cleared before the hardware interrupt is generated, the new interrupt
will not be reported.
This patch optimizes this problem by clearing CMDQ_SRC and RESET_STS
before enabling interrupt and syncing pending IRQ handlers after disabling
interrupt.
Fixes: 466b0c00391b ("net: hns3: Add support for misc interrupt")
Signed-off-by: Xi Wang <wangxi11@huawei.com>
Signed-off-by: Peng Li <lipeng321@huawei.com>
Signed-off-by: Salil Mehta <salil.mehta@huawei.com>
---
drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c | 13 +++++++++++++
1 file changed, 13 insertions(+)
diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c
index 2a80134..d318d35 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c
@@ -2557,6 +2557,15 @@ static void hclge_clear_event_cause(struct hclge_dev *hdev, u32 event_type,
}
}
+static void hclge_clear_all_event_cause(struct hclge_dev *hdev)
+{
+ hclge_clear_event_cause(hdev, HCLGE_VECTOR0_EVENT_RST,
+ BIT(HCLGE_VECTOR0_GLOBALRESET_INT_B) |
+ BIT(HCLGE_VECTOR0_CORERESET_INT_B) |
+ BIT(HCLGE_VECTOR0_IMPRESET_INT_B));
+ hclge_clear_event_cause(hdev, HCLGE_VECTOR0_EVENT_MBX, 0);
+}
+
static void hclge_enable_vector(struct hclge_misc_vector *vector, bool enable)
{
writel(enable ? 1 : 0, vector->addr);
@@ -5688,6 +5697,8 @@ static int hclge_init_ae_dev(struct hnae3_ae_dev *ae_dev)
INIT_WORK(&hdev->rst_service_task, hclge_reset_service_task);
INIT_WORK(&hdev->mbx_service_task, hclge_mailbox_service_task);
+ hclge_clear_all_event_cause(hdev);
+
/* Enable MISC vector(vector0) */
hclge_enable_vector(&hdev->misc_vector, true);
@@ -5817,6 +5828,8 @@ static void hclge_uninit_ae_dev(struct hnae3_ae_dev *ae_dev)
/* Disable MISC vector(vector0) */
hclge_enable_vector(&hdev->misc_vector, false);
+ synchronize_irq(hdev->misc_vector.vector_irq);
+
hclge_destroy_cmd_queue(&hdev->hw);
hclge_misc_irq_uninit(hdev);
hclge_pci_uninit(hdev);
--
2.7.4
^ permalink raw reply related
* RE: [PATCH net-next 2/3] net: hns3: Fix for VF mailbox receiving unknown message
From: Salil Mehta @ 2018-06-06 13:10 UTC (permalink / raw)
To: David Miller
Cc: Zhuangyuzeng (Yisen), lipeng (Y), mehta.salil@opnsrc.net,
netdev@vger.kernel.org, linux-kernel@vger.kernel.org, Linuxarm,
wangxi (M)
In-Reply-To: <20180605.144046.766561700879427040.davem@davemloft.net>
Hi Dave,
Sorry about this. I have fixed this in the V2 patch floated just now.
Best regards
Salil.
> -----Original Message-----
> From: David Miller [mailto:davem@davemloft.net]
> Sent: Tuesday, June 05, 2018 7:41 PM
> To: Salil Mehta <salil.mehta@huawei.com>
> Cc: Zhuangyuzeng (Yisen) <yisen.zhuang@huawei.com>; lipeng (Y)
> <lipeng321@huawei.com>; mehta.salil@opnsrc.net; netdev@vger.kernel.org;
> linux-kernel@vger.kernel.org; Linuxarm <linuxarm@huawei.com>; wangxi
> (M) <wangxi11@huawei.com>
> Subject: Re: [PATCH net-next 2/3] net: hns3: Fix for VF mailbox
> receiving unknown message
>
> From: Salil Mehta <salil.mehta@huawei.com>
> Date: Tue, 5 Jun 2018 12:42:00 +0100
>
> > + if (unlikely(!hnae3_get_bit(flag,
> HCLGEVF_CMDQ_RX_OUTVLD_B))) {
>
> This breaks the build, there is no such symbol named hnae3_get_bit().
^ permalink raw reply
* [PATCH net] kcm: fix races on sk_receive_queue
From: Paolo Abeni @ 2018-06-06 13:16 UTC (permalink / raw)
To: netdev; +Cc: David S. Miller, Tom Herbert, Kirill Tkhai
KCM removes the packets from sk_receive_queue in requeue_rx_msgs()
without acquiring any lock. Moreover, in R() when the MSG_PEEK
flag is not present, the skb is peeked and dequeued with two
separate, non-atomic, calls.
The above create room for races, which SYZBOT has been able to
exploit, causing list corruption and kernel oops:
kasan: CONFIG_KASAN_INLINE enabled
kasan: GPF could be caused by NULL-ptr deref or user memory access
general protection fault: 0000 [#1] SMP KASAN
Dumping ftrace buffer:
(ftrace buffer empty)
Modules linked in:
CPU: 0 PID: 8484 Comm: syz-executor919 Not tainted 4.17.0-rc7+ #74
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
Google 01/01/2011
RIP: 0010:__skb_unlink include/linux/skbuff.h:1844 [inline]
RIP: 0010:skb_unlink+0xc1/0x160 net/core/skbuff.c:2921
RSP: 0018:ffff8801d012f6f0 EFLAGS: 00010002
RAX: 0000000000000286 RBX: ffff8801d6e073c0 RCX: 0000000000000001
RDX: dffffc0000000000 RSI: 0000000000000004 RDI: 0000000000000008
RBP: ffff8801d012f718 R08: ffffed0038bb3b6d R09: ffffed0038bb3b6c
R10: ffffed0038bb3b6c R11: ffff8801c5d9db63 R12: 0000000000000000
R13: 0000000000000000 R14: ffff8801c5d9db60 R15: ffff8801d012fce0
FS: 0000000000ab7880(0000) GS:ffff8801dae00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000020e5b000 CR3: 00000001c31fb000 CR4: 00000000001406f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
kcm_recvmsg+0x48d/0x590 net/kcm/kcmsock.c:1160
sock_recvmsg_nosec+0x8c/0xb0 net/socket.c:802
___sys_recvmsg+0x2b6/0x680 net/socket.c:2279
__sys_recvmmsg+0x2f9/0xb80 net/socket.c:2391
do_sys_recvmmsg+0xe4/0x190 net/socket.c:2472
__do_sys_recvmmsg net/socket.c:2485 [inline]
__se_sys_recvmmsg net/socket.c:2481 [inline]
__x64_sys_recvmmsg+0xbe/0x150 net/socket.c:2481
do_syscall_64+0x1b1/0x800 arch/x86/entry/common.c:287
entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x4417a9
RSP: 002b:00007ffe27282838 EFLAGS: 00000206 ORIG_RAX: 000000000000012b
RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00000000004417a9
RDX: 00000000040000f7 RSI: 00000000200002c0 RDI: 0000000000000006
RBP: 0000000000000000 R08: 0000000020000200 R09: 00007ffe272829f8
R10: 0000000000000060 R11: 0000000000000206 R12: 00000000000001f3
R13: 000000000001f871 R14: 0000000000000000 R15: 0000000000000000
Code: 00 00 00 49 8d 7d 08 4c 8b 63 08 48 ba 00 00 00 00 00 fc ff df 48 c7
43 08 00 00 00 00 48 89 f9 48 c7 03 00 00 00 00 48 c1 e9 03 <80> 3c 11 00
75 5b 4c 89 e1 4d 89 65 08 48 ba 00 00 00 00 00 fc
RIP: __skb_unlink include/linux/skbuff.h:1844 [inline] RSP: ffff8801d012f6f0
RIP: skb_unlink+0xc1/0x160 net/core/skbuff.c:2921 RSP: ffff8801d012f6f0
To fix the above, we need to use the locked version of the socket dequeue
helper in requeue_rx_msgs() and kcm_wait_data is changed to dequeue
the available skb when not peeking.
RFC -> v1:
- use skb_dequeue(), as suggested by Tom
- explicitly close the race between skb_peek and skb_unlink
Fixes: ab7ac4eb9832 ("kcm: Kernel Connection Multiplexor module")
Reported-and-tested-by: syzbot+278279efdd2730dd14bf@syzkaller.appspotmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
---
This is an RFC, since I'm really new to this area, anyway the syzport
reported success in testing the proposed fix.
This is very likely a scenario where the upcoming skb->prev,next -> list_head
conversion would have helped a lot, thanks to list poisoning and list debug
---
net/kcm/kcmsock.c | 19 ++++++++++++-------
1 file changed, 12 insertions(+), 7 deletions(-)
diff --git a/net/kcm/kcmsock.c b/net/kcm/kcmsock.c
index d3601d421571..dd2d02bb35ae 100644
--- a/net/kcm/kcmsock.c
+++ b/net/kcm/kcmsock.c
@@ -223,7 +223,7 @@ static void requeue_rx_msgs(struct kcm_mux *mux, struct sk_buff_head *head)
struct sk_buff *skb;
struct kcm_sock *kcm;
- while ((skb = __skb_dequeue(head))) {
+ while ((skb = skb_dequeue(head))) {
/* Reset destructor to avoid calling kcm_rcv_ready */
skb->destructor = sock_rfree;
skb_orphan(skb);
@@ -1080,12 +1080,17 @@ static int kcm_sendmsg(struct socket *sock, struct msghdr *msg, size_t len)
return err;
}
-static struct sk_buff *kcm_wait_data(struct sock *sk, int flags,
+static struct sk_buff *kcm_wait_data(struct sock *sk, int flags, bool peek,
long timeo, int *err)
{
struct sk_buff *skb;
- while (!(skb = skb_peek(&sk->sk_receive_queue))) {
+ for (;; ) {
+ skb = peek ? skb_peek(&sk->sk_receive_queue) :
+ skb_dequeue(&sk->sk_receive_queue);
+ if (skb)
+ break;
+
if (sk->sk_err) {
*err = sock_error(sk);
return NULL;
@@ -1116,6 +1121,7 @@ static int kcm_recvmsg(struct socket *sock, struct msghdr *msg,
{
struct sock *sk = sock->sk;
struct kcm_sock *kcm = kcm_sk(sk);
+ bool peek = flags & MSG_PEEK;
int err = 0;
long timeo;
struct strp_msg *stm;
@@ -1126,7 +1132,7 @@ static int kcm_recvmsg(struct socket *sock, struct msghdr *msg,
lock_sock(sk);
- skb = kcm_wait_data(sk, flags, timeo, &err);
+ skb = kcm_wait_data(sk, flags, peek, timeo, &err);
if (!skb)
goto out;
@@ -1142,7 +1148,7 @@ static int kcm_recvmsg(struct socket *sock, struct msghdr *msg,
goto out;
copied = len;
- if (likely(!(flags & MSG_PEEK))) {
+ if (likely(!peek)) {
KCM_STATS_ADD(kcm->stats.rx_bytes, copied);
if (copied < stm->full_len) {
if (sock->type == SOCK_DGRAM) {
@@ -1157,7 +1163,6 @@ static int kcm_recvmsg(struct socket *sock, struct msghdr *msg,
/* Finished with message */
msg->msg_flags |= MSG_EOR;
KCM_STATS_INCR(kcm->stats.rx_msgs);
- skb_unlink(skb, &sk->sk_receive_queue);
kfree_skb(skb);
}
}
@@ -1186,7 +1191,7 @@ static ssize_t kcm_splice_read(struct socket *sock, loff_t *ppos,
lock_sock(sk);
- skb = kcm_wait_data(sk, flags, timeo, &err);
+ skb = kcm_wait_data(sk, flags, true, timeo, &err);
if (!skb)
goto err_out;
--
2.17.1
^ permalink raw reply related
* Re: [PATCH net] kcm: fix races on sk_receive_queue
From: Kirill Tkhai @ 2018-06-06 13:28 UTC (permalink / raw)
To: Paolo Abeni, netdev; +Cc: David S. Miller, Tom Herbert
In-Reply-To: <628e0398546aefabd68669450621909d269e1ba8.1528289745.git.pabeni@redhat.com>
On 06.06.2018 16:16, Paolo Abeni wrote:
> KCM removes the packets from sk_receive_queue in requeue_rx_msgs()
>
> without acquiring any lock. Moreover, in R() when the MSG_PEEK
> flag is not present, the skb is peeked and dequeued with two
> separate, non-atomic, calls.
>
> The above create room for races, which SYZBOT has been able to
> exploit, causing list corruption and kernel oops:
>
> kasan: CONFIG_KASAN_INLINE enabled
> kasan: GPF could be caused by NULL-ptr deref or user memory access
> general protection fault: 0000 [#1] SMP KASAN
> Dumping ftrace buffer:
> (ftrace buffer empty)
> Modules linked in:
> CPU: 0 PID: 8484 Comm: syz-executor919 Not tainted 4.17.0-rc7+ #74
> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
> Google 01/01/2011
> RIP: 0010:__skb_unlink include/linux/skbuff.h:1844 [inline]
> RIP: 0010:skb_unlink+0xc1/0x160 net/core/skbuff.c:2921
> RSP: 0018:ffff8801d012f6f0 EFLAGS: 00010002
> RAX: 0000000000000286 RBX: ffff8801d6e073c0 RCX: 0000000000000001
> RDX: dffffc0000000000 RSI: 0000000000000004 RDI: 0000000000000008
> RBP: ffff8801d012f718 R08: ffffed0038bb3b6d R09: ffffed0038bb3b6c
> R10: ffffed0038bb3b6c R11: ffff8801c5d9db63 R12: 0000000000000000
> R13: 0000000000000000 R14: ffff8801c5d9db60 R15: ffff8801d012fce0
> FS: 0000000000ab7880(0000) GS:ffff8801dae00000(0000) knlGS:0000000000000000
> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 0000000020e5b000 CR3: 00000001c31fb000 CR4: 00000000001406f0
> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> Call Trace:
> kcm_recvmsg+0x48d/0x590 net/kcm/kcmsock.c:1160
> sock_recvmsg_nosec+0x8c/0xb0 net/socket.c:802
> ___sys_recvmsg+0x2b6/0x680 net/socket.c:2279
> __sys_recvmmsg+0x2f9/0xb80 net/socket.c:2391
> do_sys_recvmmsg+0xe4/0x190 net/socket.c:2472
> __do_sys_recvmmsg net/socket.c:2485 [inline]
> __se_sys_recvmmsg net/socket.c:2481 [inline]
> __x64_sys_recvmmsg+0xbe/0x150 net/socket.c:2481
> do_syscall_64+0x1b1/0x800 arch/x86/entry/common.c:287
> entry_SYSCALL_64_after_hwframe+0x49/0xbe
> RIP: 0033:0x4417a9
> RSP: 002b:00007ffe27282838 EFLAGS: 00000206 ORIG_RAX: 000000000000012b
> RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00000000004417a9
> RDX: 00000000040000f7 RSI: 00000000200002c0 RDI: 0000000000000006
> RBP: 0000000000000000 R08: 0000000020000200 R09: 00007ffe272829f8
> R10: 0000000000000060 R11: 0000000000000206 R12: 00000000000001f3
> R13: 000000000001f871 R14: 0000000000000000 R15: 0000000000000000
> Code: 00 00 00 49 8d 7d 08 4c 8b 63 08 48 ba 00 00 00 00 00 fc ff df 48 c7
> 43 08 00 00 00 00 48 89 f9 48 c7 03 00 00 00 00 48 c1 e9 03 <80> 3c 11 00
> 75 5b 4c 89 e1 4d 89 65 08 48 ba 00 00 00 00 00 fc
> RIP: __skb_unlink include/linux/skbuff.h:1844 [inline] RSP: ffff8801d012f6f0
> RIP: skb_unlink+0xc1/0x160 net/core/skbuff.c:2921 RSP: ffff8801d012f6f0
>
> To fix the above, we need to use the locked version of the socket dequeue
> helper in requeue_rx_msgs() and kcm_wait_data is changed to dequeue
> the available skb when not peeking.
>
> RFC -> v1:
> - use skb_dequeue(), as suggested by Tom
> - explicitly close the race between skb_peek and skb_unlink
>
> Fixes: ab7ac4eb9832 ("kcm: Kernel Connection Multiplexor module")
> Reported-and-tested-by: syzbot+278279efdd2730dd14bf@syzkaller.appspotmail.com
> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
> ---
> This is an RFC, since I'm really new to this area, anyway the syzport
> reported success in testing the proposed fix.
> This is very likely a scenario where the upcoming skb->prev,next -> list_head
> conversion would have helped a lot, thanks to list poisoning and list debug
> ---
> net/kcm/kcmsock.c | 19 ++++++++++++-------
> 1 file changed, 12 insertions(+), 7 deletions(-)
>
> diff --git a/net/kcm/kcmsock.c b/net/kcm/kcmsock.c
> index d3601d421571..dd2d02bb35ae 100644
> --- a/net/kcm/kcmsock.c
> +++ b/net/kcm/kcmsock.c
> @@ -223,7 +223,7 @@ static void requeue_rx_msgs(struct kcm_mux *mux, struct sk_buff_head *head)
> struct sk_buff *skb;
> struct kcm_sock *kcm;
>
> - while ((skb = __skb_dequeue(head))) {
> + while ((skb = skb_dequeue(head))) {
I try to find how the patch protects against the following race:
requeue_rx_msgs() kcm_recvmsg()
skb = skb_dequeue() skb = kcm_wait_data(peek = true)
... ...
free skb ...
... skb_copy_datagram_msg(skb) <--- Use after free?
Isn't there possible a use-after-free?
Thanks,
Kirill
> /* Reset destructor to avoid calling kcm_rcv_ready */
> skb->destructor = sock_rfree;
> skb_orphan(skb);
> @@ -1080,12 +1080,17 @@ static int kcm_sendmsg(struct socket *sock, struct msghdr *msg, size_t len)
> return err;
> }
>
> -static struct sk_buff *kcm_wait_data(struct sock *sk, int flags,
> +static struct sk_buff *kcm_wait_data(struct sock *sk, int flags, bool peek,
> long timeo, int *err)
> {
> struct sk_buff *skb;
>
> - while (!(skb = skb_peek(&sk->sk_receive_queue))) {
> + for (;; ) {
> + skb = peek ? skb_peek(&sk->sk_receive_queue) :
> + skb_dequeue(&sk->sk_receive_queue);
> + if (skb)
> + break;
> +
> if (sk->sk_err) {
> *err = sock_error(sk);
> return NULL;
> @@ -1116,6 +1121,7 @@ static int kcm_recvmsg(struct socket *sock, struct msghdr *msg,
> {
> struct sock *sk = sock->sk;
> struct kcm_sock *kcm = kcm_sk(sk);
> + bool peek = flags & MSG_PEEK;
> int err = 0;
> long timeo;
> struct strp_msg *stm;
> @@ -1126,7 +1132,7 @@ static int kcm_recvmsg(struct socket *sock, struct msghdr *msg,
>
> lock_sock(sk);
>
> - skb = kcm_wait_data(sk, flags, timeo, &err);
> + skb = kcm_wait_data(sk, flags, peek, timeo, &err);
> if (!skb)
> goto out;
>
> @@ -1142,7 +1148,7 @@ static int kcm_recvmsg(struct socket *sock, struct msghdr *msg,
> goto out;
>
> copied = len;
> - if (likely(!(flags & MSG_PEEK))) {
> + if (likely(!peek)) {
> KCM_STATS_ADD(kcm->stats.rx_bytes, copied);
> if (copied < stm->full_len) {
> if (sock->type == SOCK_DGRAM) {
> @@ -1157,7 +1163,6 @@ static int kcm_recvmsg(struct socket *sock, struct msghdr *msg,
> /* Finished with message */
> msg->msg_flags |= MSG_EOR;
> KCM_STATS_INCR(kcm->stats.rx_msgs);
> - skb_unlink(skb, &sk->sk_receive_queue);
> kfree_skb(skb);
> }
> }
> @@ -1186,7 +1191,7 @@ static ssize_t kcm_splice_read(struct socket *sock, loff_t *ppos,
>
> lock_sock(sk);
>
> - skb = kcm_wait_data(sk, flags, timeo, &err);
> + skb = kcm_wait_data(sk, flags, true, timeo, &err);
> if (!skb)
> goto err_out;
>
>
^ permalink raw reply
* WARNING: refcount bug in smc_tcp_listen_work
From: syzbot @ 2018-06-06 13:32 UTC (permalink / raw)
To: davem, linux-kernel, linux-s390, netdev, syzkaller-bugs, ubraun
Hello,
syzbot found the following crash on:
HEAD commit: 75d4e704fa8d netdev-FAQ: clarify DaveM's position for stab..
git tree: net-next
console output: https://syzkaller.appspot.com/x/log.txt?x=1632a6f7800000
kernel config: https://syzkaller.appspot.com/x/.config?x=7b76ec81cd4aad4b
dashboard link: https://syzkaller.appspot.com/bug?extid=9e60d2428a42049a592a
compiler: gcc (GCC) 8.0.1 20180413 (experimental)
Unfortunately, I don't have any reproducer for this crash yet.
IMPORTANT: if you fix the bug, please add the following tag to the commit:
Reported-by: syzbot+9e60d2428a42049a592a@syzkaller.appspotmail.com
A link change request failed with some changes committed already. Interface
bridge0 may have been left with an inconsistent configuration, please check.
------------[ cut here ]------------
refcount_t: underflow; use-after-free.
WARNING: CPU: 0 PID: 23870 at lib/refcount.c:187
refcount_sub_and_test+0x2d3/0x330 lib/refcount.c:187
Kernel panic - not syncing: panic_on_warn set ...
CPU: 0 PID: 23870 Comm: kworker/0:4 Not tainted 4.17.0-rc7+ #79
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
Google 01/01/2011
Workqueue: events smc_tcp_listen_work
Call Trace:
__dump_stack lib/dump_stack.c:77 [inline]
dump_stack+0x1b9/0x294 lib/dump_stack.c:113
panic+0x22f/0x4de kernel/panic.c:184
__warn.cold.8+0x163/0x1b3 kernel/panic.c:536
report_bug+0x252/0x2d0 lib/bug.c:186
fixup_bug arch/x86/kernel/traps.c:178 [inline]
do_error_trap+0x1de/0x490 arch/x86/kernel/traps.c:296
do_invalid_op+0x1b/0x20 arch/x86/kernel/traps.c:315
invalid_op+0x14/0x20 arch/x86/entry/entry_64.S:992
RIP: 0010:refcount_sub_and_test+0x2d3/0x330 lib/refcount.c:187
RSP: 0018:ffff880194aaf4f8 EFLAGS: 00010286
RAX: 0000000000000026 RBX: 0000000000000000 RCX: ffffffff8160b8ad
RDX: 0000000000000000 RSI: ffffffff81610561 RDI: ffff880194aaf058
RBP: ffff880194aaf5e0 R08: ffff88014148e500 R09: 0000000000000006
R10: ffff88014148e500 R11: 0000000000000000 R12: 00000000ffffffff
R13: ffff880194aaf5b8 R14: 0000000000000001 R15: ffff880194aaf728
refcount_dec_and_test+0x1a/0x20 lib/refcount.c:212
sock_put include/net/sock.h:1668 [inline]
smc_tcp_listen_work+0xb94/0xec0 net/smc/af_smc.c:1073
process_one_work+0xc1e/0x1b50 kernel/workqueue.c:2145
worker_thread+0x1cc/0x1440 kernel/workqueue.c:2279
kthread+0x345/0x410 kernel/kthread.c:240
ret_from_fork+0x3a/0x50 arch/x86/entry/entry_64.S:412
Dumping ftrace buffer:
(ftrace buffer empty)
Kernel Offset: disabled
Rebooting in 86400 seconds..
---
This bug is generated by a bot. It may contain errors.
See https://goo.gl/tpsmEJ for more information about syzbot.
syzbot engineers can be reached at syzkaller@googlegroups.com.
syzbot will keep track of this bug report. See:
https://goo.gl/tpsmEJ#bug-status-tracking for how to communicate with
syzbot.
^ permalink raw reply
* Re: [PATCH] netfilter: provide udp*_lib_lookup for nf_tproxy
From: Pablo Neira Ayuso @ 2018-06-06 13:33 UTC (permalink / raw)
To: David Miller
Cc: arnd, kuznet, yoshfuji, ecklm94, pabeni, willemb, edumazet,
dsahern, kafai, netdev, linux-kernel
In-Reply-To: <20180605.105453.339908802413146875.davem@davemloft.net>
On Tue, Jun 05, 2018 at 10:54:53AM -0400, David Miller wrote:
> From: Arnd Bergmann <arnd@arndb.de>
> Date: Tue, 5 Jun 2018 13:40:34 +0200
>
> > It is now possible to enable the libified nf_tproxy modules without
> > also enabling NETFILTER_XT_TARGET_TPROXY, which throws off the
> > ifdef logic in the udp core code:
> >
> > net/ipv6/netfilter/nf_tproxy_ipv6.o: In function `nf_tproxy_get_sock_v6':
> > nf_tproxy_ipv6.c:(.text+0x1a8): undefined reference to `udp6_lib_lookup'
> > net/ipv4/netfilter/nf_tproxy_ipv4.o: In function `nf_tproxy_get_sock_v4':
> > nf_tproxy_ipv4.c:(.text+0x3d0): undefined reference to `udp4_lib_lookup'
> >
> > We can actually simplify the conditions now to provide the two functions
> > exactly when they are needed.
> >
> > Fixes: 45ca4e0cf273 ("netfilter: Libify xt_TPROXY")
> > Signed-off-by: Arnd Bergmann <arnd@arndb.de>
>
> Pablo, I'm going to apply this directly to fix the link failure.
BTW, could you also pick this fix for net-next?
http://patchwork.ozlabs.org/patch/924706/
Thanks.
^ permalink raw reply
* Re: [PATCH net] kcm: fix races on sk_receive_queue
From: Paolo Abeni @ 2018-06-06 13:46 UTC (permalink / raw)
To: Kirill Tkhai, netdev; +Cc: David S. Miller, Tom Herbert
In-Reply-To: <f79f8b6f-c137-1eb7-1161-2264c270c025@virtuozzo.com>
On Wed, 2018-06-06 at 16:28 +0300, Kirill Tkhai wrote:
> On 06.06.2018 16:16, Paolo Abeni wrote:
> > KCM removes the packets from sk_receive_queue in requeue_rx_msgs()
> >
> > without acquiring any lock. Moreover, in R() when the MSG_PEEK
> > flag is not present, the skb is peeked and dequeued with two
> > separate, non-atomic, calls.
> >
> > The above create room for races, which SYZBOT has been able to
> > exploit, causing list corruption and kernel oops:
> >
> > kasan: CONFIG_KASAN_INLINE enabled
> > kasan: GPF could be caused by NULL-ptr deref or user memory access
> > general protection fault: 0000 [#1] SMP KASAN
> > Dumping ftrace buffer:
> > (ftrace buffer empty)
> > Modules linked in:
> > CPU: 0 PID: 8484 Comm: syz-executor919 Not tainted 4.17.0-rc7+ #74
> > Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
> > Google 01/01/2011
> > RIP: 0010:__skb_unlink include/linux/skbuff.h:1844 [inline]
> > RIP: 0010:skb_unlink+0xc1/0x160 net/core/skbuff.c:2921
> > RSP: 0018:ffff8801d012f6f0 EFLAGS: 00010002
> > RAX: 0000000000000286 RBX: ffff8801d6e073c0 RCX: 0000000000000001
> > RDX: dffffc0000000000 RSI: 0000000000000004 RDI: 0000000000000008
> > RBP: ffff8801d012f718 R08: ffffed0038bb3b6d R09: ffffed0038bb3b6c
> > R10: ffffed0038bb3b6c R11: ffff8801c5d9db63 R12: 0000000000000000
> > R13: 0000000000000000 R14: ffff8801c5d9db60 R15: ffff8801d012fce0
> > FS: 0000000000ab7880(0000) GS:ffff8801dae00000(0000) knlGS:0000000000000000
> > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > CR2: 0000000020e5b000 CR3: 00000001c31fb000 CR4: 00000000001406f0
> > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> > Call Trace:
> > kcm_recvmsg+0x48d/0x590 net/kcm/kcmsock.c:1160
> > sock_recvmsg_nosec+0x8c/0xb0 net/socket.c:802
> > ___sys_recvmsg+0x2b6/0x680 net/socket.c:2279
> > __sys_recvmmsg+0x2f9/0xb80 net/socket.c:2391
> > do_sys_recvmmsg+0xe4/0x190 net/socket.c:2472
> > __do_sys_recvmmsg net/socket.c:2485 [inline]
> > __se_sys_recvmmsg net/socket.c:2481 [inline]
> > __x64_sys_recvmmsg+0xbe/0x150 net/socket.c:2481
> > do_syscall_64+0x1b1/0x800 arch/x86/entry/common.c:287
> > entry_SYSCALL_64_after_hwframe+0x49/0xbe
> > RIP: 0033:0x4417a9
> > RSP: 002b:00007ffe27282838 EFLAGS: 00000206 ORIG_RAX: 000000000000012b
> > RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00000000004417a9
> > RDX: 00000000040000f7 RSI: 00000000200002c0 RDI: 0000000000000006
> > RBP: 0000000000000000 R08: 0000000020000200 R09: 00007ffe272829f8
> > R10: 0000000000000060 R11: 0000000000000206 R12: 00000000000001f3
> > R13: 000000000001f871 R14: 0000000000000000 R15: 0000000000000000
> > Code: 00 00 00 49 8d 7d 08 4c 8b 63 08 48 ba 00 00 00 00 00 fc ff df 48 c7
> > 43 08 00 00 00 00 48 89 f9 48 c7 03 00 00 00 00 48 c1 e9 03 <80> 3c 11 00
> > 75 5b 4c 89 e1 4d 89 65 08 48 ba 00 00 00 00 00 fc
> > RIP: __skb_unlink include/linux/skbuff.h:1844 [inline] RSP: ffff8801d012f6f0
> > RIP: skb_unlink+0xc1/0x160 net/core/skbuff.c:2921 RSP: ffff8801d012f6f0
> >
> > To fix the above, we need to use the locked version of the socket dequeue
> > helper in requeue_rx_msgs() and kcm_wait_data is changed to dequeue
> > the available skb when not peeking.
> >
> > RFC -> v1:
> > - use skb_dequeue(), as suggested by Tom
> > - explicitly close the race between skb_peek and skb_unlink
> >
> > Fixes: ab7ac4eb9832 ("kcm: Kernel Connection Multiplexor module")
> > Reported-and-tested-by: syzbot+278279efdd2730dd14bf@syzkaller.appspotmail.com
> > Signed-off-by: Paolo Abeni <pabeni@redhat.com>
> > ---
> > This is an RFC, since I'm really new to this area, anyway the syzport
> > reported success in testing the proposed fix.
> > This is very likely a scenario where the upcoming skb->prev,next -> list_head
> > conversion would have helped a lot, thanks to list poisoning and list debug
> > ---
> > net/kcm/kcmsock.c | 19 ++++++++++++-------
> > 1 file changed, 12 insertions(+), 7 deletions(-)
> >
> > diff --git a/net/kcm/kcmsock.c b/net/kcm/kcmsock.c
> > index d3601d421571..dd2d02bb35ae 100644
> > --- a/net/kcm/kcmsock.c
> > +++ b/net/kcm/kcmsock.c
> > @@ -223,7 +223,7 @@ static void requeue_rx_msgs(struct kcm_mux *mux, struct sk_buff_head *head)
> > struct sk_buff *skb;
> > struct kcm_sock *kcm;
> >
> > - while ((skb = __skb_dequeue(head))) {
> > + while ((skb = skb_dequeue(head))) {
>
> I try to find how the patch protects against the following race:
>
> requeue_rx_msgs() kcm_recvmsg()
> skb = skb_dequeue() skb = kcm_wait_data(peek = true)
> ... ...
> free skb ...
> ... skb_copy_datagram_msg(skb) <--- Use after free?
>
> Isn't there possible a use-after-free?
You are right, this patch does not fix the above race: is addressing a
different one, when recvmsg() is not peeking.
The race itself is not introduced by this code, and I think a separate
patch for the the above would be better (we probably need to increment
the skb reference count while peeking and consume the skb after the
copy)
Cheers,
Paolo
^ permalink raw reply
* Make CONFIG_NET and CONFIG_SECCOMP_FILTER independent of CONFIG_NET
From: Norbert Manthey @ 2018-06-06 13:52 UTC (permalink / raw)
To: linux-kernel, kvm, netdev, x86
Dear all,
currently, KVM and SECCOMP rely on functionality of CONFIG_NET, and hence the
latter has to be enabled when building the kernel for the first two
configurations. However, there exists scenarios where the system does not need
networking, but KVM and SECCOMP filters. To reduce the kernel image size for
these scenarios, and to be able to drop active code, this commit series allows
to enable CONFIG_KVM and CONFIG_SECCOMP_FILTER without using CONFIG_NET.
The functionality that is required for seccomp filters is kept in the same
files and - after reordering the source code - is guarded with a single ifdef
per file.
I hope these changes are useful for other scenarios than the one I currently
face.
Best,
Norbert
Amazon Development Center Germany GmbH
Berlin - Dresden - Aachen
main office: Krausenstr. 38, 10117 Berlin
Geschaeftsfuehrer: Dr. Ralf Herbrich, Christian Schlaeger
Ust-ID: DE289237879
Eingetragen am Amtsgericht Charlottenburg HRB 149173 B
^ permalink raw reply
* [less-CONFIG_NET 1/7] net: reorder filter code
From: Norbert Manthey @ 2018-06-06 13:53 UTC (permalink / raw)
Cc: Norbert Manthey, Alexei Starovoitov, Daniel Borkmann,
David S. Miller, netdev, linux-kernel
In-Reply-To: <1528293127-23825-1-git-send-email-nmanthey@amazon.de>
This commit reorders the definition of functions and struct in the
file filter.c, such that in the next step we can easily cut the file
into a commonly used part, as well as a part that is only required in
case CONFIG_NET is actually set.
This is part of the effort to split CONFIG_SECCOMP_FILTER and
CONFIG_NET.
Signed-off-by: Norbert Manthey <nmanthey@amazon.de>
---
net/core/filter.c | 330 +++++++++++++++++++++++++++---------------------------
1 file changed, 165 insertions(+), 165 deletions(-)
diff --git a/net/core/filter.c b/net/core/filter.c
index 201ff36b..0d980e9 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -59,58 +59,6 @@
#include <net/tcp.h>
#include <linux/bpf_trace.h>
-/**
- * sk_filter_trim_cap - run a packet through a socket filter
- * @sk: sock associated with &sk_buff
- * @skb: buffer to filter
- * @cap: limit on how short the eBPF program may trim the packet
- *
- * Run the eBPF program and then cut skb->data to correct size returned by
- * the program. If pkt_len is 0 we toss packet. If skb->len is smaller
- * than pkt_len we keep whole skb->data. This is the socket level
- * wrapper to BPF_PROG_RUN. It returns 0 if the packet should
- * be accepted or -EPERM if the packet should be tossed.
- *
- */
-int sk_filter_trim_cap(struct sock *sk, struct sk_buff *skb, unsigned int cap)
-{
- int err;
- struct sk_filter *filter;
-
- /*
- * If the skb was allocated from pfmemalloc reserves, only
- * allow SOCK_MEMALLOC sockets to use it as this socket is
- * helping free memory
- */
- if (skb_pfmemalloc(skb) && !sock_flag(sk, SOCK_MEMALLOC)) {
- NET_INC_STATS(sock_net(sk), LINUX_MIB_PFMEMALLOCDROP);
- return -ENOMEM;
- }
- err = BPF_CGROUP_RUN_PROG_INET_INGRESS(sk, skb);
- if (err)
- return err;
-
- err = security_sock_rcv_skb(sk, skb);
- if (err)
- return err;
-
- rcu_read_lock();
- filter = rcu_dereference(sk->sk_filter);
- if (filter) {
- struct sock *save_sk = skb->sk;
- unsigned int pkt_len;
-
- skb->sk = sk;
- pkt_len = bpf_prog_run_save_cb(filter->prog, skb);
- skb->sk = save_sk;
- err = pkt_len ? pskb_trim(skb, max(cap, pkt_len)) : -EPERM;
- }
- rcu_read_unlock();
-
- return err;
-}
-EXPORT_SYMBOL(sk_filter_trim_cap);
-
BPF_CALL_1(__skb_get_pay_offset, struct sk_buff *, skb)
{
return skb_get_poff(skb);
@@ -165,12 +113,6 @@ BPF_CALL_0(__get_raw_cpu_id)
return raw_smp_processor_id();
}
-static const struct bpf_func_proto bpf_get_raw_smp_processor_id_proto = {
- .func = __get_raw_cpu_id,
- .gpl_only = false,
- .ret_type = RET_INTEGER,
-};
-
static u32 convert_skb_access(int skb_field, int dst_reg, int src_reg,
struct bpf_insn *insn_buf)
{
@@ -954,71 +896,6 @@ static void __bpf_prog_release(struct bpf_prog *prog)
}
}
-static void __sk_filter_release(struct sk_filter *fp)
-{
- __bpf_prog_release(fp->prog);
- kfree(fp);
-}
-
-/**
- * sk_filter_release_rcu - Release a socket filter by rcu_head
- * @rcu: rcu_head that contains the sk_filter to free
- */
-static void sk_filter_release_rcu(struct rcu_head *rcu)
-{
- struct sk_filter *fp = container_of(rcu, struct sk_filter, rcu);
-
- __sk_filter_release(fp);
-}
-
-/**
- * sk_filter_release - release a socket filter
- * @fp: filter to remove
- *
- * Remove a filter from a socket and release its resources.
- */
-static void sk_filter_release(struct sk_filter *fp)
-{
- if (refcount_dec_and_test(&fp->refcnt))
- call_rcu(&fp->rcu, sk_filter_release_rcu);
-}
-
-void sk_filter_uncharge(struct sock *sk, struct sk_filter *fp)
-{
- u32 filter_size = bpf_prog_size(fp->prog->len);
-
- atomic_sub(filter_size, &sk->sk_omem_alloc);
- sk_filter_release(fp);
-}
-
-/* try to charge the socket memory if there is space available
- * return true on success
- */
-static bool __sk_filter_charge(struct sock *sk, struct sk_filter *fp)
-{
- u32 filter_size = bpf_prog_size(fp->prog->len);
-
- /* same check as in sock_kmalloc() */
- if (filter_size <= sysctl_optmem_max &&
- atomic_read(&sk->sk_omem_alloc) + filter_size < sysctl_optmem_max) {
- atomic_add(filter_size, &sk->sk_omem_alloc);
- return true;
- }
- return false;
-}
-
-bool sk_filter_charge(struct sock *sk, struct sk_filter *fp)
-{
- if (!refcount_inc_not_zero(&fp->refcnt))
- return false;
-
- if (!__sk_filter_charge(sk, fp)) {
- sk_filter_release(fp);
- return false;
- }
- return true;
-}
-
static struct bpf_prog *bpf_migrate_filter(struct bpf_prog *fp)
{
struct sock_filter *old_prog;
@@ -1127,19 +1004,22 @@ static struct bpf_prog *bpf_prepare_filter(struct bpf_prog *fp,
}
/**
- * bpf_prog_create - create an unattached filter
+ * bpf_prog_create_from_user - create an unattached filter from user buffer
* @pfp: the unattached filter that is created
* @fprog: the filter program
+ * @trans: post-classic verifier transformation handler
+ * @save_orig: save classic BPF program
*
- * Create a filter independent of any socket. We first run some
- * sanity checks on it to make sure it does not explode on us later.
- * If an error occurs or there is insufficient memory for the filter
- * a negative errno code is returned. On success the return is zero.
+ * This function effectively does the same as bpf_prog_create(), only
+ * that it builds up its insns buffer from user space provided buffer.
+ * It also allows for passing a bpf_aux_classic_check_t handler.
*/
-int bpf_prog_create(struct bpf_prog **pfp, struct sock_fprog_kern *fprog)
+int bpf_prog_create_from_user(struct bpf_prog **pfp, struct sock_fprog *fprog,
+ bpf_aux_classic_check_t trans, bool save_orig)
{
unsigned int fsize = bpf_classic_proglen(fprog);
struct bpf_prog *fp;
+ int err;
/* Make sure new filter is there and in the right amounts. */
if (!bpf_check_basics_ok(fprog->filter, fprog->len))
@@ -1149,44 +1029,177 @@ int bpf_prog_create(struct bpf_prog **pfp, struct sock_fprog_kern *fprog)
if (!fp)
return -ENOMEM;
- memcpy(fp->insns, fprog->filter, fsize);
+ if (copy_from_user(fp->insns, fprog->filter, fsize)) {
+ __bpf_prog_free(fp);
+ return -EFAULT;
+ }
fp->len = fprog->len;
- /* Since unattached filters are not copied back to user
- * space through sk_get_filter(), we do not need to hold
- * a copy here, and can spare us the work.
- */
fp->orig_prog = NULL;
+ if (save_orig) {
+ err = bpf_prog_store_orig_filter(fp, fprog);
+ if (err) {
+ __bpf_prog_free(fp);
+ return -ENOMEM;
+ }
+ }
+
/* bpf_prepare_filter() already takes care of freeing
* memory in case something goes wrong.
*/
- fp = bpf_prepare_filter(fp, NULL);
+ fp = bpf_prepare_filter(fp, trans);
if (IS_ERR(fp))
return PTR_ERR(fp);
*pfp = fp;
return 0;
}
-EXPORT_SYMBOL_GPL(bpf_prog_create);
+EXPORT_SYMBOL_GPL(bpf_prog_create_from_user);
+
+void bpf_prog_destroy(struct bpf_prog *fp)
+{
+ __bpf_prog_release(fp);
+}
+EXPORT_SYMBOL_GPL(bpf_prog_destroy);
/**
- * bpf_prog_create_from_user - create an unattached filter from user buffer
+ * sk_filter_trim_cap - run a packet through a socket filter
+ * @sk: sock associated with &sk_buff
+ * @skb: buffer to filter
+ * @cap: limit on how short the eBPF program may trim the packet
+ *
+ * Run the eBPF program and then cut skb->data to correct size returned by
+ * the program. If pkt_len is 0 we toss packet. If skb->len is smaller
+ * than pkt_len we keep whole skb->data. This is the socket level
+ * wrapper to BPF_PROG_RUN. It returns 0 if the packet should
+ * be accepted or -EPERM if the packet should be tossed.
+ *
+ */
+int sk_filter_trim_cap(struct sock *sk, struct sk_buff *skb, unsigned int cap)
+{
+ int err;
+ struct sk_filter *filter;
+
+ /*
+ * If the skb was allocated from pfmemalloc reserves, only
+ * allow SOCK_MEMALLOC sockets to use it as this socket is
+ * helping free memory
+ */
+ if (skb_pfmemalloc(skb) && !sock_flag(sk, SOCK_MEMALLOC)) {
+ NET_INC_STATS(sock_net(sk), LINUX_MIB_PFMEMALLOCDROP);
+ return -ENOMEM;
+ }
+ err = BPF_CGROUP_RUN_PROG_INET_INGRESS(sk, skb);
+ if (err)
+ return err;
+
+ err = security_sock_rcv_skb(sk, skb);
+ if (err)
+ return err;
+
+ rcu_read_lock();
+ filter = rcu_dereference(sk->sk_filter);
+ if (filter) {
+ struct sock *save_sk = skb->sk;
+ unsigned int pkt_len;
+
+ skb->sk = sk;
+ pkt_len = bpf_prog_run_save_cb(filter->prog, skb);
+ skb->sk = save_sk;
+ err = pkt_len ? pskb_trim(skb, max(cap, pkt_len)) : -EPERM;
+ }
+ rcu_read_unlock();
+
+ return err;
+}
+EXPORT_SYMBOL(sk_filter_trim_cap);
+
+static const struct bpf_func_proto bpf_get_raw_smp_processor_id_proto = {
+ .func = __get_raw_cpu_id,
+ .gpl_only = false,
+ .ret_type = RET_INTEGER,
+};
+
+static void __sk_filter_release(struct sk_filter *fp)
+{
+ __bpf_prog_release(fp->prog);
+ kfree(fp);
+}
+
+/**
+ * sk_filter_release_rcu - Release a socket filter by rcu_head
+ * @rcu: rcu_head that contains the sk_filter to free
+ */
+static void sk_filter_release_rcu(struct rcu_head *rcu)
+{
+ struct sk_filter *fp = container_of(rcu, struct sk_filter, rcu);
+
+ __sk_filter_release(fp);
+}
+
+/**
+ * sk_filter_release - release a socket filter
+ * @fp: filter to remove
+ *
+ * Remove a filter from a socket and release its resources.
+ */
+static void sk_filter_release(struct sk_filter *fp)
+{
+ if (refcount_dec_and_test(&fp->refcnt))
+ call_rcu(&fp->rcu, sk_filter_release_rcu);
+}
+
+void sk_filter_uncharge(struct sock *sk, struct sk_filter *fp)
+{
+ u32 filter_size = bpf_prog_size(fp->prog->len);
+
+ atomic_sub(filter_size, &sk->sk_omem_alloc);
+ sk_filter_release(fp);
+}
+
+/* try to charge the socket memory if there is space available
+ * return true on success
+ */
+static bool __sk_filter_charge(struct sock *sk, struct sk_filter *fp)
+{
+ u32 filter_size = bpf_prog_size(fp->prog->len);
+
+ /* same check as in sock_kmalloc() */
+ if (filter_size <= sysctl_optmem_max &&
+ atomic_read(&sk->sk_omem_alloc) + filter_size < sysctl_optmem_max) {
+ atomic_add(filter_size, &sk->sk_omem_alloc);
+ return true;
+ }
+ return false;
+}
+
+bool sk_filter_charge(struct sock *sk, struct sk_filter *fp)
+{
+ if (!refcount_inc_not_zero(&fp->refcnt))
+ return false;
+
+ if (!__sk_filter_charge(sk, fp)) {
+ sk_filter_release(fp);
+ return false;
+ }
+ return true;
+}
+
+/**
+ * bpf_prog_create - create an unattached filter
* @pfp: the unattached filter that is created
* @fprog: the filter program
- * @trans: post-classic verifier transformation handler
- * @save_orig: save classic BPF program
*
- * This function effectively does the same as bpf_prog_create(), only
- * that it builds up its insns buffer from user space provided buffer.
- * It also allows for passing a bpf_aux_classic_check_t handler.
+ * Create a filter independent of any socket. We first run some
+ * sanity checks on it to make sure it does not explode on us later.
+ * If an error occurs or there is insufficient memory for the filter
+ * a negative errno code is returned. On success the return is zero.
*/
-int bpf_prog_create_from_user(struct bpf_prog **pfp, struct sock_fprog *fprog,
- bpf_aux_classic_check_t trans, bool save_orig)
+int bpf_prog_create(struct bpf_prog **pfp, struct sock_fprog_kern *fprog)
{
unsigned int fsize = bpf_classic_proglen(fprog);
struct bpf_prog *fp;
- int err;
/* Make sure new filter is there and in the right amounts. */
if (!bpf_check_basics_ok(fprog->filter, fprog->len))
@@ -1196,39 +1209,26 @@ int bpf_prog_create_from_user(struct bpf_prog **pfp, struct sock_fprog *fprog,
if (!fp)
return -ENOMEM;
- if (copy_from_user(fp->insns, fprog->filter, fsize)) {
- __bpf_prog_free(fp);
- return -EFAULT;
- }
+ memcpy(fp->insns, fprog->filter, fsize);
fp->len = fprog->len;
+ /* Since unattached filters are not copied back to user
+ * space through sk_get_filter(), we do not need to hold
+ * a copy here, and can spare us the work.
+ */
fp->orig_prog = NULL;
- if (save_orig) {
- err = bpf_prog_store_orig_filter(fp, fprog);
- if (err) {
- __bpf_prog_free(fp);
- return -ENOMEM;
- }
- }
-
/* bpf_prepare_filter() already takes care of freeing
* memory in case something goes wrong.
*/
- fp = bpf_prepare_filter(fp, trans);
+ fp = bpf_prepare_filter(fp, NULL);
if (IS_ERR(fp))
return PTR_ERR(fp);
*pfp = fp;
return 0;
}
-EXPORT_SYMBOL_GPL(bpf_prog_create_from_user);
-
-void bpf_prog_destroy(struct bpf_prog *fp)
-{
- __bpf_prog_release(fp);
-}
-EXPORT_SYMBOL_GPL(bpf_prog_destroy);
+EXPORT_SYMBOL_GPL(bpf_prog_create);
static int __sk_attach_prog(struct bpf_prog *prog, struct sock *sk)
{
--
2.7.4
Amazon Development Center Germany GmbH
Berlin - Dresden - Aachen
main office: Krausenstr. 38, 10117 Berlin
Geschaeftsfuehrer: Dr. Ralf Herbrich, Christian Schlaeger
Ust-ID: DE289237879
Eingetragen am Amtsgericht Charlottenburg HRB 149173 B
^ permalink raw reply related
* [less-CONFIG_NET 2/7] net: reorder flow_dissector
From: Norbert Manthey @ 2018-06-06 13:53 UTC (permalink / raw)
Cc: Norbert Manthey, David S. Miller, Simon Horman, Andrew Lunn,
Jakub Kicinski, Tom Herbert, John Crispin, Eric Dumazet,
Sven Eckelmann, WANG Cong, David Ahern, Jon Maloy, netdev,
linux-kernel
In-Reply-To: <1528293206-24298-1-git-send-email-nmanthey@amazon.de>
This commit reorders the definitions, such that in the next step we
can easily cut the file into a commonly used part, as well as a part
that is only required in case CONFIG_NET is used.
This is part of the effort to split CONFIG_SECCOMP_FILTER and
CONFIG_NET.
Signed-off-by: Norbert Manthey <nmanthey@amazon.de>
---
net/core/flow_dissector.c | 206 +++++++++++++++++++++++-----------------------
1 file changed, 103 insertions(+), 103 deletions(-)
diff --git a/net/core/flow_dissector.c b/net/core/flow_dissector.c
index d29f09b..70e0679 100644
--- a/net/core/flow_dissector.c
+++ b/net/core/flow_dissector.c
@@ -1085,36 +1085,6 @@ static inline size_t flow_keys_hash_length(const struct flow_keys *flow)
return (sizeof(*flow) - diff) / sizeof(u32);
}
-__be32 flow_get_u32_src(const struct flow_keys *flow)
-{
- switch (flow->control.addr_type) {
- case FLOW_DISSECTOR_KEY_IPV4_ADDRS:
- return flow->addrs.v4addrs.src;
- case FLOW_DISSECTOR_KEY_IPV6_ADDRS:
- return (__force __be32)ipv6_addr_hash(
- &flow->addrs.v6addrs.src);
- case FLOW_DISSECTOR_KEY_TIPC:
- return flow->addrs.tipckey.key;
- default:
- return 0;
- }
-}
-EXPORT_SYMBOL(flow_get_u32_src);
-
-__be32 flow_get_u32_dst(const struct flow_keys *flow)
-{
- switch (flow->control.addr_type) {
- case FLOW_DISSECTOR_KEY_IPV4_ADDRS:
- return flow->addrs.v4addrs.dst;
- case FLOW_DISSECTOR_KEY_IPV6_ADDRS:
- return (__force __be32)ipv6_addr_hash(
- &flow->addrs.v6addrs.dst);
- default:
- return 0;
- }
-}
-EXPORT_SYMBOL(flow_get_u32_dst);
-
static inline void __flow_hash_consistentify(struct flow_keys *keys)
{
int addr_diff, i;
@@ -1162,49 +1132,6 @@ static inline u32 __flow_hash_from_keys(struct flow_keys *keys, u32 keyval)
return hash;
}
-u32 flow_hash_from_keys(struct flow_keys *keys)
-{
- __flow_hash_secret_init();
- return __flow_hash_from_keys(keys, hashrnd);
-}
-EXPORT_SYMBOL(flow_hash_from_keys);
-
-static inline u32 ___skb_get_hash(const struct sk_buff *skb,
- struct flow_keys *keys, u32 keyval)
-{
- skb_flow_dissect_flow_keys(skb, keys,
- FLOW_DISSECTOR_F_STOP_AT_FLOW_LABEL);
-
- return __flow_hash_from_keys(keys, keyval);
-}
-
-struct _flow_keys_digest_data {
- __be16 n_proto;
- u8 ip_proto;
- u8 padding;
- __be32 ports;
- __be32 src;
- __be32 dst;
-};
-
-void make_flow_keys_digest(struct flow_keys_digest *digest,
- const struct flow_keys *flow)
-{
- struct _flow_keys_digest_data *data =
- (struct _flow_keys_digest_data *)digest;
-
- BUILD_BUG_ON(sizeof(*data) > sizeof(*digest));
-
- memset(digest, 0, sizeof(*digest));
-
- data->n_proto = flow->basic.n_proto;
- data->ip_proto = flow->basic.ip_proto;
- data->ports = flow->ports.ports;
- data->src = flow->addrs.v4addrs.src;
- data->dst = flow->addrs.v4addrs.dst;
-}
-EXPORT_SYMBOL(make_flow_keys_digest);
-
static struct flow_dissector flow_keys_dissector_symmetric __read_mostly;
u32 __skb_get_hash_symmetric(const struct sk_buff *skb)
@@ -1222,36 +1149,6 @@ u32 __skb_get_hash_symmetric(const struct sk_buff *skb)
}
EXPORT_SYMBOL_GPL(__skb_get_hash_symmetric);
-/**
- * __skb_get_hash: calculate a flow hash
- * @skb: sk_buff to calculate flow hash from
- *
- * This function calculates a flow hash based on src/dst addresses
- * and src/dst port numbers. Sets hash in skb to non-zero hash value
- * on success, zero indicates no valid hash. Also, sets l4_hash in skb
- * if hash is a canonical 4-tuple hash over transport ports.
- */
-void __skb_get_hash(struct sk_buff *skb)
-{
- struct flow_keys keys;
- u32 hash;
-
- __flow_hash_secret_init();
-
- hash = ___skb_get_hash(skb, &keys, hashrnd);
-
- __skb_set_sw_hash(skb, hash, flow_keys_have_l4(&keys));
-}
-EXPORT_SYMBOL(__skb_get_hash);
-
-__u32 skb_get_hash_perturb(const struct sk_buff *skb, u32 perturb)
-{
- struct flow_keys keys;
-
- return ___skb_get_hash(skb, &keys, perturb);
-}
-EXPORT_SYMBOL(skb_get_hash_perturb);
-
u32 __skb_get_poff(const struct sk_buff *skb, void *data,
const struct flow_keys *keys, int hlen)
{
@@ -1322,6 +1219,109 @@ u32 skb_get_poff(const struct sk_buff *skb)
return __skb_get_poff(skb, skb->data, &keys, skb_headlen(skb));
}
+__be32 flow_get_u32_src(const struct flow_keys *flow)
+{
+ switch (flow->control.addr_type) {
+ case FLOW_DISSECTOR_KEY_IPV4_ADDRS:
+ return flow->addrs.v4addrs.src;
+ case FLOW_DISSECTOR_KEY_IPV6_ADDRS:
+ return (__force __be32)ipv6_addr_hash(
+ &flow->addrs.v6addrs.src);
+ case FLOW_DISSECTOR_KEY_TIPC:
+ return flow->addrs.tipckey.key;
+ default:
+ return 0;
+ }
+}
+EXPORT_SYMBOL(flow_get_u32_src);
+
+__be32 flow_get_u32_dst(const struct flow_keys *flow)
+{
+ switch (flow->control.addr_type) {
+ case FLOW_DISSECTOR_KEY_IPV4_ADDRS:
+ return flow->addrs.v4addrs.dst;
+ case FLOW_DISSECTOR_KEY_IPV6_ADDRS:
+ return (__force __be32)ipv6_addr_hash(
+ &flow->addrs.v6addrs.dst);
+ default:
+ return 0;
+ }
+}
+EXPORT_SYMBOL(flow_get_u32_dst);
+
+u32 flow_hash_from_keys(struct flow_keys *keys)
+{
+ __flow_hash_secret_init();
+ return __flow_hash_from_keys(keys, hashrnd);
+}
+EXPORT_SYMBOL(flow_hash_from_keys);
+
+static inline u32 ___skb_get_hash(const struct sk_buff *skb,
+ struct flow_keys *keys, u32 keyval)
+{
+ skb_flow_dissect_flow_keys(skb, keys,
+ FLOW_DISSECTOR_F_STOP_AT_FLOW_LABEL);
+
+ return __flow_hash_from_keys(keys, keyval);
+}
+
+struct _flow_keys_digest_data {
+ __be16 n_proto;
+ u8 ip_proto;
+ u8 padding;
+ __be32 ports;
+ __be32 src;
+ __be32 dst;
+};
+
+void make_flow_keys_digest(struct flow_keys_digest *digest,
+ const struct flow_keys *flow)
+{
+ struct _flow_keys_digest_data *data =
+ (struct _flow_keys_digest_data *)digest;
+
+ BUILD_BUG_ON(sizeof(*data) > sizeof(*digest));
+
+ memset(digest, 0, sizeof(*digest));
+
+ data->n_proto = flow->basic.n_proto;
+ data->ip_proto = flow->basic.ip_proto;
+ data->ports = flow->ports.ports;
+ data->src = flow->addrs.v4addrs.src;
+ data->dst = flow->addrs.v4addrs.dst;
+}
+EXPORT_SYMBOL(make_flow_keys_digest);
+
+/**
+ * __skb_get_hash: calculate a flow hash
+ * @skb: sk_buff to calculate flow hash from
+ *
+ * This function calculates a flow hash based on src/dst addresses
+ * and src/dst port numbers. Sets hash in skb to non-zero hash value
+ * on success, zero indicates no valid hash. Also, sets l4_hash in skb
+ * if hash is a canonical 4-tuple hash over transport ports.
+ */
+void __skb_get_hash(struct sk_buff *skb)
+{
+ struct flow_keys keys;
+ u32 hash;
+
+ __flow_hash_secret_init();
+
+ hash = ___skb_get_hash(skb, &keys, hashrnd);
+
+ __skb_set_sw_hash(skb, hash, flow_keys_have_l4(&keys));
+}
+EXPORT_SYMBOL(__skb_get_hash);
+
+__u32 skb_get_hash_perturb(const struct sk_buff *skb, u32 perturb)
+{
+ struct flow_keys keys;
+
+ return ___skb_get_hash(skb, &keys, perturb);
+}
+EXPORT_SYMBOL(skb_get_hash_perturb);
+
__u32 __get_hash_from_flowi6(const struct flowi6 *fl6, struct flow_keys *keys)
{
memset(keys, 0, sizeof(*keys));
--
2.7.4
Amazon Development Center Germany GmbH
Berlin - Dresden - Aachen
main office: Krausenstr. 38, 10117 Berlin
Geschaeftsfuehrer: Dr. Ralf Herbrich, Christian Schlaeger
Ust-ID: DE289237879
Eingetragen am Amtsgericht Charlottenburg HRB 149173 B
^ permalink raw reply related
* [less-CONFIG_NET 3/7] seccomp: include net and bpf files
From: Norbert Manthey @ 2018-06-06 13:53 UTC (permalink / raw)
Cc: Norbert Manthey, Alexei Starovoitov, Daniel Borkmann,
David S. Miller, netdev, linux-kernel
In-Reply-To: <1528293206-24298-1-git-send-email-nmanthey@amazon.de>
When we want to use CONFIG_SECCOMP_FILTER without CONFIG_NET, we have
to ensure that the required files that would be pulled in via
CONFIG_NET are compiled when dropping CONFIG_NET.
Signed-off-by: Norbert Manthey <nmanthey@amazon.de>
---
kernel/bpf/Makefile | 3 ++-
net/Makefile | 5 +++++
2 files changed, 7 insertions(+), 1 deletion(-)
diff --git a/kernel/bpf/Makefile b/kernel/bpf/Makefile
index a713fd2..5d13269 100644
--- a/kernel/bpf/Makefile
+++ b/kernel/bpf/Makefile
@@ -4,7 +4,8 @@ obj-y := core.o
obj-$(CONFIG_BPF_SYSCALL) += syscall.o verifier.o inode.o helpers.o tnum.o
obj-$(CONFIG_BPF_SYSCALL) += hashtab.o arraymap.o percpu_freelist.o bpf_lru_list.o lpm_trie.o map_in_map.o
obj-$(CONFIG_BPF_SYSCALL) += disasm.o
-ifeq ($(CONFIG_NET),y)
+
+ifneq ($(filter y,$(CONFIG_NET) $(CONFIG_SECCOMP_FILTER)),)
obj-$(CONFIG_BPF_SYSCALL) += devmap.o
obj-$(CONFIG_BPF_SYSCALL) += cpumap.o
obj-$(CONFIG_BPF_SYSCALL) += offload.o
diff --git a/net/Makefile b/net/Makefile
index a6147c6..08f1875 100644
--- a/net/Makefile
+++ b/net/Makefile
@@ -11,6 +11,11 @@ obj-$(CONFIG_NET) := socket.o core/
tmp-$(CONFIG_COMPAT) := compat.o
obj-$(CONFIG_NET) += $(tmp-y)
+ifneq ($(CONFIG_NET),y)
+obj-$(CONFIG_SECCOMP_FILTER) += core/filter.o
+obj-$(CONFIG_SECCOMP_FILTER) += core/flow_dissector.o
+endif
+
# LLC has to be linked before the files in net/802/
obj-$(CONFIG_LLC) += llc/
obj-$(CONFIG_NET) += ethernet/ 802/ sched/ netlink/ bpf/
--
2.7.4
Amazon Development Center Germany GmbH
Berlin - Dresden - Aachen
main office: Krausenstr. 38, 10117 Berlin
Geschaeftsfuehrer: Dr. Ralf Herbrich, Christian Schlaeger
Ust-ID: DE289237879
Eingetragen am Amtsgericht Charlottenburg HRB 149173 B
^ permalink raw reply related
* [less-CONFIG_NET 5/7] seccomp: cut off functions not required
From: Norbert Manthey @ 2018-06-06 13:53 UTC (permalink / raw)
Cc: Norbert Manthey, Alexei Starovoitov, Daniel Borkmann,
David S. Miller, John Crispin, Simon Horman, Jakub Kicinski,
Tom Herbert, Eric Dumazet, Sven Eckelmann, WANG Cong, David Ahern,
Jon Maloy, netdev, linux-kernel
In-Reply-To: <1528293206-24298-1-git-send-email-nmanthey@amazon.de>
When using CONFIG_SECCOMP_FILTER, not all functions of filter.c and
flow_dissector.c are required. To not pull in more dependencies,
guard the functions that are not required with CONFIG_NET defines.
This way, these functions are enabled in case the file is compiled
because of CONFIG_NET, but they are not present when the file is
compiled because of other configurations.
Signed-off-by: Norbert Manthey <nmanthey@amazon.de>
---
net/core/filter.c | 2 ++
net/core/flow_dissector.c | 2 ++
2 files changed, 4 insertions(+)
diff --git a/net/core/filter.c b/net/core/filter.c
index 0d980e9..4ddacb7 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -1063,6 +1063,7 @@ void bpf_prog_destroy(struct bpf_prog *fp)
}
EXPORT_SYMBOL_GPL(bpf_prog_destroy);
+#if defined(CONFIG_NET)
/**
* sk_filter_trim_cap - run a packet through a socket filter
* @sk: sock associated with &sk_buff
@@ -5657,3 +5658,4 @@ int sk_get_filter(struct sock *sk, struct sock_filter __user *ubuf,
release_sock(sk);
return ret;
}
+#endif // CONFIG_NET
diff --git a/net/core/flow_dissector.c b/net/core/flow_dissector.c
index 70e0679..0903444 100644
--- a/net/core/flow_dissector.c
+++ b/net/core/flow_dissector.c
@@ -1219,6 +1219,7 @@ u32 skb_get_poff(const struct sk_buff *skb)
return __skb_get_poff(skb, skb->data, &keys, skb_headlen(skb));
}
+#if defined(CONFIG_NET)
__be32 flow_get_u32_src(const struct flow_keys *flow)
{
switch (flow->control.addr_type) {
@@ -1340,6 +1341,7 @@ __u32 __get_hash_from_flowi6(const struct flowi6 *fl6, struct flow_keys *keys)
return flow_hash_from_keys(keys);
}
EXPORT_SYMBOL(__get_hash_from_flowi6);
+#endif // CONFIG_NET
static const struct flow_dissector_key flow_keys_dissector_keys[] = {
{
--
2.7.4
Amazon Development Center Germany GmbH
Berlin - Dresden - Aachen
main office: Krausenstr. 38, 10117 Berlin
Geschaeftsfuehrer: Dr. Ralf Herbrich, Christian Schlaeger
Ust-ID: DE289237879
Eingetragen am Amtsgericht Charlottenburg HRB 149173 B
^ permalink raw reply related
* [PATCH net] rxrpc: Fix terminal retransmission connection ID to include the channel
From: David Howells @ 2018-06-06 13:59 UTC (permalink / raw)
To: netdev; +Cc: dhowells, linux-afs, linux-kernel
When retransmitting the final ACK or ABORT packet for a call, the cid field
in the packet header is set to the connection's cid, but this is incorrect
as it also needs to include the channel number on that connection that the
call was made on.
Fix this by OR'ing in the channel number.
Note that this fixes the bug that:
commit 1a025028d400b23477341aa7ec2ce55f8b39b554
rxrpc: Fix handling of call quietly cancelled out on server
works around. I'm not intending to revert that as it will help protect
against problems that might occur on the server.
Fixes: 3136ef49a14c ("rxrpc: Delay terminal ACK transmission on a client call")
Signed-off-by: David Howells <dhowells@redhat.com>
---
net/rxrpc/conn_event.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/net/rxrpc/conn_event.c b/net/rxrpc/conn_event.c
index 1350f1be8037..8229a52c2acd 100644
--- a/net/rxrpc/conn_event.c
+++ b/net/rxrpc/conn_event.c
@@ -70,7 +70,7 @@ static void rxrpc_conn_retransmit_call(struct rxrpc_connection *conn,
iov[2].iov_len = sizeof(ack_info);
pkt.whdr.epoch = htonl(conn->proto.epoch);
- pkt.whdr.cid = htonl(conn->proto.cid);
+ pkt.whdr.cid = htonl(conn->proto.cid | channel);
pkt.whdr.callNumber = htonl(call_id);
pkt.whdr.seq = 0;
pkt.whdr.type = chan->last_type;
^ permalink raw reply related
* Re: [PATCH net] ipv4: igmp: hold wakelock to prevent delayed reports
From: Tejaswi Tanikella @ 2018-06-06 14:38 UTC (permalink / raw)
To: David Miller; +Cc: netdev, f.fainelli, andrew, edumazet
In-Reply-To: <20180604.110640.184022152967598302.davem@davemloft.net>
On Mon, Jun 04, 2018 at 11:06:40AM -0400, David Miller wrote:
> From: Tejaswi Tanikella <tejaswit@codeaurora.org>
> Date: Fri, 1 Jun 2018 19:35:41 +0530
>
> > On receiving a IGMPv2/v3 query, based on max_delay set in the header a
> > timer is started to send out a response after a random time within
> > max_delay. If the system then moves into suspend state, Report is
> > delayed until system wakes up.
> >
> > In one reported scenario, on arm64 devices, max_delay was set to 10s,
> > Reports were consistantly delayed if the timer is scheduled after 5 plus
> > seconds.
> >
> > Hold wakelock while starting the timer to prevent moving into suspend
> > state.
> >
> > Signed-off-by: Tejaswi Tanikella <tejaswit@codeaurora.org>
>
> As Florian stated, this won't be the only networking facility to hit
> a problem like this. So, if we go down this route, we probably want
> to generically solve this.
>
> But I have a deeper concern.
>
> Do we really want every timer based querying mechanism to prevent a
> system from being able to suspend?
>
> We get to the point where external entities can generate traffic which
> can prevent a remote system from entering suspend state.
Like you suggested maybe aquiring a wakelock will unnecessarily stop the
system from suspending.
Would using alarmtimer be better? It seems similar to a hrtimer but can
wake the system up from suspend.
This should allow the system to suspend till the timer expires. But might
cause repeated wake-ups and suspends.
^ permalink raw reply
* Re: [PATCH net-next 1/6] net: hippi: use pci_zalloc_consistent
From: kbuild test robot @ 2018-06-06 15:04 UTC (permalink / raw)
To: YueHaibing
Cc: kbuild-all, davem, netdev, linux-kernel, jcliburn, chris.snook,
benve, jdmason, chessman, jes, rahul.verma, YueHaibing
In-Reply-To: <20180605122851.23912-2-yuehaibing@huawei.com>
[-- Attachment #1: Type: text/plain, Size: 5410 bytes --]
Hi YueHaibing,
Thank you for the patch! Yet something to improve:
[auto build test ERROR on net-next/master]
url: https://github.com/0day-ci/linux/commits/YueHaibing/use-pci_zalloc_consistent/20180606-205531
config: x86_64-allyesdebian (attached as .config)
compiler: gcc-7 (Debian 7.3.0-16) 7.3.0
reproduce:
# save the attached .config to linux build tree
make ARCH=x86_64
All errors (new ones prefixed by >>):
drivers/net/hippi/rrunner.c: In function 'rr_init_one':
>> drivers/net/hippi/rrunner.c:176:7: error: 'trrpriv' undeclared (first use in this function); did you mean 'rrpriv'?
if (!trrpriv->evt_ring) {
^~~~~~~
rrpriv
drivers/net/hippi/rrunner.c:176:7: note: each undeclared identifier is reported only once for each function it appears in
vim +176 drivers/net/hippi/rrunner.c
74
75 /*
76 * Implementation notes:
77 *
78 * The DMA engine only allows for DMA within physical 64KB chunks of
79 * memory. The current approach of the driver (and stack) is to use
80 * linear blocks of memory for the skbuffs. However, as the data block
81 * is always the first part of the skb and skbs are 2^n aligned so we
82 * are guarantted to get the whole block within one 64KB align 64KB
83 * chunk.
84 *
85 * On the long term, relying on being able to allocate 64KB linear
86 * chunks of memory is not feasible and the skb handling code and the
87 * stack will need to know about I/O vectors or something similar.
88 */
89
90 static int rr_init_one(struct pci_dev *pdev, const struct pci_device_id *ent)
91 {
92 struct net_device *dev;
93 static int version_disp;
94 u8 pci_latency;
95 struct rr_private *rrpriv;
96 dma_addr_t ring_dma;
97 int ret = -ENOMEM;
98
99 dev = alloc_hippi_dev(sizeof(struct rr_private));
100 if (!dev)
101 goto out3;
102
103 ret = pci_enable_device(pdev);
104 if (ret) {
105 ret = -ENODEV;
106 goto out2;
107 }
108
109 rrpriv = netdev_priv(dev);
110
111 SET_NETDEV_DEV(dev, &pdev->dev);
112
113 ret = pci_request_regions(pdev, "rrunner");
114 if (ret < 0)
115 goto out;
116
117 pci_set_drvdata(pdev, dev);
118
119 rrpriv->pci_dev = pdev;
120
121 spin_lock_init(&rrpriv->lock);
122
123 dev->netdev_ops = &rr_netdev_ops;
124
125 /* display version info if adapter is found */
126 if (!version_disp) {
127 /* set display flag to TRUE so that */
128 /* we only display this string ONCE */
129 version_disp = 1;
130 printk(version);
131 }
132
133 pci_read_config_byte(pdev, PCI_LATENCY_TIMER, &pci_latency);
134 if (pci_latency <= 0x58){
135 pci_latency = 0x58;
136 pci_write_config_byte(pdev, PCI_LATENCY_TIMER, pci_latency);
137 }
138
139 pci_set_master(pdev);
140
141 printk(KERN_INFO "%s: Essential RoadRunner serial HIPPI "
142 "at 0x%llx, irq %i, PCI latency %i\n", dev->name,
143 (unsigned long long)pci_resource_start(pdev, 0),
144 pdev->irq, pci_latency);
145
146 /*
147 * Remap the MMIO regs into kernel space.
148 */
149 rrpriv->regs = pci_iomap(pdev, 0, 0x1000);
150 if (!rrpriv->regs) {
151 printk(KERN_ERR "%s: Unable to map I/O register, "
152 "RoadRunner will be disabled.\n", dev->name);
153 ret = -EIO;
154 goto out;
155 }
156
157 rrpriv->tx_ring = pci_alloc_consistent(pdev, TX_TOTAL_SIZE, &ring_dma);
158 rrpriv->tx_ring_dma = ring_dma;
159
160 if (!rrpriv->tx_ring) {
161 ret = -ENOMEM;
162 goto out;
163 }
164
165 rrpriv->rx_ring = pci_alloc_consistent(pdev, RX_TOTAL_SIZE, &ring_dma);
166 rrpriv->rx_ring_dma = ring_dma;
167
168 if (!rrpriv->rx_ring) {
169 ret = -ENOMEM;
170 goto out;
171 }
172
173 rrpriv->evt_ring = pci_alloc_consistent(pdev, EVT_RING_SIZE, &ring_dma);
174 rrpriv->evt_ring_dma = ring_dma;
175
> 176 if (!trrpriv->evt_ring) {
177 ret = -ENOMEM;
178 goto out;
179 }
180
181 /*
182 * Don't access any register before this point!
183 */
184 #ifdef __BIG_ENDIAN
185 writel(readl(&rrpriv->regs->HostCtrl) | NO_SWAP,
186 &rrpriv->regs->HostCtrl);
187 #endif
188 /*
189 * Need to add a case for little-endian 64-bit hosts here.
190 */
191
192 rr_init(dev);
193
194 ret = register_netdev(dev);
195 if (ret)
196 goto out;
197 return 0;
198
199 out:
200 if (rrpriv->evt_ring)
201 pci_free_consistent(pdev, EVT_RING_SIZE, rrpriv->evt_ring,
202 rrpriv->evt_ring_dma);
203 if (rrpriv->rx_ring)
204 pci_free_consistent(pdev, RX_TOTAL_SIZE, rrpriv->rx_ring,
205 rrpriv->rx_ring_dma);
206 if (rrpriv->tx_ring)
207 pci_free_consistent(pdev, TX_TOTAL_SIZE, rrpriv->tx_ring,
208 rrpriv->tx_ring_dma);
209 if (rrpriv->regs)
210 pci_iounmap(pdev, rrpriv->regs);
211 if (pdev)
212 pci_release_regions(pdev);
213 out2:
214 free_netdev(dev);
215 out3:
216 return ret;
217 }
218
---
0-DAY kernel test infrastructure Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all Intel Corporation
[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 39429 bytes --]
^ permalink raw reply
* [PATCH net] net: in virtio_net_hdr only add VLAN_HLEN to csum_start if payload holds vlan
From: Willem de Bruijn @ 2018-06-06 15:23 UTC (permalink / raw)
To: netdev; +Cc: davem, rppt, herbert, anton.ivanov, Willem de Bruijn
From: Willem de Bruijn <willemb@google.com>
Tun, tap, virtio, packet and uml vector all use struct virtio_net_hdr
to communicate packet metadata to userspace.
For skbuffs with vlan, the first two return the packet as it may have
existed on the wire, inserting the VLAN tag in the user buffer. Then
virtio_net_hdr.csum_start needs to be adjusted by VLAN_HLEN bytes.
Commit f09e2249c4f5 ("macvtap: restore vlan header on user read")
added this feature to macvtap. Commit 3ce9b20f1971 ("macvtap: Fix
csum_start when VLAN tags are present") then fixed up csum_start.
Virtio, packet and uml do not insert the vlan header in the user
buffer.
When introducing virtio_net_hdr_from_skb to deduplicate filling in
the virtio_net_hdr, the variant from macvtap which adds VLAN_HLEN was
applied uniformly, breaking csum offset for packets with vlan on
virtio and packet.
Make insertion of VLAN_HLEN optional. Convert the callers to pass it
when needed.
Fixes: e858fae2b0b8f4 ("virtio_net: use common code for virtio_net_hdr and skb GSO conversion")
Fixes: 1276f24eeef2 ("packet: use common code for virtio_net_hdr and skb GSO conversion")
Signed-off-by: Willem de Bruijn <willemb@google.com>
---
arch/um/drivers/vector_transports.c | 3 ++-
drivers/net/tap.c | 5 ++++-
drivers/net/tun.c | 3 ++-
drivers/net/virtio_net.c | 3 ++-
include/linux/virtio_net.h | 11 ++++-------
net/packet/af_packet.c | 4 ++--
6 files changed, 16 insertions(+), 13 deletions(-)
diff --git a/arch/um/drivers/vector_transports.c b/arch/um/drivers/vector_transports.c
index 9065047f844b..77e4ebc206ae 100644
--- a/arch/um/drivers/vector_transports.c
+++ b/arch/um/drivers/vector_transports.c
@@ -120,7 +120,8 @@ static int raw_form_header(uint8_t *header,
skb,
vheader,
virtio_legacy_is_little_endian(),
- false
+ false,
+ 0
);
return 0;
diff --git a/drivers/net/tap.c b/drivers/net/tap.c
index 9b6cb780affe..f0f7cd977667 100644
--- a/drivers/net/tap.c
+++ b/drivers/net/tap.c
@@ -774,13 +774,16 @@ static ssize_t tap_put_user(struct tap_queue *q,
int total;
if (q->flags & IFF_VNET_HDR) {
+ int vlan_hlen = skb_vlan_tag_present(skb) ? VLAN_HLEN : 0;
struct virtio_net_hdr vnet_hdr;
+
vnet_hdr_len = READ_ONCE(q->vnet_hdr_sz);
if (iov_iter_count(iter) < vnet_hdr_len)
return -EINVAL;
if (virtio_net_hdr_from_skb(skb, &vnet_hdr,
- tap_is_little_endian(q), true))
+ tap_is_little_endian(q), true,
+ vlan_hlen))
BUG();
if (copy_to_iter(&vnet_hdr, sizeof(vnet_hdr), iter) !=
diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index 23e9eb66197f..409eb8b74740 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -2078,7 +2078,8 @@ static ssize_t tun_put_user(struct tun_struct *tun,
return -EINVAL;
if (virtio_net_hdr_from_skb(skb, &gso,
- tun_is_little_endian(tun), true)) {
+ tun_is_little_endian(tun), true,
+ vlan_hlen)) {
struct skb_shared_info *sinfo = skb_shinfo(skb);
pr_err("unexpected GSO type: "
"0x%x, gso_size %d, hdr_len %d\n",
diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 032e1ac10a30..8c7207535179 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -1358,7 +1358,8 @@ static int xmit_skb(struct send_queue *sq, struct sk_buff *skb)
hdr = skb_vnet_hdr(skb);
if (virtio_net_hdr_from_skb(skb, &hdr->hdr,
- virtio_is_little_endian(vi->vdev), false))
+ virtio_is_little_endian(vi->vdev), false,
+ 0))
BUG();
if (vi->mergeable_rx_bufs)
diff --git a/include/linux/virtio_net.h b/include/linux/virtio_net.h
index f144216febc6..9397628a1967 100644
--- a/include/linux/virtio_net.h
+++ b/include/linux/virtio_net.h
@@ -58,7 +58,8 @@ static inline int virtio_net_hdr_to_skb(struct sk_buff *skb,
static inline int virtio_net_hdr_from_skb(const struct sk_buff *skb,
struct virtio_net_hdr *hdr,
bool little_endian,
- bool has_data_valid)
+ bool has_data_valid,
+ int vlan_hlen)
{
memset(hdr, 0, sizeof(*hdr)); /* no info leak */
@@ -83,12 +84,8 @@ static inline int virtio_net_hdr_from_skb(const struct sk_buff *skb,
if (skb->ip_summed == CHECKSUM_PARTIAL) {
hdr->flags = VIRTIO_NET_HDR_F_NEEDS_CSUM;
- if (skb_vlan_tag_present(skb))
- hdr->csum_start = __cpu_to_virtio16(little_endian,
- skb_checksum_start_offset(skb) + VLAN_HLEN);
- else
- hdr->csum_start = __cpu_to_virtio16(little_endian,
- skb_checksum_start_offset(skb));
+ hdr->csum_start = __cpu_to_virtio16(little_endian,
+ skb_checksum_start_offset(skb) + vlan_hlen);
hdr->csum_offset = __cpu_to_virtio16(little_endian,
skb->csum_offset);
} else if (has_data_valid &&
diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
index acb7b86574cd..9198997b39d1 100644
--- a/net/packet/af_packet.c
+++ b/net/packet/af_packet.c
@@ -2037,7 +2037,7 @@ static int packet_rcv_vnet(struct msghdr *msg, const struct sk_buff *skb,
return -EINVAL;
*len -= sizeof(vnet_hdr);
- if (virtio_net_hdr_from_skb(skb, &vnet_hdr, vio_le(), true))
+ if (virtio_net_hdr_from_skb(skb, &vnet_hdr, vio_le(), true, 0))
return -EINVAL;
return memcpy_to_msg(msg, (void *)&vnet_hdr, sizeof(vnet_hdr));
@@ -2304,7 +2304,7 @@ static int tpacket_rcv(struct sk_buff *skb, struct net_device *dev,
if (do_vnet) {
if (virtio_net_hdr_from_skb(skb, h.raw + macoff -
sizeof(struct virtio_net_hdr),
- vio_le(), true)) {
+ vio_le(), true, 0)) {
spin_lock(&sk->sk_receive_queue.lock);
goto drop_n_account;
}
--
2.17.1.1185.g55be947832-goog
^ permalink raw reply related
* [PATCH v2 net-next] net/sched: add skbprio scheduler
From: Nishanth Devarajan @ 2018-06-06 15:40 UTC (permalink / raw)
To: xiyou.wangcong, jhs, jiri, davem; +Cc: netdev, doucette, michel
net/sched: add skbprio scheduler
Skbprio (SKB Priority Queue) is a queuing discipline that prioritizes IPv4
and IPv6 packets according to their skb->priority field. Although Skbprio can
be employed in any scenario in which a higher skb->priority field means a
higher priority packet, Skbprio was concieved as a solution for
denial-of-service defenses that need to route packets with different priorities.
v2:
*Use skb->priority field rather than DS field. Rename queueing discipline as
SKB Priority Queue (previously Gatekeeper Priority Queue).
*Queueing discipline is made classful to expose Skbprio's internal priority
queues.
Signed-off-by: Nishanth Devarajan <ndev2021@gmail.com>
Reviewed-by: Sachin Paryani <sachin.paryani@gmail.com>
Reviewed-by: Cody Doucette <doucette@bu.edu>
Reviewed-by: Michel Machado <michel@digirati.com.br>
---
include/uapi/linux/pkt_sched.h | 15 ++
net/sched/Kconfig | 13 ++
net/sched/Makefile | 1 +
net/sched/sch_skbprio.c | 347 +++++++++++++++++++++++++++++++++++++++++
4 files changed, 376 insertions(+)
create mode 100644 net/sched/sch_skbprio.c
diff --git a/include/uapi/linux/pkt_sched.h b/include/uapi/linux/pkt_sched.h
index 37b5096..6fd07e8 100644
--- a/include/uapi/linux/pkt_sched.h
+++ b/include/uapi/linux/pkt_sched.h
@@ -124,6 +124,21 @@ struct tc_fifo_qopt {
__u32 limit; /* Queue length: bytes for bfifo, packets for pfifo */
};
+/* SKBPRIO section */
+
+/*
+ * Priorities go from zero to (SKBPRIO_MAX_PRIORITY - 1).
+ * SKBPRIO_MAX_PRIORITY should be at least 64 in order for skbprio to be able
+ * to map one to one the DS field of IPV4 and IPV6 headers.
+ * Memory allocation grows linearly with SKBPRIO_MAX_PRIORITY.
+ */
+
+#define SKBPRIO_MAX_PRIORITY 64
+
+struct tc_skbprio_qopt {
+ __u32 limit; /* Queue length in packets. */
+};
+
/* PRIO section */
#define TCQ_PRIO_BANDS 16
diff --git a/net/sched/Kconfig b/net/sched/Kconfig
index a01169f..9ac4b53 100644
--- a/net/sched/Kconfig
+++ b/net/sched/Kconfig
@@ -240,6 +240,19 @@ config NET_SCH_MQPRIO
If unsure, say N.
+config NET_SCH_SKBPRIO
+ tristate "SKB priority queue scheduler (SKBPRIO)"
+ help
+ Say Y here if you want to use the SKB priority queue
+ scheduler. This schedules packets according to skb->priority,
+ which is useful for request packets in DoS mitigation systems such
+ as Gatekeeper.
+
+ To compile this driver as a module, choose M here: the module will
+ be called sch_skbprio.
+
+ If unsure, say N.
+
config NET_SCH_CHOKE
tristate "CHOose and Keep responsive flow scheduler (CHOKE)"
help
diff --git a/net/sched/Makefile b/net/sched/Makefile
index 8811d38..a4d8893 100644
--- a/net/sched/Makefile
+++ b/net/sched/Makefile
@@ -46,6 +46,7 @@ obj-$(CONFIG_NET_SCH_NETEM) += sch_netem.o
obj-$(CONFIG_NET_SCH_DRR) += sch_drr.o
obj-$(CONFIG_NET_SCH_PLUG) += sch_plug.o
obj-$(CONFIG_NET_SCH_MQPRIO) += sch_mqprio.o
+obj-$(CONFIG_NET_SCH_SKBPRIO) += sch_skbprio.o
obj-$(CONFIG_NET_SCH_CHOKE) += sch_choke.o
obj-$(CONFIG_NET_SCH_QFQ) += sch_qfq.o
obj-$(CONFIG_NET_SCH_CODEL) += sch_codel.o
diff --git a/net/sched/sch_skbprio.c b/net/sched/sch_skbprio.c
new file mode 100644
index 0000000..f73ad62
--- /dev/null
+++ b/net/sched/sch_skbprio.c
@@ -0,0 +1,347 @@
+/*
+ * net/sched/sch_skbprio.c SKB Priority Queue.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version
+ * 2 of the License, or (at your option) any later version.
+ *
+ * Authors: Nishanth Devarajan, <ndev2021@gmail.com>
+ * Cody Doucette, <doucette@bu.edu>
+ * original idea by Michel Machado, Cody Doucette, and Qiaobin Fu
+ */
+
+#include <linux/string.h>
+#include <linux/module.h>
+#include <linux/slab.h>
+#include <linux/types.h>
+#include <linux/kernel.h>
+#include <linux/errno.h>
+#include <linux/skbuff.h>
+#include <net/pkt_sched.h>
+#include <net/sch_generic.h>
+#include <net/inet_ecn.h>
+
+
+/* SKB Priority Queue
+ * =================================
+ *
+ * This qdisc schedules a packet according to skb->priority, where a higher
+ * value places the packet closer to the exit of the queue. When the queue is
+ * full, the lowest priority packet in the queue is dropped to make room for
+ * the packet to be added if it has higher priority. If the packet to be added
+ * has lower priority than all packets in the queue, it is dropped.
+ *
+ * Without the SKB priority queue, queue length limits must be imposed
+ * for individual queues, and there is no easy way to enforce a global queue
+ * length limit across all priorities. With the SKBprio queue, a global
+ * queue length limit can be enforced while not restricting the queue lengths
+ * of individual priorities.
+ *
+ * This is especially useful for a denial-of-service defense system like
+ * Gatekeeper, which prioritizes packets in flows that demonstrate expected
+ * behavior of legitimate users. The queue is flexible to allow any number
+ * of packets of any priority up to the global limit of the scheduler
+ * without risking resource overconsumption by a flood of low priority packets.
+ *
+ * The Gatekeeper codebase is found here:
+ *
+ * https://github.com/AltraMayor/gatekeeper
+ */
+
+struct skbprio_sched_data {
+ /* Parameters. */
+ u32 max_limit;
+
+ /* Queue state. */
+ struct sk_buff_head qdiscs[SKBPRIO_MAX_PRIORITY];
+ struct gnet_stats_queue qstats[SKBPRIO_MAX_PRIORITY];
+ u16 highest_prio;
+ u16 lowest_prio;
+};
+
+static u16 calc_new_high_prio(const struct skbprio_sched_data *q)
+{
+ int prio;
+
+ for (prio = q->highest_prio - 1; prio >= q->lowest_prio; prio--) {
+ if (!skb_queue_empty(&q->qdiscs[prio]))
+ return prio;
+ }
+
+ /* SKB queue is empty, return 0 (default highest priority setting). */
+ return 0;
+}
+
+static u16 calc_new_low_prio(const struct skbprio_sched_data *q)
+{
+ int prio;
+
+ for (prio = q->lowest_prio + 1; prio <= q->highest_prio; prio++) {
+ if (!skb_queue_empty(&q->qdiscs[prio]))
+ return prio;
+ }
+
+ /* SKB queue is empty, return SKBPRIO_MAX_PRIORITY - 1
+ * (default lowest priority setting).
+ */
+ return SKBPRIO_MAX_PRIORITY - 1;
+}
+
+static int skbprio_enqueue(struct sk_buff *skb, struct Qdisc *sch,
+ struct sk_buff **to_free)
+{
+ struct skbprio_sched_data *q = qdisc_priv(sch);
+ struct sk_buff_head *qdisc;
+ struct sk_buff_head *lp_qdisc;
+ struct sk_buff *to_drop;
+ const unsigned int max_priority = SKBPRIO_MAX_PRIORITY - 1;
+ u16 prio, lp;
+
+ /* Obtain the priority of @skb. */
+ prio = min(skb->priority, max_priority);
+
+ qdisc = &q->qdiscs[prio];
+ if (sch->q.qlen < q->max_limit) {
+ __skb_queue_tail(qdisc, skb);
+ qdisc_qstats_backlog_inc(sch, skb);
+ q->qstats[prio].backlog += qdisc_pkt_len(skb);
+
+ /* Check to update highest and lowest priorities. */
+ if (prio > q->highest_prio)
+ q->highest_prio = prio;
+
+ if (prio < q->lowest_prio)
+ q->lowest_prio = prio;
+
+ sch->q.qlen++;
+ return NET_XMIT_SUCCESS;
+ }
+
+ /* If this packet has the lowest priority, drop it. */
+ lp = q->lowest_prio;
+ if (prio <= lp) {
+ q->qstats[prio].drops++;
+ return qdisc_drop(skb, sch, to_free);
+ }
+
+ /* Drop the packet at the tail of the lowest priority qdisc. */
+ lp_qdisc = &q->qdiscs[lp];
+ to_drop = __skb_dequeue_tail(lp_qdisc);
+ BUG_ON(!to_drop);
+ qdisc_qstats_backlog_dec(sch, to_drop);
+ qdisc_drop(to_drop, sch, to_free);
+
+ q->qstats[lp].backlog -= qdisc_pkt_len(to_drop);
+ q->qstats[lp].drops++;
+
+ __skb_queue_tail(qdisc, skb);
+ qdisc_qstats_backlog_inc(sch, skb);
+ q->qstats[prio].backlog += qdisc_pkt_len(skb);
+
+ /* Check to update highest and lowest priorities. */
+ if (skb_queue_empty(lp_qdisc)) {
+ if (q->lowest_prio == q->highest_prio) {
+ BUG_ON(sch->q.qlen);
+ q->lowest_prio = prio;
+ q->highest_prio = prio;
+ } else {
+ q->lowest_prio = calc_new_low_prio(q);
+ }
+ }
+
+ if (prio > q->highest_prio)
+ q->highest_prio = prio;
+
+ return NET_XMIT_CN;
+}
+
+static struct sk_buff *skbprio_dequeue(struct Qdisc *sch)
+{
+ struct skbprio_sched_data *q = qdisc_priv(sch);
+ struct sk_buff_head *hpq = &q->qdiscs[q->highest_prio];
+ struct sk_buff *skb = __skb_dequeue(hpq);
+
+ if (unlikely(!skb))
+ return NULL;
+
+ sch->q.qlen--;
+ qdisc_qstats_backlog_dec(sch, skb);
+ qdisc_bstats_update(sch, skb);
+
+ q->qstats[q->highest_prio].backlog -= qdisc_pkt_len(skb);
+
+ /* Update highest priority field. */
+ if (skb_queue_empty(hpq)) {
+ if (q->lowest_prio == q->highest_prio) {
+ BUG_ON(sch->q.qlen);
+ q->highest_prio = 0;
+ q->lowest_prio = SKBPRIO_MAX_PRIORITY - 1;
+ } else {
+ q->highest_prio = calc_new_high_prio(q);
+ }
+ }
+ return skb;
+}
+
+static int skbprio_change(struct Qdisc *sch, struct nlattr *opt,
+ struct netlink_ext_ack *extack)
+{
+ struct skbprio_sched_data *q = qdisc_priv(sch);
+ struct tc_skbprio_qopt *ctl = nla_data(opt);
+ const unsigned int min_limit = 1;
+
+ if (ctl->limit == (typeof(ctl->limit))-1)
+ q->max_limit = max(qdisc_dev(sch)->tx_queue_len, min_limit);
+ else if (ctl->limit < min_limit ||
+ ctl->limit > qdisc_dev(sch)->tx_queue_len)
+ return -EINVAL;
+ else
+ q->max_limit = ctl->limit;
+
+ return 0;
+}
+
+static int skbprio_init(struct Qdisc *sch, struct nlattr *opt,
+ struct netlink_ext_ack *extack)
+{
+ struct skbprio_sched_data *q = qdisc_priv(sch);
+ int prio;
+ const unsigned int min_limit = 1;
+
+ /* Initialise all queues, one for each possible priority. */
+ for (prio = 0; prio < SKBPRIO_MAX_PRIORITY; prio++)
+ __skb_queue_head_init(&q->qdiscs[prio]);
+
+ memset(&q->qstats, 0, sizeof(q->qstats));
+ q->highest_prio = 0;
+ q->lowest_prio = SKBPRIO_MAX_PRIORITY - 1;
+ if (!opt) {
+ q->max_limit = max(qdisc_dev(sch)->tx_queue_len, min_limit);
+ return 0;
+ }
+ return skbprio_change(sch, opt, extack);
+}
+
+static int skbprio_dump(struct Qdisc *sch, struct sk_buff *skb)
+{
+ struct skbprio_sched_data *q = qdisc_priv(sch);
+ struct tc_skbprio_qopt opt;
+
+ opt.limit = q->max_limit;
+
+ if (nla_put(skb, TCA_OPTIONS, sizeof(opt), &opt))
+ return -1;
+
+ return skb->len;
+}
+
+static void skbprio_reset(struct Qdisc *sch)
+{
+ struct skbprio_sched_data *q = qdisc_priv(sch);
+ int prio;
+
+ sch->qstats.backlog = 0;
+ sch->q.qlen = 0;
+
+ for (prio = 0; prio < SKBPRIO_MAX_PRIORITY; prio++)
+ __skb_queue_purge(&q->qdiscs[prio]);
+
+ memset(&q->qstats, 0, sizeof(q->qstats));
+ q->highest_prio = 0;
+ q->lowest_prio = SKBPRIO_MAX_PRIORITY - 1;
+}
+
+static void skbprio_destroy(struct Qdisc *sch)
+{
+ struct skbprio_sched_data *q = qdisc_priv(sch);
+ int prio;
+
+ for (prio = 0; prio < SKBPRIO_MAX_PRIORITY; prio++)
+ __skb_queue_purge(&q->qdiscs[prio]);
+}
+
+static struct Qdisc *skbprio_leaf(struct Qdisc *sch, unsigned long arg)
+{
+ return NULL;
+}
+
+static unsigned long skbprio_find(struct Qdisc *sch, u32 classid)
+{
+ return 0;
+}
+
+static int skbprio_dump_class(struct Qdisc *sch, unsigned long cl,
+ struct sk_buff *skb, struct tcmsg *tcm)
+{
+ tcm->tcm_handle |= TC_H_MIN(cl);
+ return 0;
+}
+
+static int skbprio_dump_class_stats(struct Qdisc *sch, unsigned long cl,
+ struct gnet_dump *d)
+{
+ struct skbprio_sched_data *q = qdisc_priv(sch);
+ if (gnet_stats_copy_queue(d, NULL, &q->qstats[cl - 1],
+ q->qstats[cl - 1].qlen) < 0)
+ return -1;
+ return 0;
+}
+
+static void skbprio_walk(struct Qdisc *sch, struct qdisc_walker *arg)
+{
+ unsigned int i;
+
+ if (arg->stop)
+ return;
+
+ for (i = 0; i < SKBPRIO_MAX_PRIORITY; i++) {
+ if (arg->count < arg->skip) {
+ arg->count++;
+ continue;
+ }
+ if (arg->fn(sch, i + 1, arg) < 0) {
+ arg->stop = 1;
+ break;
+ }
+ arg->count++;
+ }
+}
+
+static const struct Qdisc_class_ops skbprio_class_ops = {
+ .leaf = skbprio_leaf,
+ .find = skbprio_find,
+ .dump = skbprio_dump_class,
+ .dump_stats = skbprio_dump_class_stats,
+ .walk = skbprio_walk,
+};
+
+static struct Qdisc_ops skbprio_qdisc_ops __read_mostly = {
+ .cl_ops = &skbprio_class_ops,
+ .id = "skbprio",
+ .priv_size = sizeof(struct skbprio_sched_data),
+ .enqueue = skbprio_enqueue,
+ .dequeue = skbprio_dequeue,
+ .peek = qdisc_peek_dequeued,
+ .init = skbprio_init,
+ .reset = skbprio_reset,
+ .change = skbprio_change,
+ .dump = skbprio_dump,
+ .destroy = skbprio_destroy,
+ .owner = THIS_MODULE,
+};
+
+static int __init skbprio_module_init(void)
+{
+ return register_qdisc(&skbprio_qdisc_ops);
+}
+
+static void __exit skbprio_module_exit(void)
+{
+ unregister_qdisc(&skbprio_qdisc_ops);
+}
+
+module_init(skbprio_module_init)
+module_exit(skbprio_module_exit)
+
+MODULE_LICENSE("GPL");
--
1.9.1
^ permalink raw reply related
* Re: [PATCH net] net: sched: cls: Fix offloading when ingress dev is vxlan
From: Jakub Kicinski @ 2018-06-06 15:46 UTC (permalink / raw)
To: Or Gerlitz
Cc: David Miller, Paul Blakey, Jiri Pirko, Cong Wang,
Jamal Hadi Salim, Linux Netdev List, Yevgeny Kliteynik, Roi Dayan,
Shahar Klein, Mark Bloch, Or Gerlitz
In-Reply-To: <CAJ3xEMh5TffWyD1V1w_EWGgnQuhA26O1BX8kZrdkHu3bB9+7-g@mail.gmail.com>
On Wed, 6 Jun 2018 08:15:27 +0300, Or Gerlitz wrote:
> On Wed, Jun 6, 2018 at 12:27 AM, Jakub Kicinski <kubakici@wp.pl> wrote:
> > On Tue, 05 Jun 2018 15:06:40 -0400 (EDT), David Miller wrote:
> >> From: Jakub Kicinski <kubakici@wp.pl>
> >> Date: Tue, 5 Jun 2018 11:57:47 -0700
> >>
> >> > Do we still care about correctness and not breaking backward
> >> > compatibility?
> >>
> >> Jakub let me know if you want me to revert this change.
> >
> > Yes, I think this patch introduces a regression when block is shared
> > between offload capable and in-capable device, therefore it should be
> > reverted. Shared blocks went through a number of review cycles to
> > ensure such cases are handled correctly.
> >
> >
> > Longer story about the actual issue which is never explained in the
> > commit message is this: in kernels 4.10 - 4.14 skip_sw flag was
> > supported on tunnels in cls_flower only:
> >
> > static int fl_hw_replace_filter(struct tcf_proto *tp,
> > [...]
> > if (!tc_can_offload(dev)) {
> > if (tcf_exts_get_dev(dev, &f->exts, &f->hw_dev) ||
> > (f->hw_dev && !tc_can_offload(f->hw_dev))) {
> > f->hw_dev = dev;
> > return tc_skip_sw(f->flags) ? -EINVAL : 0;
> > }
> > dev = f->hw_dev;
> > cls_flower.egress_dev = true;
> > } else {
> > f->hw_dev = dev;
> > }
> >
> >
> > In 4.15 - 4.17 with addition of shared blocks egdev mechanism was
> > promoted to a generic TC thing supported on all filters but it no
> > longer works with skip_sw.
> >
> > I'd argue skip_sw is incorrect for tunnels, because the rule will only
> > apply to traffic ingressing on ASIC ports, not all traffic which hits
> > the tunnel netdev.
>
> This argument makes sense, however, skip_sw for tunnel decap rules
> **is** allowed since 4.10 and we have some sort of regression here (turns
> out that before and after the patch..)
As I said it was allowed in 4 releases, which was a mistake, in last 3
it wasn't. I understand your use case, but the semantics of skip_sw
are not preserved here so we should find a different solution.
> > Therefore we should keep the 4.15 - 4.17 behaviour.
> > But that's a side note, I don't think we should be breaking offload on
> > shared blocks whether we decide to support skip_sw on tunnels or not.
>
> skip_sw on tunnels was there before shared-block, newer features should
> take care not to break existing ones.
Oh, I agree we shouldn't break existing use cases so please don't break
the use case I mentioned above. I want to set up shared block between
a LAG and its members. Now since the bond will not be offload-capable
TC will not even make an attempt to offload to members.
I'm gonna test that my reading of the code is correct and send a revert
later today, sorry.
^ permalink raw reply
* Re: [PATCH net] net: sched: cls: Fix offloading when ingress dev is vxlan
From: Jakub Kicinski @ 2018-06-06 15:50 UTC (permalink / raw)
To: Or Gerlitz
Cc: Paul Blakey, Jiri Pirko, Cong Wang, Jamal Hadi Salim,
David Miller, Linux Netdev List, Yevgeny Kliteynik, Roi Dayan,
Shahar Klein, Mark Bloch, Or Gerlitz
In-Reply-To: <CAJ3xEMg0QJ2MdcXgWbe1J+7KssN9b36X_2+nBFaGAKjkcy2xPg@mail.gmail.com>
On Wed, 6 Jun 2018 08:12:20 +0300, Or Gerlitz wrote:
> On Tue, Jun 5, 2018 at 9:59 PM, Jakub Kicinski <kubakici@wp.pl> wrote:
> > On Tue, 5 Jun 2018 11:57:47 -0700, Jakub Kicinski wrote:
> >> On Tue, 5 Jun 2018 11:04:03 +0300, Paul Blakey wrote:
> >> > When using a vxlan device as the ingress dev, we count it as a
> >> > "no offload dev", so when such a rule comes and err stop is true,
> >> > we fail early and don't try the egdev route which can offload it
> >> > through the egress device.
> >> >
> >> > Fix that by not calling the block offload if one of the devices
> >> > attached to it is not offload capable, but make sure egress on such case
> >> > is capable instead.
> >> >
> >> > Fixes: caa7260156eb ("net: sched: keep track of offloaded filters [..]")
> >> > Reviewed-by: Roi Dayan <roid@mellanox.com>
> >> > Acked-by: Jiri Pirko <jiri@mellanox.com>
> >> > Signed-off-by: Paul Blakey <paulb@mellanox.com>
> >>
> >> Very poor commit message. What you're doing is re-enabling skip_sw
> >> filters on tunnel devices which semantically make no sense and
> >> shouldn't have been allowed in the first place.
> >>
> >> This will breaks block sharing between tunnels and HW netdevs (because
> >> you skip the tcf_block_cb_call() completely). The entire egdev idea
> >> remains badly broken accepting filters like this:
> >>
> >> # tc filter add dev vxlan0 ingress \
> >> matchall action skip_sw \
> >> mirred egress redirect dev lo \
> >> mirred egress redirect dev sw1np0
> >
> > For above we probably need something like this (untested):
> >
> > diff --git a/net/sched/act_api.c b/net/sched/act_api.c
> > index 3f4cf930f809..71e5eebec31a 100644
> > --- a/net/sched/act_api.c
> > +++ b/net/sched/act_api.c
> > @@ -1511,7 +1511,7 @@ int tc_setup_cb_egdev_call(const struct net_device *dev,
> > struct tcf_action_egdev *egdev = tcf_action_egdev_lookup(dev);
> >
> > if (!egdev)
> > - return 0;
> > + return err_stop ? -EOPNOTSUPP : 0;
> > return tcf_action_egdev_cb_call(egdev, type, type_data, err_stop);
> > }
> > EXPORT_SYMBOL_GPL(tc_setup_cb_egdev_call);
>
> Jakub,
>
> We will test it out and let you know
That's probably insufficient, because the second action should be
skipped completely. But an improvement nonetheless :)
> > But the correct fix is to remove egdev crutch completely IMO.
>
> Not against it, sometimes designs should change and be replaced
> with better ones, happens
Cool, I'm not sure when we'll get to it, but perhaps we can go over
the ideas I presented at netconf on switchdev call next week (if it's
still active)?
^ permalink raw reply
* Re: [PATCH net] net: sched: cls: Fix offloading when ingress dev is vxlan
From: Jakub Kicinski @ 2018-06-06 15:57 UTC (permalink / raw)
To: Paul Blakey
Cc: David Miller, jiri, xiyou.wangcong, jhs, netdev, kliteyn, roid,
shahark, markb, ogerlitz
In-Reply-To: <fae451a1-a925-b4c9-ef26-29e6bf58c88a@mellanox.com>
On Wed, 6 Jun 2018 10:59:30 +0300, Paul Blakey wrote:
> Maybe we can apply my patch logic of still trying the egress dev if the
> block has a single device, and not shared. Is that ok with you?
I don't remember that patch but sounds pretty bad.
> You're patch seems good as an add on, but the egress hw device (sw1np0)
> would go over the tc actions and see if it can offload such rule (output
> to software lo device) and would fail anyway.
Ack, I'm not trying to solve the skip_sw problem. It's a hard one.
^ permalink raw reply
* Re: BUG: 4.14.11 unable to handle kernel NULL pointer dereference in xfrm_lookup
From: Tobias Hommel @ 2018-06-06 16:03 UTC (permalink / raw)
To: Kristian Evensen; +Cc: Steffen Klassert, Markus Berner, Network Development
In-Reply-To: <CAKfDRXhHMg5HKX=T1BKs5mx+BL-L8VyzqApS+8E52mDipZzU6Q@mail.gmail.com>
Hi,
On Wed, Jun 06, 2018 at 12:41:53PM +0200, Kristian Evensen wrote:
> Hi,
>
> I am experiencing the same issue on a PC Engines APU2 running kernel
> 4.14.34, both with and without hardware encryption. With hw.
> encryption, the crash occurs within 2-4 hours. Without hw. encryption,
> it takes 7-8 hours. My setup is nothing crazy, between 7 and 20
> tunnels with heavy RX/TX.
>
> On Fri, Feb 2, 2018 at 9:09 AM, Steffen Klassert
> <steffen.klassert@secunet.com> wrote:
> > Thanks for offering help, but I fear we have to wait until
> > Tobias has bisected it.
>
> Has any progress been made here? I would like to try a bisect myself,
Sorry no progress until now, I currently do not get time to have a deeper look
into that. We're back to 4.1.6 right now.
> but my board is running OpenWRT, which makes bisecting hard. The only
> observation I have so far, is that I did not see the issue with kernel
> 4.9.
>
> BR,
> Kristian
Tobi
^ permalink raw reply
* [PATCH bpf] tools/bpf: fix selftest get_cgroup_id_user
From: Yonghong Song @ 2018-06-06 16:12 UTC (permalink / raw)
To: ast, daniel, netdev; +Cc: kernel-team
Commit f269099a7e7a ("tools/bpf: add a selftest for
bpf_get_current_cgroup_id() helper") added a test
for bpf_get_current_cgroup_id() helper. The bpf program
is attached to tracepoint syscalls/sys_enter_nanosleep
and will record the cgroup id if the tracepoint is hit.
The test program creates a cgroup and attachs itself to
this cgroup and expects that the test program process
cgroup id is the same as the cgroup_id retrieved
by the bpf program.
In a light system where no other processes called
nanosleep syscall, the test case can pass.
In a busy system where many different processes can hit
syscalls/sys_enter_nanosleep tracepoint, the cgroup id
recorded by bpf program may not match the test program
process cgroup_id.
This patch fixed an issue by communicating the test program
pid to bpf program. The bpf program only records
cgroup id if the current task pid is the same as
passed-in pid. This ensures that the recorded cgroup_id
is for the cgroup within which the test program resides.
Fixes: f269099a7e7a ("tools/bpf: add a selftest for bpf_get_current_cgroup_id() helper")
Signed-off-by: Yonghong Song <yhs@fb.com>
---
tools/testing/selftests/bpf/get_cgroup_id_kern.c | 14 +++++++++++++-
tools/testing/selftests/bpf/get_cgroup_id_user.c | 12 ++++++++++--
2 files changed, 23 insertions(+), 3 deletions(-)
diff --git a/tools/testing/selftests/bpf/get_cgroup_id_kern.c b/tools/testing/selftests/bpf/get_cgroup_id_kern.c
index 2cf8cb2..014dba1 100644
--- a/tools/testing/selftests/bpf/get_cgroup_id_kern.c
+++ b/tools/testing/selftests/bpf/get_cgroup_id_kern.c
@@ -11,12 +11,24 @@ struct bpf_map_def SEC("maps") cg_ids = {
.max_entries = 1,
};
+struct bpf_map_def SEC("maps") pidmap = {
+ .type = BPF_MAP_TYPE_ARRAY,
+ .key_size = sizeof(__u32),
+ .value_size = sizeof(__u32),
+ .max_entries = 1,
+};
+
SEC("tracepoint/syscalls/sys_enter_nanosleep")
int trace(void *ctx)
{
- __u32 key = 0;
+ __u32 pid = bpf_get_current_pid_tgid();
+ __u32 key = 0, *expected_pid;
__u64 *val;
+ expected_pid = bpf_map_lookup_elem(&pidmap, &key);
+ if (!expected_pid || *expected_pid != pid)
+ return 0;
+
val = bpf_map_lookup_elem(&cg_ids, &key);
if (val)
*val = bpf_get_current_cgroup_id();
diff --git a/tools/testing/selftests/bpf/get_cgroup_id_user.c b/tools/testing/selftests/bpf/get_cgroup_id_user.c
index ea19a42..e8da7b3 100644
--- a/tools/testing/selftests/bpf/get_cgroup_id_user.c
+++ b/tools/testing/selftests/bpf/get_cgroup_id_user.c
@@ -50,13 +50,13 @@ int main(int argc, char **argv)
const char *probe_name = "syscalls/sys_enter_nanosleep";
const char *file = "get_cgroup_id_kern.o";
int err, bytes, efd, prog_fd, pmu_fd;
+ int cgroup_fd, cgidmap_fd, pidmap_fd;
struct perf_event_attr attr = {};
- int cgroup_fd, cgidmap_fd;
struct bpf_object *obj;
__u64 kcgid = 0, ucgid;
+ __u32 key = 0, pid;
int exit_code = 1;
char buf[256];
- __u32 key = 0;
err = setup_cgroup_environment();
if (CHECK(err, "setup_cgroup_environment", "err %d errno %d\n", err,
@@ -81,6 +81,14 @@ int main(int argc, char **argv)
cgidmap_fd, errno))
goto close_prog;
+ pidmap_fd = bpf_find_map(__func__, obj, "pidmap");
+ if (CHECK(pidmap_fd < 0, "bpf_find_map", "err %d errno %d\n",
+ pidmap_fd, errno))
+ goto close_prog;
+
+ pid = getpid();
+ bpf_map_update_elem(pidmap_fd, &key, &pid, 0);
+
snprintf(buf, sizeof(buf),
"/sys/kernel/debug/tracing/events/%s/id", probe_name);
efd = open(buf, O_RDONLY, 0);
--
2.9.5
^ permalink raw reply related
* BUG: jumbo frames broken after commit xen-netfront: Fix race between device setup and open
From: Andrew Jeddeloh @ 2018-06-06 16:29 UTC (permalink / raw)
To: netdev
Hi all,
The patch "xen-netfront: Fix race between device setup and open" seems
to have introduced a regression preventing setting MTU's larger than
1500. We experienced this downstream with Container Linux and
confirmed with Fedora 28 as well.
It's commit f599c64fdf7d9c108e8717fb04bc41c680120da4 in the linux-stable tree.
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git/commit/?id=f599c64fdf7d9c108e8717fb04bc41c680120da4
Downstream bugs:
https://github.com/coreos/bugs/issues/2443
https://bugzilla.redhat.com/show_bug.cgi?id=1584216
We've confirmed that reverting that commit fixes the bug. It be
reliably can be reproduced on AWS with t2.micro instances (and
presumably other systems using the same driver). Both using
systemd-networkd to set the mtu and manual ip link commands cause the
link to repsond with "Invalid argument" when trying to set the MTU >
1500.
I'm not sure why that commit introduced the regression.
Please let me know if there's any more information that would be helpful.
- Andrew
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox