* Re: Problem with the kernel 4.15 - cutting the band (tc)
From: Cong Wang @ 2018-04-06 23:50 UTC (permalink / raw)
To: Linus Torvalds; +Cc: David Miller, Linux Kernel Network Developers
In-Reply-To: <CAADWXX-z+=E=Da3DtJsyLes6uc+ohAF4NSG7kUboCwqPX3+ArA@mail.gmail.com>
On Fri, Apr 6, 2018 at 2:56 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> Forwarding a report about what looks like a regression between 4.14 and 4.15.
>
> New ENOSPC issue? I don't even knew where to start guessing where to look.
>
> Help me, Davem-Wan Kenobi, you are my only hope.
>
> (But adding netdev just in case somebody else goes "That's obviously Xyz")
>
> Linus
>
> ---------- Forwarded message ----------
> From: Marcin Kabiesz <admin@hostcenter.eu>
> Date: Thu, Apr 5, 2018 at 10:38 AM
> Subject: Problem with the kernel 4.15 - cutting the band (tc)
>
>
> Hello,
> I have a problem with bandwidth cutting on kernel 4.15. On the version
> up to 4.15, i.e. 4.14, this problem does not occur.
>
> uname -a: Linux router 4.14.15 #1 SMP x86_64 Intel Xeon E3-1230 v6
> command to reproduce:
>
> tc qdisc add dev ifb0 root handle 1: htb r2q 2
> tc class add dev ifb0 parent 1: classid 1:1 htb rate 10gbit ceil
> 10gbit quantum 16000
> tc filter add dev ifb0 parent 1: prio 5 handle 1: protocol all u32 divisor 256
> tc filter add dev ifb0 protocol all parent 1: prio 5 u32 ht 800::
> match ip dst 0.0.0.0/0 hashkey mask 0x000000ff at 16 link 1:
> tc filter add dev ifb0 parent 1:0 handle ::1 protocol all prio 5 u32
> ht 1:2c: match ip dst 192.168.3.44/32 flowid 1:2
> tc filter del dev ifb0 parent 1:0 handle 1:2c:1 prio 5 u32
> tc filter add dev ifb0 parent 1:0 handle ::1 protocol all prio 5 u32
> ht 1:2c: match ip dst 192.168.3.44/32 flowid 1:2
> tc filter del dev ifb0 parent 1:0 handle 1:2c:1 prio 5 u32
>
> This ok, no error/warnings and dmesg log.
>
> uname -a: Linux router 4.15.8 #1 SMP x86_64 Intel Xeon E3-1230 v6 (or
> 4.15.14 this same effect)
> command to reproduce:
>
> tc qdisc add dev ifb0 root handle 1: htb r2q 2
> tc class add dev ifb0 parent 1: classid 1:1 htb rate 10gbit ceil
> 10gbit quantum 16000
> tc filter add dev ifb0 parent 1: prio 5 handle 1: protocol all u32 divisor 256
> tc filter add dev ifb0 protocol all parent 1: prio 5 u32 ht 800::
> match ip dst 0.0.0.0/0 hashkey mask 0x000000ff at 16 link 1:
> tc filter add dev ifb0 parent 1:0 handle ::1 protocol all prio 5 u32
> ht 1:2c: match ip dst 192.168.3.44/32 flowid 1:2
> tc filter del dev ifb0 parent 1:0 handle 1:2c:1 prio 5 u32
> tc filter add dev ifb0 parent 1:0 handle ::1 protocol all prio 5 u32
> ht 1:2c: match ip dst 192.168.3.44/32 flowid 1:2
> RTNETLINK answers: No space left on device
> We have an error talking to the kernel
>
> This not ok, on error/warnings and no dmesg log.
We forgot to call idr_remove() when deleting u32 key...
I am cooking a fix now.
Thanks!
^ permalink raw reply
* [PATCH net 4/5] ibmvnic: Fix failover case for non-redundant configuration
From: Thomas Falcon @ 2018-04-06 23:37 UTC (permalink / raw)
To: netdev; +Cc: jallen, nfont, Thomas Falcon
In-Reply-To: <1523057826-5262-1-git-send-email-tlfalcon@linux.vnet.ibm.com>
There is a failover case for a non-redundant pseries VNIC
configuration that was not being handled properly. The current
implementation assumes that the driver will always have a redandant
device to communicate with following a failover notification. There
are cases, however, when a non-redundant configuration can receive
a failover request. If that happens, the driver should wait until
it receives a signal that the device is ready for operation.
The driver is agnostic of its backing hardware configuration,
so this fix necessarily affects all device failover management.
The driver needs to wait until it receives a signal that the device
is ready for resetting. A flag is introduced to track this intermediary
state where the driver is waiting for an active device.
Signed-off-by: Thomas Falcon <tlfalcon@linux.vnet.ibm.com>
---
drivers/net/ethernet/ibm/ibmvnic.c | 37 +++++++++++++++++++++++++++++--------
drivers/net/ethernet/ibm/ibmvnic.h | 1 +
2 files changed, 30 insertions(+), 8 deletions(-)
diff --git a/drivers/net/ethernet/ibm/ibmvnic.c b/drivers/net/ethernet/ibm/ibmvnic.c
index bbcd07a..151542e 100644
--- a/drivers/net/ethernet/ibm/ibmvnic.c
+++ b/drivers/net/ethernet/ibm/ibmvnic.c
@@ -325,10 +325,11 @@ static void replenish_rx_pool(struct ibmvnic_adapter *adapter,
adapter->replenish_add_buff_failure++;
atomic_add(buffers_added, &pool->available);
- if (lpar_rc == H_CLOSED) {
+ if (lpar_rc == H_CLOSED || adapter->failover_pending) {
/* Disable buffer pool replenishment and report carrier off if
- * queue is closed. Firmware guarantees that a signal will
- * be sent to the driver, triggering a reset.
+ * queue is closed or pending failover.
+ * Firmware guarantees that a signal will be sent to the
+ * driver, triggering a reset.
*/
deactivate_rx_pools(adapter);
netif_carrier_off(adapter->netdev);
@@ -1068,6 +1069,14 @@ static int ibmvnic_open(struct net_device *netdev)
struct ibmvnic_adapter *adapter = netdev_priv(netdev);
int rc;
+ /* If device failover is pending, just set device state and return.
+ * Device operation will be handled by reset routine.
+ */
+ if (adapter->failover_pending) {
+ adapter->state = VNIC_OPEN;
+ return 0;
+ }
+
mutex_lock(&adapter->reset_lock);
if (adapter->state != VNIC_CLOSED) {
@@ -1225,6 +1234,14 @@ static int ibmvnic_close(struct net_device *netdev)
struct ibmvnic_adapter *adapter = netdev_priv(netdev);
int rc;
+ /* If device failover is pending, just set device state and return.
+ * Device operation will be handled by reset routine.
+ */
+ if (adapter->failover_pending) {
+ adapter->state = VNIC_CLOSED;
+ return 0;
+ }
+
mutex_lock(&adapter->reset_lock);
rc = __ibmvnic_close(netdev);
mutex_unlock(&adapter->reset_lock);
@@ -1559,8 +1576,9 @@ static int ibmvnic_xmit(struct sk_buff *skb, struct net_device *netdev)
dev_kfree_skb_any(skb);
tx_buff->skb = NULL;
- if (lpar_rc == H_CLOSED) {
- /* Disable TX and report carrier off if queue is closed.
+ if (lpar_rc == H_CLOSED || adapter->failover_pending) {
+ /* Disable TX and report carrier off if queue is closed
+ * or pending failover.
* Firmware guarantees that a signal will be sent to the
* driver, triggering a reset or some other action.
*/
@@ -1884,9 +1902,10 @@ static int ibmvnic_reset(struct ibmvnic_adapter *adapter,
int ret;
if (adapter->state == VNIC_REMOVING ||
- adapter->state == VNIC_REMOVED) {
+ adapter->state == VNIC_REMOVED ||
+ adapter->failover_pending) {
ret = EBUSY;
- netdev_dbg(netdev, "Adapter removing, skipping reset\n");
+ netdev_dbg(netdev, "Adapter removing or pending failover, skipping reset\n");
goto err;
}
@@ -4162,7 +4181,9 @@ static void ibmvnic_handle_crq(union ibmvnic_crq *crq,
case IBMVNIC_CRQ_INIT:
dev_info(dev, "Partner initialized\n");
adapter->from_passive_init = true;
+ adapter->failover_pending = false;
complete(&adapter->init_done);
+ ibmvnic_reset(adapter, VNIC_RESET_FAILOVER);
break;
case IBMVNIC_CRQ_INIT_COMPLETE:
dev_info(dev, "Partner initialization complete\n");
@@ -4179,7 +4200,7 @@ static void ibmvnic_handle_crq(union ibmvnic_crq *crq,
ibmvnic_reset(adapter, VNIC_RESET_MOBILITY);
} else if (gen_crq->cmd == IBMVNIC_DEVICE_FAILOVER) {
dev_info(dev, "Backing device failover detected\n");
- ibmvnic_reset(adapter, VNIC_RESET_FAILOVER);
+ adapter->failover_pending = true;
} else {
/* The adapter lost the connection */
dev_err(dev, "Virtual Adapter failed (rc=%d)\n",
diff --git a/drivers/net/ethernet/ibm/ibmvnic.h b/drivers/net/ethernet/ibm/ibmvnic.h
index 89efe70..99c0b58 100644
--- a/drivers/net/ethernet/ibm/ibmvnic.h
+++ b/drivers/net/ethernet/ibm/ibmvnic.h
@@ -1108,6 +1108,7 @@ struct ibmvnic_adapter {
bool napi_enabled, from_passive_init;
bool mac_change_pending;
+ bool failover_pending;
struct ibmvnic_tunables desired;
struct ibmvnic_tunables fallback;
--
1.8.3.1
^ permalink raw reply related
* [PATCH net 5/5] ibmvnic: Do not reset CRQ for Mobility driver resets
From: Thomas Falcon @ 2018-04-06 23:37 UTC (permalink / raw)
To: netdev; +Cc: jallen, nfont, Thomas Falcon
In-Reply-To: <1523057826-5262-1-git-send-email-tlfalcon@linux.vnet.ibm.com>
From: Nathan Fontenot <nfont@linux.vnet.ibm.com>
When resetting the ibmvnic driver after a partition migration occurs
there is no requirement to do a reset of the main CRQ. The current
driver code does the required re-enable of the main CRQ, then does
a reset of the main CRQ later.
What we should be doing for a driver reset after a migration is to
re-enable the main CRQ, release all the sub-CRQs, and then allocate
new sub-CRQs after capability negotiation.
This patch updates the handling of mobility resets to do the proper
work and not reset the main CRQ. To do this the initialization/reset
of the main CRQ had to be moved out of the ibmvnic_init routine
and in to the ibmvnic_probe and do_reset routines.
Signed-off-by: Nathan Fontenot <nfont@linux.vnet.ibm.com>
Signed-off-by: Thomas Falcon <tlfalcon@linux.vnet.ibm.com>
---
drivers/net/ethernet/ibm/ibmvnic.c | 55 ++++++++++++++++++++++----------------
1 file changed, 32 insertions(+), 23 deletions(-)
diff --git a/drivers/net/ethernet/ibm/ibmvnic.c b/drivers/net/ethernet/ibm/ibmvnic.c
index 151542e..aad5658 100644
--- a/drivers/net/ethernet/ibm/ibmvnic.c
+++ b/drivers/net/ethernet/ibm/ibmvnic.c
@@ -118,6 +118,7 @@ static union sub_crq *ibmvnic_next_scrq(struct ibmvnic_adapter *,
static int ibmvnic_init(struct ibmvnic_adapter *);
static void release_crq_queue(struct ibmvnic_adapter *);
static int __ibmvnic_set_mac(struct net_device *netdev, struct sockaddr *p);
+static int init_crq_queue(struct ibmvnic_adapter *adapter);
struct ibmvnic_stat {
char name[ETH_GSTRING_LEN];
@@ -1224,7 +1225,6 @@ static int __ibmvnic_close(struct net_device *netdev)
rc = set_link_state(adapter, IBMVNIC_LOGICAL_LNK_DN);
if (rc)
return rc;
- ibmvnic_cleanup(netdev);
adapter->state = VNIC_CLOSED;
return 0;
}
@@ -1244,6 +1244,7 @@ static int ibmvnic_close(struct net_device *netdev)
mutex_lock(&adapter->reset_lock);
rc = __ibmvnic_close(netdev);
+ ibmvnic_cleanup(netdev);
mutex_unlock(&adapter->reset_lock);
return rc;
@@ -1726,14 +1727,10 @@ static int do_reset(struct ibmvnic_adapter *adapter,
old_num_rx_queues = adapter->req_rx_queues;
old_num_tx_queues = adapter->req_tx_queues;
- if (rwi->reset_reason == VNIC_RESET_MOBILITY) {
- rc = ibmvnic_reenable_crq_queue(adapter);
- if (rc)
- return 0;
- ibmvnic_cleanup(netdev);
- } else if (rwi->reset_reason == VNIC_RESET_FAILOVER) {
- ibmvnic_cleanup(netdev);
- } else {
+ ibmvnic_cleanup(netdev);
+
+ if (adapter->reset_reason != VNIC_RESET_MOBILITY &&
+ adapter->reset_reason != VNIC_RESET_FAILOVER) {
rc = __ibmvnic_close(netdev);
if (rc)
return rc;
@@ -1752,6 +1749,23 @@ static int do_reset(struct ibmvnic_adapter *adapter,
*/
adapter->state = VNIC_PROBED;
+ if (adapter->wait_for_reset) {
+ rc = init_crq_queue(adapter);
+ } else if (adapter->reset_reason == VNIC_RESET_MOBILITY) {
+ rc = ibmvnic_reenable_crq_queue(adapter);
+ release_sub_crqs(adapter, 1);
+ } else {
+ rc = ibmvnic_reset_crq(adapter);
+ if (!rc)
+ rc = vio_enable_interrupts(adapter->vdev);
+ }
+
+ if (rc) {
+ netdev_err(adapter->netdev,
+ "Couldn't initialize crq. rc=%d\n", rc);
+ return rc;
+ }
+
rc = ibmvnic_init(adapter);
if (rc)
return IBMVNIC_INIT_FAILED;
@@ -4500,19 +4514,6 @@ static int ibmvnic_init(struct ibmvnic_adapter *adapter)
u64 old_num_rx_queues, old_num_tx_queues;
int rc;
- if (adapter->resetting && !adapter->wait_for_reset) {
- rc = ibmvnic_reset_crq(adapter);
- if (!rc)
- rc = vio_enable_interrupts(adapter->vdev);
- } else {
- rc = init_crq_queue(adapter);
- }
-
- if (rc) {
- dev_err(dev, "Couldn't initialize crq. rc=%d\n", rc);
- return rc;
- }
-
adapter->from_passive_init = false;
old_num_rx_queues = adapter->req_rx_queues;
@@ -4537,7 +4538,8 @@ static int ibmvnic_init(struct ibmvnic_adapter *adapter)
return -1;
}
- if (adapter->resetting && !adapter->wait_for_reset) {
+ if (adapter->resetting && !adapter->wait_for_reset &&
+ adapter->reset_reason != VNIC_RESET_MOBILITY) {
if (adapter->req_rx_queues != old_num_rx_queues ||
adapter->req_tx_queues != old_num_tx_queues) {
release_sub_crqs(adapter, 0);
@@ -4625,6 +4627,13 @@ static int ibmvnic_probe(struct vio_dev *dev, const struct vio_device_id *id)
adapter->mac_change_pending = false;
do {
+ rc = init_crq_queue(adapter);
+ if (rc) {
+ dev_err(&dev->dev, "Couldn't initialize crq. rc=%d\n",
+ rc);
+ goto ibmvnic_init_fail;
+ }
+
rc = ibmvnic_init(adapter);
if (rc && rc != EAGAIN)
goto ibmvnic_init_fail;
--
1.8.3.1
^ permalink raw reply related
* [PATCH net 3/5] ibmvnic: Fix reset scheduler error handling
From: Thomas Falcon @ 2018-04-06 23:37 UTC (permalink / raw)
To: netdev; +Cc: jallen, nfont, Thomas Falcon
In-Reply-To: <1523057826-5262-1-git-send-email-tlfalcon@linux.vnet.ibm.com>
In some cases, if the driver is waiting for a reset following
a device parameter change, failure to schedule a reset can result
in a hang since a completion signal is never sent.
If the device configuration is being altered by a tool such
as ethtool or ifconfig, it could cause the console to hang
if the reset request does not get scheduled. Add some additional
error handling code to exit the wait_for_completion if there is
one in progress.
Signed-off-by: Thomas Falcon <tlfalcon@linux.vnet.ibm.com>
---
drivers/net/ethernet/ibm/ibmvnic.c | 39 ++++++++++++++++++++++++++++----------
1 file changed, 29 insertions(+), 10 deletions(-)
diff --git a/drivers/net/ethernet/ibm/ibmvnic.c b/drivers/net/ethernet/ibm/ibmvnic.c
index 153a868..bbcd07a 100644
--- a/drivers/net/ethernet/ibm/ibmvnic.c
+++ b/drivers/net/ethernet/ibm/ibmvnic.c
@@ -1875,23 +1875,25 @@ static void __ibmvnic_reset(struct work_struct *work)
mutex_unlock(&adapter->reset_lock);
}
-static void ibmvnic_reset(struct ibmvnic_adapter *adapter,
- enum ibmvnic_reset_reason reason)
+static int ibmvnic_reset(struct ibmvnic_adapter *adapter,
+ enum ibmvnic_reset_reason reason)
{
struct ibmvnic_rwi *rwi, *tmp;
struct net_device *netdev = adapter->netdev;
struct list_head *entry;
+ int ret;
if (adapter->state == VNIC_REMOVING ||
adapter->state == VNIC_REMOVED) {
+ ret = EBUSY;
netdev_dbg(netdev, "Adapter removing, skipping reset\n");
- return;
+ goto err;
}
if (adapter->state == VNIC_PROBING) {
netdev_warn(netdev, "Adapter reset during probe\n");
- adapter->init_done_rc = EAGAIN;
- return;
+ ret = adapter->init_done_rc = EAGAIN;
+ goto err;
}
mutex_lock(&adapter->rwi_lock);
@@ -1901,7 +1903,8 @@ static void ibmvnic_reset(struct ibmvnic_adapter *adapter,
if (tmp->reset_reason == reason) {
netdev_dbg(netdev, "Skipping matching reset\n");
mutex_unlock(&adapter->rwi_lock);
- return;
+ ret = EBUSY;
+ goto err;
}
}
@@ -1909,7 +1912,8 @@ static void ibmvnic_reset(struct ibmvnic_adapter *adapter,
if (!rwi) {
mutex_unlock(&adapter->rwi_lock);
ibmvnic_close(netdev);
- return;
+ ret = ENOMEM;
+ goto err;
}
rwi->reset_reason = reason;
@@ -1918,6 +1922,12 @@ static void ibmvnic_reset(struct ibmvnic_adapter *adapter,
netdev_dbg(adapter->netdev, "Scheduling reset (reason %d)\n", reason);
schedule_work(&adapter->ibmvnic_reset);
+
+ return 0;
+err:
+ if (adapter->wait_for_reset)
+ adapter->wait_for_reset = false;
+ return -ret;
}
static void ibmvnic_tx_timeout(struct net_device *dev)
@@ -2052,6 +2062,8 @@ static void ibmvnic_netpoll_controller(struct net_device *dev)
static int wait_for_reset(struct ibmvnic_adapter *adapter)
{
+ int rc, ret;
+
adapter->fallback.mtu = adapter->req_mtu;
adapter->fallback.rx_queues = adapter->req_rx_queues;
adapter->fallback.tx_queues = adapter->req_tx_queues;
@@ -2059,11 +2071,15 @@ static int wait_for_reset(struct ibmvnic_adapter *adapter)
adapter->fallback.tx_entries = adapter->req_tx_entries_per_subcrq;
init_completion(&adapter->reset_done);
- ibmvnic_reset(adapter, VNIC_RESET_CHANGE_PARAM);
adapter->wait_for_reset = true;
+ rc = ibmvnic_reset(adapter, VNIC_RESET_CHANGE_PARAM);
+ if (rc)
+ return rc;
wait_for_completion(&adapter->reset_done);
+ ret = 0;
if (adapter->reset_done_rc) {
+ ret = -EIO;
adapter->desired.mtu = adapter->fallback.mtu;
adapter->desired.rx_queues = adapter->fallback.rx_queues;
adapter->desired.tx_queues = adapter->fallback.tx_queues;
@@ -2071,12 +2087,15 @@ static int wait_for_reset(struct ibmvnic_adapter *adapter)
adapter->desired.tx_entries = adapter->fallback.tx_entries;
init_completion(&adapter->reset_done);
- ibmvnic_reset(adapter, VNIC_RESET_CHANGE_PARAM);
+ adapter->wait_for_reset = true;
+ rc = ibmvnic_reset(adapter, VNIC_RESET_CHANGE_PARAM);
+ if (rc)
+ return ret;
wait_for_completion(&adapter->reset_done);
}
adapter->wait_for_reset = false;
- return adapter->reset_done_rc;
+ return ret;
}
static int ibmvnic_change_mtu(struct net_device *netdev, int new_mtu)
--
1.8.3.1
^ permalink raw reply related
* [PATCH net 2/5] ibmvnic: Zero used TX descriptor counter on reset
From: Thomas Falcon @ 2018-04-06 23:37 UTC (permalink / raw)
To: netdev; +Cc: jallen, nfont, Thomas Falcon
In-Reply-To: <1523057826-5262-1-git-send-email-tlfalcon@linux.vnet.ibm.com>
The counter that tracks used TX descriptors pending completion
needs to be zeroed as part of a device reset. This change fixes
a bug causing transmit queues to be stopped unnecessarily and in
some cases a transmit queue stall and timeout reset. If the counter
is not reset, the remaining descriptors will not be "removed",
effectively reducing queue capacity. If the queue is over half full,
it will cause the queue to stall if stopped.
Signed-off-by: Thomas Falcon <tlfalcon@linux.vnet.ibm.com>
---
drivers/net/ethernet/ibm/ibmvnic.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/drivers/net/ethernet/ibm/ibmvnic.c b/drivers/net/ethernet/ibm/ibmvnic.c
index 58e0143..153a868 100644
--- a/drivers/net/ethernet/ibm/ibmvnic.c
+++ b/drivers/net/ethernet/ibm/ibmvnic.c
@@ -2361,6 +2361,7 @@ static int reset_one_sub_crq_queue(struct ibmvnic_adapter *adapter,
}
memset(scrq->msgs, 0, 4 * PAGE_SIZE);
+ atomic_set(&scrq->used, 0);
scrq->cur = 0;
rc = h_reg_sub_crq(adapter->vdev->unit_address, scrq->msg_token,
--
1.8.3.1
^ permalink raw reply related
* [PATCH net 1/5] ibmvnic: Fix DMA mapping mistakes
From: Thomas Falcon @ 2018-04-06 23:37 UTC (permalink / raw)
To: netdev; +Cc: jallen, nfont, Thomas Falcon
In-Reply-To: <1523057826-5262-1-git-send-email-tlfalcon@linux.vnet.ibm.com>
Fix some mistakes caught by the DMA debugger. The first change
fixes a unnecessary unmap that should have been removed in an
earlier update. The next hunk fixes another bad unmap by zeroing
the bit checked to determine that an unmap is needed. The final
change fixes some buffers that are unmapped with the wrong
direction specified.
Signed-off-by: Thomas Falcon <tlfalcon@linux.vnet.ibm.com>
---
drivers/net/ethernet/ibm/ibmvnic.c | 14 ++++++--------
1 file changed, 6 insertions(+), 8 deletions(-)
diff --git a/drivers/net/ethernet/ibm/ibmvnic.c b/drivers/net/ethernet/ibm/ibmvnic.c
index b492af6..58e0143 100644
--- a/drivers/net/ethernet/ibm/ibmvnic.c
+++ b/drivers/net/ethernet/ibm/ibmvnic.c
@@ -320,9 +320,6 @@ static void replenish_rx_pool(struct ibmvnic_adapter *adapter,
dev_info(dev, "replenish pools failure\n");
pool->free_map[pool->next_free] = index;
pool->rx_buff[index].skb = NULL;
- if (!dma_mapping_error(dev, dma_addr))
- dma_unmap_single(dev, dma_addr, pool->buff_size,
- DMA_FROM_DEVICE);
dev_kfree_skb_any(skb);
adapter->replenish_add_buff_failure++;
@@ -2574,7 +2571,7 @@ static int ibmvnic_complete_tx(struct ibmvnic_adapter *adapter,
union sub_crq *next;
int index;
int i, j;
- u8 first;
+ u8 *first;
restart_loop:
while (pending_scrq(adapter, scrq)) {
@@ -2605,11 +2602,12 @@ static int ibmvnic_complete_tx(struct ibmvnic_adapter *adapter,
txbuff->data_dma[j] = 0;
}
/* if sub_crq was sent indirectly */
- first = txbuff->indir_arr[0].generic.first;
- if (first == IBMVNIC_CRQ_CMD) {
+ first = &txbuff->indir_arr[0].generic.first;
+ if (*first == IBMVNIC_CRQ_CMD) {
dma_unmap_single(dev, txbuff->indir_dma,
sizeof(txbuff->indir_arr),
DMA_TO_DEVICE);
+ *first = 0;
}
if (txbuff->last_frag) {
@@ -3882,9 +3880,9 @@ static int handle_login_rsp(union ibmvnic_crq *login_rsp_crq,
int i;
dma_unmap_single(dev, adapter->login_buf_token, adapter->login_buf_sz,
- DMA_BIDIRECTIONAL);
+ DMA_TO_DEVICE);
dma_unmap_single(dev, adapter->login_rsp_buf_token,
- adapter->login_rsp_buf_sz, DMA_BIDIRECTIONAL);
+ adapter->login_rsp_buf_sz, DMA_FROM_DEVICE);
/* If the number of queues requested can't be allocated by the
* server, the login response will return with code 1. We will need
--
1.8.3.1
^ permalink raw reply related
* [PATCH net 0/5] ibmvnic: Fix driver reset and DMA bugs
From: Thomas Falcon @ 2018-04-06 23:37 UTC (permalink / raw)
To: netdev; +Cc: jallen, nfont, Thomas Falcon
This patch series introduces some fixes to the driver reset
routines and a patch that fixes mistakes caught by the kernel
DMA debugger.
The reset fixes include a fix to reset TX queue counters properly
after a reset as well as updates to driver reset error-handling code.
It also provides updates to the reset handling routine for redundant
backing VF failover and partition migration cases.
Nathan Fontenot (1):
ibmvnic: Do not reset CRQ for Mobility driver resets
Thomas Falcon (4):
ibmvnic: Fix DMA mapping mistakes
ibmvnic: Zero used TX descriptor counter on reset
ibmvnic: Fix reset scheduler error handling
ibmvnic: Fix failover case for non-redundant configuration
drivers/net/ethernet/ibm/ibmvnic.c | 146 ++++++++++++++++++++++++-------------
drivers/net/ethernet/ibm/ibmvnic.h | 1 +
2 files changed, 98 insertions(+), 49 deletions(-)
--
1.8.3.1
^ permalink raw reply
* Re: [RFC PATCH net-next v5 2/4] net: Introduce generic bypass module
From: Samudrala, Sridhar @ 2018-04-06 23:08 UTC (permalink / raw)
To: Jiri Pirko
Cc: mst, stephen, davem, netdev, virtualization, virtio-dev,
jesse.brandeburg, alexander.h.duyck, kubakici, jasowang,
loseweigh
In-Reply-To: <20180406125758.GC19345@nanopsycho>
On 4/6/2018 5:57 AM, Jiri Pirko wrote:
> Thu, Apr 05, 2018 at 11:08:21PM CEST, sridhar.samudrala@intel.com wrote:
>> This provides a generic interface for paravirtual drivers to listen
>> for netdev register/unregister/link change events from pci ethernet
>> devices with the same MAC and takeover their datapath. The notifier and
>> event handling code is based on the existing netvsc implementation. A
>> paravirtual driver can use this module by registering a set of ops and
>> each instance of the device when it is probed.
>>
>> Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
>> ---
>> include/net/bypass.h | 80 ++++++++++
>> net/Kconfig | 18 +++
>> net/core/Makefile | 1 +
>> net/core/bypass.c | 406 +++++++++++++++++++++++++++++++++++++++++++++++++++
>> 4 files changed, 505 insertions(+)
>> create mode 100644 include/net/bypass.h
>> create mode 100644 net/core/bypass.c
>>
>> diff --git a/include/net/bypass.h b/include/net/bypass.h
>> new file mode 100644
>> index 000000000000..e2dd122f951a
>> --- /dev/null
>> +++ b/include/net/bypass.h
>> @@ -0,0 +1,80 @@
>> +// SPDX-License-Identifier: GPL-2.0
>> +/* Copyright (c) 2018, Intel Corporation. */
>> +
>> +#ifndef _NET_BYPASS_H
>> +#define _NET_BYPASS_H
>> +
>> +#include <linux/netdevice.h>
>> +
>> +struct bypass_ops {
> Perhaps "net_bypass_" would be better prefix for this module structs
> and functions. No strong opinion though.
>
>
>> + int (*register_child)(struct net_device *bypass_netdev,
>> + struct net_device *child_netdev);
> We have master/slave upper/lower netdevices. This adds "child". Consider
> using some existing names. Not sure if possible without loss of meaning.
OK. will change this to register_slave()
>
>
>> + int (*join_child)(struct net_device *bypass_netdev,
>> + struct net_device *child_netdev);
>> + int (*unregister_child)(struct net_device *bypass_netdev,
>> + struct net_device *child_netdev);
>> + int (*release_child)(struct net_device *bypass_netdev,
>> + struct net_device *child_netdev);
>> + int (*update_link)(struct net_device *bypass_netdev,
>> + struct net_device *child_netdev);
>> + rx_handler_result_t (*handle_frame)(struct sk_buff **pskb);
>> +};
>> +
>> +struct bypass_instance {
>> + struct list_head list;
>> + struct net_device __rcu *bypass_netdev;
>> + struct bypass *bypass;
>> +};
>> +
>> +struct bypass {
>> + struct list_head list;
>> + const struct bypass_ops *ops;
>> + const struct net_device_ops *netdev_ops;
>> + struct list_head instance_list;
>> + struct mutex lock;
>> +};
>> +
>> +#if IS_ENABLED(CONFIG_NET_BYPASS)
>> +
>> +struct bypass *bypass_register_driver(const struct bypass_ops *ops,
>> + const struct net_device_ops *netdev_ops);
>> +void bypass_unregister_driver(struct bypass *bypass);
>> +
>> +int bypass_register_instance(struct bypass *bypass, struct net_device *dev);
>> +int bypass_unregister_instance(struct bypass *bypass, struct net_device *dev);
>> +
>> +int bypass_unregister_child(struct net_device *child_netdev);
>> +
>> +#else
>> +
>> +static inline
>> +struct bypass *bypass_register_driver(const struct bypass_ops *ops,
>> + const struct net_device_ops *netdev_ops)
>> +{
>> + return NULL;
>> +}
>> +
>> +static inline void bypass_unregister_driver(struct bypass *bypass)
>> +{
>> +}
>> +
>> +static inline int bypass_register_instance(struct bypass *bypass,
>> + struct net_device *dev)
>> +{
>> + return 0;
>> +}
>> +
>> +static inline int bypass_unregister_instance(struct bypass *bypass,
>> + struct net_device *dev)
>> +{
>> + return 0;
>> +}
>> +
>> +static inline int bypass_unregister_child(struct net_device *child_netdev)
>> +{
>> + return 0;
>> +}
>> +
>> +#endif
>> +
>> +#endif /* _NET_BYPASS_H */
>> diff --git a/net/Kconfig b/net/Kconfig
>> index 0428f12c25c2..994445f4a96a 100644
>> --- a/net/Kconfig
>> +++ b/net/Kconfig
>> @@ -423,6 +423,24 @@ config MAY_USE_DEVLINK
>> on MAY_USE_DEVLINK to ensure they do not cause link errors when
>> devlink is a loadable module and the driver using it is built-in.
>>
>> +config NET_BYPASS
>> + tristate "Bypass interface"
>> + ---help---
>> + This provides a generic interface for paravirtual drivers to listen
>> + for netdev register/unregister/link change events from pci ethernet
>> + devices with the same MAC and takeover their datapath. This also
>> + enables live migration of a VM with direct attached VF by failing
>> + over to the paravirtual datapath when the VF is unplugged.
>> +
>> +config MAY_USE_BYPASS
>> + tristate
>> + default m if NET_BYPASS=m
>> + default y if NET_BYPASS=y || NET_BYPASS=n
>> + help
>> + Drivers using the bypass infrastructure should have a dependency
>> + on MAY_USE_BYPASS to ensure they do not cause link errors when
>> + bypass is a loadable module and the driver using it is built-in.
>> +
>> endif # if NET
>>
>> # Used by archs to tell that they support BPF JIT compiler plus which flavour.
>> diff --git a/net/core/Makefile b/net/core/Makefile
>> index 6dbbba8c57ae..a9727ed1c8fc 100644
>> --- a/net/core/Makefile
>> +++ b/net/core/Makefile
>> @@ -30,3 +30,4 @@ obj-$(CONFIG_DST_CACHE) += dst_cache.o
>> obj-$(CONFIG_HWBM) += hwbm.o
>> obj-$(CONFIG_NET_DEVLINK) += devlink.o
>> obj-$(CONFIG_GRO_CELLS) += gro_cells.o
>> +obj-$(CONFIG_NET_BYPASS) += bypass.o
>> diff --git a/net/core/bypass.c b/net/core/bypass.c
>> new file mode 100644
>> index 000000000000..7bde962ec3d4
>> --- /dev/null
>> +++ b/net/core/bypass.c
>> @@ -0,0 +1,406 @@
>> +// SPDX-License-Identifier: GPL-2.0
>> +/* Copyright (c) 2018, Intel Corporation. */
>> +
>> +/* A common module to handle registrations and notifications for paravirtual
>> + * drivers to enable accelerated datapath and support VF live migration.
>> + *
>> + * The notifier and event handling code is based on netvsc driver.
>> + */
>> +
>> +#include <linux/netdevice.h>
>> +#include <linux/etherdevice.h>
>> +#include <linux/ethtool.h>
>> +#include <linux/module.h>
>> +#include <linux/slab.h>
>> +#include <linux/netdevice.h>
>> +#include <linux/netpoll.h>
>> +#include <linux/rtnetlink.h>
>> +#include <linux/if_vlan.h>
>> +#include <net/sch_generic.h>
>> +#include <uapi/linux/if_arp.h>
>> +#include <net/bypass.h>
>> +
>> +static LIST_HEAD(bypass_list);
>> +
>> +static DEFINE_MUTEX(bypass_mutex);
> Why mutex? Apparently you don't need to sleep while holding a lock.
> Simple spinlock would do.
>
>
>> +
>> +struct bypass_instance *bypass_instance_alloc(struct net_device *dev)
>> +{
>> + struct bypass_instance *bypass_instance;
>> +
>> + bypass_instance = kzalloc(sizeof(*bypass_instance), GFP_KERNEL);
>> + if (!bypass_instance)
>> + return NULL;
>> +
>> + dev_hold(dev);
>> + rcu_assign_pointer(bypass_instance->bypass_netdev, dev);
>> +
>> + return bypass_instance;
>> +}
>> +
>> +void bypass_instance_free(struct bypass_instance *bypass_instance)
>> +{
>> + struct net_device *bypass_netdev;
>> +
>> + bypass_netdev = rcu_dereference(bypass_instance->bypass_netdev);
>> +
>> + dev_put(bypass_netdev);
>> + kfree(bypass_instance);
>> +}
>> +
>> +static struct bypass_instance *bypass_get_instance_bymac(u8 *mac)
>> +{
>> + struct bypass_instance *bypass_instance;
>> + struct net_device *bypass_netdev;
>> + struct bypass *bypass;
>> +
>> + list_for_each_entry(bypass, &bypass_list, list) {
>> + mutex_lock(&bypass->lock);
>> + list_for_each_entry(bypass_instance, &bypass->instance_list,
>> + list) {
>> + bypass_netdev =
>> + rcu_dereference(bypass_instance->bypass_netdev);
>> + if (ether_addr_equal(bypass_netdev->perm_addr, mac)) {
>> + mutex_unlock(&bypass->lock);
>> + goto out;
>> + }
>> + }
>> + mutex_unlock(&bypass->lock);
>> + }
>> +
>> + bypass_instance = NULL;
>> +out:
>> + return bypass_instance;
>> +}
>> +
>> +static int bypass_register_child(struct net_device *child_netdev)
>> +{
>> + struct bypass_instance *bypass_instance;
>> + struct bypass *bypass;
>> + struct net_device *bypass_netdev;
>> + int ret, orig_mtu;
>> +
>> + ASSERT_RTNL();
>> +
>> + mutex_lock(&bypass_mutex);
>> + bypass_instance = bypass_get_instance_bymac(child_netdev->perm_addr);
>> + if (!bypass_instance) {
>> + mutex_unlock(&bypass_mutex);
>> + goto done;
>> + }
>> +
>> + bypass_netdev = rcu_dereference(bypass_instance->bypass_netdev);
>> + bypass = bypass_instance->bypass;
>> + mutex_unlock(&bypass_mutex);
>> +
>> + if (!bypass->ops->register_child)
>> + goto done;
>> +
>> + ret = bypass->ops->register_child(bypass_netdev, child_netdev);
>> + if (ret != 0)
>> + goto done;
>> +
>> + ret = netdev_rx_handler_register(child_netdev,
>> + bypass->ops->handle_frame,
>> + bypass_netdev);
>> + if (ret != 0) {
>> + netdev_err(child_netdev,
>> + "can not register bypass rx handler (err = %d)\n",
>> + ret);
>> + goto rx_handler_failed;
>> + }
>> +
>> + ret = netdev_upper_dev_link(child_netdev, bypass_netdev, NULL);
>> + if (ret != 0) {
>> + netdev_err(child_netdev,
> No line-wrap is needed here and in other cases like this.
>
>
>> + "can not set master device %s (err = %d)\n",
>> + bypass_netdev->name, ret);
>> + goto upper_link_failed;
>> + }
>> +
>> + child_netdev->flags |= IFF_SLAVE;
> Don't reuse IFF_SLAVE. That is bonding-specific thing. I know that
> netvsc uses it, it is wrong.
> Please rather introduce:
> IFF_BYPASS for master
> and IFF_BYPASS_SLAVE for slaves.
OK. Will add these flags.
>
>
>
>
>> +
>> + if (netif_running(bypass_netdev)) {
>> + ret = dev_open(child_netdev);
>> + if (ret && (ret != -EBUSY)) {
>> + netdev_err(bypass_netdev,
>> + "Opening child %s failed ret:%d\n",
>> + child_netdev->name, ret);
>> + goto err_interface_up;
>> + }
>> + }
>> +
>> + /* Align MTU of child with master */
>> + orig_mtu = child_netdev->mtu;
>> + ret = dev_set_mtu(child_netdev, bypass_netdev->mtu);
>> + if (ret != 0) {
>> + netdev_err(bypass_netdev,
>> + "unable to change mtu of %s to %u register failed\n",
>> + child_netdev->name, bypass_netdev->mtu);
>> + goto err_set_mtu;
>> + }
>> +
>> + ret = bypass->ops->join_child(bypass_netdev, child_netdev);
>> + if (ret != 0)
>> + goto err_join;
>> +
>> + call_netdevice_notifiers(NETDEV_JOIN, child_netdev);
>> +
>> + goto done;
>> +
>> +err_join:
>> + dev_set_mtu(child_netdev, orig_mtu);
>> +err_set_mtu:
>> + dev_close(child_netdev);
>> +err_interface_up:
>> + netdev_upper_dev_unlink(child_netdev, bypass_netdev);
>> + child_netdev->flags &= ~IFF_SLAVE;
>> +upper_link_failed:
>> + netdev_rx_handler_unregister(child_netdev);
>> +rx_handler_failed:
>> + bypass->ops->unregister_child(bypass_netdev, child_netdev);
>> +
>> +done:
>> + return NOTIFY_DONE;
>> +}
>> +
>> +int bypass_unregister_child(struct net_device *child_netdev)
>> +{
>> + struct bypass_instance *bypass_instance;
>> + struct net_device *bypass_netdev;
>> + struct bypass *bypass;
>> + int ret;
>> +
>> + ASSERT_RTNL();
>> +
>> + mutex_lock(&bypass_mutex);
>> + bypass_instance = bypass_get_instance_bymac(child_netdev->perm_addr);
>> + if (!bypass_instance) {
>> + mutex_unlock(&bypass_mutex);
>> + goto done;
>> + }
>> +
>> + bypass_netdev = rcu_dereference(bypass_instance->bypass_netdev);
>> + bypass = bypass_instance->bypass;
>> + mutex_unlock(&bypass_mutex);
>> +
>> + ret = bypass->ops->release_child(bypass_netdev, child_netdev);
>> + if (ret != 0)
>> + goto done;
>> +
>> + netdev_rx_handler_unregister(child_netdev);
>> + netdev_upper_dev_unlink(child_netdev, bypass_netdev);
>> + child_netdev->flags &= ~IFF_SLAVE;
>> +
>> + if (!bypass->ops->unregister_child)
>> + goto done;
>> +
>> + bypass->ops->unregister_child(bypass_netdev, child_netdev);
>> +
>> +done:
>> + return NOTIFY_DONE;
>> +}
>> +EXPORT_SYMBOL(bypass_unregister_child);
> Please use "EXPORT_SYMBOL_GPL" for all exported symbols.
OK.
>
>
>> +
>> +static int bypass_update_link(struct net_device *child_netdev)
>> +{
>> + struct bypass_instance *bypass_instance;
>> + struct net_device *bypass_netdev;
>> + struct bypass *bypass;
>> +
>> + ASSERT_RTNL();
>> +
>> + mutex_lock(&bypass_mutex);
>> + bypass_instance = bypass_get_instance_bymac(child_netdev->perm_addr);
> You don't really need this lookup. The kernel knows about the master
> device, you can just use netdev_master_upper_dev_get_rcu() to get it.
Sure.
>
>
>> + if (!bypass_instance) {
>> + mutex_unlock(&bypass_mutex);
>> + goto done;
>> + }
>> +
>> + bypass_netdev = rcu_dereference(bypass_instance->bypass_netdev);
>> + bypass = bypass_instance->bypass;
>> + mutex_unlock(&bypass_mutex);
>> +
>> + if (!bypass->ops->update_link)
>> + goto done;
>> +
>> + bypass->ops->update_link(bypass_netdev, child_netdev);
>> +
>> +done:
>> + return NOTIFY_DONE;
>> +}
>> +
>> +static bool bypass_validate_child_dev(struct net_device *dev)
>> +{
>> + /* Avoid non-Ethernet type devices */
>> + if (dev->type != ARPHRD_ETHER)
>> + return false;
>> +
>> + /* Avoid Vlan dev with same MAC registering as VF */
>> + if (is_vlan_dev(dev))
>> + return false;
>> +
>> + /* Avoid Bonding master dev with same MAC registering as child dev */
>> + if ((dev->priv_flags & IFF_BONDING) && (dev->flags & IFF_MASTER))
>> + return false;
>> +
>> + return true;
>> +}
>> +
>> +static int
>> +bypass_event(struct notifier_block *this, unsigned long event, void *ptr)
>> +{
>> + struct net_device *event_dev = netdev_notifier_info_to_dev(ptr);
>> + struct bypass *bypass;
>> +
>> + /* Skip Parent events */
>> + mutex_lock(&bypass_mutex);
>> + list_for_each_entry(bypass, &bypass_list, list) {
>> + if (event_dev->netdev_ops == bypass->netdev_ops) {
>> + mutex_unlock(&bypass_mutex);
>> + return NOTIFY_DONE;
>> + }
> What you need instead of this is an identification helper
> netif_is_bypass_master()
> similar to
> netif_is_team_master()
> netif_is_bridge_master()
> etc
OK. Will introduce this helper.
>
>
>
>> + }
>> + mutex_unlock(&bypass_mutex);
>> +
>> + if (!bypass_validate_child_dev(event_dev))
>> + return NOTIFY_DONE;
>> +
>> + switch (event) {
>> + case NETDEV_REGISTER:
>> + return bypass_register_child(event_dev);
>> + case NETDEV_UNREGISTER:
>> + return bypass_unregister_child(event_dev);
>> + case NETDEV_UP:
>> + case NETDEV_DOWN:
>> + case NETDEV_CHANGE:
>> + return bypass_update_link(event_dev);
>> + default:
>> + return NOTIFY_DONE;
>> + }
>> +}
>> +
>> +static struct notifier_block bypass_notifier = {
>> + .notifier_call = bypass_event,
>> +};
>> +
>> +static void bypass_register_existing_child(struct net_device *bypass_netdev)
>> +{
>> + struct net *net = dev_net(bypass_netdev);
>> + struct net_device *dev;
>> +
>> + rtnl_lock();
>> + for_each_netdev(net, dev) {
>> + if (dev == bypass_netdev)
>> + continue;
>> + if (!bypass_validate_child_dev(dev))
>> + continue;
>> + if (ether_addr_equal(bypass_netdev->perm_addr, dev->perm_addr))
>> + bypass_register_child(dev);
>> + }
>> + rtnl_unlock();
>> +}
>> +
>> +int bypass_register_instance(struct bypass *bypass, struct net_device *dev)
>> +{
>> + struct bypass_instance *bypass_instance;
> No need to allocate this instace here. You can just have is embedded
> inside netdevice priv and pass pointer to it here. You can pass the
> pointer back to the driver when you call ops as the driver can get priv
> back by it.
>
> I would also call it "struct bypass_master" and this function
> "bypass_master_register".
>
> It should contain the ops pointer too.
OK.
>
>
>> + struct net_device *bypass_netdev;
>> + int ret = 0;
>> +
>> + mutex_lock(&bypass->lock);
>> + list_for_each_entry(bypass_instance, &bypass->instance_list, list) {
>> + bypass_netdev = rcu_dereference(bypass_instance->bypass_netdev);
>> + if (bypass_netdev == dev) {
> This means the driver registered one netdev twice. That is a bug in
> driver, so WARN_ON would be nice here to point that out.
OK.
>
>
>> + ret = -EEXIST;
>> + goto done;
>> + }
>> + }
>> +
>> + bypass_instance = bypass_instance_alloc(dev);
>> + if (!bypass_instance) {
>> + ret = -ENOMEM;
>> + goto done;
>> + }
>> +
>> + bypass_instance->bypass = bypass;
>> + list_add_tail(&bypass_instance->list, &bypass->instance_list);
>> +
>> +done:
>> + mutex_unlock(&bypass->lock);
>> + bypass_register_existing_child(dev);
>> + return ret;
>> +}
>> +EXPORT_SYMBOL(bypass_register_instance);
>> +
>> +int bypass_unregister_instance(struct bypass *bypass, struct net_device *dev)
>> +{
>> + struct bypass_instance *bypass_instance;
>> + struct net_device *bypass_netdev;
>> + int ret = 0;
>> +
>> + mutex_lock(&bypass->lock);
>> + list_for_each_entry(bypass_instance, &bypass->instance_list, list) {
>> + bypass_netdev = rcu_dereference(bypass_instance->bypass_netdev);
>> + if (bypass_netdev == dev) {
>> + list_del(&bypass_instance->list);
>> + bypass_instance_free(bypass_instance);
>> + goto done;
>> + }
>> + }
>> +
>> + ret = -ENOENT;
>> +done:
>> + mutex_unlock(&bypass->lock);
>> + return ret;
>> +}
>> +EXPORT_SYMBOL(bypass_unregister_instance);
>> +
>> +struct bypass *bypass_register_driver(const struct bypass_ops *ops,
>> + const struct net_device_ops *netdev_ops)
> I don't see why you need a list of drivers. What you need is just a list
> of instances - bypass masters (probably to call them like that in the
> code as well). Well, you can use the common netdevice list for that
> purpose with the identification helper I mentioned above. Then you need
> no lists and no mutexes/spinlocks.
OK. looks like i should be able to just register instances and pass bypass_netdev,
bypass_ops and a flag to indicate 3/2 netdev model.
>
>
>> +{
>> + struct bypass *bypass;
>> +
>> + bypass = kzalloc(sizeof(*bypass), GFP_KERNEL);
>> + if (!bypass)
>> + return NULL;
>> +
>> + bypass->ops = ops;
>> + bypass->netdev_ops = netdev_ops;
>> + INIT_LIST_HEAD(&bypass->instance_list);
>> +
>> + mutex_lock(&bypass_mutex);
>> + list_add_tail(&bypass->list, &bypass_list);
>> + mutex_unlock(&bypass_mutex);
>> +
>> + return bypass;
>> +}
>> +EXPORT_SYMBOL_GPL(bypass_register_driver);
>> +
>> +void bypass_unregister_driver(struct bypass *bypass)
>> +{
>> + mutex_lock(&bypass_mutex);
>> + list_del(&bypass->list);
>> + mutex_unlock(&bypass_mutex);
>> +
>> + kfree(bypass);
>> +}
>> +EXPORT_SYMBOL_GPL(bypass_unregister_driver);
>> +
>> +static __init int
>> +bypass_init(void)
>> +{
>> + register_netdevice_notifier(&bypass_notifier);
>> +
>> + return 0;
>> +}
>> +module_init(bypass_init);
>> +
>> +static __exit
>> +void bypass_exit(void)
>> +{
>> + unregister_netdevice_notifier(&bypass_notifier);
>> +}
>> +module_exit(bypass_exit);
>> +
>> +MODULE_DESCRIPTION("Bypass infrastructure/interface for Paravirtual drivers");
>> +MODULE_LICENSE("GPL v2");
>> --
>> 2.14.3
>>
^ permalink raw reply
* Re: [RFC PATCH net-next v5 3/4] virtio_net: Extend virtio to use VF datapath when available
From: Samudrala, Sridhar @ 2018-04-06 22:59 UTC (permalink / raw)
To: Jiri Pirko
Cc: mst, stephen, davem, netdev, virtualization, virtio-dev,
jesse.brandeburg, alexander.h.duyck, kubakici, jasowang,
loseweigh
In-Reply-To: <20180406124814.GA24525@nanopsycho>
On 4/6/2018 5:48 AM, Jiri Pirko wrote:
> Thu, Apr 05, 2018 at 11:08:22PM CEST, sridhar.samudrala@intel.com wrote:
<snip>
>
>> +
>> +static void virtnet_bypass_set_rx_mode(struct net_device *dev)
>> +{
>> + struct virtnet_bypass_info *vbi = netdev_priv(dev);
>> + struct net_device *child_netdev;
>> +
>> + rcu_read_lock();
>> +
>> + child_netdev = rcu_dereference(vbi->active_netdev);
>> + if (child_netdev) {
>> + dev_uc_sync_multiple(child_netdev, dev);
>> + dev_mc_sync_multiple(child_netdev, dev);
>> + }
>> +
>> + child_netdev = rcu_dereference(vbi->backup_netdev);
>> + if (child_netdev) {
>> + dev_uc_sync_multiple(child_netdev, dev);
>> + dev_mc_sync_multiple(child_netdev, dev);
>> + }
>> +
>> + rcu_read_unlock();
>> +}
> This should be moved to bypass module.
Sure. All these bypass ndo_ops can be moved to bypass module and any
paravirtual driver that want to go with 3 netdev model can reuse these functions.
>
>
>> +
>> +static const struct net_device_ops virtnet_bypass_netdev_ops = {
>> + .ndo_open = virtnet_bypass_open,
>> + .ndo_stop = virtnet_bypass_close,
>> + .ndo_start_xmit = virtnet_bypass_start_xmit,
>> + .ndo_select_queue = virtnet_bypass_select_queue,
>> + .ndo_get_stats64 = virtnet_bypass_get_stats,
>> + .ndo_change_mtu = virtnet_bypass_change_mtu,
>> + .ndo_set_rx_mode = virtnet_bypass_set_rx_mode,
>> + .ndo_validate_addr = eth_validate_addr,
>> + .ndo_features_check = passthru_features_check,
>> +};
>> +
>> +static int
>> +virtnet_bypass_ethtool_get_link_ksettings(struct net_device *dev,
>> + struct ethtool_link_ksettings *cmd)
>> +{
>> + struct virtnet_bypass_info *vbi = netdev_priv(dev);
>> + struct net_device *child_netdev;
>> +
>> + child_netdev = rtnl_dereference(vbi->active_netdev);
>> + if (!child_netdev || !virtnet_bypass_xmit_ready(child_netdev)) {
>> + child_netdev = rtnl_dereference(vbi->backup_netdev);
>> + if (!child_netdev || !virtnet_bypass_xmit_ready(child_netdev)) {
>> + cmd->base.duplex = DUPLEX_UNKNOWN;
>> + cmd->base.port = PORT_OTHER;
>> + cmd->base.speed = SPEED_UNKNOWN;
>> +
>> + return 0;
>> + }
>> + }
>> +
>> + return __ethtool_get_link_ksettings(child_netdev, cmd);
>> +}
>> +
>> +#define BYPASS_DRV_NAME "virtnet_bypass"
>> +#define BYPASS_DRV_VERSION "0.1"
>> +
>> +static void virtnet_bypass_ethtool_get_drvinfo(struct net_device *dev,
>> + struct ethtool_drvinfo *drvinfo)
>> +{
>> + strlcpy(drvinfo->driver, BYPASS_DRV_NAME, sizeof(drvinfo->driver));
>> + strlcpy(drvinfo->version, BYPASS_DRV_VERSION, sizeof(drvinfo->version));
>> +}
>> +
>> +static const struct ethtool_ops virtnet_bypass_ethtool_ops = {
>> + .get_drvinfo = virtnet_bypass_ethtool_get_drvinfo,
>> + .get_link = ethtool_op_get_link,
>> + .get_link_ksettings = virtnet_bypass_ethtool_get_link_ksettings,
>> +};
>> +
>> +static int virtnet_bypass_join_child(struct net_device *bypass_netdev,
>> + struct net_device *child_netdev)
>> +{
>> + struct virtnet_bypass_info *vbi;
>> + bool backup;
>> +
>> + vbi = netdev_priv(bypass_netdev);
>> + backup = (child_netdev->dev.parent == bypass_netdev->dev.parent);
>> + if (backup ? rtnl_dereference(vbi->backup_netdev) :
>> + rtnl_dereference(vbi->active_netdev)) {
>> + netdev_info(bypass_netdev,
>> + "%s attempting to join bypass dev when %s already present\n",
>> + child_netdev->name, backup ? "backup" : "active");
> Bypass module should check if there is already some other netdev
> enslaved and refuse right there.
This will work for virtio-net with 3 netdev model, but this check has to be done by netvsc
as its bypass_netdev is same as the backup_netdev.
Will add a flag while registering with the bypass module to indicate if the driver is doing
a 2 netdev or 3 netdev model and based on that flag this check can be done in bypass module
for 3 netdev scenario.
>
> The active/backup terminology is quite confusing. From the bonding world
> that means active is the one which is currently used for tx of the
> packets. And it depends on link and other things what netdev is declared
> active. However here, it is different. Backup is always the virtio_net
> instance even when it is active. Odd. Please change the terminology.
> For "active" I suggest to use name "stolen".
I am not too happy with 'stolen' name. Will see if i can come up with a
better name.
>
> *** Also, the 2 slave netdev pointers should be stored in the bypass
> module instance, not in the drivers.
Will move virtnet_bypass_info struct to bypass.h
>
>
>
>> + return -EEXIST;
>> + }
>> +
>> + dev_hold(child_netdev);
>> +
>> + if (backup) {
>> + rcu_assign_pointer(vbi->backup_netdev, child_netdev);
>> + dev_get_stats(vbi->backup_netdev, &vbi->backup_stats);
>> + } else {
>> + rcu_assign_pointer(vbi->active_netdev, child_netdev);
>> + dev_get_stats(vbi->active_netdev, &vbi->active_stats);
>> + bypass_netdev->min_mtu = child_netdev->min_mtu;
>> + bypass_netdev->max_mtu = child_netdev->max_mtu;
>> + }
>> +
>> + netdev_info(bypass_netdev, "child:%s joined\n", child_netdev->name);
>> +
>> + return 0;
>> +}
>> +
>> +static int virtnet_bypass_register_child(struct net_device *bypass_netdev,
>> + struct net_device *child_netdev)
>> +{
>> + struct virtnet_bypass_info *vbi;
>> + bool backup;
>> +
>> + vbi = netdev_priv(bypass_netdev);
>> + backup = (child_netdev->dev.parent == bypass_netdev->dev.parent);
>> + if (backup ? rtnl_dereference(vbi->backup_netdev) :
>> + rtnl_dereference(vbi->active_netdev)) {
>> + netdev_info(bypass_netdev,
>> + "%s attempting to register bypass dev when %s already present\n",
>> + child_netdev->name, backup ? "backup" : "active");
>> + return -EEXIST;
>> + }
>> +
>> + /* Avoid non pci devices as active netdev */
>> + if (!backup && (!child_netdev->dev.parent ||
>> + !dev_is_pci(child_netdev->dev.parent)))
>> + return -EINVAL;
>> +
>> + netdev_info(bypass_netdev, "child:%s registered\n", child_netdev->name);
>> +
>> + return 0;
>> +}
>> +
>> +static int virtnet_bypass_release_child(struct net_device *bypass_netdev,
>> + struct net_device *child_netdev)
>> +{
>> + struct net_device *backup_netdev, *active_netdev;
>> + struct virtnet_bypass_info *vbi;
>> +
>> + vbi = netdev_priv(bypass_netdev);
>> + active_netdev = rtnl_dereference(vbi->active_netdev);
>> + backup_netdev = rtnl_dereference(vbi->backup_netdev);
>> +
>> + if (child_netdev != active_netdev && child_netdev != backup_netdev)
>> + return -EINVAL;
>> +
>> + netdev_info(bypass_netdev, "child:%s released\n", child_netdev->name);
>> +
>> + return 0;
>> +}
>> +
>> +static int virtnet_bypass_unregister_child(struct net_device *bypass_netdev,
>> + struct net_device *child_netdev)
>> +{
>> + struct net_device *backup_netdev, *active_netdev;
>> + struct virtnet_bypass_info *vbi;
>> +
>> + vbi = netdev_priv(bypass_netdev);
>> + active_netdev = rtnl_dereference(vbi->active_netdev);
>> + backup_netdev = rtnl_dereference(vbi->backup_netdev);
>> +
>> + if (child_netdev != active_netdev && child_netdev != backup_netdev)
>> + return -EINVAL;
>> +
>> + if (child_netdev == backup_netdev) {
>> + RCU_INIT_POINTER(vbi->backup_netdev, NULL);
>> + } else {
>> + RCU_INIT_POINTER(vbi->active_netdev, NULL);
>> + if (backup_netdev) {
>> + bypass_netdev->min_mtu = backup_netdev->min_mtu;
>> + bypass_netdev->max_mtu = backup_netdev->max_mtu;
>> + }
>> + }
>> +
>> + dev_put(child_netdev);
>> +
>> + netdev_info(bypass_netdev, "child:%s unregistered\n",
>> + child_netdev->name);
>> +
>> + return 0;
>> +}
>> +
>> +static int virtnet_bypass_update_link(struct net_device *bypass_netdev,
>> + struct net_device *child_netdev)
>> +{
>> + struct net_device *active_netdev, *backup_netdev;
>> + struct virtnet_bypass_info *vbi;
>> +
>> + if (!netif_running(bypass_netdev))
>> + return 0;
>> +
>> + vbi = netdev_priv(bypass_netdev);
>> +
>> + active_netdev = rtnl_dereference(vbi->active_netdev);
>> + backup_netdev = rtnl_dereference(vbi->backup_netdev);
>> +
>> + if (child_netdev != active_netdev && child_netdev != backup_netdev)
>> + return -EINVAL;
>> +
>> + if ((active_netdev && virtnet_bypass_xmit_ready(active_netdev)) ||
>> + (backup_netdev && virtnet_bypass_xmit_ready(backup_netdev))) {
>> + netif_carrier_on(bypass_netdev);
>> + netif_tx_wake_all_queues(bypass_netdev);
>> + } else {
>> + netif_carrier_off(bypass_netdev);
>> + netif_tx_stop_all_queues(bypass_netdev);
>> + }
>> +
>> + return 0;
>> +}
>> +
>> +/* Called when child dev is injecting data into network stack.
>> + * Change the associated network device from lower dev to virtio.
>> + * note: already called with rcu_read_lock
>> + */
>> +static rx_handler_result_t virtnet_bypass_handle_frame(struct sk_buff **pskb)
>> +{
>> + struct sk_buff *skb = *pskb;
>> + struct net_device *ndev = rcu_dereference(skb->dev->rx_handler_data);
>> +
>> + skb->dev = ndev;
>> +
>> + return RX_HANDLER_ANOTHER;
>> +}
> Hmm, you have the rx_handler defined in drivers and you register it in
> bypass module. It is odd because here you assume that the bypass module
> passed "ndev" and rx_handler_data. Also, you don't need advanced
> features rx_handler provides.
> Instead, just register a common rx_handler defined in the bypass module
> and have simple skb_rx callback here (static void).
Will move the rx_handler also to bypass.c
>
>
>> +
>> +static const struct bypass_ops virtnet_bypass_ops = {
>> + .register_child = virtnet_bypass_register_child,
>> + .join_child = virtnet_bypass_join_child,
>> + .unregister_child = virtnet_bypass_unregister_child,
>> + .release_child = virtnet_bypass_release_child,
>> + .update_link = virtnet_bypass_update_link,
>> + .handle_frame = virtnet_bypass_handle_frame,
>> +};
>> +
>> +static struct bypass *virtnet_bypass;
>> +
>> +static int virtnet_bypass_create(struct virtnet_info *vi)
>> +{
>> + struct net_device *backup_netdev = vi->dev;
>> + struct device *dev = &vi->vdev->dev;
>> + struct net_device *bypass_netdev;
>> + int res;
>> +
>> + /* Alloc at least 2 queues, for now we are going with 16 assuming
>> + * that most devices being bonded won't have too many queues.
>> + */
>> + bypass_netdev = alloc_etherdev_mq(sizeof(struct virtnet_bypass_info),
>> + 16);
>> + if (!bypass_netdev) {
>> + dev_err(dev, "Unable to allocate bypass_netdev!\n");
>> + return -ENOMEM;
>> + }
>> +
>> + dev_net_set(bypass_netdev, dev_net(backup_netdev));
>> + SET_NETDEV_DEV(bypass_netdev, dev);
>> +
>> + bypass_netdev->netdev_ops = &virtnet_bypass_netdev_ops;
>> + bypass_netdev->ethtool_ops = &virtnet_bypass_ethtool_ops;
>> +
>> + /* Initialize the device options */
>> + bypass_netdev->flags |= IFF_MASTER;
> I think I pointed that out already. Don't use "IFF_MASTER". That is
> specific to bonding. As I suggested in the reply to the patch #2, you
> should introduce IFF_BYPASS. Also, this flag should be set by the bypass
> module. Just create the netdev and do things specific to virtio and then
> call to bypass module, pass the netdev so it can do the rest. I think
> that the flags, features etc would be also fine to set there.
OK.
>
>
>> + bypass_netdev->priv_flags |= IFF_UNICAST_FLT | IFF_NO_QUEUE;
>> + bypass_netdev->priv_flags &= ~(IFF_XMIT_DST_RELEASE |
>> + IFF_TX_SKB_SHARING);
>> +
>> + /* don't acquire bypass netdev's netif_tx_lock when transmitting */
>> + bypass_netdev->features |= NETIF_F_LLTX;
>> +
>> + /* Don't allow bypass devices to change network namespaces. */
>> + bypass_netdev->features |= NETIF_F_NETNS_LOCAL;
>> +
>> + bypass_netdev->hw_features = NETIF_F_HW_CSUM | NETIF_F_SG |
>> + NETIF_F_FRAGLIST | NETIF_F_ALL_TSO |
>> + NETIF_F_HIGHDMA | NETIF_F_LRO;
>> +
>> + bypass_netdev->hw_features |= NETIF_F_GSO_ENCAP_ALL;
>> + bypass_netdev->features |= bypass_netdev->hw_features;
>> +
>> + /* For now treat bypass netdev as VLAN challenged since we
>> + * cannot assume VLAN functionality with a VF
> Why? I don't see such drivers. But to be 100% correct, you should check
> the NETIF_F_VLAN_CHALLENGED feature in bypass module during VF enslave
> and forbid to do so if it is on.
Will look into it.
>
>
>> + */
>> + bypass_netdev->features |= NETIF_F_VLAN_CHALLENGED;
>> +
>> + memcpy(bypass_netdev->dev_addr, backup_netdev->dev_addr,
>> + bypass_netdev->addr_len);
>> +
>> + bypass_netdev->min_mtu = backup_netdev->min_mtu;
>> + bypass_netdev->max_mtu = backup_netdev->max_mtu;
>> +
>> + res = register_netdev(bypass_netdev);
>> + if (res < 0) {
>> + dev_err(dev, "Unable to register bypass_netdev!\n");
>> + goto err_register_netdev;
>> + }
>> +
>> + netif_carrier_off(bypass_netdev);
>> +
>> + res = bypass_register_instance(virtnet_bypass, bypass_netdev);
>> + if (res < 0)
>> + goto err_bypass;
>> +
>> + rcu_assign_pointer(vi->bypass_netdev, bypass_netdev);
>> +
>> + return 0;
>> +
>> +err_bypass:
>> + unregister_netdev(bypass_netdev);
>> +err_register_netdev:
>> + free_netdev(bypass_netdev);
>> +
>> + return res;
>> +}
>> +
>> +static void virtnet_bypass_destroy(struct virtnet_info *vi)
>> +{
>> + struct net_device *bypass_netdev;
>> + struct virtnet_bypass_info *vbi;
>> + struct net_device *child_netdev;
>> +
>> + bypass_netdev = rcu_dereference(vi->bypass_netdev);
>> + /* no device found, nothing to free */
>> + if (!bypass_netdev)
>> + return;
>> +
>> + vbi = netdev_priv(bypass_netdev);
>> +
>> + netif_device_detach(bypass_netdev);
>> +
>> + rtnl_lock();
>> +
>> + child_netdev = rtnl_dereference(vbi->active_netdev);
>> + if (child_netdev)
>> + bypass_unregister_child(child_netdev);
>> +
>> + child_netdev = rtnl_dereference(vbi->backup_netdev);
>> + if (child_netdev)
>> + bypass_unregister_child(child_netdev);
>> +
>> + unregister_netdevice(bypass_netdev);
>> +
>> + bypass_unregister_instance(virtnet_bypass, bypass_netdev);
>> +
>> + rtnl_unlock();
>> +
>> + free_netdev(bypass_netdev);
>> +}
>> +
>> +/* END of functions supporting VIRTIO_NET_F_BACKUP feature. */
>> +
>> static int virtnet_probe(struct virtio_device *vdev)
>> {
>> int i, err = -ENOMEM;
>> @@ -2839,10 +3432,15 @@ static int virtnet_probe(struct virtio_device *vdev)
>>
>> virtnet_init_settings(dev);
>>
>> + if (virtio_has_feature(vdev, VIRTIO_NET_F_BACKUP)) {
>> + if (virtnet_bypass_create(vi) != 0)
> You need to do:
> err = virtnet_bypass_create(vi);
> if (err)
> otherwise you ignore err and virtnet_probe would return 0;
Sure.
>
>
>> + goto free_vqs;
>> + }
>> +
>> err = register_netdev(dev);
>> if (err) {
>> pr_debug("virtio_net: registering device failed\n");
>> - goto free_vqs;
>> + goto free_bypass;
>> }
>>
>> virtio_device_ready(vdev);
>> @@ -2879,6 +3477,8 @@ static int virtnet_probe(struct virtio_device *vdev)
>> vi->vdev->config->reset(vdev);
>>
>> unregister_netdev(dev);
>> +free_bypass:
>> + virtnet_bypass_destroy(vi);
>> free_vqs:
>> cancel_delayed_work_sync(&vi->refill);
>> free_receive_page_frags(vi);
>> @@ -2913,6 +3513,8 @@ static void virtnet_remove(struct virtio_device *vdev)
>>
>> unregister_netdev(vi->dev);
>>
>> + virtnet_bypass_destroy(vi);
>> +
>> remove_vq_common(vi);
>>
>> free_netdev(vi->dev);
>> @@ -2996,6 +3598,11 @@ static __init int virtio_net_driver_init(void)
>> {
>> int ret;
>>
>> + virtnet_bypass = bypass_register_driver(&virtnet_bypass_ops,
>> + &virtnet_bypass_netdev_ops);
>> + if (!virtnet_bypass)
>> + return -ENOMEM;
> If CONFIG_NET_BYPASS is undefined, you will always return -ENOMEM here.
Will fix.
>
>
>> +
>> ret = cpuhp_setup_state_multi(CPUHP_AP_ONLINE_DYN, "virtio/net:online",
>> virtnet_cpu_online,
>> virtnet_cpu_down_prep);
>> @@ -3010,12 +3617,14 @@ static __init int virtio_net_driver_init(void)
>> ret = register_virtio_driver(&virtio_net_driver);
>> if (ret)
>> goto err_virtio;
>> +
>> return 0;
>> err_virtio:
>> cpuhp_remove_multi_state(CPUHP_VIRT_NET_DEAD);
>> err_dead:
>> cpuhp_remove_multi_state(virtionet_online);
>> out:
>> + bypass_unregister_driver(virtnet_bypass);
>> return ret;
>> }
>> module_init(virtio_net_driver_init);
>> @@ -3025,6 +3634,7 @@ static __exit void virtio_net_driver_exit(void)
>> unregister_virtio_driver(&virtio_net_driver);
>> cpuhp_remove_multi_state(CPUHP_VIRT_NET_DEAD);
>> cpuhp_remove_multi_state(virtionet_online);
>> + bypass_unregister_driver(virtnet_bypass);
>> }
>> module_exit(virtio_net_driver_exit);
>>
>> --
>> 2.14.3
>>
^ permalink raw reply
* Fwd: Problem with the kernel 4.15 - cutting the band (tc)
From: Linus Torvalds @ 2018-04-06 21:56 UTC (permalink / raw)
To: David Miller; +Cc: netdev
In-Reply-To: <42a29f8a-beb3-84da-9129-fbda7ef81be4@hostcenter.eu>
Forwarding a report about what looks like a regression between 4.14 and 4.15.
New ENOSPC issue? I don't even knew where to start guessing where to look.
Help me, Davem-Wan Kenobi, you are my only hope.
(But adding netdev just in case somebody else goes "That's obviously Xyz")
Linus
---------- Forwarded message ----------
From: Marcin Kabiesz <admin@hostcenter.eu>
Date: Thu, Apr 5, 2018 at 10:38 AM
Subject: Problem with the kernel 4.15 - cutting the band (tc)
Hello,
I have a problem with bandwidth cutting on kernel 4.15. On the version
up to 4.15, i.e. 4.14, this problem does not occur.
uname -a: Linux router 4.14.15 #1 SMP x86_64 Intel Xeon E3-1230 v6
command to reproduce:
tc qdisc add dev ifb0 root handle 1: htb r2q 2
tc class add dev ifb0 parent 1: classid 1:1 htb rate 10gbit ceil
10gbit quantum 16000
tc filter add dev ifb0 parent 1: prio 5 handle 1: protocol all u32 divisor 256
tc filter add dev ifb0 protocol all parent 1: prio 5 u32 ht 800::
match ip dst 0.0.0.0/0 hashkey mask 0x000000ff at 16 link 1:
tc filter add dev ifb0 parent 1:0 handle ::1 protocol all prio 5 u32
ht 1:2c: match ip dst 192.168.3.44/32 flowid 1:2
tc filter del dev ifb0 parent 1:0 handle 1:2c:1 prio 5 u32
tc filter add dev ifb0 parent 1:0 handle ::1 protocol all prio 5 u32
ht 1:2c: match ip dst 192.168.3.44/32 flowid 1:2
tc filter del dev ifb0 parent 1:0 handle 1:2c:1 prio 5 u32
This ok, no error/warnings and dmesg log.
uname -a: Linux router 4.15.8 #1 SMP x86_64 Intel Xeon E3-1230 v6 (or
4.15.14 this same effect)
command to reproduce:
tc qdisc add dev ifb0 root handle 1: htb r2q 2
tc class add dev ifb0 parent 1: classid 1:1 htb rate 10gbit ceil
10gbit quantum 16000
tc filter add dev ifb0 parent 1: prio 5 handle 1: protocol all u32 divisor 256
tc filter add dev ifb0 protocol all parent 1: prio 5 u32 ht 800::
match ip dst 0.0.0.0/0 hashkey mask 0x000000ff at 16 link 1:
tc filter add dev ifb0 parent 1:0 handle ::1 protocol all prio 5 u32
ht 1:2c: match ip dst 192.168.3.44/32 flowid 1:2
tc filter del dev ifb0 parent 1:0 handle 1:2c:1 prio 5 u32
tc filter add dev ifb0 parent 1:0 handle ::1 protocol all prio 5 u32
ht 1:2c: match ip dst 192.168.3.44/32 flowid 1:2
RTNETLINK answers: No space left on device
We have an error talking to the kernel
This not ok, on error/warnings and no dmesg log.
Best Regards
Please forgive my English
Marcin Kabiesz
^ permalink raw reply
* [RFC PATCH bpf-next 1/6] bpf: change prototype for stack_map_get_build_id_offset
From: Yonghong Song @ 2018-04-06 21:48 UTC (permalink / raw)
To: ast, daniel, netdev; +Cc: kernel-team
In-Reply-To: <20180406214846.916265-1-yhs@fb.com>
This patch didn't incur functionality change. The function prototype
got changed so that the same function can be reused later.
Signed-off-by: Yonghong Song <yhs@fb.com>
---
kernel/bpf/stackmap.c | 13 +++++--------
1 file changed, 5 insertions(+), 8 deletions(-)
diff --git a/kernel/bpf/stackmap.c b/kernel/bpf/stackmap.c
index 57eeb12..04f6ec1 100644
--- a/kernel/bpf/stackmap.c
+++ b/kernel/bpf/stackmap.c
@@ -262,16 +262,11 @@ static int stack_map_get_build_id(struct vm_area_struct *vma,
return ret;
}
-static void stack_map_get_build_id_offset(struct bpf_map *map,
- struct stack_map_bucket *bucket,
+static void stack_map_get_build_id_offset(struct bpf_stack_build_id *id_offs,
u64 *ips, u32 trace_nr, bool user)
{
int i;
struct vm_area_struct *vma;
- struct bpf_stack_build_id *id_offs;
-
- bucket->nr = trace_nr;
- id_offs = (struct bpf_stack_build_id *)bucket->data;
/*
* We cannot do up_read() in nmi context, so build_id lookup is
@@ -361,8 +356,10 @@ BPF_CALL_3(bpf_get_stackid, struct pt_regs *, regs, struct bpf_map *, map,
pcpu_freelist_pop(&smap->freelist);
if (unlikely(!new_bucket))
return -ENOMEM;
- stack_map_get_build_id_offset(map, new_bucket, ips,
- trace_nr, user);
+ new_bucket->nr = trace_nr;
+ stack_map_get_build_id_offset(
+ (struct bpf_stack_build_id *)new_bucket->data,
+ ips, trace_nr, user);
trace_len = trace_nr * sizeof(struct bpf_stack_build_id);
if (hash_matches && bucket->nr == trace_nr &&
memcmp(bucket->data, new_bucket->data, trace_len) == 0) {
--
2.9.5
^ permalink raw reply related
* [RFC PATCH bpf-next 6/6] tools/bpf: add a test case for bpf_get_stack helper
From: Yonghong Song @ 2018-04-06 21:48 UTC (permalink / raw)
To: ast, daniel, netdev; +Cc: kernel-team
In-Reply-To: <20180406214846.916265-1-yhs@fb.com>
The test_stacktrace_map is enhanced to call bpf_get_stack
in the helper to get the stack trace as well.
The stack traces from bpf_get_stack and bpf_get_stackid
are compared to ensure that for the same stack as
represented as the same hash, their ip addresses
must be the same.
Signed-off-by: Yonghong Song <yhs@fb.com>
---
tools/testing/selftests/bpf/test_progs.c | 41 ++++++++++++++++++++++-
tools/testing/selftests/bpf/test_stacktrace_map.c | 20 +++++++++--
2 files changed, 57 insertions(+), 4 deletions(-)
diff --git a/tools/testing/selftests/bpf/test_progs.c b/tools/testing/selftests/bpf/test_progs.c
index faadbe2..8aa2844 100644
--- a/tools/testing/selftests/bpf/test_progs.c
+++ b/tools/testing/selftests/bpf/test_progs.c
@@ -865,9 +865,39 @@ static int compare_map_keys(int map1_fd, int map2_fd)
return 0;
}
+static int compare_stack_ips(int smap_fd, int amap_fd)
+{
+ int max_len = PERF_MAX_STACK_DEPTH * sizeof(__u64);
+ __u32 key, next_key, *cur_key_p, *next_key_p;
+ char val_buf1[max_len], val_buf2[max_len];
+ int i, err;
+
+ cur_key_p = NULL;
+ next_key_p = &key;
+ while (bpf_map_get_next_key(smap_fd, cur_key_p, next_key_p) == 0) {
+ err = bpf_map_lookup_elem(smap_fd, next_key_p, val_buf1);
+ if (err)
+ return err;
+ err = bpf_map_lookup_elem(amap_fd, next_key_p, val_buf2);
+ if (err)
+ return err;
+ for (i = 0; i < max_len; i++) {
+ if (val_buf1[i] != val_buf2[i])
+ return -1;
+ }
+ key = *next_key_p;
+ cur_key_p = &key;
+ next_key_p = &next_key;
+ }
+ if (errno != ENOENT)
+ return -1;
+
+ return 0;
+}
+
static void test_stacktrace_map()
{
- int control_map_fd, stackid_hmap_fd, stackmap_fd;
+ int control_map_fd, stackid_hmap_fd, stackmap_fd, stack_amap_fd;
const char *file = "./test_stacktrace_map.o";
int bytes, efd, err, pmu_fd, prog_fd;
struct perf_event_attr attr = {};
@@ -925,6 +955,10 @@ static void test_stacktrace_map()
if (stackmap_fd < 0)
goto disable_pmu;
+ stack_amap_fd = bpf_find_map(__func__, obj, "stack_amap");
+ if (stack_amap_fd < 0)
+ goto disable_pmu;
+
/* give some time for bpf program run */
sleep(1);
@@ -946,6 +980,11 @@ static void test_stacktrace_map()
"err %d errno %d\n", err, errno))
goto disable_pmu_noerr;
+ err = compare_stack_ips(stackmap_fd, stack_amap_fd);
+ if (CHECK(err, "compare_stack_ips stackmap vs. stack_amap",
+ "err %d errno %d\n", err, errno))
+ goto disable_pmu_noerr;
+
goto disable_pmu_noerr;
disable_pmu:
error_cnt++;
diff --git a/tools/testing/selftests/bpf/test_stacktrace_map.c b/tools/testing/selftests/bpf/test_stacktrace_map.c
index 76d85c5d..f83c7b6 100644
--- a/tools/testing/selftests/bpf/test_stacktrace_map.c
+++ b/tools/testing/selftests/bpf/test_stacktrace_map.c
@@ -19,14 +19,21 @@ struct bpf_map_def SEC("maps") stackid_hmap = {
.type = BPF_MAP_TYPE_HASH,
.key_size = sizeof(__u32),
.value_size = sizeof(__u32),
- .max_entries = 10000,
+ .max_entries = 16384,
};
struct bpf_map_def SEC("maps") stackmap = {
.type = BPF_MAP_TYPE_STACK_TRACE,
.key_size = sizeof(__u32),
.value_size = sizeof(__u64) * PERF_MAX_STACK_DEPTH,
- .max_entries = 10000,
+ .max_entries = 16384,
+};
+
+struct bpf_map_def SEC("maps") stack_amap = {
+ .type = BPF_MAP_TYPE_ARRAY,
+ .key_size = sizeof(__u32),
+ .value_size = sizeof(__u64) * PERF_MAX_STACK_DEPTH,
+ .max_entries = 16384,
};
/* taken from /sys/kernel/debug/tracing/events/sched/sched_switch/format */
@@ -44,7 +51,10 @@ struct sched_switch_args {
SEC("tracepoint/sched/sched_switch")
int oncpu(struct sched_switch_args *ctx)
{
+ __u32 max_len = PERF_MAX_STACK_DEPTH * sizeof(__u64);
__u32 key = 0, val = 0, *value_p;
+ void *stack_p;
+
value_p = bpf_map_lookup_elem(&control_map, &key);
if (value_p && *value_p)
@@ -52,8 +62,12 @@ int oncpu(struct sched_switch_args *ctx)
/* The size of stackmap and stackid_hmap should be the same */
key = bpf_get_stackid(ctx, &stackmap, 0);
- if ((int)key >= 0)
+ if ((int)key >= 0) {
bpf_map_update_elem(&stackid_hmap, &key, &val, 0);
+ stack_p = bpf_map_lookup_elem(&stack_amap, &key);
+ if (stack_p)
+ bpf_get_stack(ctx, stack_p, max_len, 0);
+ }
return 0;
}
--
2.9.5
^ permalink raw reply related
* [RFC PATCH bpf-next 3/6] tools/bpf: add bpf_get_stack helper to tools headers
From: Yonghong Song @ 2018-04-06 21:48 UTC (permalink / raw)
To: ast, daniel, netdev; +Cc: kernel-team
In-Reply-To: <20180406214846.916265-1-yhs@fb.com>
Signed-off-by: Yonghong Song <yhs@fb.com>
---
tools/include/uapi/linux/bpf.h | 17 +++++++++++++++--
tools/testing/selftests/bpf/bpf_helpers.h | 2 ++
2 files changed, 17 insertions(+), 2 deletions(-)
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 9d07465..3930463 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -517,6 +517,17 @@ union bpf_attr {
* other bits - reserved
* Return: >= 0 stackid on success or negative error
*
+ * int bpf_get_stack(ctx, buf, size, flags)
+ * walk user or kernel stack and store the ips in buf
+ * @ctx: struct pt_regs*
+ * @buf: user buffer to fill stack
+ * @size: the buf size
+ * @flags: bits 0-7 - numer of stack frames to skip
+ * bit 8 - collect user stack instead of kernel
+ * bit 11 - get build-id as well if user stack
+ * other bits - reserved
+ * Return: >= 0 size copied on success or negative error
+ *
* s64 bpf_csum_diff(from, from_size, to, to_size, seed)
* calculate csum diff
* @from: raw from buffer
@@ -821,7 +832,8 @@ union bpf_attr {
FN(msg_apply_bytes), \
FN(msg_cork_bytes), \
FN(msg_pull_data), \
- FN(bind),
+ FN(bind), \
+ FN(get_stack),
/* integer value in 'imm' field of BPF_CALL instruction selects which helper
* function eBPF program intends to call
@@ -855,11 +867,12 @@ enum bpf_func_id {
/* BPF_FUNC_skb_set_tunnel_key and BPF_FUNC_skb_get_tunnel_key flags. */
#define BPF_F_TUNINFO_IPV6 (1ULL << 0)
-/* BPF_FUNC_get_stackid flags. */
+/* BPF_FUNC_get_stackid and BPF_FUNC_get_stack flags. */
#define BPF_F_SKIP_FIELD_MASK 0xffULL
#define BPF_F_USER_STACK (1ULL << 8)
#define BPF_F_FAST_STACK_CMP (1ULL << 9)
#define BPF_F_REUSE_STACKID (1ULL << 10)
+#define BPF_F_USER_BUILD_ID (1ULL << 11)
/* BPF_FUNC_skb_set_tunnel_key flags. */
#define BPF_F_ZERO_CSUM_TX (1ULL << 1)
diff --git a/tools/testing/selftests/bpf/bpf_helpers.h b/tools/testing/selftests/bpf/bpf_helpers.h
index d8223d9..acaed02 100644
--- a/tools/testing/selftests/bpf/bpf_helpers.h
+++ b/tools/testing/selftests/bpf/bpf_helpers.h
@@ -96,6 +96,8 @@ static int (*bpf_msg_pull_data)(void *ctx, int start, int end, int flags) =
(void *) BPF_FUNC_msg_pull_data;
static int (*bpf_bind)(void *ctx, void *addr, int addr_len) =
(void *) BPF_FUNC_bind;
+static int (*bpf_get_stack)(void *ctx, void *buf, int size, int flags) =
+ (void *) BPF_FUNC_get_stack;
/* llvm builtin functions that eBPF C program may use to
* emit BPF_LD_ABS and BPF_LD_IND instructions
--
2.9.5
^ permalink raw reply related
* [RFC PATCH bpf-next 5/6] samples/bpf: add a test for bpf_get_stack helper
From: Yonghong Song @ 2018-04-06 21:48 UTC (permalink / raw)
To: ast, daniel, netdev; +Cc: kernel-team
In-Reply-To: <20180406214846.916265-1-yhs@fb.com>
The test attached a kprobe program to kernel function sys_write.
It tested to get stack for user space, kernel space and user
space with build_id request. It also tested to get user
and kernel stack into the same buffer with back-to-back
bpf_get_stack helper calls.
Whenever the kernel stack is available, the user space
application will check to ensure that sys_write/SyS_write
is part of the stack.
Signed-off-by: Yonghong Song <yhs@fb.com>
---
samples/bpf/Makefile | 4 +
samples/bpf/trace_get_stack_kern.c | 80 ++++++++++++++++++++
samples/bpf/trace_get_stack_user.c | 150 +++++++++++++++++++++++++++++++++++++
3 files changed, 234 insertions(+)
create mode 100644 samples/bpf/trace_get_stack_kern.c
create mode 100644 samples/bpf/trace_get_stack_user.c
diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
index 4d6a6ed..94e7b10 100644
--- a/samples/bpf/Makefile
+++ b/samples/bpf/Makefile
@@ -44,6 +44,7 @@ hostprogs-y += xdp_monitor
hostprogs-y += xdp_rxq_info
hostprogs-y += syscall_tp
hostprogs-y += cpustat
+hostprogs-y += trace_get_stack
# Libbpf dependencies
LIBBPF := ../../tools/lib/bpf/bpf.o ../../tools/lib/bpf/nlattr.o
@@ -95,6 +96,7 @@ xdp_monitor-objs := bpf_load.o $(LIBBPF) xdp_monitor_user.o
xdp_rxq_info-objs := bpf_load.o $(LIBBPF) xdp_rxq_info_user.o
syscall_tp-objs := bpf_load.o $(LIBBPF) syscall_tp_user.o
cpustat-objs := bpf_load.o $(LIBBPF) cpustat_user.o
+trace_get_stack-objs := bpf_load.o $(LIBBPF) trace_get_stack_user.o
# Tell kbuild to always build the programs
always := $(hostprogs-y)
@@ -148,6 +150,7 @@ always += xdp_rxq_info_kern.o
always += xdp2skb_meta_kern.o
always += syscall_tp_kern.o
always += cpustat_kern.o
+always += trace_get_stack_kern.o
HOSTCFLAGS += -I$(objtree)/usr/include
HOSTCFLAGS += -I$(srctree)/tools/lib/
@@ -193,6 +196,7 @@ HOSTLOADLIBES_xdp_monitor += -lelf
HOSTLOADLIBES_xdp_rxq_info += -lelf
HOSTLOADLIBES_syscall_tp += -lelf
HOSTLOADLIBES_cpustat += -lelf
+HOSTLOADLIBES_trace_get_stack += -lelf
# Allows pointing LLC/CLANG to a LLVM backend with bpf support, redefine on cmdline:
# make samples/bpf/ LLC=~/git/llvm/build/bin/llc CLANG=~/git/llvm/build/bin/clang
diff --git a/samples/bpf/trace_get_stack_kern.c b/samples/bpf/trace_get_stack_kern.c
new file mode 100644
index 0000000..c7cc7b1
--- /dev/null
+++ b/samples/bpf/trace_get_stack_kern.c
@@ -0,0 +1,80 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include <linux/ptrace.h>
+#include <linux/version.h>
+#include <uapi/linux/bpf.h>
+#include "bpf_helpers.h"
+
+/* Permit pretty deep stack traces */
+#define MAX_STACK 100
+struct stack_trace_t {
+ int pid;
+ int kern_stack_size;
+ int user_stack_size;
+ int user_stack_buildid_size;
+ u64 kern_stack[MAX_STACK];
+ u64 user_stack[MAX_STACK];
+ struct bpf_stack_build_id user_stack_buildid[MAX_STACK];
+};
+
+struct bpf_map_def SEC("maps") perfmap = {
+ .type = BPF_MAP_TYPE_PERF_EVENT_ARRAY,
+ .key_size = sizeof(int),
+ .value_size = sizeof(u32),
+ .max_entries = 2,
+};
+
+struct bpf_map_def SEC("maps") stackdata_map = {
+ .type = BPF_MAP_TYPE_PERCPU_ARRAY,
+ .key_size = sizeof(u32),
+ .value_size = sizeof(struct stack_trace_t),
+ .max_entries = 1,
+};
+
+SEC("kprobe/sys_write")
+int bpf_prog1(struct pt_regs *ctx)
+{
+ int max_len, max_buildid_len, usize, ksize, total_size;
+ struct stack_trace_t *data;
+ void *raw_data;
+ u32 key = 0;
+
+ data = bpf_map_lookup_elem(&stackdata_map, &key);
+ if (!data)
+ return 0;
+
+ max_len = MAX_STACK * sizeof(u64);
+ max_buildid_len = MAX_STACK * sizeof(struct bpf_stack_build_id);
+ data->pid = bpf_get_current_pid_tgid();
+ data->kern_stack_size = bpf_get_stack(ctx, data->kern_stack,
+ max_len, 0);
+ data->user_stack_size = bpf_get_stack(ctx, data->user_stack, max_len,
+ BPF_F_USER_STACK);
+ data->user_stack_buildid_size = bpf_get_stack(
+ ctx, data->user_stack_buildid, max_buildid_len,
+ BPF_F_USER_STACK | BPF_F_USER_BUILD_ID);
+ bpf_perf_event_output(ctx, &perfmap, 0, data, sizeof(*data));
+
+ /* write both kernel and user stacks to the same buffer */
+ raw_data = (void *)data;
+ usize = bpf_get_stack(ctx, raw_data, max_len, BPF_F_USER_STACK);
+ if (usize < 0)
+ return 0;
+
+ ksize = 0;
+ if (usize < max_len) {
+ ksize = bpf_get_stack(ctx, raw_data + usize, max_len - usize,
+ 0);
+ if (ksize < 0)
+ return 0;
+ }
+ total_size = (usize < max_len ? usize : 0) +
+ (ksize < max_len ? ksize : 0);
+ if (total_size > 0 && total_size < max_len)
+ bpf_perf_event_output(ctx, &perfmap, 0, raw_data, total_size);
+
+ return 0;
+}
+
+char _license[] SEC("license") = "GPL";
+u32 _version SEC("version") = LINUX_VERSION_CODE;
diff --git a/samples/bpf/trace_get_stack_user.c b/samples/bpf/trace_get_stack_user.c
new file mode 100644
index 0000000..f64f5a5
--- /dev/null
+++ b/samples/bpf/trace_get_stack_user.c
@@ -0,0 +1,150 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include <stdio.h>
+#include <unistd.h>
+#include <stdlib.h>
+#include <stdbool.h>
+#include <string.h>
+#include <fcntl.h>
+#include <poll.h>
+#include <linux/perf_event.h>
+#include <linux/bpf.h>
+#include <errno.h>
+#include <assert.h>
+#include <sys/syscall.h>
+#include <sys/ioctl.h>
+#include <sys/mman.h>
+#include <time.h>
+#include <signal.h>
+#include "libbpf.h"
+#include "bpf_load.h"
+#include "perf-sys.h"
+
+static int pmu_fd;
+
+#define MAX_CNT 10ull
+#define MAX_STACK 100
+struct stack_trace_t {
+ int pid;
+ int kern_stack_size;
+ int user_stack_size;
+ int user_stack_buildid_size;
+ __u64 kern_stack[MAX_STACK];
+ __u64 user_stack[MAX_STACK];
+ struct bpf_stack_build_id user_stack_buildid[MAX_STACK];
+};
+
+static void print_bpf_output(void *data, int size)
+{
+ struct stack_trace_t *e = data;
+ int i, num_stack;
+ static __u64 cnt;
+ bool found = false;
+
+ cnt++;
+
+ if (size < sizeof(struct stack_trace_t)) {
+ __u64 *raw_data = data;
+
+ num_stack = size / sizeof(__u64);
+ printf("sample size = %d, raw stack\n\t", size);
+ for (i = 0; i < num_stack; i++) {
+ struct ksym *ks = ksym_search(raw_data[i]);
+
+ printf("0x%llx ", raw_data[i]);
+ if (ks && (strcmp(ks->name, "sys_write") == 0 ||
+ strcmp(ks->name, "SyS_write") == 0))
+ found = true;
+ }
+ printf("\n");
+ } else {
+ printf("sample size = %d, pid %d\n", size, e->pid);
+ if (e->kern_stack_size > 0) {
+ num_stack = e->kern_stack_size / sizeof(__u64);
+ printf("\tkernel_stack(%d): ", num_stack);
+ for (i = 0; i < num_stack; i++) {
+ struct ksym *ks = ksym_search(e->kern_stack[i]);
+
+ printf("0x%llx ", e->kern_stack[i]);
+ if (ks && (strcmp(ks->name, "sys_write") == 0 ||
+ strcmp(ks->name, "SyS_write") == 0))
+ found = true;
+ }
+ printf("\n");
+ }
+ if (e->user_stack_size > 0) {
+ num_stack = e->user_stack_size / sizeof(__u64);
+ printf("\tuser_stack(%d): ", num_stack);
+ for (i = 0; i < num_stack; i++)
+ printf("0x%llx ", e->user_stack[i]);
+ printf("\n");
+ }
+ if (e->user_stack_buildid_size > 0) {
+ num_stack = e->user_stack_buildid_size /
+ sizeof(struct bpf_stack_build_id);
+ printf("\tuser_stack_buildid(%d): ", num_stack);
+ for (i = 0; i < num_stack; i++) {
+ int j;
+
+ printf("(%d, 0x", e->user_stack_buildid[i].status);
+ for (j = 0; j < BPF_BUILD_ID_SIZE; j++)
+ printf("%02x", e->user_stack_buildid[i].build_id[i]);
+ printf(", %llx) ", e->user_stack_buildid[i].offset);
+ }
+ printf("\n");
+ }
+ }
+ if (!found) {
+ printf("received %lld events, kern symbol not found, exiting ...\n", cnt);
+ kill(0, SIGINT);
+ }
+
+ if (cnt == MAX_CNT) {
+ printf("received max %lld events, exiting ...\n", cnt);
+ kill(0, SIGINT);
+ }
+}
+
+static void test_bpf_perf_event(void)
+{
+ struct perf_event_attr attr = {
+ .sample_type = PERF_SAMPLE_RAW,
+ .type = PERF_TYPE_SOFTWARE,
+ .config = PERF_COUNT_SW_BPF_OUTPUT,
+ };
+ int key = 0;
+
+ pmu_fd = sys_perf_event_open(&attr, -1/*pid*/, 0/*cpu*/, -1/*group_fd*/, 0);
+
+ assert(pmu_fd >= 0);
+ assert(bpf_map_update_elem(map_fd[0], &key, &pmu_fd, BPF_ANY) == 0);
+ ioctl(pmu_fd, PERF_EVENT_IOC_ENABLE, 0);
+}
+
+static void action(void)
+{
+ FILE *f;
+
+ f = popen("taskset 1 dd if=/dev/zero of=/dev/null", "r");
+ (void) f;
+}
+
+int main(int argc, char **argv)
+{
+ char filename[256];
+
+ snprintf(filename, sizeof(filename), "%s_kern.o", argv[0]);
+
+ if (load_kallsyms()) {
+ printf("failed to process /proc/kallsyms\n");
+ return 2;
+ }
+
+ if (load_bpf_file(filename)) {
+ printf("%s", bpf_log_buf);
+ return 1;
+ }
+
+ test_bpf_perf_event();
+ return perf_event_poller(pmu_fd, action, print_bpf_output);
+}
--
2.9.5
^ permalink raw reply related
* [RFC PATCH bpf-next 2/6] bpf: add bpf_get_stack helper
From: Yonghong Song @ 2018-04-06 21:48 UTC (permalink / raw)
To: ast, daniel, netdev; +Cc: kernel-team
In-Reply-To: <20180406214846.916265-1-yhs@fb.com>
Currently, stackmap and bpf_get_stackid helper are provided
for bpf program to get the stack trace. This approach has
a limitation though. If two stack traces have the same hash,
only one will get stored in the stackmap table,
so some stack traces are missing from user perspective.
This patch implements a new helper, bpf_get_stack, will
send stack traces directly to bpf program. The bpf program
is able to see all stack traces, and then can do in-kernel
processing or send stack traces to user space through
shared map or bpf_perf_event_output.
Signed-off-by: Yonghong Song <yhs@fb.com>
---
include/linux/bpf.h | 1 +
include/linux/filter.h | 3 ++-
include/uapi/linux/bpf.h | 17 +++++++++++++--
kernel/bpf/stackmap.c | 56 ++++++++++++++++++++++++++++++++++++++++++++++++
kernel/bpf/syscall.c | 12 ++++++++++-
kernel/bpf/verifier.c | 3 +++
kernel/trace/bpf_trace.c | 50 +++++++++++++++++++++++++++++++++++++++++-
7 files changed, 137 insertions(+), 5 deletions(-)
diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 95a7abd..72ccb9a 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -676,6 +676,7 @@ extern const struct bpf_func_proto bpf_get_current_comm_proto;
extern const struct bpf_func_proto bpf_skb_vlan_push_proto;
extern const struct bpf_func_proto bpf_skb_vlan_pop_proto;
extern const struct bpf_func_proto bpf_get_stackid_proto;
+extern const struct bpf_func_proto bpf_get_stack_proto;
extern const struct bpf_func_proto bpf_sock_map_update_proto;
/* Shared helpers among cBPF and eBPF. */
diff --git a/include/linux/filter.h b/include/linux/filter.h
index fc4e8f9..9b64f63 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -467,7 +467,8 @@ struct bpf_prog {
dst_needed:1, /* Do we need dst entry? */
blinded:1, /* Was blinded */
is_func:1, /* program is a bpf function */
- kprobe_override:1; /* Do we override a kprobe? */
+ kprobe_override:1, /* Do we override a kprobe? */
+ need_callchain_buf:1; /* Needs callchain buffer? */
enum bpf_prog_type type; /* Type of BPF program */
enum bpf_attach_type expected_attach_type; /* For some prog types */
u32 len; /* Number of filter blocks */
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index c5ec897..a4ff5b7 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -517,6 +517,17 @@ union bpf_attr {
* other bits - reserved
* Return: >= 0 stackid on success or negative error
*
+ * int bpf_get_stack(ctx, buf, size, flags)
+ * walk user or kernel stack and store the ips in buf
+ * @ctx: struct pt_regs*
+ * @buf: user buffer to fill stack
+ * @size: the buf size
+ * @flags: bits 0-7 - numer of stack frames to skip
+ * bit 8 - collect user stack instead of kernel
+ * bit 11 - get build-id as well if user stack
+ * other bits - reserved
+ * Return: >= 0 size copied on success or negative error
+ *
* s64 bpf_csum_diff(from, from_size, to, to_size, seed)
* calculate csum diff
* @from: raw from buffer
@@ -821,7 +832,8 @@ union bpf_attr {
FN(msg_apply_bytes), \
FN(msg_cork_bytes), \
FN(msg_pull_data), \
- FN(bind),
+ FN(bind), \
+ FN(get_stack),
/* integer value in 'imm' field of BPF_CALL instruction selects which helper
* function eBPF program intends to call
@@ -855,11 +867,12 @@ enum bpf_func_id {
/* BPF_FUNC_skb_set_tunnel_key and BPF_FUNC_skb_get_tunnel_key flags. */
#define BPF_F_TUNINFO_IPV6 (1ULL << 0)
-/* BPF_FUNC_get_stackid flags. */
+/* BPF_FUNC_get_stackid and BPF_FUNC_get_stack flags. */
#define BPF_F_SKIP_FIELD_MASK 0xffULL
#define BPF_F_USER_STACK (1ULL << 8)
#define BPF_F_FAST_STACK_CMP (1ULL << 9)
#define BPF_F_REUSE_STACKID (1ULL << 10)
+#define BPF_F_USER_BUILD_ID (1ULL << 11)
/* BPF_FUNC_skb_set_tunnel_key flags. */
#define BPF_F_ZERO_CSUM_TX (1ULL << 1)
diff --git a/kernel/bpf/stackmap.c b/kernel/bpf/stackmap.c
index 04f6ec1..371c72e 100644
--- a/kernel/bpf/stackmap.c
+++ b/kernel/bpf/stackmap.c
@@ -402,6 +402,62 @@ const struct bpf_func_proto bpf_get_stackid_proto = {
.arg3_type = ARG_ANYTHING,
};
+BPF_CALL_4(bpf_get_stack, struct pt_regs *, regs, void *, buf, u32, size,
+ u64, flags)
+{
+ u32 init_nr, trace_nr, copy_len, elem_size, num_elem;
+ bool user_build_id = flags & BPF_F_USER_BUILD_ID;
+ u32 skip = flags & BPF_F_SKIP_FIELD_MASK;
+ bool user = flags & BPF_F_USER_STACK;
+ struct perf_callchain_entry *trace;
+ bool kernel = !user;
+ u64 *ips;
+
+ if (unlikely(flags & ~(BPF_F_SKIP_FIELD_MASK | BPF_F_USER_STACK |
+ BPF_F_USER_BUILD_ID)))
+ return -EINVAL;
+
+ elem_size = (user && user_build_id) ? sizeof(struct bpf_stack_build_id)
+ : sizeof(u64);
+ if (unlikely(size % elem_size))
+ return -EINVAL;
+
+ num_elem = size / elem_size;
+ if (sysctl_perf_event_max_stack < num_elem)
+ init_nr = 0;
+ else
+ init_nr = sysctl_perf_event_max_stack - num_elem;
+ trace = get_perf_callchain(regs, init_nr, kernel, user,
+ sysctl_perf_event_max_stack, false, false);
+ if (unlikely(!trace))
+ return -EFAULT;
+
+ trace_nr = trace->nr - init_nr;
+ if (trace_nr <= skip)
+ return -EFAULT;
+
+ trace_nr -= skip;
+ trace_nr = (trace_nr <= num_elem) ? trace_nr : num_elem;
+ copy_len = trace_nr * elem_size;
+ ips = trace->ip + skip + init_nr;
+ if (user && user_build_id)
+ stack_map_get_build_id_offset(buf, ips, trace_nr, user);
+ else
+ memcpy(buf, ips, copy_len);
+
+ return copy_len;
+}
+
+const struct bpf_func_proto bpf_get_stack_proto = {
+ .func = bpf_get_stack,
+ .gpl_only = true,
+ .ret_type = RET_INTEGER,
+ .arg1_type = ARG_PTR_TO_CTX,
+ .arg2_type = ARG_PTR_TO_UNINIT_MEM,
+ .arg3_type = ARG_CONST_SIZE_OR_ZERO,
+ .arg4_type = ARG_ANYTHING,
+};
+
/* Called from eBPF program */
static void *stack_map_lookup_elem(struct bpf_map *map, void *key)
{
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 0244973..2aa3a65 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -984,10 +984,13 @@ void bpf_prog_free_id(struct bpf_prog *prog, bool do_idr_lock)
static void __bpf_prog_put_rcu(struct rcu_head *rcu)
{
struct bpf_prog_aux *aux = container_of(rcu, struct bpf_prog_aux, rcu);
+ bool need_callchain_buf = aux->prog->need_callchain_buf;
free_used_maps(aux);
bpf_prog_uncharge_memlock(aux->prog);
security_bpf_prog_free(aux);
+ if (need_callchain_buf)
+ put_callchain_buffers();
bpf_prog_free(aux->prog);
}
@@ -1004,7 +1007,8 @@ static void __bpf_prog_put(struct bpf_prog *prog, bool do_idr_lock)
bpf_prog_kallsyms_del(prog->aux->func[i]);
bpf_prog_kallsyms_del(prog);
- call_rcu(&prog->aux->rcu, __bpf_prog_put_rcu);
+ synchronize_rcu();
+ __bpf_prog_put_rcu(&prog->aux->rcu);
}
}
@@ -1341,6 +1345,12 @@ static int bpf_prog_load(union bpf_attr *attr)
if (err)
goto free_used_maps;
+ if (prog->need_callchain_buf) {
+ err = get_callchain_buffers(sysctl_perf_event_max_stack);
+ if (err)
+ goto free_used_maps;
+ }
+
err = bpf_prog_new_fd(prog);
if (err < 0) {
/* failed to allocate fd.
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 5dd1dcb..aba9425 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -2460,6 +2460,9 @@ static int check_helper_call(struct bpf_verifier_env *env, int func_id, int insn
if (err)
return err;
+ if (func_id == BPF_FUNC_get_stack)
+ env->prog->need_callchain_buf = true;
+
if (changes_data)
clear_all_pkt_pointers(env);
return 0;
diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
index d88e96d..fe8476f 100644
--- a/kernel/trace/bpf_trace.c
+++ b/kernel/trace/bpf_trace.c
@@ -20,6 +20,7 @@
#include "trace.h"
u64 bpf_get_stackid(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5);
+u64 bpf_get_stack(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5);
/**
* trace_call_bpf - invoke BPF program
@@ -577,6 +578,8 @@ kprobe_prog_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
return &bpf_perf_event_output_proto;
case BPF_FUNC_get_stackid:
return &bpf_get_stackid_proto;
+ case BPF_FUNC_get_stack:
+ return &bpf_get_stack_proto;
case BPF_FUNC_perf_event_read_value:
return &bpf_perf_event_read_value_proto;
#ifdef CONFIG_BPF_KPROBE_OVERRIDE
@@ -664,6 +667,25 @@ static const struct bpf_func_proto bpf_get_stackid_proto_tp = {
.arg3_type = ARG_ANYTHING,
};
+BPF_CALL_4(bpf_get_stack_tp, void *, tp_buff, void *, buf, u32, size,
+ u64, flags)
+{
+ struct pt_regs *regs = *(struct pt_regs **)tp_buff;
+
+ return bpf_get_stack((unsigned long) regs, (unsigned long) buf,
+ (unsigned long) size, flags, 0);
+}
+
+static const struct bpf_func_proto bpf_get_stack_proto_tp = {
+ .func = bpf_get_stack_tp,
+ .gpl_only = true,
+ .ret_type = RET_INTEGER,
+ .arg1_type = ARG_PTR_TO_CTX,
+ .arg2_type = ARG_PTR_TO_UNINIT_MEM,
+ .arg3_type = ARG_CONST_SIZE_OR_ZERO,
+ .arg4_type = ARG_ANYTHING,
+};
+
static const struct bpf_func_proto *
tp_prog_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
{
@@ -672,6 +694,8 @@ tp_prog_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
return &bpf_perf_event_output_proto_tp;
case BPF_FUNC_get_stackid:
return &bpf_get_stackid_proto_tp;
+ case BPF_FUNC_get_stack:
+ return &bpf_get_stack_proto_tp;
default:
return tracing_func_proto(func_id, prog);
}
@@ -734,6 +758,8 @@ pe_prog_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
return &bpf_perf_event_output_proto_tp;
case BPF_FUNC_get_stackid:
return &bpf_get_stackid_proto_tp;
+ case BPF_FUNC_get_stack:
+ return &bpf_get_stack_proto_tp;
case BPF_FUNC_perf_prog_read_value:
return &bpf_perf_prog_read_value_proto;
default:
@@ -744,7 +770,7 @@ pe_prog_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
/*
* bpf_raw_tp_regs are separate from bpf_pt_regs used from skb/xdp
* to avoid potential recursive reuse issue when/if tracepoints are added
- * inside bpf_*_event_output and/or bpf_get_stack_id
+ * inside bpf_*_event_output, bpf_get_stackid and/or bpf_get_stack
*/
static DEFINE_PER_CPU(struct pt_regs, bpf_raw_tp_regs);
BPF_CALL_5(bpf_perf_event_output_raw_tp, struct bpf_raw_tracepoint_args *, args,
@@ -787,6 +813,26 @@ static const struct bpf_func_proto bpf_get_stackid_proto_raw_tp = {
.arg3_type = ARG_ANYTHING,
};
+BPF_CALL_4(bpf_get_stack_raw_tp, struct bpf_raw_tracepoint_args *, args,
+ void *, buf, u32, size, u64, flags)
+{
+ struct pt_regs *regs = this_cpu_ptr(&bpf_raw_tp_regs);
+
+ perf_fetch_caller_regs(regs);
+ return bpf_get_stack((unsigned long) regs, (unsigned long) buf,
+ (unsigned long) size, flags, 0);
+}
+
+static const struct bpf_func_proto bpf_get_stack_proto_raw_tp = {
+ .func = bpf_get_stack_raw_tp,
+ .gpl_only = true,
+ .ret_type = RET_INTEGER,
+ .arg1_type = ARG_PTR_TO_CTX,
+ .arg2_type = ARG_PTR_TO_MEM,
+ .arg3_type = ARG_CONST_SIZE_OR_ZERO,
+ .arg4_type = ARG_ANYTHING,
+};
+
static const struct bpf_func_proto *
raw_tp_prog_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
{
@@ -795,6 +841,8 @@ raw_tp_prog_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
return &bpf_perf_event_output_proto_raw_tp;
case BPF_FUNC_get_stackid:
return &bpf_get_stackid_proto_raw_tp;
+ case BPF_FUNC_get_stack:
+ return &bpf_get_stack_proto_raw_tp;
default:
return tracing_func_proto(func_id, prog);
}
--
2.9.5
^ permalink raw reply related
* [RFC PATCH bpf-next 0/6] bpf: add bpf_get_stack_helper
From: Yonghong Song @ 2018-04-06 21:48 UTC (permalink / raw)
To: ast, daniel, netdev; +Cc: kernel-team
Currently, stackmap and bpf_get_stackid helper are provided
for bpf program to get the stack trace. This approach has
a limitation though. If two stack traces have the same hash,
only one will get stored in the stackmap table,
so some stack traces are missing from user perspective.
This patch implements a new helper, bpf_get_stack, will
send stack traces directly to bpf program. The bpf program
is able to see all stack traces, and then can do in-kernel
processing or send stack traces to user space through
shared map or bpf_perf_event_output.
Patches #1 and #2 implemented the core kernel support.
Patch #3 synced the new helper to tools headers.
Patches #4 and #5 added a test in samples/bpf by attaching
to a kprobe, and Patch #6 added a test in tools/bpf by
attaching to a tracepoint.
Yonghong Song (6):
bpf: change prototype for stack_map_get_build_id_offset
bpf: add bpf_get_stack helper
tools/bpf: add bpf_get_stack helper to tools headers
samples/bpf: move common-purpose perf_event functions to bpf_load.c
samples/bpf: add a test for bpf_get_stack helper
tools/bpf: add a test case for bpf_get_stack helper
include/linux/bpf.h | 1 +
include/linux/filter.h | 3 +-
include/uapi/linux/bpf.h | 17 ++-
kernel/bpf/stackmap.c | 69 ++++++++--
kernel/bpf/syscall.c | 12 +-
kernel/bpf/verifier.c | 3 +
kernel/trace/bpf_trace.c | 50 +++++++-
samples/bpf/Makefile | 4 +
samples/bpf/bpf_load.c | 104 +++++++++++++++
samples/bpf/bpf_load.h | 5 +
samples/bpf/trace_get_stack_kern.c | 80 ++++++++++++
samples/bpf/trace_get_stack_user.c | 150 ++++++++++++++++++++++
samples/bpf/trace_output_user.c | 113 ++--------------
tools/include/uapi/linux/bpf.h | 17 ++-
tools/testing/selftests/bpf/bpf_helpers.h | 2 +
tools/testing/selftests/bpf/test_progs.c | 41 +++++-
tools/testing/selftests/bpf/test_stacktrace_map.c | 20 ++-
17 files changed, 568 insertions(+), 123 deletions(-)
create mode 100644 samples/bpf/trace_get_stack_kern.c
create mode 100644 samples/bpf/trace_get_stack_user.c
--
2.9.5
^ permalink raw reply
* [RFC PATCH bpf-next 4/6] samples/bpf: move common-purpose perf_event functions to bpf_load.c
From: Yonghong Song @ 2018-04-06 21:48 UTC (permalink / raw)
To: ast, daniel, netdev; +Cc: kernel-team
In-Reply-To: <20180406214846.916265-1-yhs@fb.com>
There is no functionality change in this patch. The common-purpose
perf_event functions are moved from trace_output_user.c to bpf_load.c
so that these function can be reused later.
Signed-off-by: Yonghong Song <yhs@fb.com>
---
samples/bpf/bpf_load.c | 104 ++++++++++++++++++++++++++++++++++++
samples/bpf/bpf_load.h | 5 ++
samples/bpf/trace_output_user.c | 113 ++++------------------------------------
3 files changed, 118 insertions(+), 104 deletions(-)
diff --git a/samples/bpf/bpf_load.c b/samples/bpf/bpf_load.c
index bebe418..62aa5cc 100644
--- a/samples/bpf/bpf_load.c
+++ b/samples/bpf/bpf_load.c
@@ -713,3 +713,107 @@ struct ksym *ksym_search(long key)
return &syms[0];
}
+static int page_size;
+static int page_cnt = 8;
+static volatile struct perf_event_mmap_page *header;
+
+static int perf_event_mmap(int fd)
+{
+ void *base;
+ int mmap_size;
+
+ page_size = getpagesize();
+ mmap_size = page_size * (page_cnt + 1);
+
+ base = mmap(NULL, mmap_size, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
+ if (base == MAP_FAILED) {
+ printf("mmap err\n");
+ return -1;
+ }
+
+ header = base;
+ return 0;
+}
+
+static int perf_event_poll(int fd)
+{
+ struct pollfd pfd = { .fd = fd, .events = POLLIN };
+
+ return poll(&pfd, 1, 1000);
+}
+
+struct perf_event_sample {
+ struct perf_event_header header;
+ __u32 size;
+ char data[];
+};
+
+static void perf_event_read(perf_event_print_fn fn)
+{
+ __u64 data_tail = header->data_tail;
+ __u64 data_head = header->data_head;
+ __u64 buffer_size = page_cnt * page_size;
+ void *base, *begin, *end;
+ char buf[256];
+
+ asm volatile("" ::: "memory"); /* in real code it should be smp_rmb() */
+ if (data_head == data_tail)
+ return;
+
+ base = ((char *)header) + page_size;
+
+ begin = base + data_tail % buffer_size;
+ end = base + data_head % buffer_size;
+
+ while (begin != end) {
+ struct perf_event_sample *e;
+
+ e = begin;
+ if (begin + e->header.size > base + buffer_size) {
+ long len = base + buffer_size - begin;
+
+ assert(len < e->header.size);
+ memcpy(buf, begin, len);
+ memcpy(buf + len, base, e->header.size - len);
+ e = (void *) buf;
+ begin = base + e->header.size - len;
+ } else if (begin + e->header.size == base + buffer_size) {
+ begin = base;
+ } else {
+ begin += e->header.size;
+ }
+
+ if (e->header.type == PERF_RECORD_SAMPLE) {
+ fn(e->data, e->size);
+ } else if (e->header.type == PERF_RECORD_LOST) {
+ struct {
+ struct perf_event_header header;
+ __u64 id;
+ __u64 lost;
+ } *lost = (void *) e;
+ printf("lost %lld events\n", lost->lost);
+ } else {
+ printf("unknown event type=%d size=%d\n",
+ e->header.type, e->header.size);
+ }
+ }
+
+ __sync_synchronize(); /* smp_mb() */
+ header->data_tail = data_head;
+}
+
+int perf_event_poller(int fd, perf_event_exec_fn exec_fn,
+ perf_event_print_fn output_fn)
+{
+ if (perf_event_mmap(fd) < 0)
+ return 1;
+
+ exec_fn();
+
+ for (;;) {
+ perf_event_poll(fd);
+ perf_event_read(output_fn);
+ }
+
+ return 0;
+}
diff --git a/samples/bpf/bpf_load.h b/samples/bpf/bpf_load.h
index 453c200..d618750 100644
--- a/samples/bpf/bpf_load.h
+++ b/samples/bpf/bpf_load.h
@@ -62,4 +62,9 @@ struct ksym {
int load_kallsyms(void);
struct ksym *ksym_search(long key);
int bpf_set_link_xdp_fd(int ifindex, int fd, __u32 flags);
+
+typedef void (*perf_event_exec_fn)(void);
+typedef void (*perf_event_print_fn)(void *data, int size);
+int perf_event_poller(int fd, perf_event_exec_fn exec_fn,
+ perf_event_print_fn output_fn);
#endif
diff --git a/samples/bpf/trace_output_user.c b/samples/bpf/trace_output_user.c
index ccca1e3..3d3991f 100644
--- a/samples/bpf/trace_output_user.c
+++ b/samples/bpf/trace_output_user.c
@@ -24,97 +24,6 @@
static int pmu_fd;
-int page_size;
-int page_cnt = 8;
-volatile struct perf_event_mmap_page *header;
-
-typedef void (*print_fn)(void *data, int size);
-
-static int perf_event_mmap(int fd)
-{
- void *base;
- int mmap_size;
-
- page_size = getpagesize();
- mmap_size = page_size * (page_cnt + 1);
-
- base = mmap(NULL, mmap_size, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
- if (base == MAP_FAILED) {
- printf("mmap err\n");
- return -1;
- }
-
- header = base;
- return 0;
-}
-
-static int perf_event_poll(int fd)
-{
- struct pollfd pfd = { .fd = fd, .events = POLLIN };
-
- return poll(&pfd, 1, 1000);
-}
-
-struct perf_event_sample {
- struct perf_event_header header;
- __u32 size;
- char data[];
-};
-
-static void perf_event_read(print_fn fn)
-{
- __u64 data_tail = header->data_tail;
- __u64 data_head = header->data_head;
- __u64 buffer_size = page_cnt * page_size;
- void *base, *begin, *end;
- char buf[256];
-
- asm volatile("" ::: "memory"); /* in real code it should be smp_rmb() */
- if (data_head == data_tail)
- return;
-
- base = ((char *)header) + page_size;
-
- begin = base + data_tail % buffer_size;
- end = base + data_head % buffer_size;
-
- while (begin != end) {
- struct perf_event_sample *e;
-
- e = begin;
- if (begin + e->header.size > base + buffer_size) {
- long len = base + buffer_size - begin;
-
- assert(len < e->header.size);
- memcpy(buf, begin, len);
- memcpy(buf + len, base, e->header.size - len);
- e = (void *) buf;
- begin = base + e->header.size - len;
- } else if (begin + e->header.size == base + buffer_size) {
- begin = base;
- } else {
- begin += e->header.size;
- }
-
- if (e->header.type == PERF_RECORD_SAMPLE) {
- fn(e->data, e->size);
- } else if (e->header.type == PERF_RECORD_LOST) {
- struct {
- struct perf_event_header header;
- __u64 id;
- __u64 lost;
- } *lost = (void *) e;
- printf("lost %lld events\n", lost->lost);
- } else {
- printf("unknown event type=%d size=%d\n",
- e->header.type, e->header.size);
- }
- }
-
- __sync_synchronize(); /* smp_mb() */
- header->data_tail = data_head;
-}
-
static __u64 time_get_ns(void)
{
struct timespec ts;
@@ -166,10 +75,17 @@ static void test_bpf_perf_event(void)
ioctl(pmu_fd, PERF_EVENT_IOC_ENABLE, 0);
}
+static void exec_action(void)
+{
+ FILE *f;
+
+ f = popen("taskset 1 dd if=/dev/zero of=/dev/null", "r");
+ (void) f;
+}
+
int main(int argc, char **argv)
{
char filename[256];
- FILE *f;
snprintf(filename, sizeof(filename), "%s_kern.o", argv[0]);
@@ -180,17 +96,6 @@ int main(int argc, char **argv)
test_bpf_perf_event();
- if (perf_event_mmap(pmu_fd) < 0)
- return 1;
-
- f = popen("taskset 1 dd if=/dev/zero of=/dev/null", "r");
- (void) f;
-
start_time = time_get_ns();
- for (;;) {
- perf_event_poll(pmu_fd);
- perf_event_read(print_bpf_output);
- }
-
- return 0;
+ return perf_event_poller(pmu_fd, exec_action, print_bpf_output);
}
--
2.9.5
^ permalink raw reply related
* Re: [RFC PATCH 2/3] netdev: kernel-only IFF_HIDDEN netdevice
From: Siwei Liu @ 2018-04-06 21:29 UTC (permalink / raw)
To: Stephen Hemminger
Cc: Michael S. Tsirkin, Jiri Pirko, Alexander Duyck, David Miller,
Brandeburg, Jesse, Jakub Kicinski, Jason Wang, Samudrala, Sridhar,
Netdev, virtualization, virtio-dev, si-wei liu
In-Reply-To: <20180403160834.51594373@xeon-e3>
(put discussions back on list which got accidentally removed)
On Tue, Apr 3, 2018 at 4:08 PM, Stephen Hemminger
<stephen@networkplumber.org> wrote:
> On Tue, 3 Apr 2018 12:04:38 -0700
> Siwei Liu <loseweigh@gmail.com> wrote:
>
>> On Tue, Apr 3, 2018 at 10:35 AM, Stephen Hemminger
>> <stephen@networkplumber.org> wrote:
>> > On Sun, 1 Apr 2018 05:13:09 -0400
>> > Si-Wei Liu <si-wei.liu@oracle.com> wrote:
>> >
>> >> Hidden netdevice is not visible to userspace such that
>> >> typical network utilites e.g. ip, ifconfig and et al,
>> >> cannot sense its existence or configure it. Internally
>> >> hidden netdev may associate with an upper level netdev
>> >> that userspace has access to. Although userspace cannot
>> >> manipulate the lower netdev directly, user may control
>> >> or configure the underlying hidden device through the
>> >> upper-level netdev. For identification purpose, the
>> >> kobject for hidden netdev still presents in the sysfs
>> >> hierarchy, however, no uevent message will be generated
>> >> when the sysfs entry is created, modified or destroyed.
>> >>
>> >> For that end, a separate namescope needs to be carved
>> >> out for IFF_HIDDEN netdevs. As of now netdev name that
>> >> starts with colon i.e. ':' is invalid in userspace,
>> >> since socket ioctls such as SIOCGIFCONF use ':' as the
>> >> separator for ifname. The absence of namescope started
>> >> with ':' can rightly be used as the namescope for
>> >> the kernel-only IFF_HIDDEN netdevs.
>> >>
>> >> Signed-off-by: Si-Wei Liu <si-wei.liu@oracle.com>
>> >> ---
>> >
>> > I understand the use case. I proposed using . as a prefix before
>> > but that ran into resistance. Using colon seems worse.
>>
>> Using dot (.) can't be good because it would cause namespace collision
>> and thus breaking apps when you hide the device. Imagine a user really
>> wants to add a link with the same name as the one hidden and it starts
>> with a dot. It would fail, and users don't know its just because the
>> name starts with dot. IMHO users should be agnostic of (the namespace
>> of) hidden device at all if what they pick is a valid name.
>>
>> ":" is an invalid prefix to userspace, there's no such problem if
>> being used to construct the namescope for hidden devices.
>>
>> However, technically, just as what I alluded to in the reply earlier,
>> it might really be consistent to put this under a separeate namespace
>> instead than fiddling with name prefix. But I am just not sure if that
>> is a big hammer and would like to earn enough feedback and attention
>> before going that way too quickly.
>>
>>
>> >
>> > Rather than playing with names and all the issues that can cause,
>> > why not make it an attribute flag of the device in netlink.
>>
>> Atrribute flag doesn't help. It's a matter of namespace.
>>
>> Regards,
>> -Siwei
>
> In Vyatta, we used names like ".isatap" for devices that would clutter up
> the user experience. They are naturally not visible by simple scans of
> /sys/class/net, and there was a patch to ignore them in iproute2.
> It was a hack which worked but not really worth upstreaming.
>
> The question is if this a security feature then it needs to be more
I don't expect the namespace to be a security aspect of feature, but
rather a way to make old userspace unmodified to work with a new
feature. And, we're going to add API to expose the netdev info for the
invisible IFF_AUTO_MANAGED links anyway. We don't need to make it
secure and all hidden under the dark to be honest.
> robust than just name prefix. Plus it took years to handle network
> namespaces everywhere; this kind of flag would start same problems.
>
> Network namespaces work but have the problem namespaces only weakly
> support hierarchy and nesting. I prefer the namespace approach
> because it fits better and has less impact.
Great, thanks!
-Siwei
^ permalink raw reply
* Re: [PATCH net-next 6/6] netdevsim: Add simple FIB resource controller via devlink
From: David Ahern @ 2018-04-06 21:22 UTC (permalink / raw)
To: Jiri Pirko
Cc: netdev, davem, roopa, shm, jiri, idosch, jakub.kicinski,
andy.roulin
In-Reply-To: <20180406055255.GB19345@nanopsycho>
On 4/5/18 11:52 PM, Jiri Pirko wrote:
> Thu, Apr 05, 2018 at 11:06:41PM CEST, dsa@cumulusnetworks.com wrote:
>> On 4/5/18 2:10 PM, David Ahern wrote:
>>>
>>> The ASIC here is the kernel tables in a namespace. It does not make
>>> sense to have 2 devlink instances for a single namespace.
>>
>> I put this example controller in netdevsim per a suggestion from Ido.
>> The netdevsim seemed like a good idea given that modules intention --
>> testing network facilities. Perhaps I should have done this as a
>> completely standalone module ...
>>
>> The intention is to treat the kernel's tables *per namespace* as a
>> standalone entity that can be managed very similar to ASIC resources.
>
> So you say you want to treat a namespace as an ASIC? That sounds very
> odd to me :/
Why? The kernel has forwarding tables, acl's, etc just like the ASIC,
and each namespace is a separate set of tables.
If you think about it, userspace "programs" the kernel just like mlxsw
and userspace SDKs "program" an asic.
>> Given that I can add a resource controller module
>> (drivers/net/kern_res_mgr.c?) that creates a 'struct device' per network
>> namespace with a devlink instance. In this case the device would very
>> much be tied to the namespace 1:1.
>
> That sounds more reasonable and accurate, yet still odd. You would not
> have any netdevices there? Any ports?
>
Sure, what ever ports are assigned to or created in the namespace.
Nothing about the devlink API says it has to be a real h/w device.
Nothing about the devlink API says it can only be used for real h/w that
has ports represented by netdevices that the devlink instance some how
has "control" over.
As the netdevsim demo shows, I can build an L3 resource controller for
the kernel tables using just the devlink API and the in-kernel notifiers.
^ permalink raw reply
* Re: [PATCH net v6 4/4] ipv6: udp: set dst cache for a connected sk if current not valid
From: Sasha Levin @ 2018-04-06 20:05 UTC (permalink / raw)
To: Sasha Levin, Alexey Kodanev, netdev@vger.kernel.org
Cc: Eric Dumazet, stable@vger.kernel.org
In-Reply-To: <1522756810-24985-5-git-send-email-alexey.kodanev@oracle.com>
Hi,
[This is an automated email]
This commit has been processed because it contains a "Fixes:" tag,
fixing commit: 33c162a980fe ipv6: datagram: Update dst cache of a connected datagram sk during pmtu update.
The bot has tested the following trees: v4.16, v4.15.15, v4.14.32, v4.9.92.
v4.16: Failed to apply! Possible dependencies:
96818159c3c0 ("ipv6: allow to cache dst for a connected sk in ip6_sk_dst_lookup_flow()")
v4.15.15: Failed to apply! Possible dependencies:
96818159c3c0 ("ipv6: allow to cache dst for a connected sk in ip6_sk_dst_lookup_flow()")
v4.14.32: Failed to apply! Possible dependencies:
96818159c3c0 ("ipv6: allow to cache dst for a connected sk in ip6_sk_dst_lookup_flow()")
v4.9.92: Failed to apply! Possible dependencies:
96818159c3c0 ("ipv6: allow to cache dst for a connected sk in ip6_sk_dst_lookup_flow()")
^ permalink raw reply
* Re: [PATCH net] net: dsa: b53: Fix sparse warnings in b53_mmap.c
From: Sasha Levin @ 2018-04-06 20:05 UTC (permalink / raw)
To: Sasha Levin, Florian Fainelli, netdev@vger.kernel.org
Cc: jonas.gorski@gmail.com, Florian Fainelli, stable@vger.kernel.org
In-Reply-To: <20180402231701.17348-1-f.fainelli@gmail.com>
Hi,
[This is an automated email]
This commit has been processed because it contains a "Fixes:" tag,
fixing commit: 967dd82ffc52 net: dsa: b53: Add support for Broadcom RoboSwitch.
The bot has also determined it's probably a bug fixing patch. (score: 8.8847)
The bot has tested the following trees: v4.16, v4.15.15, v4.14.32, v4.9.92.
v4.16: Build OK!
v4.15.15: Build OK!
v4.14.32: Build OK!
v4.9.92: Build OK!
^ permalink raw reply
* Re: [PATCH net 2/2] net: systemport: Fix sparse warnings in bcm_sysport_insert_tsb()
From: Sasha Levin @ 2018-04-06 20:05 UTC (permalink / raw)
To: Sasha Levin, Florian Fainelli, netdev@vger.kernel.org
Cc: Florian Fainelli, stable@vger.kernel.org
In-Reply-To: <20180402225856.4351-3-f.fainelli@gmail.com>
Hi,
[This is an automated email]
This commit has been processed because it contains a "Fixes:" tag,
fixing commit: 80105befdb4b net: systemport: add Broadcom SYSTEMPORT Ethernet MAC driver.
The bot has also determined it's probably a bug fixing patch. (score: 50.4075)
The bot has tested the following trees: v4.16, v4.15.15, v4.14.32, v4.9.92, v4.4.126.
v4.16: Build OK!
v4.15.15: Build OK!
v4.14.32: Build OK!
v4.9.92: Build OK!
v4.4.126: Build OK!
^ permalink raw reply
* Re: [PATCH net 1/2] net: bcmgenet: Fix sparse warnings in bcmgenet_put_tx_csum()
From: Sasha Levin @ 2018-04-06 20:05 UTC (permalink / raw)
To: Sasha Levin, Florian Fainelli, netdev@vger.kernel.org
Cc: Florian Fainelli, stable@vger.kernel.org
In-Reply-To: <20180402225856.4351-2-f.fainelli@gmail.com>
Hi,
[This is an automated email]
This commit has been processed because it contains a "Fixes:" tag,
fixing commit: 1c1008c793fa net: bcmgenet: add main driver file.
The bot has also determined it's probably a bug fixing patch. (score: 49.2621)
The bot has tested the following trees: v4.16, v4.15.15, v4.14.32, v4.9.92, v4.4.126.
v4.16: Build OK!
v4.15.15: Build OK!
v4.14.32: Build OK!
v4.9.92: Build OK!
v4.4.126: Build OK!
^ permalink raw reply
* [PATCH v4] net: thunderx: rework mac addresses list to u64 array
From: Vadim Lomovtsev @ 2018-04-06 19:53 UTC (permalink / raw)
To: sgoutham, sunil.kovvuri, rric, linux-arm-kernel, netdev,
linux-kernel, davem
Cc: dnelson, robin.murphy, hch, gustavo, Vadim Lomovtsev
In-Reply-To: <20180406140443.15181-1-Vadim.Lomovtsev@caviumnetworks.com>
From: Vadim Lomovtsev <Vadim.Lomovtsev@cavium.com>
It is too expensive to pass u64 values via linked list, instead
allocate array for them by overall number of mac addresses from netdev.
This eventually removes multiple kmalloc() calls, aviod memory
fragmentation and allow to put single null check on kmalloc
return value in order to prevent a potential null pointer dereference.
Addresses-Coverity-ID: 1467429 ("Dereference null return value")
Fixes: 37c3347eb247 ("net: thunderx: add ndo_set_rx_mode callback implementation for VF")
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Vadim Lomovtsev <Vadim.Lomovtsev@cavium.com>
---
Changes from v1 to v2:
- C99 syntax: update xcast_addr_list struct field mc[0] -> mc[];
Changes from v2 to v3:
- update commit description with 'Reported-by: Dan Carpenter';
- update size calculations for mc list to offsetof() call
instead of explicit arithmetic;
Changes from v3 to v4:
- change loop control variable type from u8 to int, accordingly
to mc_count size;
---
drivers/net/ethernet/cavium/thunder/nic.h | 7 +-----
drivers/net/ethernet/cavium/thunder/nicvf_main.c | 28 +++++++++---------------
2 files changed, 11 insertions(+), 24 deletions(-)
diff --git a/drivers/net/ethernet/cavium/thunder/nic.h b/drivers/net/ethernet/cavium/thunder/nic.h
index 5fc46c5..448d1fa 100644
--- a/drivers/net/ethernet/cavium/thunder/nic.h
+++ b/drivers/net/ethernet/cavium/thunder/nic.h
@@ -265,14 +265,9 @@ struct nicvf_drv_stats {
struct cavium_ptp;
-struct xcast_addr {
- struct list_head list;
- u64 addr;
-};
-
struct xcast_addr_list {
- struct list_head list;
int count;
+ u64 mc[];
};
struct nicvf_work {
diff --git a/drivers/net/ethernet/cavium/thunder/nicvf_main.c b/drivers/net/ethernet/cavium/thunder/nicvf_main.c
index 1e9a31f..6bd5658 100644
--- a/drivers/net/ethernet/cavium/thunder/nicvf_main.c
+++ b/drivers/net/ethernet/cavium/thunder/nicvf_main.c
@@ -1929,7 +1929,7 @@ static void nicvf_set_rx_mode_task(struct work_struct *work_arg)
work.work);
struct nicvf *nic = container_of(vf_work, struct nicvf, rx_mode_work);
union nic_mbx mbx = {};
- struct xcast_addr *xaddr, *next;
+ int idx = 0;
if (!vf_work)
return;
@@ -1956,16 +1956,10 @@ static void nicvf_set_rx_mode_task(struct work_struct *work_arg)
/* check if we have any specific MACs to be added to PF DMAC filter */
if (vf_work->mc) {
/* now go through kernel list of MACs and add them one by one */
- list_for_each_entry_safe(xaddr, next,
- &vf_work->mc->list, list) {
+ for (idx = 0; idx < vf_work->mc->count; idx++) {
mbx.xcast.msg = NIC_MBOX_MSG_ADD_MCAST;
- mbx.xcast.data.mac = xaddr->addr;
+ mbx.xcast.data.mac = vf_work->mc->mc[idx];
nicvf_send_msg_to_pf(nic, &mbx);
-
- /* after receiving ACK from PF release memory */
- list_del(&xaddr->list);
- kfree(xaddr);
- vf_work->mc->count--;
}
kfree(vf_work->mc);
}
@@ -1996,17 +1990,15 @@ static void nicvf_set_rx_mode(struct net_device *netdev)
mode |= BGX_XCAST_MCAST_FILTER;
/* here we need to copy mc addrs */
if (netdev_mc_count(netdev)) {
- struct xcast_addr *xaddr;
-
- mc_list = kmalloc(sizeof(*mc_list), GFP_ATOMIC);
- INIT_LIST_HEAD(&mc_list->list);
+ mc_list = kmalloc(offsetof(typeof(*mc_list),
+ mc[netdev_mc_count(netdev)]),
+ GFP_ATOMIC);
+ if (unlikely(!mc_list))
+ return;
+ mc_list->count = 0;
netdev_hw_addr_list_for_each(ha, &netdev->mc) {
- xaddr = kmalloc(sizeof(*xaddr),
- GFP_ATOMIC);
- xaddr->addr =
+ mc_list->mc[mc_list->count] =
ether_addr_to_u64(ha->addr);
- list_add_tail(&xaddr->list,
- &mc_list->list);
mc_list->count++;
}
}
--
1.8.3.1
^ permalink raw reply related
* kernel BUG at drivers/vhost/vhost.c:LINE! (2)
From: syzbot @ 2018-04-06 19:02 UTC (permalink / raw)
To: jasowang, kvm, linux-kernel, mst, netdev, syzkaller-bugs,
virtualization
Hello,
syzbot hit the following crash on upstream commit
38c23685b273cfb4ccf31a199feccce3bdcb5d83 (Fri Apr 6 04:29:35 2018 +0000)
Merge tag 'armsoc-drivers' of
git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc
syzbot dashboard link:
https://syzkaller.appspot.com/bug?extid=65a84dde0214b0387ccd
So far this crash happened 4 times on upstream.
C reproducer: https://syzkaller.appspot.com/x/repro.c?id=6586748079439872
syzkaller reproducer:
https://syzkaller.appspot.com/x/repro.syz?id=5974272052822016
Raw console output:
https://syzkaller.appspot.com/x/log.txt?id=6224632407392256
Kernel config:
https://syzkaller.appspot.com/x/.config?id=-5813481738265533882
compiler: gcc (GCC) 8.0.1 20180301 (experimental)
IMPORTANT: if you fix the bug, please add the following tag to the commit:
Reported-by: syzbot+65a84dde0214b0387ccd@syzkaller.appspotmail.com
It will help syzbot understand when the bug is fixed. See footer for
details.
If you forward the report, please keep this part and the footer.
------------[ cut here ]------------
kernel BUG at drivers/vhost/vhost.c:1652!
invalid opcode: 0000 [#1] SMP KASAN
------------[ cut here ]------------
Dumping ftrace buffer:
kernel BUG at drivers/vhost/vhost.c:1652!
(ftrace buffer empty)
Modules linked in:
CPU: 1 PID: 4461 Comm: syzkaller684218 Not tainted 4.16.0+ #3
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
Google 01/01/2011
RIP: 0010:set_bit_to_user drivers/vhost/vhost.c:1652 [inline]
RIP: 0010:log_write+0x42a/0x4d0 drivers/vhost/vhost.c:1676
RSP: 0018:ffff8801b256f920 EFLAGS: 00010293
RAX: ffff8801adc9e2c0 RBX: dffffc0000000000 RCX: ffffffff85924a0f
RDX: 0000000000000000 RSI: ffffffff85924cea RDI: 0000000000000005
RBP: ffff8801b256fa58 R08: ffff8801adc9e2c0 R09: ffffed003962412d
R10: ffff8801b256fad8 R11: ffff8801cb12096f R12: 0001ffffffffffff
R13: ffffed00364adf36 R14: 0000000000000000 R15: ffff8801b256fa30
FS: 00007fdf24b19700(0000) GS:ffff8801db100000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000020bf6000 CR3: 00000001ae6a7000 CR4: 00000000001406e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
vhost_update_used_flags+0x3af/0x4a0 drivers/vhost/vhost.c:1723
vhost_vq_init_access+0x117/0x590 drivers/vhost/vhost.c:1763
vhost_vsock_start drivers/vhost/vsock.c:446 [inline]
vhost_vsock_dev_ioctl+0x751/0x920 drivers/vhost/vsock.c:678
vfs_ioctl fs/ioctl.c:46 [inline]
file_ioctl fs/ioctl.c:500 [inline]
do_vfs_ioctl+0x1cf/0x1650 fs/ioctl.c:684
ksys_ioctl+0xa9/0xd0 fs/ioctl.c:701
SYSC_ioctl fs/ioctl.c:708 [inline]
SyS_ioctl+0x24/0x30 fs/ioctl.c:706
do_syscall_64+0x29e/0x9d0 arch/x86/entry/common.c:287
entry_SYSCALL_64_after_hwframe+0x42/0xb7
RIP: 0033:0x4456c9
RSP: 002b:00007fdf24b18da8 EFLAGS: 00000297 ORIG_RAX: 0000000000000010
RAX: ffffffffffffffda RBX: 00000000006dac24 RCX: 00000000004456c9
RDX: 0000000020f82ffc RSI: 000000004004af61 RDI: 000000000000001b
RBP: 00000000006dac20 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000297 R12: 6b636f73762d7473
R13: 6f68762f7665642f R14: fffffffffffffffc R15: 0000000000000007
Code: e8 7c 5e e4 fb 4c 89 ef e8 e4 16 06 fc 48 8d 85 58 ff ff ff 48 c1 e8
03 c6 04 18 f8 e9 46 ff ff ff 45 31 f6 eb 91 e8 56 5e e4 fb <0f> 0b e8 4f
5e e4 fb 48 c7 c6 a0 a3 24 88 4c 89 ef e8 60 b6 10
RIP: set_bit_to_user drivers/vhost/vhost.c:1652 [inline] RSP:
ffff8801b256f920
RIP: log_write+0x42a/0x4d0 drivers/vhost/vhost.c:1676 RSP: ffff8801b256f920
invalid opcode: 0000 [#2] SMP KASAN
---[ end trace 0d0ff45aa44d8a23 ]---
Dumping ftrace buffer:
(ftrace buffer empty)
Modules linked in:
---
This bug is generated by a dumb bot. It may contain errors.
See https://goo.gl/tpsmEJ for details.
Direct all questions to syzkaller@googlegroups.com.
syzbot will keep track of this bug report.
If you forgot to add the Reported-by tag, once the fix for this bug is
merged
into any tree, please reply to this email with:
#syz fix: exact-commit-title
If you want to test a patch for this bug, please reply with:
#syz test: git://repo/address.git branch
and provide the patch inline or as an attachment.
To mark this as a duplicate of another syzbot report, please reply with:
#syz dup: exact-subject-of-another-report
If it's a one-off invalid bug report, please reply with:
#syz invalid
Note: if the crash happens again, it will cause creation of a new bug
report.
Note: all commands must start from beginning of the line in the email body.
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox