Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH v2 2/2] i40e: add support for XDP_REDIRECT
From: Alexander Duyck @ 2018-03-22 14:52 UTC (permalink / raw)
  To: Björn Töpel
  Cc: Jesper Dangaard Brouer, Jeff Kirsher, intel-wired-lan,
	Björn Töpel, Karlsson, Magnus, Netdev,
	Duyck, Alexander H
In-Reply-To: <CAJ+HfNiGWgJEE9K1rh3OrgixFaTJE4mDCBzYVGQJv3R8_rpLTQ@mail.gmail.com>

On Thu, Mar 22, 2018 at 5:20 AM, Björn Töpel <bjorn.topel@gmail.com> wrote:
> 2018-03-22 12:58 GMT+01:00 Jesper Dangaard Brouer <brouer@redhat.com>:
>>
>> On Thu, 22 Mar 2018 10:03:07 +0100 Björn Töpel <bjorn.topel@gmail.com> wrote:
>>
>>> +/**
>>> + * i40e_xdp_xmit - Implements ndo_xdp_xmit
>>> + * @dev: netdev
>>> + * @xdp: XDP buffer
>>> + *
>>> + * Returns Zero if sent, else an error code
>>> + **/
>>> +int i40e_xdp_xmit(struct net_device *dev, struct xdp_buff *xdp)
>>> +{
>>
>> The return code is used by the XDP redirect tracepoint... this is the
>> only way we have to debug/troubleshoot runtime issues with XDP. Thus,
>> these need to be consistent across drives and distinguishable.
>>
>
> Thanks for pointing this out! I'll address all your comments and do a
> respin (but I'll wait for Alex' comments, if any).
>
>
> Björn
>

The patch mostly looks okay to me. Maybe a bit of reverse xmas tree
formatting needs to be addressed for the variable declarations in your
two new functions but that is about it in terms of what I see.

- Alex

^ permalink raw reply

* Re: [RFC PATCH 1/5] net: macb: Check MDIO state before read/write and use timeouts
From: Andrew Lunn @ 2018-03-22 14:47 UTC (permalink / raw)
  To: harinikatakamlinux
  Cc: nicolas.ferre, davem, netdev, linux-kernel, harinik, michals,
	appanad, Shubhrajyoti Datta
In-Reply-To: <1521726700-22634-2-git-send-email-harinikatakamlinux@gmail.com>

On Thu, Mar 22, 2018 at 07:21:36PM +0530, harinikatakamlinux@gmail.com wrote:
> From: Harini Katakam <harinik@xilinx.com>
> 
> Replace the while loop in MDIO read/write functions with a timeout.
> In addition, add a check for MDIO bus busy before initiating a new
> operation as well to make sure there is no ongoing MDIO operation.
> 
> Signed-off-by: Shubhrajyoti Datta <shubhrajyoti.datta@xilinx.com>
> Signed-off-by: Harini Katakam <harinik@xilinx.com>
> ---
>  drivers/net/ethernet/cadence/macb_main.c | 54 ++++++++++++++++++++++++++++++--
>  1 file changed, 52 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/net/ethernet/cadence/macb_main.c b/drivers/net/ethernet/cadence/macb_main.c
> index d09bd43..f4030c1 100644
> --- a/drivers/net/ethernet/cadence/macb_main.c
> +++ b/drivers/net/ethernet/cadence/macb_main.c
> @@ -321,6 +321,21 @@ static int macb_mdio_read(struct mii_bus *bus, int mii_id, int regnum)
>  {
>  	struct macb *bp = bus->priv;
>  	int value;
> +	ulong timeout;
> +
> +	timeout = jiffies + msecs_to_jiffies(1000);
> +	/* wait for end of transfer */
> +	do {
> +		if (MACB_BFEXT(IDLE, macb_readl(bp, NSR)))
> +			break;
> +
> +		cpu_relax();
> +	} while (!time_after_eq(jiffies, timeout));
> +
> +	if (time_after_eq(jiffies, timeout)) {
> +		netdev_err(bp->dev, "wait for end of transfer timed out\n");
> +		return -ETIMEDOUT;
> +	}

Hi Harini

It looks like you have repeated the same code 4 times. Please move it
into a helper function.

     Andrew

^ permalink raw reply

* [PATCH v2 net-next] virtio-net: Fix operstate for virtio when no VIRTIO_NET_F_STATUS
From: Jay Vosburgh @ 2018-03-22 14:42 UTC (permalink / raw)
  To: netdev; +Cc: Michael S. Tsirkin, Jason Wang, David Miller, Ben Hutchings

	The operstate update logic will leave an interface in the
default UNKNOWN operstate if the interface carrier state never changes
from the default carrier up state set at creation.  This includes the
case of an explicit call to netif_carrier_on, as the carrier on to on
transition has no effect on operstate.

	This affects virtio-net for the case that the virtio peer does
not support VIRTIO_NET_F_STATUS (the feature that provides carrier state
updates).  Without this feature, the virtio specification states that
"the link should be assumed active," so, logically, the operstate should
be UP instead of UNKNOWN.  This has impact on user space applications
that use the operstate to make availability decisions for the interface.

	Resolve this by changing the virtio probe logic slightly to call
netif_carrier_off for both the "with" and "without" VIRTIO_NET_F_STATUS
cases, and then the existing call to netif_carrier_on for the "without"
case will cause an operstate transition.

Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Jason Wang <jasowang@redhat.com>
Cc: Ben Hutchings <ben@decadent.org.uk>
Signed-off-by: Jay Vosburgh <jay.vosburgh@canonical.com>

---

	I considered resolving this by changing linkwatch_init_dev to
unconditionally call rfc2863_policy, as that would always set operstate
for all interfaces.

	This would not have any impact on most cases (as most drivers
call netif_carrier_off during probe), except for the loopback device,
which currently has an operstate of UNKNOWN (because it never does any
carrier state transitions).  This change would add a round trip on the
dev_base_lock for every loopback device creation, which could have a
negative impact when creating many loopback devices, e.g., when
concurrently creating large numbers of containers.


 drivers/net/virtio_net.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 23374603e4d9..7b187ec7411e 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -2857,8 +2857,8 @@ static int virtnet_probe(struct virtio_device *vdev)
 
 	/* Assume link up if device can't report link status,
 	   otherwise get link status from config. */
+	netif_carrier_off(dev);
 	if (virtio_has_feature(vi->vdev, VIRTIO_NET_F_STATUS)) {
-		netif_carrier_off(dev);
 		schedule_work(&vi->config_work);
 	} else {
 		vi->status = VIRTIO_NET_S_LINK_UP;
-- 
2.14.1

^ permalink raw reply related

* Re: [patch net-next RFC 00/12] devlink: introduce port flavours and common phys_port_name generation
From: Roopa Prabhu @ 2018-03-22 14:40 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: netdev, David Miller, Ido Schimmel, Jakub Kicinski, mlxsw,
	Andrew Lunn, Vivien Didelot, Florian Fainelli, Michael Chan,
	ganeshgr, Saeed Mahameed, Simon Horman, pieter.jansenvanvuuren,
	John Hurley, Dirk van der Merwe, Alexander Duyck, Or Gerlitz,
	David Ahern, vijaya.guvva, satananda.burla, raghu.vatsavayi,
	felix.manlunas, Andy Gospodarek, sathya.perla, vasundhara-v.volam,
	tariqt, eranbe, Jeff Kirsher
In-Reply-To: <20180322105522.8186-1-jiri@resnulli.us>

On Thu, Mar 22, 2018 at 3:55 AM, Jiri Pirko <jiri@resnulli.us> wrote:
> From: Jiri Pirko <jiri@mellanox.com>
>
> This patchset resolves 2 issues we have right now:
> 1) There are many netdevices / ports in the system, for port, pf, vf
>    represenatation but the user has no way to see which is which
> 2) The ndo_get_phys_port_name is implemented in each driver separatelly,
>    which may lead to inconsistent names between drivers.
>
> This patchset introduces port flavours which should address the first
> problem. I'm testing this with Netronome nfp hardware. When the user
> has 2 physical ports, 1 pf, and 4 vfs, he should see something like this:
> # devlink port
> pci/0000:05:00.0/0: type eth netdev enp5s0np0 flavour physical number 0
> pci/0000:05:00.0/268435456: type eth netdev eth0 flavour physical number 0
> pci/0000:05:00.0/268435460: type eth netdev enp5s0np1 flavour physical number 1
> pci/0000:05:00.0/536875008: type eth netdev eth2 flavour pf_rep number 536875008
> pci/0000:05:00.0/536870912: type eth netdev eth1 flavour vf_rep number 0
> pci/0000:05:00.0/536870976: type eth netdev eth3 flavour vf_rep number 1
> pci/0000:05:00.0/536871040: type eth netdev eth4 flavour vf_rep number 2
> pci/0000:05:00.0/536871104: type eth netdev eth5 flavour vf_rep number 3
>
> The indexes are weird numbers now. That needs to be fixed. Also, netdev
> renaming does not work correctly for me now for some reason.
> Also, there is one extra port that I don't understand what
> is the purpose for it - something nfp specific perhaps.
>
> The desired output should look like this:
> # devlink port
> pci/0000:05:00.0/0: type eth netdev enp5s0np0 flavour physical number 0
> pci/0000:05:00.0/1: type eth netdev enp5s0np1 flavour physical number 1
> pci/0000:05:00.0/2: type eth netdev enp5s0npf0 flavour pf_rep number 0
> pci/0000:05:00.0/3: type eth netdev enp5s0nvf0 flavour vf_rep number 0
> pci/0000:05:00.0/4: type eth netdev enp5s0nvf1 flavour vf_rep number 1
> pci/0000:05:00.0/5: type eth netdev enp5s0nvf2 flavour vf_rep number 2
> pci/0000:05:00.0/6: type eth netdev enp5s0nvf3 flavour vf_rep number 3
>
> As you can see, the netdev names are generated according to the flavour
> and port number. In case the port is split, the split subnumber is also
> included.
>
> I tested this for mlxsw and nfp. I have no way to test this on DSA hw,
> I would really appretiate DSA guys to test this. Thanks!
>

nice series, I like that the user can query a ports flavor (I get this
ask all the time).

thanks

^ permalink raw reply

* Re: [PATCH net] virtio-net: Fix operstate for virtio when no VIRTIO_NET_F_STATUS
From: Jay Vosburgh @ 2018-03-22 14:34 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: netdev, Jason Wang, David Miller, Ben Hutchings
In-Reply-To: <20180322160132-mutt-send-email-mst@kernel.org>

Michael S. Tsirkin <mst@redhat.com> wrote:

>On Thu, Mar 22, 2018 at 12:02:10PM +0000, Jay Vosburgh wrote:
>> Michael S. Tsirkin <mst@redhat.com> wrote:
>> 
>> >On Thu, Mar 22, 2018 at 09:05:52AM +0000, Jay Vosburgh wrote:
>> >> 	The operstate update logic will leave an interface in the
>> >> default UNKNOWN operstate if the interface carrier state never changes
>> >> from the default carrier up state set at creation.  This includes the
>> >> case of an explicit call to netif_carrier_on, as the carrier on to on
>> >> transition has no effect on operstate.
>> >> 
>> >> 	This affects virtio-net for the case that the virtio peer does
>> >> not support VIRTIO_NET_F_STATUS (the feature that provides carrier state
>> >> updates).  Without this feature, the virtio specification states that
>> >> "the link should be assumed active," so, logically, the operstate should
>> >> be UP instead of UNKNOWN.  This has impact on user space applications
>> >> that use the operstate to make availability decisions for the interface.
>> >> 
>> >> 	Resolve this by changing the virtio probe logic slightly to call
>> >> netif_carrier_off for both the "with" and "without" VIRTIO_NET_F_STATUS
>> >> cases, and then the existing call to netif_carrier_on for the "without"
>> >> case will cause an operstate transition.
>> >> 
>> >> Cc: "Michael S. Tsirkin" <mst@redhat.com>
>> >> Cc: Jason Wang <jasowang@redhat.com>
>> >> Cc: Ben Hutchings <ben@decadent.org.uk>
>> >> Fixes: 167c25e4c550 ("virtio-net: init link state correctly")
>> >
>> >I'd say that's an abuse of this notation. openstate was UNKNOWN
>> >even before that fix.
>> 
>> 	I went back to the commit that added the dependency on
>> VIRTIO_NET_F_STATUS (and that this patch would thus apply on top of).
>> If that's an issue, I can resubmit without it.
>> 
>> 	-J
>
>The patch can be trivially backported to any version that has virtio.
>
>The issue was present since the original version of virtio.
>VIRTIO_NET_F_STATUS fixed it for new devices.
>So the tag is incorrectly blaming a partial fix for not being a full
>one.
>
>Also, I think it's more appropriate for net-next - it's a
>minor ABI change (previously presence of VIRTIO_NET_F_STATUS
>could be detected by looking at operstate, now it can't).
>Hopefully this makes more apps work than it breaks.
>
>So yes, pls repost without Fixes and with net-next unless
>davem can make the change himself.

	Reposting with requested changes.

	-J

>> >> Signed-off-by: Jay Vosburgh <jay.vosburgh@canonical.com>
>> >
>> >Acked-by: Michael S. Tsirkin <mst@redhat.com>
>> >
>> >
>> >> ---
>> >> 
>> >> 	I considered resolving this by changing linkwatch_init_dev to
>> >> unconditionally call rfc2863_policy, as that would always set operstate
>> >> for all interfaces.
>> >> 
>> >> 	This would not have any impact on most cases (as most drivers
>> >> call netif_carrier_off during probe), except for the loopback device,
>> >> which currently has an operstate of UNKNOWN (because it never does any
>> >> carrier state transitions).  This change would add a round trip on the
>> >> dev_base_lock for every loopback device creation, which could have a
>> >> negative impact when creating many loopback devices, e.g., when
>> >> concurrently creating large numbers of containers.
>> >> 
>> >> 
>> >>  drivers/net/virtio_net.c | 2 +-
>> >>  1 file changed, 1 insertion(+), 1 deletion(-)
>> >> 
>> >> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
>> >> index 23374603e4d9..7b187ec7411e 100644
>> >> --- a/drivers/net/virtio_net.c
>> >> +++ b/drivers/net/virtio_net.c
>> >> @@ -2857,8 +2857,8 @@ static int virtnet_probe(struct virtio_device *vdev)
>> >>  
>> >>  	/* Assume link up if device can't report link status,
>> >>  	   otherwise get link status from config. */
>> >> +	netif_carrier_off(dev);
>> >>  	if (virtio_has_feature(vi->vdev, VIRTIO_NET_F_STATUS)) {
>> >> -		netif_carrier_off(dev);
>> >>  		schedule_work(&vi->config_work);
>> >>  	} else {
>> >>  		vi->status = VIRTIO_NET_S_LINK_UP;
>> >> -- 
>> >> 2.14.1

^ permalink raw reply

* Re: [PATCH net-next 1/1] net/ipv4: disable SMC TCP option with SYN Cookies
From: Eric Dumazet @ 2018-03-22 14:30 UTC (permalink / raw)
  To: Ursula Braun, davem
  Cc: netdev, linux-s390, schwidefsky, heiko.carstens, raspl
In-Reply-To: <9b88f9f8-1d3b-7d7d-f612-b823069afa75@linux.vnet.ibm.com>



On 03/22/2018 06:23 AM, Ursula Braun wrote:

> We moved the clear to cookie_v4_check()/cookie_v6_check. However, this does not seem to
> be sufficient to prevent the SYNACK from containing the SMC experimental option.
> We found that an additional check in tcp_conn_request() helps:
> 
> --- a/net/ipv4/tcp_input.c
> +++ b/net/ipv4/tcp_input.c
> @@ -6248,6 +6248,9 @@ int tcp_conn_request(struct request_sock
>  	if (want_cookie && !tmp_opt.saw_tstamp)
>  		tcp_clear_options(&tmp_opt);
>  
> +	if (IS_ENABLED(CONFIG_SMC) && want_cookie && tmp_opt.smc_ok)
> +		tmp_opt.smc_ok = 0;
> +
>  	tmp_opt.tstamp_ok = tmp_opt.saw_tstamp;
>  	tcp_openreq_init(req, &tmp_opt, skb, sk);
>  	inet_rsk(req)->no_srccheck = inet_sk(sk)->transparent;
> 
> Do you think this could be the right place for clearing the smc_ok bit?


Yes, but since tmp_opt is private to this thread/cpu, no false sharing to be afraid of

if (IS_ENABLED(CONFIG_SMC) && want_cookie)
    tmp_opt.smc_ok = 0;

^ permalink raw reply

* [PATCH net-next 9/9] net: hns3: Changes required in PF mailbox to support VF reset
From: Salil Mehta @ 2018-03-22 14:29 UTC (permalink / raw)
  To: davem
  Cc: salil.mehta, yisen.zhuang, mehta.salil.lnk, netdev, linux-kernel,
	linuxarm
In-Reply-To: <20180322142900.22860-1-salil.mehta@huawei.com>

PF needs to assert the VF reset when it receives the request to
reset from VF. After receiving request PF ackknowledges the
request by replying back MBX_ASSERTING_RESET message to VF.
VF then goes to pending state and wait for hardware to complete
the reset.

This patch contains code to handle the received VF message, inform
the VF of assertion and reset the VF using cmdq interface.

Signed-off-by: Salil Mehta <salil.mehta@huawei.com>
---
 .../ethernet/hisilicon/hns3/hns3pf/hclge_main.c    |  2 +-
 .../ethernet/hisilicon/hns3/hns3pf/hclge_main.h    |  1 +
 .../net/ethernet/hisilicon/hns3/hns3pf/hclge_mbx.c | 42 ++++++++++++++++++++++
 3 files changed, 44 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c
index a3e00da..bede411 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c
@@ -2749,7 +2749,7 @@ static int hclge_reset_wait(struct hclge_dev *hdev)
 	return 0;
 }
 
-static int hclge_func_reset_cmd(struct hclge_dev *hdev, int func_id)
+int hclge_func_reset_cmd(struct hclge_dev *hdev, int func_id)
 {
 	struct hclge_desc desc;
 	struct hclge_reset_cmd *req = (struct hclge_reset_cmd *)desc.data;
diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.h b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.h
index 8c14d10..0f4157e 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.h
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.h
@@ -657,4 +657,5 @@ void hclge_mbx_handler(struct hclge_dev *hdev);
 void hclge_reset_tqp(struct hnae3_handle *handle, u16 queue_id);
 void hclge_reset_vf_queue(struct hclge_vport *vport, u16 queue_id);
 int hclge_cfg_flowctrl(struct hclge_dev *hdev);
+int hclge_func_reset_cmd(struct hclge_dev *hdev, int func_id);
 #endif
diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_mbx.c b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_mbx.c
index 949da0c..3901333 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_mbx.c
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_mbx.c
@@ -79,6 +79,18 @@ static int hclge_send_mbx_msg(struct hclge_vport *vport, u8 *msg, u16 msg_len,
 	return status;
 }
 
+int hclge_inform_reset_assert_to_vf(struct hclge_vport *vport)
+{
+	u8 msg_data[2];
+	u8 dest_vfid;
+
+	dest_vfid = (u8)vport->vport_id;
+
+	/* send this requested info to VF */
+	return hclge_send_mbx_msg(vport, msg_data, sizeof(u8),
+				  HCLGE_MBX_ASSERTING_RESET, dest_vfid);
+}
+
 static void hclge_free_vector_ring_chain(struct hnae3_ring_chain_node *head)
 {
 	struct hnae3_ring_chain_node *chain_tmp, *chain;
@@ -339,6 +351,33 @@ static void hclge_mbx_reset_vf_queue(struct hclge_vport *vport,
 	hclge_gen_resp_to_vf(vport, mbx_req, 0, NULL, 0);
 }
 
+static void hclge_reset_vf(struct hclge_vport *vport,
+			   struct hclge_mbx_vf_to_pf_cmd *mbx_req)
+{
+	struct hclge_dev *hdev = vport->back;
+	int ret;
+
+	dev_warn(&hdev->pdev->dev, "PF received VF reset request from VF %d!",
+		 mbx_req->mbx_src_vfid);
+
+	/* Acknowledge VF that PF is now about to assert the reset for the VF.
+	 * On receiving this message VF will get into pending state and will
+	 * start polling for the hardware reset completion status.
+	 */
+	ret = hclge_inform_reset_assert_to_vf(vport);
+	if (ret) {
+		dev_err(&hdev->pdev->dev,
+			"PF fail(%d) to inform VF(%d)of reset, reset failed!\n",
+			ret, vport->vport_id);
+		return;
+	}
+
+	dev_warn(&hdev->pdev->dev, "PF is now resetting VF %d.\n",
+		 mbx_req->mbx_src_vfid);
+	/* reset this virtual function */
+	hclge_func_reset_cmd(hdev, mbx_req->mbx_src_vfid);
+}
+
 void hclge_mbx_handler(struct hclge_dev *hdev)
 {
 	struct hclge_cmq_ring *crq = &hdev->hw.cmq.crq;
@@ -416,6 +455,9 @@ void hclge_mbx_handler(struct hclge_dev *hdev)
 		case HCLGE_MBX_QUEUE_RESET:
 			hclge_mbx_reset_vf_queue(vport, req);
 			break;
+		case HCLGE_MBX_RESET:
+			hclge_reset_vf(vport, req);
+			break;
 		default:
 			dev_err(&hdev->pdev->dev,
 				"un-supported mailbox message, code = %d\n",
-- 
2.7.4

^ permalink raw reply related

* [PATCH net-next 8/9] net: hns3: Add *Asserting Reset* mailbox message & handling in VF
From: Salil Mehta @ 2018-03-22 14:28 UTC (permalink / raw)
  To: davem
  Cc: salil.mehta, yisen.zhuang, mehta.salil.lnk, netdev, linux-kernel,
	linuxarm
In-Reply-To: <20180322142900.22860-1-salil.mehta@huawei.com>

Reset Asserting message is forwarded by PF to inform VF about
the hardware reset which is about to happen. This might be due
to the earlier VF reset request received by the PF or because PF
for any reason decides to undergo reset. This message results in
VF to go in pending state in which it polls the hardware to
complete the reset and then further resets/tears its own stack.

Signed-off-by: Salil Mehta <salil.mehta@huawei.com>
---
 drivers/net/ethernet/hisilicon/hns3/hclge_mbx.h          |  1 +
 drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_mbx.c | 12 ++++++++++++
 2 files changed, 13 insertions(+)

diff --git a/drivers/net/ethernet/hisilicon/hns3/hclge_mbx.h b/drivers/net/ethernet/hisilicon/hns3/hclge_mbx.h
index f3e90c2..519e2bd 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hclge_mbx.h
+++ b/drivers/net/ethernet/hisilicon/hns3/hclge_mbx.h
@@ -11,6 +11,7 @@
 
 enum HCLGE_MBX_OPCODE {
 	HCLGE_MBX_RESET = 0x01,		/* (VF -> PF) assert reset */
+	HCLGE_MBX_ASSERTING_RESET,	/* (PF -> VF) PF is asserting reset*/
 	HCLGE_MBX_SET_UNICAST,		/* (VF -> PF) set UC addr */
 	HCLGE_MBX_SET_MULTICAST,	/* (VF -> PF) set MC addr */
 	HCLGE_MBX_SET_VLAN,		/* (VF -> PF) set VLAN */
diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_mbx.c b/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_mbx.c
index 7687911..a286184 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_mbx.c
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_mbx.c
@@ -170,6 +170,7 @@ void hclgevf_mbx_handler(struct hclgevf_dev *hdev)
 			}
 			break;
 		case HCLGE_MBX_LINK_STAT_CHANGE:
+		case HCLGE_MBX_ASSERTING_RESET:
 			/* set this mbx event as pending. This is required as we
 			 * might loose interrupt event when mbx task is busy
 			 * handling. This shall be cleared when mbx task just
@@ -242,6 +243,17 @@ void hclgevf_mbx_async_handler(struct hclgevf_dev *hdev)
 			hclgevf_update_speed_duplex(hdev, speed, duplex);
 
 			break;
+		case HCLGE_MBX_ASSERTING_RESET:
+			/* PF has asserted reset hence VF should go in pending
+			 * state and poll for the hardware reset status till it
+			 * has been completely reset. After this stack should
+			 * eventually be re-initialized.
+			 */
+			hdev->nic.reset_level = HNAE3_VF_RESET;
+			set_bit(HCLGEVF_RESET_PENDING, &hdev->reset_state);
+			hclgevf_reset_task_schedule(hdev);
+
+			break;
 		default:
 			dev_err(&hdev->pdev->dev,
 				"fetched unsupported(%d) message from arq\n",
-- 
2.7.4

^ permalink raw reply related

* [PATCH net-next 7/9] net: hns3: Changes to support ARQ(Asynchronous Receive Queue)
From: Salil Mehta @ 2018-03-22 14:28 UTC (permalink / raw)
  To: davem
  Cc: salil.mehta, yisen.zhuang, mehta.salil.lnk, netdev, linux-kernel,
	linuxarm
In-Reply-To: <20180322142900.22860-1-salil.mehta@huawei.com>

Current mailbox CRQ could consists of both synchronous and async
responses from the PF. Synchronous responses are time critical
and should be handed over to the waiting tasks/context as quickly
as possible otherwise timeout occurs.

Above problem gets accentuated if CRQ consists of even single
async message. Hence, it is important to have quick handling of
synchronous messages and maybe deferred handling of async messages
This patch introduces separate ARQ(async receive queues) for the
async messages. These messages are processed later with repsect
to mailbox task while synchronous messages still gets processed
in context to mailbox interrupt.

ARQ is important as VF reset introduces some new async messages
like MBX_ASSERTING_RESET which adds up to the presssure on the
responses for synchronousmessages and they timeout even more
quickly.

Signed-off-by: Salil Mehta <salil.mehta@huawei.com>
---
 drivers/net/ethernet/hisilicon/hns3/hclge_mbx.h    | 15 ++++
 .../ethernet/hisilicon/hns3/hns3vf/hclgevf_cmd.c   |  6 ++
 .../ethernet/hisilicon/hns3/hns3vf/hclgevf_main.c  | 16 +++--
 .../ethernet/hisilicon/hns3/hns3vf/hclgevf_main.h  |  5 ++
 .../ethernet/hisilicon/hns3/hns3vf/hclgevf_mbx.c   | 83 +++++++++++++++++++---
 5 files changed, 111 insertions(+), 14 deletions(-)

diff --git a/drivers/net/ethernet/hisilicon/hns3/hclge_mbx.h b/drivers/net/ethernet/hisilicon/hns3/hclge_mbx.h
index e6e1d22..f3e90c2 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hclge_mbx.h
+++ b/drivers/net/ethernet/hisilicon/hns3/hclge_mbx.h
@@ -85,6 +85,21 @@ struct hclge_mbx_pf_to_vf_cmd {
 	u16 msg[8];
 };
 
+/* used by VF to store the received Async responses from PF */
+struct hclgevf_mbx_arq_ring {
+#define HCLGE_MBX_MAX_ARQ_MSG_SIZE	8
+#define HCLGE_MBX_MAX_ARQ_MSG_NUM	1024
+	struct hclgevf_dev *hdev;
+	u32 head;
+	u32 tail;
+	u32 count;
+	u16 msg_q[HCLGE_MBX_MAX_ARQ_MSG_NUM][HCLGE_MBX_MAX_ARQ_MSG_SIZE];
+};
+
 #define hclge_mbx_ring_ptr_move_crq(crq) \
 	(crq->next_to_use = (crq->next_to_use + 1) % crq->desc_num)
+#define hclge_mbx_tail_ptr_move_arq(arq) \
+	(arq.tail = (arq.tail + 1) % HCLGE_MBX_MAX_ARQ_MSG_SIZE)
+#define hclge_mbx_head_ptr_move_arq(arq) \
+		(arq.head = (arq.head + 1) % HCLGE_MBX_MAX_ARQ_MSG_SIZE)
 #endif
diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_cmd.c b/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_cmd.c
index 85985e7..1bbfe13 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_cmd.c
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_cmd.c
@@ -315,6 +315,12 @@ int hclgevf_cmd_init(struct hclgevf_dev *hdev)
 		goto err_csq;
 	}
 
+	/* initialize the pointers of async rx queue of mailbox */
+	hdev->arq.hdev = hdev;
+	hdev->arq.head = 0;
+	hdev->arq.tail = 0;
+	hdev->arq.count = 0;
+
 	/* get firmware version */
 	ret = hclgevf_cmd_query_firmware_version(&hdev->hw, &version);
 	if (ret) {
diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_main.c b/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_main.c
index 6dd7561..2b84264 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_main.c
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_main.c
@@ -1010,10 +1010,13 @@ void hclgevf_reset_task_schedule(struct hclgevf_dev *hdev)
 	}
 }
 
-static void hclgevf_mbx_task_schedule(struct hclgevf_dev *hdev)
+void hclgevf_mbx_task_schedule(struct hclgevf_dev *hdev)
 {
-	if (!test_and_set_bit(HCLGEVF_STATE_MBX_SERVICE_SCHED, &hdev->state))
+	if (!test_bit(HCLGEVF_STATE_MBX_SERVICE_SCHED, &hdev->state) &&
+	    !test_bit(HCLGEVF_STATE_MBX_HANDLING, &hdev->state)) {
+		set_bit(HCLGEVF_STATE_MBX_SERVICE_SCHED, &hdev->state);
 		schedule_work(&hdev->mbx_service_task);
+	}
 }
 
 static void hclgevf_task_schedule(struct hclgevf_dev *hdev)
@@ -1025,6 +1028,10 @@ static void hclgevf_task_schedule(struct hclgevf_dev *hdev)
 
 static void hclgevf_deferred_task_schedule(struct hclgevf_dev *hdev)
 {
+	/* if we have any pending mailbox event then schedule the mbx task */
+	if (hdev->mbx_event_pending)
+		hclgevf_mbx_task_schedule(hdev);
+
 	if (test_bit(HCLGEVF_RESET_PENDING, &hdev->reset_state))
 		hclgevf_reset_task_schedule(hdev);
 }
@@ -1118,7 +1125,7 @@ static void hclgevf_mailbox_service_task(struct work_struct *work)
 
 	clear_bit(HCLGEVF_STATE_MBX_SERVICE_SCHED, &hdev->state);
 
-	hclgevf_mbx_handler(hdev);
+	hclgevf_mbx_async_handler(hdev);
 
 	clear_bit(HCLGEVF_STATE_MBX_HANDLING, &hdev->state);
 }
@@ -1178,8 +1185,7 @@ static irqreturn_t hclgevf_misc_irq_handle(int irq, void *data)
 	if (!hclgevf_check_event_cause(hdev, &clearval))
 		goto skip_sched;
 
-	/* schedule the VF mailbox service task, if not already scheduled */
-	hclgevf_mbx_task_schedule(hdev);
+	hclgevf_mbx_handler(hdev);
 
 	hclgevf_clear_event_cause(hdev, clearval);
 
diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_main.h b/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_main.h
index 8cdc602..a477a7c 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_main.h
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_main.h
@@ -152,7 +152,9 @@ struct hclgevf_dev {
 	int *vector_irq;
 
 	bool accept_mta_mc; /* whether to accept mta filter multicast */
+	bool mbx_event_pending;
 	struct hclgevf_mbx_resp_status mbx_resp; /* mailbox response */
+	struct hclgevf_mbx_arq_ring arq; /* mailbox async rx queue */
 
 	struct timer_list service_timer;
 	struct work_struct service_task;
@@ -187,8 +189,11 @@ int hclgevf_send_mbx_msg(struct hclgevf_dev *hdev, u16 code, u16 subcode,
 			 const u8 *msg_data, u8 msg_len, bool need_resp,
 			 u8 *resp_data, u16 resp_len);
 void hclgevf_mbx_handler(struct hclgevf_dev *hdev);
+void hclgevf_mbx_async_handler(struct hclgevf_dev *hdev);
+
 void hclgevf_update_link_status(struct hclgevf_dev *hdev, int link_state);
 void hclgevf_update_speed_duplex(struct hclgevf_dev *hdev, u32 speed,
 				 u8 duplex);
 void hclgevf_reset_task_schedule(struct hclgevf_dev *hdev);
+void hclgevf_mbx_task_schedule(struct hclgevf_dev *hdev);
 #endif
diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_mbx.c b/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_mbx.c
index a63ed3a..7687911 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_mbx.c
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_mbx.c
@@ -132,9 +132,8 @@ void hclgevf_mbx_handler(struct hclgevf_dev *hdev)
 	struct hclge_mbx_pf_to_vf_cmd *req;
 	struct hclgevf_cmq_ring *crq;
 	struct hclgevf_desc *desc;
-	u16 link_status, flag;
-	u32 speed;
-	u8 duplex;
+	u16 *msg_q;
+	u16 flag;
 	u8 *temp;
 	int i;
 
@@ -146,6 +145,12 @@ void hclgevf_mbx_handler(struct hclgevf_dev *hdev)
 		desc = &crq->desc[crq->next_to_use];
 		req = (struct hclge_mbx_pf_to_vf_cmd *)desc->data;
 
+		/* synchronous messages are time critical and need preferential
+		 * treatment. Therefore, we need to acknowledge all the sync
+		 * responses as quickly as possible so that waiting tasks do not
+		 * timeout and simultaneously queue the async messages for later
+		 * prcessing in context of mailbox task i.e. the slow path.
+		 */
 		switch (req->msg[0]) {
 		case HCLGE_MBX_PF_VF_RESP:
 			if (resp->received_resp)
@@ -165,13 +170,30 @@ void hclgevf_mbx_handler(struct hclgevf_dev *hdev)
 			}
 			break;
 		case HCLGE_MBX_LINK_STAT_CHANGE:
-			link_status = le16_to_cpu(req->msg[1]);
-			memcpy(&speed, &req->msg[2], sizeof(speed));
-			duplex = (u8)le16_to_cpu(req->msg[4]);
+			/* set this mbx event as pending. This is required as we
+			 * might loose interrupt event when mbx task is busy
+			 * handling. This shall be cleared when mbx task just
+			 * enters handling state.
+			 */
+			hdev->mbx_event_pending = true;
 
-			/* update upper layer with new link link status */
-			hclgevf_update_link_status(hdev, link_status);
-			hclgevf_update_speed_duplex(hdev, speed, duplex);
+			/* we will drop the async msg if we find ARQ as full
+			 * and continue with next message
+			 */
+			if (hdev->arq.count >= HCLGE_MBX_MAX_ARQ_MSG_NUM) {
+				dev_warn(&hdev->pdev->dev,
+					 "Async Q full, dropping msg(%d)\n",
+					 req->msg[1]);
+				break;
+			}
+
+			/* tail the async message in arq */
+			msg_q = hdev->arq.msg_q[hdev->arq.tail];
+			memcpy(&msg_q[0], req->msg, HCLGE_MBX_MAX_ARQ_MSG_SIZE);
+			hclge_mbx_tail_ptr_move_arq(hdev->arq);
+			hdev->arq.count++;
+
+			hclgevf_mbx_task_schedule(hdev);
 
 			break;
 		default:
@@ -189,3 +211,46 @@ void hclgevf_mbx_handler(struct hclgevf_dev *hdev)
 	hclgevf_write_dev(&hdev->hw, HCLGEVF_NIC_CRQ_HEAD_REG,
 			  crq->next_to_use);
 }
+
+void hclgevf_mbx_async_handler(struct hclgevf_dev *hdev)
+{
+	u16 link_status;
+	u16 *msg_q;
+	u8 duplex;
+	u32 speed;
+	u32 tail;
+
+	/* we can safely clear it now as we are at start of the async message
+	 * processing
+	 */
+	hdev->mbx_event_pending = false;
+
+	tail = hdev->arq.tail;
+
+	/* process all the async queue messages */
+	while (tail != hdev->arq.head) {
+		msg_q = hdev->arq.msg_q[hdev->arq.head];
+
+		switch (msg_q[0]) {
+		case HCLGE_MBX_LINK_STAT_CHANGE:
+			link_status = le16_to_cpu(msg_q[1]);
+			memcpy(&speed, &msg_q[2], sizeof(speed));
+			duplex = (u8)le16_to_cpu(msg_q[4]);
+
+			/* update upper layer with new link link status */
+			hclgevf_update_link_status(hdev, link_status);
+			hclgevf_update_speed_duplex(hdev, speed, duplex);
+
+			break;
+		default:
+			dev_err(&hdev->pdev->dev,
+				"fetched unsupported(%d) message from arq\n",
+				msg_q[0]);
+			break;
+		}
+
+		hclge_mbx_head_ptr_move_arq(hdev->arq);
+		hdev->arq.count--;
+		msg_q = hdev->arq.msg_q[hdev->arq.head];
+	}
+}
-- 
2.7.4

^ permalink raw reply related

* [PATCH net-next 6/9] net: hns3: Add support to re-initialize the hclge device
From: Salil Mehta @ 2018-03-22 14:28 UTC (permalink / raw)
  To: davem
  Cc: salil.mehta, yisen.zhuang, mehta.salil.lnk, netdev, linux-kernel,
	linuxarm
In-Reply-To: <20180322142900.22860-1-salil.mehta@huawei.com>

After the hardware reset we should re-fetch the configuration from
PF like queue info and tc info. This might have impact on allocations
made like that of TQPs. Hence, we should release all such allocations
and re-allocate fresh according to new fetched configuration after
reset.

Signed-off-by: Salil Mehta <salil.mehta@huawei.com>
---
 .../ethernet/hisilicon/hns3/hns3vf/hclgevf_main.c  | 106 ++++++++++++++++++---
 .../ethernet/hisilicon/hns3/hns3vf/hclgevf_main.h  |  14 +++
 2 files changed, 106 insertions(+), 14 deletions(-)

diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_main.c b/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_main.c
index bd45b11..6dd7561 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_main.c
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_main.c
@@ -10,6 +10,8 @@
 
 #define HCLGEVF_NAME	"hclgevf"
 
+static int hclgevf_init_hdev(struct hclgevf_dev *hdev);
+static void hclgevf_uninit_hdev(struct hclgevf_dev *hdev);
 static struct hnae3_ae_algo ae_algovf;
 
 static const struct pci_device_id ae_algovf_pci_tbl[] = {
@@ -209,6 +211,12 @@ static int hclgevf_alloc_tqps(struct hclgevf_dev *hdev)
 	struct hclgevf_tqp *tqp;
 	int i;
 
+	/* if this is on going reset then we need to re-allocate the TPQs
+	 * since we cannot assume we would get same number of TPQs back from PF
+	 */
+	if (hclgevf_dev_ongoing_reset(hdev))
+		devm_kfree(&hdev->pdev->dev, hdev->htqp);
+
 	hdev->htqp = devm_kcalloc(&hdev->pdev->dev, hdev->num_tqps,
 				  sizeof(struct hclgevf_tqp), GFP_KERNEL);
 	if (!hdev->htqp)
@@ -252,6 +260,12 @@ static int hclgevf_knic_setup(struct hclgevf_dev *hdev)
 	new_tqps = kinfo->rss_size * kinfo->num_tc;
 	kinfo->num_tqps = min(new_tqps, hdev->num_tqps);
 
+	/* if this is on going reset then we need to re-allocate the hnae queues
+	 * as well since number of TPQs from PF might have changed.
+	 */
+	if (hclgevf_dev_ongoing_reset(hdev))
+		devm_kfree(&hdev->pdev->dev, kinfo->tqp);
+
 	kinfo->tqp = devm_kcalloc(&hdev->pdev->dev, kinfo->num_tqps,
 				  sizeof(struct hnae3_queue *), GFP_KERNEL);
 	if (!kinfo->tqp)
@@ -878,10 +892,18 @@ static int hclgevf_reset_wait(struct hclgevf_dev *hdev)
 
 static int hclgevf_reset_stack(struct hclgevf_dev *hdev)
 {
+	int ret;
+
 	/* uninitialize the nic client */
 	hclgevf_notify_client(hdev, HNAE3_UNINIT_CLIENT);
 
-	/* re-initialize the hclge device - add code here */
+	/* re-initialize the hclge device */
+	ret = hclgevf_init_hdev(hdev);
+	if (ret) {
+		dev_err(&hdev->pdev->dev,
+			"hclge device re-init failed, VF is disabled!\n");
+		return ret;
+	}
 
 	/* bring up the nic client again */
 	hclgevf_notify_client(hdev, HNAE3_INIT_CLIENT);
@@ -1179,6 +1201,22 @@ static int hclgevf_configure(struct hclgevf_dev *hdev)
 	return hclgevf_get_tc_info(hdev);
 }
 
+static int hclgevf_alloc_hdev(struct hnae3_ae_dev *ae_dev)
+{
+	struct pci_dev *pdev = ae_dev->pdev;
+	struct hclgevf_dev *hdev = ae_dev->priv;
+
+	hdev = devm_kzalloc(&pdev->dev, sizeof(*hdev), GFP_KERNEL);
+	if (!hdev)
+		return -ENOMEM;
+
+	hdev->pdev = pdev;
+	hdev->ae_dev = ae_dev;
+	ae_dev->priv = hdev;
+
+	return 0;
+}
+
 static int hclgevf_init_roce_base_info(struct hclgevf_dev *hdev)
 {
 	struct hnae3_handle *roce = &hdev->roce;
@@ -1284,6 +1322,10 @@ static void hclgevf_ae_stop(struct hnae3_handle *handle)
 
 static void hclgevf_state_init(struct hclgevf_dev *hdev)
 {
+	/* if this is on going reset then skip this initialization */
+	if (hclgevf_dev_ongoing_reset(hdev))
+		return;
+
 	/* setup tasks for the MBX */
 	INIT_WORK(&hdev->mbx_service_task, hclgevf_mailbox_service_task);
 	clear_bit(HCLGEVF_STATE_MBX_SERVICE_SCHED, &hdev->state);
@@ -1325,6 +1367,10 @@ static int hclgevf_init_msi(struct hclgevf_dev *hdev)
 	int vectors;
 	int i;
 
+	/* if this is on going reset then skip this initialization */
+	if (hclgevf_dev_ongoing_reset(hdev))
+		return 0;
+
 	hdev->num_msi = HCLGEVF_MAX_VF_VECTOR_NUM;
 
 	vectors = pci_alloc_irq_vectors(pdev, 1, hdev->num_msi,
@@ -1375,6 +1421,10 @@ static int hclgevf_misc_irq_init(struct hclgevf_dev *hdev)
 {
 	int ret = 0;
 
+	/* if this is on going reset then skip this initialization */
+	if (hclgevf_dev_ongoing_reset(hdev))
+		return 0;
+
 	hclgevf_get_misc_vector(hdev);
 
 	ret = request_irq(hdev->misc_vector.vector_irq, hclgevf_misc_irq_handle,
@@ -1485,6 +1535,14 @@ static int hclgevf_pci_init(struct hclgevf_dev *hdev)
 	struct hclgevf_hw *hw;
 	int ret;
 
+	/* check if we need to skip initialization of pci. This will happen if
+	 * device is undergoing VF reset. Otherwise, we would need to
+	 * re-initialize pci interface again i.e. when device is not going
+	 * through *any* reset or actually undergoing full reset.
+	 */
+	if (hclgevf_dev_ongoing_reset(hdev))
+		return 0;
+
 	ret = pci_enable_device(pdev);
 	if (ret) {
 		dev_err(&pdev->dev, "failed to enable PCI device\n");
@@ -1536,19 +1594,16 @@ static void hclgevf_pci_uninit(struct hclgevf_dev *hdev)
 	pci_set_drvdata(pdev, NULL);
 }
 
-static int hclgevf_init_ae_dev(struct hnae3_ae_dev *ae_dev)
+static int hclgevf_init_hdev(struct hclgevf_dev *hdev)
 {
-	struct pci_dev *pdev = ae_dev->pdev;
-	struct hclgevf_dev *hdev;
+	struct pci_dev *pdev = hdev->pdev;
 	int ret;
 
-	hdev = devm_kzalloc(&pdev->dev, sizeof(*hdev), GFP_KERNEL);
-	if (!hdev)
-		return -ENOMEM;
-
-	hdev->pdev = pdev;
-	hdev->ae_dev = ae_dev;
-	ae_dev->priv = hdev;
+	/* check if device is on-going full reset(i.e. pcie as well) */
+	if (hclgevf_dev_ongoing_full_reset(hdev)) {
+		dev_warn(&pdev->dev, "device is going full reset\n");
+		hclgevf_uninit_hdev(hdev);
+	}
 
 	ret = hclgevf_pci_init(hdev);
 	if (ret) {
@@ -1633,15 +1688,38 @@ static int hclgevf_init_ae_dev(struct hnae3_ae_dev *ae_dev)
 	return ret;
 }
 
-static void hclgevf_uninit_ae_dev(struct hnae3_ae_dev *ae_dev)
+static void hclgevf_uninit_hdev(struct hclgevf_dev *hdev)
 {
-	struct hclgevf_dev *hdev = ae_dev->priv;
-
 	hclgevf_cmd_uninit(hdev);
 	hclgevf_misc_irq_uninit(hdev);
 	hclgevf_state_uninit(hdev);
 	hclgevf_uninit_msi(hdev);
 	hclgevf_pci_uninit(hdev);
+}
+
+static int hclgevf_init_ae_dev(struct hnae3_ae_dev *ae_dev)
+{
+	struct pci_dev *pdev = ae_dev->pdev;
+	int ret;
+
+	ret = hclgevf_alloc_hdev(ae_dev);
+	if (ret) {
+		dev_err(&pdev->dev, "hclge device allocation failed\n");
+		return ret;
+	}
+
+	ret = hclgevf_init_hdev(ae_dev->priv);
+	if (ret)
+		dev_err(&pdev->dev, "hclge device initialization failed\n");
+
+	return ret;
+}
+
+static void hclgevf_uninit_ae_dev(struct hnae3_ae_dev *ae_dev)
+{
+	struct hclgevf_dev *hdev = ae_dev->priv;
+
+	hclgevf_uninit_hdev(hdev);
 	ae_dev->priv = NULL;
 }
 
diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_main.h b/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_main.h
index afdb15d..8cdc602 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_main.h
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_main.h
@@ -169,6 +169,20 @@ struct hclgevf_dev {
 	u32 flag;
 };
 
+static inline bool hclgevf_dev_ongoing_reset(struct hclgevf_dev *hdev)
+{
+	return (hdev &&
+		(test_bit(HCLGEVF_STATE_RST_HANDLING, &hdev->state)) &&
+		(hdev->nic.reset_level == HNAE3_VF_RESET));
+}
+
+static inline bool hclgevf_dev_ongoing_full_reset(struct hclgevf_dev *hdev)
+{
+	return (hdev &&
+		(test_bit(HCLGEVF_STATE_RST_HANDLING, &hdev->state)) &&
+		(hdev->nic.reset_level == HNAE3_VF_FULL_RESET));
+}
+
 int hclgevf_send_mbx_msg(struct hclgevf_dev *hdev, u16 code, u16 subcode,
 			 const u8 *msg_data, u8 msg_len, bool need_resp,
 			 u8 *resp_data, u16 resp_len);
-- 
2.7.4

^ permalink raw reply related

* [PATCH net-next 5/9] net: hns3: Add support to reset the enet/ring mgmt layer
From: Salil Mehta @ 2018-03-22 14:28 UTC (permalink / raw)
  To: davem
  Cc: salil.mehta, yisen.zhuang, mehta.salil.lnk, netdev, linux-kernel,
	linuxarm
In-Reply-To: <20180322142900.22860-1-salil.mehta@huawei.com>

After VF driver knows that hardware reset has been performed
successfully, it should proceed ahead and reset the enet layer.
This primarily consists of bringing down interface, clearing
TX/RX rings, disassociating vectors from ring etc.

Signed-off-by: Salil Mehta <salil.mehta@huawei.com>
---
 .../ethernet/hisilicon/hns3/hns3vf/hclgevf_main.c  | 103 ++++++++++++++++++++-
 .../ethernet/hisilicon/hns3/hns3vf/hclgevf_main.h  |   3 +
 2 files changed, 102 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_main.c b/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_main.c
index b648311..bd45b11 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_main.c
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_main.c
@@ -2,6 +2,7 @@
 // Copyright (c) 2016-2017 Hisilicon Limited.
 
 #include <linux/etherdevice.h>
+#include <net/rtnetlink.h>
 #include "hclgevf_cmd.h"
 #include "hclgevf_main.h"
 #include "hclge_mbx.h"
@@ -832,6 +833,101 @@ static void hclgevf_reset_tqp(struct hnae3_handle *handle, u16 queue_id)
 			     2, true, NULL, 0);
 }
 
+static int hclgevf_notify_client(struct hclgevf_dev *hdev,
+				 enum hnae3_reset_notify_type type)
+{
+	struct hnae3_client *client = hdev->nic_client;
+	struct hnae3_handle *handle = &hdev->nic;
+
+	if (!client->ops->reset_notify)
+		return -EOPNOTSUPP;
+
+	return client->ops->reset_notify(handle, type);
+}
+
+static int hclgevf_reset_wait(struct hclgevf_dev *hdev)
+{
+#define HCLGEVF_RESET_WAIT_MS	500
+#define HCLGEVF_RESET_WAIT_CNT	20
+	u32 val, cnt = 0;
+
+	/* wait to check the hardware reset completion status */
+	val = hclgevf_read_dev(&hdev->hw, HCLGEVF_FUN_RST_ING);
+	while (hnae_get_bit(val, HCLGEVF_FUN_RST_ING_B) &&
+			    (cnt < HCLGEVF_RESET_WAIT_CNT)) {
+		msleep(HCLGEVF_RESET_WAIT_MS);
+		val = hclgevf_read_dev(&hdev->hw, HCLGEVF_FUN_RST_ING);
+		cnt++;
+	}
+
+	/* hardware completion status should be available by this time */
+	if (cnt >= HCLGEVF_RESET_WAIT_CNT) {
+		dev_warn(&hdev->pdev->dev,
+			 "could'nt get reset done status from h/w, timeout!\n");
+		return -EBUSY;
+	}
+
+	/* we will wait a bit more to let reset of the stack to complete. This
+	 * might happen in case reset assertion was made by PF. Yes, this also
+	 * means we might end up waiting bit more even for VF reset.
+	 */
+	msleep(5000);
+
+	return 0;
+}
+
+static int hclgevf_reset_stack(struct hclgevf_dev *hdev)
+{
+	/* uninitialize the nic client */
+	hclgevf_notify_client(hdev, HNAE3_UNINIT_CLIENT);
+
+	/* re-initialize the hclge device - add code here */
+
+	/* bring up the nic client again */
+	hclgevf_notify_client(hdev, HNAE3_INIT_CLIENT);
+
+	return 0;
+}
+
+static int hclgevf_reset(struct hclgevf_dev *hdev)
+{
+	int ret;
+
+	rtnl_lock();
+
+	/* bring down the nic to stop any ongoing TX/RX */
+	hclgevf_notify_client(hdev, HNAE3_DOWN_CLIENT);
+
+	/* check if VF could successfully fetch the hardware reset completion
+	 * status from the hardware
+	 */
+	ret = hclgevf_reset_wait(hdev);
+	if (ret) {
+		/* can't do much in this situation, will disable VF */
+		dev_err(&hdev->pdev->dev,
+			"VF failed(=%d) to fetch H/W reset completion status\n",
+			ret);
+
+		dev_warn(&hdev->pdev->dev, "VF reset failed, disabling VF!\n");
+		hclgevf_notify_client(hdev, HNAE3_UNINIT_CLIENT);
+
+		rtnl_unlock();
+		return ret;
+	}
+
+	/* now, re-initialize the nic client and ae device*/
+	ret = hclgevf_reset_stack(hdev);
+	if (ret)
+		dev_err(&hdev->pdev->dev, "failed to reset VF stack\n");
+
+	/* bring up the nic to enable TX/RX again */
+	hclgevf_notify_client(hdev, HNAE3_UP_CLIENT);
+
+	rtnl_unlock();
+
+	return ret;
+}
+
 static int hclgevf_do_reset(struct hclgevf_dev *hdev)
 {
 	int status;
@@ -940,10 +1036,9 @@ static void hclgevf_reset_service_task(struct work_struct *work)
 		 */
 		hdev->reset_attempts = 0;
 
-		/* code to check/wait for hardware reset completion and the
-		 * further initiating software stack reset would be added here
-		 */
-
+		ret = hclgevf_reset(hdev);
+		if (ret)
+			dev_err(&hdev->pdev->dev, "VF stack reset failed.\n");
 	} else if (test_and_clear_bit(HCLGEVF_RESET_REQUESTED,
 				      &hdev->reset_state)) {
 		/* we could be here when either of below happens:
diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_main.h b/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_main.h
index 1c9cf87..afdb15d 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_main.h
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_main.h
@@ -34,6 +34,9 @@
 #define HCLGEVF_VECTOR0_RX_CMDQ_INT_B	1
 
 #define HCLGEVF_TQP_RESET_TRY_TIMES	10
+/* Reset related Registers */
+#define HCLGEVF_FUN_RST_ING		0x20C00
+#define HCLGEVF_FUN_RST_ING_B		0
 
 #define HCLGEVF_RSS_IND_TBL_SIZE		512
 #define HCLGEVF_RSS_SET_BITMAP_MSK	0xffff
-- 
2.7.4

^ permalink raw reply related

* [PATCH net-next 4/9] net: hns3: Add support to request VF Reset to PF
From: Salil Mehta @ 2018-03-22 14:28 UTC (permalink / raw)
  To: davem
  Cc: salil.mehta, yisen.zhuang, mehta.salil.lnk, netdev, linux-kernel,
	linuxarm
In-Reply-To: <20180322142900.22860-1-salil.mehta@huawei.com>

VF driver depends upon PF to eventually reset the hardware. This
request is made using the mailbox command. This patch adds the
required function to acheive above.

Signed-off-by: Salil Mehta <salil.mehta@huawei.com>
---
 .../net/ethernet/hisilicon/hns3/hns3vf/hclgevf_main.c | 19 +++++++++++++++++++
 1 file changed, 19 insertions(+)

diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_main.c b/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_main.c
index 0d204e2..b648311 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_main.c
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_main.c
@@ -832,6 +832,20 @@ static void hclgevf_reset_tqp(struct hnae3_handle *handle, u16 queue_id)
 			     2, true, NULL, 0);
 }
 
+static int hclgevf_do_reset(struct hclgevf_dev *hdev)
+{
+	int status;
+	u8 respmsg;
+
+	status = hclgevf_send_mbx_msg(hdev, HCLGE_MBX_RESET, 0, NULL,
+				      0, false, &respmsg, sizeof(u8));
+	if (status)
+		dev_err(&hdev->pdev->dev,
+			"VF reset request to PF failed(=%d)\n", status);
+
+	return status;
+}
+
 static void hclgevf_reset_event(struct hnae3_handle *handle)
 {
 	struct hclgevf_dev *hdev = hclgevf_ae_get_hdev(handle);
@@ -910,6 +924,7 @@ static void hclgevf_reset_service_task(struct work_struct *work)
 {
 	struct hclgevf_dev *hdev =
 		container_of(work, struct hclgevf_dev, rst_service_task);
+	int ret;
 
 	if (test_and_set_bit(HCLGEVF_STATE_RST_HANDLING, &hdev->state))
 		return;
@@ -965,6 +980,10 @@ static void hclgevf_reset_service_task(struct work_struct *work)
 			hdev->reset_attempts++;
 
 			/* request PF for resetting this VF via mailbox */
+			ret = hclgevf_do_reset(hdev);
+			if (ret)
+				dev_warn(&hdev->pdev->dev,
+					 "VF rst fail, stack will call\n");
 		}
 	}
 
-- 
2.7.4

^ permalink raw reply related

* [PATCH net-next 3/9] net: hns3: Add VF Reset device state and its handling
From: Salil Mehta @ 2018-03-22 14:28 UTC (permalink / raw)
  To: davem
  Cc: salil.mehta, yisen.zhuang, mehta.salil.lnk, netdev, linux-kernel,
	linuxarm
In-Reply-To: <20180322142900.22860-1-salil.mehta@huawei.com>

This introduces the hclge device reset states of "requested" and
"pending" and also its handling in context to Reset Service Task.

Device gets into requested state because of any VF reset request
asserted from upper layers, for example due to watchdog timeout
expiration. Requested state would result in eventually forwarding
the VF reset request to PF which would actually reset the VF.

Device will get into pending state if:
1. VF receives the acknowledgement from PF for the VF reset
   request it originally sent to PF.
2. Reset Service Task detects that after asserting VF reset for
   certain times the data-path is not working and device then
   decides to assert full VF reset(this means also resetting the
   PCIe interface).
3. PF intimates the VF that it has undergone reset.
Pending state would result in VF to poll for hardware reset
completion status and then resetting the stack/enet layer, which
in turn means reinitializing the ring management/enet layer.

Note: we would be adding support of 3. later as a separate patch.
This decision should not affect VF reset as its event handling
is generic in nature.

Signed-off-by: Salil Mehta <salil.mehta@huawei.com>
---
 drivers/net/ethernet/hisilicon/hns3/hnae3.h        |  1 +
 .../ethernet/hisilicon/hns3/hns3vf/hclgevf_main.c  | 67 ++++++++++++++++++++--
 .../ethernet/hisilicon/hns3/hns3vf/hclgevf_main.h  |  5 ++
 3 files changed, 68 insertions(+), 5 deletions(-)

diff --git a/drivers/net/ethernet/hisilicon/hns3/hnae3.h b/drivers/net/ethernet/hisilicon/hns3/hnae3.h
index 56f9e650..37ec1b3 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hnae3.h
+++ b/drivers/net/ethernet/hisilicon/hns3/hnae3.h
@@ -119,6 +119,7 @@ enum hnae3_reset_notify_type {
 
 enum hnae3_reset_type {
 	HNAE3_VF_RESET,
+	HNAE3_VF_FULL_RESET,
 	HNAE3_FUNC_RESET,
 	HNAE3_CORE_RESET,
 	HNAE3_GLOBAL_RESET,
diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_main.c b/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_main.c
index cdb6e7a..0d204e2 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_main.c
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_main.c
@@ -840,7 +840,9 @@ static void hclgevf_reset_event(struct hnae3_handle *handle)
 
 	handle->reset_level = HNAE3_VF_RESET;
 
-	/* request VF reset here. Code added later */
+	/* reset of this VF requested */
+	set_bit(HCLGEVF_RESET_REQUESTED, &hdev->reset_state);
+	hclgevf_reset_task_schedule(hdev);
 
 	handle->last_reset_time = jiffies;
 }
@@ -889,6 +891,12 @@ static void hclgevf_task_schedule(struct hclgevf_dev *hdev)
 		schedule_work(&hdev->service_task);
 }
 
+static void hclgevf_deferred_task_schedule(struct hclgevf_dev *hdev)
+{
+	if (test_bit(HCLGEVF_RESET_PENDING, &hdev->reset_state))
+		hclgevf_reset_task_schedule(hdev);
+}
+
 static void hclgevf_service_timer(struct timer_list *t)
 {
 	struct hclgevf_dev *hdev = from_timer(hdev, t, service_timer);
@@ -908,10 +916,57 @@ static void hclgevf_reset_service_task(struct work_struct *work)
 
 	clear_bit(HCLGEVF_STATE_RST_SERVICE_SCHED, &hdev->state);
 
-	/* body of the reset service task will constitute of hclge device
-	 * reset state handling. This code shall be added subsequently in
-	 * next patches.
-	 */
+	if (test_and_clear_bit(HCLGEVF_RESET_PENDING,
+			       &hdev->reset_state)) {
+		/* PF has initmated that it is about to reset the hardware.
+		 * We now have to poll & check if harware has actually completed
+		 * the reset sequence. On hardware reset completion, VF needs to
+		 * reset the client and ae device.
+		 */
+		hdev->reset_attempts = 0;
+
+		/* code to check/wait for hardware reset completion and the
+		 * further initiating software stack reset would be added here
+		 */
+
+	} else if (test_and_clear_bit(HCLGEVF_RESET_REQUESTED,
+				      &hdev->reset_state)) {
+		/* we could be here when either of below happens:
+		 * 1. reset was initiated due to watchdog timeout due to
+		 *    a. IMP was earlier reset and our TX got choked down and
+		 *       which resulted in watchdog reacting and inducing VF
+		 *       reset. This also means our cmdq would be unreliable.
+		 *    b. problem in TX due to other lower layer(example link
+		 *       layer not functioning properly etc.)
+		 * 2. VF reset might have been initiated due to some config
+		 *    change.
+		 *
+		 * NOTE: Theres no clear way to detect above cases than to react
+		 * to the response of PF for this reset request. PF will ack the
+		 * 1b and 2. cases but we will not get any intimation about 1a
+		 * from PF as cmdq would be in unreliable state i.e. mailbox
+		 * communication between PF and VF would be broken.
+		 */
+
+		/* if we are never geting into pending state it means either:
+		 * 1. PF is not receiving our request which could be due to IMP
+		 *    reset
+		 * 2. PF is screwed
+		 * We cannot do much for 2. but to check first we can try reset
+		 * our PCIe + stack and see if it alleviates the problem.
+		 */
+		if (hdev->reset_attempts > 3) {
+			/* prepare for full reset of stack + pcie interface */
+			hdev->nic.reset_level = HNAE3_VF_FULL_RESET;
+
+			/* "defer" schedule the reset task again */
+			set_bit(HCLGEVF_RESET_PENDING, &hdev->reset_state);
+		} else {
+			hdev->reset_attempts++;
+
+			/* request PF for resetting this VF via mailbox */
+		}
+	}
 
 	clear_bit(HCLGEVF_STATE_RST_HANDLING, &hdev->state);
 }
@@ -943,6 +998,8 @@ static void hclgevf_service_task(struct work_struct *work)
 	 */
 	hclgevf_request_link_info(hdev);
 
+	hclgevf_deferred_task_schedule(hdev);
+
 	clear_bit(HCLGEVF_STATE_SERVICE_SCHED, &hdev->state);
 }
 
diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_main.h b/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_main.h
index 8b5fa67..1c9cf87 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_main.h
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_main.h
@@ -124,6 +124,11 @@ struct hclgevf_dev {
 	struct hclgevf_rss_cfg rss_cfg;
 	unsigned long state;
 
+#define HCLGEVF_RESET_REQUESTED		0
+#define HCLGEVF_RESET_PENDING		1
+	unsigned long reset_state;	/* requested, pending */
+	u32 reset_attempts;
+
 	u32 fw_version;
 	u16 num_tqps;		/* num task queue pairs of this PF */
 
-- 
2.7.4

^ permalink raw reply related

* [PATCH net-next 2/9] net: hns3: Add VF Reset Service Task to support event handling
From: Salil Mehta @ 2018-03-22 14:28 UTC (permalink / raw)
  To: davem
  Cc: salil.mehta, yisen.zhuang, mehta.salil.lnk, netdev, linux-kernel,
	linuxarm
In-Reply-To: <20180322142900.22860-1-salil.mehta@huawei.com>

VF reset would involve handling of different reset related events
from the stack, physical function, mailbox etc. Reset service task
would be used in servicing such reset event requests and later
handling the hardware completions waits and initiating the stack
resets.

Signed-off-by: Salil Mehta <salil.mehta@huawei.com>
---
 .../ethernet/hisilicon/hns3/hns3vf/hclgevf_main.c  | 31 ++++++++++++++++++++++
 .../ethernet/hisilicon/hns3/hns3vf/hclgevf_main.h  |  4 +++
 2 files changed, 35 insertions(+)

diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_main.c b/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_main.c
index 6c3881d..cdb6e7a 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_main.c
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_main.c
@@ -867,6 +867,15 @@ static void hclgevf_get_misc_vector(struct hclgevf_dev *hdev)
 	hdev->num_msi_used += 1;
 }
 
+void hclgevf_reset_task_schedule(struct hclgevf_dev *hdev)
+{
+	if (!test_bit(HCLGEVF_STATE_RST_SERVICE_SCHED, &hdev->state) &&
+	    !test_bit(HCLGEVF_STATE_RST_HANDLING, &hdev->state)) {
+		set_bit(HCLGEVF_STATE_RST_SERVICE_SCHED, &hdev->state);
+		schedule_work(&hdev->rst_service_task);
+	}
+}
+
 static void hclgevf_mbx_task_schedule(struct hclgevf_dev *hdev)
 {
 	if (!test_and_set_bit(HCLGEVF_STATE_MBX_SERVICE_SCHED, &hdev->state))
@@ -889,6 +898,24 @@ static void hclgevf_service_timer(struct timer_list *t)
 	hclgevf_task_schedule(hdev);
 }
 
+static void hclgevf_reset_service_task(struct work_struct *work)
+{
+	struct hclgevf_dev *hdev =
+		container_of(work, struct hclgevf_dev, rst_service_task);
+
+	if (test_and_set_bit(HCLGEVF_STATE_RST_HANDLING, &hdev->state))
+		return;
+
+	clear_bit(HCLGEVF_STATE_RST_SERVICE_SCHED, &hdev->state);
+
+	/* body of the reset service task will constitute of hclge device
+	 * reset state handling. This code shall be added subsequently in
+	 * next patches.
+	 */
+
+	clear_bit(HCLGEVF_STATE_RST_HANDLING, &hdev->state);
+}
+
 static void hclgevf_mailbox_service_task(struct work_struct *work)
 {
 	struct hclgevf_dev *hdev;
@@ -1097,6 +1124,8 @@ static void hclgevf_state_init(struct hclgevf_dev *hdev)
 	INIT_WORK(&hdev->service_task, hclgevf_service_task);
 	clear_bit(HCLGEVF_STATE_SERVICE_SCHED, &hdev->state);
 
+	INIT_WORK(&hdev->rst_service_task, hclgevf_reset_service_task);
+
 	mutex_init(&hdev->mbx_resp.mbx_mutex);
 
 	/* bring the device down */
@@ -1113,6 +1142,8 @@ static void hclgevf_state_uninit(struct hclgevf_dev *hdev)
 		cancel_work_sync(&hdev->service_task);
 	if (hdev->mbx_service_task.func)
 		cancel_work_sync(&hdev->mbx_service_task);
+	if (hdev->rst_service_task.func)
+		cancel_work_sync(&hdev->rst_service_task);
 
 	mutex_destroy(&hdev->mbx_resp.mbx_mutex);
 }
diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_main.h b/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_main.h
index 0eaea06..8b5fa67 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_main.h
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_main.h
@@ -52,6 +52,8 @@ enum hclgevf_states {
 	HCLGEVF_STATE_DISABLED,
 	/* task states */
 	HCLGEVF_STATE_SERVICE_SCHED,
+	HCLGEVF_STATE_RST_SERVICE_SCHED,
+	HCLGEVF_STATE_RST_HANDLING,
 	HCLGEVF_STATE_MBX_SERVICE_SCHED,
 	HCLGEVF_STATE_MBX_HANDLING,
 };
@@ -146,6 +148,7 @@ struct hclgevf_dev {
 
 	struct timer_list service_timer;
 	struct work_struct service_task;
+	struct work_struct rst_service_task;
 	struct work_struct mbx_service_task;
 
 	struct hclgevf_tqp *htqp;
@@ -165,4 +168,5 @@ void hclgevf_mbx_handler(struct hclgevf_dev *hdev);
 void hclgevf_update_link_status(struct hclgevf_dev *hdev, int link_state);
 void hclgevf_update_speed_duplex(struct hclgevf_dev *hdev, u32 speed,
 				 u8 duplex);
+void hclgevf_reset_task_schedule(struct hclgevf_dev *hdev);
 #endif
-- 
2.7.4

^ permalink raw reply related

* [PATCH net-next 1/9] net: hns3: Changes to make enet watchdog timeout func common for PF/VF
From: Salil Mehta @ 2018-03-22 14:28 UTC (permalink / raw)
  To: davem
  Cc: salil.mehta, yisen.zhuang, mehta.salil.lnk, netdev, linux-kernel,
	linuxarm
In-Reply-To: <20180322142900.22860-1-salil.mehta@huawei.com>

HNS3 drivers enet layer, used for the ring management and stack
interaction, is common to both VF and PF. PF already supports reset
functionality to handle the network stack watchdog timeout trigger
but the existing code is not generic enough to be used to support VF
reset as well.
This patch does following:
1. Makes the existing watchdog timeout handler in enet layer generic
   i.e. suitable for both VF and PF and
2. Introduces the new reset event handler for the VF code.
3. Changes existing reset event handler of PF code to initialize the
   reset level

Signed-off-by: Salil Mehta <salil.mehta@huawei.com>
---
 drivers/net/ethernet/hisilicon/hns3/hnae3.h        |  7 +++--
 drivers/net/ethernet/hisilicon/hns3/hns3_enet.c    | 30 +++++-------------
 drivers/net/ethernet/hisilicon/hns3/hns3_enet.h    |  2 --
 .../ethernet/hisilicon/hns3/hns3pf/hclge_main.c    | 36 ++++++++++++----------
 .../ethernet/hisilicon/hns3/hns3vf/hclgevf_main.c  | 14 +++++++++
 5 files changed, 46 insertions(+), 43 deletions(-)

diff --git a/drivers/net/ethernet/hisilicon/hns3/hnae3.h b/drivers/net/ethernet/hisilicon/hns3/hnae3.h
index 9daa88d..56f9e650 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hnae3.h
+++ b/drivers/net/ethernet/hisilicon/hns3/hnae3.h
@@ -118,6 +118,7 @@ enum hnae3_reset_notify_type {
 };
 
 enum hnae3_reset_type {
+	HNAE3_VF_RESET,
 	HNAE3_FUNC_RESET,
 	HNAE3_CORE_RESET,
 	HNAE3_GLOBAL_RESET,
@@ -400,8 +401,7 @@ struct hnae3_ae_ops {
 	int (*set_vf_vlan_filter)(struct hnae3_handle *handle, int vfid,
 				  u16 vlan, u8 qos, __be16 proto);
 	int (*enable_hw_strip_rxvtag)(struct hnae3_handle *handle, bool enable);
-	void (*reset_event)(struct hnae3_handle *handle,
-			    enum hnae3_reset_type reset);
+	void (*reset_event)(struct hnae3_handle *handle);
 	void (*get_channels)(struct hnae3_handle *handle,
 			     struct ethtool_channels *ch);
 	void (*get_tqps_and_rss_info)(struct hnae3_handle *h,
@@ -495,6 +495,9 @@ struct hnae3_handle {
 	struct hnae3_ae_algo *ae_algo;  /* the class who provides this handle */
 	u64 flags; /* Indicate the capabilities for this handle*/
 
+	unsigned long last_reset_time;
+	enum hnae3_reset_type reset_level;
+
 	union {
 		struct net_device *netdev; /* first member */
 		struct hnae3_knic_private_info kinfo;
diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c b/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c
index 0b4a676..40a3eb7 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c
@@ -320,7 +320,7 @@ static int hns3_nic_net_open(struct net_device *netdev)
 		return ret;
 	}
 
-	priv->last_reset_time = jiffies;
+	priv->ae_handle->last_reset_time = jiffies;
 	return 0;
 }
 
@@ -1543,7 +1543,6 @@ static bool hns3_get_tx_timeo_queue_info(struct net_device *ndev)
 static void hns3_nic_net_timeout(struct net_device *ndev)
 {
 	struct hns3_nic_priv *priv = netdev_priv(ndev);
-	unsigned long last_reset_time = priv->last_reset_time;
 	struct hnae3_handle *h = priv->ae_handle;
 
 	if (!hns3_get_tx_timeo_queue_info(ndev))
@@ -1551,24 +1550,12 @@ static void hns3_nic_net_timeout(struct net_device *ndev)
 
 	priv->tx_timeout_count++;
 
-	/* This timeout is far away enough from last timeout,
-	 * if timeout again,set the reset type to PF reset
-	 */
-	if (time_after(jiffies, (last_reset_time + 20 * HZ)))
-		priv->reset_level = HNAE3_FUNC_RESET;
-
-	/* Don't do any new action before the next timeout */
-	else if (time_before(jiffies, (last_reset_time + ndev->watchdog_timeo)))
+	if (time_before(jiffies, (h->last_reset_time + ndev->watchdog_timeo)))
 		return;
 
-	priv->last_reset_time = jiffies;
-
+	/* request the reset */
 	if (h->ae_algo->ops->reset_event)
-		h->ae_algo->ops->reset_event(h, priv->reset_level);
-
-	priv->reset_level++;
-	if (priv->reset_level > HNAE3_GLOBAL_RESET)
-		priv->reset_level = HNAE3_GLOBAL_RESET;
+		h->ae_algo->ops->reset_event(h);
 }
 
 static const struct net_device_ops hns3_nic_netdev_ops = {
@@ -3122,8 +3109,8 @@ static int hns3_client_init(struct hnae3_handle *handle)
 	priv->dev = &pdev->dev;
 	priv->netdev = netdev;
 	priv->ae_handle = handle;
-	priv->last_reset_time = jiffies;
-	priv->reset_level = HNAE3_FUNC_RESET;
+	priv->ae_handle->reset_level = HNAE3_NONE_RESET;
+	priv->ae_handle->last_reset_time = jiffies;
 	priv->tx_timeout_count = 0;
 
 	handle->kinfo.netdev = netdev;
@@ -3355,7 +3342,6 @@ static int hns3_reset_notify_down_enet(struct hnae3_handle *handle)
 static int hns3_reset_notify_up_enet(struct hnae3_handle *handle)
 {
 	struct hnae3_knic_private_info *kinfo = &handle->kinfo;
-	struct hns3_nic_priv *priv = netdev_priv(kinfo->netdev);
 	int ret = 0;
 
 	if (netif_running(kinfo->netdev)) {
@@ -3365,8 +3351,7 @@ static int hns3_reset_notify_up_enet(struct hnae3_handle *handle)
 				   "hns net up fail, ret=%d!\n", ret);
 			return ret;
 		}
-
-		priv->last_reset_time = jiffies;
+		handle->last_reset_time = jiffies;
 	}
 
 	return ret;
@@ -3378,7 +3363,6 @@ static int hns3_reset_notify_init_enet(struct hnae3_handle *handle)
 	struct hns3_nic_priv *priv = netdev_priv(netdev);
 	int ret;
 
-	priv->reset_level = 1;
 	hns3_init_mac_addr(netdev);
 	hns3_nic_set_rx_mode(netdev);
 	hns3_recover_hw_addr(netdev);
diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3_enet.h b/drivers/net/ethernet/hisilicon/hns3/hns3_enet.h
index 39daa01..9e4cfbb 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3_enet.h
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3_enet.h
@@ -532,8 +532,6 @@ struct hns3_nic_priv {
 	/* The most recently read link state */
 	int link;
 	u64 tx_timeout_count;
-	enum hnae3_reset_type reset_level;
-	unsigned long last_reset_time;
 
 	unsigned long state;
 
diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c
index 31e90b5..a3e00da 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c
@@ -2845,27 +2845,31 @@ static void hclge_reset(struct hclge_dev *hdev)
 	hclge_notify_client(hdev, HNAE3_UP_CLIENT);
 }
 
-static void hclge_reset_event(struct hnae3_handle *handle,
-			      enum hnae3_reset_type reset)
+static void hclge_reset_event(struct hnae3_handle *handle)
 {
 	struct hclge_vport *vport = hclge_get_vport(handle);
 	struct hclge_dev *hdev = vport->back;
 
-	dev_info(&hdev->pdev->dev,
-		 "Receive reset event , reset_type is %d", reset);
+	/* check if this is a new reset request and we are not here just because
+	 * last reset attempt did not succeed and watchdog hit us again. We will
+	 * know this if last reset request did not occur very recently (watchdog
+	 * timer = 5*HZ, let us check after sufficiently large time, say 4*5*Hz)
+	 * In case of new request we reset the "reset level" to PF reset.
+	 */
+	if (time_after(jiffies, (handle->last_reset_time + 4 * 5 * HZ)))
+		handle->reset_level = HNAE3_FUNC_RESET;
 
-	switch (reset) {
-	case HNAE3_FUNC_RESET:
-	case HNAE3_CORE_RESET:
-	case HNAE3_GLOBAL_RESET:
-		/* request reset & schedule reset task */
-		set_bit(reset, &hdev->reset_request);
-		hclge_reset_task_schedule(hdev);
-		break;
-	default:
-		dev_warn(&hdev->pdev->dev, "Unsupported reset event:%d", reset);
-		break;
-	}
+	dev_info(&hdev->pdev->dev, "received reset event , reset type is %d",
+		 handle->reset_level);
+
+	/* request reset & schedule reset task */
+	set_bit(handle->reset_level, &hdev->reset_request);
+	hclge_reset_task_schedule(hdev);
+
+	if (handle->reset_level < HNAE3_GLOBAL_RESET)
+		handle->reset_level++;
+
+	handle->last_reset_time = jiffies;
 }
 
 static void hclge_reset_subtask(struct hclge_dev *hdev)
diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_main.c b/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_main.c
index 906dfa3..6c3881d 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_main.c
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_main.c
@@ -832,6 +832,19 @@ static void hclgevf_reset_tqp(struct hnae3_handle *handle, u16 queue_id)
 			     2, true, NULL, 0);
 }
 
+static void hclgevf_reset_event(struct hnae3_handle *handle)
+{
+	struct hclgevf_dev *hdev = hclgevf_ae_get_hdev(handle);
+
+	dev_info(&hdev->pdev->dev, "received reset request from VF enet\n");
+
+	handle->reset_level = HNAE3_VF_RESET;
+
+	/* request VF reset here. Code added later */
+
+	handle->last_reset_time = jiffies;
+}
+
 static u32 hclgevf_get_fw_version(struct hnae3_handle *handle)
 {
 	struct hclgevf_dev *hdev = hclgevf_ae_get_hdev(handle);
@@ -1526,6 +1539,7 @@ static const struct hnae3_ae_ops hclgevf_ops = {
 	.get_tc_size = hclgevf_get_tc_size,
 	.get_fw_version = hclgevf_get_fw_version,
 	.set_vlan_filter = hclgevf_set_vlan_filter,
+	.reset_event = hclgevf_reset_event,
 	.get_channels = hclgevf_get_channels,
 	.get_tqps_and_rss_info = hclgevf_get_tqps_and_rss_info,
 	.get_status = hclgevf_get_status,
-- 
2.7.4

^ permalink raw reply related

* [PATCH net-next 0/9] Add support of VF Reset to HNS3 VF driver
From: Salil Mehta @ 2018-03-22 14:28 UTC (permalink / raw)
  To: davem
  Cc: salil.mehta, yisen.zhuang, mehta.salil.lnk, netdev, linux-kernel,
	linuxarm

This patch-set adds the support of VF reset to the existing VF driver.
VF Reset can be triggered due to TX watchdog firing  as a result of TX
data-path not working. VF reset could also be a result of some internal
configuration changes if that requires reset, or as a result of the
PF/Core/Global/IMP(Integrated Management Processor) reset happened in
the PF. 

Summary of Patches:
* Watchdog timer trigger chnages are present in Patch 1.
* Reset Service Task and related Event handling is present in Patches {2,3}
* Changes to send reset request to PF, reset stack and re-initialization
  of the hclge device is present in Patches {4,5,6}
* Changes related to ARQ (Asynchronous Receive Queue) and its event handling
  are present in Patches {7,8}
* Changes required in PF to handle the VF Reset request and actually perform
  hardware VF reset is there in Patch 9.

NOTE: This patch depends upon "[PATCH net-next 00/11] fix some bugs for HNS3 driver"
	Link: https://lkml.org/lkml/2018/3/21/72

Salil Mehta (9):
  net: hns3: Changes to make enet watchdog timeout func common for PF/VF
  net: hns3: Add VF Reset Service Task to support event handling
  net: hns3: Add VF Reset device state and its handling
  net: hns3: Add support to request VF Reset to PF
  net: hns3: Add support to reset the enet/ring mgmt layer
  net: hns3: Add support to re-initialize the hclge device
  net: hns3: Changes to support ARQ(Asynchronous Receive Queue)
  net: hns3: Add *Asserting Reset* mailbox message & handling in VF
  net: hns3: Changes required in PF mailbox to support VF reset

 drivers/net/ethernet/hisilicon/hns3/hclge_mbx.h    |  16 +
 drivers/net/ethernet/hisilicon/hns3/hnae3.h        |   8 +-
 drivers/net/ethernet/hisilicon/hns3/hns3_enet.c    |  30 +-
 drivers/net/ethernet/hisilicon/hns3/hns3_enet.h    |   2 -
 .../ethernet/hisilicon/hns3/hns3pf/hclge_main.c    |  38 +--
 .../ethernet/hisilicon/hns3/hns3pf/hclge_main.h    |   1 +
 .../net/ethernet/hisilicon/hns3/hns3pf/hclge_mbx.c |  42 +++
 .../ethernet/hisilicon/hns3/hns3vf/hclgevf_cmd.c   |   6 +
 .../ethernet/hisilicon/hns3/hns3vf/hclgevf_main.c  | 336 +++++++++++++++++++--
 .../ethernet/hisilicon/hns3/hns3vf/hclgevf_main.h  |  31 ++
 .../ethernet/hisilicon/hns3/hns3vf/hclgevf_mbx.c   |  95 +++++-
 11 files changed, 534 insertions(+), 71 deletions(-)

-- 
2.7.4

^ permalink raw reply

* Re: Fwd: Kernel panic when using KVM and mlx4_en driver (when bonding and sriov enabled)
From: Tariq Toukan @ 2018-03-22 14:24 UTC (permalink / raw)
  To: kvaps, Tariq Toukan; +Cc: netdev
In-Reply-To: <CAGO-sgOehi4++WUXUb3RrkafV5-kAr_=XWgZ1Tu20fHVyUzu-w@mail.gmail.com>



On 20/03/2018 10:14 PM, kvaps wrote:
> Hello, I have one bug with new HPE ProLiant m710x Server Cartridges,
> there is Mellanox Technologies MT27520 Family [ConnectX-3 Pro]
> Ethernet controller.
> 
> When I use bonding + VFs and KVM I have stacked kernel with these
> messages on console:
> 
> [ 1011.070739] kvm [16361]: vcpu0, guest rIP: 0xffffffff810644d8
> disabled perfctr wrmsr: 0xc2 data 0xffff
> [ 1011.528347] cache_from_obj: Wrong slab cache. kmalloc-256 but
> object is from kmalloc-192
> [ 1011.927642] general protection fault: 0000 [#1] SMP PTI
> [ 1012.185439] cache_from_obj: Wrong slab cache. kmalloc-256 but
> object is from kmalloc-192
> 
> I've already reported this bug on launchpad:
> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1755268
> But since the bug is present in the latest kernel, I was advised to
> contact you directly.
> 

Thanks for that!

I will check the details below and let you know of any questions/updates 
I have.

Regards,
Tariq

> === Steps to repoduce ===
> 
> I have the next network configuration:
> 
> eno1 (physical)    eno1d1 (physical)    eno2 (virtual function)
> eno2d1 (virtual function)
>          |                  |
>          +------ bond0 -----+
>                    |
>                    |
>                  vmbr0 (bridge)
> 
> 
> After my machine is booted, I can run this commands:
> 
> # wget http://dl-cdn.alpinelinux.org/alpine/v3.7/releases/x86_64/alpine-virt-3.7.0-x86_64.iso
> -O alpine.iso
> # qemu-system-x86_64 -machine pc-i440fx-xenial,accel=kvm,usb=off -boot
> d -cdrom alpine.iso -m 512 -nographic -device e1000,netdev=net0
> -netdev tap,id=net0
> 
> And kernel will break down.
> 
> === System information ===
> 
> ##################
> # Network config #
> ##################
> 
> This is my /etc/network/interfaces file:
> 
> auto lo
> iface lo inet loopback
> 
> auto openibd
> iface openibd inet manual
>          pre-up /etc/init.d/openibd start
>          pre-down /etc/init.d/openibd force-stop
> 
> auto bond0
> iface bond0 inet manual
>          pre-up ip link add bond0 type bond || true
>          pre-up ip link set bond0 down
>          pre-up ip link set bond0 type bond mode active-backup
> arp_interval 2000 arp_ip_target 10.36.0.1 arp_validate 3 primary eno1
>          pre-up ip link set eno1 down
>          pre-up ip link set eno1d1 down
>          pre-up ip link set eno1 master bond0
>          pre-up ip link set eno1d1 master bond0
>          pre-up ip link set bond0 up
>          pre-down ip link del bond0
> 
> auto vmbr0
> iface vmbr0 inet static
>          address 10.36.128.217
>          netmask 255.255.0.0
>          gateway 10.36.0.1
>          bridge_ports bond0
>          bridge_stp off
>          bridge_fd 0
> 
> 
> ##################
> # Kernel version #
> ##################
> 
> Latest kernel that I've tested:
> 
> # cat /proc/version
> Linux version 4.16.0-041600rc6-generic (kernel@gloin) (gcc version
> 7.2.0 (Ubuntu 7.2.0-8ubuntu3.2)) #201803182230 SMP Mon Mar 19 02:32:18
> UTC 2018
> 
> ##################
> # Driver version #
> ##################
> 
> Both drivers that I tested:
> 
> # Mellanox driver on stock and hwe kernels:
> 
> # ethtool -i eno1
> driver: mlx4_en
> version: 4.3-1.0.1
> firmware-version: 2.40.5540
> expansion-rom-version:
> bus-info: 0000:11:00.0
> supports-statistics: yes
> supports-test: yes
> supports-eeprom-access: no
> supports-register-dump: no
> supports-priv-flags: yes
> 
> Built-in driver from latest kernel:
> 
> # ethtool -i eno1
> driver: mlx4_en
> version: 4.0-0
> firmware-version: 2.42.5004
> expansion-rom-version:
> bus-info: 0000:11:00.0
> supports-statistics: yes
> supports-test: yes
> supports-eeprom-access: no
> supports-register-dump: no
> supports-priv-flags: yes
> 
> ###############
> # NIC Details #
> ###############
> 
> # mst status
> MST modules:
> ------------
>      MST PCI module loaded
>      MST PCI configuration module loaded
> 
> MST devices:
> ------------
> /dev/mst/mt4103_pci_cr0 - PCI direct access.
>                                     domain:bus:dev.fn=0000:11:00.0
> bar=0x7f100000 size=0x100000
>                                     Chip revision is: 00
> /dev/mst/mt4103_pciconf0 - PCI configuration cycles access.
>                                     domain:bus:dev.fn=0000:11:00.0
> addr.reg=88 data.reg=92
>                                     Chip revision is: 00
> 
> # ibv_devinfo
> hca_id: mlx4_1
>          transport: InfiniBand (0)
>          fw_ver: 2.40.5540
>          node_guid: 0014:0500:d300:bc52
>          sys_image_guid: f403:4303:00fd:102d
>          vendor_id: 0x02c9
>          vendor_part_id: 4100
>          hw_ver: 0x0
>          board_id: HP_1690110017
>          phys_port_cnt: 2
>          Device ports:
>                  port: 1
>                          state: PORT_DOWN (1)
>                          max_mtu: 4096 (5)
>                          active_mtu: 1024 (3)
>                          sm_lid: 0
>                          port_lid: 0
>                          port_lmc: 0x00
>                          link_layer: Ethernet
> 
>                  port: 2
>                          state: PORT_DOWN (1)
>                          max_mtu: 4096 (5)
>                          active_mtu: 1024 (3)
>                          sm_lid: 0
>                          port_lid: 0
>                          port_lmc: 0x00
>                          link_layer: Ethernet
> 
> hca_id: mlx4_0
>          transport: InfiniBand (0)
>          fw_ver: 2.40.5540
>          node_guid: f403:4303:00fd:102d
>          sys_image_guid: f403:4303:00fd:102d
>          vendor_id: 0x02c9
>          vendor_part_id: 4103
>          hw_ver: 0x0
>          board_id: HP_1690110017
>          phys_port_cnt: 2
>          Device ports:
>                  port: 1
>                          state: PORT_ACTIVE (4)
>                          max_mtu: 4096 (5)
>                          active_mtu: 1024 (3)
>                          sm_lid: 0
>                          port_lid: 0
>                          port_lmc: 0x00
>                          link_layer: Ethernet
> 
>                  port: 2
>                          state: PORT_ACTIVE (4)
>                          max_mtu: 4096 (5)
>                          active_mtu: 1024 (3)
>                          sm_lid: 0
>                          port_lid: 0
>                          port_lmc: 0x00
>                          link_layer: Ethernet
> 
> #########################
> # Full Hardware Details #
> #########################
> 
> # lspci -vvv
> 00:00.0 Host bridge: Intel Corporation Sky Lake Host Bridge/DRAM
> Registers (rev 0a)
>      Subsystem: Hewlett-Packard Company Skylake Host Bridge/DRAM Registers
>      Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop-
> ParErr- Stepping- SERR- FastB2B- DisINTx-
>      Status: Cap+ 66MHz- UDF- FastB2B+ ParErr- DEVSEL=fast >TAbort-
> <TAbort- <MAbort+ >SERR- <PERR- INTx-
>      Latency: 0
>      Capabilities: [e0] Vendor Specific Information: Len=10 <?>
>      Kernel driver in use: ie31200_edac
>      Kernel modules: ie31200_edac
> 
> 00:01.0 PCI bridge: Intel Corporation Sky Lake PCIe Controller (x16)
> (rev 0a) (prog-if 00 [Normal decode])
>      Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop-
> ParErr+ Stepping- SERR+ FastB2B- DisINTx+
>      Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort-
> <TAbort- <MAbort- >SERR- <PERR- INTx-
>      Latency: 0, Cache Line Size: 64 bytes
>      Interrupt: pin ? routed to IRQ 26
>      Bus: primary=00, secondary=11, subordinate=11, sec-latency=0
>      I/O behind bridge: 0000f000-00000fff
>      Memory behind bridge: 7f100000-7f1fffff
>      Prefetchable memory behind bridge: 0000002000000000-00000020047fffff
>      Secondary status: 66MHz- FastB2B- ParErr- DEVSEL=fast >TAbort-
> <TAbort- <MAbort+ <SERR- <PERR-
>      BridgeCtl: Parity+ SERR+ NoISA- VGA- MAbort- >Reset- FastB2B-
>          PriDiscTmr- SecDiscTmr- DiscTmrStat- DiscTmrSERREn-
>      Capabilities: [88] Subsystem: Hewlett-Packard Company Skylake PCIe
> Controller (x16)
>      Capabilities: [80] Power Management version 3
>          Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA
> PME(D0+,D1-,D2-,D3hot+,D3cold+)
>          Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
>      Capabilities: [90] MSI: Enable+ Count=1/1 Maskable- 64bit-
>          Address: fee001f8  Data: 0000
>      Capabilities: [a0] Express (v2) Root Port (Slot+), MSI 00
>          DevCap:    MaxPayload 256 bytes, PhantFunc 0
>              ExtTag- RBE+
>          DevCtl:    Report errors: Correctable- Non-Fatal+ Fatal+ Unsupported-
>              RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop-
>              MaxPayload 256 bytes, MaxReadReq 128 bytes
>          DevSta:    CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr- TransPend-
>          LnkCap:    Port #2, Speed 8GT/s, Width x8, ASPM L0s L1, Exit
> Latency L0s <256ns, L1 <8us
>              ClockPM- Surprise- LLActRep- BwNot+ ASPMOptComp+
>          LnkCtl:    ASPM Disabled; RCB 64 bytes Disabled- CommClk+
>              ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
>          LnkSta:    Speed 8GT/s, Width x8, TrErr- Train- SlotClk+
> DLActive- BWMgmt+ ABWMgmt+
>          SltCap:    AttnBtn- PwrCtrl- MRL- AttnInd- PwrInd- HotPlug- Surprise-
>              Slot #1, PowerLimit 75.000W; Interlock- NoCompl+
>          SltCtl:    Enable: AttnBtn- PwrFlt- MRL- PresDet- CmdCplt-
> HPIrq- LinkChg-
>              Control: AttnInd Unknown, PwrInd Unknown, Power- Interlock-
>          SltSta:    Status: AttnBtn- PowerFlt- MRL- CmdCplt- PresDet+ Interlock-
>              Changed: MRL- PresDet+ LinkState-
>          RootCtl: ErrCorrectable- ErrNon-Fatal+ ErrFatal+ PMEIntEna+ CRSVisible-
>          RootCap: CRSVisible-
>          RootSta: PME ReqID 0000, PMEStatus- PMEPending-
>          DevCap2: Completion Timeout: Not Supported, TimeoutDis-, LTR+,
> OBFF Via WAKE# ARIFwd-
>          DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR+,
> OBFF Disabled ARIFwd-
>          LnkCtl2: Target Link Speed: 8GT/s, EnterCompliance- SpeedDis-
>               Transmit Margin: Normal Operating Range,
> EnterModifiedCompliance- ComplianceSOS-
>               Compliance De-emphasis: -6dB
>          LnkSta2: Current De-emphasis Level: -6dB,
> EqualizationComplete+, EqualizationPhase1+
>               EqualizationPhase2+, EqualizationPhase3+, LinkEqualizationRequest-
>      Capabilities: [100 v1] Virtual Channel
>          Caps:    LPEVC=0 RefClk=100ns PATEntryBits=1
>          Arb:    Fixed- WRR32- WRR64- WRR128-
>          Ctrl:    ArbSelect=Fixed
>          Status:    InProgress-
>          VC0:    Caps:    PATOffset=00 MaxTimeSlots=1 RejSnoopTrans-
>              Arb:    Fixed+ WRR32- WRR64- WRR128- TWRR128- WRR256-
>              Ctrl:    Enable+ ID=0 ArbSelect=Fixed TC/VC=ff
>              Status:    NegoPending- InProgress-
>      Capabilities: [140 v1] Root Complex Link
>          Desc:    PortNumber=02 ComponentID=01 EltType=Config
>          Link0:    Desc:    TargetPort=00 TargetComponent=01 AssocRCRB-
> LinkType=MemMapped LinkValid+
>              Addr:    00000000fed19000
>      Capabilities: [1c0 v1] Advanced Error Reporting
>          UESta:    DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt-
> RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
>          UEMsk:    DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt-
> RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
>          UESvrt:    DLP+ SDES- TLP- FCP- CmpltTO+ CmpltAbrt- UnxCmplt+
> RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
>          CESta:    RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr-
>          CEMsk:    RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
>          AERCap:    First Error Pointer: 00, GenCap- CGenEn- ChkCap- ChkEn-
>      Capabilities: [d94 v1] #19
>      Kernel driver in use: pcieport
>      Kernel modules: shpchp
> 
> 00:02.0 VGA compatible controller: Intel Corporation Device 193a (rev
> 09) (prog-if 00 [VGA controller])
>      Subsystem: Hewlett-Packard Company Device 18a9
>      Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop-
> ParErr- Stepping- SERR- FastB2B- DisINTx+
>      Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort-
> <TAbort- <MAbort- >SERR- <PERR- INTx-
>      Latency: 0, Cache Line Size: 64 bytes
>      Interrupt: pin A routed to IRQ 43
>      Region 0: Memory at 3ff0000000 (64-bit, non-prefetchable) [size=16M]
>      Region 2: Memory at 3fe0000000 (64-bit, prefetchable) [size=256M]
>      Region 4: I/O ports at 2000 [size=64]
>      Capabilities: [40] Vendor Specific Information: Len=0c <?>
>      Capabilities: [70] Express (v2) Root Complex Integrated Endpoint, MSI 00
>          DevCap:    MaxPayload 128 bytes, PhantFunc 0
>              ExtTag- RBE+
>          DevCtl:    Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
>              RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop-
>              MaxPayload 128 bytes, MaxReadReq 128 bytes
>          DevSta:    CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr- TransPend-
>          DevCap2: Completion Timeout: Not Supported, TimeoutDis-, LTR-,
> OBFF Not Supported
>          DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-,
> OBFF Disabled
>      Capabilities: [ac] MSI: Enable+ Count=1/1 Maskable- 64bit-
>          Address: fee00018  Data: 0000
>      Capabilities: [d0] Power Management version 2
>          Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA
> PME(D0-,D1-,D2-,D3hot-,D3cold-)
>          Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME-
>      Capabilities: [100 v1] #1b
>      Capabilities: [200 v1] Address Translation Service (ATS)
>          ATSCap:    Invalidate Queue Depth: 00
>          ATSCtl:    Enable-, Smallest Translation Unit: 00
>      Capabilities: [300 v1] #13
>      Kernel driver in use: i915
>      Kernel modules: i915
> 
> 00:14.0 USB controller: Intel Corporation Sunrise Point-H USB 3.0 xHCI
> Controller (rev 31) (prog-if 30 [XHCI])
>      Subsystem: Hewlett-Packard Company Sunrise Point-H USB 3.0 xHCI Controller
>      Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop-
> ParErr+ Stepping- SERR+ FastB2B- DisINTx+
>      Status: Cap+ 66MHz- UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort-
> <TAbort- <MAbort- >SERR- <PERR- INTx-
>      Latency: 0
>      Interrupt: pin A routed to IRQ 41
>      Region 0: Memory at 3ff1000000 (64-bit, non-prefetchable) [size=64K]
>      Capabilities: [70] Power Management version 2
>          Flags: PMEClk- DSI- D1- D2- AuxCurrent=375mA
> PME(D0-,D1-,D2-,D3hot+,D3cold+)
>          Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
>      Capabilities: [80] MSI: Enable+ Count=1/8 Maskable- 64bit+
>          Address: 00000000fee003f8  Data: 0000
>      Kernel driver in use: xhci_hcd
> 
> 00:16.0 Communication controller: Intel Corporation Sunrise Point-H
> CSME HECI #1 (rev 31)
>      Subsystem: Hewlett-Packard Company Sunrise Point-H CSME HECI
>      Control: I/O- Mem- BusMaster- SpecCycle- MemWINV- VGASnoop-
> ParErr- Stepping- SERR- FastB2B- DisINTx-
>      Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort-
> <TAbort- <MAbort- >SERR- <PERR- INTx-
>      Interrupt: pin A routed to IRQ 255
>      Region 0: Memory at 3ff1011000 (64-bit, non-prefetchable)
> [disabled] [size=4K]
>      Capabilities: [50] Power Management version 3
>          Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA
> PME(D0-,D1-,D2-,D3hot+,D3cold-)
>          Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
>      Capabilities: [8c] MSI: Enable- Count=1/1 Maskable- 64bit+
>          Address: 0000000000000000  Data: 0000
>      Kernel modules: mei_me
> 
> 00:17.0 SATA controller: Intel Corporation Sunrise Point-H SATA
> controller [AHCI mode] (rev 31) (prog-if 01 [AHCI 1.0])
>      DeviceName: Embedded SATA Controller #1
>      Subsystem: Hewlett-Packard Company Sunrise Point-H SATA controller
> [AHCI mode]
>      Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop-
> ParErr+ Stepping- SERR+ FastB2B- DisINTx+
>      Status: Cap+ 66MHz+ UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort-
> <TAbort- <MAbort- >SERR- <PERR- INTx-
>      Latency: 0
>      Interrupt: pin A routed to IRQ 42
>      Region 0: Memory at 7f280000 (32-bit, non-prefetchable) [size=32K]
>      Region 1: Memory at 7f28c000 (32-bit, non-prefetchable) [size=256]
>      Region 2: I/O ports at 2080 [size=8]
>      Region 3: I/O ports at 2088 [size=4]
>      Region 4: I/O ports at 2060 [size=32]
>      Region 5: Memory at 7f200000 (32-bit, non-prefetchable) [size=512K]
>      Capabilities: [80] MSI: Enable+ Count=1/1 Maskable- 64bit-
>          Address: fee00418  Data: 0000
>      Capabilities: [70] Power Management version 3
>          Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA
> PME(D0-,D1-,D2-,D3hot+,D3cold-)
>          Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
>      Capabilities: [a8] SATA HBA v1.0 BAR4 Offset=00000004
>      Kernel driver in use: ahci
>      Kernel modules: ahci
> 
> 00:1b.0 PCI bridge: Intel Corporation Sunrise Point-H PCI Root Port
> #17 (rev f1) (prog-if 00 [Normal decode])
>      Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop-
> ParErr+ Stepping- SERR+ FastB2B- DisINTx+
>      Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort-
> <TAbort- <MAbort- >SERR- <PERR- INTx-
>      Latency: 0, Cache Line Size: 64 bytes
>      Interrupt: pin A routed to IRQ 27
>      Bus: primary=00, secondary=0e, subordinate=0e, sec-latency=0
>      I/O behind bridge: 0000f000-00000fff
>      Memory behind bridge: 7f000000-7f0fffff
>      Prefetchable memory behind bridge: 00000000fff00000-00000000000fffff
>      Secondary status: 66MHz- FastB2B- ParErr- DEVSEL=fast >TAbort-
> <TAbort- <MAbort+ <SERR- <PERR-
>      BridgeCtl: Parity+ SERR+ NoISA- VGA- MAbort- >Reset- FastB2B-
>          PriDiscTmr- SecDiscTmr- DiscTmrStat- DiscTmrSERREn-
>      Capabilities: [40] Express (v2) Root Port (Slot+), MSI 00
>          DevCap:    MaxPayload 256 bytes, PhantFunc 0
>              ExtTag- RBE+
>          DevCtl:    Report errors: Correctable- Non-Fatal+ Fatal+ Unsupported-
>              RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop-
>              MaxPayload 128 bytes, MaxReadReq 128 bytes
>          DevSta:    CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr+ TransPend-
>          LnkCap:    Port #17, Speed 8GT/s, Width x4, ASPM not
> supported, Exit Latency L0s <1us, L1 <16us
>              ClockPM- Surprise- LLActRep+ BwNot+ ASPMOptComp+
>          LnkCtl:    ASPM Disabled; RCB 64 bytes Disabled- CommClk+
>              ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
>          LnkSta:    Speed 8GT/s, Width x4, TrErr- Train- SlotClk+
> DLActive+ BWMgmt+ ABWMgmt-
>          SltCap:    AttnBtn- PwrCtrl- MRL- AttnInd- PwrInd- HotPlug- Surprise-
>              Slot #3, PowerLimit 25.000W; Interlock- NoCompl+
>          SltCtl:    Enable: AttnBtn- PwrFlt- MRL- PresDet- CmdCplt-
> HPIrq- LinkChg-
>              Control: AttnInd Unknown, PwrInd Unknown, Power- Interlock-
>          SltSta:    Status: AttnBtn- PowerFlt- MRL- CmdCplt- PresDet+ Interlock-
>              Changed: MRL- PresDet- LinkState+
>          RootCtl: ErrCorrectable- ErrNon-Fatal+ ErrFatal+ PMEIntEna+ CRSVisible-
>          RootCap: CRSVisible-
>          RootSta: PME ReqID 0000, PMEStatus- PMEPending-
>          DevCap2: Completion Timeout: Range ABC, TimeoutDis+, LTR+,
> OBFF Not Supported ARIFwd+
>          DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR+,
> OBFF Disabled ARIFwd-
>          LnkCtl2: Target Link Speed: 8GT/s, EnterCompliance- SpeedDis-
>               Transmit Margin: Normal Operating Range,
> EnterModifiedCompliance- ComplianceSOS-
>               Compliance De-emphasis: -6dB
>          LnkSta2: Current De-emphasis Level: -3.5dB,
> EqualizationComplete+, EqualizationPhase1+
>               EqualizationPhase2+, EqualizationPhase3+, LinkEqualizationRequest-
>      Capabilities: [80] MSI: Enable+ Count=1/1 Maskable- 64bit-
>          Address: fee00218  Data: 0000
>      Capabilities: [90] Subsystem: Hewlett-Packard Company Sunrise
> Point-H PCI Root Port
>      Capabilities: [a0] Power Management version 3
>          Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA
> PME(D0+,D1-,D2-,D3hot+,D3cold+)
>          Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME-
>      Capabilities: [100 v1] Advanced Error Reporting
>          UESta:    DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt-
> RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
>          UEMsk:    DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt+
> RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
>          UESvrt:    DLP+ SDES- TLP- FCP- CmpltTO+ CmpltAbrt- UnxCmplt-
> RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
>          CESta:    RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr-
>          CEMsk:    RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
>          AERCap:    First Error Pointer: 00, GenCap- CGenEn- ChkCap- ChkEn-
>      Capabilities: [140 v1] Access Control Services
>          ACSCap:    SrcValid+ TransBlk+ ReqRedir+ CmpltRedir+
> UpstreamFwd- EgressCtrl- DirectTrans-
>          ACSCtl:    SrcValid- TransBlk- ReqRedir- CmpltRedir-
> UpstreamFwd- EgressCtrl- DirectTrans-
>      Capabilities: [220 v1] #19
>      Kernel driver in use: pcieport
>      Kernel modules: shpchp
> 
> 00:1c.0 PCI bridge: Intel Corporation Sunrise Point-H PCI Express Root
> Port #1 (rev f1) (prog-if 00 [Normal decode])
>      Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop-
> ParErr+ Stepping- SERR+ FastB2B- DisINTx+
>      Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort-
> <TAbort- <MAbort- >SERR- <PERR- INTx-
>      Latency: 0, Cache Line Size: 64 bytes
>      Interrupt: pin A routed to IRQ 28
>      Bus: primary=00, secondary=08, subordinate=08, sec-latency=0
>      I/O behind bridge: 0000f000-00000fff
>      Memory behind bridge: fff00000-000fffff
>      Prefetchable memory behind bridge: 00000000fff00000-00000000000fffff
>      Secondary status: 66MHz- FastB2B- ParErr- DEVSEL=fast >TAbort-
> <TAbort- <MAbort+ <SERR- <PERR-
>      BridgeCtl: Parity+ SERR+ NoISA- VGA- MAbort- >Reset- FastB2B-
>          PriDiscTmr- SecDiscTmr- DiscTmrStat- DiscTmrSERREn-
>      Capabilities: [40] Express (v2) Root Port (Slot+), MSI 00
>          DevCap:    MaxPayload 256 bytes, PhantFunc 0
>              ExtTag- RBE+
>          DevCtl:    Report errors: Correctable- Non-Fatal+ Fatal+ Unsupported+
>              RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop-
>              MaxPayload 128 bytes, MaxReadReq 128 bytes
>          DevSta:    CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr+ TransPend-
>          LnkCap:    Port #1, Speed 8GT/s, Width x2, ASPM not supported,
> Exit Latency L0s unlimited, L1 <16us
>              ClockPM- Surprise- LLActRep+ BwNot+ ASPMOptComp+
>          LnkCtl:    ASPM Disabled; RCB 64 bytes Disabled- CommClk-
>              ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
>          LnkSta:    Speed 2.5GT/s, Width x0, TrErr- Train+ SlotClk+
> DLActive- BWMgmt- ABWMgmt-
>          SltCap:    AttnBtn- PwrCtrl- MRL- AttnInd- PwrInd- HotPlug- Surprise-
>              Slot #5, PowerLimit 25.000W; Interlock- NoCompl+
>          SltCtl:    Enable: AttnBtn- PwrFlt- MRL- PresDet- CmdCplt-
> HPIrq- LinkChg-
>              Control: AttnInd Unknown, PwrInd Unknown, Power- Interlock-
>          SltSta:    Status: AttnBtn- PowerFlt- MRL- CmdCplt- PresDet- Interlock-
>              Changed: MRL- PresDet- LinkState-
>          RootCtl: ErrCorrectable- ErrNon-Fatal+ ErrFatal+ PMEIntEna+ CRSVisible-
>          RootCap: CRSVisible-
>          RootSta: PME ReqID 0000, PMEStatus- PMEPending-
>          DevCap2: Completion Timeout: Range ABC, TimeoutDis+, LTR+,
> OBFF Not Supported ARIFwd+
>          DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR+,
> OBFF Disabled ARIFwd-
>          LnkCtl2: Target Link Speed: 8GT/s, EnterCompliance- SpeedDis-
>               Transmit Margin: Normal Operating Range,
> EnterModifiedCompliance- ComplianceSOS-
>               Compliance De-emphasis: -6dB
>          LnkSta2: Current De-emphasis Level: -3.5dB,
> EqualizationComplete-, EqualizationPhase1-
>               EqualizationPhase2-, EqualizationPhase3-, LinkEqualizationRequest-
>      Capabilities: [80] MSI: Enable+ Count=1/1 Maskable- 64bit-
>          Address: fee00238  Data: 0000
>      Capabilities: [90] Subsystem: Hewlett-Packard Company Sunrise
> Point-H PCI Express Root Port
>      Capabilities: [a0] Power Management version 3
>          Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA
> PME(D0+,D1-,D2-,D3hot+,D3cold+)
>          Status: D3 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME-
>      Capabilities: [100 v1] Advanced Error Reporting
>          UESta:    DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt-
> RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
>          UEMsk:    DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt+
> RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
>          UESvrt:    DLP+ SDES- TLP- FCP- CmpltTO+ CmpltAbrt- UnxCmplt-
> RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
>          CESta:    RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr-
>          CEMsk:    RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
>          AERCap:    First Error Pointer: 00, GenCap- CGenEn- ChkCap- ChkEn-
>      Capabilities: [140 v1] Access Control Services
>          ACSCap:    SrcValid+ TransBlk+ ReqRedir+ CmpltRedir+
> UpstreamFwd- EgressCtrl- DirectTrans-
>          ACSCtl:    SrcValid- TransBlk- ReqRedir- CmpltRedir-
> UpstreamFwd- EgressCtrl- DirectTrans-
>      Capabilities: [220 v1] #19
>      Kernel driver in use: pcieport
>      Kernel modules: shpchp
> 
> 00:1c.3 PCI bridge: Intel Corporation Sunrise Point-H PCI Express Root
> Port #4 (rev f1) (prog-if 00 [Normal decode])
>      Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop-
> ParErr+ Stepping- SERR+ FastB2B- DisINTx+
>      Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort-
> <TAbort- <MAbort- >SERR- <PERR- INTx-
>      Latency: 0, Cache Line Size: 64 bytes
>      Interrupt: pin D routed to IRQ 29
>      Bus: primary=00, secondary=01, subordinate=01, sec-latency=0
>      I/O behind bridge: 00001000-00001fff
>      Memory behind bridge: 90000000-92afffff
>      Prefetchable memory behind bridge: 00000000fff00000-00000000000fffff
>      Secondary status: 66MHz- FastB2B- ParErr- DEVSEL=fast >TAbort-
> <TAbort- <MAbort+ <SERR- <PERR-
>      BridgeCtl: Parity+ SERR+ NoISA- VGA+ MAbort- >Reset- FastB2B-
>          PriDiscTmr- SecDiscTmr- DiscTmrStat- DiscTmrSERREn-
>      Capabilities: [40] Express (v2) Root Port (Slot+), MSI 00
>          DevCap:    MaxPayload 256 bytes, PhantFunc 0
>              ExtTag- RBE+
>          DevCtl:    Report errors: Correctable- Non-Fatal+ Fatal+ Unsupported-
>              RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop-
>              MaxPayload 128 bytes, MaxReadReq 128 bytes
>          DevSta:    CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr+ TransPend-
>          LnkCap:    Port #4, Speed 2.5GT/s, Width x1, ASPM not
> supported, Exit Latency L0s unlimited, L1 <16us
>              ClockPM- Surprise- LLActRep+ BwNot+ ASPMOptComp+
>          LnkCtl:    ASPM Disabled; RCB 64 bytes Disabled- CommClk-
>              ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
>          LnkSta:    Speed 2.5GT/s, Width x1, TrErr- Train- SlotClk+
> DLActive+ BWMgmt- ABWMgmt-
>          SltCap:    AttnBtn- PwrCtrl- MRL- AttnInd- PwrInd- HotPlug- Surprise-
>              Slot #0, PowerLimit 10.000W; Interlock- NoCompl+
>          SltCtl:    Enable: AttnBtn- PwrFlt- MRL- PresDet- CmdCplt-
> HPIrq- LinkChg-
>              Control: AttnInd Unknown, PwrInd Unknown, Power- Interlock-
>          SltSta:    Status: AttnBtn- PowerFlt- MRL- CmdCplt- PresDet+ Interlock-
>              Changed: MRL- PresDet- LinkState+
>          RootCtl: ErrCorrectable- ErrNon-Fatal+ ErrFatal+ PMEIntEna+ CRSVisible-
>          RootCap: CRSVisible-
>          RootSta: PME ReqID 0000, PMEStatus- PMEPending-
>          DevCap2: Completion Timeout: Range ABC, TimeoutDis+, LTR+,
> OBFF Not Supported ARIFwd+
>          DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR+,
> OBFF Disabled ARIFwd-
>          LnkCtl2: Target Link Speed: 2.5GT/s, EnterCompliance- SpeedDis-
>               Transmit Margin: Normal Operating Range,
> EnterModifiedCompliance- ComplianceSOS-
>               Compliance De-emphasis: -6dB
>          LnkSta2: Current De-emphasis Level: -3.5dB,
> EqualizationComplete-, EqualizationPhase1-
>               EqualizationPhase2-, EqualizationPhase3-, LinkEqualizationRequest-
>      Capabilities: [80] MSI: Enable+ Count=1/1 Maskable- 64bit-
>          Address: fee00258  Data: 0000
>      Capabilities: [90] Subsystem: Hewlett-Packard Company Sunrise
> Point-H PCI Express Root Port
>      Capabilities: [a0] Power Management version 3
>          Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA
> PME(D0+,D1-,D2-,D3hot+,D3cold+)
>          Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME-
>      Capabilities: [100 v1] Advanced Error Reporting
>          UESta:    DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt-
> RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
>          UEMsk:    DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt+
> RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
>          UESvrt:    DLP+ SDES- TLP- FCP- CmpltTO+ CmpltAbrt- UnxCmplt-
> RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
>          CESta:    RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr-
>          CEMsk:    RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
>          AERCap:    First Error Pointer: 00, GenCap- CGenEn- ChkCap- ChkEn-
>      Capabilities: [140 v1] Access Control Services
>          ACSCap:    SrcValid+ TransBlk+ ReqRedir+ CmpltRedir+
> UpstreamFwd- EgressCtrl- DirectTrans-
>          ACSCtl:    SrcValid- TransBlk- ReqRedir- CmpltRedir-
> UpstreamFwd- EgressCtrl- DirectTrans-
>      Kernel driver in use: pcieport
>      Kernel modules: shpchp
> 
> 00:1c.4 PCI bridge: Intel Corporation Sunrise Point-H PCI Express Root
> Port #5 (rev f1) (prog-if 00 [Normal decode])
>      Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop-
> ParErr+ Stepping- SERR+ FastB2B- DisINTx+
>      Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort-
> <TAbort- <MAbort- >SERR- <PERR- INTx-
>      Latency: 0, Cache Line Size: 64 bytes
>      Interrupt: pin A routed to IRQ 30
>      Bus: primary=00, secondary=0b, subordinate=0b, sec-latency=0
>      I/O behind bridge: 0000f000-00000fff
>      Memory behind bridge: fff00000-000fffff
>      Prefetchable memory behind bridge: 00000000fff00000-00000000000fffff
>      Secondary status: 66MHz- FastB2B- ParErr- DEVSEL=fast >TAbort-
> <TAbort- <MAbort+ <SERR- <PERR-
>      BridgeCtl: Parity+ SERR+ NoISA- VGA- MAbort- >Reset- FastB2B-
>          PriDiscTmr- SecDiscTmr- DiscTmrStat- DiscTmrSERREn-
>      Capabilities: [40] Express (v2) Root Port (Slot+), MSI 00
>          DevCap:    MaxPayload 256 bytes, PhantFunc 0
>              ExtTag- RBE+
>          DevCtl:    Report errors: Correctable- Non-Fatal+ Fatal+ Unsupported+
>              RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop-
>              MaxPayload 128 bytes, MaxReadReq 128 bytes
>          DevSta:    CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr+ TransPend-
>          LnkCap:    Port #5, Speed 8GT/s, Width x4, ASPM not supported,
> Exit Latency L0s unlimited, L1 <16us
>              ClockPM- Surprise- LLActRep+ BwNot+ ASPMOptComp+
>          LnkCtl:    ASPM Disabled; RCB 64 bytes Disabled- CommClk-
>              ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
>          LnkSta:    Speed 2.5GT/s, Width x0, TrErr- Train+ SlotClk+
> DLActive- BWMgmt- ABWMgmt-
>          SltCap:    AttnBtn- PwrCtrl- MRL- AttnInd- PwrInd- HotPlug- Surprise-
>              Slot #4, PowerLimit 25.000W; Interlock- NoCompl+
>          SltCtl:    Enable: AttnBtn- PwrFlt- MRL- PresDet- CmdCplt-
> HPIrq- LinkChg-
>              Control: AttnInd Unknown, PwrInd Unknown, Power- Interlock-
>          SltSta:    Status: AttnBtn- PowerFlt- MRL- CmdCplt- PresDet- Interlock-
>              Changed: MRL- PresDet- LinkState-
>          RootCtl: ErrCorrectable- ErrNon-Fatal+ ErrFatal+ PMEIntEna+ CRSVisible-
>          RootCap: CRSVisible-
>          RootSta: PME ReqID 0000, PMEStatus- PMEPending-
>          DevCap2: Completion Timeout: Range ABC, TimeoutDis+, LTR+,
> OBFF Not Supported ARIFwd+
>          DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR+,
> OBFF Disabled ARIFwd-
>          LnkCtl2: Target Link Speed: 8GT/s, EnterCompliance- SpeedDis-
>               Transmit Margin: Normal Operating Range,
> EnterModifiedCompliance- ComplianceSOS-
>               Compliance De-emphasis: -6dB
>          LnkSta2: Current De-emphasis Level: -3.5dB,
> EqualizationComplete-, EqualizationPhase1-
>               EqualizationPhase2-, EqualizationPhase3-, LinkEqualizationRequest-
>      Capabilities: [80] MSI: Enable+ Count=1/1 Maskable- 64bit-
>          Address: fee00278  Data: 0000
>      Capabilities: [90] Subsystem: Hewlett-Packard Company Sunrise
> Point-H PCI Express Root Port
>      Capabilities: [a0] Power Management version 3
>          Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA
> PME(D0+,D1-,D2-,D3hot+,D3cold+)
>          Status: D3 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME-
>      Capabilities: [100 v1] Advanced Error Reporting
>          UESta:    DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt-
> RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
>          UEMsk:    DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt+
> RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
>          UESvrt:    DLP+ SDES- TLP- FCP- CmpltTO+ CmpltAbrt- UnxCmplt-
> RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
>          CESta:    RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr-
>          CEMsk:    RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
>          AERCap:    First Error Pointer: 00, GenCap- CGenEn- ChkCap- ChkEn-
>      Capabilities: [140 v1] Access Control Services
>          ACSCap:    SrcValid+ TransBlk+ ReqRedir+ CmpltRedir+
> UpstreamFwd- EgressCtrl- DirectTrans-
>          ACSCtl:    SrcValid- TransBlk- ReqRedir- CmpltRedir-
> UpstreamFwd- EgressCtrl- DirectTrans-
>      Capabilities: [220 v1] #19
>      Kernel driver in use: pcieport
>      Kernel modules: shpchp
> 
> 00:1d.0 PCI bridge: Intel Corporation Sunrise Point-H PCI Express Root
> Port #9 (rev f1) (prog-if 00 [Normal decode])
>      Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop-
> ParErr+ Stepping- SERR+ FastB2B- DisINTx+
>      Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort-
> <TAbort- <MAbort- >SERR- <PERR- INTx-
>      Latency: 0, Cache Line Size: 64 bytes
>      Interrupt: pin A routed to IRQ 31
>      Bus: primary=00, secondary=02, subordinate=02, sec-latency=0
>      I/O behind bridge: 0000f000-00000fff
>      Memory behind bridge: fff00000-000fffff
>      Prefetchable memory behind bridge: 00000000fff00000-00000000000fffff
>      Secondary status: 66MHz- FastB2B- ParErr- DEVSEL=fast >TAbort-
> <TAbort- <MAbort+ <SERR- <PERR-
>      BridgeCtl: Parity+ SERR+ NoISA- VGA- MAbort- >Reset- FastB2B-
>          PriDiscTmr- SecDiscTmr- DiscTmrStat- DiscTmrSERREn-
>      Capabilities: [40] Express (v2) Root Port (Slot+), MSI 00
>          DevCap:    MaxPayload 256 bytes, PhantFunc 0
>              ExtTag- RBE+
>          DevCtl:    Report errors: Correctable- Non-Fatal+ Fatal+ Unsupported+
>              RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop-
>              MaxPayload 128 bytes, MaxReadReq 128 bytes
>          DevSta:    CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr+ TransPend-
>          LnkCap:    Port #9, Speed 8GT/s, Width x4, ASPM not supported,
> Exit Latency L0s unlimited, L1 <16us
>              ClockPM- Surprise- LLActRep+ BwNot+ ASPMOptComp+
>          LnkCtl:    ASPM Disabled; RCB 64 bytes Disabled- CommClk-
>              ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
>          LnkSta:    Speed 2.5GT/s, Width x0, TrErr- Train+ SlotClk+
> DLActive- BWMgmt- ABWMgmt-
>          SltCap:    AttnBtn- PwrCtrl- MRL- AttnInd- PwrInd- HotPlug- Surprise-
>              Slot #2, PowerLimit 25.000W; Interlock- NoCompl+
>          SltCtl:    Enable: AttnBtn- PwrFlt- MRL- PresDet- CmdCplt-
> HPIrq- LinkChg-
>              Control: AttnInd Unknown, PwrInd Unknown, Power- Interlock-
>          SltSta:    Status: AttnBtn- PowerFlt- MRL- CmdCplt- PresDet- Interlock-
>              Changed: MRL- PresDet- LinkState-
>          RootCtl: ErrCorrectable- ErrNon-Fatal+ ErrFatal+ PMEIntEna+ CRSVisible-
>          RootCap: CRSVisible-
>          RootSta: PME ReqID 0000, PMEStatus- PMEPending-
>          DevCap2: Completion Timeout: Range ABC, TimeoutDis+, LTR+,
> OBFF Not Supported ARIFwd+
>          DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR+,
> OBFF Disabled ARIFwd-
>          LnkCtl2: Target Link Speed: 8GT/s, EnterCompliance- SpeedDis-
>               Transmit Margin: Normal Operating Range,
> EnterModifiedCompliance- ComplianceSOS-
>               Compliance De-emphasis: -6dB
>          LnkSta2: Current De-emphasis Level: -3.5dB,
> EqualizationComplete-, EqualizationPhase1-
>               EqualizationPhase2-, EqualizationPhase3-, LinkEqualizationRequest-
>      Capabilities: [80] MSI: Enable+ Count=1/1 Maskable- 64bit-
>          Address: fee00298  Data: 0000
>      Capabilities: [90] Subsystem: Hewlett-Packard Company Sunrise
> Point-H PCI Express Root Port
>      Capabilities: [a0] Power Management version 3
>          Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA
> PME(D0+,D1-,D2-,D3hot+,D3cold+)
>          Status: D3 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME-
>      Capabilities: [100 v1] Advanced Error Reporting
>          UESta:    DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt-
> RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
>          UEMsk:    DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt+
> RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
>          UESvrt:    DLP+ SDES- TLP- FCP- CmpltTO+ CmpltAbrt- UnxCmplt-
> RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
>          CESta:    RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr-
>          CEMsk:    RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
>          AERCap:    First Error Pointer: 00, GenCap- CGenEn- ChkCap- ChkEn-
>      Capabilities: [140 v1] Access Control Services
>          ACSCap:    SrcValid+ TransBlk+ ReqRedir+ CmpltRedir+
> UpstreamFwd- EgressCtrl- DirectTrans-
>          ACSCtl:    SrcValid- TransBlk- ReqRedir- CmpltRedir-
> UpstreamFwd- EgressCtrl- DirectTrans-
>      Capabilities: [220 v1] #19
>      Kernel driver in use: pcieport
>      Kernel modules: shpchp
> 
> 00:1d.5 PCI bridge: Intel Corporation Sunrise Point-H PCI Express Root
> Port #14 (rev f1) (prog-if 00 [Normal decode])
>      Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop-
> ParErr+ Stepping- SERR+ FastB2B- DisINTx+
>      Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort-
> <TAbort- <MAbort- >SERR- <PERR- INTx-
>      Latency: 0, Cache Line Size: 64 bytes
>      Interrupt: pin B routed to IRQ 32
>      Bus: primary=00, secondary=05, subordinate=05, sec-latency=0
>      I/O behind bridge: 0000f000-00000fff
>      Memory behind bridge: fff00000-000fffff
>      Prefetchable memory behind bridge: 00000000fff00000-00000000000fffff
>      Secondary status: 66MHz- FastB2B- ParErr- DEVSEL=fast >TAbort-
> <TAbort- <MAbort+ <SERR- <PERR-
>      BridgeCtl: Parity+ SERR+ NoISA- VGA- MAbort- >Reset- FastB2B-
>          PriDiscTmr- SecDiscTmr- DiscTmrStat- DiscTmrSERREn-
>      Capabilities: [40] Express (v2) Root Port (Slot+), MSI 00
>          DevCap:    MaxPayload 256 bytes, PhantFunc 0
>              ExtTag- RBE+
>          DevCtl:    Report errors: Correctable- Non-Fatal+ Fatal+ Unsupported+
>              RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop-
>              MaxPayload 128 bytes, MaxReadReq 128 bytes
>          DevSta:    CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr+ TransPend-
>          LnkCap:    Port #14, Speed 8GT/s, Width x1, ASPM not
> supported, Exit Latency L0s unlimited, L1 <16us
>              ClockPM- Surprise- LLActRep+ BwNot+ ASPMOptComp+
>          LnkCtl:    ASPM Disabled; RCB 64 bytes Disabled- CommClk-
>              ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
>          LnkSta:    Speed 2.5GT/s, Width x0, TrErr- Train+ SlotClk+
> DLActive- BWMgmt- ABWMgmt-
>          SltCap:    AttnBtn- PwrCtrl- MRL- AttnInd- PwrInd- HotPlug- Surprise-
>              Slot #1, PowerLimit 10.000W; Interlock- NoCompl+
>          SltCtl:    Enable: AttnBtn- PwrFlt- MRL- PresDet- CmdCplt-
> HPIrq- LinkChg-
>              Control: AttnInd Unknown, PwrInd Unknown, Power- Interlock-
>          SltSta:    Status: AttnBtn- PowerFlt- MRL- CmdCplt- PresDet- Interlock-
>              Changed: MRL- PresDet- LinkState-
>          RootCtl: ErrCorrectable- ErrNon-Fatal+ ErrFatal+ PMEIntEna+ CRSVisible-
>          RootCap: CRSVisible-
>          RootSta: PME ReqID 0000, PMEStatus- PMEPending-
>          DevCap2: Completion Timeout: Range ABC, TimeoutDis+, LTR+,
> OBFF Not Supported ARIFwd+
>          DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR+,
> OBFF Disabled ARIFwd-
>          LnkCtl2: Target Link Speed: 8GT/s, EnterCompliance- SpeedDis-
>               Transmit Margin: Normal Operating Range,
> EnterModifiedCompliance- ComplianceSOS-
>               Compliance De-emphasis: -6dB
>          LnkSta2: Current De-emphasis Level: -3.5dB,
> EqualizationComplete-, EqualizationPhase1-
>               EqualizationPhase2-, EqualizationPhase3-, LinkEqualizationRequest-
>      Capabilities: [80] MSI: Enable+ Count=1/1 Maskable- 64bit-
>          Address: fee002b8  Data: 0000
>      Capabilities: [90] Subsystem: Hewlett-Packard Company Sunrise
> Point-H PCI Express Root Port
>      Capabilities: [a0] Power Management version 3
>          Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA
> PME(D0+,D1-,D2-,D3hot+,D3cold+)
>          Status: D3 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME-
>      Capabilities: [100 v1] Advanced Error Reporting
>          UESta:    DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt-
> RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
>          UEMsk:    DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt+
> RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
>          UESvrt:    DLP+ SDES- TLP- FCP- CmpltTO+ CmpltAbrt- UnxCmplt-
> RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
>          CESta:    RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr-
>          CEMsk:    RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
>          AERCap:    First Error Pointer: 00, GenCap- CGenEn- ChkCap- ChkEn-
>      Capabilities: [140 v1] Access Control Services
>          ACSCap:    SrcValid+ TransBlk+ ReqRedir+ CmpltRedir+
> UpstreamFwd- EgressCtrl- DirectTrans-
>          ACSCtl:    SrcValid- TransBlk- ReqRedir- CmpltRedir-
> UpstreamFwd- EgressCtrl- DirectTrans-
>      Capabilities: [220 v1] #19
>      Kernel driver in use: pcieport
>      Kernel modules: shpchp
> 
> 00:1f.0 ISA bridge: Intel Corporation Sunrise Point-H LPC Controller (rev 31)
>      Subsystem: Hewlett-Packard Company Sunrise Point-H LPC Controller
>      Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop-
> ParErr+ Stepping- SERR+ FastB2B- DisINTx-
>      Status: Cap- 66MHz- UDF- FastB2B- ParErr- DEVSEL=medium >TAbort-
> <TAbort- <MAbort- >SERR- <PERR- INTx-
>      Latency: 0
> 
> 00:1f.2 Memory controller: Intel Corporation Sunrise Point-H PMC (rev 31)
>      Subsystem: Hewlett-Packard Company Sunrise Point-H PMC
>      Control: I/O- Mem- BusMaster- SpecCycle- MemWINV- VGASnoop-
> ParErr- Stepping- SERR- FastB2B- DisINTx-
>      Status: Cap- 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort-
> <TAbort- <MAbort- >SERR- <PERR- INTx-
>      Region 0: Memory at 7f288000 (32-bit, non-prefetchable) [disabled]
> [size=16K]
> 
> 00:1f.4 SMBus: Intel Corporation Sunrise Point-H SMBus (rev 31)
>      Subsystem: Hewlett-Packard Company Sunrise Point-H SMBus
>      Control: I/O+ Mem- BusMaster- SpecCycle- MemWINV- VGASnoop-
> ParErr- Stepping- SERR- FastB2B- DisINTx-
>      Status: Cap- 66MHz- UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort-
> <TAbort- <MAbort- >SERR- <PERR- INTx-
>      Interrupt: pin A routed to IRQ 255
>      Region 0: Memory at 3ff1010000 (64-bit, non-prefetchable)
> [disabled] [size=256]
>      Region 4: I/O ports at efa0 [size=32]
>      Kernel modules: i2c_i801
> 
> 01:00.0 System peripheral: Hewlett-Packard Company Integrated
> Lights-Out Standard Slave Instrumentation & System Support (rev 06)
>      Subsystem: Hewlett-Packard Company iLO4
>      Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop-
> ParErr+ Stepping- SERR+ FastB2B- DisINTx-
>      Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort-
> <TAbort- <MAbort- >SERR- <PERR- INTx-
>      Latency: 0, Cache Line Size: 64 bytes
>      Interrupt: pin A routed to IRQ 255
>      Region 0: I/O ports at 1200 [size=256]
>      Region 1: Memory at 92a8d000 (32-bit, non-prefetchable) [size=512]
>      Region 2: I/O ports at 1100 [size=256]
>      Capabilities: [78] Power Management version 3
>          Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA
> PME(D0-,D1-,D2-,D3hot-,D3cold-)
>          Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
>      Capabilities: [b0] MSI: Enable- Count=1/1 Maskable- 64bit+
>          Address: 0000000000000000  Data: 0000
>      Capabilities: [c0] Express (v2) Legacy Endpoint, MSI 00
>          DevCap:    MaxPayload 128 bytes, PhantFunc 0, Latency L0s
> unlimited, L1 unlimited
>              ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset-
>          DevCtl:    Report errors: Correctable- Non-Fatal+ Fatal+ Unsupported-
>              RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop-
>              MaxPayload 128 bytes, MaxReadReq 128 bytes
>          DevSta:    CorrErr+ UncorrErr- FatalErr- UnsuppReq+ AuxPwr- TransPend-
>          LnkCap:    Port #0, Speed 2.5GT/s, Width x1, ASPM L0s, Exit
> Latency L0s <4us, L1 <4us
>              ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp-
>          LnkCtl:    ASPM Disabled; RCB 64 bytes Disabled- CommClk-
>              ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
>          LnkSta:    Speed 2.5GT/s, Width x1, TrErr- Train- SlotClk-
> DLActive- BWMgmt- ABWMgmt-
>          DevCap2: Completion Timeout: Not Supported, TimeoutDis+, LTR-,
> OBFF Not Supported
>          DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-,
> OBFF Disabled
>          LnkCtl2: Target Link Speed: 2.5GT/s, EnterCompliance- SpeedDis-
>               Transmit Margin: Normal Operating Range,
> EnterModifiedCompliance- ComplianceSOS-
>               Compliance De-emphasis: -6dB
>          LnkSta2: Current De-emphasis Level: -6dB,
> EqualizationComplete-, EqualizationPhase1-
>               EqualizationPhase2-, EqualizationPhase3-, LinkEqualizationRequest-
>      Capabilities: [100 v2] Advanced Error Reporting
>          UESta:    DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt-
> RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
>          UEMsk:    DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt-
> RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
>          UESvrt:    DLP+ SDES- TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt-
> RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
>          CESta:    RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
>          CEMsk:    RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
>          AERCap:    First Error Pointer: 00, GenCap- CGenEn- ChkCap- ChkEn-
>      Kernel modules: hpwdt
> 
> 01:00.1 VGA compatible controller: Matrox Electronics Systems Ltd. MGA
> G200EH (rev 01) (prog-if 00 [VGA controller])
>      Subsystem: Hewlett-Packard Company iLO4
>      Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop-
> ParErr+ Stepping- SERR+ FastB2B- DisINTx-
>      Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort-
> <TAbort- <MAbort- >SERR- <PERR- INTx-
>      Latency: 0, Cache Line Size: 64 bytes
>      Interrupt: pin B routed to IRQ 16
>      Region 0: Memory at 91000000 (32-bit, prefetchable) [size=16M]
>      Region 1: Memory at 92a88000 (32-bit, non-prefetchable) [size=16K]
>      Region 2: Memory at 92000000 (32-bit, non-prefetchable) [size=8M]
>      [virtual] Expansion ROM at 000c0000 [disabled] [size=128K]
>      Capabilities: [a8] Power Management version 3
>          Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA
> PME(D0-,D1-,D2-,D3hot-,D3cold-)
>          Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME-
>      Capabilities: [b0] MSI: Enable- Count=1/1 Maskable- 64bit+
>          Address: 0000000000000000  Data: 0000
>      Capabilities: [c0] Express (v2) Legacy Endpoint, MSI 00
>          DevCap:    MaxPayload 128 bytes, PhantFunc 0, Latency L0s
> unlimited, L1 unlimited
>              ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset-
>          DevCtl:    Report errors: Correctable- Non-Fatal+ Fatal+ Unsupported-
>              RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop-
>              MaxPayload 128 bytes, MaxReadReq 128 bytes
>          DevSta:    CorrErr+ UncorrErr- FatalErr- UnsuppReq+ AuxPwr- TransPend-
>          LnkCap:    Port #0, Speed 2.5GT/s, Width x1, ASPM L0s, Exit
> Latency L0s <4us, L1 <4us
>              ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp-
>          LnkCtl:    ASPM Disabled; RCB 64 bytes Disabled- CommClk-
>              ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
>          LnkSta:    Speed 2.5GT/s, Width x1, TrErr- Train- SlotClk-
> DLActive- BWMgmt- ABWMgmt-
>          DevCap2: Completion Timeout: Not Supported, TimeoutDis+, LTR-,
> OBFF Not Supported
>          DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-,
> OBFF Disabled
>          LnkSta2: Current De-emphasis Level: -6dB,
> EqualizationComplete-, EqualizationPhase1-
>               EqualizationPhase2-, EqualizationPhase3-, LinkEqualizationRequest-
>      Capabilities: [100 v2] Advanced Error Reporting
>          UESta:    DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt-
> RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
>          UEMsk:    DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt-
> RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
>          UESvrt:    DLP+ SDES- TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt-
> RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
>          CESta:    RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
>          CEMsk:    RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
>          AERCap:    First Error Pointer: 00, GenCap- CGenEn- ChkCap- ChkEn-
>      Kernel driver in use: mgag200
>      Kernel modules: mgag200
> 
> 01:00.2 System peripheral: Hewlett-Packard Company Integrated
> Lights-Out Standard Management Processor Support and Messaging (rev
> 06)
>      Subsystem: Hewlett-Packard Company iLO4
>      Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop-
> ParErr+ Stepping- SERR+ FastB2B- DisINTx-
>      Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort-
> <TAbort- <MAbort- >SERR- <PERR- INTx-
>      Latency: 0, Cache Line Size: 64 bytes
>      Interrupt: pin B routed to IRQ 16
>      Region 0: I/O ports at 1000 [size=256]
>      Region 1: Memory at 92a8c000 (32-bit, non-prefetchable) [size=256]
>      Region 2: Memory at 92900000 (32-bit, non-prefetchable) [size=1M]
>      Region 3: Memory at 92a00000 (32-bit, non-prefetchable) [size=512K]
>      Region 4: Memory at 92a80000 (32-bit, non-prefetchable) [size=32K]
>      Region 5: Memory at 92800000 (32-bit, non-prefetchable) [size=1M]
>      [virtual] Expansion ROM at 90000000 [disabled] [size=64K]
>      Capabilities: [78] Power Management version 3
>          Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA
> PME(D0+,D1-,D2-,D3hot+,D3cold+)
>          Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
>      Capabilities: [b0] MSI: Enable- Count=1/1 Maskable- 64bit+
>          Address: 0000000000000000  Data: 0000
>      Capabilities: [c0] Express (v2) Legacy Endpoint, MSI 00
>          DevCap:    MaxPayload 128 bytes, PhantFunc 0, Latency L0s
> unlimited, L1 unlimited
>              ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset-
>          DevCtl:    Report errors: Correctable- Non-Fatal+ Fatal+ Unsupported-
>              RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop-
>              MaxPayload 128 bytes, MaxReadReq 128 bytes
>          DevSta:    CorrErr+ UncorrErr- FatalErr- UnsuppReq+ AuxPwr- TransPend-
>          LnkCap:    Port #0, Speed 2.5GT/s, Width x1, ASPM L0s, Exit
> Latency L0s <4us, L1 <4us
>              ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp-
>          LnkCtl:    ASPM Disabled; RCB 64 bytes Disabled- CommClk-
>              ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
>          LnkSta:    Speed 2.5GT/s, Width x1, TrErr- Train- SlotClk-
> DLActive- BWMgmt- ABWMgmt-
>          DevCap2: Completion Timeout: Not Supported, TimeoutDis+, LTR-,
> OBFF Not Supported
>          DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-,
> OBFF Disabled
>          LnkSta2: Current De-emphasis Level: -6dB,
> EqualizationComplete-, EqualizationPhase1-
>               EqualizationPhase2-, EqualizationPhase3-, LinkEqualizationRequest-
>      Capabilities: [100 v2] Advanced Error Reporting
>          UESta:    DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt-
> RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
>          UEMsk:    DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt-
> RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
>          UESvrt:    DLP+ SDES- TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt-
> RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
>          CESta:    RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
>          CEMsk:    RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
>          AERCap:    First Error Pointer: 00, GenCap- CGenEn- ChkCap- ChkEn-
>      Kernel driver in use: hpilo
>      Kernel modules: hpilo
> 
> 01:00.4 USB controller: Hewlett-Packard Company Integrated Lights-Out
> Standard Virtual USB Controller (rev 03) (prog-if 00 [UHCI])
>      Subsystem: Hewlett-Packard Company iLO4
>      Control: I/O+ Mem- BusMaster+ SpecCycle- MemWINV- VGASnoop-
> ParErr+ Stepping- SERR+ FastB2B- DisINTx-
>      Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort-
> <TAbort- <MAbort- >SERR- <PERR- INTx-
>      Latency: 0, Cache Line Size: 64 bytes
>      Interrupt: pin B routed to IRQ 16
>      Region 4: I/O ports at 1300 [size=32]
>      Capabilities: [70] MSI: Enable- Count=1/1 Maskable- 64bit+
>          Address: 0000000000000000  Data: 0000
>      Capabilities: [80] Express (v2) Legacy Endpoint, MSI 00
>          DevCap:    MaxPayload 128 bytes, PhantFunc 0, Latency L0s
> unlimited, L1 unlimited
>              ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset-
>          DevCtl:    Report errors: Correctable- Non-Fatal+ Fatal+ Unsupported-
>              RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop-
>              MaxPayload 128 bytes, MaxReadReq 128 bytes
>          DevSta:    CorrErr+ UncorrErr- FatalErr- UnsuppReq+ AuxPwr- TransPend-
>          LnkCap:    Port #0, Speed 2.5GT/s, Width x1, ASPM L0s, Exit
> Latency L0s <4us, L1 <4us
>              ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp-
>          LnkCtl:    ASPM Disabled; RCB 64 bytes Disabled- CommClk-
>              ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
>          LnkSta:    Speed 2.5GT/s, Width x1, TrErr- Train- SlotClk-
> DLActive- BWMgmt- ABWMgmt-
>          DevCap2: Completion Timeout: Not Supported, TimeoutDis+, LTR-,
> OBFF Not Supported
>          DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-,
> OBFF Disabled
>          LnkSta2: Current De-emphasis Level: -6dB,
> EqualizationComplete-, EqualizationPhase1-
>               EqualizationPhase2-, EqualizationPhase3-, LinkEqualizationRequest-
>      Capabilities: [f0] Power Management version 3
>          Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA
> PME(D0-,D1-,D2-,D3hot-,D3cold-)
>          Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
>      Capabilities: [100 v2] Advanced Error Reporting
>          UESta:    DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt-
> RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
>          UEMsk:    DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt-
> RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
>          UESvrt:    DLP+ SDES- TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt-
> RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
>          CESta:    RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
>          CEMsk:    RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
>          AERCap:    First Error Pointer: 00, GenCap- CGenEn- ChkCap- ChkEn-
>      Kernel driver in use: uhci_hcd
> 
> 0e:00.0 Non-Volatile memory controller: Intel Corporation Device f1a5
> (rev 03) (prog-if 02 [NVM Express])
>      Subsystem: Intel Corporation Device 390a
>      Physical Slot: 3
>      Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop-
> ParErr+ Stepping- SERR+ FastB2B- DisINTx+
>      Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort-
> <TAbort- <MAbort- >SERR- <PERR- INTx-
>      Latency: 0, Cache Line Size: 64 bytes
>      Interrupt: pin A routed to IRQ 16
>      Region 0: Memory at 7f000000 (64-bit, non-prefetchable) [size=16K]
>      Capabilities: [40] Power Management version 3
>          Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA
> PME(D0-,D1-,D2-,D3hot-,D3cold-)
>          Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME-
>      Capabilities: [70] Express (v2) Endpoint, MSI 00
>          DevCap:    MaxPayload 128 bytes, PhantFunc 0, Latency L0s
> unlimited, L1 unlimited
>              ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset+
>          DevCtl:    Report errors: Correctable+ Non-Fatal+ Fatal+ Unsupported+
>              RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop- FLReset-
>              MaxPayload 128 bytes, MaxReadReq 4096 bytes
>          DevSta:    CorrErr+ UncorrErr- FatalErr- UnsuppReq+ AuxPwr+ TransPend-
>          LnkCap:    Port #0, Speed 8GT/s, Width x4, ASPM L1, Exit
> Latency L0s <1us, L1 <8us
>              ClockPM+ Surprise- LLActRep- BwNot- ASPMOptComp+
>          LnkCtl:    ASPM Disabled; RCB 64 bytes Disabled- CommClk+
>              ExtSynch- ClockPM+ AutWidDis- BWInt- AutBWInt-
>          LnkSta:    Speed 8GT/s, Width x4, TrErr- Train- SlotClk+
> DLActive- BWMgmt- ABWMgmt-
>          DevCap2: Completion Timeout: Range ABCD, TimeoutDis+, LTR+,
> OBFF Via message
>          DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR+,
> OBFF Disabled
>          LnkCtl2: Target Link Speed: 8GT/s, EnterCompliance- SpeedDis-
>               Transmit Margin: Normal Operating Range,
> EnterModifiedCompliance- ComplianceSOS-
>               Compliance De-emphasis: -6dB
>          LnkSta2: Current De-emphasis Level: -6dB,
> EqualizationComplete+, EqualizationPhase1+
>               EqualizationPhase2+, EqualizationPhase3+, LinkEqualizationRequest-
>      Capabilities: [b0] MSI-X: Enable+ Count=16 Masked-
>          Vector table: BAR=0 offset=00002000
>          PBA: BAR=0 offset=00002100
>      Capabilities: [100 v2] Advanced Error Reporting
>          UESta:    DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt-
> RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
>          UEMsk:    DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt-
> RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
>          UESvrt:    DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt-
> RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
>          CESta:    RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr-
>          CEMsk:    RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
>          AERCap:    First Error Pointer: 00, GenCap+ CGenEn- ChkCap+ ChkEn-
>      Capabilities: [158 v1] #19
>      Capabilities: [178 v1] Latency Tolerance Reporting
>          Max snoop latency: 71680ns
>          Max no snoop latency: 71680ns
>      Capabilities: [180 v1] L1 PM Substates
>          L1SubCap: PCI-PM_L1.2+ PCI-PM_L1.1+ ASPM_L1.2+ ASPM_L1.1+
> L1_PM_Substates+
>                PortCommonModeRestoreTime=10us PortTPowerOnTime=10us
>      Kernel driver in use: nvme
> 
> 11:00.0 Ethernet controller: Mellanox Technologies MT27520 Family
> [ConnectX-3 Pro]
>      DeviceName: Embedded LOM 1 Port 1
>      Subsystem: Hewlett Packard Enterprise MT27520 Family [ConnectX-3 Pro]
>      Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop-
> ParErr+ Stepping- SERR+ FastB2B- DisINTx+
>      Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort-
> <TAbort- <MAbort- >SERR- <PERR- INTx-
>      Latency: 0, Cache Line Size: 64 bytes
>      Interrupt: pin A routed to IRQ 16
>      Region 0: Memory at 7f100000 (64-bit, non-prefetchable) [size=1M]
>      Region 2: Memory at 2000000000 (64-bit, prefetchable) [size=8M]
>      Capabilities: [40] Power Management version 3
>          Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA
> PME(D0-,D1-,D2-,D3hot-,D3cold-)
>          Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
>      Capabilities: [9c] MSI-X: Enable+ Count=128 Masked-
>          Vector table: BAR=0 offset=0007c000
>          PBA: BAR=0 offset=0007d000
>      Capabilities: [60] Express (v2) Endpoint, MSI 00
>          DevCap:    MaxPayload 512 bytes, PhantFunc 0, Latency L0s
> <64ns, L1 unlimited
>              ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset+
>          DevCtl:    Report errors: Correctable- Non-Fatal+ Fatal+ Unsupported-
>              RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop- FLReset-
>              MaxPayload 256 bytes, MaxReadReq 512 bytes
>          DevSta:    CorrErr+ UncorrErr- FatalErr- UnsuppReq+ AuxPwr- TransPend-
>          LnkCap:    Port #8, Speed 8GT/s, Width x8, ASPM L0s, Exit
> Latency L0s unlimited, L1 unlimited
>              ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+
>          LnkCtl:    ASPM Disabled; RCB 64 bytes Disabled- CommClk+
>              ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
>          LnkSta:    Speed 8GT/s, Width x8, TrErr- Train- SlotClk+
> DLActive- BWMgmt- ABWMgmt-
>          DevCap2: Completion Timeout: Range ABCD, TimeoutDis+, LTR-,
> OBFF Not Supported
>          DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-,
> OBFF Disabled
>          LnkCtl2: Target Link Speed: 8GT/s, EnterCompliance- SpeedDis-
>               Transmit Margin: Normal Operating Range,
> EnterModifiedCompliance- ComplianceSOS-
>               Compliance De-emphasis: -6dB
>          LnkSta2: Current De-emphasis Level: -6dB,
> EqualizationComplete+, EqualizationPhase1+
>               EqualizationPhase2+, EqualizationPhase3+, LinkEqualizationRequest-
>      Capabilities: [c0] Vendor Specific Information: Len=18 <?>
>      Capabilities: [100 v1] Alternative Routing-ID Interpretation (ARI)
>          ARICap:    MFVC- ACS-, Next Function: 0
>          ARICtl:    MFVC- ACS-, Function Group: 0
>      Capabilities: [148 v1] Device Serial Number f4-03-43-03-00-df-aa-59
>      Capabilities: [108 v1] Single Root I/O Virtualization (SR-IOV)
>          IOVCap:    Migration-, Interrupt Message Number: 000
>          IOVCtl:    Enable+ Migration- Interrupt- MSE+ ARIHierarchy-
>          IOVSta:    Migration-
>          Initial VFs: 8, Total VFs: 8, Number of VFs: 1, Function
> Dependency Link: 00
>          VF offset: 1, stride: 1, Device ID: 1004
>          Supported Page Size: 000007ff, System Page Size: 00000001
>          Region 2: Memory at 0000002000800000 (64-bit, prefetchable)
>          VF Migration: offset: 00000000, BIR: 0
>      Capabilities: [154 v2] Advanced Error Reporting
>          UESta:    DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt-
> RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
>          UEMsk:    DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt-
> RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
>          UESvrt:    DLP+ SDES- TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt-
> RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
>          CESta:    RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr-
>          CEMsk:    RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
>          AERCap:    First Error Pointer: 00, GenCap+ CGenEn- ChkCap+ ChkEn-
>      Capabilities: [18c v1] #19
>      Kernel driver in use: mlx4_core
>      Kernel modules: mlx4_core
> 
> 11:00.1 Ethernet controller: Mellanox Technologies MT27500/MT27520
> Family [ConnectX-3/ConnectX-3 Pro Virtual Function]
>      DeviceName: Embedded LOM 1 Port 2
>      Subsystem: Hewlett Packard Enterprise MT27500/MT27520 Family
> [ConnectX-3/ConnectX-3 Pro Virtual Function]
>      Control: I/O- Mem- BusMaster+ SpecCycle- MemWINV- VGASnoop-
> ParErr- Stepping- SERR- FastB2B- DisINTx-
>      Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort-
> <TAbort- <MAbort- >SERR- <PERR- INTx-
>      Latency: 0
>      Region 2: [virtual] Memory at 2000800000 (64-bit, prefetchable) [size=8M]
>      Capabilities: [60] Express (v2) Endpoint, MSI 00
>          DevCap:    MaxPayload 512 bytes, PhantFunc 0, Latency L0s <64ns, L1 <1us
>              ExtTag- AttnBtn- AttnInd- PwrInd- RBE- FLReset+
>          DevCtl:    Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
>              RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop- FLReset-
>              MaxPayload 128 bytes, MaxReadReq 128 bytes
>          DevSta:    CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr- TransPend-
>          LnkCap:    Port #0, Speed 8GT/s, Width x8, ASPM not supported,
> Exit Latency L0s <64ns, L1 <1us
>              ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp-
>          LnkCtl:    ASPM Disabled; RCB 64 bytes Disabled- CommClk-
>              ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
>          LnkSta:    Speed unknown, Width x0, TrErr- Train- SlotClk-
> DLActive- BWMgmt- ABWMgmt-
>          DevCap2: Completion Timeout: Range ABCD, TimeoutDis+, LTR-,
> OBFF Not Supported
>          DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-,
> OBFF Disabled
>          LnkSta2: Current De-emphasis Level: -6dB,
> EqualizationComplete-, EqualizationPhase1-
>               EqualizationPhase2-, EqualizationPhase3-, LinkEqualizationRequest-
>      Capabilities: [9c] MSI-X: Enable+ Count=256 Masked-
>          Vector table: BAR=2 offset=00002000
>          PBA: BAR=2 offset=00003000
>      Capabilities: [40] Power Management version 0
>          Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA
> PME(D0-,D1-,D2-,D3hot-,D3cold-)
>          Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME-
>      Kernel driver in use: mlx4_core
>      Kernel modules: mlx4_core
> 
> ####################
> # Kernel trace log #
> ####################
> 
> Usualy I have nothing more than, these messages:
> 
> [ 1011.070739] kvm [16361]: vcpu0, guest rIP: 0xffffffff810644d8
> disabled perfctr wrmsr: 0xc2 data 0xffff
> [ 1011.528347] cache_from_obj: Wrong slab cache. kmalloc-256 but
> object is from kmalloc-192
> [ 1011.927642] general protection fault: 0000 [#1] SMP PTI
> [ 1012.185439] cache_from_obj: Wrong slab cache. kmalloc-256 but
> object is from kmalloc-192
> 
> But few times I've got full trace log:
> 
> [  108.416627] kvm [7297]: vcpu0, guest rIP: 0xffffffff810644d8
> disabled perfctr wrmsr: 0xc2 data 0xffff
> [  108.868512] cache_from_obj: Wrong slab cache. kmalloc-256 but
> object is from kmalloc-192
> [  108.868517] ------------[ cut here ]------------
> [  108.868521] WARNING: CPU: 1 PID: 16 at
> /build/linux-hwe-4GXcua/linux-hwe-4.13.0/mm/slab.h:377
> kmem_cache_free+0x129/0x1c0
> [  108.868522] Modules linked in: nf_conntrack_ipv6 nf_defrag_ipv6
> ip_set_hash_ip xt_mac xt_physdev vhost_net vhost tap act_police
> cls_u32 sch_ingress cls_fw sch_sfq sch_htb xt_CHECKSUM iptable_mangle
> ipt_REJECT nf_reject_ipv4 ebtable_filter ebtables ip6table_filter
> ip6_tables xt_set ip_set_list_set ip_set_hash_net veth dummy
> beegfs(OE) nf_conntrack_netlink xt_nat xt_tcpudp xt_recent ip_set
> nfnetlink ip_vs rpcsec_gss_krb5 auth_rpcgss nfsv4 nfs lockd grace
> sunrpc fscache xt_comment xt_mark netconsole ipt_MASQUERADE
> nf_nat_masquerade_ipv4 xfrm_user iptable_nat nf_conntrack_ipv4
> nf_defrag_ipv4 nf_nat_ipv4 xt_addrtype iptable_filter ip_tables
> xt_conntrack x_tables nf_nat nf_conntrack libcrc32c br_netfilter 8021q
> garp mrp bridge stp llc bonding rdma_ucm(OE) ib_ucm(OE) rdma_cm(OE)
> iw_cm(OE) ib_ipoib(OE) ib_cm(OE)
> [  108.868553]  ib_uverbs(OE) ib_umad(OE) esp6_offload esp6
> esp4_offload esp4 xfrm_algo mlx5_fpga_tools(OE) mlx5_ib(OE)
> mlx5_core(OE) mlxfw(OE) mlx4_ib(OE) mlx4_en(OE) ib_core(OE) ptp
> pps_core mlx4_core(OE) devlink mlx_compat(OE) ipmi_ssif intel_rapl
> x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass
> crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcbc aesni_intel
> aes_x86_64 crypto_simd hpilo glue_helper cryptd ipmi_si mei_me
> intel_cstate ipmi_devintf intel_rapl_perf mei ipmi_msghandler shpchp
> acpi_power_meter mac_hid ie31200_edac knem(OE) autofs4 overlay nbd
> i915 mgag200 video ttm i2c_algo_bit drm_kms_helper syscopyarea
> sysfillrect sysimgblt fb_sys_fops drm ahci nvme nvme_core libahci
> [last unloaded: devlink]
> [  108.868584] CPU: 1 PID: 16 Comm: ksoftirqd/1 Tainted: G
> OE   4.13.0-36-generic #40~16.04.1-Ubuntu
> [  108.868585] Hardware name: HP ProLiant m710x Server
> Cartridge/ProLiant m710x Server Cartridge, BIOS H07 07/17/2017
> [  108.868586] task: ffff9cf2f9cedf00 task.stack: ffffb477862fc000
> [  108.868587] RIP: 0010:kmem_cache_free+0x129/0x1c0
> [  108.868588] RSP: 0018:ffffb477862ff728 EFLAGS: 00010282
> [  108.868589] RAX: 000000000000004c RBX: ffff9cf2c4ed46e0 RCX: 000000000000001f
> [  108.868590] RDX: 0000000000000000 RSI: 0000000000000002 RDI: 0000000000000246
> [  108.868590] RBP: ffffb477862ff740 R08: 0000000000000000 R09: 000000000000004c
> [  108.868591] R10: 00000000a0000000 R11: 0000000000000000 R12: ffff9cf300803200
> [  108.868592] R13: ffffffffb8ea3318 R14: ffff9cf2b06cbb84 R15: ffff9cf2bb764000
> [  108.868593] FS:  0000000000000000(0000) GS:ffff9cf341440000(0000)
> knlGS:0000000000000000
> [  108.868594] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [  108.868594] CR2: 00000000ffffffff CR3: 00000009eb00a002 CR4: 00000000003626e0
> [  108.868600] Call Trace:
> [  108.868604]  ? tun_net_xmit+0x98/0x340
> [  108.868606]  kfree_skbmem+0x59/0x60
> [  108.868608]  kfree_skb+0x3a/0xa0
> [  108.868609]  tun_net_xmit+0x98/0x340
> [  108.868611]  dev_hard_start_xmit+0xa6/0x210
> [  108.868613]  sch_direct_xmit+0xfc/0x1c0
> [  108.868615]  __qdisc_run+0x12a/0x280
> [  108.868617]  __dev_queue_xmit+0x242/0x6a0
> [  108.868620]  ? ebt_do_table+0x58f/0x680 [ebtables]
> [  108.868625]  ? br_port_flags_change+0x20/0x20 [bridge]
> [  108.868626]  dev_queue_xmit+0x10/0x20
> [  108.868628]  ? dev_queue_xmit+0x10/0x20
> [  108.868631]  br_dev_queue_push_xmit+0x7a/0x140 [bridge]
> [  108.868635]  br_forward_finish+0x3d/0xb0 [bridge]
> [  108.868638]  ? br_fdb_offloaded_set+0x50/0x50 [bridge]
> [  108.868698]  __br_forward+0x15e/0x1d0 [bridge]
> [  108.868701]  ? br_dev_queue_push_xmit+0x140/0x140 [bridge]
> [  108.868705]  deliver_clone+0x37/0x50 [bridge]
> [  108.868708]  br_flood+0xfa/0x210 [bridge]
> [  108.868711]  br_handle_frame_finish+0x29e/0x530 [bridge]
> [  108.868714]  br_handle_frame+0x1a4/0x2f0 [bridge]
> [  108.868791]  ? br_pass_frame_up+0x150/0x150 [bridge]
> [  108.868794]  __netif_receive_skb_core+0x342/0xaf0
> [  108.868798]  ? update_load_avg+0x41c/0x590
> [  108.868800]  ? select_task_rq_fair+0x7d2/0xb40
> [  108.868801]  __netif_receive_skb+0x18/0x60
> [  108.868803]  ? __netif_receive_skb+0x18/0x60
> [  108.868804]  netif_receive_skb_internal+0x3f/0x400
> [  108.868806]  ? dev_gro_receive+0x274/0x4a0
> [  108.868807]  napi_gro_frags+0xee/0x230
> [  108.868811]  mlx4_en_process_rx_cq+0xacb/0xea0 [mlx4_en]
> [  108.868814]  mlx4_en_poll_rx_cq+0x64/0x110 [mlx4_en]
> [  108.868815]  net_rx_action+0x24d/0x380
> [  108.868818]  ? __switch_to+0x450/0x540
> [  108.868820]  __do_softirq+0xf2/0x287
> [  108.868823]  run_ksoftirqd+0x29/0x60
> [  108.868832]  smpboot_thread_fn+0x11a/0x170
> [  108.868833]  kthread+0x10c/0x140
> [  108.868835]  ? sort_range+0x30/0x30
> [  108.868836]  ? kthread_create_on_node+0x70/0x70
> [  108.868839]  ret_from_fork+0x35/0x40
> [  108.868840] Code: 00 00 4c 3b a7 d8 00 00 00 0f 84 16 ff ff ff 48
> 8b 4f 60 49 8b 54 24 60 48 c7 c6 a0 f5 63 b9 48 c7 c7 58 1e 8c b9 e8
> 78 b2 eb ff <0f> ff 4c 89 e7 e9 f0 fe ff ff 65 8b 05 be 25 5e 47 89 c0
> 48 0f
> [  108.868861] ---[ end trace 196b820a8fbb4908 ]---
> [  108.868906] BUG: unable to handle kernel NULL pointer dereference
> at           (null)
> [  108.868910] IP: __netif_receive_skb_core+0x26a/0xaf0
> [  108.868911] PGD 0
> [  108.868912] P4D 0
> [  108.868913]
> [  108.868915] Oops: 0000 [#1] SMP PTI
> [  108.868917] Modules linked in: nf_conntrack_ipv6 nf_defrag_ipv6
> ip_set_hash_ip xt_mac xt_physdev vhost_net vhost tap act_police
> cls_u32 sch_ingress cls_fw sch_sfq sch_htb xt_CHECKSUM iptable_mangle
> ipt_REJECT nf_reject_ipv4 ebtable_filter ebtables ip6table_filter
> ip6_tables xt_set ip_set_list_set ip_set_hash_net veth dummy
> beegfs(OE) nf_conntrack_netlink xt_nat xt_tcpudp xt_recent ip_set
> nfnetlink ip_vs rpcsec_gss_krb5 auth_rpcgss nfsv4 nfs lockd grace
> sunrpc fscache xt_comment xt_mark netconsole ipt_MASQUERADE
> nf_nat_masquerade_ipv4 xfrm_user iptable_nat nf_conntrack_ipv4
> nf_defrag_ipv4 nf_nat_ipv4 xt_addrtype iptable_filter ip_tables
> xt_conntrack x_tables nf_nat nf_conntrack libcrc32c br_netfilter 8021q
> garp mrp bridge stp llc bonding rdma_ucm(OE) ib_ucm(OE) rdma_cm(OE)
> iw_cm(OE) ib_ipoib(OE) ib_cm(OE)
> [  108.868970]  ib_uverbs(OE) ib_umad(OE) esp6_offload esp6
> esp4_offload esp4 xfrm_algo mlx5_fpga_tools(OE) mlx5_ib(OE)
> mlx5_core(OE) mlxfw(OE) mlx4_ib(OE) mlx4_en(OE) ib_core(OE) ptp
> pps_core mlx4_core(OE) devlink mlx_compat(OE) ipmi_ssif intel_rapl
> x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass
> crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcbc aesni_intel
> aes_x86_64 crypto_simd hpilo glue_helper cryptd ipmi_si mei_me
> intel_cstate ipmi_devintf intel_rapl_perf mei ipmi_msghandler shpchp
> acpi_power_meter mac_hid ie31200_edac knem(OE) autofs4 overlay nbd
> i915 mgag200 video ttm i2c_algo_bit drm_kms_helper syscopyarea
> sysfillrect sysimgblt fb_sys_fops drm ahci nvme nvme_core libahci
> [last unloaded: devlink]
> [  108.869001] CPU: 1 PID: 16 Comm: ksoftirqd/1 Tainted: G        W
> OE   4.13.0-36-generic #40~16.04.1-Ubuntu
> [  108.869002] Hardware name: HP ProLiant m710x Server
> Cartridge/ProLiant m710x Server Cartridge, BIOS H07 07/17/2017
> [  108.869003] task: ffff9cf2f9cedf00 task.stack: ffffb477862fc000
> [  108.869005] RIP: 0010:__netif_receive_skb_core+0x26a/0xaf0
> [  108.869006] RSP: 0018:ffffb477862ff8c0 EFLAGS: 00010246
> [  108.869007] RAX: ffff9cf2000000cc RBX: ffff9cf2bceae600 RCX: 0000000000000000
> [  108.869008] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff9cf2bceae600
> [  108.869008] RBP: ffffb477862ff950 R08: 0000000000000001 R09: 0000000000000022
> [  108.869009] R10: ffffb477862ff970 R11: ffffb477862ff75c R12: ffffffffffffffd8
> [  108.869010] R13: ffff9cf20000003c R14: 0000000000000000 R15: 0000000000000001
> [  108.869011] FS:  0000000000000000(0000) GS:ffff9cf341440000(0000)
> knlGS:0000000000000000
> [  108.869012] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [  108.869012] CR2: 0000000000000000 CR3: 00000009eb00a002 CR4: 00000000003626e0
> [  108.869013] Call Trace:
> [  108.869059]  ? ebt_do_table+0x58f/0x680 [ebtables]
> [  108.869062]  __netif_receive_skb+0x18/0x60
> [  108.869065]  ? __netif_receive_skb+0x18/0x60
> [  108.869067]  netif_receive_skb_internal+0x3f/0x400
> [  108.869129]  ? br_fdb_offloaded_set+0x50/0x50 [bridge]
> [  108.869131]  netif_receive_skb+0x1c/0x70
> [  108.869219]  br_netif_receive_skb+0x34/0x50 [bridge]
> [  108.869223]  br_pass_frame_up+0xcd/0x150 [bridge]
> [  108.869226]  ? br_port_flags_change+0x20/0x20 [bridge]
> [  108.869229]  br_handle_frame_finish+0x203/0x530 [bridge]
> [  108.869232]  br_handle_frame+0x1a4/0x2f0 [bridge]
> [  108.869235]  ? br_pass_frame_up+0x150/0x150 [bridge]
> [  108.869237]  __netif_receive_skb_core+0x342/0xaf0
> [  108.869240]  __netif_receive_skb+0x18/0x60
> [  108.869242]  ? __netif_receive_skb+0x18/0x60
> [  108.869244]  netif_receive_skb_internal+0x3f/0x400
> [  108.869246]  ? dev_gro_receive+0x274/0x4a0
> [  108.869248]  napi_gro_frags+0xee/0x230
> [  108.869257]  mlx4_en_process_rx_cq+0xacb/0xea0 [mlx4_en]
> [  108.869260]  mlx4_en_poll_rx_cq+0x64/0x110 [mlx4_en]
> [  108.869262]  net_rx_action+0x24d/0x380
> [  108.869264]  ? __switch_to+0x450/0x540
> [  108.869267]  __do_softirq+0xf2/0x287
> [  108.869269]  run_ksoftirqd+0x29/0x60
> [  108.869272]  smpboot_thread_fn+0x11a/0x170
> [  108.869273]  kthread+0x10c/0x140
> [  108.869276]  ? sort_range+0x30/0x30
> [  108.869277]  ? kthread_create_on_node+0x70/0x70
> [  108.869280]  ret_from_fork+0x35/0x40
> [  108.869281] Code: 44 01 03 08 0f 85 8d 03 00 00 f0 ff 83 e4 00 00
> 00 48 8b 73 20 48 8b 42 10 4c 89 e9 48 89 df e8 9d 22 42 00 41 89 c7
> 48 8b 5d 88 <49> 8b 4c 24 28 4c 89 e2 48 8b 43 20 48 8d 71 d8 48 05 90
> 00 00
> [  108.869310] RIP: __netif_receive_skb_core+0x26a/0xaf0 RSP: ffffb477862ff8c0
> [  108.869311] CR2: 0000000000000000
> [  108.869313] ---[ end trace 196b820a8fbb4909 ]---
> [  108.869314] BUG: unable to handle kernel NULL pointer dereference
> at 0000000000000028
> [  108.869318] IP: netif_skb_features+0x102/0x250
> [  108.869319] PGD 0
> [  108.870072] P4D 0
> [  108.870076]
> [  108.870078] Oops: 0000 [#2] SMP PTI
> [  108.870079] Modules linked in: nf_conntrack_ipv6 nf_defrag_ipv6
> ip_set_hash_ip xt_mac xt_physdev vhost_net vhost tap act_police
> cls_u32 sch_ingress cls_fw sch_sfq sch_htb xt_CHECKSUM iptable_mangle
> ipt_REJECT nf_reject_ipv4 ebtable_filter ebtables ip6table_filter
> ip6_tables xt_set ip_set_list_set ip_set_hash_net veth dummy
> beegfs(OE) nf_conntrack_netlink xt_nat xt_tcpudp xt_recent ip_set
> nfnetlink ip_vs rpcsec_gss_krb5 auth_rpcgss nfsv4 nfs lockd grace
> sunrpc fscache xt_comment xt_mark netconsole ipt_MASQUERADE
> nf_nat_masquerade_ipv4 xfrm_user iptable_nat nf_conntrack_ipv4
> nf_defrag_ipv4 nf_nat_ipv4 xt_addrtype iptable_filter ip_tables
> xt_conntrack x_tables nf_nat nf_conntrack libcrc32c br_netfilter 8021q
> garp mrp bridge stp llc bonding rdma_ucm(OE) ib_ucm(OE) rdma_cm(OE)
> iw_cm(OE) ib_ipoib(OE) ib_cm(OE)
> [  108.870111]  ib_uverbs(OE) ib_umad(OE) esp6_offload esp6
> esp4_offload esp4 xfrm_algo mlx5_fpga_tools(OE) mlx5_ib(OE)
> mlx5_core(OE) mlxfw(OE) mlx4_ib(OE) mlx4_en(OE) ib_core(OE) ptp
> pps_core mlx4_core(OE) devlink mlx_compat(OE) ipmi_ssif intel_rapl
> x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass
> crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcbc aesni_intel
> aes_x86_64 crypto_simd hpilo glue_helper cryptd ipmi_si mei_me
> intel_cstate ipmi_devintf intel_rapl_perf mei ipmi_msghandler shpchp
> acpi_power_meter mac_hid ie31200_edac knem(OE) autofs4 overlay nbd
> i915 mgag200 video ttm i2c_algo_bit drm_kms_helper syscopyarea
> sysfillrect sysimgblt fb_sys_fops drm ahci nvme nvme_core libahci
> [last unloaded: devlink]
> [  108.870135] CPU: 0 PID: 7 Comm: ksoftirqd/0 Tainted: G      D W  OE
>    4.13.0-36-generic #40~16.04.1-Ubuntu
> [  108.870136] Hardware name: HP ProLiant m710x Server
> Cartridge/ProLiant m710x Server Cartridge, BIOS H07 07/17/2017
> [  108.870137] task: ffff9cf2fa382f80 task.stack: ffffb4778627c000
> [  108.870139] RIP: 0010:netif_skb_features+0x102/0x250
> [  108.870140] RSP: 0018:ffffb4778627f770 EFLAGS: 00010246
> [  108.870141] RAX: 0000000000001000 RBX: ffff9cf2bceae600 RCX: 0000000000000000
> [  108.870142] RDX: 0000000000000000 RSI: ffff9cf20000003c RDI: 0000000000000000
> [  108.870143] RBP: ffffb4778627f790 R08: ffff9cf29c78909c R09: 0000000000000001
> [  108.870143] R10: 0000000000000001 R11: 0000000000000001 R12: 0000000000000000
> [  108.870144] R13: ffff9cf2bb764000 R14: ffff9cf2bb764000 R15: ffff9cf2bb764000
> [  108.870147] FS:  0000000000000000(0000) GS:ffff9cf341400000(0000)
> knlGS:0000000000000000
> [  108.870148] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [  108.870149] CR2: 0000000000000028 CR3: 00000009eb00a001 CR4: 00000000003626f0
> [  108.870150] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [  108.870150] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [  108.870151] Call Trace:
> [  108.870154]  validate_xmit_skb+0x21/0x2c0
> [  108.870156]  validate_xmit_skb_list+0x43/0x70
> [  108.870158]  sch_direct_xmit+0x16b/0x1c0
> [  108.870161]  __qdisc_run+0x12a/0x280
> [  108.870162]  __dev_queue_xmit+0x242/0x6a0
> [  108.870165]  ? ebt_do_table+0x58f/0x680 [ebtables]
> [  108.870167]  dev_queue_xmit+0x10/0x20
> [  108.870169]  ? dev_queue_xmit+0x10/0x20
> [  108.870173]  br_dev_queue_push_xmit+0x7a/0x140 [bridge]
> [  108.870178]  br_forward_finish+0x3d/0xb0 [bridge]
> [  108.870182]  ? br_fdb_offloaded_set+0x50/0x50 [bridge]
> [  108.870186]  __br_forward+0x15e/0x1d0 [bridge]
> [  108.870190]  ? br_dev_queue_push_xmit+0x140/0x140 [bridge]
> [  108.870194]  deliver_clone+0x37/0x50 [bridge]
> [  108.870198]  br_flood+0xfa/0x210 [bridge]
> [  108.870202]  br_handle_frame_finish+0x29e/0x530 [bridge]
> [  108.870208]  br_handle_frame+0x1a4/0x2f0 [bridge]
> [  108.870212]  ? br_pass_frame_up+0x150/0x150 [bridge]
> [  108.870214]  __netif_receive_skb_core+0x342/0xaf0
> [  108.870215]  ? __build_skb+0x2a/0xe0
> [  108.870217]  __netif_receive_skb+0x18/0x60
> [  108.870219]  ? __netif_receive_skb+0x18/0x60
> [  108.870221]  netif_receive_skb_internal+0x3f/0x400
> [  108.870223]  napi_gro_frags+0xee/0x230
> [  108.870226]  mlx4_en_process_rx_cq+0xacb/0xea0 [mlx4_en]
> [  108.870229]  ? rcu_accelerate_cbs+0x27/0x1b0
> [  108.870232]  mlx4_en_poll_rx_cq+0x64/0x110 [mlx4_en]
> [  108.870233]  net_rx_action+0x24d/0x380
> [  108.870235]  ? rcu_process_callbacks+0xf9/0x4f0
> [  108.870237]  __do_softirq+0xf2/0x287
> [  108.870240]  run_ksoftirqd+0x29/0x60
> [  108.870242]  smpboot_thread_fn+0x11a/0x170
> [  108.870244]  kthread+0x10c/0x140
> [  108.870245]  ? sort_range+0x30/0x30
> [  108.870247]  ? kthread_create_on_node+0x70/0x70
> [  108.870249]  ret_from_fork+0x35/0x40
> [  108.870250] Code: 8e e8 00 00 00 48 ba 80 00 00 00 00 08 00 00 4c
> 89 e7 48 09 ca 48 31 d7 83 e7 08 0f 85 d9 00 00 00 49 21 d4 48 8b 96
> f0 01 00 00 <48> 8b 4a 28 48 85 c9 0f 84 9d 00 00 00 4c 89 e2 48 89 df
> e8 66
> [  108.870274] RIP: netif_skb_features+0x102/0x250 RSP: ffffb4778627f770
> [  108.870274] CR2: 0000000000000028
> [  108.870277] ---[ end trace 196b820a8fbb490a ]---
> [  108.873245] Kernel panic - not syncing: Fatal exception in interrupt
> [  109.895732] Shutting down cpus with NMI
> [  109.895746] Kernel Offset: 0x37800000 from 0xffffffff81000000
> (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
> [  109.895747] ------------[ cut here ]------------
> [  109.895748] WARNING: CPU: 6 PID: 294 at
> /build/linux-hwe-4GXcua/linux-hwe-4.13.0/kernel/workqueue.c:1513
> __queue_delayed_work+0x1f/0xa0
> [  109.895748] Modules linked in: nf_conntrack_ipv6 nf_defrag_ipv6
> ip_set_hash_ip xt_mac xt_physdev vhost_net vhost tap act_police
> cls_u32 sch_ingress cls_fw sch_sfq sch_htb xt_CHECKSUM iptable_mangle
> ipt_REJECT nf_reject_ipv4 ebtable_filter ebtables ip6table_filter
> ip6_tables xt_set ip_set_list_set ip_set_hash_net veth dummy
> beegfs(OE) nf_conntrack_netlink xt_nat xt_tcpudp xt_recent ip_set
> nfnetlink ip_vs rpcsec_gss_krb5 auth_rpcgss nfsv4 nfs lockd grace
> sunrpc fscache xt_comment xt_mark netconsole ipt_MASQUERADE
> nf_nat_masquerade_ipv4 xfrm_user iptable_nat nf_conntrack_ipv4
> nf_defrag_ipv4 nf_nat_ipv4 xt_addrtype iptable_filter ip_tables
> xt_conntrack x_tables nf_nat nf_conntrack libcrc32c br_netfilter 8021q
> garp mrp bridge stp llc bonding rdma_ucm(OE) ib_ucm(OE) rdma_cm(OE)
> iw_cm(OE) ib_ipoib(OE) ib_cm(OE)
> [  109.895759]  ib_uverbs(OE) ib_umad(OE) esp6_offload esp6
> esp4_offload esp4 xfrm_algo mlx5_fpga_tools(OE) mlx5_ib(OE)
> mlx5_core(OE) mlxfw(OE) mlx4_ib(OE) mlx4_en(OE) ib_core(OE) ptp
> pps_core mlx4_core(OE) devlink mlx_compat(OE) ipmi_ssif intel_rapl
> x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass
> crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcbc aesni_intel
> aes_x86_64 crypto_simd hpilo glue_helper cryptd ipmi_si mei_me
> intel_cstate ipmi_devintf intel_rapl_perf mei ipmi_msghandler shpchp
> acpi_power_meter mac_hid ie31200_edac knem(OE) autofs4 overlay nbd
> i915 mgag200 video ttm i2c_algo_bit drm_kms_helper syscopyarea
> sysfillrect sysimgblt fb_sys_fops drm ahci nvme nvme_core libahci
> [last unloaded: devlink]
> [  109.895769] CPU: 6 PID: 294 Comm: kworker/u16:4 Tainted: G      D W
>   OE   4.13.0-36-generic #40~16.04.1-Ubuntu
> [  109.895769] Hardware name: HP ProLiant m710x Server
> Cartridge/ProLiant m710x Server Cartridge, BIOS H07 07/17/2017
> [  109.895769] Workqueue: events_power_efficient fb_flashcursor
> [  109.895770] task: ffff9cf2ee9d8000 task.stack: ffffb47787098000
> [  109.895770] RIP: 0010:__queue_delayed_work+0x1f/0xa0
> [  109.895770] RSP: 0018:ffffb4778709bb20 EFLAGS: 00010007
> [  109.895771] RAX: 0000000000000046 RBX: 0000000000000046 RCX: 0000000000000000
> [  109.895771] RDX: ffff9cf2c4ed46f8 RSI: ffff9cf300819000 RDI: ffff9cf2c4ed4718
> [  109.895771] RBP: ffffb4778709bb20 R08: 0000000000002000 R09: ffffffffb8fcaab9
> [  109.895772] R10: ffffe38ebfc1fb80 R11: ffff9cf2af495e00 R12: ffff9cf2b07eee00
> [  109.895772] R13: ffff9cf2a6530900 R14: ffff9cf2bb77aa00 R15: ffff9cf2bb764000
> [  109.895772] FS:  0000000000000000(0000) GS:ffff9cf341580000(0000)
> knlGS:0000000000000000
> [  109.895772] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [  109.895773] CR2: 000000c421ff4000 CR3: 00000009eb00a005 CR4: 00000000003626e0
> [  109.895773] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [  109.895773] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [  109.895773] Call Trace:
> [  109.895773]  queue_delayed_work_on+0x27/0x40
> [  109.895774]  netpoll_send_skb_on_dev+0xae/0x200
> [  109.895774]  __br_forward+0x1ad/0x1d0 [bridge]
> [  109.895774]  ? skb_clone+0x54/0xa0
> [  109.895774]  ? __skb_clone+0x2e/0x140
> [  109.895774]  deliver_clone+0x37/0x50 [bridge]
> [  109.895775]  br_flood+0x18e/0x210 [bridge]
> [  109.895775]  br_dev_xmit+0x240/0x2b0 [bridge]
> [  109.895775]  netpoll_start_xmit+0x142/0x1d0
> [  109.895775]  ? __alloc_skb+0x5b/0x1d0
> [  109.895775]  netpoll_send_skb_on_dev+0x13d/0x200
> [  109.895776]  netpoll_send_udp+0x2de/0x420
> [  109.895776]  write_msg+0xb2/0xf0 [netconsole]
> [  109.895776]  console_unlock+0x409/0x4f0
> [  109.895776]  ? update_attr.isra.2+0x90/0x90
> [  109.895776]  fb_flashcursor+0x5c/0x110
> [  109.895777]  process_one_work+0x15b/0x410
> [  109.895777]  worker_thread+0x4b/0x460
> [  109.895777]  kthread+0x10c/0x140
> [  109.895777]  ? process_one_work+0x410/0x410
> [  109.895778]  ? kthread_create_on_node+0x70/0x70
> [  109.895778]  ret_from_fork+0x35/0x40
> [  109.895778] Code: a8 fb ff ff 5d c3 66 0f 1f 44 00 00 0f 1f 44 00
> 00 55 48 85 f6 41 89 f8 48 8d 7a 20 48 89 e5 74 5d 48 81 7a 38 60 09
> 8a b8 74 4b <0f> ff 48 83 7a 28 00 75 4e 48 8b 42 08 4c 8d 4a 08 49 39
> c1 75
> [  109.895786] ---[ end trace 196b820a8fbb490b ]---
> [  109.895786] ------------[ cut here ]------------
> [  109.895787] WARNING: CPU: 6 PID: 294 at
> /build/linux-hwe-4GXcua/linux-hwe-4.13.0/kernel/workqueue.c:1515
> __queue_delayed_work+0x85/0xa0
> [  109.895787] Modules linked in: nf_conntrack_ipv6 nf_defrag_ipv6
> ip_set_hash_ip xt_mac xt_physdev vhost_net vhost tap act_police
> cls_u32 sch_ingress cls_fw sch_sfq sch_htb xt_CHECKSUM iptable_mangle
> ipt_REJECT nf_reject_ipv4 ebtable_filter ebtables ip6table_filter
> ip6_tables xt_set ip_set_list_set ip_set_hash_net veth dummy
> beegfs(OE) nf_conntrack_netlink xt_nat xt_tcpudp xt_recent ip_set
> nfnetlink ip_vs rpcsec_gss_krb5 auth_rpcgss nfsv4 nfs lockd grace
> sunrpc fscache xt_comment xt_mark netconsole ipt_MASQUERADE
> nf_nat_masquerade_ipv4 xfrm_user iptable_nat nf_conntrack_ipv4
> nf_defrag_ipv4 nf_nat_ipv4 xt_addrtype iptable_filter ip_tables
> xt_conntrack x_tables nf_nat nf_conntrack libcrc32c br_netfilter 8021q
> garp mrp bridge stp llc bonding rdma_ucm(OE) ib_ucm(OE) rdma_cm(OE)
> iw_cm(OE) ib_ipoib(OE) ib_cm(OE)
> [  109.895797]  ib_uverbs(OE) ib_umad(OE) esp6_offload esp6
> esp4_offload esp4 xfrm_algo mlx5_fpga_tools(OE) mlx5_ib(OE)
> mlx5_core(OE) mlxfw(OE) mlx4_ib(OE) mlx4_en(OE) ib_core(OE) ptp
> pps_core mlx4_core(OE) devlink mlx_compat(OE) ipmi_ssif intel_rapl
> x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass
> crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcbc aesni_intel
> aes_x86_64 crypto_simd hpilo glue_helper cryptd ipmi_si mei_me
> intel_cstate ipmi_devintf intel_rapl_perf mei ipmi_msghandler shpchp
> acpi_power_meter mac_hid ie31200_edac knem(OE) autofs4 overlay nbd
> i915 mgag200 video ttm i2c_algo_bit drm_kms_helper syscopyarea
> sysfillrect sysimgblt fb_sys_fops drm ahci nvme nvme_core libahci
> [last unloaded: devlink]
> [  109.895807] CPU: 6 PID: 294 Comm: kworker/u16:4 Tainted: G      D W
>   OE   4.13.0-36-generic #40~16.04.1-Ubuntu
> [  109.895807] Hardware name: HP ProLiant m710x Server
> Cartridge/ProLiant m710x Server Cartridge, BIOS H07 07/17/2017
> [  109.895807] Workqueue: events_power_efficient fb_flashcursor
> [  109.895808] task: ffff9cf2ee9d8000 task.stack: ffffb47787098000
> [  109.895808] RIP: 0010:__queue_delayed_work+0x85/0xa0
> [  109.895808] RSP: 0018:ffffb4778709bb20 EFLAGS: 00010006
> [  109.895809] RAX: ffff9cf2bb764000 RBX: 0000000000000046 RCX: 0000000000000000
> [  109.895809] RDX: ffff9cf2c4ed46f8 RSI: ffff9cf300819000 RDI: ffff9cf2c4ed4718
> [  109.895809] RBP: ffffb4778709bb20 R08: 0000000000002000 R09: ffff9cf2c4ed4700
> [  109.895809] R10: ffffe38ebfc1fb80 R11: ffff9cf2af495e00 R12: ffff9cf2b07eee00
> [  109.895810] R13: ffff9cf2a6530900 R14: ffff9cf2bb77aa00 R15: ffff9cf2bb764000
> [  109.895810] FS:  0000000000000000(0000) GS:ffff9cf341580000(0000)
> knlGS:0000000000000000
> [  109.895810] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [  109.895810] CR2: 000000c421ff4000 CR3: 00000009eb00a005 CR4: 00000000003626e0
> [  109.895811] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [  109.895811] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [  109.895811] Call Trace:
> [  109.895811]  queue_delayed_work_on+0x27/0x40
> [  109.895811]  netpoll_send_skb_on_dev+0xae/0x200
> [  109.895812]  __br_forward+0x1ad/0x1d0 [bridge]
> [  109.895812]  ? skb_clone+0x54/0xa0
> [  109.895812]  ? __skb_clone+0x2e/0x140
> [  109.895812]  deliver_clone+0x37/0x50 [bridge]
> [  109.895812]  br_flood+0x18e/0x210 [bridge]
> [  109.895813]  br_dev_xmit+0x240/0x2b0 [bridge]
> [  109.895813]  netpoll_start_xmit+0x142/0x1d0
> [  109.895813]  ? __alloc_skb+0x5b/0x1d0
> [  109.895813]  netpoll_send_skb_on_dev+0x13d/0x200
> [  109.895813]  netpoll_send_udp+0x2de/0x420
> [  109.895814]  write_msg+0xb2/0xf0 [netconsole]
> [  109.895814]  console_unlock+0x409/0x4f0
> [  109.895814]  ? update_attr.isra.2+0x90/0x90
> [  109.895814]  fb_flashcursor+0x5c/0x110
> [  109.895814]  process_one_work+0x15b/0x410
> [  109.895815]  worker_thread+0x4b/0x460
> [  109.895815]  kthread+0x10c/0x140
> [  109.895815]  ? process_one_work+0x410/0x410
> [  109.895815]  ? kthread_create_on_node+0x70/0x70
> [  109.895815]  ret_from_fork+0x35/0x40
> [  109.895816] Code: 52 14 06 00 5d c3 44 89 c7 e8 38 fb ff ff 5d c3
> 48 3b 52 40 75 af eb af 0f ff eb 9f 0f ff 48 8b 42 08 4c 8d 4a 08 49
> 39
> [  109.895821] Lost 282 message(s)!
> [  211.038097] Rebooting in 10 seconds..
> 
> 
> Thanks in advance
> - kvaps
> 

^ permalink raw reply

* [bpf-next V4 PATCH 15/15] xdp: transition into using xdp_frame for ndo_xdp_xmit
From: Jesper Dangaard Brouer @ 2018-03-22 14:22 UTC (permalink / raw)
  To: netdev, BjörnTöpel, magnus.karlsson
  Cc: eugenia, Jason Wang, John Fastabend, Eran Ben Elisha,
	Saeed Mahameed, galp, Jesper Dangaard Brouer, Daniel Borkmann,
	Alexei Starovoitov, Tariq Toukan
In-Reply-To: <152172842149.20979.12110131083451936498.stgit@firesoul>

Changing API ndo_xdp_xmit to take a struct xdp_frame instead of struct
xdp_buff.  This brings xdp_return_frame and ndp_xdp_xmit in sync.

This builds towards changing the API further to become a bulk API,
because xdp_buff is not a queue-able object while xdp_frame is.

V4: Adjust for commit 59655a5b6c83 ("tuntap: XDP_TX can use native XDP")

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
---
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c |   21 +++++++++++----------
 drivers/net/tun.c                             |   19 ++++++++++++-------
 drivers/net/virtio_net.c                      |   24 ++++++++++++++----------
 include/linux/netdevice.h                     |    4 ++--
 net/core/filter.c                             |   17 +++++++++++++++--
 5 files changed, 54 insertions(+), 31 deletions(-)

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
index e6e9b28ecfba..f78096ed4c86 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
@@ -2252,7 +2252,7 @@ static struct sk_buff *ixgbe_build_skb(struct ixgbe_ring *rx_ring,
 #define IXGBE_XDP_TX 2
 
 static int ixgbe_xmit_xdp_ring(struct ixgbe_adapter *adapter,
-			       struct xdp_buff *xdp);
+			       struct xdp_frame *xdpf);
 
 static struct sk_buff *ixgbe_run_xdp(struct ixgbe_adapter *adapter,
 				     struct ixgbe_ring *rx_ring,
@@ -2260,6 +2260,7 @@ static struct sk_buff *ixgbe_run_xdp(struct ixgbe_adapter *adapter,
 {
 	int err, result = IXGBE_XDP_PASS;
 	struct bpf_prog *xdp_prog;
+	struct xdp_frame *xdpf;
 	u32 act;
 
 	rcu_read_lock();
@@ -2273,7 +2274,12 @@ static struct sk_buff *ixgbe_run_xdp(struct ixgbe_adapter *adapter,
 	case XDP_PASS:
 		break;
 	case XDP_TX:
-		result = ixgbe_xmit_xdp_ring(adapter, xdp);
+		xdpf = convert_to_xdp_frame(xdp);
+		if (unlikely(!xdpf)) {
+			result = IXGBE_XDP_CONSUMED;
+			break;
+		}
+		result = ixgbe_xmit_xdp_ring(adapter, xdpf);
 		break;
 	case XDP_REDIRECT:
 		err = xdp_do_redirect(adapter->netdev, xdp, xdp_prog);
@@ -8329,20 +8335,15 @@ static u16 ixgbe_select_queue(struct net_device *dev, struct sk_buff *skb,
 }
 
 static int ixgbe_xmit_xdp_ring(struct ixgbe_adapter *adapter,
-			       struct xdp_buff *xdp)
+			       struct xdp_frame *xdpf)
 {
 	struct ixgbe_ring *ring = adapter->xdp_ring[smp_processor_id()];
 	struct ixgbe_tx_buffer *tx_buffer;
 	union ixgbe_adv_tx_desc *tx_desc;
-	struct xdp_frame *xdpf;
 	u32 len, cmd_type;
 	dma_addr_t dma;
 	u16 i;
 
-	xdpf = convert_to_xdp_frame(xdp);
-	if (unlikely(!xdpf))
-		return -EOVERFLOW;
-
 	len = xdpf->len;
 
 	if (unlikely(!ixgbe_desc_unused(ring)))
@@ -9995,7 +9996,7 @@ static int ixgbe_xdp(struct net_device *dev, struct netdev_bpf *xdp)
 	}
 }
 
-static int ixgbe_xdp_xmit(struct net_device *dev, struct xdp_buff *xdp)
+static int ixgbe_xdp_xmit(struct net_device *dev, struct xdp_frame *xdpf)
 {
 	struct ixgbe_adapter *adapter = netdev_priv(dev);
 	struct ixgbe_ring *ring;
@@ -10011,7 +10012,7 @@ static int ixgbe_xdp_xmit(struct net_device *dev, struct xdp_buff *xdp)
 	if (unlikely(!ring))
 		return -ENXIO;
 
-	err = ixgbe_xmit_xdp_ring(adapter, xdp);
+	err = ixgbe_xmit_xdp_ring(adapter, xdpf);
 	if (err != IXGBE_XDP_TX)
 		return -ENOSPC;
 
diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index a7e42ae1b220..da0402ebc5ce 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -1293,18 +1293,13 @@ static const struct net_device_ops tun_netdev_ops = {
 	.ndo_get_stats64	= tun_net_get_stats64,
 };
 
-static int tun_xdp_xmit(struct net_device *dev, struct xdp_buff *xdp)
+static int tun_xdp_xmit(struct net_device *dev, struct xdp_frame *frame)
 {
 	struct tun_struct *tun = netdev_priv(dev);
-	struct xdp_frame *frame;
 	struct tun_file *tfile;
 	u32 numqueues;
 	int ret = 0;
 
-	frame = convert_to_xdp_frame(xdp);
-	if (unlikely(!frame))
-		return -EOVERFLOW;
-
 	rcu_read_lock();
 
 	numqueues = READ_ONCE(tun->numqueues);
@@ -1328,6 +1323,16 @@ static int tun_xdp_xmit(struct net_device *dev, struct xdp_buff *xdp)
 	return ret;
 }
 
+static int tun_xdp_tx(struct net_device *dev, struct xdp_buff *xdp)
+{
+	struct xdp_frame *frame = convert_to_xdp_frame(xdp);
+
+	if (unlikely(!frame))
+		return -EOVERFLOW;
+
+	return tun_xdp_xmit(dev, frame);
+}
+
 static void tun_xdp_flush(struct net_device *dev)
 {
 	struct tun_struct *tun = netdev_priv(dev);
@@ -1675,7 +1680,7 @@ static struct sk_buff *tun_build_skb(struct tun_struct *tun,
 		case XDP_TX:
 			get_page(alloc_frag->page);
 			alloc_frag->offset += buflen;
-			if (tun_xdp_xmit(tun->dev, &xdp))
+			if (tun_xdp_tx(tun->dev, &xdp))
 				goto err_redirect;
 			tun_xdp_flush(tun->dev);
 			rcu_read_unlock();
diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 479a80339fad..906fcd9ff49b 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -416,10 +416,10 @@ static void virtnet_xdp_flush(struct net_device *dev)
 }
 
 static bool __virtnet_xdp_xmit(struct virtnet_info *vi,
-			       struct xdp_buff *xdp)
+			       struct xdp_frame *xdpf)
 {
 	struct virtio_net_hdr_mrg_rxbuf *hdr;
-	struct xdp_frame *xdpf, *xdpf_sent;
+	struct xdp_frame *xdpf_sent;
 	struct send_queue *sq;
 	unsigned int len;
 	unsigned int qp;
@@ -432,10 +432,6 @@ static bool __virtnet_xdp_xmit(struct virtnet_info *vi,
 	while ((xdpf_sent = virtqueue_get_buf(sq->vq, &len)) != NULL)
 		xdp_return_frame(xdpf_sent);
 
-	xdpf = convert_to_xdp_frame(xdp);
-	if (unlikely(!xdpf))
-		return -EOVERFLOW;
-
 	/* virtqueue want to use data area in-front of packet */
 	if (unlikely(xdpf->metasize > 0))
 		return -EOPNOTSUPP;
@@ -460,7 +456,7 @@ static bool __virtnet_xdp_xmit(struct virtnet_info *vi,
 	return true;
 }
 
-static int virtnet_xdp_xmit(struct net_device *dev, struct xdp_buff *xdp)
+static int virtnet_xdp_xmit(struct net_device *dev, struct xdp_frame *xdpf)
 {
 	struct virtnet_info *vi = netdev_priv(dev);
 	struct receive_queue *rq = vi->rq;
@@ -474,7 +470,7 @@ static int virtnet_xdp_xmit(struct net_device *dev, struct xdp_buff *xdp)
 	if (!xdp_prog)
 		return -ENXIO;
 
-	sent = __virtnet_xdp_xmit(vi, xdp);
+	sent = __virtnet_xdp_xmit(vi, xdpf);
 	if (!sent)
 		return -ENOSPC;
 	return 0;
@@ -575,6 +571,7 @@ static struct sk_buff *receive_small(struct net_device *dev,
 	xdp_prog = rcu_dereference(rq->xdp_prog);
 	if (xdp_prog) {
 		struct virtio_net_hdr_mrg_rxbuf *hdr = buf + header_offset;
+		struct xdp_frame *xdpf;
 		struct xdp_buff xdp;
 		void *orig_data;
 		u32 act;
@@ -617,7 +614,10 @@ static struct sk_buff *receive_small(struct net_device *dev,
 			delta = orig_data - xdp.data;
 			break;
 		case XDP_TX:
-			sent = __virtnet_xdp_xmit(vi, &xdp);
+			xdpf = convert_to_xdp_frame(&xdp);
+			if (unlikely(!xdpf))
+				goto err_xdp;
+			sent = __virtnet_xdp_xmit(vi, xdpf);
 			if (unlikely(!sent)) {
 				trace_xdp_exception(vi->dev, xdp_prog, act);
 				goto err_xdp;
@@ -709,6 +709,7 @@ static struct sk_buff *receive_mergeable(struct net_device *dev,
 	rcu_read_lock();
 	xdp_prog = rcu_dereference(rq->xdp_prog);
 	if (xdp_prog) {
+		struct xdp_frame *xdpf;
 		struct page *xdp_page;
 		struct xdp_buff xdp;
 		void *data;
@@ -773,7 +774,10 @@ static struct sk_buff *receive_mergeable(struct net_device *dev,
 			}
 			break;
 		case XDP_TX:
-			sent = __virtnet_xdp_xmit(vi, &xdp);
+			xdpf = convert_to_xdp_frame(&xdp);
+			if (unlikely(!xdpf))
+				goto err_xdp;
+			sent = __virtnet_xdp_xmit(vi, xdpf);
 			if (unlikely(!sent)) {
 				trace_xdp_exception(vi->dev, xdp_prog, act);
 				if (unlikely(xdp_page != page))
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 913b1cc882cf..62d984ac6c7c 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1164,7 +1164,7 @@ struct dev_ifalias {
  *	This function is used to set or query state related to XDP on the
  *	netdevice and manage BPF offload. See definition of
  *	enum bpf_netdev_command for details.
- * int (*ndo_xdp_xmit)(struct net_device *dev, struct xdp_buff *xdp);
+ * int (*ndo_xdp_xmit)(struct net_device *dev, struct xdp_frame *xdp);
  *	This function is used to submit a XDP packet for transmit on a
  *	netdevice.
  * void (*ndo_xdp_flush)(struct net_device *dev);
@@ -1355,7 +1355,7 @@ struct net_device_ops {
 	int			(*ndo_bpf)(struct net_device *dev,
 					   struct netdev_bpf *bpf);
 	int			(*ndo_xdp_xmit)(struct net_device *dev,
-						struct xdp_buff *xdp);
+						struct xdp_frame *xdp);
 	void			(*ndo_xdp_flush)(struct net_device *dev);
 };
 
diff --git a/net/core/filter.c b/net/core/filter.c
index c86f03fd9ea5..189ae8e4dda3 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -2724,13 +2724,18 @@ static int __bpf_tx_xdp(struct net_device *dev,
 			struct xdp_buff *xdp,
 			u32 index)
 {
+	struct xdp_frame *xdpf;
 	int err;
 
 	if (!dev->netdev_ops->ndo_xdp_xmit) {
 		return -EOPNOTSUPP;
 	}
 
-	err = dev->netdev_ops->ndo_xdp_xmit(dev, xdp);
+	xdpf = convert_to_xdp_frame(xdp);
+	if (unlikely(!xdpf))
+		return -EOVERFLOW;
+
+	err = dev->netdev_ops->ndo_xdp_xmit(dev, xdpf);
 	if (err)
 		return err;
 	dev->netdev_ops->ndo_xdp_flush(dev);
@@ -2746,11 +2751,19 @@ static int __bpf_tx_xdp_map(struct net_device *dev_rx, void *fwd,
 
 	if (map->map_type == BPF_MAP_TYPE_DEVMAP) {
 		struct net_device *dev = fwd;
+		struct xdp_frame *xdpf;
 
 		if (!dev->netdev_ops->ndo_xdp_xmit)
 			return -EOPNOTSUPP;
 
-		err = dev->netdev_ops->ndo_xdp_xmit(dev, xdp);
+		xdpf = convert_to_xdp_frame(xdp);
+		if (unlikely(!xdpf))
+			return -EOVERFLOW;
+
+		/* TODO: move to inside map code instead, for bulk support
+		 * err = dev_map_enqueue(dev, xdp);
+		 */
+		err = dev->netdev_ops->ndo_xdp_xmit(dev, xdpf);
 		if (err)
 			return err;
 		__dev_map_insert_ctx(map, index);

^ permalink raw reply related

* [bpf-next V4 PATCH 14/15] xdp: transition into using xdp_frame for return API
From: Jesper Dangaard Brouer @ 2018-03-22 14:22 UTC (permalink / raw)
  To: netdev, BjörnTöpel, magnus.karlsson
  Cc: eugenia, Jason Wang, John Fastabend, Eran Ben Elisha,
	Saeed Mahameed, galp, Jesper Dangaard Brouer, Daniel Borkmann,
	Alexei Starovoitov, Tariq Toukan
In-Reply-To: <152172842149.20979.12110131083451936498.stgit@firesoul>

Changing API xdp_return_frame() to take struct xdp_frame as argument,
seems like a natural choice. But there are some subtle performance
details here that needs extra care, which is a deliberate choice.

When de-referencing xdp_frame on a remote CPU during DMA-TX
completion, result in the cache-line is change to "Shared"
state. Later when the page is reused for RX, then this xdp_frame
cache-line is written, which change the state to "Modified".

This situation already happens (naturally) for, virtio_net, tun and
cpumap as the xdp_frame pointer is the queued object.  In tun and
cpumap, the ptr_ring is used for efficiently transferring cache-lines
(with pointers) between CPUs. Thus, the only option is to
de-referencing xdp_frame.

It is only the ixgbe driver that had an optimization, in which it can
avoid doing the de-reference of xdp_frame.  The driver already have
TX-ring queue, which (in case of remote DMA-TX completion) have to be
transferred between CPUs anyhow.  In this data area, we stored a
struct xdp_mem_info and a data pointer, which allowed us to avoid
de-referencing xdp_frame.

To compensate for this, a prefetchw is used for telling the cache
coherency protocol about our access pattern.  My benchmarks show that
this prefetchw is enough to compensate the ixgbe driver.

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
---
 drivers/net/ethernet/intel/ixgbe/ixgbe.h        |    4 +---
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c   |   17 +++++++++++------
 drivers/net/ethernet/mellanox/mlx5/core/en_rx.c |    1 +
 drivers/net/tun.c                               |    4 ++--
 drivers/net/virtio_net.c                        |    2 +-
 include/net/xdp.h                               |    2 +-
 kernel/bpf/cpumap.c                             |    6 +++---
 net/core/xdp.c                                  |    4 +++-
 8 files changed, 23 insertions(+), 17 deletions(-)

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe.h b/drivers/net/ethernet/intel/ixgbe/ixgbe.h
index cbc20f199364..dfbc15a45cb4 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe.h
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe.h
@@ -240,8 +240,7 @@ struct ixgbe_tx_buffer {
 	unsigned long time_stamp;
 	union {
 		struct sk_buff *skb;
-		/* XDP uses address ptr on irq_clean */
-		void *data;
+		struct xdp_frame *xdpf;
 	};
 	unsigned int bytecount;
 	unsigned short gso_segs;
@@ -249,7 +248,6 @@ struct ixgbe_tx_buffer {
 	DEFINE_DMA_UNMAP_ADDR(dma);
 	DEFINE_DMA_UNMAP_LEN(len);
 	u32 tx_flags;
-	struct xdp_mem_info xdp_mem;
 };
 
 struct ixgbe_rx_buffer {
diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
index ff069597fccf..e6e9b28ecfba 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
@@ -1207,7 +1207,7 @@ static bool ixgbe_clean_tx_irq(struct ixgbe_q_vector *q_vector,
 
 		/* free the skb */
 		if (ring_is_xdp(tx_ring))
-			xdp_return_frame(tx_buffer->data, &tx_buffer->xdp_mem);
+			xdp_return_frame(tx_buffer->xdpf);
 		else
 			napi_consume_skb(tx_buffer->skb, napi_budget);
 
@@ -2376,6 +2376,7 @@ static int ixgbe_clean_rx_irq(struct ixgbe_q_vector *q_vector,
 			xdp.data_hard_start = xdp.data -
 					      ixgbe_rx_offset(rx_ring);
 			xdp.data_end = xdp.data + size;
+			prefetchw(xdp.data_hard_start); /* xdp_frame write */
 
 			skb = ixgbe_run_xdp(adapter, rx_ring, &xdp);
 		}
@@ -5787,7 +5788,7 @@ static void ixgbe_clean_tx_ring(struct ixgbe_ring *tx_ring)
 
 		/* Free all the Tx ring sk_buffs */
 		if (ring_is_xdp(tx_ring))
-			xdp_return_frame(tx_buffer->data, &tx_buffer->xdp_mem);
+			xdp_return_frame(tx_buffer->xdpf);
 		else
 			dev_kfree_skb_any(tx_buffer->skb);
 
@@ -8333,16 +8334,21 @@ static int ixgbe_xmit_xdp_ring(struct ixgbe_adapter *adapter,
 	struct ixgbe_ring *ring = adapter->xdp_ring[smp_processor_id()];
 	struct ixgbe_tx_buffer *tx_buffer;
 	union ixgbe_adv_tx_desc *tx_desc;
+	struct xdp_frame *xdpf;
 	u32 len, cmd_type;
 	dma_addr_t dma;
 	u16 i;
 
-	len = xdp->data_end - xdp->data;
+	xdpf = convert_to_xdp_frame(xdp);
+	if (unlikely(!xdpf))
+		return -EOVERFLOW;
+
+	len = xdpf->len;
 
 	if (unlikely(!ixgbe_desc_unused(ring)))
 		return IXGBE_XDP_CONSUMED;
 
-	dma = dma_map_single(ring->dev, xdp->data, len, DMA_TO_DEVICE);
+	dma = dma_map_single(ring->dev, xdpf->data, len, DMA_TO_DEVICE);
 	if (dma_mapping_error(ring->dev, dma))
 		return IXGBE_XDP_CONSUMED;
 
@@ -8357,8 +8363,7 @@ static int ixgbe_xmit_xdp_ring(struct ixgbe_adapter *adapter,
 
 	dma_unmap_len_set(tx_buffer, len, len);
 	dma_unmap_addr_set(tx_buffer, dma, dma);
-	tx_buffer->data = xdp->data;
-	tx_buffer->xdp_mem = xdp->rxq->mem;
+	tx_buffer->xdpf = xdpf;
 
 	tx_desc->read.buffer_addr = cpu_to_le64(dma);
 
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
index 2ac78b88fc3d..00b9b13d9fea 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
@@ -896,6 +896,7 @@ struct sk_buff *skb_from_cqe(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe,
 				      di->addr + wi->offset,
 				      0, frag_size,
 				      DMA_FROM_DEVICE);
+	prefetchw(va); /* xdp_frame data area */
 	prefetch(data);
 	wi->offset += frag_size;
 
diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index 81fddf9cc58f..a7e42ae1b220 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -663,7 +663,7 @@ static void tun_ptr_free(void *ptr)
 	if (tun_is_xdp_frame(ptr)) {
 		struct xdp_frame *xdpf = tun_ptr_to_xdp(ptr);
 
-		xdp_return_frame(xdpf->data, &xdpf->mem);
+		xdp_return_frame(xdpf);
 	} else {
 		__skb_array_destroy_skb(ptr);
 	}
@@ -2188,7 +2188,7 @@ static ssize_t tun_do_read(struct tun_struct *tun, struct tun_file *tfile,
 		struct xdp_frame *xdpf = tun_ptr_to_xdp(ptr);
 
 		ret = tun_put_user_xdp(tun, tfile, xdpf, to);
-		xdp_return_frame(xdpf->data, &xdpf->mem);
+		xdp_return_frame(xdpf);
 	} else {
 		struct sk_buff *skb = ptr;
 
diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 48c86accd3b8..479a80339fad 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -430,7 +430,7 @@ static bool __virtnet_xdp_xmit(struct virtnet_info *vi,
 
 	/* Free up any pending old buffers before queueing new ones. */
 	while ((xdpf_sent = virtqueue_get_buf(sq->vq, &len)) != NULL)
-		xdp_return_frame(xdpf_sent->data, &xdpf_sent->mem);
+		xdp_return_frame(xdpf_sent);
 
 	xdpf = convert_to_xdp_frame(xdp);
 	if (unlikely(!xdpf))
diff --git a/include/net/xdp.h b/include/net/xdp.h
index 98b55eaf8fd7..35aa9825fdd0 100644
--- a/include/net/xdp.h
+++ b/include/net/xdp.h
@@ -103,7 +103,7 @@ struct xdp_frame *convert_to_xdp_frame(struct xdp_buff *xdp)
 	return xdp_frame;
 }
 
-void xdp_return_frame(void *data, struct xdp_mem_info *mem);
+void xdp_return_frame(struct xdp_frame *xdpf);
 
 int xdp_rxq_info_reg(struct xdp_rxq_info *xdp_rxq,
 		     struct net_device *dev, u32 queue_index);
diff --git a/kernel/bpf/cpumap.c b/kernel/bpf/cpumap.c
index bcdc4dea5ce7..c95b04ec103e 100644
--- a/kernel/bpf/cpumap.c
+++ b/kernel/bpf/cpumap.c
@@ -219,7 +219,7 @@ static void __cpu_map_ring_cleanup(struct ptr_ring *ring)
 
 	while ((xdpf = ptr_ring_consume(ring)))
 		if (WARN_ON_ONCE(xdpf))
-			xdp_return_frame(xdpf->data, &xdpf->mem);
+			xdp_return_frame(xdpf);
 }
 
 static void put_cpu_map_entry(struct bpf_cpu_map_entry *rcpu)
@@ -275,7 +275,7 @@ static int cpu_map_kthread_run(void *data)
 
 			skb = cpu_map_build_skb(rcpu, xdpf);
 			if (!skb) {
-				xdp_return_frame(xdpf->data, &xdpf->mem);
+				xdp_return_frame(xdpf);
 				continue;
 			}
 
@@ -578,7 +578,7 @@ static int bq_flush_to_queue(struct bpf_cpu_map_entry *rcpu,
 		err = __ptr_ring_produce(q, xdpf);
 		if (err) {
 			drops++;
-			xdp_return_frame(xdpf->data, &xdpf->mem);
+			xdp_return_frame(xdpf);
 		}
 		processed++;
 	}
diff --git a/net/core/xdp.c b/net/core/xdp.c
index fe8e87abc266..6ed3d73a73be 100644
--- a/net/core/xdp.c
+++ b/net/core/xdp.c
@@ -294,9 +294,11 @@ int xdp_rxq_info_reg_mem_model(struct xdp_rxq_info *xdp_rxq,
 }
 EXPORT_SYMBOL_GPL(xdp_rxq_info_reg_mem_model);
 
-void xdp_return_frame(void *data, struct xdp_mem_info *mem)
+void xdp_return_frame(struct xdp_frame *xdpf)
 {
 	struct xdp_mem_allocator *xa = NULL;
+	struct xdp_mem_info *mem = &xdpf->mem;
+	void *data = xdpf->data;
 
 	rcu_read_lock();
 	if (mem->id)

^ permalink raw reply related

* [bpf-next V4 PATCH 13/15] mlx5: use page_pool for xdp_return_frame call
From: Jesper Dangaard Brouer @ 2018-03-22 14:22 UTC (permalink / raw)
  To: netdev, BjörnTöpel, magnus.karlsson
  Cc: eugenia, Jason Wang, John Fastabend, Eran Ben Elisha,
	Saeed Mahameed, galp, Jesper Dangaard Brouer, Daniel Borkmann,
	Alexei Starovoitov, Tariq Toukan
In-Reply-To: <152172842149.20979.12110131083451936498.stgit@firesoul>

This patch shows how it is possible to have both the driver local page
cache, which uses elevated refcnt for "catching"/avoiding SKB
put_page.  And at the same time, have pages getting returned to the
page_pool from ndp_xdp_xmit DMA completion.

Performance is surprisingly good. Tested DMA-TX completion on ixgbe,
that calls "xdp_return_frame", which call page_pool_put_page().
Stats show DMA-TX-completion runs on CPU#9 and mlx5 RX runs on CPU#5.
(Internally page_pool uses ptr_ring, which is what gives the good
cross CPU performance).

Show adapter(s) (ixgbe2 mlx5p2) statistics (ONLY that changed!)
Ethtool(ixgbe2  ) stat:    732863573 (    732,863,573) <= tx_bytes /sec
Ethtool(ixgbe2  ) stat:    781724427 (    781,724,427) <= tx_bytes_nic /sec
Ethtool(ixgbe2  ) stat:     12214393 (     12,214,393) <= tx_packets /sec
Ethtool(ixgbe2  ) stat:     12214435 (     12,214,435) <= tx_pkts_nic /sec
Ethtool(mlx5p2  ) stat:     12211786 (     12,211,786) <= rx3_cache_empty /sec
Ethtool(mlx5p2  ) stat:     36506736 (     36,506,736) <= rx_64_bytes_phy /sec
Ethtool(mlx5p2  ) stat:   2336430575 (  2,336,430,575) <= rx_bytes_phy /sec
Ethtool(mlx5p2  ) stat:     12211786 (     12,211,786) <= rx_cache_empty /sec
Ethtool(mlx5p2  ) stat:     22823073 (     22,823,073) <= rx_discards_phy /sec
Ethtool(mlx5p2  ) stat:      1471860 (      1,471,860) <= rx_out_of_buffer /sec
Ethtool(mlx5p2  ) stat:     36506715 (     36,506,715) <= rx_packets_phy /sec
Ethtool(mlx5p2  ) stat:   2336542282 (  2,336,542,282) <= rx_prio0_bytes /sec
Ethtool(mlx5p2  ) stat:     13683921 (     13,683,921) <= rx_prio0_packets /sec
Ethtool(mlx5p2  ) stat:    821015537 (    821,015,537) <= rx_vport_unicast_bytes /sec
Ethtool(mlx5p2  ) stat:     13683608 (     13,683,608) <= rx_vport_unicast_packets /sec

Before this patch: single flow performance was 6Mpps, and if I started
two flows the collective performance drop to 4Mpps, because we hit the
page allocator lock (further negative scaling occurs).

V2: Adjustments requested by Tariq
 - Changed page_pool_create return codes not return NULL, only
   ERR_PTR, as this simplifies err handling in drivers.
 - Save a branch in mlx5e_page_release
 - Correct page_pool size calc for MLX5_WQ_TYPE_LINKED_LIST_STRIDING_RQ

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
---
 drivers/net/ethernet/mellanox/mlx5/core/en.h      |    3 ++
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c |   41 +++++++++++++++++----
 drivers/net/ethernet/mellanox/mlx5/core/en_rx.c   |   16 ++++++--
 3 files changed, 48 insertions(+), 12 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h b/drivers/net/ethernet/mellanox/mlx5/core/en.h
index 28cc26debeda..ab91166f7c5a 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h
@@ -53,6 +53,8 @@
 #include "mlx5_core.h"
 #include "en_stats.h"
 
+struct page_pool;
+
 #define MLX5_SET_CFG(p, f, v) MLX5_SET(create_flow_group_in, p, f, v)
 
 #define MLX5E_ETH_HARD_MTU (ETH_HLEN + VLAN_HLEN + ETH_FCS_LEN)
@@ -535,6 +537,7 @@ struct mlx5e_rq {
 	/* XDP */
 	struct bpf_prog       *xdp_prog;
 	struct mlx5e_xdpsq     xdpsq;
+	struct page_pool      *page_pool;
 
 	/* control */
 	struct mlx5_wq_ctrl    wq_ctrl;
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index 2e4ca0f15b62..bf17e6d614d6 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -35,6 +35,7 @@
 #include <linux/mlx5/fs.h>
 #include <net/vxlan.h>
 #include <linux/bpf.h>
+#include <net/page_pool.h>
 #include "eswitch.h"
 #include "en.h"
 #include "en_tc.h"
@@ -387,10 +388,11 @@ static int mlx5e_alloc_rq(struct mlx5e_channel *c,
 			  struct mlx5e_rq_param *rqp,
 			  struct mlx5e_rq *rq)
 {
+	struct page_pool_params pp_params = { 0 };
 	struct mlx5_core_dev *mdev = c->mdev;
 	void *rqc = rqp->rqc;
 	void *rqc_wq = MLX5_ADDR_OF(rqc, rqc, wq);
-	u32 byte_count;
+	u32 byte_count, pool_size;
 	int npages;
 	int wq_sz;
 	int err;
@@ -429,10 +431,13 @@ static int mlx5e_alloc_rq(struct mlx5e_channel *c,
 
 	rq->buff.map_dir = rq->xdp_prog ? DMA_BIDIRECTIONAL : DMA_FROM_DEVICE;
 	rq->buff.headroom = params->rq_headroom;
+	pool_size = 1 << params->log_rq_size;
 
 	switch (rq->wq_type) {
 	case MLX5_WQ_TYPE_LINKED_LIST_STRIDING_RQ:
 
+		pool_size = pool_size * MLX5_MPWRQ_PAGES_PER_WQE;
+
 		rq->post_wqes = mlx5e_post_rx_mpwqes;
 		rq->dealloc_wqe = mlx5e_dealloc_rx_mpwqe;
 
@@ -506,13 +511,31 @@ static int mlx5e_alloc_rq(struct mlx5e_channel *c,
 		rq->mkey_be = c->mkey_be;
 	}
 
-	/* This must only be activate for order-0 pages */
-	if (rq->xdp_prog) {
-		err = xdp_rxq_info_reg_mem_model(&rq->xdp_rxq,
-						 MEM_TYPE_PAGE_ORDER0, NULL);
-		if (err)
-			goto err_rq_wq_destroy;
+	/* Create a page_pool and register it with rxq */
+	pp_params.size      = PAGE_POOL_PARAMS_SIZE;
+	pp_params.order     = rq->buff.page_order;
+	pp_params.dev       = c->pdev;
+	pp_params.nid       = cpu_to_node(c->cpu);
+	pp_params.dma_dir   = rq->buff.map_dir;
+	pp_params.pool_size = pool_size;
+	pp_params.flags     = 0; /* No-internal DMA mapping in page_pool */
+
+	/* page_pool can be used even when there is no rq->xdp_prog,
+	 * given page_pool does not handle DMA mapping there is no
+	 * required state to clear. And page_pool gracefully handle
+	 * elevated refcnt.
+	 */
+	rq->page_pool = page_pool_create(&pp_params);
+	if (IS_ERR(rq->page_pool)) {
+		kfree(rq->wqe.frag_info);
+		err = PTR_ERR(rq->page_pool);
+		rq->page_pool = NULL;
+		goto err_rq_wq_destroy;
 	}
+	err = xdp_rxq_info_reg_mem_model(&rq->xdp_rxq,
+					 MEM_TYPE_PAGE_POOL, rq->page_pool);
+	if (err)
+		goto err_rq_wq_destroy;
 
 	for (i = 0; i < wq_sz; i++) {
 		struct mlx5e_rx_wqe *wqe = mlx5_wq_ll_get_wqe(&rq->wq, i);
@@ -550,6 +573,8 @@ static int mlx5e_alloc_rq(struct mlx5e_channel *c,
 	if (rq->xdp_prog)
 		bpf_prog_put(rq->xdp_prog);
 	xdp_rxq_info_unreg(&rq->xdp_rxq);
+	if (rq->page_pool)
+		page_pool_destroy_rcu(rq->page_pool);
 	mlx5_wq_destroy(&rq->wq_ctrl);
 
 	return err;
@@ -563,6 +588,8 @@ static void mlx5e_free_rq(struct mlx5e_rq *rq)
 		bpf_prog_put(rq->xdp_prog);
 
 	xdp_rxq_info_unreg(&rq->xdp_rxq);
+	if (rq->page_pool)
+		page_pool_destroy_rcu(rq->page_pool);
 
 	switch (rq->wq_type) {
 	case MLX5_WQ_TYPE_LINKED_LIST_STRIDING_RQ:
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
index 6dcc3e8fbd3e..2ac78b88fc3d 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
@@ -37,6 +37,7 @@
 #include <linux/bpf_trace.h>
 #include <net/busy_poll.h>
 #include <net/ip6_checksum.h>
+#include <net/page_pool.h>
 #include "en.h"
 #include "en_tc.h"
 #include "eswitch.h"
@@ -221,7 +222,7 @@ static inline int mlx5e_page_alloc_mapped(struct mlx5e_rq *rq,
 	if (mlx5e_rx_cache_get(rq, dma_info))
 		return 0;
 
-	dma_info->page = dev_alloc_pages(rq->buff.page_order);
+	dma_info->page = page_pool_dev_alloc_pages(rq->page_pool);
 	if (unlikely(!dma_info->page))
 		return -ENOMEM;
 
@@ -246,11 +247,16 @@ static inline void mlx5e_page_dma_unmap(struct mlx5e_rq *rq,
 void mlx5e_page_release(struct mlx5e_rq *rq, struct mlx5e_dma_info *dma_info,
 			bool recycle)
 {
-	if (likely(recycle) && mlx5e_rx_cache_put(rq, dma_info))
-		return;
+	if (likely(recycle)) {
+		if (mlx5e_rx_cache_put(rq, dma_info))
+			return;
 
-	mlx5e_page_dma_unmap(rq, dma_info);
-	put_page(dma_info->page);
+		mlx5e_page_dma_unmap(rq, dma_info);
+		page_pool_recycle_direct(rq->page_pool, dma_info->page);
+	} else {
+		mlx5e_page_dma_unmap(rq, dma_info);
+		put_page(dma_info->page);
+	}
 }
 
 static inline bool mlx5e_page_reuse(struct mlx5e_rq *rq,

^ permalink raw reply related

* [bpf-next V4 PATCH 12/15] xdp: allow page_pool as an allocator type in xdp_return_frame
From: Jesper Dangaard Brouer @ 2018-03-22 14:22 UTC (permalink / raw)
  To: netdev, BjörnTöpel, magnus.karlsson
  Cc: eugenia, Jason Wang, John Fastabend, Eran Ben Elisha,
	Saeed Mahameed, galp, Jesper Dangaard Brouer, Daniel Borkmann,
	Alexei Starovoitov, Tariq Toukan
In-Reply-To: <152172842149.20979.12110131083451936498.stgit@firesoul>

New allocator type MEM_TYPE_PAGE_POOL for page_pool usage.

The registered allocator page_pool pointer is not available directly
from xdp_rxq_info, but it could be (if needed).  For now, the driver
should keep separate track of the page_pool pointer, which it should
use for RX-ring page allocation.

As suggested by Saeed, to maintain a symmetric API it is the drivers
responsibility to allocate/create and free/destroy the page_pool.
Thus, after the driver have called xdp_rxq_info_unreg(), it is drivers
responsibility to free the page_pool, but with a RCU free call.  This
is done easily via the page_pool helper page_pool_destroy_rcu() (which
avoids touching any driver code during the RCU callback, which could
happen after the driver have been unloaded).

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
---
 include/net/xdp.h |    3 +++
 net/core/xdp.c    |   23 ++++++++++++++++++++---
 2 files changed, 23 insertions(+), 3 deletions(-)

diff --git a/include/net/xdp.h b/include/net/xdp.h
index 859aa9b737fe..98b55eaf8fd7 100644
--- a/include/net/xdp.h
+++ b/include/net/xdp.h
@@ -36,6 +36,7 @@
 enum mem_type {
 	MEM_TYPE_PAGE_SHARED = 0, /* Split-page refcnt based model */
 	MEM_TYPE_PAGE_ORDER0,     /* Orig XDP full page model */
+	MEM_TYPE_PAGE_POOL,
 	MEM_TYPE_MAX,
 };
 
@@ -44,6 +45,8 @@ struct xdp_mem_info {
 	u32 id;
 };
 
+struct page_pool;
+
 struct xdp_rxq_info {
 	struct net_device *dev;
 	u32 queue_index;
diff --git a/net/core/xdp.c b/net/core/xdp.c
index 06a5b39491ad..fe8e87abc266 100644
--- a/net/core/xdp.c
+++ b/net/core/xdp.c
@@ -8,6 +8,7 @@
 #include <linux/slab.h>
 #include <linux/idr.h>
 #include <linux/rhashtable.h>
+#include <net/page_pool.h>
 
 #include <net/xdp.h>
 
@@ -27,7 +28,10 @@ static struct rhashtable *mem_id_ht;
 
 struct xdp_mem_allocator {
 	struct xdp_mem_info mem;
-	void *allocator;
+	union {
+		void *allocator;
+		struct page_pool *page_pool;
+	};
 	struct rhash_head node;
 	struct rcu_head rcu;
 };
@@ -74,7 +78,9 @@ void __xdp_mem_allocator_rcu_free(struct rcu_head *rcu)
 	/* Allow this ID to be reused */
 	ida_simple_remove(&mem_id_pool, xa->mem.id);
 
-	/* TODO: Depending on allocator type/pointer free resources */
+	/* Notice, driver is expected to free the *allocator,
+	 * e.g. page_pool, and MUST also use RCU free.
+	 */
 
 	/* Poison memory */
 	xa->mem.id = 0xFFFF;
@@ -290,11 +296,21 @@ EXPORT_SYMBOL_GPL(xdp_rxq_info_reg_mem_model);
 
 void xdp_return_frame(void *data, struct xdp_mem_info *mem)
 {
-	struct xdp_mem_allocator *xa;
+	struct xdp_mem_allocator *xa = NULL;
 
 	rcu_read_lock();
 	if (mem->id)
 		xa = rhashtable_lookup(mem_id_ht, &mem->id, mem_id_rht_params);
+
+	if (mem->type == MEM_TYPE_PAGE_POOL) {
+		struct page *page = virt_to_head_page(data);
+
+		if (xa)
+			page_pool_put_page(xa->page_pool, page);
+		else
+			put_page(page);
+		return;
+	}
 	rcu_read_unlock();
 
 	if (mem->type == MEM_TYPE_PAGE_SHARED) {
@@ -306,6 +322,7 @@ void xdp_return_frame(void *data, struct xdp_mem_info *mem)
 		struct page *page = virt_to_page(data); /* Assumes order0 page*/
 
 		put_page(page);
+		return;
 	}
 }
 EXPORT_SYMBOL_GPL(xdp_return_frame);

^ permalink raw reply related

* [bpf-next V4 PATCH 11/15] page_pool: refurbish version of page_pool code
From: Jesper Dangaard Brouer @ 2018-03-22 14:22 UTC (permalink / raw)
  To: netdev, BjörnTöpel, magnus.karlsson
  Cc: eugenia, Jason Wang, John Fastabend, Eran Ben Elisha,
	Saeed Mahameed, galp, Jesper Dangaard Brouer, Daniel Borkmann,
	Alexei Starovoitov, Tariq Toukan
In-Reply-To: <152172842149.20979.12110131083451936498.stgit@firesoul>

Need a fast page recycle mechanism for ndo_xdp_xmit API for returning
pages on DMA-TX completion time, which have good cross CPU
performance, given DMA-TX completion time can happen on a remote CPU.

Refurbish my page_pool code, that was presented[1] at MM-summit 2016.
Adapted page_pool code to not depend the page allocator and
integration into struct page.  The DMA mapping feature is kept,
even-though it will not be activated/used in this patchset.

[1] http://people.netfilter.org/hawk/presentations/MM-summit2016/generic_page_pool_mm_summit2016.pdf

V2: Adjustments requested by Tariq
 - Changed page_pool_create return codes, don't return NULL, only
   ERR_PTR, as this simplifies err handling in drivers.

V4: many small improvements and cleanups
- Add DOC comment section, that can be used by kernel-doc
- Improve fallback mode, to work better with refcnt based recycling
  e.g. remove a WARN as pointed out by Tariq
  e.g. quicker fallback if ptr_ring is empty.

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
---
 include/net/page_pool.h |  133 +++++++++++++++++++
 net/core/Makefile       |    1 
 net/core/page_pool.c    |  329 +++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 463 insertions(+)
 create mode 100644 include/net/page_pool.h
 create mode 100644 net/core/page_pool.c

diff --git a/include/net/page_pool.h b/include/net/page_pool.h
new file mode 100644
index 000000000000..1ff11e641b2e
--- /dev/null
+++ b/include/net/page_pool.h
@@ -0,0 +1,133 @@
+/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note
+ *
+ * page_pool.h
+ *	Author:	Jesper Dangaard Brouer <netoptimizer@brouer.com>
+ *	Copyright (C) 2016 Red Hat, Inc.
+ */
+
+/**
+ * DOC: page_pool allocator
+ *
+ * This page_pool allocator is optimized for the XDP mode that
+ * uses one-frame-per-page, but have fallbacks that act like the
+ * regular page allocator APIs.
+ *
+ * Basic use involve replacing alloc_pages() calls with the
+ * page_pool_alloc_pages() call.  Drivers should likely use
+ * page_pool_dev_alloc_pages() replacing dev_alloc_pages().
+ *
+ * If page_pool handles DMA mapping (use page->private), then API user
+ * is responsible for invoking page_pool_put_page() once.  In-case of
+ * elevated refcnt, the DMA state is released, assuming other users of
+ * the page will eventually call put_page().
+ *
+ * If no DMA mapping is done, then it can act as shim-layer that
+ * fall-through to alloc_page.  As no state is kept on the page, the
+ * regular put_page() call is sufficient.
+ */
+#ifndef _NET_PAGE_POOL_H
+#define _NET_PAGE_POOL_H
+
+#include <linux/mm.h> /* Needed by ptr_ring */
+#include <linux/ptr_ring.h>
+#include <linux/dma-direction.h>
+
+#define PP_FLAG_DMA_MAP 1 /* Should page_pool do the DMA map/unmap */
+#define PP_FLAG_ALL	PP_FLAG_DMA_MAP
+
+/*
+ * Fast allocation side cache array/stack
+ *
+ * The cache size and refill watermark is related to the network
+ * use-case.  The NAPI budget is 64 packets.  After a NAPI poll the RX
+ * ring is usually refilled and the max consumed elements will be 64,
+ * thus a natural max size of objects needed in the cache.
+ *
+ * Keeping room for more objects, is due to XDP_DROP use-case.  As
+ * XDP_DROP allows the opportunity to recycle objects directly into
+ * this array, as it shares the same softirq/NAPI protection.  If
+ * cache is already full (or partly full) then the XDP_DROP recycles
+ * would have to take a slower code path.
+ */
+#define PP_ALLOC_CACHE_SIZE	128
+#define PP_ALLOC_CACHE_REFILL	64
+struct pp_alloc_cache {
+	u32 count ____cacheline_aligned_in_smp;
+	void *cache[PP_ALLOC_CACHE_SIZE];
+};
+
+struct page_pool_params {
+	u32		size; /* caller sets size of struct */
+	unsigned int	order;
+	unsigned long	flags;
+	struct device	*dev; /* device, for DMA pre-mapping purposes */
+	int		nid;  /* Numa node id to allocate from pages from */
+	enum dma_data_direction dma_dir; /* DMA mapping direction */
+	unsigned int	pool_size;
+	char		end_marker[0]; /* must be last struct member */
+};
+#define	PAGE_POOL_PARAMS_SIZE	offsetof(struct page_pool_params, end_marker)
+
+struct page_pool {
+	struct page_pool_params p;
+
+	/*
+	 * Data structure for allocation side
+	 *
+	 * Drivers allocation side usually already perform some kind
+	 * of resource protection.  Piggyback on this protection, and
+	 * require driver to protect allocation side.
+	 *
+	 * For NIC drivers this means, allocate a page_pool per
+	 * RX-queue. As the RX-queue is already protected by
+	 * Softirq/BH scheduling and napi_schedule. NAPI schedule
+	 * guarantee that a single napi_struct will only be scheduled
+	 * on a single CPU (see napi_schedule).
+	 */
+	struct pp_alloc_cache alloc;
+
+	/* Data structure for storing recycled pages.
+	 *
+	 * Returning/freeing pages is more complicated synchronization
+	 * wise, because free's can happen on remote CPUs, with no
+	 * association with allocation resource.
+	 *
+	 * Use ptr_ring, as it separates consumer and producer
+	 * effeciently, it a way that doesn't bounce cache-lines.
+	 *
+	 * TODO: Implement bulk return pages into this structure.
+	 */
+	struct ptr_ring ring;
+
+	struct rcu_head rcu;
+};
+
+struct page *page_pool_alloc_pages(struct page_pool *pool, gfp_t gfp);
+
+static inline struct page *page_pool_dev_alloc_pages(struct page_pool *pool)
+{
+	gfp_t gfp = (GFP_ATOMIC | __GFP_NOWARN);
+
+	return page_pool_alloc_pages(pool, gfp);
+}
+
+struct page_pool *page_pool_create(const struct page_pool_params *params);
+
+void page_pool_destroy_rcu(struct page_pool *pool);
+
+/* Never call this directly, use helpers below */
+void __page_pool_put_page(struct page_pool *pool,
+			  struct page *page, bool allow_direct);
+
+static inline void page_pool_put_page(struct page_pool *pool, struct page *page)
+{
+	__page_pool_put_page(pool, page, false);
+}
+/* Very limited use-cases allow recycle direct */
+static inline void page_pool_recycle_direct(struct page_pool *pool,
+					    struct page *page)
+{
+	__page_pool_put_page(pool, page, true);
+}
+
+#endif /* _NET_PAGE_POOL_H */
diff --git a/net/core/Makefile b/net/core/Makefile
index 6dbbba8c57ae..100a2b3b2a08 100644
--- a/net/core/Makefile
+++ b/net/core/Makefile
@@ -14,6 +14,7 @@ obj-y		     += dev.o ethtool.o dev_addr_lists.o dst.o netevent.o \
 			fib_notifier.o xdp.o
 
 obj-y += net-sysfs.o
+obj-y += page_pool.o
 obj-$(CONFIG_PROC_FS) += net-procfs.o
 obj-$(CONFIG_NET_PKTGEN) += pktgen.o
 obj-$(CONFIG_NETPOLL) += netpoll.o
diff --git a/net/core/page_pool.c b/net/core/page_pool.c
new file mode 100644
index 000000000000..04112feb2df6
--- /dev/null
+++ b/net/core/page_pool.c
@@ -0,0 +1,329 @@
+/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note
+ *
+ * page_pool.c
+ *	Author:	Jesper Dangaard Brouer <netoptimizer@brouer.com>
+ *	Copyright (C) 2016 Red Hat, Inc.
+ */
+#include <linux/types.h>
+#include <linux/kernel.h>
+#include <linux/slab.h>
+
+#include <net/page_pool.h>
+#include <linux/dma-direction.h>
+#include <linux/dma-mapping.h>
+#include <linux/page-flags.h>
+#include <linux/mm.h> /* for __put_page() */
+
+int page_pool_init(struct page_pool *pool,
+		   const struct page_pool_params *params)
+{
+	int ring_qsize = 1024; /* Default */
+	int param_copy_sz;
+
+	if (!pool)
+		return -EFAULT;
+
+	/* Note, below struct compat code was primarily needed when
+	 * page_pool code lived under MM-tree control, given mmots and
+	 * net-next trees progress in very different rates.
+	 *
+	 * Allow kernel devel trees and driver to progress at different rates
+	 */
+	param_copy_sz = PAGE_POOL_PARAMS_SIZE;
+	memset(&pool->p, 0, param_copy_sz);
+	if (params->size < param_copy_sz) {
+		/* Older module calling newer kernel, handled by only
+		 * copying supplied size, and keep remaining params zero
+		 */
+		param_copy_sz = params->size;
+	} else if (params->size > param_copy_sz) {
+		/* Newer module calling older kernel. Need to validate
+		 * no new features were requested.
+		 */
+		unsigned char *addr = (unsigned char *)params + param_copy_sz;
+		unsigned char *end  = (unsigned char *)params + params->size;
+
+		for (; addr < end; addr++) {
+			if (*addr != 0)
+				return -E2BIG;
+		}
+	}
+	memcpy(&pool->p, params, param_copy_sz);
+
+	/* Validate only known flags were used */
+	if (pool->p.flags & ~(PP_FLAG_ALL))
+		return -EINVAL;
+
+	if (pool->p.pool_size)
+		ring_qsize = pool->p.pool_size;
+
+	if (ptr_ring_init(&pool->ring, ring_qsize, GFP_KERNEL) < 0)
+		return -ENOMEM;
+
+	/* DMA direction is either DMA_FROM_DEVICE or DMA_BIDIRECTIONAL.
+	 * DMA_BIDIRECTIONAL is for allowing page used for DMA sending,
+	 * which is the XDP_TX use-case.
+	 */
+	if ((pool->p.dma_dir != DMA_FROM_DEVICE) &&
+	    (pool->p.dma_dir != DMA_BIDIRECTIONAL))
+		return -EINVAL;
+
+	return 0;
+}
+
+struct page_pool *page_pool_create(const struct page_pool_params *params)
+{
+	struct page_pool *pool;
+	int err = 0;
+
+	if (params->size < offsetof(struct page_pool_params, nid)) {
+		WARN(1, "Fix page_pool_params->size code\n");
+		return ERR_PTR(-EBADR);
+	}
+
+	pool = kzalloc_node(sizeof(*pool), GFP_KERNEL, params->nid);
+	err = page_pool_init(pool, params);
+	if (err < 0) {
+		pr_warn("%s() gave up with errno %d\n", __func__, err);
+		kfree(pool);
+		return ERR_PTR(err);
+	}
+	return pool;
+}
+EXPORT_SYMBOL(page_pool_create);
+
+/* fast path */
+static struct page *__page_pool_get_cached(struct page_pool *pool)
+{
+	struct ptr_ring *r = &pool->ring;
+	struct page *page;
+
+	/* Quicker fallback, avoid locks when ring is empty */
+	if (__ptr_ring_empty(r))
+		return NULL;
+
+	/* Test for safe-context, caller should provide this guarantee */
+	if (likely(in_serving_softirq())) {
+		if (likely(pool->alloc.count)) {
+			/* Fast-path */
+			page = pool->alloc.cache[--pool->alloc.count];
+			return page;
+		}
+		/* Slower-path: Alloc array empty, time to refill
+		 *
+		 * Open-coded bulk ptr_ring consumer.
+		 *
+		 * Discussion: the ring consumer lock is not really
+		 * needed due to the softirq/NAPI protection, but
+		 * later need the ability to reclaim pages on the
+		 * ring. Thus, keeping the locks.
+		 */
+		spin_lock(&r->consumer_lock);
+		while ((page = __ptr_ring_consume(r))) {
+			if (pool->alloc.count == PP_ALLOC_CACHE_REFILL)
+				break;
+			pool->alloc.cache[pool->alloc.count++] = page;
+		}
+		spin_unlock(&r->consumer_lock);
+		return page;
+	}
+
+	/* Slow-path: Get page from locked ring queue */
+	page = ptr_ring_consume(&pool->ring);
+	return page;
+}
+
+/* slow path */
+noinline
+static struct page *__page_pool_alloc_pages_slow(struct page_pool *pool,
+						 gfp_t _gfp)
+{
+	struct page *page;
+	gfp_t gfp = _gfp;
+	dma_addr_t dma;
+
+	/* We could always set __GFP_COMP, and avoid this branch, as
+	 * prep_new_page() can handle order-0 with __GFP_COMP.
+	 */
+	if (pool->p.order)
+		gfp |= __GFP_COMP;
+
+	/* FUTURE development:
+	 *
+	 * Current slow-path essentially falls back to single page
+	 * allocations, which doesn't improve performance.  This code
+	 * need bulk allocation support from the page allocator code.
+	 */
+
+	/* Cache was empty, do real allocation */
+	page = alloc_pages_node(pool->p.nid, gfp, pool->p.order);
+	if (!page)
+		return NULL;
+
+	if (!(pool->p.flags & PP_FLAG_DMA_MAP))
+		goto skip_dma_map;
+
+	/* Setup DMA mapping: use page->private for DMA-addr
+	 * This mapping is kept for lifetime of page, until leaving pool.
+	 */
+	dma = dma_map_page(pool->p.dev, page, 0,
+			   (PAGE_SIZE << pool->p.order),
+			   pool->p.dma_dir);
+	if (dma_mapping_error(pool->p.dev, dma)) {
+		put_page(page);
+		return NULL;
+	}
+	set_page_private(page, dma); /* page->private = dma; */
+
+skip_dma_map:
+	/* When page just alloc'ed is should/must have refcnt 1. */
+	return page;
+}
+
+/* For using page_pool replace: alloc_pages() API calls, but provide
+ * synchronization guarantee for allocation side.
+ */
+struct page *page_pool_alloc_pages(struct page_pool *pool, gfp_t gfp)
+{
+	struct page *page;
+
+	/* Fast-path: Get a page from cache */
+	page = __page_pool_get_cached(pool);
+	if (page)
+		return page;
+
+	/* Slow-path: cache empty, do real allocation */
+	page = __page_pool_alloc_pages_slow(pool, gfp);
+	return page;
+}
+EXPORT_SYMBOL(page_pool_alloc_pages);
+
+/* Cleanup page_pool state from page */
+static void __page_pool_clean_page(struct page_pool *pool,
+				   struct page *page)
+{
+	if (!(pool->p.flags & PP_FLAG_DMA_MAP))
+		return;
+
+	/* DMA unmap */
+	dma_unmap_page(pool->p.dev, page_private(page),
+		       PAGE_SIZE << pool->p.order, pool->p.dma_dir);
+	set_page_private(page, 0);
+}
+
+/* Return a page to the page allocator, cleaning up our state */
+static void __page_pool_return_page(struct page_pool *pool, struct page *page)
+{
+	__page_pool_clean_page(pool, page);
+	put_page(page);
+	/* An optimization would be to call __free_pages(page, pool->p.order)
+	 * knowing page is not part of page-cache (thus avoiding a
+	 * __page_cache_release() call).
+	 */
+}
+
+bool __page_pool_recycle_into_ring(struct page_pool *pool,
+				   struct page *page)
+{
+	int ret;
+	/* BH protection not needed if current is serving softirq */
+	if (in_serving_softirq())
+		ret = ptr_ring_produce(&pool->ring, page);
+	else
+		ret = ptr_ring_produce_bh(&pool->ring, page);
+
+	return (ret == 0) ? true : false;
+}
+
+/* Only allow direct recycling in special circumstances, into the
+ * alloc side cache.  E.g. during RX-NAPI processing for XDP_DROP use-case.
+ *
+ * Caller must provide appropriate safe context.
+ */
+static bool __page_pool_recycle_direct(struct page *page,
+				       struct page_pool *pool)
+{
+	if (unlikely(pool->alloc.count == PP_ALLOC_CACHE_SIZE))
+		return false;
+
+	/* Caller MUST have verified/know (page_ref_count(page) == 1) */
+	pool->alloc.cache[pool->alloc.count++] = page;
+	return true;
+}
+
+void __page_pool_put_page(struct page_pool *pool,
+			  struct page *page, bool allow_direct)
+{
+	/* This allocator is optimized for the XDP mode that uses
+	 * one-frame-per-page, but have fallbacks that act like the
+	 * regular page allocator APIs.
+	 *
+	 * refcnt == 1 means page_pool owns page, and can recycle it.
+	 */
+	if (likely(page_ref_count(page) == 1)) {
+		/* Read barrier done in page_ref_count / READ_ONCE */
+
+		if (allow_direct && in_serving_softirq())
+			if (__page_pool_recycle_direct(page, pool))
+				return;
+
+		if (!__page_pool_recycle_into_ring(pool, page)) {
+			/* Cache full, fallback to free pages */
+			__page_pool_return_page(pool, page);
+		}
+		return;
+	}
+	/* Fallback/non-XDP mode: API user have elevated refcnt.
+	 *
+	 * Many drivers split up the page into fragments, and some
+	 * want to keep doing this to save memory and do refcnt based
+	 * recycling. Support this use case too, to ease drivers
+	 * switching between XDP/non-XDP.
+	 *
+	 * In-case page_pool maintains the DMA mapping, API user must
+	 * call page_pool_put_page once.  In this elevated refcnt
+	 * case, the DMA is unmapped/released, as driver is likely
+	 * doing refcnt based recycle tricks, meaning another process
+	 * will be invoking put_page.
+	 */
+	__page_pool_clean_page(pool, page);
+	put_page(page);
+}
+EXPORT_SYMBOL(__page_pool_put_page);
+
+/* Cleanup and release resources */
+void __page_pool_destroy_rcu(struct rcu_head *rcu)
+{
+	struct page_pool *pool;
+	struct page *page;
+
+	pool = container_of(rcu, struct page_pool, rcu);
+
+	/* Empty alloc cache, assume caller made sure this is
+	 * no-longer in use, and page_pool_alloc_pages() cannot be
+	 * call concurrently.
+	 */
+	while (pool->alloc.count) {
+		page = pool->alloc.cache[--pool->alloc.count];
+		__page_pool_return_page(pool, page);
+	}
+
+	/* Empty recycle ring */
+	while ((page = ptr_ring_consume(&pool->ring))) {
+		/* Verify the refcnt invariant of cached pages */
+		if (!(page_ref_count(page) == 1)) {
+			pr_crit("%s() page_pool refcnt %d violation\n",
+				__func__, page_ref_count(page));
+			WARN_ON(1);
+		}
+		__page_pool_return_page(pool, page);
+	}
+	ptr_ring_cleanup(&pool->ring, NULL);
+	kfree(pool);
+}
+
+void page_pool_destroy_rcu(struct page_pool *pool)
+{
+	call_rcu(&pool->rcu, __page_pool_destroy_rcu);
+}
+EXPORT_SYMBOL(page_pool_destroy_rcu);

^ permalink raw reply related

* [bpf-next V4 PATCH 10/15] xdp: rhashtable with allocator ID to pointer mapping
From: Jesper Dangaard Brouer @ 2018-03-22 14:21 UTC (permalink / raw)
  To: netdev, BjörnTöpel, magnus.karlsson
  Cc: eugenia, Jason Wang, John Fastabend, Eran Ben Elisha,
	Saeed Mahameed, galp, Jesper Dangaard Brouer, Daniel Borkmann,
	Alexei Starovoitov, Tariq Toukan
In-Reply-To: <152172842149.20979.12110131083451936498.stgit@firesoul>

Use the IDA infrastructure for getting a cyclic increasing ID number,
that is used for keeping track of each registered allocator per
RX-queue xdp_rxq_info.  Instead of using the IDR infrastructure, which
uses a radix tree, use a dynamic rhashtable, for creating ID to
pointer lookup table, because this is faster.

The problem that is being solved here is that, the xdp_rxq_info
pointer (stored in xdp_buff) cannot be used directly, as the
guaranteed lifetime is too short.  The info is needed on a
(potentially) remote CPU during DMA-TX completion time . In an
xdp_frame the xdp_mem_info is stored, when it got converted from an
xdp_buff, which is sufficient for the simple page refcnt based recycle
schemes.

For more advanced allocators there is a need to store a pointer to the
registered allocator.  Thus, there is a need to guard the lifetime or
validity of the allocator pointer, which is done through this
rhashtable ID map to pointer. The removal and validity of of the
allocator and helper struct xdp_mem_allocator is guarded by RCU.  The
allocator will be created by the driver, and registered with
xdp_rxq_info_reg_mem_model().

It is up-to debate who is responsible for freeing the allocator
pointer or invoking the allocator destructor function.  In any case,
this must happen via RCU freeing.

Use the IDA infrastructure for getting a cyclic increasing ID number,
that is used for keeping track of each registered allocator per
RX-queue xdp_rxq_info.

V4: Per req of Jason Wang
- Use xdp_rxq_info_reg_mem_model() in all drivers implementing
  XDP_REDIRECT, even-though it's not strictly necessary when
  allocator==NULL for type MEM_TYPE_PAGE_SHARED (given it's zero).

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
---
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c |    9 +
 drivers/net/tun.c                             |    6 +
 drivers/net/virtio_net.c                      |    7 +
 include/net/xdp.h                             |   15 --
 net/core/xdp.c                                |  230 ++++++++++++++++++++++++-
 5 files changed, 248 insertions(+), 19 deletions(-)

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
index 45520eb503ee..ff069597fccf 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
@@ -6360,7 +6360,7 @@ int ixgbe_setup_rx_resources(struct ixgbe_adapter *adapter,
 	struct device *dev = rx_ring->dev;
 	int orig_node = dev_to_node(dev);
 	int ring_node = -1;
-	int size;
+	int size, err;
 
 	size = sizeof(struct ixgbe_rx_buffer) * rx_ring->count;
 
@@ -6397,6 +6397,13 @@ int ixgbe_setup_rx_resources(struct ixgbe_adapter *adapter,
 			     rx_ring->queue_index) < 0)
 		goto err;
 
+	err = xdp_rxq_info_reg_mem_model(&rx_ring->xdp_rxq,
+					 MEM_TYPE_PAGE_SHARED, NULL);
+	if (err) {
+		xdp_rxq_info_unreg(&rx_ring->xdp_rxq);
+		goto err;
+	}
+
 	rx_ring->xdp_prog = adapter->xdp_prog;
 
 	return 0;
diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index 6750980d9f30..81fddf9cc58f 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -846,6 +846,12 @@ static int tun_attach(struct tun_struct *tun, struct file *file,
 				       tun->dev, tfile->queue_index);
 		if (err < 0)
 			goto out;
+		err = xdp_rxq_info_reg_mem_model(&tfile->xdp_rxq,
+						 MEM_TYPE_PAGE_SHARED, NULL);
+		if (err < 0) {
+			xdp_rxq_info_unreg(&tfile->xdp_rxq);
+			goto out;
+		}
 		err = 0;
 	}
 
diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 6c4220450506..48c86accd3b8 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -1312,6 +1312,13 @@ static int virtnet_open(struct net_device *dev)
 		if (err < 0)
 			return err;
 
+		err = xdp_rxq_info_reg_mem_model(&vi->rq[i].xdp_rxq,
+						 MEM_TYPE_PAGE_SHARED, NULL);
+		if (err < 0) {
+			xdp_rxq_info_unreg(&vi->rq[i].xdp_rxq);
+			return err;
+		}
+
 		virtnet_napi_enable(vi->rq[i].vq, &vi->rq[i].napi);
 		virtnet_napi_tx_enable(vi, vi->sq[i].vq, &vi->sq[i].napi);
 	}
diff --git a/include/net/xdp.h b/include/net/xdp.h
index bc0cb97e20dc..859aa9b737fe 100644
--- a/include/net/xdp.h
+++ b/include/net/xdp.h
@@ -41,7 +41,7 @@ enum mem_type {
 
 struct xdp_mem_info {
 	u32 type; /* enum mem_type, but known size type */
-	/* u32 id; will be added later in this patchset */
+	u32 id;
 };
 
 struct xdp_rxq_info {
@@ -100,18 +100,7 @@ struct xdp_frame *convert_to_xdp_frame(struct xdp_buff *xdp)
 	return xdp_frame;
 }
 
-static inline
-void xdp_return_frame(void *data, struct xdp_mem_info *mem)
-{
-	if (mem->type == MEM_TYPE_PAGE_SHARED)
-		page_frag_free(data);
-
-	if (mem->type == MEM_TYPE_PAGE_ORDER0) {
-		struct page *page = virt_to_page(data); /* Assumes order0 page*/
-
-		put_page(page);
-	}
-}
+void xdp_return_frame(void *data, struct xdp_mem_info *mem);
 
 int xdp_rxq_info_reg(struct xdp_rxq_info *xdp_rxq,
 		     struct net_device *dev, u32 queue_index);
diff --git a/net/core/xdp.c b/net/core/xdp.c
index 9eee0c431126..06a5b39491ad 100644
--- a/net/core/xdp.c
+++ b/net/core/xdp.c
@@ -5,6 +5,9 @@
  */
 #include <linux/types.h>
 #include <linux/mm.h>
+#include <linux/slab.h>
+#include <linux/idr.h>
+#include <linux/rhashtable.h>
 
 #include <net/xdp.h>
 
@@ -13,6 +16,99 @@
 #define REG_STATE_UNREGISTERED	0x2
 #define REG_STATE_UNUSED	0x3
 
+DEFINE_IDA(mem_id_pool);
+static DEFINE_MUTEX(mem_id_lock);
+#define MEM_ID_MAX 0xFFFE
+#define MEM_ID_MIN 1
+static int mem_id_next = MEM_ID_MIN;
+
+static bool mem_id_init; /* false */
+static struct rhashtable *mem_id_ht;
+
+struct xdp_mem_allocator {
+	struct xdp_mem_info mem;
+	void *allocator;
+	struct rhash_head node;
+	struct rcu_head rcu;
+};
+
+static u32 xdp_mem_id_hashfn(const void *data, u32 len, u32 seed)
+{
+	const u32 *k = data;
+	const u32 key = *k;
+
+	BUILD_BUG_ON(FIELD_SIZEOF(struct xdp_mem_allocator, mem.id)
+		     != sizeof(u32));
+
+	/* Use cyclic increasing ID as direct hash key, see rht_bucket_index */
+	return key << RHT_HASH_RESERVED_SPACE;
+}
+
+static int xdp_mem_id_cmp(struct rhashtable_compare_arg *arg,
+			  const void *ptr)
+{
+	const struct xdp_mem_allocator *xa = ptr;
+	u32 mem_id = *(u32 *)arg->key;
+
+	return xa->mem.id != mem_id;
+}
+
+static const struct rhashtable_params mem_id_rht_params = {
+	.nelem_hint = 64,
+	.head_offset = offsetof(struct xdp_mem_allocator, node),
+	.key_offset  = offsetof(struct xdp_mem_allocator, mem.id),
+	.key_len = FIELD_SIZEOF(struct xdp_mem_allocator, mem.id),
+	.max_size = MEM_ID_MAX,
+	.min_size = 8,
+	.automatic_shrinking = true,
+	.hashfn    = xdp_mem_id_hashfn,
+	.obj_cmpfn = xdp_mem_id_cmp,
+};
+
+void __xdp_mem_allocator_rcu_free(struct rcu_head *rcu)
+{
+	struct xdp_mem_allocator *xa;
+
+	xa = container_of(rcu, struct xdp_mem_allocator, rcu);
+
+	/* Allow this ID to be reused */
+	ida_simple_remove(&mem_id_pool, xa->mem.id);
+
+	/* TODO: Depending on allocator type/pointer free resources */
+
+	/* Poison memory */
+	xa->mem.id = 0xFFFF;
+	xa->mem.type = 0xF0F0;
+	xa->allocator = (void *)0xDEAD9001;
+
+	kfree(xa);
+}
+
+void __xdp_rxq_info_unreg_mem_model(struct xdp_rxq_info *xdp_rxq)
+{
+	struct xdp_mem_allocator *xa;
+	int id = xdp_rxq->mem.id;
+	int err;
+
+	if (id == 0)
+		return;
+
+	mutex_lock(&mem_id_lock);
+
+	xa = rhashtable_lookup(mem_id_ht, &id, mem_id_rht_params);
+	if (!xa) {
+		mutex_unlock(&mem_id_lock);
+		return;
+	}
+
+	err = rhashtable_remove_fast(mem_id_ht, &xa->node, mem_id_rht_params);
+	WARN_ON(err);
+
+	call_rcu(&xa->rcu, __xdp_mem_allocator_rcu_free);
+
+	mutex_unlock(&mem_id_lock);
+}
+
 void xdp_rxq_info_unreg(struct xdp_rxq_info *xdp_rxq)
 {
 	/* Simplify driver cleanup code paths, allow unreg "unused" */
@@ -21,8 +117,14 @@ void xdp_rxq_info_unreg(struct xdp_rxq_info *xdp_rxq)
 
 	WARN(!(xdp_rxq->reg_state == REG_STATE_REGISTERED), "Driver BUG");
 
+	__xdp_rxq_info_unreg_mem_model(xdp_rxq);
+
 	xdp_rxq->reg_state = REG_STATE_UNREGISTERED;
 	xdp_rxq->dev = NULL;
+
+	/* Reset mem info to defaults */
+	xdp_rxq->mem.id = 0;
+	xdp_rxq->mem.type = 0;
 }
 EXPORT_SYMBOL_GPL(xdp_rxq_info_unreg);
 
@@ -72,20 +174,138 @@ bool xdp_rxq_info_is_reg(struct xdp_rxq_info *xdp_rxq)
 }
 EXPORT_SYMBOL_GPL(xdp_rxq_info_is_reg);
 
+int __mem_id_init_hash_table(void)
+{
+	struct rhashtable *rht;
+	int ret;
+
+	if (unlikely(mem_id_init))
+		return 0;
+
+	rht = kzalloc(sizeof(*rht), GFP_KERNEL);
+	if (!rht)
+		return -ENOMEM;
+
+	ret = rhashtable_init(rht, &mem_id_rht_params);
+	if (ret < 0) {
+		kfree(rht);
+		return ret;
+	}
+	mem_id_ht = rht;
+	smp_mb(); /* mutex lock should provide enough pairing */
+	mem_id_init = true;
+
+	return 0;
+}
+
+/* Allocate a cyclic ID that maps to allocator pointer.
+ * See: https://www.kernel.org/doc/html/latest/core-api/idr.html
+ *
+ * Caller must lock mem_id_lock.
+ */
+static int __mem_id_cyclic_get(gfp_t gfp)
+{
+	int retries = 1;
+	int id;
+
+again:
+	id = ida_simple_get(&mem_id_pool, mem_id_next, MEM_ID_MAX, gfp);
+	if (id < 0) {
+		if (id == -ENOSPC) {
+			/* Cyclic allocator, reset next id */
+			if (retries--) {
+				mem_id_next = MEM_ID_MIN;
+				goto again;
+			}
+		}
+		return id; /* errno */
+	}
+	mem_id_next = id + 1;
+
+	return id;
+}
+
 int xdp_rxq_info_reg_mem_model(struct xdp_rxq_info *xdp_rxq,
 			       enum mem_type type, void *allocator)
 {
+	struct xdp_mem_allocator *xdp_alloc;
+	gfp_t gfp = GFP_KERNEL;
+	int id, errno, ret;
+	void *ptr;
+
+	if (xdp_rxq->reg_state != REG_STATE_REGISTERED) {
+		WARN(1, "Missing register, driver bug");
+		return -EFAULT;
+	}
+
 	if (type >= MEM_TYPE_MAX)
 		return -EINVAL;
 
 	xdp_rxq->mem.type = type;
 
-	if (allocator)
-		return -EOPNOTSUPP;
+	if (!allocator)
+		return 0;
+
+	/* Delay init of rhashtable to save memory if feature isn't used */
+	if (!mem_id_init) {
+		mutex_lock(&mem_id_lock);
+		ret = __mem_id_init_hash_table();
+		mutex_unlock(&mem_id_lock);
+		if (ret < 0) {
+			WARN_ON(1);
+			return ret;
+		}
+	}
+
+	xdp_alloc = kzalloc(sizeof(*xdp_alloc), gfp);
+	if (!xdp_alloc)
+		return -ENOMEM;
+
+	mutex_lock(&mem_id_lock);
+	id = __mem_id_cyclic_get(gfp);
+	if (id < 0) {
+		errno = id;
+		goto err;
+	}
+	xdp_rxq->mem.id = id;
+	xdp_alloc->mem  = xdp_rxq->mem;
+	xdp_alloc->allocator = allocator;
+
+	/* Insert allocator into ID lookup table */
+	ptr = rhashtable_insert_slow(mem_id_ht, &id, &xdp_alloc->node);
+	if (IS_ERR(ptr)) {
+		errno = PTR_ERR(ptr);
+		goto err;
+	}
+
+	mutex_unlock(&mem_id_lock);
 
-	/* TODO: Allocate an ID that maps to allocator pointer
-	 * See: https://www.kernel.org/doc/html/latest/core-api/idr.html
-	 */
 	return 0;
+err:
+	mutex_unlock(&mem_id_lock);
+	kfree(xdp_alloc);
+	return errno;
 }
 EXPORT_SYMBOL_GPL(xdp_rxq_info_reg_mem_model);
+
+void xdp_return_frame(void *data, struct xdp_mem_info *mem)
+{
+	struct xdp_mem_allocator *xa;
+
+	rcu_read_lock();
+	if (mem->id)
+		xa = rhashtable_lookup(mem_id_ht, &mem->id, mem_id_rht_params);
+	rcu_read_unlock();
+
+	if (mem->type == MEM_TYPE_PAGE_SHARED) {
+		page_frag_free(data);
+		return;
+	}
+
+	if (mem->type == MEM_TYPE_PAGE_ORDER0) {
+		struct page *page = virt_to_page(data); /* Assumes order0 page*/
+
+		put_page(page);
+	}
+}
+EXPORT_SYMBOL_GPL(xdp_return_frame);

^ permalink raw reply related

* [bpf-next V4 PATCH 09/15] mlx5: register a memory model when XDP is enabled
From: Jesper Dangaard Brouer @ 2018-03-22 14:21 UTC (permalink / raw)
  To: netdev, BjörnTöpel, magnus.karlsson
  Cc: eugenia, Jason Wang, John Fastabend, Eran Ben Elisha,
	Saeed Mahameed, galp, Jesper Dangaard Brouer, Daniel Borkmann,
	Alexei Starovoitov, Tariq Toukan
In-Reply-To: <152172842149.20979.12110131083451936498.stgit@firesoul>

Now all the users of ndo_xdp_xmit have been converted to use xdp_return_frame.
This enable a different memory model, thus activating another code path
in the xdp_return_frame API.

V2: Fixed issues pointed out by Tariq.

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
---
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c |    8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index da94c8cba5ee..2e4ca0f15b62 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -506,6 +506,14 @@ static int mlx5e_alloc_rq(struct mlx5e_channel *c,
 		rq->mkey_be = c->mkey_be;
 	}
 
+	/* This must only be activate for order-0 pages */
+	if (rq->xdp_prog) {
+		err = xdp_rxq_info_reg_mem_model(&rq->xdp_rxq,
+						 MEM_TYPE_PAGE_ORDER0, NULL);
+		if (err)
+			goto err_rq_wq_destroy;
+	}
+
 	for (i = 0; i < wq_sz; i++) {
 		struct mlx5e_rx_wqe *wqe = mlx5_wq_ll_get_wqe(&rq->wq, i);
 

^ permalink raw reply related

* [bpf-next V4 PATCH 08/15] bpf: cpumap convert to use generic xdp_frame
From: Jesper Dangaard Brouer @ 2018-03-22 14:21 UTC (permalink / raw)
  To: netdev, BjörnTöpel, magnus.karlsson
  Cc: eugenia, Jason Wang, John Fastabend, Eran Ben Elisha,
	Saeed Mahameed, galp, Jesper Dangaard Brouer, Daniel Borkmann,
	Alexei Starovoitov, Tariq Toukan
In-Reply-To: <152172842149.20979.12110131083451936498.stgit@firesoul>

The generic xdp_frame format, was inspired by the cpumap own internal
xdp_pkt format.  It is now time to convert it over to the generic
xdp_frame format.  The cpumap needs one extra field dev_rx.

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
---
 include/net/xdp.h   |    1 +
 kernel/bpf/cpumap.c |  100 ++++++++++++++-------------------------------------
 2 files changed, 29 insertions(+), 72 deletions(-)

diff --git a/include/net/xdp.h b/include/net/xdp.h
index 13f71a15c79f..bc0cb97e20dc 100644
--- a/include/net/xdp.h
+++ b/include/net/xdp.h
@@ -68,6 +68,7 @@ struct xdp_frame {
 	 * while mem info is valid on remote CPU.
 	 */
 	struct xdp_mem_info mem;
+	struct net_device *dev_rx; /* used by cpumap */
 };
 
 /* Convert xdp_buff to xdp_frame */
diff --git a/kernel/bpf/cpumap.c b/kernel/bpf/cpumap.c
index 3e4bbcbe3e86..bcdc4dea5ce7 100644
--- a/kernel/bpf/cpumap.c
+++ b/kernel/bpf/cpumap.c
@@ -159,52 +159,8 @@ static void cpu_map_kthread_stop(struct work_struct *work)
 	kthread_stop(rcpu->kthread);
 }
 
-/* For now, xdp_pkt is a cpumap internal data structure, with info
- * carried between enqueue to dequeue. It is mapped into the top
- * headroom of the packet, to avoid allocating separate mem.
- */
-struct xdp_pkt {
-	void *data;
-	u16 len;
-	u16 headroom;
-	u16 metasize;
-	/* Lifetime of xdp_rxq_info is limited to NAPI/enqueue time,
-	 * while mem info is valid on remote CPU.
-	 */
-	struct xdp_mem_info mem;
-	struct net_device *dev_rx;
-};
-
-/* Convert xdp_buff to xdp_pkt */
-static struct xdp_pkt *convert_to_xdp_pkt(struct xdp_buff *xdp)
-{
-	struct xdp_pkt *xdp_pkt;
-	int metasize;
-	int headroom;
-
-	/* Assure headroom is available for storing info */
-	headroom = xdp->data - xdp->data_hard_start;
-	metasize = xdp->data - xdp->data_meta;
-	metasize = metasize > 0 ? metasize : 0;
-	if (unlikely((headroom - metasize) < sizeof(*xdp_pkt)))
-		return NULL;
-
-	/* Store info in top of packet */
-	xdp_pkt = xdp->data_hard_start;
-
-	xdp_pkt->data = xdp->data;
-	xdp_pkt->len  = xdp->data_end - xdp->data;
-	xdp_pkt->headroom = headroom - sizeof(*xdp_pkt);
-	xdp_pkt->metasize = metasize;
-
-	/* rxq only valid until napi_schedule ends, convert to xdp_mem_info */
-	xdp_pkt->mem = xdp->rxq->mem;
-
-	return xdp_pkt;
-}
-
 static struct sk_buff *cpu_map_build_skb(struct bpf_cpu_map_entry *rcpu,
-					 struct xdp_pkt *xdp_pkt)
+					 struct xdp_frame *xdpf)
 {
 	unsigned int frame_size;
 	void *pkt_data_start;
@@ -219,7 +175,7 @@ static struct sk_buff *cpu_map_build_skb(struct bpf_cpu_map_entry *rcpu,
 	 * would be preferred to set frame_size to 2048 or 4096
 	 * depending on the driver.
 	 *   frame_size = 2048;
-	 *   frame_len  = frame_size - sizeof(*xdp_pkt);
+	 *   frame_len  = frame_size - sizeof(*xdp_frame);
 	 *
 	 * Instead, with info avail, skb_shared_info in placed after
 	 * packet len.  This, unfortunately fakes the truesize.
@@ -227,21 +183,21 @@ static struct sk_buff *cpu_map_build_skb(struct bpf_cpu_map_entry *rcpu,
 	 * is not at a fixed memory location, with mixed length
 	 * packets, which is bad for cache-line hotness.
 	 */
-	frame_size = SKB_DATA_ALIGN(xdp_pkt->len) + xdp_pkt->headroom +
+	frame_size = SKB_DATA_ALIGN(xdpf->len) + xdpf->headroom +
 		SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
 
-	pkt_data_start = xdp_pkt->data - xdp_pkt->headroom;
+	pkt_data_start = xdpf->data - xdpf->headroom;
 	skb = build_skb(pkt_data_start, frame_size);
 	if (!skb)
 		return NULL;
 
-	skb_reserve(skb, xdp_pkt->headroom);
-	__skb_put(skb, xdp_pkt->len);
-	if (xdp_pkt->metasize)
-		skb_metadata_set(skb, xdp_pkt->metasize);
+	skb_reserve(skb, xdpf->headroom);
+	__skb_put(skb, xdpf->len);
+	if (xdpf->metasize)
+		skb_metadata_set(skb, xdpf->metasize);
 
 	/* Essential SKB info: protocol and skb->dev */
-	skb->protocol = eth_type_trans(skb, xdp_pkt->dev_rx);
+	skb->protocol = eth_type_trans(skb, xdpf->dev_rx);
 
 	/* Optional SKB info, currently missing:
 	 * - HW checksum info		(skb->ip_summed)
@@ -259,11 +215,11 @@ static void __cpu_map_ring_cleanup(struct ptr_ring *ring)
 	 * invoked cpu_map_kthread_stop(). Catch any broken behaviour
 	 * gracefully and warn once.
 	 */
-	struct xdp_pkt *xdp_pkt;
+	struct xdp_frame *xdpf;
 
-	while ((xdp_pkt = ptr_ring_consume(ring)))
-		if (WARN_ON_ONCE(xdp_pkt))
-			xdp_return_frame(xdp_pkt, &xdp_pkt->mem);
+	while ((xdpf = ptr_ring_consume(ring)))
+		if (WARN_ON_ONCE(xdpf))
+			xdp_return_frame(xdpf->data, &xdpf->mem);
 }
 
 static void put_cpu_map_entry(struct bpf_cpu_map_entry *rcpu)
@@ -290,7 +246,7 @@ static int cpu_map_kthread_run(void *data)
 	 */
 	while (!kthread_should_stop() || !__ptr_ring_empty(rcpu->queue)) {
 		unsigned int processed = 0, drops = 0, sched = 0;
-		struct xdp_pkt *xdp_pkt;
+		struct xdp_frame *xdpf;
 
 		/* Release CPU reschedule checks */
 		if (__ptr_ring_empty(rcpu->queue)) {
@@ -313,13 +269,13 @@ static int cpu_map_kthread_run(void *data)
 		 * kthread CPU pinned. Lockless access to ptr_ring
 		 * consume side valid as no-resize allowed of queue.
 		 */
-		while ((xdp_pkt = __ptr_ring_consume(rcpu->queue))) {
+		while ((xdpf = __ptr_ring_consume(rcpu->queue))) {
 			struct sk_buff *skb;
 			int ret;
 
-			skb = cpu_map_build_skb(rcpu, xdp_pkt);
+			skb = cpu_map_build_skb(rcpu, xdpf);
 			if (!skb) {
-				xdp_return_frame(xdp_pkt, &xdp_pkt->mem);
+				xdp_return_frame(xdpf->data, &xdpf->mem);
 				continue;
 			}
 
@@ -616,13 +572,13 @@ static int bq_flush_to_queue(struct bpf_cpu_map_entry *rcpu,
 	spin_lock(&q->producer_lock);
 
 	for (i = 0; i < bq->count; i++) {
-		struct xdp_pkt *xdp_pkt = bq->q[i];
+		struct xdp_frame *xdpf = bq->q[i];
 		int err;
 
-		err = __ptr_ring_produce(q, xdp_pkt);
+		err = __ptr_ring_produce(q, xdpf);
 		if (err) {
 			drops++;
-			xdp_return_frame(xdp_pkt->data, &xdp_pkt->mem);
+			xdp_return_frame(xdpf->data, &xdpf->mem);
 		}
 		processed++;
 	}
@@ -637,7 +593,7 @@ static int bq_flush_to_queue(struct bpf_cpu_map_entry *rcpu,
 /* Runs under RCU-read-side, plus in softirq under NAPI protection.
  * Thus, safe percpu variable access.
  */
-static int bq_enqueue(struct bpf_cpu_map_entry *rcpu, struct xdp_pkt *xdp_pkt)
+static int bq_enqueue(struct bpf_cpu_map_entry *rcpu, struct xdp_frame *xdpf)
 {
 	struct xdp_bulk_queue *bq = this_cpu_ptr(rcpu->bulkq);
 
@@ -648,28 +604,28 @@ static int bq_enqueue(struct bpf_cpu_map_entry *rcpu, struct xdp_pkt *xdp_pkt)
 	 * driver to code invoking us to finished, due to driver
 	 * (e.g. ixgbe) recycle tricks based on page-refcnt.
 	 *
-	 * Thus, incoming xdp_pkt is always queued here (else we race
+	 * Thus, incoming xdp_frame is always queued here (else we race
 	 * with another CPU on page-refcnt and remaining driver code).
 	 * Queue time is very short, as driver will invoke flush
 	 * operation, when completing napi->poll call.
 	 */
-	bq->q[bq->count++] = xdp_pkt;
+	bq->q[bq->count++] = xdpf;
 	return 0;
 }
 
 int cpu_map_enqueue(struct bpf_cpu_map_entry *rcpu, struct xdp_buff *xdp,
 		    struct net_device *dev_rx)
 {
-	struct xdp_pkt *xdp_pkt;
+	struct xdp_frame *xdpf;
 
-	xdp_pkt = convert_to_xdp_pkt(xdp);
-	if (unlikely(!xdp_pkt))
+	xdpf = convert_to_xdp_frame(xdp);
+	if (unlikely(!xdpf))
 		return -EOVERFLOW;
 
 	/* Info needed when constructing SKB on remote CPU */
-	xdp_pkt->dev_rx = dev_rx;
+	xdpf->dev_rx = dev_rx;
 
-	bq_enqueue(rcpu, xdp_pkt);
+	bq_enqueue(rcpu, xdpf);
 	return 0;
 }
 

^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox