Netdev List
 help / color / mirror / Atom feed
* RE: [PATCH v1 net-next 00/15] tc-taprio offload for SJA1105 DSA
From: Gomes, Vinicius @ 2019-09-11  0:45 UTC (permalink / raw)
  To: Vladimir Oltean
  Cc: David Miller, f.fainelli@gmail.com, vivien.didelot@gmail.com,
	andrew@lunn.ch, Patel, Vedang, richardcochran@gmail.com,
	Voon, Weifeng, jiri@mellanox.com, m-karicheri2@ti.com,
	Jose.Abreu@synopsys.com, ilias.apalodimas@linaro.org,
	jhs@mojatatu.com, xiyou.wangcong@gmail.com,
	kurt.kanzenbach@linutronix.de, netdev@vger.kernel.org
In-Reply-To: <CA+h21hoQ-DaFGzALVmGo2mDJancUp5Fndc=o0f4LfD_9yaNi0g@mail.gmail.com>

Hi Vladimir,

[...]

> 
> I'll make sure this subtlety is more clearly formulated in the next version of the
> patch.
> 

Ack.

> Actually let me ask you a few questions as well:
> 
> - I'm trying to understand what is the correct use of the tc-mqprio "queues"
> argument. I've only tested it with "1@0 1@1 1@2 1@3 1@4 1@5
> 1@6 1@7", which I believe is equivalent to not specifying it at all? I believe it
> should be interpreted as: "allocate this many netdev queues for each traffic
> class", where "traffic class" means a group of queues having the same priority
> (equal to the traffic class's number), but engaged in a strict priority scheme with
> other groups of queues (traffic classes). Right?

Specifying the "queues" is mandatory, IIRC. Yeah, your reading of those arguments
for you example matches mine.

So you mean, that you only tested situations when only one queue is "open" at a time?
I think this is another good thing to test.

> 
> - DSA can only formally support multi-queue, because its connection to the Linux
> host is through an Ethernet MAC (FIFO). Even if the DSA master netdevice may
> be multi-queue, allocating and separating those queues for each front-panel
> switch port is a task best left to the user/administrator. This means that DSA
> should reject all other "queues" mappings except the trivial one I pointed to
> above?
> 
> - I'm looking at the "tc_mask_to_queue_mask" function that I'm carrying along
> from your initial offload RFC. Are you sure this is the right approach? I don't feel
> a need to translate from traffic class to netdev queues, considering that in the
> general case, a traffic class is a group of queues, and 802.1Qbv doesn't really
> specify that you can gate individual queues from a traffic class. In the software
> implementation you are only looking at netdev_get_prio_tc_map, which is not
> equivalent as far as my understanding goes, but saner.
> Actually 802.1Q-2018 does not really clarify this either. It looks to me like they
> use the term "queue" and "traffic class" interchangeably.
> See two examples below (emphasis mine):

I spent quite a long time thinking about this, still not sure that I got it right. Let me begin
with the objective for that "translation". Scheduled traffic only makes sense when
the whole network shares the same schedule, so, I wanted a way so I minimize the
amount of information of each schedule that's controller dependent, Linux already 
does most of it with the separation of traffic classes and queues (you are right that 
802.1Q is confusing on this), the idea is that the only thing that needs to change from 
one node to another in the network is the "queues" parameter. Because each node might 
have different number of queues, or assign different priorities to different queues.  

So, that's the idea of doing that intermediate "transformation" step: taprio knows about
traffic classes and HW queues, but the driver only knows about HW queues. And unless I made
a mistake, tc_mask_to_queue_mask() should be equivalent to:  

netdev_get_prio_tc_map() + scanning the gatemask for BIT(tc).

(Thinking more about this, I am having a few ideas about ways to simplify software mode :-)

> 
> Q.2 Using gate operations to create protected windows The enhancements for
> scheduled traffic described in 8.6.8.4 allow transmission to be switched on and
> off on a timed basis for each _traffic class_ that is implemented on a port. This
> switching is achieved by means of individual on/off transmission gates
> associated with each _traffic class_ and a list of gate operations that control the
> gates; an individual SetGateStates operation has a time delay parameter that
> indicates the delay after the gate operation is executed until the next operation
> is to occur, and a GateState parameter that defines a vector of up to eight state
> values (open or
> closed) that is to be applied to each gate when the operation is executed. The
> gate operations allow any combination of open/closed states to be defined, and
> the mechanism makes no assumptions about which _traffic classes_ are being
> “protected” and which are “unprotected”; any such assumptions are left to the
> designer of the sequence of gate operations.
> 
> Table 8-7—Gate operations
> The GateState parameter indicates a value, open or closed, for each of the
> Port’s _queues_.
> 
> - What happens with the "clockid" argument now that hardware offload is
> possible? Do we allow "/dev/ptp0" to be specified as input?
> Actually this question is relevant to your txtime-assist mode as well:
> doesn't it assume that there is an implicit phc2sys instance running to keep the
> system time in sync with the PHC?

That's a very interesting question. I think, for now, allowing specifying /dev/ptp* clocks
won't work "always": if the driver or something needs to add a timer to be able to run 
the schedule, it won't be able to use /dev/ptp* clocks (hrtimers and ptp clocks don’t mix).
But for "full" offloads, it should work.

So, you are right, taprio and txtime-assisted (and ETF) require the system clock and phc 
clock to be synchronized, via something like phc2sys.

Hope I got all your questions.

Cheers,
--
Vinicius


^ permalink raw reply

* Re: [PATCH net-next] tcp: force a PSH flag on TSO packets
From: Neal Cardwell @ 2019-09-11  0:55 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David S . Miller, netdev, Eric Dumazet, Soheil Hassas Yeganeh,
	Yuchung Cheng, Daniel Borkmann, Tariq Toukan
In-Reply-To: <20190910214928.220727-1-edumazet@google.com>

On Tue, Sep 10, 2019 at 5:49 PM Eric Dumazet <edumazet@google.com> wrote:
>
> When tcp sends a TSO packet, adding a PSH flag on it
> reduces the sojourn time of GRO packet in GRO receivers.
>
> This is particularly the case under pressure, since RX queues
> receive packets for many concurrent flows.
>
> A sender can give a hint to GRO engines when it is
> appropriate to flush a super-packet, especially when pacing
> is in the picture, since next packet is probably delayed by
> one ms.
>
> Having less packets in GRO engine reduces chance
> of LRU eviction or inflated RTT, and reduces GRO cost.
>
> We found recently that we must not set the PSH flag on
> individual full-size MSS segments [1] :
>
>  Under pressure (CWR state), we better let the packet sit
>  for a small delay (depending on NAPI logic) so that the
>  ACK packet is delayed, and thus next packet we send is
>  also delayed a bit. Eventually the bottleneck queue can
>  be drained. DCTCP flows with CWND=1 have demonstrated
>  the issue.
>
> This patch allows to slowdown the aggregate traffic without
> involving high resolution timers on senders and/or
> receivers.
>
> It has been used at Google for about four years,
> and has been discussed at various networking conferences.
>
> [1] segments smaller than MSS already have PSH flag set
>     by tcp_sendmsg() / tcp_mark_push(), unless MSG_MORE
>     has been requested by the user.
>
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Cc: Soheil Hassas Yeganeh <soheil@google.com>
> Cc: Neal Cardwell <ncardwell@google.com>
> Cc: Yuchung Cheng <ycheng@google.com>
> Cc: Daniel Borkmann <daniel@iogearbox.net>
> Cc: Tariq Toukan <tariqt@mellanox.com>
> ---
>  net/ipv4/tcp_output.c | 15 +++++++++++++--
>  1 file changed, 13 insertions(+), 2 deletions(-)
>
> diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> index 42abc9bd687a5fea627cbc7cfa750d022f393d84..fec6d67bfd146dc78f0f25173fd71b8b8cc752fe 100644
> --- a/net/ipv4/tcp_output.c
> +++ b/net/ipv4/tcp_output.c
> @@ -1050,11 +1050,22 @@ static int __tcp_transmit_skb(struct sock *sk, struct sk_buff *skb,
>         tcb = TCP_SKB_CB(skb);
>         memset(&opts, 0, sizeof(opts));
>
> -       if (unlikely(tcb->tcp_flags & TCPHDR_SYN))
> +       if (unlikely(tcb->tcp_flags & TCPHDR_SYN)) {
>                 tcp_options_size = tcp_syn_options(sk, skb, &opts, &md5);
> -       else
> +       } else {
>                 tcp_options_size = tcp_established_options(sk, skb, &opts,
>                                                            &md5);
> +               /* Force a PSH flag on all (GSO) packets to expedite GRO flush
> +                * at receiver : This slightly improve GRO performance.
> +                * Note that we do not force the PSH flag for non GSO packets,
> +                * because they might be sent under high congestion events,
> +                * and in this case it is better to delay the delivery of 1-MSS
> +                * packets and thus the corresponding ACK packet that would
> +                * release the following packet.
> +                */
> +               if (tcp_skb_pcount(skb) > 1)
> +                       tcb->tcp_flags |= TCPHDR_PSH;
> +       }
>         tcp_header_size = tcp_options_size + sizeof(struct tcphdr);
>
>         /* if no packet is in qdisc/device queue, then allow XPS to select

Acked-by: Neal Cardwell <ncardwell@google.com>

Thanks, Eric!

neal

^ permalink raw reply

* Re: Strange routing with VRF and 5.2.7+
From: Ben Greear @ 2019-09-11  1:08 UTC (permalink / raw)
  To: netdev
In-Reply-To: <91749b17-7800-44c0-d137-5242b8ceb819@candelatech.com>

On 9/10/19 3:17 PM, Ben Greear wrote:
> Today we were testing creating 200 virtual station vdevs on ath9k, and using
> VRF for the routing.

Looks like the same issue happens w/out VRF, but there I have oodles of routing
rules, so it is an area ripe for failure.

Will upgrade to 5.2.14+ and retest, and try 4.20 as well....

Thanks,
Ben

> 
> This really slows down the machine in question.
> 
> During the minutes that it takes to bring these up and configure them,
> we loose network connectivity on the management port.
> 
> If I do 'ip route show', it just shows the default route out of eth0, and
> the subnet route.  But, if I try to ping the gateway, I get an ICMP error
> coming back from the gateway of one of the virtual stations (which should be
> safely using VRFs and so not in use when I do a plain 'ping' from the shell).
> 
> I tried running tshark on eth0 in the background and running ping, and it captures
> no packets leaving eth0.
> 
> After some time (and during this time, my various scripts will be (re)configuring
> vrfs and stations and related vrf routing tables and such,
> but should *not* be messing with the main routing table, then suddenly
> things start working again.
> 
> I am curious if anyone has seen anything similar or has suggestions for more
> ways to debug this.  It seems reproducible, but it is a pain to
> debug.
> 
> Thanks,
> Ben
> 


-- 
Ben Greear <greearb@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com


^ permalink raw reply

* Re: [PATCH net-next 03/11] net: aquantia: add basic ptp_clock callbacks
From: kbuild test robot @ 2019-09-09 23:27 UTC (permalink / raw)
  To: Igor Russkikh
  Cc: kbuild-all, netdev@vger.kernel.org, richardcochran@gmail.com,
	davem@davemloft.net, Egor Pomozov, Sergey Samoilenko,
	Dmitry Bezrukov, Igor Russkikh
In-Reply-To: <8868449ec5508f498131ee141399149bf801ea94.1568034880.git.igor.russkikh@aquantia.com>

[-- Attachment #1: Type: text/plain, Size: 1734 bytes --]

Hi Igor,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on net-next/master]

url:    https://github.com/0day-ci/linux/commits/Igor-Russkikh/net-aquantia-PTP-support-for-AQC-devices/20190909-233041
config: i386-allyesconfig (attached as .config)
compiler: gcc-7 (Debian 7.4.0-11) 7.4.0
reproduce:
        # save the attached .config to linux build tree
        make ARCH=i386 

If you fix the issue, kindly add following tag
Reported-by: kbuild test robot <lkp@intel.com>

All errors (new ones prefixed by >>):

   ld: drivers/net/ethernet/aquantia/atlantic/aq_nic.o: in function `aq_nic_update_link_status':
   aq_nic.c:(.text+0xe1): undefined reference to `aq_ptp_clock_init'
   ld: drivers/net/ethernet/aquantia/atlantic/aq_ptp.o: in function `aq_ptp_init':
   aq_ptp.c:(.text+0x367): undefined reference to `aq_ptp_clock_init'
   ld: drivers/net/ethernet/aquantia/atlantic/hw_atl/hw_atl_b0.o: in function `hw_atl_b0_adj_clock_freq':
>> hw_atl_b0.c:(.text+0xad): undefined reference to `__udivdi3'
>> ld: hw_atl_b0.c:(.text+0xc2): undefined reference to `__divdi3'
   ld: hw_atl_b0.c:(.text+0x11b): undefined reference to `__divdi3'
   ld: hw_atl_b0.c:(.text+0x12b): undefined reference to `__divdi3'
>> ld: hw_atl_b0.c:(.text+0x165): undefined reference to `__udivdi3'
   ld: hw_atl_b0.c:(.text+0x177): undefined reference to `__divdi3'
   ld: hw_atl_b0.c:(.text+0x1ca): undefined reference to `__divdi3'
   ld: hw_atl_b0.c:(.text+0x1da): undefined reference to `__divdi3'
   ld: hw_atl_b0.c:(.text+0x247): undefined reference to `__divdi3'

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 68904 bytes --]

^ permalink raw reply

* Re: [PATCH net-next 03/11] net: aquantia: add basic ptp_clock callbacks
From: kbuild test robot @ 2019-09-10  2:45 UTC (permalink / raw)
  To: Igor Russkikh
  Cc: kbuild-all, netdev@vger.kernel.org, richardcochran@gmail.com,
	davem@davemloft.net, Egor Pomozov, Sergey Samoilenko,
	Dmitry Bezrukov, Igor Russkikh
In-Reply-To: <8868449ec5508f498131ee141399149bf801ea94.1568034880.git.igor.russkikh@aquantia.com>

[-- Attachment #1: Type: text/plain, Size: 966 bytes --]

Hi Igor,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on net-next/master]

url:    https://github.com/0day-ci/linux/commits/Igor-Russkikh/net-aquantia-PTP-support-for-AQC-devices/20190909-233041
config: i386-allmodconfig (attached as .config)
compiler: gcc-7 (Debian 7.4.0-11) 7.4.0
reproduce:
        # save the attached .config to linux build tree
        make ARCH=i386 

If you fix the issue, kindly add following tag
Reported-by: kbuild test robot <lkp@intel.com>

All errors (new ones prefixed by >>):

   ERROR: "aq_ptp_clock_init" [drivers/net/ethernet/aquantia/atlantic/atlantic.ko] undefined!
>> ERROR: "__udivdi3" [drivers/net/ethernet/aquantia/atlantic/atlantic.ko] undefined!
>> ERROR: "__divdi3" [drivers/net/ethernet/aquantia/atlantic/atlantic.ko] undefined!

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 69589 bytes --]

^ permalink raw reply

* Re: [PATCH net-next 11/11] net: aquantia: add support for PIN funcs
From: kbuild test robot @ 2019-09-10  2:45 UTC (permalink / raw)
  To: Igor Russkikh
  Cc: kbuild-all, netdev@vger.kernel.org, richardcochran@gmail.com,
	davem@davemloft.net, Egor Pomozov, Sergey Samoilenko,
	Dmitry Bezrukov, Igor Russkikh, Nikita Danilov
In-Reply-To: <b3eeb5dd7d38dab79ded5f4b7cca2a84505c8fac.1568034880.git.igor.russkikh@aquantia.com>

[-- Attachment #1: Type: text/plain, Size: 1575 bytes --]

Hi Igor,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on net-next/master]

url:    https://github.com/0day-ci/linux/commits/Igor-Russkikh/net-aquantia-PTP-support-for-AQC-devices/20190909-233041
config: i386-allyesconfig (attached as .config)
compiler: gcc-7 (Debian 7.4.0-11) 7.4.0
reproduce:
        # save the attached .config to linux build tree
        make ARCH=i386 

If you fix the issue, kindly add following tag
Reported-by: kbuild test robot <lkp@intel.com>

All errors (new ones prefixed by >>):

   ld: drivers/net/ethernet/aquantia/atlantic/aq_ptp.o: in function `aq_ptp_gpio_feature_enable':
>> aq_ptp.c:(.text+0x7fe): undefined reference to `__umoddi3'
   ld: drivers/net/ethernet/aquantia/atlantic/hw_atl/hw_atl_b0.o: in function `hw_atl_b0_adj_clock_freq':
   hw_atl_b0.c:(.text+0xdd): undefined reference to `__udivdi3'
   ld: hw_atl_b0.c:(.text+0xf2): undefined reference to `__divdi3'
   ld: hw_atl_b0.c:(.text+0x14b): undefined reference to `__divdi3'
   ld: hw_atl_b0.c:(.text+0x15b): undefined reference to `__divdi3'
   ld: hw_atl_b0.c:(.text+0x195): undefined reference to `__udivdi3'
   ld: hw_atl_b0.c:(.text+0x1a7): undefined reference to `__divdi3'
   ld: hw_atl_b0.c:(.text+0x1fa): undefined reference to `__divdi3'
   ld: hw_atl_b0.c:(.text+0x20a): undefined reference to `__divdi3'
   ld: hw_atl_b0.c:(.text+0x277): undefined reference to `__divdi3'

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 68904 bytes --]

^ permalink raw reply

* [PATCH v2 net] net: sonic: replace dev_kfree_skb in sonic_send_packet
From: Mao Wenan @ 2019-09-11  1:36 UTC (permalink / raw)
  To: tsbogend, davem; +Cc: netdev, linux-kernel, kernel-janitors, Mao Wenan
In-Reply-To: <a48a6690-eeb4-91d2-bed8-439d14b63e2f@cogentembedded.com>

sonic_send_packet will be processed in irq or non-irq 
context, so it would better use dev_kfree_skb_any
instead of dev_kfree_skb.

Fixes: d9fb9f384292 ("*sonic/natsemi/ns83829: Move the National Semi-conductor drivers")
Signed-off-by: Mao Wenan <maowenan@huawei.com>
---
 v2: change 'none irq' to 'non-irq'.
 drivers/net/ethernet/natsemi/sonic.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/natsemi/sonic.c b/drivers/net/ethernet/natsemi/sonic.c
index 18fd62fbfb64..b339125b2f09 100644
--- a/drivers/net/ethernet/natsemi/sonic.c
+++ b/drivers/net/ethernet/natsemi/sonic.c
@@ -233,7 +233,7 @@ static int sonic_send_packet(struct sk_buff *skb, struct net_device *dev)
 	laddr = dma_map_single(lp->device, skb->data, length, DMA_TO_DEVICE);
 	if (!laddr) {
 		pr_err_ratelimited("%s: failed to map tx DMA buffer.\n", dev->name);
-		dev_kfree_skb(skb);
+		dev_kfree_skb_any(skb);
 		return NETDEV_TX_OK;
 	}
 
-- 
2.20.1


^ permalink raw reply related

* Re: [PATCH net 1/2] sctp: remove redundant assignment when call sctp_get_port_local
From: maowenan @ 2019-09-11  1:30 UTC (permalink / raw)
  To: Dan Carpenter
  Cc: vyasevich, nhorman, marcelo.leitner, davem, linux-sctp, netdev,
	linux-kernel, kernel-janitors
In-Reply-To: <20190910192207.GE20699@kadam>



On 2019/9/11 3:22, Dan Carpenter wrote:
> On Tue, Sep 10, 2019 at 09:57:10PM +0300, Dan Carpenter wrote:
>> On Tue, Sep 10, 2019 at 03:13:42PM +0800, Mao Wenan wrote:
>>> There are more parentheses in if clause when call sctp_get_port_local
>>> in sctp_do_bind, and redundant assignment to 'ret'. This patch is to
>>> do cleanup.
>>>
>>> Signed-off-by: Mao Wenan <maowenan@huawei.com>
>>> ---
>>>  net/sctp/socket.c | 3 +--
>>>  1 file changed, 1 insertion(+), 2 deletions(-)
>>>
>>> diff --git a/net/sctp/socket.c b/net/sctp/socket.c
>>> index 9d1f83b10c0a..766b68b55ebe 100644
>>> --- a/net/sctp/socket.c
>>> +++ b/net/sctp/socket.c
>>> @@ -399,9 +399,8 @@ static int sctp_do_bind(struct sock *sk, union sctp_addr *addr, int len)
>>>  	 * detection.
>>>  	 */
>>>  	addr->v4.sin_port = htons(snum);
>>> -	if ((ret = sctp_get_port_local(sk, addr))) {
>>> +	if (sctp_get_port_local(sk, addr))
>>>  		return -EADDRINUSE;
>>
>> sctp_get_port_local() returns a long which is either 0,1 or a pointer
>> casted to long.  It's not documented what it means and neither of the
>> callers use the return since commit 62208f12451f ("net: sctp: simplify
>> sctp_get_port").
> 
> Actually it was commit 4e54064e0a13 ("sctp: Allow only 1 listening
> socket with SO_REUSEADDR") from 11 years ago.  That patch fixed a bug,
> because before the code assumed that a pointer casted to an int was the
> same as a pointer casted to a long.

commit 4e54064e0a13 treated non-zero return value as unexpected, so the current
cleanup is ok?

> 
> regards,
> dan carpenter
> 
> 
> .
> 


^ permalink raw reply

* Re: [PATCH 0/3] rtlwifi: use generic rtl_evm_db_to_percentage
From: Pkshih @ 2019-09-11  1:31 UTC (permalink / raw)
  To: kvalo@codeaurora.org, straube.linux@gmail.com
  Cc: linux-wireless@vger.kernel.org, davem@davemloft.net,
	netdev@vger.kernel.org, linux-kernel@vger.kernel.org
In-Reply-To: <20190910190422.63378-1-straube.linux@gmail.com>

On Tue, 2019-09-10 at 21:04 +0200, Michael Straube wrote:
> Functions _rtl92{c,d}_evm_db_to_percentage are functionally identical
> to the generic version rtl_evm_db_to percentage. This series converts
> rtl8192ce, rtl8192cu and rtl8192de to use the generic version.
> 
> Michael Straube (3):
>   rtlwifi: rtl8192ce: replace _rtl92c_evm_db_to_percentage with generic
>     version
>   rtlwifi: rtl8192cu: replace _rtl92c_evm_db_to_percentage with generic
>     version
>   rtlwifi: rtl8192de: replace _rtl92d_evm_db_to_percentage with generic
>     version
> 
>  .../wireless/realtek/rtlwifi/rtl8192ce/trx.c  | 23 +------------------
>  .../wireless/realtek/rtlwifi/rtl8192cu/mac.c  | 18 +--------------
>  .../wireless/realtek/rtlwifi/rtl8192de/trx.c  | 18 ++-------------
>  3 files changed, 4 insertions(+), 55 deletions(-)
> 

I checked the generic version and removed functions, and they are indeed
identical. Thanks for your patches.

Acked-by: Ping-Ke Shih <pkshih@realtek.com>



^ permalink raw reply

* Re: [RFC PATCH 3/4] virtio: introudce a mdev based transport
From: Tiwei Bie @ 2019-09-11  1:47 UTC (permalink / raw)
  To: Jason Wang
  Cc: mst, kvm, virtualization, netdev, linux-kernel, kwankhede,
	alex.williamson, cohuck, maxime.coquelin, cunming.liang,
	zhihong.wang, rob.miller, idos, xiao.w.wang, haotian.wang
In-Reply-To: <20190910081935.30516-4-jasowang@redhat.com>

On Tue, Sep 10, 2019 at 04:19:34PM +0800, Jason Wang wrote:
> This path introduces a new mdev transport for virtio. This is used to
> use kernel virtio driver to drive the mediated device that is capable
> of populating virtqueue directly.
> 
> A new virtio-mdev driver will be registered to the mdev bus, when a
> new virtio-mdev device is probed, it will register the device with
> mdev based config ops. This means, unlike the exist hardware
> transport, this is a software transport between mdev driver and mdev
> device. The transport was implemented through:
> 
> - configuration access was implemented through parent_ops->read()/write()
> - vq/config callback was implemented through parent_ops->ioctl()
> 
> This transport is derived from virtio MMIO protocol and was wrote for
> kernel driver. But for the transport itself, but the design goal is to
> be generic enough to support userspace driver (this part will be added
> in the future).
> 
> Note:
> - current mdev assume all the parameter of parent_ops was from
>   userspace. This prevents us from implementing the kernel mdev
>   driver. For a quick POC, this patch just abuse those parameter and
>   assume the mdev device implementation will treat them as kernel
>   pointer. This should be addressed in the formal series by extending
>   mdev_parent_ops.
> - for a quick POC, I just drive the transport from MMIO, I'm pretty
>   there's lot of optimization space for this.
> 
> Signed-off-by: Jason Wang <jasowang@redhat.com>
> ---
>  drivers/vfio/mdev/Kconfig        |   7 +
>  drivers/vfio/mdev/Makefile       |   1 +
>  drivers/vfio/mdev/virtio_mdev.c  | 500 +++++++++++++++++++++++++++++++
>  include/uapi/linux/virtio_mdev.h | 131 ++++++++
>  4 files changed, 639 insertions(+)
>  create mode 100644 drivers/vfio/mdev/virtio_mdev.c
>  create mode 100644 include/uapi/linux/virtio_mdev.h
> 
[...]
> diff --git a/include/uapi/linux/virtio_mdev.h b/include/uapi/linux/virtio_mdev.h
> new file mode 100644
> index 000000000000..8040de6b960a
> --- /dev/null
> +++ b/include/uapi/linux/virtio_mdev.h
> @@ -0,0 +1,131 @@
> +/*
> + * Virtio mediated device driver
> + *
> + * Copyright 2019, Red Hat Corp.
> + *
> + * Based on Virtio MMIO driver by ARM Ltd, copyright ARM Ltd. 2011
> + *
> + * This header is BSD licensed so anyone can use the definitions to implement
> + * compatible drivers/servers.
> + *
> + * Redistribution and use in source and binary forms, with or without
> + * modification, are permitted provided that the following conditions
> + * are met:
> + * 1. Redistributions of source code must retain the above copyright
> + *    notice, this list of conditions and the following disclaimer.
> + * 2. Redistributions in binary form must reproduce the above copyright
> + *    notice, this list of conditions and the following disclaimer in the
> + *    documentation and/or other materials provided with the distribution.
> + * 3. Neither the name of IBM nor the names of its contributors
> + *    may be used to endorse or promote products derived from this software
> + *    without specific prior written permission.
> + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS ``AS IS'' AND
> + * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
> + * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
> + * ARE DISCLAIMED.  IN NO EVENT SHALL IBM OR CONTRIBUTORS BE LIABLE
> + * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
> + * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
> + * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
> + * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
> + * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
> + * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
> + * SUCH DAMAGE.
> + */
> +#ifndef _LINUX_VIRTIO_MDEV_H
> +#define _LINUX_VIRTIO_MDEV_H
> +
> +#include <linux/interrupt.h>
> +#include <linux/vringh.h>
> +#include <uapi/linux/virtio_net.h>
> +
> +/*
> + * Ioctls
> + */
> +
> +struct virtio_mdev_callback {
> +	irqreturn_t (*callback)(void *);
> +	void *private;
> +};
> +
> +#define VIRTIO_MDEV 0xAF
> +#define VIRTIO_MDEV_SET_VQ_CALLBACK _IOW(VIRTIO_MDEV, 0x00, \
> +					 struct virtio_mdev_callback)
> +#define VIRTIO_MDEV_SET_CONFIG_CALLBACK _IOW(VIRTIO_MDEV, 0x01, \
> +					struct virtio_mdev_callback)
> +
> +#define VIRTIO_MDEV_DEVICE_API_STRING		"virtio-mdev"
> +
> +/*
> + * Control registers
> + */
> +
> +/* Magic value ("virt" string) - Read Only */
> +#define VIRTIO_MDEV_MAGIC_VALUE		0x000
> +
> +/* Virtio device version - Read Only */
> +#define VIRTIO_MDEV_VERSION		0x004
> +
> +/* Virtio device ID - Read Only */
> +#define VIRTIO_MDEV_DEVICE_ID		0x008
> +
> +/* Virtio vendor ID - Read Only */
> +#define VIRTIO_MDEV_VENDOR_ID		0x00c
> +
> +/* Bitmask of the features supported by the device (host)
> + * (32 bits per set) - Read Only */
> +#define VIRTIO_MDEV_DEVICE_FEATURES	0x010
> +
> +/* Device (host) features set selector - Write Only */
> +#define VIRTIO_MDEV_DEVICE_FEATURES_SEL	0x014
> +
> +/* Bitmask of features activated by the driver (guest)
> + * (32 bits per set) - Write Only */
> +#define VIRTIO_MDEV_DRIVER_FEATURES	0x020
> +
> +/* Activated features set selector - Write Only */
> +#define VIRTIO_MDEV_DRIVER_FEATURES_SEL	0x024
> +
> +/* Queue selector - Write Only */
> +#define VIRTIO_MDEV_QUEUE_SEL		0x030
> +
> +/* Maximum size of the currently selected queue - Read Only */
> +#define VIRTIO_MDEV_QUEUE_NUM_MAX	0x034
> +
> +/* Queue size for the currently selected queue - Write Only */
> +#define VIRTIO_MDEV_QUEUE_NUM		0x038
> +
> +/* Ready bit for the currently selected queue - Read Write */
> +#define VIRTIO_MDEV_QUEUE_READY		0x044
> +
> +/* Alignment of virtqueue - Read Only */
> +#define VIRTIO_MDEV_QUEUE_ALIGN		0x048
> +
> +/* Queue notifier - Write Only */
> +#define VIRTIO_MDEV_QUEUE_NOTIFY	0x050
> +
> +/* Device status register - Read Write */
> +#define VIRTIO_MDEV_STATUS		0x060
> +
> +/* Selected queue's Descriptor Table address, 64 bits in two halves */
> +#define VIRTIO_MDEV_QUEUE_DESC_LOW	0x080
> +#define VIRTIO_MDEV_QUEUE_DESC_HIGH	0x084
> +
> +/* Selected queue's Available Ring address, 64 bits in two halves */
> +#define VIRTIO_MDEV_QUEUE_AVAIL_LOW	0x090
> +#define VIRTIO_MDEV_QUEUE_AVAIL_HIGH	0x094
> +
> +/* Selected queue's Used Ring address, 64 bits in two halves */
> +#define VIRTIO_MDEV_QUEUE_USED_LOW	0x0a0
> +#define VIRTIO_MDEV_QUEUE_USED_HIGH	0x0a4
> +
> +/* Configuration atomicity value */
> +#define VIRTIO_MDEV_CONFIG_GENERATION	0x0fc
> +
> +/* The config space is defined by each driver as
> + * the per-driver configuration space - Read Write */
> +#define VIRTIO_MDEV_CONFIG		0x100

IIUC, we can use above registers with virtio-mdev parent's
read()/write() to access the mdev device from kernel driver.
As you suggested, it's a choice to build vhost-mdev on top
of this abstraction as well. But virtio is the frontend
device which lacks some vhost backend features, e.g. get
vring base, set vring base, negotiate vhost features, etc.
So I'm wondering, does it make sense to reserve some space
for vhost-mdev in kernel to do vhost backend specific setups?
Or do you have any other thoughts?

Besides, I'm also wondering, what's the purpose of making
above registers part of UAPI? And if we make them part
of UAPI, do we also need to make them part of virtio spec?

Thanks!
Tiwei

> +
> +#endif
> +
> +
> +/* Ready bit for the currently selected queue - Read Write */
> -- 
> 2.19.1
> 

^ permalink raw reply

* Re: [PATCH net-next 1/7] net: hns3: add ethtool_ops.set_channels support for HNS3 VF driver
From: tanhuazhong @ 2019-09-11  2:01 UTC (permalink / raw)
  To: David Miller
  Cc: netdev, linux-kernel, salil.mehta, yisen.zhuang, linuxarm,
	jakub.kicinski, huangguangbin2
In-Reply-To: <20190910.192516.1686418457520996592.davem@davemloft.net>



On 2019/9/11 1:25, David Miller wrote:
> From: Huazhong Tan <tanhuazhong@huawei.com>
> Date: Tue, 10 Sep 2019 16:58:22 +0800
> 
>> +	/* Set to user value, no larger than max_rss_size. */
>> +	if (kinfo->req_rss_size != kinfo->rss_size && kinfo->req_rss_size &&
>> +	    kinfo->req_rss_size <= max_rss_size) {
>> +		dev_info(&hdev->pdev->dev, "rss changes from %u to %u\n",
>> +			 kinfo->rss_size, kinfo->req_rss_size);
>> +		kinfo->rss_size = kinfo->req_rss_size;
> 
> Please do not emit kernel log messages for normal operations.
> 

Will remove this log in V2.
Thanks.

> .
> 


^ permalink raw reply

* Re: [RFC PATCH 3/4] virtio: introudce a mdev based transport
From: Jason Wang @ 2019-09-11  2:38 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: kvm, virtualization, netdev, linux-kernel, kwankhede,
	alex.williamson, cohuck, tiwei.bie, maxime.coquelin,
	cunming.liang, zhihong.wang, rob.miller, idos, xiao.w.wang,
	haotian.wang
In-Reply-To: <20190910094807-mutt-send-email-mst@kernel.org>


On 2019/9/10 下午9:52, Michael S. Tsirkin wrote:
> On Tue, Sep 10, 2019 at 09:13:02PM +0800, Jason Wang wrote:
>> On 2019/9/10 下午6:01, Michael S. Tsirkin wrote:
>>>> +#ifndef _LINUX_VIRTIO_MDEV_H
>>>> +#define _LINUX_VIRTIO_MDEV_H
>>>> +
>>>> +#include <linux/interrupt.h>
>>>> +#include <linux/vringh.h>
>>>> +#include <uapi/linux/virtio_net.h>
>>>> +
>>>> +/*
>>>> + * Ioctls
>>>> + */
>>> Pls add a bit more content here. It's redundant to state these
>>> are ioctls. Much better to document what does each one do.
>>
>> Ok.
>>
>>
>>>> +
>>>> +struct virtio_mdev_callback {
>>>> +	irqreturn_t (*callback)(void *);
>>>> +	void *private;
>>>> +};
>>>> +
>>>> +#define VIRTIO_MDEV 0xAF
>>>> +#define VIRTIO_MDEV_SET_VQ_CALLBACK _IOW(VIRTIO_MDEV, 0x00, \
>>>> +					 struct virtio_mdev_callback)
>>>> +#define VIRTIO_MDEV_SET_CONFIG_CALLBACK _IOW(VIRTIO_MDEV, 0x01, \
>>>> +					struct virtio_mdev_callback)
>>> Function pointer in an ioctl parameter? How does this ever make sense?
>>
>> I admit this is hacky (casting).
>>
>>
>>> And can't we use a couple of registers for this, and avoid ioctls?
>>
>> Yes, how about something like interrupt numbers for each virtqueue and
>> config?
> Should we just reuse VIRTIO_PCI_COMMON_Q_XXX then?


You mean something like VIRTIO_PCI_COMMON_Q_MSIX? Then it becomes a PCI 
transport in fact. And using either MSIX or irq number is actually 
another layer of indirection. So I think we can just write callback 
function and parameter through registers.


>
>
>>>> +
>>>> +#define VIRTIO_MDEV_DEVICE_API_STRING		"virtio-mdev"
>>>> +
>>>> +/*
>>>> + * Control registers
>>>> + */
>>>> +
>>>> +/* Magic value ("virt" string) - Read Only */
>>>> +#define VIRTIO_MDEV_MAGIC_VALUE		0x000
>>>> +
>>>> +/* Virtio device version - Read Only */
>>>> +#define VIRTIO_MDEV_VERSION		0x004
>>>> +
>>>> +/* Virtio device ID - Read Only */
>>>> +#define VIRTIO_MDEV_DEVICE_ID		0x008
>>>> +
>>>> +/* Virtio vendor ID - Read Only */
>>>> +#define VIRTIO_MDEV_VENDOR_ID		0x00c
>>>> +
>>>> +/* Bitmask of the features supported by the device (host)
>>>> + * (32 bits per set) - Read Only */
>>>> +#define VIRTIO_MDEV_DEVICE_FEATURES	0x010
>>>> +
>>>> +/* Device (host) features set selector - Write Only */
>>>> +#define VIRTIO_MDEV_DEVICE_FEATURES_SEL	0x014
>>>> +
>>>> +/* Bitmask of features activated by the driver (guest)
>>>> + * (32 bits per set) - Write Only */
>>>> +#define VIRTIO_MDEV_DRIVER_FEATURES	0x020
>>>> +
>>>> +/* Activated features set selector - Write Only */
>>>> +#define VIRTIO_MDEV_DRIVER_FEATURES_SEL	0x024
>>>> +
>>>> +/* Queue selector - Write Only */
>>>> +#define VIRTIO_MDEV_QUEUE_SEL		0x030
>>>> +
>>>> +/* Maximum size of the currently selected queue - Read Only */
>>>> +#define VIRTIO_MDEV_QUEUE_NUM_MAX	0x034
>>>> +
>>>> +/* Queue size for the currently selected queue - Write Only */
>>>> +#define VIRTIO_MDEV_QUEUE_NUM		0x038
>>>> +
>>>> +/* Ready bit for the currently selected queue - Read Write */
>>>> +#define VIRTIO_MDEV_QUEUE_READY		0x044
>>> Is this same as started?
>>
>> Do you mean "status"?
> I really meant "enabled", didn't remember the correct name.
> As in:  VIRTIO_PCI_COMMON_Q_ENABLE


Yes, it's the same.

Thanks


>
>>>> +
>>>> +/* Alignment of virtqueue - Read Only */
>>>> +#define VIRTIO_MDEV_QUEUE_ALIGN		0x048
>>>> +
>>>> +/* Queue notifier - Write Only */
>>>> +#define VIRTIO_MDEV_QUEUE_NOTIFY	0x050
>>>> +
>>>> +/* Device status register - Read Write */
>>>> +#define VIRTIO_MDEV_STATUS		0x060
>>>> +
>>>> +/* Selected queue's Descriptor Table address, 64 bits in two halves */
>>>> +#define VIRTIO_MDEV_QUEUE_DESC_LOW	0x080
>>>> +#define VIRTIO_MDEV_QUEUE_DESC_HIGH	0x084
>>>> +
>>>> +/* Selected queue's Available Ring address, 64 bits in two halves */
>>>> +#define VIRTIO_MDEV_QUEUE_AVAIL_LOW	0x090
>>>> +#define VIRTIO_MDEV_QUEUE_AVAIL_HIGH	0x094
>>>> +
>>>> +/* Selected queue's Used Ring address, 64 bits in two halves */
>>>> +#define VIRTIO_MDEV_QUEUE_USED_LOW	0x0a0
>>>> +#define VIRTIO_MDEV_QUEUE_USED_HIGH	0x0a4
>>>> +
>>>> +/* Configuration atomicity value */
>>>> +#define VIRTIO_MDEV_CONFIG_GENERATION	0x0fc
>>>> +
>>>> +/* The config space is defined by each driver as
>>>> + * the per-driver configuration space - Read Write */
>>>> +#define VIRTIO_MDEV_CONFIG		0x100
>>> Mixing device and generic config space is what virtio pci did,
>>> caused lots of problems with extensions.
>>> It would be better to reserve much more space.
>>
>> I see, will do this.
>>
>> Thanks
>>
>>
>>>
>>>> +
>>>> +#endif
>>>> +
>>>> +
>>>> +/* Ready bit for the currently selected queue - Read Write */
>>>> -- 
>>>> 2.19.1

^ permalink raw reply

* [PATCH V2 net-next 5/7] net: hns3: modify some logs format
From: Huazhong Tan @ 2019-09-11  2:40 UTC (permalink / raw)
  To: davem
  Cc: netdev, linux-kernel, salil.mehta, yisen.zhuang, linuxarm,
	jakub.kicinski
In-Reply-To: <1568169639-43658-1-git-send-email-tanhuazhong@huawei.com>

From: Guangbin Huang <huangguangbin2@huawei.com>

The pfc_en and pfc_map need to be displayed in hexadecimal notation,
printing dma address should use %pad, and the end of printed string
needs to be add "\n".

This patch modifies them.

Signed-off-by: Guangbin Huang <huangguangbin2@huawei.com>
Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com>
---
 drivers/net/ethernet/hisilicon/hns3/hns3_debugfs.c      | 7 +++++--
 drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_dcb.c  | 2 +-
 drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c | 2 +-
 3 files changed, 7 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3_debugfs.c b/drivers/net/ethernet/hisilicon/hns3/hns3_debugfs.c
index 5cf4c1e..28961a6 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3_debugfs.c
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3_debugfs.c
@@ -166,6 +166,7 @@ static int hns3_dbg_bd_info(struct hnae3_handle *h, const char *cmd_buf)
 	struct hns3_enet_ring *ring;
 	u32 tx_index, rx_index;
 	u32 q_num, value;
+	dma_addr_t addr;
 	int cnt;
 
 	cnt = sscanf(&cmd_buf[8], "%u %u", &q_num, &tx_index);
@@ -194,8 +195,9 @@ static int hns3_dbg_bd_info(struct hnae3_handle *h, const char *cmd_buf)
 	}
 
 	tx_desc = &ring->desc[tx_index];
+	addr = le64_to_cpu(tx_desc->addr);
 	dev_info(dev, "TX Queue Num: %u, BD Index: %u\n", q_num, tx_index);
-	dev_info(dev, "(TX)addr: 0x%llx\n", tx_desc->addr);
+	dev_info(dev, "(TX)addr: %pad\n", &addr);
 	dev_info(dev, "(TX)vlan_tag: %u\n", tx_desc->tx.vlan_tag);
 	dev_info(dev, "(TX)send_size: %u\n", tx_desc->tx.send_size);
 	dev_info(dev, "(TX)vlan_tso: %u\n", tx_desc->tx.type_cs_vlan_tso);
@@ -217,8 +219,9 @@ static int hns3_dbg_bd_info(struct hnae3_handle *h, const char *cmd_buf)
 	rx_index = (cnt == 1) ? value : tx_index;
 	rx_desc	 = &ring->desc[rx_index];
 
+	addr = le64_to_cpu(rx_desc->addr);
 	dev_info(dev, "RX Queue Num: %u, BD Index: %u\n", q_num, rx_index);
-	dev_info(dev, "(RX)addr: 0x%llx\n", rx_desc->addr);
+	dev_info(dev, "(RX)addr: %pad\n", &addr);
 	dev_info(dev, "(RX)l234_info: %u\n", rx_desc->rx.l234_info);
 	dev_info(dev, "(RX)pkt_len: %u\n", rx_desc->rx.pkt_len);
 	dev_info(dev, "(RX)size: %u\n", rx_desc->rx.size);
diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_dcb.c b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_dcb.c
index 816f920..c063301 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_dcb.c
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_dcb.c
@@ -342,7 +342,7 @@ static int hclge_ieee_setpfc(struct hnae3_handle *h, struct ieee_pfc *pfc)
 	hdev->tm_info.pfc_en = pfc->pfc_en;
 
 	netif_dbg(h, drv, netdev,
-		  "set pfc: pfc_en=%u, pfc_map=%u, num_tc=%u\n",
+		  "set pfc: pfc_en=%x, pfc_map=%x, num_tc=%u\n",
 		  pfc->pfc_en, pfc_map, hdev->tm_info.num_tc);
 
 	hclge_tm_pfc_info_update(hdev);
diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c
index 8d4dc1b..bc5bad3 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c
@@ -3751,7 +3751,7 @@ static void hclge_reset_event(struct pci_dev *pdev, struct hnae3_handle *handle)
 	else if (time_after(jiffies, (hdev->last_reset_time + 4 * 5 * HZ)))
 		hdev->reset_level = HNAE3_FUNC_RESET;
 
-	dev_info(&hdev->pdev->dev, "received reset event , reset type is %d",
+	dev_info(&hdev->pdev->dev, "received reset event, reset type is %d\n",
 		 hdev->reset_level);
 
 	/* request reset & schedule reset task */
-- 
2.7.4


^ permalink raw reply related

* [PATCH V2 net-next 6/7] net: hns3: check NULL pointer before use
From: Huazhong Tan @ 2019-09-11  2:40 UTC (permalink / raw)
  To: davem
  Cc: netdev, linux-kernel, salil.mehta, yisen.zhuang, linuxarm,
	jakub.kicinski
In-Reply-To: <1568169639-43658-1-git-send-email-tanhuazhong@huawei.com>

From: Guangbin Huang <huangguangbin2@huawei.com>

This patch checks ops->set_default_reset_request whether is NULL
before using it in function hns3_slot_reset.

Signed-off-by: Guangbin Huang <huangguangbin2@huawei.com>
Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com>
---
 drivers/net/ethernet/hisilicon/hns3/hns3_enet.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c b/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c
index 8dbaf36..616cad0 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c
@@ -2006,7 +2006,8 @@ static pci_ers_result_t hns3_slot_reset(struct pci_dev *pdev)
 
 	ops = ae_dev->ops;
 	/* request the reset */
-	if (ops->reset_event && ops->get_reset_level) {
+	if (ops->reset_event && ops->get_reset_level &&
+	    ops->set_default_reset_request) {
 		if (ae_dev->hw_err_reset_req) {
 			reset_type = ops->get_reset_level(ae_dev,
 						&ae_dev->hw_err_reset_req);
-- 
2.7.4


^ permalink raw reply related

* [PATCH V2 net-next 7/7] net: hns3: add some DFX info for reset issue
From: Huazhong Tan @ 2019-09-11  2:40 UTC (permalink / raw)
  To: davem
  Cc: netdev, linux-kernel, salil.mehta, yisen.zhuang, linuxarm,
	jakub.kicinski
In-Reply-To: <1568169639-43658-1-git-send-email-tanhuazhong@huawei.com>

This patch adds more information for reset DFX. Also, adds some
cleanups to reset info, move reset_fail_cnt into struct
hclge_rst_stats, and modifies some print formats.

Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com>
---
 .../ethernet/hisilicon/hns3/hns3pf/hclge_debugfs.c | 32 ++++++++++++++++------
 .../ethernet/hisilicon/hns3/hns3pf/hclge_main.c    | 11 ++++----
 .../ethernet/hisilicon/hns3/hns3pf/hclge_main.h    |  2 +-
 3 files changed, 30 insertions(+), 15 deletions(-)

diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_debugfs.c b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_debugfs.c
index 6dcce48..d0128d7 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_debugfs.c
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_debugfs.c
@@ -931,22 +931,36 @@ static void hclge_dbg_fd_tcam(struct hclge_dev *hdev)
 
 static void hclge_dbg_dump_rst_info(struct hclge_dev *hdev)
 {
-	dev_info(&hdev->pdev->dev, "PF reset count: %d\n",
+	dev_info(&hdev->pdev->dev, "PF reset count: %u\n",
 		 hdev->rst_stats.pf_rst_cnt);
-	dev_info(&hdev->pdev->dev, "FLR reset count: %d\n",
+	dev_info(&hdev->pdev->dev, "FLR reset count: %u\n",
 		 hdev->rst_stats.flr_rst_cnt);
-	dev_info(&hdev->pdev->dev, "CORE reset count: %d\n",
-		 hdev->rst_stats.core_rst_cnt);
-	dev_info(&hdev->pdev->dev, "GLOBAL reset count: %d\n",
+	dev_info(&hdev->pdev->dev, "GLOBAL reset count: %u\n",
 		 hdev->rst_stats.global_rst_cnt);
-	dev_info(&hdev->pdev->dev, "IMP reset count: %d\n",
+	dev_info(&hdev->pdev->dev, "IMP reset count: %u\n",
 		 hdev->rst_stats.imp_rst_cnt);
-	dev_info(&hdev->pdev->dev, "reset done count: %d\n",
+	dev_info(&hdev->pdev->dev, "reset done count: %u\n",
 		 hdev->rst_stats.reset_done_cnt);
-	dev_info(&hdev->pdev->dev, "HW reset done count: %d\n",
+	dev_info(&hdev->pdev->dev, "HW reset done count: %u\n",
 		 hdev->rst_stats.hw_reset_done_cnt);
-	dev_info(&hdev->pdev->dev, "reset count: %d\n",
+	dev_info(&hdev->pdev->dev, "reset count: %u\n",
 		 hdev->rst_stats.reset_cnt);
+	dev_info(&hdev->pdev->dev, "reset count: %u\n",
+		 hdev->rst_stats.reset_cnt);
+	dev_info(&hdev->pdev->dev, "reset fail count: %u\n",
+		 hdev->rst_stats.reset_fail_cnt);
+	dev_info(&hdev->pdev->dev, "vector0 interrupt enable status: 0x%x\n",
+		 hclge_read_dev(&hdev->hw, HCLGE_MISC_VECTOR_REG_BASE));
+	dev_info(&hdev->pdev->dev, "reset interrupt source: 0x%x\n",
+		 hclge_read_dev(&hdev->hw, HCLGE_MISC_RESET_STS_REG));
+	dev_info(&hdev->pdev->dev, "reset interrupt status: 0x%x\n",
+		 hclge_read_dev(&hdev->hw, HCLGE_MISC_VECTOR_INT_STS));
+	dev_info(&hdev->pdev->dev, "hardware reset status: 0x%x\n",
+		 hclge_read_dev(&hdev->hw, HCLGE_GLOBAL_RESET_REG));
+	dev_info(&hdev->pdev->dev, "handshake status: 0x%x\n",
+		 hclge_read_dev(&hdev->hw, HCLGE_NIC_CSQ_DEPTH_REG));
+	dev_info(&hdev->pdev->dev, "function reset status: 0x%x\n",
+		 hclge_read_dev(&hdev->hw, HCLGE_FUN_RST_ING));
 }
 
 static void hclge_dbg_get_m7_stats_info(struct hclge_dev *hdev)
diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c
index bc5bad3..fd7f943 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c
@@ -3547,12 +3547,12 @@ static bool hclge_reset_err_handle(struct hclge_dev *hdev)
 			 "reset failed because new reset interrupt\n");
 		hclge_clear_reset_cause(hdev);
 		return false;
-	} else if (hdev->reset_fail_cnt < MAX_RESET_FAIL_CNT) {
-		hdev->reset_fail_cnt++;
+	} else if (hdev->rst_stats.reset_fail_cnt < MAX_RESET_FAIL_CNT) {
+		hdev->rst_stats.reset_fail_cnt++;
 		set_bit(hdev->reset_type, &hdev->reset_pending);
 		dev_info(&hdev->pdev->dev,
 			 "re-schedule reset task(%d)\n",
-			 hdev->reset_fail_cnt);
+			 hdev->rst_stats.reset_fail_cnt);
 		return true;
 	}
 
@@ -3679,7 +3679,8 @@ static void hclge_reset(struct hclge_dev *hdev)
 	/* ignore RoCE notify error if it fails HCLGE_RESET_MAX_FAIL_CNT - 1
 	 * times
 	 */
-	if (ret && hdev->reset_fail_cnt < HCLGE_RESET_MAX_FAIL_CNT - 1)
+	if (ret &&
+	    hdev->rst_stats.reset_fail_cnt < HCLGE_RESET_MAX_FAIL_CNT - 1)
 		goto err_reset;
 
 	rtnl_lock();
@@ -3695,7 +3696,7 @@ static void hclge_reset(struct hclge_dev *hdev)
 		goto err_reset;
 
 	hdev->last_reset_time = jiffies;
-	hdev->reset_fail_cnt = 0;
+	hdev->rst_stats.reset_fail_cnt = 0;
 	hdev->rst_stats.reset_done_cnt++;
 	ae_dev->reset_type = HNAE3_NONE_RESET;
 
diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.h b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.h
index 870550f..3e9574a 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.h
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.h
@@ -659,6 +659,7 @@ struct hclge_rst_stats {
 	u32 global_rst_cnt;	/* the number of GLOBAL */
 	u32 imp_rst_cnt;	/* the number of IMP reset */
 	u32 reset_cnt;		/* the number of reset */
+	u32 reset_fail_cnt;	/* the number of reset fail */
 };
 
 /* time and register status when mac tunnel interruption occur */
@@ -725,7 +726,6 @@ struct hclge_dev {
 	unsigned long reset_request;	/* reset has been requested */
 	unsigned long reset_pending;	/* client rst is pending to be served */
 	struct hclge_rst_stats rst_stats;
-	u32 reset_fail_cnt;
 	u32 fw_version;
 	u16 num_vmdq_vport;		/* Num vmdq vport this PF has set up */
 	u16 num_tqps;			/* Num task queue pairs of this PF */
-- 
2.7.4


^ permalink raw reply related

* [PATCH V2 net-next 4/7] net: hns3: fix port setting handle for fibre port
From: Huazhong Tan @ 2019-09-11  2:40 UTC (permalink / raw)
  To: davem
  Cc: netdev, linux-kernel, salil.mehta, yisen.zhuang, linuxarm,
	jakub.kicinski
In-Reply-To: <1568169639-43658-1-git-send-email-tanhuazhong@huawei.com>

From: Guangbin Huang <huangguangbin2@huawei.com>

For hardware doesn't support use specified speed and duplex
to negotiate, it's unnecessary to check and modify the port
speed and duplex for fibre port when autoneg is on.

Fixes: 22f48e24a23d ("net: hns3: add autoneg and change speed support for fibre port")
Signed-off-by: Guangbin Huang <huangguangbin2@huawei.com>
Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com>
---
 drivers/net/ethernet/hisilicon/hns3/hns3_ethtool.c | 15 +++++++++++++++
 1 file changed, 15 insertions(+)

diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3_ethtool.c b/drivers/net/ethernet/hisilicon/hns3/hns3_ethtool.c
index f5a681d..680c350 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3_ethtool.c
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3_ethtool.c
@@ -726,6 +726,12 @@ static int hns3_check_ksettings_param(const struct net_device *netdev,
 	u8 duplex;
 	int ret;
 
+	/* hw doesn't support use specified speed and duplex to negotiate,
+	 * unnecessary to check them when autoneg on.
+	 */
+	if (cmd->base.autoneg)
+		return 0;
+
 	if (ops->get_ksettings_an_result) {
 		ops->get_ksettings_an_result(handle, &autoneg, &speed, &duplex);
 		if (cmd->base.autoneg == autoneg && cmd->base.speed == speed &&
@@ -787,6 +793,15 @@ static int hns3_set_link_ksettings(struct net_device *netdev,
 			return ret;
 	}
 
+	/* hw doesn't support use specified speed and duplex to negotiate,
+	 * ignore them when autoneg on.
+	 */
+	if (cmd->base.autoneg) {
+		netdev_info(netdev,
+			    "autoneg is on, ignore the speed and duplex\n");
+		return 0;
+	}
+
 	if (ops->cfg_mac_speed_dup_h)
 		ret = ops->cfg_mac_speed_dup_h(handle, cmd->base.speed,
 					       cmd->base.duplex);
-- 
2.7.4


^ permalink raw reply related

* [PATCH V2 net-next 0/7] net: hns3: add a feature & bugfixes & cleanups
From: Huazhong Tan @ 2019-09-11  2:40 UTC (permalink / raw)
  To: davem
  Cc: netdev, linux-kernel, salil.mehta, yisen.zhuang, linuxarm,
	jakub.kicinski

This patch-set includes a VF feature, bugfixes and cleanups for the HNS3
ethernet controller driver.

[patch 01/07] adds ethtool_ops.set_channels support for HNS3 VF driver

[patch 02/07] adds a recovery for setting channel fail.

[patch 03/07] fixes an error related to shaper parameter algorithm.

[patch 04/07] fixes an error related to ksetting.

[patch 05/07] adds cleanups for some log pinting.

[patch 06/07] adds a NULL pointer check before function calling.

[patch 07/07] adds some debugging information for reset issue.

Change log:
V1->V2: fixes comment from David Miller.

Guangbin Huang (4):
  net: hns3: add ethtool_ops.set_channels support for HNS3 VF driver
  net: hns3: fix port setting handle for fibre port
  net: hns3: modify some logs format
  net: hns3: check NULL pointer before use

Huazhong Tan (1):
  net: hns3: add some DFX info for reset issue

Peng Li (1):
  net: hns3: revert to old channel when setting new channel num fail

Yonglong Liu (1):
  net: hns3: fix shaper parameter algorithm

 drivers/net/ethernet/hisilicon/hns3/hns3_debugfs.c |  7 +-
 drivers/net/ethernet/hisilicon/hns3/hns3_enet.c    | 54 ++++++++++----
 drivers/net/ethernet/hisilicon/hns3/hns3_ethtool.c | 16 +++++
 .../net/ethernet/hisilicon/hns3/hns3pf/hclge_dcb.c |  2 +-
 .../ethernet/hisilicon/hns3/hns3pf/hclge_debugfs.c | 32 ++++++---
 .../ethernet/hisilicon/hns3/hns3pf/hclge_main.c    | 13 ++--
 .../ethernet/hisilicon/hns3/hns3pf/hclge_main.h    |  2 +-
 .../net/ethernet/hisilicon/hns3/hns3pf/hclge_tm.c  | 11 ++-
 .../ethernet/hisilicon/hns3/hns3vf/hclgevf_main.c  | 83 ++++++++++++++++++++--
 9 files changed, 174 insertions(+), 46 deletions(-)

-- 
2.7.4


^ permalink raw reply

* [PATCH V2 net-next 1/7] net: hns3: add ethtool_ops.set_channels support for HNS3 VF driver
From: Huazhong Tan @ 2019-09-11  2:40 UTC (permalink / raw)
  To: davem
  Cc: netdev, linux-kernel, salil.mehta, yisen.zhuang, linuxarm,
	jakub.kicinski
In-Reply-To: <1568169639-43658-1-git-send-email-tanhuazhong@huawei.com>

From: Guangbin Huang <huangguangbin2@huawei.com>

This patch adds ethtool_ops.set_channels support for HNS3 VF driver,
and updates related TQP information and RSS information, to support
modification of VF TQP number, and uses current rss_size instead of
max_rss_size to initialize RSS.

Also, fixes a format error in hclgevf_get_rss().

Signed-off-by: Guangbin Huang <huangguangbin2@huawei.com>
Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com>
---
 drivers/net/ethernet/hisilicon/hns3/hns3_ethtool.c |  1 +
 .../ethernet/hisilicon/hns3/hns3vf/hclgevf_main.c  | 83 ++++++++++++++++++++--
 2 files changed, 79 insertions(+), 5 deletions(-)

diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3_ethtool.c b/drivers/net/ethernet/hisilicon/hns3/hns3_ethtool.c
index aa692b1..f5a681d 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3_ethtool.c
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3_ethtool.c
@@ -1397,6 +1397,7 @@ static const struct ethtool_ops hns3vf_ethtool_ops = {
 	.set_rxfh = hns3_set_rss,
 	.get_link_ksettings = hns3_get_link_ksettings,
 	.get_channels = hns3_get_channels,
+	.set_channels = hns3_set_channels,
 	.get_coalesce = hns3_get_coalesce,
 	.set_coalesce = hns3_set_coalesce,
 	.get_regs_len = hns3_get_regs_len,
diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_main.c b/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_main.c
index 594cae8..e3090b3 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_main.c
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_main.c
@@ -743,7 +743,7 @@ static int hclgevf_get_rss(struct hnae3_handle *handle, u32 *indir, u8 *key,
 }
 
 static int hclgevf_set_rss(struct hnae3_handle *handle, const u32 *indir,
-			   const  u8 *key, const  u8 hfunc)
+			   const u8 *key, const u8 hfunc)
 {
 	struct hclgevf_dev *hdev = hclgevf_ae_get_hdev(handle);
 	struct hclgevf_rss_cfg *rss_cfg = &hdev->rss_cfg;
@@ -2060,9 +2060,10 @@ static int hclgevf_config_gro(struct hclgevf_dev *hdev, bool en)
 static int hclgevf_rss_init_hw(struct hclgevf_dev *hdev)
 {
 	struct hclgevf_rss_cfg *rss_cfg = &hdev->rss_cfg;
-	int i, ret;
+	int ret;
+	u32 i;
 
-	rss_cfg->rss_size = hdev->rss_size_max;
+	rss_cfg->rss_size = hdev->nic.kinfo.rss_size;
 
 	if (hdev->pdev->revision >= 0x21) {
 		rss_cfg->hash_algo = HCLGEVF_RSS_HASH_ALGO_SIMPLE;
@@ -2099,13 +2100,13 @@ static int hclgevf_rss_init_hw(struct hclgevf_dev *hdev)
 
 	/* Initialize RSS indirect table */
 	for (i = 0; i < HCLGEVF_RSS_IND_TBL_SIZE; i++)
-		rss_cfg->rss_indirection_tbl[i] = i % hdev->rss_size_max;
+		rss_cfg->rss_indirection_tbl[i] = i % rss_cfg->rss_size;
 
 	ret = hclgevf_set_rss_indir_table(hdev);
 	if (ret)
 		return ret;
 
-	return hclgevf_set_rss_tc_mode(hdev, hdev->rss_size_max);
+	return hclgevf_set_rss_tc_mode(hdev, rss_cfg->rss_size);
 }
 
 static int hclgevf_init_vlan_config(struct hclgevf_dev *hdev)
@@ -2835,6 +2836,77 @@ static void hclgevf_get_tqps_and_rss_info(struct hnae3_handle *handle,
 	*max_rss_size = hdev->rss_size_max;
 }
 
+static void hclgevf_update_rss_size(struct hnae3_handle *handle,
+				    u32 new_tqps_num)
+{
+	struct hnae3_knic_private_info *kinfo = &handle->kinfo;
+	struct hclgevf_dev *hdev = hclgevf_ae_get_hdev(handle);
+	u16 max_rss_size;
+
+	kinfo->req_rss_size = new_tqps_num;
+
+	max_rss_size = min_t(u16, hdev->rss_size_max,
+			     hdev->num_tqps / kinfo->num_tc);
+
+	/* Use the user's configuration when it is not larger than
+	 * max_rss_size, otherwise, use the maximum specification value.
+	 */
+	if (kinfo->req_rss_size != kinfo->rss_size && kinfo->req_rss_size &&
+	    kinfo->req_rss_size <= max_rss_size)
+		kinfo->rss_size = kinfo->req_rss_size;
+	else if (kinfo->rss_size > max_rss_size ||
+		 (!kinfo->req_rss_size && kinfo->rss_size < max_rss_size))
+		kinfo->rss_size = max_rss_size;
+
+	kinfo->num_tqps = kinfo->num_tc * kinfo->rss_size;
+}
+
+static int hclgevf_set_channels(struct hnae3_handle *handle, u32 new_tqps_num,
+				bool rxfh_configured)
+{
+	struct hclgevf_dev *hdev = hclgevf_ae_get_hdev(handle);
+	struct hnae3_knic_private_info *kinfo = &handle->kinfo;
+	u16 cur_rss_size = kinfo->rss_size;
+	u16 cur_tqps = kinfo->num_tqps;
+	u32 *rss_indir;
+	unsigned int i;
+	int ret;
+
+	hclgevf_update_rss_size(handle, new_tqps_num);
+
+	ret = hclgevf_set_rss_tc_mode(hdev, kinfo->rss_size);
+	if (ret)
+		return ret;
+
+	/* RSS indirection table has been configuared by user */
+	if (rxfh_configured)
+		goto out;
+
+	/* Reinitializes the rss indirect table according to the new RSS size */
+	rss_indir = kcalloc(HCLGEVF_RSS_IND_TBL_SIZE, sizeof(u32), GFP_KERNEL);
+	if (!rss_indir)
+		return -ENOMEM;
+
+	for (i = 0; i < HCLGEVF_RSS_IND_TBL_SIZE; i++)
+		rss_indir[i] = i % kinfo->rss_size;
+
+	ret = hclgevf_set_rss(handle, rss_indir, NULL, 0);
+	if (ret)
+		dev_err(&hdev->pdev->dev, "set rss indir table fail, ret=%d\n",
+			ret);
+
+	kfree(rss_indir);
+
+out:
+	if (!ret)
+		dev_info(&hdev->pdev->dev,
+			 "Channels changed, rss_size from %u to %u, tqps from %u to %u",
+			 cur_rss_size, kinfo->rss_size,
+			 cur_tqps, kinfo->rss_size * kinfo->num_tc);
+
+	return ret;
+}
+
 static int hclgevf_get_status(struct hnae3_handle *handle)
 {
 	struct hclgevf_dev *hdev = hclgevf_ae_get_hdev(handle);
@@ -3042,6 +3114,7 @@ static const struct hnae3_ae_ops hclgevf_ops = {
 	.enable_hw_strip_rxvtag = hclgevf_en_hw_strip_rxvtag,
 	.reset_event = hclgevf_reset_event,
 	.set_default_reset_request = hclgevf_set_def_reset_request,
+	.set_channels = hclgevf_set_channels,
 	.get_channels = hclgevf_get_channels,
 	.get_tqps_and_rss_info = hclgevf_get_tqps_and_rss_info,
 	.get_regs_len = hclgevf_get_regs_len,
-- 
2.7.4


^ permalink raw reply related

* [PATCH V2 net-next 2/7] net: hns3: revert to old channel when setting new channel num fail
From: Huazhong Tan @ 2019-09-11  2:40 UTC (permalink / raw)
  To: davem
  Cc: netdev, linux-kernel, salil.mehta, yisen.zhuang, linuxarm,
	jakub.kicinski
In-Reply-To: <1568169639-43658-1-git-send-email-tanhuazhong@huawei.com>

From: Peng Li <lipeng321@huawei.com>

After setting new channel num, it needs free old ring memory and
allocate new ring memory. If there is no enough memory and allocate
new ring memory fail, the ring may initialize fail. To make sure
the network interface can work normally, driver should revert the
channel to the old configuration.

Signed-off-by: Peng Li <lipeng321@huawei.com>
Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com>
---
 drivers/net/ethernet/hisilicon/hns3/hns3_enet.c | 51 ++++++++++++++++++-------
 1 file changed, 37 insertions(+), 14 deletions(-)

diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c b/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c
index 9f3f8e3..8dbaf36 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c
@@ -4410,6 +4410,30 @@ static int hns3_reset_notify(struct hnae3_handle *handle,
 	return ret;
 }
 
+static int hns3_change_channels(struct hnae3_handle *handle, u32 new_tqp_num,
+				bool rxfh_configured)
+{
+	int ret;
+
+	ret = handle->ae_algo->ops->set_channels(handle, new_tqp_num,
+						 rxfh_configured);
+	if (ret) {
+		dev_err(&handle->pdev->dev,
+			"Change tqp num(%u) fail.\n", new_tqp_num);
+		return ret;
+	}
+
+	ret = hns3_reset_notify(handle, HNAE3_INIT_CLIENT);
+	if (ret)
+		return ret;
+
+	ret =  hns3_reset_notify(handle, HNAE3_UP_CLIENT);
+	if (ret)
+		hns3_reset_notify(handle, HNAE3_UNINIT_CLIENT);
+
+	return ret;
+}
+
 int hns3_set_channels(struct net_device *netdev,
 		      struct ethtool_channels *ch)
 {
@@ -4450,24 +4474,23 @@ int hns3_set_channels(struct net_device *netdev,
 		return ret;
 
 	org_tqp_num = h->kinfo.num_tqps;
-	ret = h->ae_algo->ops->set_channels(h, new_tqp_num, rxfh_configured);
+	ret = hns3_change_channels(h, new_tqp_num, rxfh_configured);
 	if (ret) {
-		ret = h->ae_algo->ops->set_channels(h, org_tqp_num,
-						    rxfh_configured);
-		if (ret) {
-			/* If revert to old tqp failed, fatal error occurred */
-			dev_err(&netdev->dev,
-				"Revert to old tqp num fail, ret=%d", ret);
-			return ret;
+		int ret1;
+
+		netdev_warn(netdev,
+			    "Change channels fail, revert to old value\n");
+		ret1 = hns3_change_channels(h, org_tqp_num, rxfh_configured);
+		if (ret1) {
+			netdev_err(netdev,
+				   "revert to old channel fail\n");
+			return ret1;
 		}
-		dev_info(&netdev->dev,
-			 "Change tqp num fail, Revert to old tqp num");
-	}
-	ret = hns3_reset_notify(h, HNAE3_INIT_CLIENT);
-	if (ret)
+
 		return ret;
+	}
 
-	return hns3_reset_notify(h, HNAE3_UP_CLIENT);
+	return 0;
 }
 
 static const struct hns3_hw_error_info hns3_hw_err[] = {
-- 
2.7.4


^ permalink raw reply related

* [PATCH V2 net-next 3/7] net: hns3: fix shaper parameter algorithm
From: Huazhong Tan @ 2019-09-11  2:40 UTC (permalink / raw)
  To: davem
  Cc: netdev, linux-kernel, salil.mehta, yisen.zhuang, linuxarm,
	jakub.kicinski
In-Reply-To: <1568169639-43658-1-git-send-email-tanhuazhong@huawei.com>

From: Yonglong Liu <liuyonglong@huawei.com>

Currently when hns3 driver configures the tm shaper to limit
bandwidth below 20Mbit using the parameters calculated by
hclge_shaper_para_calc(), the actual bandwidth limited by tm
hardware module is not accurate enough, for example, 1.28 Mbit
when the user is configuring 1 Mbit.

This patch adjusts the ir_calc to be closer to ir, and
always calculate the ir_b parameter when user is configuring
a small bandwidth. Also, removes an unnecessary parenthesis
when calculating denominator.

Fixes: 848440544b41 ("net: hns3: Add support of TX Scheduler & Shaper to HNS3 driver")
Signed-off-by: Yonglong Liu <liuyonglong@huawei.com>
Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com>
---
 drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_tm.c | 11 ++++-------
 1 file changed, 4 insertions(+), 7 deletions(-)

diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_tm.c b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_tm.c
index e829101..9f0e35f 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_tm.c
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_tm.c
@@ -81,16 +81,13 @@ static int hclge_shaper_para_calc(u32 ir, u8 shaper_level,
 		return 0;
 	} else if (ir_calc > ir) {
 		/* Increasing the denominator to select ir_s value */
-		while (ir_calc > ir) {
+		while (ir_calc >= ir && ir) {
 			ir_s_calc++;
 			ir_calc = DIVISOR_IR_B_126 / (tick * (1 << ir_s_calc));
 		}
 
-		if (ir_calc == ir)
-			*ir_b = 126;
-		else
-			*ir_b = (ir * tick * (1 << ir_s_calc) +
-				 (DIVISOR_CLK >> 1)) / DIVISOR_CLK;
+		*ir_b = (ir * tick * (1 << ir_s_calc) + (DIVISOR_CLK >> 1)) /
+			DIVISOR_CLK;
 	} else {
 		/* Increasing the numerator to select ir_u value */
 		u32 numerator;
@@ -104,7 +101,7 @@ static int hclge_shaper_para_calc(u32 ir, u8 shaper_level,
 		if (ir_calc == ir) {
 			*ir_b = 126;
 		} else {
-			u32 denominator = (DIVISOR_CLK * (1 << --ir_u_calc));
+			u32 denominator = DIVISOR_CLK * (1 << --ir_u_calc);
 			*ir_b = (ir * tick + (denominator >> 1)) / denominator;
 		}
 	}
-- 
2.7.4


^ permalink raw reply related

* [PATCH v2] rtl8xxxu: add bluetooth co-existence support for single antenna
From: Chris Chiu @ 2019-09-11  2:50 UTC (permalink / raw)
  To: Jes.Sorensen, kvalo, davem; +Cc: linux-wireless, netdev, linux-kernel, linux

The RTL8723BU suffers the wifi disconnection problem while bluetooth
device connected. While wifi is doing tx/rx, the bluetooth will scan
without results. This is due to the wifi and bluetooth share the same
single antenna for RF communication and they need to have a mechanism
to collaborate.

BT information is provided via the packet sent from co-processor to
host (C2H). It contains the status of BT but the rtl8723bu_handle_c2h
dose not really handle it. And there's no bluetooth coexistence
mechanism to deal with it.

This commit adds a workqueue to set the tdma configurations and
coefficient table per the parsed bluetooth link status and given
wifi connection state. The tdma/coef table comes from the vendor
driver code of the RTL8192EU and RTL8723BU. However, this commit is
only for single antenna scenario which RTL8192EU is default dual
antenna. The rtl8xxxu_parse_rxdesc24 which invokes the handle_c2h
is only for 8723b and 8192e so the mechanism is expected to work
on both chips with single antenna. Note RTL8192EU dual antenna is
not supported.

Signed-off-by: Chris Chiu <chiu@endlessm.com>
---

Notes:
  v2:
   - Add helper functions to replace bunch of tdma settings
   - Reformat some lines to meet kernel coding style


 .../net/wireless/realtek/rtl8xxxu/rtl8xxxu.h  |  37 +++
 .../realtek/rtl8xxxu/rtl8xxxu_8723b.c         |   2 -
 .../wireless/realtek/rtl8xxxu/rtl8xxxu_core.c | 262 +++++++++++++++++-
 3 files changed, 294 insertions(+), 7 deletions(-)

diff --git a/drivers/net/wireless/realtek/rtl8xxxu/rtl8xxxu.h b/drivers/net/wireless/realtek/rtl8xxxu/rtl8xxxu.h
index 582c2a346cec..22e95b11bfbb 100644
--- a/drivers/net/wireless/realtek/rtl8xxxu/rtl8xxxu.h
+++ b/drivers/net/wireless/realtek/rtl8xxxu/rtl8xxxu.h
@@ -1220,6 +1220,37 @@ enum ratr_table_mode_new {
 	RATEID_IDX_BGN_3SS = 14,
 };
 
+#define BT_INFO_8723B_1ANT_B_FTP		BIT(7)
+#define BT_INFO_8723B_1ANT_B_A2DP		BIT(6)
+#define BT_INFO_8723B_1ANT_B_HID		BIT(5)
+#define BT_INFO_8723B_1ANT_B_SCO_BUSY		BIT(4)
+#define BT_INFO_8723B_1ANT_B_ACL_BUSY		BIT(3)
+#define BT_INFO_8723B_1ANT_B_INQ_PAGE		BIT(2)
+#define BT_INFO_8723B_1ANT_B_SCO_ESCO		BIT(1)
+#define BT_INFO_8723B_1ANT_B_CONNECTION	BIT(0)
+
+enum _BT_8723B_1ANT_STATUS {
+	BT_8723B_1ANT_STATUS_NON_CONNECTED_IDLE      = 0x0,
+	BT_8723B_1ANT_STATUS_CONNECTED_IDLE          = 0x1,
+	BT_8723B_1ANT_STATUS_INQ_PAGE                = 0x2,
+	BT_8723B_1ANT_STATUS_ACL_BUSY                = 0x3,
+	BT_8723B_1ANT_STATUS_SCO_BUSY                = 0x4,
+	BT_8723B_1ANT_STATUS_ACL_SCO_BUSY            = 0x5,
+	BT_8723B_1ANT_STATUS_MAX
+};
+
+struct rtl8xxxu_btcoex {
+	u8      bt_status;
+	bool	bt_busy;
+	bool	has_sco;
+	bool	has_a2dp;
+	bool    has_hid;
+	bool    has_pan;
+	bool	hid_only;
+	bool	a2dp_only;
+	bool    c2h_bt_inquiry;
+};
+
 #define RTL8XXXU_RATR_STA_INIT 0
 #define RTL8XXXU_RATR_STA_HIGH 1
 #define RTL8XXXU_RATR_STA_MID  2
@@ -1340,6 +1371,10 @@ struct rtl8xxxu_priv {
 	 */
 	struct ieee80211_vif *vif;
 	struct delayed_work ra_watchdog;
+	struct work_struct c2hcmd_work;
+	struct sk_buff_head c2hcmd_queue;
+	spinlock_t c2hcmd_lock;
+	struct rtl8xxxu_btcoex bt_coex;
 };
 
 struct rtl8xxxu_rx_urb {
@@ -1486,6 +1521,8 @@ void rtl8xxxu_fill_txdesc_v2(struct ieee80211_hw *hw, struct ieee80211_hdr *hdr,
 			     struct rtl8xxxu_txdesc32 *tx_desc32, bool sgi,
 			     bool short_preamble, bool ampdu_enable,
 			     u32 rts_rate);
+void rtl8723bu_set_ps_tdma(struct rtl8xxxu_priv *priv,
+			   u8 arg1, u8 arg2, u8 arg3, u8 arg4, u8 arg5);
 
 extern struct rtl8xxxu_fileops rtl8192cu_fops;
 extern struct rtl8xxxu_fileops rtl8192eu_fops;
diff --git a/drivers/net/wireless/realtek/rtl8xxxu/rtl8xxxu_8723b.c b/drivers/net/wireless/realtek/rtl8xxxu/rtl8xxxu_8723b.c
index ceffe05bd65b..9ba661b3d767 100644
--- a/drivers/net/wireless/realtek/rtl8xxxu/rtl8xxxu_8723b.c
+++ b/drivers/net/wireless/realtek/rtl8xxxu/rtl8xxxu_8723b.c
@@ -1580,9 +1580,7 @@ static void rtl8723b_enable_rf(struct rtl8xxxu_priv *priv)
 	/*
 	 * Software control, antenna at WiFi side
 	 */
-#ifdef NEED_PS_TDMA
 	rtl8723bu_set_ps_tdma(priv, 0x08, 0x00, 0x00, 0x00, 0x00);
-#endif
 
 	rtl8xxxu_write32(priv, REG_BT_COEX_TABLE1, 0x55555555);
 	rtl8xxxu_write32(priv, REG_BT_COEX_TABLE2, 0x55555555);
diff --git a/drivers/net/wireless/realtek/rtl8xxxu/rtl8xxxu_core.c b/drivers/net/wireless/realtek/rtl8xxxu/rtl8xxxu_core.c
index a6f358b9e447..e4c1b08c8070 100644
--- a/drivers/net/wireless/realtek/rtl8xxxu/rtl8xxxu_core.c
+++ b/drivers/net/wireless/realtek/rtl8xxxu/rtl8xxxu_core.c
@@ -3820,9 +3820,8 @@ void rtl8xxxu_power_off(struct rtl8xxxu_priv *priv)
 	rtl8xxxu_write8(priv, REG_RSV_CTRL, 0x0e);
 }
 
-#ifdef NEED_PS_TDMA
-static void rtl8723bu_set_ps_tdma(struct rtl8xxxu_priv *priv,
-				  u8 arg1, u8 arg2, u8 arg3, u8 arg4, u8 arg5)
+void rtl8723bu_set_ps_tdma(struct rtl8xxxu_priv *priv,
+			   u8 arg1, u8 arg2, u8 arg3, u8 arg4, u8 arg5)
 {
 	struct h2c_cmd h2c;
 
@@ -3835,7 +3834,6 @@ static void rtl8723bu_set_ps_tdma(struct rtl8xxxu_priv *priv,
 	h2c.b_type_dma.data5 = arg5;
 	rtl8xxxu_gen2_h2c_cmd(priv, &h2c, sizeof(h2c.b_type_dma));
 }
-#endif
 
 void rtl8xxxu_gen2_disable_rf(struct rtl8xxxu_priv *priv)
 {
@@ -5186,12 +5184,258 @@ static void rtl8xxxu_rx_urb_work(struct work_struct *work)
 	}
 }
 
+void rtl8723bu_set_coex_with_type(struct rtl8xxxu_priv *priv, u8 type)
+{
+	switch (type) {
+	case 0:
+		rtl8xxxu_write32(priv, REG_BT_COEX_TABLE1, 0x55555555);
+		rtl8xxxu_write32(priv, REG_BT_COEX_TABLE2, 0x55555555);
+		rtl8xxxu_write32(priv, REG_BT_COEX_TABLE3, 0x00ffffff);
+		rtl8xxxu_write8(priv, REG_BT_COEX_TABLE4, 0x03);
+		break;
+	case 1:
+	case 3:
+		rtl8xxxu_write32(priv, REG_BT_COEX_TABLE1, 0x55555555);
+		rtl8xxxu_write32(priv, REG_BT_COEX_TABLE2, 0x5a5a5a5a);
+		rtl8xxxu_write32(priv, REG_BT_COEX_TABLE3, 0x00ffffff);
+		rtl8xxxu_write8(priv, REG_BT_COEX_TABLE4, 0x03);
+		break;
+	case 2:
+		rtl8xxxu_write32(priv, REG_BT_COEX_TABLE1, 0x5a5a5a5a);
+		rtl8xxxu_write32(priv, REG_BT_COEX_TABLE2, 0x5a5a5a5a);
+		rtl8xxxu_write32(priv, REG_BT_COEX_TABLE3, 0x00ffffff);
+		rtl8xxxu_write8(priv, REG_BT_COEX_TABLE4, 0x03);
+		break;
+	case 4:
+		rtl8xxxu_write32(priv, REG_BT_COEX_TABLE1, 0x5a5a5a5a);
+		rtl8xxxu_write32(priv, REG_BT_COEX_TABLE2, 0xaaaa5a5a);
+		rtl8xxxu_write32(priv, REG_BT_COEX_TABLE3, 0x00ffffff);
+		rtl8xxxu_write8(priv, REG_BT_COEX_TABLE4, 0x03);
+		break;
+	case 5:
+		rtl8xxxu_write32(priv, REG_BT_COEX_TABLE1, 0x5a5a5a5a);
+		rtl8xxxu_write32(priv, REG_BT_COEX_TABLE2, 0xaa5a5a5a);
+		rtl8xxxu_write32(priv, REG_BT_COEX_TABLE3, 0x00ffffff);
+		rtl8xxxu_write8(priv, REG_BT_COEX_TABLE4, 0x03);
+		break;
+	case 6:
+		rtl8xxxu_write32(priv, REG_BT_COEX_TABLE1, 0x55555555);
+		rtl8xxxu_write32(priv, REG_BT_COEX_TABLE2, 0xaaaaaaaa);
+		rtl8xxxu_write32(priv, REG_BT_COEX_TABLE3, 0x00ffffff);
+		rtl8xxxu_write8(priv, REG_BT_COEX_TABLE4, 0x03);
+		break;
+	case 7:
+		rtl8xxxu_write32(priv, REG_BT_COEX_TABLE1, 0xaaaaaaaa);
+		rtl8xxxu_write32(priv, REG_BT_COEX_TABLE2, 0xaaaaaaaa);
+		rtl8xxxu_write32(priv, REG_BT_COEX_TABLE3, 0x00ffffff);
+		rtl8xxxu_write8(priv, REG_BT_COEX_TABLE4, 0x03);
+		break;
+	default:
+		break;
+	}
+}
+
+void rtl8723bu_update_bt_link_info(struct rtl8xxxu_priv *priv, u8 bt_info)
+{
+	struct rtl8xxxu_btcoex *btcoex = &priv->bt_coex;
+
+	if (bt_info & BT_INFO_8723B_1ANT_B_INQ_PAGE)
+		btcoex->c2h_bt_inquiry = true;
+	else
+		btcoex->c2h_bt_inquiry = false;
+
+	if (!(bt_info & BT_INFO_8723B_1ANT_B_CONNECTION)) {
+		btcoex->bt_status = BT_8723B_1ANT_STATUS_NON_CONNECTED_IDLE;
+		btcoex->has_sco = false;
+		btcoex->has_hid = false;
+		btcoex->has_pan = false;
+		btcoex->has_a2dp = false;
+	} else {
+		if ((bt_info & 0x1f) == BT_INFO_8723B_1ANT_B_CONNECTION)
+			btcoex->bt_status = BT_8723B_1ANT_STATUS_CONNECTED_IDLE;
+		else if ((bt_info & BT_INFO_8723B_1ANT_B_SCO_ESCO) ||
+			 (bt_info & BT_INFO_8723B_1ANT_B_SCO_BUSY))
+			btcoex->bt_status = BT_8723B_1ANT_STATUS_SCO_BUSY;
+		else if (bt_info & BT_INFO_8723B_1ANT_B_ACL_BUSY)
+			btcoex->bt_status = BT_8723B_1ANT_STATUS_ACL_BUSY;
+		else
+			btcoex->bt_status = BT_8723B_1ANT_STATUS_MAX;
+
+		if (bt_info & BT_INFO_8723B_1ANT_B_FTP)
+			btcoex->has_pan = true;
+		else
+			btcoex->has_pan = false;
+
+		if (bt_info & BT_INFO_8723B_1ANT_B_A2DP)
+			btcoex->has_a2dp = true;
+		else
+			btcoex->has_a2dp = false;
+
+		if (bt_info & BT_INFO_8723B_1ANT_B_HID)
+			btcoex->has_hid = true;
+		else
+			btcoex->has_hid = false;
+
+		if (bt_info & BT_INFO_8723B_1ANT_B_SCO_ESCO)
+			btcoex->has_sco = true;
+		else
+			btcoex->has_sco = false;
+	}
+
+	if (!btcoex->has_a2dp && !btcoex->has_sco &&
+	    !btcoex->has_pan && btcoex->has_hid)
+		btcoex->hid_only = true;
+	else
+		btcoex->hid_only = false;
+
+	if (!btcoex->has_sco && !btcoex->has_pan &&
+	    !btcoex->has_hid && btcoex->has_a2dp)
+		btcoex->has_a2dp = true;
+	else
+		btcoex->has_a2dp = false;
+
+	if (btcoex->bt_status == BT_8723B_1ANT_STATUS_SCO_BUSY ||
+	    btcoex->bt_status == BT_8723B_1ANT_STATUS_ACL_BUSY)
+		btcoex->bt_busy = true;
+	else
+		btcoex->bt_busy = false;
+}
+
+void rtl8723bu_handle_bt_inquiry(struct rtl8xxxu_priv *priv)
+{
+	struct ieee80211_vif *vif;
+	struct rtl8xxxu_btcoex *btcoex;
+	bool wifi_connected;
+
+	vif = priv->vif;
+	btcoex = &priv->bt_coex;
+	wifi_connected = (vif && vif->bss_conf.assoc);
+
+	if (!wifi_connected) {
+		rtl8723bu_set_ps_tdma(priv, 0x8, 0x0, 0x0, 0x0, 0x0);
+		rtl8723bu_set_coex_with_type(priv, 0);
+	} else if (btcoex->has_sco || btcoex->has_hid || btcoex->has_a2dp) {
+		rtl8723bu_set_ps_tdma(priv, 0x61, 0x35, 0x3, 0x11, 0x11);
+		rtl8723bu_set_coex_with_type(priv, 4);
+	} else if (btcoex->has_pan) {
+		rtl8723bu_set_ps_tdma(priv, 0x61, 0x3f, 0x3, 0x11, 0x11);
+		rtl8723bu_set_coex_with_type(priv, 4);
+	} else {
+		rtl8723bu_set_ps_tdma(priv, 0x8, 0x0, 0x0, 0x0, 0x0);
+		rtl8723bu_set_coex_with_type(priv, 7);
+	}
+}
+
+void rtl8723bu_handle_bt_info(struct rtl8xxxu_priv *priv)
+{
+	struct ieee80211_vif *vif;
+	struct rtl8xxxu_btcoex *btcoex;
+	bool wifi_connected;
+
+	vif = priv->vif;
+	btcoex = &priv->bt_coex;
+	wifi_connected = (vif && vif->bss_conf.assoc);
+
+	if (wifi_connected) {
+		u32 val32 = 0;
+		u32 high_prio_tx = 0, high_prio_rx = 0;
+
+		val32 = rtl8xxxu_read32(priv, 0x770);
+		high_prio_tx = val32 & 0x0000ffff;
+		high_prio_rx = (val32  & 0xffff0000) >> 16;
+
+		if (btcoex->bt_busy) {
+			if (btcoex->hid_only) {
+				rtl8723bu_set_ps_tdma(priv, 0x61, 0x20,
+						      0x3, 0x11, 0x11);
+				rtl8723bu_set_coex_with_type(priv, 5);
+			} else if (btcoex->a2dp_only) {
+				rtl8723bu_set_ps_tdma(priv, 0x61, 0x35,
+						      0x3, 0x11, 0x11);
+				rtl8723bu_set_coex_with_type(priv, 4);
+			} else if ((btcoex->has_a2dp && btcoex->has_pan) ||
+				   (btcoex->has_hid && btcoex->has_a2dp &&
+				    btcoex->has_pan)) {
+				rtl8723bu_set_ps_tdma(priv, 0x51, 0x21,
+						      0x3, 0x10, 0x10);
+				rtl8723bu_set_coex_with_type(priv, 4);
+			} else if (btcoex->has_hid && btcoex->has_a2dp) {
+				rtl8723bu_set_ps_tdma(priv, 0x51, 0x21,
+						      0x3, 0x10, 0x10);
+				rtl8723bu_set_coex_with_type(priv, 3);
+			} else {
+				rtl8723bu_set_ps_tdma(priv, 0x61, 0x35,
+						      0x3, 0x11, 0x11);
+				rtl8723bu_set_coex_with_type(priv, 4);
+			}
+		} else {
+			rtl8723bu_set_ps_tdma(priv, 0x8, 0x0, 0x0, 0x0, 0x0);
+			if (high_prio_tx + high_prio_rx <= 60)
+				rtl8723bu_set_coex_with_type(priv, 2);
+			else
+				rtl8723bu_set_coex_with_type(priv, 7);
+		}
+	} else {
+		rtl8723bu_set_ps_tdma(priv, 0x8, 0x0, 0x0, 0x0, 0x0);
+		rtl8723bu_set_coex_with_type(priv, 0);
+	}
+}
+
+static void rtl8xxxu_c2hcmd_callback(struct work_struct *work)
+{
+	struct rtl8xxxu_priv *priv;
+	struct rtl8723bu_c2h *c2h;
+	struct ieee80211_vif *vif;
+	struct device *dev;
+	struct sk_buff *skb = NULL;
+	unsigned long flags;
+	int len;
+	u8 bt_info = 0;
+	struct rtl8xxxu_btcoex *btcoex;
+
+	priv = container_of(work, struct rtl8xxxu_priv, c2hcmd_work);
+	vif = priv->vif;
+	btcoex = &priv->bt_coex;
+	dev = &priv->udev->dev;
+
+	if (priv->rf_paths > 1)
+		goto out;
+
+	while (!skb_queue_empty(&priv->c2hcmd_queue)) {
+		spin_lock_irqsave(&priv->c2hcmd_lock, flags);
+		skb = __skb_dequeue(&priv->c2hcmd_queue);
+		spin_unlock_irqrestore(&priv->c2hcmd_lock, flags);
+
+		c2h = (struct rtl8723bu_c2h *)skb->data;
+		len = skb->len - 2;
+
+		switch (c2h->id) {
+		case C2H_8723B_BT_INFO:
+			bt_info = c2h->bt_info.bt_info;
+
+			rtl8723bu_update_bt_link_info(priv, bt_info);
+			if (btcoex->c2h_bt_inquiry) {
+				rtl8723bu_handle_bt_inquiry(priv);
+				break;
+			}
+			rtl8723bu_handle_bt_info(priv);
+			break;
+		default:
+			break;
+		}
+	}
+
+out:
+	dev_kfree_skb(skb);
+}
+
 static void rtl8723bu_handle_c2h(struct rtl8xxxu_priv *priv,
 				 struct sk_buff *skb)
 {
 	struct rtl8723bu_c2h *c2h = (struct rtl8723bu_c2h *)skb->data;
 	struct device *dev = &priv->udev->dev;
 	int len;
+	unsigned long flags;
 
 	len = skb->len - 2;
 
@@ -5229,6 +5473,12 @@ static void rtl8723bu_handle_c2h(struct rtl8xxxu_priv *priv,
 			       16, 1, c2h->raw.payload, len, false);
 		break;
 	}
+
+	spin_lock_irqsave(&priv->c2hcmd_lock, flags);
+	__skb_queue_tail(&priv->c2hcmd_queue, skb);
+	spin_unlock_irqrestore(&priv->c2hcmd_lock, flags);
+
+	schedule_work(&priv->c2hcmd_work);
 }
 
 int rtl8xxxu_parse_rxdesc16(struct rtl8xxxu_priv *priv, struct sk_buff *skb)
@@ -5353,7 +5603,6 @@ int rtl8xxxu_parse_rxdesc24(struct rtl8xxxu_priv *priv, struct sk_buff *skb)
 		struct device *dev = &priv->udev->dev;
 		dev_dbg(dev, "%s: C2H packet\n", __func__);
 		rtl8723bu_handle_c2h(priv, skb);
-		dev_kfree_skb(skb);
 		return RX_TYPE_C2H;
 	}
 
@@ -6272,6 +6521,9 @@ static int rtl8xxxu_probe(struct usb_interface *interface,
 	spin_lock_init(&priv->rx_urb_lock);
 	INIT_WORK(&priv->rx_urb_wq, rtl8xxxu_rx_urb_work);
 	INIT_DELAYED_WORK(&priv->ra_watchdog, rtl8xxxu_watchdog_callback);
+	spin_lock_init(&priv->c2hcmd_lock);
+	INIT_WORK(&priv->c2hcmd_work, rtl8xxxu_c2hcmd_callback);
+	skb_queue_head_init(&priv->c2hcmd_queue);
 
 	usb_set_intfdata(interface, hw);
 
-- 
2.20.1


^ permalink raw reply related

* Re: [RFC PATCH 3/4] virtio: introudce a mdev based transport
From: Jason Wang @ 2019-09-11  2:52 UTC (permalink / raw)
  To: Tiwei Bie
  Cc: mst, kvm, virtualization, netdev, linux-kernel, kwankhede,
	alex.williamson, cohuck, maxime.coquelin, cunming.liang,
	zhihong.wang, rob.miller, idos, xiao.w.wang, haotian.wang
In-Reply-To: <20190911014726.GA14798@___>


On 2019/9/11 上午9:47, Tiwei Bie wrote:
> On Tue, Sep 10, 2019 at 04:19:34PM +0800, Jason Wang wrote:
>> This path introduces a new mdev transport for virtio. This is used to
>> use kernel virtio driver to drive the mediated device that is capable
>> of populating virtqueue directly.
>>
>> A new virtio-mdev driver will be registered to the mdev bus, when a
>> new virtio-mdev device is probed, it will register the device with
>> mdev based config ops. This means, unlike the exist hardware
>> transport, this is a software transport between mdev driver and mdev
>> device. The transport was implemented through:
>>
>> - configuration access was implemented through parent_ops->read()/write()
>> - vq/config callback was implemented through parent_ops->ioctl()
>>
>> This transport is derived from virtio MMIO protocol and was wrote for
>> kernel driver. But for the transport itself, but the design goal is to
>> be generic enough to support userspace driver (this part will be added
>> in the future).
>>
>> Note:
>> - current mdev assume all the parameter of parent_ops was from
>>    userspace. This prevents us from implementing the kernel mdev
>>    driver. For a quick POC, this patch just abuse those parameter and
>>    assume the mdev device implementation will treat them as kernel
>>    pointer. This should be addressed in the formal series by extending
>>    mdev_parent_ops.
>> - for a quick POC, I just drive the transport from MMIO, I'm pretty
>>    there's lot of optimization space for this.
>>
>> Signed-off-by: Jason Wang <jasowang@redhat.com>
>> ---
>>   drivers/vfio/mdev/Kconfig        |   7 +
>>   drivers/vfio/mdev/Makefile       |   1 +
>>   drivers/vfio/mdev/virtio_mdev.c  | 500 +++++++++++++++++++++++++++++++
>>   include/uapi/linux/virtio_mdev.h | 131 ++++++++
>>   4 files changed, 639 insertions(+)
>>   create mode 100644 drivers/vfio/mdev/virtio_mdev.c
>>   create mode 100644 include/uapi/linux/virtio_mdev.h
>>
> [...]
>> diff --git a/include/uapi/linux/virtio_mdev.h b/include/uapi/linux/virtio_mdev.h
>> new file mode 100644
>> index 000000000000..8040de6b960a
>> --- /dev/null
>> +++ b/include/uapi/linux/virtio_mdev.h
>> @@ -0,0 +1,131 @@
>> +/*
>> + * Virtio mediated device driver
>> + *
>> + * Copyright 2019, Red Hat Corp.
>> + *
>> + * Based on Virtio MMIO driver by ARM Ltd, copyright ARM Ltd. 2011
>> + *
>> + * This header is BSD licensed so anyone can use the definitions to implement
>> + * compatible drivers/servers.
>> + *
>> + * Redistribution and use in source and binary forms, with or without
>> + * modification, are permitted provided that the following conditions
>> + * are met:
>> + * 1. Redistributions of source code must retain the above copyright
>> + *    notice, this list of conditions and the following disclaimer.
>> + * 2. Redistributions in binary form must reproduce the above copyright
>> + *    notice, this list of conditions and the following disclaimer in the
>> + *    documentation and/or other materials provided with the distribution.
>> + * 3. Neither the name of IBM nor the names of its contributors
>> + *    may be used to endorse or promote products derived from this software
>> + *    without specific prior written permission.
>> + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS ``AS IS'' AND
>> + * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
>> + * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
>> + * ARE DISCLAIMED.  IN NO EVENT SHALL IBM OR CONTRIBUTORS BE LIABLE
>> + * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
>> + * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
>> + * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
>> + * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
>> + * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
>> + * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
>> + * SUCH DAMAGE.
>> + */
>> +#ifndef _LINUX_VIRTIO_MDEV_H
>> +#define _LINUX_VIRTIO_MDEV_H
>> +
>> +#include <linux/interrupt.h>
>> +#include <linux/vringh.h>
>> +#include <uapi/linux/virtio_net.h>
>> +
>> +/*
>> + * Ioctls
>> + */
>> +
>> +struct virtio_mdev_callback {
>> +	irqreturn_t (*callback)(void *);
>> +	void *private;
>> +};
>> +
>> +#define VIRTIO_MDEV 0xAF
>> +#define VIRTIO_MDEV_SET_VQ_CALLBACK _IOW(VIRTIO_MDEV, 0x00, \
>> +					 struct virtio_mdev_callback)
>> +#define VIRTIO_MDEV_SET_CONFIG_CALLBACK _IOW(VIRTIO_MDEV, 0x01, \
>> +					struct virtio_mdev_callback)
>> +
>> +#define VIRTIO_MDEV_DEVICE_API_STRING		"virtio-mdev"
>> +
>> +/*
>> + * Control registers
>> + */
>> +
>> +/* Magic value ("virt" string) - Read Only */
>> +#define VIRTIO_MDEV_MAGIC_VALUE		0x000
>> +
>> +/* Virtio device version - Read Only */
>> +#define VIRTIO_MDEV_VERSION		0x004
>> +
>> +/* Virtio device ID - Read Only */
>> +#define VIRTIO_MDEV_DEVICE_ID		0x008
>> +
>> +/* Virtio vendor ID - Read Only */
>> +#define VIRTIO_MDEV_VENDOR_ID		0x00c
>> +
>> +/* Bitmask of the features supported by the device (host)
>> + * (32 bits per set) - Read Only */
>> +#define VIRTIO_MDEV_DEVICE_FEATURES	0x010
>> +
>> +/* Device (host) features set selector - Write Only */
>> +#define VIRTIO_MDEV_DEVICE_FEATURES_SEL	0x014
>> +
>> +/* Bitmask of features activated by the driver (guest)
>> + * (32 bits per set) - Write Only */
>> +#define VIRTIO_MDEV_DRIVER_FEATURES	0x020
>> +
>> +/* Activated features set selector - Write Only */
>> +#define VIRTIO_MDEV_DRIVER_FEATURES_SEL	0x024
>> +
>> +/* Queue selector - Write Only */
>> +#define VIRTIO_MDEV_QUEUE_SEL		0x030
>> +
>> +/* Maximum size of the currently selected queue - Read Only */
>> +#define VIRTIO_MDEV_QUEUE_NUM_MAX	0x034
>> +
>> +/* Queue size for the currently selected queue - Write Only */
>> +#define VIRTIO_MDEV_QUEUE_NUM		0x038
>> +
>> +/* Ready bit for the currently selected queue - Read Write */
>> +#define VIRTIO_MDEV_QUEUE_READY		0x044
>> +
>> +/* Alignment of virtqueue - Read Only */
>> +#define VIRTIO_MDEV_QUEUE_ALIGN		0x048
>> +
>> +/* Queue notifier - Write Only */
>> +#define VIRTIO_MDEV_QUEUE_NOTIFY	0x050
>> +
>> +/* Device status register - Read Write */
>> +#define VIRTIO_MDEV_STATUS		0x060
>> +
>> +/* Selected queue's Descriptor Table address, 64 bits in two halves */
>> +#define VIRTIO_MDEV_QUEUE_DESC_LOW	0x080
>> +#define VIRTIO_MDEV_QUEUE_DESC_HIGH	0x084
>> +
>> +/* Selected queue's Available Ring address, 64 bits in two halves */
>> +#define VIRTIO_MDEV_QUEUE_AVAIL_LOW	0x090
>> +#define VIRTIO_MDEV_QUEUE_AVAIL_HIGH	0x094
>> +
>> +/* Selected queue's Used Ring address, 64 bits in two halves */
>> +#define VIRTIO_MDEV_QUEUE_USED_LOW	0x0a0
>> +#define VIRTIO_MDEV_QUEUE_USED_HIGH	0x0a4
>> +
>> +/* Configuration atomicity value */
>> +#define VIRTIO_MDEV_CONFIG_GENERATION	0x0fc
>> +
>> +/* The config space is defined by each driver as
>> + * the per-driver configuration space - Read Write */
>> +#define VIRTIO_MDEV_CONFIG		0x100
> IIUC, we can use above registers with virtio-mdev parent's
> read()/write() to access the mdev device from kernel driver.
> As you suggested, it's a choice to build vhost-mdev on top
> of this abstraction as well. But virtio is the frontend
> device which lacks some vhost backend features, e.g. get
> vring base, set vring base, negotiate vhost features, etc.
> So I'm wondering, does it make sense to reserve some space
> for vhost-mdev in kernel to do vhost backend specific setups?
> Or do you have any other thoughts?


Good point. It's just a quick POC, I plan to add vhost features in the 
next release:

1) set/get virtqueue state (e.g vring base)
2) dirty page tracking
3) for vhost features, if you mean _F_LOG_ALL, it could be done by 2), 
for IOTLB, it requires more thought on the API since current IOTLB API 
is no implemented through ioctl ...


>
> Besides, I'm also wondering, what's the purpose of making
> above registers part of UAPI?


Sorry if it brings confusion. If we do vhost-mdev on top of this, 
there's no need for this to go for UAPI.


> And if we make them part
> of UAPI, do we also need to make them part of virtio spec?


Yes, but I do not think we should put it into UAPI. Technically, 
userspace can use this transport as well with some modification on the 
interrupt part, e.g using eventfd. But is there any value for doing this 
instead of vhost?

Thanks


>
> Thanks!
> Tiwei
>
>> +
>> +#endif
>> +
>> +
>> +/* Ready bit for the currently selected queue - Read Write */
>> -- 
>> 2.19.1
>>

^ permalink raw reply

* Re: [PATCH v4] tun: fix use-after-free when register netdev failed
From: Jason Wang @ 2019-09-11  2:57 UTC (permalink / raw)
  To: Yang Yingliang, netdev; +Cc: eric.dumazet, xiyou.wangcong, davem, weiyongjun1
In-Reply-To: <1568113017-79840-1-git-send-email-yangyingliang@huawei.com>


On 2019/9/10 下午6:56, Yang Yingliang wrote:
> I got a UAF repport in tun driver when doing fuzzy test:
>
> [  466.269490] ==================================================================
> [  466.271792] BUG: KASAN: use-after-free in tun_chr_read_iter+0x2ca/0x2d0
> [  466.271806] Read of size 8 at addr ffff888372139250 by task tun-test/2699
> [  466.271810]
> [  466.271824] CPU: 1 PID: 2699 Comm: tun-test Not tainted 5.3.0-rc1-00001-g5a9433db2614-dirty #427
> [  466.271833] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.1-0-ga5cab58e9a3f-prebuilt.qemu.org 04/01/2014
> [  466.271838] Call Trace:
> [  466.271858]  dump_stack+0xca/0x13e
> [  466.271871]  ? tun_chr_read_iter+0x2ca/0x2d0
> [  466.271890]  print_address_description+0x79/0x440
> [  466.271906]  ? vprintk_func+0x5e/0xf0
> [  466.271920]  ? tun_chr_read_iter+0x2ca/0x2d0
> [  466.271935]  __kasan_report+0x15c/0x1df
> [  466.271958]  ? tun_chr_read_iter+0x2ca/0x2d0
> [  466.271976]  kasan_report+0xe/0x20
> [  466.271987]  tun_chr_read_iter+0x2ca/0x2d0
> [  466.272013]  do_iter_readv_writev+0x4b7/0x740
> [  466.272032]  ? default_llseek+0x2d0/0x2d0
> [  466.272072]  do_iter_read+0x1c5/0x5e0
> [  466.272110]  vfs_readv+0x108/0x180
> [  466.299007]  ? compat_rw_copy_check_uvector+0x440/0x440
> [  466.299020]  ? fsnotify+0x888/0xd50
> [  466.299040]  ? __fsnotify_parent+0xd0/0x350
> [  466.299064]  ? fsnotify_first_mark+0x1e0/0x1e0
> [  466.304548]  ? vfs_write+0x264/0x510
> [  466.304569]  ? ksys_write+0x101/0x210
> [  466.304591]  ? do_preadv+0x116/0x1a0
> [  466.304609]  do_preadv+0x116/0x1a0
> [  466.309829]  do_syscall_64+0xc8/0x600
> [  466.309849]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
> [  466.309861] RIP: 0033:0x4560f9
> [  466.309875] Code: 00 00 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b8 ff ff ff f7 d8 64 89 01 48
> [  466.309889] RSP: 002b:00007ffffa5166e8 EFLAGS: 00000206 ORIG_RAX: 0000000000000127
> [  466.322992] RAX: ffffffffffffffda RBX: 0000000000400460 RCX: 00000000004560f9
> [  466.322999] RDX: 0000000000000003 RSI: 00000000200008c0 RDI: 0000000000000003
> [  466.323007] RBP: 00007ffffa516700 R08: 0000000000000004 R09: 0000000000000000
> [  466.323014] R10: 0000000000000000 R11: 0000000000000206 R12: 000000000040cb10
> [  466.323021] R13: 0000000000000000 R14: 00000000006d7018 R15: 0000000000000000
> [  466.323057]
> [  466.323064] Allocated by task 2605:
> [  466.335165]  save_stack+0x19/0x80
> [  466.336240]  __kasan_kmalloc.constprop.8+0xa0/0xd0
> [  466.337755]  kmem_cache_alloc+0xe8/0x320
> [  466.339050]  getname_flags+0xca/0x560
> [  466.340229]  user_path_at_empty+0x2c/0x50
> [  466.341508]  vfs_statx+0xe6/0x190
> [  466.342619]  __do_sys_newstat+0x81/0x100
> [  466.343908]  do_syscall_64+0xc8/0x600
> [  466.345303]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
> [  466.347034]
> [  466.347517] Freed by task 2605:
> [  466.348471]  save_stack+0x19/0x80
> [  466.349476]  __kasan_slab_free+0x12e/0x180
> [  466.350726]  kmem_cache_free+0xc8/0x430
> [  466.351874]  putname+0xe2/0x120
> [  466.352921]  filename_lookup+0x257/0x3e0
> [  466.354319]  vfs_statx+0xe6/0x190
> [  466.355498]  __do_sys_newstat+0x81/0x100
> [  466.356889]  do_syscall_64+0xc8/0x600
> [  466.358037]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
> [  466.359567]
> [  466.360050] The buggy address belongs to the object at ffff888372139100
> [  466.360050]  which belongs to the cache names_cache of size 4096
> [  466.363735] The buggy address is located 336 bytes inside of
> [  466.363735]  4096-byte region [ffff888372139100, ffff88837213a100)
> [  466.367179] The buggy address belongs to the page:
> [  466.368604] page:ffffea000dc84e00 refcount:1 mapcount:0 mapping:ffff8883df1b4f00 index:0x0 compound_mapcount: 0
> [  466.371582] flags: 0x2fffff80010200(slab|head)
> [  466.372910] raw: 002fffff80010200 dead000000000100 dead000000000122 ffff8883df1b4f00
> [  466.375209] raw: 0000000000000000 0000000000070007 00000001ffffffff 0000000000000000
> [  466.377778] page dumped because: kasan: bad access detected
> [  466.379730]
> [  466.380288] Memory state around the buggy address:
> [  466.381844]  ffff888372139100: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> [  466.384009]  ffff888372139180: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> [  466.386131] >ffff888372139200: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> [  466.388257]                                                  ^
> [  466.390234]  ffff888372139280: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> [  466.392512]  ffff888372139300: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> [  466.394667] ==================================================================
>
> tun_chr_read_iter() accessed the memory which freed by free_netdev()
> called by tun_set_iff():
>
>          CPUA                                           CPUB
>    tun_set_iff()
>      alloc_netdev_mqs()
>      tun_attach()
>                                                    tun_chr_read_iter()
>                                                      tun_get()
>                                                      tun_do_read()
>                                                        tun_ring_recv()
>      register_netdevice() <-- inject error
>      goto err_detach
>      tun_detach_all() <-- set RCV_SHUTDOWN
>      free_netdev() <-- called from
>                       err_free_dev path
>        netdev_freemem() <-- free the memory
>                          without check refcount
>        (In this path, the refcount cannot prevent
>         freeing the memory of dev, and the memory
>         will be used by dev_put() called by
>         tun_chr_read_iter() on CPUB.)
>                                                       (Break from tun_ring_recv(),
>                                                       because RCV_SHUTDOWN is set)
>                                                     tun_put()
>                                                       dev_put() <-- use the memory
>                                                                     freed by netdev_freemem()
>
> Put the publishing of tfile->tun after register_netdevice(),
> so tun_get() won't get the tun pointer that freed by
> err_detach path if register_netdevice() failed.
>
> Fixes: eb0fb363f920 ("tuntap: attach queue 0 before registering netdevice")
> Reported-by: Hulk Robot <hulkci@huawei.com>
> Suggested-by: Jason Wang <jasowang@redhat.com>
> Signed-off-by: Yang Yingliang <yangyingliang@huawei.com>
> ---
>   drivers/net/tun.c | 16 +++++++++++-----
>   1 file changed, 11 insertions(+), 5 deletions(-)
>
> diff --git a/drivers/net/tun.c b/drivers/net/tun.c
> index db16d7a13e00..aab0be40d443 100644
> --- a/drivers/net/tun.c
> +++ b/drivers/net/tun.c
> @@ -787,7 +787,8 @@ static void tun_detach_all(struct net_device *dev)
>   }
>   
>   static int tun_attach(struct tun_struct *tun, struct file *file,
> -		      bool skip_filter, bool napi, bool napi_frags)
> +		      bool skip_filter, bool napi, bool napi_frags,
> +		      bool publish_tun)
>   {
>   	struct tun_file *tfile = file->private_data;
>   	struct net_device *dev = tun->dev;
> @@ -870,7 +871,8 @@ static int tun_attach(struct tun_struct *tun, struct file *file,
>   	 * initialized tfile; otherwise we risk using half-initialized
>   	 * object.
>   	 */
> -	rcu_assign_pointer(tfile->tun, tun);
> +	if (publish_tun)
> +		rcu_assign_pointer(tfile->tun, tun);
>   	rcu_assign_pointer(tun->tfiles[tun->numqueues], tfile);
>   	tun->numqueues++;
>   	tun_set_real_num_queues(tun);
> @@ -2730,7 +2732,7 @@ static int tun_set_iff(struct net *net, struct file *file, struct ifreq *ifr)
>   
>   		err = tun_attach(tun, file, ifr->ifr_flags & IFF_NOFILTER,
>   				 ifr->ifr_flags & IFF_NAPI,
> -				 ifr->ifr_flags & IFF_NAPI_FRAGS);
> +				 ifr->ifr_flags & IFF_NAPI_FRAGS, true);
>   		if (err < 0)
>   			return err;
>   
> @@ -2829,13 +2831,17 @@ static int tun_set_iff(struct net *net, struct file *file, struct ifreq *ifr)
>   
>   		INIT_LIST_HEAD(&tun->disabled);
>   		err = tun_attach(tun, file, false, ifr->ifr_flags & IFF_NAPI,
> -				 ifr->ifr_flags & IFF_NAPI_FRAGS);
> +				 ifr->ifr_flags & IFF_NAPI_FRAGS, false);
>   		if (err < 0)
>   			goto err_free_flow;
>   
>   		err = register_netdevice(tun->dev);
>   		if (err < 0)
>   			goto err_detach;
> +		/* free_netdev() won't check refcnt, to aovid race


A typo that comes from my original patch "avoid".

Other  looks good.

Thanks


> +		 * with dev_put() we need publish tun after registration.
> +		 */
> +		rcu_assign_pointer(tfile->tun, tun);
>   	}
>   
>   	netif_carrier_on(tun->dev);
> @@ -2978,7 +2984,7 @@ static int tun_set_queue(struct file *file, struct ifreq *ifr)
>   		if (ret < 0)
>   			goto unlock;
>   		ret = tun_attach(tun, file, false, tun->flags & IFF_NAPI,
> -				 tun->flags & IFF_NAPI_FRAGS);
> +				 tun->flags & IFF_NAPI_FRAGS, true);
>   	} else if (ifr->ifr_flags & IFF_DETACH_QUEUE) {
>   		tun = rtnl_dereference(tfile->tun);
>   		if (!tun || !(tun->flags & IFF_MULTI_QUEUE) || tfile->detached)

^ permalink raw reply

* Re: [RFC PATCH 3/4] virtio: introudce a mdev based transport
From: Tiwei Bie @ 2019-09-11  3:08 UTC (permalink / raw)
  To: Jason Wang
  Cc: mst, kvm, virtualization, netdev, linux-kernel, kwankhede,
	alex.williamson, cohuck, maxime.coquelin, cunming.liang,
	zhihong.wang, rob.miller, idos, xiao.w.wang, haotian.wang
In-Reply-To: <8428eec1-b53c-9efc-1260-4e5ce2651461@redhat.com>

On Wed, Sep 11, 2019 at 10:52:03AM +0800, Jason Wang wrote:
> On 2019/9/11 上午9:47, Tiwei Bie wrote:
> > On Tue, Sep 10, 2019 at 04:19:34PM +0800, Jason Wang wrote:
> > > This path introduces a new mdev transport for virtio. This is used to
> > > use kernel virtio driver to drive the mediated device that is capable
> > > of populating virtqueue directly.
> > > 
> > > A new virtio-mdev driver will be registered to the mdev bus, when a
> > > new virtio-mdev device is probed, it will register the device with
> > > mdev based config ops. This means, unlike the exist hardware
> > > transport, this is a software transport between mdev driver and mdev
> > > device. The transport was implemented through:
> > > 
> > > - configuration access was implemented through parent_ops->read()/write()
> > > - vq/config callback was implemented through parent_ops->ioctl()
> > > 
> > > This transport is derived from virtio MMIO protocol and was wrote for
> > > kernel driver. But for the transport itself, but the design goal is to
> > > be generic enough to support userspace driver (this part will be added
> > > in the future).
> > > 
> > > Note:
> > > - current mdev assume all the parameter of parent_ops was from
> > >    userspace. This prevents us from implementing the kernel mdev
> > >    driver. For a quick POC, this patch just abuse those parameter and
> > >    assume the mdev device implementation will treat them as kernel
> > >    pointer. This should be addressed in the formal series by extending
> > >    mdev_parent_ops.
> > > - for a quick POC, I just drive the transport from MMIO, I'm pretty
> > >    there's lot of optimization space for this.
> > > 
> > > Signed-off-by: Jason Wang <jasowang@redhat.com>
> > > ---
> > >   drivers/vfio/mdev/Kconfig        |   7 +
> > >   drivers/vfio/mdev/Makefile       |   1 +
> > >   drivers/vfio/mdev/virtio_mdev.c  | 500 +++++++++++++++++++++++++++++++
> > >   include/uapi/linux/virtio_mdev.h | 131 ++++++++
> > >   4 files changed, 639 insertions(+)
> > >   create mode 100644 drivers/vfio/mdev/virtio_mdev.c
> > >   create mode 100644 include/uapi/linux/virtio_mdev.h
> > > 
> > [...]
> > > diff --git a/include/uapi/linux/virtio_mdev.h b/include/uapi/linux/virtio_mdev.h
> > > new file mode 100644
> > > index 000000000000..8040de6b960a
> > > --- /dev/null
> > > +++ b/include/uapi/linux/virtio_mdev.h
> > > @@ -0,0 +1,131 @@
> > > +/*
> > > + * Virtio mediated device driver
> > > + *
> > > + * Copyright 2019, Red Hat Corp.
> > > + *
> > > + * Based on Virtio MMIO driver by ARM Ltd, copyright ARM Ltd. 2011
> > > + *
> > > + * This header is BSD licensed so anyone can use the definitions to implement
> > > + * compatible drivers/servers.
> > > + *
> > > + * Redistribution and use in source and binary forms, with or without
> > > + * modification, are permitted provided that the following conditions
> > > + * are met:
> > > + * 1. Redistributions of source code must retain the above copyright
> > > + *    notice, this list of conditions and the following disclaimer.
> > > + * 2. Redistributions in binary form must reproduce the above copyright
> > > + *    notice, this list of conditions and the following disclaimer in the
> > > + *    documentation and/or other materials provided with the distribution.
> > > + * 3. Neither the name of IBM nor the names of its contributors
> > > + *    may be used to endorse or promote products derived from this software
> > > + *    without specific prior written permission.
> > > + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS ``AS IS'' AND
> > > + * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
> > > + * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
> > > + * ARE DISCLAIMED.  IN NO EVENT SHALL IBM OR CONTRIBUTORS BE LIABLE
> > > + * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
> > > + * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
> > > + * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
> > > + * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
> > > + * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
> > > + * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
> > > + * SUCH DAMAGE.
> > > + */
> > > +#ifndef _LINUX_VIRTIO_MDEV_H
> > > +#define _LINUX_VIRTIO_MDEV_H
> > > +
> > > +#include <linux/interrupt.h>
> > > +#include <linux/vringh.h>
> > > +#include <uapi/linux/virtio_net.h>
> > > +
> > > +/*
> > > + * Ioctls
> > > + */
> > > +
> > > +struct virtio_mdev_callback {
> > > +	irqreturn_t (*callback)(void *);
> > > +	void *private;
> > > +};
> > > +
> > > +#define VIRTIO_MDEV 0xAF
> > > +#define VIRTIO_MDEV_SET_VQ_CALLBACK _IOW(VIRTIO_MDEV, 0x00, \
> > > +					 struct virtio_mdev_callback)
> > > +#define VIRTIO_MDEV_SET_CONFIG_CALLBACK _IOW(VIRTIO_MDEV, 0x01, \
> > > +					struct virtio_mdev_callback)
> > > +
> > > +#define VIRTIO_MDEV_DEVICE_API_STRING		"virtio-mdev"
> > > +
> > > +/*
> > > + * Control registers
> > > + */
> > > +
> > > +/* Magic value ("virt" string) - Read Only */
> > > +#define VIRTIO_MDEV_MAGIC_VALUE		0x000
> > > +
> > > +/* Virtio device version - Read Only */
> > > +#define VIRTIO_MDEV_VERSION		0x004
> > > +
> > > +/* Virtio device ID - Read Only */
> > > +#define VIRTIO_MDEV_DEVICE_ID		0x008
> > > +
> > > +/* Virtio vendor ID - Read Only */
> > > +#define VIRTIO_MDEV_VENDOR_ID		0x00c
> > > +
> > > +/* Bitmask of the features supported by the device (host)
> > > + * (32 bits per set) - Read Only */
> > > +#define VIRTIO_MDEV_DEVICE_FEATURES	0x010
> > > +
> > > +/* Device (host) features set selector - Write Only */
> > > +#define VIRTIO_MDEV_DEVICE_FEATURES_SEL	0x014
> > > +
> > > +/* Bitmask of features activated by the driver (guest)
> > > + * (32 bits per set) - Write Only */
> > > +#define VIRTIO_MDEV_DRIVER_FEATURES	0x020
> > > +
> > > +/* Activated features set selector - Write Only */
> > > +#define VIRTIO_MDEV_DRIVER_FEATURES_SEL	0x024
> > > +
> > > +/* Queue selector - Write Only */
> > > +#define VIRTIO_MDEV_QUEUE_SEL		0x030
> > > +
> > > +/* Maximum size of the currently selected queue - Read Only */
> > > +#define VIRTIO_MDEV_QUEUE_NUM_MAX	0x034
> > > +
> > > +/* Queue size for the currently selected queue - Write Only */
> > > +#define VIRTIO_MDEV_QUEUE_NUM		0x038
> > > +
> > > +/* Ready bit for the currently selected queue - Read Write */
> > > +#define VIRTIO_MDEV_QUEUE_READY		0x044
> > > +
> > > +/* Alignment of virtqueue - Read Only */
> > > +#define VIRTIO_MDEV_QUEUE_ALIGN		0x048
> > > +
> > > +/* Queue notifier - Write Only */
> > > +#define VIRTIO_MDEV_QUEUE_NOTIFY	0x050
> > > +
> > > +/* Device status register - Read Write */
> > > +#define VIRTIO_MDEV_STATUS		0x060
> > > +
> > > +/* Selected queue's Descriptor Table address, 64 bits in two halves */
> > > +#define VIRTIO_MDEV_QUEUE_DESC_LOW	0x080
> > > +#define VIRTIO_MDEV_QUEUE_DESC_HIGH	0x084
> > > +
> > > +/* Selected queue's Available Ring address, 64 bits in two halves */
> > > +#define VIRTIO_MDEV_QUEUE_AVAIL_LOW	0x090
> > > +#define VIRTIO_MDEV_QUEUE_AVAIL_HIGH	0x094
> > > +
> > > +/* Selected queue's Used Ring address, 64 bits in two halves */
> > > +#define VIRTIO_MDEV_QUEUE_USED_LOW	0x0a0
> > > +#define VIRTIO_MDEV_QUEUE_USED_HIGH	0x0a4
> > > +
> > > +/* Configuration atomicity value */
> > > +#define VIRTIO_MDEV_CONFIG_GENERATION	0x0fc
> > > +
> > > +/* The config space is defined by each driver as
> > > + * the per-driver configuration space - Read Write */
> > > +#define VIRTIO_MDEV_CONFIG		0x100
> > IIUC, we can use above registers with virtio-mdev parent's
> > read()/write() to access the mdev device from kernel driver.
> > As you suggested, it's a choice to build vhost-mdev on top
> > of this abstraction as well. But virtio is the frontend
> > device which lacks some vhost backend features, e.g. get
> > vring base, set vring base, negotiate vhost features, etc.
> > So I'm wondering, does it make sense to reserve some space
> > for vhost-mdev in kernel to do vhost backend specific setups?
> > Or do you have any other thoughts?
> 
> 
> Good point. It's just a quick POC, I plan to add vhost features in the next
> release:
> 
> 1) set/get virtqueue state (e.g vring base)
> 2) dirty page tracking
> 3) for vhost features, if you mean _F_LOG_ALL, it could be done by 2), for
> IOTLB, it requires more thought on the API since current IOTLB API is no
> implemented through ioctl ...
> 
> 
> > 
> > Besides, I'm also wondering, what's the purpose of making
> > above registers part of UAPI?
> 
> 
> Sorry if it brings confusion. If we do vhost-mdev on top of this, there's no
> need for this to go for UAPI.
> 
> 
> > And if we make them part
> > of UAPI, do we also need to make them part of virtio spec?
> 
> 
> Yes, but I do not think we should put it into UAPI. Technically, userspace
> can use this transport as well with some modification on the interrupt part,
> e.g using eventfd. But is there any value for doing this instead of vhost?

Yeah, I agree. As we already have vhost, I don't see the value
to make virtio become "virtio + vhost".

Thanks,
Tiwei

^ permalink raw reply

* Re: [PATCH bpf-next v10 2/4] bpf: new helper to obtain namespace data from current task New bpf helper bpf_get_current_pidns_info.
From: Carlos Antonio Neira Bustos @ 2019-09-11  4:32 UTC (permalink / raw)
  To: Yonghong Song
  Cc: Al Viro, netdev@vger.kernel.org, ebiederm@xmission.com,
	brouer@redhat.com, bpf@vger.kernel.org
In-Reply-To: <dadf3657-2648-14ef-35ee-e09efb2cdb3e@fb.com>

On Tue, Sep 10, 2019 at 10:35:09PM +0000, Yonghong Song wrote:
Thanks a lot Yonghong.
I'll include this patch when submitting changes for version 11 of
this patch. 
> 
> Carlos,
> 
> Discussed with Eric today for what is the best way to get
> the device number for a namespace. The following patch seems
> a reasonable start although Eric would like to see
> how the helper is used in order to decide whether the
> interface looks right.
> 
> commit bb00fc36d5d263047a8bceb3e51e969d7fbce7db (HEAD -> fs2)
> Author: Yonghong Song <yhs@fb.com>
> Date:   Mon Sep 9 21:50:51 2019 -0700
> 
>      nsfs: add an interface function ns_get_inum_dev()
> 
>      This patch added an interface function
>      ns_get_inum_dev(). Given a ns_common structure,
>      the function returns the inode and device
>      numbers. The function will be used later
>      by a newly added bpf helper.
> 
>      Signed-off-by: Yonghong Song <yhs@fb.com>
> 
> diff --git a/fs/nsfs.c b/fs/nsfs.c
> index a0431642c6b5..a603c6fc3f54 100644
> --- a/fs/nsfs.c
> +++ b/fs/nsfs.c
> @@ -245,6 +245,14 @@ struct file *proc_ns_fget(int fd)
>          return ERR_PTR(-EINVAL);
>   }
> 
> +/* Get the device number for the current task pidns.
> + */
> +void ns_get_inum_dev(struct ns_common *ns, u32 *inum, dev_t *dev)
> +{
> +       *inum = ns->inum;
> +       *dev = nsfs_mnt->mnt_sb->s_dev;
> +}
> +
>   static int nsfs_show_path(struct seq_file *seq, struct dentry *dentry)
>   {
>          struct inode *inode = d_inode(dentry);
> diff --git a/include/linux/proc_ns.h b/include/linux/proc_ns.h
> index d31cb6215905..b8fc680cdf1a 100644
> --- a/include/linux/proc_ns.h
> +++ b/include/linux/proc_ns.h
> @@ -81,6 +81,7 @@ extern void *ns_get_path(struct path *path, struct 
> task_struct *task,
>   typedef struct ns_common *ns_get_path_helper_t(void *);
>   extern void *ns_get_path_cb(struct path *path, ns_get_path_helper_t 
> ns_get_cb,
>                              void *private_data);
> +extern void ns_get_inum_dev(struct ns_common *ns, u32 *inum, dev_t *dev);
> 
>   extern int ns_get_name(char *buf, size_t size, struct task_struct *task,
>                          const struct proc_ns_operations *ns_ops);
> 
> Could you put the above change and patch #1 and then have
> all your other patches? In your kernel change, please use
> interface function ns_get_inum_dev() to get pidns inode number
> and dev number.
> 
> On 9/9/19 6:45 PM, Carlos Antonio Neira Bustos wrote:
> > Thanks a lot, Al Viro and Yonghong for taking the time to review this patch and
> > provide technical insights needed on this one.
> > But how do we move this forward?
> > Al Viro's review is clear that this will not work and we should strip the name
> > resolution code (thanks for your detailed analysis).
> > As there is currently only one instance of the nsfs device on the system,
> > I think we could leave out the retrieval of the pidns device number and address it
> > when the situation changes.
> > What do you think?
> > 
> > 
> > On Sat, Sep 07, 2019 at 06:34:39AM +0000, Yonghong Song wrote:
> >>
> >>
> >> On 9/6/19 5:10 PM, Al Viro wrote:
> >>> On Fri, Sep 06, 2019 at 11:21:14PM +0000, Yonghong Song wrote:
> >>>
> >>>> -bash-4.4$ readlink /proc/self/ns/pid
> >>>> pid:[4026531836]
> >>>> -bash-4.4$ stat /proc/self/ns/pid
> >>>>      File: ‘/proc/self/ns/pid’ -> ‘pid:[4026531836]’
> >>>>      Size: 0               Blocks: 0          IO Block: 1024   symbolic link
> >>>> Device: 4h/4d   Inode: 344795989   Links: 1
> >>>> Access: (0777/lrwxrwxrwx)  Uid: (128203/     yhs)   Gid: (  100/   users)
> >>>> Context: user_u:base_r:base_t
> >>>> Access: 2019-09-06 16:06:09.431616380 -0700
> >>>> Modify: 2019-09-06 16:06:09.431616380 -0700
> >>>> Change: 2019-09-06 16:06:09.431616380 -0700
> >>>>     Birth: -
> >>>> -bash-4.4$
> >>>>
> >>>> Based on a discussion with Eric Biederman back in 2019 Linux
> >>>> Plumbers, Eric suggested that to uniquely identify a
> >>>> namespace, device id (major/minor) number should also
> >>>> be included. Although today's kernel implementation
> >>>> has the same device for all namespace pseudo files,
> >>>> but from uapi perspective, device id should be included.
> >>>>
> >>>> That is the reason why we try to get device id which holds
> >>>> pid namespace pseudo file.
> >>>>
> >>>> Do you have a better suggestion on how to get
> >>>> the device id for 'current' pid namespace? Or from design, we
> >>>> really should not care about device id at all?
> >>>
> >>> What the hell is "device id for pid namespace"?  This is the
> >>> first time I've heard about that mystery object, so it's
> >>> hard to tell where it could be found.
> >>>
> >>> I can tell you what device numbers are involved in the areas
> >>> you seem to be looking in.
> >>>
> >>> 1) there's whatever device number that gets assigned to
> >>> (this) procfs instance.  That, ironically, _is_ per-pidns, but
> >>> that of the procfs instance, not that of your process (and
> >>> those can be different).  That's what you get in ->st_dev
> >>> when doing lstat() of anything in /proc (assuming that
> >>> procfs is mounted there, in the first place).  NOTE:
> >>> that's lstat(2), not stat(2).  stat(1) uses lstat(2),
> >>> unless given -L (in which case it's stat(2) time).  The
> >>> difference:
> >>>
> >>> root@kvm1:~# stat /proc/self/ns/pid
> >>>     File: /proc/self/ns/pid -> pid:[4026531836]
> >>>     Size: 0               Blocks: 0          IO Block: 1024   symbolic link
> >>> Device: 4h/4d   Inode: 17396       Links: 1
> >>> Access: (0777/lrwxrwxrwx)  Uid: (    0/    root)   Gid: (    0/    root)
> >>> Access: 2019-09-06 19:43:11.871312319 -0400
> >>> Modify: 2019-09-06 19:43:11.871312319 -0400
> >>> Change: 2019-09-06 19:43:11.871312319 -0400
> >>>    Birth: -
> >>> root@kvm1:~# stat -L /proc/self/ns/pid
> >>>     File: /proc/self/ns/pid
> >>>     Size: 0               Blocks: 0          IO Block: 4096   regular empty file
> >>> Device: 3h/3d   Inode: 4026531836  Links: 1
> >>> Access: (0444/-r--r--r--)  Uid: (    0/    root)   Gid: (    0/    root)
> >>> Access: 2019-09-06 19:43:15.955313293 -0400
> >>> Modify: 2019-09-06 19:43:15.955313293 -0400
> >>> Change: 2019-09-06 19:43:15.955313293 -0400
> >>>    Birth: -
> >>>
> >>> The former is lstat, the latter - stat.
> >>>
> >>> 2) device number of the filesystem where the symlink target lives.
> >>> In this case, it's nsfs and there's only one instance on the entire
> >>> system.  _That_ would be obtained by looking at st_dev in stat(2) on
> >>> /proc/self/ns/pid (0:3 above).
> >>>
> >>> 3) device number *OF* the symlink.  That would be st_rdev in lstat(2).
> >>> There's none - it's a symlink, not a character or block device.  It's
> >>> always zero and always will be zero.
> >>>
> >>> 4) the same for the target; st_rdev in stat(2) results and again,
> >>> there's no such beast - it's neither character nor block device.
> >>>
> >>> Your code is looking at (3).  Please, reread any textbook on Unix
> >>> in the section that would cover stat(2) and discussion of the
> >>> difference between st_dev and st_rdev.
> >>>
> >>> I have no idea what Eric had been talking about - it's hard to
> >>> reconstruct by what you said so far.  Making nsfs per-userns,
> >>> perhaps?  But that makes no sense whatsoever, not that userns
> >>> ever had...  Cheap shots aside, I really can't guess what that's
> >>> about.  Sorry.
> >>
> >> Thanks for the detailed information. The device number we want
> >> is nsfs. Indeed, currently, there is only one instance
> >> on the entire system. But not exactly sure what is the possibility
> >> to have more than one nsfs device in the future. Maybe per-userns
> >> or any other criteria?
> >>
> >>>
> >>> In any case, pathname resolution is *NOT* for the situations where
> >>> you can't block.  Even if it's procfs (and from the same pidns as
> >>> the process) mounted there, there is no promise that the target
> >>> of /proc/self has already been looked up and not evicted from
> >>> memory since then.  And in case of cache miss pathwalk will
> >>> have to call ->lookup(), which requires locking the directory
> >>> (rw_sem, shared).  You can't do that in such context.
> >>>
> >>> And that doesn't even go into the possibility that process has
> >>> something very different mounted on /proc.
> >>>
> >>> Again, I don't know what it is that you want to get to, but
> >>> I would strongly recommend finding a way to get to that data
> >>> that would not involve going anywhere near pathname resolution.
> >>>
> >>> How would you expect the userland to work with that value,
> >>> whatever it might be?  If it's just a 32bit field that will
> >>> never be read, you might as well store there the same value
> >>> you store now (0, that is) in much cheaper and safer way ;-)
> >>
> >> Suppose inside pid namespace, user can pass the device number,
> >> say n1, (`stat -L /proc/self/ns/pid`) to bpf program (through map
> >> or JIT). At runtime, bpf program will try to get device number,
> >> say n2, for the 'current' process. If n1 is not the same as
> >> n2, that means they are not in the same namespace. 'current'
> >> is in the same pid namespace as the user iff
> >> n1 == n2 and also pidns id is the same for 'current' and
> >> the one with `lsns -t pid`.
> >>
> >> Are you aware of any way to get the pidns device number
> >> for 'current' without going through the pathname
> >> lookup?
> >>

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox