* Re: Regression caused by commit 7bb05b85bc2d ("r8169: don't use MSI-X on RTL8106e")
From: Jian-Hong Pan @ 2018-09-13 5:50 UTC (permalink / raw)
To: Kai-Heng Feng, Heiner Kallweit
Cc: Thomas Gleixner, Linux Netdev List, Linux Kernel Mailing List,
Linux Upstreaming Team, Daniel Drake, Steve Dodd
In-Reply-To: <BF554FD5-17BA-4F16-9FDF-705D1FC76C12@canonical.com>
2018-09-12 16:19 GMT+08:00 Kai-Heng Feng <kai.heng.feng@canonical.com>:
> at 14:32, Thomas Gleixner <tglx@linutronix.de> wrote:
>
>> On Wed, 12 Sep 2018, Kai-Heng Feng wrote:
>>
>>> There's a Dell machine with RTL8106e stops to work after S3 since the
>>> commit introduced. So I am wondering if it's possible to revert the
>>> commit and use DMI/subsystem id based quirk table?
>>
>>
>> Probably.
>
>
> Hopefully Jian-Hong can cook up a quirk table for the issue.
Module r8169 gets nothing in the PCI BAR after system resumes which
makes MSI-X fail on some ASUS laptops equipped with RTL8106e chip.
https://www.spinics.net/lists/linux-pci/msg75598.html
Actually, I am waiting for the patch "PCI: Reprogram bridge prefetch
registers on resume" being merged.
https://marc.info/?l=linux-pm&m=153680987814299&w=2
It resolves the drivers which get nothing in PCI BAR after system resumes.
After that, I can remove the falling back code of RTL8106e.
Heiner, any comment?
Regards,
Jian-Hong Pan
>>
>>> It's because of commit bc976233a872 ("genirq/msi, x86/vector: Prevent
>>> reservation mode for non maskable MSI") cleared the reservation mode, and
>>> I
>>> can see this after S3:
>>>
>>> [ 94.872838] do_IRQ: 3.33 No irq handler for vector
>>
>>
>> It's not because of that commit, really. There is a interrupt sent after
>> resume to the wrong vector for whatever reason. The MSI vector cannot be
>> masked it seems in the device, but the driver should quiescen the device
>> to
>> a point where it does not send interrupts.
>
>
> Understood.
>
>>
>>> If the device uses MSI-X instead of MSI, the issue doesn't happen because
>>> of
>>> reservation mode.
>>
>>
>> Reservation mode has absolutely nothing to do with that. What prevents the
>> issue is the fact that MSI-X can be masked by the IRQ core.
>
>
> So in this case I think keep the device using MSI-X is a better route, it's
> MSI-X capable anyway.
>
>>
>>> Is it something should be handled by x86 BIOS? Because I don't see this
>>> issue
>>> when I use Suspend-to-Idle, which doesn't use BIOS to do suspend.
>>
>>
>> Suspend to idle works completely different and I don't see the BIOS at
>> fault here. it's more an issue of MSI not being maskable on that device,
>> which can't be fixed in BIOS or it's some half quiescened state which is
>> used when suspending and that's a pure driver issue.
>
>
> Understood.
> Thanks for all the info!
>
> Kai-Heng
>
>>
>> Thanks,
>>
>> tglx
>
>
>
^ permalink raw reply
* Re: [PATCHv2] net: ipv4: Use BUG_ON directly instead of a if condition followed by BUG
From: Simon Horman @ 2018-09-13 10:28 UTC (permalink / raw)
To: zhong jiang; +Cc: davem, edumazet, kuznet, netdev, linux-kernel
In-Reply-To: <1536671593-14746-1-git-send-email-zhongjiang@huawei.com>
On Tue, Sep 11, 2018 at 09:13:13PM +0800, zhong jiang wrote:
> The if condition can be removed if we use BUG_ON directly.
> The issule is detected with the help of Coccinelle.
>
> Signed-off-by: zhong jiang <zhongjiang@huawei.com>
> ---
> net/ipv4/tcp_input.c | 8 ++++----
> 1 file changed, 4 insertions(+), 4 deletions(-)
>
> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> index 62508a2..526c05e 100644
> --- a/net/ipv4/tcp_input.c
> +++ b/net/ipv4/tcp_input.c
> @@ -4934,8 +4934,8 @@ void tcp_rbtree_insert(struct rb_root *root, struct sk_buff *skb)
> BUG_ON(offset < 0);
> if (size > 0) {
> size = min(copy, size);
> - if (skb_copy_bits(skb, offset, skb_put(nskb, size), size))
> - BUG();
> + BUG_ON(skb_copy_bits(skb, offset,
> + skb_put(nskb, size), size));
Perhaps things have changed but it doesn't seem desirable to me to use
BUG_ON with statements that have side-effects in case at some point BUG_ON
gets compiled out.
> TCP_SKB_CB(nskb)->end_seq += size;
> copy -= size;
> start += size;
> @@ -5327,8 +5327,8 @@ static void tcp_urg(struct sock *sk, struct sk_buff *skb, const struct tcphdr *t
> /* Is the urgent pointer pointing into this packet? */
> if (ptr < skb->len) {
> u8 tmp;
> - if (skb_copy_bits(skb, ptr, &tmp, 1))
> - BUG();
> +
> + BUG_ON(skb_copy_bits(skb, ptr, &tmp, 1));
> tp->urg_data = TCP_URG_VALID | tmp;
> if (!sock_flag(sk, SOCK_DEAD))
> sk->sk_data_ready(sk);
> --
> 1.7.12.4
>
^ permalink raw reply
* Re: [PATCH bpf-next 4/4] tools/bpf: bpftool: add net support
From: Yonghong Song @ 2018-09-13 5:01 UTC (permalink / raw)
To: Daniel Borkmann, ast, netdev; +Cc: kernel-team
In-Reply-To: <654fd0ce-5be5-0288-d7ed-42dab183f871@iogearbox.net>
On 9/12/18 3:40 PM, Daniel Borkmann wrote:
> On 09/13/2018 12:29 AM, Daniel Borkmann wrote:
>> On 09/06/2018 01:58 AM, Yonghong Song wrote:
>>> Add "bpftool net" support. Networking devices are enumerated
>>> to dump device index/name associated with xdp progs.
>>>
>>> For each networking device, tc classes and qdiscs are enumerated
>>> in order to check their bpf filters.
>>> In addition, root handle and clsact ingress/egress are also checked for
>>> bpf filters. Not all filter information is printed out. Only ifindex,
>>> kind, filter name, prog_id and tag are printed out, which are good
>>> enough to show attachment information. If the filter action
>>> is a bpf action, its bpf program id, bpf name and tag will be
>>> printed out as well.
>>>
>>> For example,
>>> $ ./bpftool net
>>> xdp [
>>> ifindex 2 devname eth0 prog_id 198
>>> ]
>>
>> Could we make the output more terse? E.g. the 'ifindex' and 'devname' is basically
>> zero information but will take lots of space. 'eth0 (2)' would for example make it
>> shorter. Also info is missing whether the attached prog is driver/hw/generic XDP. :(
>>
>>> tc_filters [
>>> ifindex 2 kind qdisc_htb name prefix_matcher.o:[cls_prefix_matcher_htb]
>>> prog_id 111727 tag d08fe3b4319bc2fd act []
>>> ifindex 2 kind qdisc_clsact_ingress name fbflow_icmp
>>> prog_id 130246 tag 3f265c7f26db62c9 act []
>>> ifindex 2 kind qdisc_clsact_egress name prefix_matcher.o:[cls_prefix_matcher_clsact]
>>> prog_id 111726 tag 99a197826974c876
>>> ifindex 2 kind qdisc_clsact_egress name cls_fg_dscp
>>> prog_id 108619 tag dc4630674fd72dcc act []
>>> ifindex 2 kind qdisc_clsact_egress name fbflow_egress
>>> prog_id 130245 tag 72d2d830d6888d2c
>>> ]
>>
>> Similar comment here. Do we need the tag here? I think it's not really needed, e.g.
>> the output of bpftool perf [0] doesn't provide it either and therefore makes the list
>> as nice one-liners, so overview is much nicer there. Is there a reason that tc progs
>> do not show dev name as opposed to xdp progs? Can we shorten everything to make it
>> a one-liner like in bpftool perf?
>>
>> Should we have a small indicator here if the tc prog was offloaded?
>>
>> Does the dump work with tc shared blocks?
>>
>> Should we also dump networking related cgroup BPF progs here under bpftool net?
>
> One more thought, I think it would make sense to also explicitly document current
> limitations of the progs that this listing does not show where user should consult
> other tools aka iproute2 e.g. lwt, seg6, tc bpf actions, etc, so the current scope
> would be more clear for users.
I will do a followup patch to address a few issues mentioned in my
previous email (mostly more concise output, briefly mentioning
limitation of the tool, etc.) and also add more information in the
documentation and also in the 'bpf net help' output to set user's
expectation right about what this tool can do and how to find more
information.
>
>> Thanks,
>> Daniel
>>
>> [0] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=b04df400c30235fa347313c9e2a0695549bd2c8e
>>
>
^ permalink raw reply
* Re: [PATCH bpf-next 4/4] tools/bpf: bpftool: add net support
From: Yonghong Song @ 2018-09-13 4:57 UTC (permalink / raw)
To: Daniel Borkmann, ast, netdev; +Cc: kernel-team
In-Reply-To: <b5b3d620-80a5-3b00-e84f-12ebce6c6925@iogearbox.net>
On 9/12/18 3:29 PM, Daniel Borkmann wrote:
> On 09/06/2018 01:58 AM, Yonghong Song wrote:
>> Add "bpftool net" support. Networking devices are enumerated
>> to dump device index/name associated with xdp progs.
>>
>> For each networking device, tc classes and qdiscs are enumerated
>> in order to check their bpf filters.
>> In addition, root handle and clsact ingress/egress are also checked for
>> bpf filters. Not all filter information is printed out. Only ifindex,
>> kind, filter name, prog_id and tag are printed out, which are good
>> enough to show attachment information. If the filter action
>> is a bpf action, its bpf program id, bpf name and tag will be
>> printed out as well.
>>
>> For example,
>> $ ./bpftool net
>> xdp [
>> ifindex 2 devname eth0 prog_id 198
>> ]
>
> Could we make the output more terse? E.g. the 'ifindex' and 'devname' is basically
> zero information but will take lots of space. 'eth0 (2)' would for example make it
> shorter. Also info is missing whether the attached prog is driver/hw/generic XDP. :(
Right 'eth0 (2)' is a good idea. Similarly to other bpftool plain
output, agree we should have concise printout.
For the above xdp output, I guess I tested on an old kernel and will
test on newer kernel to ensure it works properly.
>
>> tc_filters [
>> ifindex 2 kind qdisc_htb name prefix_matcher.o:[cls_prefix_matcher_htb]
>> prog_id 111727 tag d08fe3b4319bc2fd act []
>> ifindex 2 kind qdisc_clsact_ingress name fbflow_icmp
>> prog_id 130246 tag 3f265c7f26db62c9 act []
>> ifindex 2 kind qdisc_clsact_egress name prefix_matcher.o:[cls_prefix_matcher_clsact]
>> prog_id 111726 tag 99a197826974c876
>> ifindex 2 kind qdisc_clsact_egress name cls_fg_dscp
>> prog_id 108619 tag dc4630674fd72dcc act []
>> ifindex 2 kind qdisc_clsact_egress name fbflow_egress
>> prog_id 130245 tag 72d2d830d6888d2c
>> ]
>
> Similar comment here. Do we need the tag here? I think it's not really needed, e.g.
> the output of bpftool perf [0] doesn't provide it either and therefore makes the list
> as nice one-liners, so overview is much nicer there. Is there a reason that tc progs
> do not show dev name as opposed to xdp progs? Can we shorten everything to make it
> a one-liner like in bpftool perf?
Yes, we should remove 'tag'. Users can use prog_id to find tag.
There is no IFNAME attribute in the tc filter return message.
But I already got them in previous iplink message, so I can
print out here.
> Should we have a small indicator here if the tc prog was offloaded?
Not sure about this one. As you suggested in the next email, we
can have a message like if users want more information they
can use more specific tools 'ip link ...', 'tc filter ...' etc.
>
> Does the dump work with tc shared blocks?
I does not.I did not have experiences with tc shared block and that is
why I did not add support for it.
>
> Should we also dump networking related cgroup BPF progs here under bpftool net?
probably not, but in additional to the above suggestions about using
different tools, we can add a message to suggest check `bpftool cgroup
tree` for cgroup related net programs.
>
> Thanks,
> Daniel
>
> [0] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=b04df400c30235fa347313c9e2a0695549bd2c8e
>
^ permalink raw reply
* Re: Re: [PATCH net-next v2 0/5] virtio: support packed ring
From: Jason Wang @ 2018-09-13 9:47 UTC (permalink / raw)
To: Tiwei Bie, Michael S. Tsirkin
Cc: virtualization, linux-kernel, netdev, virtio-dev, wexu, jfreimann
In-Reply-To: <20180913085919.GA42049@fbsd1.sh.intel.com>
On 2018年09月13日 16:59, Tiwei Bie wrote:
>> If what you say is true then we should take a careful look
>> and not supporting these generic things with packed layout.
>> Once we do support them it will be too late and we won't
>> be able to get performance back.
> I think it's a good point that we don't need to support
> everything in packed ring (especially these which would
> hurt the performance), as the packed ring aims at high
> performance. I'm also wondering about the features. Is
> there any possibility that we won't support the out of
> order processing (at least not by default) in packed ring?
> If I didn't miss anything, the need to support out of order
> processing in packed ring will make the data structure
> inside the driver not cache friendly which is similar to
> the case of the descriptor table in the split ring (the
> difference is that, it only happens in driver now).
Out of order is not the only user, DMA is another one. We don't have
used ring(len), so we need to maintain buffer length somewhere even for
in order device. But if it's not too late, I second for a OUT_OF_ORDER
feature. Starting from in order can have much simpler code in driver.
Thanks
^ permalink raw reply
* [PATCH] stmmac: fix valid numbers of unicast filter entries
From: Jongsung Kim @ 2018-09-13 9:32 UTC (permalink / raw)
To: linux-kernel, netdev, Giuseppe Cavallaro, Alexandre Torgue,
Jose Abreu, David S . Miller
Cc: Chanho Min, Jongsung Kim
Synopsys DWC Ethernet MAC can be configured to have 1..32, 64, or
128 unicast filter entries. (Table 7-8 MAC Address Registers from
databook) Fix dwmac1000_validate_ucast_entries() to accept values
between 1 and 32 in addition.
Signed-off-by: Jongsung Kim <neidhard.kim@lge.com>
---
drivers/net/ethernet/stmicro/stmmac/stmmac_platform.c | 5 ++---
1 file changed, 2 insertions(+), 3 deletions(-)
diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_platform.c b/drivers/net/ethernet/stmicro/stmmac/stmmac_platform.c
index 3609c7b..2b800ce 100644
--- a/drivers/net/ethernet/stmicro/stmmac/stmmac_platform.c
+++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_platform.c
@@ -67,7 +67,7 @@ static int dwmac1000_validate_mcast_bins(int mcast_bins)
* Description:
* This function validates the number of Unicast address entries supported
* by a particular Synopsys 10/100/1000 controller. The Synopsys controller
- * supports 1, 32, 64, or 128 Unicast filter entries for it's Unicast filter
+ * supports 1..32, 64, or 128 Unicast filter entries for it's Unicast filter
* logic. This function validates a valid, supported configuration is
* selected, and defaults to 1 Unicast address if an unsupported
* configuration is selected.
@@ -77,8 +77,7 @@ static int dwmac1000_validate_ucast_entries(int ucast_entries)
int x = ucast_entries;
switch (x) {
- case 1:
- case 32:
+ case 1 ... 32:
case 64:
case 128:
break;
--
2.7.4
^ permalink raw reply related
* Re: [PATCH net-next V2] virtio_net: ethtool tx napi configuration
From: Jason Wang @ 2018-09-13 9:07 UTC (permalink / raw)
To: mst; +Cc: davem, virtualization, netdev, linux-kernel, Willem de Bruijn
In-Reply-To: <20180913053545.18585-1-jasowang@redhat.com>
On 2018年09月13日 13:35, Jason Wang wrote:
> Implement ethtool .set_coalesce (-C) and .get_coalesce (-c) handlers.
> Interrupt moderation is currently not supported, so these accept and
> display the default settings of 0 usec and 1 frame.
>
> Toggle tx napi through a bit in tx-frames. So as to not interfere
> with possible future interrupt moderation, value 1 means tx napi while
> value 0 means not.
>
> To properly synchronize with the data path, tx napi is disabled and
> tx lock is held when changing the value of napi weight. And two more
> places that can access tx napi weight:
>
> - speculative tx polling in rx napi, we can leave it as is since it
> not a must for correctness.
> - skb_xmit_done(), one more check of napi weight is added before
> trying to enable tx to avoid tx to be disabled forever if napi is
> disabled after skb_xmit_done() but before the napi
>
> Link: https://patchwork.ozlabs.org/patch/948149/
> Suggested-by: Jason Wang <jasowang@redhat.com>
> Signed-off-by: Willem de Bruijn <willemb@google.com>
> Signed-off-by: Jason Wang <jasowang@redhat.com>
> ---
> Changes from V1:
> - try to synchronize with datapath to allow changing mode when
> interface is up.
> - use tx-frames 0 as to disable tx napi while tx-frames 1 to enable tx napi
> ---
> drivers/net/virtio_net.c | 64 +++++++++++++++++++++++++++++++++++++++-
> 1 file changed, 63 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> index 765920905226..6e70864f5899 100644
> --- a/drivers/net/virtio_net.c
> +++ b/drivers/net/virtio_net.c
> @@ -66,6 +66,8 @@ DECLARE_EWMA(pkt_len, 0, 64)
>
> #define VIRTNET_DRIVER_VERSION "1.0.0"
>
> +static const u32 ethtool_coalesce_napi_mask = (1UL << 10);
> +
> static const unsigned long guest_offloads[] = {
> VIRTIO_NET_F_GUEST_TSO4,
> VIRTIO_NET_F_GUEST_TSO6,
> @@ -1444,7 +1446,10 @@ static int virtnet_poll_tx(struct napi_struct *napi, int budget)
>
> virtqueue_napi_complete(napi, sq->vq, 0);
>
> - if (sq->vq->num_free >= 2 + MAX_SKB_FRAGS)
> + /* Check napi.weight to avoid tx stall since it could be set
> + * to zero by ethtool after skb_xmit_done().
> + */
> + if (sq->vq->num_free >= 2 + MAX_SKB_FRAGS || !sq->napi.weight)
> netif_tx_wake_queue(txq);
>
> return 0;
> @@ -2181,6 +2186,61 @@ static int virtnet_get_link_ksettings(struct net_device *dev,
> return 0;
> }
>
> +static int virtnet_set_coalesce(struct net_device *dev,
> + struct ethtool_coalesce *ec)
> +{
> + struct ethtool_coalesce ec_default = {
> + .cmd = ETHTOOL_SCOALESCE,
> + .rx_max_coalesced_frames = 1,
> + };
> + struct virtnet_info *vi = netdev_priv(dev);
> + int i, napi_weight;
> +
> + if (ec->tx_max_coalesced_frames > 1)
> + return -EINVAL;
> +
> + ec_default.tx_max_coalesced_frames = ec->tx_max_coalesced_frames;
> + napi_weight = ec->tx_max_coalesced_frames ? NAPI_POLL_WEIGHT : 0;
> +
> + /* disallow changes to fields not explicitly tested above */
> + if (memcmp(ec, &ec_default, sizeof(ec_default)))
> + return -EINVAL;
> +
> + if (napi_weight ^ vi->sq[0].napi.weight) {
> + for (i = 0; i < vi->max_queue_pairs; i++) {
> + struct netdev_queue *txq =
> + netdev_get_tx_queue(vi->dev, i);
> +
> + virtnet_napi_tx_disable(&vi->sq[i].napi);
> + __netif_tx_lock_bh(txq);
Need to check netif_running() before disabling napi. Otherwise we may
have a infinite loop here.
The discussion is still ongoing, if we decide to go set_coalesce, I will
post a V3.
Thanks
> + vi->sq[i].napi.weight = napi_weight;
> + __netif_tx_unlock_bh(txq);
> + virtnet_napi_tx_enable(vi, vi->sq[i].vq,
> + &vi->sq[i].napi);
> + }
> + }
> +
> + return 0;
> +}
> +
> +static int virtnet_get_coalesce(struct net_device *dev,
> + struct ethtool_coalesce *ec)
> +{
> + struct ethtool_coalesce ec_default = {
> + .cmd = ETHTOOL_GCOALESCE,
> + .rx_max_coalesced_frames = 1,
> + .tx_max_coalesced_frames = 0,
> + };
> + struct virtnet_info *vi = netdev_priv(dev);
> +
> + memcpy(ec, &ec_default, sizeof(ec_default));
> +
> + if (vi->sq[0].napi.weight)
> + ec->tx_max_coalesced_frames = 1;
> +
> + return 0;
> +}
> +
> static void virtnet_init_settings(struct net_device *dev)
> {
> struct virtnet_info *vi = netdev_priv(dev);
> @@ -2219,6 +2279,8 @@ static const struct ethtool_ops virtnet_ethtool_ops = {
> .get_ts_info = ethtool_op_get_ts_info,
> .get_link_ksettings = virtnet_get_link_ksettings,
> .set_link_ksettings = virtnet_set_link_ksettings,
> + .set_coalesce = virtnet_set_coalesce,
> + .get_coalesce = virtnet_get_coalesce,
> };
>
> static void virtnet_freeze_down(struct virtio_device *vdev)
^ permalink raw reply
* Re: Re: [PATCH net-next v2 0/5] virtio: support packed ring
From: Tiwei Bie @ 2018-09-13 8:59 UTC (permalink / raw)
To: Michael S. Tsirkin
Cc: Jason Wang, virtualization, linux-kernel, netdev, virtio-dev,
wexu, jfreimann
In-Reply-To: <20180912121457-mutt-send-email-mst@kernel.org>
On Wed, Sep 12, 2018 at 12:16:32PM -0400, Michael S. Tsirkin wrote:
> On Tue, Sep 11, 2018 at 01:37:26PM +0800, Tiwei Bie wrote:
> > On Mon, Sep 10, 2018 at 11:33:17AM +0800, Jason Wang wrote:
> > > On 2018年09月10日 11:00, Tiwei Bie wrote:
> > > > On Fri, Sep 07, 2018 at 09:00:49AM -0400, Michael S. Tsirkin wrote:
> > > > > On Fri, Sep 07, 2018 at 09:22:25AM +0800, Tiwei Bie wrote:
> > > > > > On Mon, Aug 27, 2018 at 05:00:40PM +0300, Michael S. Tsirkin wrote:
> > > > > > > Are there still plans to test the performance with vost pmd?
> > > > > > > vhost doesn't seem to show a performance gain ...
> > > > > > >
> > > > > > I tried some performance tests with vhost PMD. In guest, the
> > > > > > XDP program will return XDP_DROP directly. And in host, testpmd
> > > > > > will do txonly fwd.
> > > > > >
> > > > > > When burst size is 1 and packet size is 64 in testpmd and
> > > > > > testpmd needs to iterate 5 Tx queues (but only the first two
> > > > > > queues are enabled) to prepare and inject packets, I got ~12%
> > > > > > performance boost (5.7Mpps -> 6.4Mpps). And if the vhost PMD
> > > > > > is faster (e.g. just need to iterate the first two queues to
> > > > > > prepare and inject packets), then I got similar performance
> > > > > > for both rings (~9.9Mpps) (packed ring's performance can be
> > > > > > lower, because it's more complicated in driver.)
> > > > > >
> > > > > > I think packed ring makes vhost PMD faster, but it doesn't make
> > > > > > the driver faster. In packed ring, the ring is simplified, and
> > > > > > the handling of the ring in vhost (device) is also simplified,
> > > > > > but things are not simplified in driver, e.g. although there is
> > > > > > no desc table in the virtqueue anymore, driver still needs to
> > > > > > maintain a private desc state table (which is still managed as
> > > > > > a list in this patch set) to support the out-of-order desc
> > > > > > processing in vhost (device).
> > > > > >
> > > > > > I think this patch set is mainly to make the driver have a full
> > > > > > functional support for the packed ring, which makes it possible
> > > > > > to leverage the packed ring feature in vhost (device). But I'm
> > > > > > not sure whether there is any other better idea, I'd like to
> > > > > > hear your thoughts. Thanks!
> > > > > Just this: Jens seems to report a nice gain with virtio and
> > > > > vhost pmd across the board. Try to compare virtio and
> > > > > virtio pmd to see what does pmd do better?
> > > > The virtio PMD (drivers/net/virtio) in DPDK doesn't need to share
> > > > the virtio ring operation code with other drivers and is highly
> > > > optimized for network. E.g. in Rx, the Rx burst function won't
> > > > chain descs. So the ID management for the Rx ring can be quite
> > > > simple and straightforward, we just need to initialize these IDs
> > > > when initializing the ring and don't need to change these IDs
> > > > in data path anymore (the mergable Rx code in that patch set
> > > > assumes the descs will be written back in order, which should be
> > > > fixed. I.e., the ID in the desc should be used to index vq->descx[]).
> > > > The Tx code in that patch set also assumes the descs will be
> > > > written back by device in order, which should be fixed.
> > >
> > > Yes it is. I think I've pointed it out in some early version of pmd patch.
> > > So I suspect part (or all) of the boost may come from in order feature.
> > >
> > > >
> > > > But in kernel virtio driver, the virtio_ring.c is very generic.
> > > > The enqueue (virtqueue_add()) and dequeue (virtqueue_get_buf_ctx())
> > > > functions need to support all the virtio devices and should be
> > > > able to handle all the possible cases that may happen. So although
> > > > the packed ring can be very efficient in some cases, currently
> > > > the room to optimize the performance in kernel's virtio_ring.c
> > > > isn't that much. If we want to take the fully advantage of the
> > > > packed ring's efficiency, we need some further e.g. API changes
> > > > in virtio_ring.c, which shouldn't be part of this patch set.
> > >
> > > Could you please share more thoughts on this e.g how to improve the API?
> > > Notice since the API is shared by both split ring and packed ring, it may
> > > improve the performance of split ring as well. One can easily imagine a
> > > batching API, but it does not have many real users now, the only case is the
> > > XDP transmission which can accept an array of XDP frames.
> >
> > I don't have detailed thoughts on this yet. But kernel's
> > virtio_ring.c is quite generic compared with what we did
> > in virtio PMD.
>
> In what way? What are some things that aren't implemented there?
Below is the code corresponding to the virtqueue_add()
for Rx ring in virtio PMD:
https://github.com/DPDK/dpdk/blob/3605968c2fa7/drivers/net/virtio/virtio_rxtx.c#L278-L304
And below is the code of virtqueue_add() in Linux:
https://github.com/torvalds/linux/blob/54eda9df17f3/drivers/virtio/virtio_ring.c#L275-L417
In virtio PMD, for each packet (mbuf), the code is pretty
straightforward, it will just check whether there is one
available desc. If it's true, it will just fill this desc
directly.
But in virtqueue_add(), it's obvious that, the logic is
much more complicated or generic. It's supposed to be
able to handle sglist (which may consist of multiple IN
buffers and multiple OUT buffers at the same time), and
it will try to use indirect descriptors. Then it needs
several loops to parse the sglist. That's why I said
it's quite generic.
>
> If what you say is true then we should take a careful look
> and not supporting these generic things with packed layout.
> Once we do support them it will be too late and we won't
> be able to get performance back.
I think it's a good point that we don't need to support
everything in packed ring (especially these which would
hurt the performance), as the packed ring aims at high
performance. I'm also wondering about the features. Is
there any possibility that we won't support the out of
order processing (at least not by default) in packed ring?
If I didn't miss anything, the need to support out of order
processing in packed ring will make the data structure
inside the driver not cache friendly which is similar to
the case of the descriptor table in the split ring (the
difference is that, it only happens in driver now).
>
>
>
> > >
> > > > So
> > > > I still think this patch set is mainly to make the kernel virtio
> > > > driver to have a full functional support of the packed ring, and
> > > > we can't expect impressive performance boost with it.
> > >
> > > We can only gain when virtio ring layout is the bottleneck. If there're
> > > bottlenecks elsewhere, we probably won't see any increasing in the numbers.
> > > Vhost-net is an example, and lots of optimizations have proved that virtio
> > > ring is not the main bottleneck for the current codes. I suspect it also the
> > > case of virtio driver. Did perf tell us any interesting things in virtio
> > > driver?
> > >
> > > Thanks
> > >
> > > >
> > > > >
> > > > > > > On Wed, Jul 11, 2018 at 10:27:06AM +0800, Tiwei Bie wrote:
> > > > > > > > Hello everyone,
> > > > > > > >
> > > > > > > > This patch set implements packed ring support in virtio driver.
> > > > > > > >
> > > > > > > > Some functional tests have been done with Jason's
> > > > > > > > packed ring implementation in vhost:
> > > > > > > >
> > > > > > > > https://lkml.org/lkml/2018/7/3/33
> > > > > > > >
> > > > > > > > Both of ping and netperf worked as expected.
> > > > > > > >
> > > > > > > > v1 -> v2:
> > > > > > > > - Use READ_ONCE() to read event off_wrap and flags together (Jason);
> > > > > > > > - Add comments related to ccw (Jason);
> > > > > > > >
> > > > > > > > RFC (v6) -> v1:
> > > > > > > > - Avoid extra virtio_wmb() in virtqueue_enable_cb_delayed_packed()
> > > > > > > > when event idx is off (Jason);
> > > > > > > > - Fix bufs calculation in virtqueue_enable_cb_delayed_packed() (Jason);
> > > > > > > > - Test the state of the desc at used_idx instead of last_used_idx
> > > > > > > > in virtqueue_enable_cb_delayed_packed() (Jason);
> > > > > > > > - Save wrap counter (as part of queue state) in the return value
> > > > > > > > of virtqueue_enable_cb_prepare_packed();
> > > > > > > > - Refine the packed ring definitions in uapi;
> > > > > > > > - Rebase on the net-next tree;
> > > > > > > >
> > > > > > > > RFC v5 -> RFC v6:
> > > > > > > > - Avoid tracking addr/len/flags when DMA API isn't used (MST/Jason);
> > > > > > > > - Define wrap counter as bool (Jason);
> > > > > > > > - Use ALIGN() in vring_init_packed() (Jason);
> > > > > > > > - Avoid using pointer to track `next` in detach_buf_packed() (Jason);
> > > > > > > > - Add comments for barriers (Jason);
> > > > > > > > - Don't enable RING_PACKED on ccw for now (noticed by Jason);
> > > > > > > > - Refine the memory barrier in virtqueue_poll();
> > > > > > > > - Add a missing memory barrier in virtqueue_enable_cb_delayed_packed();
> > > > > > > > - Remove the hacks in virtqueue_enable_cb_prepare_packed();
> > > > > > > >
> > > > > > > > RFC v4 -> RFC v5:
> > > > > > > > - Save DMA addr, etc in desc state (Jason);
> > > > > > > > - Track used wrap counter;
> > > > > > > >
> > > > > > > > RFC v3 -> RFC v4:
> > > > > > > > - Make ID allocation support out-of-order (Jason);
> > > > > > > > - Various fixes for EVENT_IDX support;
> > > > > > > >
> > > > > > > > RFC v2 -> RFC v3:
> > > > > > > > - Split into small patches (Jason);
> > > > > > > > - Add helper virtqueue_use_indirect() (Jason);
> > > > > > > > - Just set id for the last descriptor of a list (Jason);
> > > > > > > > - Calculate the prev in virtqueue_add_packed() (Jason);
> > > > > > > > - Fix/improve desc suppression code (Jason/MST);
> > > > > > > > - Refine the code layout for XXX_split/packed and wrappers (MST);
> > > > > > > > - Fix the comments and API in uapi (MST);
> > > > > > > > - Remove the BUG_ON() for indirect (Jason);
> > > > > > > > - Some other refinements and bug fixes;
> > > > > > > >
> > > > > > > > RFC v1 -> RFC v2:
> > > > > > > > - Add indirect descriptor support - compile test only;
> > > > > > > > - Add event suppression supprt - compile test only;
> > > > > > > > - Move vring_packed_init() out of uapi (Jason, MST);
> > > > > > > > - Merge two loops into one in virtqueue_add_packed() (Jason);
> > > > > > > > - Split vring_unmap_one() for packed ring and split ring (Jason);
> > > > > > > > - Avoid using '%' operator (Jason);
> > > > > > > > - Rename free_head -> next_avail_idx (Jason);
> > > > > > > > - Add comments for virtio_wmb() in virtqueue_add_packed() (Jason);
> > > > > > > > - Some other refinements and bug fixes;
> > > > > > > >
> > > > > > > > Thanks!
> > > > > > > >
> > > > > > > > Tiwei Bie (5):
> > > > > > > > virtio: add packed ring definitions
> > > > > > > > virtio_ring: support creating packed ring
> > > > > > > > virtio_ring: add packed ring support
> > > > > > > > virtio_ring: add event idx support in packed ring
> > > > > > > > virtio_ring: enable packed ring
> > > > > > > >
> > > > > > > > drivers/s390/virtio/virtio_ccw.c | 14 +
> > > > > > > > drivers/virtio/virtio_ring.c | 1365 ++++++++++++++++++++++------
> > > > > > > > include/linux/virtio_ring.h | 8 +-
> > > > > > > > include/uapi/linux/virtio_config.h | 3 +
> > > > > > > > include/uapi/linux/virtio_ring.h | 43 +
> > > > > > > > 5 files changed, 1157 insertions(+), 276 deletions(-)
> > > > > > > >
> > > > > > > > --
> > > > > > > > 2.18.0
> > > > > > > ---------------------------------------------------------------------
> > > > > > > To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
> > > > > > > For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org
> > > > > > >
> > >
^ permalink raw reply
* Re: [PATCH net] geneve: add ttl inherit support
From: David Miller @ 2018-09-13 3:38 UTC (permalink / raw)
To: liuhangbin; +Cc: netdev, pshelar, jbenc, sbrivio
In-Reply-To: <1536717861-21590-1-git-send-email-liuhangbin@gmail.com>
From: Hangbin Liu <liuhangbin@gmail.com>
Date: Wed, 12 Sep 2018 10:04:21 +0800
> Similar with commit 72f6d71e491e6 ("vxlan: add ttl inherit support"),
> currently ttl == 0 means "use whatever default value" on geneve instead
> of inherit inner ttl. To respect compatibility with old behavior, let's
> add a new IFLA_GENEVE_TTL_INHERIT for geneve ttl inherit support.
>
> Reported-by: Jianlin Shi <jishi@redhat.com>
> Suggested-by: Jiri Benc <jbenc@redhat.com>
> Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Applied to net-next.
^ permalink raw reply
* [PATCH 1/2] netlink: add NLA_REJECT policy type
From: Johannes Berg @ 2018-09-13 8:46 UTC (permalink / raw)
To: linux-wireless-u79uwXL29TY76Z2rM5mHXA,
netdev-u79uwXL29TY76Z2rM5mHXA
Cc: Michal Kubecek, Johannes Berg
From: Johannes Berg <johannes.berg-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
In some situations some netlink attributes may be used for output
only (kernel->userspace) or may be reserved for future use. It's
then helpful to be able to prevent userspace from using them in
messages sent to the kernel, since they'd otherwise be ignored and
any future will become impossible if this happens.
Add NLA_REJECT to the policy which does nothing but reject (with
EINVAL) validation of any messages containing this attribute.
Allow for returning a specific extended ACK error message in the
validation_data pointer.
While at it fix the indentation of NLA_BITFIELD32 and describe the
validation_data pointer for it.
The specific case I have in mind now is a shared nested attribute
containing request/response data, and it would be pointless and
potentially confusing to have userspace include response data in
the messages that actually contain a request.
Signed-off-by: Johannes Berg <johannes.berg-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
---
include/net/netlink.h | 6 +++++-
lib/nlattr.c | 22 +++++++++++++++-------
2 files changed, 20 insertions(+), 8 deletions(-)
diff --git a/include/net/netlink.h b/include/net/netlink.h
index 0c154f98e987..04e40fcc70d6 100644
--- a/include/net/netlink.h
+++ b/include/net/netlink.h
@@ -180,6 +180,7 @@ enum {
NLA_S32,
NLA_S64,
NLA_BITFIELD32,
+ NLA_REJECT,
__NLA_TYPE_MAX,
};
@@ -208,7 +209,10 @@ enum {
* NLA_MSECS Leaving the length field zero will verify the
* given type fits, using it verifies minimum length
* just like "All other"
- * NLA_BITFIELD32 A 32-bit bitmap/bitselector attribute
+ * NLA_BITFIELD32 A 32-bit bitmap/bitselector attribute, validation
+ * data must point to a u32 value of valid flags
+ * NLA_REJECT Reject this attribute, validation data may point
+ * to a string to report as the error in extended ACK.
* All other Minimum length of attribute payload
*
* Example:
diff --git a/lib/nlattr.c b/lib/nlattr.c
index e335bcafa9e4..56e0aae5cf23 100644
--- a/lib/nlattr.c
+++ b/lib/nlattr.c
@@ -69,7 +69,8 @@ static int validate_nla_bitfield32(const struct nlattr *nla,
}
static int validate_nla(const struct nlattr *nla, int maxtype,
- const struct nla_policy *policy)
+ const struct nla_policy *policy,
+ struct netlink_ext_ack *extack)
{
const struct nla_policy *pt;
int minlen = 0, attrlen = nla_len(nla), type = nla_type(nla);
@@ -87,6 +88,11 @@ static int validate_nla(const struct nlattr *nla, int maxtype,
}
switch (pt->type) {
+ case NLA_REJECT:
+ if (pt->validation_data && extack)
+ extack->_msg = pt->validation_data;
+ return -EINVAL;
+
case NLA_FLAG:
if (attrlen > 0)
return -ERANGE;
@@ -180,11 +186,10 @@ int nla_validate(const struct nlattr *head, int len, int maxtype,
int rem;
nla_for_each_attr(nla, head, len, rem) {
- int err = validate_nla(nla, maxtype, policy);
+ int err = validate_nla(nla, maxtype, policy, extack);
if (err < 0) {
- if (extack)
- extack->bad_attr = nla;
+ NL_SET_BAD_ATTR(extack, nla);
return err;
}
}
@@ -251,10 +256,13 @@ int nla_parse(struct nlattr **tb, int maxtype, const struct nlattr *head,
if (type > 0 && type <= maxtype) {
if (policy) {
- err = validate_nla(nla, maxtype, policy);
+ err = validate_nla(nla, maxtype, policy,
+ extack);
if (err < 0) {
- NL_SET_ERR_MSG_ATTR(extack, nla,
- "Attribute failed policy validation");
+ NL_SET_BAD_ATTR(extack, nla);
+ if (extack && !extack->_msg)
+ NL_SET_ERR_MSG(extack,
+ "Attribute failed policy validation");
goto errout;
}
}
--
2.14.4
^ permalink raw reply related
* Re: [PATCH] net: dsa: mv88e6xxx: Make sure to configure ports with external PHYs
From: David Miller @ 2018-09-13 3:36 UTC (permalink / raw)
To: marex; +Cc: netdev, andrew
In-Reply-To: <20180911221524.12938-1-marex@denx.de>
From: Marek Vasut <marex@denx.de>
Date: Wed, 12 Sep 2018 00:15:24 +0200
> The MV88E6xxx can have external PHYs attached to certain ports and those
> PHYs could even be on different MDIO bus than the one within the switch.
> This patch makes sure that ports with such PHYs are configured correctly
> according to the information provided by the PHY.
>
> Signed-off-by: Marek Vasut <marex@denx.de>
Applied to net-next.
^ permalink raw reply
* Re: [PATCH] net: ethernet: Use DIV_ROUND_UP instead of reimplementing its function
From: David Miller @ 2018-09-13 3:35 UTC (permalink / raw)
To: zhongjiang
Cc: claudiu.manoil, tariqt, saeedm, derek.chickles, leon, jdmason,
netdev, linux-kernel
In-Reply-To: <1536671295-14432-1-git-send-email-zhongjiang@huawei.com>
From: zhong jiang <zhongjiang@huawei.com>
Date: Tue, 11 Sep 2018 21:08:15 +0800
> DIV_ROUND_UP has implemented the code-opened function. Therefore, just
> replace the implementation with DIV_ROUND_UP.
>
> Signed-off-by: zhong jiang <zhongjiang@huawei.com>
Applied to net-next, thanks.
^ permalink raw reply
* Re: [PATCH net-next] qlcnic: Remove set but not used variables 'fw_mbx' and 'hdr_size'
From: David Miller @ 2018-09-13 3:33 UTC (permalink / raw)
To: yuehaibing
Cc: harish.patil, manish.chopra, Dept-GELinuxNICDev, netdev,
kernel-janitors
In-Reply-To: <1536666689-3906-1-git-send-email-yuehaibing@huawei.com>
From: YueHaibing <yuehaibing@huawei.com>
Date: Tue, 11 Sep 2018 11:51:29 +0000
> From: Yue Haibing <yuehaibing@huawei.com>
>
> Fixes gcc '-Wunused-but-set-variable' warning:
>
> drivers/net/ethernet/qlogic/qlcnic/qlcnic_sriov_common.c: In function 'qlcnic_sriov_pull_bc_msg':
> drivers/net/ethernet/qlogic/qlcnic/qlcnic_sriov_common.c:907:6: warning:
> variable 'fw_mbx' set but not used [-Wunused-but-set-variable]
>
> drivers/net/ethernet/qlogic/qlcnic/qlcnic_sriov_common.c: In function 'qlcnic_sriov_issue_bc_post':
> drivers/net/ethernet/qlogic/qlcnic/qlcnic_sriov_common.c:939:16: warning:
> variable 'hdr_size' set but not used [-Wunused-but-set-variable]
>
> Signed-off-by: Yue Haibing <yuehaibing@huawei.com>
Applied, thanks.
^ permalink raw reply
* Re: [PATCH v3 net-next 00/12] Preparing for phylib limkmodes
From: David Miller @ 2018-09-13 3:24 UTC (permalink / raw)
To: andrew; +Cc: netdev, f.fainelli
In-Reply-To: <1536709999-13420-1-git-send-email-andrew@lunn.ch>
From: Andrew Lunn <andrew@lunn.ch>
Date: Wed, 12 Sep 2018 01:53:07 +0200
> phylib currently makes us of a u32 bitmap for advertising, supported,
> and link partner capabilities. For a long time, this has been
> sufficient, for devices up to 1Gbps. With more MAC/PHY combinations
> now supporting speeds greater than 1Gbps, we have run out of
> bits. There is the need to replace this u32 with an
> __ETHTOOL_DECLARE_LINK_MODE_MASK, which makes use of linux's generic
> bitmaps.
>
> This patchset does some of the work preparing for this change. A few
> cleanups are applied to PHY drivers. Some MAC drivers directly access
> members of phydev which are going to change type. These patches adds
> some helpers and swaps MAC drivers to use them, mostly dealing with
> Pause configuration.
>
> v3:
> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
> Add missing at in commit message
> Change Subject of patch 5
> Fix return in from phy_set_asym_pause
> Fix kerneldoc in phy_set_pause
>
> v2:
> Fixup bad indentation in tg3.c
> Rename phy_support_pause() to phy_support_sym_pause()
> Also trigger autoneg if the advertising settings have changed.
> Rename phy_set_pause() to phy_set_sym_pause()
> Use the bcm63xx_enet.c logic, not fec_main.c for validating pause
Series applied, thanks Andrew.
^ permalink raw reply
* Re: [PATCH net-next 5/5] ebpf: Add sample ebpf program for SOCKET_SG_FILTER
From: Alexei Starovoitov @ 2018-09-13 2:07 UTC (permalink / raw)
To: Sowmini Varadhan
Cc: Tushar Dave, ast, daniel, davem, santosh.shilimkar,
jakub.kicinski, quentin.monnet, jiong.wang, sandipan,
john.fastabend, kafai, rdna, yhs, netdev, rds-devel
In-Reply-To: <20180913005954.GA30305@oracle.com>
On Wed, Sep 12, 2018 at 08:59:54PM -0400, Sowmini Varadhan wrote:
> > On 09/11/2018 09:00 PM, Alexei Starovoitov wrote:
> > >please no samples.
> > >Add this as proper test to tools/testing/selftests/bpf
> > >that reports PASS/FAIL and can be run automatically.
> > >samples/bpf is effectively dead code.
>
> Just a second.
>
> You do realize that RDS is doing real networking, so it needs
> RDMA capable hardware to test the rds_rdma paths? Also, when we
> "talk to ourselves" we default to the rds_loop transport, so
> we would even bypass the rds-tcp module.
>
> I dont think this can be tested with some academic "test it
> over lo0" exercise.. I suppose you can add example code in
> sefltests for this, but asking for a "proper test" may be
> a litte unrealistic here- a proper test needs proper hardware
> in this case.
I didn't know that. The way I understand your statement that
this new program type, new sg logic, and all the complexity
are only applicable to RDMA capable hw and RDS.
In such case, I think, such BPF extensions do not belong
in the upstream kernel. I don't want BPF to support niche technologies,
since the maintenance overhead makes it prohibitive long term.
If the kernel had a way to deprecate the api it would have been
different story. Every new feature and bpf extension lands once
and then being maintained forever. Hence we have to be very
selective weighting the benefits vs long term maintenance.
I'm happy to review the code further and suggest improvements,
but it's not going to be applied.
^ permalink raw reply
* Re: [PATCH v3] PCI: Reprogram bridge prefetch registers on resume
From: Rafael J. Wysocki @ 2018-09-13 6:52 UTC (permalink / raw)
To: Daniel Drake
Cc: andy.shevchenko-VuQAYsv1563Yd54FQh9/CA, Linux PM, Linux PCI,
Rafael Wysocki, nic_swsd-Rasf1IRRPZFBDgjK7y7TUQ, Keith Busch,
netdev, nouveau-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW, Bjorn Helgaas,
rchang-eYqpPyKDWXRBDgjK7y7TUQ, Linux Upstreaming Team,
David Miller, jonathan.derrick-ral2JQCrhuEAvxtiuMwx3w,
Heiner Kallweit
In-Reply-To: <20180913033745.11178-1-drake-6IF/jdPJHihWk0Htik3J/w@public.gmane.org>
On Thu, Sep 13, 2018 at 5:37 AM Daniel Drake <drake@endlessm.com> wrote:
>
> On 38+ Intel-based Asus products, the nvidia GPU becomes unusable
> after S3 suspend/resume. The affected products include multiple
> generations of nvidia GPUs and Intel SoCs. After resume, nouveau logs
> many errors such as:
>
> fifo: fault 00 [READ] at 0000005555555000 engine 00 [GR] client 04
> [HUB/FE] reason 4a [] on channel -1 [007fa91000 unknown]
> DRM: failed to idle channel 0 [DRM]
>
> Similarly, the nvidia proprietary driver also fails after resume
> (black screen, 100% CPU usage in Xorg process). We shipped a sample
> to Nvidia for diagnosis, and their response indicated that it's a
> problem with the parent PCI bridge (on the Intel SoC), not the GPU.
>
> Runtime suspend/resume works fine, only S3 suspend is affected.
>
> We found a workaround: on resume, rewrite the Intel PCI bridge
> 'Prefetchable Base Upper 32 Bits' register (PCI_PREF_BASE_UPPER32). In
> the cases that I checked, this register has value 0 and we just have to
> rewrite that value.
>
> Linux already saves and restores PCI config space during suspend/resume,
> but this register was being skipped because upon resume, it already
> has value 0 (the correct, pre-suspend value).
>
> Intel appear to have previously acknowledged this behaviour and the
> requirement to rewrite this register.
> https://bugzilla.kernel.org/show_bug.cgi?id=116851#c23
>
> Based on that, rewrite the prefetch register values even when that
> appears unnecessary.
>
> We have confirmed this solution on all the affected models we have
> in-hands (X542UQ, UX533FD, X530UN, V272UN).
>
> Additionally, this solves an issue where r8169 MSI-X interrupts were
> broken after S3 suspend/resume on Asus X441UAR. This issue was recently
> worked around in commit 7bb05b85bc2d ("r8169: don't use MSI-X on
> RTL8106e"). It also fixes the same issue on RTL6186evl/8111evl on an
> Aimfor-tech laptop that we had not yet patched. I suspect it will also
> fix the issue that was worked around in commit 7c53a722459c ("r8169:
> don't use MSI-X on RTL8168g").
>
> Thomas Martitz reports that this change also solves an issue where
> the AMD Radeon Polaris 10 GPU on the HP Zbook 14u G5 is unresponsive
> after S3 suspend/resume.
>
> Link: https://bugzilla.kernel.org/show_bug.cgi?id=201069
> Signed-off-by: Daniel Drake <drake@endlessm.com>
Still
Reviewed-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> ---
> drivers/pci/pci.c | 25 +++++++++++++++++--------
> 1 file changed, 17 insertions(+), 8 deletions(-)
>
> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
> index 29ff9619b5fa..5d58220b6997 100644
> --- a/drivers/pci/pci.c
> +++ b/drivers/pci/pci.c
> @@ -1289,12 +1289,12 @@ int pci_save_state(struct pci_dev *dev)
> EXPORT_SYMBOL(pci_save_state);
>
> static void pci_restore_config_dword(struct pci_dev *pdev, int offset,
> - u32 saved_val, int retry)
> + u32 saved_val, int retry, bool force)
> {
> u32 val;
>
> pci_read_config_dword(pdev, offset, &val);
> - if (val == saved_val)
> + if (!force && val == saved_val)
> return;
>
> for (;;) {
> @@ -1313,25 +1313,34 @@ static void pci_restore_config_dword(struct pci_dev *pdev, int offset,
> }
>
> static void pci_restore_config_space_range(struct pci_dev *pdev,
> - int start, int end, int retry)
> + int start, int end, int retry,
> + bool force)
> {
> int index;
>
> for (index = end; index >= start; index--)
> pci_restore_config_dword(pdev, 4 * index,
> pdev->saved_config_space[index],
> - retry);
> + retry, force);
> }
>
> static void pci_restore_config_space(struct pci_dev *pdev)
> {
> if (pdev->hdr_type == PCI_HEADER_TYPE_NORMAL) {
> - pci_restore_config_space_range(pdev, 10, 15, 0);
> + pci_restore_config_space_range(pdev, 10, 15, 0, false);
> /* Restore BARs before the command register. */
> - pci_restore_config_space_range(pdev, 4, 9, 10);
> - pci_restore_config_space_range(pdev, 0, 3, 0);
> + pci_restore_config_space_range(pdev, 4, 9, 10, false);
> + pci_restore_config_space_range(pdev, 0, 3, 0, false);
> + } else if (pdev->hdr_type == PCI_HEADER_TYPE_BRIDGE) {
> + pci_restore_config_space_range(pdev, 12, 15, 0, false);
> + /* Force rewriting of prefetch registers to avoid
> + * S3 resume issues on Intel PCI bridges that occur when
> + * these registers are not explicitly written.
> + */
> + pci_restore_config_space_range(pdev, 9, 11, 0, true);
> + pci_restore_config_space_range(pdev, 0, 8, 0, false);
> } else {
> - pci_restore_config_space_range(pdev, 0, 15, 0);
> + pci_restore_config_space_range(pdev, 0, 15, 0, false);
> }
> }
>
> --
> 2.17.1
>
_______________________________________________
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/nouveau
^ permalink raw reply
* Re: [PATCH net-next v3 02/17] zinc: introduce minimal cryptography library
From: Milan Broz @ 2018-09-13 6:39 UTC (permalink / raw)
To: Andy Lutomirski, Ard Biesheuvel
Cc: Jason A. Donenfeld, LKML, Netdev, David Miller,
Greg Kroah-Hartman, Samuel Neves, Jean-Philippe Aumasson,
Linux Crypto Mailing List
In-Reply-To: <CALCETrU2qkLVaxC=cpU5iAeT5B7xGsR+m2ZWtLVK37jMMWtcAA@mail.gmail.com>
On 13/09/18 01:45, Andy Lutomirski wrote:
> On Wed, Sep 12, 2018 at 3:56 PM, Ard Biesheuvel
...
> b) Crypto that is used dynamically. This includes dm-crypt
> (aes-xts-plain64, aes-cbc-essiv, etc), all the ALG_IF interfaces, a
> lot of IPSEC stuff, possibly KCM, and probably many more. These will
> get comparatively little benefit from being converted to a zinc-like
> interface. For some of these cases, it wouldn't make any sense at all
> to convert them. Certainly the ones that do async hardware crypto
> using DMA engines will never look at all like zinc, even under the
> hood.
Please note, that dm-crypt now uses not only block ciphers and modes,
but also authenticated encryption and hashes (for ESSIV and HMAC
in authenticated composed modes) and RNG (for random IV).
We use crypto API, including async variants (I hope correctly :)
There is a long time battle to move initialization vectors generators
from dm-crypt to crypto API. If there are any plans to use a new library,
this issue should be discussed as well.
(Some dm-crypt IV generators are disk encryption specific, some do more
that just IV so porting is not straightforward etc).
Related problem here is an optimization of chain of sectors encryption -
if we have new crypto API, it would be nice if can take chain of sectors
so possible implementation can process this chain in one batch
(every sector need to be tweaked by differently generated IV - and we
are back in problem above).
I think filesystem encryption uses the same pattern.
And btw, we use the same algorithms through AF_ALG in userspace (cryptsetup).
So please, if you mention dm-crypt, note that it is very complex
crypto API consumer :) And everything is dynamic, configurable through
dm-crypt options.
That said, I would be more than happy to help in experiments to porting dm-crypt
to any other crypto library, but if it doesn't not help with problems
mentioned above, I do not see any compelling reason for the new library for dm-crypt...
Milan
^ permalink raw reply
* Re: [PATCH net-next 5/5] ebpf: Add sample ebpf program for SOCKET_SG_FILTER
From: Sowmini Varadhan @ 2018-09-13 0:59 UTC (permalink / raw)
To: Tushar Dave
Cc: Alexei Starovoitov, ast, daniel, davem, santosh.shilimkar,
jakub.kicinski, quentin.monnet, jiong.wang, sandipan,
john.fastabend, kafai, rdna, yhs, netdev, rds-devel
In-Reply-To: <74f959c3-27ef-a67c-6a54-599d84cde90b@oracle.com>
> On 09/11/2018 09:00 PM, Alexei Starovoitov wrote:
> >please no samples.
> >Add this as proper test to tools/testing/selftests/bpf
> >that reports PASS/FAIL and can be run automatically.
> >samples/bpf is effectively dead code.
Just a second.
You do realize that RDS is doing real networking, so it needs
RDMA capable hardware to test the rds_rdma paths? Also, when we
"talk to ourselves" we default to the rds_loop transport, so
we would even bypass the rds-tcp module.
I dont think this can be tested with some academic "test it
over lo0" exercise.. I suppose you can add example code in
sefltests for this, but asking for a "proper test" may be
a litte unrealistic here- a proper test needs proper hardware
in this case.
--Sowmini
^ permalink raw reply
* Re: [PATCH net-next v3 02/17] zinc: introduce minimal cryptography library
From: Ard Biesheuvel @ 2018-09-13 5:41 UTC (permalink / raw)
To: Andy Lutomirski
Cc: Jason A. Donenfeld, LKML, Netdev, David Miller,
Greg Kroah-Hartman, Samuel Neves, Jean-Philippe Aumasson,
Linux Crypto Mailing List
In-Reply-To: <CALCETrU2qkLVaxC=cpU5iAeT5B7xGsR+m2ZWtLVK37jMMWtcAA@mail.gmail.com>
On 13 September 2018 at 01:45, Andy Lutomirski <luto@kernel.org> wrote:
> On Wed, Sep 12, 2018 at 3:56 PM, Ard Biesheuvel
> <ard.biesheuvel@linaro.org> wrote:
>> I realize you've put a lot of good and hard work into the existing
>> I am also concerned about your claim that all software algorithms will
>> be moved into this crypto library. You are not specific about whose
>> responsibility it will be that this is going to happen in a timely
>> fashion. But more importantly, it is not clear at all how you expect
>> this to work for, e.g., h/w instruction based SHAxxx or AES in various
>> chaining modes, which should be used only on cores that implement
>> those instructions (note that on arm64, we have optional instructions
>> for AES, PMULL, SHA1, SHA256, SHA512, SHA3, SM3 and SM4). Are all
>> those implementations (only few of which will be used on a certain
>> core) going to be part of the monolithic library? What are the APIs
>> going to look like for block ciphers, taking chaining modes into
>> account?
>
> I'm not convinced that there's any real need for *all* crypto
> algorithms to move into lib/zinc or to move at all. As I see it,
> there are two classes of crypto algorithms in the kernel:
>
> a) Crypto that is used by code that chooses its algorithm statically
> and wants synchronous operations. These include everything in
> drivers/char/random.c, but also a bunch of various networking things
> that are hardcoded and basically everything that uses stack buffers.
> (This means it includes all the code that I broke when I did
> VMAP_STACK. Sign.)
>
> b) Crypto that is used dynamically. This includes dm-crypt
> (aes-xts-plain64, aes-cbc-essiv, etc), all the ALG_IF interfaces, a
> lot of IPSEC stuff, possibly KCM, and probably many more. These will
> get comparatively little benefit from being converted to a zinc-like
> interface. For some of these cases, it wouldn't make any sense at all
> to convert them. Certainly the ones that do async hardware crypto
> using DMA engines will never look at all like zinc, even under the
> hood.
>
> I think that, as a short-term goal, it makes a lot of sense to have
> implementations of the crypto that *new* kernel code (like Wireguard)
> wants to use in style (a) that live in /lib, and it obviously makes
> sense to consolidate their implementations with the crypto/
> implementations in a timely manner. As a medium-term goal, adding
> more algorithms as needed for things that could use the simpler APIs
> (Bluetooth, perhaps) would make sense.
>
> But I see no reason at all that /lib should ever contain a grab-bag of
> crypto implementations just for the heck of it. They should have real
> in-kernel users IMO. And this means that there will probably always
> be some crypto implementations in crypto/ for things like aes-xts.
>
But one of the supposed selling points of this crypto library is that
it gives engineers who are frightened of crypto in general and the
crypto API in particular simple and easy to use crypto primitives
rather than having to jump through the crypto API's hoops.
A crypto library whose only encryption algorithm is a stream cipher
does *not* deliver on that promise, since it is only suitable for
cases where IVs are guaranteed not to be reused. You yourself were
bitten by the clunkiness of the crypto API when attempting to use the
SHA26 code, right? So shouldn't we move that into this crypto library
as well?
I think it is reasonable for WireGuard to standardize on
ChaCha20/Poly1305 only, although I have my concerns about the flag day
that will be required if this 'one true cipher' ever does turn out to
be compromised (either that, or we will have to go back in time and
add some kind of protocol versioning to existing deployments of
WireGuard)
And frankly, if the code were as good as the prose, we wouldn't be
having this discussion. Zinc adds its own clunky ways to mix arch and
generic code, involving GCC -include command line arguments and
#ifdefs everywhere. My review comments on this were completely ignored
by Jason.
^ permalink raw reply
* Re: [PATCH iproute2] iproute2: fix use-after-free
From: Stephen Hemminger @ 2018-09-13 0:33 UTC (permalink / raw)
To: Mahesh Bandewar; +Cc: netdev, Mahesh Bandewar
In-Reply-To: <20180912232928.166085-1-mahesh@bandewar.net>
On Wed, 12 Sep 2018 16:29:28 -0700
Mahesh Bandewar <mahesh@bandewar.net> wrote:
> From: Mahesh Bandewar <maheshb@google.com>
>
> A local program using iproute2 lib pointed out the issue and looking
> at the code it is pretty obvious -
>
> a = (struct nlmsghdr *)b;
> ...
> free(b);
> if (a->nlmsg_seq == seq)
> ...
>
> Fixes: 86bf43c7c2fd ("lib/libnetlink: update rtnl_talk to support malloc buff at run time")
> Signed-off-by: Mahesh Bandewar <maheshb@google.com>
Yes, this is a real problem.
Maybe a minimal patch like this would be enough:
diff --git a/lib/libnetlink.c b/lib/libnetlink.c
index 928de1dd16d8..ab2d8452e4a1 100644
--- a/lib/libnetlink.c
+++ b/lib/libnetlink.c
@@ -661,6 +661,8 @@ next:
if (l < sizeof(struct nlmsgerr)) {
fprintf(stderr, "ERROR truncated\n");
} else if (!err->error) {
+ __u32 err_seq = h->nlmsg_seq;
+
/* check messages from kernel */
nl_dump_ext_ack(h, errfn);
@@ -668,7 +670,8 @@ next:
*answer = (struct nlmsghdr *)buf;
else
free(buf);
- if (h->nlmsg_seq == seq)
+
+ if (err_seq == seq)
return 0;
else if (i < iovlen)
goto next;
^ permalink raw reply related
* [PATCH net-next V2] virtio_net: ethtool tx napi configuration
From: Jason Wang @ 2018-09-13 5:35 UTC (permalink / raw)
To: mst, jasowang
Cc: netdev, Willem de Bruijn, davem, linux-kernel, virtualization
Implement ethtool .set_coalesce (-C) and .get_coalesce (-c) handlers.
Interrupt moderation is currently not supported, so these accept and
display the default settings of 0 usec and 1 frame.
Toggle tx napi through a bit in tx-frames. So as to not interfere
with possible future interrupt moderation, value 1 means tx napi while
value 0 means not.
To properly synchronize with the data path, tx napi is disabled and
tx lock is held when changing the value of napi weight. And two more
places that can access tx napi weight:
- speculative tx polling in rx napi, we can leave it as is since it
not a must for correctness.
- skb_xmit_done(), one more check of napi weight is added before
trying to enable tx to avoid tx to be disabled forever if napi is
disabled after skb_xmit_done() but before the napi
Link: https://patchwork.ozlabs.org/patch/948149/
Suggested-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
---
Changes from V1:
- try to synchronize with datapath to allow changing mode when
interface is up.
- use tx-frames 0 as to disable tx napi while tx-frames 1 to enable tx napi
---
drivers/net/virtio_net.c | 64 +++++++++++++++++++++++++++++++++++++++-
1 file changed, 63 insertions(+), 1 deletion(-)
diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 765920905226..6e70864f5899 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -66,6 +66,8 @@ DECLARE_EWMA(pkt_len, 0, 64)
#define VIRTNET_DRIVER_VERSION "1.0.0"
+static const u32 ethtool_coalesce_napi_mask = (1UL << 10);
+
static const unsigned long guest_offloads[] = {
VIRTIO_NET_F_GUEST_TSO4,
VIRTIO_NET_F_GUEST_TSO6,
@@ -1444,7 +1446,10 @@ static int virtnet_poll_tx(struct napi_struct *napi, int budget)
virtqueue_napi_complete(napi, sq->vq, 0);
- if (sq->vq->num_free >= 2 + MAX_SKB_FRAGS)
+ /* Check napi.weight to avoid tx stall since it could be set
+ * to zero by ethtool after skb_xmit_done().
+ */
+ if (sq->vq->num_free >= 2 + MAX_SKB_FRAGS || !sq->napi.weight)
netif_tx_wake_queue(txq);
return 0;
@@ -2181,6 +2186,61 @@ static int virtnet_get_link_ksettings(struct net_device *dev,
return 0;
}
+static int virtnet_set_coalesce(struct net_device *dev,
+ struct ethtool_coalesce *ec)
+{
+ struct ethtool_coalesce ec_default = {
+ .cmd = ETHTOOL_SCOALESCE,
+ .rx_max_coalesced_frames = 1,
+ };
+ struct virtnet_info *vi = netdev_priv(dev);
+ int i, napi_weight;
+
+ if (ec->tx_max_coalesced_frames > 1)
+ return -EINVAL;
+
+ ec_default.tx_max_coalesced_frames = ec->tx_max_coalesced_frames;
+ napi_weight = ec->tx_max_coalesced_frames ? NAPI_POLL_WEIGHT : 0;
+
+ /* disallow changes to fields not explicitly tested above */
+ if (memcmp(ec, &ec_default, sizeof(ec_default)))
+ return -EINVAL;
+
+ if (napi_weight ^ vi->sq[0].napi.weight) {
+ for (i = 0; i < vi->max_queue_pairs; i++) {
+ struct netdev_queue *txq =
+ netdev_get_tx_queue(vi->dev, i);
+
+ virtnet_napi_tx_disable(&vi->sq[i].napi);
+ __netif_tx_lock_bh(txq);
+ vi->sq[i].napi.weight = napi_weight;
+ __netif_tx_unlock_bh(txq);
+ virtnet_napi_tx_enable(vi, vi->sq[i].vq,
+ &vi->sq[i].napi);
+ }
+ }
+
+ return 0;
+}
+
+static int virtnet_get_coalesce(struct net_device *dev,
+ struct ethtool_coalesce *ec)
+{
+ struct ethtool_coalesce ec_default = {
+ .cmd = ETHTOOL_GCOALESCE,
+ .rx_max_coalesced_frames = 1,
+ .tx_max_coalesced_frames = 0,
+ };
+ struct virtnet_info *vi = netdev_priv(dev);
+
+ memcpy(ec, &ec_default, sizeof(ec_default));
+
+ if (vi->sq[0].napi.weight)
+ ec->tx_max_coalesced_frames = 1;
+
+ return 0;
+}
+
static void virtnet_init_settings(struct net_device *dev)
{
struct virtnet_info *vi = netdev_priv(dev);
@@ -2219,6 +2279,8 @@ static const struct ethtool_ops virtnet_ethtool_ops = {
.get_ts_info = ethtool_op_get_ts_info,
.get_link_ksettings = virtnet_get_link_ksettings,
.set_link_ksettings = virtnet_set_link_ksettings,
+ .set_coalesce = virtnet_set_coalesce,
+ .get_coalesce = virtnet_get_coalesce,
};
static void virtnet_freeze_down(struct virtio_device *vdev)
--
2.17.1
^ permalink raw reply related
* Re: [PATCH bpf-next 11/11] Documentation: Describe bpf reference tracking
From: Alexei Starovoitov @ 2018-09-13 0:12 UTC (permalink / raw)
To: Joe Stringer
Cc: daniel, netdev, ast, john.fastabend, tgraf, kafai, nitin.hande,
mauricio.vasquez
In-Reply-To: <20180912003640.28316-12-joe@wand.net.nz>
On Tue, Sep 11, 2018 at 05:36:40PM -0700, Joe Stringer wrote:
> Signed-off-by: Joe Stringer <joe@wand.net.nz>
just few words in commit log would be better than nothing.
Acked-by: Alexei Starovoitov <ast@kernel.org>
^ permalink raw reply
* Re: [PATCH bpf-next 10/11] selftests/bpf: Add C tests for reference tracking
From: Alexei Starovoitov @ 2018-09-13 0:11 UTC (permalink / raw)
To: Joe Stringer
Cc: daniel, netdev, ast, john.fastabend, tgraf, kafai, nitin.hande,
mauricio.vasquez
In-Reply-To: <20180912003640.28316-11-joe@wand.net.nz>
On Tue, Sep 11, 2018 at 05:36:39PM -0700, Joe Stringer wrote:
> Signed-off-by: Joe Stringer <joe@wand.net.nz>
really nice set of tests.
please describe them briefly in commit log.
Acked-by: Alexei Starovoitov <ast@kernel.org>
^ permalink raw reply
* Re: [PATCH bpf-next 09/11] libbpf: Support loading individual progs
From: Alexei Starovoitov @ 2018-09-13 0:10 UTC (permalink / raw)
To: Joe Stringer
Cc: daniel, netdev, ast, john.fastabend, tgraf, kafai, nitin.hande,
mauricio.vasquez
In-Reply-To: <20180912003640.28316-10-joe@wand.net.nz>
On Tue, Sep 11, 2018 at 05:36:38PM -0700, Joe Stringer wrote:
> Allow the individual program load to be invoked. This will help with
> testing, where a single ELF may contain several sections, some of which
> denote subprograms that are expected to fail verification, along with
> some which are expected to pass verification. By allowing programs to be
> iterated and individually loaded, each program can be independently
> checked against its expected verification result.
>
> Signed-off-by: Joe Stringer <joe@wand.net.nz>
Acked-by: Alexei Starovoitov <ast@kernel.org>
^ permalink raw reply
* Re: [PATCH bpf-next 08/11] selftests/bpf: Add tests for reference tracking
From: Alexei Starovoitov @ 2018-09-13 0:09 UTC (permalink / raw)
To: Joe Stringer
Cc: daniel, netdev, ast, john.fastabend, tgraf, kafai, nitin.hande,
mauricio.vasquez
In-Reply-To: <20180912003640.28316-9-joe@wand.net.nz>
On Tue, Sep 11, 2018 at 05:36:37PM -0700, Joe Stringer wrote:
> reference tracking: leak potential reference
> reference tracking: leak potential reference on stack
> reference tracking: leak potential reference on stack 2
> reference tracking: zero potential reference
> reference tracking: copy and zero potential references
> reference tracking: release reference without check
> reference tracking: release reference
> reference tracking: release reference twice
> reference tracking: release reference twice inside branch
> reference tracking: alloc, check, free in one subbranch
> reference tracking: alloc, check, free in both subbranches
> reference tracking in call: free reference in subprog
> reference tracking in call: free reference in subprog and outside
> reference tracking in call: alloc & leak reference in subprog
> reference tracking in call: alloc in subprog, release outside
> reference tracking in call: sk_ptr leak into caller stack
> reference tracking in call: sk_ptr spill into caller stack
>
> Signed-off-by: Joe Stringer <joe@wand.net.nz>
...
> + "reference tracking in call: alloc in subprog, release outside",
> + .insns = {
> + BPF_MOV64_REG(BPF_REG_4, BPF_REG_10),
> + BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 1, 0, 5),
> + BPF_MOV64_REG(BPF_REG_1, BPF_REG_0),
> + BPF_JMP_IMM(BPF_JEQ, BPF_REG_0, 0, 2),
> + BPF_MOV64_IMM(BPF_REG_2, 0),
> + BPF_EMIT_CALL(BPF_FUNC_sk_release),
> + BPF_EXIT_INSN(),
> +
> + /* subprog 1 */
> + BPF_SK_LOOKUP,
> + BPF_EXIT_INSN(), /* return sk */
> + },
Thanks a lot for adding subprog tests and checking that return to the caller
and spill works too.
Awesome stuff!
Acked-by: Alexei Starovoitov <ast@kernel.org>
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox