Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH] net: thunderx: prevent concurrent data re-writing by nicvf_set_rx_mode
From: David Miller @ 2018-06-12 22:25 UTC (permalink / raw)
  To: dnelson
  Cc: Vadim.Lomovtsev, rric, sgoutham, linux-arm-kernel, netdev,
	linux-kernel, Vadim.Lomovtsev
In-Reply-To: <036618ae-887f-44b5-2b39-451b81191cc1@redhat.com>

From: Dean Nelson <dnelson@redhat.com>
Date: Mon, 11 Jun 2018 06:22:14 -0500

> On 06/10/2018 02:35 PM, David Miller wrote:
>> From: Vadim Lomovtsev <Vadim.Lomovtsev@caviumnetworks.com>
>> Date: Fri,  8 Jun 2018 02:27:59 -0700
>> 
>>> +	/* Save message data locally to prevent them from
>>> +	 * being overwritten by next ndo_set_rx_mode call().
>>> +	 */
>>> +	spin_lock(&nic->rx_mode_wq_lock);
>>> +	mode = vf_work->mode;
>>> +	mc = vf_work->mc;
>>> +	vf_work->mc = NULL;
> 
> If I'm reading this code correctly, I believe nic->rx_mode_work.mc
> will
> have been set to NULL before the lock is dropped by
> nicvf_set_rx_mode_task() and acquired by nicvf_set_rx_mode().
> 
> 
>>> +	spin_unlock(&nic->rx_mode_wq_lock);
>> At the moment you drop this lock, the memory behind 'mc' can be
>> freed up by:
>> 
>>> +	spin_lock(&nic->rx_mode_wq_lock);
>>> +	kfree(nic->rx_mode_work.mc);
> 
> So the kfree() will be called with a NULL pointer and quickly return.
> 
> 
>> And you'll crash when you dereference it above via
>> __nicvf_set_rx_mode_task().
>> 
> 
> I believe the call to kfree() in nicvf_set_rx_mode() is there to free
> up a mc_list that has been allocated by nicvf_set_rx_mode() during a
> previous callback to the function, one that has not yet been processed
> by nicvf_set_rx_mode_task().
> 
> In this way only the last 'unprocessed' callback to
> nicvf_set_rx_mode()
> gets processed should there be multiple callbacks occurring between
> the
> times the nicvf_set_rx_mode_task() runs.
> 
> In my testing with this patch, this is what I see happening.

You're right, my bad.

Patch applied.

^ permalink raw reply

* Re: [PATCH v2] tcp: verify the checksum of the first data segment in a new connection
From: van der Linden, Frank @ 2018-06-12 22:30 UTC (permalink / raw)
  To: Eric Dumazet, edumazet@google.com, netdev@vger.kernel.org
In-Reply-To: <212193c0-2fee-7f88-5473-9f5f4c548cb8@gmail.com>

Sure, fair enough. I was assuming there might be a reason of why tcp_filter was always done after the data (not pseudo header) checksum. If there isn't (and obviously the the possible MD5 checks are done before it too), then that's definitely the right thing to do.

I'll resend. Though if you have the simpler change already lined up, I'll happily refrain from sending it myself.

Frank

On 6/12/18, 3:03 PM, "Eric Dumazet" <eric.dumazet@gmail.com> wrote:

    On 06/12/2018 02:53 PM, van der Linden, Frank wrote:
    > The convention seems to be to call tcp_checksum_complete after tcp_filter has a chance to deal with the packet. I wanted to preserve that.
    > 
    > If that is not a concern, then I agree that this is a far better way to go.
    > 
    > Frank

    Given that we can drop the packet earlier from :

    if (skb_checksum_init(skb, IPPROTO_TCP, inet_compute_pseudo))
         goto csum_error;

    I am quite sure we really do not care of tcp_filter() being
    hit or not by packets with bad checksum.

    Thanks

^ permalink raw reply

* Re: [PATCH net] tc-testing: ife: fix wrong teardown command in test b7b8
From: David Miller @ 2018-06-12 22:32 UTC (permalink / raw)
  To: dcaratti; +Cc: lucasb, mrv, netdev
In-Reply-To: <37eb01ee5c46cb7c5e094390e65eb476aa09f07e.1528725486.git.dcaratti@redhat.com>

From: Davide Caratti <dcaratti@redhat.com>
Date: Mon, 11 Jun 2018 16:02:36 +0200

> fix failures in the 'teardown' stage of test b7b8, probably a leftover of
> commit 7c5995b33d6e ("tc-testing: fixed copy-pasting error in ife tests")
> 
> Fixes: a56e6bcd34b55 ("tc-testing: updated ife test cases")
> Signed-off-by: Davide Caratti <dcaratti@redhat.com>

Applied, thank youo.

^ permalink raw reply

* Re: [Intel-wired-lan] [jkirsher/next-queue PATCH v2 0/7] Add support for L2 Fwd Offload w/o ndo_select_queue
From: Alexander Duyck @ 2018-06-12 22:33 UTC (permalink / raw)
  To: Florian Fainelli; +Cc: Alexander Duyck, intel-wired-lan, Jeff Kirsher, Netdev
In-Reply-To: <be1b5bed-d8b2-244e-167a-1f79bfb5f6e9@gmail.com>

On Tue, Jun 12, 2018 at 10:56 AM, Florian Fainelli <f.fainelli@gmail.com> wrote:
> On 06/12/2018 08:18 AM, Alexander Duyck wrote:
>> This patch series is meant to allow support for the L2 forward offload, aka
>> MACVLAN offload without the need for using ndo_select_queue.
>>
>> The existing solution currently requires that we use ndo_select_queue in
>> the transmit path if we want to associate specific Tx queues with a given
>> MACVLAN interface. In order to get away from this we need to repurpose the
>> tc_to_txq array and XPS pointer for the MACVLAN interface and use those as
>> a means of accessing the queues on the lower device. As a result we cannot
>> offload a device that is configured as multiqueue, however it doesn't
>> really make sense to configure a macvlan interfaced as being multiqueue
>> anyway since it doesn't really have a qdisc of its own in the first place.
>
> Interesting, so at some point I had came up with the following for
> mapping queues between the DSA slave network devices and the DSA master
> network device (doing the actual transmission). The DSA master network
> device driver is just a normal network device driver.
>
> The set-up is as follows: 4 external Ethernet switch ports, each with 8
> egress queues and the DSA master (bcmsysport.c), aka CPU Ethernet
> controller has 32 output queues, so you can do a 1:1 mapping of those,
> that's actually what we want. A subsequent hardware generation only
> provides 16 output queues, so we can still do 2:1 mapping.
>
> The implementation is done like this:
>
> - DSA slave network devices are always created after the DSA master
> network device so we can leverage that
>
> - a specific notifier is running from the DSA core and tells the DSA
> master about the switch position in the tree (position 0 = directly
> attached), and the switch port number and a pointer to the slave network
> device
>
> - we establish the mapping between the queues within the bcmsysport
> driver as a simple array
>
> - when transmitting, DSA slave network devices set a specific queue/port
> number within the 16-bits that skb->queue_mapping permits
>
> - this gets re-used by bcmsysport.c to extract the correct queue number
> during ndo_select_queue such that the appropriate queue number gets used
> and congestion works end-to-end.
>
> The reason why we do that is because there is some out of band HW that
> monitors the queue depth of the switch port's egress queue and
> back-pressure the Ethernet controller directly when trying to transmit
> to a congested queue.
>
> I had initially considered establishing the mapping using tc and some
> custom "bind" argument of some kind, but ended-up doing things the way
> they are which are more automatic though they leave less configuration
> to an user. This has a number of caveats though:
>
> - this is made generic within the context of DSA in that nothing is
> switch driver or Ethernet MAC driver specific and the notifier
> represents the contract between these two seemingly independent subsystems
>
> - the queue indicated between DSA slave and master is unfortunately
> switch driver/controller specific (BRCM_TAG_SET_PORT_QUEUE,
> BRCM_TAG_GET_PORT, BRCM_TAG_GET_QUEUE)
>
> What I like about your patchset is the mapping establishment, but as you
> will read from my reply in patch 2, I think the (upper) 1:N (lower)
> mapping might not work for my specific use case.
>
> Anyhow, not intended to be blocking this, as it seems to be going in the
> right direction anyway.

I think I am still not getting why the 1:N would be an issue. At least
the way I have the code implemented here the lower queues all have a
qdisc associated with them, just not the upper device. Generally I am
using the macvlan as a bump in the wire to take care of filtering for
the bridging mode. If I have to hairpin packets and send them back up
on on of the the upper interfaces I want to do that in software rather
than hardware so I try to take care of it there instead of routing it
through the hardware.

>>
>> I am submitting this as an RFC for the netdev mailing list, and officially
>> submitting it for testing to Jeff Kirsher's next-queue in order to validate
>> the ixgbe specific bits.
>>
>> The big changes in this set are:
>>   Allow lower device to update tc_to_txq and XPS map of offloaded MACVLAN
>>   Disable XPS for single queue devices
>>   Replace accel_priv with sb_dev in ndo_select_queue
>>   Add sb_dev parameter to fallback function for ndo_select_queue
>>   Consolidated ndo_select_queue functions that appeared to be duplicates
>
> Interesting, turns out I had a possibly similar use case with DSA with
> the slave network devices need to select an outgoing queue number for

I was kind of assuming this could be applied to a number of possible
use cases. As it was I was wondering if maybe we should look at adding
this as an option for just a standard VLAN as we could perform the
same kind of filtering and just deliver the packet directly to the
VLAN interface instead of requiring the extra trip through the stack
after the tag has been stripped.

^ permalink raw reply

* Re: Problems in tc-matchall.8, tc-sample.8
From: Stephen Hemminger @ 2018-06-12 22:33 UTC (permalink / raw)
  To: Eric S. Raymond; +Cc: netdev
In-Reply-To: <20180612220003.GE4849@thyrsus.com>

On Tue, 12 Jun 2018 18:00:03 -0400
"Eric S. Raymond" <esr@thyrsus.com> wrote:

> Stephen Hemminger <stephen@networkplumber.org>:
> > Please resubmit as real patch with signed-off-by  
> 
> I would like to follow your intructions, but that description leaves me
> not quite certain what you want. A git format-patch thing?   If so, what
> git url should I clone from?

iproute patches are handled the same as the Linux kernel.
Please submit patches to the netdev@vger.kernel.org with the same kind
of diff format (and signed-off-by) as the kernel.

Like the kernel, patches which are pure bug fixes go to the master
branch, and patches with new functionality are handled with the iproute2-next repository.

^ permalink raw reply

* Re: Backport bonding patches to fix active-passive
From: David Miller @ 2018-06-12 22:35 UTC (permalink / raw)
  To: nate; +Cc: netdev
In-Reply-To: <CAG2YfWNKXy+VkrbaxfaKof_T8bOF5skESCdQmPzU9_DQ-7an_w@mail.gmail.com>

From: Nate Clark <nate@neworld.us>
Date: Mon, 11 Jun 2018 13:44:40 -0400

> Would it be possible to queue up the three commits for backporting to
> 4.9 stable:
> b5bf0f5b16b9c316c34df9f31d4be8729eb86845 bonding: correctly update
> link status during mii-commit
> 3f3c278c94dd994fe0d9f21679ae19b9c0a55292 bonding: fix active-backup transition
> ad729bc9acfb7c47112964b4877ef5404578ed13 bonding: require speed/duplex
> only for 802.3ad, alb and tlb
> 
> All of those commits apply cleanly to 4.9.107.

I only deal with -stable backports to the most recent two releases.

If you want something to happen for earlier releases you'll need to
ask the -stable tree maintainers directly.

Thank you.

^ permalink raw reply

* Re: [PATCH] net: stmmac: dwmac-meson8b: Fix an error handling path in 'meson8b_dwmac_probe()'
From: David Miller @ 2018-06-12 22:36 UTC (permalink / raw)
  To: christophe.jaillet
  Cc: peppe.cavallaro, alexandre.torgue, joabreu, carlo, khilman,
	netdev, linux-arm-kernel, linux-amlogic, linux-kernel,
	kernel-janitors
In-Reply-To: <20180611175227.27509-1-christophe.jaillet@wanadoo.fr>

From: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Date: Mon, 11 Jun 2018 19:52:27 +0200

> If 'of_device_get_match_data()' fails, we need to release some resources as
> done in the other error handling path of this function.
> 
> Fixes: efacb568c962 ("net: stmmac: dwmac-meson: extend phy mode setting")
> Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>

Applied.

^ permalink raw reply

* Re: [Patch net] smc: convert to ->poll_mask
From: David Miller @ 2018-06-12 22:37 UTC (permalink / raw)
  To: xiyou.wangcong; +Cc: netdev, penguin-kernel, hch, ubraun
In-Reply-To: <20180611210714.3754-1-xiyou.wangcong@gmail.com>

From: Cong Wang <xiyou.wangcong@gmail.com>
Date: Mon, 11 Jun 2018 14:07:14 -0700

> smc->clcsock is an internal TCP socket, after TCP socket
> converts to ->poll_mask, ->poll doesn't exist any more.
> So just convert smc socket to ->poll_mask too.
> 
> Fixes: 2c7d3dacebd4 ("net/tcp: convert to ->poll_mask")
> Reported-by: syzbot+f5066e369b2d5fff630f@syzkaller.appspotmail.com
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: Ursula Braun <ubraun@linux.ibm.com>
> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>

Applied, thanks Cong.

^ permalink raw reply

* Re: [PATCH v2] xen/netfront: raise max number of slots in xennet_get_responses()
From: David Miller @ 2018-06-12 22:43 UTC (permalink / raw)
  To: jgross; +Cc: linux-kernel, xen-devel, netdev, boris.ostrovsky
In-Reply-To: <20180612065753.10569-1-jgross@suse.com>

From: Juergen Gross <jgross@suse.com>
Date: Tue, 12 Jun 2018 08:57:53 +0200

> The max number of slots used in xennet_get_responses() is set to
> MAX_SKB_FRAGS + (rx->status <= RX_COPY_THRESHOLD).
> 
> In old kernel-xen MAX_SKB_FRAGS was 18, while nowadays it is 17. This
> difference is resulting in frequent messages "too many slots" and a
> reduced network throughput for some workloads (factor 10 below that of
> a kernel-xen based guest).
> 
> Replacing MAX_SKB_FRAGS by XEN_NETIF_NR_SLOTS_MIN for calculation of
> the max number of slots to use solves that problem (tests showed no
> more messages "too many slots" and throughput was as high as with the
> kernel-xen based guest system).
> 
> Replace MAX_SKB_FRAGS-2 by XEN_NETIF_NR_SLOTS_MIN-1 in
> netfront_tx_slot_available() for making it clearer what is really being
> tested without actually modifying the tested value.
> 
> Signed-off-by: Juergen Gross <jgross@suse.com>
> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>

Applied, thanks.

^ permalink raw reply

* Re: [Intel-wired-lan] [jkirsher/next-queue PATCH v2 0/7] Add support for L2 Fwd Offload w/o ndo_select_queue
From: Alexander Duyck @ 2018-06-12 22:47 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: Alexander Duyck, intel-wired-lan, Netdev
In-Reply-To: <20180612105029.77b40381@xeon-e3>

On Tue, Jun 12, 2018 at 10:50 AM, Stephen Hemminger
<stephen@networkplumber.org> wrote:
> On Tue, 12 Jun 2018 11:18:25 -0400
> Alexander Duyck <alexander.h.duyck@intel.com> wrote:
>
>> This patch series is meant to allow support for the L2 forward offload, aka
>> MACVLAN offload without the need for using ndo_select_queue.
>>
>> The existing solution currently requires that we use ndo_select_queue in
>> the transmit path if we want to associate specific Tx queues with a given
>> MACVLAN interface. In order to get away from this we need to repurpose the
>> tc_to_txq array and XPS pointer for the MACVLAN interface and use those as
>> a means of accessing the queues on the lower device. As a result we cannot
>> offload a device that is configured as multiqueue, however it doesn't
>> really make sense to configure a macvlan interfaced as being multiqueue
>> anyway since it doesn't really have a qdisc of its own in the first place.
>>
>> I am submitting this as an RFC for the netdev mailing list, and officially
>> submitting it for testing to Jeff Kirsher's next-queue in order to validate
>> the ixgbe specific bits.
>>
>> The big changes in this set are:
>>   Allow lower device to update tc_to_txq and XPS map of offloaded MACVLAN
>>   Disable XPS for single queue devices
>>   Replace accel_priv with sb_dev in ndo_select_queue
>>   Add sb_dev parameter to fallback function for ndo_select_queue
>>   Consolidated ndo_select_queue functions that appeared to be duplicates
>>
>> v2: Implement generic "select_queue" functions instead of "fallback" functions.
>>     Tweak last two patches to account for changes in dev_pick_tx_xxx functions.
>>
>> ---
>>
>> Alexander Duyck (7):
>>       net-sysfs: Drop support for XPS and traffic_class on single queue device
>>       net: Add support for subordinate device traffic classes
>>       ixgbe: Add code to populate and use macvlan tc to Tx queue map
>>       net: Add support for subordinate traffic classes to netdev_pick_tx
>>       net: Add generic ndo_select_queue functions
>>       net: allow ndo_select_queue to pass netdev
>>       net: allow fallback function to pass netdev
>>
>>
>>  drivers/infiniband/hw/hfi1/vnic_main.c            |    2
>>  drivers/infiniband/ulp/opa_vnic/opa_vnic_netdev.c |    4 -
>>  drivers/net/bonding/bond_main.c                   |    3
>>  drivers/net/ethernet/amazon/ena/ena_netdev.c      |    5 -
>>  drivers/net/ethernet/broadcom/bcmsysport.c        |    6 -
>>  drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c   |    6 +
>>  drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.h   |    3
>>  drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c   |    5 -
>>  drivers/net/ethernet/hisilicon/hns/hns_enet.c     |    5 -
>>  drivers/net/ethernet/intel/ixgbe/ixgbe_main.c     |   62 ++++++--
>>  drivers/net/ethernet/lantiq_etop.c                |   10 -
>>  drivers/net/ethernet/mellanox/mlx4/en_tx.c        |    7 +
>>  drivers/net/ethernet/mellanox/mlx4/mlx4_en.h      |    3
>>  drivers/net/ethernet/mellanox/mlx5/core/en.h      |    3
>>  drivers/net/ethernet/mellanox/mlx5/core/en_tx.c   |    5 -
>>  drivers/net/ethernet/renesas/ravb_main.c          |    3
>>  drivers/net/ethernet/sun/ldmvsw.c                 |    3
>>  drivers/net/ethernet/sun/sunvnet.c                |    3
>>  drivers/net/ethernet/ti/netcp_core.c              |    9 -
>>  drivers/net/hyperv/netvsc_drv.c                   |    6 -
>>  drivers/net/macvlan.c                             |   10 -
>>  drivers/net/net_failover.c                        |    7 +
>>  drivers/net/team/team.c                           |    3
>>  drivers/net/tun.c                                 |    3
>>  drivers/net/wireless/marvell/mwifiex/main.c       |    3
>>  drivers/net/xen-netback/interface.c               |    4 -
>>  drivers/net/xen-netfront.c                        |    3
>>  drivers/staging/netlogic/xlr_net.c                |    9 -
>>  drivers/staging/rtl8188eu/os_dep/os_intfs.c       |    3
>>  drivers/staging/rtl8723bs/os_dep/os_intfs.c       |    7 -
>>  include/linux/netdevice.h                         |   34 ++++-
>>  net/core/dev.c                                    |  156 ++++++++++++++++++---
>>  net/core/net-sysfs.c                              |   36 ++++-
>>  net/mac80211/iface.c                              |    4 -
>>  net/packet/af_packet.c                            |    7 +
>>  35 files changed, 312 insertions(+), 130 deletions(-)
>>
>> --
>
> This makes sense. I thought you were hoping to get rid of select queue in future?

That would be nice, however there are still a bunch of corner cases
that are not handled that have been dumped into select queue. For
example in the case of ixgbe the issue is FCoE. There are a number of
other places that are using it as well as I seem to recall netvsc and
bonding both use it to store off the original Rx->Tx queue mapping
when passing through the interface.

For now I figure we can take this one hill at a time and I am just
making it so we don't have to use ndo_select_queue in order to make
vmdq work for macvlan offload.

- Alex

^ permalink raw reply

* Re: [PATCH 1/1] ip: add rmnet initial support
From: Subash Abhinov Kasiviswanathan @ 2018-06-12 23:06 UTC (permalink / raw)
  To: Daniele Palmas; +Cc: netdev, Stephen Hemminger
In-Reply-To: <1528812777-7512-1-git-send-email-dnlplm@gmail.com>

> +
> +static void print_explain(FILE *f)
> +{
> +	fprintf(f,
> +		"Usage: ... rmnet mux_id MUXID\n"
> +		"\n"
> +		"MUXID := 1-127\n"
> +	);
> +}

Hi Daniele

This range can be from 1-254.

> +
> +static void explain(void)
> +{
> +	print_explain(stderr);
> +}
> +
> +static int rmnet_parse_opt(struct link_util *lu, int argc, char 
> **argv,
> +			   struct nlmsghdr *n)
> +{
> +	__u16 mux_id;
> +
> +	while (argc > 0) {
> +		if (matches(*argv, "mux_id") == 0) {
> +			NEXT_ARG();
> +			if (get_u16(&mux_id, *argv, 0))
> +				invarg("mux_id is invalid", *argv);
> +			addattr_l(n, 1024, IFLA_RMNET_MUX_ID, &mux_id, 2);

You could use addattr16() instead since it is __u16.

-- 
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
a Linux Foundation Collaborative Project

^ permalink raw reply

* [PATCH v3] tcp: verify the checksum of the first data segment in a new connection
From: Frank van der Linden @ 2018-06-12 23:09 UTC (permalink / raw)
  To: edumazet, netdev; +Cc: fllinden

commit 079096f103fa ("tcp/dccp: install syn_recv requests into ehash
table") introduced an optimization for the handling of child sockets
created for a new TCP connection.

But this optimization passes any data associated with the last ACK of the
connection handshake up the stack without verifying its checksum, because it
calls tcp_child_process(), which in turn calls tcp_rcv_state_process()
directly.  These lower-level processing functions do not do any checksum
verification.

Insert a tcp_checksum_complete call in the TCP_NEW_SYN_RECEIVE path to
fix this.

Signed-off-by: Frank van der Linden <fllinden@amazon.com>
---
 net/ipv4/tcp_ipv4.c | 4 ++++
 net/ipv6/tcp_ipv6.c | 4 ++++
 2 files changed, 8 insertions(+)

diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index f70586b..ef8cd0f 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -1689,6 +1689,10 @@ int tcp_v4_rcv(struct sk_buff *skb)
 			reqsk_put(req);
 			goto discard_it;
 		}
+		if (tcp_checksum_complete(skb)) {
+			reqsk_put(req);
+			goto csum_error;
+		}
 		if (unlikely(sk->sk_state != TCP_LISTEN)) {
 			inet_csk_reqsk_queue_drop_and_put(sk, req);
 			goto lookup;
diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
index 6d664d8..5d4eb9d 100644
--- a/net/ipv6/tcp_ipv6.c
+++ b/net/ipv6/tcp_ipv6.c
@@ -1475,6 +1475,10 @@ static int tcp_v6_rcv(struct sk_buff *skb)
 			reqsk_put(req);
 			goto discard_it;
 		}
+		if (tcp_checksum_complete(skb)) {
+			reqsk_put(req);
+			goto csum_error;
+		}
 		if (unlikely(sk->sk_state != TCP_LISTEN)) {
 			inet_csk_reqsk_queue_drop_and_put(sk, req);
 			goto lookup;
-- 
1.8.3.1

^ permalink raw reply related

* Re: [PATCH v2] tcp: verify the checksum of the first data segment in a new connection
From: van der Linden, Frank @ 2018-06-12 23:12 UTC (permalink / raw)
  To: Eric Dumazet, edumazet@google.com, netdev@vger.kernel.org
In-Reply-To: <212193c0-2fee-7f88-5473-9f5f4c548cb8@gmail.com>

Ok, patch v3 sent.

It was rightly pointed out to me that I shouldn't commit the mortal sin of top posting - but bear with me guys, I'll dig up my 25-year old .muttrc :-)

Frank

On 6/12/18, 3:03 PM, "Eric Dumazet" <eric.dumazet@gmail.com> wrote:

   
    
    On 06/12/2018 02:53 PM, van der Linden, Frank wrote:
    > The convention seems to be to call tcp_checksum_complete after tcp_filter has a chance to deal with the packet. I wanted to preserve that.
    > 
    > If that is not a concern, then I agree that this is a far better way to go.
    > 
    > Frank
    
    Given that we can drop the packet earlier from :
    
    if (skb_checksum_init(skb, IPPROTO_TCP, inet_compute_pseudo))
         goto csum_error;
    
    I am quite sure we really do not care of tcp_filter() being
    hit or not by packets with bad checksum.
    
    Thanks
    
    

    


^ permalink raw reply

* Re: Problems in tc-matchall.8, tc-sample.8
From: Eric S. Raymond @ 2018-06-12 23:41 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: netdev
In-Reply-To: <20180612153350.75e77f01@xeon-e3>

Stephen Hemminger <stephen@networkplumber.org>:
> On Tue, 12 Jun 2018 18:00:03 -0400
> "Eric S. Raymond" <esr@thyrsus.com> wrote:
> 
> > Stephen Hemminger <stephen@networkplumber.org>:
> > > Please resubmit as real patch with signed-off-by  
> > 
> > I would like to follow your intructions, but that description leaves me
> > not quite certain what you want. A git format-patch thing?   If so, what
> > git url should I clone from?
> 
> iproute patches are handled the same as the Linux kernel.
> Please submit patches to the netdev@vger.kernel.org with the same kind
> of diff format (and signed-off-by) as the kernel.
> 
> Like the kernel, patches which are pure bug fixes go to the master
> branch, and patches with new functionality are handled with the iproute2-next repository.

Then I should bugfix against this repository?

https://github.com/shemminger/iproute2
-- 
		<a href="http://www.catb.org/~esr/">Eric S. Raymond</a>

My work is funded by the Internet Civil Engineering Institute: https://icei.org
Please visit their site and donate: the civilization you save might be your own.

^ permalink raw reply

* Re: [PATCH bpf v3] tools/bpftool: fix a bug in bpftool perf
From: Daniel Borkmann @ 2018-06-13  0:04 UTC (permalink / raw)
  To: Yonghong Song, ast, netdev; +Cc: kernel-team
In-Reply-To: <20180612053548.901931-1-yhs@fb.com>

On 06/12/2018 07:35 AM, Yonghong Song wrote:
> Commit b04df400c302 ("tools/bpftool: add perf subcommand")
> introduced bpftool subcommand perf to query bpf program
> kuprobe and tracepoint attachments.
> 
> The perf subcommand will first test whether bpf subcommand
> BPF_TASK_FD_QUERY is supported in kernel or not. It does it
> by opening a file with argv[0] and feeds the file descriptor
> and current task pid to the kernel for querying.
> 
> Such an approach won't work if the argv[0] cannot be opened
> successfully in the current directory. This is especially
> true when bpftool is accessible through PATH env variable.
> The error below reflects the open failure for file argv[0]
> at home directory.
> 
>   [yhs@localhost ~]$ which bpftool
>   /usr/local/sbin/bpftool
>   [yhs@localhost ~]$ bpftool perf
>   Error: perf_query_support: No such file or directory
> 
> To fix the issue, let us open root directory ("/")
> which exists in every linux system. With the fix, the
> error message will correctly reflect the permission issue.
> 
>   [yhs@localhost ~]$ which bpftool
>   /usr/local/sbin/bpftool
>   [yhs@localhost ~]$ bpftool perf
>   Error: perf_query_support: Operation not permitted
>   HINT: non root or kernel doesn't support TASK_FD_QUERY
> 
> Fixes: b04df400c302 ("tools/bpftool: add perf subcommand")
> Reported-by: Alexei Starovoitov <ast@kernel.org>
> Signed-off-by: Yonghong Song <yhs@fb.com>

Applied to bpf, thanks Yonghong!

^ permalink raw reply

* Re: [virtio-dev] Re: [Qemu-devel] [PATCH] qemu: Introduce VIRTIO_NET_F_STANDBY feature bit to virtio_net
From: Samudrala, Sridhar @ 2018-06-13  0:08 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: alexander.h.duyck, virtio-dev, aaron.f.brown, jiri, kubakici,
	netdev, qemu-devel, loseweigh, virtualization
In-Reply-To: <20180612142557-mutt-send-email-mst@kernel.org>

On 6/12/2018 4:34 AM, Michael S. Tsirkin wrote:
> On Mon, Jun 11, 2018 at 10:02:45PM -0700, Samudrala, Sridhar wrote:
>> On 6/11/2018 7:17 PM, Michael S. Tsirkin wrote:
>>> On Tue, Jun 12, 2018 at 09:54:44AM +0800, Jason Wang wrote:
>>>> On 2018年06月12日 01:26, Michael S. Tsirkin wrote:
>>>>> On Mon, May 07, 2018 at 04:09:54PM -0700, Sridhar Samudrala wrote:
>>>>>> This feature bit can be used by hypervisor to indicate virtio_net device to
>>>>>> act as a standby for another device with the same MAC address.
>>>>>>
>>>>>> I tested this with a small change to the patch to mark the STANDBY feature 'true'
>>>>>> by default as i am using libvirt to start the VMs.
>>>>>> Is there a way to pass the newly added feature bit 'standby' to qemu via libvirt
>>>>>> XML file?
>>>>>>
>>>>>> Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
>>>>> So I do not think we can commit to this interface: we
>>>>> really need to control visibility of the primary device.
>>>> The problem is legacy guest won't use primary device at all if we do this.
>>> And that's by design - I think it's the only way to ensure the
>>> legacy guest isn't confused.
>> Yes. I think so. But i am not sure if Qemu is the right place to control the visibility
>> of the primary device. The primary device may not be specified as an argument to Qemu. It
>> may be plugged in later.
>> The cloud service provider is providing a feature that enables low latency datapath and live
>> migration capability.
>> A tenant can use this feature only if he is running a VM that has virtio-net with failover support.
> Well live migration is there already. The new feature is low latency
> data path.

we get live migration with just virtio.  But I meant live migration with VF as
primary device.

>
> And it's the guest that needs failover support not the VM.

Isn't guest and VM synonymous?


>
>
>> I think Qemu should check if guest virtio-net supports this feature and provide a mechanism for
>> an upper layer indicating if the STANDBY feature is successfully negotiated or not.
>> The upper layer can then decide if it should hot plug a VF with the same MAC and manage the 2 links.
>> If VF is successfully hot plugged, virtio-net link should be disabled.
> Did you even talk to upper layer management about it?
> Just list the steps they need to do and you will see
> that's a lot of machinery to manage by the upper layer.
>
> What do we gain in flexibility? As far as I can see the
> only gain is some resources saved for legacy VMs.
>
> That's not a lot as tenant of the upper layer probably already has
> at least a hunch that it's a new guest otherwise
> why bother specifying the feature at all - you
> save even more resources without it.
>

I am not all that familiar with how Qemu manages network devices. If we can do all the
required management of the primary/standby devices within Qemu, that is definitely a better
approach without upper layer involvement.


>
>
>>>> How about control the visibility of standby device?
>>>>
>>>> Thanks
>>> standy the always there to guarantee no downtime.
>>>
>>>>> However just for testing purposes, we could add a non-stable
>>>>> interface "x-standby" with the understanding that as any
>>>>> x- prefix it's unstable and will be changed down the road,
>>>>> likely in the next release.
>>>>>
>>>>>
>>>>>> ---
>>>>>>     hw/net/virtio-net.c                         | 2 ++
>>>>>>     include/standard-headers/linux/virtio_net.h | 3 +++
>>>>>>     2 files changed, 5 insertions(+)
>>>>>>
>>>>>> diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c
>>>>>> index 90502fca7c..38b3140670 100644
>>>>>> --- a/hw/net/virtio-net.c
>>>>>> +++ b/hw/net/virtio-net.c
>>>>>> @@ -2198,6 +2198,8 @@ static Property virtio_net_properties[] = {
>>>>>>                          true),
>>>>>>         DEFINE_PROP_INT32("speed", VirtIONet, net_conf.speed, SPEED_UNKNOWN),
>>>>>>         DEFINE_PROP_STRING("duplex", VirtIONet, net_conf.duplex_str),
>>>>>> +    DEFINE_PROP_BIT64("standby", VirtIONet, host_features, VIRTIO_NET_F_STANDBY,
>>>>>> +                      false),
>>>>>>         DEFINE_PROP_END_OF_LIST(),
>>>>>>     };
>>>>>> diff --git a/include/standard-headers/linux/virtio_net.h b/include/standard-headers/linux/virtio_net.h
>>>>>> index e9f255ea3f..01ec09684c 100644
>>>>>> --- a/include/standard-headers/linux/virtio_net.h
>>>>>> +++ b/include/standard-headers/linux/virtio_net.h
>>>>>> @@ -57,6 +57,9 @@
>>>>>>     					 * Steering */
>>>>>>     #define VIRTIO_NET_F_CTRL_MAC_ADDR 23	/* Set MAC address */
>>>>>> +#define VIRTIO_NET_F_STANDBY      62    /* Act as standby for another device
>>>>>> +                                         * with the same MAC.
>>>>>> +                                         */
>>>>>>     #define VIRTIO_NET_F_SPEED_DUPLEX 63	/* Device set linkspeed and duplex */
>>>>>>     #ifndef VIRTIO_NET_NO_LEGACY
>>>>>> -- 
>>>>>> 2.14.3
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
> For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org
>

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply

* Re: [PATCH] selftests: bpf: config: add config fragments
From: Daniel Borkmann @ 2018-06-13  0:08 UTC (permalink / raw)
  To: Anders Roxell, ast, shuah
  Cc: netdev, linux-kernel, linux-kselftest, William Tu
In-Reply-To: <20180612110510.11731-1-anders.roxell@linaro.org>

On 06/12/2018 01:05 PM, Anders Roxell wrote:
> Tests test_tunnel.sh fails due to config fragments ins't enabled.
> 
> Fixes: 933a741e3b82 ("selftests/bpf: bpf tunnel test.")
> Signed-off-by: Anders Roxell <anders.roxell@linaro.org>
> ---
> 
> All tests passes except ip6gretap that still fails. I'm unsure why.
> Ideas?

William (Cc) might be able to help you out.

Applied the one below in the mean-time to bpf, thanks!

> Cheers,
> Anders
> 
>  tools/testing/selftests/bpf/config | 10 ++++++++++
>  1 file changed, 10 insertions(+)
> 
> diff --git a/tools/testing/selftests/bpf/config b/tools/testing/selftests/bpf/config
> index 1eefe211a4a8..7eb613ffef55 100644
> --- a/tools/testing/selftests/bpf/config
> +++ b/tools/testing/selftests/bpf/config
> @@ -7,3 +7,13 @@ CONFIG_CGROUP_BPF=y
>  CONFIG_NETDEVSIM=m
>  CONFIG_NET_CLS_ACT=y
>  CONFIG_NET_SCH_INGRESS=y
> +CONFIG_NET_IPIP=y
> +CONFIG_IPV6=y
> +CONFIG_NET_IPGRE_DEMUX=y
> +CONFIG_NET_IPGRE=y
> +CONFIG_IPV6_GRE=y
> +CONFIG_CRYPTO_USER_API_HASH=m
> +CONFIG_CRYPTO_HMAC=m
> +CONFIG_CRYPTO_SHA256=m
> +CONFIG_VXLAN=y
> +CONFIG_GENEVE=y
> 

^ permalink raw reply

* Re: [PATCH 1/1] ip: add rmnet initial support
From: Stephen Hemminger @ 2018-06-13  0:22 UTC (permalink / raw)
  To: Daniele Palmas; +Cc: netdev, Subash Abhinov Kasiviswanathan
In-Reply-To: <1528812777-7512-1-git-send-email-dnlplm@gmail.com>

On Tue, 12 Jun 2018 16:12:57 +0200
Daniele Palmas <dnlplm@gmail.com> wrote:

> This patch adds basic support for Qualcomm rmnet devices.
> 
> Signed-off-by: Daniele Palmas <dnlplm@gmail.com>
> ---
>  ip/Makefile       |  2 +-
>  ip/iplink.c       |  2 +-
>  ip/iplink_rmnet.c | 70 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
>  3 files changed, 72 insertions(+), 2 deletions(-)
>  create mode 100644 ip/iplink_rmnet.c
};

I am glad to see integrated tool support, but this needs to be targeted at
the iproute2-next since it is a new feature.

Some things that I would like to see changed:
  1. All of iproute2 is now using SPDX license identifiers, you should not
     include GPL boilerplate
  2. You should provide dump (print_opt) as well as parse routine.
     Output format should use the print_uint (json print) routines.
  3. Please update manual page (man/man8/ip-link.8.in) to include the new
     option.

^ permalink raw reply

* Re: Problems in tc-matchall.8, tc-sample.8
From: Stephen Hemminger @ 2018-06-13  0:24 UTC (permalink / raw)
  To: Eric S. Raymond; +Cc: netdev
In-Reply-To: <20180612234103.GB14546@thyrsus.com>

On Tue, 12 Jun 2018 19:41:03 -0400
"Eric S. Raymond" <esr@thyrsus.com> wrote:

> Stephen Hemminger <stephen@networkplumber.org>:
> > On Tue, 12 Jun 2018 18:00:03 -0400
> > "Eric S. Raymond" <esr@thyrsus.com> wrote:
> >   
> > > Stephen Hemminger <stephen@networkplumber.org>:  
> > > > Please resubmit as real patch with signed-off-by    
> > > 
> > > I would like to follow your intructions, but that description leaves me
> > > not quite certain what you want. A git format-patch thing?   If so, what
> > > git url should I clone from?  
> > 
> > iproute patches are handled the same as the Linux kernel.
> > Please submit patches to the netdev@vger.kernel.org with the same kind
> > of diff format (and signed-off-by) as the kernel.
> > 
> > Like the kernel, patches which are pure bug fixes go to the master
> > branch, and patches with new functionality are handled with the iproute2-next repository.  
> 
> Then I should bugfix against this repository?
> 
> https://github.com/shemminger/iproute2

The upstream repositories for master and net-next branch are now
split. Master branch is at:
  git://git.kernel.org/pub/scm/network/iproute2/iproute2.gti

and patches for next release are in (master branch):
  git://git.kernel.org/pub/scm/network/iproute2/iproute2-next.git


Github is an out of date clone (like all the kernels on there).

^ permalink raw reply

* Re: [bpf PATCH] bpf: selftest fix for sockmap
From: Daniel Borkmann @ 2018-06-13  0:31 UTC (permalink / raw)
  To: John Fastabend, ast; +Cc: netdev
In-Reply-To: <20180611184735.31255.51105.stgit@john-Precision-Tower-5810>

On 06/11/2018 08:47 PM, John Fastabend wrote:
> In selftest test_maps the sockmap test case attempts to add a socket
> in listening state to the sockmap. This is no longer a valid operation
> so it fails as expected. However, the test wrongly reports this as an
> error now. Fix the test to avoid adding sockets in listening state.
> 
> Fixes: 945ae430aa44 ("bpf: sockmap only allow ESTABLISHED sock state")
> Signed-off-by: John Fastabend <john.fastabend@gmail.com>

(fyi, discussed with John that this will be enrolled into the set of
 fixes he has pending for bpf since the test is related to the one
 restricting to ESTABLISHED state.)

^ permalink raw reply

* Re: [RFC nf-next 0/5] netfilter: add ebpf translation infrastructure
From: Alexei Starovoitov @ 2018-06-13  0:43 UTC (permalink / raw)
  To: Florian Westphal
  Cc: netfilter-devel, ast, daniel, netdev, David S. Miller, ecree
In-Reply-To: <20180612092812.vptmhuekmpb4pn5z@breakpoint.cc>

On Tue, Jun 12, 2018 at 11:28:12AM +0200, Florian Westphal wrote:
> Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
> > On Fri, Jun 01, 2018 at 05:32:11PM +0200, Florian Westphal wrote:
> > > The userspace helper translates the rules, and, if successful, installs the
> > > generated program(s) via bpf syscall.
> > > 
> > > For each rule a small response containing the corresponding epbf file
> > > descriptor (can be -1 on failure) and a attribute count (how many
> > > expressions were jitted) gets sent back to kernel via pipe.
> > > 
> > > If translation fails, the rule is will be processed by nf_tables
> > > interpreter (as before this patch).
> > > 
> > > If translation succeeded, nf_tables fetches the bpf program using the file
> > > descriptor identifier, allocates a new rule blob containing the new 'ebpf'
> > > expression (and possible trailing un-translated expressions).
> > > 
> > > It then replaces the original rule in the transaction log with the new
> > > 'ebpf-rule'.  The original rule is retained in a private area inside the epbf
> > > expression to be able to present the original expressions back to userspace
> > > on 'nft list ruleset'.
> > > 
> > > For easier review, this contains the kernel-side only.
> > > nf_tables_jit_work() will not do anything, yet.
> > > 
> > > Unresolved issues:
> > >  - maps and sets.
> > >    It might be possible to add a new ebpf map type that just wraps
> > >    the nft set infrastructure for lookups.
> > >    This would allow nft userspace to continue to work as-is while
> > >    not requiring new ebpf helper.
> > >    Anonymous set should be a lot easier as they're immutable
> > >    and could probably be handled already by existing infra.
> > > 
> > >  - BPF_PROG_RUN() is bolted into nft main loop via a middleman expression.
> > >    I'm also abusing skb->cb[] to pass network and transport header offsets.
> > >    Its not 'public' api so this can be changed later.
> > > 
> > >  - always uses BPF_PROG_TYPE_SCHED_CLS.
> > >    This is because it "works" for current RFC purposes.
> > > 
> > >  - we should eventually support translating multiple (adjacent) rules
> > >    into single program.
> > > 
> > >    If we do this kernel will need to track mapping of rules to
> > >    program (to re-jit when a rule is changed.  This isn't implemented
> > >    so far, but can be added later.  Alternatively, one could also add a
> > >    'readonly' table switch to just prevent further updates.
> > > 
> > >    We will also need to dump the 'next' generation of the
> > >    to-be-translated table.  The kernel has this information, so its only
> > >    a matter of serializing it back to userspace from the commit phase.
> > > 
> > > The jitter is still limited.  So far it supports:
> > > 
> > >  * payload expression for network and transport header
> > >  * meta mark, nfproto, l4proto
> > >  * 32 bit immediates
> > >  * 32 bit bitmask ops
> > >  * accept/drop verdicts
> > > 
> > > As this uses netlink, there is also no technical requirement for
> > > libnftnl, its simply used here for convienience.
> > > 
> > > It doesn't need any userspace changes. Patches for libnftnl and nftables
> > > make debug info available (e.g. to map rule to its bpf prog id).
> > > 
> > > Comments welcome.
> > 
> > The implementation of patch 5 looks good to me, but I'm concerned with
> > patch 2 that adds 'ebpf expression' to nft. I see no reason to do so.
> 
> I think its important user(space) can see which rules are jitted, and
> which ebpf prog corresponds to which rule(s), using an expression as
> container allows to re-use existing nft config plane code to serialze
> this via netlink attributes.

In my mind it would be all or nothing. I don't think it helps
to convert some rules and not all.

> > It seems existing support for infinite number of nft expressions is
> > used as a way to execute infinite number of bpf programs sequentially.
> 
> In this RFC, yes.
> 
> > I don't think it was a scalable approach before and won't scale in the future.
> > I think the algorithm should consider all nft rules at once and generate
> > a program or two that will execute fast even when number of rules is large.
> 
> Yes, but existence of the epbf expression doesn't prevent doing this in
> the future.  Doing it now complicates things and given unresolved issues
> (see above cover letter) I'm reluctant to implement this already. The
> UMH in this RFC can translate only a very small subset of
> expressions.  To make full-table realistic I think issues outlined above
> need to be addressed first.
> 
> It can be done, in such case the epbf expression would replace not just
> rule but possibly all of them.

I think 'all of them' is mandatory. Same for bpfilter.
Existing iptables/nft work as fallback already.
Only when converting all rules we get performance benefit.
Partial converstion only makes things harder to debug and confuse users.

> Netlink dump of such a fully-translated table would have the epbf
> expression at the beginning of the first rule, exposing epbf program id/tag,
> and a list of the nft rule IDs that it replaced.  In the extreme (ideal)
> case, it would thus list all rule handle IDs of the chain (including
> those reachable via jump-to-user-defined-chains).
> 
> Rest of dump would be as if ebpf did not exist, but these rules would
> all be "dead" from packet-path point of view.  They are linked from via
> the nft epbf pseudo-expression, but no different from an arbitrary
> cookie/comment.
> 
> As explained above, this also needs kernel to track mapping of
> n nft rules to m ebpf progs, rather than the simple 1:1 mapping done
> in this RFC.
> 
> The 1:1 mapping is not being set stone here, its just the inital
> step to get the needed plumbing in, also see "Unresolved issues"
> in cover letter above.
> 
> So:
> 
> Step 1: 1:1 mapping, an nft rule has at most one ebpf prog.
> Step 2: figure out how to handle maps, sets, and how to cope with
>         not-yet-translateable expressions
> Step 3: m:n mapping: kernel provides adjacent rules to the UMH for
>         jitting.  Example: user appends rules a, b, c.  UMH creates
> 	single ebpf prog from a/b/c.
>       	nft-pseudo-expression replaces a/b/c in the
> 	packet path, original rules a/b/c are linked from the pseudo
> 	expression for tracking.  If user deletes rule b, we provide
> 	a/c to UMH to create new epbf prog that replaces new
> 	sequence a/c.
> Step 4: always provide entire future base chain and all reachable chains
>         to the umh.  Ideally all of it is replaced by single program.

Right. I think the first implementation of converter should
be translating all rules at once. Not necessarily all features,
but all rules. Even if 60% of rules can be translated as bpf+trie
there is not much benefit to do that and somehow mix and match
the other 40% of old style iterative rule evaluation.
Algorithms are too different. Iterative will be a drag on trie.

> 
> Eventually, entire eval loop could be replaced by ebpf prog.
> But it will need some time to get there -- at this point existing
> nft expressions would no longer provide an ->eval() function.
> 
> Does that make sense to you?
> 
> If you see this as flawed, please let me know, but as I have no idea
> how to resolve these issues going from 0 to 4 makes no sense to me.

I think the challenge is how to implement 4 without doing step 1, right?
imo doing such 1:1 (single rule to single bpf prog) translation does not
help to break hard problem into smaller pieces. Such 1:1 is great
for prototype, but not to land upstream.
For the same reasons in bpfilter we did single iptable rule to single
bpf prog translation, but such code doesn't belong in upstream tree,
since it's not a scalable approach.
It's too easy to follow that road, but it goes nowhere.
Hence my proposal to invest time into building decision tree based
algorithm coupled with pre- and post- bpf progs that supply 'key'
into decision trie lookup and interpret the result.
This way thousands of basic firewall rules will be translated
in efficient way, but even tiny ruleset with complex features (like
nat) won't be translated and that's ok.
We can build on top algorithm that considers all rules at once,
but not on top of translator that does one rule at a time.

> > There are papers on scalable packet classification algorithms that
> > use decision trees (hicuts, hypercuts, efficuts, etc)
> > Imo that is the direction should we should be looking at.
> 
> Okay, but without any idea how to consider existing expressions,
> sets, maps etc. I'm not sure it makes sense to work on that at this
> point.

I think sets and ipset (in case of iptables) fit well into trie model.

> We also have the second problem that the netfilter base hook infra
> (NF_HOOK) already imposes indirect calls on us.
> 
> Is there a plan to have a away to replace those indirect calls with
> direct ones?  We can't do that easily because most of the functions are
> in modules, but AFAIU ebpf could rewrite that to a sequence of direct
> calls.

Yes. abundance of indirect calls is a separate, but equally important
problem. We need to address both of them.

> 
> [..]
> 
> > imo this way majority of iptables/nft rules can be converted and
> > performance will be great even with large rulesets.
> 
> Oh, I do not doubt that multiple rules can be compiled into single program,
> sorry if the RFC 1:1 mapping was confusing or gave that impression.

I think bpfilter RFC also made folks believe that translating
iptables rules one by one is what we're going to do as well.
I hope this confusion is now resolved.
The kernel doesn't need another sequential match firewall.

^ permalink raw reply

* Backport 3c75f6ee139d ("net_sched: sch_htb: add per class overlimits counter")
From: Cong Wang @ 2018-06-13  0:50 UTC (permalink / raw)
  To: David Miller; +Cc: Eric Dumazet, Linux Kernel Network Developers

Hi, Dave

Please backport 3c75f6ee139d ("net_sched: sch_htb: add per class
overlimits counter") to the stable branches you take care of.
Technically it doesn't fix any bug, but it is useful for diagnose
purpose. And of course, it is easy to backport too.

Please let me know if you need my help to backport it.

Thanks!

^ permalink raw reply

* Re: Problems in tc-matchall.8, tc-sample.8
From: Eric S. Raymond @ 2018-06-13  1:12 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: netdev
In-Reply-To: <20180612172439.6416b4e7@xeon-e3>

[-- Attachment #1: Type: text/plain, Size: 866 bytes --]

Stephen Hemminger <stephen@networkplumber.org>:
> The upstream repositories for master and net-next branch are now
> split. Master branch is at:
>   git://git.kernel.org/pub/scm/network/iproute2/iproute2.gti
> 
> and patches for next release are in (master branch):
>   git://git.kernel.org/pub/scm/network/iproute2/iproute2-next.git
> 
> 
> Github is an out of date clone (like all the kernels on there).

OK.  Patch fixing markup in 7 files enclosed, with signoff.

No content changes in these patches.  The intent is just to fix syntax bugs
so doclifter can do a clean lift to DocBook-XML, from which high-quality
HTML can be generated.
-- 
		<a href="http://www.catb.org/~esr/">Eric S. Raymond</a>

My work is funded by the Internet Civil Engineering Institute: https://icei.org
Please visit their site and donate: the civilization you save might be your own.



[-- Attachment #2: 0001-Markup-fixes-for-various-manual-pages.patch --]
[-- Type: text/x-diff, Size: 2885 bytes --]

>From c06f4f46b0a09ebb21c11ccc894fd827f5a6250a Mon Sep 17 00:00:00 2001
From: "Eric S. Raymond" <esr@thyrsus.com>
Date: Tue, 12 Jun 2018 21:02:38 -0400
Subject: [PATCH] Markup fixes for various manual pages.

Signed-off-by: Eric S. Raymond <esr@thyrsus.com>
---
 man/man8/tc-cbq-details.8 | 4 ++--
 man/man8/tc-cbq.8         | 4 ++--
 man/man8/tc-htb.8         | 4 ++--
 man/man8/tc-matchall.8    | 2 --
 man/man8/tc-mqprio.8      | 4 ++--
 man/man8/tc-prio.8        | 4 ++--
 man/man8/tc-sample.8      | 2 --
 7 files changed, 10 insertions(+), 14 deletions(-)

diff --git a/man/man8/tc-cbq-details.8 b/man/man8/tc-cbq-details.8
index 9368103b..42027732 100644
--- a/man/man8/tc-cbq-details.8
+++ b/man/man8/tc-cbq-details.8
@@ -4,9 +4,9 @@ CBQ \- Class Based Queueing
 .SH SYNOPSIS
 .B tc qdisc ... dev
 dev
-.B  ( parent
+.B  { parent
 classid
-.B | root) [ handle
+.B | root} [ handle
 major:
 .B ] cbq avpkt
 bytes
diff --git a/man/man8/tc-cbq.8 b/man/man8/tc-cbq.8
index 301265d8..0d958843 100644
--- a/man/man8/tc-cbq.8
+++ b/man/man8/tc-cbq.8
@@ -4,9 +4,9 @@ CBQ \- Class Based Queueing
 .SH SYNOPSIS
 .B tc qdisc ... dev
 dev
-.B  ( parent
+.B  { parent
 classid
-.B | root) [ handle
+.B | root } [ handle
 major:
 .B ] cbq [ allot
 bytes
diff --git a/man/man8/tc-htb.8 b/man/man8/tc-htb.8
index ae310f43..b1a364bd 100644
--- a/man/man8/tc-htb.8
+++ b/man/man8/tc-htb.8
@@ -4,9 +4,9 @@ HTB \- Hierarchy Token Bucket
 .SH SYNOPSIS
 .B tc qdisc ... dev
 dev
-.B  ( parent
+.B  { parent
 classid
-.B | root) [ handle
+.B | root } [ handle
 major:
 .B ] htb [ default
 minor-id
diff --git a/man/man8/tc-matchall.8 b/man/man8/tc-matchall.8
index e3cddb1f..28969461 100644
--- a/man/man8/tc-matchall.8
+++ b/man/man8/tc-matchall.8
@@ -81,7 +81,5 @@ tc filter add dev eth0 parent ffff: matchall \\
      action sample rate 100 group 12
 .EE
 .RE
-
-.EE
 .SH SEE ALSO
 .BR tc (8),
diff --git a/man/man8/tc-mqprio.8 b/man/man8/tc-mqprio.8
index a1bedd35..0936b2be 100644
--- a/man/man8/tc-mqprio.8
+++ b/man/man8/tc-mqprio.8
@@ -4,9 +4,9 @@ MQPRIO \- Multiqueue Priority Qdisc (Offloaded Hardware QOS)
 .SH SYNOPSIS
 .B tc qdisc ... dev
 dev
-.B  ( parent
+.B  { parent
 classid
-.B | root) [ handle
+.B | root } [ handle
 major:
 .B ] mqprio [ numtc
 tcs
diff --git a/man/man8/tc-prio.8 b/man/man8/tc-prio.8
index 605f3d39..8c5b21dd 100644
--- a/man/man8/tc-prio.8
+++ b/man/man8/tc-prio.8
@@ -4,9 +4,9 @@ PRIO \- Priority qdisc
 .SH SYNOPSIS
 .B tc qdisc ... dev
 dev
-.B  ( parent
+.B  { parent
 classid
-.B | root) [ handle
+.B | root } [ handle
 major:
 .B ] prio [ bands
 bands
diff --git a/man/man8/tc-sample.8 b/man/man8/tc-sample.8
index 3e03eba2..0facd3c5 100644
--- a/man/man8/tc-sample.8
+++ b/man/man8/tc-sample.8
@@ -116,8 +116,6 @@ tc filter add dev eth1 parent ffff: matchall \\
      action sample index 19
 .EE
 .RE
-
-.EE
 .RE
 .SH SEE ALSO
 .BR tc (8),
-- 
2.17.1


^ permalink raw reply related

* Re: [PATCH net] VSOCK: check sk state before receive
From: Hangbin Liu @ 2018-06-13  1:44 UTC (permalink / raw)
  To: Jorgen S. Hansen; +Cc: Stefan Hajnoczi, netdev@vger.kernel.org, David S. Miller
In-Reply-To: <E9BA11C2-0F15-4FFF-8E29-74640E82D046@vmware.com>

On Mon, Jun 04, 2018 at 04:02:39PM +0000, Jorgen S. Hansen wrote:
> 
> > On May 30, 2018, at 11:17 AM, Stefan Hajnoczi <stefanha@redhat.com> wrote:
> > 
> > On Sun, May 27, 2018 at 11:29:45PM +0800, Hangbin Liu wrote:
> >> Hmm...Although I won't reproduce this bug with my reproducer after
> >> apply my patch. I could still get a similiar issue with syzkaller sock vnet test.
> >> 
> >> It looks this patch is not complete. Here is the KASAN call trace with my patch.
> >> I can also reproduce it without my patch.
> > 
> > Seems like a race between vmci_datagram_destroy_handle() and the
> > delayed callback, vmci_transport_recv_dgram_cb().
> > 
> > I don't know the VMCI transport well so I'll leave this to Jorgen.
> 
> Yes, it looks like we are calling the delayed callback after we return from vmci_datagram_destroy_handle(). I’ll take a closer look at the VMCI side here - the refcounting of VMCI datagram endpoints should guard against this, since the delayed callback does a get on the datagram resource, so this could a VMCI driver issue, and not a problem in the VMCI transport for AF_VSOCK.

Hi Jorgen,

Thanks for helping look at this. I'm happy to run test for you patch.

Thanks
Hangbin

^ permalink raw reply

* Re: KASAN: use-after-free Read in rds_cong_queue_updates
From: syzbot @ 2018-06-13  2:51 UTC (permalink / raw)
  To: davem, linux-kernel, linux-rdma, netdev, rds-devel,
	santosh.shilimkar, syzkaller-bugs
In-Reply-To: <089e08e548431cd0f90565c9f4e5@google.com>

syzbot has found a reproducer for the following crash on:

HEAD commit:    f0dc7f9c6dd9 Merge git://git.kernel.org/pub/scm/linux/kern..
git tree:       net-next
console output: https://syzkaller.appspot.com/x/log.txt?x=1461f03f800000
kernel config:  https://syzkaller.appspot.com/x/.config?x=fa9c20c48788d1c1
dashboard link: https://syzkaller.appspot.com/bug?extid=4c20b3866171ce8441d2
compiler:       gcc (GCC) 8.0.1 20180413 (experimental)
syzkaller repro:https://syzkaller.appspot.com/x/repro.syz?x=16cbfeaf800000
C reproducer:   https://syzkaller.appspot.com/x/repro.c?x=165227f7800000

IMPORTANT: if you fix the bug, please add the following tag to the commit:
Reported-by: syzbot+4c20b3866171ce8441d2@syzkaller.appspotmail.com

IPv6: ADDRCONF(NETDEV_CHANGE): bond0: link becomes ready
8021q: adding VLAN 0 to HW filter on device team0
IPVS: ftp: loaded support on port[0] = 21
IPVS: ftp: loaded support on port[0] = 21
==================================================================
BUG: KASAN: use-after-free in atomic_read  
include/asm-generic/atomic-instrumented.h:21 [inline]
BUG: KASAN: use-after-free in refcount_read include/linux/refcount.h:42  
[inline]
BUG: KASAN: use-after-free in check_net include/net/net_namespace.h:236  
[inline]
BUG: KASAN: use-after-free in rds_destroy_pending net/rds/rds.h:897 [inline]
BUG: KASAN: use-after-free in rds_cong_queue_updates+0x255/0x590  
net/rds/cong.c:226
Read of size 4 at addr ffff8801ab180044 by task syz-executor199/4800

CPU: 1 PID: 4800 Comm: syz-executor199 Not tainted 4.17.0+ #84
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS  
Google 01/01/2011
Call Trace:
  __dump_stack lib/dump_stack.c:77 [inline]
  dump_stack+0x1b9/0x294 lib/dump_stack.c:113
  print_address_description+0x6c/0x20b mm/kasan/report.c:256
  kasan_report_error mm/kasan/report.c:354 [inline]
  kasan_report.cold.7+0x242/0x2fe mm/kasan/report.c:412
  check_memory_region_inline mm/kasan/kasan.c:260 [inline]
  check_memory_region+0x13e/0x1b0 mm/kasan/kasan.c:267
  kasan_check_read+0x11/0x20 mm/kasan/kasan.c:272
  atomic_read include/asm-generic/atomic-instrumented.h:21 [inline]
  refcount_read include/linux/refcount.h:42 [inline]
  check_net include/net/net_namespace.h:236 [inline]
  rds_destroy_pending net/rds/rds.h:897 [inline]
  rds_cong_queue_updates+0x255/0x590 net/rds/cong.c:226
  rds_recv_rcvbuf_delta.part.3+0x211/0x350 net/rds/recv.c:126
  rds_recv_rcvbuf_delta net/rds/recv.c:735 [inline]
  rds_clear_recv_queue+0x2f0/0x4c0 net/rds/recv.c:735
  rds_release+0x15c/0x550 net/rds/af_rds.c:72
  __sock_release+0xd7/0x260 net/socket.c:603
  sock_close+0x19/0x20 net/socket.c:1186
  __fput+0x353/0x890 fs/file_table.c:209
  ____fput+0x15/0x20 fs/file_table.c:243
  task_work_run+0x1e4/0x290 kernel/task_work.c:113
  exit_task_work include/linux/task_work.h:22 [inline]
  do_exit+0x1aee/0x2730 kernel/exit.c:865
  do_group_exit+0x16f/0x430 kernel/exit.c:968
  get_signal+0x886/0x1960 kernel/signal.c:2468
  do_signal+0x9c/0x21c0 arch/x86/kernel/signal.c:816
  exit_to_usermode_loop+0x2cf/0x360 arch/x86/entry/common.c:162
  prepare_exit_to_usermode arch/x86/entry/common.c:197 [inline]
  syscall_return_slowpath arch/x86/entry/common.c:268 [inline]
  do_syscall_64+0x6ac/0x800 arch/x86/entry/common.c:293
  entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x44f439
Code: e8 ac be 02 00 48 83 c4 18 c3 0f 1f 80 00 00 00 00 48 89 f8 48 89 f7  
48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff  
ff 0f 83 5b ff fb ff c3 66 2e 0f 1f 84 00 00 00 00
RSP: 002b:00007fc65567dcf8 EFLAGS: 00000246 ORIG_RAX: 00000000000000ca
RAX: fffffffffffffe00 RBX: 00000000006edadc RCX: 000000000044f439
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 00000000006edadc
RBP: 00000000006edad8 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
R13: 00007fff3df31b1f R14: 00007fc65567e9c0 R15: 0000000000000061

Allocated by task 4800:
  save_stack+0x43/0xd0 mm/kasan/kasan.c:448
  set_track mm/kasan/kasan.c:460 [inline]
  kasan_kmalloc+0xc4/0xe0 mm/kasan/kasan.c:553
  kasan_slab_alloc+0x12/0x20 mm/kasan/kasan.c:490
  kmem_cache_alloc+0x12e/0x760 mm/slab.c:3554
  kmem_cache_zalloc include/linux/slab.h:696 [inline]
  net_alloc net/core/net_namespace.c:383 [inline]
  copy_net_ns+0x159/0x4c0 net/core/net_namespace.c:423
  create_new_namespaces+0x69d/0x8f0 kernel/nsproxy.c:107
  unshare_nsproxy_namespaces+0xc3/0x1f0 kernel/nsproxy.c:206
  ksys_unshare+0x708/0xf90 kernel/fork.c:2411
  __do_sys_unshare kernel/fork.c:2479 [inline]
  __se_sys_unshare kernel/fork.c:2477 [inline]
  __x64_sys_unshare+0x31/0x40 kernel/fork.c:2477
  do_syscall_64+0x1b1/0x800 arch/x86/entry/common.c:290
  entry_SYSCALL_64_after_hwframe+0x49/0xbe

Freed by task 746:
  save_stack+0x43/0xd0 mm/kasan/kasan.c:448
  set_track mm/kasan/kasan.c:460 [inline]
  __kasan_slab_free+0x11a/0x170 mm/kasan/kasan.c:521
  kasan_slab_free+0xe/0x10 mm/kasan/kasan.c:528
  __cache_free mm/slab.c:3498 [inline]
  kmem_cache_free+0x86/0x2d0 mm/slab.c:3756
  net_free net/core/net_namespace.c:399 [inline]
  net_drop_ns.part.14+0x11a/0x130 net/core/net_namespace.c:406
  net_drop_ns net/core/net_namespace.c:405 [inline]
  cleanup_net+0x6a1/0xb20 net/core/net_namespace.c:541
  process_one_work+0xc64/0x1b70 kernel/workqueue.c:2153
  worker_thread+0x181/0x13a0 kernel/workqueue.c:2296
  kthread+0x345/0x410 kernel/kthread.c:240
  ret_from_fork+0x3a/0x50 arch/x86/entry/entry_64.S:412

The buggy address belongs to the object at ffff8801ab180040
  which belongs to the cache net_namespace(17:syz0) of size 8896
The buggy address is located 4 bytes inside of
  8896-byte region [ffff8801ab180040, ffff8801ab182300)
The buggy address belongs to the page:
page:ffffea0006ac6000 count:1 mapcount:0 mapping:ffff8801aeaa0080 index:0x0  
compound_mapcount: 0
flags: 0x2fffc0000008100(slab|head)
raw: 02fffc0000008100 ffff8801d3827048 ffff8801d3827048 ffff8801aeaa0080
raw: 0000000000000000 ffff8801ab180040 0000000100000001 ffff8801ab7cae40
page dumped because: kasan: bad access detected
page->mem_cgroup:ffff8801ab7cae40

Memory state around the buggy address:
  ffff8801ab17ff00: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
  ffff8801ab17ff80: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
> ffff8801ab180000: fc fc fc fc fc fc fc fc fb fb fb fb fb fb fb fb
                                            ^
  ffff8801ab180080: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
  ffff8801ab180100: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
==================================================================

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox