Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [virtio-dev] Re: [Qemu-devel] [PATCH] qemu: Introduce VIRTIO_NET_F_STANDBY feature bit to virtio_net
From: Michael S. Tsirkin @ 2018-06-15  2:34 UTC (permalink / raw)
  To: Cornelia Huck
  Cc: Alexander Duyck, virtio-dev, Jiri Pirko, Jakub Kicinski,
	Samudrala, Sridhar, qemu-devel, virtualization, Siwei Liu, Netdev,
	aaron.f.brown
In-Reply-To: <20180614120231.0a72bd5f.cohuck@redhat.com>

On Thu, Jun 14, 2018 at 12:02:31PM +0200, Cornelia Huck wrote:
> So, do you know from the outset that there will be such a coupled
> device? I.e., is it a property of the VM definition?
> 
> Can there be a 'prepared' virtio-net device that presents the STANDBY
> feature even if there currently is no vfio-handled device available --
> but making it possible to simply hotplug that device later?
> Should it be possible to add a virtio/vfio pair later on?

Yes, that's the plan, more or less.

> > >>> I think Qemu should check if guest virtio-net supports this feature and
> > >>> provide a mechanism for
> > >>> an upper layer indicating if the STANDBY feature is successfully
> > >>> negotiated or not.
> > >>> The upper layer can then decide if it should hot plug a VF with the same
> > >>> MAC and manage the 2 links.
> > >>> If VF is successfully hot plugged, virtio-net link should be disabled.  

BTW I see no reason to do this last part. The role of the
standby device is to be always there.

> > >>
> > >> Did you even talk to upper layer management about it?
> > >> Just list the steps they need to do and you will see
> > >> that's a lot of machinery to manage by the upper layer.
> > >>
> > >> What do we gain in flexibility? As far as I can see the
> > >> only gain is some resources saved for legacy VMs.
> > >>
> > >> That's not a lot as tenant of the upper layer probably already has
> > >> at least a hunch that it's a new guest otherwise
> > >> why bother specifying the feature at all - you
> > >> save even more resources without it.
> > >>  
> > >
> > > I am not all that familiar with how Qemu manages network devices. If we can
> > > do all the
> > > required management of the primary/standby devices within Qemu, that is
> > > definitely a better
> > > approach without upper layer involvement.  
> > 
> > Right. I would imagine in the extreme case the upper layer doesn't
> > have to be involved at all if QEMU manages all hot plug/unplug logic.
> > The management tool can supply passthrough device and virtio with the
> > same group UUID, QEMU auto-manages the presence of the primary, and
> > hot plug the device as needed before or after the migration.
> 
> I do not really see how you can manage that kind of stuff in QEMU only.

So right now failover is limited to pci passthrough devices only.
The idea is to realize the vfio device but not expose it
to guest. Have a separate command to expose it to guest.
Hotunplug would also hide it from guest but not unrealize it.

This will help ensure that e.g. on migration failure we can
re-expose the device without risk of running out of resources.
-- 
MST

^ permalink raw reply

* Re: [PATCH 0/3] Use sbitmap instead of percpu_ida
From: Matthew Wilcox @ 2018-06-15  2:37 UTC (permalink / raw)
  To: Martin K. Petersen
  Cc: Juergen Gross, Jens Axboe, kvm, linux-scsi, netdev, linux-usb,
	linux-kernel, virtualization, target-devel, qla2xxx-upstream,
	linux1394-devel, Kent Overstreet
In-Reply-To: <yq136xols59.fsf@oracle.com>

On Thu, Jun 14, 2018 at 10:06:58PM -0400, Martin K. Petersen wrote:
> 
> Matthew,
> 
> > Removing the percpu_ida code nets over 400 lines of removal.  It's not
> > as spectacular as deleting an entire architecture, but it's still a
> > worthy reduction in lines of code.
> 
> Since most of the changes are in scsi or target, should I take this
> series through my tree?

I'd welcome that.  Nick seems to be inactive as target maintainer;
his tree on kernel.org hasn't seen any updates in five months.

Thanks!

^ permalink raw reply

* [PATCH v2] net: cxgb3: add error handling for sysfs_create_group
From: Zhouyang Jia @ 2018-06-15  3:06 UTC (permalink / raw)
  Cc: Zhouyang Jia, Santosh Raspatur, David S. Miller, netdev,
	linux-kernel
In-Reply-To: <1528984571-53320-1-git-send-email-jiazhouyang09@gmail.com>

When sysfs_create_group fails, the lack of error-handling code may
cause unexpected results.

This patch adds error-handling code after calling sysfs_create_group.

Signed-off-by: Zhouyang Jia <jiazhouyang09@gmail.com>
---
v1->v2:
- Turn off led when sysfs_create_group fails
---
 drivers/net/ethernet/chelsio/cxgb3/cxgb3_main.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/drivers/net/ethernet/chelsio/cxgb3/cxgb3_main.c b/drivers/net/ethernet/chelsio/cxgb3/cxgb3_main.c
index 2edfdbd..7b795ed 100644
--- a/drivers/net/ethernet/chelsio/cxgb3/cxgb3_main.c
+++ b/drivers/net/ethernet/chelsio/cxgb3/cxgb3_main.c
@@ -3362,10 +3362,17 @@ static int init_one(struct pci_dev *pdev, const struct pci_device_id *ent)
 
 	err = sysfs_create_group(&adapter->port[0]->dev.kobj,
 				 &cxgb3_attr_group);
+	if (err) {
+		dev_err(&pdev->dev, "cannot create sysfs group\n");
+		goto out_close_led;
+	}
 
 	print_port_info(adapter, ai);
 	return 0;
 
+out_close_led:
+	t3_set_reg_field(adapter, A_T3DBG_GPIO_EN, F_GPIO0_OUT_VAL, 0);
+
 out_free_dev:
 	iounmap(adapter->regs);
 	for (i = ai->nports0 + ai->nports1 - 1; i >= 0; --i)
-- 
2.7.4

^ permalink raw reply related

* Re: [PATCH net] hv_netvsc: Fix the variable sizes in ipsecv2 and rsc offload
From: David Miller @ 2018-06-15  3:16 UTC (permalink / raw)
  To: haiyangz, haiyangz; +Cc: olaf, sthemmin, netdev, linux-kernel, devel, vkuznets
In-Reply-To: <20180615012909.13440-1-haiyangz@linuxonhyperv.com>

From: Haiyang Zhang <haiyangz@linuxonhyperv.com>
Date: Thu, 14 Jun 2018 18:29:09 -0700

> From: Haiyang Zhang <haiyangz@microsoft.com>
> 
> These fields in struct ndis_ipsecv2_offload and struct ndis_rsc_offload
> are one byte according to the specs. This patch defines them with the
> right size. These structs are not in use right now, but will be used soon.
> 
> Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com>

Applied, thank you.

^ permalink raw reply

* Re: [PATCH] optoe: driver to read/write SFP/QSFP EEPROMs
From: Florian Fainelli @ 2018-06-15  3:24 UTC (permalink / raw)
  To: Don Bollinger, Andrew Lunn
  Cc: Tom Lendacky, Arnd Bergmann, Greg Kroah-Hartman, linux-kernel,
	brandon_chuang, wally_wang, roy_lee, rick_burchett, quentin.chang,
	steven.noble, jeffrey.townsend, scotte, roopa, David Ahern,
	luke.williams, Guohan Lu, Russell King, netdev@vger.kernel.org
In-Reply-To: <20180615022652.t6oqpnwwvdmbooab@thebollingers.org>



On 06/14/2018 07:26 PM, Don Bollinger wrote:
> On Tue, Jun 12, 2018 at 08:11:09PM +0200, Andrew Lunn wrote:
>>> There's an SFP driver under drivers/net/phy.  Can that driver be extended
>>> to provide this support?  Adding Russel King who developed sfp.c, as well
>>> at the netdev mailing list.
>>
>> I agree, the current SFP code should be used.
>>
>> My observations seem to be there are two different ways {Q}SFP are used:
>>
>> 1) The Linux kernel has full control, as assumed by the devlink/SFP
>> frame work. We parse the SFP data to find the capabilities of the SFP
>> and use it to program the MAC to use the correct mode. The MAC can be
>> a NIC, but it can also be a switch. DSA is gaining support for
>> PHYLINK, so SFP modules should just work with most switches which DSA
>> support.  And there is no reason a plain switchdev switch can not use
>> PHYLINK.
>>
>> 2) Firmware is in control of the PHY layer, but there is a wish to
>> expose some of the data which is available via i2c from the {Q}SFP to
>> linux.
>>
>> It appears this optoe supports this second case. It does not appear to
>> support any in kernel API to actually make use of the SFP data in the
>> kernel.
>>
>> We should not be duplicating code. We should share the SFP code for
>> both use cases above. There is also a Linux standard API for getting
>> access to this information. ethtool -m/--module-info. Anything which
>> is exporting {Q}SFP data needs to use this API.
>>
>>    Andrew
> 
> Actually this is better described by a third use case.  The target
> switches are PHY-less (see various designs at
> www.compute.org/wiki/Networking/SpecsAndDesigns). The AS5712 for example
> says "The AS5712-54X is a PHY-Less design with the SFP+ and QSFP+
> connections directly attaching to the Serdes interfaces of the Broadcom
> BCM56854 720G Trident 2 switching silicon..."
> 
> The electrical controls of the {Q}SFP devices (TxDisable for example)
> are organized in a platform dependent way, through CPLD devices, and
> managed by a platform specific CPLD driver.
> 
> The i2c bus is muxed from the CPU to all of the {Q}SFP devices, which
> are set up as standard linux i2c devices
> (/sys/bus/i2c/devices/i2c-xxxx).
> 
> There is no MDIO bus between the CPU and the {Q}SFP devices.
> 
>> 2) Firmware is in control of the PHY layer, but there is a wish to
>> expose some of the data which is available via i2c from the {Q}SFP to
>> linux.
> 
> So the switch silicon is in control of the PHY layer.  The platform
> driver is in control of the electrical interfaces.  And the EEPROM data
> is available via I2C.
> 
> And, there isn't actually 'a wish to expose' the EEPROM data to linux
> (the kernel).  It turns out that none of the NOS partners I'm working
> with use that data *in the kernel*.  It is all managed from user space.
> 
> More generally, I think sfp.c and optoe are not actually trying to
> accomplish the same thing at all.  sfp.c combines all three functions
> (PHY, electrical control, EEPROM access).  optoe is only providing EEPROM
> access, and only to user space.  This is a real need in the white box
> switch environment, and is not met by sfp.c.  optoe isn't better, sfp.c
> isn't better, they're just different.

sfp exposes standard ethtool hooks such as get_module_info() which pass
the whole data blob to user-space, e.g: ethtool where all of this is
better interpreted.

> 
> Finally, sfp.c does not recognize that SFP devices have data beyond 512
> bytes, accessible via a page register.  It also does not recognize QSFP
> devices at all.  QSFP devices have only 256 bytes accessible (one i2c
> address) before switching to paged access for the remaining data.  The
> first design requirement for optoe was to access all the available
> pages, because there is information and controls that we (optics
> vendors) want to make available to network management applications.

Patches welcome if you wish to extend sfp.c to support QSFP devices for
instances.

> 
> If sfp.c creates a standard linux i2c client for each SFP device, it
> should be possible to create an optoe managed device 'under' sfp.c to
> provide access to the full EEPROM address space:

It's the other way around, SFP relies on a standard Linux i2c bus master
to exist such that it can read the EEPROM from the standard slave
address location, same goes with a possibly present PHY.

>   # echo optoe2 0x50 >/sys/bus/i2c/devices/i2c-xx/new_device
> This might prove useful to user space consumers of that data.  We could
> also easily add a kernel API (eg the nvmem framework) to optoe to provide
> kernel access.  In other words, sfp.c could assign EEPROM management to
> optoe, while managing the electrical interfaces.  (This is actually
> pretty close to how the platfom drivers work in the switch environment.)
> sfp.c would get SFP page support and QSFP EEPROM access 'for free'.

That sounds like a possibly good approach.

> 
>>                       There is also a Linux standard API for getting
>> access to this information. ethtool -m/--module-info. Anything which
>> is exporting {Q}SFP data needs to use this API.
> 
> optoe simply provides direct access from user space to the full EEPROM
> data.  There is more data there than ethtool knows about, and in some
> devices there are proprietary registers that ethtool will never know
> about.  optoe does not interpret any of the EEPROM content (except the
> bare minimum to access pages correctly).  optoe also does not get in the
> way of ethtool.  It could prove to be a handy way for ethtool to access
> new EEPROM fields in the future.  QSFP-DD/OSFP are coming soon, they
> will have a different (incompatible) set of new fields to be decoded.

sfp is the same it only passes the EEPROM information to user-space and
interprets just what it needs to get the work done.

> 
> Bottom Line:  sfp.c is not a useful starting point for the switch
> environment I'm working in.  The underlying hardware architecture is
> quite different.  optoe is not a competing alternative.  Its only
> function is to provide user-space access to the EEPROM data in {Q}SFP
> devices.

I just don't understand why would we want optoe when the standard way to
expose both EEPROM and diagnostics modules has been through ethtool's
get_module_info since the basic paradigm is that a network port is a
net_device instance in the kernel. If that basic device driver model
does not exist, then it is unclear to me what are the benefits.

Would I be completely wrong if I wrote that you are likely working with
a switch SDK which primarily runs in user-space and so with lack of a
proper kernel-based device driver for your piece of hardware you are
attempting to create a driver which is some sort of a "tap" for some
specific portion of that larger hardware block?
-- 
Florian

^ permalink raw reply

* Re: [PULL] vhost: cleanups and fixes
From: Wei Wang @ 2018-06-15  3:53 UTC (permalink / raw)
  To: Nitesh Narayan Lal, Linus Torvalds, Michael S. Tsirkin
  Cc: KVM list, Network Development, Linux Kernel Mailing List,
	Bjorn Andersson, Andrew Morton, virtualization
In-Reply-To: <0f18063c-c76b-4728-5145-810f069988ea@redhat.com>

On 06/14/2018 11:01 PM, Nitesh Narayan Lal wrote:
> Hi Wei,
>
>
> On 06/12/2018 07:05 AM, Wei Wang wrote:
>> On 06/12/2018 09:59 AM, Linus Torvalds wrote:
>>> On Mon, Jun 11, 2018 at 6:36 PM Michael S. Tsirkin <mst@redhat.com>
>>> wrote:
>>>> Maybe it will help to have GFP_NONE which will make any allocation
>>>> fail if attempted. Linus, would this address your comment?
>>> It would definitely have helped me initially overlook that call chain.
>>>
>>> But then when I started looking at the whole dma_map_page() thing, it
>>> just raised my hackles again.
>>>
>>> I would seriously suggest having a much simpler version for the "no
>>> allocation, no dma mapping" case, so that it's *obvious* that that
>>> never happens.
>>>
>>> So instead of having virtio_balloon_send_free_pages() call a really
>>> generic complex chain of functions that in _some_ cases can do memory
>>> allocation, why isn't there a short-circuited "vitruque_add_datum()"
>>> that is guaranteed to never do anything like that?
>>>
>>> Honestly, I look at "add_one_sg()" and it really doesn't make me
>>> happy. It looks hacky as hell. If I read the code right, you're really
>>> trying to just queue up a simple tuple of <pfn,len>, except you encode
>>> it as a page pointer in order to play games with the SG logic, and
>>> then you hmap that to the ring, except in this case it's all a fake
>>> ring that just adds the cpu-physical address instead.
>>>
>>> And to figuer that out, it's like five layers of indirection through
>>> different helper functions that *can* do more generic things but in
>>> this case don't.
>>>
>>> And you do all of this from a core VM callback function with some
>>> _really_ core VM locks held.
>>>
>>> That makes no sense to me.
>>>
>>> How about this:
>>>
>>>    - get rid of all that code
>>>
>>>    - make the core VM callback save the "these are the free memory
>>> regions" in a fixed and limited array. One that DOES JUST THAT. No
>>> crazy "SG IO dma-mapping function crap". Just a plain array of a fixed
>>> size, pre-allocated for that virtio instance.
>>>
>>>    - make it obvious that what you do in that sequence is ten
>>> instructions and no allocations ("Look ma, I wrote a value to an array
>>> and incremented the array idex, and I'M DONE")
>>>
>>>    - then in that workqueue entry that you start *anyway*, you empty the
>>> array and do all the crazy virtio stuff.
>>>
>>> In fact, while at it, just simplify the VM interface too. Instead of
>>> traversing a random number of buddy lists, just trraverse *one* - the
>>> top-level one. Are you seriously ever going to shrink or mark
>>> read-only anythin *but* something big enough to be in the maximum
>>> order?
>>>
>>> MAX_ORDER is what, 11? So we're talking 8MB blocks. Do you *really*
>>> want the balloon code to work on smaller things, particularly since
>>> the whole interface is fundamentally racy and opportunistic to begin
>>> with?
>> OK, I will implement a new version based on the suggestions. Thanks.
> I have been working on a similar series [1] that is more generic, which
> solves the problem of giving unused memory back to the host and could be
> used to solve the migration problem as well. Can you take a look and see
> if you can use my series in some way?

Hi Nitesh,

I actually checked the last version, which dates back to last year. It 
seems the new version does not have fundamental differences.

Actually there are obvious differences between the two series. This 
series provides a simple lightweight method (will continue to post out a 
new version with simpler interfaces based on the above suggestions) to 
offer free pages hints, and the hints are quit helpful for usages like 
accelerating live migration and guest snapshot. If I read that 
correctly, that series seems to provide true (guaranteed) free pages 
with much more heavyweight logics, but true free pages are not necessary 
for the live migration optimization, which is the goal and motivation of 
this work. And from my point of view, that series seems more like an 
alternative function to ballooning, which takes out free pages (or say 
guest unused pages) via allocation.

I will join the discussion in that thread. Probably we would need to 
think about other new usages for that series.

Best,
Wei

^ permalink raw reply

* Mutual Coperation Thank you
From: Mr Obama Basolle. @ 2018-06-15  3:59 UTC (permalink / raw)


-- 
Dear Friend,

I know that this message will come to you as a surprise. I am the
Auditing and Accounting section manager with African Development Bank,
Ouagadougou Burkina faso. I Hope that you will not expose or betray
this trust and confident that I am about to repose on you for the
mutual benefit of our both families.

I need your urgent assistance in transferring the sum of($39.5)million
to your account within 10 or 14 banking days. This money has been
dormant for years in our Bank without claim.I want the bank to release
the money to you as the nearest person to our deceased customer late
George small. who died along with his supposed next of kin in an air
crash since 31st October 1999.

I don't want the money to go into government treasury as an abandoned
fund. So this is the reason why I am contacting you so that the bank
can release the money to you as the next of kin to the deceased
customer. Please I would like you to keep this proposal as atop secret
and delete it if you are not interested.

Upon receipt of your reply, I will give you full details on how the
business will be executed and also note that you will have 40% of the
above mentioned sum if you agree to handle this business with me.

I am expecting your urgent response as soon as you receive my message

Best Regard,

Mr Obama Bassole
Ouagadougou Burkina Faso

^ permalink raw reply

* Re: [PATCH RFC v2] rhashtable: implement rhashtable_walk_peek() using rhashtable_walk_last_seen()
From: Herbert Xu @ 2018-06-15  4:23 UTC (permalink / raw)
  To: Tom Herbert
  Cc: NeilBrown, Thomas Graf, Linux Kernel Network Developers, LKML,
	Tom Herbert
In-Reply-To: <CALx6S35Yr8th_LFoCLjEqfzti0cp7+7nKZC7KO3dzeQkU7c0nw@mail.gmail.com>

On Thu, Jun 14, 2018 at 10:41:05AM -0700, Tom Herbert wrote:
> On Mon, Jun 11, 2018 at 7:48 PM, NeilBrown <neilb@suse.com> wrote:
> >
> > rhashtable_walk_last_seen() does most of the work that
> > rhashtable_walk_peek() needs done, so use it and put
> > it in a "static inline".
> > Also update the documentation for rhashtable_walk_peek() to clarify
> > the expected use case.
> >
> > Signed-off-by: NeilBrown <neilb@suse.com>
> 
> Acked-by: Tom Herbert <tom@quantonium.net>

Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply

* Re: [bpf PATCH v2 1/6] bpf: sockmap, fix crash when ipv6 sock is added
From: John Fastabend @ 2018-06-15  4:46 UTC (permalink / raw)
  To: Martin KaFai Lau; +Cc: ast, daniel, netdev
In-Reply-To: <20180614235321.hi3qcno7cee4cgc4@kafai-mbp.dhcp.thefacebook.com>

On 06/14/2018 04:53 PM, Martin KaFai Lau wrote:
> On Thu, Jun 14, 2018 at 09:44:46AM -0700, John Fastabend wrote:
>> This fixes a crash where we assign tcp_prot to IPv6 sockets instead
>> of tcpv6_prot.
>>
>> Previously we overwrote the sk->prot field with tcp_prot even in the
>> AF_INET6 case. This patch ensures the correct tcp_prot and tcpv6_prot
>> are used.
> 
>> Further, only allow ESTABLISHED connections to join the
>> map per note in TLS ULP,
>>
>>    /* The TLS ulp is currently supported only for TCP sockets
>>     * in ESTABLISHED state.
>>     * Supporting sockets in LISTEN state will require us
>>     * to modify the accept implementation to clone rather then
>>     * share the ulp context.
>>     */
> This bit has been moved to patch 2.

Yep better cut the comment as well.

> 
>>
>> Also tested with 'netserver -6' and 'netperf -H [IPv6]' as well as
>> 'netperf -H [IPv4]'. The ESTABLISHED check resolves the previously
>> crashing case here.
>>
>> Fixes: 174a79ff9515 ("bpf: sockmap with sk redirect support")
>> Reported-by: syzbot+5c063698bdbfac19f363@syzkaller.appspotmail.com
>> Signed-off-by: John Fastabend <john.fastabend@gmail.com>
>> Signed-off-by: Wei Wang <weiwan@google.com>
>> ---
>>  0 files changed
>>

0 files changed will fix that as well.

>> diff --git a/kernel/bpf/sockmap.c b/kernel/bpf/sockmap.c
>> index 52a91d8..f6dd4cd 100644
>> --- a/kernel/bpf/sockmap.c
>> +++ b/kernel/bpf/sockmap.c
>> @@ -140,6 +140,7 @@ static int bpf_tcp_recvmsg(struct sock *sk, struct msghdr *msg, size_t len,
>>  static int bpf_tcp_sendmsg(struct sock *sk, struct msghdr *msg, size_t size);
>>  static int bpf_tcp_sendpage(struct sock *sk, struct page *page,
>>  			    int offset, size_t size, int flags);
>> +static void bpf_tcp_close(struct sock *sk, long timeout);
>>  
>>  static inline struct smap_psock *smap_psock_sk(const struct sock *sk)
>>  {
>> @@ -161,7 +162,42 @@ static bool bpf_tcp_stream_read(const struct sock *sk)
>>  	return !empty;
>>  }
>>  
>> -static struct proto tcp_bpf_proto;
>> +enum {
>> +	SOCKMAP_IPV4,
>> +	SOCKMAP_IPV6,
>> +	SOCKMAP_NUM_PROTS,
>> +};
>> +
>> +enum {
>> +	SOCKMAP_BASE,
>> +	SOCKMAP_TX,
>> +	SOCKMAP_NUM_CONFIGS,
>> +};
>> +
>> +static struct proto *saved_tcpv6_prot;
> __read_mostly
> 

Sure makes sense.

>> +static DEFINE_MUTEX(tcpv6_prot_mutex);
>> +static struct proto bpf_tcp_prots[SOCKMAP_NUM_PROTS][SOCKMAP_NUM_CONFIGS];
>> +static void build_protos(struct proto prot[SOCKMAP_NUM_CONFIGS],
>> +			 struct proto *base)
>> +{
>> +	prot[SOCKMAP_BASE]			= *base;
>> +	prot[SOCKMAP_BASE].close		= bpf_tcp_close;
>> +	prot[SOCKMAP_BASE].recvmsg		= bpf_tcp_recvmsg;
>> +	prot[SOCKMAP_BASE].stream_memory_read	= bpf_tcp_stream_read;
>> +
>> +	prot[SOCKMAP_TX]			= prot[SOCKMAP_BASE];
>> +	prot[SOCKMAP_TX].sendmsg		= bpf_tcp_sendmsg;
>> +	prot[SOCKMAP_TX].sendpage		= bpf_tcp_sendpage;
>> +}
>> +
>> +static void update_sk_prot(struct sock *sk, struct smap_psock *psock)
>> +{
>> +	int family = sk->sk_family == AF_INET6 ? SOCKMAP_IPV6 : SOCKMAP_IPV4;
>> +	int conf = psock->bpf_tx_msg ? SOCKMAP_TX : SOCKMAP_BASE;
>> +
>> +	sk->sk_prot = &bpf_tcp_prots[family][conf];
>> +}
>> +
>>  static int bpf_tcp_init(struct sock *sk)
>>  {
>>  	struct smap_psock *psock;
>> @@ -181,14 +217,17 @@ static int bpf_tcp_init(struct sock *sk)
>>  	psock->save_close = sk->sk_prot->close;
>>  	psock->sk_proto = sk->sk_prot;
>>  
>> -	if (psock->bpf_tx_msg) {
>> -		tcp_bpf_proto.sendmsg = bpf_tcp_sendmsg;
>> -		tcp_bpf_proto.sendpage = bpf_tcp_sendpage;
>> -		tcp_bpf_proto.recvmsg = bpf_tcp_recvmsg;
>> -		tcp_bpf_proto.stream_memory_read = bpf_tcp_stream_read;
>> +	/* Build IPv6 sockmap whenever the address of tcpv6_prot changes */
>> +	if (sk->sk_family == AF_INET6 &&
>> +	    unlikely(sk->sk_prot != smp_load_acquire(&saved_tcpv6_prot))) {
>> +		mutex_lock(&tcpv6_prot_mutex);
> bpf_tcp_init() can be called by skops?
> Can mutex_lock() be used here?
> 

No mutex lock can not be used here. Both are called
with rcu_read_lock() and we can not sleep. Thanks
for catching. Also this will give a kernel splat now
that I have the right config options. Guess we need
a v3 :/

Thanks,
John

^ permalink raw reply

* Re: [PATCH bpf-next v5 00/10] BTF: BPF Type Format
From: Yonghong Song @ 2018-06-15  4:56 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo, Alexei Starovoitov
  Cc: Martin KaFai Lau, netdev, Daniel Borkmann, kernel-team, Wang Nan,
	Jiri Olsa, Namhyung Kim, Ingo Molnar
In-Reply-To: <20180614180017.GJ30043@kernel.org>



On 6/14/18 11:00 AM, Arnaldo Carvalho de Melo wrote:
> Em Thu, Jun 14, 2018 at 02:47:59PM -0300, Arnaldo Carvalho de Melo escreveu:
>> Em Thu, Jun 14, 2018 at 10:21:30AM -0700, Alexei Starovoitov escreveu:
>>> On 6/14/18 10:18 AM, Arnaldo Carvalho de Melo wrote:
>>>> Just out of curiosity, is there any plan to have this as a clang option?
>    
>>> I think
>>> clang ... -mllvm -mattr=dwarfris
>>> should work.
>   
>> The message "(LLVM option parsing)" implies what you suggest, but didn't
>> worked :-\
>   
>>    -mllvm <value>          Additional arguments to forward to LLVM's option processing
>   
>> Almost there tho :-\
> 
> So I thought that this -mattr=dwarfris would be available only after I
> set the target, because I tried 'llc -mattr=help' and dwarfris wasn't
> there:
> 
> [acme@jouet perf]$ llc -mattr=help |& grep dwarf
> [acme@jouet perf]$
> 
> Only after I set the arch it appears:
> 
> [acme@jouet perf]$ llc -march=bpf -mattr=help |& grep dwarf
>    dwarfris - Disable MCAsmInfo DwarfUsesRelocationsAcrossSections.
>    dwarfris - Disable MCAsmInfo DwarfUsesRelocationsAcrossSections.
>    dwarfris - Disable MCAsmInfo DwarfUsesRelocationsAcrossSections.
> [acme@jouet perf]$
> 
> But even after moving the '-mllvm -mattr=dwarfris' to after '-target
> bpf' it still can't grok it :-\
> 
> /usr/local/bin/clang -D__KERNEL__ -D__NR_CPUS__=4 -DLINUX_VERSION_CODE=0x41100 -g -nostdinc -isystem /usr/lib/gcc/x86_64-redhat-linux/7/include -I/home/acme/git/linux/arch/x86/include -I./arch/x86/include/generated  -I/home/acme/git/linux/include -I./include -I/home/acme/git/linux/arch/x86/include/uapi -I./arch/x86/include/generated/uapi -I/home/acme/git/linux/include/uapi -I./include/generated/uapi -include /home/acme/git/linux/include/linux/kconfig.h  -I/home/acme/lib/include/perf/bpf -Wno-unused-value -Wno-pointer-sign -working-directory /lib/modules/4.17.0-rc5/build -c /home/acme/bpf/hello.c -target bpf -mllvm -mattr=dwarfris -O2 -o hello.o
> 
> So onlye with 'clang ... -target bpf -emit-llvm -O2 -o - | llc -march=bpf -mattr=dwarfris ...'
> things work as we expect.

Right. Currently, the only way to use option -mattr=dwarfris is through 
llc. The "clang -mllvm -mattr=dwarfris" won't work since
    -mllvm <value> Additional arguments to forward to LLVM's option 
processing
and -mattr=dwarfris is not in LLVM auto option processing system.
Those options, in llvm source code, typically have a pattern like below:
===
static cl::opt<unsigned> MemCmpEqZeroNumLoadsPerBlock(
     "memcmp-num-loads-per-block", cl::Hidden, cl::init(1),
     cl::desc("The number of loads per basic block for inline expansion of "
              "memcmp that is only being compared against zero."));
===

I really want to get rid of this option as well. To make pahole work
with the default default format, I need to add bpf support to
libdwfl in elfutils repo. I will work on that.



> - Arnaldo
> 

^ permalink raw reply

* Re: [PATCH net-next,RFC 00/13] New fast forwarding path
From: Steffen Klassert @ 2018-06-15  5:23 UTC (permalink / raw)
  To: Willem de Bruijn; +Cc: Pablo Neira Ayuso, netfilter-devel, Network Development
In-Reply-To: <CAF=yD-+fJ_isaGtfLFVgMQ4_7K-v7b+Kyo6iOB-5kUbbPJXT7Q@mail.gmail.com>

On Thu, Jun 14, 2018 at 11:50:49AM -0400, Willem de Bruijn wrote:
> > This patchset supports both layer 3 IPv4 and IPv6, and layer 4 TCP and
> > UDP protocols. This fastpath also integrates with the IPSec
> > infrastructure and the ESP protocol.
> >
> > We have collected performance numbers:
> >
> >         TCP TSO         TCP Fast Forward
> >         32.5 Gbps       35.6 Gbps
> >
> >         UDP             UDP Fast Forward
> >         17.6 Gbps       35.6 Gbps
> >
> >         ESP             ESP Fast Forward
> >         6 Gbps          7.5 Gbps
> >
> > For UDP, this is doubling performance, and we almost achieve line rate
> > with one single CPU using the Intel i40e NIC. We got similar numbers
> > with the Mellanox ConnectX-4. For TCP, this is slightly improving things
> > even if TSO is being defeated given that we need to segment the packet
> > chain in software.
> 
> The difference between TCP and UDP stems from lack of GRO for UDP.

Right.

> We
> recently added UDP GSO to allow for batch traversal of the UDP stack on
> transmission. Adding a UDP GRO handler can probably extend batching to
> the forwarding path in a similar way without the need for a new infrastructure.

That's more or less what we did. The batching method ist just
optimized for the forwarding path. We are generating skb chains
by chaning at the frag_list pointer of the first skb. With that,
we don't need to mange packet. We keep the packets in the native
form, so the 'segmentation' is rather easy.

The rest is just to be able to configure this and to make
sure that we handle only flows that are going to be (fast)
forwarded, as the upper stack can not (yet) handle such
skb chains.

^ permalink raw reply

* Re: [PATCH bpf-net] selftests/bpf: delete xfrm tunnel when test exits.
From: Eyal Birger @ 2018-06-15  5:24 UTC (permalink / raw)
  To: William Tu; +Cc: netdev, anders.roxell
In-Reply-To: <1528977666-26477-1-git-send-email-u9012063@gmail.com>

> On 14 Jun 2018, at 15:01, William Tu <u9012063@gmail.com> wrote:
> 
> Make the printting of bpf xfrm tunnel better and
> cleanup xfrm state and policy when xfrm test finishes.

Yeah the ‘tee’ was useful when developing the test - I could see what’s going on :)

Now that it’s in ‘selftests’ it’s definitely better without it.

Thanks for the cleanup!
Eyal.

^ permalink raw reply

* Re: [PATCH RFC v2] rhashtable: implement rhashtable_walk_peek() using rhashtable_walk_last_seen()
From: NeilBrown @ 2018-06-15  5:31 UTC (permalink / raw)
  To: Herbert Xu, Tom Herbert
  Cc: Thomas Graf, Linux Kernel Network Developers, LKML, Tom Herbert
In-Reply-To: <20180615042331.3dmqi3bumzk6lu5c@gondor.apana.org.au>

[-- Attachment #1: Type: text/plain, Size: 857 bytes --]

On Fri, Jun 15 2018, Herbert Xu wrote:

> On Thu, Jun 14, 2018 at 10:41:05AM -0700, Tom Herbert wrote:
>> On Mon, Jun 11, 2018 at 7:48 PM, NeilBrown <neilb@suse.com> wrote:
>> >
>> > rhashtable_walk_last_seen() does most of the work that
>> > rhashtable_walk_peek() needs done, so use it and put
>> > it in a "static inline".
>> > Also update the documentation for rhashtable_walk_peek() to clarify
>> > the expected use case.
>> >
>> > Signed-off-by: NeilBrown <neilb@suse.com>
>> 
>> Acked-by: Tom Herbert <tom@quantonium.net>
>
> Acked-by: Herbert Xu <herbert@gondor.apana.org.au>

Thanks to your both.
I assume this is an implicit Ack for
  rhashtable: add rhashtable_walk_last_seen()

I'll send out a patchset of all the acked patches shortly after the
merge window closes, and then see where we are up to.

Thanks,
NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply

* [PATCH net-next v3 1/2] r8169: Don't disable ASPM in the driver
From: Kai-Heng Feng @ 2018-06-15  5:32 UTC (permalink / raw)
  To: davem
  Cc: ryankao, hayeswang, hau, hkallweit1, romieu, bhelgaas, acelan.kao,
	netdev, linux-pci, linux-kernel, Kai-Heng Feng

Enable or disable ASPM should be done in PCI core instead of in the
device driver.

Commit ba04c7c93bbc ("r8169: disable ASPM") uses
pci_disable_link_state() to disable ASPM, but it's not the best way to
do it. If the device really wants to disable ASPM, we can use a quirk in
PCI core to prevent the PCI core from setting ASPM before probe.

Let's remove pci_disable_link_state() for now. Use PCI core quirks if
any regression happens.

Signed-off-by: Kai-Heng Feng <kai.heng.feng@canonical.com>
---
v3:
- Change commit message wording.
- Rename the function to rtl_hw_aspm_clkreq_enable().

v2:
- Remove module parameter.
- Remove pci_disable_link_state().

 drivers/net/ethernet/realtek/r8169.c | 5 -----
 1 file changed, 5 deletions(-)

diff --git a/drivers/net/ethernet/realtek/r8169.c b/drivers/net/ethernet/realtek/r8169.c
index 75dfac0248f4..9b55ce513a36 100644
--- a/drivers/net/ethernet/realtek/r8169.c
+++ b/drivers/net/ethernet/realtek/r8169.c
@@ -25,7 +25,6 @@
 #include <linux/dma-mapping.h>
 #include <linux/pm_runtime.h>
 #include <linux/firmware.h>
-#include <linux/pci-aspm.h>
 #include <linux/prefetch.h>
 #include <linux/ipv6.h>
 #include <net/ip6_checksum.h>
@@ -7647,10 +7646,6 @@ static int rtl_init_one(struct pci_dev *pdev, const struct pci_device_id *ent)
 	mii->reg_num_mask = 0x1f;
 	mii->supports_gmii = cfg->has_gmii;
 
-	/* disable ASPM completely as that cause random device stop working
-	 * problems as well as full system hangs for some PCIe devices users */
-	pci_disable_link_state(pdev, PCIE_LINK_STATE_L0S | PCIE_LINK_STATE_L1 |
-				     PCIE_LINK_STATE_CLKPM);
 
 	/* enable device (incl. PCI PM wakeup and hotplug setup) */
 	rc = pcim_enable_device(pdev);
-- 
2.17.0

^ permalink raw reply related

* [PATCH net-next v3 2/2] r8169: Reinstate ASPM Support
From: Kai-Heng Feng @ 2018-06-15  5:32 UTC (permalink / raw)
  To: davem
  Cc: ryankao, hayeswang, hau, hkallweit1, romieu, bhelgaas, acelan.kao,
	netdev, linux-pci, linux-kernel, Kai-Heng Feng
In-Reply-To: <20180615053219.14053-1-kai.heng.feng@canonical.com>

On Intel Coffe Lake platforms, ASPM support in r8169 is the last missing
puzzle to let CPU's Package C-State reaches PC8.
Without ASPM support, the CPU cannot reach beyond PC3. PC8 can save
additional ~3W in comparison with PC3 on my testing platform, Dell G3
3779.

This is based on the work from Chunhao Lin <hau@realtek.com>.

Signed-off-by: Kai-Heng Feng <kai.heng.feng@canonical.com>
---
v3:
- Change commit message wording.
- Rename the function to rtl_hw_aspm_clkreq_enable().

v2:
- Remove module parameter.
- Remove pci_disable_link_state().

 drivers/net/ethernet/realtek/r8169.c | 40 +++++++++++++++++++---------
 1 file changed, 27 insertions(+), 13 deletions(-)

diff --git a/drivers/net/ethernet/realtek/r8169.c b/drivers/net/ethernet/realtek/r8169.c
index 9b55ce513a36..269ac7561368 100644
--- a/drivers/net/ethernet/realtek/r8169.c
+++ b/drivers/net/ethernet/realtek/r8169.c
@@ -5289,6 +5289,17 @@ static void rtl_pcie_state_l2l3_enable(struct rtl8169_private *tp, bool enable)
 	RTL_W8(tp, Config3, data);
 }
 
+static void rtl_hw_aspm_clkreq_enable(struct rtl8169_private *tp, bool enable)
+{
+	if (enable) {
+		RTL_W8(tp, Config2, RTL_R8(tp, Config2) | ClkReqEn);
+		RTL_W8(tp, Config5, RTL_R8(tp, Config5) | ASPM_en);
+	} else {
+		RTL_W8(tp, Config2, RTL_R8(tp, Config2) & ~ClkReqEn);
+		RTL_W8(tp, Config5, RTL_R8(tp, Config5) & ~ASPM_en);
+	}
+}
+
 static void rtl_hw_start_8168bb(struct rtl8169_private *tp)
 {
 	RTL_W8(tp, Config3, RTL_R8(tp, Config3) & ~Beacon_en);
@@ -5645,9 +5656,9 @@ static void rtl_hw_start_8168g_1(struct rtl8169_private *tp)
 	rtl_hw_start_8168g(tp);
 
 	/* disable aspm and clock request before access ephy */
-	RTL_W8(tp, Config2, RTL_R8(tp, Config2) & ~ClkReqEn);
-	RTL_W8(tp, Config5, RTL_R8(tp, Config5) & ~ASPM_en);
+	rtl_hw_aspm_clkreq_enable(tp, false);
 	rtl_ephy_init(tp, e_info_8168g_1, ARRAY_SIZE(e_info_8168g_1));
+	rtl_hw_aspm_clkreq_enable(tp, true);
 }
 
 static void rtl_hw_start_8168g_2(struct rtl8169_private *tp)
@@ -5680,9 +5691,9 @@ static void rtl_hw_start_8411_2(struct rtl8169_private *tp)
 	rtl_hw_start_8168g(tp);
 
 	/* disable aspm and clock request before access ephy */
-	RTL_W8(tp, Config2, RTL_R8(tp, Config2) & ~ClkReqEn);
-	RTL_W8(tp, Config5, RTL_R8(tp, Config5) & ~ASPM_en);
+	rtl_hw_aspm_clkreq_enable(tp, false);
 	rtl_ephy_init(tp, e_info_8411_2, ARRAY_SIZE(e_info_8411_2));
+	rtl_hw_aspm_clkreq_enable(tp, true);
 }
 
 static void rtl_hw_start_8168h_1(struct rtl8169_private *tp)
@@ -5699,8 +5710,7 @@ static void rtl_hw_start_8168h_1(struct rtl8169_private *tp)
 	};
 
 	/* disable aspm and clock request before access ephy */
-	RTL_W8(tp, Config2, RTL_R8(tp, Config2) & ~ClkReqEn);
-	RTL_W8(tp, Config5, RTL_R8(tp, Config5) & ~ASPM_en);
+	rtl_hw_aspm_clkreq_enable(tp, false);
 	rtl_ephy_init(tp, e_info_8168h_1, ARRAY_SIZE(e_info_8168h_1));
 
 	RTL_W32(tp, TxConfig, RTL_R32(tp, TxConfig) | TXCFG_AUTO_FIFO);
@@ -5779,6 +5789,8 @@ static void rtl_hw_start_8168h_1(struct rtl8169_private *tp)
 	r8168_mac_ocp_write(tp, 0xe63e, 0x0000);
 	r8168_mac_ocp_write(tp, 0xc094, 0x0000);
 	r8168_mac_ocp_write(tp, 0xc09e, 0x0000);
+
+	rtl_hw_aspm_clkreq_enable(tp, true);
 }
 
 static void rtl_hw_start_8168ep(struct rtl8169_private *tp)
@@ -5830,11 +5842,12 @@ static void rtl_hw_start_8168ep_1(struct rtl8169_private *tp)
 	};
 
 	/* disable aspm and clock request before access ephy */
-	RTL_W8(tp, Config2, RTL_R8(tp, Config2) & ~ClkReqEn);
-	RTL_W8(tp, Config5, RTL_R8(tp, Config5) & ~ASPM_en);
+	rtl_hw_aspm_clkreq_enable(tp, false);
 	rtl_ephy_init(tp, e_info_8168ep_1, ARRAY_SIZE(e_info_8168ep_1));
 
 	rtl_hw_start_8168ep(tp);
+
+	rtl_hw_aspm_clkreq_enable(tp, true);
 }
 
 static void rtl_hw_start_8168ep_2(struct rtl8169_private *tp)
@@ -5846,14 +5859,15 @@ static void rtl_hw_start_8168ep_2(struct rtl8169_private *tp)
 	};
 
 	/* disable aspm and clock request before access ephy */
-	RTL_W8(tp, Config2, RTL_R8(tp, Config2) & ~ClkReqEn);
-	RTL_W8(tp, Config5, RTL_R8(tp, Config5) & ~ASPM_en);
+	rtl_hw_aspm_clkreq_enable(tp, false);
 	rtl_ephy_init(tp, e_info_8168ep_2, ARRAY_SIZE(e_info_8168ep_2));
 
 	rtl_hw_start_8168ep(tp);
 
 	RTL_W8(tp, DLLPR, RTL_R8(tp, DLLPR) & ~PFM_EN);
 	RTL_W8(tp, MISC_1, RTL_R8(tp, MISC_1) & ~PFM_D3COLD_EN);
+
+	rtl_hw_aspm_clkreq_enable(tp, true);
 }
 
 static void rtl_hw_start_8168ep_3(struct rtl8169_private *tp)
@@ -5867,8 +5881,7 @@ static void rtl_hw_start_8168ep_3(struct rtl8169_private *tp)
 	};
 
 	/* disable aspm and clock request before access ephy */
-	RTL_W8(tp, Config2, RTL_R8(tp, Config2) & ~ClkReqEn);
-	RTL_W8(tp, Config5, RTL_R8(tp, Config5) & ~ASPM_en);
+	rtl_hw_aspm_clkreq_enable(tp, false);
 	rtl_ephy_init(tp, e_info_8168ep_3, ARRAY_SIZE(e_info_8168ep_3));
 
 	rtl_hw_start_8168ep(tp);
@@ -5888,6 +5901,8 @@ static void rtl_hw_start_8168ep_3(struct rtl8169_private *tp)
 	data = r8168_mac_ocp_read(tp, 0xe860);
 	data |= 0x0080;
 	r8168_mac_ocp_write(tp, 0xe860, data);
+
+	rtl_hw_aspm_clkreq_enable(tp, true);
 }
 
 static void rtl_hw_start_8168(struct rtl8169_private *tp)
@@ -7646,7 +7661,6 @@ static int rtl_init_one(struct pci_dev *pdev, const struct pci_device_id *ent)
 	mii->reg_num_mask = 0x1f;
 	mii->supports_gmii = cfg->has_gmii;
 
-
 	/* enable device (incl. PCI PM wakeup and hotplug setup) */
 	rc = pcim_enable_device(pdev);
 	if (rc < 0) {
-- 
2.17.0

^ permalink raw reply related

* Re: [bpf PATCH v2 3/6] bpf: sockhash fix omitted bucket lock in sock_close
From: Martin KaFai Lau @ 2018-06-15  5:41 UTC (permalink / raw)
  To: John Fastabend; +Cc: ast, daniel, netdev
In-Reply-To: <20180614164457.24994.33710.stgit@john-Precision-Tower-5810>

On Thu, Jun 14, 2018 at 09:44:57AM -0700, John Fastabend wrote:
> First in tcp_close, reduce scope of sk_callback_lock() the lock is
> only needed for protecting smap_release_sock() the ingress and cork
> lists are protected by sock lock. Having the lock in wider scope is
> harmless but may confuse the reader who may infer it is in fact
> needed.
> 
> Next, in sock_hash_delete_elem() the pattern is as follows,
> 
>   sock_hash_delete_elem()
>      [...]
>      spin_lock(bucket_lock)
>      l = lookup_elem_raw()
>      if (l)
>         hlist_del_rcu()
>         write_lock(sk_callback_lock)
>          .... destroy psock ...
>         write_unlock(sk_callback_lock)
>      spin_unlock(bucket_lock)
> 
> The ordering is necessary because we only know the {p}sock after
> dereferencing the hash table which we can't do unless we have the
> bucket lock held. Once we have the bucket lock and the psock element
> it is deleted from the hashmap to ensure any other path doing a lookup
> will fail. Finally, the refcnt is decremented and if zero the psock
> is destroyed.
> 
> In parallel with the above (or free'ing the map) a tcp close event
> may trigger tcp_close(). Which at the moment omits the bucket lock
> altogether (oops!) where the flow looks like this,
> 
>   bpf_tcp_close()
>      [...]
>      write_lock(sk_callback_lock)
>      for each psock->maps // list of maps this sock is part of
>          hlist_del_rcu(ref_hash_node);
>          .... destroy psock ...
>      write_unlock(sk_callback_lock)
> 
> Obviously, and demonstrated by syzbot, this is broken because
> we can have multiple threads deleting entries via hlist_del_rcu().
> 
> To fix this we might be tempted to wrap the hlist operation in a
> bucket lock but that would create a lock inversion problem. In
> summary to follow locking rules maps needs the sk_callback_lock but we
> need the bucket lock to do the hlist_del_rcu. To resolve the lock
> inversion problem note that when bpf_tcp_close is called no updates
> can happen in parallel, due to ESTABLISH state check in update logic,
> so pop the head of the list repeatedly and remove the reference until
> no more are left. If a delete happens in parallel from the BPF API
> that is OK as well because it will do a similar action, lookup the
> sock in the map/hash, delete it from the map/hash, and dec the refcnt.
> We check for this case before doing a destroy on the psock to ensure
> we don't have two threads tearing down a psock. The new logic is
> as follows,
> 
>   bpf_tcp_close()
>   e = psock_map_pop(psock->maps) // done with sk_callback_lock
>   bucket_lock() // lock hash list bucket
>   l = lookup_elem_raw(head, hash, key, key_size);
>   if (l) {
>      //only get here if elmnt was not already removed
>      hlist_del_rcu()
>      ... destroy psock...
>   }
>   bucket_unlock()
> 
> And finally for all the above to work add missing sk_callback_lock
> around smap_list_remove in sock_hash_ctx_update_elem(). Otherwise
> delete and update may corrupt maps list.
> 
> (As an aside the sk_callback_lock serves two purposes. The
>  first, is to update the sock callbacks sk_data_ready, sk_write_space,
>  etc. The second is to protect the psock 'maps' list. The 'maps' list
>  is used to (as shown above) to delete all map/hash references to a
>  sock when the sock is closed)
> 
> (If we did not have the ESTABLISHED state guarantee from tcp_close
>  then we could not ensure completion because updates could happen
>  forever and pin thread in delete loop.)
> 
> Reported-by: syzbot+0ce137753c78f7b6acc1@syzkaller.appspotmail.com
> Fixes: 81110384441a ("bpf: sockmap, add hash map support")
> Signed-off-by: John Fastabend <john.fastabend@gmail.com>
> ---
>  0 files changed
> 
> diff --git a/kernel/bpf/sockmap.c b/kernel/bpf/sockmap.c
> index f1ab52d..04764f5 100644
> --- a/kernel/bpf/sockmap.c
> +++ b/kernel/bpf/sockmap.c
> @@ -258,16 +258,54 @@ static void bpf_tcp_release(struct sock *sk)
>  	rcu_read_unlock();
>  }
>  
> +static struct htab_elem *lookup_elem_raw(struct hlist_head *head,
> +					 u32 hash, void *key, u32 key_size)
> +{
> +	struct htab_elem *l;
> +
> +	hlist_for_each_entry_rcu(l, head, hash_node) {
> +		if (l->hash == hash && !memcmp(&l->key, key, key_size))
> +			return l;
> +	}
> +
> +	return NULL;
> +}
> +
> +static inline struct bucket *__select_bucket(struct bpf_htab *htab, u32 hash)
> +{
> +	return &htab->buckets[hash & (htab->n_buckets - 1)];
> +}
> +
> +static inline struct hlist_head *select_bucket(struct bpf_htab *htab, u32 hash)
> +{
> +	return &__select_bucket(htab, hash)->head;
> +}
> +
>  static void free_htab_elem(struct bpf_htab *htab, struct htab_elem *l)
>  {
>  	atomic_dec(&htab->count);
>  	kfree_rcu(l, rcu);
>  }
>  
> +struct smap_psock_map_entry *psock_map_pop(struct sock *sk,
> +					   struct smap_psock *psock)
> +{
> +	struct smap_psock_map_entry *e;
> +
> +	write_lock_bh(&sk->sk_callback_lock);
> +	e = list_first_entry_or_null(&psock->maps,
> +				     struct smap_psock_map_entry,
> +				     list);
> +	if (e)
> +		list_del(&e->list);
> +	write_unlock_bh(&sk->sk_callback_lock);
> +	return e;
> +}
> +
>  static void bpf_tcp_close(struct sock *sk, long timeout)
>  {
>  	void (*close_fun)(struct sock *sk, long timeout);
> -	struct smap_psock_map_entry *e, *tmp;
> +	struct smap_psock_map_entry *e;
>  	struct sk_msg_buff *md, *mtmp;
>  	struct smap_psock *psock;
>  	struct sock *osk;
> @@ -286,7 +324,6 @@ static void bpf_tcp_close(struct sock *sk, long timeout)
>  	 */
>  	close_fun = psock->save_close;
>  
> -	write_lock_bh(&sk->sk_callback_lock);
>  	if (psock->cork) {
>  		free_start_sg(psock->sock, psock->cork);
>  		kfree(psock->cork);
> @@ -299,20 +336,48 @@ static void bpf_tcp_close(struct sock *sk, long timeout)
>  		kfree(md);
>  	}
>  
> -	list_for_each_entry_safe(e, tmp, &psock->maps, list) {
> +	/* Sock is in TCP_CLOSE state so any concurrent adds or updates will be
> +	 * blocked by ESTABLISHED check. However, tcp_close() + delete + free
> +	 * can all run at the same time. If a tcp_close + delete happens each
> +	 * code path will remove the entry for the map/hash before deleting it.
> +	 * In the map case a xchg and then check to verify we have a sk protects
> +	 * two paths from tearing down the same object. For hash map we lock the
> +	 * bucket and remove the object from the hash map before destroying to
> +	 * ensure that only one reference exists. By pulling object off the head
> +	 * of the list with (with sk_callback_lock) if multiple deleters are
> +	 * running we avoid duplicate references.
> +	 */
> +	e = psock_map_pop(sk, psock);
> +	while (e) {
>  		if (e->entry) {
>  			osk = cmpxchg(e->entry, sk, NULL);
>  			if (osk == sk) {
> -				list_del(&e->list);
>  				smap_release_sock(psock, sk);
>  			}
>  		} else {
> -			hlist_del_rcu(&e->hash_link->hash_node);
> -			smap_release_sock(psock, e->hash_link->sk);
> -			free_htab_elem(e->htab, e->hash_link);
> +			struct htab_elem *link = e->hash_link;
> +			struct hlist_head *head;
> +			struct htab_elem *l;
> +			struct bucket *b;
> +
> +			b = __select_bucket(e->htab, link->hash);
> +			head = &b->head;
> +			raw_spin_lock_bh(&b->lock);
> +			l = lookup_elem_raw(head,
> +					    link->hash, link->key,
> +					    e->htab->elem_size);
> +			/* If another thread deleted this object skip deletion.
> +			 * The refcnt on psock may or may not be zero.
> +			 */
> +			if (l) {
> +				hlist_del_rcu(&e->hash_link->hash_node);
> +				smap_release_sock(psock, e->hash_link->sk);
> +				free_htab_elem(e->htab, e->hash_link);
> +			}
> +			raw_spin_unlock_bh(&b->lock);
>  		}
> +		e = psock_map_pop(sk, psock);
>  	}
> -	write_unlock_bh(&sk->sk_callback_lock);
>  	rcu_read_unlock();
>  	close_fun(sk, timeout);
>  }
> @@ -2088,16 +2153,6 @@ static struct bpf_map *sock_hash_alloc(union bpf_attr *attr)
>  	return ERR_PTR(err);
>  }
>  
> -static inline struct bucket *__select_bucket(struct bpf_htab *htab, u32 hash)
> -{
> -	return &htab->buckets[hash & (htab->n_buckets - 1)];
> -}
> -
> -static inline struct hlist_head *select_bucket(struct bpf_htab *htab, u32 hash)
> -{
> -	return &__select_bucket(htab, hash)->head;
> -}
> -
>  static void sock_hash_free(struct bpf_map *map)
>  {
>  	struct bpf_htab *htab = container_of(map, struct bpf_htab, map);
> @@ -2114,10 +2169,13 @@ static void sock_hash_free(struct bpf_map *map)
>  	 */
>  	rcu_read_lock();
>  	for (i = 0; i < htab->n_buckets; i++) {
> -		struct hlist_head *head = select_bucket(htab, i);
> +		struct bucket *b = __select_bucket(htab, i);
> +		struct hlist_head *head;
>  		struct hlist_node *n;
>  		struct htab_elem *l;
>  
> +		raw_spin_lock_bh(&b->lock);
There is a synchronize_rcu() at the beginning of sock_hash_free().
Taking the bucket's lock of the free-ing map at this point is a bit
suspicious.  What may still access the map after synchronize_rcu()?


> +		head = &b->head;
>  		hlist_for_each_entry_safe(l, n, head, hash_node) {
>  			struct sock *sock = l->sk;
>  			struct smap_psock *psock;
> @@ -2137,6 +2195,7 @@ static void sock_hash_free(struct bpf_map *map)
>  			write_unlock_bh(&sock->sk_callback_lock);
>  			kfree(l);
>  		}
> +		raw_spin_unlock_bh(&b->lock);
>  	}
>  	rcu_read_unlock();
>  	bpf_map_area_free(htab->buckets);
> @@ -2167,19 +2226,6 @@ static struct htab_elem *alloc_sock_hash_elem(struct bpf_htab *htab,
>  	return l_new;
>  }
>  
> -static struct htab_elem *lookup_elem_raw(struct hlist_head *head,
> -					 u32 hash, void *key, u32 key_size)
> -{
> -	struct htab_elem *l;
> -
> -	hlist_for_each_entry_rcu(l, head, hash_node) {
> -		if (l->hash == hash && !memcmp(&l->key, key, key_size))
> -			return l;
> -	}
> -
> -	return NULL;
> -}
> -
>  static inline u32 htab_map_hash(const void *key, u32 key_len)
>  {
>  	return jhash(key, key_len, 0);
> @@ -2307,8 +2353,10 @@ static int sock_hash_ctx_update_elem(struct bpf_sock_ops_kern *skops,
>  		psock = smap_psock_sk(l_old->sk);
>  
>  		hlist_del_rcu(&l_old->hash_node);
> +		write_lock_bh(&l_old->sk->sk_callback_lock);
>  		smap_list_remove(psock, NULL, l_old);
>  		smap_release_sock(psock, l_old->sk);
> +		write_unlock_bh(&l_old->sk->sk_callback_lock);
>  		free_htab_elem(htab, l_old);
>  	}
>  	raw_spin_unlock_bh(&b->lock);
> 

^ permalink raw reply

* Re: [PATCH net-next,RFC 00/13] New fast forwarding path
From: Steffen Klassert @ 2018-06-15  6:03 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Pablo Neira Ayuso, netfilter-devel, netdev
In-Reply-To: <70e4a28c-ccc3-6fd5-0d43-08ae72b7ad1b@gmail.com>

On Thu, Jun 14, 2018 at 08:57:20AM -0700, Eric Dumazet wrote:
> 
> 
> On 06/14/2018 07:19 AM, Pablo Neira Ayuso wrote:
> > Hi,
> > 
> 
> > We have collected performance numbers:
> > 
> >         TCP TSO         TCP Fast Forward
> >         32.5 Gbps       35.6 Gbps
> > 
> >         UDP             UDP Fast Forward
> >         17.6 Gbps       35.6 Gbps
> > 
> >         ESP             ESP Fast Forward
> >         6 Gbps          7.5 Gbps
> > 
> > For UDP, this is doubling performance, and we almost achieve line rate
> > with one single CPU using the Intel i40e NIC. We got similar numbers
> > with the Mellanox ConnectX-4. For TCP, this is slightly improving things
> > even if TSO is being defeated given that we need to segment the packet
> > chain in software. We would like to explore HW GRO support with hardware
> > vendors with this new mode, we think that should improve the TCP numbers
> > we are showing above even more.
> 
> Hi Pablo
> 
> Not very convincing numbers, because it is unclear what traffic patterns were used.
> 
> We normally use packets per second to measure a forwarding workload,
> and it is not clear if you tried a DDOS, or/and a mix of packets being locally
> delivered and packets being forwarded.

Yes, these number need some more explaination. We used my IPsec
forwarding test setup for this. It looks like this:

	   ------------         ------------
	-->| router 1 |-------->| router 2 |--
	|  ------------         ------------  |
	|                                     |
	|       --------------------          |
	--------|Spirent Testcenter|<----------
	        --------------------

The numbers are from single stream forwarding tests, no local delivery.
Packet size in the UDP case was 1460 byte. I used this packet size
because such packets still fit into the mtu when encapsulated by IPsec.

> 
> Presumably adding cache line misses (to probe for flows) will slow down the things.
> 
> I suspect the NIC you use has some kind of bottleneck on sending TSO packets,
> or that you hit the issue that GRO might cook suboptimal packets for forwarding workloads
> (eg setting frag_list)

That might be, I was a bit surprised about the TCP numbers myself.
I was more focused on UDP and IPsec because these don't have
hardware segmentation support. I've just added a TCP handler to
see what happens, the numbers looked ok, so I kept it.

All this is based on the approach I pesented last year at the nefilter
workshop.

> 
> This path series add yet more code to GRO engine which is already very fat
> to the point many people advocate to turn it off.

We tried to stay away from the generic codepath as much as possible.
Currently we need five 'if' statements, two of them are in error
paths (Patch 4).

> Saving cpu cycles on moderate load is not okay if added complexity
> slows down the DDOS (or stress) by 10 % :/

Why 10%?

> 
> To me, GRO is specialized to optimize the non-forwarding case,
> so it is counter-intuitive to base a fast forwarding path on top of it.

It is optimized for the non-forwarding case, but it seems that forwarding
can benefit from that too with very little cost for the non-forwarding case.

^ permalink raw reply

* Re: [bpf PATCH v2 4/6] bpf: sockmap, tcp_disconnect to listen transition
From: Martin KaFai Lau @ 2018-06-15  6:04 UTC (permalink / raw)
  To: John Fastabend; +Cc: ast, daniel, netdev
In-Reply-To: <20180614164502.24994.38682.stgit@john-Precision-Tower-5810>

On Thu, Jun 14, 2018 at 09:45:02AM -0700, John Fastabend wrote:
> After adding checks to ensure TCP is in ESTABLISHED state when a
> sock is added we need to also ensure that user does not transition
> through tcp_disconnect() and back into ESTABLISHED state without
> sockmap removing the sock.
> 
> To do this add unhash hook and remove sock from map there.
> 
> Reported-by: Eric Dumazet <edumazet@google.com>
> Fixes: 81110384441a ("bpf: sockmap, add hash map support")
> Signed-off-by: John Fastabend <john.fastabend@gmail.com>
LGTM. One nit.

Acked-by: Martin KaFai Lau <kafai@fb.com>

> ---
>  0 files changed
> 
> diff --git a/kernel/bpf/sockmap.c b/kernel/bpf/sockmap.c
> index 04764f5..ffc5152 100644
> --- a/kernel/bpf/sockmap.c
> +++ b/kernel/bpf/sockmap.c
> @@ -130,6 +130,7 @@ struct smap_psock {
>  
>  	struct proto *sk_proto;
>  	void (*save_close)(struct sock *sk, long timeout);
> +	void (*save_unhash)(struct sock *sk);
>  	void (*save_data_ready)(struct sock *sk);
>  	void (*save_write_space)(struct sock *sk);
>  };
> @@ -141,6 +142,7 @@ static int bpf_tcp_recvmsg(struct sock *sk, struct msghdr *msg, size_t len,
>  static int bpf_tcp_sendpage(struct sock *sk, struct page *page,
>  			    int offset, size_t size, int flags);
>  static void bpf_tcp_close(struct sock *sk, long timeout);
> +static void bpf_tcp_unhash(struct sock *sk);
>  
>  static inline struct smap_psock *smap_psock_sk(const struct sock *sk)
>  {
> @@ -182,6 +184,7 @@ static void build_protos(struct proto prot[SOCKMAP_NUM_CONFIGS],
>  {
>  	prot[SOCKMAP_BASE]			= *base;
>  	prot[SOCKMAP_BASE].close		= bpf_tcp_close;
> +	prot[SOCKMAP_BASE].unhash		= bpf_tcp_unhash;
>  	prot[SOCKMAP_BASE].recvmsg		= bpf_tcp_recvmsg;
>  	prot[SOCKMAP_BASE].stream_memory_read	= bpf_tcp_stream_read;
>  
> @@ -215,6 +218,7 @@ static int bpf_tcp_init(struct sock *sk)
>  	}
>  
>  	psock->save_close = sk->sk_prot->close;
> +	psock->save_unhash = sk->sk_prot->unhash;
>  	psock->sk_proto = sk->sk_prot;
>  
>  	/* Build IPv6 sockmap whenever the address of tcpv6_prot changes */
> @@ -302,28 +306,12 @@ struct smap_psock_map_entry *psock_map_pop(struct sock *sk,
>  	return e;
>  }
>  
> -static void bpf_tcp_close(struct sock *sk, long timeout)
> +static void bpf_tcp_remove(struct sock *sk, struct smap_psock *psock)
>  {
> -	void (*close_fun)(struct sock *sk, long timeout);
>  	struct smap_psock_map_entry *e;
>  	struct sk_msg_buff *md, *mtmp;
> -	struct smap_psock *psock;
>  	struct sock *osk;
>  
> -	rcu_read_lock();
> -	psock = smap_psock_sk(sk);
> -	if (unlikely(!psock)) {
> -		rcu_read_unlock();
> -		return sk->sk_prot->close(sk, timeout);
> -	}
> -
> -	/* The psock may be destroyed anytime after exiting the RCU critial
> -	 * section so by the time we use close_fun the psock may no longer
> -	 * be valid. However, bpf_tcp_close is called with the sock lock
> -	 * held so the close hook and sk are still valid.
> -	 */
> -	close_fun = psock->save_close;
> -
>  	if (psock->cork) {
>  		free_start_sg(psock->sock, psock->cork);
>  		kfree(psock->cork);
> @@ -378,6 +366,51 @@ static void bpf_tcp_close(struct sock *sk, long timeout)
>  		}
>  		e = psock_map_pop(sk, psock);
>  	}
> +}
> +
> +static void bpf_tcp_unhash(struct sock *sk)
> +{
> +	void (*unhash_fun)(struct sock *sk);
> +	struct smap_psock *psock;
> +
> +	rcu_read_lock();
> +	psock = smap_psock_sk(sk);
> +	if (unlikely(!psock)) {
> +		rcu_read_unlock();
> +		return sk->sk_prot->unhash(sk);
> +	}
> +
> +	/* The psock may be destroyed anytime after exiting the RCU critial
> +	 * section so by the time we use close_fun the psock may no longer
> +	 * be valid. However, bpf_tcp_close is called with the sock lock
> +	 * held so the close hook and sk are still valid.
> +	 */
Nit. s/close/unhash/

> +	unhash_fun = psock->save_unhash;
> +	bpf_tcp_remove(sk, psock);
> +	rcu_read_unlock();
> +	unhash_fun(sk);
> +
> +}
> +
> +static void bpf_tcp_close(struct sock *sk, long timeout)
> +{
> +	void (*close_fun)(struct sock *sk, long timeout);
> +	struct smap_psock *psock;
> +
> +	rcu_read_lock();
> +	psock = smap_psock_sk(sk);
> +	if (unlikely(!psock)) {
> +		rcu_read_unlock();
> +		return sk->sk_prot->close(sk, timeout);
> +	}
> +
> +	/* The psock may be destroyed anytime after exiting the RCU critial
> +	 * section so by the time we use close_fun the psock may no longer
> +	 * be valid. However, bpf_tcp_close is called with the sock lock
> +	 * held so the close hook and sk are still valid.
> +	 */
> +	close_fun = psock->save_close;
> +	bpf_tcp_remove(sk, psock);
>  	rcu_read_unlock();
>  	close_fun(sk, timeout);
>  }
> 

^ permalink raw reply

* Re: [bpf PATCH v2 5/6] bpf: sockhash, add release routine
From: Martin KaFai Lau @ 2018-06-15  6:05 UTC (permalink / raw)
  To: John Fastabend; +Cc: ast, daniel, netdev
In-Reply-To: <20180614164507.24994.83616.stgit@john-Precision-Tower-5810>

On Thu, Jun 14, 2018 at 09:45:07AM -0700, John Fastabend wrote:
> Add map_release_uref pointer to hashmap ops. This was dropped when
> original sockhash code was ported into bpf-next before initial
> commit.
> 
> Fixes: 81110384441a ("bpf: sockmap, add hash map support")
> Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>

> ---
>  0 files changed
> 
> diff --git a/kernel/bpf/sockmap.c b/kernel/bpf/sockmap.c
> index ffc5152..77fe204 100644
> --- a/kernel/bpf/sockmap.c
> +++ b/kernel/bpf/sockmap.c
> @@ -2518,6 +2518,7 @@ struct sock  *__sock_hash_lookup_elem(struct bpf_map *map, void *key)
>  	.map_get_next_key = sock_hash_get_next_key,
>  	.map_update_elem = sock_hash_update_elem,
>  	.map_delete_elem = sock_hash_delete_elem,
> +	.map_release_uref = sock_map_release,
>  };
>  
>  static bool bpf_is_valid_sock(struct bpf_sock_ops_kern *ops)
> 

^ permalink raw reply

* Re: [bpf PATCH v2 6/6] bpf: selftest remove attempts to add LISTEN sockets to sockmap
From: Martin KaFai Lau @ 2018-06-15  6:07 UTC (permalink / raw)
  To: John Fastabend; +Cc: ast, daniel, netdev
In-Reply-To: <20180614164512.24994.56879.stgit@john-Precision-Tower-5810>

On Thu, Jun 14, 2018 at 09:45:12AM -0700, John Fastabend wrote:
> In selftest test_maps the sockmap test case attempts to add a socket
> in listening state to the sockmap. This is no longer a valid operation
> so it fails as expected. However, the test wrongly reports this as an
> error now. Fix the test to avoid adding sockets in listening state.
> 
> Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>

> ---
>  0 files changed
> 
> diff --git a/tools/testing/selftests/bpf/test_maps.c b/tools/testing/selftests/bpf/test_maps.c
> index 6c25334..9fed5f0 100644
> --- a/tools/testing/selftests/bpf/test_maps.c
> +++ b/tools/testing/selftests/bpf/test_maps.c
> @@ -564,7 +564,7 @@ static void test_sockmap(int tasks, void *data)
>  	}
>  
>  	/* Test update without programs */
> -	for (i = 0; i < 6; i++) {
> +	for (i = 2; i < 6; i++) {
>  		err = bpf_map_update_elem(fd, &i, &sfd[i], BPF_ANY);
>  		if (err) {
>  			printf("Failed noprog update sockmap '%i:%i'\n",
> @@ -727,7 +727,7 @@ static void test_sockmap(int tasks, void *data)
>  	}
>  
>  	/* Test map update elem afterwards fd lives in fd and map_fd */
> -	for (i = 0; i < 6; i++) {
> +	for (i = 2; i < 6; i++) {
>  		err = bpf_map_update_elem(map_fd_rx, &i, &sfd[i], BPF_ANY);
>  		if (err) {
>  			printf("Failed map_fd_rx update sockmap %i '%i:%i'\n",
> 

^ permalink raw reply

* Re: [PATCH net-next,RFC 00/13] New fast forwarding path
From: Steffen Klassert @ 2018-06-15  6:17 UTC (permalink / raw)
  To: David Miller; +Cc: pablo, netfilter-devel, netdev
In-Reply-To: <20180614.101831.465275975690050595.davem@davemloft.net>

On Thu, Jun 14, 2018 at 10:18:31AM -0700, David Miller wrote:
> From: Pablo Neira Ayuso <pablo@netfilter.org>
> Date: Thu, 14 Jun 2018 16:19:34 +0200
> 
> > This patchset proposes a new fast forwarding path infrastructure
> > that combines the GRO/GSO and the flowtable infrastructures. The
> > idea is to add a hook at the GRO layer that is invoked before the
> > standard GRO protocol offloads. This allows us to build custom
> > packet chains that we can quickly pass in one go to the neighbour
> > layer to define fast forwarding path for flows.
> 
> We have full, complete, customizability of the packet path via XDP
> and eBPF.
> 
> XDP and eBPF supports everything necessary to accomplish that,
> there are implementations of forwarding implementations in
> the tree and elsewhere.
> 
> And most importantly, XDP and eBPF are optimized in drivers and
> offloaded to hardware.
> 
> There really is no need for something like what you are proposing.

I started with this last year because I wanted to improve
the IPsec (and UDP) forwarding path. Batching packets
at layer2  and send them directly to the output path
seemed to be a good method to improve this.

In particular, we need to do only one IPsec lookup
for the whole packet chain. So it relaxes the pain
from reomoving the IPsec flowcache a bit. It can be
only a first step, but we need some improvements here
as people start to complain about that.

^ permalink raw reply

* Re: [PATCH net-next,RFC 00/13] New fast forwarding path
From: Steffen Klassert @ 2018-06-15  6:27 UTC (permalink / raw)
  To: Tom Herbert
  Cc: Pablo Neira Ayuso, netfilter-devel,
	Linux Kernel Network Developers
In-Reply-To: <CALx6S345B2-sejH52beewdiHuGo5nAY0v0uFmNuiJ0hQ6Wu3xw@mail.gmail.com>

On Thu, Jun 14, 2018 at 01:52:03PM -0700, Tom Herbert wrote:
> On Thu, Jun 14, 2018 at 7:19 AM, Pablo Neira Ayuso <pablo@netfilter.org> wrote:
> > Hi,
> >
> > This patchset proposes a new fast forwarding path infrastructure that
> > combines the GRO/GSO and the flowtable infrastructures. The idea is to
> > add a hook at the GRO layer that is invoked before the standard GRO
> > protocol offloads. This allows us to build custom packet chains that we
> > can quickly pass in one go to the neighbour layer to define fast
> > forwarding path for flows.
> >
> > For each packet that gets into the GRO layer, we first check if there is
> > an entry in the flowtable, if so, the packet is placed in a list until
> > the GRO infrastructure decides to send the batch from gro_complete to
> > the neighbour layer. The first packet in the list takes the route from
> > the flowtable entry, so we avoid reiterative routing lookups.
> >
> > In case no entry is found in the flowtable, the packet is passed up to
> > the classic GRO offload handlers. Thus, this packet follows the standard
> > forwarding path. Note that the initial packets of the flow always go
> > through the standard IPv4/IPv6 netfilter forward hook, that is used to
> > configure what flows are placed in the flowtable. Therefore, only a few
> > (initial) packets follow the standard forwarding path while most of the
> > follow up packets take this new fast forwarding path.
> >
> 
> IIRC, there was a similar proposal a while back that want to bundle
> packets of the same flow together (without doing GRO) so that they
> could be processed by various functions by looking at just one
> representative packet in the group. The concept had some promise, but
> in the end it created quite a bit of complexity since at some point
> the packet bundle needed to be undone to go back to processing the
> individual packets.

With the way we chain the packets it is not too complicated to
undo this chaining (nft_skb_segment in patch 5 implements this).
After that, this looks like a chain of usual segments, so we
trigger xmit_more with every packet chain.

^ permalink raw reply

* Re: [PATCH net-next,RFC 00/13] New fast forwarding path
From: Steffen Klassert @ 2018-06-15  6:34 UTC (permalink / raw)
  To: David Miller; +Cc: tom, pablo, netfilter-devel, netdev
In-Reply-To: <20180614.165834.338565136334574983.davem@davemloft.net>

On Thu, Jun 14, 2018 at 04:58:34PM -0700, David Miller wrote:
> From: Tom Herbert <tom@herbertland.com>
> Date: Thu, 14 Jun 2018 13:52:03 -0700
> 
> > IIRC, there was a similar proposal a while back that want to bundle
> > packets of the same flow together (without doing GRO) so that they
> > could be processed by various functions by looking at just one
> > representative packet in the group. The concept had some promise, but
> > in the end it created quite a bit of complexity since at some point
> > the packet bundle needed to be undone to go back to processing the
> > individual packets.
> 
> You're probably talking about Edward Cree's SKB list stuff, and as
> per his presenation at netconf 2 weeks ago he plans to revitalize
> it given how Spectre et al. gives cause to reevaluate all bulking
> techniques.

Are there patches for the proposal Edward did a while ago,
or was it just a concept?

Maybe we can somehow put things together, I just need some
batching method that works for IPsec and UDP. It does not
need to be exactly the one we proposing here.

^ permalink raw reply

* Re: [PATCH] selftests: bpf: config: add config fragments
From: Anders Roxell @ 2018-06-15  6:41 UTC (permalink / raw)
  To: William Tu
  Cc: Daniel Borkmann, Alexei Starovoitov, Shuah Khan, Networking,
	Linux Kernel Mailing List, open list:KERNEL SELFTEST FRAMEWORK
In-Reply-To: <CALDO+SZXvnUU7VKd_UwnBCU0NiTHjqei4xEUCdcktkKCD3MnEg@mail.gmail.com>

On Thu, 14 Jun 2018 at 14:09, William Tu <u9012063@gmail.com> wrote:
>
> On Thu, Jun 14, 2018 at 4:42 AM, Anders Roxell <anders.roxell@linaro.org> wrote:
> > On 14 June 2018 at 13:06, William Tu <u9012063@gmail.com> wrote:
> >> On Tue, Jun 12, 2018 at 5:08 PM, Daniel Borkmann <daniel@iogearbox.net> wrote:
> >>> On 06/12/2018 01:05 PM, Anders Roxell wrote:
> >>>> Tests test_tunnel.sh fails due to config fragments ins't enabled.
> >>>>
> >>>> Fixes: 933a741e3b82 ("selftests/bpf: bpf tunnel test.")
> >>>> Signed-off-by: Anders Roxell <anders.roxell@linaro.org>
> >>>> ---
> >>>>
> >>>> All tests passes except ip6gretap that still fails. I'm unsure why.
> >>>> Ideas?
> >>
> >> Hi Anders,
> >>
> >> ip6erspan is based on ip6gretap, does ip6erspan pass?
> >
> > it did pass  when I was sending the email.
> > However, I retested this on next-20180613 and now it fails.
> >
> Does 'ip -s link show' show any errors/dropped on ip6gretap device?

I rerun the test_ip6gretap test only and added "set -x" to
test_tunnel.sh here's the output.
I added "ip -s link show ip6gretap11" before the cleanup function in the script.

# ./test_tunnel.sh
+ PING_ARG='-c 3 -w 10 -q'
+ ret=0
+ GREEN='\033[0;92m'
+ RED='\033[0;31m'
+ NC='\033[0m'
+ trap cleanup 0 3 6
+ trap cleanup_exit 2 9
+ cleanup
+ ip netns delete at_ns0
+ ip link del veth1
+ ip link del ipip11
+ ip link del ipip6tnl11
+ ip link del gretap11
+ ip link del ip6gre11
+ ip link del ip6gretap11
+ ip link del vxlan11
+ ip link del ip6vxlan11
+ ip link del geneve11
+ ip link del ip6geneve11
+ ip link del erspan11
+ ip link del ip6erspan11
+ bpf_tunnel_test
+ echo 'Testing IP6GRETAP tunnel...'
Testing IP6GRETAP tunnel...
+ test_ip6gretap
+ TYPE=ip6gretap
+ DEV_NS=ip6gretap00
+ DEV=ip6gretap11
+ ret=0
+ check ip6gretap
+ ip link help ip6gretap
+ grep -q '^Usage:'
+ '[' 0 -ne 0 ']'
+ config_device
+ ip netns add at_ns0
+ ip link add veth0 type veth peer name veth1
+ ip link set veth0 netns at_ns0
+ ip netns exec at_ns0 ip addr add 172.16.1.100/24 dev veth0
+ ip netns exec at_ns0 ip link set dev veth0 up
+ ip link set dev veth1 up mtu 1500
+ ip addr add dev veth1 172.16.1.200/24
+ add_ip6gretap_tunnel
+ ip netns exec at_ns0 ip addr add ::11/96 dev veth0
+ ip netns exec at_ns0 ip link set dev veth0 up
+ ip addr add dev veth1 ::22/96
+ ip link set dev veth1 up
+ ip netns exec at_ns0 ip link add dev ip6gretap00 type ip6gretap seq
flowlabel 0xbcdef key 2 local ::11 remote ::22
+ ip netns exec at_ns0 ip addr add dev ip6gretap00 10.1.1.100/24
+ ip netns exec at_ns0 ip addr add dev ip6gretap00 fc80::100/96
+ ip netns exec at_ns0 ip link set dev ip6gretap00 up
+ ip link add dev ip6gretap11 type ip6gretap external
+ ip addr add dev ip6gretap11 10.1.1.200/24
+ ip addr add dev ip6gretap11 fc80::200/24
+ ip link set dev ip6gretap11 up
+ attach_bpf ip6gretap11 ip6gretap_set_tunnel ip6gretap_get_tunnel
+ DEV=ip6gretap11
+ SET=ip6gretap_set_tunnel
+ GET=ip6gretap_get_tunnel
+ tc qdisc add dev ip6gretap11 clsact
+ tc filter add dev ip6gretap11 egress bpf da obj test_tunnel_kern.o
sec ip6gretap_set_tunnel
+ tc filter add dev ip6gretap11 ingress bpf da obj test_tunnel_kern.o
sec ip6gretap_get_tunnel
+ ping6 -c 3 -w 10 -q ::11
PING ::11 (::11): 56 data bytes

--- ::11 ping statistics ---
5 packets transmitted, 3 packets received, 40% packet loss
round-trip min/avg/max = 0.139/1.857/5.293 ms
+ ip netns exec at_ns0 ping -c 3 -w 10 -q 10.1.1.200
PING 10.1.1.200 (10.1.1.200): 56 data bytes

--- 10.1.1.200 ping statistics ---
3 packets transmitted, 3 packets received, 0% packet loss
round-trip min/avg/max = 0.214/0.256/0.305 ms
+ ping -c 3 -w 10 -q 10.1.1.100
PING 10.1.1.100 (10.1.1.100): 56 data bytes

--- 10.1.1.100 ping statistics ---
3 packets transmitted, 3 packets received, 0% packet loss
round-trip min/avg/max = 0.210/0.211/0.213 ms
+ check_err 0
+ '[' 0 -eq 0 ']'
+ ret=0
+ ip netns exec at_ns0 ping6 -c 3 -w 10 -q fc80::200
PING fc80::200 (fc80::200): 56 data bytes

--- fc80::200 ping statistics ---
10 packets transmitted, 0 packets received, 100% packet loss
+ check_err 1
+ '[' 0 -eq 0 ']'
+ ret=1
+ ip -s link show ip6gretap11
19: ip6gretap11@NONE: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1434 qdisc
pfifo_fast state UNKNOWN mode DEFAULT group default qlen 1000
    link/ether de:d2:0c:53:80:8c brd ff:ff:ff:ff:ff:ff
    RX: bytes  packets  errors  dropped overrun mcast
    2096       25       0       0       0       0
    TX: bytes  packets  errors  dropped carrier collsns
    5324       36       5       5       0       0
+ cleanup
+ ip netns delete at_ns0
+ ip link del veth1
+ ip link del ipip11
+ ip link del ipip6tnl11
+ ip link del gretap11
+ ip link del ip6gre11
+ ip link del ip6gretap11
+ ip link del vxlan11
+ ip link del ip6vxlan11
+ ip link del geneve11
+ ip link del ip6geneve11
+ ip link del erspan11
+ ip link del ip6erspan11
+ '[' 1 -ne 0 ']'
+ echo -e '\033[0;31mFAIL: ip6gretap\033[0m'
FAIL: ip6gretap
+ return 1
+ exit 0
+ cleanup
+ ip netns delete at_ns0
+ ip link del veth1
+ ip link del ipip11
+ ip link del ipip6tnl11
+ ip link del gretap11
+ ip link del ip6gre11
+ ip link del ip6gretap11
+ ip link del vxlan11
+ ip link del ip6vxlan11
+ ip link del geneve11
+ ip link del ip6geneve11
+ ip link del erspan11
+ ip link del ip6erspan11

^ permalink raw reply

* [PATCH RFC ipsec-next] xfrm: Extend the output_mark to support input direction and masking.
From: Steffen Klassert @ 2018-06-15  6:55 UTC (permalink / raw)
  To: netdev; +Cc: Tobias Brunner, Eyal Birger, Lorenzo Colitti

We already support setting an output mark at the xfrm_state,
unfortunately this does not support the input direction and
masking the marks that will be applied to the skb. This change
adds support applying a masked value in both directions.

The existing XFRMA_OUTPUT_MARK number is reused for this purpose
and as it is now bi-directional, it is renamed to XFRMA_SET_MARK.

An additional XFRMA_SET_MARK_MASK attribute is added for setting the
mask. If the attribute mask not provided, it is set to 0xffffffff,
keeping the XFRMA_OUTPUT_MARK existing 'full mask' semantics.

Co-developed-by: Tobias Brunner <tobias@strongswan.org>
Co-developed-by: Eyal Birger <eyal.birger@gmail.com>
Co-developed-by: Lorenzo Colitti <lorenzo@google.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: Tobias Brunner <tobias@strongswan.org>
Signed-off-by: Eyal Birger <eyal.birger@gmail.com>
Signed-off-by: Lorenzo Colitti <lorenzo@google.com>
---
 include/net/xfrm.h        |  9 ++++++++-
 include/uapi/linux/xfrm.h |  4 +++-
 net/xfrm/xfrm_input.c     |  2 ++
 net/xfrm/xfrm_output.c    |  3 +--
 net/xfrm/xfrm_policy.c    |  5 +++--
 net/xfrm/xfrm_user.c      | 48 +++++++++++++++++++++++++++++++++++++----------
 6 files changed, 55 insertions(+), 16 deletions(-)

diff --git a/include/net/xfrm.h b/include/net/xfrm.h
index 45e75c36b738..8727b2484855 100644
--- a/include/net/xfrm.h
+++ b/include/net/xfrm.h
@@ -166,7 +166,7 @@ struct xfrm_state {
 		int		header_len;
 		int		trailer_len;
 		u32		extra_flags;
-		u32		output_mark;
+		struct xfrm_mark	smark;
 	} props;
 
 	struct xfrm_lifetime_cfg lft;
@@ -2012,6 +2012,13 @@ static inline int xfrm_mark_put(struct sk_buff *skb, const struct xfrm_mark *m)
 	return ret;
 }
 
+static inline __u32 xfrm_smark_get(__u32 mark, struct xfrm_state *x)
+{
+	struct xfrm_mark *m = &x->props.smark;
+
+	return (m->v & m->m) | (mark & ~m->m);
+}
+
 static inline int xfrm_tunnel_check(struct sk_buff *skb, struct xfrm_state *x,
 				    unsigned int family)
 {
diff --git a/include/uapi/linux/xfrm.h b/include/uapi/linux/xfrm.h
index e3af2859188b..5a6ed7ce5a29 100644
--- a/include/uapi/linux/xfrm.h
+++ b/include/uapi/linux/xfrm.h
@@ -305,9 +305,11 @@ enum xfrm_attr_type_t {
 	XFRMA_ADDRESS_FILTER,	/* struct xfrm_address_filter */
 	XFRMA_PAD,
 	XFRMA_OFFLOAD_DEV,	/* struct xfrm_state_offload */
-	XFRMA_OUTPUT_MARK,	/* __u32 */
+	XFRMA_SET_MARK,		/* __u32 */
+	XFRMA_SET_MARK_MASK,	/* __u32 */
 	__XFRMA_MAX
 
+#define XFRMA_OUTPUT_MARK XFRMA_SET_MARK	/* Compatibility */
 #define XFRMA_MAX (__XFRMA_MAX - 1)
 };
 
diff --git a/net/xfrm/xfrm_input.c b/net/xfrm/xfrm_input.c
index 352abca2605f..074810436242 100644
--- a/net/xfrm/xfrm_input.c
+++ b/net/xfrm/xfrm_input.c
@@ -339,6 +339,8 @@ int xfrm_input(struct sk_buff *skb, int nexthdr, __be32 spi, int encap_type)
 			goto drop;
 		}
 
+		skb->mark = xfrm_smark_get(skb->mark, x);
+
 		skb->sp->xvec[skb->sp->len++] = x;
 
 lock:
diff --git a/net/xfrm/xfrm_output.c b/net/xfrm/xfrm_output.c
index 89b178a78dc7..45ba07ab3e4f 100644
--- a/net/xfrm/xfrm_output.c
+++ b/net/xfrm/xfrm_output.c
@@ -66,8 +66,7 @@ static int xfrm_output_one(struct sk_buff *skb, int err)
 			goto error_nolock;
 		}
 
-		if (x->props.output_mark)
-			skb->mark = x->props.output_mark;
+		skb->mark = xfrm_smark_get(skb->mark, x);
 
 		err = x->outer_mode->output(x, skb);
 		if (err) {
diff --git a/net/xfrm/xfrm_policy.c b/net/xfrm/xfrm_policy.c
index 40b54cc64243..f95f5f75748c 100644
--- a/net/xfrm/xfrm_policy.c
+++ b/net/xfrm/xfrm_policy.c
@@ -1607,10 +1607,11 @@ static struct dst_entry *xfrm_bundle_create(struct xfrm_policy *policy,
 		dst_copy_metrics(dst1, dst);
 
 		if (xfrm[i]->props.mode != XFRM_MODE_TRANSPORT) {
+			__u32 mark = xfrm_smark_get(fl->flowi_mark, xfrm[i]);
+
 			family = xfrm[i]->props.family;
 			dst = xfrm_dst_lookup(xfrm[i], tos, fl->flowi_oif,
-					      &saddr, &daddr, family,
-					      xfrm[i]->props.output_mark);
+					      &saddr, &daddr, family, mark);
 			err = PTR_ERR(dst);
 			if (IS_ERR(dst))
 				goto put_states;
diff --git a/net/xfrm/xfrm_user.c b/net/xfrm/xfrm_user.c
index 080035f056d9..9602cc9e05ab 100644
--- a/net/xfrm/xfrm_user.c
+++ b/net/xfrm/xfrm_user.c
@@ -527,6 +527,19 @@ static void xfrm_update_ae_params(struct xfrm_state *x, struct nlattr **attrs,
 		x->replay_maxdiff = nla_get_u32(rt);
 }
 
+static void xfrm_smark_init(struct nlattr **attrs, struct xfrm_mark *m)
+{
+	if (attrs[XFRMA_SET_MARK]) {
+		m->v = nla_get_u32(attrs[XFRMA_SET_MARK]);
+		if (attrs[XFRMA_SET_MARK_MASK])
+			m->m = nla_get_u32(attrs[XFRMA_SET_MARK_MASK]);
+		else
+			m->m = 0xffffffff;
+	} else {
+		m->v = m->m = 0;
+	}
+}
+
 static struct xfrm_state *xfrm_state_construct(struct net *net,
 					       struct xfrm_usersa_info *p,
 					       struct nlattr **attrs,
@@ -579,8 +592,7 @@ static struct xfrm_state *xfrm_state_construct(struct net *net,
 
 	xfrm_mark_get(attrs, &x->mark);
 
-	if (attrs[XFRMA_OUTPUT_MARK])
-		x->props.output_mark = nla_get_u32(attrs[XFRMA_OUTPUT_MARK]);
+	xfrm_smark_init(attrs, &x->props.smark);
 
 	err = __xfrm_init_state(x, false, attrs[XFRMA_OFFLOAD_DEV]);
 	if (err)
@@ -824,6 +836,18 @@ static int copy_to_user_auth(struct xfrm_algo_auth *auth, struct sk_buff *skb)
 	return 0;
 }
 
+static int xfrm_smark_put(struct sk_buff *skb, struct xfrm_mark *m)
+{
+	int ret = 0;
+
+	if (m->v | m->m) {
+		ret = nla_put_u32(skb, XFRMA_SET_MARK, m->v);
+		if (!ret)
+			ret = nla_put_u32(skb, XFRMA_SET_MARK_MASK, m->m);
+	}
+	return ret;
+}
+
 /* Don't change this without updating xfrm_sa_len! */
 static int copy_to_user_state_extra(struct xfrm_state *x,
 				    struct xfrm_usersa_info *p,
@@ -887,6 +911,11 @@ static int copy_to_user_state_extra(struct xfrm_state *x,
 	ret = xfrm_mark_put(skb, &x->mark);
 	if (ret)
 		goto out;
+
+	ret = xfrm_smark_put(skb, &x->props.smark);
+	if (ret)
+		goto out;
+
 	if (x->replay_esn)
 		ret = nla_put(skb, XFRMA_REPLAY_ESN_VAL,
 			      xfrm_replay_state_esn_len(x->replay_esn),
@@ -900,11 +929,7 @@ static int copy_to_user_state_extra(struct xfrm_state *x,
 		ret = copy_user_offload(&x->xso, skb);
 	if (ret)
 		goto out;
-	if (x->props.output_mark) {
-		ret = nla_put_u32(skb, XFRMA_OUTPUT_MARK, x->props.output_mark);
-		if (ret)
-			goto out;
-	}
+
 	if (x->security)
 		ret = copy_sec_ctx(x->security, skb);
 out:
@@ -2493,7 +2518,8 @@ static const struct nla_policy xfrma_policy[XFRMA_MAX+1] = {
 	[XFRMA_PROTO]		= { .type = NLA_U8 },
 	[XFRMA_ADDRESS_FILTER]	= { .len = sizeof(struct xfrm_address_filter) },
 	[XFRMA_OFFLOAD_DEV]	= { .len = sizeof(struct xfrm_user_offload) },
-	[XFRMA_OUTPUT_MARK]	= { .type = NLA_U32 },
+	[XFRMA_SET_MARK]	= { .type = NLA_U32 },
+	[XFRMA_SET_MARK_MASK]	= { .type = NLA_U32 },
 };
 
 static const struct nla_policy xfrma_spd_policy[XFRMA_SPD_MAX+1] = {
@@ -2719,8 +2745,10 @@ static inline unsigned int xfrm_sa_len(struct xfrm_state *x)
 		l += nla_total_size(sizeof(x->props.extra_flags));
 	if (x->xso.dev)
 		 l += nla_total_size(sizeof(x->xso));
-	if (x->props.output_mark)
-		l += nla_total_size(sizeof(x->props.output_mark));
+	if (x->props.smark.v | x->props.smark.m) {
+		l += nla_total_size(sizeof(x->props.smark.v));
+		l += nla_total_size(sizeof(x->props.smark.m));
+	}
 
 	/* Must count x->lastused as it may become non-zero behind our back. */
 	l += nla_total_size_64bit(sizeof(u64));
-- 
2.14.1

^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox