Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [pci PATCH v6 4/5] nvme: Migrate over to unmanaged SR-IOV support
From: Christoph Hellwig @ 2018-03-14  8:54 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: bhelgaas, alexander.h.duyck, linux-pci, virtio-dev, kvm, netdev,
	dan.daly, linux-kernel, linux-nvme, keith.busch, netanel, ddutile,
	mheyne, liang-min.wang, mark.d.rustad, dwmw2, hch, dwmw
In-Reply-To: <20180313213034.3553.47677.stgit@localhost.localdomain>

Looks good,

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply

* Re: [pci PATCH v6 2/5] virtio_pci: Add support for unmanaged SR-IOV on virtio_pci devices
From: Christoph Hellwig @ 2018-03-14  8:54 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: bhelgaas, alexander.h.duyck, linux-pci, virtio-dev, kvm, netdev,
	dan.daly, linux-kernel, linux-nvme, keith.busch, netanel, ddutile,
	mheyne, liang-min.wang, mark.d.rustad, dwmw2, hch, dwmw
In-Reply-To: <20180313212855.3553.97762.stgit@localhost.localdomain>

Looks good,

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply

* Re: [pci PATCH v6 1/5] pci: Add pci_sriov_configure_simple for PFs that don't manage VF resources
From: Christoph Hellwig @ 2018-03-14  8:54 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: bhelgaas, alexander.h.duyck, linux-pci, virtio-dev, kvm, netdev,
	dan.daly, linux-kernel, linux-nvme, keith.busch, netanel, ddutile,
	mheyne, liang-min.wang, mark.d.rustad, dwmw2, hch, dwmw
In-Reply-To: <20180313212754.3553.72176.stgit@localhost.localdomain>

On Tue, Mar 13, 2018 at 02:28:49PM -0700, Alexander Duyck wrote:
> From: Alexander Duyck <alexander.h.duyck@intel.com>
> 
> This patch adds a common configuration function called
> pci_sriov_configure_simple that will allow for managing VFs on devices
> where the PF is not capable of managing VF resources.
> 
> Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
> ---
> 
> v5: New patch replacing pci_sriov_configure_unmanaged with
>       pci_sriov_configure_simple
>     Dropped bits related to autoprobe changes
> v6: Defined pci_sriov_configure_simple as NULL if IOV is disabled
> 
>  drivers/pci/iov.c   |   32 ++++++++++++++++++++++++++++++++
>  include/linux/pci.h |    3 +++
>  2 files changed, 35 insertions(+)
> 
> diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c
> index 677924ae0350..bd7021491fdb 100644
> --- a/drivers/pci/iov.c
> +++ b/drivers/pci/iov.c
> @@ -807,3 +807,35 @@ int pci_sriov_get_totalvfs(struct pci_dev *dev)
>  	return dev->sriov->total_VFs;
>  }
>  EXPORT_SYMBOL_GPL(pci_sriov_get_totalvfs);
> +
> +/**
> + * pci_sriov_configure_simple - helper to configure unmanaged SR-IOV
> + * @dev: the PCI device
> + * @nr_virtfn: number of virtual functions to enable, 0 to disable
> + *
> + * Used to provide generic enable/disable SR-IOV option for devices
> + * that do not manage the VFs generated by their driver
> + */
> +int pci_sriov_configure_simple(struct pci_dev *dev, int nr_virtfn)
> +{
> +	int err = -EINVAL;

This assignment seems like it is never used..

> +
> +	might_sleep();
> +
> +	if (!dev->is_physfn)
> +		return -ENODEV;
> +
> +	if (pci_vfs_assigned(dev)) {
> +		pci_warn(dev,
> +			 "Cannot modify SR-IOV while VFs are assigned\n");
> +		err = -EPERM;

Why not:

	if (pci_vfs_assigned(dev)) {
		pci_warn(dev,
			 "Cannot modify SR-IOV while VFs are assigned\n");
		return -EPERM;
	}

Otherwise looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply

* Re: [PATCH] pktgen: Fix memory leak in pktgen_if_write
From: Arnd Bergmann @ 2018-03-14  8:40 UTC (permalink / raw)
  To: Gustavo A. R. Silva
  Cc: David S. Miller, Wang Jian, Networking, Linux Kernel Mailing List
In-Reply-To: <20180314080727.GA17319@embeddedgus>

On Wed, Mar 14, 2018 at 9:07 AM, Gustavo A. R. Silva
<gustavo@embeddedor.com> wrote:
> _buf_ is an array and the one that must be freed is _tp_ instead.
>
> Fixes: a870a02cc963 ("pktgen: use dynamic allocation for debug print buffer")
> Reported-by: Wang Jian <jianjian.wang1@gmail.com>
> Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>

Acked-by: Arnd Bergmann <arnd@arndb.de>

Thanks for fixing up my mistake so quickly, and thanks to Wang for the report

I was about to send the same patch, but you got there first.

^ permalink raw reply

* Re: [PATCH] hv_netvsc: Make sure out channel is fully opened on send
From: Dan Carpenter @ 2018-03-14  8:27 UTC (permalink / raw)
  To: Mohammed Gamal
  Cc: otubo, sthemmin, netdev, linux-kernel, devel, vkuznets, davem
In-Reply-To: <1520968010-20733-1-git-send-email-mgamal@redhat.com>

On Tue, Mar 13, 2018 at 08:06:50PM +0100, Mohammed Gamal wrote:
> @@ -791,6 +791,7 @@ static inline int netvsc_send_pkt(
>  				       VMBUS_DATA_PACKET_FLAG_COMPLETION_REQUESTED);
>  	}
>  
> +	ring_avail = hv_ringbuf_avail_percent(&out_channel->outbound);
>  	if (ret == 0) {
>  		atomic_inc_return(&nvchan->queue_sends);
>  

Could you move the assignment inside the "ret == 0" path closer to where
it's used?

regards,
dan carpenter

^ permalink raw reply

* Re: [PATCH] pktgen: use dynamic allocation for debug print buffer
From: Gustavo A. R. Silva @ 2018-03-14  8:16 UTC (permalink / raw)
  To: Wang Jian, David Miller
  Cc: arnd, dima, johannes.berg, edumazet, netdev, linux-kernel
In-Reply-To: <CAP4sYWURRroc0H7vDcMdmi2cyuKxHfcj12X0zqWjw1vcoru06A@mail.gmail.com>

Arnd:

Thanks for the fix.

On 03/13/2018 10:02 PM, Wang Jian wrote:
>>> +  kfree(buf);
> free tb? buf is an array.
> 

Wang:

Thanks for the report. I already sent a patch to fix this: 
https://patchwork.kernel.org/patch/10281587/

--
Gustavo

> On Wed, Mar 14, 2018 at 8:25 AM, David Miller <davem@davemloft.net> wrote:
>> From: Arnd Bergmann <arnd@arndb.de>
>> Date: Tue, 13 Mar 2018 21:58:39 +0100
>>
>>> After the removal of the VLA, we get a harmless warning about a large
>>> stack frame:
>>>
>>> net/core/pktgen.c: In function 'pktgen_if_write':
>>> net/core/pktgen.c:1710:1: error: the frame size of 1076 bytes is larger than 1024 bytes [-Werror=frame-larger-than=]
>>>
>>> The function was previously shown to be safe despite hitting
>>> the 1024 bye warning level. To get rid of the annoyging warning,
>>> while keeping it readable, this changes it to use strndup_user().
>>>
>>> Obviously this is not a fast path, so the kmalloc() overhead
>>> can be disregarded.
>>>
>>> Fixes: 35951393bbff ("pktgen: Remove VLA usage")
>>> Signed-off-by: Arnd Bergmann <arnd@arndb.de>
>>
>> Applied, thanks.

^ permalink raw reply

* Re: WARNING: CPU: 3 PID: 0 at net/sched/sch_hfsc.c:1388 hfsc_dequeue+0x319/0x350 [sch_hfsc]
From: Marco Berizzi @ 2018-03-14  8:10 UTC (permalink / raw)
  To: Cong Wang; +Cc: Linux Kernel Network Developers
In-Reply-To: <CAM_iQpWBo2A4dNpPmocgKd-cHm=gHR_hvS=pqmtEm=EhvcfLRQ@mail.gmail.com>

> Il 9 marzo 2018 alle 0.14 Cong Wang <xiyou.wangcong@gmail.com> ha scritto:
> 
> 
> On Thu, Mar 8, 2018 at 8:02 AM, Marco Berizzi <pupilla@libero.it> wrote:
> >> Marco Berizzi wrote:
> >>
> >>
> >> Hello everyone,
> >>
> >> Yesterday I got this error on a slackware linux 4.16-rc4 system
> >> running as a traffic shaping gateway and netfilter nat.
> >> The error has been arisen after a partial ISP network outage,
> >> so unfortunately it will not trivial for me to reproduce it again.
> >
> > Hello everyone,
> >
> > I'm getting this error twice/day, so fortunately I'm able to
> > reproduce it.
> 
> IIRC, there was a patch for this, but it got lost...
> 
> I will take a look anyway.

ok, thanks for the response. Let me know when there will be a patch
available to test.

^ permalink raw reply

* [PATCH] pktgen: Fix memory leak in pktgen_if_write
From: Gustavo A. R. Silva @ 2018-03-14  8:07 UTC (permalink / raw)
  To: David S. Miller, Arnd Bergmann, Wang Jian
  Cc: netdev, linux-kernel, Gustavo A. R. Silva

_buf_ is an array and the one that must be freed is _tp_ instead.

Fixes: a870a02cc963 ("pktgen: use dynamic allocation for debug print buffer")
Reported-by: Wang Jian <jianjian.wang1@gmail.com>
Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
---
 net/core/pktgen.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/core/pktgen.c b/net/core/pktgen.c
index fd65761..545cf08 100644
--- a/net/core/pktgen.c
+++ b/net/core/pktgen.c
@@ -913,7 +913,7 @@ static ssize_t pktgen_if_write(struct file *file,
 			return PTR_ERR(tp);
 
 		pr_debug("%s,%zu  buffer -:%s:-\n", name, count, tp);
-		kfree(buf);
+		kfree(tp);
 	}
 
 	if (!strcmp(name, "min_pkt_size")) {
-- 
2.7.4

^ permalink raw reply related

* [v4.15.9] BUG: KASAN: slab-out-of-bounds in __dev_queue_xmit+0x2e5/0x14c0
From: Andrei Vagin @ 2018-03-14  6:55 UTC (permalink / raw)
  To: netdev

Hi,

I got the following warning on the v4.15.9 kernel.

:[ 4483.052174] ==================================================================
:[ 4483.052659] BUG: KASAN: slab-out-of-bounds in __dev_queue_xmit+0x2e5/0x14c0
:[ 4483.052937] Read of size 1 at addr ffff880067ef7bc0 by task objtool/26177
:
:[ 4483.053361] CPU: 0 PID: 26177 Comm: objtool Not tainted 4.15.9 #1
:[ 4483.053603] Hardware name: Parallels Software International Inc. Parallels Virtual Platform/Parallels Virtual Platform, BIOS 6.12.26068.1232434 02/27/2017
:[ 4483.054116] Call Trace:
:[ 4483.054272]  <IRQ>
:[ 4483.054419]  dump_stack+0xda/0x16f
:[ 4483.054589]  ? _atomic_dec_and_lock+0x101/0x101
:[ 4483.054810]  ? rcu_lockdep_current_cpu_online+0xba/0x120
:[ 4483.055077]  print_address_description+0x6a/0x270
:[ 4483.055312]  kasan_report+0x277/0x360
:[ 4483.055491]  ? __dev_queue_xmit+0x2e5/0x14c0
:[ 4483.055688]  __dev_queue_xmit+0x2e5/0x14c0
:[ 4483.055892]  ? do_raw_spin_unlock+0x147/0x220
:[ 4483.056122]  ? netdev_pick_tx+0x150/0x150
:[ 4483.056369]  ? mark_held_locks+0x52/0x90
:[ 4483.056560]  ? __lock_acquire+0x61b/0x2060
:[ 4483.056771]  ? match_held_lock+0x8d/0x420
:[ 4483.056969]  ? mark_lock+0x1c9/0xa30
:[ 4483.057173]  ? save_trace+0x1e0/0x1e0
:[ 4483.057367]  ? print_irqtrace_events+0x110/0x110
:[ 4483.057602]  ? nf_conntrack_alter_reply+0x2a0/0x2a0 [nf_conntrack]
:[ 4483.057867]  ? tcp_new+0x510/0x510 [nf_conntrack]
:[ 4483.058101]  ? debug_check_no_locks_freed+0x1b0/0x1b0
:[ 4483.058360]  ? kernel_text_address+0xec/0x100
:[ 4483.058562]  ? find_held_lock+0x6d/0xd0
:[ 4483.058754]  ? lock_downgrade+0x320/0x320
:[ 4483.058959]  ? lock_release+0x4d0/0x4d0
:[ 4483.059184]  ? nf_ct_get_tuple+0x98/0xd0 [nf_conntrack]
:[ 4483.059422]  ? rcu_lockdep_current_cpu_online+0xba/0x120
:[ 4483.059655]  ? mark_held_locks+0x52/0x90
:[ 4483.059845]  ? ip_finish_output2+0x83d/0xb10
:[ 4483.060068]  ip_finish_output2+0x93f/0xb10
:[ 4483.060292]  ? ip_copy_metadata+0x320/0x320
:[ 4483.060485]  ? save_trace+0x1e0/0x1e0
:[ 4483.060659]  ? rcu_is_watching+0x81/0xc0
:[ 4483.060872]  ? ipv4_nlattr_to_tuple+0x80/0x80 [nf_conntrack_ipv4]
:[ 4483.061166]  ? nf_ct_deliver_cached_events+0x1a3/0x450 [nf_conntrack]
:[ 4483.061461]  ? __local_bh_enable_ip+0x9a/0x110
:[ 4483.061662]  ? ipt_do_table+0x65c/0x7e0
:[ 4483.061845]  ? ipv4_mtu+0x1ac/0x220
:[ 4483.062025]  ? find_held_lock+0x6d/0xd0
:[ 4483.062267]  ? ip_finish_output+0x435/0x590
:[ 4483.062462]  ip_finish_output+0x435/0x590
:[ 4483.062649]  ? ip_fragment.constprop.45+0xf0/0xf0
:[ 4483.062860]  ? ipv4_nlattr_to_tuple+0x80/0x80 [nf_conntrack_ipv4]
:[ 4483.063142]  ? iptable_nat_ipv4_fn+0x20/0x20 [iptable_nat]
:[ 4483.063393]  ? iptable_nat_ipv4_local_fn+0x20/0x20 [iptable_nat]
:[ 4483.063634]  ? rcu_is_watching+0x81/0xc0
:[ 4483.063829]  ? nf_hook_slow+0xa4/0xe0
:[ 4483.064031]  ip_output+0x12a/0x450
:[ 4483.064237]  ? ip_mc_output+0xc30/0xc30
:[ 4483.064435]  ? ip_fragment.constprop.45+0xf0/0xf0
:[ 4483.064644]  ? tcp_make_synack+0x7b9/0x950
:[ 4483.064849]  ip_build_and_send_pkt+0x2f7/0x480
:[ 4483.065086]  ? ip_local_out+0x90/0x90
:[ 4483.065283]  ? __lockdep_init_map+0x98/0x2a0
:[ 4483.065485]  ? inet_bind_hash+0x130/0x130
:[ 4483.065681]  tcp_v4_send_synack+0x1b7/0x280
:[ 4483.065878]  ? tcp_v4_send_check+0x40/0x40
:[ 4483.066094]  ? ip_mc_output+0x4b0/0xc30
:[ 4483.066344]  ? inet_csk_reqsk_queue_hash_add+0x11b/0x170
:[ 4483.066569]  ? inet_csk_route_child_sock+0x430/0x430
:[ 4483.066798]  tcp_conn_request+0x152e/0x1a70
:[ 4483.067017]  ? tcp_event_data_recv+0x6a0/0x6a0
:[ 4483.067259]  ? __lock_acquire+0x61b/0x2060
:[ 4483.067483]  ? debug_check_no_locks_freed+0x1b0/0x1b0
:[ 4483.067696]  ? print_irqtrace_events+0x110/0x110
:[ 4483.067902]  ? __lock_acquire+0x61b/0x2060
:[ 4483.068126]  ? match_held_lock+0x8d/0x420
:[ 4483.068376]  ? match_held_lock+0x8d/0x420
:[ 4483.068617]  ? match_held_lock+0x8d/0x420
:[ 4483.068868]  ? save_trace+0x1e0/0x1e0
:[ 4483.069132]  ? save_trace+0x1e0/0x1e0
:[ 4483.069383]  ? save_trace+0x1e0/0x1e0
:[ 4483.069615]  ? find_held_lock+0x6d/0xd0
:[ 4483.069888]  ? __lock_is_held+0x71/0xc0
:[ 4483.070181]  ? tcp_rcv_state_process+0x507/0x1fb0
:[ 4483.070557]  tcp_rcv_state_process+0x507/0x1fb0
:[ 4483.070824]  ? rcu_is_watching+0x81/0xc0
:[ 4483.071103]  ? tcp_finish_connect+0x180/0x180
:[ 4483.071394]  ? sk_filter_trim_cap+0x30b/0x510
:[ 4483.071658]  ? sk_skb_is_valid_access+0xd0/0xd0
:[ 4483.071933]  ? tcp_parse_md5sig_option+0x6d/0x90
:[ 4483.072231]  ? tcp_v4_inbound_md5_hash+0xca/0x2a0
:[ 4483.072530]  ? tcp_v4_do_rcv+0x266/0x340
:[ 4483.072763]  tcp_v4_do_rcv+0x266/0x340
:[ 4483.073018]  tcp_v4_rcv+0x1255/0x1290
:[ 4483.073324]  ? tcp_v4_early_demux+0x3b0/0x3b0
:[ 4483.073583]  ? find_held_lock+0xb0/0xd0
:[ 4483.073840]  ip_local_deliver_finish+0x1c9/0x5f0
:[ 4483.074137]  ? ipv4_nlattr_to_tuple+0x80/0x80 [nf_conntrack_ipv4]
:[ 4483.074425]  ? inet_del_offload+0x40/0x40
:[ 4483.074618]  ? nf_hook_slow+0xa4/0xe0
:[ 4483.074799]  ip_local_deliver+0x324/0x410
:[ 4483.075005]  ? ip_call_ra_chain+0x390/0x390
:[ 4483.075239]  ? inet_del_offload+0x40/0x40
:[ 4483.075460]  ip_rcv_finish+0x587/0xbb0
:[ 4483.075646]  ? ip_local_deliver_finish+0x5f0/0x5f0
:[ 4483.075860]  ? find_held_lock+0x6d/0xd0
:[ 4483.076067]  ? ip_rcv+0x70b/0x940
:[ 4483.076252]  ? lock_downgrade+0x320/0x320
:[ 4483.076556]  ? tcp_v4_send_synack+0x280/0x280
:[ 4483.076757]  ? do_add_counters+0x2b0/0x2b0
:[ 4483.076958]  ? rcu_is_watching+0x81/0xc0
:[ 4483.077179]  ? iptable_nat_ipv4_out+0x20/0x20 [iptable_nat]
:[ 4483.077424]  ? nf_hook_slow+0xa4/0xe0
:[ 4483.077606]  ip_rcv+0x54d/0x940
:[ 4483.077776]  ? ip_local_deliver+0x410/0x410
:[ 4483.077985]  ? ip_local_deliver_finish+0x5f0/0x5f0
:[ 4483.078229]  ? match_held_lock+0x8d/0x420
:[ 4483.078455]  ? ip_local_deliver+0x410/0x410
:[ 4483.078653]  __netif_receive_skb_core+0x13d7/0x1a20
:[ 4483.078884]  ? enqueue_to_backlog+0x730/0x730
:[ 4483.079110]  ? __is_insn_slot_addr+0x17b/0x240
:[ 4483.079332]  ? lock_downgrade+0x320/0x320
:[ 4483.079535]  ? find_held_lock+0x6d/0xd0
:[ 4483.079727]  ? is_bpf_text_address+0x60/0xe0
:[ 4483.079931]  ? match_held_lock+0x8d/0x420
:[ 4483.080138]  ? lock_downgrade+0x320/0x320
:[ 4483.080344]  ? save_trace+0x1e0/0x1e0
:[ 4483.080518]  ? lock_release+0x4d0/0x4d0
:[ 4483.080699]  ? __free_insn_slot+0x3e0/0x3e0
:[ 4483.080892]  ? rcu_is_watching+0x81/0xc0
:[ 4483.081104]  ? rcutorture_record_progress+0x10/0x10
:[ 4483.081339]  ? page_fault+0x7b/0x80
:[ 4483.081514]  ? match_held_lock+0x8d/0x420
:[ 4483.081705]  ? save_trace+0x1e0/0x1e0
:[ 4483.081882]  ? find_held_lock+0x6d/0xd0
:[ 4483.082093]  ? inet_gro_receive+0x21e/0x7c0
:[ 4483.082309]  ? lock_downgrade+0x320/0x320
:[ 4483.082504]  ? lock_release+0x4d0/0x4d0
:[ 4483.082695]  ? find_held_lock+0x6d/0xd0
:[ 4483.082887]  ? lock_acquire+0x129/0x320
:[ 4483.083090]  ? lock_acquire+0x129/0x320
:[ 4483.083293]  ? netif_receive_skb_internal+0xb2/0x4b0
:[ 4483.083519]  ? lock_release+0x4d0/0x4d0
:[ 4483.083703]  ? rcu_is_watching+0x81/0xc0
:[ 4483.083889]  ? rcu_is_watching+0x81/0xc0
:[ 4483.084097]  ? rcutorture_record_progress+0x10/0x10
:[ 4483.084335]  ? save_trace+0x1e0/0x1e0
:[ 4483.084518]  ? netif_receive_skb_internal+0xfa/0x4b0
:[ 4483.084729]  netif_receive_skb_internal+0xfa/0x4b0
:[ 4483.084962]  ? dev_cpu_dead+0x500/0x500
:[ 4483.085176]  ? net_rx_action+0xbf0/0xbf0
:[ 4483.085386]  ? __lock_is_held+0x51/0xc0
:[ 4483.085588]  napi_gro_receive+0x262/0x2e0
:[ 4483.085773]  ? dev_gro_receive+0xfe0/0xfe0
:[ 4483.085966]  ? eth_type_trans+0x133/0x280
:[ 4483.086180]  ? eth_gro_receive+0x3d0/0x3d0
:[ 4483.086411]  e1000_clean_rx_irq+0x2fa/0x940 [e1000]
:[ 4483.086654]  ? e1000_clean_jumbo_rx_irq+0x1110/0x1110 [e1000]
:[ 4483.086904]  ? update_max_interval+0x40/0x40
:[ 4483.087145]  ? __lock_is_held+0x71/0xc0
:[ 4483.087348]  ? __calc_delta+0xf6/0x140
:[ 4483.087529]  ? update_min_vruntime+0x7d/0xb0
:[ 4483.087731]  ? e1000_clean_jumbo_rx_irq+0x1110/0x1110 [e1000]
:[ 4483.087989]  e1000_clean+0x65e/0x1190 [e1000]
:[ 4483.088252]  ? e1000_unmap_and_free_tx_resource.isra.45+0x120/0x120 [e1000]
:[ 4483.088545]  ? do_raw_spin_trylock+0x100/0x100
:[ 4483.088744]  ? find_held_lock+0xb0/0xd0
:[ 4483.088940]  ? calc_global_load_tick+0x90/0x170
:[ 4483.089178]  ? match_held_lock+0xa5/0x420
:[ 4483.089446]  ? match_held_lock+0x8d/0x420
:[ 4483.089637]  ? save_trace+0x1e0/0x1e0
:[ 4483.089824]  ? enqueue_hrtimer+0xe2/0x290
:[ 4483.090023]  ? mark_held_locks+0x6e/0x90
:[ 4483.090241]  ? net_rx_action+0x2e3/0xbf0
:[ 4483.090441]  net_rx_action+0x477/0xbf0
:[ 4483.090647]  ? napi_complete_done+0x350/0x350
:[ 4483.090848]  ? lock_downgrade+0x320/0x320
:[ 4483.091078]  ? find_held_lock+0x6d/0xd0
:[ 4483.091293]  ? match_held_lock+0xa5/0x420
:[ 4483.091481]  ? ktime_get+0x18f/0x250
:[ 4483.091655]  ? mark_lock+0x1c9/0xa30
:[ 4483.091828]  ? do_raw_spin_unlock+0x147/0x220
:[ 4483.092053]  ? print_irqtrace_events+0x110/0x110
:[ 4483.092304]  ? pvclock_clocksource_read+0x12c/0x230
:[ 4483.092525]  ? pvclock_read_flags+0x50/0x50
:[ 4483.092725]  ? native_apic_msr_write+0x27/0x30
:[ 4483.092928]  ? lapic_next_event+0x36/0x40
:[ 4483.093139]  ? idle_cpu+0x96/0x110
:[ 4483.093325]  ? task_prio+0x20/0x20
:[ 4483.093495]  ? sched_clock_cpu+0x14/0xe0
:[ 4483.093683]  ? irqtime_account_irq+0xa1/0xd0
:[ 4483.093893]  ? rcu_irq_exit+0x62/0xb0
:[ 4483.094095]  ? irq_exit+0x7a/0x150
:[ 4483.094322]  ? smp_apic_timer_interrupt+0x13e/0x490
:[ 4483.094534]  ? smp_call_function_single_interrupt+0x430/0x430
:[ 4483.094773]  ? trace_hardirqs_off_caller+0x70/0x100
:[ 4483.095001]  ? match_held_lock+0xa5/0x420
:[ 4483.095227]  ? save_trace+0x1e0/0x1e0
:[ 4483.095417]  ? mark_held_locks+0x6e/0x90
:[ 4483.095599]  ? retint_kernel+0x10/0x10
:[ 4483.095779]  ? trace_hardirqs_on_caller+0x17f/0x260
:[ 4483.096018]  ? trace_hardirqs_on_thunk+0x1a/0x1c
:[ 4483.096263]  ? irq_exit+0x7a/0x150
:[ 4483.096448]  ? __lock_is_held+0x51/0xc0
:[ 4483.096646]  __do_softirq+0x1de/0x765
:[ 4483.096840]  ? __irqentry_text_end+0x1fa1d7/0x1fa1d7
:[ 4483.097081]  ? handle_irq+0x109/0x1c0
:[ 4483.097280]  ? lock_downgrade+0x320/0x320
:[ 4483.097473]  ? pvclock_clocksource_read+0x12c/0x230
:[ 4483.097690]  ? pvclock_read_flags+0x50/0x50
:[ 4483.097884]  ? __irq_complete_move+0x15/0x50
:[ 4483.098100]  ? kzalloc.constprop.11+0x15/0x15
:[ 4483.098314]  ? ioapic_ack_level+0xbb/0x1e0
:[ 4483.098526]  ? sched_clock+0x5/0x10
:[ 4483.098693]  ? sched_clock_cpu+0x14/0xe0
:[ 4483.098899]  irq_exit+0x146/0x150
:[ 4483.099093]  do_IRQ+0xb0/0x130
:[ 4483.099290]  common_interrupt+0x91/0x91
:[ 4483.099474]  </IRQ>
:[ 4483.099601] RIP: 0010:lock_release+0x280/0x4d0
:[ 4483.099794] RSP: 0000:ffff880011667918 EFLAGS: 00000246 ORIG_RAX: ffffffffffffffda
:[ 4483.100123] RAX: 0000000000000000 RBX: 1ffff100022ccf26 RCX: ffffffff911cc36f
:[ 4483.100417] RDX: 0000000000000007 RSI: dffffc0000000000 RDI: 0000000000000246
:[ 4483.100689] RBP: ffff880062aea7c0 R08: 0000000000000000 R09: 0000000000000000
:[ 4483.100975] R10: 0000000000000000 R11: 0000000000000000 R12: ffff880062aea7c0
:[ 4483.101289] R13: 0000000000000001 R14: 0000000000000001 R15: e9e54f45c56e85aa
:[ 4483.101598]  ? lock_release+0x26f/0x4d0
:[ 4483.101798]  ? __handle_mm_fault+0xc29/0x2040
:[ 4483.102046]  ? lock_downgrade+0x320/0x320
:[ 4483.102257]  ? lock_release+0x4d0/0x4d0
:[ 4483.102448]  ? do_raw_spin_trylock+0x100/0x100
:[ 4483.102670]  _raw_spin_unlock+0x1c/0x30
:[ 4483.102850]  __handle_mm_fault+0xc29/0x2040
:[ 4483.103077]  ? __pmd_alloc+0x320/0x320
:[ 4483.103302]  ? handle_mm_fault+0x17a/0x4d0
:[ 4483.103499]  ? lock_downgrade+0x320/0x320
:[ 4483.103706]  ? mem_cgroup_from_task+0xb4/0x170
:[ 4483.103910]  ? rcu_is_watching+0x81/0xc0
:[ 4483.104137]  handle_mm_fault+0x204/0x4d0
:[ 4483.104345]  ? __handle_mm_fault+0x2040/0x2040
:[ 4483.104546]  ? vmacache_find+0xe6/0x110
:[ 4483.104739]  __do_page_fault+0x3b1/0x6e0
:[ 4483.104935]  ? spurious_fault+0x320/0x320
:[ 4483.105151]  ? __do_page_fault+0x5dd/0x6e0
:[ 4483.105369]  do_page_fault+0xb6/0x440
:[ 4483.105545]  ? __do_page_fault+0x6e0/0x6e0
:[ 4483.105736]  ? exit_to_usermode_loop+0xb7/0x170
:[ 4483.105946]  ? trace_raw_output_sys_exit+0x80/0x80
:[ 4483.106183]  ? __do_page_fault+0x5dd/0x6e0
:[ 4483.106388]  ? lockdep_sys_exit+0x16/0x8e
:[ 4483.106572]  ? syscall_return_slowpath+0x1bc/0x2c0
:[ 4483.106783]  ? mark_held_locks+0x1c/0x90
:[ 4483.107093]  ? retint_user+0x18/0x18
:[ 4483.107281]  ? page_fault+0x65/0x80
:[ 4483.107462]  ? trace_hardirqs_off_caller+0xbe/0x100
:[ 4483.107674]  ? trace_hardirqs_off_thunk+0x1a/0x1c
:[ 4483.107890]  ? page_fault+0x65/0x80
:[ 4483.108079]  page_fault+0x7b/0x80
:[ 4483.108267] RIP: 0033:0x408de0
:[ 4483.108434] RSP: 002b:00007ffc27610e80 EFLAGS: 00010202
:[ 4483.108656] RAX: 00007fb1b53da000 RBX: 00007fb1b7152068 RCX: 00007fb1b540a880
:[ 4483.108939] RDX: 00007fb1b538a870 RSI: 0000000000081000 RDI: 0000000000000000
:[ 4483.109248] RBP: 00007fb1b7152010 R08: 00007fb1b538a010 R09: 0000000000000000
:[ 4483.109532] R10: 0000000000000022 R11: 0000000000000246 R12: 0000000000000029
:[ 4483.109804] R13: 00007fb1b538a010 R14: 000000000115b3b8 R15: 0000000000000000
:
:[ 4483.110264] Allocated by task 0:
:[ 4483.110429] (stack is not available)
:
:[ 4483.110702] Freed by task 0:
:[ 4483.110853] (stack is not available)
:
:[ 4483.111159] The buggy address belongs to the object at ffff880067ef7b00
:                which belongs to the cache request_sock_TCP of size 328
:[ 4483.111629] The buggy address is located 192 bytes inside of
:                328-byte region [ffff880067ef7b00, ffff880067ef7c48)
:[ 4483.112063] The buggy address belongs to the page:
:[ 4483.112289] page:ffffea00019fbd00 count:1 mapcount:0 mapping:0000000000000000 index:0xffff880067ef7e30 compound_mapcount: 0
:[ 4483.112699] flags: 0xfffe000008100(slab|head)
:[ 4483.112900] raw: 000fffe000008100 0000000000000000 ffff880067ef7e30 0000000100280002
:[ 4483.113232] raw: ffff880069909780 ffff880069909780 ffff88006a186f80 0000000000000000
:[ 4483.113539] page dumped because: kasan: bad access detected
:
:[ 4483.113872] Memory state around the buggy address:
:[ 4483.114108]  ffff880067ef7a80: 00 00 00 00 00 00 fc fc fc fc fc fc fc fc fc fc
:[ 4483.114415]  ffff880067ef7b00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
:[ 4483.114695] >ffff880067ef7b80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
:[ 4483.114990]                                            ^
:[ 4483.115246]  ffff880067ef7c00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
:[ 4483.115537]  ffff880067ef7c80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
:[ 4483.115816] ==================================================================
:[ 4483.116132] Disabling lock debugging due to kernel taint

/root/linux/./include/linux/cgroup-defs.h:761
   169c2:       49 8d bc 24 f0 03 00    lea    0x3f0(%r12),%rdi
   169c9:       00
   169ca:       41 bd 01 00 00 00       mov    $0x1,%r13d
   169d0:       e8 00 00 00 00          callq  169d5 <__dev_queue_xmit+0x2e5>
   169d5:       41 f6 84 24 f0 03 00    testb  $0x1,0x3f0(%r12)
   169dc:       00 01 
   169de:       74 16                   je     169f6 <__dev_queue_xmit+0x306>
   169e0:       49 8d bc 24 f2 03 00    lea    0x3f2(%r12),%rdi
   169e7:       00 
   169e8:       e8 00 00 00 00          callq  169ed <__dev_queue_xmit+0x2fd>
   169ed:       45 0f b7 ac 24 f2 03    movzwl 0x3f2(%r12),%r13d
   169f4:       00 00

static inline u16 sock_cgroup_prioidx(struct sock_cgroup_data *skcd)
{  
            /* fallback to 1 which is always the ID of the root cgroup */
761:        return (skcd->is_data & 1) ? skcd->prioidx : 1;
} 

^ permalink raw reply

* Re: [RFC net-next 2/6] driver: net: bonding: allow registration of tc offload callbacks in bond
From: Or Gerlitz @ 2018-03-14  6:54 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Jiri Pirko, Rabie Loulou, John Hurley, Simon Horman,
	Linux Netdev List, mlxsw, Yevgeny Kliteynik, Paul Blakey
In-Reply-To: <20180313185002.45264fb1@cakuba.netronome.com>

On Wed, Mar 14, 2018 at 3:50 AM, Jakub Kicinski
<jakub.kicinski@netronome.com> wrote:
> On Tue, 13 Mar 2018 17:53:39 +0200, Or Gerlitz wrote:
>> > Starting with type 2, in our current NIC HW APIs we have to duplicate
>> > these rules
>> > into two rules set to HW:
>> >
>> > 2.1 VF rep --> uplink 0
>> > 2.2 VF rep --> uplink 1
>> >
>> > and we do that in the driver (add/del two HW rules, combine the stat
>> > results, etc)
>
> Ack, I think our HW API also will require us to duplicate the rules
> today, but IMHO we should implement some common helper module in the
> core that would work for any block sharing rather than bond specific
> solution.

To be clear, you refer to the case where the bond is the egress device
of the rule?

For the case the bond is the ingress device, RU OK with the approach
Jiri suggested
to propagate the tc setup ndo call into the lower devices? so they are
bind/unbinding
for any block the upper is. This approach is applicable for
bond/team/vlan devices for
both NIC and Switch ASIC (or NPU...) drivers. You want to make a
helper out of this?

^ permalink raw reply

* Re: [PATCH RFC bpf-next 1/6] bpf: Hooks for sys_bind
From: Eric Dumazet @ 2018-03-14  6:21 UTC (permalink / raw)
  To: Alexei Starovoitov, davem; +Cc: daniel, netdev, kernel-team
In-Reply-To: <20180314033934.3502167-2-ast@kernel.org>



On 03/13/2018 08:39 PM, Alexei Starovoitov wrote:
> From: Andrey Ignatov <rdna@fb.com>
> 
> == The problem ==
> 
> There is a use-case when all processes inside a cgroup should use one
> single IP address on a host that has multiple IP configured.  Those
> processes should use the IP for both ingress and egress, for TCP and UDP
> traffic. So TCP/UDP servers should be bound to that IP to accept
> incoming connections on it, and TCP/UDP clients should make outgoing
> connections from that IP. It should not require changing application
> code since it's often not possible.
> 
> Currently it's solved by intercepting glibc wrappers around syscalls
> such as `bind(2)` and `connect(2)`. It's done by a shared library that
> is preloaded for every process in a cgroup so that whenever TCP/UDP
> server calls `bind(2)`, the library replaces IP in sockaddr before
> passing arguments to syscall. When application calls `connect(2)` the
> library transparently binds the local end of connection to that IP
> (`bind(2)` with `IP_BIND_ADDRESS_NO_PORT` to avoid performance penalty).
> 
> Shared library approach is fragile though, e.g.:
> * some applications clear env vars (incl. `LD_PRELOAD`);
> * `/etc/ld.so.preload` doesn't help since some applications are linked
>    with option `-z nodefaultlib`;
> * other applications don't use glibc and there is nothing to intercept.
> 
> == The solution ==
> 
> The patch provides much more reliable in-kernel solution for the 1st
> part of the problem: binding TCP/UDP servers on desired IP. It does not
> depend on application environment and implementation details (whether
> glibc is used or not).
>


If I understand well,  strace(1) will not show the real (after 
modification by eBPF) IP/port ?

What about selinux and other LSM ?

We have now network namespaces for full isolation. Soon ILA will come.

The argument that it is not convenient (or even possible) to change the 
application or using modern isolation is quite strange, considering the 
added burden/complexity/bloat to the kernel.

The post hook for sys_bind is clearly a failure of the model, since 
releasing the port might already be too late, another thread might fail 
to get it during a non zero time window.
It seems this is exactly the case where a netns would be the correct answer.


If you want to provide an alternate port allocation strategy, better 
provide a correct eBPF hook.

^ permalink raw reply

* Re: [PATCH iproute2 1/1] tc: use get_u32() in psample action to match types
From: yotam gigi @ 2018-03-14  5:41 UTC (permalink / raw)
  To: Roman Mashak
  Cc: stephen, netdev, kernel, Jamal Hadi Salim, xiyou.wangcong, jiri
In-Reply-To: <1520975783-8593-1-git-send-email-mrv@mojatatu.com>

On Tue, Mar 13, 2018 at 11:16 PM, Roman Mashak <mrv@mojatatu.com> wrote:

Makes sense :)

Acked-by: Yotam Gigi <yotam.gi@gmail.com>

> Signed-off-by: Roman Mashak <mrv@mojatatu.com>
> ---
>  tc/m_sample.c | 6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
>
> diff --git a/tc/m_sample.c b/tc/m_sample.c
> index ff5ee6bd1ef6..dff986f59999 100644
> --- a/tc/m_sample.c
> +++ b/tc/m_sample.c
> @@ -65,7 +65,7 @@ static int parse_sample(struct action_util *a, int *argc_p, char ***argv_p,
>         while (argc > 0) {
>                 if (matches(*argv, "rate") == 0) {
>                         NEXT_ARG();
> -                       if (get_unsigned(&rate, *argv, 10) != 0) {
> +                       if (get_u32(&rate, *argv, 10) != 0) {
>                                 fprintf(stderr, "Illegal rate %s\n", *argv);
>                                 usage();
>                                 return -1;
> @@ -73,7 +73,7 @@ static int parse_sample(struct action_util *a, int *argc_p, char ***argv_p,
>                         rate_set = true;
>                 } else if (matches(*argv, "group") == 0) {
>                         NEXT_ARG();
> -                       if (get_unsigned(&group, *argv, 10) != 0) {
> +                       if (get_u32(&group, *argv, 10) != 0) {
>                                 fprintf(stderr, "Illegal group num %s\n",
>                                         *argv);
>                                 usage();
> @@ -82,7 +82,7 @@ static int parse_sample(struct action_util *a, int *argc_p, char ***argv_p,
>                         group_set = true;
>                 } else if (matches(*argv, "trunc") == 0) {
>                         NEXT_ARG();
> -                       if (get_unsigned(&trunc, *argv, 10) != 0) {
> +                       if (get_u32(&trunc, *argv, 10) != 0) {
>                                 fprintf(stderr, "Illegal truncation size %s\n",
>                                         *argv);
>                                 usage();
> --
> 2.7.4
>

^ permalink raw reply

* Re: [PATCH 7/7] ixgbevf: eliminate duplicate barriers on weakly-ordered archs
From: Timur Tabi @ 2018-03-14  5:08 UTC (permalink / raw)
  To: Sinan Kaya, netdev, sulrich
  Cc: linux-arm-msm, linux-arm-kernel, Jeff Kirsher, intel-wired-lan,
	linux-kernel
In-Reply-To: <1520997629-17361-7-git-send-email-okaya@codeaurora.org>

On 3/13/18 10:20 PM, Sinan Kaya wrote:
> +/* Assumes caller has executed a write barrier to order memory and device
> + * requests.
> + */
>   static inline void ixgbevf_write_tail(struct ixgbevf_ring *ring, u32 value)
>   {
> -	writel(value, ring->tail);
> +	writel_relaxed(value, ring->tail);
>   }

Why not put the wmb() in this function, or just get rid of the wmb() in 
the rest of the file and keep this as writel?  That way, you can avoid 
the comment and the risk that comes with it.

-- 
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm
Technologies, Inc.  Qualcomm Technologies, Inc. is a member of the
Code Aurora Forum, a Linux Foundation Collaborative Project.

^ permalink raw reply

* [PATCH net-next] liquidio: Add support for liquidio 10GBase-T NIC
From: Felix Manlunas @ 2018-03-14  5:04 UTC (permalink / raw)
  To: davem
  Cc: netdev, raghu.vatsavayi, derek.chickles, satananda.burla,
	felix.manlunas, veerasenareddy.burru

From: Veerasenareddy Burru <veerasenareddy.burru@cavium.com>

Added ethtool changes to show port type as TP (Twisted Pair) for
10GBASE-T ports. Same driver and firmware works for liquidio NIC with
SFP+ ports or TP ports.

Signed-off-by: Veerasenareddy Burru <veerasenareddy.burru@cavium.com>
Signed-off-by: Felix Manlunas <felix.manlunas@cavium.com>
---
 drivers/net/ethernet/cavium/liquidio/lio_ethtool.c | 24 ++++++++++++++++------
 .../net/ethernet/cavium/liquidio/liquidio_common.h | 12 +++++++++--
 2 files changed, 28 insertions(+), 8 deletions(-)

diff --git a/drivers/net/ethernet/cavium/liquidio/lio_ethtool.c b/drivers/net/ethernet/cavium/liquidio/lio_ethtool.c
index a63ddf0..550ac29 100644
--- a/drivers/net/ethernet/cavium/liquidio/lio_ethtool.c
+++ b/drivers/net/ethernet/cavium/liquidio/lio_ethtool.c
@@ -232,10 +232,16 @@ static int lio_get_link_ksettings(struct net_device *netdev,
 
 	linfo = &lio->linfo;
 
-	if (linfo->link.s.if_mode == INTERFACE_MODE_XAUI ||
-	    linfo->link.s.if_mode == INTERFACE_MODE_RXAUI ||
-	    linfo->link.s.if_mode == INTERFACE_MODE_XLAUI ||
-	    linfo->link.s.if_mode == INTERFACE_MODE_XFI) {
+	switch (linfo->link.s.phy_type) {
+	case LIO_PHY_PORT_TP:
+		ecmd->base.port = PORT_TP;
+		supported = (SUPPORTED_10000baseT_Full |
+			     SUPPORTED_TP | SUPPORTED_Pause);
+		advertising = (ADVERTISED_10000baseT_Full | ADVERTISED_Pause);
+		ecmd->base.autoneg = AUTONEG_DISABLE;
+		break;
+
+	case LIO_PHY_PORT_FIBRE:
 		ecmd->base.port = PORT_FIBRE;
 
 		if (linfo->link.s.speed == SPEED_10000) {
@@ -245,12 +251,18 @@ static int lio_get_link_ksettings(struct net_device *netdev,
 
 		supported |= SUPPORTED_FIBRE | SUPPORTED_Pause;
 		advertising |= ADVERTISED_Pause;
+		ecmd->base.autoneg = AUTONEG_DISABLE;
+		break;
+	}
+
+	if (linfo->link.s.if_mode == INTERFACE_MODE_XAUI ||
+	    linfo->link.s.if_mode == INTERFACE_MODE_RXAUI ||
+	    linfo->link.s.if_mode == INTERFACE_MODE_XLAUI ||
+	    linfo->link.s.if_mode == INTERFACE_MODE_XFI) {
 		ethtool_convert_legacy_u32_to_link_mode(
 			ecmd->link_modes.supported, supported);
 		ethtool_convert_legacy_u32_to_link_mode(
 			ecmd->link_modes.advertising, advertising);
-		ecmd->base.autoneg = AUTONEG_DISABLE;
-
 	} else {
 		dev_err(&oct->pci_dev->dev, "Unknown link interface reported %d\n",
 			linfo->link.s.if_mode);
diff --git a/drivers/net/ethernet/cavium/liquidio/liquidio_common.h b/drivers/net/ethernet/cavium/liquidio/liquidio_common.h
index 522dcc4..ae8566c 100644
--- a/drivers/net/ethernet/cavium/liquidio/liquidio_common.h
+++ b/drivers/net/ethernet/cavium/liquidio/liquidio_common.h
@@ -675,9 +675,11 @@ struct octeon_instr_rdp {
 		u64 if_mode:5;
 		u64 pause:1;
 		u64 flashing:1;
-		u64 reserved:15;
+		u64 phy_type:5;
+		u64 reserved:10;
 #else
-		u64 reserved:15;
+		u64 reserved:10;
+		u64 phy_type:5;
 		u64 flashing:1;
 		u64 pause:1;
 		u64 if_mode:5;
@@ -690,6 +692,12 @@ struct octeon_instr_rdp {
 	} s;
 };
 
+enum lio_phy_type {
+	LIO_PHY_PORT_TP = 0x0,
+	LIO_PHY_PORT_FIBRE = 0x1,
+	LIO_PHY_PORT_UNKNOWN,
+};
+
 /** The txpciq info passed to host from the firmware */
 
 union oct_txpciq {
-- 
1.8.3.1

^ permalink raw reply related

* [PATCH v2 2/2] doc: Change the udp/sctp rmem/wmem default value.
From: Tonghao Zhang @ 2018-03-14  4:57 UTC (permalink / raw)
  To: davem, pabeni, edumazet; +Cc: netdev, Tonghao Zhang
In-Reply-To: <1521003437-20341-1-git-send-email-xiangxia.m.yue@gmail.com>

The SK_MEM_QUANTUM was changed from PAGE_SIZE to 4096.

Signed-off-by: Tonghao Zhang <xiangxia.m.yue@gmail.com>
---
 Documentation/networking/ip-sysctl.txt | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt
index 783675a..1d11207 100644
--- a/Documentation/networking/ip-sysctl.txt
+++ b/Documentation/networking/ip-sysctl.txt
@@ -755,13 +755,13 @@ udp_rmem_min - INTEGER
 	Minimal size of receive buffer used by UDP sockets in moderation.
 	Each UDP socket is able to use the size for receiving data, even if
 	total pages of UDP sockets exceed udp_mem pressure. The unit is byte.
-	Default: 1 page
+	Default: 4K
 
 udp_wmem_min - INTEGER
 	Minimal size of send buffer used by UDP sockets in moderation.
 	Each UDP socket is able to use the size for sending data, even if
 	total pages of UDP sockets exceed udp_mem pressure. The unit is byte.
-	Default: 1 page
+	Default: 4K
 
 CIPSOv4 Variables:
 
@@ -2101,7 +2101,7 @@ sctp_rmem - vector of 3 INTEGERs: min, default, max
 	It is guaranteed to each SCTP socket (but not association) even
 	under moderate memory pressure.
 
-	Default: 1 page
+	Default: 4K
 
 sctp_wmem  - vector of 3 INTEGERs: min, default, max
 	Currently this tunable has no effect.
-- 
1.8.3.1

^ permalink raw reply related

* [PATCH v2 1/2] udp: Move the udp sysctl to namespace.
From: Tonghao Zhang @ 2018-03-14  4:57 UTC (permalink / raw)
  To: davem, pabeni, edumazet; +Cc: netdev, Tonghao Zhang

This patch moves the udp_rmem_min, udp_wmem_min
to namespace and init the udp_l3mdev_accept explicitly.

The udp_rmem_min/udp_wmem_min affect udp rx/tx queue,
with this patch namespaces can set them differently.

Signed-off-by: Tonghao Zhang <xiangxia.m.yue@gmail.com>
---
 v2: use a common helper to avoid the code duplication.
---
 include/net/netns/ipv4.h   |  3 ++
 net/ipv4/sysctl_net_ipv4.c | 32 ++++++++---------
 net/ipv4/udp.c             | 86 +++++++++++++++++++++++++++-------------------
 net/ipv6/udp.c             | 52 ++++++++++++++--------------
 4 files changed, 96 insertions(+), 77 deletions(-)

diff --git a/include/net/netns/ipv4.h b/include/net/netns/ipv4.h
index 3a970e4..382bfd7 100644
--- a/include/net/netns/ipv4.h
+++ b/include/net/netns/ipv4.h
@@ -168,6 +168,9 @@ struct netns_ipv4 {
 	atomic_t tfo_active_disable_times;
 	unsigned long tfo_active_disable_stamp;
 
+	int sysctl_udp_wmem_min;
+	int sysctl_udp_rmem_min;
+
 #ifdef CONFIG_NET_L3_MASTER_DEV
 	int sysctl_udp_l3mdev_accept;
 #endif
diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
index 011de9a..5b72d97 100644
--- a/net/ipv4/sysctl_net_ipv4.c
+++ b/net/ipv4/sysctl_net_ipv4.c
@@ -520,22 +520,6 @@ static int proc_fib_multipath_hash_policy(struct ctl_table *table, int write,
 		.mode		= 0644,
 		.proc_handler	= proc_doulongvec_minmax,
 	},
-	{
-		.procname	= "udp_rmem_min",
-		.data		= &sysctl_udp_rmem_min,
-		.maxlen		= sizeof(sysctl_udp_rmem_min),
-		.mode		= 0644,
-		.proc_handler	= proc_dointvec_minmax,
-		.extra1		= &one
-	},
-	{
-		.procname	= "udp_wmem_min",
-		.data		= &sysctl_udp_wmem_min,
-		.maxlen		= sizeof(sysctl_udp_wmem_min),
-		.mode		= 0644,
-		.proc_handler	= proc_dointvec_minmax,
-		.extra1		= &one
-	},
 	{ }
 };
 
@@ -1167,6 +1151,22 @@ static int proc_fib_multipath_hash_policy(struct ctl_table *table, int write,
 		.proc_handler	= proc_dointvec_minmax,
 		.extra1		= &one,
 	},
+	{
+		.procname	= "udp_rmem_min",
+		.data		= &init_net.ipv4.sysctl_udp_rmem_min,
+		.maxlen		= sizeof(init_net.ipv4.sysctl_udp_rmem_min),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec_minmax,
+		.extra1		= &one
+	},
+	{
+		.procname	= "udp_wmem_min",
+		.data		= &init_net.ipv4.sysctl_udp_wmem_min,
+		.maxlen		= sizeof(init_net.ipv4.sysctl_udp_wmem_min),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec_minmax,
+		.extra1		= &one
+	},
 	{ }
 };
 
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 3013404..908fc02 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -122,12 +122,6 @@
 long sysctl_udp_mem[3] __read_mostly;
 EXPORT_SYMBOL(sysctl_udp_mem);
 
-int sysctl_udp_rmem_min __read_mostly;
-EXPORT_SYMBOL(sysctl_udp_rmem_min);
-
-int sysctl_udp_wmem_min __read_mostly;
-EXPORT_SYMBOL(sysctl_udp_wmem_min);
-
 atomic_long_t udp_memory_allocated;
 EXPORT_SYMBOL(udp_memory_allocated);
 
@@ -2533,35 +2527,35 @@ int udp_abort(struct sock *sk, int err)
 EXPORT_SYMBOL_GPL(udp_abort);
 
 struct proto udp_prot = {
-	.name		   = "UDP",
-	.owner		   = THIS_MODULE,
-	.close		   = udp_lib_close,
-	.connect	   = ip4_datagram_connect,
-	.disconnect	   = udp_disconnect,
-	.ioctl		   = udp_ioctl,
-	.init		   = udp_init_sock,
-	.destroy	   = udp_destroy_sock,
-	.setsockopt	   = udp_setsockopt,
-	.getsockopt	   = udp_getsockopt,
-	.sendmsg	   = udp_sendmsg,
-	.recvmsg	   = udp_recvmsg,
-	.sendpage	   = udp_sendpage,
-	.release_cb	   = ip4_datagram_release_cb,
-	.hash		   = udp_lib_hash,
-	.unhash		   = udp_lib_unhash,
-	.rehash		   = udp_v4_rehash,
-	.get_port	   = udp_v4_get_port,
-	.memory_allocated  = &udp_memory_allocated,
-	.sysctl_mem	   = sysctl_udp_mem,
-	.sysctl_wmem	   = &sysctl_udp_wmem_min,
-	.sysctl_rmem	   = &sysctl_udp_rmem_min,
-	.obj_size	   = sizeof(struct udp_sock),
-	.h.udp_table	   = &udp_table,
+	.name			= "UDP",
+	.owner			= THIS_MODULE,
+	.close			= udp_lib_close,
+	.connect		= ip4_datagram_connect,
+	.disconnect		= udp_disconnect,
+	.ioctl			= udp_ioctl,
+	.init			= udp_init_sock,
+	.destroy		= udp_destroy_sock,
+	.setsockopt		= udp_setsockopt,
+	.getsockopt		= udp_getsockopt,
+	.sendmsg		= udp_sendmsg,
+	.recvmsg		= udp_recvmsg,
+	.sendpage		= udp_sendpage,
+	.release_cb		= ip4_datagram_release_cb,
+	.hash			= udp_lib_hash,
+	.unhash			= udp_lib_unhash,
+	.rehash			= udp_v4_rehash,
+	.get_port		= udp_v4_get_port,
+	.memory_allocated	= &udp_memory_allocated,
+	.sysctl_mem		= sysctl_udp_mem,
+	.sysctl_wmem_offset	= offsetof(struct net, ipv4.sysctl_udp_wmem_min),
+	.sysctl_rmem_offset	= offsetof(struct net, ipv4.sysctl_udp_rmem_min),
+	.obj_size		= sizeof(struct udp_sock),
+	.h.udp_table		= &udp_table,
 #ifdef CONFIG_COMPAT
-	.compat_setsockopt = compat_udp_setsockopt,
-	.compat_getsockopt = compat_udp_getsockopt,
+	.compat_setsockopt	= compat_udp_setsockopt,
+	.compat_getsockopt	= compat_udp_getsockopt,
 #endif
-	.diag_destroy	   = udp_abort,
+	.diag_destroy		= udp_abort,
 };
 EXPORT_SYMBOL(udp_prot);
 
@@ -2831,6 +2825,26 @@ u32 udp_flow_hashrnd(void)
 }
 EXPORT_SYMBOL(udp_flow_hashrnd);
 
+static void __udp_sysctl_init(struct net *net)
+{
+	net->ipv4.sysctl_udp_rmem_min = SK_MEM_QUANTUM;
+	net->ipv4.sysctl_udp_wmem_min = SK_MEM_QUANTUM;
+
+#ifdef CONFIG_NET_L3_MASTER_DEV
+	net->ipv4.sysctl_udp_l3mdev_accept = 0;
+#endif
+}
+
+static int __net_init udp_sysctl_init(struct net *net)
+{
+	__udp_sysctl_init(net);
+	return 0;
+}
+
+static struct pernet_operations __net_initdata udp_sysctl_ops = {
+	.init       = udp_sysctl_init,
+};
+
 void __init udp_init(void)
 {
 	unsigned long limit;
@@ -2843,8 +2857,7 @@ void __init udp_init(void)
 	sysctl_udp_mem[1] = limit;
 	sysctl_udp_mem[2] = sysctl_udp_mem[0] * 2;
 
-	sysctl_udp_rmem_min = SK_MEM_QUANTUM;
-	sysctl_udp_wmem_min = SK_MEM_QUANTUM;
+	__udp_sysctl_init(&init_net);
 
 	/* 16 spinlocks per cpu */
 	udp_busylocks_log = ilog2(nr_cpu_ids) + 4;
@@ -2854,4 +2867,7 @@ void __init udp_init(void)
 		panic("UDP: failed to alloc udp_busylocks\n");
 	for (i = 0; i < (1U << udp_busylocks_log); i++)
 		spin_lock_init(udp_busylocks + i);
+
+	if (register_pernet_subsys(&udp_sysctl_ops))
+		panic("UDP: failed to init sysctl parameters.\n");
 }
diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c
index 52e3ea0..ad30f5e 100644
--- a/net/ipv6/udp.c
+++ b/net/ipv6/udp.c
@@ -1509,34 +1509,34 @@ void udp6_proc_exit(struct net *net)
 /* ------------------------------------------------------------------------ */
 
 struct proto udpv6_prot = {
-	.name		   = "UDPv6",
-	.owner		   = THIS_MODULE,
-	.close		   = udp_lib_close,
-	.connect	   = ip6_datagram_connect,
-	.disconnect	   = udp_disconnect,
-	.ioctl		   = udp_ioctl,
-	.init		   = udp_init_sock,
-	.destroy	   = udpv6_destroy_sock,
-	.setsockopt	   = udpv6_setsockopt,
-	.getsockopt	   = udpv6_getsockopt,
-	.sendmsg	   = udpv6_sendmsg,
-	.recvmsg	   = udpv6_recvmsg,
-	.release_cb	   = ip6_datagram_release_cb,
-	.hash		   = udp_lib_hash,
-	.unhash		   = udp_lib_unhash,
-	.rehash		   = udp_v6_rehash,
-	.get_port	   = udp_v6_get_port,
-	.memory_allocated  = &udp_memory_allocated,
-	.sysctl_mem	   = sysctl_udp_mem,
-	.sysctl_wmem	   = &sysctl_udp_wmem_min,
-	.sysctl_rmem	   = &sysctl_udp_rmem_min,
-	.obj_size	   = sizeof(struct udp6_sock),
-	.h.udp_table	   = &udp_table,
+	.name			= "UDPv6",
+	.owner			= THIS_MODULE,
+	.close			= udp_lib_close,
+	.connect		= ip6_datagram_connect,
+	.disconnect		= udp_disconnect,
+	.ioctl			= udp_ioctl,
+	.init			= udp_init_sock,
+	.destroy		= udpv6_destroy_sock,
+	.setsockopt		= udpv6_setsockopt,
+	.getsockopt		= udpv6_getsockopt,
+	.sendmsg		= udpv6_sendmsg,
+	.recvmsg		= udpv6_recvmsg,
+	.release_cb		= ip6_datagram_release_cb,
+	.hash			= udp_lib_hash,
+	.unhash			= udp_lib_unhash,
+	.rehash			= udp_v6_rehash,
+	.get_port		= udp_v6_get_port,
+	.memory_allocated	= &udp_memory_allocated,
+	.sysctl_mem		= sysctl_udp_mem,
+	.sysctl_wmem_offset     = offsetof(struct net, ipv4.sysctl_udp_wmem_min),
+	.sysctl_rmem_offset     = offsetof(struct net, ipv4.sysctl_udp_rmem_min),
+	.obj_size		= sizeof(struct udp6_sock),
+	.h.udp_table		= &udp_table,
 #ifdef CONFIG_COMPAT
-	.compat_setsockopt = compat_udpv6_setsockopt,
-	.compat_getsockopt = compat_udpv6_getsockopt,
+	.compat_setsockopt	= compat_udpv6_setsockopt,
+	.compat_getsockopt	= compat_udpv6_getsockopt,
 #endif
-	.diag_destroy      = udp_abort,
+	.diag_destroy		= udp_abort,
 };
 
 static struct inet_protosw udpv6_protosw = {
-- 
1.8.3.1

^ permalink raw reply related

* Re: [PATCH v4 2/2] virtio_net: Extend virtio to use VF datapath when available
From: Siwei Liu @ 2018-03-14  4:50 UTC (permalink / raw)
  To: Samudrala, Sridhar
  Cc: Michael S. Tsirkin, Stephen Hemminger, David Miller, Netdev,
	Jiri Pirko, virtio-dev, Brandeburg, Jesse, Alexander Duyck,
	Jakub Kicinski
In-Reply-To: <eac018bf-58fe-188d-0dad-f454c9affebb@intel.com>

On Tue, Mar 13, 2018 at 5:28 PM, Samudrala, Sridhar
<sridhar.samudrala@intel.com> wrote:
> On 3/12/2018 3:44 PM, Siwei Liu wrote:
>>
>> Apologies, still some comments going. Please see inline.
>>
>> On Thu, Mar 1, 2018 at 12:08 PM, Sridhar Samudrala
>> <sridhar.samudrala@intel.com> wrote:
>>>
>>> This patch enables virtio_net to switch over to a VF datapath when a VF
>>> netdev is present with the same MAC address. It allows live migration
>>> of a VM with a direct attached VF without the need to setup a bond/team
>>> between a VF and virtio net device in the guest.
>>>
>>> The hypervisor needs to enable only one datapath at any time so that
>>> packets don't get looped back to the VM over the other datapath. When a
>>> VF
>>> is plugged, the virtio datapath link state can be marked as down. The
>>> hypervisor needs to unplug the VF device from the guest on the source
>>> host
>>> and reset the MAC filter of the VF to initiate failover of datapath to
>>> virtio before starting the migration. After the migration is completed,
>>> the destination hypervisor sets the MAC filter on the VF and plugs it
>>> back
>>> to the guest to switch over to VF datapath.
>>>
>>> When BACKUP feature is enabled, an additional netdev(bypass netdev) is
>>> created that acts as a master device and tracks the state of the 2 lower
>>> netdevs. The original virtio_net netdev is marked as 'backup' netdev and
>>> a
>>> passthru device with the same MAC is registered as 'active' netdev.
>>>
>>> This patch is based on the discussion initiated by Jesse on this thread.
>>> https://marc.info/?l=linux-virtualization&m=151189725224231&w=2
>>>
>>> Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
>>> Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
>>> Reviewed-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
>>> ---
>>>   drivers/net/virtio_net.c | 683
>>> ++++++++++++++++++++++++++++++++++++++++++++++-
>>>   1 file changed, 682 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
>>> index bcd13fe906ca..f2860d86c952 100644
>>> --- a/drivers/net/virtio_net.c
>>> +++ b/drivers/net/virtio_net.c
>>> @@ -30,6 +30,8 @@
>>>   #include <linux/cpu.h>
>>>   #include <linux/average.h>
>>>   #include <linux/filter.h>
>>> +#include <linux/netdevice.h>
>>> +#include <linux/pci.h>
>>>   #include <net/route.h>
>>>   #include <net/xdp.h>
>>>
>>> @@ -206,6 +208,9 @@ struct virtnet_info {
>>>          u32 speed;
>>>
>>>          unsigned long guest_offloads;
>>> +
>>> +       /* upper netdev created when BACKUP feature enabled */
>>> +       struct net_device *bypass_netdev;
>>>   };
>>>
>>>   struct padded_vnet_hdr {
>>> @@ -2236,6 +2241,22 @@ static int virtnet_xdp(struct net_device *dev,
>>> struct netdev_bpf *xdp)
>>>          }
>>>   }
>>>
>>> +static int virtnet_get_phys_port_name(struct net_device *dev, char *buf,
>>> +                                     size_t len)
>>> +{
>>> +       struct virtnet_info *vi = netdev_priv(dev);
>>> +       int ret;
>>> +
>>> +       if (!virtio_has_feature(vi->vdev, VIRTIO_NET_F_BACKUP))
>>> +               return -EOPNOTSUPP;
>>> +
>>> +       ret = snprintf(buf, len, "_bkup");
>>> +       if (ret >= len)
>>> +               return -EOPNOTSUPP;
>>> +
>>> +       return 0;
>>> +}
>>> +
>>>   static const struct net_device_ops virtnet_netdev = {
>>>          .ndo_open            = virtnet_open,
>>>          .ndo_stop            = virtnet_close,
>>> @@ -2253,6 +2274,7 @@ static const struct net_device_ops virtnet_netdev =
>>> {
>>>          .ndo_xdp_xmit           = virtnet_xdp_xmit,
>>>          .ndo_xdp_flush          = virtnet_xdp_flush,
>>>          .ndo_features_check     = passthru_features_check,
>>> +       .ndo_get_phys_port_name = virtnet_get_phys_port_name,
>>>   };
>>>
>>>   static void virtnet_config_changed_work(struct work_struct *work)
>>> @@ -2647,6 +2669,653 @@ static int virtnet_validate(struct virtio_device
>>> *vdev)
>>>          return 0;
>>>   }
>>>
>>> +/* START of functions supporting VIRTIO_NET_F_BACKUP feature.
>>> + * When BACKUP feature is enabled, an additional netdev(bypass netdev)
>>> + * is created that acts as a master device and tracks the state of the
>>> + * 2 lower netdevs. The original virtio_net netdev is registered as
>>> + * 'backup' netdev and a passthru device with the same MAC is registered
>>> + * as 'active' netdev.
>>> + */
>>> +
>>> +/* bypass state maintained when BACKUP feature is enabled */
>>> +struct virtnet_bypass_info {
>>> +       /* passthru netdev with same MAC */
>>> +       struct net_device __rcu *active_netdev;
>>> +
>>> +       /* virtio_net netdev */
>>> +       struct net_device __rcu *backup_netdev;
>>> +
>>> +       /* active netdev stats */
>>> +       struct rtnl_link_stats64 active_stats;
>>> +
>>> +       /* backup netdev stats */
>>> +       struct rtnl_link_stats64 backup_stats;
>>> +
>>> +       /* aggregated stats */
>>> +       struct rtnl_link_stats64 bypass_stats;
>>> +
>>> +       /* spinlock while updating stats */
>>> +       spinlock_t stats_lock;
>>> +};
>>> +
>>> +static void virtnet_bypass_child_open(struct net_device *dev,
>>> +                                     struct net_device *child_netdev)
>>> +{
>>> +       int err = dev_open(child_netdev);
>>> +
>>> +       if (err)
>>> +               netdev_warn(dev, "unable to open slave: %s: %d\n",
>>> +                           child_netdev->name, err);
>>> +}
>>
>> Why ignoring the error?? i.e. virtnet_bypass_child_open should have
>> return value. I don't believe the caller is in a safe context to
>> guarantee the dev_open always succeeds.
>
>
> OK.  Will handle this in the next revision.
>
>
>>
>>> +
>>> +static int virtnet_bypass_open(struct net_device *dev)
>>> +{
>>> +       struct virtnet_bypass_info *vbi = netdev_priv(dev);
>>> +       struct net_device *child_netdev;
>>> +
>>> +       netif_carrier_off(dev);
>>> +       netif_tx_wake_all_queues(dev);
>>> +
>>> +       child_netdev = rtnl_dereference(vbi->active_netdev);
>>> +       if (child_netdev)
>>> +               virtnet_bypass_child_open(dev, child_netdev);
>>
>> Handle the error?
>>
>>> +
>>> +       child_netdev = rtnl_dereference(vbi->backup_netdev);
>>> +       if (child_netdev)
>>> +               virtnet_bypass_child_open(dev, child_netdev);
>>
>> Handle the error and unwind?
>
>
> Sure.
>
> <snip>
>
>
>> +
>> +static int virtnet_bypass_register_child(struct net_device *child_netdev)
>> +{
>> +       struct virtnet_bypass_info *vbi;
>> +       struct net_device *dev;
>> +       bool backup;
>> +       int ret;
>> +
>> +       if (child_netdev->addr_len != ETH_ALEN)
>> +               return NOTIFY_DONE;
>> +
>> +       /* We will use the MAC address to locate the virtnet_bypass netdev
>> +        * to associate with the child netdev. If we don't find a matching
>> +        * bypass netdev, move on.
>> +        */
>> +       dev = get_virtnet_bypass_bymac(dev_net(child_netdev),
>> +                                      child_netdev->perm_addr);
>> +       if (!dev)
>> +               return NOTIFY_DONE;
>> +
>> +       vbi = netdev_priv(dev);
>> +       backup = (child_netdev->dev.parent == dev->dev.parent);
>> +       if (backup ? rtnl_dereference(vbi->backup_netdev) :
>> +                       rtnl_dereference(vbi->active_netdev)) {
>> +               netdev_info(dev,
>> +                           "%s attempting to join bypass dev when %s
>> already present\n",
>> +                           child_netdev->name, backup ? "backup" :
>> "active");
>> +               return NOTIFY_DONE;
>> +       }
>> +
>> +       /* Avoid non pci devices as active netdev */
>> +       if (!backup && (!child_netdev->dev.parent ||
>> +                       !dev_is_pci(child_netdev->dev.parent)))
>> +               return NOTIFY_DONE;
>> +
>> There's a problem here in terms of error (particularly on the active
>> slave, e.g. VF), see below:
>>
>>> +       ret = netdev_rx_handler_register(child_netdev,
>>> +                                        virtnet_bypass_handle_frame,
>>> dev);
>>> +       if (ret != 0) {
>>> +               netdev_err(child_netdev,
>>> +                          "can not register bypass receive handler (err
>>> = %d)\n",
>>> +                          ret);
>>> +               goto rx_handler_failed;
>>> +       }
>>> +
>>> +       ret = netdev_upper_dev_link(child_netdev, dev, NULL);
>>> +       if (ret != 0) {
>>> +               netdev_err(child_netdev,
>>> +                          "can not set master device %s (err = %d)\n",
>>> +                          dev->name, ret);
>>> +               goto upper_link_failed;
>>> +       }
>>> +
>>> +       child_netdev->flags |= IFF_SLAVE;
>>> +
>>> +       if (netif_running(dev)) {
>>> +               ret = dev_open(child_netdev);
>>> +               if (ret && (ret != -EBUSY)) {
>>> +                       netdev_err(dev, "Opening child %s failed
>>> ret:%d\n",
>>> +                                  child_netdev->name, ret);
>>> +                       goto err_interface_up;
>>> +               }
>>> +       }
>>> +
>>> +       /* Align MTU of child with master */
>>> +       ret = dev_set_mtu(child_netdev, dev->mtu);
>>> +       if (ret) {
>>> +               netdev_err(dev,
>>> +                          "unable to change mtu of %s to %u register
>>> failed\n",
>>> +                          child_netdev->name, dev->mtu);
>>> +               goto err_set_mtu;
>>> +       }
>>
>> If any of the above calls returns non-zero, the current code steps
>> back to undo what's being done on that spefic slave previously. For
>> instance, if the netdev_rx_handler_register returns non-zero because
>> of busy rx handler, this register_child function would give up
>> enslaving the VF and leave the upper virtio_bypass interface behind
>> once it returns.
>
>
> virtio_bypass interface is the upper master netdev and it is always present
> when
> BACKUP feature is enabled.
> If there is failure with enslaving VF, there will be 2 netdevs,
> master virtio_bypass and slave virtio_net and the VM can be migrated.

Failure in enslaving the VF does not mean the netdev would disappear.
That means there will be 3 seperate netdevs left behind, a master
virtio_bypass with slave virtio_net, plus a standalone VF. In this
case the VF are still useable as the datapath is still there, while
the datapath for virtio_* is made standby. Which interface you expect
user to continue to use in case of failure?

If it's the VF interface you choose, there should be some detection
for that condition in virtio_net, hide the virtio interfaces from user
by unregistering those two virtio netdevs (master virtio_bypass with
slave virtio_net), presumably.

If not so user has to use virtio_bypass, how do you prevent users from
using the VF, or consequently, indicate/propogate the failure to
backend and switch the host datapath to virtio?

I don't see either of the above being handled on your patch. Unless I
miss something obvious, I don't think this patch is close to achieving
the goal in terms of completeness.

>
>
>> I am not sure if it's a good idea to leave the
>> virtio_bypass around if running into failure: the guest is not
>> migratable as the VF doesn't have a backup path,
>
>
> Are you talking about a failure when registering backup netdev?  This should

No, see above. I was mainly concerned with the failure in enslaving VF
(active netdev).

If users continue to use raw VF interface within the VM, the guest
cannot be migrated for sure.

> not
> happen, but i guess we can improve error handling in such scenario.
>
>
>> And perhaps the worse
>> part is that, it now has two interfaces with identical MAC address but
>> one of them is invalid (user cannot use the virtio interface as it has
>> a dampened datapath). IMHO the virtio_bypass and its lower netdev
>> should be destroyed at all when it fails to bind the VF, and
>> technically, there should be some way to propogate the failure status
>> to the hypervisor/backend, indicating that the VM is not migratable
>> because of guest software errors (maybe by clearing out the backup
>> feature from the guest virtio driver so host can see/learn it).
>
>
> In BACKUP mode, user can only use the upper virtio_bypass netdev and that
> will
> always be there. Any failure to enslave VF netdev is not fatal, but i will
> see
> if we can improve the error handling of failure to enslave backup netdev.
> Also, i don't think the BACKUP feature bit is negotiable with the host.

I don't believe that is the case. In any event, you should come up
with whatever means to indicate to the hypervisor that some VF are
still left unbound with a virtio, so QEMU knows the VM cannot live
migrate when host user attempts to do so. Or, alternatively, propoate
the failure to host, plug out the VF and switch the host datapath to
virtio.  Please work out a complete solution to address the error
case.

Regards,
-Siwei

>
> Thanks
> Sridhar
>
>

^ permalink raw reply

* Re: [PATCH 3/7] RDMA/qedr: eliminate duplicate barriers on weakly-ordered archs
From: Jason Gunthorpe @ 2018-03-14  4:12 UTC (permalink / raw)
  To: Sinan Kaya
  Cc: netdev, timur, sulrich, linux-arm-msm, linux-arm-kernel,
	Michal Kalderon, Ariel Elior, Doug Ledford, linux-rdma,
	linux-kernel
In-Reply-To: <1520997629-17361-3-git-send-email-okaya@codeaurora.org>

On Tue, Mar 13, 2018 at 11:20:24PM -0400, Sinan Kaya wrote:
> Code includes wmb() followed by writel() in multiple places. writel()
> already has a barrier on some architectures like arm64.
> 
> This ends up CPU observing two barriers back to back before executing the
> register write.
> 
> Since code already has an explicit barrier call, changing writel() to
> writel_relaxed().
> 
> Signed-off-by: Sinan Kaya <okaya@codeaurora.org>
>  drivers/infiniband/hw/qedr/verbs.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)

Sure matches my understanding of writel_relaxed

This is part of a series, should we take just this patch through the
rdma tree? If not:

Acked-by: Jason Gunthorpe <jgg@mellanox.com>

Thanks,
Jason

^ permalink raw reply

* Re: [bug, bisected] pfifo_fast causes packet reordering
From: John Fastabend @ 2018-03-14  4:03 UTC (permalink / raw)
  To: Dave Taht, Jakob Unterwurzacher
  Cc: netdev, linux-kernel, David S. Miller, linux-can@vger.kernel.org,
	Martin Elshuber
In-Reply-To: <CAA93jw5FSbvg8AZ=EmneW8juqd8LCgp=xUrSxYm2GBEucKkT3w@mail.gmail.com>

On 03/13/2018 11:35 AM, Dave Taht wrote:
> On Tue, Mar 13, 2018 at 11:24 AM, Jakob Unterwurzacher
> <jakob.unterwurzacher@theobroma-systems.com> wrote:
>> During stress-testing our "ucan" USB/CAN adapter SocketCAN driver on Linux
>> v4.16-rc4-383-ged58d66f60b3 we observed that a small fraction of packets are
>> delivered out-of-order.
>>

Is the stress-testing tool available somewhere? What type of packets
are being sent?


>> We have tracked the problem down to the driver interface level, and it seems
>> that the driver's net_device_ops.ndo_start_xmit() function gets the packets
>> handed over in the wrong order.
>>
>> This behavior was not observed on Linux v4.15 and I have bisected the
>> problem down to this patch:
>>
>>> commit c5ad119fb6c09b0297446be05bd66602fa564758
>>> Author: John Fastabend <john.fastabend@gmail.com>
>>> Date:   Thu Dec 7 09:58:19 2017 -0800
>>>
>>>    net: sched: pfifo_fast use skb_array
>>>
>>>    This converts the pfifo_fast qdisc to use the skb_array data structure
>>>    and set the lockless qdisc bit. pfifo_fast is the first qdisc to
>>> support
>>>    the lockless bit that can be a child of a qdisc requiring locking. So
>>>    we add logic to clear the lock bit on initialization in these cases
>>> when
>>>    the qdisc graft operation occurs.
>>>
>>>    This also removes the logic used to pick the next band to dequeue from
>>>    and instead just checks a per priority array for packets from top
>>> priority
>>>    to lowest. This might need to be a bit more clever but seems to work
>>>    for now.
>>>
>>>    Signed-off-by: John Fastabend <john.fastabend@gmail.com>
>>>    Signed-off-by: David S. Miller <davem@davemloft.net>
>>
>>
>> The patch does not revert cleanly, but moving to one commit earlier makes
>> the problem go away.
>>
>> Selecting the "fq" scheduler instead of "pfifo_fast" makes the problem go
>> away as well.
> 

Is this a single queue device or a multiqueue device? Running
'tc -s qdisc show dev foo' would help some.

> I am of course, a fan of obsoleting pfifo_fast. There's no good reason
> for it anymore.
> 
>>
>> Is this an unintended side-effect of the patch or is there something the
>> driver has to do to request in-order delivery?
>>
If we introduced a OOO edge case somewhere that was not
intended so I'll take a look into it. But, if you can provide
a bit more details on how stress testing is done to cause the
issue that would help.

Thanks,
John

>> Thanks,
>> Jakob
> 
> 
> 

^ permalink raw reply

* Re: [RESEND] rsi: Remove stack VLA usage
From: Tobin C. Harding @ 2018-03-14  3:43 UTC (permalink / raw)
  To: Kees Cook
  Cc: Andy Shevchenko, Kalle Valo, Kernel Hardening,
	Linux Kernel Mailing List, netdev, open list:TI WILINK WIRELES...,
	Tycho Andersen, Konstantin Ryabitsev
In-Reply-To: <CAGXu5jJsDsegY+EZFq1Hip+sRNLR5SS+-Euf1rm22okAq7REJw@mail.gmail.com>

Added Konstantin in case he is in charge of administering patchwork.kernel.org?

On Tue, Mar 13, 2018 at 07:53:34PM -0700, Kees Cook wrote:
> On Tue, Mar 13, 2018 at 7:11 PM, Tobin C. Harding <me@tobin.cc> wrote:
> > On Tue, Mar 13, 2018 at 11:00:47PM +0200, Andy Shevchenko wrote:
> >> On Tue, Mar 13, 2018 at 10:17 PM, tcharding <me@tobin.cc> wrote:
> >> > On Mon, Mar 12, 2018 at 09:46:06AM +0000, Kalle Valo wrote:
> >> >> tcharding <me@tobin.cc> wrote:
> >>
> >> I'm pretty much sure it depends on the original email headers, like
> >> above ^^^ — no name.
> >> Perhaps git config on your side should be done.
> >
> > Thanks for the suggestion Andy but the 'tcharding' as the name was
> > munged by either Kalle or patchwork.  I'm guessing patchwork.
> 
> Something you're sending from is using "tcharding" (see the email Andy
> quotes). I see the headers as:
> 
> Date: Wed, 14 Mar 2018 07:17:57 +1100
> From: tcharding <me@tobin.cc>
> ...
> Message-ID: <20180313201757.GK8631@eros>
> X-Mailer: Mutt 1.5.24 (2015-08-30)
> User-Agent: Mutt/1.5.24 (2015-08-30)
> 
> Your most recently email shows "Tobin C. Harding" though, and also
> sent with Mutt...
>
> Do you have multiple Mutt configurations? Is something lacking a
> "From" header insertion and your MTA is filling it in for you from
> your username?

Thanks for taking the time to respond Kees (and Tycho).  I have mutt
configured to reply from whichever email address I receive from so if
patchwork sent an email to 'tcharding <me@tobin.cc>' (which is the
details it has) and I hit reply it would have come from 'tcharding',
hence Andy's reply.  I wouldn't bet my life on it but I'm kinda
confident that I cannot initiate an email from 'tcharding' with my
current set up.

Super bad form to blame someone (or something else) but I think this is
a problem with how my patchwork account is configured.  Either way, that
is still my fault I should have added my real name to patchwork when I
signed up (not just username 'tcharding').

Is patchwork.kernel.org administered by Konstantin Ryabitsev? Added
Konstantin to CC's.

thanks,
Tobin.

^ permalink raw reply

* [PATCH RFC bpf-next 5/6] selftests/bpf: Selftest for sys_connect hooks
From: Alexei Starovoitov @ 2018-03-14  3:39 UTC (permalink / raw)
  To: davem; +Cc: daniel, netdev, kernel-team
In-Reply-To: <20180314033934.3502167-1-ast@kernel.org>

From: Andrey Ignatov <rdna@fb.com>

Add selftest for BPF_CGROUP_INET4_CONNECT and BPF_CGROUP_INET6_CONNECT
attach types.

Try to connect(2) to specified IP:port and test that:
* remote IP:port pair is overridden;
* local end of connection is bound to specified IP.

All combinations of IPv4/IPv6 and TCP/UDP are tested.

Example:
  # tcpdump -pn -i lo -w connect.pcap 2>/dev/null &
  [1] 271
  # strace -qqf -e connect -o connect.trace ./test_sock_addr.sh
  Wait for testing IPv4/IPv6 to become available ... OK
  Attached bind4 program.
  Attached connect4 program.
  Test case #1 (IPv4/TCP):
          Requested: bind(192.168.1.254, 4040) ..
             Actual: bind(127.0.0.1, 4444)
          Requested: connect(192.168.1.254, 4040) from (*, *) ..
             Actual: connect(127.0.0.1, 4444) from (127.0.0.4, 45780)
  Test case #2 (IPv4/UDP):
          Requested: bind(192.168.1.254, 4040) ..
             Actual: bind(127.0.0.1, 4444)
          Requested: connect(192.168.1.254, 4040) from (*, *) ..
             Actual: connect(127.0.0.1, 4444) from (127.0.0.4, 44708)
  Attached bind6 program.
  Attached connect6 program.
  Test case #3 (IPv6/TCP):
          Requested: bind(face:b00c:1234:5678::abcd, 6060) ..
             Actual: bind(::1, 6666)
          Requested: connect(face:b00c:1234:5678::abcd, 6060) from (*, *) .
             Actual: connect(::1, 6666) from (::6, 37510)
  Test case #4 (IPv6/UDP):
          Requested: bind(face:b00c:1234:5678::abcd, 6060) ..
             Actual: bind(::1, 6666)
          Requested: connect(face:b00c:1234:5678::abcd, 6060) from (*, *) .
             Actual: connect(::1, 6666) from (::6, 51749)
  ### SUCCESS
  # egrep 'connect\(.*AF_INET' connect.trace | \
      egrep -vw 'htons\(1025\)' | fold -b -s -w 72
  295   connect(7, {sa_family=AF_INET, sin_port=htons(4040),
  sin_addr=inet_addr("192.168.1.254")}, 128) = 0
  295   connect(8, {sa_family=AF_INET, sin_port=htons(4040),
  sin_addr=inet_addr("192.168.1.254")}, 128) = 0
  295   connect(9, {sa_family=AF_INET6, sin6_port=htons(6060),
  inet_pton(AF_INET6, "face:b00c:1234:5678::abcd", &sin6_addr),
  sin6_flowinfo=0, sin6_scope_id=0}, 128) = 0
  295   connect(10, {sa_family=AF_INET6, sin6_port=htons(6060),
  inet_pton(AF_INET6, "face:b00c:1234:5678::abcd", &sin6_addr),
  sin6_flowinfo=0, sin6_scope_id=0}, 128) = 0
  # fg
  tcpdump -pn -i lo -w connect.pcap 2> /dev/null
  # tcpdump -r connect.pcap -n tcp | cut -c 1-72
  reading from file connect.pcap, link-type EN10MB (Ethernet)
  17:20:03.346047 IP 127.0.0.4.45780 > 127.0.0.1.4444: Flags [S], seq 2460
  17:20:03.346084 IP 127.0.0.1.4444 > 127.0.0.4.45780: Flags [S.], seq 320
  17:20:03.346110 IP 127.0.0.4.45780 > 127.0.0.1.4444: Flags [.], ack 1, w
  17:20:03.347218 IP 127.0.0.1.4444 > 127.0.0.4.45780: Flags [R.], seq 1,
  17:20:03.356698 IP6 ::6.37510 > ::1.6666: Flags [S], seq 2155424486, win
  17:20:03.356733 IP6 ::1.6666 > ::6.37510: Flags [S.], seq 1308562754, ac
  17:20:03.356752 IP6 ::6.37510 > ::1.6666: Flags [.], ack 1, win 342, opt
  17:20:03.357639 IP6 ::1.6666 > ::6.37510: Flags [R.], seq 1, ack 1, win

Signed-off-by: Andrey Ignatov <rdna@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
 tools/include/uapi/linux/bpf.h                | 15 +++++-
 tools/testing/selftests/bpf/Makefile          |  5 +-
 tools/testing/selftests/bpf/bpf_helpers.h     |  2 +
 tools/testing/selftests/bpf/connect4_prog.c   | 45 ++++++++++++++++
 tools/testing/selftests/bpf/connect6_prog.c   | 61 +++++++++++++++++++++
 tools/testing/selftests/bpf/test_sock_addr.c  | 78 +++++++++++++++++++++++++++
 tools/testing/selftests/bpf/test_sock_addr.sh | 57 ++++++++++++++++++++
 7 files changed, 260 insertions(+), 3 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/connect4_prog.c
 create mode 100644 tools/testing/selftests/bpf/connect6_prog.c
 create mode 100755 tools/testing/selftests/bpf/test_sock_addr.sh

diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index a6af06bb5efb..11e1b633808a 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -135,6 +135,8 @@ enum bpf_prog_type {
 	BPF_PROG_TYPE_CGROUP_DEVICE,
 	BPF_PROG_TYPE_CGROUP_INET4_BIND,
 	BPF_PROG_TYPE_CGROUP_INET6_BIND,
+	BPF_PROG_TYPE_CGROUP_INET4_CONNECT,
+	BPF_PROG_TYPE_CGROUP_INET6_CONNECT,
 };
 
 enum bpf_attach_type {
@@ -147,6 +149,8 @@ enum bpf_attach_type {
 	BPF_CGROUP_DEVICE,
 	BPF_CGROUP_INET4_BIND,
 	BPF_CGROUP_INET6_BIND,
+	BPF_CGROUP_INET4_CONNECT,
+	BPF_CGROUP_INET6_CONNECT,
 	__MAX_BPF_ATTACH_TYPE
 };
 
@@ -700,6 +704,14 @@ union bpf_attr {
  * int bpf_override_return(pt_regs, rc)
  *	@pt_regs: pointer to struct pt_regs
  *	@rc: the return value to set
+ *
+ * int bpf_bind(ctx, addr, addr_len)
+ *     Bind socket to address. Only binding to IP is supported, no port can be
+ *     set in addr.
+ *     @ctx: pointer to context of type bpf_sock_addr
+ *     @addr: pointer to struct sockaddr to bind socket to
+ *     @addr_len: length of sockaddr structure
+ *     Return: 0 on success or negative error code
  */
 #define __BPF_FUNC_MAPPER(FN)		\
 	FN(unspec),			\
@@ -761,7 +773,8 @@ union bpf_attr {
 	FN(perf_prog_read_value),	\
 	FN(getsockopt),			\
 	FN(override_return),		\
-	FN(sock_ops_cb_flags_set),
+	FN(sock_ops_cb_flags_set),	\
+	FN(bind),
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
  * function eBPF program intends to call
diff --git a/tools/testing/selftests/bpf/Makefile b/tools/testing/selftests/bpf/Makefile
index f319b67fd0f6..a3f8d40647f2 100644
--- a/tools/testing/selftests/bpf/Makefile
+++ b/tools/testing/selftests/bpf/Makefile
@@ -21,14 +21,15 @@ TEST_GEN_FILES = test_pkt_access.o test_xdp.o test_l4lb.o test_tcp_estats.o test
 	test_pkt_md_access.o test_xdp_redirect.o test_xdp_meta.o sockmap_parse_prog.o     \
 	sockmap_verdict_prog.o dev_cgroup.o sample_ret0.o test_tracepoint.o \
 	test_l4lb_noinline.o test_xdp_noinline.o test_stacktrace_map.o \
-	sample_map_ret0.o test_tcpbpf_kern.o
+	sample_map_ret0.o test_tcpbpf_kern.o connect4_prog.o connect6_prog.o
 
 # Order correspond to 'make run_tests' order
 TEST_PROGS := test_kmod.sh \
 	test_libbpf.sh \
 	test_xdp_redirect.sh \
 	test_xdp_meta.sh \
-	test_offload.py
+	test_offload.py \
+	test_sock_addr.sh
 
 # Compile but not part of 'make run_tests'
 TEST_GEN_PROGS_EXTENDED = test_libbpf_open
diff --git a/tools/testing/selftests/bpf/bpf_helpers.h b/tools/testing/selftests/bpf/bpf_helpers.h
index dde2c11d7771..edf4971554e2 100644
--- a/tools/testing/selftests/bpf/bpf_helpers.h
+++ b/tools/testing/selftests/bpf/bpf_helpers.h
@@ -86,6 +86,8 @@ static int (*bpf_perf_prog_read_value)(void *ctx, void *buf,
 	(void *) BPF_FUNC_perf_prog_read_value;
 static int (*bpf_override_return)(void *ctx, unsigned long rc) =
 	(void *) BPF_FUNC_override_return;
+static int (*bpf_bind)(void *ctx, void *addr, int addr_len) =
+	(void *) BPF_FUNC_bind;
 
 /* llvm builtin functions that eBPF C program may use to
  * emit BPF_LD_ABS and BPF_LD_IND instructions
diff --git a/tools/testing/selftests/bpf/connect4_prog.c b/tools/testing/selftests/bpf/connect4_prog.c
new file mode 100644
index 000000000000..d6507e504f91
--- /dev/null
+++ b/tools/testing/selftests/bpf/connect4_prog.c
@@ -0,0 +1,45 @@
+// SPDX-License-Identifier: GPL-2.0
+// Copyright (c) 2018 Facebook
+
+#include <string.h>
+
+#include <linux/stddef.h>
+#include <linux/bpf.h>
+#include <linux/in.h>
+#include <linux/in6.h>
+#include <sys/socket.h>
+
+#include "bpf_helpers.h"
+#include "bpf_endian.h"
+
+#define SRC_REWRITE_IP4		0x7f000004U
+#define DST_REWRITE_IP4		0x7f000001U
+#define DST_REWRITE_PORT4	4444
+
+int _version SEC("version") = 1;
+
+SEC("connect4")
+int connect_v4_prog(struct bpf_sock_addr *ctx)
+{
+	struct sockaddr_in sa;
+
+	/* Rewrite destination. */
+	ctx->user_ip4 = bpf_htonl(DST_REWRITE_IP4);
+	ctx->user_port = bpf_htons(DST_REWRITE_PORT4);
+
+	if (ctx->type == SOCK_DGRAM || ctx->type == SOCK_STREAM) {
+		///* Rewrite source. */
+		memset(&sa, 0, sizeof(sa));
+
+		sa.sin_family = AF_INET;
+		sa.sin_port = bpf_htons(0);
+		sa.sin_addr.s_addr = bpf_htonl(SRC_REWRITE_IP4);
+
+		if (bpf_bind(ctx, (struct sockaddr *)&sa, sizeof(sa)) != 0)
+			return 0;
+	}
+
+	return 1;
+}
+
+char _license[] SEC("license") = "GPL";
diff --git a/tools/testing/selftests/bpf/connect6_prog.c b/tools/testing/selftests/bpf/connect6_prog.c
new file mode 100644
index 000000000000..553b2f630c88
--- /dev/null
+++ b/tools/testing/selftests/bpf/connect6_prog.c
@@ -0,0 +1,61 @@
+// SPDX-License-Identifier: GPL-2.0
+// Copyright (c) 2018 Facebook
+
+#include <string.h>
+
+#include <linux/stddef.h>
+#include <linux/bpf.h>
+#include <linux/in.h>
+#include <linux/in6.h>
+#include <sys/socket.h>
+
+#include "bpf_helpers.h"
+#include "bpf_endian.h"
+
+#define SRC_REWRITE_IP6_0	0
+#define SRC_REWRITE_IP6_1	0
+#define SRC_REWRITE_IP6_2	0
+#define SRC_REWRITE_IP6_3	6
+
+#define DST_REWRITE_IP6_0	0
+#define DST_REWRITE_IP6_1	0
+#define DST_REWRITE_IP6_2	0
+#define DST_REWRITE_IP6_3	1
+
+#define DST_REWRITE_PORT6	6666
+
+int _version SEC("version") = 1;
+
+SEC("connect6")
+int connect_v6_prog(struct bpf_sock_addr *ctx)
+{
+	struct sockaddr_in6 sa;
+
+	/* Rewrite destination. */
+	ctx->user_ip6[0] = bpf_htonl(DST_REWRITE_IP6_0);
+	ctx->user_ip6[1] = bpf_htonl(DST_REWRITE_IP6_1);
+	ctx->user_ip6[2] = bpf_htonl(DST_REWRITE_IP6_2);
+	ctx->user_ip6[3] = bpf_htonl(DST_REWRITE_IP6_3);
+
+	ctx->user_port = bpf_htons(DST_REWRITE_PORT6);
+
+	if (ctx->type == SOCK_DGRAM || ctx->type == SOCK_STREAM) {
+		/* Rewrite source. */
+		memset(&sa, 0, sizeof(sa));
+
+		sa.sin6_family = AF_INET6;
+		sa.sin6_port = bpf_htons(0);
+
+		sa.sin6_addr.s6_addr32[0] = bpf_htonl(SRC_REWRITE_IP6_0);
+		sa.sin6_addr.s6_addr32[1] = bpf_htonl(SRC_REWRITE_IP6_1);
+		sa.sin6_addr.s6_addr32[2] = bpf_htonl(SRC_REWRITE_IP6_2);
+		sa.sin6_addr.s6_addr32[3] = bpf_htonl(SRC_REWRITE_IP6_3);
+
+		if (bpf_bind(ctx, (struct sockaddr *)&sa, sizeof(sa)) != 0)
+			return 0;
+	}
+
+	return 1;
+}
+
+char _license[] SEC("license") = "GPL";
diff --git a/tools/testing/selftests/bpf/test_sock_addr.c b/tools/testing/selftests/bpf/test_sock_addr.c
index 18ea250484dc..d4a7ba1242ab 100644
--- a/tools/testing/selftests/bpf/test_sock_addr.c
+++ b/tools/testing/selftests/bpf/test_sock_addr.c
@@ -12,10 +12,13 @@
 #include <linux/filter.h>
 
 #include <bpf/bpf.h>
+#include <bpf/libbpf.h>
 
 #include "cgroup_helpers.h"
 
 #define CG_PATH	"/foo"
+#define CONNECT4_PROG_PATH	"./connect4_prog.o"
+#define CONNECT6_PROG_PATH	"./connect6_prog.o"
 
 #define SERV4_IP		"192.168.1.254"
 #define SERV4_REWRITE_IP	"127.0.0.1"
@@ -245,6 +248,31 @@ static int bind6_prog_load(void)
 			  "bind() for AF_INET6");
 }
 
+static int connect_prog_load_path(const char *path, enum bpf_prog_type type)
+{
+	struct bpf_object *obj;
+	int prog_fd;
+
+	if (bpf_prog_load(path, type, &obj, &prog_fd)) {
+		log_err(">>> Loading %s program error.\n", path);
+		return -1;
+	}
+
+	return prog_fd;
+}
+
+static int connect4_prog_load(void)
+{
+	return connect_prog_load_path(CONNECT4_PROG_PATH,
+				      BPF_PROG_TYPE_CGROUP_INET4_CONNECT);
+}
+
+static int connect6_prog_load(void)
+{
+	return connect_prog_load_path(CONNECT6_PROG_PATH,
+				      BPF_PROG_TYPE_CGROUP_INET6_CONNECT);
+}
+
 static void print_ip_port(int sockfd, info_fn fn, const char *fmt)
 {
 	char addr_buf[INET_NTOP_BUF];
@@ -281,6 +309,11 @@ static void print_local_ip_port(int sockfd, const char *fmt)
 	print_ip_port(sockfd, getsockname, fmt);
 }
 
+static void print_remote_ip_port(int sockfd, const char *fmt)
+{
+	print_ip_port(sockfd, getpeername, fmt);
+}
+
 static int start_server(int type, const struct sockaddr_storage *addr,
 			socklen_t addr_len)
 {
@@ -315,6 +348,39 @@ static int start_server(int type, const struct sockaddr_storage *addr,
 	return fd;
 }
 
+static int connect_to_server(int type, const struct sockaddr_storage *addr,
+			     socklen_t addr_len)
+{
+	int domain;
+	int fd;
+
+	domain = addr->ss_family;
+
+	if (domain != AF_INET && domain != AF_INET6) {
+		log_err("Unsupported address family");
+		return -1;
+	}
+
+	fd = socket(domain, type, 0);
+	if (fd == -1) {
+		log_err("Failed to creating client socket");
+		return -1;
+	}
+
+	if (connect(fd, (const struct sockaddr *)addr, addr_len) == -1) {
+		log_err("Fail to connect to server");
+		goto err;
+	}
+
+	print_remote_ip_port(fd, "\t   Actual: connect(%s, %d)");
+	print_local_ip_port(fd, " from (%s, %d)\n");
+
+	return 0;
+err:
+	close(fd);
+	return -1;
+}
+
 static void print_test_case_num(int domain, int type)
 {
 	static int test_num;
@@ -347,6 +413,10 @@ static int run_test_case(int domain, int type, const char *ip,
 	if (servfd == -1)
 		goto err;
 
+	printf("\tRequested: connect(%s, %d) from (*, *) ..\n", ip, port);
+	if (connect_to_server(type, &addr, addr_len))
+		goto err;
+
 	goto out;
 err:
 	err = -1;
@@ -421,11 +491,13 @@ static int run_test(void)
 
 	struct program inet6_progs[] = {
 		{BPF_CGROUP_INET6_BIND, bind6_prog_load, -1, "bind6"},
+		{BPF_CGROUP_INET6_CONNECT, connect6_prog_load, -1, "connect6"},
 	};
 	inet6_prog_cnt = sizeof(inet6_progs) / sizeof(struct program);
 
 	struct program inet_progs[] = {
 		{BPF_CGROUP_INET4_BIND, bind4_prog_load, -1, "bind4"},
+		{BPF_CGROUP_INET4_CONNECT, connect4_prog_load, -1, "connect4"},
 	};
 	inet_prog_cnt = sizeof(inet_progs) / sizeof(struct program);
 
@@ -459,5 +531,11 @@ static int run_test(void)
 
 int main(int argc, char **argv)
 {
+	if (argc < 2) {
+		fprintf(stderr,
+			"%s has to be run via %s.sh. Skip direct run.\n",
+			argv[0], argv[0]);
+		exit(0);
+	}
 	return run_test();
 }
diff --git a/tools/testing/selftests/bpf/test_sock_addr.sh b/tools/testing/selftests/bpf/test_sock_addr.sh
new file mode 100755
index 000000000000..c6e1dcf992c4
--- /dev/null
+++ b/tools/testing/selftests/bpf/test_sock_addr.sh
@@ -0,0 +1,57 @@
+#!/bin/sh
+
+set -eu
+
+ping_once()
+{
+	ping -q -c 1 -W 1 ${1%%/*} >/dev/null 2>&1
+}
+
+wait_for_ip()
+{
+	local _i
+	echo -n "Wait for testing IPv4/IPv6 to become available "
+	for _i in $(seq ${MAX_PING_TRIES}); do
+		echo -n "."
+		if ping_once ${TEST_IPv4} && ping_once ${TEST_IPv6}; then
+			echo " OK"
+			return
+		fi
+	done
+	echo 1>&2 "ERROR: Timeout waiting for test IP to become available."
+	exit 1
+}
+
+setup()
+{
+	# Create testing interfaces not to interfere with current environment.
+	ip link add dev ${TEST_IF} type veth peer name ${TEST_IF_PEER}
+	ip link set ${TEST_IF} up
+	ip link set ${TEST_IF_PEER} up
+
+	ip -4 addr add ${TEST_IPv4} dev ${TEST_IF}
+	ip -6 addr add ${TEST_IPv6} dev ${TEST_IF}
+	wait_for_ip
+}
+
+cleanup()
+{
+	ip link del ${TEST_IF} 2>/dev/null || :
+	ip link del ${TEST_IF_PEER} 2>/dev/null || :
+}
+
+main()
+{
+	trap cleanup EXIT 2 3 6 15
+	setup
+	./test_sock_addr setup_done
+}
+
+BASENAME=$(basename $0 .sh)
+TEST_IF="${BASENAME}1"
+TEST_IF_PEER="${BASENAME}2"
+TEST_IPv4="127.0.0.4/8"
+TEST_IPv6="::6/128"
+MAX_PING_TRIES=5
+
+main
-- 
2.9.5

^ permalink raw reply related

* [PATCH RFC bpf-next 4/6] bpf: Hooks for sys_connect
From: Alexei Starovoitov @ 2018-03-14  3:39 UTC (permalink / raw)
  To: davem; +Cc: daniel, netdev, kernel-team
In-Reply-To: <20180314033934.3502167-1-ast@kernel.org>

From: Andrey Ignatov <rdna@fb.com>

== The problem ==

See description of the problem in the initial patch of this patch set.

== The solution ==

The patch provides much more reliable in-kernel solution for the 2nd
part of the problem: making outgoing connecttion from desired IP.

It adds new program types `BPF_PROG_TYPE_CGROUP_INET4_CONNECT` and
`BPF_PROG_TYPE_CGROUP_INET6_CONNECT` and corresponding attach types that
can be used to override both source and destination of a connection at
connect(2) time.

Local end of connection can be bound to desired IP using newly
introduced BPF-helper `bpf_bind()`. It allows to bind to only IP though,
and doesn't support binding to port, i.e. leverages
`IP_BIND_ADDRESS_NO_PORT` socket option. There are two reasons for this:
* looking for a free port is expensive and can affect performance
  significantly;
* there is no use-case for port.

As for remote end (`struct sockaddr *` passed by user), both parts of it
can be overridden, remote IP and remote port. It's useful if an
application inside cgroup wants to connect to another application inside
same cgroup or to itself, but knows nothing about IP assigned to the
cgroup.

Support is added for IPv4 and IPv6, for TCP and UDP.

IPv4 and IPv6 have separate program types for same reason as sys_bind
hooks, i.e. to prevent reading from / writing to e.g. user_ip6 fields
when user passes sockaddr_in since it'd be out-of-bound.

Program types for sys_bind itself can't be reused for sys_connect either
since there is a difference in allowed helpers to call: `bpf_bind()` is
useful to call from sys_connect hooks, but doesn't make sense in
sys_bind hooks.

== Implementation notes ==

The patch introduces new field in `struct proto`: `pre_connect` that is
a pointer to a function with same signature as `connect` but is called
before it. The reason is in some cases BPF hooks should be called way
before control is passed to `sk->sk_prot->connect`. Specifically
`inet_dgram_connect` autobinds socket before calling
`sk->sk_prot->connect` and there is no way to call `bpf_bind()` from
hooks from e.g. `ip4_datagram_connect` or `ip6_datagram_connect` since
it'd cause double-bind. On the other hand `proto.pre_connect` provides a
flexible way to add BPF hooks for connect only for necessary `proto` and
call them at desired time before `connect`. Since `bpf_bind()` is
allowed to bind only to IP and autobind in `inet_dgram_connect` binds
only port there is no chance of double-bind.

bpf_bind()` sets `force_bind_address_no_port` to bind to only IP despite
of value of `bind_address_no_port` socket field.

`bpf_bind()` sets `with_lock` to `false` when calling to `__inet_bind()`
and `__inet6_bind()` since all call-sites, where `bpf_bind()` is called,
already hold socket lock.

Signed-off-by: Andrey Ignatov <rdna@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
 include/linux/bpf-cgroup.h | 31 +++++++++++++++++++++
 include/linux/bpf_types.h  |  2 ++
 include/net/sock.h         |  3 +++
 include/net/udp.h          |  1 +
 include/uapi/linux/bpf.h   | 15 ++++++++++-
 kernel/bpf/syscall.c       | 14 ++++++++++
 kernel/bpf/verifier.c      |  2 ++
 net/core/filter.c          | 67 ++++++++++++++++++++++++++++++++++++++++++++++
 net/ipv4/af_inet.c         | 13 +++++++++
 net/ipv4/tcp_ipv4.c        | 16 +++++++++++
 net/ipv4/udp.c             | 14 ++++++++++
 net/ipv6/tcp_ipv6.c        | 16 +++++++++++
 net/ipv6/udp.c             | 20 ++++++++++++++
 13 files changed, 213 insertions(+), 1 deletion(-)

diff --git a/include/linux/bpf-cgroup.h b/include/linux/bpf-cgroup.h
index dd0cfbddcfbe..6b5c25ef1482 100644
--- a/include/linux/bpf-cgroup.h
+++ b/include/linux/bpf-cgroup.h
@@ -116,12 +116,38 @@ int __cgroup_bpf_check_dev_permission(short dev_type, u32 major, u32 minor,
 	__ret;								       \
 })
 
+#define BPF_CGROUP_RUN_SA_PROG_LOCK(sk, uaddr, type)			       \
+({									       \
+	int __ret = 0;							       \
+	if (cgroup_bpf_enabled)	{					       \
+		lock_sock(sk);						       \
+		__ret = __cgroup_bpf_run_filter_sock_addr(sk, uaddr, type);    \
+		release_sock(sk);					       \
+	}								       \
+	__ret;								       \
+})
+
 #define BPF_CGROUP_RUN_PROG_INET4_BIND(sk, uaddr)			       \
 	BPF_CGROUP_RUN_SA_PROG(sk, uaddr, BPF_CGROUP_INET4_BIND)
 
 #define BPF_CGROUP_RUN_PROG_INET6_BIND(sk, uaddr)			       \
 	BPF_CGROUP_RUN_SA_PROG(sk, uaddr, BPF_CGROUP_INET6_BIND)
 
+#define BPF_CGROUP_PRE_CONNECT_ENABLED(sk) (cgroup_bpf_enabled && \
+					    sk->sk_prot->pre_connect)
+
+#define BPF_CGROUP_RUN_PROG_INET4_CONNECT(sk, uaddr)			       \
+	BPF_CGROUP_RUN_SA_PROG(sk, uaddr, BPF_CGROUP_INET4_CONNECT)
+
+#define BPF_CGROUP_RUN_PROG_INET6_CONNECT(sk, uaddr)			       \
+	BPF_CGROUP_RUN_SA_PROG(sk, uaddr, BPF_CGROUP_INET6_CONNECT)
+
+#define BPF_CGROUP_RUN_PROG_INET4_CONNECT_LOCK(sk, uaddr)		       \
+	BPF_CGROUP_RUN_SA_PROG_LOCK(sk, uaddr, BPF_CGROUP_INET4_CONNECT)
+
+#define BPF_CGROUP_RUN_PROG_INET6_CONNECT_LOCK(sk, uaddr)		       \
+	BPF_CGROUP_RUN_SA_PROG_LOCK(sk, uaddr, BPF_CGROUP_INET6_CONNECT)
+
 #define BPF_CGROUP_RUN_PROG_SOCK_OPS(sock_ops)				       \
 ({									       \
 	int __ret = 0;							       \
@@ -151,11 +177,16 @@ struct cgroup_bpf {};
 static inline void cgroup_bpf_put(struct cgroup *cgrp) {}
 static inline int cgroup_bpf_inherit(struct cgroup *cgrp) { return 0; }
 
+#define BPF_CGROUP_PRE_CONNECT_ENABLED(sk) (0)
 #define BPF_CGROUP_RUN_PROG_INET_INGRESS(sk,skb) ({ 0; })
 #define BPF_CGROUP_RUN_PROG_INET_EGRESS(sk,skb) ({ 0; })
 #define BPF_CGROUP_RUN_PROG_INET_SOCK(sk) ({ 0; })
 #define BPF_CGROUP_RUN_PROG_INET4_BIND(sk, uaddr) ({ 0; })
 #define BPF_CGROUP_RUN_PROG_INET6_BIND(sk, uaddr) ({ 0; })
+#define BPF_CGROUP_RUN_PROG_INET4_CONNECT(sk, uaddr) ({ 0; })
+#define BPF_CGROUP_RUN_PROG_INET4_CONNECT_LOCK(sk, uaddr) ({ 0; })
+#define BPF_CGROUP_RUN_PROG_INET6_CONNECT(sk, uaddr) ({ 0; })
+#define BPF_CGROUP_RUN_PROG_INET6_CONNECT_LOCK(sk, uaddr) ({ 0; })
 #define BPF_CGROUP_RUN_PROG_SOCK_OPS(sock_ops) ({ 0; })
 #define BPF_CGROUP_RUN_PROG_DEVICE_CGROUP(type,major,minor,access) ({ 0; })
 
diff --git a/include/linux/bpf_types.h b/include/linux/bpf_types.h
index eefd877f8e68..52a571827b9f 100644
--- a/include/linux/bpf_types.h
+++ b/include/linux/bpf_types.h
@@ -10,6 +10,8 @@ BPF_PROG_TYPE(BPF_PROG_TYPE_CGROUP_SKB, cg_skb)
 BPF_PROG_TYPE(BPF_PROG_TYPE_CGROUP_SOCK, cg_sock)
 BPF_PROG_TYPE(BPF_PROG_TYPE_CGROUP_INET4_BIND, cg_inet4_bind)
 BPF_PROG_TYPE(BPF_PROG_TYPE_CGROUP_INET6_BIND, cg_inet6_bind)
+BPF_PROG_TYPE(BPF_PROG_TYPE_CGROUP_INET4_CONNECT, cg_inet4_connect)
+BPF_PROG_TYPE(BPF_PROG_TYPE_CGROUP_INET6_CONNECT, cg_inet6_connect)
 BPF_PROG_TYPE(BPF_PROG_TYPE_LWT_IN, lwt_inout)
 BPF_PROG_TYPE(BPF_PROG_TYPE_LWT_OUT, lwt_inout)
 BPF_PROG_TYPE(BPF_PROG_TYPE_LWT_XMIT, lwt_xmit)
diff --git a/include/net/sock.h b/include/net/sock.h
index b9624581d639..997259e0ecae 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -1026,6 +1026,9 @@ static inline void sk_prot_clear_nulls(struct sock *sk, int size)
 struct proto {
 	void			(*close)(struct sock *sk,
 					long timeout);
+	int			(*pre_connect)(struct sock *sk,
+					struct sockaddr *uaddr,
+					int addr_len);
 	int			(*connect)(struct sock *sk,
 					struct sockaddr *uaddr,
 					int addr_len);
diff --git a/include/net/udp.h b/include/net/udp.h
index 850a8e581cce..0676b272f6ac 100644
--- a/include/net/udp.h
+++ b/include/net/udp.h
@@ -273,6 +273,7 @@ void udp4_hwcsum(struct sk_buff *skb, __be32 src, __be32 dst);
 int udp_rcv(struct sk_buff *skb);
 int udp_ioctl(struct sock *sk, int cmd, unsigned long arg);
 int udp_init_sock(struct sock *sk);
+int udp_pre_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len);
 int __udp_disconnect(struct sock *sk, int flags);
 int udp_disconnect(struct sock *sk, int flags);
 __poll_t udp_poll(struct file *file, struct socket *sock, poll_table *wait);
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 78628a3f3cd8..441a674f385a 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -135,6 +135,8 @@ enum bpf_prog_type {
 	BPF_PROG_TYPE_CGROUP_DEVICE,
 	BPF_PROG_TYPE_CGROUP_INET4_BIND,
 	BPF_PROG_TYPE_CGROUP_INET6_BIND,
+	BPF_PROG_TYPE_CGROUP_INET4_CONNECT,
+	BPF_PROG_TYPE_CGROUP_INET6_CONNECT,
 };
 
 enum bpf_attach_type {
@@ -147,6 +149,8 @@ enum bpf_attach_type {
 	BPF_CGROUP_DEVICE,
 	BPF_CGROUP_INET4_BIND,
 	BPF_CGROUP_INET6_BIND,
+	BPF_CGROUP_INET4_CONNECT,
+	BPF_CGROUP_INET6_CONNECT,
 	__MAX_BPF_ATTACH_TYPE
 };
 
@@ -700,6 +704,14 @@ union bpf_attr {
  * int bpf_override_return(pt_regs, rc)
  *	@pt_regs: pointer to struct pt_regs
  *	@rc: the return value to set
+ *
+ * int bpf_bind(ctx, addr, addr_len)
+ *     Bind socket to address. Only binding to IP is supported, no port can be
+ *     set in addr.
+ *     @ctx: pointer to context of type bpf_sock_addr
+ *     @addr: pointer to struct sockaddr to bind socket to
+ *     @addr_len: length of sockaddr structure
+ *     Return: 0 on success or negative error code
  */
 #define __BPF_FUNC_MAPPER(FN)		\
 	FN(unspec),			\
@@ -761,7 +773,8 @@ union bpf_attr {
 	FN(perf_prog_read_value),	\
 	FN(getsockopt),			\
 	FN(override_return),		\
-	FN(sock_ops_cb_flags_set),
+	FN(sock_ops_cb_flags_set),	\
+	FN(bind),
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
  * function eBPF program intends to call
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 7f86542aa42c..145de3332e32 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -1382,6 +1382,12 @@ static int bpf_prog_attach(const union bpf_attr *attr)
 	case BPF_CGROUP_INET6_BIND:
 		ptype = BPF_PROG_TYPE_CGROUP_INET6_BIND;
 		break;
+	case BPF_CGROUP_INET4_CONNECT:
+		ptype = BPF_PROG_TYPE_CGROUP_INET4_CONNECT;
+		break;
+	case BPF_CGROUP_INET6_CONNECT:
+		ptype = BPF_PROG_TYPE_CGROUP_INET6_CONNECT;
+		break;
 	case BPF_CGROUP_SOCK_OPS:
 		ptype = BPF_PROG_TYPE_SOCK_OPS;
 		break;
@@ -1443,6 +1449,12 @@ static int bpf_prog_detach(const union bpf_attr *attr)
 	case BPF_CGROUP_INET6_BIND:
 		ptype = BPF_PROG_TYPE_CGROUP_INET6_BIND;
 		break;
+	case BPF_CGROUP_INET4_CONNECT:
+		ptype = BPF_PROG_TYPE_CGROUP_INET4_CONNECT;
+		break;
+	case BPF_CGROUP_INET6_CONNECT:
+		ptype = BPF_PROG_TYPE_CGROUP_INET6_CONNECT;
+		break;
 	case BPF_CGROUP_SOCK_OPS:
 		ptype = BPF_PROG_TYPE_SOCK_OPS;
 		break;
@@ -1492,6 +1504,8 @@ static int bpf_prog_query(const union bpf_attr *attr,
 	case BPF_CGROUP_INET_SOCK_CREATE:
 	case BPF_CGROUP_INET4_BIND:
 	case BPF_CGROUP_INET6_BIND:
+	case BPF_CGROUP_INET4_CONNECT:
+	case BPF_CGROUP_INET6_CONNECT:
 	case BPF_CGROUP_SOCK_OPS:
 	case BPF_CGROUP_DEVICE:
 		break;
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 01b54afcb762..cda7830a2c1b 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -3874,6 +3874,8 @@ static int check_return_code(struct bpf_verifier_env *env)
 	case BPF_PROG_TYPE_CGROUP_SOCK:
 	case BPF_PROG_TYPE_CGROUP_INET4_BIND:
 	case BPF_PROG_TYPE_CGROUP_INET6_BIND:
+	case BPF_PROG_TYPE_CGROUP_INET4_CONNECT:
+	case BPF_PROG_TYPE_CGROUP_INET6_CONNECT:
 	case BPF_PROG_TYPE_SOCK_OPS:
 	case BPF_PROG_TYPE_CGROUP_DEVICE:
 		break;
diff --git a/net/core/filter.c b/net/core/filter.c
index 78907cf3b42f..916195b86a23 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -33,6 +33,7 @@
 #include <linux/if_packet.h>
 #include <linux/if_arp.h>
 #include <linux/gfp.h>
+#include <net/inet_common.h>
 #include <net/ip.h>
 #include <net/protocol.h>
 #include <net/netlink.h>
@@ -3400,6 +3401,43 @@ static const struct bpf_func_proto bpf_sock_ops_cb_flags_set_proto = {
 	.arg2_type	= ARG_ANYTHING,
 };
 
+BPF_CALL_3(bpf_bind, struct bpf_sock_addr_kern *, ctx, struct sockaddr *, addr,
+	   int, addr_len)
+{
+	struct sock *sk = ctx->sk;
+	int err;
+
+	/* Binding to port can be expensive so it's prohibited in the helper.
+	 * Only binding to IP is supported.
+	 */
+
+	err = -EINVAL;
+	if (addr->sa_family == AF_INET) {
+		if (addr_len < sizeof(struct sockaddr_in))
+			return err;
+		if (((struct sockaddr_in *)addr)->sin_port != htons(0))
+			return err;
+		return __inet_bind(sk, addr, addr_len, true, false);
+	} else if (addr->sa_family == AF_INET6) {
+		if (addr_len < SIN6_LEN_RFC2133)
+			return err;
+		if (((struct sockaddr_in6 *)addr)->sin6_port != htons(0))
+			return err;
+		return __inet6_bind(sk, addr, addr_len, true, false);
+	}
+
+	return -EAFNOSUPPORT;
+}
+
+static const struct bpf_func_proto bpf_bind_proto = {
+	.func		= bpf_bind,
+	.gpl_only	= false,
+	.ret_type	= RET_INTEGER,
+	.arg1_type	= ARG_PTR_TO_CTX,
+	.arg2_type	= ARG_PTR_TO_MEM,
+	.arg3_type	= ARG_CONST_SIZE,
+};
+
 static const struct bpf_func_proto *
 bpf_base_func_proto(enum bpf_func_id func_id)
 {
@@ -3457,6 +3495,17 @@ inet_bind_func_proto(enum bpf_func_id func_id)
 }
 
 static const struct bpf_func_proto *
+inet_connect_func_proto(enum bpf_func_id func_id)
+{
+	switch (func_id) {
+	case BPF_FUNC_bind:
+		return &bpf_bind_proto;
+	default:
+		return inet_bind_func_proto(func_id);
+	}
+}
+
+static const struct bpf_func_proto *
 sk_filter_func_proto(enum bpf_func_id func_id)
 {
 	switch (func_id) {
@@ -5091,6 +5140,24 @@ const struct bpf_verifier_ops cg_inet6_bind_verifier_ops = {
 const struct bpf_prog_ops cg_inet6_bind_prog_ops = {
 };
 
+const struct bpf_verifier_ops cg_inet4_connect_verifier_ops = {
+	.get_func_proto		= inet_connect_func_proto,
+	.is_valid_access	= sock_addr4_is_valid_access,
+	.convert_ctx_access	= sock_addr_convert_ctx_access,
+};
+
+const struct bpf_prog_ops cg_inet4_connect_prog_ops = {
+};
+
+const struct bpf_verifier_ops cg_inet6_connect_verifier_ops = {
+	.get_func_proto		= inet_connect_func_proto,
+	.is_valid_access	= sock_addr6_is_valid_access,
+	.convert_ctx_access	= sock_addr_convert_ctx_access,
+};
+
+const struct bpf_prog_ops cg_inet6_connect_prog_ops = {
+};
+
 const struct bpf_verifier_ops sock_ops_verifier_ops = {
 	.get_func_proto		= sock_ops_func_proto,
 	.is_valid_access	= sock_ops_is_valid_access,
diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index e203a39d6988..488fe26ac8e5 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -547,12 +547,19 @@ int inet_dgram_connect(struct socket *sock, struct sockaddr *uaddr,
 		       int addr_len, int flags)
 {
 	struct sock *sk = sock->sk;
+	int err;
 
 	if (addr_len < sizeof(uaddr->sa_family))
 		return -EINVAL;
 	if (uaddr->sa_family == AF_UNSPEC)
 		return sk->sk_prot->disconnect(sk, flags);
 
+	if (BPF_CGROUP_PRE_CONNECT_ENABLED(sk)) {
+		err = sk->sk_prot->pre_connect(sk, uaddr, addr_len);
+		if (err)
+			return err;
+	}
+
 	if (!inet_sk(sk)->inet_num && inet_autobind(sk))
 		return -EAGAIN;
 	return sk->sk_prot->connect(sk, uaddr, addr_len);
@@ -633,6 +640,12 @@ int __inet_stream_connect(struct socket *sock, struct sockaddr *uaddr,
 		if (sk->sk_state != TCP_CLOSE)
 			goto out;
 
+		if (BPF_CGROUP_PRE_CONNECT_ENABLED(sk)) {
+			err = sk->sk_prot->pre_connect(sk, uaddr, addr_len);
+			if (err)
+				goto out;
+		}
+
 		err = sk->sk_prot->connect(sk, uaddr, addr_len);
 		if (err < 0)
 			goto out;
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 2c6aec2643e8..3c11d992d784 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -140,6 +140,21 @@ int tcp_twsk_unique(struct sock *sk, struct sock *sktw, void *twp)
 }
 EXPORT_SYMBOL_GPL(tcp_twsk_unique);
 
+static int tcp_v4_pre_connect(struct sock *sk, struct sockaddr *uaddr,
+			      int addr_len)
+{
+	/* This check is replicated from tcp_v4_connect() and intended to
+	 * prevent BPF program called below from accessing bytes that are out
+	 * of the bound specified by user in addr_len.
+	 */
+	if (addr_len < sizeof(struct sockaddr_in))
+		return -EINVAL;
+
+	sock_owned_by_me(sk);
+
+	return BPF_CGROUP_RUN_PROG_INET4_CONNECT(sk, uaddr);
+}
+
 /* This will initiate an outgoing connection. */
 int tcp_v4_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)
 {
@@ -2409,6 +2424,7 @@ struct proto tcp_prot = {
 	.name			= "TCP",
 	.owner			= THIS_MODULE,
 	.close			= tcp_close,
+	.pre_connect		= tcp_v4_pre_connect,
 	.connect		= tcp_v4_connect,
 	.disconnect		= tcp_disconnect,
 	.accept			= inet_csk_accept,
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 3013404d0935..0cbf66deed6f 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -1664,6 +1664,19 @@ int udp_recvmsg(struct sock *sk, struct msghdr *msg, size_t len, int noblock,
 	goto try_again;
 }
 
+int udp_pre_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)
+{
+	/* This check is replicated from __ip4_datagram_connect() and
+	 * intended to prevent BPF program called below from accessing bytes
+	 * that are out of the bound specified by user in addr_len.
+	 */
+	if (addr_len < sizeof(struct sockaddr_in))
+		return -EINVAL;
+
+	return BPF_CGROUP_RUN_PROG_INET4_CONNECT_LOCK(sk, uaddr);
+}
+EXPORT_SYMBOL(udp_pre_connect);
+
 int __udp_disconnect(struct sock *sk, int flags)
 {
 	struct inet_sock *inet = inet_sk(sk);
@@ -2536,6 +2549,7 @@ struct proto udp_prot = {
 	.name		   = "UDP",
 	.owner		   = THIS_MODULE,
 	.close		   = udp_lib_close,
+	.pre_connect	   = udp_pre_connect,
 	.connect	   = ip4_datagram_connect,
 	.disconnect	   = udp_disconnect,
 	.ioctl		   = udp_ioctl,
diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
index 5425d7b100ee..6469b741cf5a 100644
--- a/net/ipv6/tcp_ipv6.c
+++ b/net/ipv6/tcp_ipv6.c
@@ -117,6 +117,21 @@ static u32 tcp_v6_init_ts_off(const struct net *net, const struct sk_buff *skb)
 				   ipv6_hdr(skb)->saddr.s6_addr32);
 }
 
+static int tcp_v6_pre_connect(struct sock *sk, struct sockaddr *uaddr,
+			      int addr_len)
+{
+	/* This check is replicated from tcp_v6_connect() and intended to
+	 * prevent BPF program called below from accessing bytes that are out
+	 * of the bound specified by user in addr_len.
+	 */
+	if (addr_len < SIN6_LEN_RFC2133)
+		return -EINVAL;
+
+	sock_owned_by_me(sk);
+
+	return BPF_CGROUP_RUN_PROG_INET6_CONNECT(sk, uaddr);
+}
+
 static int tcp_v6_connect(struct sock *sk, struct sockaddr *uaddr,
 			  int addr_len)
 {
@@ -1925,6 +1940,7 @@ struct proto tcpv6_prot = {
 	.name			= "TCPv6",
 	.owner			= THIS_MODULE,
 	.close			= tcp_close,
+	.pre_connect		= tcp_v6_pre_connect,
 	.connect		= tcp_v6_connect,
 	.disconnect		= tcp_disconnect,
 	.accept			= inet_csk_accept,
diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c
index 52e3ea0e6f50..636904ca63ba 100644
--- a/net/ipv6/udp.c
+++ b/net/ipv6/udp.c
@@ -957,6 +957,25 @@ static void udp_v6_flush_pending_frames(struct sock *sk)
 	}
 }
 
+static int udpv6_pre_connect(struct sock *sk, struct sockaddr *uaddr,
+			     int addr_len)
+{
+	/* The following checks are replicated from __ip6_datagram_connect()
+	 * and intended to prevent BPF program called below from accessing
+	 * bytes that are out of the bound specified by user in addr_len.
+	 */
+	if (uaddr->sa_family == AF_INET) {
+		if (__ipv6_only_sock(sk))
+			return -EAFNOSUPPORT;
+		return udp_pre_connect(sk, uaddr, addr_len);
+	}
+
+	if (addr_len < SIN6_LEN_RFC2133)
+		return -EINVAL;
+
+	return BPF_CGROUP_RUN_PROG_INET6_CONNECT_LOCK(sk, uaddr);
+}
+
 /**
  *	udp6_hwcsum_outgoing  -  handle outgoing HW checksumming
  *	@sk:	socket we are sending on
@@ -1512,6 +1531,7 @@ struct proto udpv6_prot = {
 	.name		   = "UDPv6",
 	.owner		   = THIS_MODULE,
 	.close		   = udp_lib_close,
+	.pre_connect	   = udpv6_pre_connect,
 	.connect	   = ip6_datagram_connect,
 	.disconnect	   = udp_disconnect,
 	.ioctl		   = udp_ioctl,
-- 
2.9.5

^ permalink raw reply related

* [PATCH RFC bpf-next 6/6] bpf: Post-hooks for sys_bind
From: Alexei Starovoitov @ 2018-03-14  3:39 UTC (permalink / raw)
  To: davem; +Cc: daniel, netdev, kernel-team
In-Reply-To: <20180314033934.3502167-1-ast@kernel.org>

From: Andrey Ignatov <rdna@fb.com>

"Post-hooks" are hooks that are called right before returning from
sys_bind. At this time IP and port are already allocated and no further
changes to `struct sock` can happen before returning from sys_bind but
BPF program has a chance to inspect the socket and change sys_bind
result.

Specifically it can e.g. inspect what port was allocated and if it
doesn't satisfy some policy, BPF program can force sys_bind to release
that port and return an error to user.

Another example of usage is recording the IP:port pair to some map to
use it in later calls to sys_connect. E.g. if some TCP server inside
cgroup was bound to some IP:port and then some TCP client inside same
cgroup is trying to connect to 127.0.0.1:port then BPF hook for
sys_connect can override the destination and connect application to
IP:port instead of 127.0.0.1:port. That helps forcing all applications
inside a cgroup to use desired IP and not break those applications if
they user e.g. localhost to communicate between each other.

== Implementation details ==

Post-hooks are implemented as two new prog types
`BPF_PROG_TYPE_CGROUP_INET4_POST_BIND` and
`BPF_PROG_TYPE_CGROUP_INET6_POST_BIND` and corresponding attach types
`BPF_CGROUP_INET4_POST_BIND` and `BPF_CGROUP_INET6_POST_BIND`.

Separate prog types for IPv4 and IPv6 are introduced to avoid access to
IPv6 field in `struct sock` from `inet_bind()` and to IPv4 field from
`inet6_bind()` since those fields might not make sense in such cases.

`BPF_PROG_TYPE_CGROUP_SOCK` prog type is not reused because it provides
write access to some `struct sock` fields, but socket must not be
changed in post-hooks for sys_bind.

Signed-off-by: Andrey Ignatov <rdna@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
 include/linux/bpf-cgroup.h |  16 ++++-
 include/linux/bpf_types.h  |   2 +
 include/uapi/linux/bpf.h   |  13 ++++
 kernel/bpf/syscall.c       |  14 ++++
 kernel/bpf/verifier.c      |   2 +
 net/core/filter.c          | 170 ++++++++++++++++++++++++++++++++++++++++-----
 net/ipv4/af_inet.c         |   3 +-
 net/ipv6/af_inet6.c        |   3 +-
 8 files changed, 202 insertions(+), 21 deletions(-)

diff --git a/include/linux/bpf-cgroup.h b/include/linux/bpf-cgroup.h
index 6b5c25ef1482..693c542632e3 100644
--- a/include/linux/bpf-cgroup.h
+++ b/include/linux/bpf-cgroup.h
@@ -98,16 +98,24 @@ int __cgroup_bpf_check_dev_permission(short dev_type, u32 major, u32 minor,
 	__ret;								       \
 })
 
-#define BPF_CGROUP_RUN_PROG_INET_SOCK(sk)				       \
+#define BPF_CGROUP_RUN_SK_PROG(sk, type)				       \
 ({									       \
 	int __ret = 0;							       \
 	if (cgroup_bpf_enabled) {					       \
-		__ret = __cgroup_bpf_run_filter_sk(sk,			       \
-						 BPF_CGROUP_INET_SOCK_CREATE); \
+		__ret = __cgroup_bpf_run_filter_sk(sk, type);		       \
 	}								       \
 	__ret;								       \
 })
 
+#define BPF_CGROUP_RUN_PROG_INET_SOCK(sk)				       \
+	BPF_CGROUP_RUN_SK_PROG(sk, BPF_CGROUP_INET_SOCK_CREATE)
+
+#define BPF_CGROUP_RUN_PROG_INET4_POST_BIND(sk)				       \
+	BPF_CGROUP_RUN_SK_PROG(sk, BPF_CGROUP_INET4_POST_BIND)
+
+#define BPF_CGROUP_RUN_PROG_INET6_POST_BIND(sk)				       \
+	BPF_CGROUP_RUN_SK_PROG(sk, BPF_CGROUP_INET6_POST_BIND)
+
 #define BPF_CGROUP_RUN_SA_PROG(sk, uaddr, type) 			       \
 ({									       \
 	int __ret = 0;							       \
@@ -183,6 +191,8 @@ static inline int cgroup_bpf_inherit(struct cgroup *cgrp) { return 0; }
 #define BPF_CGROUP_RUN_PROG_INET_SOCK(sk) ({ 0; })
 #define BPF_CGROUP_RUN_PROG_INET4_BIND(sk, uaddr) ({ 0; })
 #define BPF_CGROUP_RUN_PROG_INET6_BIND(sk, uaddr) ({ 0; })
+#define BPF_CGROUP_RUN_PROG_INET4_POST_BIND(sk) ({ 0; })
+#define BPF_CGROUP_RUN_PROG_INET6_POST_BIND(sk) ({ 0; })
 #define BPF_CGROUP_RUN_PROG_INET4_CONNECT(sk, uaddr) ({ 0; })
 #define BPF_CGROUP_RUN_PROG_INET4_CONNECT_LOCK(sk, uaddr) ({ 0; })
 #define BPF_CGROUP_RUN_PROG_INET6_CONNECT(sk, uaddr) ({ 0; })
diff --git a/include/linux/bpf_types.h b/include/linux/bpf_types.h
index 52a571827b9f..23a97978b544 100644
--- a/include/linux/bpf_types.h
+++ b/include/linux/bpf_types.h
@@ -10,6 +10,8 @@ BPF_PROG_TYPE(BPF_PROG_TYPE_CGROUP_SKB, cg_skb)
 BPF_PROG_TYPE(BPF_PROG_TYPE_CGROUP_SOCK, cg_sock)
 BPF_PROG_TYPE(BPF_PROG_TYPE_CGROUP_INET4_BIND, cg_inet4_bind)
 BPF_PROG_TYPE(BPF_PROG_TYPE_CGROUP_INET6_BIND, cg_inet6_bind)
+BPF_PROG_TYPE(BPF_PROG_TYPE_CGROUP_INET4_POST_BIND, cg_inet4_post_bind)
+BPF_PROG_TYPE(BPF_PROG_TYPE_CGROUP_INET6_POST_BIND, cg_inet6_post_bind)
 BPF_PROG_TYPE(BPF_PROG_TYPE_CGROUP_INET4_CONNECT, cg_inet4_connect)
 BPF_PROG_TYPE(BPF_PROG_TYPE_CGROUP_INET6_CONNECT, cg_inet6_connect)
 BPF_PROG_TYPE(BPF_PROG_TYPE_LWT_IN, lwt_inout)
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 441a674f385a..7dcc75a65a97 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -137,6 +137,8 @@ enum bpf_prog_type {
 	BPF_PROG_TYPE_CGROUP_INET6_BIND,
 	BPF_PROG_TYPE_CGROUP_INET4_CONNECT,
 	BPF_PROG_TYPE_CGROUP_INET6_CONNECT,
+	BPF_PROG_TYPE_CGROUP_INET4_POST_BIND,
+	BPF_PROG_TYPE_CGROUP_INET6_POST_BIND,
 };
 
 enum bpf_attach_type {
@@ -151,6 +153,8 @@ enum bpf_attach_type {
 	BPF_CGROUP_INET6_BIND,
 	BPF_CGROUP_INET4_CONNECT,
 	BPF_CGROUP_INET6_CONNECT,
+	BPF_CGROUP_INET4_POST_BIND,
+	BPF_CGROUP_INET6_POST_BIND,
 	__MAX_BPF_ATTACH_TYPE
 };
 
@@ -903,6 +907,15 @@ struct bpf_sock {
 	__u32 protocol;
 	__u32 mark;
 	__u32 priority;
+	__u32 src_ip4;		/* Allows 1,2,4-byte read.
+				 * Stored in network byte order.
+				 */
+	__u32 src_ip6[4];	/* Allows 1,2,4-byte read.
+				 * Stored in network byte order.
+				 */
+	__u32 src_port;		/* Allows 4-byte read.
+				 * Stored in network byte order
+				 */
 };
 
 #define XDP_PACKET_HEADROOM 256
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 145de3332e32..2eb941dacbc5 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -1382,6 +1382,12 @@ static int bpf_prog_attach(const union bpf_attr *attr)
 	case BPF_CGROUP_INET6_BIND:
 		ptype = BPF_PROG_TYPE_CGROUP_INET6_BIND;
 		break;
+	case BPF_CGROUP_INET4_POST_BIND:
+		ptype = BPF_PROG_TYPE_CGROUP_INET4_POST_BIND;
+		break;
+	case BPF_CGROUP_INET6_POST_BIND:
+		ptype = BPF_PROG_TYPE_CGROUP_INET6_POST_BIND;
+		break;
 	case BPF_CGROUP_INET4_CONNECT:
 		ptype = BPF_PROG_TYPE_CGROUP_INET4_CONNECT;
 		break;
@@ -1449,6 +1455,12 @@ static int bpf_prog_detach(const union bpf_attr *attr)
 	case BPF_CGROUP_INET6_BIND:
 		ptype = BPF_PROG_TYPE_CGROUP_INET6_BIND;
 		break;
+	case BPF_CGROUP_INET4_POST_BIND:
+		ptype = BPF_PROG_TYPE_CGROUP_INET4_POST_BIND;
+		break;
+	case BPF_CGROUP_INET6_POST_BIND:
+		ptype = BPF_PROG_TYPE_CGROUP_INET6_POST_BIND;
+		break;
 	case BPF_CGROUP_INET4_CONNECT:
 		ptype = BPF_PROG_TYPE_CGROUP_INET4_CONNECT;
 		break;
@@ -1504,6 +1516,8 @@ static int bpf_prog_query(const union bpf_attr *attr,
 	case BPF_CGROUP_INET_SOCK_CREATE:
 	case BPF_CGROUP_INET4_BIND:
 	case BPF_CGROUP_INET6_BIND:
+	case BPF_CGROUP_INET4_POST_BIND:
+	case BPF_CGROUP_INET6_POST_BIND:
 	case BPF_CGROUP_INET4_CONNECT:
 	case BPF_CGROUP_INET6_CONNECT:
 	case BPF_CGROUP_SOCK_OPS:
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index cda7830a2c1b..84faec85fe3e 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -3874,6 +3874,8 @@ static int check_return_code(struct bpf_verifier_env *env)
 	case BPF_PROG_TYPE_CGROUP_SOCK:
 	case BPF_PROG_TYPE_CGROUP_INET4_BIND:
 	case BPF_PROG_TYPE_CGROUP_INET6_BIND:
+	case BPF_PROG_TYPE_CGROUP_INET4_POST_BIND:
+	case BPF_PROG_TYPE_CGROUP_INET6_POST_BIND:
 	case BPF_PROG_TYPE_CGROUP_INET4_CONNECT:
 	case BPF_PROG_TYPE_CGROUP_INET6_CONNECT:
 	case BPF_PROG_TYPE_SOCK_OPS:
diff --git a/net/core/filter.c b/net/core/filter.c
index 916195b86a23..e27196248c10 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -3840,6 +3840,62 @@ static bool sock_filter_is_valid_access(int off, int size,
 	return true;
 }
 
+static bool __sock_is_valid_access(unsigned short ctx_family, int off, int size,
+				   enum bpf_access_type type,
+				   struct bpf_insn_access_aux *info)
+{
+	const int size_default = sizeof(__u32);
+	unsigned short requested_family = 0;
+
+	if (off < 0 || off >= sizeof(struct bpf_sock))
+		return false;
+	if (off % size != 0)
+		return false;
+	if (type != BPF_READ)
+		return false;
+
+	switch (off) {
+	case bpf_ctx_range(struct bpf_sock, src_ip4):
+		requested_family = AF_INET;
+		/* FALLTHROUGH */
+	case bpf_ctx_range_till(struct bpf_sock, src_ip6[0], src_ip6[3]):
+		if (!requested_family)
+			requested_family = AF_INET6;
+		/* Disallow access to IPv6 fields from IPv4 contex and vise
+		 * versa.
+		 */
+		if (requested_family != ctx_family)
+			return false;
+		bpf_ctx_record_field_size(info, size_default);
+		if (!bpf_ctx_narrow_access_ok(off, size, size_default))
+			return false;
+		break;
+	case bpf_ctx_range(struct bpf_sock, family):
+	case bpf_ctx_range(struct bpf_sock, type):
+	case bpf_ctx_range(struct bpf_sock, protocol):
+	case bpf_ctx_range(struct bpf_sock, src_port):
+		if (size != size_default)
+			return false;
+		break;
+	default:
+		return false;
+	}
+
+	return true;
+}
+
+static bool sock4_is_valid_access(int off, int size, enum bpf_access_type type,
+				  struct bpf_insn_access_aux *info)
+{
+	return __sock_is_valid_access(AF_INET, off, size, type, info);
+}
+
+static bool sock6_is_valid_access(int off, int size, enum bpf_access_type type,
+				  struct bpf_insn_access_aux *info)
+{
+	return __sock_is_valid_access(AF_INET6, off, size, type, info);
+}
+
 static int bpf_unclone_prologue(struct bpf_insn *insn_buf, bool direct_write,
 				const struct bpf_prog *prog, int drop_verdict)
 {
@@ -4406,6 +4462,40 @@ static u32 bpf_convert_ctx_access(enum bpf_access_type type,
 	return insn - insn_buf;
 }
 
+static u32 __sock_convert_ctx_access(enum bpf_access_type type,
+				     const struct bpf_insn *si,
+				     struct bpf_insn *insn_buf,
+				     struct bpf_prog *prog, u32 *target_size)
+{
+	struct bpf_insn *insn = insn_buf;
+
+	switch (si->off) {
+	case offsetof(struct bpf_sock, family):
+		BUILD_BUG_ON(FIELD_SIZEOF(struct sock, sk_family) != 2);
+
+		*insn++ = BPF_LDX_MEM(BPF_H, si->dst_reg, si->src_reg,
+				      offsetof(struct sock, sk_family));
+		break;
+
+	case offsetof(struct bpf_sock, type):
+		*insn++ = BPF_LDX_MEM(BPF_W, si->dst_reg, si->src_reg,
+				      offsetof(struct sock, __sk_flags_offset));
+		*insn++ = BPF_ALU32_IMM(BPF_AND, si->dst_reg, SK_FL_TYPE_MASK);
+		*insn++ = BPF_ALU32_IMM(BPF_RSH, si->dst_reg, SK_FL_TYPE_SHIFT);
+		break;
+
+	case offsetof(struct bpf_sock, protocol):
+		*insn++ = BPF_LDX_MEM(BPF_W, si->dst_reg, si->src_reg,
+				      offsetof(struct sock, __sk_flags_offset));
+		*insn++ = BPF_ALU32_IMM(BPF_AND, si->dst_reg, SK_FL_PROTO_MASK);
+		*insn++ = BPF_ALU32_IMM(BPF_RSH, si->dst_reg,
+					SK_FL_PROTO_SHIFT);
+		break;
+	}
+
+	return insn - insn_buf;
+}
+
 static u32 sock_filter_convert_ctx_access(enum bpf_access_type type,
 					  const struct bpf_insn *si,
 					  struct bpf_insn *insn_buf,
@@ -4447,26 +4537,56 @@ static u32 sock_filter_convert_ctx_access(enum bpf_access_type type,
 				      offsetof(struct sock, sk_priority));
 		break;
 
-	case offsetof(struct bpf_sock, family):
-		BUILD_BUG_ON(FIELD_SIZEOF(struct sock, sk_family) != 2);
+	default:
+		return __sock_convert_ctx_access(type, si, insn_buf, prog,
+						 target_size);
+	}
 
-		*insn++ = BPF_LDX_MEM(BPF_H, si->dst_reg, si->src_reg,
-				      offsetof(struct sock, sk_family));
-		break;
+	return insn - insn_buf;
+}
 
-	case offsetof(struct bpf_sock, type):
-		*insn++ = BPF_LDX_MEM(BPF_W, si->dst_reg, si->src_reg,
-				      offsetof(struct sock, __sk_flags_offset));
-		*insn++ = BPF_ALU32_IMM(BPF_AND, si->dst_reg, SK_FL_TYPE_MASK);
-		*insn++ = BPF_ALU32_IMM(BPF_RSH, si->dst_reg, SK_FL_TYPE_SHIFT);
-		break;
+static u32 sock_convert_ctx_access(enum bpf_access_type type,
+				   const struct bpf_insn *si,
+				   struct bpf_insn *insn_buf,
+				   struct bpf_prog *prog,
+				   u32 *target_size)
+{
+	struct bpf_insn *insn = insn_buf;
+	int off;
 
-	case offsetof(struct bpf_sock, protocol):
-		*insn++ = BPF_LDX_MEM(BPF_W, si->dst_reg, si->src_reg,
-				      offsetof(struct sock, __sk_flags_offset));
-		*insn++ = BPF_ALU32_IMM(BPF_AND, si->dst_reg, SK_FL_PROTO_MASK);
-		*insn++ = BPF_ALU32_IMM(BPF_RSH, si->dst_reg, SK_FL_PROTO_SHIFT);
+	switch (si->off) {
+	case offsetof(struct bpf_sock, src_ip4):
+		*insn++ = BPF_LDX_MEM(
+			BPF_SIZE(si->code), si->dst_reg, si->src_reg,
+			bpf_target_off(struct sock_common, skc_rcv_saddr,
+				       FIELD_SIZEOF(struct sock_common,
+						    skc_rcv_saddr),
+				       target_size));
 		break;
+	case bpf_ctx_range_till(struct bpf_sock, src_ip6[0], src_ip6[3]):
+		off = si->off;
+		off -= offsetof(struct bpf_sock, src_ip6[0]);
+		*insn++ = BPF_LDX_MEM(
+			BPF_SIZE(si->code), si->dst_reg, si->src_reg,
+			bpf_target_off(
+				struct sock_common,
+				skc_v6_rcv_saddr.s6_addr32[0],
+				FIELD_SIZEOF(struct sock_common,
+					     skc_v6_rcv_saddr.s6_addr32[0]),
+				target_size) + off);
+		break;
+	case offsetof(struct bpf_sock, src_port):
+		*insn++ = BPF_LDX_MEM(
+			BPF_FIELD_SIZEOF(struct sock_common, skc_num),
+			si->dst_reg, si->src_reg,
+			bpf_target_off(struct sock_common, skc_num,
+				       FIELD_SIZEOF(struct sock_common,
+						    skc_num),
+				       target_size));
+		break;
+	default:
+		return __sock_convert_ctx_access(type, si, insn_buf, prog,
+						 target_size);
 	}
 
 	return insn - insn_buf;
@@ -5122,6 +5242,24 @@ const struct bpf_verifier_ops cg_sock_verifier_ops = {
 const struct bpf_prog_ops cg_sock_prog_ops = {
 };
 
+const struct bpf_verifier_ops cg_inet4_post_bind_verifier_ops = {
+	.get_func_proto		= sock_filter_func_proto,
+	.is_valid_access	= sock4_is_valid_access,
+	.convert_ctx_access	= sock_convert_ctx_access,
+};
+
+const struct bpf_prog_ops cg_inet4_post_bind_prog_ops = {
+};
+
+const struct bpf_verifier_ops cg_inet6_post_bind_verifier_ops = {
+	.get_func_proto		= sock_filter_func_proto,
+	.is_valid_access	= sock6_is_valid_access,
+	.convert_ctx_access	= sock_convert_ctx_access,
+};
+
+const struct bpf_prog_ops cg_inet6_post_bind_prog_ops = {
+};
+
 const struct bpf_verifier_ops cg_inet4_bind_verifier_ops = {
 	.get_func_proto		= inet_bind_func_proto,
 	.is_valid_access	= sock_addr4_is_valid_access,
diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index 488fe26ac8e5..28e2e7fdd5b1 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -521,7 +521,8 @@ int __inet_bind(struct sock *sk, struct sockaddr *uaddr, int addr_len,
 	/* Make sure we are allowed to bind here. */
 	if ((snum || !(inet->bind_address_no_port ||
 		       force_bind_address_no_port)) &&
-	    sk->sk_prot->get_port(sk, snum)) {
+	    (sk->sk_prot->get_port(sk, snum) ||
+	     BPF_CGROUP_RUN_PROG_INET4_POST_BIND(sk))) {
 		inet->inet_saddr = inet->inet_rcv_saddr = 0;
 		err = -EADDRINUSE;
 		goto out_release_sock;
diff --git a/net/ipv6/af_inet6.c b/net/ipv6/af_inet6.c
index 13110bee5c14..473cc55a3a7d 100644
--- a/net/ipv6/af_inet6.c
+++ b/net/ipv6/af_inet6.c
@@ -414,7 +414,8 @@ int __inet6_bind(struct sock *sk, struct sockaddr *uaddr, int addr_len,
 	/* Make sure we are allowed to bind here. */
 	if ((snum || !(inet->bind_address_no_port ||
 		       force_bind_address_no_port)) &&
-	    sk->sk_prot->get_port(sk, snum)) {
+	    (sk->sk_prot->get_port(sk, snum) ||
+	     BPF_CGROUP_RUN_PROG_INET6_POST_BIND(sk))) {
 		sk->sk_ipv6only = saved_ipv6only;
 		inet_reset_saddr(sk);
 		err = -EADDRINUSE;
-- 
2.9.5

^ permalink raw reply related

* [PATCH RFC bpf-next 1/6] bpf: Hooks for sys_bind
From: Alexei Starovoitov @ 2018-03-14  3:39 UTC (permalink / raw)
  To: davem; +Cc: daniel, netdev, kernel-team
In-Reply-To: <20180314033934.3502167-1-ast@kernel.org>

From: Andrey Ignatov <rdna@fb.com>

== The problem ==

There is a use-case when all processes inside a cgroup should use one
single IP address on a host that has multiple IP configured.  Those
processes should use the IP for both ingress and egress, for TCP and UDP
traffic. So TCP/UDP servers should be bound to that IP to accept
incoming connections on it, and TCP/UDP clients should make outgoing
connections from that IP. It should not require changing application
code since it's often not possible.

Currently it's solved by intercepting glibc wrappers around syscalls
such as `bind(2)` and `connect(2)`. It's done by a shared library that
is preloaded for every process in a cgroup so that whenever TCP/UDP
server calls `bind(2)`, the library replaces IP in sockaddr before
passing arguments to syscall. When application calls `connect(2)` the
library transparently binds the local end of connection to that IP
(`bind(2)` with `IP_BIND_ADDRESS_NO_PORT` to avoid performance penalty).

Shared library approach is fragile though, e.g.:
* some applications clear env vars (incl. `LD_PRELOAD`);
* `/etc/ld.so.preload` doesn't help since some applications are linked
  with option `-z nodefaultlib`;
* other applications don't use glibc and there is nothing to intercept.

== The solution ==

The patch provides much more reliable in-kernel solution for the 1st
part of the problem: binding TCP/UDP servers on desired IP. It does not
depend on application environment and implementation details (whether
glibc is used or not).

It adds new eBPF program types `BPF_PROG_TYPE_CGROUP_INET4_BIND` and
`BPF_PROG_TYPE_CGROUP_INET6_BIND` and corresponding attach types
`BPF_CGROUP_INET4_BIND` and `BPF_CGROUP_INET6_BIND` (similar to already
existing `BPF_CGROUP_INET_SOCK_CREATE`).

The new program types are intended to be used with sockets (`struct sock`)
in a cgroup and provided by user `struct sockaddr`. Pointers to both of
them are parts of the context passed to programs of newly added types.

The new attach types provides hooks in `bind(2)` system call for both
IPv4 and IPv6 so that one can write a program to override IP addresses
and ports user program tries to bind to and apply such a program for
whole cgroup.

== Implementation notes ==

[1]
Separate prog/attach types for `AF_INET` and `AF_INET6` are added
intentionally to prevent reading/writing to offsets that don't make
sense for corresponding socket family. E.g. if user passes `sockaddr_in`
it doesn't make sense to read from / write to `user_ip6[]` context
fields.

[2]
The write access to `struct bpf_sock_addr_kern` is implemented using
special field as an additional "register".

There are just two registers in `sock_addr_convert_ctx_access`: `src`
with value to write and `dst` with pointer to context that can't be
changed not to break later instructions. But the fields, allowed to
write to, are not available directly and to access them address of
corresponding pointer has to be loaded first. To get additional register
the 1st not used by `src` and `dst` one is taken, its content is saved
to `bpf_sock_addr_kern.tmp_reg`, then the register is used to load
address of pointer field, and finally the register's content is restored
from the temporary field after writing `src` value.

Signed-off-by: Andrey Ignatov <rdna@fb.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
 include/linux/bpf-cgroup.h |  21 ++++
 include/linux/bpf_types.h  |   2 +
 include/linux/filter.h     |  10 ++
 include/uapi/linux/bpf.h   |  24 +++++
 kernel/bpf/cgroup.c        |  36 +++++++
 kernel/bpf/syscall.c       |  14 +++
 kernel/bpf/verifier.c      |   2 +
 net/core/filter.c          | 242 +++++++++++++++++++++++++++++++++++++++++++++
 net/ipv4/af_inet.c         |   7 ++
 net/ipv6/af_inet6.c        |   7 ++
 10 files changed, 365 insertions(+)

diff --git a/include/linux/bpf-cgroup.h b/include/linux/bpf-cgroup.h
index 8a4566691c8f..dd0cfbddcfbe 100644
--- a/include/linux/bpf-cgroup.h
+++ b/include/linux/bpf-cgroup.h
@@ -6,6 +6,7 @@
 #include <uapi/linux/bpf.h>
 
 struct sock;
+struct sockaddr;
 struct cgroup;
 struct sk_buff;
 struct bpf_sock_ops_kern;
@@ -63,6 +64,10 @@ int __cgroup_bpf_run_filter_skb(struct sock *sk,
 int __cgroup_bpf_run_filter_sk(struct sock *sk,
 			       enum bpf_attach_type type);
 
+int __cgroup_bpf_run_filter_sock_addr(struct sock *sk,
+				      struct sockaddr *uaddr,
+				      enum bpf_attach_type type);
+
 int __cgroup_bpf_run_filter_sock_ops(struct sock *sk,
 				     struct bpf_sock_ops_kern *sock_ops,
 				     enum bpf_attach_type type);
@@ -103,6 +108,20 @@ int __cgroup_bpf_check_dev_permission(short dev_type, u32 major, u32 minor,
 	__ret;								       \
 })
 
+#define BPF_CGROUP_RUN_SA_PROG(sk, uaddr, type) 			       \
+({									       \
+	int __ret = 0;							       \
+	if (cgroup_bpf_enabled)						       \
+		__ret = __cgroup_bpf_run_filter_sock_addr(sk, uaddr, type);    \
+	__ret;								       \
+})
+
+#define BPF_CGROUP_RUN_PROG_INET4_BIND(sk, uaddr)			       \
+	BPF_CGROUP_RUN_SA_PROG(sk, uaddr, BPF_CGROUP_INET4_BIND)
+
+#define BPF_CGROUP_RUN_PROG_INET6_BIND(sk, uaddr)			       \
+	BPF_CGROUP_RUN_SA_PROG(sk, uaddr, BPF_CGROUP_INET6_BIND)
+
 #define BPF_CGROUP_RUN_PROG_SOCK_OPS(sock_ops)				       \
 ({									       \
 	int __ret = 0;							       \
@@ -135,6 +154,8 @@ static inline int cgroup_bpf_inherit(struct cgroup *cgrp) { return 0; }
 #define BPF_CGROUP_RUN_PROG_INET_INGRESS(sk,skb) ({ 0; })
 #define BPF_CGROUP_RUN_PROG_INET_EGRESS(sk,skb) ({ 0; })
 #define BPF_CGROUP_RUN_PROG_INET_SOCK(sk) ({ 0; })
+#define BPF_CGROUP_RUN_PROG_INET4_BIND(sk, uaddr) ({ 0; })
+#define BPF_CGROUP_RUN_PROG_INET6_BIND(sk, uaddr) ({ 0; })
 #define BPF_CGROUP_RUN_PROG_SOCK_OPS(sock_ops) ({ 0; })
 #define BPF_CGROUP_RUN_PROG_DEVICE_CGROUP(type,major,minor,access) ({ 0; })
 
diff --git a/include/linux/bpf_types.h b/include/linux/bpf_types.h
index 19b8349a3809..eefd877f8e68 100644
--- a/include/linux/bpf_types.h
+++ b/include/linux/bpf_types.h
@@ -8,6 +8,8 @@ BPF_PROG_TYPE(BPF_PROG_TYPE_SCHED_ACT, tc_cls_act)
 BPF_PROG_TYPE(BPF_PROG_TYPE_XDP, xdp)
 BPF_PROG_TYPE(BPF_PROG_TYPE_CGROUP_SKB, cg_skb)
 BPF_PROG_TYPE(BPF_PROG_TYPE_CGROUP_SOCK, cg_sock)
+BPF_PROG_TYPE(BPF_PROG_TYPE_CGROUP_INET4_BIND, cg_inet4_bind)
+BPF_PROG_TYPE(BPF_PROG_TYPE_CGROUP_INET6_BIND, cg_inet6_bind)
 BPF_PROG_TYPE(BPF_PROG_TYPE_LWT_IN, lwt_inout)
 BPF_PROG_TYPE(BPF_PROG_TYPE_LWT_OUT, lwt_inout)
 BPF_PROG_TYPE(BPF_PROG_TYPE_LWT_XMIT, lwt_xmit)
diff --git a/include/linux/filter.h b/include/linux/filter.h
index fdb691b520c0..fe469320feab 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -1001,6 +1001,16 @@ static inline int bpf_tell_extensions(void)
 	return SKF_AD_MAX;
 }
 
+struct bpf_sock_addr_kern {
+	struct sock *sk;
+	struct sockaddr *uaddr;
+	/* Temporary "register" to make indirect stores to nested structures
+	 * defined above. We need three registers to make such a store, but
+	 * only two (src and dst) are available at convert_ctx_access time
+	 */
+	u64 tmp_reg;
+};
+
 struct bpf_sock_ops_kern {
 	struct	sock *sk;
 	u32	op;
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 2a66769e5875..78628a3f3cd8 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -133,6 +133,8 @@ enum bpf_prog_type {
 	BPF_PROG_TYPE_SOCK_OPS,
 	BPF_PROG_TYPE_SK_SKB,
 	BPF_PROG_TYPE_CGROUP_DEVICE,
+	BPF_PROG_TYPE_CGROUP_INET4_BIND,
+	BPF_PROG_TYPE_CGROUP_INET6_BIND,
 };
 
 enum bpf_attach_type {
@@ -143,6 +145,8 @@ enum bpf_attach_type {
 	BPF_SK_SKB_STREAM_PARSER,
 	BPF_SK_SKB_STREAM_VERDICT,
 	BPF_CGROUP_DEVICE,
+	BPF_CGROUP_INET4_BIND,
+	BPF_CGROUP_INET6_BIND,
 	__MAX_BPF_ATTACH_TYPE
 };
 
@@ -953,6 +957,26 @@ struct bpf_map_info {
 	__u64 netns_ino;
 } __attribute__((aligned(8)));
 
+/* User bpf_sock_addr struct to access socket fields and sockaddr struct passed
+ * by user and intended to be used by socket (e.g. to bind to, depends on
+ * attach attach type).
+ */
+struct bpf_sock_addr {
+	__u32 user_family;	/* Allows 4-byte read, but no write. */
+	__u32 user_ip4;		/* Allows 1,2,4-byte read and 4-byte write.
+				 * Stored in network byte order.
+				 */
+	__u32 user_ip6[4];	/* Allows 1,2,4-byte read an 4-byte write.
+				 * Stored in network byte order.
+				 */
+	__u32 user_port;	/* Allows 4-byte read and write.
+				 * Stored in network byte order
+				 */
+	__u32 family;		/* Allows 4-byte read, but no write */
+	__u32 type;		/* Allows 4-byte read, but no write */
+	__u32 protocol;		/* Allows 4-byte read, but no write */
+};
+
 /* User bpf_sock_ops struct to access socket values and specify request ops
  * and their replies.
  * Some of this fields are in network (bigendian) byte order and may need
diff --git a/kernel/bpf/cgroup.c b/kernel/bpf/cgroup.c
index c1c0b60d3f2f..78ef086a7c2d 100644
--- a/kernel/bpf/cgroup.c
+++ b/kernel/bpf/cgroup.c
@@ -495,6 +495,42 @@ int __cgroup_bpf_run_filter_sk(struct sock *sk,
 EXPORT_SYMBOL(__cgroup_bpf_run_filter_sk);
 
 /**
+ * __cgroup_bpf_run_filter_sock_addr() - Run a program on a sock and
+ *                                       provided by user sockaddr
+ * @sk: sock struct that will use sockaddr
+ * @uaddr: sockaddr struct provided by user
+ * @type: The type of program to be exectuted
+ *
+ * socket is expected to be of type INET or INET6.
+ *
+ * This function will return %-EPERM if an attached program is found and
+ * returned value != 1 during execution. In all other cases, 0 is returned.
+ */
+int __cgroup_bpf_run_filter_sock_addr(struct sock *sk,
+				      struct sockaddr *uaddr,
+				      enum bpf_attach_type type)
+{
+	struct bpf_sock_addr_kern ctx = {
+		.sk = sk,
+		.uaddr = uaddr,
+	};
+	struct cgroup *cgrp;
+	int ret;
+
+	/* Check socket family since not all sockets represent network
+	 * endpoint (e.g. AF_UNIX).
+	 */
+	if (sk->sk_family != AF_INET && sk->sk_family != AF_INET6)
+		return 0;
+
+	cgrp = sock_cgroup_ptr(&sk->sk_cgrp_data);
+	ret = BPF_PROG_RUN_ARRAY(cgrp->bpf.effective[type], &ctx, BPF_PROG_RUN);
+
+	return ret == 1 ? 0 : -EPERM;
+}
+EXPORT_SYMBOL(__cgroup_bpf_run_filter_sock_addr);
+
+/**
  * __cgroup_bpf_run_filter_sock_ops() - Run a program on a sock
  * @sk: socket to get cgroup from
  * @sock_ops: bpf_sock_ops_kern struct to pass to program. Contains
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index e24aa3241387..7f86542aa42c 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -1376,6 +1376,12 @@ static int bpf_prog_attach(const union bpf_attr *attr)
 	case BPF_CGROUP_INET_SOCK_CREATE:
 		ptype = BPF_PROG_TYPE_CGROUP_SOCK;
 		break;
+	case BPF_CGROUP_INET4_BIND:
+		ptype = BPF_PROG_TYPE_CGROUP_INET4_BIND;
+		break;
+	case BPF_CGROUP_INET6_BIND:
+		ptype = BPF_PROG_TYPE_CGROUP_INET6_BIND;
+		break;
 	case BPF_CGROUP_SOCK_OPS:
 		ptype = BPF_PROG_TYPE_SOCK_OPS;
 		break;
@@ -1431,6 +1437,12 @@ static int bpf_prog_detach(const union bpf_attr *attr)
 	case BPF_CGROUP_INET_SOCK_CREATE:
 		ptype = BPF_PROG_TYPE_CGROUP_SOCK;
 		break;
+	case BPF_CGROUP_INET4_BIND:
+		ptype = BPF_PROG_TYPE_CGROUP_INET4_BIND;
+		break;
+	case BPF_CGROUP_INET6_BIND:
+		ptype = BPF_PROG_TYPE_CGROUP_INET6_BIND;
+		break;
 	case BPF_CGROUP_SOCK_OPS:
 		ptype = BPF_PROG_TYPE_SOCK_OPS;
 		break;
@@ -1478,6 +1490,8 @@ static int bpf_prog_query(const union bpf_attr *attr,
 	case BPF_CGROUP_INET_INGRESS:
 	case BPF_CGROUP_INET_EGRESS:
 	case BPF_CGROUP_INET_SOCK_CREATE:
+	case BPF_CGROUP_INET4_BIND:
+	case BPF_CGROUP_INET6_BIND:
 	case BPF_CGROUP_SOCK_OPS:
 	case BPF_CGROUP_DEVICE:
 		break;
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index eb79a34359c0..01b54afcb762 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -3872,6 +3872,8 @@ static int check_return_code(struct bpf_verifier_env *env)
 	switch (env->prog->type) {
 	case BPF_PROG_TYPE_CGROUP_SKB:
 	case BPF_PROG_TYPE_CGROUP_SOCK:
+	case BPF_PROG_TYPE_CGROUP_INET4_BIND:
+	case BPF_PROG_TYPE_CGROUP_INET6_BIND:
 	case BPF_PROG_TYPE_SOCK_OPS:
 	case BPF_PROG_TYPE_CGROUP_DEVICE:
 		break;
diff --git a/net/core/filter.c b/net/core/filter.c
index 33edfa8372fd..78907cf3b42f 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -3443,6 +3443,20 @@ sock_filter_func_proto(enum bpf_func_id func_id)
 }
 
 static const struct bpf_func_proto *
+inet_bind_func_proto(enum bpf_func_id func_id)
+{
+	switch (func_id) {
+	/* inet and inet6 sockets are created in a process
+	 * context so there is always a valid uid/gid
+	 */
+	case BPF_FUNC_get_current_uid_gid:
+		return &bpf_get_current_uid_gid_proto;
+	default:
+		return bpf_base_func_proto(func_id);
+	}
+}
+
+static const struct bpf_func_proto *
 sk_filter_func_proto(enum bpf_func_id func_id)
 {
 	switch (func_id) {
@@ -3900,6 +3914,70 @@ void bpf_warn_invalid_xdp_action(u32 act)
 }
 EXPORT_SYMBOL_GPL(bpf_warn_invalid_xdp_action);
 
+static bool __sock_addr_is_valid_access(unsigned short ctx_family, int off,
+					int size, enum bpf_access_type type,
+					struct bpf_insn_access_aux *info)
+{
+	const int size_default = sizeof(__u32);
+	unsigned short requested_family = 0;
+
+	if (off < 0 || off >= sizeof(struct bpf_sock_addr))
+		return false;
+	if (off % size != 0)
+		return false;
+
+	switch (off) {
+	case bpf_ctx_range(struct bpf_sock_addr, user_ip4):
+		requested_family = AF_INET;
+		/* FALLTHROUGH */
+	case bpf_ctx_range_till(struct bpf_sock_addr, user_ip6[0], user_ip6[3]):
+		if (!requested_family)
+			requested_family = AF_INET6;
+		/* Disallow access to IPv6 fields from IPv4 contex and vise
+		 * versa.
+		 */
+		if (requested_family != ctx_family)
+			return false;
+		/* Only narrow read access allowed for now. */
+		if (type == BPF_READ) {
+			bpf_ctx_record_field_size(info, size_default);
+			if (!bpf_ctx_narrow_access_ok(off, size, size_default))
+				return false;
+		} else {
+			if (size != size_default)
+				return false;
+		}
+		break;
+	case bpf_ctx_range(struct bpf_sock_addr, user_port):
+		if (size != size_default)
+			return false;
+		break;
+	default:
+		if (type == BPF_READ) {
+			if (size != size_default)
+				return false;
+		} else {
+			return false;
+		}
+	}
+
+	return true;
+}
+
+static bool sock_addr4_is_valid_access(int off, int size,
+				       enum bpf_access_type type,
+				       struct bpf_insn_access_aux *info)
+{
+	return __sock_addr_is_valid_access(AF_INET, off, size, type, info);
+}
+
+static bool sock_addr6_is_valid_access(int off, int size,
+				       enum bpf_access_type type,
+				       struct bpf_insn_access_aux *info)
+{
+	return __sock_addr_is_valid_access(AF_INET6, off, size, type, info);
+}
+
 static bool sock_ops_is_valid_access(int off, int size,
 				     enum bpf_access_type type,
 				     struct bpf_insn_access_aux *info)
@@ -4415,6 +4493,152 @@ static u32 xdp_convert_ctx_access(enum bpf_access_type type,
 	return insn - insn_buf;
 }
 
+/* SOCK_ADDR_LOAD_NESTED_FIELD() loads Nested Field S.F.NF where S is type of
+ * context Structure, F is Field in context structure that contains a pointer
+ * to Nested Structure of type NS that has the field NF.
+ *
+ * SIZE encodes the load size (BPF_B, BPF_H, etc). It's up to caller to make
+ * sure that SIZE is not greater than actual size of S.F.NF.
+ *
+ * If offset OFF is provided, the load happens from that offset relative to
+ * offset of NF.
+ */
+#define SOCK_ADDR_LOAD_NESTED_FIELD_SIZE_OFF(S, NS, F, NF, SIZE, OFF)	       \
+	do {								       \
+		*insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(S, F), si->dst_reg,     \
+				      si->src_reg, offsetof(S, F));	       \
+		*insn++ = BPF_LDX_MEM(					       \
+			SIZE, si->dst_reg, si->dst_reg,			       \
+			bpf_target_off(NS, NF, FIELD_SIZEOF(NS, NF),	       \
+				       target_size)			       \
+				+ OFF);					       \
+	} while (0)
+
+#define SOCK_ADDR_LOAD_NESTED_FIELD(S, NS, F, NF)			       \
+	SOCK_ADDR_LOAD_NESTED_FIELD_SIZE_OFF(S, NS, F, NF,		       \
+					     BPF_FIELD_SIZEOF(NS, NF), 0)
+
+/* SOCK_ADDR_STORE_NESTED_FIELD_OFF() has semantic similar to
+ * SOCK_ADDR_LOAD_NESTED_FIELD_SIZE_OFF() but for store operation.
+ *
+ * It doesn't support SIZE argument though since narrow stores are not
+ * supported for now.
+ *
+ * In addition it uses Temporary Field TF (member of struct S) as the 3rd
+ * "register" since two registers available in convert_ctx_access are not
+ * enough: we can't override neither SRC, since it contains value to store, nor
+ * DST since it contains pointer to context that may be used by later
+ * instructions. But we need a temporary place to save pointer to nested
+ * structure whose field we want to store to.
+ */
+#define SOCK_ADDR_STORE_NESTED_FIELD_OFF(S, NS, F, NF, OFF, TF)		       \
+	do {								       \
+		int tmp_reg = BPF_REG_9;				       \
+		if (si->src_reg == tmp_reg || si->dst_reg == tmp_reg)	       \
+			--tmp_reg;					       \
+		if (si->src_reg == tmp_reg || si->dst_reg == tmp_reg)	       \
+			--tmp_reg;					       \
+		*insn++ = BPF_STX_MEM(BPF_DW, si->dst_reg, tmp_reg,	       \
+				      offsetof(S, TF));			       \
+		*insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(S, F), tmp_reg,	       \
+				      si->dst_reg, offsetof(S, F));	       \
+		*insn++ = BPF_STX_MEM(					       \
+			BPF_FIELD_SIZEOF(NS, NF), tmp_reg, si->src_reg,	       \
+			bpf_target_off(NS, NF, FIELD_SIZEOF(NS, NF),	       \
+				       target_size)			       \
+				+ OFF);					       \
+		*insn++ = BPF_LDX_MEM(BPF_DW, tmp_reg, si->dst_reg,	       \
+				      offsetof(S, TF));			       \
+	} while (0)
+
+#define SOCK_ADDR_LOAD_OR_STORE_NESTED_FIELD_SIZE_OFF(S, NS, F, NF, SIZE, OFF, \
+						      TF)		       \
+	do {								       \
+		if (type == BPF_WRITE) {				       \
+			SOCK_ADDR_STORE_NESTED_FIELD_OFF(S, NS, F, NF, OFF,    \
+							 TF);		       \
+		} else {						       \
+			SOCK_ADDR_LOAD_NESTED_FIELD_SIZE_OFF(		       \
+				S, NS, F, NF, SIZE, OFF);  \
+		}							       \
+	} while (0)
+
+#define SOCK_ADDR_LOAD_OR_STORE_NESTED_FIELD(S, NS, F, NF, TF)		       \
+	SOCK_ADDR_LOAD_OR_STORE_NESTED_FIELD_SIZE_OFF(			       \
+		S, NS, F, NF, BPF_FIELD_SIZEOF(NS, NF), 0, TF)
+
+static u32 sock_addr_convert_ctx_access(enum bpf_access_type type,
+					const struct bpf_insn *si,
+					struct bpf_insn *insn_buf,
+					struct bpf_prog *prog, u32 *target_size)
+{
+	struct bpf_insn *insn = insn_buf;
+	int off;
+
+	switch (si->off) {
+	case offsetof(struct bpf_sock_addr, user_family):
+		SOCK_ADDR_LOAD_NESTED_FIELD(struct bpf_sock_addr_kern,
+					    struct sockaddr, uaddr, sa_family);
+		break;
+
+	case offsetof(struct bpf_sock_addr, user_ip4):
+		SOCK_ADDR_LOAD_OR_STORE_NESTED_FIELD_SIZE_OFF(
+			struct bpf_sock_addr_kern, struct sockaddr_in, uaddr,
+			sin_addr, BPF_SIZE(si->code), 0, tmp_reg);
+		break;
+
+	case bpf_ctx_range_till(struct bpf_sock_addr, user_ip6[0], user_ip6[3]):
+		off = si->off;
+		off -= offsetof(struct bpf_sock_addr, user_ip6[0]);
+		SOCK_ADDR_LOAD_OR_STORE_NESTED_FIELD_SIZE_OFF(
+			struct bpf_sock_addr_kern, struct sockaddr_in6, uaddr,
+			sin6_addr.s6_addr32[0], BPF_SIZE(si->code), off,
+			tmp_reg);
+		break;
+
+	case offsetof(struct bpf_sock_addr, user_port):
+		/* To get port we need to know sa_family first and then treat
+		 * sockaddr as either sockaddr_in or sockaddr_in6.
+		 * Though we can simplify since port field has same offset and
+		 * size in both structures.
+		 * Here we check this invariant and use just one of the
+		 * structures if it's true.
+		 */
+		BUILD_BUG_ON(offsetof(struct sockaddr_in, sin_port) !=
+			     offsetof(struct sockaddr_in6, sin6_port));
+		BUILD_BUG_ON(FIELD_SIZEOF(struct sockaddr_in, sin_port) !=
+			     FIELD_SIZEOF(struct sockaddr_in6, sin6_port));
+		SOCK_ADDR_LOAD_OR_STORE_NESTED_FIELD(struct bpf_sock_addr_kern,
+						     struct sockaddr_in6, uaddr,
+						     sin6_port, tmp_reg);
+		break;
+
+	case offsetof(struct bpf_sock_addr, family):
+		SOCK_ADDR_LOAD_NESTED_FIELD(struct bpf_sock_addr_kern,
+					    struct sock, sk, sk_family);
+		break;
+
+	case offsetof(struct bpf_sock_addr, type):
+		SOCK_ADDR_LOAD_NESTED_FIELD_SIZE_OFF(
+			struct bpf_sock_addr_kern, struct sock, sk,
+			__sk_flags_offset, BPF_W, 0);
+		*insn++ = BPF_ALU32_IMM(BPF_AND, si->dst_reg, SK_FL_TYPE_MASK);
+		*insn++ = BPF_ALU32_IMM(BPF_RSH, si->dst_reg, SK_FL_TYPE_SHIFT);
+		break;
+
+	case offsetof(struct bpf_sock_addr, protocol):
+		SOCK_ADDR_LOAD_NESTED_FIELD_SIZE_OFF(
+			struct bpf_sock_addr_kern, struct sock, sk,
+			__sk_flags_offset, BPF_W, 0);
+		*insn++ = BPF_ALU32_IMM(BPF_AND, si->dst_reg, SK_FL_PROTO_MASK);
+		*insn++ = BPF_ALU32_IMM(BPF_RSH, si->dst_reg,
+					SK_FL_PROTO_SHIFT);
+		break;
+	}
+
+	return insn - insn_buf;
+}
+
 static u32 sock_ops_convert_ctx_access(enum bpf_access_type type,
 				       const struct bpf_insn *si,
 				       struct bpf_insn *insn_buf,
@@ -4849,6 +5073,24 @@ const struct bpf_verifier_ops cg_sock_verifier_ops = {
 const struct bpf_prog_ops cg_sock_prog_ops = {
 };
 
+const struct bpf_verifier_ops cg_inet4_bind_verifier_ops = {
+	.get_func_proto		= inet_bind_func_proto,
+	.is_valid_access	= sock_addr4_is_valid_access,
+	.convert_ctx_access	= sock_addr_convert_ctx_access,
+};
+
+const struct bpf_prog_ops cg_inet4_bind_prog_ops = {
+};
+
+const struct bpf_verifier_ops cg_inet6_bind_verifier_ops = {
+	.get_func_proto		= inet_bind_func_proto,
+	.is_valid_access	= sock_addr6_is_valid_access,
+	.convert_ctx_access	= sock_addr_convert_ctx_access,
+};
+
+const struct bpf_prog_ops cg_inet6_bind_prog_ops = {
+};
+
 const struct bpf_verifier_ops sock_ops_verifier_ops = {
 	.get_func_proto		= sock_ops_func_proto,
 	.is_valid_access	= sock_ops_is_valid_access,
diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index e8c7fad8c329..2dec266507dc 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -450,6 +450,13 @@ int inet_bind(struct socket *sock, struct sockaddr *uaddr, int addr_len)
 	if (addr_len < sizeof(struct sockaddr_in))
 		goto out;
 
+	/* BPF prog is run before any checks are done so that if the prog
+	 * changes context in a wrong way it will be caught.
+	 */
+	err = BPF_CGROUP_RUN_PROG_INET4_BIND(sk, uaddr);
+	if (err)
+		goto out;
+
 	if (addr->sin_family != AF_INET) {
 		/* Compatibility games : accept AF_UNSPEC (mapped to AF_INET)
 		 * only if s_addr is INADDR_ANY.
diff --git a/net/ipv6/af_inet6.c b/net/ipv6/af_inet6.c
index dbbe04018813..fa24e3f06ac6 100644
--- a/net/ipv6/af_inet6.c
+++ b/net/ipv6/af_inet6.c
@@ -295,6 +295,13 @@ int inet6_bind(struct socket *sock, struct sockaddr *uaddr, int addr_len)
 	if (addr_len < SIN6_LEN_RFC2133)
 		return -EINVAL;
 
+	/* BPF prog is run before any checks are done so that if the prog
+	 * changes context in a wrong way it will be caught.
+	 */
+	err = BPF_CGROUP_RUN_PROG_INET6_BIND(sk, uaddr);
+	if (err)
+		return err;
+
 	if (addr->sin6_family != AF_INET6)
 		return -EAFNOSUPPORT;
 
-- 
2.9.5

^ permalink raw reply related

* [PATCH RFC bpf-next 3/6] net: Introduce __inet_bind() and __inet6_bind
From: Alexei Starovoitov @ 2018-03-14  3:39 UTC (permalink / raw)
  To: davem; +Cc: daniel, netdev, kernel-team
In-Reply-To: <20180314033934.3502167-1-ast@kernel.org>

From: Andrey Ignatov <rdna@fb.com>

Refactor `bind()` code to make it ready to be called from BPF helper
function `bpf_bind()` (will be added soon). Implementation of
`inet_bind()` and `inet6_bind()` is separated into `__inet_bind()` and
`__inet6_bind()` correspondingly. These function can be used from both
`sk_prot->bind` and `bpf_bind()` contexts.

New functions have two additional arguments.

`force_bind_address_no_port` forces binding to IP only w/o checking
`inet_sock.bind_address_no_port` field. It'll allow to bind local end of
a connection to desired IP in `bpf_bind()` w/o changing
`bind_address_no_port` field of a socket. It's useful since `bpf_bind()`
can return an error and we'd need to restore original value of
`bind_address_no_port` in that case if we changed this before calling to
the helper.

`with_lock` specifies whether to lock socket when working with `struct
sk` or not. The argument is set to `true` for `sk_prot->bind`, i.e. old
behavior is preserved. But it will be set to `false` for `bpf_bind()`
use-case. The reason is all call-sites, where `bpf_bind()` will be
called, already hold that socket lock.

Signed-off-by: Andrey Ignatov <rdna@fb.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
 include/net/inet_common.h |  2 ++
 include/net/ipv6.h        |  2 ++
 net/ipv4/af_inet.c        | 39 ++++++++++++++++++++++++---------------
 net/ipv6/af_inet6.c       | 37 ++++++++++++++++++++++++-------------
 4 files changed, 52 insertions(+), 28 deletions(-)

diff --git a/include/net/inet_common.h b/include/net/inet_common.h
index 500f81375200..384b90c62c0b 100644
--- a/include/net/inet_common.h
+++ b/include/net/inet_common.h
@@ -32,6 +32,8 @@ int inet_shutdown(struct socket *sock, int how);
 int inet_listen(struct socket *sock, int backlog);
 void inet_sock_destruct(struct sock *sk);
 int inet_bind(struct socket *sock, struct sockaddr *uaddr, int addr_len);
+int __inet_bind(struct sock *sk, struct sockaddr *uaddr, int addr_len,
+		bool force_bind_address_no_port, bool with_lock);
 int inet_getname(struct socket *sock, struct sockaddr *uaddr,
 		 int peer);
 int inet_ioctl(struct socket *sock, unsigned int cmd, unsigned long arg);
diff --git a/include/net/ipv6.h b/include/net/ipv6.h
index 50a6f0ddb878..2e5fedc56e59 100644
--- a/include/net/ipv6.h
+++ b/include/net/ipv6.h
@@ -1066,6 +1066,8 @@ void ipv6_local_error(struct sock *sk, int err, struct flowi6 *fl6, u32 info);
 void ipv6_local_rxpmtu(struct sock *sk, struct flowi6 *fl6, u32 mtu);
 
 int inet6_release(struct socket *sock);
+int __inet6_bind(struct sock *sock, struct sockaddr *uaddr, int addr_len,
+		 bool force_bind_address_no_port, bool with_lock);
 int inet6_bind(struct socket *sock, struct sockaddr *uaddr, int addr_len);
 int inet6_getname(struct socket *sock, struct sockaddr *uaddr,
 		  int peer);
diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index 2dec266507dc..e203a39d6988 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -432,30 +432,37 @@ EXPORT_SYMBOL(inet_release);
 
 int inet_bind(struct socket *sock, struct sockaddr *uaddr, int addr_len)
 {
-	struct sockaddr_in *addr = (struct sockaddr_in *)uaddr;
 	struct sock *sk = sock->sk;
-	struct inet_sock *inet = inet_sk(sk);
-	struct net *net = sock_net(sk);
-	unsigned short snum;
-	int chk_addr_ret;
-	u32 tb_id = RT_TABLE_LOCAL;
 	int err;
 
 	/* If the socket has its own bind function then use it. (RAW) */
 	if (sk->sk_prot->bind) {
-		err = sk->sk_prot->bind(sk, uaddr, addr_len);
-		goto out;
+		return sk->sk_prot->bind(sk, uaddr, addr_len);
 	}
-	err = -EINVAL;
 	if (addr_len < sizeof(struct sockaddr_in))
-		goto out;
+		return -EINVAL;
 
 	/* BPF prog is run before any checks are done so that if the prog
 	 * changes context in a wrong way it will be caught.
 	 */
 	err = BPF_CGROUP_RUN_PROG_INET4_BIND(sk, uaddr);
 	if (err)
-		goto out;
+		return err;
+
+	return __inet_bind(sk, uaddr, addr_len, false, true);
+}
+EXPORT_SYMBOL(inet_bind);
+
+int __inet_bind(struct sock *sk, struct sockaddr *uaddr, int addr_len,
+		bool force_bind_address_no_port, bool with_lock)
+{
+	struct sockaddr_in *addr = (struct sockaddr_in *)uaddr;
+	struct inet_sock *inet = inet_sk(sk);
+	struct net *net = sock_net(sk);
+	unsigned short snum;
+	int chk_addr_ret;
+	u32 tb_id = RT_TABLE_LOCAL;
+	int err;
 
 	if (addr->sin_family != AF_INET) {
 		/* Compatibility games : accept AF_UNSPEC (mapped to AF_INET)
@@ -499,7 +506,8 @@ int inet_bind(struct socket *sock, struct sockaddr *uaddr, int addr_len)
 	 *      would be illegal to use them (multicast/broadcast) in
 	 *      which case the sending device address is used.
 	 */
-	lock_sock(sk);
+	if (with_lock)
+		lock_sock(sk);
 
 	/* Check these errors (active socket, double bind). */
 	err = -EINVAL;
@@ -511,7 +519,8 @@ int inet_bind(struct socket *sock, struct sockaddr *uaddr, int addr_len)
 		inet->inet_saddr = 0;  /* Use device */
 
 	/* Make sure we are allowed to bind here. */
-	if ((snum || !inet->bind_address_no_port) &&
+	if ((snum || !(inet->bind_address_no_port ||
+		       force_bind_address_no_port)) &&
 	    sk->sk_prot->get_port(sk, snum)) {
 		inet->inet_saddr = inet->inet_rcv_saddr = 0;
 		err = -EADDRINUSE;
@@ -528,11 +537,11 @@ int inet_bind(struct socket *sock, struct sockaddr *uaddr, int addr_len)
 	sk_dst_reset(sk);
 	err = 0;
 out_release_sock:
-	release_sock(sk);
+	if (with_lock)
+		release_sock(sk);
 out:
 	return err;
 }
-EXPORT_SYMBOL(inet_bind);
 
 int inet_dgram_connect(struct socket *sock, struct sockaddr *uaddr,
 		       int addr_len, int flags)
diff --git a/net/ipv6/af_inet6.c b/net/ipv6/af_inet6.c
index fa24e3f06ac6..13110bee5c14 100644
--- a/net/ipv6/af_inet6.c
+++ b/net/ipv6/af_inet6.c
@@ -277,15 +277,7 @@ static int inet6_create(struct net *net, struct socket *sock, int protocol,
 /* bind for INET6 API */
 int inet6_bind(struct socket *sock, struct sockaddr *uaddr, int addr_len)
 {
-	struct sockaddr_in6 *addr = (struct sockaddr_in6 *)uaddr;
 	struct sock *sk = sock->sk;
-	struct inet_sock *inet = inet_sk(sk);
-	struct ipv6_pinfo *np = inet6_sk(sk);
-	struct net *net = sock_net(sk);
-	__be32 v4addr = 0;
-	unsigned short snum;
-	bool saved_ipv6only;
-	int addr_type = 0;
 	int err = 0;
 
 	/* If the socket has its own bind function then use it. */
@@ -302,11 +294,28 @@ int inet6_bind(struct socket *sock, struct sockaddr *uaddr, int addr_len)
 	if (err)
 		return err;
 
+	return __inet6_bind(sk, uaddr, addr_len, false, true);
+}
+EXPORT_SYMBOL(inet6_bind);
+
+int __inet6_bind(struct sock *sk, struct sockaddr *uaddr, int addr_len,
+		 bool force_bind_address_no_port, bool with_lock)
+{
+	struct sockaddr_in6 *addr = (struct sockaddr_in6 *)uaddr;
+	struct inet_sock *inet = inet_sk(sk);
+	struct ipv6_pinfo *np = inet6_sk(sk);
+	struct net *net = sock_net(sk);
+	__be32 v4addr = 0;
+	unsigned short snum;
+	bool saved_ipv6only;
+	int addr_type = 0;
+	int err = 0;
+
 	if (addr->sin6_family != AF_INET6)
 		return -EAFNOSUPPORT;
 
 	addr_type = ipv6_addr_type(&addr->sin6_addr);
-	if ((addr_type & IPV6_ADDR_MULTICAST) && sock->type == SOCK_STREAM)
+	if ((addr_type & IPV6_ADDR_MULTICAST) && sk->sk_type == SOCK_STREAM)
 		return -EINVAL;
 
 	snum = ntohs(addr->sin6_port);
@@ -314,7 +323,8 @@ int inet6_bind(struct socket *sock, struct sockaddr *uaddr, int addr_len)
 	    !ns_capable(net->user_ns, CAP_NET_BIND_SERVICE))
 		return -EACCES;
 
-	lock_sock(sk);
+	if (with_lock)
+		lock_sock(sk);
 
 	/* Check these errors (active socket, double bind). */
 	if (sk->sk_state != TCP_CLOSE || inet->inet_num) {
@@ -402,7 +412,8 @@ int inet6_bind(struct socket *sock, struct sockaddr *uaddr, int addr_len)
 		sk->sk_ipv6only = 1;
 
 	/* Make sure we are allowed to bind here. */
-	if ((snum || !inet->bind_address_no_port) &&
+	if ((snum || !(inet->bind_address_no_port ||
+		       force_bind_address_no_port)) &&
 	    sk->sk_prot->get_port(sk, snum)) {
 		sk->sk_ipv6only = saved_ipv6only;
 		inet_reset_saddr(sk);
@@ -418,13 +429,13 @@ int inet6_bind(struct socket *sock, struct sockaddr *uaddr, int addr_len)
 	inet->inet_dport = 0;
 	inet->inet_daddr = 0;
 out:
-	release_sock(sk);
+	if (with_lock)
+		release_sock(sk);
 	return err;
 out_unlock:
 	rcu_read_unlock();
 	goto out;
 }
-EXPORT_SYMBOL(inet6_bind);
 
 int inet6_release(struct socket *sock)
 {
-- 
2.9.5

^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox