* Re: general protection fault in tls_trim_both_msgs
From: syzbot @ 2019-07-28 3:46 UTC (permalink / raw)
To: ast, aviadye, borisp, bpf, corbet, daniel, davejwatson, davem,
jakub.kicinski, john.fastabend, kafai, linux-doc, linux-kernel,
netdev, songliubraving, syzkaller-bugs, yhs
In-Reply-To: <0000000000002b4896058e7abf78@google.com>
syzbot has bisected this bug to:
commit 32857cf57f920cdc03b5095f08febec94cf9c36b
Author: John Fastabend <john.fastabend@gmail.com>
Date: Fri Jul 19 17:29:18 2019 +0000
net/tls: fix transition through disconnect with close
bisection log: https://syzkaller.appspot.com/x/bisect.txt?x=155064d8600000
start commit: fde50b96 Add linux-next specific files for 20190726
git tree: linux-next
final crash: https://syzkaller.appspot.com/x/report.txt?x=175064d8600000
console output: https://syzkaller.appspot.com/x/log.txt?x=135064d8600000
kernel config: https://syzkaller.appspot.com/x/.config?x=4b58274564b354c1
dashboard link: https://syzkaller.appspot.com/bug?extid=0e0fedcad708d12d3032
syz repro: https://syzkaller.appspot.com/x/repro.syz?x=14779d64600000
C reproducer: https://syzkaller.appspot.com/x/repro.c?x=1587c842600000
Reported-by: syzbot+0e0fedcad708d12d3032@syzkaller.appspotmail.com
Fixes: 32857cf57f92 ("net/tls: fix transition through disconnect with
close")
For information about bisection process see: https://goo.gl/tpsmEJ#bisection
^ permalink raw reply
* Re: [PATCH] hv_sock: use HV_HYP_PAGE_SIZE instead of PAGE_SIZE_4K
From: kbuild test robot @ 2019-07-28 4:06 UTC (permalink / raw)
To: Himadri Pandya
Cc: kbuild-all, mikelley, kys, haiyangz, sthemmin, sashal, davem,
linux-hyperv, netdev, linux-kernel, Himadri Pandya
In-Reply-To: <20190725051125.10605-1-himadri18.07@gmail.com>
Hi Himadri,
Thank you for the patch! Perhaps something to improve:
[auto build test WARNING on linus/master]
[cannot apply to v5.3-rc1 next-20190726]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]
url: https://github.com/0day-ci/linux/commits/Himadri-Pandya/hv_sock-use-HV_HYP_PAGE_SIZE-instead-of-PAGE_SIZE_4K/20190726-085229
reproduce:
# apt-get install sparse
# sparse version: v0.6.1-rc1-7-g2b96cd8-dirty
make ARCH=x86_64 allmodconfig
make C=1 CF='-fdiagnostic-prefix -D__CHECK_ENDIAN__'
If you fix the issue, kindly add following tag
Reported-by: kbuild test robot <lkp@intel.com>
sparse warnings: (new ones prefixed by >>)
include/linux/sched.h:609:43: sparse: sparse: bad integer constant expression
include/linux/sched.h:609:73: sparse: sparse: invalid named zero-width bitfield `value'
include/linux/sched.h:610:43: sparse: sparse: bad integer constant expression
include/linux/sched.h:610:67: sparse: sparse: invalid named zero-width bitfield `bucket_id'
net/vmw_vsock/hyperv_transport.c:214:39: sparse: sparse: undefined identifier 'HV_HYP_PAGE_SIZE'
net/vmw_vsock/hyperv_transport.c:214:39: sparse: sparse: undefined identifier 'HV_HYP_PAGE_SIZE'
>> net/vmw_vsock/hyperv_transport.c:214:39: sparse: sparse: incompatible types for operation (-)
>> net/vmw_vsock/hyperv_transport.c:214:39: sparse: left side has type bad type
>> net/vmw_vsock/hyperv_transport.c:214:39: sparse: right side has type int
net/vmw_vsock/hyperv_transport.c:214:39: sparse: sparse: undefined identifier 'HV_HYP_PAGE_SIZE'
>> net/vmw_vsock/hyperv_transport.c:214:39: sparse: sparse: incompatible types for operation (-)
>> net/vmw_vsock/hyperv_transport.c:214:39: sparse: left side has type bad type
>> net/vmw_vsock/hyperv_transport.c:214:39: sparse: right side has type int
net/vmw_vsock/hyperv_transport.c:65:17: sparse: sparse: undefined identifier 'HV_HYP_PAGE_SIZE'
>> net/vmw_vsock/hyperv_transport.c:65:17: sparse: sparse: bad constant expression type
net/vmw_vsock/hyperv_transport.c:387:26: sparse: sparse: undefined identifier 'HV_HYP_PAGE_SIZE'
net/vmw_vsock/hyperv_transport.c:388:26: sparse: sparse: undefined identifier 'HV_HYP_PAGE_SIZE'
net/vmw_vsock/hyperv_transport.c:390:26: sparse: sparse: undefined identifier 'HV_HYP_PAGE_SIZE'
>> net/vmw_vsock/hyperv_transport.c:390:26: sparse: sparse: cast from unknown type
net/vmw_vsock/hyperv_transport.c:390:26: sparse: sparse: undefined identifier 'HV_HYP_PAGE_SIZE'
>> net/vmw_vsock/hyperv_transport.c:390:26: sparse: sparse: cast from unknown type
net/vmw_vsock/hyperv_transport.c:390:26: sparse: sparse: undefined identifier 'HV_HYP_PAGE_SIZE'
>> net/vmw_vsock/hyperv_transport.c:390:26: sparse: sparse: cast from unknown type
net/vmw_vsock/hyperv_transport.c:390:26: sparse: sparse: undefined identifier 'HV_HYP_PAGE_SIZE'
>> net/vmw_vsock/hyperv_transport.c:390:26: sparse: sparse: cast from unknown type
net/vmw_vsock/hyperv_transport.c:390:26: sparse: sparse: undefined identifier 'HV_HYP_PAGE_SIZE'
>> net/vmw_vsock/hyperv_transport.c:390:26: sparse: sparse: cast from unknown type
net/vmw_vsock/hyperv_transport.c:391:26: sparse: sparse: undefined identifier 'HV_HYP_PAGE_SIZE'
net/vmw_vsock/hyperv_transport.c:391:26: sparse: sparse: cast from unknown type
net/vmw_vsock/hyperv_transport.c:391:26: sparse: sparse: undefined identifier 'HV_HYP_PAGE_SIZE'
net/vmw_vsock/hyperv_transport.c:391:26: sparse: sparse: cast from unknown type
net/vmw_vsock/hyperv_transport.c:391:26: sparse: sparse: undefined identifier 'HV_HYP_PAGE_SIZE'
net/vmw_vsock/hyperv_transport.c:391:26: sparse: sparse: cast from unknown type
net/vmw_vsock/hyperv_transport.c:391:26: sparse: sparse: undefined identifier 'HV_HYP_PAGE_SIZE'
net/vmw_vsock/hyperv_transport.c:391:26: sparse: sparse: cast from unknown type
net/vmw_vsock/hyperv_transport.c:391:26: sparse: sparse: undefined identifier 'HV_HYP_PAGE_SIZE'
net/vmw_vsock/hyperv_transport.c:391:26: sparse: sparse: cast from unknown type
net/vmw_vsock/hyperv_transport.c:392:26: sparse: sparse: undefined identifier 'HV_HYP_PAGE_SIZE'
net/vmw_vsock/hyperv_transport.c:392:26: sparse: sparse: cast from unknown type
net/vmw_vsock/hyperv_transport.c:392:26: sparse: sparse: undefined identifier 'HV_HYP_PAGE_SIZE'
net/vmw_vsock/hyperv_transport.c:392:26: sparse: sparse: cast from unknown type
net/vmw_vsock/hyperv_transport.c:393:26: sparse: sparse: undefined identifier 'HV_HYP_PAGE_SIZE'
net/vmw_vsock/hyperv_transport.c:393:26: sparse: sparse: cast from unknown type
net/vmw_vsock/hyperv_transport.c:393:26: sparse: sparse: undefined identifier 'HV_HYP_PAGE_SIZE'
net/vmw_vsock/hyperv_transport.c:393:26: sparse: sparse: cast from unknown type
net/vmw_vsock/hyperv_transport.c:393:26: sparse: sparse: undefined identifier 'HV_HYP_PAGE_SIZE'
net/vmw_vsock/hyperv_transport.c:393:26: sparse: sparse: cast from unknown type
net/vmw_vsock/hyperv_transport.c:393:26: sparse: sparse: undefined identifier 'HV_HYP_PAGE_SIZE'
net/vmw_vsock/hyperv_transport.c:393:26: sparse: sparse: cast from unknown type
net/vmw_vsock/hyperv_transport.c:393:26: sparse: sparse: undefined identifier 'HV_HYP_PAGE_SIZE'
net/vmw_vsock/hyperv_transport.c:393:26: sparse: sparse: cast from unknown type
net/vmw_vsock/hyperv_transport.c:394:26: sparse: sparse: undefined identifier 'HV_HYP_PAGE_SIZE'
net/vmw_vsock/hyperv_transport.c:394:26: sparse: sparse: cast from unknown type
net/vmw_vsock/hyperv_transport.c:394:26: sparse: sparse: undefined identifier 'HV_HYP_PAGE_SIZE'
net/vmw_vsock/hyperv_transport.c:394:26: sparse: sparse: cast from unknown type
net/vmw_vsock/hyperv_transport.c:394:26: sparse: sparse: undefined identifier 'HV_HYP_PAGE_SIZE'
net/vmw_vsock/hyperv_transport.c:394:26: sparse: sparse: cast from unknown type
net/vmw_vsock/hyperv_transport.c:394:26: sparse: sparse: undefined identifier 'HV_HYP_PAGE_SIZE'
net/vmw_vsock/hyperv_transport.c:394:26: sparse: sparse: cast from unknown type
net/vmw_vsock/hyperv_transport.c:394:26: sparse: sparse: undefined identifier 'HV_HYP_PAGE_SIZE'
net/vmw_vsock/hyperv_transport.c:394:26: sparse: sparse: cast from unknown type
net/vmw_vsock/hyperv_transport.c:395:26: sparse: sparse: undefined identifier 'HV_HYP_PAGE_SIZE'
net/vmw_vsock/hyperv_transport.c:395:26: sparse: sparse: cast from unknown type
net/vmw_vsock/hyperv_transport.c:395:26: sparse: sparse: undefined identifier 'HV_HYP_PAGE_SIZE'
net/vmw_vsock/hyperv_transport.c:395:26: sparse: sparse: cast from unknown type
net/vmw_vsock/hyperv_transport.c:465:25: sparse: sparse: undefined identifier 'HV_HYP_PAGE_SIZE'
net/vmw_vsock/hyperv_transport.c:466:25: sparse: sparse: undefined identifier 'HV_HYP_PAGE_SIZE'
net/vmw_vsock/hyperv_transport.c:666:9: sparse: sparse: undefined identifier 'HV_HYP_PAGE_SIZE'
net/vmw_vsock/hyperv_transport.c:681:28: sparse: sparse: undefined identifier 'HV_HYP_PAGE_SIZE'
net/vmw_vsock/hyperv_transport.c:681:28: sparse: sparse: cast from unknown type
net/vmw_vsock/hyperv_transport.c:681:28: sparse: sparse: undefined identifier 'HV_HYP_PAGE_SIZE'
net/vmw_vsock/hyperv_transport.c:681:28: sparse: sparse: cast from unknown type
net/vmw_vsock/hyperv_transport.c:681:28: sparse: sparse: undefined identifier 'HV_HYP_PAGE_SIZE'
net/vmw_vsock/hyperv_transport.c:681:28: sparse: sparse: cast from unknown type
net/vmw_vsock/hyperv_transport.c:681:28: sparse: sparse: undefined identifier 'HV_HYP_PAGE_SIZE'
net/vmw_vsock/hyperv_transport.c:681:28: sparse: sparse: cast from unknown type
net/vmw_vsock/hyperv_transport.c:681:28: sparse: sparse: undefined identifier 'HV_HYP_PAGE_SIZE'
net/vmw_vsock/hyperv_transport.c:681:28: sparse: sparse: cast from unknown type
vim +214 net/vmw_vsock/hyperv_transport.c
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 59
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 60 struct hvs_send_buf {
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 61 /* The header before the payload data */
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 62 struct vmpipe_proto_header hdr;
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 63
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 64 /* The payload */
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 @65 u8 data[HVS_SEND_BUF_SIZE];
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 66 };
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 67
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 68 #define HVS_HEADER_LEN (sizeof(struct vmpacket_descriptor) + \
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 69 sizeof(struct vmpipe_proto_header))
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 70
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 71 /* See 'prev_indices' in hv_ringbuffer_read(), hv_ringbuffer_write(), and
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 72 * __hv_pkt_iter_next().
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 73 */
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 74 #define VMBUS_PKT_TRAILER_SIZE (sizeof(u64))
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 75
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 76 #define HVS_PKT_LEN(payload_len) (HVS_HEADER_LEN + \
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 77 ALIGN((payload_len), 8) + \
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 78 VMBUS_PKT_TRAILER_SIZE)
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 79
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 80 union hvs_service_id {
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 81 uuid_le srv_id;
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 82
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 83 struct {
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 84 unsigned int svm_port;
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 85 unsigned char b[sizeof(uuid_le) - sizeof(unsigned int)];
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 86 };
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 87 };
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 88
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 89 /* Per-socket state (accessed via vsk->trans) */
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 90 struct hvsock {
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 91 struct vsock_sock *vsk;
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 92
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 93 uuid_le vm_srv_id;
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 94 uuid_le host_srv_id;
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 95
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 96 struct vmbus_channel *chan;
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 97 struct vmpacket_descriptor *recv_desc;
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 98
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 99 /* The length of the payload not delivered to userland yet */
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 100 u32 recv_data_len;
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 101 /* The offset of the payload */
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 102 u32 recv_data_off;
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 103
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 104 /* Have we sent the zero-length packet (FIN)? */
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 105 bool fin_sent;
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 106 };
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 107
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 108 /* In the VM, we support Hyper-V Sockets with AF_VSOCK, and the endpoint is
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 109 * <cid, port> (see struct sockaddr_vm). Note: cid is not really used here:
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 110 * when we write apps to connect to the host, we can only use VMADDR_CID_ANY
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 111 * or VMADDR_CID_HOST (both are equivalent) as the remote cid, and when we
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 112 * write apps to bind() & listen() in the VM, we can only use VMADDR_CID_ANY
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 113 * as the local cid.
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 114 *
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 115 * On the host, Hyper-V Sockets are supported by Winsock AF_HYPERV:
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 116 * https://docs.microsoft.com/en-us/virtualization/hyper-v-on-windows/user-
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 117 * guide/make-integration-service, and the endpoint is <VmID, ServiceId> with
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 118 * the below sockaddr:
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 119 *
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 120 * struct SOCKADDR_HV
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 121 * {
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 122 * ADDRESS_FAMILY Family;
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 123 * USHORT Reserved;
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 124 * GUID VmId;
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 125 * GUID ServiceId;
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 126 * };
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 127 * Note: VmID is not used by Linux VM and actually it isn't transmitted via
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 128 * VMBus, because here it's obvious the host and the VM can easily identify
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 129 * each other. Though the VmID is useful on the host, especially in the case
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 130 * of Windows container, Linux VM doesn't need it at all.
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 131 *
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 132 * To make use of the AF_VSOCK infrastructure in Linux VM, we have to limit
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 133 * the available GUID space of SOCKADDR_HV so that we can create a mapping
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 134 * between AF_VSOCK port and SOCKADDR_HV Service GUID. The rule of writing
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 135 * Hyper-V Sockets apps on the host and in Linux VM is:
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 136 *
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 137 ****************************************************************************
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 138 * The only valid Service GUIDs, from the perspectives of both the host and *
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 139 * Linux VM, that can be connected by the other end, must conform to this *
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 140 * format: <port>-facb-11e6-bd58-64006a7986d3, and the "port" must be in *
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 141 * this range [0, 0x7FFFFFFF]. *
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 142 ****************************************************************************
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 143 *
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 144 * When we write apps on the host to connect(), the GUID ServiceID is used.
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 145 * When we write apps in Linux VM to connect(), we only need to specify the
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 146 * port and the driver will form the GUID and use that to request the host.
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 147 *
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 148 * From the perspective of Linux VM:
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 149 * 1. the local ephemeral port (i.e. the local auto-bound port when we call
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 150 * connect() without explicit bind()) is generated by __vsock_bind_stream(),
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 151 * and the range is [1024, 0xFFFFFFFF).
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 152 * 2. the remote ephemeral port (i.e. the auto-generated remote port for
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 153 * a connect request initiated by the host's connect()) is generated by
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 154 * hvs_remote_addr_init() and the range is [0x80000000, 0xFFFFFFFF).
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 155 */
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 156
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 157 #define MAX_LISTEN_PORT ((u32)0x7FFFFFFF)
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 158 #define MAX_VM_LISTEN_PORT MAX_LISTEN_PORT
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 159 #define MAX_HOST_LISTEN_PORT MAX_LISTEN_PORT
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 160 #define MIN_HOST_EPHEMERAL_PORT (MAX_HOST_LISTEN_PORT + 1)
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 161
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 162 /* 00000000-facb-11e6-bd58-64006a7986d3 */
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 163 static const uuid_le srv_id_template =
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 164 UUID_LE(0x00000000, 0xfacb, 0x11e6, 0xbd, 0x58,
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 165 0x64, 0x00, 0x6a, 0x79, 0x86, 0xd3);
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 166
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 167 static bool is_valid_srv_id(const uuid_le *id)
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 168 {
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 169 return !memcmp(&id->b[4], &srv_id_template.b[4], sizeof(uuid_le) - 4);
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 170 }
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 171
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 172 static unsigned int get_port_by_srv_id(const uuid_le *svr_id)
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 173 {
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 174 return *((unsigned int *)svr_id);
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 175 }
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 176
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 177 static void hvs_addr_init(struct sockaddr_vm *addr, const uuid_le *svr_id)
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 178 {
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 179 unsigned int port = get_port_by_srv_id(svr_id);
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 180
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 181 vsock_addr_init(addr, VMADDR_CID_ANY, port);
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 182 }
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 183
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 184 static void hvs_remote_addr_init(struct sockaddr_vm *remote,
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 185 struct sockaddr_vm *local)
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 186 {
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 187 static u32 host_ephemeral_port = MIN_HOST_EPHEMERAL_PORT;
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 188 struct sock *sk;
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 189
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 190 vsock_addr_init(remote, VMADDR_CID_ANY, VMADDR_PORT_ANY);
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 191
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 192 while (1) {
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 193 /* Wrap around ? */
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 194 if (host_ephemeral_port < MIN_HOST_EPHEMERAL_PORT ||
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 195 host_ephemeral_port == VMADDR_PORT_ANY)
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 196 host_ephemeral_port = MIN_HOST_EPHEMERAL_PORT;
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 197
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 198 remote->svm_port = host_ephemeral_port++;
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 199
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 200 sk = vsock_find_connected_socket(remote, local);
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 201 if (!sk) {
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 202 /* Found an available ephemeral port */
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 203 return;
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 204 }
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 205
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 206 /* Release refcnt got in vsock_find_connected_socket */
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 207 sock_put(sk);
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 208 }
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 209 }
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 210
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 211 static void hvs_set_channel_pending_send_size(struct vmbus_channel *chan)
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 212 {
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 213 set_channel_pending_send_size(chan,
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 @214 HVS_PKT_LEN(HVS_SEND_BUF_SIZE));
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 215
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 216 virt_mb();
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 217 }
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 218
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 219 static bool hvs_channel_readable(struct vmbus_channel *chan)
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 220 {
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 221 u32 readable = hv_get_bytes_to_read(&chan->inbound);
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 222
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 223 /* 0-size payload means FIN */
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 224 return readable >= HVS_PKT_LEN(0);
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 225 }
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 226
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 227 static int hvs_channel_readable_payload(struct vmbus_channel *chan)
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 228 {
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 229 u32 readable = hv_get_bytes_to_read(&chan->inbound);
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 230
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 231 if (readable > HVS_PKT_LEN(0)) {
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 232 /* At least we have 1 byte to read. We don't need to return
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 233 * the exact readable bytes: see vsock_stream_recvmsg() ->
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 234 * vsock_stream_has_data().
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 235 */
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 236 return 1;
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 237 }
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 238
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 239 if (readable == HVS_PKT_LEN(0)) {
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 240 /* 0-size payload means FIN */
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 241 return 0;
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 242 }
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 243
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 244 /* No payload or FIN */
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 245 return -1;
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 246 }
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 247
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 248 static size_t hvs_channel_writable_bytes(struct vmbus_channel *chan)
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 249 {
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 250 u32 writeable = hv_get_bytes_to_write(&chan->outbound);
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 251 size_t ret;
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 252
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 253 /* The ringbuffer mustn't be 100% full, and we should reserve a
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 254 * zero-length-payload packet for the FIN: see hv_ringbuffer_write()
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 255 * and hvs_shutdown().
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 256 */
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 257 if (writeable <= HVS_PKT_LEN(1) + HVS_PKT_LEN(0))
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 258 return 0;
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 259
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 260 ret = writeable - HVS_PKT_LEN(1) - HVS_PKT_LEN(0);
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 261
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 262 return round_down(ret, 8);
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 263 }
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 264
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 265 static int hvs_send_data(struct vmbus_channel *chan,
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 266 struct hvs_send_buf *send_buf, size_t to_write)
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 267 {
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 268 send_buf->hdr.pkt_type = 1;
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 269 send_buf->hdr.data_size = to_write;
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 270 return vmbus_sendpacket(chan, &send_buf->hdr,
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 271 sizeof(send_buf->hdr) + to_write,
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 272 0, VM_PKT_DATA_INBAND, 0);
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 273 }
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 274
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 275 static void hvs_channel_cb(void *ctx)
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 276 {
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 277 struct sock *sk = (struct sock *)ctx;
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 278 struct vsock_sock *vsk = vsock_sk(sk);
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 279 struct hvsock *hvs = vsk->trans;
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 280 struct vmbus_channel *chan = hvs->chan;
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 281
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 282 if (hvs_channel_readable(chan))
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 283 sk->sk_data_ready(sk);
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 284
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 285 if (hv_get_bytes_to_write(&chan->outbound) > 0)
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 286 sk->sk_write_space(sk);
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 287 }
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 288
a9eeb998c28d550 Sunil Muthuswamy 2019-05-15 289 static void hvs_do_close_lock_held(struct vsock_sock *vsk,
a9eeb998c28d550 Sunil Muthuswamy 2019-05-15 290 bool cancel_timeout)
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 291 {
a9eeb998c28d550 Sunil Muthuswamy 2019-05-15 292 struct sock *sk = sk_vsock(vsk);
b4562ca7925a3be Dexuan Cui 2017-10-19 293
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 294 sock_set_flag(sk, SOCK_DONE);
a9eeb998c28d550 Sunil Muthuswamy 2019-05-15 295 vsk->peer_shutdown = SHUTDOWN_MASK;
a9eeb998c28d550 Sunil Muthuswamy 2019-05-15 296 if (vsock_stream_has_data(vsk) <= 0)
a9eeb998c28d550 Sunil Muthuswamy 2019-05-15 297 sk->sk_state = TCP_CLOSING;
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 298 sk->sk_state_change(sk);
a9eeb998c28d550 Sunil Muthuswamy 2019-05-15 299 if (vsk->close_work_scheduled &&
a9eeb998c28d550 Sunil Muthuswamy 2019-05-15 300 (!cancel_timeout || cancel_delayed_work(&vsk->close_work))) {
a9eeb998c28d550 Sunil Muthuswamy 2019-05-15 301 vsk->close_work_scheduled = false;
a9eeb998c28d550 Sunil Muthuswamy 2019-05-15 302 vsock_remove_sock(vsk);
b4562ca7925a3be Dexuan Cui 2017-10-19 303
a9eeb998c28d550 Sunil Muthuswamy 2019-05-15 304 /* Release the reference taken while scheduling the timeout */
a9eeb998c28d550 Sunil Muthuswamy 2019-05-15 305 sock_put(sk);
a9eeb998c28d550 Sunil Muthuswamy 2019-05-15 306 }
a9eeb998c28d550 Sunil Muthuswamy 2019-05-15 307 }
a9eeb998c28d550 Sunil Muthuswamy 2019-05-15 308
a9eeb998c28d550 Sunil Muthuswamy 2019-05-15 309 static void hvs_close_connection(struct vmbus_channel *chan)
a9eeb998c28d550 Sunil Muthuswamy 2019-05-15 310 {
a9eeb998c28d550 Sunil Muthuswamy 2019-05-15 311 struct sock *sk = get_per_channel_state(chan);
a9eeb998c28d550 Sunil Muthuswamy 2019-05-15 312
a9eeb998c28d550 Sunil Muthuswamy 2019-05-15 313 lock_sock(sk);
a9eeb998c28d550 Sunil Muthuswamy 2019-05-15 314 hvs_do_close_lock_held(vsock_sk(sk), true);
b4562ca7925a3be Dexuan Cui 2017-10-19 315 release_sock(sk);
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 316 }
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 317
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 318 static void hvs_open_connection(struct vmbus_channel *chan)
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 319 {
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 320 uuid_le *if_instance, *if_type;
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 321 unsigned char conn_from_host;
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 322
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 323 struct sockaddr_vm addr;
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 324 struct sock *sk, *new = NULL;
ac383f58f3c98de Sunil Muthuswamy 2019-05-22 325 struct vsock_sock *vnew = NULL;
ac383f58f3c98de Sunil Muthuswamy 2019-05-22 326 struct hvsock *hvs = NULL;
ac383f58f3c98de Sunil Muthuswamy 2019-05-22 327 struct hvsock *hvs_new = NULL;
ac383f58f3c98de Sunil Muthuswamy 2019-05-22 328 int rcvbuf;
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 329 int ret;
ac383f58f3c98de Sunil Muthuswamy 2019-05-22 330 int sndbuf;
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 331
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 332 if_type = &chan->offermsg.offer.if_type;
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 333 if_instance = &chan->offermsg.offer.if_instance;
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 334 conn_from_host = chan->offermsg.offer.u.pipe.user_def[0];
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 335
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 336 /* The host or the VM should only listen on a port in
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 337 * [0, MAX_LISTEN_PORT]
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 338 */
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 339 if (!is_valid_srv_id(if_type) ||
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 340 get_port_by_srv_id(if_type) > MAX_LISTEN_PORT)
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 341 return;
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 342
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 343 hvs_addr_init(&addr, conn_from_host ? if_type : if_instance);
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 344 sk = vsock_find_bound_socket(&addr);
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 345 if (!sk)
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 346 return;
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 347
b4562ca7925a3be Dexuan Cui 2017-10-19 348 lock_sock(sk);
3b4477d2dcf2709 Stefan Hajnoczi 2017-10-05 349 if ((conn_from_host && sk->sk_state != TCP_LISTEN) ||
3b4477d2dcf2709 Stefan Hajnoczi 2017-10-05 350 (!conn_from_host && sk->sk_state != TCP_SYN_SENT))
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 351 goto out;
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 352
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 353 if (conn_from_host) {
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 354 if (sk->sk_ack_backlog >= sk->sk_max_ack_backlog)
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 355 goto out;
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 356
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 357 new = __vsock_create(sock_net(sk), NULL, sk, GFP_KERNEL,
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 358 sk->sk_type, 0);
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 359 if (!new)
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 360 goto out;
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 361
3b4477d2dcf2709 Stefan Hajnoczi 2017-10-05 362 new->sk_state = TCP_SYN_SENT;
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 363 vnew = vsock_sk(new);
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 364 hvs_new = vnew->trans;
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 365 hvs_new->chan = chan;
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 366 } else {
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 367 hvs = vsock_sk(sk)->trans;
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 368 hvs->chan = chan;
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 369 }
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 370
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 371 set_channel_read_mode(chan, HV_CALL_DIRECT);
ac383f58f3c98de Sunil Muthuswamy 2019-05-22 372
ac383f58f3c98de Sunil Muthuswamy 2019-05-22 373 /* Use the socket buffer sizes as hints for the VMBUS ring size. For
ac383f58f3c98de Sunil Muthuswamy 2019-05-22 374 * server side sockets, 'sk' is the parent socket and thus, this will
ac383f58f3c98de Sunil Muthuswamy 2019-05-22 375 * allow the child sockets to inherit the size from the parent. Keep
ac383f58f3c98de Sunil Muthuswamy 2019-05-22 376 * the mins to the default value and align to page size as per VMBUS
ac383f58f3c98de Sunil Muthuswamy 2019-05-22 377 * requirements.
ac383f58f3c98de Sunil Muthuswamy 2019-05-22 378 * For the max, the socket core library will limit the socket buffer
ac383f58f3c98de Sunil Muthuswamy 2019-05-22 379 * size that can be set by the user, but, since currently, the hv_sock
ac383f58f3c98de Sunil Muthuswamy 2019-05-22 380 * VMBUS ring buffer is physically contiguous allocation, restrict it
ac383f58f3c98de Sunil Muthuswamy 2019-05-22 381 * further.
ac383f58f3c98de Sunil Muthuswamy 2019-05-22 382 * Older versions of hv_sock host side code cannot handle bigger VMBUS
ac383f58f3c98de Sunil Muthuswamy 2019-05-22 383 * ring buffer size. Use the version number to limit the change to newer
ac383f58f3c98de Sunil Muthuswamy 2019-05-22 384 * versions.
ac383f58f3c98de Sunil Muthuswamy 2019-05-22 385 */
ac383f58f3c98de Sunil Muthuswamy 2019-05-22 386 if (vmbus_proto_version < VERSION_WIN10_V5) {
ac383f58f3c98de Sunil Muthuswamy 2019-05-22 387 sndbuf = RINGBUFFER_HVS_SND_SIZE;
ac383f58f3c98de Sunil Muthuswamy 2019-05-22 388 rcvbuf = RINGBUFFER_HVS_RCV_SIZE;
ac383f58f3c98de Sunil Muthuswamy 2019-05-22 389 } else {
ac383f58f3c98de Sunil Muthuswamy 2019-05-22 @390 sndbuf = max_t(int, sk->sk_sndbuf, RINGBUFFER_HVS_SND_SIZE);
ac383f58f3c98de Sunil Muthuswamy 2019-05-22 391 sndbuf = min_t(int, sndbuf, RINGBUFFER_HVS_MAX_SIZE);
31113cc83e30924 Himadri Pandya 2019-07-25 392 sndbuf = ALIGN(sndbuf, HV_HYP_PAGE_SIZE);
ac383f58f3c98de Sunil Muthuswamy 2019-05-22 393 rcvbuf = max_t(int, sk->sk_rcvbuf, RINGBUFFER_HVS_RCV_SIZE);
ac383f58f3c98de Sunil Muthuswamy 2019-05-22 394 rcvbuf = min_t(int, rcvbuf, RINGBUFFER_HVS_MAX_SIZE);
31113cc83e30924 Himadri Pandya 2019-07-25 395 rcvbuf = ALIGN(rcvbuf, HV_HYP_PAGE_SIZE);
ac383f58f3c98de Sunil Muthuswamy 2019-05-22 396 }
ac383f58f3c98de Sunil Muthuswamy 2019-05-22 397
ac383f58f3c98de Sunil Muthuswamy 2019-05-22 398 ret = vmbus_open(chan, sndbuf, rcvbuf, NULL, 0, hvs_channel_cb,
ac383f58f3c98de Sunil Muthuswamy 2019-05-22 399 conn_from_host ? new : sk);
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 400 if (ret != 0) {
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 401 if (conn_from_host) {
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 402 hvs_new->chan = NULL;
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 403 sock_put(new);
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 404 } else {
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 405 hvs->chan = NULL;
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 406 }
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 407 goto out;
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 408 }
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 409
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 410 set_per_channel_state(chan, conn_from_host ? new : sk);
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 411 vmbus_set_chn_rescind_callback(chan, hvs_close_connection);
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 412
cb359b60416701c Sunil Muthuswamy 2019-06-17 413 /* Set the pending send size to max packet size to always get
cb359b60416701c Sunil Muthuswamy 2019-06-17 414 * notifications from the host when there is enough writable space.
cb359b60416701c Sunil Muthuswamy 2019-06-17 415 * The host is optimized to send notifications only when the pending
cb359b60416701c Sunil Muthuswamy 2019-06-17 416 * size boundary is crossed, and not always.
cb359b60416701c Sunil Muthuswamy 2019-06-17 417 */
cb359b60416701c Sunil Muthuswamy 2019-06-17 418 hvs_set_channel_pending_send_size(chan);
cb359b60416701c Sunil Muthuswamy 2019-06-17 419
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 420 if (conn_from_host) {
3b4477d2dcf2709 Stefan Hajnoczi 2017-10-05 421 new->sk_state = TCP_ESTABLISHED;
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 422 sk->sk_ack_backlog++;
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 423
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 424 hvs_addr_init(&vnew->local_addr, if_type);
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 425 hvs_remote_addr_init(&vnew->remote_addr, &vnew->local_addr);
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 426
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 427 hvs_new->vm_srv_id = *if_type;
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 428 hvs_new->host_srv_id = *if_instance;
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 429
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 430 vsock_insert_connected(vnew);
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 431
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 432 vsock_enqueue_accept(sk, new);
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 433 } else {
3b4477d2dcf2709 Stefan Hajnoczi 2017-10-05 434 sk->sk_state = TCP_ESTABLISHED;
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 435 sk->sk_socket->state = SS_CONNECTED;
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 436
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 437 vsock_insert_connected(vsock_sk(sk));
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 438 }
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 439
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 440 sk->sk_state_change(sk);
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 441
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 442 out:
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 443 /* Release refcnt obtained when we called vsock_find_bound_socket() */
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 444 sock_put(sk);
b4562ca7925a3be Dexuan Cui 2017-10-19 445
b4562ca7925a3be Dexuan Cui 2017-10-19 446 release_sock(sk);
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 447 }
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 448
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 449 static u32 hvs_get_local_cid(void)
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 450 {
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 451 return VMADDR_CID_ANY;
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 452 }
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 453
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 454 static int hvs_sock_init(struct vsock_sock *vsk, struct vsock_sock *psk)
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 455 {
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 456 struct hvsock *hvs;
ac383f58f3c98de Sunil Muthuswamy 2019-05-22 457 struct sock *sk = sk_vsock(vsk);
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 458
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 459 hvs = kzalloc(sizeof(*hvs), GFP_KERNEL);
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 460 if (!hvs)
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 461 return -ENOMEM;
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 462
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 463 vsk->trans = hvs;
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 464 hvs->vsk = vsk;
ac383f58f3c98de Sunil Muthuswamy 2019-05-22 465 sk->sk_sndbuf = RINGBUFFER_HVS_SND_SIZE;
ac383f58f3c98de Sunil Muthuswamy 2019-05-22 466 sk->sk_rcvbuf = RINGBUFFER_HVS_RCV_SIZE;
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 467 return 0;
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 468 }
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 469
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 470 static int hvs_connect(struct vsock_sock *vsk)
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 471 {
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 472 union hvs_service_id vm, host;
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 473 struct hvsock *h = vsk->trans;
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 474
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 475 vm.srv_id = srv_id_template;
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 476 vm.svm_port = vsk->local_addr.svm_port;
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 477 h->vm_srv_id = vm.srv_id;
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 478
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 479 host.srv_id = srv_id_template;
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 480 host.svm_port = vsk->remote_addr.svm_port;
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 481 h->host_srv_id = host.srv_id;
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 482
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 483 return vmbus_send_tl_connect_request(&h->vm_srv_id, &h->host_srv_id);
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 484 }
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 485
a9eeb998c28d550 Sunil Muthuswamy 2019-05-15 486 static void hvs_shutdown_lock_held(struct hvsock *hvs, int mode)
a9eeb998c28d550 Sunil Muthuswamy 2019-05-15 487 {
a9eeb998c28d550 Sunil Muthuswamy 2019-05-15 488 struct vmpipe_proto_header hdr;
a9eeb998c28d550 Sunil Muthuswamy 2019-05-15 489
a9eeb998c28d550 Sunil Muthuswamy 2019-05-15 490 if (hvs->fin_sent || !hvs->chan)
a9eeb998c28d550 Sunil Muthuswamy 2019-05-15 491 return;
a9eeb998c28d550 Sunil Muthuswamy 2019-05-15 492
a9eeb998c28d550 Sunil Muthuswamy 2019-05-15 493 /* It can't fail: see hvs_channel_writable_bytes(). */
a9eeb998c28d550 Sunil Muthuswamy 2019-05-15 494 (void)hvs_send_data(hvs->chan, (struct hvs_send_buf *)&hdr, 0);
a9eeb998c28d550 Sunil Muthuswamy 2019-05-15 495 hvs->fin_sent = true;
a9eeb998c28d550 Sunil Muthuswamy 2019-05-15 496 }
a9eeb998c28d550 Sunil Muthuswamy 2019-05-15 497
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 498 static int hvs_shutdown(struct vsock_sock *vsk, int mode)
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 499 {
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 500 struct sock *sk = sk_vsock(vsk);
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 501
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 502 if (!(mode & SEND_SHUTDOWN))
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 503 return 0;
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 504
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 505 lock_sock(sk);
a9eeb998c28d550 Sunil Muthuswamy 2019-05-15 506 hvs_shutdown_lock_held(vsk->trans, mode);
a9eeb998c28d550 Sunil Muthuswamy 2019-05-15 507 release_sock(sk);
a9eeb998c28d550 Sunil Muthuswamy 2019-05-15 508 return 0;
a9eeb998c28d550 Sunil Muthuswamy 2019-05-15 509 }
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 510
a9eeb998c28d550 Sunil Muthuswamy 2019-05-15 511 static void hvs_close_timeout(struct work_struct *work)
a9eeb998c28d550 Sunil Muthuswamy 2019-05-15 512 {
a9eeb998c28d550 Sunil Muthuswamy 2019-05-15 513 struct vsock_sock *vsk =
a9eeb998c28d550 Sunil Muthuswamy 2019-05-15 514 container_of(work, struct vsock_sock, close_work.work);
a9eeb998c28d550 Sunil Muthuswamy 2019-05-15 515 struct sock *sk = sk_vsock(vsk);
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 516
a9eeb998c28d550 Sunil Muthuswamy 2019-05-15 517 sock_hold(sk);
a9eeb998c28d550 Sunil Muthuswamy 2019-05-15 518 lock_sock(sk);
a9eeb998c28d550 Sunil Muthuswamy 2019-05-15 519 if (!sock_flag(sk, SOCK_DONE))
a9eeb998c28d550 Sunil Muthuswamy 2019-05-15 520 hvs_do_close_lock_held(vsk, false);
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 521
a9eeb998c28d550 Sunil Muthuswamy 2019-05-15 522 vsk->close_work_scheduled = false;
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 523 release_sock(sk);
a9eeb998c28d550 Sunil Muthuswamy 2019-05-15 524 sock_put(sk);
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 525 }
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 526
a9eeb998c28d550 Sunil Muthuswamy 2019-05-15 527 /* Returns true, if it is safe to remove socket; false otherwise */
a9eeb998c28d550 Sunil Muthuswamy 2019-05-15 528 static bool hvs_close_lock_held(struct vsock_sock *vsk)
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 529 {
b4562ca7925a3be Dexuan Cui 2017-10-19 530 struct sock *sk = sk_vsock(vsk);
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 531
a9eeb998c28d550 Sunil Muthuswamy 2019-05-15 532 if (!(sk->sk_state == TCP_ESTABLISHED ||
a9eeb998c28d550 Sunil Muthuswamy 2019-05-15 533 sk->sk_state == TCP_CLOSING))
a9eeb998c28d550 Sunil Muthuswamy 2019-05-15 534 return true;
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 535
a9eeb998c28d550 Sunil Muthuswamy 2019-05-15 536 if ((sk->sk_shutdown & SHUTDOWN_MASK) != SHUTDOWN_MASK)
a9eeb998c28d550 Sunil Muthuswamy 2019-05-15 537 hvs_shutdown_lock_held(vsk->trans, SHUTDOWN_MASK);
b4562ca7925a3be Dexuan Cui 2017-10-19 538
a9eeb998c28d550 Sunil Muthuswamy 2019-05-15 539 if (sock_flag(sk, SOCK_DONE))
a9eeb998c28d550 Sunil Muthuswamy 2019-05-15 540 return true;
a9eeb998c28d550 Sunil Muthuswamy 2019-05-15 541
a9eeb998c28d550 Sunil Muthuswamy 2019-05-15 542 /* This reference will be dropped by the delayed close routine */
a9eeb998c28d550 Sunil Muthuswamy 2019-05-15 543 sock_hold(sk);
a9eeb998c28d550 Sunil Muthuswamy 2019-05-15 544 INIT_DELAYED_WORK(&vsk->close_work, hvs_close_timeout);
a9eeb998c28d550 Sunil Muthuswamy 2019-05-15 545 vsk->close_work_scheduled = true;
a9eeb998c28d550 Sunil Muthuswamy 2019-05-15 546 schedule_delayed_work(&vsk->close_work, HVS_CLOSE_TIMEOUT);
a9eeb998c28d550 Sunil Muthuswamy 2019-05-15 547 return false;
a9eeb998c28d550 Sunil Muthuswamy 2019-05-15 548 }
b4562ca7925a3be Dexuan Cui 2017-10-19 549
a9eeb998c28d550 Sunil Muthuswamy 2019-05-15 550 static void hvs_release(struct vsock_sock *vsk)
a9eeb998c28d550 Sunil Muthuswamy 2019-05-15 551 {
a9eeb998c28d550 Sunil Muthuswamy 2019-05-15 552 struct sock *sk = sk_vsock(vsk);
a9eeb998c28d550 Sunil Muthuswamy 2019-05-15 553 bool remove_sock;
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 554
a9eeb998c28d550 Sunil Muthuswamy 2019-05-15 555 lock_sock(sk);
a9eeb998c28d550 Sunil Muthuswamy 2019-05-15 556 remove_sock = hvs_close_lock_held(vsk);
a9eeb998c28d550 Sunil Muthuswamy 2019-05-15 557 release_sock(sk);
a9eeb998c28d550 Sunil Muthuswamy 2019-05-15 558 if (remove_sock)
a9eeb998c28d550 Sunil Muthuswamy 2019-05-15 559 vsock_remove_sock(vsk);
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 560 }
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 561
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 562 static void hvs_destruct(struct vsock_sock *vsk)
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 563 {
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 564 struct hvsock *hvs = vsk->trans;
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 565 struct vmbus_channel *chan = hvs->chan;
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 566
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 567 if (chan)
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 568 vmbus_hvsock_device_unregister(chan);
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 569
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 570 kfree(hvs);
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 571 }
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 572
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 573 static int hvs_dgram_bind(struct vsock_sock *vsk, struct sockaddr_vm *addr)
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 574 {
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 575 return -EOPNOTSUPP;
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 576 }
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 577
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 578 static int hvs_dgram_dequeue(struct vsock_sock *vsk, struct msghdr *msg,
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 579 size_t len, int flags)
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 580 {
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 581 return -EOPNOTSUPP;
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 582 }
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 583
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 584 static int hvs_dgram_enqueue(struct vsock_sock *vsk,
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 585 struct sockaddr_vm *remote, struct msghdr *msg,
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 586 size_t dgram_len)
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 587 {
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 588 return -EOPNOTSUPP;
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 589 }
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 590
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 591 static bool hvs_dgram_allow(u32 cid, u32 port)
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 592 {
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 593 return false;
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 594 }
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 595
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 596 static int hvs_update_recv_data(struct hvsock *hvs)
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 597 {
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 598 struct hvs_recv_buf *recv_buf;
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 599 u32 payload_len;
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 600
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 601 recv_buf = (struct hvs_recv_buf *)(hvs->recv_desc + 1);
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 602 payload_len = recv_buf->hdr.data_size;
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 603
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 604 if (payload_len > HVS_MTU_SIZE)
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 605 return -EIO;
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 606
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 607 if (payload_len == 0)
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 608 hvs->vsk->peer_shutdown |= SEND_SHUTDOWN;
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 609
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 610 hvs->recv_data_len = payload_len;
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 611 hvs->recv_data_off = 0;
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 612
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 613 return 0;
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 614 }
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 615
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 616 static ssize_t hvs_stream_dequeue(struct vsock_sock *vsk, struct msghdr *msg,
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 617 size_t len, int flags)
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 618 {
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 619 struct hvsock *hvs = vsk->trans;
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 620 bool need_refill = !hvs->recv_desc;
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 621 struct hvs_recv_buf *recv_buf;
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 622 u32 to_read;
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 623 int ret;
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 624
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 625 if (flags & MSG_PEEK)
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 626 return -EOPNOTSUPP;
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 627
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 628 if (need_refill) {
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 629 hvs->recv_desc = hv_pkt_iter_first(hvs->chan);
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 630 ret = hvs_update_recv_data(hvs);
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 631 if (ret)
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 632 return ret;
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 633 }
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 634
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 635 recv_buf = (struct hvs_recv_buf *)(hvs->recv_desc + 1);
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 636 to_read = min_t(u32, len, hvs->recv_data_len);
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 637 ret = memcpy_to_msg(msg, recv_buf->data + hvs->recv_data_off, to_read);
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 638 if (ret != 0)
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 639 return ret;
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 640
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 641 hvs->recv_data_len -= to_read;
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 642 if (hvs->recv_data_len == 0) {
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 643 hvs->recv_desc = hv_pkt_iter_next(hvs->chan, hvs->recv_desc);
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 644 if (hvs->recv_desc) {
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 645 ret = hvs_update_recv_data(hvs);
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 646 if (ret)
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 647 return ret;
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 648 }
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 649 } else {
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 650 hvs->recv_data_off += to_read;
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 651 }
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 652
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 653 return to_read;
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 654 }
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 655
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 656 static ssize_t hvs_stream_enqueue(struct vsock_sock *vsk, struct msghdr *msg,
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 657 size_t len)
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 658 {
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 659 struct hvsock *hvs = vsk->trans;
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 660 struct vmbus_channel *chan = hvs->chan;
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 661 struct hvs_send_buf *send_buf;
14a1eaa8820e8f3 Sunil Muthuswamy 2019-05-22 662 ssize_t to_write, max_writable;
14a1eaa8820e8f3 Sunil Muthuswamy 2019-05-22 663 ssize_t ret = 0;
14a1eaa8820e8f3 Sunil Muthuswamy 2019-05-22 664 ssize_t bytes_written = 0;
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 665
31113cc83e30924 Himadri Pandya 2019-07-25 666 BUILD_BUG_ON(sizeof(*send_buf) != HV_HYP_PAGE_SIZE);
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 667
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 668 send_buf = kmalloc(sizeof(*send_buf), GFP_KERNEL);
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 669 if (!send_buf)
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 670 return -ENOMEM;
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 671
14a1eaa8820e8f3 Sunil Muthuswamy 2019-05-22 672 /* Reader(s) could be draining data from the channel as we write.
14a1eaa8820e8f3 Sunil Muthuswamy 2019-05-22 673 * Maximize bandwidth, by iterating until the channel is found to be
14a1eaa8820e8f3 Sunil Muthuswamy 2019-05-22 674 * full.
14a1eaa8820e8f3 Sunil Muthuswamy 2019-05-22 675 */
14a1eaa8820e8f3 Sunil Muthuswamy 2019-05-22 676 while (len) {
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 677 max_writable = hvs_channel_writable_bytes(chan);
14a1eaa8820e8f3 Sunil Muthuswamy 2019-05-22 678 if (!max_writable)
14a1eaa8820e8f3 Sunil Muthuswamy 2019-05-22 679 break;
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 680 to_write = min_t(ssize_t, len, max_writable);
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 681 to_write = min_t(ssize_t, to_write, HVS_SEND_BUF_SIZE);
14a1eaa8820e8f3 Sunil Muthuswamy 2019-05-22 682 /* memcpy_from_msg is safe for loop as it advances the offsets
14a1eaa8820e8f3 Sunil Muthuswamy 2019-05-22 683 * within the message iterator.
14a1eaa8820e8f3 Sunil Muthuswamy 2019-05-22 684 */
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 685 ret = memcpy_from_msg(send_buf->data, msg, to_write);
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 686 if (ret < 0)
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 687 goto out;
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 688
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 689 ret = hvs_send_data(hvs->chan, send_buf, to_write);
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 690 if (ret < 0)
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 691 goto out;
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 692
14a1eaa8820e8f3 Sunil Muthuswamy 2019-05-22 693 bytes_written += to_write;
14a1eaa8820e8f3 Sunil Muthuswamy 2019-05-22 694 len -= to_write;
14a1eaa8820e8f3 Sunil Muthuswamy 2019-05-22 695 }
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 696 out:
14a1eaa8820e8f3 Sunil Muthuswamy 2019-05-22 697 /* If any data has been sent, return that */
14a1eaa8820e8f3 Sunil Muthuswamy 2019-05-22 698 if (bytes_written)
14a1eaa8820e8f3 Sunil Muthuswamy 2019-05-22 699 ret = bytes_written;
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 700 kfree(send_buf);
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 701 return ret;
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 702 }
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 703
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 704 static s64 hvs_stream_has_data(struct vsock_sock *vsk)
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 705 {
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 706 struct hvsock *hvs = vsk->trans;
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 707 s64 ret;
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 708
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 709 if (hvs->recv_data_len > 0)
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 710 return 1;
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 711
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 712 switch (hvs_channel_readable_payload(hvs->chan)) {
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 713 case 1:
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 714 ret = 1;
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 715 break;
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 716 case 0:
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 717 vsk->peer_shutdown |= SEND_SHUTDOWN;
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 718 ret = 0;
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 719 break;
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 720 default: /* -1 */
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 721 ret = 0;
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 722 break;
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 723 }
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 724
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 725 return ret;
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 726 }
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 727
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 728 static s64 hvs_stream_has_space(struct vsock_sock *vsk)
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 729 {
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 730 struct hvsock *hvs = vsk->trans;
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 731
cb359b60416701c Sunil Muthuswamy 2019-06-17 732 return hvs_channel_writable_bytes(hvs->chan);
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 733 }
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 734
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 735 static u64 hvs_stream_rcvhiwat(struct vsock_sock *vsk)
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 736 {
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 737 return HVS_MTU_SIZE + 1;
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 738 }
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 739
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 740 static bool hvs_stream_is_active(struct vsock_sock *vsk)
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 741 {
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 742 struct hvsock *hvs = vsk->trans;
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 743
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 744 return hvs->chan != NULL;
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 745 }
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 746
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 747 static bool hvs_stream_allow(u32 cid, u32 port)
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 748 {
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 749 /* The host's port range [MIN_HOST_EPHEMERAL_PORT, 0xFFFFFFFF) is
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 750 * reserved as ephemeral ports, which are used as the host's ports
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 751 * when the host initiates connections.
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 752 *
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 753 * Perform this check in the guest so an immediate error is produced
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 754 * instead of a timeout.
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 755 */
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 756 if (port > MAX_HOST_LISTEN_PORT)
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 757 return false;
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 758
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 759 if (cid == VMADDR_CID_HOST)
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 760 return true;
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 761
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 762 return false;
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 763 }
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 764
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 765 static
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 766 int hvs_notify_poll_in(struct vsock_sock *vsk, size_t target, bool *readable)
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 767 {
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 768 struct hvsock *hvs = vsk->trans;
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 769
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 770 *readable = hvs_channel_readable(hvs->chan);
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 771 return 0;
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 772 }
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 773
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 774 static
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 775 int hvs_notify_poll_out(struct vsock_sock *vsk, size_t target, bool *writable)
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 776 {
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 777 *writable = hvs_stream_has_space(vsk) > 0;
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 778
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 779 return 0;
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 780 }
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 781
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 782 static
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 783 int hvs_notify_recv_init(struct vsock_sock *vsk, size_t target,
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 784 struct vsock_transport_recv_notify_data *d)
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 785 {
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 786 return 0;
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 787 }
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 788
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 789 static
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 790 int hvs_notify_recv_pre_block(struct vsock_sock *vsk, size_t target,
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 791 struct vsock_transport_recv_notify_data *d)
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 792 {
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 793 return 0;
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 794 }
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 795
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 796 static
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 797 int hvs_notify_recv_pre_dequeue(struct vsock_sock *vsk, size_t target,
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 798 struct vsock_transport_recv_notify_data *d)
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 799 {
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 800 return 0;
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 801 }
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 802
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 803 static
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 804 int hvs_notify_recv_post_dequeue(struct vsock_sock *vsk, size_t target,
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 805 ssize_t copied, bool data_read,
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 806 struct vsock_transport_recv_notify_data *d)
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 807 {
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 808 return 0;
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 809 }
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 810
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 811 static
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 812 int hvs_notify_send_init(struct vsock_sock *vsk,
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 813 struct vsock_transport_send_notify_data *d)
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 814 {
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 815 return 0;
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 816 }
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 817
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 818 static
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 819 int hvs_notify_send_pre_block(struct vsock_sock *vsk,
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 820 struct vsock_transport_send_notify_data *d)
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 821 {
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 822 return 0;
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 823 }
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 824
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 825 static
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 826 int hvs_notify_send_pre_enqueue(struct vsock_sock *vsk,
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 827 struct vsock_transport_send_notify_data *d)
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 828 {
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 829 return 0;
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 830 }
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 831
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 832 static
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 833 int hvs_notify_send_post_enqueue(struct vsock_sock *vsk, ssize_t written,
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 834 struct vsock_transport_send_notify_data *d)
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 835 {
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 836 return 0;
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 837 }
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 838
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 839 static void hvs_set_buffer_size(struct vsock_sock *vsk, u64 val)
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 840 {
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 841 /* Ignored. */
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 842 }
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 843
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 844 static void hvs_set_min_buffer_size(struct vsock_sock *vsk, u64 val)
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 845 {
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 846 /* Ignored. */
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 847 }
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 848
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 849 static void hvs_set_max_buffer_size(struct vsock_sock *vsk, u64 val)
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 850 {
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 851 /* Ignored. */
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 852 }
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 853
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 854 static u64 hvs_get_buffer_size(struct vsock_sock *vsk)
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 855 {
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 856 return -ENOPROTOOPT;
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 857 }
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 858
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 859 static u64 hvs_get_min_buffer_size(struct vsock_sock *vsk)
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 860 {
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 861 return -ENOPROTOOPT;
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 862 }
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 863
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 864 static u64 hvs_get_max_buffer_size(struct vsock_sock *vsk)
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 865 {
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 866 return -ENOPROTOOPT;
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 867 }
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 868
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 869 static struct vsock_transport hvs_transport = {
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 870 .get_local_cid = hvs_get_local_cid,
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 871
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 872 .init = hvs_sock_init,
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 873 .destruct = hvs_destruct,
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 874 .release = hvs_release,
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 875 .connect = hvs_connect,
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 876 .shutdown = hvs_shutdown,
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 877
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 878 .dgram_bind = hvs_dgram_bind,
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 879 .dgram_dequeue = hvs_dgram_dequeue,
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 880 .dgram_enqueue = hvs_dgram_enqueue,
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 881 .dgram_allow = hvs_dgram_allow,
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 882
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 883 .stream_dequeue = hvs_stream_dequeue,
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 884 .stream_enqueue = hvs_stream_enqueue,
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 885 .stream_has_data = hvs_stream_has_data,
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 886 .stream_has_space = hvs_stream_has_space,
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 887 .stream_rcvhiwat = hvs_stream_rcvhiwat,
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 888 .stream_is_active = hvs_stream_is_active,
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 889 .stream_allow = hvs_stream_allow,
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 890
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 891 .notify_poll_in = hvs_notify_poll_in,
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 892 .notify_poll_out = hvs_notify_poll_out,
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 893 .notify_recv_init = hvs_notify_recv_init,
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 894 .notify_recv_pre_block = hvs_notify_recv_pre_block,
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 895 .notify_recv_pre_dequeue = hvs_notify_recv_pre_dequeue,
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 896 .notify_recv_post_dequeue = hvs_notify_recv_post_dequeue,
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 897 .notify_send_init = hvs_notify_send_init,
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 898 .notify_send_pre_block = hvs_notify_send_pre_block,
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 899 .notify_send_pre_enqueue = hvs_notify_send_pre_enqueue,
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 900 .notify_send_post_enqueue = hvs_notify_send_post_enqueue,
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 901
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 902 .set_buffer_size = hvs_set_buffer_size,
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 903 .set_min_buffer_size = hvs_set_min_buffer_size,
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 904 .set_max_buffer_size = hvs_set_max_buffer_size,
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 905 .get_buffer_size = hvs_get_buffer_size,
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 906 .get_min_buffer_size = hvs_get_min_buffer_size,
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 907 .get_max_buffer_size = hvs_get_max_buffer_size,
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 908 };
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 909
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 910 static int hvs_probe(struct hv_device *hdev,
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 911 const struct hv_vmbus_device_id *dev_id)
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 912 {
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 913 struct vmbus_channel *chan = hdev->channel;
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 914
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 915 hvs_open_connection(chan);
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 916
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 917 /* Always return success to suppress the unnecessary error message
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 918 * in vmbus_probe(): on error the host will rescind the device in
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 919 * 30 seconds and we can do cleanup at that time in
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 920 * vmbus_onoffer_rescind().
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 921 */
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 922 return 0;
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 923 }
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 924
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 925 static int hvs_remove(struct hv_device *hdev)
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 926 {
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 927 struct vmbus_channel *chan = hdev->channel;
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 928
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 929 vmbus_close(chan);
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 930
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 931 return 0;
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 932 }
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 933
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 934 /* This isn't really used. See vmbus_match() and vmbus_probe() */
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 935 static const struct hv_vmbus_device_id id_table[] = {
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 936 {},
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 937 };
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 938
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 939 static struct hv_driver hvs_drv = {
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 940 .name = "hv_sock",
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 941 .hvsock = true,
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 942 .id_table = id_table,
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 943 .probe = hvs_probe,
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 944 .remove = hvs_remove,
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 945 };
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 946
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 947 static int __init hvs_init(void)
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 948 {
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 949 int ret;
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 950
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 951 if (vmbus_proto_version < VERSION_WIN10)
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 952 return -ENODEV;
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 953
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 954 ret = vmbus_driver_register(&hvs_drv);
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 955 if (ret != 0)
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 956 return ret;
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 957
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 958 ret = vsock_core_init(&hvs_transport);
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 959 if (ret) {
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 960 vmbus_driver_unregister(&hvs_drv);
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 961 return ret;
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 962 }
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 963
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 964 return 0;
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 965 }
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 966
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 967 static void __exit hvs_exit(void)
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 968 {
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 969 vsock_core_exit();
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 970 vmbus_driver_unregister(&hvs_drv);
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 971 }
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 972
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 973 module_init(hvs_init);
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 974 module_exit(hvs_exit);
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 975
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 976 MODULE_DESCRIPTION("Hyper-V Sockets");
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 977 MODULE_VERSION("1.0.0");
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 978 MODULE_LICENSE("GPL");
ae0078fcf0a5eb3 Dexuan Cui 2017-08-26 979 MODULE_ALIAS_NETPROTO(PF_VSOCK);
:::::: The code at line 214 was first introduced by commit
:::::: ae0078fcf0a5eb3a8623bfb5f988262e0911fdb9 hv_sock: implements Hyper-V transport for Virtual Sockets (AF_VSOCK)
:::::: TO: Dexuan Cui <decui@microsoft.com>
:::::: CC: David S. Miller <davem@davemloft.net>
---
0-DAY kernel test infrastructure Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all Intel Corporation
^ permalink raw reply
* Re: [PATCH v3 bpf-next 0/9] Revamp test_progs as a test running framework
From: Alexei Starovoitov @ 2019-07-28 5:41 UTC (permalink / raw)
To: Andrii Nakryiko
Cc: bpf, Network Development, Alexei Starovoitov, Daniel Borkmann,
Stanislav Fomichev, Andrii Nakryiko, Kernel Team
In-Reply-To: <20190728032531.2358749-1-andriin@fb.com>
On Sat, Jul 27, 2019 at 8:25 PM Andrii Nakryiko <andriin@fb.com> wrote:
>
> This patch set makes a number of changes to test_progs selftest, which is
> a collection of many other tests (and sometimes sub-tests as well), to provide
> better testing experience and allow to start convering many individual test
> programs under selftests/bpf into a single and convenient test runner.
>
> Patch #1 fixes issue with Makefile, which makes prog_tests/test.h compiled as
> a C code. This fix allows to change how test.h is generated, providing ability
> to have more control on what and how tests are run.
>
> Patch #2 changes how test.h is auto-generated, which allows to have test
> definitions, instead of just running test functions. This gives ability to do
> more complicated test run policies.
>
> Patch #3 adds `-t <test-name>` and `-n <test-num>` selectors to run only
> subset of tests.
>
> Patch #4 changes libbpf_set_print() to return previously set print callback,
> allowing to temporarily replace current print callback and then set it back.
> This is necessary for some tests that want more control over libbpf logging.
>
> Patch #5 sets up and takes over libbpf logging from individual tests to
> test_prog runner, adding -vv verbosity to capture debug output from libbpf.
> This is useful when debugging failing tests.
>
> Patch #6 furthers test output management and buffers it by default, emitting
> log output only if test fails. This give succinct and clean default test
> output. It's possible to bypass this behavior with -v flag, which will turn
> off test output buffering.
>
> Patch #7 adds support for sub-tests. It also enhances -t and -n selectors to
> both support ability to specify sub-test selectors, as well as enhancing
> number selector to accept sets of test, instead of just individual test
> number.
>
> Patch #8 converts bpf_verif_scale.c test to use sub-test APIs.
>
> Patch #9 converts send_signal.c tests to use sub-test APIs.
>
> v2->v3:
> - fix buffered output rare unitialized value bug (Alexei);
> - fix buffered output va_list reuse bug (Alexei);
> - fix buffered output truncation due to interleaving zero terminators;
Looks great.
Applied. Thanks!
^ permalink raw reply
* [PATCH net-next v4 0/3] flow_offload: add indr-block in nf_table_offload
From: wenxu @ 2019-07-28 6:52 UTC (permalink / raw)
To: pablo, fw, jakub.kicinski; +Cc: netfilter-devel, netdev
From: wenxu <wenxu@ucloud.cn>
This series patch make nftables offload support the vlan and
tunnel device offload through indr-block architecture.
The first patch mv tc indr block to flow offload and rename
to flow-indr-block.
Because the new flow-indr-block can't get the tcf_block
directly. The second patch provide a callback to get tcf_block
immediately when the device register and contain a ingress block.
The third patch make nf_tables_offload support flow-indr-block.
wenxu (3):
flow_offload: move tc indirect block to flow offload
flow_offload: Support get default block from tc immediately
netfilter: nf_tables_offload: support indr block call
drivers/net/ethernet/mellanox/mlx5/core/en_rep.c | 10 +-
.../net/ethernet/netronome/nfp/flower/offload.c | 10 +-
include/net/flow_offload.h | 39 ++++
include/net/pkt_cls.h | 42 +---
include/net/sch_generic.h | 3 -
net/core/flow_offload.c | 181 +++++++++++++++
net/netfilter/nf_tables_offload.c | 131 +++++++++--
net/sched/cls_api.c | 246 ++++-----------------
8 files changed, 385 insertions(+), 277 deletions(-)
--
1.8.3.1
^ permalink raw reply
* [PATCH net-next v4 2/3] flow_offload: Support get default block from tc immediately
From: wenxu @ 2019-07-28 6:52 UTC (permalink / raw)
To: pablo, fw, jakub.kicinski; +Cc: netfilter-devel, netdev
In-Reply-To: <1564296769-32294-1-git-send-email-wenxu@ucloud.cn>
From: wenxu <wenxu@ucloud.cn>
When thre indr device register, it can get the default block
from tc immediately if the block is exist.
Signed-off-by: wenxu <wenxu@ucloud.cn>
---
v3: no change
v4: get tc default block without callback
include/net/pkt_cls.h | 7 +++++++
net/core/flow_offload.c | 2 ++
net/sched/cls_api.c | 33 +++++++++++++++++++++++++++++++++
3 files changed, 42 insertions(+)
diff --git a/include/net/pkt_cls.h b/include/net/pkt_cls.h
index 0790a4e..77c3a42 100644
--- a/include/net/pkt_cls.h
+++ b/include/net/pkt_cls.h
@@ -54,6 +54,8 @@ int tcf_block_get_ext(struct tcf_block **p_block, struct Qdisc *q,
void tcf_block_put_ext(struct tcf_block *block, struct Qdisc *q,
struct tcf_block_ext_info *ei);
+void tc_indr_get_default_block(struct flow_indr_block_dev *indr_dev);
+
static inline bool tcf_block_shared(struct tcf_block *block)
{
return block->index;
@@ -74,6 +76,11 @@ int tcf_classify(struct sk_buff *skb, const struct tcf_proto *tp,
struct tcf_result *res, bool compat_mode);
#else
+static inline
+void tc_indr_get_default_block(struct flow_indr_block_dev *indr_dev)
+{
+}
+
static inline bool tcf_block_shared(struct tcf_block *block)
{
return false;
diff --git a/net/core/flow_offload.c b/net/core/flow_offload.c
index 9f1ae67..0ca3d51 100644
--- a/net/core/flow_offload.c
+++ b/net/core/flow_offload.c
@@ -3,6 +3,7 @@
#include <linux/slab.h>
#include <net/flow_offload.h>
#include <linux/rtnetlink.h>
+#include <net/pkt_cls.h>
struct flow_rule *flow_rule_alloc(unsigned int num_actions)
{
@@ -312,6 +313,7 @@ static struct flow_indr_block_dev *flow_indr_block_dev_get(struct net_device *de
INIT_LIST_HEAD(&indr_dev->cb_list);
indr_dev->dev = dev;
+ tc_indr_get_default_block(indr_dev);
if (rhashtable_insert_fast(&indr_setup_block_ht, &indr_dev->ht_node,
flow_indr_setup_block_ht_params)) {
kfree(indr_dev);
diff --git a/net/sched/cls_api.c b/net/sched/cls_api.c
index d551c56..59e9572 100644
--- a/net/sched/cls_api.c
+++ b/net/sched/cls_api.c
@@ -576,6 +576,39 @@ static void tc_indr_block_ing_cmd(struct net_device *dev,
tcf_block_setup(block, &bo);
}
+static struct tcf_block *tc_dev_ingress_block(struct net_device *dev)
+{
+ const struct Qdisc_class_ops *cops;
+ struct Qdisc *qdisc;
+
+ if (!dev_ingress_queue(dev))
+ return NULL;
+
+ qdisc = dev_ingress_queue(dev)->qdisc_sleeping;
+ if (!qdisc)
+ return NULL;
+
+ cops = qdisc->ops->cl_ops;
+ if (!cops)
+ return NULL;
+
+ if (!cops->tcf_block)
+ return NULL;
+
+ return cops->tcf_block(qdisc, TC_H_MIN_INGRESS, NULL);
+}
+
+void tc_indr_get_default_block(struct flow_indr_block_dev *indr_dev)
+{
+ struct tcf_block *block = tc_dev_ingress_block(indr_dev->dev);
+
+ if (block) {
+ indr_dev->flow_block = &block->flow_block;
+ indr_dev->ing_cmd_cb = tc_indr_block_ing_cmd;
+ }
+}
+EXPORT_SYMBOL(tc_indr_get_default_block);
+
static void tc_indr_block_call(struct tcf_block *block, struct net_device *dev,
struct tcf_block_ext_info *ei,
enum flow_block_command command,
--
1.8.3.1
^ permalink raw reply related
* [PATCH net-next v4 3/3] netfilter: nf_tables_offload: support indr block call
From: wenxu @ 2019-07-28 6:52 UTC (permalink / raw)
To: pablo, fw, jakub.kicinski; +Cc: netfilter-devel, netdev
In-Reply-To: <1564296769-32294-1-git-send-email-wenxu@ucloud.cn>
From: wenxu <wenxu@ucloud.cn>
nftable support indr-block call. It makes nftable an offload vlan
and tunnel device.
nft add table netdev firewall
nft add chain netdev firewall aclout { type filter hook ingress offload device mlx_pf0vf0 priority - 300 \; }
nft add rule netdev firewall aclout ip daddr 10.0.0.1 fwd to vlan0
nft add chain netdev firewall aclin { type filter hook ingress device vlan0 priority - 300 \; }
nft add rule netdev firewall aclin ip daddr 10.0.0.7 fwd to mlx_pf0vf0
Signed-off-by: wenxu <wenxu@ucloud.cn>
---
v3: subsys_initcall for init_flow_indr_rhashtable
v4: guarantee only one offload base chain used per indr dev.
If the indr_block_cmd bind fail return unsupported.
net/netfilter/nf_tables_offload.c | 131 +++++++++++++++++++++++++++++++-------
1 file changed, 107 insertions(+), 24 deletions(-)
diff --git a/net/netfilter/nf_tables_offload.c b/net/netfilter/nf_tables_offload.c
index 64f5fd5..19214ad 100644
--- a/net/netfilter/nf_tables_offload.c
+++ b/net/netfilter/nf_tables_offload.c
@@ -171,24 +171,123 @@ static int nft_flow_offload_unbind(struct flow_block_offload *bo,
return 0;
}
+static int nft_block_setup(struct nft_base_chain *basechain,
+ struct flow_block_offload *bo,
+ enum flow_block_command cmd)
+{
+ int err;
+
+ switch (cmd) {
+ case FLOW_BLOCK_BIND:
+ err = nft_flow_offload_bind(bo, basechain);
+ break;
+ case FLOW_BLOCK_UNBIND:
+ err = nft_flow_offload_unbind(bo, basechain);
+ break;
+ default:
+ WARN_ON_ONCE(1);
+ err = -EOPNOTSUPP;
+ }
+
+ return err;
+}
+
+static int nft_block_offload_cmd(struct nft_base_chain *chain,
+ struct net_device *dev,
+ enum flow_block_command cmd)
+{
+ struct netlink_ext_ack extack = {};
+ struct flow_block_offload bo = {};
+ int err;
+
+ bo.net = dev_net(dev);
+ bo.block = &chain->flow_block;
+ bo.command = cmd;
+ bo.binder_type = FLOW_BLOCK_BINDER_TYPE_CLSACT_INGRESS;
+ bo.extack = &extack;
+ INIT_LIST_HEAD(&bo.cb_list);
+
+ err = dev->netdev_ops->ndo_setup_tc(dev, TC_SETUP_BLOCK, &bo);
+ if (err < 0)
+ return err;
+
+ return nft_block_setup(chain, &bo, cmd);
+}
+
+static void nft_indr_block_ing_cmd(struct net_device *dev,
+ struct flow_block *flow_block,
+ struct flow_indr_block_cb *indr_block_cb,
+ enum flow_block_command cmd)
+{
+ struct netlink_ext_ack extack = {};
+ struct flow_block_offload bo = {};
+ struct nft_base_chain *chain;
+
+ if (flow_block)
+ return;
+
+ chain = container_of(flow_block, struct nft_base_chain, flow_block);
+
+ bo.net = dev_net(dev);
+ bo.block = flow_block;
+ bo.command = cmd;
+ bo.binder_type = FLOW_BLOCK_BINDER_TYPE_CLSACT_INGRESS;
+ bo.extack = &extack;
+ INIT_LIST_HEAD(&bo.cb_list);
+
+ indr_block_cb->cb(dev, indr_block_cb->cb_priv, TC_SETUP_BLOCK, &bo);
+
+ nft_block_setup(chain, &bo, cmd);
+}
+
+static int nft_indr_block_offload_cmd(struct nft_base_chain *chain,
+ struct net_device *dev,
+ enum flow_block_command cmd)
+{
+ struct flow_indr_block_cb *indr_block_cb;
+ struct flow_indr_block_dev *indr_dev;
+ struct flow_block_offload bo = {};
+ struct netlink_ext_ack extack = {};
+
+ bo.net = dev_net(dev);
+ bo.block = &chain->flow_block;
+ bo.command = cmd;
+ bo.binder_type = FLOW_BLOCK_BINDER_TYPE_CLSACT_INGRESS;
+ bo.extack = &extack;
+ INIT_LIST_HEAD(&bo.cb_list);
+
+ indr_dev = flow_indr_block_dev_lookup(dev);
+ if (!indr_dev)
+ return -EOPNOTSUPP;
+
+ indr_dev->flow_block = cmd == FLOW_BLOCK_BIND ? &chain->flow_block : NULL;
+ indr_dev->ing_cmd_cb = cmd == FLOW_BLOCK_BIND ? nft_indr_block_ing_cmd : NULL;
+
+ list_for_each_entry(indr_block_cb, &indr_dev->cb_list, list)
+ indr_block_cb->cb(dev, indr_block_cb->cb_priv, TC_SETUP_BLOCK,
+ &bo);
+
+ if (list_empty(&bo.cb_list))
+ return -EOPNOTSUPP;
+
+ return nft_block_setup(chain, &bo, cmd);
+}
+
#define FLOW_SETUP_BLOCK TC_SETUP_BLOCK
static int nft_flow_offload_chain(struct nft_trans *trans,
enum flow_block_command cmd)
{
struct nft_chain *chain = trans->ctx.chain;
- struct netlink_ext_ack extack = {};
- struct flow_block_offload bo = {};
struct nft_base_chain *basechain;
struct net_device *dev;
- int err;
if (!nft_is_base_chain(chain))
return -EOPNOTSUPP;
basechain = nft_base_chain(chain);
dev = basechain->ops.dev;
- if (!dev || !dev->netdev_ops->ndo_setup_tc)
+ if (!dev)
return -EOPNOTSUPP;
/* Only default policy to accept is supported for now. */
@@ -197,26 +296,10 @@ static int nft_flow_offload_chain(struct nft_trans *trans,
nft_trans_chain_policy(trans) != NF_ACCEPT)
return -EOPNOTSUPP;
- bo.command = cmd;
- bo.block = &basechain->flow_block;
- bo.binder_type = FLOW_BLOCK_BINDER_TYPE_CLSACT_INGRESS;
- bo.extack = &extack;
- INIT_LIST_HEAD(&bo.cb_list);
-
- err = dev->netdev_ops->ndo_setup_tc(dev, FLOW_SETUP_BLOCK, &bo);
- if (err < 0)
- return err;
-
- switch (cmd) {
- case FLOW_BLOCK_BIND:
- err = nft_flow_offload_bind(&bo, basechain);
- break;
- case FLOW_BLOCK_UNBIND:
- err = nft_flow_offload_unbind(&bo, basechain);
- break;
- }
-
- return err;
+ if (dev->netdev_ops->ndo_setup_tc)
+ return nft_block_offload_cmd(basechain, dev, cmd);
+ else
+ return nft_indr_block_offload_cmd(basechain, dev, cmd);
}
int nft_flow_rule_offload_commit(struct net *net)
--
1.8.3.1
^ permalink raw reply related
* [PATCH net-next v4 1/3] flow_offload: move tc indirect block to flow offload
From: wenxu @ 2019-07-28 6:52 UTC (permalink / raw)
To: pablo, fw, jakub.kicinski; +Cc: netfilter-devel, netdev
In-Reply-To: <1564296769-32294-1-git-send-email-wenxu@ucloud.cn>
From: wenxu <wenxu@ucloud.cn>
move tc indirect block to flow_offload and rename
it to flow indirect block.The nf_tables can use the
indr block architecture.
Signed-off-by: wenxu <wenxu@ucloud.cn>
---
v3: subsys_initcall for init_flow_indr_rhashtable
v4: no change
drivers/net/ethernet/mellanox/mlx5/core/en_rep.c | 10 +-
.../net/ethernet/netronome/nfp/flower/offload.c | 10 +-
include/net/flow_offload.h | 39 ++++
include/net/pkt_cls.h | 35 ---
include/net/sch_generic.h | 3 -
net/core/flow_offload.c | 179 ++++++++++++++++
net/sched/cls_api.c | 235 ++-------------------
7 files changed, 247 insertions(+), 264 deletions(-)
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rep.c b/drivers/net/ethernet/mellanox/mlx5/core/en_rep.c
index 7f747cb..074573b 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_rep.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rep.c
@@ -785,9 +785,9 @@ static int mlx5e_rep_indr_register_block(struct mlx5e_rep_priv *rpriv,
{
int err;
- err = __tc_indr_block_cb_register(netdev, rpriv,
- mlx5e_rep_indr_setup_tc_cb,
- rpriv);
+ err = __flow_indr_block_cb_register(netdev, rpriv,
+ mlx5e_rep_indr_setup_tc_cb,
+ rpriv);
if (err) {
struct mlx5e_priv *priv = netdev_priv(rpriv->netdev);
@@ -800,8 +800,8 @@ static int mlx5e_rep_indr_register_block(struct mlx5e_rep_priv *rpriv,
static void mlx5e_rep_indr_unregister_block(struct mlx5e_rep_priv *rpriv,
struct net_device *netdev)
{
- __tc_indr_block_cb_unregister(netdev, mlx5e_rep_indr_setup_tc_cb,
- rpriv);
+ __flow_indr_block_cb_unregister(netdev, mlx5e_rep_indr_setup_tc_cb,
+ rpriv);
}
static int mlx5e_nic_rep_netdevice_event(struct notifier_block *nb,
diff --git a/drivers/net/ethernet/netronome/nfp/flower/offload.c b/drivers/net/ethernet/netronome/nfp/flower/offload.c
index e209f15..6a0f034 100644
--- a/drivers/net/ethernet/netronome/nfp/flower/offload.c
+++ b/drivers/net/ethernet/netronome/nfp/flower/offload.c
@@ -1479,16 +1479,16 @@ int nfp_flower_reg_indir_block_handler(struct nfp_app *app,
return NOTIFY_OK;
if (event == NETDEV_REGISTER) {
- err = __tc_indr_block_cb_register(netdev, app,
- nfp_flower_indr_setup_tc_cb,
- app);
+ err = __flow_indr_block_cb_register(netdev, app,
+ nfp_flower_indr_setup_tc_cb,
+ app);
if (err)
nfp_flower_cmsg_warn(app,
"Indirect block reg failed - %s\n",
netdev->name);
} else if (event == NETDEV_UNREGISTER) {
- __tc_indr_block_cb_unregister(netdev,
- nfp_flower_indr_setup_tc_cb, app);
+ __flow_indr_block_cb_unregister(netdev,
+ nfp_flower_indr_setup_tc_cb, app);
}
return NOTIFY_OK;
diff --git a/include/net/flow_offload.h b/include/net/flow_offload.h
index 00b9aab..66f89bc 100644
--- a/include/net/flow_offload.h
+++ b/include/net/flow_offload.h
@@ -4,6 +4,7 @@
#include <linux/kernel.h>
#include <linux/list.h>
#include <net/flow_dissector.h>
+#include <linux/rhashtable.h>
struct flow_match {
struct flow_dissector *dissector;
@@ -366,4 +367,42 @@ static inline void flow_block_init(struct flow_block *flow_block)
INIT_LIST_HEAD(&flow_block->cb_list);
}
+typedef int flow_indr_block_bind_cb_t(struct net_device *dev, void *cb_priv,
+ enum tc_setup_type type, void *type_data);
+
+struct flow_indr_block_cb {
+ struct list_head list;
+ void *cb_priv;
+ flow_indr_block_bind_cb_t *cb;
+ void *cb_ident;
+};
+
+typedef void flow_indr_block_ing_cmd_t(struct net_device *dev,
+ struct flow_block *flow_block,
+ struct flow_indr_block_cb *indr_block_cb,
+ enum flow_block_command command);
+
+struct flow_indr_block_dev {
+ struct rhash_head ht_node;
+ struct net_device *dev;
+ unsigned int refcnt;
+ struct list_head cb_list;
+ flow_indr_block_ing_cmd_t *ing_cmd_cb;
+ struct flow_block *flow_block;
+};
+
+struct flow_indr_block_dev *flow_indr_block_dev_lookup(struct net_device *dev);
+
+int __flow_indr_block_cb_register(struct net_device *dev, void *cb_priv,
+ flow_indr_block_bind_cb_t *cb, void *cb_ident);
+
+void __flow_indr_block_cb_unregister(struct net_device *dev,
+ flow_indr_block_bind_cb_t *cb, void *cb_ident);
+
+int flow_indr_block_cb_register(struct net_device *dev, void *cb_priv,
+ flow_indr_block_bind_cb_t *cb, void *cb_ident);
+
+void flow_indr_block_cb_unregister(struct net_device *dev,
+ flow_indr_block_bind_cb_t *cb, void *cb_ident);
+
#endif /* _NET_FLOW_OFFLOAD_H */
diff --git a/include/net/pkt_cls.h b/include/net/pkt_cls.h
index e429809..0790a4e 100644
--- a/include/net/pkt_cls.h
+++ b/include/net/pkt_cls.h
@@ -70,15 +70,6 @@ static inline struct Qdisc *tcf_block_q(struct tcf_block *block)
return block->q;
}
-int __tc_indr_block_cb_register(struct net_device *dev, void *cb_priv,
- tc_indr_block_bind_cb_t *cb, void *cb_ident);
-int tc_indr_block_cb_register(struct net_device *dev, void *cb_priv,
- tc_indr_block_bind_cb_t *cb, void *cb_ident);
-void __tc_indr_block_cb_unregister(struct net_device *dev,
- tc_indr_block_bind_cb_t *cb, void *cb_ident);
-void tc_indr_block_cb_unregister(struct net_device *dev,
- tc_indr_block_bind_cb_t *cb, void *cb_ident);
-
int tcf_classify(struct sk_buff *skb, const struct tcf_proto *tp,
struct tcf_result *res, bool compat_mode);
@@ -137,32 +128,6 @@ void tc_setup_cb_block_unregister(struct tcf_block *block, flow_setup_cb_t *cb,
{
}
-static inline
-int __tc_indr_block_cb_register(struct net_device *dev, void *cb_priv,
- tc_indr_block_bind_cb_t *cb, void *cb_ident)
-{
- return 0;
-}
-
-static inline
-int tc_indr_block_cb_register(struct net_device *dev, void *cb_priv,
- tc_indr_block_bind_cb_t *cb, void *cb_ident)
-{
- return 0;
-}
-
-static inline
-void __tc_indr_block_cb_unregister(struct net_device *dev,
- tc_indr_block_bind_cb_t *cb, void *cb_ident)
-{
-}
-
-static inline
-void tc_indr_block_cb_unregister(struct net_device *dev,
- tc_indr_block_bind_cb_t *cb, void *cb_ident)
-{
-}
-
static inline int tcf_classify(struct sk_buff *skb, const struct tcf_proto *tp,
struct tcf_result *res, bool compat_mode)
{
diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
index 6b6b012..d9f359a 100644
--- a/include/net/sch_generic.h
+++ b/include/net/sch_generic.h
@@ -23,9 +23,6 @@
struct module;
struct bpf_flow_keys;
-typedef int tc_indr_block_bind_cb_t(struct net_device *dev, void *cb_priv,
- enum tc_setup_type type, void *type_data);
-
struct qdisc_rate_table {
struct tc_ratespec rate;
u32 data[256];
diff --git a/net/core/flow_offload.c b/net/core/flow_offload.c
index d63b970..9f1ae67 100644
--- a/net/core/flow_offload.c
+++ b/net/core/flow_offload.c
@@ -2,6 +2,7 @@
#include <linux/kernel.h>
#include <linux/slab.h>
#include <net/flow_offload.h>
+#include <linux/rtnetlink.h>
struct flow_rule *flow_rule_alloc(unsigned int num_actions)
{
@@ -280,3 +281,181 @@ int flow_block_cb_setup_simple(struct flow_block_offload *f,
}
}
EXPORT_SYMBOL(flow_block_cb_setup_simple);
+
+static struct rhashtable indr_setup_block_ht;
+
+static const struct rhashtable_params flow_indr_setup_block_ht_params = {
+ .key_offset = offsetof(struct flow_indr_block_dev, dev),
+ .head_offset = offsetof(struct flow_indr_block_dev, ht_node),
+ .key_len = sizeof(struct net_device *),
+};
+
+struct flow_indr_block_dev *
+flow_indr_block_dev_lookup(struct net_device *dev)
+{
+ return rhashtable_lookup_fast(&indr_setup_block_ht, &dev,
+ flow_indr_setup_block_ht_params);
+}
+EXPORT_SYMBOL(flow_indr_block_dev_lookup);
+
+static struct flow_indr_block_dev *flow_indr_block_dev_get(struct net_device *dev)
+{
+ struct flow_indr_block_dev *indr_dev;
+
+ indr_dev = flow_indr_block_dev_lookup(dev);
+ if (indr_dev)
+ goto inc_ref;
+
+ indr_dev = kzalloc(sizeof(*indr_dev), GFP_KERNEL);
+ if (!indr_dev)
+ return NULL;
+
+ INIT_LIST_HEAD(&indr_dev->cb_list);
+ indr_dev->dev = dev;
+ if (rhashtable_insert_fast(&indr_setup_block_ht, &indr_dev->ht_node,
+ flow_indr_setup_block_ht_params)) {
+ kfree(indr_dev);
+ return NULL;
+ }
+
+inc_ref:
+ indr_dev->refcnt++;
+ return indr_dev;
+}
+
+static void flow_indr_block_dev_put(struct flow_indr_block_dev *indr_dev)
+{
+ if (--indr_dev->refcnt)
+ return;
+
+ rhashtable_remove_fast(&indr_setup_block_ht, &indr_dev->ht_node,
+ flow_indr_setup_block_ht_params);
+ kfree(indr_dev);
+}
+
+static struct flow_indr_block_cb *
+flow_indr_block_cb_lookup(struct flow_indr_block_dev *indr_dev,
+ flow_indr_block_bind_cb_t *cb, void *cb_ident)
+{
+ struct flow_indr_block_cb *indr_block_cb;
+
+ list_for_each_entry(indr_block_cb, &indr_dev->cb_list, list)
+ if (indr_block_cb->cb == cb &&
+ indr_block_cb->cb_ident == cb_ident)
+ return indr_block_cb;
+ return NULL;
+}
+
+static struct flow_indr_block_cb *
+flow_indr_block_cb_add(struct flow_indr_block_dev *indr_dev, void *cb_priv,
+ flow_indr_block_bind_cb_t *cb, void *cb_ident)
+{
+ struct flow_indr_block_cb *indr_block_cb;
+
+ indr_block_cb = flow_indr_block_cb_lookup(indr_dev, cb, cb_ident);
+ if (indr_block_cb)
+ return ERR_PTR(-EEXIST);
+
+ indr_block_cb = kzalloc(sizeof(*indr_block_cb), GFP_KERNEL);
+ if (!indr_block_cb)
+ return ERR_PTR(-ENOMEM);
+
+ indr_block_cb->cb_priv = cb_priv;
+ indr_block_cb->cb = cb;
+ indr_block_cb->cb_ident = cb_ident;
+ list_add(&indr_block_cb->list, &indr_dev->cb_list);
+
+ return indr_block_cb;
+}
+
+static void flow_indr_block_cb_del(struct flow_indr_block_cb *indr_block_cb)
+{
+ list_del(&indr_block_cb->list);
+ kfree(indr_block_cb);
+}
+
+int __flow_indr_block_cb_register(struct net_device *dev, void *cb_priv,
+ flow_indr_block_bind_cb_t *cb,
+ void *cb_ident)
+{
+ struct flow_indr_block_cb *indr_block_cb;
+ struct flow_indr_block_dev *indr_dev;
+ int err;
+
+ indr_dev = flow_indr_block_dev_get(dev);
+ if (!indr_dev)
+ return -ENOMEM;
+
+ indr_block_cb = flow_indr_block_cb_add(indr_dev, cb_priv, cb, cb_ident);
+ err = PTR_ERR_OR_ZERO(indr_block_cb);
+ if (err)
+ goto err_dev_put;
+
+ if (indr_dev->ing_cmd_cb)
+ indr_dev->ing_cmd_cb(indr_dev->dev, indr_dev->flow_block, indr_block_cb,
+ FLOW_BLOCK_BIND);
+
+ return 0;
+
+err_dev_put:
+ flow_indr_block_dev_put(indr_dev);
+ return err;
+}
+EXPORT_SYMBOL_GPL(__flow_indr_block_cb_register);
+
+int flow_indr_block_cb_register(struct net_device *dev, void *cb_priv,
+ flow_indr_block_bind_cb_t *cb,
+ void *cb_ident)
+{
+ int err;
+
+ rtnl_lock();
+ err = __flow_indr_block_cb_register(dev, cb_priv, cb, cb_ident);
+ rtnl_unlock();
+
+ return err;
+}
+EXPORT_SYMBOL_GPL(flow_indr_block_cb_register);
+
+void __flow_indr_block_cb_unregister(struct net_device *dev,
+ flow_indr_block_bind_cb_t *cb,
+ void *cb_ident)
+{
+ struct flow_indr_block_cb *indr_block_cb;
+ struct flow_indr_block_dev *indr_dev;
+
+ indr_dev = flow_indr_block_dev_lookup(dev);
+ if (!indr_dev)
+ return;
+
+ indr_block_cb = flow_indr_block_cb_lookup(indr_dev, cb, cb_ident);
+ if (!indr_block_cb)
+ return;
+
+ /* Send unbind message if required to free any block cbs. */
+ if (indr_dev->ing_cmd_cb)
+ indr_dev->ing_cmd_cb(indr_dev->dev, indr_dev->flow_block,
+ indr_block_cb,
+ FLOW_BLOCK_UNBIND);
+
+ flow_indr_block_cb_del(indr_block_cb);
+ flow_indr_block_dev_put(indr_dev);
+}
+EXPORT_SYMBOL_GPL(__flow_indr_block_cb_unregister);
+
+void flow_indr_block_cb_unregister(struct net_device *dev,
+ flow_indr_block_bind_cb_t *cb,
+ void *cb_ident)
+{
+ rtnl_lock();
+ __flow_indr_block_cb_unregister(dev, cb, cb_ident);
+ rtnl_unlock();
+}
+EXPORT_SYMBOL_GPL(flow_indr_block_cb_unregister);
+
+static int __init init_flow_indr_rhashtable(void)
+{
+ return rhashtable_init(&indr_setup_block_ht,
+ &flow_indr_setup_block_ht_params);
+}
+subsys_initcall(init_flow_indr_rhashtable);
diff --git a/net/sched/cls_api.c b/net/sched/cls_api.c
index 3565d9a..d551c56 100644
--- a/net/sched/cls_api.c
+++ b/net/sched/cls_api.c
@@ -37,6 +37,7 @@
#include <net/tc_act/tc_skbedit.h>
#include <net/tc_act/tc_ct.h>
#include <net/tc_act/tc_mpls.h>
+#include <net/flow_offload.h>
extern const struct nla_policy rtm_tca_policy[TCA_MAX + 1];
@@ -545,235 +546,43 @@ static void tcf_chain_flush(struct tcf_chain *chain, bool rtnl_held)
}
}
-static struct tcf_block *tc_dev_ingress_block(struct net_device *dev)
-{
- const struct Qdisc_class_ops *cops;
- struct Qdisc *qdisc;
-
- if (!dev_ingress_queue(dev))
- return NULL;
-
- qdisc = dev_ingress_queue(dev)->qdisc_sleeping;
- if (!qdisc)
- return NULL;
-
- cops = qdisc->ops->cl_ops;
- if (!cops)
- return NULL;
-
- if (!cops->tcf_block)
- return NULL;
-
- return cops->tcf_block(qdisc, TC_H_MIN_INGRESS, NULL);
-}
-
-static struct rhashtable indr_setup_block_ht;
-
-struct tc_indr_block_dev {
- struct rhash_head ht_node;
- struct net_device *dev;
- unsigned int refcnt;
- struct list_head cb_list;
- struct tcf_block *block;
-};
-
-struct tc_indr_block_cb {
- struct list_head list;
- void *cb_priv;
- tc_indr_block_bind_cb_t *cb;
- void *cb_ident;
-};
-
-static const struct rhashtable_params tc_indr_setup_block_ht_params = {
- .key_offset = offsetof(struct tc_indr_block_dev, dev),
- .head_offset = offsetof(struct tc_indr_block_dev, ht_node),
- .key_len = sizeof(struct net_device *),
-};
-
-static struct tc_indr_block_dev *
-tc_indr_block_dev_lookup(struct net_device *dev)
-{
- return rhashtable_lookup_fast(&indr_setup_block_ht, &dev,
- tc_indr_setup_block_ht_params);
-}
-
-static struct tc_indr_block_dev *tc_indr_block_dev_get(struct net_device *dev)
-{
- struct tc_indr_block_dev *indr_dev;
-
- indr_dev = tc_indr_block_dev_lookup(dev);
- if (indr_dev)
- goto inc_ref;
-
- indr_dev = kzalloc(sizeof(*indr_dev), GFP_KERNEL);
- if (!indr_dev)
- return NULL;
-
- INIT_LIST_HEAD(&indr_dev->cb_list);
- indr_dev->dev = dev;
- indr_dev->block = tc_dev_ingress_block(dev);
- if (rhashtable_insert_fast(&indr_setup_block_ht, &indr_dev->ht_node,
- tc_indr_setup_block_ht_params)) {
- kfree(indr_dev);
- return NULL;
- }
-
-inc_ref:
- indr_dev->refcnt++;
- return indr_dev;
-}
-
-static void tc_indr_block_dev_put(struct tc_indr_block_dev *indr_dev)
-{
- if (--indr_dev->refcnt)
- return;
-
- rhashtable_remove_fast(&indr_setup_block_ht, &indr_dev->ht_node,
- tc_indr_setup_block_ht_params);
- kfree(indr_dev);
-}
-
-static struct tc_indr_block_cb *
-tc_indr_block_cb_lookup(struct tc_indr_block_dev *indr_dev,
- tc_indr_block_bind_cb_t *cb, void *cb_ident)
-{
- struct tc_indr_block_cb *indr_block_cb;
-
- list_for_each_entry(indr_block_cb, &indr_dev->cb_list, list)
- if (indr_block_cb->cb == cb &&
- indr_block_cb->cb_ident == cb_ident)
- return indr_block_cb;
- return NULL;
-}
-
-static struct tc_indr_block_cb *
-tc_indr_block_cb_add(struct tc_indr_block_dev *indr_dev, void *cb_priv,
- tc_indr_block_bind_cb_t *cb, void *cb_ident)
-{
- struct tc_indr_block_cb *indr_block_cb;
-
- indr_block_cb = tc_indr_block_cb_lookup(indr_dev, cb, cb_ident);
- if (indr_block_cb)
- return ERR_PTR(-EEXIST);
-
- indr_block_cb = kzalloc(sizeof(*indr_block_cb), GFP_KERNEL);
- if (!indr_block_cb)
- return ERR_PTR(-ENOMEM);
-
- indr_block_cb->cb_priv = cb_priv;
- indr_block_cb->cb = cb;
- indr_block_cb->cb_ident = cb_ident;
- list_add(&indr_block_cb->list, &indr_dev->cb_list);
-
- return indr_block_cb;
-}
-
-static void tc_indr_block_cb_del(struct tc_indr_block_cb *indr_block_cb)
-{
- list_del(&indr_block_cb->list);
- kfree(indr_block_cb);
-}
-
static int tcf_block_setup(struct tcf_block *block,
struct flow_block_offload *bo);
-static void tc_indr_block_ing_cmd(struct tc_indr_block_dev *indr_dev,
- struct tc_indr_block_cb *indr_block_cb,
+static void tc_indr_block_ing_cmd(struct net_device *dev,
+ struct flow_block *flow_block,
+ struct flow_indr_block_cb *indr_block_cb,
enum flow_block_command command)
{
+ struct tcf_block *block = flow_block ?
+ container_of(flow_block,
+ struct tcf_block,
+ flow_block) : NULL;
struct flow_block_offload bo = {
.command = command,
.binder_type = FLOW_BLOCK_BINDER_TYPE_CLSACT_INGRESS,
- .net = dev_net(indr_dev->dev),
- .block_shared = tcf_block_non_null_shared(indr_dev->block),
+ .net = dev_net(dev),
+ .block_shared = tcf_block_non_null_shared(block),
};
INIT_LIST_HEAD(&bo.cb_list);
- if (!indr_dev->block)
- return;
-
- bo.block = &indr_dev->block->flow_block;
-
- indr_block_cb->cb(indr_dev->dev, indr_block_cb->cb_priv, TC_SETUP_BLOCK,
- &bo);
- tcf_block_setup(indr_dev->block, &bo);
-}
-
-int __tc_indr_block_cb_register(struct net_device *dev, void *cb_priv,
- tc_indr_block_bind_cb_t *cb, void *cb_ident)
-{
- struct tc_indr_block_cb *indr_block_cb;
- struct tc_indr_block_dev *indr_dev;
- int err;
-
- indr_dev = tc_indr_block_dev_get(dev);
- if (!indr_dev)
- return -ENOMEM;
-
- indr_block_cb = tc_indr_block_cb_add(indr_dev, cb_priv, cb, cb_ident);
- err = PTR_ERR_OR_ZERO(indr_block_cb);
- if (err)
- goto err_dev_put;
-
- tc_indr_block_ing_cmd(indr_dev, indr_block_cb, FLOW_BLOCK_BIND);
- return 0;
-
-err_dev_put:
- tc_indr_block_dev_put(indr_dev);
- return err;
-}
-EXPORT_SYMBOL_GPL(__tc_indr_block_cb_register);
-
-int tc_indr_block_cb_register(struct net_device *dev, void *cb_priv,
- tc_indr_block_bind_cb_t *cb, void *cb_ident)
-{
- int err;
-
- rtnl_lock();
- err = __tc_indr_block_cb_register(dev, cb_priv, cb, cb_ident);
- rtnl_unlock();
-
- return err;
-}
-EXPORT_SYMBOL_GPL(tc_indr_block_cb_register);
-
-void __tc_indr_block_cb_unregister(struct net_device *dev,
- tc_indr_block_bind_cb_t *cb, void *cb_ident)
-{
- struct tc_indr_block_cb *indr_block_cb;
- struct tc_indr_block_dev *indr_dev;
-
- indr_dev = tc_indr_block_dev_lookup(dev);
- if (!indr_dev)
+ if (!block)
return;
- indr_block_cb = tc_indr_block_cb_lookup(indr_dev, cb, cb_ident);
- if (!indr_block_cb)
- return;
+ bo.block = flow_block;
- /* Send unbind message if required to free any block cbs. */
- tc_indr_block_ing_cmd(indr_dev, indr_block_cb, FLOW_BLOCK_UNBIND);
- tc_indr_block_cb_del(indr_block_cb);
- tc_indr_block_dev_put(indr_dev);
-}
-EXPORT_SYMBOL_GPL(__tc_indr_block_cb_unregister);
+ indr_block_cb->cb(dev, indr_block_cb->cb_priv, TC_SETUP_BLOCK, &bo);
-void tc_indr_block_cb_unregister(struct net_device *dev,
- tc_indr_block_bind_cb_t *cb, void *cb_ident)
-{
- rtnl_lock();
- __tc_indr_block_cb_unregister(dev, cb, cb_ident);
- rtnl_unlock();
+ tcf_block_setup(block, &bo);
}
-EXPORT_SYMBOL_GPL(tc_indr_block_cb_unregister);
static void tc_indr_block_call(struct tcf_block *block, struct net_device *dev,
struct tcf_block_ext_info *ei,
enum flow_block_command command,
struct netlink_ext_ack *extack)
{
- struct tc_indr_block_cb *indr_block_cb;
- struct tc_indr_block_dev *indr_dev;
+ struct flow_indr_block_cb *indr_block_cb;
+ struct flow_indr_block_dev *indr_dev;
struct flow_block_offload bo = {
.command = command,
.binder_type = ei->binder_type,
@@ -784,11 +593,12 @@ static void tc_indr_block_call(struct tcf_block *block, struct net_device *dev,
};
INIT_LIST_HEAD(&bo.cb_list);
- indr_dev = tc_indr_block_dev_lookup(dev);
+ indr_dev = flow_indr_block_dev_lookup(dev);
if (!indr_dev)
return;
- indr_dev->block = command == FLOW_BLOCK_BIND ? block : NULL;
+ indr_dev->flow_block = command == FLOW_BLOCK_BIND ? &block->flow_block : NULL;
+ indr_dev->ing_cmd_cb = command == FLOW_BLOCK_BIND ? tc_indr_block_ing_cmd : NULL;
list_for_each_entry(indr_block_cb, &indr_dev->cb_list, list)
indr_block_cb->cb(dev, indr_block_cb->cb_priv, TC_SETUP_BLOCK,
@@ -3358,11 +3168,6 @@ static int __init tc_filter_init(void)
if (err)
goto err_register_pernet_subsys;
- err = rhashtable_init(&indr_setup_block_ht,
- &tc_indr_setup_block_ht_params);
- if (err)
- goto err_rhash_setup_block_ht;
-
rtnl_register(PF_UNSPEC, RTM_NEWTFILTER, tc_new_tfilter, NULL,
RTNL_FLAG_DOIT_UNLOCKED);
rtnl_register(PF_UNSPEC, RTM_DELTFILTER, tc_del_tfilter, NULL,
@@ -3376,8 +3181,6 @@ static int __init tc_filter_init(void)
return 0;
-err_rhash_setup_block_ht:
- unregister_pernet_subsys(&tcf_net_ops);
err_register_pernet_subsys:
destroy_workqueue(tc_filter_wq);
return err;
--
1.8.3.1
^ permalink raw reply related
* Re: [PATCH] rocker: fix memory leaks of fib_work on two error return paths
From: Jiri Pirko @ 2019-07-28 7:46 UTC (permalink / raw)
To: Colin King
Cc: David Ahern, David S . Miller, netdev, kernel-janitors,
linux-kernel
In-Reply-To: <20190727233726.3121-1-colin.king@canonical.com>
Sun, Jul 28, 2019 at 01:37:26AM CEST, colin.king@canonical.com wrote:
>From: Colin Ian King <colin.king@canonical.com>
>
>Currently there are two error return paths that leak memory allocated
>to fib_work. Fix this by kfree'ing fib_work before returning.
>
>Addresses-Coverity: ("Resource leak")
>Fixes: 19a9d136f198 ("ipv4: Flag fib_info with a fib_nh using IPv6 gateway")
>Fixes: dbcc4fa718ee ("rocker: Fail attempts to use routes with nexthop objects")
>Signed-off-by: Colin Ian King <colin.king@canonical.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
^ permalink raw reply
* Re: INFO: rcu detected stall in vhost_worker
From: Michael S. Tsirkin @ 2019-07-28 8:36 UTC (permalink / raw)
To: Hillf Danton
Cc: syzbot, jasowang, kvm, linux-kbuild, linux-kernel, michal.lkml,
netdev, syzkaller-bugs, torvalds, virtualization, yamada.masahiro
In-Reply-To: <000000000000e87d14058e9728d7@google.com>
On Sat, Jul 27, 2019 at 04:23:23PM +0800, Hillf Danton wrote:
>
> Fri, 26 Jul 2019 08:26:01 -0700 (PDT)
> > syzbot has bisected this bug to:
> >
> > commit 0ecfebd2b52404ae0c54a878c872bb93363ada36
> > Author: Linus Torvalds <torvalds@linux-foundation.org>
> > Date: Sun Jul 7 22:41:56 2019 +0000
> >
> > Linux 5.2
> >
> > bisection log: https://syzkaller.appspot.com/x/bisect.txt?x=118810bfa00000
> > start commit: 13bf6d6a Add linux-next specific files for 20190725
> > git tree: linux-next
> > kernel config: https://syzkaller.appspot.com/x/.config?x=8ae987d803395886
> > dashboard link: https://syzkaller.appspot.com/bug?extid=36e93b425cd6eb54fcc1
> > syz repro: https://syzkaller.appspot.com/x/repro.syz?x=15112f3fa00000
> > C reproducer: https://syzkaller.appspot.com/x/repro.c?x=131ab578600000
> >
> > Reported-by: syzbot+36e93b425cd6eb54fcc1@syzkaller.appspotmail.com
> > Fixes: 0ecfebd2b524 ("Linux 5.2")
> >
> > For information about bisection process see: https://goo.gl/tpsmEJ#bisection
>
> --- a/drivers/vhost/vhost.c
> +++ b/drivers/vhost/vhost.c
> @@ -787,7 +787,6 @@ static void vhost_setup_uaddr(struct vho
> size_t size, bool write)
> {
> struct vhost_uaddr *addr = &vq->uaddrs[index];
> - spin_lock(&vq->mmu_lock);
>
> addr->uaddr = uaddr;
> addr->size = size;
> @@ -797,7 +796,10 @@ static void vhost_setup_uaddr(struct vho
> static void vhost_setup_vq_uaddr(struct vhost_virtqueue *vq)
> {
> spin_lock(&vq->mmu_lock);
> -
> + /*
> + * deadlock if managing to take mmu_lock again while
> + * setting up uaddr
> + */
> vhost_setup_uaddr(vq, VHOST_ADDR_DESC,
> (unsigned long)vq->desc,
> vhost_get_desc_size(vq, vq->num),
> --
Thanks!
I reverted this whole commit.
--
MST
^ permalink raw reply
* Re: [PATCH v6 rdma-next 1/6] RDMA/core: Create mmap database and cookie helper functions
From: Gal Pressman @ 2019-07-28 8:45 UTC (permalink / raw)
To: Jason Gunthorpe, Michal Kalderon
Cc: Kamal Heib, Ariel Elior, dledford@redhat.com,
linux-rdma@vger.kernel.org, davem@davemloft.net,
netdev@vger.kernel.org
In-Reply-To: <20190726132316.GA8695@ziepe.ca>
On 26/07/2019 16:23, Jason Gunthorpe wrote:
> On Fri, Jul 26, 2019 at 08:42:07AM +0000, Michal Kalderon wrote:
>
>>>> But we don't free entires from the xa_array ( only when ucontext is
>>>> destroyed) so how will There be an empty element after we wrap ?
>>>
>>> Oh!
>>>
>>> That should be fixed up too, in the general case if a user is
>>> creating/destroying driver objects in loop we don't want memory usage to
>>> be unbounded.
>>>
>>> The rdma_user_mmap stuff has VMA ops that can refcount the xa entry and
>>> now that this is core code it is easy enough to harmonize the two things and
>>> track the xa side from the struct rdma_umap_priv
>>>
>>> The question is, does EFA or qedr have a use model for this that allows a
>>> userspace verb to create/destroy in a loop? ie do we need to fix this right
>>> now?
>
>> The mapping occurs for every qp and cq creation. So yes.
>>
>> So do you mean add a ref-cnt to the xarray entry and from umap
>> decrease the refcnt and free?
>
> Yes, free the entry (release the HW resource) and release the xa_array
> ID.
This is a bit tricky for EFA.
The UAR BAR resources (LLQ for example) aren't cleaned up until the UAR is
deallocated, so many of the entries won't really be freed when the refcount
reaches zero (i.e the HW considers these entries as refcounted as long as the
UAR exists). The best we can do is free the DMA buffers for appropriate entries.
^ permalink raw reply
* [PATCH net-next] r8169: make use of xmit_more
From: Heiner Kallweit @ 2019-07-28 9:25 UTC (permalink / raw)
To: Realtek linux nic maintainers, David Miller
Cc: netdev@vger.kernel.org, Sander Eikelenboom, Eric Dumazet
There was a previous attempt to use xmit_more, but the change had to be
reverted because under load sometimes a transmit timeout occurred [0].
Maybe this was caused by a missing memory barrier, the new attempt
keeps the memory barrier before the call to netif_stop_queue like it
is used by the driver as of today. The new attempt also changes the
order of some calls as suggested by Eric.
[0] https://lkml.org/lkml/2019/2/10/39
Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
---
drivers/net/ethernet/realtek/r8169_main.c | 17 ++++++++++++-----
1 file changed, 12 insertions(+), 5 deletions(-)
diff --git a/drivers/net/ethernet/realtek/r8169_main.c b/drivers/net/ethernet/realtek/r8169_main.c
index 864ca529d..d9261e68f 100644
--- a/drivers/net/ethernet/realtek/r8169_main.c
+++ b/drivers/net/ethernet/realtek/r8169_main.c
@@ -5637,6 +5637,8 @@ static netdev_tx_t rtl8169_start_xmit(struct sk_buff *skb,
struct device *d = tp_to_dev(tp);
dma_addr_t mapping;
u32 opts[2], len;
+ bool stop_queue;
+ bool door_bell;
int frags;
if (unlikely(!rtl_tx_slots_avail(tp, skb_shinfo(skb)->nr_frags))) {
@@ -5680,13 +5682,13 @@ static netdev_tx_t rtl8169_start_xmit(struct sk_buff *skb,
txd->opts2 = cpu_to_le32(opts[1]);
- netdev_sent_queue(dev, skb->len);
-
skb_tx_timestamp(skb);
/* Force memory writes to complete before releasing descriptor */
dma_wmb();
+ door_bell = __netdev_sent_queue(dev, skb->len, netdev_xmit_more());
+
txd->opts1 = rtl8169_get_txd_opts1(opts[0], len, entry);
/* Force all memory writes to complete before notifying device */
@@ -5694,14 +5696,19 @@ static netdev_tx_t rtl8169_start_xmit(struct sk_buff *skb,
tp->cur_tx += frags + 1;
- RTL_W8(tp, TxPoll, NPQ);
-
- if (!rtl_tx_slots_avail(tp, MAX_SKB_FRAGS)) {
+ stop_queue = !rtl_tx_slots_avail(tp, MAX_SKB_FRAGS);
+ if (unlikely(stop_queue)) {
/* Avoid wrongly optimistic queue wake-up: rtl_tx thread must
* not miss a ring update when it notices a stopped queue.
*/
smp_wmb();
netif_stop_queue(dev);
+ }
+
+ if (door_bell)
+ RTL_W8(tp, TxPoll, NPQ);
+
+ if (unlikely(stop_queue)) {
/* Sync with rtl_tx:
* - publish queue status and cur_tx ring index (write barrier)
* - refresh dirty_tx ring index (read barrier).
--
2.22.0
^ permalink raw reply related
* Re: [PATCH v6 rdma-next 1/6] RDMA/core: Create mmap database and cookie helper functions
From: Kamal Heib @ 2019-07-28 9:30 UTC (permalink / raw)
To: Jason Gunthorpe
Cc: Michal Kalderon, ariel.elior, dledford, galpress, linux-rdma,
davem, netdev
In-Reply-To: <20190725175540.GA18757@ziepe.ca>
On Thu, Jul 25, 2019 at 02:55:40PM -0300, Jason Gunthorpe wrote:
> On Tue, Jul 09, 2019 at 05:17:30PM +0300, Michal Kalderon wrote:
> > Create some common API's for adding entries to a xa_mmap.
> > Searching for an entry and freeing one.
> >
> > The code was copied from the efa driver almost as is, just renamed
> > function to be generic and not efa specific.
> >
> > Signed-off-by: Ariel Elior <ariel.elior@marvell.com>
> > Signed-off-by: Michal Kalderon <michal.kalderon@marvell.com>
> > drivers/infiniband/core/device.c | 1 +
> > drivers/infiniband/core/rdma_core.c | 1 +
> > drivers/infiniband/core/uverbs_cmd.c | 1 +
> > drivers/infiniband/core/uverbs_main.c | 135 ++++++++++++++++++++++++++++++++++
> > include/rdma/ib_verbs.h | 46 ++++++++++++
> > 5 files changed, 184 insertions(+)
> >
> > diff --git a/drivers/infiniband/core/device.c b/drivers/infiniband/core/device.c
> > index 8a6ccb936dfe..a830c2c5d691 100644
> > +++ b/drivers/infiniband/core/device.c
> > @@ -2521,6 +2521,7 @@ void ib_set_device_ops(struct ib_device *dev, const struct ib_device_ops *ops)
> > SET_DEVICE_OP(dev_ops, map_mr_sg_pi);
> > SET_DEVICE_OP(dev_ops, map_phys_fmr);
> > SET_DEVICE_OP(dev_ops, mmap);
> > + SET_DEVICE_OP(dev_ops, mmap_free);
> > SET_DEVICE_OP(dev_ops, modify_ah);
> > SET_DEVICE_OP(dev_ops, modify_cq);
> > SET_DEVICE_OP(dev_ops, modify_device);
> > diff --git a/drivers/infiniband/core/rdma_core.c b/drivers/infiniband/core/rdma_core.c
> > index ccf4d069c25c..1ed01b02401f 100644
> > +++ b/drivers/infiniband/core/rdma_core.c
> > @@ -816,6 +816,7 @@ static void ufile_destroy_ucontext(struct ib_uverbs_file *ufile,
> >
> > rdma_restrack_del(&ucontext->res);
> >
> > + rdma_user_mmap_entries_remove_free(ucontext);
> > ib_dev->ops.dealloc_ucontext(ucontext);
> > kfree(ucontext);
> >
> > diff --git a/drivers/infiniband/core/uverbs_cmd.c b/drivers/infiniband/core/uverbs_cmd.c
> > index 7ddd0e5bc6b3..44c0600245e4 100644
> > +++ b/drivers/infiniband/core/uverbs_cmd.c
> > @@ -254,6 +254,7 @@ static int ib_uverbs_get_context(struct uverbs_attr_bundle *attrs)
> >
> > mutex_init(&ucontext->per_mm_list_lock);
> > INIT_LIST_HEAD(&ucontext->per_mm_list);
> > + xa_init(&ucontext->mmap_xa);
> >
> > ret = get_unused_fd_flags(O_CLOEXEC);
> > if (ret < 0)
> > diff --git a/drivers/infiniband/core/uverbs_main.c b/drivers/infiniband/core/uverbs_main.c
> > index 11c13c1381cf..4b909d7b97de 100644
> > +++ b/drivers/infiniband/core/uverbs_main.c
> > @@ -965,6 +965,141 @@ int rdma_user_mmap_io(struct ib_ucontext *ucontext, struct vm_area_struct *vma,
> > }
> > EXPORT_SYMBOL(rdma_user_mmap_io);
> >
> > +static inline u64
> > +rdma_user_mmap_get_key(const struct rdma_user_mmap_entry *entry)
> > +{
> > + return (u64)entry->mmap_page << PAGE_SHIFT;
> > +}
> > +
> > +/**
> > + * rdma_user_mmap_entry_get() - Get an entry from the mmap_xa.
> > + *
> > + * @ucontext: associated user context.
> > + * @key: The key received from rdma_user_mmap_entry_insert which
> > + * is provided by user as the address to map.
> > + * @len: The length the user wants to map
> > + *
> > + * This function is called when a user tries to mmap a key it
> > + * initially received from the driver. They key was created by
> > + * the function rdma_user_mmap_entry_insert.
> > + *
> > + * Return an entry if exists or NULL if there is no match.
> > + */
> > +struct rdma_user_mmap_entry *
> > +rdma_user_mmap_entry_get(struct ib_ucontext *ucontext, u64 key, u64 len)
> > +{
> > + struct rdma_user_mmap_entry *entry;
> > + u64 mmap_page;
> > +
> > + mmap_page = key >> PAGE_SHIFT;
> > + if (mmap_page > U32_MAX)
> > + return NULL;
> > +
> > + entry = xa_load(&ucontext->mmap_xa, mmap_page);
> > + if (!entry || entry->length != len)
> > + return NULL;
> > +
> > + ibdev_dbg(ucontext->device,
> > + "mmap: obj[0x%p] key[%#llx] addr[%#llx] len[%#llx] removed\n",
> > + entry->obj, key, entry->address, entry->length);
> > +
> > + return entry;
> > +}
> > +EXPORT_SYMBOL(rdma_user_mmap_entry_get);
>
> It is a mistake we keep making, and maybe the war is hopelessly lost
> now, but functions called from a driver should not be part of the
> ib_uverbs module - ideally uverbs is an optional module. They should
> be in ib_core.
>
> Maybe put this in ib_core_uverbs.c ?
>
> Kamal, you've been tackling various cleanups, maybe making ib_uverbs
> unloadable again is something you'd be keen on?
>
Yes, Could you please give some background on that?
> > +/**
> > + * rdma_user_mmap_entry_insert() - Allocate and insert an entry to the mmap_xa.
> > + *
> > + * @ucontext: associated user context.
> > + * @obj: opaque driver object that will be stored in the entry.
> > + * @address: The address that will be mmapped to the user
> > + * @length: Length of the address that will be mmapped
> > + * @mmap_flag: opaque driver flags related to the address (For
> > + * example could be used for cachability)
> > + *
> > + * This function should be called by drivers that use the rdma_user_mmap
> > + * interface for handling user mmapped addresses. The database is handled in
> > + * the core and helper functions are provided to insert entries into the
> > + * database and extract entries when the user call mmap with the given key.
> > + * The function returns a unique key that should be provided to user, the user
> > + * will use the key to map the given address.
> > + *
> > + * Note this locking scheme cannot support removal of entries,
> > + * except during ucontext destruction when the core code
> > + * guarentees no concurrency.
> > + *
> > + * Return: unique key or RDMA_USER_MMAP_INVALID if entry was not added.
> > + */
> > +u64 rdma_user_mmap_entry_insert(struct ib_ucontext *ucontext, void *obj,
> > + u64 address, u64 length, u8 mmap_flag)
> > +{
> > + struct rdma_user_mmap_entry *entry;
> > + u32 next_mmap_page;
> > + int err;
> > +
> > + entry = kzalloc(sizeof(*entry), GFP_KERNEL);
> > + if (!entry)
> > + return RDMA_USER_MMAP_INVALID;
> > +
> > + entry->obj = obj;
> > + entry->address = address;
> > + entry->length = length;
> > + entry->mmap_flag = mmap_flag;
> > +
> > + xa_lock(&ucontext->mmap_xa);
> > + if (check_add_overflow(ucontext->mmap_xa_page,
> > + (u32)(length >> PAGE_SHIFT),
>
> Should this be divide round up ?
>
> > + &next_mmap_page))
> > + goto err_unlock;
>
> I still don't like that this algorithm latches into a permanent
> failure when the xa_page wraps.
>
> It seems worth spending a bit more time here to tidy this.. Keep using
> the mmap_xa_page scheme, but instead do something like
>
> alloc_cyclic_range():
>
> while () {
> // Find first empty element in a cyclic way
> xa_page_first = mmap_xa_page;
> xa_find(xa, &xa_page_first, U32_MAX, XA_FREE_MARK)
>
> // Is there a enough room to have the range?
> if (check_add_overflow(xa_page_first, npages, &xa_page_end)) {
> mmap_xa_page = 0;
> continue;
> }
>
> // See if the element before intersects
> elm = xa_find(xa, &zero, xa_page_end, 0);
> if (elm && intersects(xa_page_first, xa_page_last, elm->first, elm->last)) {
> mmap_xa_page = elm->last + 1;
> continue
> }
>
> // xa_page_first -> xa_page_end should now be free
> xa_insert(xa, xa_page_start, entry);
> mmap_xa_page = xa_page_end + 1;
> return xa_page_start;
> }
>
> Approximately, please check it.
>
> > @@ -2199,6 +2201,17 @@ struct iw_cm_conn_param;
> >
> > #define DECLARE_RDMA_OBJ_SIZE(ib_struct) size_t size_##ib_struct
> >
> > +#define RDMA_USER_MMAP_FLAG_SHIFT 56
> > +#define RDMA_USER_MMAP_PAGE_MASK GENMASK(EFA_MMAP_FLAG_SHIFT - 1, 0)
> > +#define RDMA_USER_MMAP_INVALID U64_MAX
> > +struct rdma_user_mmap_entry {
> > + void *obj;
> > + u64 address;
> > + u64 length;
> > + u32 mmap_page;
> > + u8 mmap_flag;
> > +};
> > +
> > /**
> > * struct ib_device_ops - InfiniBand device operations
> > * This structure defines all the InfiniBand device operations, providers will
> > @@ -2311,6 +2324,19 @@ struct ib_device_ops {
> > struct ib_udata *udata);
> > void (*dealloc_ucontext)(struct ib_ucontext *context);
> > int (*mmap)(struct ib_ucontext *context, struct vm_area_struct *vma);
> > + /**
> > + * Memory that is mapped to the user can only be freed once the
> > + * ucontext of the application is destroyed. This is for
> > + * security reasons where we don't want an application to have a
> > + * mapping to phyiscal memory that is freed and allocated to
> > + * another application. For this reason, all the entries are
> > + * stored in ucontext and once ucontext is freed mmap_free is
> > + * called on each of the entries. They type of the memory that
>
> They -> the
>
> > + * was mapped may differ between entries and is opaque to the
> > + * rdma_user_mmap interface. Therefore needs to be implemented
> > + * by the driver in mmap_free.
> > + */
> > + void (*mmap_free)(struct rdma_user_mmap_entry *entry);
> > void (*disassociate_ucontext)(struct ib_ucontext *ibcontext);
> > int (*alloc_pd)(struct ib_pd *pd, struct ib_udata *udata);
> > void (*dealloc_pd)(struct ib_pd *pd, struct ib_udata *udata);
> > @@ -2709,6 +2735,11 @@ void ib_set_device_ops(struct ib_device *device,
> > #if IS_ENABLED(CONFIG_INFINIBAND_USER_ACCESS)
> > int rdma_user_mmap_io(struct ib_ucontext *ucontext, struct vm_area_struct *vma,
> > unsigned long pfn, unsigned long size, pgprot_t prot);
> > +u64 rdma_user_mmap_entry_insert(struct ib_ucontext *ucontext, void *obj,
> > + u64 address, u64 length, u8 mmap_flag);
> > +struct rdma_user_mmap_entry *
> > +rdma_user_mmap_entry_get(struct ib_ucontext *ucontext, u64 key, u64 len);
> > +void rdma_user_mmap_entries_remove_free(struct ib_ucontext
> > *ucontext);
>
> Should remove_free should be in the core-priv header?
>
> Jason
^ permalink raw reply
* RE: [PATCH] net/mlx5e: Fix zero table prio set by user.
From: Paul Blakey @ 2019-07-28 10:04 UTC (permalink / raw)
To: Marcelo Ricardo Leitner, wenxu
Cc: Or Gerlitz, Saeed Mahameed, Roi Dayan, Mark Bloch,
pablo@netfilter.org, netdev@vger.kernel.org
In-Reply-To: <20190726140142.GC4063@localhost.localdomain>
On 7/26/2019 5:01 PM, Marcelo Ricardo Leitner wrote:
> On Fri, Jul 26, 2019 at 08:39:43PM +0800, wenxu wrote:
>>
>> 在 2019/7/26 20:19, Or Gerlitz 写道:
>>> On Fri, Jul 26, 2019 at 12:24 AM Saeed Mahameed <saeedm@mellanox.com> wrote:
>>>> On Thu, 2019-07-25 at 19:24 +0800, wenxu@ucloud.cn wrote:
>>>>> From: wenxu <wenxu@ucloud.cn>
>>>>>
>>>>> The flow_cls_common_offload prio is zero
>>>>>
>>>>> It leads the invalid table prio in hw.
>>>>>
>>>>> Error: Could not process rule: Invalid argument
>>>>>
>>>>> kernel log:
>>>>> mlx5_core 0000:81:00.0: E-Switch: Failed to create FDB Table err -22
>>>>> (table prio: 65535, level: 0, size: 4194304)
>>>>>
>>>>> table_prio = (chain * FDB_MAX_PRIO) + prio - 1;
>>>>> should check (chain * FDB_MAX_PRIO) + prio is not 0
>>>>>
>>>>> Signed-off-by: wenxu <wenxu@ucloud.cn>
>>>>> ---
>>>>> drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c | 4 +++-
>>>>> 1 file changed, 3 insertions(+), 1 deletion(-)
>>>>>
>>>>> diff --git
>>>>> a/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c
>>>>> b/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c
>>>>> index 089ae4d..64ca90f 100644
>>>>> --- a/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c
>>>>> +++ b/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c
>>>>> @@ -970,7 +970,9 @@ static int esw_add_fdb_miss_rule(struct
>>>> this piece of code isn't in this function, weird how it got to the
>>>> diff, patch applies correctly though !
>>>>
>>>>> mlx5_eswitch *esw)
>>>>> flags |= (MLX5_FLOW_TABLE_TUNNEL_EN_REFORMAT |
>>>>> MLX5_FLOW_TABLE_TUNNEL_EN_DECAP);
>>>>>
>>>>> - table_prio = (chain * FDB_MAX_PRIO) + prio - 1;
>>>>> + table_prio = (chain * FDB_MAX_PRIO) + prio;
>>>>> + if (table_prio)
>>>>> + table_prio = table_prio - 1;
>>>>>
>>>> This is black magic, even before this fix.
>>>> this -1 seems to be needed in order to call
>>>> create_next_size_table(table_prio) with the previous "table prio" ?
>>>> (table_prio - 1) ?
>>>>
>>>> The whole thing looks wrong to me since when prio is 0 and chain is 0,
>>>> there is not such thing table_prio - 1.
>>>>
>>>> mlnx eswitch guys in the cc, please advise.
>>> basically, prio 0 is not something we ever get in the driver, since if
>>> user space
>>> specifies 0, the kernel generates some random non-zero prio, and we support
>>> only prios 1-16 -- Wenxu -- what do you run to get this error?
>>>
>>>
>> I run offload with nfatbles(but not tc), there is no prio for each rule.
>>
>> prio of flow_cls_common_offload init as 0.
>>
>> static void nft_flow_offload_common_init(struct flow_cls_common_offload *common,
>>
>> __be16 proto,
>> struct netlink_ext_ack *extack)
>> {
>> common->protocol = proto;
>> common->extack = extack;
>> }
>>
>>
>> flow_cls_common_offload
>
> Note that on
> [PATCH net-next] netfilter: nf_table_offload: Fix zero prio of flow_cls_common_offload
> I asked Pablo on how nftables should behave on this situation.
>
> It's the same issue as in the patch above but being fixed at a
> different level.
That's better, since the original code relied on not having prio 0 as valid, the suggested fix (net/mlx5e: Fix zero table prio set by user) maps NFT offload prio 0 and tc prio 1 to the same
hardware table. This is wrong and can cause issues.
^ permalink raw reply
* Re: next-20190723: bpf/seccomp - systemd/journald issue?
From: Sedat Dilek @ 2019-07-28 11:09 UTC (permalink / raw)
To: Yonghong Song
Cc: Alexei Starovoitov, Alexei Starovoitov, Daniel Borkmann,
Martin Lau, Song Liu, netdev@vger.kernel.org, bpf@vger.kernel.org,
Clang-Built-Linux ML, Kees Cook, Nick Desaulniers,
Nathan Chancellor
In-Reply-To: <934a2a0a-c3fb-fd75-b8a3-c1042d73ca0c@fb.com>
On Sat, Jul 27, 2019 at 7:08 PM Yonghong Song <yhs@fb.com> wrote:
>
>
>
> On 7/27/19 12:36 AM, Sedat Dilek wrote:
> > On Sat, Jul 27, 2019 at 4:24 AM Alexei Starovoitov
> > <alexei.starovoitov@gmail.com> wrote:
> >>
> >> On Fri, Jul 26, 2019 at 2:19 PM Sedat Dilek <sedat.dilek@gmail.com> wrote:
> >>>
> >>> On Fri, Jul 26, 2019 at 11:10 PM Yonghong Song <yhs@fb.com> wrote:
> >>>>
> >>>>
> >>>>
> >>>> On 7/26/19 2:02 PM, Sedat Dilek wrote:
> >>>>> On Fri, Jul 26, 2019 at 10:38 PM Sedat Dilek <sedat.dilek@gmail.com> wrote:
> >>>>>>
> >>>>>> Hi Yonghong Song,
> >>>>>>
> >>>>>> On Fri, Jul 26, 2019 at 5:45 PM Yonghong Song <yhs@fb.com> wrote:
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> On 7/26/19 1:26 AM, Sedat Dilek wrote:
> >>>>>>>> Hi,
> >>>>>>>>
> >>>>>>>> I have opened a new issue in the ClangBuiltLinux issue tracker.
> >>>>>>>
> >>>>>>> Glad to know clang 9 has asm goto support and now It can compile
> >>>>>>> kernel again.
> >>>>>>>
> >>>>>>
> >>>>>> Yupp.
> >>>>>>
> >>>>>>>>
> >>>>>>>> I am seeing a problem in the area bpf/seccomp causing
> >>>>>>>> systemd/journald/udevd services to fail.
> >>>>>>>>
> >>>>>>>> [Fri Jul 26 08:08:43 2019] systemd[453]: systemd-udevd.service: Failed
> >>>>>>>> to connect stdout to the journal socket, ignoring: Connection refused
> >>>>>>>>
> >>>>>>>> This happens when I use the (LLVM) LLD ld.lld-9 linker but not with
> >>>>>>>> BFD linker ld.bfd on Debian/buster AMD64.
> >>>>>>>> In both cases I use clang-9 (prerelease).
> >>>>>>>
> >>>>>>> Looks like it is a lld bug.
> >>>>>>>
> >>>>>>> I see the stack trace has __bpf_prog_run32() which is used by
> >>>>>>> kernel bpf interpreter. Could you try to enable bpf jit
> >>>>>>> sysctl net.core.bpf_jit_enable = 1
> >>>>>>> If this passed, it will prove it is interpreter related.
> >>>>>>>
> >>>>>>
> >>>>>> After...
> >>>>>>
> >>>>>> sysctl -w net.core.bpf_jit_enable=1
> >>>>>>
> >>>>>> I can start all failed systemd services.
> >>>>>>
> >>>>>> systemd-journald.service
> >>>>>> systemd-udevd.service
> >>>>>> haveged.service
> >>>>>>
> >>>>>> This is in maintenance mode.
> >>>>>>
> >>>>>> What is next: Do set a permanent sysctl setting for net.core.bpf_jit_enable?
> >>>>>>
> >>>>>
> >>>>> This is what I did:
> >>>>
> >>>> I probably won't have cycles to debug this potential lld issue.
> >>>> Maybe you already did, I suggest you put enough reproducible
> >>>> details in the bug you filed against lld so they can take a look.
> >>>>
> >>>
> >>> I understand and will put the journalctl-log into the CBL issue
> >>> tracker and update informations.
> >>>
> >>> Thanks for your help understanding the BPF correlations.
> >>>
> >>> Is setting 'net.core.bpf_jit_enable = 2' helpful here?
> >>
> >> jit_enable=1 is enough.
> >> Or use CONFIG_BPF_JIT_ALWAYS_ON to workaround.
> >>
> >> It sounds like clang miscompiles interpreter.
> >> modprobe test_bpf
> >> should be able to point out which part of interpreter is broken.
> >
> > Maybe we need something like...
> >
> > "bpf: Disable GCC -fgcse optimization for ___bpf_prog_run()"
> >
> > ...for clang?
>
> Not sure how do you get conclusion it is gcse causing the problem.
> But anyway, adding such flag in the kernel is not a good idea.
> clang/llvm should be fixed instead. Esp. there is still time
> for 9.0.0 release to fix bugs.
>
To clarify: This is a snapshot release of clang-9 built with tc-build.
Building with -O0 is not possible as I see asm-goto failing.
- Sedat -
[1] https://github.com/ClangBuiltLinux/tc-build
> >
> > - Sedat -
> >
> > [1] https://git.kernel.org/linus/3193c0836f203a91bef96d88c64cccf0be090d9c
> >
^ permalink raw reply
* ip route JSON format is unparseable for "unreachable" routes
From: Michael Ziegler @ 2019-07-28 11:09 UTC (permalink / raw)
To: netdev
Hi,
I created a couple "unreachable" routes on one of my systems, like such:
> ip route add unreachable 10.0.0.0/8 metric 255
> ip route add unreachable 192.168.0.0/16 metric 255
Unfortunately this results in unparseable JSON output from "ip":
> # ip -j route show | jq .
> parse error: Objects must consist of key:value pairs at line 1, column 84
The offending JSON objects are these:
> {"unreachable","dst":"10.0.0.0/8","metric":255,"flags":[]}
> {"unreachable","dst":"192.168.0.0/16","metric":255,"flags":[]}
"unreachable" cannot appear on its own here, it needs to be some kind of
field.
The manpage says to report here, thus I do :) I've searched the
archives, but I wasn't able to find any existing bug reports about this.
I'm running version
> ip utility, iproute2-ss190107
on Debian Buster.
Regards,
Michael.
^ permalink raw reply
* Re: next-20190723: bpf/seccomp - systemd/journald issue?
From: Sedat Dilek @ 2019-07-28 11:16 UTC (permalink / raw)
To: Yonghong Song
Cc: Alexei Starovoitov, Alexei Starovoitov, Daniel Borkmann,
Martin Lau, Song Liu, netdev@vger.kernel.org, bpf@vger.kernel.org,
Clang-Built-Linux ML, Kees Cook, Nick Desaulniers,
Nathan Chancellor
In-Reply-To: <57169960-35c2-d9d3-94e4-3b5a43d5aca7@fb.com>
On Sat, Jul 27, 2019 at 7:11 PM Yonghong Song <yhs@fb.com> wrote:
>
>
>
> On 7/27/19 1:16 AM, Sedat Dilek wrote:
> > On Sat, Jul 27, 2019 at 9:36 AM Sedat Dilek <sedat.dilek@gmail.com> wrote:
> >>
> >> On Sat, Jul 27, 2019 at 4:24 AM Alexei Starovoitov
> >> <alexei.starovoitov@gmail.com> wrote:
> >>>
> >>> On Fri, Jul 26, 2019 at 2:19 PM Sedat Dilek <sedat.dilek@gmail.com> wrote:
> >>>>
> >>>> On Fri, Jul 26, 2019 at 11:10 PM Yonghong Song <yhs@fb.com> wrote:
> >>>>>
> >>>>>
> >>>>>
> >>>>> On 7/26/19 2:02 PM, Sedat Dilek wrote:
> >>>>>> On Fri, Jul 26, 2019 at 10:38 PM Sedat Dilek <sedat.dilek@gmail.com> wrote:
> >>>>>>>
> >>>>>>> Hi Yonghong Song,
> >>>>>>>
> >>>>>>> On Fri, Jul 26, 2019 at 5:45 PM Yonghong Song <yhs@fb.com> wrote:
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> On 7/26/19 1:26 AM, Sedat Dilek wrote:
> >>>>>>>>> Hi,
> >>>>>>>>>
> >>>>>>>>> I have opened a new issue in the ClangBuiltLinux issue tracker.
> >>>>>>>>
> >>>>>>>> Glad to know clang 9 has asm goto support and now It can compile
> >>>>>>>> kernel again.
> >>>>>>>>
> >>>>>>>
> >>>>>>> Yupp.
> >>>>>>>
> >>>>>>>>>
> >>>>>>>>> I am seeing a problem in the area bpf/seccomp causing
> >>>>>>>>> systemd/journald/udevd services to fail.
> >>>>>>>>>
> >>>>>>>>> [Fri Jul 26 08:08:43 2019] systemd[453]: systemd-udevd.service: Failed
> >>>>>>>>> to connect stdout to the journal socket, ignoring: Connection refused
> >>>>>>>>>
> >>>>>>>>> This happens when I use the (LLVM) LLD ld.lld-9 linker but not with
> >>>>>>>>> BFD linker ld.bfd on Debian/buster AMD64.
> >>>>>>>>> In both cases I use clang-9 (prerelease).
> >>>>>>>>
> >>>>>>>> Looks like it is a lld bug.
> >>>>>>>>
> >>>>>>>> I see the stack trace has __bpf_prog_run32() which is used by
> >>>>>>>> kernel bpf interpreter. Could you try to enable bpf jit
> >>>>>>>> sysctl net.core.bpf_jit_enable = 1
> >>>>>>>> If this passed, it will prove it is interpreter related.
> >>>>>>>>
> >>>>>>>
> >>>>>>> After...
> >>>>>>>
> >>>>>>> sysctl -w net.core.bpf_jit_enable=1
> >>>>>>>
> >>>>>>> I can start all failed systemd services.
> >>>>>>>
> >>>>>>> systemd-journald.service
> >>>>>>> systemd-udevd.service
> >>>>>>> haveged.service
> >>>>>>>
> >>>>>>> This is in maintenance mode.
> >>>>>>>
> >>>>>>> What is next: Do set a permanent sysctl setting for net.core.bpf_jit_enable?
> >>>>>>>
> >>>>>>
> >>>>>> This is what I did:
> >>>>>
> >>>>> I probably won't have cycles to debug this potential lld issue.
> >>>>> Maybe you already did, I suggest you put enough reproducible
> >>>>> details in the bug you filed against lld so they can take a look.
> >>>>>
> >>>>
> >>>> I understand and will put the journalctl-log into the CBL issue
> >>>> tracker and update informations.
> >>>>
> >>>> Thanks for your help understanding the BPF correlations.
> >>>>
> >>>> Is setting 'net.core.bpf_jit_enable = 2' helpful here?
> >>>
> >>> jit_enable=1 is enough.
> >>> Or use CONFIG_BPF_JIT_ALWAYS_ON to workaround.
> >>>
> >>> It sounds like clang miscompiles interpreter.
> >
> > Just to clarify:
> > This does not happen with clang-9 + ld.bfd (GNU/ld linker).
> >
> >>> modprobe test_bpf
> >>> should be able to point out which part of interpreter is broken.
> >>
> >> Maybe we need something like...
> >>
> >> "bpf: Disable GCC -fgcse optimization for ___bpf_prog_run()"
> >>
> >> ...for clang?
> >>
> >
> > Not sure if something like GCC's...
> >
> > -fgcse
> >
> > Perform a global common subexpression elimination pass. This pass also
> > performs global constant and copy propagation.
> >
> > Note: When compiling a program using computed gotos, a GCC extension,
> > you may get better run-time performance if you disable the global
> > common subexpression elimination pass by adding -fno-gcse to the
> > command line.
> >
> > Enabled at levels -O2, -O3, -Os.
> >
> > ...is available for clang.
> >
> > I tried with hopping to turn off "global common subexpression elimination":
> >
> > diff --git a/arch/x86/net/Makefile b/arch/x86/net/Makefile
> > index 383c87300b0d..92f934a1e9ff 100644
> > --- a/arch/x86/net/Makefile
> > +++ b/arch/x86/net/Makefile
> > @@ -3,6 +3,8 @@
> > # Arch-specific network modules
> > #
> >
> > +KBUILD_CFLAGS += -O0
>
> This won't work. First, you added to the wrong file. The interpreter
> is at kernel/bpf/core.c.
>
Thanks for the clarification.
I mixed up the x86 BPF JIT compiler with the BPF interpreter.
I see no diff in the disassembled kernel/bpf/core.o in my clang9-bfd
and clang9-lld build-dirs.
l$ objdump -M intel -d linux.clang9-bfd/kernel/bpf/core.o >
bpf_core_o_clang9-bfd.txt
$ objdump -M intel -d linux.clang9-lld/kernel/bpf/core.o >
bpf_core_o_clang9-lld.txt
--- bpf_core_o_clang9-bfd.txt 2019-07-28 13:11:59.363552042 +0200
+++ bpf_core_o_clang9-lld.txt 2019-07-28 13:12:09.975535278 +0200
@@ -1,5 +1,5 @@
-linux.clang9-bfd/kernel/bpf/core.o: file format elf64-x86-64
+linux.clang9-lld/kernel/bpf/core.o: file format elf64-x86-64
Disassembly of section .text:
> Second, kernel may have compilation issues with -O0.
>
Confirmed.
- Sedat -
> > +
> > ifeq ($(CONFIG_X86_32),y)
> > obj-$(CONFIG_BPF_JIT) += bpf_jit_comp32.o
> > else
> >
> > Still see...
> > BROKEN: test_bpf: #294 BPF_MAXINSNS: Jump, gap, jump, ... jited:0
> >
> > - Sedat -
> >
^ permalink raw reply
* [patch net] net: fix ifindex collision during namespace removal
From: Jiri Pirko @ 2019-07-28 12:56 UTC (permalink / raw)
To: netdev
Cc: davem, xemul, edumazet, pabeni, idosch, petrm, sd, f.fainelli,
stephen, mlxsw, Jiri Pirko
From: Jiri Pirko <jiri@mellanox.com>
Commit aca51397d014 ("netns: Fix arbitrary net_device-s corruptions
on net_ns stop.") introduced a possibility to hit a BUG in case device
is returning back to init_net and two following conditions are met:
1) dev->ifindex value is used in a name of another "dev%d"
device in init_net.
2) dev->name is used by another device in init_net.
Under real life circumstances this is hard to get. Therefore this has
been present happily for over 10 years. To reproduce:
$ ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
2: dummy0: <BROADCAST,NOARP> mtu 1500 qdisc noop state DOWN group default qlen 1000
link/ether 86:89:3f:86:61:29 brd ff:ff:ff:ff:ff:ff
3: enp0s2: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000
link/ether 52:54:00:12:34:56 brd ff:ff:ff:ff:ff:ff
$ ip netns add ns1
$ ip -n ns1 link add dummy1ns1 type dummy
$ ip -n ns1 link add dummy2ns1 type dummy
$ ip link set enp0s2 netns ns1
$ ip -n ns1 link set enp0s2 name dummy0
[ 100.858894] virtio_net virtio0 dummy0: renamed from enp0s2
$ ip link add dev4 type dummy
$ ip -n ns1 a
1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
2: dummy1ns1: <BROADCAST,NOARP> mtu 1500 qdisc noop state DOWN group default qlen 1000
link/ether 16:63:4c:38:3e:ff brd ff:ff:ff:ff:ff:ff
3: dummy2ns1: <BROADCAST,NOARP> mtu 1500 qdisc noop state DOWN group default qlen 1000
link/ether aa:9e:86:dd:6b:5d brd ff:ff:ff:ff:ff:ff
4: dummy0: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000
link/ether 52:54:00:12:34:56 brd ff:ff:ff:ff:ff:ff
$ ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
2: dummy0: <BROADCAST,NOARP> mtu 1500 qdisc noop state DOWN group default qlen 1000
link/ether 86:89:3f:86:61:29 brd ff:ff:ff:ff:ff:ff
4: dev4: <BROADCAST,NOARP> mtu 1500 qdisc noop state DOWN group default qlen 1000
link/ether 5a:e1:4a:b6:ec:f8 brd ff:ff:ff:ff:ff:ff
$ ip netns del ns1
[ 158.717795] default_device_exit: failed to move dummy0 to init_net: -17
[ 158.719316] ------------[ cut here ]------------
[ 158.720591] kernel BUG at net/core/dev.c:9824!
[ 158.722260] invalid opcode: 0000 [#1] SMP KASAN PTI
[ 158.723728] CPU: 0 PID: 56 Comm: kworker/u2:1 Not tainted 5.3.0-rc1+ #18
[ 158.725422] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-2.fc30 04/01/2014
[ 158.727508] Workqueue: netns cleanup_net
[ 158.728915] RIP: 0010:default_device_exit.cold+0x1d/0x1f
[ 158.730683] Code: 84 e8 18 c9 3e fe 0f 0b e9 70 90 ff ff e8 36 e4 52 fe 89 d9 4c 89 e2 48 c7 c6 80 d6 25 84 48 c7 c7 20 c0 25 84 e8 f4 c8 3e
[ 158.736854] RSP: 0018:ffff8880347e7b90 EFLAGS: 00010282
[ 158.738752] RAX: 000000000000003b RBX: 00000000ffffffef RCX: 0000000000000000
[ 158.741369] RDX: 0000000000000000 RSI: ffffffff8128013d RDI: ffffed10068fcf64
[ 158.743418] RBP: ffff888033550170 R08: 000000000000003b R09: fffffbfff0b94b9c
[ 158.745626] R10: fffffbfff0b94b9b R11: ffffffff85ca5cdf R12: ffff888032f28000
[ 158.748405] R13: dffffc0000000000 R14: ffff8880335501b8 R15: 1ffff110068fcf72
[ 158.750638] FS: 0000000000000000(0000) GS:ffff888036000000(0000) knlGS:0000000000000000
[ 158.752944] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 158.755245] CR2: 00007fe8b45d21d0 CR3: 00000000340b4005 CR4: 0000000000360ef0
[ 158.757654] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 158.760012] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 158.762758] Call Trace:
[ 158.763882] ? dev_change_net_namespace+0xbb0/0xbb0
[ 158.766148] ? devlink_nl_cmd_set_doit+0x520/0x520
[ 158.768034] ? dev_change_net_namespace+0xbb0/0xbb0
[ 158.769870] ops_exit_list.isra.0+0xa8/0x150
[ 158.771544] cleanup_net+0x446/0x8f0
[ 158.772945] ? unregister_pernet_operations+0x4a0/0x4a0
[ 158.775294] process_one_work+0xa1a/0x1740
[ 158.776896] ? pwq_dec_nr_in_flight+0x310/0x310
[ 158.779143] ? do_raw_spin_lock+0x11b/0x280
[ 158.780848] worker_thread+0x9e/0x1060
[ 158.782500] ? process_one_work+0x1740/0x1740
[ 158.784454] kthread+0x31b/0x420
[ 158.786082] ? __kthread_create_on_node+0x3f0/0x3f0
[ 158.788286] ret_from_fork+0x3a/0x50
[ 158.789871] ---[ end trace defd6c657c71f936 ]---
[ 158.792273] RIP: 0010:default_device_exit.cold+0x1d/0x1f
[ 158.795478] Code: 84 e8 18 c9 3e fe 0f 0b e9 70 90 ff ff e8 36 e4 52 fe 89 d9 4c 89 e2 48 c7 c6 80 d6 25 84 48 c7 c7 20 c0 25 84 e8 f4 c8 3e
[ 158.804854] RSP: 0018:ffff8880347e7b90 EFLAGS: 00010282
[ 158.807865] RAX: 000000000000003b RBX: 00000000ffffffef RCX: 0000000000000000
[ 158.811794] RDX: 0000000000000000 RSI: ffffffff8128013d RDI: ffffed10068fcf64
[ 158.816652] RBP: ffff888033550170 R08: 000000000000003b R09: fffffbfff0b94b9c
[ 158.820930] R10: fffffbfff0b94b9b R11: ffffffff85ca5cdf R12: ffff888032f28000
[ 158.825113] R13: dffffc0000000000 R14: ffff8880335501b8 R15: 1ffff110068fcf72
[ 158.829899] FS: 0000000000000000(0000) GS:ffff888036000000(0000) knlGS:0000000000000000
[ 158.834923] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 158.838164] CR2: 00007fe8b45d21d0 CR3: 00000000340b4005 CR4: 0000000000360ef0
[ 158.841917] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 158.845149] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Fix this by checking if a device with the same name exists in init_net
and fallback to original code - dev%d to allocate name - in case it does.
This was found using syzkaller.
Fixes: aca51397d014 ("netns: Fix arbitrary net_device-s corruptions on net_ns stop.")
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
---
net/core/dev.c | 2 ++
1 file changed, 2 insertions(+)
diff --git a/net/core/dev.c b/net/core/dev.c
index 2a3be2b279d3..1a24ba26b098 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -9817,6 +9817,8 @@ static void __net_exit default_device_exit(struct net *net)
/* Push remaining network devices to init_net */
snprintf(fb_name, IFNAMSIZ, "dev%d", dev->ifindex);
+ if (__dev_get_by_name(&init_net, fb_name))
+ snprintf(fb_name, IFNAMSIZ, "dev%%d");
err = dev_change_net_namespace(dev, &init_net, fb_name);
if (err) {
pr_emerg("%s: failed to move %s to init_net: %d\n",
--
2.21.0
^ permalink raw reply related
* Re: [PATCH net] net: hns: fix LED configuration for marvell phy
From: Pavel Machek @ 2019-07-28 13:24 UTC (permalink / raw)
To: Andrew Lunn
Cc: liuyonglong, David Miller, netdev, linux-kernel, linuxarm,
salil.mehta, yisen.zhuang, shiju.jose
In-Reply-To: <20190725042829.GB14276@lunn.ch>
On Thu 2019-07-25 06:28:29, Andrew Lunn wrote:
> On Thu, Jul 25, 2019 at 11:00:08AM +0800, liuyonglong wrote:
> > > Revert "net: hns: fix LED configuration for marvell phy"
> > > This reverts commit f4e5f775db5a4631300dccd0de5eafb50a77c131.
> > >
> > > Andrew Lunn says this should be handled another way.
> > >
> > > Signed-off-by: David S. Miller <davem@davemloft.net>
> >
> >
> > Hi Andrew:
> >
> > I see this patch have been reverted, can you tell me the better way to do this?
> > Thanks very much!
>
> Please take a look at the work Matthias Kaehlcke is doing. It has not
> got too far yet, but when it is complete, it should define a generic
> way to configure PHY LEDs.
I don't remember PHY LED discussion from LED mailing list. Would you have a pointer?
Would it make sense to coordinate with LED subsystem?
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
^ permalink raw reply
* Re: [PATCH] tcp: add new tcp_mtu_probe_floor sysctl
From: Eric Dumazet @ 2019-07-28 13:54 UTC (permalink / raw)
To: Josh Hunt; +Cc: netdev, David Miller
In-Reply-To: <a9ec9cfd-c381-c02e-7d67-e24373c693d6@akamai.com>
On Sun, Jul 28, 2019 at 1:21 AM Josh Hunt <johunt@akamai.com> wrote:
>
> On 7/27/19 12:05 AM, Eric Dumazet wrote:
> > On Sat, Jul 27, 2019 at 4:23 AM Josh Hunt <johunt@akamai.com> wrote:
> >>
> >> The current implementation of TCP MTU probing can considerably
> >> underestimate the MTU on lossy connections allowing the MSS to get down to
> >> 48. We have found that in almost all of these cases on our networks these
> >> paths can handle much larger MTUs meaning the connections are being
> >> artificially limited. Even though TCP MTU probing can raise the MSS back up
> >> we have seen this not to be the case causing connections to be "stuck" with
> >> an MSS of 48 when heavy loss is present.
> >>
> >> Prior to pushing out this change we could not keep TCP MTU probing enabled
> >> b/c of the above reasons. Now with a reasonble floor set we've had it
> >> enabled for the past 6 months.
> >
> > And what reasonable value have you used ???
>
> Reasonable for some may not be reasonable for others hence the new
> sysctl :) We're currently running with a fairly high value based off of
> the v6 min MTU minus headers and options, etc. We went conservative with
> our setting initially as it seemed a reasonable first step when
> re-enabling TCP MTU probing since with no configurable floor we saw a #
> of cases where connections were using severely reduced mss b/c of loss
> and not b/c of actual path restriction. I plan to reevaluate the setting
> at some point, but since the probing method is still the same it means
> the same clients who got stuck with mss of 48 before will land at
> whatever floor we set. Looking forward we are interested in trying to
> improve TCP MTU probing so it does not penalize clients like this.
>
> A suggestion for a more reasonable floor default would be 512, which is
> the same as the min_pmtu. Given both mechanisms are trying to achieve
> the same goal it seems like they should have a similar min/floor.
>
> >
> >>
> >> The new sysctl will still default to TCP_MIN_SND_MSS (48), but gives
> >> administrators the ability to control the floor of MSS probing.
> >>
> >> Signed-off-by: Josh Hunt <johunt@akamai.com>
> >> ---
> >> Documentation/networking/ip-sysctl.txt | 6 ++++++
> >> include/net/netns/ipv4.h | 1 +
> >> net/ipv4/sysctl_net_ipv4.c | 9 +++++++++
> >> net/ipv4/tcp_ipv4.c | 1 +
> >> net/ipv4/tcp_timer.c | 2 +-
> >> 5 files changed, 18 insertions(+), 1 deletion(-)
> >>
> >> diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt
> >> index df33674799b5..49e95f438ed7 100644
> >> --- a/Documentation/networking/ip-sysctl.txt
> >> +++ b/Documentation/networking/ip-sysctl.txt
> >> @@ -256,6 +256,12 @@ tcp_base_mss - INTEGER
> >> Path MTU discovery (MTU probing). If MTU probing is enabled,
> >> this is the initial MSS used by the connection.
> >>
> >> +tcp_mtu_probe_floor - INTEGER
> >> + If MTU probing is enabled this caps the minimum MSS used for search_low
> >> + for the connection.
> >> +
> >> + Default : 48
> >> +
> >> tcp_min_snd_mss - INTEGER
> >> TCP SYN and SYNACK messages usually advertise an ADVMSS option,
> >> as described in RFC 1122 and RFC 6691.
> >> diff --git a/include/net/netns/ipv4.h b/include/net/netns/ipv4.h
> >> index bc24a8ec1ce5..c0c0791b1912 100644
> >> --- a/include/net/netns/ipv4.h
> >> +++ b/include/net/netns/ipv4.h
> >> @@ -116,6 +116,7 @@ struct netns_ipv4 {
> >> int sysctl_tcp_l3mdev_accept;
> >> #endif
> >> int sysctl_tcp_mtu_probing;
> >> + int sysctl_tcp_mtu_probe_floor;
> >> int sysctl_tcp_base_mss;
> >> int sysctl_tcp_min_snd_mss;
> >> int sysctl_tcp_probe_threshold;
> >> diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
> >> index 0b980e841927..59ded25acd04 100644
> >> --- a/net/ipv4/sysctl_net_ipv4.c
> >> +++ b/net/ipv4/sysctl_net_ipv4.c
> >> @@ -820,6 +820,15 @@ static struct ctl_table ipv4_net_table[] = {
> >> .extra2 = &tcp_min_snd_mss_max,
> >> },
> >> {
> >> + .procname = "tcp_mtu_probe_floor",
> >> + .data = &init_net.ipv4.sysctl_tcp_mtu_probe_floor,
> >> + .maxlen = sizeof(int),
> >> + .mode = 0644,
> >> + .proc_handler = proc_dointvec_minmax,
> >> + .extra1 = &tcp_min_snd_mss_min,
> >> + .extra2 = &tcp_min_snd_mss_max,
> >> + },
> >> + {
> >> .procname = "tcp_probe_threshold",
> >> .data = &init_net.ipv4.sysctl_tcp_probe_threshold,
> >> .maxlen = sizeof(int),
> >> diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
> >> index d57641cb3477..e0a372676329 100644
> >> --- a/net/ipv4/tcp_ipv4.c
> >> +++ b/net/ipv4/tcp_ipv4.c
> >> @@ -2637,6 +2637,7 @@ static int __net_init tcp_sk_init(struct net *net)
> >> net->ipv4.sysctl_tcp_min_snd_mss = TCP_MIN_SND_MSS;
> >> net->ipv4.sysctl_tcp_probe_threshold = TCP_PROBE_THRESHOLD;
> >> net->ipv4.sysctl_tcp_probe_interval = TCP_PROBE_INTERVAL;
> >> + net->ipv4.sysctl_tcp_mtu_probe_floor = TCP_MIN_SND_MSS;
> >>
> >> net->ipv4.sysctl_tcp_keepalive_time = TCP_KEEPALIVE_TIME;
> >> net->ipv4.sysctl_tcp_keepalive_probes = TCP_KEEPALIVE_PROBES;
> >> diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c
> >> index c801cd37cc2a..dbd9d2d0ee63 100644
> >> --- a/net/ipv4/tcp_timer.c
> >> +++ b/net/ipv4/tcp_timer.c
> >> @@ -154,7 +154,7 @@ static void tcp_mtu_probing(struct inet_connection_sock *icsk, struct sock *sk)
> >> } else {
> >> mss = tcp_mtu_to_mss(sk, icsk->icsk_mtup.search_low) >> 1;
> >> mss = min(net->ipv4.sysctl_tcp_base_mss, mss);
> >> - mss = max(mss, 68 - tcp_sk(sk)->tcp_header_len);
> >> + mss = max(mss, net->ipv4.sysctl_tcp_mtu_probe_floor);
> >> mss = max(mss, net->ipv4.sysctl_tcp_min_snd_mss);
> >> icsk->icsk_mtup.search_low = tcp_mss_to_mtu(sk, mss);
> >> }
> >
> >
> > Existing sysctl should be enough ?
>
> I don't think so. Changing tcp_min_snd_mss could impact clients that
> really want/need a small mss. When you added the new sysctl I tried to
> analyze the mss values we're seeing to understand what we could possibly
> raise it to. While not a huge amount, we see more clients than I
> expected announcing mss values in the 180-512 range. Given that I would
> not feel comfortable setting tcp_min_snd_mss to say 512 as I suggested
> above.
If these clients need mss values in 180-512 ranges, how MTU probing
would work for them,
if you set a floor to 512 ?
Are we sure the intent of tcp_base_mss was not to act as a floor ?
diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c
index c801cd37cc2a9c11f2dd4b9681137755e501a538..6d15895e9dcfb2eff51bbcf3608c7e68c1970a9e
100644
--- a/net/ipv4/tcp_timer.c
+++ b/net/ipv4/tcp_timer.c
@@ -153,7 +153,7 @@ static void tcp_mtu_probing(struct
inet_connection_sock *icsk, struct sock *sk)
icsk->icsk_mtup.probe_timestamp = tcp_jiffies32;
} else {
mss = tcp_mtu_to_mss(sk, icsk->icsk_mtup.search_low) >> 1;
- mss = min(net->ipv4.sysctl_tcp_base_mss, mss);
+ mss = max(net->ipv4.sysctl_tcp_base_mss, mss);
mss = max(mss, 68 - tcp_sk(sk)->tcp_header_len);
mss = max(mss, net->ipv4.sysctl_tcp_min_snd_mss);
icsk->icsk_mtup.search_low = tcp_mss_to_mtu(sk, mss);
>
> >
> > tcp_min_snd_mss documentation could be slightly updated.
> >
> > And maybe its default value could be raised a bit.
> >
>
> Thanks
> Josh
^ permalink raw reply
* [PATCH net-next] rt2800usb: Add new rt2800usb device PLANEX GW-USMicroN
From: Masanari Iida @ 2019-07-28 14:07 UTC (permalink / raw)
To: sgruszka, helmut.schaa, kvalo, davem, linux-wireless, netdev,
linux-kernel
Cc: Masanari Iida
This patch add a device ID for PLANEX GW-USMicroN.
Without this patch, I had to echo the device IDs in order to
recognize the device.
# lsusb |grep PLANEX
Bus 002 Device 005: ID 2019:ed14 PLANEX GW-USMicroN
Signed-off-by: Masanari Iida <standby24x7@gmail.com>
---
drivers/net/wireless/ralink/rt2x00/rt2800usb.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/drivers/net/wireless/ralink/rt2x00/rt2800usb.c b/drivers/net/wireless/ralink/rt2x00/rt2800usb.c
index fdf0504b5f1d..0dfb55c69b73 100644
--- a/drivers/net/wireless/ralink/rt2x00/rt2800usb.c
+++ b/drivers/net/wireless/ralink/rt2x00/rt2800usb.c
@@ -1086,6 +1086,7 @@ static const struct usb_device_id rt2800usb_device_table[] = {
{ USB_DEVICE(0x0846, 0x9013) },
{ USB_DEVICE(0x0846, 0x9019) },
/* Planex */
+ { USB_DEVICE(0x2019, 0xed14) },
{ USB_DEVICE(0x2019, 0xed19) },
/* Ralink */
{ USB_DEVICE(0x148f, 0x3573) },
--
2.22.0.545.g9c9b961d7eb1
^ permalink raw reply related
* Re: [PATCH] gigaset: stop maintaining seperately
From: Tilman Schmidt @ 2019-07-28 14:17 UTC (permalink / raw)
To: Paul Bolle
Cc: David Miller, Hansjoerg Lipp, Arnd Bergmann, Karsten Keil, netdev,
linux-kernel
In-Reply-To: <20190726220541.28783-1-pebolle@tiscali.nl>
Thanks to you, Paul, for all your contributions, and specifically for
keeping the driver maintained for four more years after I had to abandon
it for the same reason.
I had a lot of fun working on that driver and I learned a lot in the
course. Now it's time to move on without regrets.
All the best,
Tilman
Am 27.07.2019 um 00:05 schrieb Paul Bolle:
> The Dutch consumer grade ISDN network will be shut down on September 1,
> 2019. This means I'll be converted to some sort of VOIP shortly. At that
> point it would be unwise to try to maintain the gigaset driver, even for
> odd fixes as I do. So I'll stop maintaining it as a seperate driver and
> bump support to CAPI in staging. De facto this means the driver will be
> unmaintained, since no-one seems to be working on CAPI.
>
> I've lighty tested the hardware specific modules of this driver (bas-gigaset,
> ser-gigaset, and usb-gigaset) for v5.3-rc1. The basic functionality appears to
> be working. It's unclear whether anyone still cares. I'm aware of only one
> person sort of using the driver a few years ago.
>
> Thanks to Karsten Keil for the ISDN subsystems gigaset was using (I4L and
> CAPI). And many thanks to Hansjoerg Lipp and Tilman Schmidt for writing and
> upstreaming this driver.
>
> Signed-off-by: Paul Bolle <pebolle@tiscali.nl>
> ---
> MAINTAINERS | 7 -------
> 1 file changed, 7 deletions(-)
>
> diff --git a/MAINTAINERS b/MAINTAINERS
> index 783569e3c4b4..e99afbd13355 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -6822,13 +6822,6 @@ F: Documentation/filesystems/gfs2*.txt
> F: fs/gfs2/
> F: include/uapi/linux/gfs2_ondisk.h
>
> -GIGASET ISDN DRIVERS
> -M: Paul Bolle <pebolle@tiscali.nl>
> -L: gigaset307x-common@lists.sourceforge.net
> -W: http://gigaset307x.sourceforge.net/
> -S: Odd Fixes
> -F: drivers/staging/isdn/gigaset/
> -
> GNSS SUBSYSTEM
> M: Johan Hovold <johan@kernel.org>
> T: git git://git.kernel.org/pub/scm/linux/kernel/git/johan/gnss.git
>
^ permalink raw reply
* Re: memory leak in fdb_create
From: syzbot @ 2019-07-28 14:20 UTC (permalink / raw)
To: bridge, bsingharora, coreteam, davem, duwe, kaber, kadlec,
linux-kernel, mingo, mpe, netdev, netfilter-devel, nikolay, pablo,
roopa, rostedt, syzkaller-bugs
In-Reply-To: <0000000000005e6124058c0cbdbe@google.com>
syzbot has bisected this bug to:
commit 04cf31a759ef575f750a63777cee95500e410994
Author: Michael Ellerman <mpe@ellerman.id.au>
Date: Thu Mar 24 11:04:01 2016 +0000
ftrace: Make ftrace_location_range() global
bisection log: https://syzkaller.appspot.com/x/bisect.txt?x=1538c778600000
start commit: abf02e29 Merge tag 'pm-5.2-rc6' of git://git.kernel.org/pu..
git tree: upstream
final crash: https://syzkaller.appspot.com/x/report.txt?x=1738c778600000
console output: https://syzkaller.appspot.com/x/log.txt?x=1338c778600000
kernel config: https://syzkaller.appspot.com/x/.config?x=56f1da14935c3cce
dashboard link: https://syzkaller.appspot.com/bug?extid=88533dc8b582309bf3ee
syz repro: https://syzkaller.appspot.com/x/repro.syz?x=16de5c06a00000
C reproducer: https://syzkaller.appspot.com/x/repro.c?x=10546026a00000
Reported-by: syzbot+88533dc8b582309bf3ee@syzkaller.appspotmail.com
Fixes: 04cf31a759ef ("ftrace: Make ftrace_location_range() global")
For information about bisection process see: https://goo.gl/tpsmEJ#bisection
^ permalink raw reply
* Re: [PATCH net-next] mvpp2: document HW checksum behaviour
From: Matteo Croce @ 2019-07-28 14:30 UTC (permalink / raw)
To: Antoine Tenart, Marcin Wojtas, Stefan Chulski, Maxime Chevallier
Cc: netdev, LKML, David S . Miller
In-Reply-To: <CAGnkfhycOc8mvqeQDBcnXueUjrFQMC7hdfAOkxr5k0+xc_tnDw@mail.gmail.com>
On Sun, Jul 28, 2019 at 3:36 AM Matteo Croce <mcroce@redhat.com> wrote:
>
> On Fri, Jul 26, 2019 at 2:57 PM Antoine Tenart
> <antoine.tenart@bootlin.com> wrote:
> >
> > Hi Matteo,
> >
> > On Fri, Jul 26, 2019 at 01:15:46AM +0200, Matteo Croce wrote:
> > > The hardware can only offload checksum calculation on first port
> > > due to the Tx FIFO size limitation. Document this in a comment.
> > >
> > > Fixes: 576193f2d579 ("net: mvpp2: jumbo frames support")
> > > Signed-off-by: Matteo Croce <mcroce@redhat.com>
> >
> > Looks good. Please note there's a similar code path in the probe.
> > You could also add a comment there (or move this check/comment in a
> > common place).
> >
> > Thanks!
> > Antoine
> >
>
> Hi Antoine,
>
> I was making a v2, when I looked at the mvpp2_port_probe() which does:
>
> --------------------------------%<------------------------------
> features = NETIF_F_SG | NETIF_F_IP_CSUM | NETIF_F_IPV6_CSUM |
> NETIF_F_TSO;
>
> if (port->pool_long->id == MVPP2_BM_JUMBO && port->id != 0) {
> dev->features &= ~(NETIF_F_IP_CSUM | NETIF_F_IPV6_CSUM);
> dev->hw_features &= ~(NETIF_F_IP_CSUM | NETIF_F_IPV6_CSUM);
> }
>
> dev->vlan_features |= features;
> -------------------------------->%------------------------------
>
> Is it ok to remove NETIF_F_IP*_CSUM from dev->features and
> dev->hw_features but keep it in dev->vlan_features?
>
> Regards,
> --
> Matteo Croce
> per aspera ad upstream
Hi all,
probably dev->vlan_features is safe to keep the CSUM features to avoid
unnecessary calculation in some cases, but I have another question.
Does the PP2 hardware support checksumming within any offset? I
replaced 'NETIF_F_IP_CSUM | NETIF_F_IPV6_CSUM' with NETIF_F_HW_CSUM and
then stacked 5 VxLANS on top of a mvpp2 device, to have the last IP
header at offset 264:
ip link set $dev up
ip addr add 192.168.0.$last/24 dev $dev
for i in {1..5}; do
ip link add vx$i type vxlan id $i dstport 4789 remote 192.168.$((i-1)).$other
ip link set vx$i up
ip addr add 192.168.$i.$last/24 dev vx$i
done
00:51:82:11:22:00 > 3c:fd:fe:9c:60:6c, ethertype IPv4 (0x0800), length 348: 192.168.0.1.33625 > 192.168.0.2.4789: VXLAN, flags [I] (0x08), vni 1
02:25:60:da:87:03 > 92:20:05:45:3d:d3, ethertype IPv4 (0x0800), length 298: 192.168.1.1.33625 > 192.168.1.2.4789: VXLAN, flags [I] (0x08), vni 2
12:20:97:15:8f:aa > 66:08:23:c7:72:ea, ethertype IPv4 (0x0800), length 248: 192.168.2.1.33625 > 192.168.2.2.4789: VXLAN, flags [I] (0x08), vni 3
c6:1c:b9:fd:9d:28 > 22:ca:cb:6a:ea:68, ethertype IPv4 (0x0800), length 198: 192.168.3.1.33625 > 192.168.3.2.4789: VXLAN, flags [I] (0x08), vni 4
02:34:5f:45:a5:9d > d2:4e:d4:d7:42:31, ethertype IPv4 (0x0800), length 148: 192.168.4.1.34504 > 192.168.4.2.4789: VXLAN, flags [I] (0x08), vni 5
a2:99:fd:9c:1b:05 > 5a:81:3b:fc:6a:07, ethertype IPv4 (0x0800), length 98: 192.168.5.1 > 192.168.5.2: ICMP echo request, id 1654, seq 156, length 64
It seems that the HW is capable of doing it, can someone with a
datasheet confirm this?
Regards,
--
Matteo Croce
per aspera ad upstream
^ permalink raw reply
* RE: [EXT] Re: [PATCH net-next] mvpp2: document HW checksum behaviour
From: Stefan Chulski @ 2019-07-28 15:22 UTC (permalink / raw)
To: Matteo Croce, Antoine Tenart, Marcin Wojtas, Maxime Chevallier
Cc: netdev, LKML, David S . Miller
In-Reply-To: <CAGnkfhz+PezeLT+gyXdsnyJz2dnKpYkcb2HbqvXJoLdzNxuC6g@mail.gmail.com>
> Hi all,
>
> probably dev->vlan_features is safe to keep the CSUM features to avoid
> unnecessary calculation in some cases, but I have another question.
> Does the PP2 hardware support checksumming within any offset? I replaced
> 'NETIF_F_IP_CSUM | NETIF_F_IPV6_CSUM' with NETIF_F_HW_CSUM and
> then stacked 5 VxLANS on top of a mvpp2 device, to have the last IP header
> at offset 264:
>
> ip link set $dev up
> ip addr add 192.168.0.$last/24 dev $dev
>
> for i in {1..5}; do
> ip link add vx$i type vxlan id $i dstport 4789 remote 192.168.$((i-
> 1)).$other
> ip link set vx$i up
> ip addr add 192.168.$i.$last/24 dev vx$i done
>
> 00:51:82:11:22:00 > 3c:fd:fe:9c:60:6c, ethertype IPv4 (0x0800), length 348:
> 192.168.0.1.33625 > 192.168.0.2.4789: VXLAN, flags [I] (0x08), vni 1
> 02:25:60:da:87:03 > 92:20:05:45:3d:d3, ethertype IPv4 (0x0800), length 298:
> 192.168.1.1.33625 > 192.168.1.2.4789: VXLAN, flags [I] (0x08), vni 2
> 12:20:97:15:8f:aa > 66:08:23:c7:72:ea, ethertype IPv4 (0x0800), length 248:
> 192.168.2.1.33625 > 192.168.2.2.4789: VXLAN, flags [I] (0x08), vni 3
> c6:1c:b9:fd:9d:28 > 22:ca:cb:6a:ea:68, ethertype IPv4 (0x0800), length 198:
> 192.168.3.1.33625 > 192.168.3.2.4789: VXLAN, flags [I] (0x08), vni 4
> 02:34:5f:45:a5:9d > d2:4e:d4:d7:42:31, ethertype IPv4 (0x0800), length 148:
> 192.168.4.1.34504 > 192.168.4.2.4789: VXLAN, flags [I] (0x08), vni 5
> a2:99:fd:9c:1b:05 > 5a:81:3b:fc:6a:07, ethertype IPv4 (0x0800), length 98:
> 192.168.5.1 > 192.168.5.2: ICMP echo request, id 1654, seq 156, length 64
>
> It seems that the HW is capable of doing it, can someone with a datasheet
> confirm this?
L3_offset in TX descriptor has 7 bits, so beginning of Layer3 should be less than 128 Bytes.
Stefan,
Regards.
^ permalink raw reply
* Re: ip route JSON format is unparseable for "unreachable" routes
From: Stephen Hemminger @ 2019-07-28 16:15 UTC (permalink / raw)
To: Michael Ziegler; +Cc: netdev
In-Reply-To: <6e88311b-5edc-4c62-1581-0f5b160a5f4e@michaelziegler.name>
On Sun, 28 Jul 2019 13:09:55 +0200
Michael Ziegler <ich@michaelziegler.name> wrote:
> Hi,
>
> I created a couple "unreachable" routes on one of my systems, like such:
>
> > ip route add unreachable 10.0.0.0/8 metric 255
> > ip route add unreachable 192.168.0.0/16 metric 255
>
> Unfortunately this results in unparseable JSON output from "ip":
>
> > # ip -j route show | jq .
> > parse error: Objects must consist of key:value pairs at line 1, column 84
>
> The offending JSON objects are these:
>
> > {"unreachable","dst":"10.0.0.0/8","metric":255,"flags":[]}
> > {"unreachable","dst":"192.168.0.0/16","metric":255,"flags":[]}
> "unreachable" cannot appear on its own here, it needs to be some kind of
> field.
>
> The manpage says to report here, thus I do :) I've searched the
> archives, but I wasn't able to find any existing bug reports about this.
> I'm running version
>
> > ip utility, iproute2-ss190107
>
> on Debian Buster.
>
> Regards,
> Michael.
Already fixed upstream by:
commit 073661773872709518d35d4d093f3a715281f21d
Author: Matteo Croce <mcroce@redhat.com>
Date: Mon Mar 18 18:19:29 2019 +0100
ip route: print route type in JSON output
ip route generates an invalid JSON if the route type has to be printed,
eg. when detailed mode is active, or the type is different that unicast:
$ ip -d -j -p route show
[ {"unicast",
"dst": "192.168.122.0/24",
"dev": "virbr0",
"protocol": "kernel",
"scope": "link",
"prefsrc": "192.168.122.1",
"flags": [ "linkdown" ]
} ]
$ ip -j -p route show
[ {"unreachable",
"dst": "192.168.23.0/24",
"flags": [ ]
},{"prohibit",
"dst": "192.168.24.0/24",
"flags": [ ]
},{"blackhole",
"dst": "192.168.25.0/24",
"flags": [ ]
} ]
Fix it by printing the route type as the "type" attribute:
$ ip -d -j -p route show
[ {
"type": "unicast",
"dst": "default",
"gateway": "192.168.85.1",
"dev": "wlp3s0",
"protocol": "dhcp",
"scope": "global",
"metric": 600,
"flags": [ ]
},{
"type": "unreachable",
"dst": "192.168.23.0/24",
"protocol": "boot",
"scope": "global",
"flags": [ ]
},{
"type": "prohibit",
"dst": "192.168.24.0/24",
"protocol": "boot",
"scope": "global",
"flags": [ ]
},{
"type": "blackhole",
"dst": "192.168.25.0/24",
"protocol": "boot",
"scope": "global",
"flags": [ ]
} ]
Fixes: 663c3cb23103 ("iproute: implement JSON and color output")
Acked-by: Phil Sutter <phil@nwl.cc>
Reviewed-and-tested-by: Andrea Claudi <aclaudi@redhat.com>
Signed-off-by: Matteo Croce <mcroce@redhat.com>
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox