Netdev List
 help / color / mirror / Atom feed
* Re: [PATCH v3 net-next 2/2 tuntap: Increase the number of queues in tun.
From: Jason Wang @ 2014-12-05  7:35 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Pankaj Gupta, linux-kernel, netdev, davem, dgibson, vfalico,
	edumazet, vyasevic, hkchu, wuzhy, xemul, therbert, bhutchings,
	xii, stephen, jiri, sergei.shtylyov
In-Reply-To: <20141204102013.GC17122@redhat.com>


On 12/04/2014 06:20 PM, Michael S. Tsirkin wrote:
> On Thu, Dec 04, 2014 at 03:03:34AM +0008, Jason Wang wrote:
>> > 
>> > 
>> > On Wed, Dec 3, 2014 at 5:52 PM, Michael S. Tsirkin <mst@redhat.com> wrote:
>>> > >On Wed, Dec 03, 2014 at 12:49:37PM +0530, Pankaj Gupta wrote:
>>>> > >> Networking under kvm works best if we allocate a per-vCPU RX and TX
>>>> > >> queue in a virtual NIC. This requires a per-vCPU queue on the host
>>>> > >>side.
>>>> > >> It is now safe to increase the maximum number of queues.
>>>> > >> Preceding patche: 'net: allow large number of rx queues'
>>> > >
>>> > >s/patche/patch/
>>> > >
>>>> > >> made sure this won't cause failures due to high order memory
>>>> > >> allocations. Increase it to 256: this is the max number of vCPUs
>>>> > >> KVM supports.
>>>> > >> Signed-off-by: Pankaj Gupta <pagupta@redhat.com>
>>>> > >> Reviewed-by: David Gibson <dgibson@redhat.com>
>>> > >
>>> > >Hmm it's kind of nasty that each tun device is now using x16 memory.
>>> > >Maybe we should look at using a flex array instead, and removing the
>>> > >limitation altogether (e.g. make it INT_MAX)?
>> > 
>> > But this only happens when IFF_MULTIQUEUE were used.
> I refer to this field:
>         struct tun_file __rcu   *tfiles[MAX_TAP_QUEUES];
> if we make MAX_TAP_QUEUES 256, this will use 4K bytes,
> apparently unconditionally.
>
>

How about just allocate one tfile if IFF_MULTIQUEUE were disabled?

^ permalink raw reply

* Re: [patch] ipvs: uninitialized data with IP_VS_IPV6
From: Dan Carpenter @ 2014-12-05  7:25 UTC (permalink / raw)
  To: Julian Anastasov
  Cc: Wensong Zhang, Simon Horman, Pablo Neira Ayuso, Patrick McHardy,
	Jozsef Kadlecsik, David S. Miller, netdev, lvs-devel,
	netfilter-devel, coreteam, kernel-janitors
In-Reply-To: <alpine.LFD.2.11.1412042308100.4841@ja.home.ssi.bg>

On Thu, Dec 04, 2014 at 11:19:34PM +0200, Julian Anastasov wrote:
> 
> 	I guess ip_vs_ftp_in() needs the same fix?

Good catch.  Thanks for noticing that.  I'll send a v2.

regards,
dan carpenter


^ permalink raw reply

* Re: [PATCH v2 1/6] net-PPP: Replacement of a printk() call by pr_warn() in mppe_rekey()
From: Julia Lawall @ 2014-12-05  7:21 UTC (permalink / raw)
  To: Joe Perches
  Cc: SF Markus Elfring, Sergei Shtylyov, Paul Mackerras, linux-ppp,
	netdev, Eric Dumazet, LKML, kernel-janitors
In-Reply-To: <1417731809.2721.17.camel@perches.com>



On Thu, 4 Dec 2014, Joe Perches wrote:

> On Thu, 2014-12-04 at 23:10 +0100, SF Markus Elfring wrote:
> > The mppe_rekey() function contained a few update candidates.
> > * Curly brackets were still used around a single function call "printk".
> > * Unwanted space characters
> > 
> > Let us improve these implementation details according to the current Linux
> > coding style convention.
> 
> trivia:
> 
> > diff --git a/drivers/net/ppp/ppp_mppe.c b/drivers/net/ppp/ppp_mppe.c
> []
> > @@ -172,9 +172,8 @@ static void mppe_rekey(struct ppp_mppe_state * state, int initial_key)
> >  		setup_sg(sg_in, state->sha1_digest, state->keylen);
> >  		setup_sg(sg_out, state->session_key, state->keylen);
> >  		if (crypto_blkcipher_encrypt(&desc, sg_out, sg_in,
> > -					     state->keylen) != 0) {
> > -    		    printk(KERN_WARNING "mppe_rekey: cipher_encrypt failed\n");
> > -		}
> > +					     state->keylen) != 0)
> > +			pr_warn("mppe_rekey: cipher_encrypt failed\n");
> 
> It's generally nicer to replace embedded function names
> with "%s: ", __func__
> 
> 			pr_warn("%s: cipher_encrypt failed\n", __func__);

Doing so may potentially allow some strings to be shared, thus saving a 
little space.  Perhaps not in this case, though.

julia

^ permalink raw reply

* Re: [PATCH v2 1/6] net-PPP: Replacement of a printk() call by pr_warn() in mppe_rekey()
From: SF Markus Elfring @ 2014-12-05  7:18 UTC (permalink / raw)
  To: Joe Perches
  Cc: Sergei Shtylyov, Paul Mackerras, linux-ppp, netdev, Eric Dumazet,
	LKML, kernel-janitors, Julia Lawall
In-Reply-To: <1417733101.2721.20.camel@perches.com>

>>> It's generally nicer to replace embedded function names
>>> with "%s: ", __func__
>>>
>>> 			pr_warn("%s: cipher_encrypt failed\n", __func__);
>>
>> Do you want that I send a third patch series for the fine-tuning of these parameters?
> 
> If you want.

Would "a committer" fix such a small source code adjustment also without a resend of
a patch series?


> I just wanted you to be aware of it for future patches.

Thanks for your tip.

Does it make sense to express such implementation details in the Linux coding
style documentation more explicitly (besides the fact that this update suggestion
was also triggered by a warning from the script "checkpatch.pl").

Regards,
Markus

^ permalink raw reply

* [RESEND PATCH] net/socket.c : introduce helper function do_sock_sendmsg to replace reduplicate code
From: Gu Zheng @ 2014-12-05  7:14 UTC (permalink / raw)
  To: davem; +Cc: netdev, linux-kernel, Gu Zheng
In-Reply-To: <1416799125-13972-1-git-send-email-guz.fnst@cn.fujitsu.com>

Introduce helper function do_sock_sendmsg() to simplify sock_sendmsg{_nosec},
and replace reduplicate code.

Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
---
 net/socket.c |   22 ++++++++++------------
 1 files changed, 10 insertions(+), 12 deletions(-)

diff --git a/net/socket.c b/net/socket.c
index fe20c31..2a6f3c6 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -651,7 +651,8 @@ static inline int __sock_sendmsg(struct kiocb *iocb, struct socket *sock,
 	return err ?: __sock_sendmsg_nosec(iocb, sock, msg, size);
 }
 
-int sock_sendmsg(struct socket *sock, struct msghdr *msg, size_t size)
+static int do_sock_sendmsg(struct socket *sock, struct msghdr *msg,
+			   size_t size, bool nosec)
 {
 	struct kiocb iocb;
 	struct sock_iocb siocb;
@@ -659,25 +660,22 @@ int sock_sendmsg(struct socket *sock, struct msghdr *msg, size_t size)
 
 	init_sync_kiocb(&iocb, NULL);
 	iocb.private = &siocb;
-	ret = __sock_sendmsg(&iocb, sock, msg, size);
+	ret = nosec ? __sock_sendmsg_nosec(&iocb, sock, msg, size) :
+		      __sock_sendmsg(&iocb, sock, msg, size);
 	if (-EIOCBQUEUED == ret)
 		ret = wait_on_sync_kiocb(&iocb);
 	return ret;
 }
+
+int sock_sendmsg(struct socket *sock, struct msghdr *msg, size_t size)
+{
+	return do_sock_sendmsg(sock, msg, size, false);
+}
 EXPORT_SYMBOL(sock_sendmsg);
 
 static int sock_sendmsg_nosec(struct socket *sock, struct msghdr *msg, size_t size)
 {
-	struct kiocb iocb;
-	struct sock_iocb siocb;
-	int ret;
-
-	init_sync_kiocb(&iocb, NULL);
-	iocb.private = &siocb;
-	ret = __sock_sendmsg_nosec(&iocb, sock, msg, size);
-	if (-EIOCBQUEUED == ret)
-		ret = wait_on_sync_kiocb(&iocb);
-	return ret;
+	return do_sock_sendmsg(sock, msg, size, true);
 }
 
 int kernel_sendmsg(struct socket *sock, struct msghdr *msg,
-- 
1.7.7

.

^ permalink raw reply related

* [RESEND PATCH] net: introduce helper macra CMSG_FOREACH_HDR
From: Gu Zheng @ 2014-12-05  7:14 UTC (permalink / raw)
  To: davem; +Cc: netdev, linux-kernel, Gu Zheng

Introduce helper macra CMSG_FOREACH_HDR as a wrapper of the enumerating
cmsghdr from msghdr, just cleanup. 

Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
---
 .../networking/timestamping/timestamping.c         |    4 +---
 .../networking/timestamping/txtimestamp.c          |    4 +---
 crypto/af_alg.c                                    |    2 +-
 include/linux/socket.h                             |    4 ++++
 net/core/scm.c                                     |    3 +--
 net/dccp/proto.c                                   |    5 ++---
 net/ipv4/ip_sockglue.c                             |    2 +-
 net/ipv6/datagram.c                                |    2 +-
 net/iucv/af_iucv.c                                 |    4 +---
 net/rds/send.c                                     |    4 ++--
 net/rxrpc/ar-output.c                              |    2 +-
 net/sctp/socket.c                                  |    3 +--
 12 files changed, 17 insertions(+), 22 deletions(-)

diff --git a/Documentation/networking/timestamping/timestamping.c b/Documentation/networking/timestamping/timestamping.c
index 5cdfd74..3106e88 100644
--- a/Documentation/networking/timestamping/timestamping.c
+++ b/Documentation/networking/timestamping/timestamping.c
@@ -169,9 +169,7 @@ static void printpacket(struct msghdr *msg, int res,
 	       res,
 	       inet_ntoa(from_addr->sin_addr),
 	       msg->msg_controllen);
-	for (cmsg = CMSG_FIRSTHDR(msg);
-	     cmsg;
-	     cmsg = CMSG_NXTHDR(msg, cmsg)) {
+	CMSG_FOREACH_HDR(cmsg, msg) {
 		printf("   cmsg len %zu: ", cmsg->cmsg_len);
 		switch (cmsg->cmsg_level) {
 		case SOL_SOCKET:
diff --git a/Documentation/networking/timestamping/txtimestamp.c b/Documentation/networking/timestamping/txtimestamp.c
index b32fc2a..e44ef35 100644
--- a/Documentation/networking/timestamping/txtimestamp.c
+++ b/Documentation/networking/timestamping/txtimestamp.c
@@ -149,9 +149,7 @@ static void __recv_errmsg_cmsg(struct msghdr *msg, int payload_len)
 	struct scm_timestamping *tss = NULL;
 	struct cmsghdr *cm;
 
-	for (cm = CMSG_FIRSTHDR(msg);
-	     cm && cm->cmsg_len;
-	     cm = CMSG_NXTHDR(msg, cm)) {
+	CMSG_FOREACH_HDR(cmsg, msg) {
 		if (cm->cmsg_level == SOL_SOCKET &&
 		    cm->cmsg_type == SCM_TIMESTAMPING) {
 			tss = (void *) CMSG_DATA(cm);
diff --git a/crypto/af_alg.c b/crypto/af_alg.c
index 6a3ad80..3df7d53 100644
--- a/crypto/af_alg.c
+++ b/crypto/af_alg.c
@@ -399,7 +399,7 @@ int af_alg_cmsg_send(struct msghdr *msg, struct af_alg_control *con)
 {
 	struct cmsghdr *cmsg;
 
-	for (cmsg = CMSG_FIRSTHDR(msg); cmsg; cmsg = CMSG_NXTHDR(msg, cmsg)) {
+	CMSG_FOREACH_HDR(cmsg, msg) {
 		if (!CMSG_OK(msg, cmsg))
 			return -EINVAL;
 		if (cmsg->cmsg_level != SOL_ALG)
diff --git a/include/linux/socket.h b/include/linux/socket.h
index bb9b836..d4b592f 100644
--- a/include/linux/socket.h
+++ b/include/linux/socket.h
@@ -94,6 +94,10 @@ struct cmsghdr {
 			     (cmsg)->cmsg_len <= (unsigned long) \
 			     ((mhdr)->msg_controllen - \
 			      ((char *)(cmsg) - (char *)(mhdr)->msg_control)))
+#define CMSG_FOREACH_HDR(cmsg, msg)  \
+	for (cmsg = CMSG_FIRSTHDR(msg); \
+	     cmsg; \
+	     cmsg = CMSG_NXTHDR(msg, cmsg))
 
 /*
  *	Get the next cmsg header
diff --git a/net/core/scm.c b/net/core/scm.c
index b442e7e..e938c49 100644
--- a/net/core/scm.c
+++ b/net/core/scm.c
@@ -129,8 +129,7 @@ int __scm_send(struct socket *sock, struct msghdr *msg, struct scm_cookie *p)
 	struct cmsghdr *cmsg;
 	int err;
 
-	for (cmsg = CMSG_FIRSTHDR(msg); cmsg; cmsg = CMSG_NXTHDR(msg, cmsg))
-	{
+	CMSG_FOREACH_HDR(cmsg, msg) {
 		err = -EINVAL;
 
 		/* Verify that cmsg_len is at least sizeof(struct cmsghdr) */
diff --git a/net/dccp/proto.c b/net/dccp/proto.c
index 5ab6627..d449cc5 100644
--- a/net/dccp/proto.c
+++ b/net/dccp/proto.c
@@ -703,7 +703,7 @@ EXPORT_SYMBOL_GPL(compat_dccp_getsockopt);
 
 static int dccp_msghdr_parse(struct msghdr *msg, struct sk_buff *skb)
 {
-	struct cmsghdr *cmsg = CMSG_FIRSTHDR(msg);
+	struct cmsghdr *cmsg;
 
 	/*
 	 * Assign an (opaque) qpolicy priority value to skb->priority.
@@ -717,8 +717,7 @@ static int dccp_msghdr_parse(struct msghdr *msg, struct sk_buff *skb)
 	 */
 	skb->priority = 0;
 
-	for (; cmsg != NULL; cmsg = CMSG_NXTHDR(msg, cmsg)) {
-
+	CMSG_FOREACH_HDR(cmsg, msg) {
 		if (!CMSG_OK(msg, cmsg))
 			return -EINVAL;
 
diff --git a/net/ipv4/ip_sockglue.c b/net/ipv4/ip_sockglue.c
index 9daf217..14a6f71 100644
--- a/net/ipv4/ip_sockglue.c
+++ b/net/ipv4/ip_sockglue.c
@@ -192,7 +192,7 @@ int ip_cmsg_send(struct net *net, struct msghdr *msg, struct ipcm_cookie *ipc,
 	int err, val;
 	struct cmsghdr *cmsg;
 
-	for (cmsg = CMSG_FIRSTHDR(msg); cmsg; cmsg = CMSG_NXTHDR(msg, cmsg)) {
+	CMSG_FOREACH_HDR(cmsg, msg) {
 		if (!CMSG_OK(msg, cmsg))
 			return -EINVAL;
 #if IS_ENABLED(CONFIG_IPV6)
diff --git a/net/ipv6/datagram.c b/net/ipv6/datagram.c
index 2cdc383..9895b98 100644
--- a/net/ipv6/datagram.c
+++ b/net/ipv6/datagram.c
@@ -640,7 +640,7 @@ int ip6_datagram_send_ctl(struct net *net, struct sock *sk,
 	int len;
 	int err = 0;
 
-	for (cmsg = CMSG_FIRSTHDR(msg); cmsg; cmsg = CMSG_NXTHDR(msg, cmsg)) {
+	CMSG_FOREACH_HDR(cmsg, msg) {
 		int addr_type;
 
 		if (!CMSG_OK(msg, cmsg)) {
diff --git a/net/iucv/af_iucv.c b/net/iucv/af_iucv.c
index a089b6b..eae4e08 100644
--- a/net/iucv/af_iucv.c
+++ b/net/iucv/af_iucv.c
@@ -1070,9 +1070,7 @@ static int iucv_sock_sendmsg(struct kiocb *iocb, struct socket *sock,
 	txmsg.class = 0;
 
 	/* iterate over control messages */
-	for (cmsg = CMSG_FIRSTHDR(msg); cmsg;
-		cmsg = CMSG_NXTHDR(msg, cmsg)) {
-
+	CMSG_FOREACH_HDR(cmsg, msg) {
 		if (!CMSG_OK(msg, cmsg)) {
 			err = -EINVAL;
 			goto out;
diff --git a/net/rds/send.c b/net/rds/send.c
index 0a64541..1accb3e 100644
--- a/net/rds/send.c
+++ b/net/rds/send.c
@@ -826,7 +826,7 @@ static int rds_rm_size(struct msghdr *msg, int data_len)
 	int cmsg_groups = 0;
 	int retval;
 
-	for (cmsg = CMSG_FIRSTHDR(msg); cmsg; cmsg = CMSG_NXTHDR(msg, cmsg)) {
+	CMSG_FOREACH_HDR(cmsg, msg) {
 		if (!CMSG_OK(msg, cmsg))
 			return -EINVAL;
 
@@ -878,7 +878,7 @@ static int rds_cmsg_send(struct rds_sock *rs, struct rds_message *rm,
 	struct cmsghdr *cmsg;
 	int ret = 0;
 
-	for (cmsg = CMSG_FIRSTHDR(msg); cmsg; cmsg = CMSG_NXTHDR(msg, cmsg)) {
+	CMSG_FOREACH_HDR(cmsg, msg) {
 		if (!CMSG_OK(msg, cmsg))
 			return -EINVAL;
 
diff --git a/net/rxrpc/ar-output.c b/net/rxrpc/ar-output.c
index 0b4b9a7..f915e7e 100644
--- a/net/rxrpc/ar-output.c
+++ b/net/rxrpc/ar-output.c
@@ -45,7 +45,7 @@ static int rxrpc_sendmsg_cmsg(struct rxrpc_sock *rx, struct msghdr *msg,
 	if (msg->msg_controllen == 0)
 		return -EINVAL;
 
-	for (cmsg = CMSG_FIRSTHDR(msg); cmsg; cmsg = CMSG_NXTHDR(msg, cmsg)) {
+	CMSG_FOREACH_HDR(cmsg, msg) {
 		if (!CMSG_OK(msg, cmsg))
 			return -EINVAL;
 
diff --git a/net/sctp/socket.c b/net/sctp/socket.c
index 634a2ab..58834d7 100644
--- a/net/sctp/socket.c
+++ b/net/sctp/socket.c
@@ -6592,8 +6592,7 @@ static int sctp_msghdr_parse(const struct msghdr *msg, sctp_cmsgs_t *cmsgs)
 	struct cmsghdr *cmsg;
 	struct msghdr *my_msg = (struct msghdr *)msg;
 
-	for (cmsg = CMSG_FIRSTHDR(msg); cmsg != NULL;
-	     cmsg = CMSG_NXTHDR(my_msg, cmsg)) {
+	CMSG_FOREACH_HDR(cmsg, my_msg) {
 		if (!CMSG_OK(my_msg, cmsg))
 			return -EINVAL;
 
-- 
1.7.7

.

^ permalink raw reply related

* Re: [PATCH 2/3] bridge: offload bridge port attributes to switch asic if feature flag set
From: Roopa Prabhu @ 2014-12-05  7:10 UTC (permalink / raw)
  To: John Fastabend
  Cc: Scott Feldman, Jiří Pírko, Jamal Hadi Salim,
	Benjamin LaHaise, Thomas Graf, stephen@networkplumber.org,
	John Linville, nhorman@tuxdriver.com, Nicolas Dichtel,
	vyasevic@redhat.com, Florian Fainelli, buytenh@wantstofly.org,
	Aviad Raveh, Netdev, David S. Miller, shm, Andy Gospodarek
In-Reply-To: <548156C8.5060602@gmail.com>

On 12/4/14, 10:55 PM, John Fastabend wrote:
> On 12/04/2014 10:41 PM, Scott Feldman wrote:
>> On Thu, Dec 4, 2014 at 6:26 PM, <roopa@cumulusnetworks.com> wrote:
>>> From: Roopa Prabhu <roopa@cumulusnetworks.com>
>>>
>>> This allows offloading to switch asic without having the user to set
>>> any flag. And this is done in the bridge driver to rollback kernel 
>>> settings
>>> on hw offload failure if required in the future.
>>>
>>> With this, it also makes sure a notification goes out only after the
>>> attributes are set both in the kernel and hw.
>>
>> I like this approach as it streamlines the steps for the user in
>> setting port flags.  There is one case for FLOODING where you'll have
>> to turn off flooding for both, and then turn on flooding in hw. You
>> don't want flooding turned on on kernel and hw.
>>
>>> ---
>>>   net/bridge/br_netlink.c |   27 ++++++++++++++++++++++++++-
>>>   1 file changed, 26 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/net/bridge/br_netlink.c b/net/bridge/br_netlink.c
>>> index 9f5eb55..ce173f0 100644
>>> --- a/net/bridge/br_netlink.c
>>> +++ b/net/bridge/br_netlink.c
>>> @@ -407,9 +407,21 @@ int br_setlink(struct net_device *dev, struct 
>>> nlmsghdr *nlh)
>>>                                  afspec, RTM_SETLINK);
>>>          }
>>>
>>> +       if ((dev->features & NETIF_F_HW_SWITCH_OFFLOAD) &&
>>> + dev->netdev_ops->ndo_bridge_setlink) {
>>> +               int ret = dev->netdev_ops->ndo_bridge_setlink(dev, 
>>> nlh);
>>
>> I think you want to up-level this to net/core/rtnetlink.c because
>> you're only enabling the feature for one instance of a driver that
>> implements ndo_bridge_setlink: the bridge driver.  If another driver
>> was MASTER and implemented ndo_bridge_setlink, you'd want same check
>> to push setting down to SELF port driver.
>
> Also if the user set SELF && MASTER flags && had HW_SWITCH_OFFLOAD bit
> set we would call ndo_bridge_setlink twice on the dev. Maybe you can
> catch this case in rtnetlink.c and only call it once.

yes, thought about this and when i looked at iproute2 code, it is either 
master
or self today and i don't think anybody else uses both flags with the 
current
kernel implementation. But yes, that does not stop anybody from setting 
both flags.
I will handle it better in v2.
>
>>
>>> +               if (ret && ret != -EOPNOTSUPP) {
>>> +                       /* XXX Fix this in the future to rollback
>>> +                        * kernel settings and return error
>>> +                        */
>>
>> The future is now.  Let's fix this now for the rollback case (again up
>> in rtnetlink.c).  So then a general question comes to mind: for these
>> dual target sets, is it best to try HW first and then SW, or the other
>> way around?  Either way, on failure on second you need to rollback
>> first.  And, on failure, you need to know rollback value for first, so
>> you have to do a getlink on first before attempting set.
>
> It might be helpful to return some indication of what object failed as
> well.

ok...

thanks,
Roopa

^ permalink raw reply

* Re: [PATCH iproute2] bridge link: add option 'self'
From: Roopa Prabhu @ 2014-12-05  7:04 UTC (permalink / raw)
  To: Scott Feldman
  Cc: Jiří Pírko, Jamal Hadi Salim, Benjamin LaHaise,
	Thomas Graf, john fastabend, stephen@networkplumber.org,
	John Linville, nhorman@tuxdriver.com, Nicolas Dichtel,
	vyasevic@redhat.com, Florian Fainelli, buytenh@wantstofly.org,
	Aviad Raveh, Netdev, David S. Miller, shm, Andy Gospodarek
In-Reply-To: <CAE4R7bA3JSyHxe5ptVzpCzP5nHmKm8KyPqGvK7a_N2TWX-BNRw@mail.gmail.com>

On 12/4/14, 10:52 PM, Scott Feldman wrote:
> On Thu, Dec 4, 2014 at 6:27 PM,  <roopa@cumulusnetworks.com> wrote:
>> From: Roopa Prabhu <roopa@cumulusnetworks.com>
>>
>> Currently self is set internally only if hwmode is set.
>> This makes it necessary for the hw to have a mode.
>> There is no hwmode really required to go to hardware. So, introduce
>> self for anybody who wants to target hardware.
> Can you include unwinding the hwmode 'swdev' setting for
> kernel/iproute2 in v2?  With this "self" option, and your new
> NETIF_F_HW_SWITCH_OFFLOAD flag, we don't need the hwmde 'swdev'.
yep. will do. thanks

>> ---
>>   bridge/link.c |    3 +++
>>   1 file changed, 3 insertions(+)
>>
>> diff --git a/bridge/link.c b/bridge/link.c
>> index 90d9e7f..2b86141 100644
>> --- a/bridge/link.c
>> +++ b/bridge/link.c
>> @@ -321,6 +321,9 @@ static int brlink_modify(int argc, char **argv)
>>                                          "\"veb\".\n");
>>                                  exit(-1);
>>                          }
>> +               } else if (strcmp(*argv, "self") == 0) {
>> +                       NEXT_ARG();
>> +                       flags = BRIDGE_FLAGS_SELF;
>>                  } else {
>>                          usage();
>>                  }
>> --
>> 1.7.10.4
>>

^ permalink raw reply

* Re: [PATCH 2/3] bridge: offload bridge port attributes to switch asic if feature flag set
From: Roopa Prabhu @ 2014-12-05  7:02 UTC (permalink / raw)
  To: Scott Feldman
  Cc: Jiří Pírko, Jamal Hadi Salim, Benjamin LaHaise,
	Thomas Graf, john fastabend, stephen@networkplumber.org,
	John Linville, nhorman@tuxdriver.com, Nicolas Dichtel,
	vyasevic@redhat.com, Florian Fainelli, buytenh@wantstofly.org,
	Aviad Raveh, Netdev, David S. Miller, shm, Andy Gospodarek
In-Reply-To: <CAE4R7bBOCgoH0zL6+82jj2qjcP8n3kSo51x+i5SoqkN8Ff1CBQ@mail.gmail.com>

On 12/4/14, 10:41 PM, Scott Feldman wrote:
> On Thu, Dec 4, 2014 at 6:26 PM,  <roopa@cumulusnetworks.com> wrote:
>> From: Roopa Prabhu <roopa@cumulusnetworks.com>
>>
>> This allows offloading to switch asic without having the user to set
>> any flag. And this is done in the bridge driver to rollback kernel settings
>> on hw offload failure if required in the future.
>>
>> With this, it also makes sure a notification goes out only after the
>> attributes are set both in the kernel and hw.
> I like this approach as it streamlines the steps for the user in
> setting port flags.  There is one case for FLOODING where you'll have
> to turn off flooding for both, and then turn on flooding in hw.  You
> don't want flooding turned on on kernel and hw.
ok, maybe using the higher bits as in 
https://patchwork.ozlabs.org/patch/413211/

might help with that. Let me think some more.
>
>> ---
>>   net/bridge/br_netlink.c |   27 ++++++++++++++++++++++++++-
>>   1 file changed, 26 insertions(+), 1 deletion(-)
>>
>> diff --git a/net/bridge/br_netlink.c b/net/bridge/br_netlink.c
>> index 9f5eb55..ce173f0 100644
>> --- a/net/bridge/br_netlink.c
>> +++ b/net/bridge/br_netlink.c
>> @@ -407,9 +407,21 @@ int br_setlink(struct net_device *dev, struct nlmsghdr *nlh)
>>                                  afspec, RTM_SETLINK);
>>          }
>>
>> +       if ((dev->features & NETIF_F_HW_SWITCH_OFFLOAD) &&
>> +                       dev->netdev_ops->ndo_bridge_setlink) {
>> +               int ret = dev->netdev_ops->ndo_bridge_setlink(dev, nlh);
> I think you want to up-level this to net/core/rtnetlink.c because
> you're only enabling the feature for one instance of a driver that
> implements ndo_bridge_setlink: the bridge driver.  If another driver
> was MASTER and implemented ndo_bridge_setlink, you'd want same check
> to push setting down to SELF port driver.

yeah, i thought about that. But i moved it here so that rollback would 
be easier.
>
>> +               if (ret && ret != -EOPNOTSUPP) {
>> +                       /* XXX Fix this in the future to rollback
>> +                        * kernel settings and return error
>> +                        */
> The future is now.  Let's fix this now for the rollback case (again up
> in rtnetlink.c).  So then a general question comes to mind: for these
> dual target sets, is it best to try HW first and then SW, or the other
> way around?  Either way, on failure on second you need to rollback
> first.  And, on failure, you need to know rollback value for first, so
> you have to do a getlink on first before attempting set.
yep, exactly, I went through the same thought process yesterday when i 
was trying to implement rollback.
>
>> +                       br_warn(p->br, "error offloading bridge attributes "
>> +                                       "on port %u(%s)\n", (unsigned int) p->port_no,
>> +                                       p->dev->name);
>> +               }
>> +       }
>> +
>>          if (err == 0)
>>                  br_ifinfo_notify(RTM_NEWLINK, p);
>> -
>>   out:
>>          return err;
>>   }
>> @@ -433,6 +445,19 @@ int br_dellink(struct net_device *dev, struct nlmsghdr *nlh)
>>          err = br_afspec((struct net_bridge *)netdev_priv(dev), p,
>>                          afspec, RTM_DELLINK);
>>
>> +       if (dev->features & NETIF_F_HW_SWITCH_OFFLOAD
>> +                       && dev->netdev_ops->ndo_bridge_setlink) {
>> +               int ret = dev->netdev_ops->ndo_bridge_dellink(dev, nlh);
>> +               if (ret && ret != -EOPNOTSUPP) {
>> +                       /* XXX Fix this in the future to rollback
>> +                        * kernel settings and return error
>> +                        */
>> +                       br_warn(p->br, "error offloading bridge attributes "
>> +                                       "on port %u(%s)\n", (unsigned int) p->port_no,
>> +                                       p->dev->name);
>> +               }
>> +       }
>> +
> Same comments as setlink above.
>
>>          return err;
>>   }
>>   static int br_validate(struct nlattr *tb[], struct nlattr *data[])
>> --
>> 1.7.10.4
>>

^ permalink raw reply

* Re: [PATCH 2/3] bridge: offload bridge port attributes to switch asic if feature flag set
From: John Fastabend @ 2014-12-05  6:55 UTC (permalink / raw)
  To: Scott Feldman
  Cc: Roopa Prabhu, Jiří Pírko, Jamal Hadi Salim,
	Benjamin LaHaise, Thomas Graf, stephen@networkplumber.org,
	John Linville, nhorman@tuxdriver.com, Nicolas Dichtel,
	vyasevic@redhat.com, Florian Fainelli, buytenh@wantstofly.org,
	Aviad Raveh, Netdev, David S. Miller, shm, Andy Gospodarek
In-Reply-To: <CAE4R7bBOCgoH0zL6+82jj2qjcP8n3kSo51x+i5SoqkN8Ff1CBQ@mail.gmail.com>

On 12/04/2014 10:41 PM, Scott Feldman wrote:
> On Thu, Dec 4, 2014 at 6:26 PM,  <roopa@cumulusnetworks.com> wrote:
>> From: Roopa Prabhu <roopa@cumulusnetworks.com>
>>
>> This allows offloading to switch asic without having the user to set
>> any flag. And this is done in the bridge driver to rollback kernel settings
>> on hw offload failure if required in the future.
>>
>> With this, it also makes sure a notification goes out only after the
>> attributes are set both in the kernel and hw.
>
> I like this approach as it streamlines the steps for the user in
> setting port flags.  There is one case for FLOODING where you'll have
> to turn off flooding for both, and then turn on flooding in hw.  You
> don't want flooding turned on on kernel and hw.
>
>> ---
>>   net/bridge/br_netlink.c |   27 ++++++++++++++++++++++++++-
>>   1 file changed, 26 insertions(+), 1 deletion(-)
>>
>> diff --git a/net/bridge/br_netlink.c b/net/bridge/br_netlink.c
>> index 9f5eb55..ce173f0 100644
>> --- a/net/bridge/br_netlink.c
>> +++ b/net/bridge/br_netlink.c
>> @@ -407,9 +407,21 @@ int br_setlink(struct net_device *dev, struct nlmsghdr *nlh)
>>                                  afspec, RTM_SETLINK);
>>          }
>>
>> +       if ((dev->features & NETIF_F_HW_SWITCH_OFFLOAD) &&
>> +                       dev->netdev_ops->ndo_bridge_setlink) {
>> +               int ret = dev->netdev_ops->ndo_bridge_setlink(dev, nlh);
>
> I think you want to up-level this to net/core/rtnetlink.c because
> you're only enabling the feature for one instance of a driver that
> implements ndo_bridge_setlink: the bridge driver.  If another driver
> was MASTER and implemented ndo_bridge_setlink, you'd want same check
> to push setting down to SELF port driver.

Also if the user set SELF && MASTER flags && had HW_SWITCH_OFFLOAD bit
set we would call ndo_bridge_setlink twice on the dev. Maybe you can
catch this case in rtnetlink.c and only call it once.

>
>> +               if (ret && ret != -EOPNOTSUPP) {
>> +                       /* XXX Fix this in the future to rollback
>> +                        * kernel settings and return error
>> +                        */
>
> The future is now.  Let's fix this now for the rollback case (again up
> in rtnetlink.c).  So then a general question comes to mind: for these
> dual target sets, is it best to try HW first and then SW, or the other
> way around?  Either way, on failure on second you need to rollback
> first.  And, on failure, you need to know rollback value for first, so
> you have to do a getlink on first before attempting set.

It might be helpful to return some indication of what object failed as
well.

>
>> +                       br_warn(p->br, "error offloading bridge attributes "
>> +                                       "on port %u(%s)\n", (unsigned int) p->port_no,
>> +                                       p->dev->name);
>> +               }
>> +       }
>> +
>>          if (err == 0)
>>                  br_ifinfo_notify(RTM_NEWLINK, p);
>> -
>>   out:
>>          return err;
>>   }
>> @@ -433,6 +445,19 @@ int br_dellink(struct net_device *dev, struct nlmsghdr *nlh)
>>          err = br_afspec((struct net_bridge *)netdev_priv(dev), p,
>>                          afspec, RTM_DELLINK);
>>
>> +       if (dev->features & NETIF_F_HW_SWITCH_OFFLOAD
>> +                       && dev->netdev_ops->ndo_bridge_setlink) {
>> +               int ret = dev->netdev_ops->ndo_bridge_dellink(dev, nlh);
>> +               if (ret && ret != -EOPNOTSUPP) {
>> +                       /* XXX Fix this in the future to rollback
>> +                        * kernel settings and return error
>> +                        */
>> +                       br_warn(p->br, "error offloading bridge attributes "
>> +                                       "on port %u(%s)\n", (unsigned int) p->port_no,
>> +                                       p->dev->name);
>> +               }
>> +       }
>> +
>
> Same comments as setlink above.
>
>>          return err;
>>   }
>>   static int br_validate(struct nlattr *tb[], struct nlattr *data[])
>> --
>> 1.7.10.4
>>


-- 
John Fastabend         Intel Corporation

^ permalink raw reply

* Re: [PATCH iproute2] bridge link: add option 'self'
From: Scott Feldman @ 2014-12-05  6:52 UTC (permalink / raw)
  To: Roopa Prabhu
  Cc: Jiří Pírko, Jamal Hadi Salim, Benjamin LaHaise,
	Thomas Graf, john fastabend, stephen@networkplumber.org,
	John Linville, nhorman@tuxdriver.com, Nicolas Dichtel,
	vyasevic@redhat.com, Florian Fainelli, buytenh@wantstofly.org,
	Aviad Raveh, Netdev, David S. Miller, shm, Andy Gospodarek
In-Reply-To: <1417746436-41023-1-git-send-email-roopa@cumulusnetworks.com>

On Thu, Dec 4, 2014 at 6:27 PM,  <roopa@cumulusnetworks.com> wrote:
> From: Roopa Prabhu <roopa@cumulusnetworks.com>
>
> Currently self is set internally only if hwmode is set.
> This makes it necessary for the hw to have a mode.
> There is no hwmode really required to go to hardware. So, introduce
> self for anybody who wants to target hardware.

Can you include unwinding the hwmode 'swdev' setting for
kernel/iproute2 in v2?  With this "self" option, and your new
NETIF_F_HW_SWITCH_OFFLOAD flag, we don't need the hwmde 'swdev'.

> ---
>  bridge/link.c |    3 +++
>  1 file changed, 3 insertions(+)
>
> diff --git a/bridge/link.c b/bridge/link.c
> index 90d9e7f..2b86141 100644
> --- a/bridge/link.c
> +++ b/bridge/link.c
> @@ -321,6 +321,9 @@ static int brlink_modify(int argc, char **argv)
>                                         "\"veb\".\n");
>                                 exit(-1);
>                         }
> +               } else if (strcmp(*argv, "self") == 0) {
> +                       NEXT_ARG();
> +                       flags = BRIDGE_FLAGS_SELF;
>                 } else {
>                         usage();
>                 }
> --
> 1.7.10.4
>

^ permalink raw reply

* Re: [PATCH 1/3] netdev: introduce new NETIF_F_HW_SWITCH_OFFLOAD feature flag for switch device offloads
From: Scott Feldman @ 2014-12-05  6:47 UTC (permalink / raw)
  To: John Fastabend
  Cc: Roopa Prabhu, Jiří Pírko, Jamal Hadi Salim,
	Benjamin LaHaise, Thomas Graf, stephen@networkplumber.org,
	John Linville, nhorman@tuxdriver.com, Nicolas Dichtel,
	vyasevic@redhat.com, Florian Fainelli, buytenh@wantstofly.org,
	Aviad Raveh, Netdev, David S. Miller, shm, Andy Gospodarek
In-Reply-To: <54815180.30900@gmail.com>

On Thu, Dec 4, 2014 at 10:32 PM, John Fastabend
<john.fastabend@gmail.com> wrote:
> On 12/04/2014 10:08 PM, Scott Feldman wrote:
>>
>> On Thu, Dec 4, 2014 at 6:26 PM,  <roopa@cumulusnetworks.com> wrote:
>>>
>>> From: Roopa Prabhu <roopa@cumulusnetworks.com>
>>>
>>> This is a generic high level feature flag for all switch asic features
>>> today.
>>>
>>> switch drivers set this flag on switch ports. Logical devices like
>>> bridge, bonds, vxlans can inherit this flag from their slaves/ports.
>>>
>>> I had to use SWITCH in the name to avoid ambiguity with other feature
>>> flags. But, since i have been harping about not calling it 'switch',
>>> I am welcome to any suggestions :)
>>>
>>> An alternative to using a feature flag is to use a IFF_HW_OFFLOAD
>>> in net_device_flags.
>>> ---
>>>   include/linux/netdev_features.h |    2 ++
>>>   1 file changed, 2 insertions(+)
>>>
>>> diff --git a/include/linux/netdev_features.h
>>> b/include/linux/netdev_features.h
>>> index 8e30685..68db1de 100644
>>> --- a/include/linux/netdev_features.h
>>> +++ b/include/linux/netdev_features.h
>>> @@ -66,6 +66,7 @@ enum {
>>>          NETIF_F_HW_VLAN_STAG_FILTER_BIT,/* Receive filtering on VLAN
>>> STAGs */
>>>          NETIF_F_HW_L2FW_DOFFLOAD_BIT,   /* Allow L2 Forwarding in
>>> Hardware */
>>>          NETIF_F_BUSY_POLL_BIT,          /* Busy poll */
>>> +       NETIF_F_HW_SWITCH_OFFLOAD_BIT,  /* HW switch offload */
>>>
>>>          /*
>>>           * Add your fresh new feature above and remember to update
>>> @@ -124,6 +125,7 @@ enum {
>>>   #define NETIF_F_HW_VLAN_STAG_TX        __NETIF_F(HW_VLAN_STAG_TX)
>>>   #define NETIF_F_HW_L2FW_DOFFLOAD       __NETIF_F(HW_L2FW_DOFFLOAD)
>>>   #define NETIF_F_BUSY_POLL      __NETIF_F(BUSY_POLL)
>>> +#define NETIF_F_HW_SWITCH_OFFLOAD      __NETIF_F(HW_SWITCH_OFFLOAD)
>>
>>
>> Can we use existing flag NETIF_F_HW_L2FW_DOFFLOAD?  The comment above
>> on that flag suggests it would be perfect for what we're doing with L2
>> bridging: offloading the L2 forwarding path in HW.
>>
>> NETIF_F_HW_L2FW_DOFFLOAD seems to be orphaned in current net-next, so
>> I'm not sure of it's status, but see if it makes sense to recycle it.
>>
>> I could see a NETIF_F_HW_L3FW_DOFFLOAD for L3 fwd offload, down the road.
>>
>
> Its not orphaned its used by the macvlan device to indicate that the
> port switching should be done in hardware. For example put macvlan in
> VEPA mode and the hardware will deliver the packets directly to the
> netdev and get the pruning right. In VEB mode the hardware will forward
> packets correctly between macvlan netdevs. The implementation on ixgbe
> for example is to give a set of hardware queues per macvlan netdev. I
> still need to add offloaded QOS to make this more interesting.
>
> I could support a l3vlan device as well even on some of the host nics.
>
> The issue with generalizing it is at the moment it is used by ixgbe
> which can't forward between two physical ports. It works on macvlan
> because each macvlan netdev uses the same physical uplink. I think switching
> between different ports is going to require another feature
> bit like its done here.


My bad...I grepped for the _BIT variant and missed the usage in
macvlan, etc.  Sorry about that.

>
> .John
>
>
>>
>>>   /* Features valid for ethtool to change */
>>>   /* = all defined minus driver/device-class-related */
>>> --
>>> 1.7.10.4
>>>
>
>
> --
> John Fastabend         Intel Corporation

^ permalink raw reply

* Re: [PATCH 2/3] bridge: offload bridge port attributes to switch asic if feature flag set
From: Scott Feldman @ 2014-12-05  6:41 UTC (permalink / raw)
  To: Roopa Prabhu
  Cc: Jiří Pírko, Jamal Hadi Salim, Benjamin LaHaise,
	Thomas Graf, john fastabend, stephen@networkplumber.org,
	John Linville, nhorman@tuxdriver.com, Nicolas Dichtel,
	vyasevic@redhat.com, Florian Fainelli, buytenh@wantstofly.org,
	Aviad Raveh, Netdev, David S. Miller, shm, Andy Gospodarek
In-Reply-To: <1417746401-8140-3-git-send-email-roopa@cumulusnetworks.com>

On Thu, Dec 4, 2014 at 6:26 PM,  <roopa@cumulusnetworks.com> wrote:
> From: Roopa Prabhu <roopa@cumulusnetworks.com>
>
> This allows offloading to switch asic without having the user to set
> any flag. And this is done in the bridge driver to rollback kernel settings
> on hw offload failure if required in the future.
>
> With this, it also makes sure a notification goes out only after the
> attributes are set both in the kernel and hw.

I like this approach as it streamlines the steps for the user in
setting port flags.  There is one case for FLOODING where you'll have
to turn off flooding for both, and then turn on flooding in hw.  You
don't want flooding turned on on kernel and hw.

> ---
>  net/bridge/br_netlink.c |   27 ++++++++++++++++++++++++++-
>  1 file changed, 26 insertions(+), 1 deletion(-)
>
> diff --git a/net/bridge/br_netlink.c b/net/bridge/br_netlink.c
> index 9f5eb55..ce173f0 100644
> --- a/net/bridge/br_netlink.c
> +++ b/net/bridge/br_netlink.c
> @@ -407,9 +407,21 @@ int br_setlink(struct net_device *dev, struct nlmsghdr *nlh)
>                                 afspec, RTM_SETLINK);
>         }
>
> +       if ((dev->features & NETIF_F_HW_SWITCH_OFFLOAD) &&
> +                       dev->netdev_ops->ndo_bridge_setlink) {
> +               int ret = dev->netdev_ops->ndo_bridge_setlink(dev, nlh);

I think you want to up-level this to net/core/rtnetlink.c because
you're only enabling the feature for one instance of a driver that
implements ndo_bridge_setlink: the bridge driver.  If another driver
was MASTER and implemented ndo_bridge_setlink, you'd want same check
to push setting down to SELF port driver.

> +               if (ret && ret != -EOPNOTSUPP) {
> +                       /* XXX Fix this in the future to rollback
> +                        * kernel settings and return error
> +                        */

The future is now.  Let's fix this now for the rollback case (again up
in rtnetlink.c).  So then a general question comes to mind: for these
dual target sets, is it best to try HW first and then SW, or the other
way around?  Either way, on failure on second you need to rollback
first.  And, on failure, you need to know rollback value for first, so
you have to do a getlink on first before attempting set.

> +                       br_warn(p->br, "error offloading bridge attributes "
> +                                       "on port %u(%s)\n", (unsigned int) p->port_no,
> +                                       p->dev->name);
> +               }
> +       }
> +
>         if (err == 0)
>                 br_ifinfo_notify(RTM_NEWLINK, p);
> -
>  out:
>         return err;
>  }
> @@ -433,6 +445,19 @@ int br_dellink(struct net_device *dev, struct nlmsghdr *nlh)
>         err = br_afspec((struct net_bridge *)netdev_priv(dev), p,
>                         afspec, RTM_DELLINK);
>
> +       if (dev->features & NETIF_F_HW_SWITCH_OFFLOAD
> +                       && dev->netdev_ops->ndo_bridge_setlink) {
> +               int ret = dev->netdev_ops->ndo_bridge_dellink(dev, nlh);
> +               if (ret && ret != -EOPNOTSUPP) {
> +                       /* XXX Fix this in the future to rollback
> +                        * kernel settings and return error
> +                        */
> +                       br_warn(p->br, "error offloading bridge attributes "
> +                                       "on port %u(%s)\n", (unsigned int) p->port_no,
> +                                       p->dev->name);
> +               }
> +       }
> +

Same comments as setlink above.

>         return err;
>  }
>  static int br_validate(struct nlattr *tb[], struct nlattr *data[])
> --
> 1.7.10.4
>

^ permalink raw reply

* Re: [PATCH 3/3] rocker: set feature NETIF_F_HW_SWITCH_OFFLOAD
From: John Fastabend @ 2014-12-05  6:38 UTC (permalink / raw)
  To: Scott Feldman
  Cc: Jianhua Xie, Roopa Prabhu, Jiří Pírko,
	Jamal Hadi Salim, Benjamin LaHaise, Thomas Graf,
	stephen@networkplumber.org, John Linville, nhorman@tuxdriver.com,
	Nicolas Dichtel, vyasevic@redhat.com, Florian Fainelli,
	buytenh@wantstofly.org, Aviad Raveh, Netdev, David S. Miller, shm,
	Andy Gospodarek
In-Reply-To: <CAE4R7bDCcE94WQE50J+suAr9F=OFvQnC73txA+xrC2JxMseYkw@mail.gmail.com>

On 12/04/2014 10:16 PM, Scott Feldman wrote:
> On Thu, Dec 4, 2014 at 7:25 PM, Jianhua Xie <jianhua.xie@freescale.com> wrote:
>>
>> 在 2014年12月05日 10:26, roopa@cumulusnetworks.com 写道:
>>
>>> From: Roopa Prabhu <roopa@cumulusnetworks.com>
>>>
>>> This patch just sets the feature flag on rocker ports
>>> ---
>>>    drivers/net/ethernet/rocker/rocker.c |    3 ++-
>>>    1 file changed, 2 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/drivers/net/ethernet/rocker/rocker.c
>>> b/drivers/net/ethernet/rocker/rocker.c
>>> index fded127..3fe19b0 100644
>>> --- a/drivers/net/ethernet/rocker/rocker.c
>>> +++ b/drivers/net/ethernet/rocker/rocker.c
>>> @@ -4003,7 +4003,8 @@ static int rocker_probe_port(struct rocker *rocker,
>>> unsigned int port_number)
>>>                         NAPI_POLL_WEIGHT);
>>>          rocker_carrier_init(rocker_port);
>>>    -     dev->features |= NETIF_F_HW_VLAN_CTAG_FILTER;
>>> +       dev->features |= NETIF_F_HW_VLAN_CTAG_FILTER |
>>> +                                        NETIF_F_HW_SWITCH_OFFLOAD;
>>
>> Do you have a plan on enabling/disabling  this flag dynamically by ethtool?
>
> Might get it for free if we recycle NETIF_F_HW_L2FW_DOFFLOAD;
> NETIF_F_HW_L2FW_DOFFLOAD seems to be supported by ethtool already.

See my other note but ethtool support for feature flags makes
this pretty close to trivial to implement these days.

Take a look @ netdev_features_strings.

>
>>
>> Thanks & B.R
>> Jianhua
>>
>>>          err = register_netdev(dev);
>>>          if (err) {
>>
>>


-- 
John Fastabend         Intel Corporation

^ permalink raw reply

* Re: [PATCH 1/3] netdev: introduce new NETIF_F_HW_SWITCH_OFFLOAD feature flag for switch device offloads
From: John Fastabend @ 2014-12-05  6:32 UTC (permalink / raw)
  To: Scott Feldman
  Cc: Roopa Prabhu, Jiří Pírko, Jamal Hadi Salim,
	Benjamin LaHaise, Thomas Graf, stephen@networkplumber.org,
	John Linville, nhorman@tuxdriver.com, Nicolas Dichtel,
	vyasevic@redhat.com, Florian Fainelli, buytenh@wantstofly.org,
	Aviad Raveh, Netdev, David S. Miller, shm, Andy Gospodarek
In-Reply-To: <CAE4R7bBC+W+9F3cmR5TVuO+BbhK+zBuiQhNtk6gCvK5CC2MkMg@mail.gmail.com>

On 12/04/2014 10:08 PM, Scott Feldman wrote:
> On Thu, Dec 4, 2014 at 6:26 PM,  <roopa@cumulusnetworks.com> wrote:
>> From: Roopa Prabhu <roopa@cumulusnetworks.com>
>>
>> This is a generic high level feature flag for all switch asic features today.
>>
>> switch drivers set this flag on switch ports. Logical devices like
>> bridge, bonds, vxlans can inherit this flag from their slaves/ports.
>>
>> I had to use SWITCH in the name to avoid ambiguity with other feature
>> flags. But, since i have been harping about not calling it 'switch',
>> I am welcome to any suggestions :)
>>
>> An alternative to using a feature flag is to use a IFF_HW_OFFLOAD
>> in net_device_flags.
>> ---
>>   include/linux/netdev_features.h |    2 ++
>>   1 file changed, 2 insertions(+)
>>
>> diff --git a/include/linux/netdev_features.h b/include/linux/netdev_features.h
>> index 8e30685..68db1de 100644
>> --- a/include/linux/netdev_features.h
>> +++ b/include/linux/netdev_features.h
>> @@ -66,6 +66,7 @@ enum {
>>          NETIF_F_HW_VLAN_STAG_FILTER_BIT,/* Receive filtering on VLAN STAGs */
>>          NETIF_F_HW_L2FW_DOFFLOAD_BIT,   /* Allow L2 Forwarding in Hardware */
>>          NETIF_F_BUSY_POLL_BIT,          /* Busy poll */
>> +       NETIF_F_HW_SWITCH_OFFLOAD_BIT,  /* HW switch offload */
>>
>>          /*
>>           * Add your fresh new feature above and remember to update
>> @@ -124,6 +125,7 @@ enum {
>>   #define NETIF_F_HW_VLAN_STAG_TX        __NETIF_F(HW_VLAN_STAG_TX)
>>   #define NETIF_F_HW_L2FW_DOFFLOAD       __NETIF_F(HW_L2FW_DOFFLOAD)
>>   #define NETIF_F_BUSY_POLL      __NETIF_F(BUSY_POLL)
>> +#define NETIF_F_HW_SWITCH_OFFLOAD      __NETIF_F(HW_SWITCH_OFFLOAD)
>
> Can we use existing flag NETIF_F_HW_L2FW_DOFFLOAD?  The comment above
> on that flag suggests it would be perfect for what we're doing with L2
> bridging: offloading the L2 forwarding path in HW.
>
> NETIF_F_HW_L2FW_DOFFLOAD seems to be orphaned in current net-next, so
> I'm not sure of it's status, but see if it makes sense to recycle it.
>
> I could see a NETIF_F_HW_L3FW_DOFFLOAD for L3 fwd offload, down the road.
>

Its not orphaned its used by the macvlan device to indicate that the
port switching should be done in hardware. For example put macvlan in
VEPA mode and the hardware will deliver the packets directly to the
netdev and get the pruning right. In VEB mode the hardware will forward
packets correctly between macvlan netdevs. The implementation on ixgbe
for example is to give a set of hardware queues per macvlan netdev. I
still need to add offloaded QOS to make this more interesting.

I could support a l3vlan device as well even on some of the host nics.

The issue with generalizing it is at the moment it is used by ixgbe
which can't forward between two physical ports. It works on macvlan
because each macvlan netdev uses the same physical uplink. I think 
switching between different ports is going to require another feature
bit like its done here.

.John

>
>>   /* Features valid for ethtool to change */
>>   /* = all defined minus driver/device-class-related */
>> --
>> 1.7.10.4
>>


-- 
John Fastabend         Intel Corporation

^ permalink raw reply

* Re: [PATCH v2 1/6] net-PPP: Replacement of a printk() call by pr_warn() in mppe_rekey()
From: Julia Lawall @ 2014-12-05  6:26 UTC (permalink / raw)
  To: Joe Perches
  Cc: SF Markus Elfring, Sergei Shtylyov, Paul Mackerras, linux-ppp,
	netdev, Eric Dumazet, LKML, kernel-janitors, Julia Lawall
In-Reply-To: <1417733101.2721.20.camel@perches.com>

On Thu, 4 Dec 2014, Joe Perches wrote:

> On Thu, 2014-12-04 at 23:27 +0100, SF Markus Elfring wrote:
> > >> diff --git a/drivers/net/ppp/ppp_mppe.c b/drivers/net/ppp/ppp_mppe.c
> > > []
> > >> @@ -172,9 +172,8 @@ static void mppe_rekey(struct ppp_mppe_state * state, int initial_key)
> > >>  		setup_sg(sg_in, state->sha1_digest, state->keylen);
> > >>  		setup_sg(sg_out, state->session_key, state->keylen);
> > >>  		if (crypto_blkcipher_encrypt(&desc, sg_out, sg_in,
> > >> -					     state->keylen) != 0) {
> > >> -    		    printk(KERN_WARNING "mppe_rekey: cipher_encrypt failed\n");
> > >> -		}
> > >> +					     state->keylen) != 0)
> > >> +			pr_warn("mppe_rekey: cipher_encrypt failed\n");
> > > 
> > > It's generally nicer to replace embedded function names
> > > with "%s: ", __func__
> > > 
> > > 			pr_warn("%s: cipher_encrypt failed\n", __func__);
> > 
> > Do you want that I send a third patch series for the fine-tuning of these parameters?
> 
> If you want.
> 
> I just wanted you to be aware of it for future patches.

Markus, are you sure that you cannot use netdev_warn in this case?

julia

^ permalink raw reply

* Re: [PATCH 3/3] rocker: set feature NETIF_F_HW_SWITCH_OFFLOAD
From: Scott Feldman @ 2014-12-05  6:16 UTC (permalink / raw)
  To: Jianhua Xie
  Cc: Roopa Prabhu, Jiří Pírko, Jamal Hadi Salim,
	Benjamin LaHaise, Thomas Graf, john fastabend,
	stephen@networkplumber.org, John Linville, nhorman@tuxdriver.com,
	Nicolas Dichtel, vyasevic@redhat.com, Florian Fainelli,
	buytenh@wantstofly.org, Aviad Raveh, Netdev, David S. Miller, shm,
	Andy Gospodarek
In-Reply-To: <5481259C.3040707@freescale.com>

On Thu, Dec 4, 2014 at 7:25 PM, Jianhua Xie <jianhua.xie@freescale.com> wrote:
>
> 在 2014年12月05日 10:26, roopa@cumulusnetworks.com 写道:
>
>> From: Roopa Prabhu <roopa@cumulusnetworks.com>
>>
>> This patch just sets the feature flag on rocker ports
>> ---
>>   drivers/net/ethernet/rocker/rocker.c |    3 ++-
>>   1 file changed, 2 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/net/ethernet/rocker/rocker.c
>> b/drivers/net/ethernet/rocker/rocker.c
>> index fded127..3fe19b0 100644
>> --- a/drivers/net/ethernet/rocker/rocker.c
>> +++ b/drivers/net/ethernet/rocker/rocker.c
>> @@ -4003,7 +4003,8 @@ static int rocker_probe_port(struct rocker *rocker,
>> unsigned int port_number)
>>                        NAPI_POLL_WEIGHT);
>>         rocker_carrier_init(rocker_port);
>>   -     dev->features |= NETIF_F_HW_VLAN_CTAG_FILTER;
>> +       dev->features |= NETIF_F_HW_VLAN_CTAG_FILTER |
>> +                                        NETIF_F_HW_SWITCH_OFFLOAD;
>
> Do you have a plan on enabling/disabling  this flag dynamically by ethtool?

Might get it for free if we recycle NETIF_F_HW_L2FW_DOFFLOAD;
NETIF_F_HW_L2FW_DOFFLOAD seems to be supported by ethtool already.

>
> Thanks & B.R
> Jianhua
>
>>         err = register_netdev(dev);
>>         if (err) {
>
>

^ permalink raw reply

* Re: [PATCH 1/3] netdev: introduce new NETIF_F_HW_SWITCH_OFFLOAD feature flag for switch device offloads
From: Scott Feldman @ 2014-12-05  6:08 UTC (permalink / raw)
  To: Roopa Prabhu
  Cc: Jiří Pírko, Jamal Hadi Salim, Benjamin LaHaise,
	Thomas Graf, john fastabend, stephen@networkplumber.org,
	John Linville, nhorman@tuxdriver.com, Nicolas Dichtel,
	vyasevic@redhat.com, Florian Fainelli, buytenh@wantstofly.org,
	Aviad Raveh, Netdev, David S. Miller, shm, Andy Gospodarek
In-Reply-To: <1417746401-8140-2-git-send-email-roopa@cumulusnetworks.com>

On Thu, Dec 4, 2014 at 6:26 PM,  <roopa@cumulusnetworks.com> wrote:
> From: Roopa Prabhu <roopa@cumulusnetworks.com>
>
> This is a generic high level feature flag for all switch asic features today.
>
> switch drivers set this flag on switch ports. Logical devices like
> bridge, bonds, vxlans can inherit this flag from their slaves/ports.
>
> I had to use SWITCH in the name to avoid ambiguity with other feature
> flags. But, since i have been harping about not calling it 'switch',
> I am welcome to any suggestions :)
>
> An alternative to using a feature flag is to use a IFF_HW_OFFLOAD
> in net_device_flags.
> ---
>  include/linux/netdev_features.h |    2 ++
>  1 file changed, 2 insertions(+)
>
> diff --git a/include/linux/netdev_features.h b/include/linux/netdev_features.h
> index 8e30685..68db1de 100644
> --- a/include/linux/netdev_features.h
> +++ b/include/linux/netdev_features.h
> @@ -66,6 +66,7 @@ enum {
>         NETIF_F_HW_VLAN_STAG_FILTER_BIT,/* Receive filtering on VLAN STAGs */
>         NETIF_F_HW_L2FW_DOFFLOAD_BIT,   /* Allow L2 Forwarding in Hardware */
>         NETIF_F_BUSY_POLL_BIT,          /* Busy poll */
> +       NETIF_F_HW_SWITCH_OFFLOAD_BIT,  /* HW switch offload */
>
>         /*
>          * Add your fresh new feature above and remember to update
> @@ -124,6 +125,7 @@ enum {
>  #define NETIF_F_HW_VLAN_STAG_TX        __NETIF_F(HW_VLAN_STAG_TX)
>  #define NETIF_F_HW_L2FW_DOFFLOAD       __NETIF_F(HW_L2FW_DOFFLOAD)
>  #define NETIF_F_BUSY_POLL      __NETIF_F(BUSY_POLL)
> +#define NETIF_F_HW_SWITCH_OFFLOAD      __NETIF_F(HW_SWITCH_OFFLOAD)

Can we use existing flag NETIF_F_HW_L2FW_DOFFLOAD?  The comment above
on that flag suggests it would be perfect for what we're doing with L2
bridging: offloading the L2 forwarding path in HW.

NETIF_F_HW_L2FW_DOFFLOAD seems to be orphaned in current net-next, so
I'm not sure of it's status, but see if it makes sense to recycle it.

I could see a NETIF_F_HW_L3FW_DOFFLOAD for L3 fwd offload, down the road.


>  /* Features valid for ethtool to change */
>  /* = all defined minus driver/device-class-related */
> --
> 1.7.10.4
>

^ permalink raw reply

* [PATCH 03/12] ip_generic_getfrag, udplite_getfrag: switch to passing msghdr
From: Al Viro @ 2014-12-05  5:58 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, linux-kernel
In-Reply-To: <20141205055623.GQ29748@ZenIV.linux.org.uk>

From: Al Viro <viro@zeniv.linux.org.uk>

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
 include/net/udplite.h | 3 ++-
 net/ipv4/ip_output.c  | 6 +++---
 net/ipv4/raw.c        | 2 +-
 net/ipv4/udp.c        | 4 ++--
 net/ipv6/raw.c        | 2 +-
 net/ipv6/udp.c        | 2 +-
 net/l2tp/l2tp_ip6.c   | 2 +-
 7 files changed, 11 insertions(+), 10 deletions(-)

diff --git a/include/net/udplite.h b/include/net/udplite.h
index 9a28a51..d5baaba 100644
--- a/include/net/udplite.h
+++ b/include/net/udplite.h
@@ -19,7 +19,8 @@ extern struct udp_table		udplite_table;
 static __inline__ int udplite_getfrag(void *from, char *to, int  offset,
 				      int len, int odd, struct sk_buff *skb)
 {
-	return memcpy_fromiovecend(to, (struct iovec *) from, offset, len);
+	struct msghdr *msg = from;
+	return memcpy_fromiovecend(to, msg->msg_iov, offset, len);
 }
 
 /* Designate sk as UDP-Lite socket */
diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index 4a929ad..cdedcf1 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -752,14 +752,14 @@ EXPORT_SYMBOL(ip_fragment);
 int
 ip_generic_getfrag(void *from, char *to, int offset, int len, int odd, struct sk_buff *skb)
 {
-	struct iovec *iov = from;
+	struct msghdr *msg = from;
 
 	if (skb->ip_summed == CHECKSUM_PARTIAL) {
-		if (memcpy_fromiovecend(to, iov, offset, len) < 0)
+		if (memcpy_fromiovecend(to, msg->msg_iov, offset, len) < 0)
 			return -EFAULT;
 	} else {
 		__wsum csum = 0;
-		if (csum_partial_copy_fromiovecend(to, iov, offset, len, &csum) < 0)
+		if (csum_partial_copy_fromiovecend(to, msg->msg_iov, offset, len, &csum) < 0)
 			return -EFAULT;
 		skb->csum = csum_block_add(skb->csum, csum, odd);
 	}
diff --git a/net/ipv4/raw.c b/net/ipv4/raw.c
index 5c901eb..5d83bd2 100644
--- a/net/ipv4/raw.c
+++ b/net/ipv4/raw.c
@@ -478,7 +478,7 @@ static int raw_getfrag(void *from, char *to, int offset, int len, int odd,
 
 	offset -= rfv->hlen;
 
-	return ip_generic_getfrag(rfv->msg->msg_iov, to, offset, len, odd, skb);
+	return ip_generic_getfrag(rfv->msg, to, offset, len, odd, skb);
 }
 
 static int raw_sendmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index b2d6068..0cac250 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -1036,7 +1036,7 @@ back_from_confirm:
 
 	/* Lockless fast path for the non-corking case. */
 	if (!corkreq) {
-		skb = ip_make_skb(sk, fl4, getfrag, msg->msg_iov, ulen,
+		skb = ip_make_skb(sk, fl4, getfrag, msg, ulen,
 				  sizeof(struct udphdr), &ipc, &rt,
 				  msg->msg_flags);
 		err = PTR_ERR(skb);
@@ -1067,7 +1067,7 @@ back_from_confirm:
 
 do_append_data:
 	up->len += ulen;
-	err = ip_append_data(sk, fl4, getfrag, msg->msg_iov, ulen,
+	err = ip_append_data(sk, fl4, getfrag, msg, ulen,
 			     sizeof(struct udphdr), &ipc, &rt,
 			     corkreq ? msg->msg_flags|MSG_MORE : msg->msg_flags);
 	if (err)
diff --git a/net/ipv6/raw.c b/net/ipv6/raw.c
index 942f67b..11a9283 100644
--- a/net/ipv6/raw.c
+++ b/net/ipv6/raw.c
@@ -727,7 +727,7 @@ static int raw6_getfrag(void *from, char *to, int offset, int len, int odd,
 
 	offset -= rfv->hlen;
 
-	return ip_generic_getfrag(rfv->msg->msg_iov, to, offset, len, odd, skb);
+	return ip_generic_getfrag(rfv->msg, to, offset, len, odd, skb);
 }
 
 static int rawv6_sendmsg(struct kiocb *iocb, struct sock *sk,
diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c
index 7cfb5d74..5a164b6 100644
--- a/net/ipv6/udp.c
+++ b/net/ipv6/udp.c
@@ -1299,7 +1299,7 @@ do_append_data:
 		dontfrag = np->dontfrag;
 	up->len += ulen;
 	getfrag  =  is_udplite ?  udplite_getfrag : ip_generic_getfrag;
-	err = ip6_append_data(sk, getfrag, msg->msg_iov, ulen,
+	err = ip6_append_data(sk, getfrag, msg, ulen,
 		sizeof(struct udphdr), hlimit, tclass, opt, &fl6,
 		(struct rt6_info *)dst,
 		corkreq ? msg->msg_flags|MSG_MORE : msg->msg_flags, dontfrag);
diff --git a/net/l2tp/l2tp_ip6.c b/net/l2tp/l2tp_ip6.c
index 2177b96..8611f1b 100644
--- a/net/l2tp/l2tp_ip6.c
+++ b/net/l2tp/l2tp_ip6.c
@@ -619,7 +619,7 @@ static int l2tp_ip6_sendmsg(struct kiocb *iocb, struct sock *sk,
 
 back_from_confirm:
 	lock_sock(sk);
-	err = ip6_append_data(sk, ip_generic_getfrag, msg->msg_iov,
+	err = ip6_append_data(sk, ip_generic_getfrag, msg,
 			      ulen, transhdrlen, hlimit, tclass, opt,
 			      &fl6, (struct rt6_info *)dst,
 			      msg->msg_flags, dontfrag);
-- 
2.1.3

^ permalink raw reply related

* [PATCH 12/12] bury memcpy_toiovec()
From: Al Viro @ 2014-12-05  5:58 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, linux-kernel
In-Reply-To: <20141205055623.GQ29748@ZenIV.linux.org.uk>

From: Al Viro <viro@zeniv.linux.org.uk>

no users left

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
 include/linux/uio.h |  1 -
 lib/iovec.c         | 25 -------------------------
 2 files changed, 26 deletions(-)

diff --git a/include/linux/uio.h b/include/linux/uio.h
index bd8569a..a41e252 100644
--- a/include/linux/uio.h
+++ b/include/linux/uio.h
@@ -131,7 +131,6 @@ size_t csum_and_copy_to_iter(void *addr, size_t bytes, __wsum *csum, struct iov_
 size_t csum_and_copy_from_iter(void *addr, size_t bytes, __wsum *csum, struct iov_iter *i);
 
 int memcpy_fromiovec(unsigned char *kdata, struct iovec *iov, int len);
-int memcpy_toiovec(struct iovec *iov, unsigned char *kdata, int len);
 int memcpy_fromiovecend(unsigned char *kdata, const struct iovec *iov,
 			int offset, int len);
 int memcpy_toiovecend(const struct iovec *v, unsigned char *kdata,
diff --git a/lib/iovec.c b/lib/iovec.c
index df3abd1..2d99cb4 100644
--- a/lib/iovec.c
+++ b/lib/iovec.c
@@ -29,31 +29,6 @@ EXPORT_SYMBOL(memcpy_fromiovec);
 
 /*
  *	Copy kernel to iovec. Returns -EFAULT on error.
- *
- *	Note: this modifies the original iovec.
- */
-
-int memcpy_toiovec(struct iovec *iov, unsigned char *kdata, int len)
-{
-	while (len > 0) {
-		if (iov->iov_len) {
-			int copy = min_t(unsigned int, iov->iov_len, len);
-			if (copy_to_user(iov->iov_base, kdata, copy))
-				return -EFAULT;
-			kdata += copy;
-			len -= copy;
-			iov->iov_len -= copy;
-			iov->iov_base += copy;
-		}
-		iov++;
-	}
-
-	return 0;
-}
-EXPORT_SYMBOL(memcpy_toiovec);
-
-/*
- *	Copy kernel to iovec. Returns -EFAULT on error.
  */
 
 int memcpy_toiovecend(const struct iovec *iov, unsigned char *kdata,
-- 
2.1.3

^ permalink raw reply related

* [PATCH 11/12] skb_copy_datagram_iovec() can die
From: Al Viro @ 2014-12-05  5:58 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, linux-kernel
In-Reply-To: <20141205055623.GQ29748@ZenIV.linux.org.uk>

From: Al Viro <viro@zeniv.linux.org.uk>

no callers other than itself.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
 include/linux/skbuff.h |  2 --
 net/core/datagram.c    | 84 --------------------------------------------------
 2 files changed, 86 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 45ac95c..f676e54 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -2642,8 +2642,6 @@ struct sk_buff *skb_recv_datagram(struct sock *sk, unsigned flags, int noblock,
 				  int *err);
 unsigned int datagram_poll(struct file *file, struct socket *sock,
 			   struct poll_table_struct *wait);
-int skb_copy_datagram_iovec(const struct sk_buff *from, int offset,
-			    struct iovec *to, int size);
 int skb_copy_datagram_iter(const struct sk_buff *from, int offset,
 			   struct iov_iter *to, int size);
 static inline int skb_copy_datagram_msg(const struct sk_buff *from, int offset,
diff --git a/net/core/datagram.c b/net/core/datagram.c
index 41075ed..df493d6 100644
--- a/net/core/datagram.c
+++ b/net/core/datagram.c
@@ -310,90 +310,6 @@ int skb_kill_datagram(struct sock *sk, struct sk_buff *skb, unsigned int flags)
 EXPORT_SYMBOL(skb_kill_datagram);
 
 /**
- *	skb_copy_datagram_iovec - Copy a datagram to an iovec.
- *	@skb: buffer to copy
- *	@offset: offset in the buffer to start copying from
- *	@to: io vector to copy to
- *	@len: amount of data to copy from buffer to iovec
- *
- *	Note: the iovec is modified during the copy.
- */
-int skb_copy_datagram_iovec(const struct sk_buff *skb, int offset,
-			    struct iovec *to, int len)
-{
-	int start = skb_headlen(skb);
-	int i, copy = start - offset;
-	struct sk_buff *frag_iter;
-
-	trace_skb_copy_datagram_iovec(skb, len);
-
-	/* Copy header. */
-	if (copy > 0) {
-		if (copy > len)
-			copy = len;
-		if (memcpy_toiovec(to, skb->data + offset, copy))
-			goto fault;
-		if ((len -= copy) == 0)
-			return 0;
-		offset += copy;
-	}
-
-	/* Copy paged appendix. Hmm... why does this look so complicated? */
-	for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) {
-		int end;
-		const skb_frag_t *frag = &skb_shinfo(skb)->frags[i];
-
-		WARN_ON(start > offset + len);
-
-		end = start + skb_frag_size(frag);
-		if ((copy = end - offset) > 0) {
-			int err;
-			u8  *vaddr;
-			struct page *page = skb_frag_page(frag);
-
-			if (copy > len)
-				copy = len;
-			vaddr = kmap(page);
-			err = memcpy_toiovec(to, vaddr + frag->page_offset +
-					     offset - start, copy);
-			kunmap(page);
-			if (err)
-				goto fault;
-			if (!(len -= copy))
-				return 0;
-			offset += copy;
-		}
-		start = end;
-	}
-
-	skb_walk_frags(skb, frag_iter) {
-		int end;
-
-		WARN_ON(start > offset + len);
-
-		end = start + frag_iter->len;
-		if ((copy = end - offset) > 0) {
-			if (copy > len)
-				copy = len;
-			if (skb_copy_datagram_iovec(frag_iter,
-						    offset - start,
-						    to, copy))
-				goto fault;
-			if ((len -= copy) == 0)
-				return 0;
-			offset += copy;
-		}
-		start = end;
-	}
-	if (!len)
-		return 0;
-
-fault:
-	return -EFAULT;
-}
-EXPORT_SYMBOL(skb_copy_datagram_iovec);
-
-/**
  *	skb_copy_datagram_iter - Copy a datagram to an iovec iterator.
  *	@skb: buffer to copy
  *	@offset: offset in the buffer to start copying from
-- 
2.1.3

^ permalink raw reply related

* [PATCH 10/12] ppp_read(): switch to skb_copy_datagram_iter()
From: Al Viro @ 2014-12-05  5:58 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, linux-kernel
In-Reply-To: <20141205055623.GQ29748@ZenIV.linux.org.uk>

From: Al Viro <viro@zeniv.linux.org.uk>

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
 drivers/net/ppp/ppp_generic.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ppp/ppp_generic.c b/drivers/net/ppp/ppp_generic.c
index 794a473..af034db 100644
--- a/drivers/net/ppp/ppp_generic.c
+++ b/drivers/net/ppp/ppp_generic.c
@@ -417,6 +417,7 @@ static ssize_t ppp_read(struct file *file, char __user *buf,
 	ssize_t ret;
 	struct sk_buff *skb = NULL;
 	struct iovec iov;
+	struct iov_iter to;
 
 	ret = count;
 
@@ -462,7 +463,8 @@ static ssize_t ppp_read(struct file *file, char __user *buf,
 	ret = -EFAULT;
 	iov.iov_base = buf;
 	iov.iov_len = count;
-	if (skb_copy_datagram_iovec(skb, 0, &iov, skb->len))
+	iov_iter_init(&to, READ, &iov, 1, count);
+	if (skb_copy_datagram_iter(skb, 0, &to, skb->len))
 		goto outf;
 	ret = skb->len;
 
-- 
2.1.3

^ permalink raw reply related

* [PATCH 09/12] switch memcpy_to_msg() and skb_copy{,_and_csum}_datagram_msg() to primitives
From: Al Viro @ 2014-12-05  5:58 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, linux-kernel
In-Reply-To: <20141205055623.GQ29748@ZenIV.linux.org.uk>

From: Al Viro <viro@zeniv.linux.org.uk>

... making both non-draining.  That means that tcp_recvmsg() becomes
non-draining.  And _that_ would break iscsit_do_rx_data() unless we
	a) make sure tcp_recvmsg() is uniformly non-draining (it is)
	b) make sure it copes with arbitrary (including shifted)
iov_iter (it does, all it uses is iov_iter primitives)
	c) make iscsit_do_rx_data() initialize ->msg_iter only once.

Fortunately, (c) is doable with minimal work and we are rid of one
the two places where kernel send/recvmsg users would be unhappy with
non-draining behaviour.

Actually, that makes all but one of ->recvmsg() instances iov_iter-clean.
The exception is skcipher_recvmsg() and it also isn't hard to convert
to primitives (iov_iter_get_pages() is needed there).  That'll wait
a bit - there's some interplay with ->sendmsg() path for that one.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
 drivers/target/iscsi/iscsi_target_util.c | 12 +++----
 include/linux/skbuff.h                   | 16 +++-------
 net/core/datagram.c                      | 54 +++++++++++---------------------
 3 files changed, 28 insertions(+), 54 deletions(-)

diff --git a/drivers/target/iscsi/iscsi_target_util.c b/drivers/target/iscsi/iscsi_target_util.c
index ce87ce9..7c6a95b 100644
--- a/drivers/target/iscsi/iscsi_target_util.c
+++ b/drivers/target/iscsi/iscsi_target_util.c
@@ -1326,21 +1326,19 @@ static int iscsit_do_rx_data(
 	struct iscsi_conn *conn,
 	struct iscsi_data_count *count)
 {
-	int data = count->data_length, rx_loop = 0, total_rx = 0, iov_len;
-	struct kvec *iov_p;
+	int data = count->data_length, rx_loop = 0, total_rx = 0;
 	struct msghdr msg;
 
 	if (!conn || !conn->sock || !conn->conn_ops)
 		return -1;
 
 	memset(&msg, 0, sizeof(struct msghdr));
-
-	iov_p = count->iov;
-	iov_len	= count->iov_count;
+	iov_iter_kvec(&msg.msg_iter, READ | ITER_KVEC,
+		      count->iov, count->iov_count, data);
 
 	while (total_rx < data) {
-		rx_loop = kernel_recvmsg(conn->sock, &msg, iov_p, iov_len,
-					(data - total_rx), MSG_WAITALL);
+		rx_loop = sock_recvmsg(conn->sock, &msg,
+				      (data - total_rx), MSG_WAITALL);
 		if (rx_loop <= 0) {
 			pr_debug("rx_loop: %d total_rx: %d\n",
 				rx_loop, total_rx);
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 4199dfa..45ac95c 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -2649,17 +2649,10 @@ int skb_copy_datagram_iter(const struct sk_buff *from, int offset,
 static inline int skb_copy_datagram_msg(const struct sk_buff *from, int offset,
 					struct msghdr *msg, int size)
 {
-	/* XXX: stripping const */
-	return skb_copy_datagram_iovec(from, offset, (struct iovec *)msg->msg_iter.iov, size);
-}
-int skb_copy_and_csum_datagram_iovec(struct sk_buff *skb, int hlen,
-				     struct iovec *iov);
-static inline int skb_copy_and_csum_datagram_msg(struct sk_buff *skb, int hlen,
-			    struct msghdr *msg)
-{
-	/* XXX: stripping const */
-	return skb_copy_and_csum_datagram_iovec(skb, hlen, (struct iovec *)msg->msg_iter.iov);
+	return skb_copy_datagram_iter(from, offset, &msg->msg_iter, size);
 }
+int skb_copy_and_csum_datagram_msg(struct sk_buff *skb, int hlen,
+				   struct msghdr *msg);
 int skb_copy_datagram_from_iter(struct sk_buff *skb, int offset,
 				 struct iov_iter *from, int len);
 int zerocopy_sg_from_iter(struct sk_buff *skb, struct iov_iter *frm);
@@ -2695,8 +2688,7 @@ static inline int memcpy_from_msg(void *data, struct msghdr *msg, int len)
 
 static inline int memcpy_to_msg(struct msghdr *msg, void *data, int len)
 {
-	/* XXX: stripping const */
-	return memcpy_toiovec((struct iovec *)msg->msg_iter.iov, data, len);
+	return copy_to_iter(data, len, &msg->msg_iter) == len ? 0 : -EFAULT;
 }
 
 struct skb_checksum_ops {
diff --git a/net/core/datagram.c b/net/core/datagram.c
index b6e303b..41075ed 100644
--- a/net/core/datagram.c
+++ b/net/core/datagram.c
@@ -615,27 +615,25 @@ int zerocopy_sg_from_iter(struct sk_buff *skb, struct iov_iter *from)
 EXPORT_SYMBOL(zerocopy_sg_from_iter);
 
 static int skb_copy_and_csum_datagram(const struct sk_buff *skb, int offset,
-				      u8 __user *to, int len,
+				      struct iov_iter *to, int len,
 				      __wsum *csump)
 {
 	int start = skb_headlen(skb);
 	int i, copy = start - offset;
 	struct sk_buff *frag_iter;
 	int pos = 0;
+	int n;
 
 	/* Copy header. */
 	if (copy > 0) {
-		int err = 0;
 		if (copy > len)
 			copy = len;
-		*csump = csum_and_copy_to_user(skb->data + offset, to, copy,
-					       *csump, &err);
-		if (err)
+		n = csum_and_copy_to_iter(skb->data + offset, copy, csump, to);
+		if (n != copy)
 			goto fault;
 		if ((len -= copy) == 0)
 			return 0;
 		offset += copy;
-		to += copy;
 		pos = copy;
 	}
 
@@ -647,26 +645,22 @@ static int skb_copy_and_csum_datagram(const struct sk_buff *skb, int offset,
 
 		end = start + skb_frag_size(frag);
 		if ((copy = end - offset) > 0) {
-			__wsum csum2;
-			int err = 0;
-			u8  *vaddr;
+			__wsum csum2 = 0;
 			struct page *page = skb_frag_page(frag);
+			u8  *vaddr = kmap(page);
 
 			if (copy > len)
 				copy = len;
-			vaddr = kmap(page);
-			csum2 = csum_and_copy_to_user(vaddr +
-							frag->page_offset +
-							offset - start,
-						      to, copy, 0, &err);
+			n = csum_and_copy_to_iter(vaddr + frag->page_offset +
+						  offset - start, copy,
+						  &csum2, to);
 			kunmap(page);
-			if (err)
+			if (n != copy)
 				goto fault;
 			*csump = csum_block_add(*csump, csum2, pos);
 			if (!(len -= copy))
 				return 0;
 			offset += copy;
-			to += copy;
 			pos += copy;
 		}
 		start = end;
@@ -691,7 +685,6 @@ static int skb_copy_and_csum_datagram(const struct sk_buff *skb, int offset,
 			if ((len -= copy) == 0)
 				return 0;
 			offset += copy;
-			to += copy;
 			pos += copy;
 		}
 		start = end;
@@ -744,20 +737,19 @@ __sum16 __skb_checksum_complete(struct sk_buff *skb)
 EXPORT_SYMBOL(__skb_checksum_complete);
 
 /**
- *	skb_copy_and_csum_datagram_iovec - Copy and checksum skb to user iovec.
+ *	skb_copy_and_csum_datagram_msg - Copy and checksum skb to user iovec.
  *	@skb: skbuff
  *	@hlen: hardware length
- *	@iov: io vector
+ *	@msg: destination
  *
  *	Caller _must_ check that skb will fit to this iovec.
  *
  *	Returns: 0       - success.
  *		 -EINVAL - checksum failure.
- *		 -EFAULT - fault during copy. Beware, in this case iovec
- *			   can be modified!
+ *		 -EFAULT - fault during copy.
  */
-int skb_copy_and_csum_datagram_iovec(struct sk_buff *skb,
-				     int hlen, struct iovec *iov)
+int skb_copy_and_csum_datagram_msg(struct sk_buff *skb,
+				   int hlen, struct msghdr *msg)
 {
 	__wsum csum;
 	int chunk = skb->len - hlen;
@@ -765,28 +757,20 @@ int skb_copy_and_csum_datagram_iovec(struct sk_buff *skb,
 	if (!chunk)
 		return 0;
 
-	/* Skip filled elements.
-	 * Pretty silly, look at memcpy_toiovec, though 8)
-	 */
-	while (!iov->iov_len)
-		iov++;
-
-	if (iov->iov_len < chunk) {
+	if (iov_iter_count(&msg->msg_iter) < chunk) {
 		if (__skb_checksum_complete(skb))
 			goto csum_error;
-		if (skb_copy_datagram_iovec(skb, hlen, iov, chunk))
+		if (skb_copy_datagram_msg(skb, hlen, msg, chunk))
 			goto fault;
 	} else {
 		csum = csum_partial(skb->data, hlen, skb->csum);
-		if (skb_copy_and_csum_datagram(skb, hlen, iov->iov_base,
+		if (skb_copy_and_csum_datagram(skb, hlen, &msg->msg_iter,
 					       chunk, &csum))
 			goto fault;
 		if (csum_fold(csum))
 			goto csum_error;
 		if (unlikely(skb->ip_summed == CHECKSUM_COMPLETE))
 			netdev_rx_csum_fault(skb->dev);
-		iov->iov_len -= chunk;
-		iov->iov_base += chunk;
 	}
 	return 0;
 csum_error:
@@ -794,7 +778,7 @@ csum_error:
 fault:
 	return -EFAULT;
 }
-EXPORT_SYMBOL(skb_copy_and_csum_datagram_iovec);
+EXPORT_SYMBOL(skb_copy_and_csum_datagram_msg);
 
 /**
  * 	datagram_poll - generic datagram poll
-- 
2.1.3

^ permalink raw reply related

* [PATCH 08/12] first fruits - kill l2cap ->memcpy_fromiovec()
From: Al Viro @ 2014-12-05  5:58 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, linux-kernel
In-Reply-To: <20141205055623.GQ29748@ZenIV.linux.org.uk>

From: Al Viro <viro@zeniv.linux.org.uk>

Just use copy_from_iter().  That's what this method is trying to do
in all cases, in a very convoluted fashion.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
 include/net/bluetooth/l2cap.h | 29 -----------------------------
 net/bluetooth/6lowpan.c       |  3 +--
 net/bluetooth/a2mp.c          |  3 +--
 net/bluetooth/l2cap_core.c    |  7 +++----
 net/bluetooth/l2cap_sock.c    |  8 --------
 net/bluetooth/smp.c           |  4 +---
 6 files changed, 6 insertions(+), 48 deletions(-)

diff --git a/include/net/bluetooth/l2cap.h b/include/net/bluetooth/l2cap.h
index bca6fc0..692f786 100644
--- a/include/net/bluetooth/l2cap.h
+++ b/include/net/bluetooth/l2cap.h
@@ -606,10 +606,6 @@ struct l2cap_ops {
 	struct sk_buff		*(*alloc_skb) (struct l2cap_chan *chan,
 					       unsigned long hdr_len,
 					       unsigned long len, int nb);
-	int			(*memcpy_fromiovec) (struct l2cap_chan *chan,
-						     unsigned char *kdata,
-						     struct msghdr *msg,
-						     int len);
 };
 
 struct l2cap_conn {
@@ -903,31 +899,6 @@ static inline long l2cap_chan_no_get_sndtimeo(struct l2cap_chan *chan)
 	return 0;
 }
 
-static inline int l2cap_chan_no_memcpy_fromiovec(struct l2cap_chan *chan,
-						 unsigned char *kdata,
-						 struct msghdr *msg,
-						 int len)
-{
-	/* Following is safe since for compiler definitions of kvec and
-	 * iovec are identical, yielding the same in-core layout and alignment
-	 */
-	struct kvec *vec = (struct kvec *)msg->msg_iter.iov;
-
-	while (len > 0) {
-		if (vec->iov_len) {
-			int copy = min_t(unsigned int, len, vec->iov_len);
-			memcpy(kdata, vec->iov_base, copy);
-			len -= copy;
-			kdata += copy;
-			vec->iov_base += copy;
-			vec->iov_len -= copy;
-		}
-		vec++;
-	}
-
-	return 0;
-}
-
 extern bool disable_ertm;
 
 int l2cap_init_sockets(void);
diff --git a/net/bluetooth/6lowpan.c b/net/bluetooth/6lowpan.c
index d8c67a5..76617be 100644
--- a/net/bluetooth/6lowpan.c
+++ b/net/bluetooth/6lowpan.c
@@ -541,7 +541,7 @@ static int send_pkt(struct l2cap_chan *chan, struct sk_buff *skb,
 	iv.iov_len = skb->len;
 
 	memset(&msg, 0, sizeof(msg));
-	iov_iter_init(&msg.msg_iter, WRITE, (struct iovec *) &iv, 1, skb->len);
+	iov_iter_kvec(&msg.msg_iter, WRITE | ITER_KVEC, &iv, 1, skb->len);
 
 	err = l2cap_chan_send(chan, &msg, skb->len);
 	if (err > 0) {
@@ -1050,7 +1050,6 @@ static const struct l2cap_ops bt_6lowpan_chan_ops = {
 	.suspend		= chan_suspend_cb,
 	.get_sndtimeo		= chan_get_sndtimeo_cb,
 	.alloc_skb		= chan_alloc_skb_cb,
-	.memcpy_fromiovec	= l2cap_chan_no_memcpy_fromiovec,
 
 	.teardown		= l2cap_chan_no_teardown,
 	.defer			= l2cap_chan_no_defer,
diff --git a/net/bluetooth/a2mp.c b/net/bluetooth/a2mp.c
index 716d2a3..cedfbda 100644
--- a/net/bluetooth/a2mp.c
+++ b/net/bluetooth/a2mp.c
@@ -60,7 +60,7 @@ void a2mp_send(struct amp_mgr *mgr, u8 code, u8 ident, u16 len, void *data)
 
 	memset(&msg, 0, sizeof(msg));
 
-	iov_iter_init(&msg.msg_iter, WRITE, (struct iovec *)&iv, 1, total_len);
+	iov_iter_kvec(&msg.msg_iter, WRITE | ITER_KVEC, &iv, 1, total_len);
 
 	l2cap_chan_send(chan, &msg, total_len);
 
@@ -719,7 +719,6 @@ static const struct l2cap_ops a2mp_chan_ops = {
 	.resume = l2cap_chan_no_resume,
 	.set_shutdown = l2cap_chan_no_set_shutdown,
 	.get_sndtimeo = l2cap_chan_no_get_sndtimeo,
-	.memcpy_fromiovec = l2cap_chan_no_memcpy_fromiovec,
 };
 
 static struct l2cap_chan *a2mp_chan_open(struct l2cap_conn *conn, bool locked)
diff --git a/net/bluetooth/l2cap_core.c b/net/bluetooth/l2cap_core.c
index 5201d61..1754040 100644
--- a/net/bluetooth/l2cap_core.c
+++ b/net/bluetooth/l2cap_core.c
@@ -2096,8 +2096,7 @@ static inline int l2cap_skbuff_fromiovec(struct l2cap_chan *chan,
 	struct sk_buff **frag;
 	int sent = 0;
 
-	if (chan->ops->memcpy_fromiovec(chan, skb_put(skb, count),
-					msg, count))
+	if (copy_from_iter(skb_put(skb, count), count, &msg->msg_iter) != count)
 		return -EFAULT;
 
 	sent += count;
@@ -2117,8 +2116,8 @@ static inline int l2cap_skbuff_fromiovec(struct l2cap_chan *chan,
 
 		*frag = tmp;
 
-		if (chan->ops->memcpy_fromiovec(chan, skb_put(*frag, count),
-						msg, count))
+		if (copy_from_iter(skb_put(*frag, count), count,
+				   &msg->msg_iter) != count)
 			return -EFAULT;
 
 		sent += count;
diff --git a/net/bluetooth/l2cap_sock.c b/net/bluetooth/l2cap_sock.c
index 205b298..f65caf4 100644
--- a/net/bluetooth/l2cap_sock.c
+++ b/net/bluetooth/l2cap_sock.c
@@ -1336,13 +1336,6 @@ static struct sk_buff *l2cap_sock_alloc_skb_cb(struct l2cap_chan *chan,
 	return skb;
 }
 
-static int l2cap_sock_memcpy_fromiovec_cb(struct l2cap_chan *chan,
-					  unsigned char *kdata,
-					  struct msghdr *msg, int len)
-{
-	return memcpy_from_msg(kdata, msg, len);
-}
-
 static void l2cap_sock_ready_cb(struct l2cap_chan *chan)
 {
 	struct sock *sk = chan->data;
@@ -1427,7 +1420,6 @@ static const struct l2cap_ops l2cap_chan_ops = {
 	.set_shutdown		= l2cap_sock_set_shutdown_cb,
 	.get_sndtimeo		= l2cap_sock_get_sndtimeo_cb,
 	.alloc_skb		= l2cap_sock_alloc_skb_cb,
-	.memcpy_fromiovec	= l2cap_sock_memcpy_fromiovec_cb,
 };
 
 static void l2cap_sock_destruct(struct sock *sk)
diff --git a/net/bluetooth/smp.c b/net/bluetooth/smp.c
index 21f555b..de7dc75 100644
--- a/net/bluetooth/smp.c
+++ b/net/bluetooth/smp.c
@@ -268,7 +268,7 @@ static void smp_send_cmd(struct l2cap_conn *conn, u8 code, u16 len, void *data)
 
 	memset(&msg, 0, sizeof(msg));
 
-	iov_iter_init(&msg.msg_iter, WRITE, (struct iovec *)iv, 2, 1 + len);
+	iov_iter_kvec(&msg.msg_iter, WRITE | ITER_KVEC, iv, 2, 1 + len);
 
 	l2cap_chan_send(chan, &msg, 1 + len);
 
@@ -1629,7 +1629,6 @@ static const struct l2cap_ops smp_chan_ops = {
 	.suspend		= l2cap_chan_no_suspend,
 	.set_shutdown		= l2cap_chan_no_set_shutdown,
 	.get_sndtimeo		= l2cap_chan_no_get_sndtimeo,
-	.memcpy_fromiovec	= l2cap_chan_no_memcpy_fromiovec,
 };
 
 static inline struct l2cap_chan *smp_new_conn_cb(struct l2cap_chan *pchan)
@@ -1678,7 +1677,6 @@ static const struct l2cap_ops smp_root_chan_ops = {
 	.resume			= l2cap_chan_no_resume,
 	.set_shutdown		= l2cap_chan_no_set_shutdown,
 	.get_sndtimeo		= l2cap_chan_no_get_sndtimeo,
-	.memcpy_fromiovec	= l2cap_chan_no_memcpy_fromiovec,
 };
 
 int smp_register(struct hci_dev *hdev)
-- 
2.1.3

^ permalink raw reply related

* [PATCH 07/12] put iov_iter into msghdr
From: Al Viro @ 2014-12-05  5:58 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, linux-kernel
In-Reply-To: <20141205055623.GQ29748@ZenIV.linux.org.uk>

From: Al Viro <viro@zeniv.linux.org.uk>

Note that the code _using_ ->msg_iter at that point will be very
unhappy with anything other than unshifted iovec-backed iov_iter.
We still need to convert users to proper primitives.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
 crypto/algif_hash.c            |  4 ++--
 crypto/algif_skcipher.c        |  4 ++--
 drivers/net/macvtap.c          |  8 ++------
 drivers/net/tun.c              |  8 ++------
 drivers/vhost/net.c            |  8 +++-----
 fs/afs/rxrpc.c                 | 14 ++++++--------
 include/linux/skbuff.h         | 16 ++++++++++------
 include/linux/socket.h         |  3 +--
 include/net/bluetooth/l2cap.h  |  2 +-
 include/net/udplite.h          |  3 ++-
 net/atm/common.c               |  5 +----
 net/bluetooth/6lowpan.c        |  6 +++---
 net/bluetooth/a2mp.c           |  3 +--
 net/bluetooth/smp.c            |  3 +--
 net/caif/caif_socket.c         |  2 +-
 net/compat.c                   | 10 ++++++----
 net/ipv4/ip_output.c           |  6 ++++--
 net/ipv4/ping.c                |  3 ++-
 net/ipv4/raw.c                 |  3 ++-
 net/ipv4/tcp.c                 |  6 +++---
 net/ipv4/tcp_output.c          |  2 +-
 net/ipv6/ping.c                |  3 ++-
 net/ipv6/raw.c                 |  3 ++-
 net/netlink/af_netlink.c       |  2 +-
 net/packet/af_packet.c         |  7 ++-----
 net/rds/recv.c                 |  7 ++++---
 net/rds/send.c                 |  4 +---
 net/rxrpc/ar-output.c          |  8 +++-----
 net/sctp/socket.c              |  5 +----
 net/socket.c                   | 27 ++++++++++++---------------
 net/tipc/msg.c                 |  4 ++--
 net/tipc/socket.c              |  4 ++--
 net/unix/af_unix.c             | 10 ++--------
 net/vmw_vsock/vmci_transport.c |  3 ++-
 34 files changed, 92 insertions(+), 114 deletions(-)

diff --git a/crypto/algif_hash.c b/crypto/algif_hash.c
index 35c93ff..83cd2cc 100644
--- a/crypto/algif_hash.c
+++ b/crypto/algif_hash.c
@@ -42,7 +42,7 @@ static int hash_sendmsg(struct kiocb *unused, struct socket *sock,
 	struct alg_sock *ask = alg_sk(sk);
 	struct hash_ctx *ctx = ask->private;
 	unsigned long iovlen;
-	struct iovec *iov;
+	const struct iovec *iov;
 	long copied = 0;
 	int err;
 
@@ -58,7 +58,7 @@ static int hash_sendmsg(struct kiocb *unused, struct socket *sock,
 
 	ctx->more = 0;
 
-	for (iov = msg->msg_iov, iovlen = msg->msg_iovlen; iovlen > 0;
+	for (iov = msg->msg_iter.iov, iovlen = msg->msg_iter.nr_segs; iovlen > 0;
 	     iovlen--, iov++) {
 		unsigned long seglen = iov->iov_len;
 		char __user *from = iov->iov_base;
diff --git a/crypto/algif_skcipher.c b/crypto/algif_skcipher.c
index c3b482b..4f45dab 100644
--- a/crypto/algif_skcipher.c
+++ b/crypto/algif_skcipher.c
@@ -429,13 +429,13 @@ static int skcipher_recvmsg(struct kiocb *unused, struct socket *sock,
 	struct skcipher_sg_list *sgl;
 	struct scatterlist *sg;
 	unsigned long iovlen;
-	struct iovec *iov;
+	const struct iovec *iov;
 	int err = -EAGAIN;
 	int used;
 	long copied = 0;
 
 	lock_sock(sk);
-	for (iov = msg->msg_iov, iovlen = msg->msg_iovlen; iovlen > 0;
+	for (iov = msg->msg_iter.iov, iovlen = msg->msg_iter.nr_segs; iovlen > 0;
 	     iovlen--, iov++) {
 		unsigned long seglen = iov->iov_len;
 		char __user *from = iov->iov_base;
diff --git a/drivers/net/macvtap.c b/drivers/net/macvtap.c
index 22b4cf2..4fb1222 100644
--- a/drivers/net/macvtap.c
+++ b/drivers/net/macvtap.c
@@ -1092,9 +1092,7 @@ static int macvtap_sendmsg(struct kiocb *iocb, struct socket *sock,
 			   struct msghdr *m, size_t total_len)
 {
 	struct macvtap_queue *q = container_of(sock, struct macvtap_queue, sock);
-	struct iov_iter from;
-	iov_iter_init(&from, WRITE, m->msg_iov, m->msg_iovlen, total_len);
-	return macvtap_get_user(q, m, &from, m->msg_flags & MSG_DONTWAIT);
+	return macvtap_get_user(q, m, &m->msg_iter, m->msg_flags & MSG_DONTWAIT);
 }
 
 static int macvtap_recvmsg(struct kiocb *iocb, struct socket *sock,
@@ -1102,12 +1100,10 @@ static int macvtap_recvmsg(struct kiocb *iocb, struct socket *sock,
 			   int flags)
 {
 	struct macvtap_queue *q = container_of(sock, struct macvtap_queue, sock);
-	struct iov_iter to;
 	int ret;
 	if (flags & ~(MSG_DONTWAIT|MSG_TRUNC))
 		return -EINVAL;
-	iov_iter_init(&to, READ, m->msg_iov, m->msg_iovlen, total_len);
-	ret = macvtap_do_read(q, &to, flags & MSG_DONTWAIT);
+	ret = macvtap_do_read(q, &m->msg_iter, flags & MSG_DONTWAIT);
 	if (ret > total_len) {
 		m->msg_flags |= MSG_TRUNC;
 		ret = flags & MSG_TRUNC ? ret : total_len;
diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index 6d44da1..8d8bede 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -1446,13 +1446,11 @@ static int tun_sendmsg(struct kiocb *iocb, struct socket *sock,
 	int ret;
 	struct tun_file *tfile = container_of(sock, struct tun_file, socket);
 	struct tun_struct *tun = __tun_get(tfile);
-	struct iov_iter from;
 
 	if (!tun)
 		return -EBADFD;
 
-	iov_iter_init(&from, WRITE, m->msg_iov, m->msg_iovlen, total_len);
-	ret = tun_get_user(tun, tfile, m->msg_control, &from,
+	ret = tun_get_user(tun, tfile, m->msg_control, &m->msg_iter,
 			   m->msg_flags & MSG_DONTWAIT);
 	tun_put(tun);
 	return ret;
@@ -1464,7 +1462,6 @@ static int tun_recvmsg(struct kiocb *iocb, struct socket *sock,
 {
 	struct tun_file *tfile = container_of(sock, struct tun_file, socket);
 	struct tun_struct *tun = __tun_get(tfile);
-	struct iov_iter to;
 	int ret;
 
 	if (!tun)
@@ -1479,8 +1476,7 @@ static int tun_recvmsg(struct kiocb *iocb, struct socket *sock,
 					 SOL_PACKET, TUN_TX_TIMESTAMP);
 		goto out;
 	}
-	iov_iter_init(&to, READ, m->msg_iov, m->msg_iovlen, total_len);
-	ret = tun_do_read(tun, tfile, &to, flags & MSG_DONTWAIT);
+	ret = tun_do_read(tun, tfile, &m->msg_iter, flags & MSG_DONTWAIT);
 	if (ret > total_len) {
 		m->msg_flags |= MSG_TRUNC;
 		ret = flags & MSG_TRUNC ? ret : total_len;
diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index 8dae2f7..9f06e70 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -342,7 +342,6 @@ static void handle_tx(struct vhost_net *net)
 		.msg_namelen = 0,
 		.msg_control = NULL,
 		.msg_controllen = 0,
-		.msg_iov = vq->iov,
 		.msg_flags = MSG_DONTWAIT,
 	};
 	size_t len, total_len = 0;
@@ -396,8 +395,8 @@ static void handle_tx(struct vhost_net *net)
 		}
 		/* Skip header. TODO: support TSO. */
 		s = move_iovec_hdr(vq->iov, nvq->hdr, hdr_size, out);
-		msg.msg_iovlen = out;
 		len = iov_length(vq->iov, out);
+		iov_iter_init(&msg.msg_iter, WRITE, vq->iov, out, len);
 		/* Sanity check */
 		if (!len) {
 			vq_err(vq, "Unexpected header len for TX: "
@@ -562,7 +561,6 @@ static void handle_rx(struct vhost_net *net)
 		.msg_namelen = 0,
 		.msg_control = NULL, /* FIXME: get and handle RX aux data. */
 		.msg_controllen = 0,
-		.msg_iov = vq->iov,
 		.msg_flags = MSG_DONTWAIT,
 	};
 	struct virtio_net_hdr_mrg_rxbuf hdr = {
@@ -600,7 +598,7 @@ static void handle_rx(struct vhost_net *net)
 			break;
 		/* On overrun, truncate and discard */
 		if (unlikely(headcount > UIO_MAXIOV)) {
-			msg.msg_iovlen = 1;
+			iov_iter_init(&msg.msg_iter, READ, vq->iov, 1, 1);
 			err = sock->ops->recvmsg(NULL, sock, &msg,
 						 1, MSG_DONTWAIT | MSG_TRUNC);
 			pr_debug("Discarded rx packet: len %zd\n", sock_len);
@@ -626,7 +624,7 @@ static void handle_rx(struct vhost_net *net)
 			/* Copy the header for use in VIRTIO_NET_F_MRG_RXBUF:
 			 * needed because recvmsg can modify msg_iov. */
 			copy_iovec_hdr(vq->iov, nvq->hdr, sock_hlen, in);
-		msg.msg_iovlen = in;
+		iov_iter_init(&msg.msg_iter, READ, vq->iov, in, sock_len);
 		err = sock->ops->recvmsg(NULL, sock, &msg,
 					 sock_len, MSG_DONTWAIT | MSG_TRUNC);
 		/* Userspace might have consumed the packet meanwhile:
diff --git a/fs/afs/rxrpc.c b/fs/afs/rxrpc.c
index 03a3beb..06e14bf 100644
--- a/fs/afs/rxrpc.c
+++ b/fs/afs/rxrpc.c
@@ -306,8 +306,8 @@ static int afs_send_pages(struct afs_call *call, struct msghdr *msg,
 
 			_debug("- range %u-%u%s",
 			       offset, to, msg->msg_flags ? " [more]" : "");
-			msg->msg_iov = (struct iovec *) iov;
-			msg->msg_iovlen = 1;
+			iov_iter_init(&msg->msg_iter, WRITE,
+				      (struct iovec *) iov, 1, to - offset);
 
 			/* have to change the state *before* sending the last
 			 * packet as RxRPC might give us the reply before it
@@ -384,8 +384,8 @@ int afs_make_call(struct in_addr *addr, struct afs_call *call, gfp_t gfp,
 
 	msg.msg_name		= NULL;
 	msg.msg_namelen		= 0;
-	msg.msg_iov		= (struct iovec *) iov;
-	msg.msg_iovlen		= 1;
+	iov_iter_init(&msg.msg_iter, WRITE, (struct iovec *)iov, 1,
+		      call->request_size);
 	msg.msg_control		= NULL;
 	msg.msg_controllen	= 0;
 	msg.msg_flags		= (call->send_pages ? MSG_MORE : 0);
@@ -778,8 +778,7 @@ void afs_send_empty_reply(struct afs_call *call)
 	iov[0].iov_len		= 0;
 	msg.msg_name		= NULL;
 	msg.msg_namelen		= 0;
-	msg.msg_iov		= iov;
-	msg.msg_iovlen		= 0;
+	iov_iter_init(&msg.msg_iter, WRITE, iov, 0, 0);	/* WTF? */
 	msg.msg_control		= NULL;
 	msg.msg_controllen	= 0;
 	msg.msg_flags		= 0;
@@ -815,8 +814,7 @@ void afs_send_simple_reply(struct afs_call *call, const void *buf, size_t len)
 	iov[0].iov_len		= len;
 	msg.msg_name		= NULL;
 	msg.msg_namelen		= 0;
-	msg.msg_iov		= iov;
-	msg.msg_iovlen		= 1;
+	iov_iter_init(&msg.msg_iter, WRITE, iov, 1, len);
 	msg.msg_control		= NULL;
 	msg.msg_controllen	= 0;
 	msg.msg_flags		= 0;
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 7691ad5..4199dfa 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -2644,22 +2644,24 @@ unsigned int datagram_poll(struct file *file, struct socket *sock,
 			   struct poll_table_struct *wait);
 int skb_copy_datagram_iovec(const struct sk_buff *from, int offset,
 			    struct iovec *to, int size);
+int skb_copy_datagram_iter(const struct sk_buff *from, int offset,
+			   struct iov_iter *to, int size);
 static inline int skb_copy_datagram_msg(const struct sk_buff *from, int offset,
 					struct msghdr *msg, int size)
 {
-	return skb_copy_datagram_iovec(from, offset, msg->msg_iov, size);
+	/* XXX: stripping const */
+	return skb_copy_datagram_iovec(from, offset, (struct iovec *)msg->msg_iter.iov, size);
 }
 int skb_copy_and_csum_datagram_iovec(struct sk_buff *skb, int hlen,
 				     struct iovec *iov);
 static inline int skb_copy_and_csum_datagram_msg(struct sk_buff *skb, int hlen,
 			    struct msghdr *msg)
 {
-	return skb_copy_and_csum_datagram_iovec(skb, hlen, msg->msg_iov);
+	/* XXX: stripping const */
+	return skb_copy_and_csum_datagram_iovec(skb, hlen, (struct iovec *)msg->msg_iter.iov);
 }
 int skb_copy_datagram_from_iter(struct sk_buff *skb, int offset,
 				 struct iov_iter *from, int len);
-int skb_copy_datagram_iter(const struct sk_buff *from, int offset,
-			   struct iov_iter *to, int size);
 int zerocopy_sg_from_iter(struct sk_buff *skb, struct iov_iter *frm);
 void skb_free_datagram(struct sock *sk, struct sk_buff *skb);
 void skb_free_datagram_locked(struct sock *sk, struct sk_buff *skb);
@@ -2687,12 +2689,14 @@ int skb_vlan_push(struct sk_buff *skb, __be16 vlan_proto, u16 vlan_tci);
 
 static inline int memcpy_from_msg(void *data, struct msghdr *msg, int len)
 {
-	return memcpy_fromiovec(data, msg->msg_iov, len);
+	/* XXX: stripping const */
+	return memcpy_fromiovec(data, (struct iovec *)msg->msg_iter.iov, len);
 }
 
 static inline int memcpy_to_msg(struct msghdr *msg, void *data, int len)
 {
-	return memcpy_toiovec(msg->msg_iov, data, len);
+	/* XXX: stripping const */
+	return memcpy_toiovec((struct iovec *)msg->msg_iter.iov, data, len);
 }
 
 struct skb_checksum_ops {
diff --git a/include/linux/socket.h b/include/linux/socket.h
index de52228..048d6d6 100644
--- a/include/linux/socket.h
+++ b/include/linux/socket.h
@@ -47,8 +47,7 @@ struct linger {
 struct msghdr {
 	void		*msg_name;	/* ptr to socket address structure */
 	int		msg_namelen;	/* size of socket address structure */
-	struct iovec	*msg_iov;	/* scatter/gather array */
-	__kernel_size_t	msg_iovlen;	/* # elements in msg_iov */
+	struct iov_iter	msg_iter;	/* data */
 	void		*msg_control;	/* ancillary data */
 	__kernel_size_t	msg_controllen;	/* ancillary data buffer length */
 	unsigned int	msg_flags;	/* flags on received message */
diff --git a/include/net/bluetooth/l2cap.h b/include/net/bluetooth/l2cap.h
index 4e23674..bca6fc0 100644
--- a/include/net/bluetooth/l2cap.h
+++ b/include/net/bluetooth/l2cap.h
@@ -911,7 +911,7 @@ static inline int l2cap_chan_no_memcpy_fromiovec(struct l2cap_chan *chan,
 	/* Following is safe since for compiler definitions of kvec and
 	 * iovec are identical, yielding the same in-core layout and alignment
 	 */
-	struct kvec *vec = (struct kvec *)msg->msg_iov;
+	struct kvec *vec = (struct kvec *)msg->msg_iter.iov;
 
 	while (len > 0) {
 		if (vec->iov_len) {
diff --git a/include/net/udplite.h b/include/net/udplite.h
index d5baaba..ae7c8d1 100644
--- a/include/net/udplite.h
+++ b/include/net/udplite.h
@@ -20,7 +20,8 @@ static __inline__ int udplite_getfrag(void *from, char *to, int  offset,
 				      int len, int odd, struct sk_buff *skb)
 {
 	struct msghdr *msg = from;
-	return memcpy_fromiovecend(to, msg->msg_iov, offset, len);
+	/* XXX: stripping const */
+	return memcpy_fromiovecend(to, (struct iovec *)msg->msg_iter.iov, offset, len);
 }
 
 /* Designate sk as UDP-Lite socket */
diff --git a/net/atm/common.c b/net/atm/common.c
index f591129..b84057e 100644
--- a/net/atm/common.c
+++ b/net/atm/common.c
@@ -577,9 +577,6 @@ int vcc_sendmsg(struct kiocb *iocb, struct socket *sock, struct msghdr *m,
 	struct atm_vcc *vcc;
 	struct sk_buff *skb;
 	int eff, error;
-	struct iov_iter from;
-
-	iov_iter_init(&from, WRITE, m->msg_iov, m->msg_iovlen, size);
 
 	lock_sock(sk);
 	if (sock->state != SS_CONNECTED) {
@@ -634,7 +631,7 @@ int vcc_sendmsg(struct kiocb *iocb, struct socket *sock, struct msghdr *m,
 		goto out;
 	skb->dev = NULL; /* for paths shared with net_device interfaces */
 	ATM_SKB(skb)->atm_options = vcc->atm_options;
-	if (copy_from_iter(skb_put(skb, size), size, &from) != size) {
+	if (copy_from_iter(skb_put(skb, size), size, &m->msg_iter) != size) {
 		kfree_skb(skb);
 		error = -EFAULT;
 		goto out;
diff --git a/net/bluetooth/6lowpan.c b/net/bluetooth/6lowpan.c
index bdcaefd..d8c67a5 100644
--- a/net/bluetooth/6lowpan.c
+++ b/net/bluetooth/6lowpan.c
@@ -537,12 +537,12 @@ static int send_pkt(struct l2cap_chan *chan, struct sk_buff *skb,
 	 */
 	chan->data = skb;
 
-	memset(&msg, 0, sizeof(msg));
-	msg.msg_iov = (struct iovec *) &iv;
-	msg.msg_iovlen = 1;
 	iv.iov_base = skb->data;
 	iv.iov_len = skb->len;
 
+	memset(&msg, 0, sizeof(msg));
+	iov_iter_init(&msg.msg_iter, WRITE, (struct iovec *) &iv, 1, skb->len);
+
 	err = l2cap_chan_send(chan, &msg, skb->len);
 	if (err > 0) {
 		netdev->stats.tx_bytes += err;
diff --git a/net/bluetooth/a2mp.c b/net/bluetooth/a2mp.c
index 5dcade5..716d2a3 100644
--- a/net/bluetooth/a2mp.c
+++ b/net/bluetooth/a2mp.c
@@ -60,8 +60,7 @@ void a2mp_send(struct amp_mgr *mgr, u8 code, u8 ident, u16 len, void *data)
 
 	memset(&msg, 0, sizeof(msg));
 
-	msg.msg_iov = (struct iovec *) &iv;
-	msg.msg_iovlen = 1;
+	iov_iter_init(&msg.msg_iter, WRITE, (struct iovec *)&iv, 1, total_len);
 
 	l2cap_chan_send(chan, &msg, total_len);
 
diff --git a/net/bluetooth/smp.c b/net/bluetooth/smp.c
index 069b76e..21f555b 100644
--- a/net/bluetooth/smp.c
+++ b/net/bluetooth/smp.c
@@ -268,8 +268,7 @@ static void smp_send_cmd(struct l2cap_conn *conn, u8 code, u16 len, void *data)
 
 	memset(&msg, 0, sizeof(msg));
 
-	msg.msg_iov = (struct iovec *) &iv;
-	msg.msg_iovlen = 2;
+	iov_iter_init(&msg.msg_iter, WRITE, (struct iovec *)iv, 2, 1 + len);
 
 	l2cap_chan_send(chan, &msg, 1 + len);
 
diff --git a/net/caif/caif_socket.c b/net/caif/caif_socket.c
index ac618b0..769b185 100644
--- a/net/caif/caif_socket.c
+++ b/net/caif/caif_socket.c
@@ -535,7 +535,7 @@ static int caif_seqpkt_sendmsg(struct kiocb *kiocb, struct socket *sock,
 		goto err;
 
 	ret = -EINVAL;
-	if (unlikely(msg->msg_iov->iov_base == NULL))
+	if (unlikely(msg->msg_iter.iov->iov_base == NULL))
 		goto err;
 	noblock = msg->msg_flags & MSG_DONTWAIT;
 
diff --git a/net/compat.c b/net/compat.c
index 062f157..3236b41 100644
--- a/net/compat.c
+++ b/net/compat.c
@@ -37,13 +37,14 @@ ssize_t get_compat_msghdr(struct msghdr *kmsg,
 			  struct iovec **iov)
 {
 	compat_uptr_t uaddr, uiov, tmp3;
+	compat_size_t nr_segs;
 	ssize_t err;
 
 	if (!access_ok(VERIFY_READ, umsg, sizeof(*umsg)) ||
 	    __get_user(uaddr, &umsg->msg_name) ||
 	    __get_user(kmsg->msg_namelen, &umsg->msg_namelen) ||
 	    __get_user(uiov, &umsg->msg_iov) ||
-	    __get_user(kmsg->msg_iovlen, &umsg->msg_iovlen) ||
+	    __get_user(nr_segs, &umsg->msg_iovlen) ||
 	    __get_user(tmp3, &umsg->msg_control) ||
 	    __get_user(kmsg->msg_controllen, &umsg->msg_controllen) ||
 	    __get_user(kmsg->msg_flags, &umsg->msg_flags))
@@ -68,14 +69,15 @@ ssize_t get_compat_msghdr(struct msghdr *kmsg,
 		kmsg->msg_namelen = 0;
 	}
 
-	if (kmsg->msg_iovlen > UIO_MAXIOV)
+	if (nr_segs > UIO_MAXIOV)
 		return -EMSGSIZE;
 
 	err = compat_rw_copy_check_uvector(save_addr ? READ : WRITE,
-					   compat_ptr(uiov), kmsg->msg_iovlen,
+					   compat_ptr(uiov), nr_segs,
 					   UIO_FASTIOV, *iov, iov);
 	if (err >= 0)
-		kmsg->msg_iov = *iov;
+		iov_iter_init(&kmsg->msg_iter, save_addr ? READ : WRITE,
+			      *iov, nr_segs, err);
 	return err;
 }
 
diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index cdedcf1..b50861b2 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -755,11 +755,13 @@ ip_generic_getfrag(void *from, char *to, int offset, int len, int odd, struct sk
 	struct msghdr *msg = from;
 
 	if (skb->ip_summed == CHECKSUM_PARTIAL) {
-		if (memcpy_fromiovecend(to, msg->msg_iov, offset, len) < 0)
+		/* XXX: stripping const */
+		if (memcpy_fromiovecend(to, (struct iovec *)msg->msg_iter.iov, offset, len) < 0)
 			return -EFAULT;
 	} else {
 		__wsum csum = 0;
-		if (csum_partial_copy_fromiovecend(to, msg->msg_iov, offset, len, &csum) < 0)
+		/* XXX: stripping const */
+		if (csum_partial_copy_fromiovecend(to, (struct iovec *)msg->msg_iter.iov, offset, len, &csum) < 0)
 			return -EFAULT;
 		skb->csum = csum_block_add(skb->csum, csum, odd);
 	}
diff --git a/net/ipv4/ping.c b/net/ipv4/ping.c
index 8dd4ae0..c0d82f7 100644
--- a/net/ipv4/ping.c
+++ b/net/ipv4/ping.c
@@ -811,7 +811,8 @@ back_from_confirm:
 	pfh.icmph.checksum = 0;
 	pfh.icmph.un.echo.id = inet->inet_sport;
 	pfh.icmph.un.echo.sequence = user_icmph.un.echo.sequence;
-	pfh.iov = msg->msg_iov;
+	/* XXX: stripping const */
+	pfh.iov = (struct iovec *)msg->msg_iter.iov;
 	pfh.wcheck = 0;
 	pfh.family = AF_INET;
 
diff --git a/net/ipv4/raw.c b/net/ipv4/raw.c
index 5d83bd2..0bb68df 100644
--- a/net/ipv4/raw.c
+++ b/net/ipv4/raw.c
@@ -625,7 +625,8 @@ static int raw_sendmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
 back_from_confirm:
 
 	if (inet->hdrincl)
-		err = raw_send_hdrinc(sk, &fl4, msg->msg_iov, len,
+		/* XXX: stripping const */
+		err = raw_send_hdrinc(sk, &fl4, (struct iovec *)msg->msg_iter.iov, len,
 				      &rt, msg->msg_flags);
 
 	 else {
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 4a96f37..54ba620 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -1085,7 +1085,7 @@ static int tcp_sendmsg_fastopen(struct sock *sk, struct msghdr *msg,
 int tcp_sendmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
 		size_t size)
 {
-	struct iovec *iov;
+	const struct iovec *iov;
 	struct tcp_sock *tp = tcp_sk(sk);
 	struct sk_buff *skb;
 	int iovlen, flags, err, copied = 0;
@@ -1136,8 +1136,8 @@ int tcp_sendmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
 	mss_now = tcp_send_mss(sk, &size_goal, flags);
 
 	/* Ok commence sending. */
-	iovlen = msg->msg_iovlen;
-	iov = msg->msg_iov;
+	iovlen = msg->msg_iter.nr_segs;
+	iov = msg->msg_iter.iov;
 	copied = 0;
 
 	err = -EPIPE;
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index f5bd4bd..3e225b0 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -3050,7 +3050,7 @@ static int tcp_send_syn_data(struct sock *sk, struct sk_buff *syn)
 	syn_data->ip_summed = CHECKSUM_PARTIAL;
 	memcpy(syn_data->cb, syn->cb, sizeof(syn->cb));
 	if (unlikely(memcpy_fromiovecend(skb_put(syn_data, space),
-					 fo->data->msg_iov, 0, space))) {
+					 fo->data->msg_iter.iov, 0, space))) {
 		kfree_skb(syn_data);
 		goto fallback;
 	}
diff --git a/net/ipv6/ping.c b/net/ipv6/ping.c
index 5b7a1ed..2d31483 100644
--- a/net/ipv6/ping.c
+++ b/net/ipv6/ping.c
@@ -163,7 +163,8 @@ int ping_v6_sendmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
 	pfh.icmph.checksum = 0;
 	pfh.icmph.un.echo.id = inet->inet_sport;
 	pfh.icmph.un.echo.sequence = user_icmph.icmp6_sequence;
-	pfh.iov = msg->msg_iov;
+	/* XXX: stripping const */
+	pfh.iov = (struct iovec *)msg->msg_iter.iov;
 	pfh.wcheck = 0;
 	pfh.family = AF_INET6;
 
diff --git a/net/ipv6/raw.c b/net/ipv6/raw.c
index 11a9283..ee25631 100644
--- a/net/ipv6/raw.c
+++ b/net/ipv6/raw.c
@@ -886,7 +886,8 @@ static int rawv6_sendmsg(struct kiocb *iocb, struct sock *sk,
 
 back_from_confirm:
 	if (inet->hdrincl)
-		err = rawv6_send_hdrinc(sk, msg->msg_iov, len, &fl6, &dst, msg->msg_flags);
+		/* XXX: stripping const */
+		err = rawv6_send_hdrinc(sk, (struct iovec *)msg->msg_iter.iov, len, &fl6, &dst, msg->msg_flags);
 	else {
 		lock_sock(sk);
 		err = ip6_append_data(sk, raw6_getfrag, &rfv,
diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c
index 63aa5c8..cc9bcf0 100644
--- a/net/netlink/af_netlink.c
+++ b/net/netlink/af_netlink.c
@@ -2305,7 +2305,7 @@ static int netlink_sendmsg(struct kiocb *kiocb, struct socket *sock,
 	}
 
 	if (netlink_tx_is_mmaped(sk) &&
-	    msg->msg_iov->iov_base == NULL) {
+	    msg->msg_iter.iov->iov_base == NULL) {
 		err = netlink_mmap_sendmsg(sk, msg, dst_portid, dst_group,
 					   siocb);
 		goto out;
diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
index efa8445..ed2e620 100644
--- a/net/packet/af_packet.c
+++ b/net/packet/af_packet.c
@@ -2408,11 +2408,8 @@ static int packet_snd(struct socket *sock, struct msghdr *msg, size_t len)
 	unsigned short gso_type = 0;
 	int hlen, tlen;
 	int extra_len = 0;
-	struct iov_iter from;
 	ssize_t n;
 
-	iov_iter_init(&from, WRITE, msg->msg_iov, msg->msg_iovlen, len);
-
 	/*
 	 *	Get and verify the address.
 	 */
@@ -2451,7 +2448,7 @@ static int packet_snd(struct socket *sock, struct msghdr *msg, size_t len)
 		len -= vnet_hdr_len;
 
 		err = -EFAULT;
-		n = copy_from_iter(&vnet_hdr, vnet_hdr_len, &from);
+		n = copy_from_iter(&vnet_hdr, vnet_hdr_len, &msg->msg_iter);
 		if (n != vnet_hdr_len)
 			goto out_unlock;
 
@@ -2522,7 +2519,7 @@ static int packet_snd(struct socket *sock, struct msghdr *msg, size_t len)
 	}
 
 	/* Returns -EFAULT on error */
-	err = skb_copy_datagram_from_iter(skb, offset, &from, len);
+	err = skb_copy_datagram_from_iter(skb, offset, &msg->msg_iter, len);
 	if (err)
 		goto out_free;
 
diff --git a/net/rds/recv.c b/net/rds/recv.c
index 47d7b10..f9ec1ac 100644
--- a/net/rds/recv.c
+++ b/net/rds/recv.c
@@ -404,7 +404,6 @@ int rds_recvmsg(struct kiocb *iocb, struct socket *sock, struct msghdr *msg,
 	int ret = 0, nonblock = msg_flags & MSG_DONTWAIT;
 	DECLARE_SOCKADDR(struct sockaddr_in *, sin, msg->msg_name);
 	struct rds_incoming *inc = NULL;
-	struct iov_iter to;
 
 	/* udp_recvmsg()->sock_recvtimeo() gets away without locking too.. */
 	timeo = sock_rcvtimeo(sk, nonblock);
@@ -415,6 +414,7 @@ int rds_recvmsg(struct kiocb *iocb, struct socket *sock, struct msghdr *msg,
 		goto out;
 
 	while (1) {
+		struct iov_iter save;
 		/* If there are pending notifications, do those - and nothing else */
 		if (!list_empty(&rs->rs_notify_queue)) {
 			ret = rds_notify_queue_get(rs, msg);
@@ -450,8 +450,8 @@ int rds_recvmsg(struct kiocb *iocb, struct socket *sock, struct msghdr *msg,
 		rdsdebug("copying inc %p from %pI4:%u to user\n", inc,
 			 &inc->i_conn->c_faddr,
 			 ntohs(inc->i_hdr.h_sport));
-		iov_iter_init(&to, READ, msg->msg_iov, msg->msg_iovlen, size);
-		ret = inc->i_conn->c_trans->inc_copy_to_user(inc, &to);
+		save = msg->msg_iter;
+		ret = inc->i_conn->c_trans->inc_copy_to_user(inc, &msg->msg_iter);
 		if (ret < 0)
 			break;
 
@@ -464,6 +464,7 @@ int rds_recvmsg(struct kiocb *iocb, struct socket *sock, struct msghdr *msg,
 			rds_inc_put(inc);
 			inc = NULL;
 			rds_stats_inc(s_recv_deliver_raced);
+			msg->msg_iter = save;
 			continue;
 		}
 
diff --git a/net/rds/send.c b/net/rds/send.c
index 4de62ea..40a5629a 100644
--- a/net/rds/send.c
+++ b/net/rds/send.c
@@ -934,9 +934,7 @@ int rds_sendmsg(struct kiocb *iocb, struct socket *sock, struct msghdr *msg,
 	int queued = 0, allocated_mr = 0;
 	int nonblock = msg->msg_flags & MSG_DONTWAIT;
 	long timeo = sock_sndtimeo(sk, nonblock);
-	struct iov_iter from;
 
-	iov_iter_init(&from, WRITE, msg->msg_iov, msg->msg_iovlen, payload_len);
 	/* Mirror Linux UDP mirror of BSD error message compatibility */
 	/* XXX: Perhaps MSG_MORE someday */
 	if (msg->msg_flags & ~(MSG_DONTWAIT | MSG_CMSG_COMPAT)) {
@@ -984,7 +982,7 @@ int rds_sendmsg(struct kiocb *iocb, struct socket *sock, struct msghdr *msg,
 			ret = -ENOMEM;
 			goto out;
 		}
-		ret = rds_message_copy_from_user(rm, &from);
+		ret = rds_message_copy_from_user(rm, &msg->msg_iter);
 		if (ret)
 			goto out;
 	}
diff --git a/net/rxrpc/ar-output.c b/net/rxrpc/ar-output.c
index 0b4b9a7..86e0f10 100644
--- a/net/rxrpc/ar-output.c
+++ b/net/rxrpc/ar-output.c
@@ -531,14 +531,12 @@ static int rxrpc_send_data(struct kiocb *iocb,
 	struct rxrpc_skb_priv *sp;
 	unsigned char __user *from;
 	struct sk_buff *skb;
-	struct iovec *iov;
+	const struct iovec *iov;
 	struct sock *sk = &rx->sk;
 	long timeo;
 	bool more;
 	int ret, ioc, segment, copied;
 
-	_enter(",,,{%zu},%zu", msg->msg_iovlen, len);
-
 	timeo = sock_sndtimeo(sk, msg->msg_flags & MSG_DONTWAIT);
 
 	/* this should be in poll */
@@ -547,8 +545,8 @@ static int rxrpc_send_data(struct kiocb *iocb,
 	if (sk->sk_err || (sk->sk_shutdown & SEND_SHUTDOWN))
 		return -EPIPE;
 
-	iov = msg->msg_iov;
-	ioc = msg->msg_iovlen - 1;
+	iov = msg->msg_iter.iov;
+	ioc = msg->msg_iter.nr_segs - 1;
 	from = iov->iov_base;
 	segment = iov->iov_len;
 	iov++;
diff --git a/net/sctp/socket.c b/net/sctp/socket.c
index 0397ac9..c92f96cd 100644
--- a/net/sctp/socket.c
+++ b/net/sctp/socket.c
@@ -1609,9 +1609,6 @@ static int sctp_sendmsg(struct kiocb *iocb, struct sock *sk,
 	__u16 sinfo_flags = 0;
 	long timeo;
 	int err;
-	struct iov_iter from;
-
-	iov_iter_init(&from, WRITE, msg->msg_iov, msg->msg_iovlen, msg_len);
 
 	err = 0;
 	sp = sctp_sk(sk);
@@ -1950,7 +1947,7 @@ static int sctp_sendmsg(struct kiocb *iocb, struct sock *sk,
 	}
 
 	/* Break the message into multiple chunks of maximum size. */
-	datamsg = sctp_datamsg_from_user(asoc, sinfo, &from);
+	datamsg = sctp_datamsg_from_user(asoc, sinfo, &msg->msg_iter);
 	if (IS_ERR(datamsg)) {
 		err = PTR_ERR(datamsg);
 		goto out_free;
diff --git a/net/socket.c b/net/socket.c
index ee3ee39..46571ee 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -691,8 +691,7 @@ int kernel_sendmsg(struct socket *sock, struct msghdr *msg,
 	 * the following is safe, since for compiler definitions of kvec and
 	 * iovec are identical, yielding the same in-core layout and alignment
 	 */
-	msg->msg_iov = (struct iovec *)vec;
-	msg->msg_iovlen = num;
+	iov_iter_init(&msg->msg_iter, WRITE, (struct iovec *)vec, num, size);
 	result = sock_sendmsg(sock, msg, size);
 	set_fs(oldfs);
 	return result;
@@ -855,7 +854,7 @@ int kernel_recvmsg(struct socket *sock, struct msghdr *msg,
 	 * the following is safe, since for compiler definitions of kvec and
 	 * iovec are identical, yielding the same in-core layout and alignment
 	 */
-	msg->msg_iov = (struct iovec *)vec, msg->msg_iovlen = num;
+	iov_iter_init(&msg->msg_iter, READ, (struct iovec *)vec, num, size);
 	result = sock_recvmsg(sock, msg, size, flags);
 	set_fs(oldfs);
 	return result;
@@ -915,8 +914,7 @@ static ssize_t do_sock_read(struct msghdr *msg, struct kiocb *iocb,
 	msg->msg_namelen = 0;
 	msg->msg_control = NULL;
 	msg->msg_controllen = 0;
-	msg->msg_iov = (struct iovec *)iov;
-	msg->msg_iovlen = nr_segs;
+	iov_iter_init(&msg->msg_iter, READ, iov, nr_segs, size);
 	msg->msg_flags = (file->f_flags & O_NONBLOCK) ? MSG_DONTWAIT : 0;
 
 	return __sock_recvmsg(iocb, sock, msg, size, msg->msg_flags);
@@ -955,8 +953,7 @@ static ssize_t do_sock_write(struct msghdr *msg, struct kiocb *iocb,
 	msg->msg_namelen = 0;
 	msg->msg_control = NULL;
 	msg->msg_controllen = 0;
-	msg->msg_iov = (struct iovec *)iov;
-	msg->msg_iovlen = nr_segs;
+	iov_iter_init(&msg->msg_iter, WRITE, iov, nr_segs, size);
 	msg->msg_flags = (file->f_flags & O_NONBLOCK) ? MSG_DONTWAIT : 0;
 	if (sock->type == SOCK_SEQPACKET)
 		msg->msg_flags |= MSG_EOR;
@@ -1800,8 +1797,7 @@ SYSCALL_DEFINE6(sendto, int, fd, void __user *, buff, size_t, len,
 	iov.iov_base = buff;
 	iov.iov_len = len;
 	msg.msg_name = NULL;
-	msg.msg_iov = &iov;
-	msg.msg_iovlen = 1;
+	iov_iter_init(&msg.msg_iter, WRITE, &iov, 1, len);
 	msg.msg_control = NULL;
 	msg.msg_controllen = 0;
 	msg.msg_namelen = 0;
@@ -1858,10 +1854,9 @@ SYSCALL_DEFINE6(recvfrom, int, fd, void __user *, ubuf, size_t, size,
 
 	msg.msg_control = NULL;
 	msg.msg_controllen = 0;
-	msg.msg_iovlen = 1;
-	msg.msg_iov = &iov;
 	iov.iov_len = size;
 	iov.iov_base = ubuf;
+	iov_iter_init(&msg.msg_iter, READ, &iov, 1, size);
 	/* Save some cycles and don't copy the address if not needed */
 	msg.msg_name = addr ? (struct sockaddr *)&address : NULL;
 	/* We assume all kernel code knows the size of sockaddr_storage */
@@ -1995,13 +1990,14 @@ static ssize_t copy_msghdr_from_user(struct msghdr *kmsg,
 {
 	struct sockaddr __user *uaddr;
 	struct iovec __user *uiov;
+	size_t nr_segs;
 	ssize_t err;
 
 	if (!access_ok(VERIFY_READ, umsg, sizeof(*umsg)) ||
 	    __get_user(uaddr, &umsg->msg_name) ||
 	    __get_user(kmsg->msg_namelen, &umsg->msg_namelen) ||
 	    __get_user(uiov, &umsg->msg_iov) ||
-	    __get_user(kmsg->msg_iovlen, &umsg->msg_iovlen) ||
+	    __get_user(nr_segs, &umsg->msg_iovlen) ||
 	    __get_user(kmsg->msg_control, &umsg->msg_control) ||
 	    __get_user(kmsg->msg_controllen, &umsg->msg_controllen) ||
 	    __get_user(kmsg->msg_flags, &umsg->msg_flags))
@@ -2031,14 +2027,15 @@ static ssize_t copy_msghdr_from_user(struct msghdr *kmsg,
 		kmsg->msg_namelen = 0;
 	}
 
-	if (kmsg->msg_iovlen > UIO_MAXIOV)
+	if (nr_segs > UIO_MAXIOV)
 		return -EMSGSIZE;
 
 	err = rw_copy_check_uvector(save_addr ? READ : WRITE,
-				    uiov, kmsg->msg_iovlen,
+				    uiov, nr_segs,
 				    UIO_FASTIOV, *iov, iov);
 	if (err >= 0)
-		kmsg->msg_iov = *iov;
+		iov_iter_init(&kmsg->msg_iter, save_addr ? READ : WRITE,
+			      *iov, nr_segs, err);
 	return err;
 }
 
diff --git a/net/tipc/msg.c b/net/tipc/msg.c
index 5b06597..a687b30 100644
--- a/net/tipc/msg.c
+++ b/net/tipc/msg.c
@@ -194,7 +194,7 @@ int tipc_msg_build(struct tipc_msg *mhdr, struct msghdr *m, int offset,
 		__skb_queue_tail(list, skb);
 		skb_copy_to_linear_data(skb, mhdr, mhsz);
 		pktpos = skb->data + mhsz;
-		if (!dsz || !memcpy_fromiovecend(pktpos, m->msg_iov, offset,
+		if (!dsz || !memcpy_fromiovecend(pktpos, m->msg_iter.iov, offset,
 						 dsz))
 			return dsz;
 		rc = -EFAULT;
@@ -224,7 +224,7 @@ int tipc_msg_build(struct tipc_msg *mhdr, struct msghdr *m, int offset,
 		if (drem < pktrem)
 			pktrem = drem;
 
-		if (memcpy_fromiovecend(pktpos, m->msg_iov, offset, pktrem)) {
+		if (memcpy_fromiovecend(pktpos, m->msg_iter.iov, offset, pktrem)) {
 			rc = -EFAULT;
 			goto error;
 		}
diff --git a/net/tipc/socket.c b/net/tipc/socket.c
index 9658d9b..4e78a7d 100644
--- a/net/tipc/socket.c
+++ b/net/tipc/socket.c
@@ -850,9 +850,9 @@ static int dest_name_check(struct sockaddr_tipc *dest, struct msghdr *m)
 	if (likely(dest->addr.name.name.type != TIPC_CFG_SRV))
 		return -EACCES;
 
-	if (!m->msg_iovlen || (m->msg_iov[0].iov_len < sizeof(hdr)))
+	if (!m->msg_iter.nr_segs || (m->msg_iter.iov[0].iov_len < sizeof(hdr)))
 		return -EMSGSIZE;
-	if (copy_from_user(&hdr, m->msg_iov[0].iov_base, sizeof(hdr)))
+	if (copy_from_user(&hdr, m->msg_iter.iov[0].iov_base, sizeof(hdr)))
 		return -EFAULT;
 	if ((ntohs(hdr.tcm_type) & 0xC000) && (!capable(CAP_NET_ADMIN)))
 		return -EACCES;
diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
index 4450d62..8e1b102 100644
--- a/net/unix/af_unix.c
+++ b/net/unix/af_unix.c
@@ -1459,9 +1459,6 @@ static int unix_dgram_sendmsg(struct kiocb *kiocb, struct socket *sock,
 	struct scm_cookie tmp_scm;
 	int max_level;
 	int data_len = 0;
-	struct iov_iter from;
-
-	iov_iter_init(&from, WRITE, msg->msg_iov, msg->msg_iovlen, len);
 
 	if (NULL == siocb->scm)
 		siocb->scm = &tmp_scm;
@@ -1519,7 +1516,7 @@ static int unix_dgram_sendmsg(struct kiocb *kiocb, struct socket *sock,
 	skb_put(skb, len - data_len);
 	skb->data_len = data_len;
 	skb->len = len;
-	err = skb_copy_datagram_from_iter(skb, 0, &from, len);
+	err = skb_copy_datagram_from_iter(skb, 0, &msg->msg_iter, len);
 	if (err)
 		goto out_free;
 
@@ -1641,9 +1638,6 @@ static int unix_stream_sendmsg(struct kiocb *kiocb, struct socket *sock,
 	bool fds_sent = false;
 	int max_level;
 	int data_len;
-	struct iov_iter from;
-
-	iov_iter_init(&from, WRITE, msg->msg_iov, msg->msg_iovlen, len);
 
 	if (NULL == siocb->scm)
 		siocb->scm = &tmp_scm;
@@ -1700,7 +1694,7 @@ static int unix_stream_sendmsg(struct kiocb *kiocb, struct socket *sock,
 		skb_put(skb, size - data_len);
 		skb->data_len = data_len;
 		skb->len = size;
-		err = skb_copy_datagram_from_iter(skb, 0, &from, size);
+		err = skb_copy_datagram_from_iter(skb, 0, &msg->msg_iter, size);
 		if (err) {
 			kfree_skb(skb);
 			goto out_err;
diff --git a/net/vmw_vsock/vmci_transport.c b/net/vmw_vsock/vmci_transport.c
index 20a0ba3..02d2e52 100644
--- a/net/vmw_vsock/vmci_transport.c
+++ b/net/vmw_vsock/vmci_transport.c
@@ -1850,7 +1850,8 @@ static ssize_t vmci_transport_stream_enqueue(
 	struct msghdr *msg,
 	size_t len)
 {
-	return vmci_qpair_enquev(vmci_trans(vsk)->qpair, msg->msg_iov, len, 0);
+	/* XXX: stripping const */
+	return vmci_qpair_enquev(vmci_trans(vsk)->qpair, (struct iovec *)msg->msg_iter.iov, len, 0);
 }
 
 static s64 vmci_transport_stream_has_data(struct vsock_sock *vsk)
-- 
2.1.3

^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox