Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH 2/3] net/unix: hook unix_socketpair() into LSM
From: David Miller @ 2018-04-24 17:56 UTC (permalink / raw)
  To: paul
  Cc: dh.herrmann, linux-kernel, jmorris, teg, sds, selinux,
	linux-security-module, eparis, serge, netdev
In-Reply-To: <CAHC9VhSv6tacFb+nEs1cCUOj52Vu+wXD6ZPZ1r1W4pYXo0VJMQ@mail.gmail.com>

From: Paul Moore <paul@paul-moore.com>
Date: Tue, 24 Apr 2018 13:55:31 -0400

> On Mon, Apr 23, 2018 at 9:30 AM, David Herrmann <dh.herrmann@gmail.com> wrote:
>> Use the newly created LSM-hook for unix_socketpair(). The default hook
>> return-value is 0, so behavior stays the same unless LSMs start using
>> this hook.
>>
>> Signed-off-by: David Herrmann <dh.herrmann@gmail.com>
>> ---
>>  net/unix/af_unix.c | 5 +++++
>>  1 file changed, 5 insertions(+)
>>
>> diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
>> index 68bb70a62afe..bc9705ace9b1 100644
>> --- a/net/unix/af_unix.c
>> +++ b/net/unix/af_unix.c
>> @@ -1371,6 +1371,11 @@ static int unix_stream_connect(struct socket *sock, struct sockaddr *uaddr,
>>  static int unix_socketpair(struct socket *socka, struct socket *sockb)
>>  {
>>         struct sock *ska = socka->sk, *skb = sockb->sk;
>> +       int err;
>> +
>> +       err = security_unix_stream_socketpair(ska, skb);
>> +       if (err)
>> +               return err;
> 
> I recognize that AF_UNIX is really the only protocol that supports
> socketpair(2) at the moment, but I like to avoid protocol specific LSM
> hooks whenever possible.  Unless someone can think of a good
> objection, I would prefer to see the hook placed in __sys_socketpair()
> instead (and obviously drop the "unix_stream" portion from the hook
> name).

The counterargument is that after 30 years no other protocol has grown
usage of this operation. :-)

^ permalink raw reply

* Re: [PATCH 2/3] net/unix: hook unix_socketpair() into LSM
From: Paul Moore @ 2018-04-24 17:58 UTC (permalink / raw)
  To: David Miller
  Cc: dh.herrmann, linux-kernel, James Morris, teg, Stephen Smalley,
	selinux, linux-security-module, Eric Paris, serge, netdev
In-Reply-To: <20180424.135651.492329246141701047.davem@davemloft.net>

On Tue, Apr 24, 2018 at 1:56 PM, David Miller <davem@davemloft.net> wrote:
> From: Paul Moore <paul@paul-moore.com>
> Date: Tue, 24 Apr 2018 13:55:31 -0400
>
>> On Mon, Apr 23, 2018 at 9:30 AM, David Herrmann <dh.herrmann@gmail.com> wrote:
>>> Use the newly created LSM-hook for unix_socketpair(). The default hook
>>> return-value is 0, so behavior stays the same unless LSMs start using
>>> this hook.
>>>
>>> Signed-off-by: David Herrmann <dh.herrmann@gmail.com>
>>> ---
>>>  net/unix/af_unix.c | 5 +++++
>>>  1 file changed, 5 insertions(+)
>>>
>>> diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
>>> index 68bb70a62afe..bc9705ace9b1 100644
>>> --- a/net/unix/af_unix.c
>>> +++ b/net/unix/af_unix.c
>>> @@ -1371,6 +1371,11 @@ static int unix_stream_connect(struct socket *sock, struct sockaddr *uaddr,
>>>  static int unix_socketpair(struct socket *socka, struct socket *sockb)
>>>  {
>>>         struct sock *ska = socka->sk, *skb = sockb->sk;
>>> +       int err;
>>> +
>>> +       err = security_unix_stream_socketpair(ska, skb);
>>> +       if (err)
>>> +               return err;
>>
>> I recognize that AF_UNIX is really the only protocol that supports
>> socketpair(2) at the moment, but I like to avoid protocol specific LSM
>> hooks whenever possible.  Unless someone can think of a good
>> objection, I would prefer to see the hook placed in __sys_socketpair()
>> instead (and obviously drop the "unix_stream" portion from the hook
>> name).
>
> The counterargument is that after 30 years no other protocol has grown
> usage of this operation. :-)

Call me a an optimist ;)

-- 
paul moore
www.paul-moore.com

^ permalink raw reply

* [net-next v2] ipv6: sr: Compute flowlabel for outer IPv6 header of seg6 encap mode
From: Ahmed Abdelsalam @ 2018-04-24 17:59 UTC (permalink / raw)
  To: davem, dav.lebrun, kuznet, yoshfuji, netdev, linux-kernel
  Cc: Ahmed Abdelsalam

ECMP (equal-cost multipath) hashes are typically computed on the packets'
5-tuple(src IP, dst IP, src port, dst port, L4 proto).

For encapsulated packets, the L4 data is not readily available and ECMP
hashing will often revert to (src IP, dst IP). This will lead to traffic
polarization on a single ECMP path, causing congestion and waste of network
capacity.

In IPv6, the 20-bit flow label field is also used as part of the ECMP
hash. In the lack of L4 data, the hashing will be on (src IP, dst IP,
flow label). Having a non-zero flow label is thus important for proper
traffic load balancing when L4 data is unavailable (i.e., when packets
are encapsulated).

Currently, the seg6_do_srh_encap() function extracts the original packet's
flow label and set it as the outer IPv6 flow label. There are two issues
with this behaviour:

a) There is no guarantee that the inner flow label is set by the source.

b) If the original packet is not IPv6, the flow label will be set to
zero (e.g., IPv4 or L2 encap).

This patch adds a function, named seg6_make_flowlabel(), that computes a
flow label from a given skb. It supports IPv6, IPv4 and L2 payloads, and
leverages the per namespace 'seg6_flowlabel" sysctl value.

The currently support behaviours are as follows:
-1 set flowlabel to zero.
 0 copy flowlabel from Inner paceket in case of Inner IPv6
   (Set flowlabel to 0 in case IPv4/L2)
 1 Compute the flowlabel using seg6_make_flowlabel()

This patch has been tested for IPv6, IPv4, and L2 traffic.

Signed-off-by: Ahmed Abdelsalam <amsalam20@gmail.com>
---
 include/net/netns/ipv6.h   |  1 +
 net/ipv6/seg6_iptunnel.c   | 24 ++++++++++++++++++++++--
 net/ipv6/sysctl_net_ipv6.c |  8 ++++++++
 3 files changed, 31 insertions(+), 2 deletions(-)

diff --git a/include/net/netns/ipv6.h b/include/net/netns/ipv6.h
index 97b3a54..c978a31 100644
--- a/include/net/netns/ipv6.h
+++ b/include/net/netns/ipv6.h
@@ -43,6 +43,7 @@ struct netns_sysctl_ipv6 {
 	int max_hbh_opts_cnt;
 	int max_dst_opts_len;
 	int max_hbh_opts_len;
+	int seg6_flowlabel;
 };
 
 struct netns_ipv6 {
diff --git a/net/ipv6/seg6_iptunnel.c b/net/ipv6/seg6_iptunnel.c
index 5fe1394..3d9cd86 100644
--- a/net/ipv6/seg6_iptunnel.c
+++ b/net/ipv6/seg6_iptunnel.c
@@ -91,6 +91,24 @@ static void set_tun_src(struct net *net, struct net_device *dev,
 	rcu_read_unlock();
 }
 
+/* Compute flowlabel for outer IPv6 header */
+__be32 seg6_make_flowlabel(struct net *net, struct sk_buff *skb,
+			   struct ipv6hdr *inner_hdr)
+{
+	int do_flowlabel = net->ipv6.sysctl.seg6_flowlabel;
+	__be32 flowlabel = 0;
+	u32 hash;
+
+	if (do_flowlabel > 0) {
+		hash = skb_get_hash(skb);
+		rol32(hash, 16);
+		flowlabel = (__force __be32)hash & IPV6_FLOWLABEL_MASK;
+	} else if (!do_flowlabel && skb->protocol == htons(ETH_P_IPV6)) {
+		flowlabel = ip6_flowlabel(inner_hdr);
+	}
+	return flowlabel;
+}
+
 /* encapsulate an IPv6 packet within an outer IPv6 header with a given SRH */
 int seg6_do_srh_encap(struct sk_buff *skb, struct ipv6_sr_hdr *osrh, int proto)
 {
@@ -99,6 +117,7 @@ int seg6_do_srh_encap(struct sk_buff *skb, struct ipv6_sr_hdr *osrh, int proto)
 	struct ipv6hdr *hdr, *inner_hdr;
 	struct ipv6_sr_hdr *isrh;
 	int hdrlen, tot_len, err;
+	__be32 flowlabel;
 
 	hdrlen = (osrh->hdrlen + 1) << 3;
 	tot_len = hdrlen + sizeof(*hdr);
@@ -119,12 +138,13 @@ int seg6_do_srh_encap(struct sk_buff *skb, struct ipv6_sr_hdr *osrh, int proto)
 	 * decapsulation will overwrite inner hlim with outer hlim
 	 */
 
+	flowlabel = seg6_make_flowlabel(net, skb, inner_hdr);
 	if (skb->protocol == htons(ETH_P_IPV6)) {
 		ip6_flow_hdr(hdr, ip6_tclass(ip6_flowinfo(inner_hdr)),
-			     ip6_flowlabel(inner_hdr));
+			     flowlabel);
 		hdr->hop_limit = inner_hdr->hop_limit;
 	} else {
-		ip6_flow_hdr(hdr, 0, 0);
+		ip6_flow_hdr(hdr, 0, flowlabel);
 		hdr->hop_limit = ip6_dst_hoplimit(skb_dst(skb));
 	}
 
diff --git a/net/ipv6/sysctl_net_ipv6.c b/net/ipv6/sysctl_net_ipv6.c
index 6fbdef6..e15cd37 100644
--- a/net/ipv6/sysctl_net_ipv6.c
+++ b/net/ipv6/sysctl_net_ipv6.c
@@ -152,6 +152,13 @@ static struct ctl_table ipv6_table_template[] = {
 		.extra1		= &zero,
 		.extra2		= &one,
 	},
+	{
+		.procname	= "seg6_flowlabel",
+		.data		= &init_net.ipv6.sysctl.seg6_flowlabel,
+		.maxlen		= sizeof(int),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec
+	},
 	{ }
 };
 
@@ -217,6 +224,7 @@ static int __net_init ipv6_sysctl_net_init(struct net *net)
 	ipv6_table[12].data = &net->ipv6.sysctl.max_dst_opts_len;
 	ipv6_table[13].data = &net->ipv6.sysctl.max_hbh_opts_len;
 	ipv6_table[14].data = &net->ipv6.sysctl.multipath_hash_policy,
+	ipv6_table[15].data = &net->ipv6.sysctl.seg6_flowlabel;
 
 	ipv6_route_table = ipv6_route_sysctl_init(net);
 	if (!ipv6_route_table)
-- 
2.1.4

^ permalink raw reply related

* Re: [PATCH net-next] qed: Fix copying 2 strings
From: David Miller @ 2018-04-24 18:00 UTC (permalink / raw)
  To: denis.bolotin; +Cc: netdev, ariel.elior
In-Reply-To: <20180424123253.2333-1-denis.bolotin@cavium.com>

From: Denis Bolotin <denis.bolotin@cavium.com>
Date: Tue, 24 Apr 2018 15:32:53 +0300

> The strscpy() was a recent fix (net: qed: use correct strncpy() size) to
> prevent passing the length of the source buffer to strncpy() and guarantee
> null termination.
> It misses the goal of overwriting only the first 3 characters in
> "???_BIG_RAM" and "???_RAM" while keeping the rest of the string.
> Use strncpy() with the length of 3, without null termination.
> 
> Signed-off-by: Denis Bolotin <denis.bolotin@cavium.com>
> Signed-off-by: Ariel Elior <ariel.elior@cavium.com>

Applied, thank you.

^ permalink raw reply

* Re: [PATCH] net: phy: allow scanning busses with missing phys
From: Andrew Lunn @ 2018-04-24 18:01 UTC (permalink / raw)
  To: Florian Fainelli
  Cc: Alexandre Belloni, David S . Miller, Allan Nielsen,
	Thomas Petazzoni, netdev, linux-kernel
In-Reply-To: <2833f30a-cf95-e5c7-e44f-218929e61024@gmail.com>

On Tue, Apr 24, 2018 at 09:37:09AM -0700, Florian Fainelli wrote:
> 
> 
> On 04/24/2018 09:09 AM, Alexandre Belloni wrote:
> > Some MDIO busses will error out when trying to read a phy address with no
> > phy present at that address. In that case, probing the bus will fail
> > because __mdiobus_register() is scanning the bus for all possible phys
> > addresses.
> > 
> > In case MII_PHYSID1 returns -EIO or -ENODEV, consider there is no phy at
> > this address and set the phy ID to 0xffffffff which is then properly
> > handled in get_phy_device().
> 
> Humm, why not have your MDIO bus implementation do the scanning itself
> in a reset() callback, which happens before probing the bus, and based
> on the results, set phy_mask accordingly such that only PHYs present are
> populated?

Hi Florian

Seems a bit odd have the driver do this, when the core could.

> My only concern with your change is that we are having a special
> treatment for EIO and ENODEV, so we must make sure MDIO bus drivers are
> all conforming to that.
 
I don't see how it could be a problem. It used to be any error was a
real error, and would stop the bus from loading. It now means there is
nothing there. The only possible side effect is an mdio bus driver
might remain loaded without any devices if all reads return
ENODEV/EIO, were as before it would probably never load. 

	    Andrew

^ permalink raw reply

* Re: [PATCH V7 net-next 00/14] TLS offload, netdev & MLX5 support
From: David Miller @ 2018-04-24 18:02 UTC (permalink / raw)
  To: borisp; +Cc: netdev, saeedm, davejwatson, ktkhai
In-Reply-To: <1524575585-49541-1-git-send-email-borisp@mellanox.com>

From: Boris Pismenny <borisp@mellanox.com>
Date: Tue, 24 Apr 2018 16:12:51 +0300

> The following series provides TLS TX inline crypto offload.

Unfortunately the mlx5 bits don't apply cleanly to net-next, please
respin.

Thank you.

^ permalink raw reply

* Re: [PATCH] net: nfp: fix nfp_net_tx()'s return type
From: Jakub Kicinski @ 2018-04-24 18:09 UTC (permalink / raw)
  To: Luc Van Oostenryck
  Cc: linux-kernel, David S. Miller, Simon Horman, Daniel Borkmann,
	Dirk van der Merwe, Pablo Cascón, oss-drivers, netdev
In-Reply-To: <20180424131710.4206-1-luc.vanoostenryck@gmail.com>

On Tue, 24 Apr 2018 15:17:07 +0200, Luc Van Oostenryck wrote:
> The method ndo_start_xmit() is defined as returning an 'netdev_tx_t',
> which is a typedef for an enum type, but the implementation in this
> driver returns an 'int'.
> 
> Fix this by returning 'netdev_tx_t' in this driver too.
> 
> Signed-off-by: Luc Van Oostenryck <luc.vanoostenryck@gmail.com>

Acked-by: Jakub Kicinski <jakub.kicinski@netronome.com>

^ permalink raw reply

* Re: [PATCH] net: phy: allow scanning busses with missing phys
From: Alexandre Belloni @ 2018-04-24 18:09 UTC (permalink / raw)
  To: Florian Fainelli
  Cc: Andrew Lunn, David S . Miller, Allan Nielsen, Thomas Petazzoni,
	netdev, linux-kernel
In-Reply-To: <2833f30a-cf95-e5c7-e44f-218929e61024@gmail.com>

On 24/04/2018 09:37:09-0700, Florian Fainelli wrote:
> 
> 
> On 04/24/2018 09:09 AM, Alexandre Belloni wrote:
> > Some MDIO busses will error out when trying to read a phy address with no
> > phy present at that address. In that case, probing the bus will fail
> > because __mdiobus_register() is scanning the bus for all possible phys
> > addresses.
> > 
> > In case MII_PHYSID1 returns -EIO or -ENODEV, consider there is no phy at
> > this address and set the phy ID to 0xffffffff which is then properly
> > handled in get_phy_device().
> 
> Humm, why not have your MDIO bus implementation do the scanning itself
> in a reset() callback, which happens before probing the bus, and based
> on the results, set phy_mask accordingly such that only PHYs present are
> populated?
> 
> My only concern with your change is that we are having a special
> treatment for EIO and ENODEV, so we must make sure MDIO bus drivers are
> all conforming to that.
> 

That was what I was doing in [1] but it seems that Andrew preferred this way.

The third solution I was seeing was to return phy_reg instead of -EIO so
the MDIO driver can return -ENODEV and that would be passed to
get_phy_device(). __mdiobus_register() seems to handle -ENODEV properly.

My coccinelle-fu is not great but the following drivers can return
-ENODEV from their read callback:

drivers/net/ethernet/marvell/mvmdio.c
drivers/net/ethernet/hisilicon/hix5hd2_gmac.c (seeing the error message,
this has probably been copy pasted)

[1] https://marc.info/?l=linux-netdev&m=152183609927933&w=2

> > 
> > Suggested-by: Andrew Lunn <andrew@lunn.ch>
> > Signed-off-by: Alexandre Belloni <alexandre.belloni@bootlin.com>
> > ---
> >  drivers/net/phy/phy_device.c | 11 ++++++++++-
> >  1 file changed, 10 insertions(+), 1 deletion(-)
> > 
> > diff --git a/drivers/net/phy/phy_device.c b/drivers/net/phy/phy_device.c
> > index ac23322a32e1..9e4ba8e80a18 100644
> > --- a/drivers/net/phy/phy_device.c
> > +++ b/drivers/net/phy/phy_device.c
> > @@ -535,8 +535,17 @@ static int get_phy_id(struct mii_bus *bus, int addr, u32 *phy_id,
> >  
> >  	/* Grab the bits from PHYIR1, and put them in the upper half */
> >  	phy_reg = mdiobus_read(bus, addr, MII_PHYSID1);
> > -	if (phy_reg < 0)
> > +	if (phy_reg < 0) {
> > +		/* if there is no device, return without an error so scanning
> > +		 * the bus works properly
> > +		 */
> > +		if (phy_reg == -EIO || phy_reg == -ENODEV) {
> > +			*phy_id = 0xffffffff;
> > +			return 0;
> > +		}
> > +
> >  		return -EIO;
> > +	}
> >  
> >  	*phy_id = (phy_reg & 0xffff) << 16;
> >  
> > 
> 
> -- 
> Florian

-- 
Alexandre Belloni, Bootlin (formerly Free Electrons)
Embedded Linux and Kernel engineering
https://bootlin.com

^ permalink raw reply

* remove PCI_DMA_BUS_IS_PHYS V2
From: Christoph Hellwig @ 2018-04-24 18:16 UTC (permalink / raw)
  To: iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-arch-u79uwXL29TY76Z2rM5mHXA,
	linux-block-u79uwXL29TY76Z2rM5mHXA,
	linux-ide-u79uwXL29TY76Z2rM5mHXA,
	linux-scsi-u79uwXL29TY76Z2rM5mHXA, netdev-u79uwXL29TY76Z2rM5mHXA
  Cc: David S. Miller, linux-kernel-u79uwXL29TY76Z2rM5mHXA

Hi all,

this series tries to get rid of the global and PCI_DMA_BUS_IS_PHYS flag,
which causes the block layer and networking code to bounce buffer memory
above the dma mask in some cases.  It is a leftover from i386 + highmem
days and is obsolete now that we have swiotlb or iommus so that the
dma ops implementations can always (minus the ISA DMA case which
will require further attention) handle memory passed to them.

Changes since V1:
 - dropped all patches not strictly required to remove
   PCI_DMA_BUS_IS_PHYS, those will be resent separately

^ permalink raw reply

* [PATCH 1/5] scsi: reduce use of block bounce buffers
From: Christoph Hellwig @ 2018-04-24 18:16 UTC (permalink / raw)
  To: iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-arch-u79uwXL29TY76Z2rM5mHXA,
	linux-block-u79uwXL29TY76Z2rM5mHXA,
	linux-ide-u79uwXL29TY76Z2rM5mHXA,
	linux-scsi-u79uwXL29TY76Z2rM5mHXA, netdev-u79uwXL29TY76Z2rM5mHXA
  Cc: David S. Miller, linux-kernel-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <20180424181625.22410-1-hch-jcswGhMUV9g@public.gmane.org>

We can rely on the dma-mapping code to handle any DMA limits that is
bigger than the ISA DMA mask for us (either using an iommu or swiotlb),
so remove setting the block layer bounce limit for anything but the
unchecked_isa_dma case, or the bouncing for highmem pages.

Signed-off-by: Christoph Hellwig <hch-jcswGhMUV9g@public.gmane.org>
---
 drivers/scsi/scsi_lib.c | 24 ++----------------------
 1 file changed, 2 insertions(+), 22 deletions(-)

diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index e9b4f279d29c..e0b614c0b1e6 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -2149,27 +2149,6 @@ static int scsi_map_queues(struct blk_mq_tag_set *set)
 	return blk_mq_map_queues(set);
 }
 
-static u64 scsi_calculate_bounce_limit(struct Scsi_Host *shost)
-{
-	struct device *host_dev;
-	u64 bounce_limit = 0xffffffff;
-
-	if (shost->unchecked_isa_dma)
-		return BLK_BOUNCE_ISA;
-	/*
-	 * Platforms with virtual-DMA translation
-	 * hardware have no practical limit.
-	 */
-	if (!PCI_DMA_BUS_IS_PHYS)
-		return BLK_BOUNCE_ANY;
-
-	host_dev = scsi_get_device(shost);
-	if (host_dev && host_dev->dma_mask)
-		bounce_limit = (u64)dma_max_pfn(host_dev) << PAGE_SHIFT;
-
-	return bounce_limit;
-}
-
 void __scsi_init_queue(struct Scsi_Host *shost, struct request_queue *q)
 {
 	struct device *dev = shost->dma_dev;
@@ -2189,7 +2168,8 @@ void __scsi_init_queue(struct Scsi_Host *shost, struct request_queue *q)
 	}
 
 	blk_queue_max_hw_sectors(q, shost->max_sectors);
-	blk_queue_bounce_limit(q, scsi_calculate_bounce_limit(shost));
+	if (shost->unchecked_isa_dma)
+		blk_queue_bounce_limit(q, BLK_BOUNCE_ISA);
 	blk_queue_segment_boundary(q, shost->dma_boundary);
 	dma_set_seg_boundary(dev, shost->dma_boundary);
 
-- 
2.17.0

^ permalink raw reply related

* [PATCH 2/5] ide: kill ide_toggle_bounce
From: Christoph Hellwig @ 2018-04-24 18:16 UTC (permalink / raw)
  To: iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-arch-u79uwXL29TY76Z2rM5mHXA,
	linux-block-u79uwXL29TY76Z2rM5mHXA,
	linux-ide-u79uwXL29TY76Z2rM5mHXA,
	linux-scsi-u79uwXL29TY76Z2rM5mHXA, netdev-u79uwXL29TY76Z2rM5mHXA
  Cc: David S. Miller, linux-kernel-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <20180424181625.22410-1-hch-jcswGhMUV9g@public.gmane.org>

ide_toggle_bounce did select various strange block bounce limits, including
not bouncing at all as soon as an iommu is present in the system.  Given
that the dma_map routines now handle any required bounce buffering except
for ISA DMA, and the ide code already must handle either ISA DMA or highmem
at least for iommu equipped systems we can get rid of the block layer
bounce limit setting entirely.

Signed-off-by: Christoph Hellwig <hch-jcswGhMUV9g@public.gmane.org>
---
 drivers/ide/ide-dma.c   |  2 --
 drivers/ide/ide-lib.c   | 26 --------------------------
 drivers/ide/ide-probe.c |  3 ---
 include/linux/ide.h     |  2 --
 4 files changed, 33 deletions(-)

diff --git a/drivers/ide/ide-dma.c b/drivers/ide/ide-dma.c
index 54d4d78ca46a..6f344654ef22 100644
--- a/drivers/ide/ide-dma.c
+++ b/drivers/ide/ide-dma.c
@@ -180,7 +180,6 @@ EXPORT_SYMBOL_GPL(ide_dma_unmap_sg);
 void ide_dma_off_quietly(ide_drive_t *drive)
 {
 	drive->dev_flags &= ~IDE_DFLAG_USING_DMA;
-	ide_toggle_bounce(drive, 0);
 
 	drive->hwif->dma_ops->dma_host_set(drive, 0);
 }
@@ -211,7 +210,6 @@ EXPORT_SYMBOL(ide_dma_off);
 void ide_dma_on(ide_drive_t *drive)
 {
 	drive->dev_flags |= IDE_DFLAG_USING_DMA;
-	ide_toggle_bounce(drive, 1);
 
 	drive->hwif->dma_ops->dma_host_set(drive, 1);
 }
diff --git a/drivers/ide/ide-lib.c b/drivers/ide/ide-lib.c
index e1180fa46196..78cb79eddc8b 100644
--- a/drivers/ide/ide-lib.c
+++ b/drivers/ide/ide-lib.c
@@ -6,32 +6,6 @@
 #include <linux/ide.h>
 #include <linux/bitops.h>
 
-/**
- *	ide_toggle_bounce	-	handle bounce buffering
- *	@drive: drive to update
- *	@on: on/off boolean
- *
- *	Enable or disable bounce buffering for the device. Drives move
- *	between PIO and DMA and that changes the rules we need.
- */
-
-void ide_toggle_bounce(ide_drive_t *drive, int on)
-{
-	u64 addr = BLK_BOUNCE_HIGH;	/* dma64_addr_t */
-
-	if (!PCI_DMA_BUS_IS_PHYS) {
-		addr = BLK_BOUNCE_ANY;
-	} else if (on && drive->media == ide_disk) {
-		struct device *dev = drive->hwif->dev;
-
-		if (dev && dev->dma_mask)
-			addr = *dev->dma_mask;
-	}
-
-	if (drive->queue)
-		blk_queue_bounce_limit(drive->queue, addr);
-}
-
 u64 ide_get_lba_addr(struct ide_cmd *cmd, int lba48)
 {
 	struct ide_taskfile *tf = &cmd->tf;
diff --git a/drivers/ide/ide-probe.c b/drivers/ide/ide-probe.c
index 2019e66eada7..8d8ed036ca0a 100644
--- a/drivers/ide/ide-probe.c
+++ b/drivers/ide/ide-probe.c
@@ -805,9 +805,6 @@ static int ide_init_queue(ide_drive_t *drive)
 	/* assign drive queue */
 	drive->queue = q;
 
-	/* needs drive->queue to be set */
-	ide_toggle_bounce(drive, 1);
-
 	return 0;
 }
 
diff --git a/include/linux/ide.h b/include/linux/ide.h
index ca9d34feb572..11f0dd03a4b4 100644
--- a/include/linux/ide.h
+++ b/include/linux/ide.h
@@ -1508,8 +1508,6 @@ static inline void ide_set_hwifdata (ide_hwif_t * hwif, void *data)
 	hwif->hwif_data = data;
 }
 
-extern void ide_toggle_bounce(ide_drive_t *drive, int on);
-
 u64 ide_get_lba_addr(struct ide_cmd *, int);
 u8 ide_dump_status(ide_drive_t *, const char *, u8);
 
-- 
2.17.0

^ permalink raw reply related

* [PATCH 3/5] ide: remove the PCI_DMA_BUS_IS_PHYS check
From: Christoph Hellwig @ 2018-04-24 18:16 UTC (permalink / raw)
  To: iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-arch-u79uwXL29TY76Z2rM5mHXA,
	linux-block-u79uwXL29TY76Z2rM5mHXA,
	linux-ide-u79uwXL29TY76Z2rM5mHXA,
	linux-scsi-u79uwXL29TY76Z2rM5mHXA, netdev-u79uwXL29TY76Z2rM5mHXA
  Cc: David S. Miller, linux-kernel-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <20180424181625.22410-1-hch-jcswGhMUV9g@public.gmane.org>

We now have ways to deal with drainage in the block layer, and libata has
been using it for ages.  We also want to get rid of PCI_DMA_BUS_IS_PHYS
now, so just reduce the PCI transfer size for ide - anyone who cares for
performance on PCI controllers should have switched to libata long ago.

Signed-off-by: Christoph Hellwig <hch-jcswGhMUV9g@public.gmane.org>
---
 drivers/ide/ide-probe.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/drivers/ide/ide-probe.c b/drivers/ide/ide-probe.c
index 8d8ed036ca0a..56d7bc228cb3 100644
--- a/drivers/ide/ide-probe.c
+++ b/drivers/ide/ide-probe.c
@@ -796,8 +796,7 @@ static int ide_init_queue(ide_drive_t *drive)
 	 * This will be fixed once we teach pci_map_sg() about our boundary
 	 * requirements, hopefully soon. *FIXME*
 	 */
-	if (!PCI_DMA_BUS_IS_PHYS)
-		max_sg_entries >>= 1;
+	max_sg_entries >>= 1;
 #endif /* CONFIG_PCI */
 
 	blk_queue_max_segments(q, max_sg_entries);
-- 
2.17.0

^ permalink raw reply related

* [PATCH 4/5] net: remove the PCI_DMA_BUS_IS_PHYS check in illegal_highdma
From: Christoph Hellwig @ 2018-04-24 18:16 UTC (permalink / raw)
  To: iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-arch-u79uwXL29TY76Z2rM5mHXA,
	linux-block-u79uwXL29TY76Z2rM5mHXA,
	linux-ide-u79uwXL29TY76Z2rM5mHXA,
	linux-scsi-u79uwXL29TY76Z2rM5mHXA, netdev-u79uwXL29TY76Z2rM5mHXA
  Cc: David S. Miller, linux-kernel-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <20180424181625.22410-1-hch-jcswGhMUV9g@public.gmane.org>

These days the dma mapping routines must be able to handle any address
supported by the device, be that using an iommu, or swiotlb if none is
supported.  With that the PCI_DMA_BUS_IS_PHYS check in illegal_highdma
is not needed and can be removed.

Signed-off-by: Christoph Hellwig <hch-jcswGhMUV9g@public.gmane.org>
---
 net/core/dev.c | 20 +-------------------
 1 file changed, 1 insertion(+), 19 deletions(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index af0558b00c6c..060256cbf4f3 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2884,11 +2884,7 @@ void netdev_rx_csum_fault(struct net_device *dev)
 EXPORT_SYMBOL(netdev_rx_csum_fault);
 #endif
 
-/* Actually, we should eliminate this check as soon as we know, that:
- * 1. IOMMU is present and allows to map all the memory.
- * 2. No high memory really exists on this machine.
- */
-
+/* XXX: check that highmem exists at all on the given machine. */
 static int illegal_highdma(struct net_device *dev, struct sk_buff *skb)
 {
 #ifdef CONFIG_HIGHMEM
@@ -2902,20 +2898,6 @@ static int illegal_highdma(struct net_device *dev, struct sk_buff *skb)
 				return 1;
 		}
 	}
-
-	if (PCI_DMA_BUS_IS_PHYS) {
-		struct device *pdev = dev->dev.parent;
-
-		if (!pdev)
-			return 0;
-		for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) {
-			skb_frag_t *frag = &skb_shinfo(skb)->frags[i];
-			dma_addr_t addr = page_to_phys(skb_frag_page(frag));
-
-			if (!pdev->dma_mask || addr + PAGE_SIZE - 1 > *pdev->dma_mask)
-				return 1;
-		}
-	}
 #endif
 	return 0;
 }
-- 
2.17.0

^ permalink raw reply related

* [PATCH 5/5] PCI: remove PCI_DMA_BUS_IS_PHYS
From: Christoph Hellwig @ 2018-04-24 18:16 UTC (permalink / raw)
  To: iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-arch-u79uwXL29TY76Z2rM5mHXA,
	linux-block-u79uwXL29TY76Z2rM5mHXA,
	linux-ide-u79uwXL29TY76Z2rM5mHXA,
	linux-scsi-u79uwXL29TY76Z2rM5mHXA, netdev-u79uwXL29TY76Z2rM5mHXA
  Cc: David S. Miller, linux-kernel-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <20180424181625.22410-1-hch-jcswGhMUV9g@public.gmane.org>

This was used by the ide, scsi and networking code in the past to
determine if they should bounce payloads.  Now that the dma mapping
always have to support dma to all physical memory (thanks to swiotlb
for non-iommu systems) there is no need to this crude hack any more.

Signed-off-by: Christoph Hellwig <hch-jcswGhMUV9g@public.gmane.org>
---
 arch/alpha/include/asm/pci.h      |  5 -----
 arch/arc/include/asm/pci.h        |  6 ------
 arch/arm/include/asm/pci.h        |  7 -------
 arch/arm64/include/asm/pci.h      |  5 -----
 arch/h8300/include/asm/pci.h      |  2 --
 arch/hexagon/kernel/dma.c         |  1 -
 arch/ia64/hp/common/sba_iommu.c   |  3 ---
 arch/ia64/include/asm/pci.h       | 17 -----------------
 arch/ia64/kernel/setup.c          | 12 ------------
 arch/ia64/sn/kernel/io_common.c   |  5 -----
 arch/m68k/include/asm/pci.h       |  6 ------
 arch/microblaze/include/asm/pci.h |  6 ------
 arch/mips/include/asm/pci.h       |  7 -------
 arch/parisc/include/asm/pci.h     | 23 -----------------------
 arch/parisc/kernel/setup.c        |  5 -----
 arch/powerpc/include/asm/pci.h    | 18 ------------------
 arch/riscv/include/asm/pci.h      |  3 ---
 arch/s390/include/asm/pci.h       |  2 --
 arch/s390/pci/pci_dma.c           |  2 --
 arch/sh/include/asm/pci.h         |  6 ------
 arch/sh/kernel/dma-nommu.c        |  1 -
 arch/sparc/include/asm/pci_32.h   |  4 ----
 arch/sparc/include/asm/pci_64.h   |  6 ------
 arch/x86/include/asm/pci.h        |  3 ---
 arch/xtensa/include/asm/pci.h     |  2 --
 drivers/parisc/ccio-dma.c         |  2 --
 drivers/parisc/sba_iommu.c        |  2 --
 include/asm-generic/pci.h         |  8 --------
 include/linux/dma-mapping.h       |  1 -
 lib/dma-direct.c                  |  1 -
 tools/virtio/linux/dma-mapping.h  |  2 --
 31 files changed, 173 deletions(-)

diff --git a/arch/alpha/include/asm/pci.h b/arch/alpha/include/asm/pci.h
index b9ec55351924..cf6bc1e64d66 100644
--- a/arch/alpha/include/asm/pci.h
+++ b/arch/alpha/include/asm/pci.h
@@ -56,11 +56,6 @@ struct pci_controller {
 
 /* IOMMU controls.  */
 
-/* The PCI address space does not equal the physical memory address space.
-   The networking and block device layers use this boolean for bounce buffer
-   decisions.  */
-#define PCI_DMA_BUS_IS_PHYS  0
-
 /* TODO: integrate with include/asm-generic/pci.h ? */
 static inline int pci_get_legacy_ide_irq(struct pci_dev *dev, int channel)
 {
diff --git a/arch/arc/include/asm/pci.h b/arch/arc/include/asm/pci.h
index ba56c23c1b20..4ff53c041c64 100644
--- a/arch/arc/include/asm/pci.h
+++ b/arch/arc/include/asm/pci.h
@@ -16,12 +16,6 @@
 #define PCIBIOS_MIN_MEM 0x100000
 
 #define pcibios_assign_all_busses()	1
-/*
- * The PCI address space does equal the physical memory address space.
- * The networking and block device layers use this boolean for bounce
- * buffer decisions.
- */
-#define PCI_DMA_BUS_IS_PHYS	1
 
 #endif /* __KERNEL__ */
 
diff --git a/arch/arm/include/asm/pci.h b/arch/arm/include/asm/pci.h
index 1f0de808d111..0abd389cf0ec 100644
--- a/arch/arm/include/asm/pci.h
+++ b/arch/arm/include/asm/pci.h
@@ -19,13 +19,6 @@ static inline int pci_proc_domain(struct pci_bus *bus)
 }
 #endif /* CONFIG_PCI_DOMAINS */
 
-/*
- * The PCI address space does equal the physical memory address space.
- * The networking and block device layers use this boolean for bounce
- * buffer decisions.
- */
-#define PCI_DMA_BUS_IS_PHYS     (1)
-
 #define HAVE_PCI_MMAP
 #define ARCH_GENERIC_PCI_MMAP_RESOURCE
 
diff --git a/arch/arm64/include/asm/pci.h b/arch/arm64/include/asm/pci.h
index 8747f7c5e0e7..9e690686e8aa 100644
--- a/arch/arm64/include/asm/pci.h
+++ b/arch/arm64/include/asm/pci.h
@@ -18,11 +18,6 @@
 #define pcibios_assign_all_busses() \
 	(pci_has_flag(PCI_REASSIGN_ALL_BUS))
 
-/*
- * PCI address space differs from physical memory address space
- */
-#define PCI_DMA_BUS_IS_PHYS	(0)
-
 #define ARCH_GENERIC_PCI_MMAP_RESOURCE	1
 
 extern int isa_dma_bridge_buggy;
diff --git a/arch/h8300/include/asm/pci.h b/arch/h8300/include/asm/pci.h
index 7c9e55d62215..d4d345a52092 100644
--- a/arch/h8300/include/asm/pci.h
+++ b/arch/h8300/include/asm/pci.h
@@ -15,6 +15,4 @@ static inline void pcibios_penalize_isa_irq(int irq, int active)
 	/* We don't do dynamic PCI IRQ allocation */
 }
 
-#define PCI_DMA_BUS_IS_PHYS	(1)
-
 #endif /* _ASM_H8300_PCI_H */
diff --git a/arch/hexagon/kernel/dma.c b/arch/hexagon/kernel/dma.c
index ad8347c29dcf..77459df34e2e 100644
--- a/arch/hexagon/kernel/dma.c
+++ b/arch/hexagon/kernel/dma.c
@@ -208,7 +208,6 @@ const struct dma_map_ops hexagon_dma_ops = {
 	.sync_single_for_cpu = hexagon_sync_single_for_cpu,
 	.sync_single_for_device = hexagon_sync_single_for_device,
 	.mapping_error	= hexagon_mapping_error,
-	.is_phys	= 1,
 };
 
 void __init hexagon_dma_init(void)
diff --git a/arch/ia64/hp/common/sba_iommu.c b/arch/ia64/hp/common/sba_iommu.c
index aec4a3354abe..6f05aba9012f 100644
--- a/arch/ia64/hp/common/sba_iommu.c
+++ b/arch/ia64/hp/common/sba_iommu.c
@@ -1845,9 +1845,6 @@ static void ioc_init(unsigned long hpa, struct ioc *ioc)
 	ioc_resource_init(ioc);
 	ioc_sac_init(ioc);
 
-	if ((long) ~iovp_mask > (long) ia64_max_iommu_merge_mask)
-		ia64_max_iommu_merge_mask = ~iovp_mask;
-
 	printk(KERN_INFO PFX
 		"%s %d.%d HPA 0x%lx IOVA space %dMb at 0x%lx\n",
 		ioc->name, (ioc->rev >> 4) & 0xF, ioc->rev & 0xF,
diff --git a/arch/ia64/include/asm/pci.h b/arch/ia64/include/asm/pci.h
index b1d04e8bafc8..780e8744ba85 100644
--- a/arch/ia64/include/asm/pci.h
+++ b/arch/ia64/include/asm/pci.h
@@ -30,23 +30,6 @@ struct pci_vector_struct {
 #define PCIBIOS_MIN_IO		0x1000
 #define PCIBIOS_MIN_MEM		0x10000000
 
-/*
- * PCI_DMA_BUS_IS_PHYS should be set to 1 if there is _necessarily_ a direct
- * correspondence between device bus addresses and CPU physical addresses.
- * Platforms with a hardware I/O MMU _must_ turn this off to suppress the
- * bounce buffer handling code in the block and network device layers.
- * Platforms with separate bus address spaces _must_ turn this off and provide
- * a device DMA mapping implementation that takes care of the necessary
- * address translation.
- *
- * For now, the ia64 platforms which may have separate/multiple bus address
- * spaces all have I/O MMUs which support the merging of physically
- * discontiguous buffers, so we can use that as the sole factor to determine
- * the setting of PCI_DMA_BUS_IS_PHYS.
- */
-extern unsigned long ia64_max_iommu_merge_mask;
-#define PCI_DMA_BUS_IS_PHYS	(ia64_max_iommu_merge_mask == ~0UL)
-
 #define HAVE_PCI_MMAP
 #define ARCH_GENERIC_PCI_MMAP_RESOURCE
 #define arch_can_pci_mmap_wc()	1
diff --git a/arch/ia64/kernel/setup.c b/arch/ia64/kernel/setup.c
index dee56bcb993d..ad43cbf70628 100644
--- a/arch/ia64/kernel/setup.c
+++ b/arch/ia64/kernel/setup.c
@@ -123,18 +123,6 @@ unsigned long ia64_i_cache_stride_shift = ~0;
 #define	CACHE_STRIDE_SHIFT	5
 unsigned long ia64_cache_stride_shift = ~0;
 
-/*
- * The merge_mask variable needs to be set to (max(iommu_page_size(iommu)) - 1).  This
- * mask specifies a mask of address bits that must be 0 in order for two buffers to be
- * mergeable by the I/O MMU (i.e., the end address of the first buffer and the start
- * address of the second buffer must be aligned to (merge_mask+1) in order to be
- * mergeable).  By default, we assume there is no I/O MMU which can merge physically
- * discontiguous buffers, so we set the merge_mask to ~0UL, which corresponds to a iommu
- * page-size of 2^64.
- */
-unsigned long ia64_max_iommu_merge_mask = ~0UL;
-EXPORT_SYMBOL(ia64_max_iommu_merge_mask);
-
 /*
  * We use a special marker for the end of memory and it uses the extra (+1) slot
  */
diff --git a/arch/ia64/sn/kernel/io_common.c b/arch/ia64/sn/kernel/io_common.c
index 11f2275570fb..8479e9a7ce16 100644
--- a/arch/ia64/sn/kernel/io_common.c
+++ b/arch/ia64/sn/kernel/io_common.c
@@ -480,11 +480,6 @@ sn_io_early_init(void)
 	tioca_init_provider();
 	tioce_init_provider();
 
-	/*
-	 * This is needed to avoid bounce limit checks in the blk layer
-	 */
-	ia64_max_iommu_merge_mask = ~PAGE_MASK;
-
 	sn_irq_lh_init();
 	INIT_LIST_HEAD(&sn_sysdata_list);
 	sn_init_cpei_timer();
diff --git a/arch/m68k/include/asm/pci.h b/arch/m68k/include/asm/pci.h
index ef26fae8cf0b..5a4bc223743b 100644
--- a/arch/m68k/include/asm/pci.h
+++ b/arch/m68k/include/asm/pci.h
@@ -4,12 +4,6 @@
 
 #include <asm-generic/pci.h>
 
-/* The PCI address space does equal the physical memory
- * address space.  The networking and block device layers use
- * this boolean for bounce buffer decisions.
- */
-#define PCI_DMA_BUS_IS_PHYS	(1)
-
 #define	pcibios_assign_all_busses()	1
 
 #define	PCIBIOS_MIN_IO		0x00000100
diff --git a/arch/microblaze/include/asm/pci.h b/arch/microblaze/include/asm/pci.h
index 5de871eb4a59..00337861472e 100644
--- a/arch/microblaze/include/asm/pci.h
+++ b/arch/microblaze/include/asm/pci.h
@@ -62,12 +62,6 @@ extern int pci_mmap_legacy_page_range(struct pci_bus *bus,
 
 #define HAVE_PCI_LEGACY	1
 
-/* The PCI address space does equal the physical memory
- * address space (no IOMMU).  The IDE and SCSI device layers use
- * this boolean for bounce buffer decisions.
- */
-#define PCI_DMA_BUS_IS_PHYS     (1)
-
 extern void pcibios_claim_one_bus(struct pci_bus *b);
 
 extern void pcibios_finish_adding_to_bus(struct pci_bus *bus);
diff --git a/arch/mips/include/asm/pci.h b/arch/mips/include/asm/pci.h
index 2339f42f047a..436099883022 100644
--- a/arch/mips/include/asm/pci.h
+++ b/arch/mips/include/asm/pci.h
@@ -121,13 +121,6 @@ extern unsigned long PCIBIOS_MIN_MEM;
 #include <linux/string.h>
 #include <asm/io.h>
 
-/*
- * The PCI address space does equal the physical memory address space.
- * The networking and block device layers use this boolean for bounce
- * buffer decisions.
- */
-#define PCI_DMA_BUS_IS_PHYS     (1)
-
 #ifdef CONFIG_PCI_DOMAINS_GENERIC
 static inline int pci_proc_domain(struct pci_bus *bus)
 {
diff --git a/arch/parisc/include/asm/pci.h b/arch/parisc/include/asm/pci.h
index 96b7deec512d..3328fd17c19d 100644
--- a/arch/parisc/include/asm/pci.h
+++ b/arch/parisc/include/asm/pci.h
@@ -87,29 +87,6 @@ struct pci_hba_data {
 #define PCI_F_EXTEND		0UL
 #endif /* !CONFIG_64BIT */
 
-/*
- * If the PCI device's view of memory is the same as the CPU's view of memory,
- * PCI_DMA_BUS_IS_PHYS is true.  The networking and block device layers use
- * this boolean for bounce buffer decisions.
- */
-#ifdef CONFIG_PA20
-/* All PA-2.0 machines have an IOMMU. */
-#define PCI_DMA_BUS_IS_PHYS	0
-#define parisc_has_iommu()	do { } while (0)
-#else
-
-#if defined(CONFIG_IOMMU_CCIO) || defined(CONFIG_IOMMU_SBA)
-extern int parisc_bus_is_phys; 	/* in arch/parisc/kernel/setup.c */
-#define PCI_DMA_BUS_IS_PHYS	parisc_bus_is_phys
-#define parisc_has_iommu()	do { parisc_bus_is_phys = 0; } while (0)
-#else
-#define PCI_DMA_BUS_IS_PHYS	1
-#define parisc_has_iommu()	do { } while (0)
-#endif
-
-#endif	/* !CONFIG_PA20 */
-
-
 /*
 ** Most PCI devices (eg Tulip, NCR720) also export the same registers
 ** to both MMIO and I/O port space.  Due to poor performance of I/O Port
diff --git a/arch/parisc/kernel/setup.c b/arch/parisc/kernel/setup.c
index 0e9675f857a5..8d3a7b80ac42 100644
--- a/arch/parisc/kernel/setup.c
+++ b/arch/parisc/kernel/setup.c
@@ -58,11 +58,6 @@ struct proc_dir_entry * proc_runway_root __read_mostly = NULL;
 struct proc_dir_entry * proc_gsc_root __read_mostly = NULL;
 struct proc_dir_entry * proc_mckinley_root __read_mostly = NULL;
 
-#if !defined(CONFIG_PA20) && (defined(CONFIG_IOMMU_CCIO) || defined(CONFIG_IOMMU_SBA))
-int parisc_bus_is_phys __read_mostly = 1;	/* Assume no IOMMU is present */
-EXPORT_SYMBOL(parisc_bus_is_phys);
-#endif
-
 void __init setup_cmdline(char **cmdline_p)
 {
 	extern unsigned int boot_args[];
diff --git a/arch/powerpc/include/asm/pci.h b/arch/powerpc/include/asm/pci.h
index 401c62aad5e4..2af9ded80540 100644
--- a/arch/powerpc/include/asm/pci.h
+++ b/arch/powerpc/include/asm/pci.h
@@ -92,24 +92,6 @@ extern int pci_mmap_legacy_page_range(struct pci_bus *bus,
 
 #define HAVE_PCI_LEGACY	1
 
-#ifdef CONFIG_PPC64
-
-/* The PCI address space does not equal the physical memory address
- * space (we have an IOMMU).  The IDE and SCSI device layers use
- * this boolean for bounce buffer decisions.
- */
-#define PCI_DMA_BUS_IS_PHYS	(0)
-
-#else /* 32-bit */
-
-/* The PCI address space does equal the physical memory
- * address space (no IOMMU).  The IDE and SCSI device layers use
- * this boolean for bounce buffer decisions.
- */
-#define PCI_DMA_BUS_IS_PHYS     (1)
-
-#endif /* CONFIG_PPC64 */
-
 extern void pcibios_claim_one_bus(struct pci_bus *b);
 
 extern void pcibios_finish_adding_to_bus(struct pci_bus *bus);
diff --git a/arch/riscv/include/asm/pci.h b/arch/riscv/include/asm/pci.h
index 0f2fc9ef20fc..b3638c505728 100644
--- a/arch/riscv/include/asm/pci.h
+++ b/arch/riscv/include/asm/pci.h
@@ -26,9 +26,6 @@
 /* RISC-V shim does not initialize PCI bus */
 #define pcibios_assign_all_busses() 1
 
-/* We do not have an IOMMU */
-#define PCI_DMA_BUS_IS_PHYS 1
-
 extern int isa_dma_bridge_buggy;
 
 #ifdef CONFIG_PCI
diff --git a/arch/s390/include/asm/pci.h b/arch/s390/include/asm/pci.h
index 12fe3591034f..94f8db468c9b 100644
--- a/arch/s390/include/asm/pci.h
+++ b/arch/s390/include/asm/pci.h
@@ -2,8 +2,6 @@
 #ifndef __ASM_S390_PCI_H
 #define __ASM_S390_PCI_H
 
-/* must be set before including asm-generic/pci.h */
-#define PCI_DMA_BUS_IS_PHYS (0)
 /* must be set before including pci_clp.h */
 #define PCI_BAR_COUNT	6
 
diff --git a/arch/s390/pci/pci_dma.c b/arch/s390/pci/pci_dma.c
index 2d15d84c20ed..10abf5ed6187 100644
--- a/arch/s390/pci/pci_dma.c
+++ b/arch/s390/pci/pci_dma.c
@@ -685,8 +685,6 @@ const struct dma_map_ops s390_pci_dma_ops = {
 	.map_page	= s390_dma_map_pages,
 	.unmap_page	= s390_dma_unmap_pages,
 	.mapping_error	= s390_mapping_error,
-	/* if we support direct DMA this must be conditional */
-	.is_phys	= 0,
 	/* dma_supported is unconditionally true without a callback */
 };
 EXPORT_SYMBOL_GPL(s390_pci_dma_ops);
diff --git a/arch/sh/include/asm/pci.h b/arch/sh/include/asm/pci.h
index 0033f0df2b3b..10a36b1cf2ea 100644
--- a/arch/sh/include/asm/pci.h
+++ b/arch/sh/include/asm/pci.h
@@ -71,12 +71,6 @@ extern unsigned long PCIBIOS_MIN_IO, PCIBIOS_MIN_MEM;
  * SuperH has everything mapped statically like x86.
  */
 
-/* The PCI address space does equal the physical memory
- * address space.  The networking and block device layers use
- * this boolean for bounce buffer decisions.
- */
-#define PCI_DMA_BUS_IS_PHYS	(dma_ops->is_phys)
-
 #ifdef CONFIG_PCI
 /*
  * None of the SH PCI controllers support MWI, it is always treated as a
diff --git a/arch/sh/kernel/dma-nommu.c b/arch/sh/kernel/dma-nommu.c
index 178457d7620c..3e3a32fc676e 100644
--- a/arch/sh/kernel/dma-nommu.c
+++ b/arch/sh/kernel/dma-nommu.c
@@ -78,7 +78,6 @@ const struct dma_map_ops nommu_dma_ops = {
 	.sync_single_for_device	= nommu_sync_single_for_device,
 	.sync_sg_for_device	= nommu_sync_sg_for_device,
 #endif
-	.is_phys		= 1,
 };
 
 void __init no_iommu_init(void)
diff --git a/arch/sparc/include/asm/pci_32.h b/arch/sparc/include/asm/pci_32.h
index 98917e48727d..cfc0ee9476c6 100644
--- a/arch/sparc/include/asm/pci_32.h
+++ b/arch/sparc/include/asm/pci_32.h
@@ -17,10 +17,6 @@
 
 #define PCI_IRQ_NONE		0xffffffff
 
-/* Dynamic DMA mapping stuff.
- */
-#define PCI_DMA_BUS_IS_PHYS	(0)
-
 #endif /* __KERNEL__ */
 
 #ifndef CONFIG_LEON_PCI
diff --git a/arch/sparc/include/asm/pci_64.h b/arch/sparc/include/asm/pci_64.h
index 671274e36cfa..fac77813402c 100644
--- a/arch/sparc/include/asm/pci_64.h
+++ b/arch/sparc/include/asm/pci_64.h
@@ -17,12 +17,6 @@
 
 #define PCI_IRQ_NONE		0xffffffff
 
-/* The PCI address space does not equal the physical memory
- * address space.  The networking and block device layers use
- * this boolean for bounce buffer decisions.
- */
-#define PCI_DMA_BUS_IS_PHYS	(0)
-
 /* PCI IOMMU mapping bypass support. */
 
 /* PCI 64-bit addressing works for all slots on all controller
diff --git a/arch/x86/include/asm/pci.h b/arch/x86/include/asm/pci.h
index d32175e30259..662963681ea6 100644
--- a/arch/x86/include/asm/pci.h
+++ b/arch/x86/include/asm/pci.h
@@ -117,9 +117,6 @@ void native_restore_msi_irqs(struct pci_dev *dev);
 #define native_setup_msi_irqs		NULL
 #define native_teardown_msi_irq		NULL
 #endif
-
-#define PCI_DMA_BUS_IS_PHYS (dma_ops->is_phys)
-
 #endif  /* __KERNEL__ */
 
 #ifdef CONFIG_X86_64
diff --git a/arch/xtensa/include/asm/pci.h b/arch/xtensa/include/asm/pci.h
index d5a82153a7c5..6ddf0a30c60d 100644
--- a/arch/xtensa/include/asm/pci.h
+++ b/arch/xtensa/include/asm/pci.h
@@ -42,8 +42,6 @@ extern struct pci_controller* pcibios_alloc_controller(void);
  * decisions.
  */
 
-#define PCI_DMA_BUS_IS_PHYS	(1)
-
 /* Tell PCI code what kind of PCI resource mappings we support */
 #define HAVE_PCI_MMAP			1
 #define ARCH_GENERIC_PCI_MMAP_RESOURCE	1
diff --git a/drivers/parisc/ccio-dma.c b/drivers/parisc/ccio-dma.c
index acba1f56af3e..2b129d8525d5 100644
--- a/drivers/parisc/ccio-dma.c
+++ b/drivers/parisc/ccio-dma.c
@@ -1596,8 +1596,6 @@ static int __init ccio_probe(struct parisc_device *dev)
 	}
 #endif
 	ioc_count++;
-
-	parisc_has_iommu();
 	return 0;
 }
 
diff --git a/drivers/parisc/sba_iommu.c b/drivers/parisc/sba_iommu.c
index 0a9c762a70fa..a58c586ebd81 100644
--- a/drivers/parisc/sba_iommu.c
+++ b/drivers/parisc/sba_iommu.c
@@ -2017,8 +2017,6 @@ static int __init sba_driver_callback(struct parisc_device *dev)
 	proc_create("sba_iommu", 0, root, &sba_proc_fops);
 	proc_create("sba_iommu-bitmap", 0, root, &sba_proc_bitmap_fops);
 #endif
-
-	parisc_has_iommu();
 	return 0;
 }
 
diff --git a/include/asm-generic/pci.h b/include/asm-generic/pci.h
index 830d7659289b..6bb3cd3d695a 100644
--- a/include/asm-generic/pci.h
+++ b/include/asm-generic/pci.h
@@ -14,12 +14,4 @@ static inline int pci_get_legacy_ide_irq(struct pci_dev *dev, int channel)
 }
 #endif /* HAVE_ARCH_PCI_GET_LEGACY_IDE_IRQ */
 
-/*
- * By default, assume that no iommu is in use and that the PCI
- * space is mapped to address physical 0.
- */
-#ifndef PCI_DMA_BUS_IS_PHYS
-#define PCI_DMA_BUS_IS_PHYS	(1)
-#endif
-
 #endif /* _ASM_GENERIC_PCI_H */
diff --git a/include/linux/dma-mapping.h b/include/linux/dma-mapping.h
index 14269d25498b..25a9a2b04f78 100644
--- a/include/linux/dma-mapping.h
+++ b/include/linux/dma-mapping.h
@@ -133,7 +133,6 @@ struct dma_map_ops {
 #ifdef ARCH_HAS_DMA_GET_REQUIRED_MASK
 	u64 (*get_required_mask)(struct device *dev);
 #endif
-	int is_phys;
 };
 
 extern const struct dma_map_ops dma_direct_ops;
diff --git a/lib/dma-direct.c b/lib/dma-direct.c
index bbfb229aa067..ebfee14ed2df 100644
--- a/lib/dma-direct.c
+++ b/lib/dma-direct.c
@@ -180,6 +180,5 @@ const struct dma_map_ops dma_direct_ops = {
 	.map_sg			= dma_direct_map_sg,
 	.dma_supported		= dma_direct_supported,
 	.mapping_error		= dma_direct_mapping_error,
-	.is_phys		= 1,
 };
 EXPORT_SYMBOL(dma_direct_ops);
diff --git a/tools/virtio/linux/dma-mapping.h b/tools/virtio/linux/dma-mapping.h
index 1571e24e9494..f91aeb5fe571 100644
--- a/tools/virtio/linux/dma-mapping.h
+++ b/tools/virtio/linux/dma-mapping.h
@@ -6,8 +6,6 @@
 # error Virtio userspace code does not support CONFIG_HAS_DMA
 #endif
 
-#define PCI_DMA_BUS_IS_PHYS 1
-
 enum dma_data_direction {
 	DMA_BIDIRECTIONAL = 0,
 	DMA_TO_DEVICE = 1,
-- 
2.17.0

^ permalink raw reply related

* Re: [PATCH net-next v2 0/2] openvswitch: Support conntrack zone limit
From: Yi-Hung Wei @ 2018-04-24 18:21 UTC (permalink / raw)
  To: David Miller
  Cc: Pravin Shelar, Linux Kernel Network Developers, Florian Westphal
In-Reply-To: <20180424.134219.297866275253495288.davem@davemloft.net>

On Tue, Apr 24, 2018 at 10:42 AM, David Miller <davem@davemloft.net> wrote:
> From: Pravin Shelar <pshelar@ovn.org>
> Date: Mon, 23 Apr 2018 23:34:48 -0700
>
>> OK. Thanks for the info.
>
> So, ACK, Reviewed-by, etc.? :-)
>

Parvin provides feedback in a previous email.  I will address them and
send out v3.

Thanks,

-Yi-Hung

^ permalink raw reply

* Re: [PATCH 5/5] PCI: remove PCI_DMA_BUS_IS_PHYS
From: Palmer Dabbelt @ 2018-04-24 18:23 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: iommu, linux-arch, linux-block, linux-ide, linux-scsi, netdev,
	davem, linux-kernel
In-Reply-To: <20180424181625.22410-6-hch@lst.de>

On Tue, 24 Apr 2018 11:16:25 PDT (-0700), Christoph Hellwig wrote:
> This was used by the ide, scsi and networking code in the past to
> determine if they should bounce payloads.  Now that the dma mapping
> always have to support dma to all physical memory (thanks to swiotlb
> for non-iommu systems) there is no need to this crude hack any more.
>
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> ---
> [...]
> diff --git a/arch/riscv/include/asm/pci.h b/arch/riscv/include/asm/pci.h
> index 0f2fc9ef20fc..b3638c505728 100644
> --- a/arch/riscv/include/asm/pci.h
> +++ b/arch/riscv/include/asm/pci.h
> @@ -26,9 +26,6 @@
>  /* RISC-V shim does not initialize PCI bus */
>  #define pcibios_assign_all_busses() 1
>
> -/* We do not have an IOMMU */
> -#define PCI_DMA_BUS_IS_PHYS 1
> -
>  extern int isa_dma_bridge_buggy;
>
>  #ifdef CONFIG_PCI

Thanks!

Acked-by: Palmer Dabbelt <palmer@sifive.com> (For the RISC-V change)

^ permalink raw reply

* [net-next v3] ipv6: sr: Compute flowlabel for outer IPv6 header of seg6 encap mode
From: Ahmed Abdelsalam @ 2018-04-24 18:23 UTC (permalink / raw)
  To: davem, dav.lebrun, kuznet, yoshfuji, netdev, linux-kernel
  Cc: Ahmed Abdelsalam

ECMP (equal-cost multipath) hashes are typically computed on the packets'
5-tuple(src IP, dst IP, src port, dst port, L4 proto).

For encapsulated packets, the L4 data is not readily available and ECMP
hashing will often revert to (src IP, dst IP). This will lead to traffic
polarization on a single ECMP path, causing congestion and waste of network
capacity.

In IPv6, the 20-bit flow label field is also used as part of the ECMP hash.
In the lack of L4 data, the hashing will be on (src IP, dst IP, flow
label). Having a non-zero flow label is thus important for proper traffic
load balancing when L4 data is unavailable (i.e., when packets are
encapsulated).

Currently, the seg6_do_srh_encap() function extracts the original packet's
flow label and set it as the outer IPv6 flow label. There are two issues
with this behaviour:

a) There is no guarantee that the inner flow label is set by the source.
b) If the original packet is not IPv6, the flow label will be set to
zero (e.g., IPv4 or L2 encap).

This patch adds a function, named seg6_make_flowlabel(), that computes a
flow label from a given skb. It supports IPv6, IPv4 and L2 payloads, and
leverages the per namespace 'seg6_flowlabel" sysctl value.

The currently support behaviours are as follows:
-1 set flowlabel to zero.
0 copy flowlabel from Inner paceket in case of Inner IPv6
(Set flowlabel to 0 in case IPv4/L2)
1 Compute the flowlabel using seg6_make_flowlabel()

This patch has been tested for IPv6, IPv4, and L2 traffic.

Signed-off-by: Ahmed Abdelsalam <amsalam20@gmail.com>
---
 include/net/netns/ipv6.h   |  1 +
 net/ipv6/seg6_iptunnel.c   | 24 ++++++++++++++++++++++--
 net/ipv6/sysctl_net_ipv6.c |  8 ++++++++
 3 files changed, 31 insertions(+), 2 deletions(-)

diff --git a/include/net/netns/ipv6.h b/include/net/netns/ipv6.h
index 97b3a54..c978a31 100644
--- a/include/net/netns/ipv6.h
+++ b/include/net/netns/ipv6.h
@@ -43,6 +43,7 @@ struct netns_sysctl_ipv6 {
 	int max_hbh_opts_cnt;
 	int max_dst_opts_len;
 	int max_hbh_opts_len;
+	int seg6_flowlabel;
 };
 
 struct netns_ipv6 {
diff --git a/net/ipv6/seg6_iptunnel.c b/net/ipv6/seg6_iptunnel.c
index 5fe1394..9898926 100644
--- a/net/ipv6/seg6_iptunnel.c
+++ b/net/ipv6/seg6_iptunnel.c
@@ -91,6 +91,24 @@ static void set_tun_src(struct net *net, struct net_device *dev,
 	rcu_read_unlock();
 }
 
+/* Compute flowlabel for outer IPv6 header */
+static __be32 seg6_make_flowlabel(struct net *net, struct sk_buff *skb,
+				  struct ipv6hdr *inner_hdr)
+{
+	int do_flowlabel = net->ipv6.sysctl.seg6_flowlabel;
+	__be32 flowlabel = 0;
+	u32 hash;
+
+	if (do_flowlabel > 0) {
+		hash = skb_get_hash(skb);
+		rol32(hash, 16);
+		flowlabel = (__force __be32)hash & IPV6_FLOWLABEL_MASK;
+	} else if (!do_flowlabel && skb->protocol == htons(ETH_P_IPV6)) {
+		flowlabel = ip6_flowlabel(inner_hdr);
+	}
+	return flowlabel;
+}
+
 /* encapsulate an IPv6 packet within an outer IPv6 header with a given SRH */
 int seg6_do_srh_encap(struct sk_buff *skb, struct ipv6_sr_hdr *osrh, int proto)
 {
@@ -99,6 +117,7 @@ int seg6_do_srh_encap(struct sk_buff *skb, struct ipv6_sr_hdr *osrh, int proto)
 	struct ipv6hdr *hdr, *inner_hdr;
 	struct ipv6_sr_hdr *isrh;
 	int hdrlen, tot_len, err;
+	__be32 flowlabel;
 
 	hdrlen = (osrh->hdrlen + 1) << 3;
 	tot_len = hdrlen + sizeof(*hdr);
@@ -119,12 +138,13 @@ int seg6_do_srh_encap(struct sk_buff *skb, struct ipv6_sr_hdr *osrh, int proto)
 	 * decapsulation will overwrite inner hlim with outer hlim
 	 */
 
+	flowlabel = seg6_make_flowlabel(net, skb, inner_hdr);
 	if (skb->protocol == htons(ETH_P_IPV6)) {
 		ip6_flow_hdr(hdr, ip6_tclass(ip6_flowinfo(inner_hdr)),
-			     ip6_flowlabel(inner_hdr));
+			     flowlabel);
 		hdr->hop_limit = inner_hdr->hop_limit;
 	} else {
-		ip6_flow_hdr(hdr, 0, 0);
+		ip6_flow_hdr(hdr, 0, flowlabel);
 		hdr->hop_limit = ip6_dst_hoplimit(skb_dst(skb));
 	}
 
diff --git a/net/ipv6/sysctl_net_ipv6.c b/net/ipv6/sysctl_net_ipv6.c
index 6fbdef6..e15cd37 100644
--- a/net/ipv6/sysctl_net_ipv6.c
+++ b/net/ipv6/sysctl_net_ipv6.c
@@ -152,6 +152,13 @@ static struct ctl_table ipv6_table_template[] = {
 		.extra1		= &zero,
 		.extra2		= &one,
 	},
+	{
+		.procname	= "seg6_flowlabel",
+		.data		= &init_net.ipv6.sysctl.seg6_flowlabel,
+		.maxlen		= sizeof(int),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec
+	},
 	{ }
 };
 
@@ -217,6 +224,7 @@ static int __net_init ipv6_sysctl_net_init(struct net *net)
 	ipv6_table[12].data = &net->ipv6.sysctl.max_dst_opts_len;
 	ipv6_table[13].data = &net->ipv6.sysctl.max_hbh_opts_len;
 	ipv6_table[14].data = &net->ipv6.sysctl.multipath_hash_policy,
+	ipv6_table[15].data = &net->ipv6.sysctl.seg6_flowlabel;
 
 	ipv6_route_table = ipv6_route_sysctl_init(net);
 	if (!ipv6_route_table)
-- 
2.1.4

^ permalink raw reply related

* Re: [net-next v2] ipv6: sr: Compute flowlabel for outer IPv6 header of seg6 encap mode
From: Ahmed Abdelsalam @ 2018-04-24 18:24 UTC (permalink / raw)
  To: Ahmed Abdelsalam
  Cc: davem, dav.lebrun, kuznet, yoshfuji, netdev, linux-kernel
In-Reply-To: <1524592795-1467-1-git-send-email-amsalam20@gmail.com>

On Tue, 24 Apr 2018 19:59:55 +0200
Ahmed Abdelsalam <amsalam20@gmail.com> wrote:

> This patch has been tested for IPv6, IPv4, and L2 traffic.
> 
> Signed-off-by: Ahmed Abdelsalam <amsalam20@gmail.com>
> ---
>  include/net/netns/ipv6.h   |  1 +
>  net/ipv6/seg6_iptunnel.c   | 24 ++++++++++++++++++++++--
>  net/ipv6/sysctl_net_ipv6.c |  8 ++++++++
>  3 files changed, 31 insertions(+), 2 deletions(-)
> 
> diff --git a/include/net/netns/ipv6.h b/include/net/netns/ipv6.h
> index 97b3a54..c978a31 100644
> --- a/include/net/netns/ipv6.h
> +++ b/include/net/netns/ipv6.h
> @@ -43,6 +43,7 @@ struct netns_sysctl_ipv6 {
>  	int max_hbh_opts_cnt;
>  	int max_dst_opts_len;
>  	int max_hbh_opts_len;
> +	int seg6_flowlabel;
>  };
>  
>  struct netns_ipv6 {
> diff --git a/net/ipv6/seg6_iptunnel.c b/net/ipv6/seg6_iptunnel.c
> index 5fe1394..3d9cd86 100644
> --- a/net/ipv6/seg6_iptunnel.c
> +++ b/net/ipv6/seg6_iptunnel.c
> @@ -91,6 +91,24 @@ static void set_tun_src(struct net *net, struct net_device *dev,
>  	rcu_read_unlock();
>  }
>  
> +/* Compute flowlabel for outer IPv6 header */
> +__be32 seg6_make_flowlabel(struct net *net, struct sk_buff *skb,
> +			   struct ipv6hdr *inner_hdr)

David, please take v3 of the patch. 
I re-defined seg6_make_flowlabel () as static to fix the kbuild test robot.

-- 
Ahmed Abdelsalam <amsalam20@gmail.com>

^ permalink raw reply

* Re: [PATCH bpf-next 07/15] xsk: add Rx receive functions and poll support
From: Björn Töpel @ 2018-04-24 18:32 UTC (permalink / raw)
  To: Willem de Bruijn
  Cc: Karlsson, Magnus, Alexander Duyck, Alexander Duyck,
	John Fastabend, Alexei Starovoitov, Jesper Dangaard Brouer,
	Daniel Borkmann, Michael S. Tsirkin, Network Development,
	Björn Töpel, michael.lundkvist, Brandeburg, Jesse,
	Singhai, Anjali, Zhang, Qi Z
In-Reply-To: <CAF=yD-LAqKaC5U1ZR7ZNyZpmBonp9W68qp8z3v-P8cKFkXe4AA@mail.gmail.com>

2018-04-24 18:56 GMT+02:00 Willem de Bruijn <willemdebruijn.kernel@gmail.com>:
> On Mon, Apr 23, 2018 at 9:56 AM, Björn Töpel <bjorn.topel@gmail.com> wrote:
>> From: Björn Töpel <bjorn.topel@intel.com>
>>
>> Here the actual receive functions of AF_XDP are implemented, that in a
>> later commit, will be called from the XDP layers.
>>
>> There's one set of functions for the XDP_DRV side and another for
>> XDP_SKB (generic).
>>
>> Support for the poll syscall is also implemented.
>>
>> Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
>> ---
>
>> +/* Common functions operating for both RXTX and umem queues */
>> +
>> +static inline u32 xskq_nb_avail(struct xsk_queue *q, u32 dcnt)
>> +{
>> +       u32 entries = q->prod_tail - q->cons_tail;
>> +
>> +       if (entries == 0) {
>> +               /* Refresh the local pointer */
>> +               q->prod_tail = READ_ONCE(q->ring->producer);
>> +       }
>> +
>> +       entries = q->prod_tail - q->cons_tail;
>
> Probably meant to be inside the branch? Though I see the same
> pattern in the userspace example program.
>

Yes! Nasty C&P going on here... :-(

>> +static inline u32 *xskq_validate_id(struct xsk_queue *q)
>> +{
>> +       while (q->cons_tail != q->cons_head) {
>> +               struct xdp_umem_ring *ring = (struct xdp_umem_ring *)q->ring;
>> +               unsigned int idx = q->cons_tail & q->ring_mask;
>> +
>> +               if (xskq_is_valid_id(q, ring->desc[idx]))
>> +                       return &ring->desc[idx];
>
> Missing a q->cons_tail increment in this loop?

Indeed! Good catch! Thanks!


Björn

^ permalink raw reply

* Re: [PATCH v3] kvmalloc: always use vmalloc if CONFIG_DEBUG_SG
From: Mikulas Patocka @ 2018-04-24 18:41 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Michal Hocko, David Miller, Andrew Morton, linux-mm, eric.dumazet,
	edumazet, netdev, linux-kernel, mst, jasowang, virtualization,
	dm-devel, Vlastimil Babka, Christoph Lameter, Pekka Enberg,
	David Rientjes, Joonsoo Kim
In-Reply-To: <20180424171651.GC30577@bombadil.infradead.org>



On Tue, 24 Apr 2018, Matthew Wilcox wrote:

> On Tue, Apr 24, 2018 at 08:29:14AM -0400, Mikulas Patocka wrote:
> > 
> > 
> > On Mon, 23 Apr 2018, Matthew Wilcox wrote:
> > 
> > > On Mon, Apr 23, 2018 at 08:06:16PM -0400, Mikulas Patocka wrote:
> > > > Some bugs (such as buffer overflows) are better detected
> > > > with kmalloc code, so we must test the kmalloc path too.
> > > 
> > > Well now, this brings up another item for the collective TODO list --
> > > implement redzone checks for vmalloc.  Unless this is something already
> > > taken care of by kasan or similar.
> > 
> > The kmalloc overflow testing is also not ideal - it rounds the size up to 
> > the next slab size and detects buffer overflows only at this boundary.
> > 
> > Some times ago, I made a "kmalloc guard" patch that places a magic number 
> > immediatelly after the requested size - so that it can detect overflows at 
> > byte boundary 
> > ( https://www.redhat.com/archives/dm-devel/2014-September/msg00018.html )
> > 
> > That patch found a bug in crypto code:
> > ( http://lkml.iu.edu/hypermail/linux/kernel/1409.1/02325.html )
> 
> Is it still worth doing this, now we have kasan?

The kmalloc guard has much lower overhead than kasan.

(BTW. when I tried kasan, it oopsed with persistent memory)

Mikulas

^ permalink raw reply

* Re: [PATCH bpf-next 05/15] xsk: add support for bind for Rx
From: Björn Töpel @ 2018-04-24 18:43 UTC (permalink / raw)
  To: Willem de Bruijn
  Cc: Karlsson, Magnus, Alexander Duyck, Alexander Duyck,
	John Fastabend, Alexei Starovoitov, Jesper Dangaard Brouer,
	Daniel Borkmann, Michael S. Tsirkin, Network Development,
	michael.lundkvist, Brandeburg, Jesse, Singhai, Anjali,
	Zhang, Qi Z
In-Reply-To: <CAF=yD-J-KfXzEKNZPr55PP4c2gxziU=6nPKJ3sty1EB3quvUdA@mail.gmail.com>

2018-04-24 18:55 GMT+02:00 Willem de Bruijn <willemdebruijn.kernel@gmail.com>:
> On Mon, Apr 23, 2018 at 9:56 AM, Björn Töpel <bjorn.topel@gmail.com> wrote:
>> From: Magnus Karlsson <magnus.karlsson@intel.com>
>>
>> Here, the bind syscall is added. Binding an AF_XDP socket, means
>> associating the socket to an umem, a netdev and a queue index. This
>> can be done in two ways.
>>
>> The first way, creating a "socket from scratch". Create the umem using
>> the XDP_UMEM_REG setsockopt and an associated fill queue with
>> XDP_UMEM_FILL_QUEUE. Create the Rx queue using the XDP_RX_QUEUE
>> setsockopt. Call bind passing ifindex and queue index ("channel" in
>> ethtool speak).
>>
>> The second way to bind a socket, is simply skipping the
>> umem/netdev/queue index, and passing another already setup AF_XDP
>> socket. The new socket will then have the same umem/netdev/queue index
>> as the parent so it will share the same umem. You must also set the
>> flags field in the socket address to XDP_SHARED_UMEM.
>>
>> Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
>> ---
>
>> +static struct socket *xsk_lookup_xsk_from_fd(int fd, int *err)
>> +{
>> +       struct socket *sock;
>> +
>> +       *err = -ENOTSOCK;
>> +       sock = sockfd_lookup(fd, err);
>> +       if (!sock)
>> +               return NULL;
>> +
>> +       if (sock->sk->sk_family != PF_XDP) {
>> +               *err = -ENOPROTOOPT;
>> +               sockfd_put(sock);
>> +               return NULL;
>> +       }
>> +
>> +       *err = 0;
>> +       return sock;
>> +}
>
> In this and similar cases, can use ERR_PTR to avoid the extra argument.

Noted. Thanks!

^ permalink raw reply

* Re: [PATCH net-next 5/8] net: mscc: Add initial Ocelot switch support
From: Alexandre Belloni @ 2018-04-24 18:52 UTC (permalink / raw)
  To: Andrew Lunn
  Cc: Florian Fainelli, David S . Miller, Allan Nielsen,
	razvan.stefanescu, po.liu, Thomas Petazzoni, netdev, devicetree,
	linux-kernel, linux-mips
In-Reply-To: <20180330145008.GE28244@lunn.ch>

I realise now that I didn't reply to this comment:

On 30/03/2018 16:50:08+0200, Andrew Lunn wrote:
> > The fact is that ocelot doesn't have separate controls. The port is
> > either forwarding or not. If it is not forwarding, then there is nothing
> > to tell the HW to do.
> 
> Think about the following sequence:
> 
> ip link set lan0 up
> 
> After this command, i expect to see packets on lan0 arrive at the
> host, tcpdump to work, etc. This probably means the port is in
> 'forwarding' mode, or for B53, STP is disabled.
> 

On Ocelot, forwarding packets to the host (i.e. forwarding frames
received on the port to the cpu port) is separate from bridging ports
together. So after that command, the host can receive packets on lan0.


-- 
Alexandre Belloni, Bootlin (formerly Free Electrons)
Embedded Linux and Kernel engineering
https://bootlin.com

^ permalink raw reply

* Re: [PATCH bpf-next 08/15] bpf: introduce new bpf AF_XDP map type BPF_MAP_TYPE_XSKMAP
From: Björn Töpel @ 2018-04-24 18:58 UTC (permalink / raw)
  To: Willem de Bruijn
  Cc: Karlsson, Magnus, Alexander Duyck, Alexander Duyck,
	John Fastabend, Alexei Starovoitov, Jesper Dangaard Brouer,
	Daniel Borkmann, Michael S. Tsirkin, Network Development,
	Björn Töpel, michael.lundkvist, Brandeburg, Jesse,
	Singhai, Anjali, Zhang, Qi Z
In-Reply-To: <CAF=yD-+W8=wDXc1=wHi8KF0whgyFLqvo=tROOq16XfA7MDkR+Q@mail.gmail.com>

2018-04-24 18:56 GMT+02:00 Willem de Bruijn <willemdebruijn.kernel@gmail.com>:
> On Mon, Apr 23, 2018 at 9:56 AM, Björn Töpel <bjorn.topel@gmail.com> wrote:
>> From: Björn Töpel <bjorn.topel@intel.com>
>>
>> The xskmap is yet another BPF map, very much inspired by
>> dev/cpu/sockmap, and is a holder of AF_XDP sockets. A user application
>> adds AF_XDP sockets into the map, and by using the bpf_redirect_map
>> helper, an XDP program can redirect XDP frames to an AF_XDP socket.
>>
>> Note that a socket that is bound to certain ifindex/queue index will
>> *only* accept XDP frames from that netdev/queue index. If an XDP
>> program tries to redirect from a netdev/queue index other than what
>> the socket is bound to, the frame will not be received on the socket.
>>
>> A socket can reside in multiple maps.
>>
>> Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
>
>> +struct xsk_map_entry {
>> +       struct xdp_sock *xs;
>> +       struct rcu_head rcu;
>> +};
>
>> +struct xdp_sock *__xsk_map_lookup_elem(struct bpf_map *map, u32 key)
>> +{
>> +       struct xsk_map *m = container_of(map, struct xsk_map, map);
>> +       struct xsk_map_entry *entry;
>> +
>> +       if (key >= map->max_entries)
>> +               return NULL;
>> +
>> +       entry = READ_ONCE(m->xsk_map[key]);
>> +       return entry ? entry->xs : NULL;
>> +}
>
> This dynamically allocated structure adds an extra cacheline lookup. If
> xdp_sock gets an rcu_head, it can be linked into the map directly.

Nice one! I'll try this out!

^ permalink raw reply

* Re: [RFC PATCH ghak32 V2 01/13] audit: add container id
From: Paul Moore @ 2018-04-24 19:01 UTC (permalink / raw)
  To: Richard Guy Briggs
  Cc: cgroups, containers, linux-api, Linux-Audit Mailing List,
	linux-fsdevel, LKML, netdev, ebiederm, luto, jlayton, carlos,
	dhowells, viro, simo, Eric Paris, serge
In-Reply-To: <20180424020200.imonhbkwtb73luxl@madcap2.tricolour.ca>

On Mon, Apr 23, 2018 at 10:02 PM, Richard Guy Briggs <rgb@redhat.com> wrote:
> On 2018-04-23 19:15, Paul Moore wrote:
>> On Sat, Apr 21, 2018 at 10:34 AM, Richard Guy Briggs <rgb@redhat.com> wrote:
>> > On 2018-04-18 19:47, Paul Moore wrote:
>> >> On Fri, Mar 16, 2018 at 5:00 AM, Richard Guy Briggs <rgb@redhat.com> wrote:
>> >> > Implement the proc fs write to set the audit container ID of a process,
>> >> > emitting an AUDIT_CONTAINER record to document the event.
>> >> >
>> >> > This is a write from the container orchestrator task to a proc entry of
>> >> > the form /proc/PID/containerid where PID is the process ID of the newly
>> >> > created task that is to become the first task in a container, or an
>> >> > additional task added to a container.
>> >> >
>> >> > The write expects up to a u64 value (unset: 18446744073709551615).
>> >> >
>> >> > This will produce a record such as this:
>> >> > type=CONTAINER msg=audit(1519903238.968:261): op=set pid=596 uid=0 subj=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023 auid=0 tty=pts0 ses=1 opid=596 old-contid=18446744073709551615 contid=123455 res=0
>> >> >
>> >> > The "op" field indicates an initial set.  The "pid" to "ses" fields are
>> >> > the orchestrator while the "opid" field is the object's PID, the process
>> >> > being "contained".  Old and new container ID values are given in the
>> >> > "contid" fields, while res indicates its success.
>> >> >
>> >> > It is not permitted to self-set, unset or re-set the container ID.  A
>> >> > child inherits its parent's container ID, but then can be set only once
>> >> > after.
>> >> >
>> >> > See: https://github.com/linux-audit/audit-kernel/issues/32
>> >> >
>> >> > Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
>> >> > ---
>> >> >  fs/proc/base.c             | 37 ++++++++++++++++++++
>> >> >  include/linux/audit.h      | 16 +++++++++
>> >> >  include/linux/init_task.h  |  4 ++-
>> >> >  include/linux/sched.h      |  1 +
>> >> >  include/uapi/linux/audit.h |  2 ++
>> >> >  kernel/auditsc.c           | 84 ++++++++++++++++++++++++++++++++++++++++++++++
>> >> >  6 files changed, 143 insertions(+), 1 deletion(-)

...

>> >> >  /* audit_rule_data supports filter rules with both integer and string
>> >> >   * fields.  It corresponds with AUDIT_ADD_RULE, AUDIT_DEL_RULE and
>> >> > diff --git a/kernel/auditsc.c b/kernel/auditsc.c
>> >> > index 4e0a4ac..29c8482 100644
>> >> > --- a/kernel/auditsc.c
>> >> > +++ b/kernel/auditsc.c
>> >> > @@ -2073,6 +2073,90 @@ int audit_set_loginuid(kuid_t loginuid)
>> >> >         return rc;
>> >> >  }
>> >> >
>> >> > +static int audit_set_containerid_perm(struct task_struct *task, u64 containerid)
>> >> > +{
>> >> > +       struct task_struct *parent;
>> >> > +       u64 pcontainerid, ccontainerid;
>> >> > +
>> >> > +       /* Don't allow to set our own containerid */
>> >> > +       if (current == task)
>> >> > +               return -EPERM;
>> >>
>> >> Why not?  Is there some obvious security concern that I missing?
>> >
>> > We then lose the distinction in the AUDIT_CONTAINER record between the
>> > initiating PID and the target PID.  This was outlined in the proposal.
>>
>> I just went back and reread the v3 proposal and I still don't see a
>> good explanation of this.  Why is this bad?  What's the security
>> concern?
>
> I don't remember, specifically.  Maybe this has been addressed by the
> check for children/threads or identical parent container ID.  So, I'm
> reluctantly willing to remove that check for now.

Okay.  For the record, if someone can explain to me why this
restriction saves us from some terrible situation I'm all for leaving
it.  I'm just opposed to restrictions without solid reasoning behind
them.

>> > Having said that, I'm still not sure we have protected sufficiently from
>> > a child turning around and setting it's parent's as yet unset or
>> > inherited audit container ID.
>>
>> Yes, I believe we only want to let a task set the audit container for
>> it's children (or itself/threads if we decide to allow that, see
>> above).  There *has* to be a function to check to see if a task if a
>> child of a given task ... right? ... although this is likely to be a
>> pointer traversal and locking nightmare ... hmmm.
>
> Isn't that just (struct task_struct)parent == (struct
> task_struct)child->parent (or ->real_parent)?
>
> And now that I say that, it is covered by the following patch's child
> check, so as long as we keep that, we should be fine.

I was thinking of checking not just current's immediate children, but
any of it's descendants as I believe that is what we want to limit,
yes?  I just worry that it isn't really practical to perform that
check.

>> >> I ask because I suppose it might be possible for some container
>> >> runtime to do a fork, setup some of the environment and them exec the
>> >> container (before you answer the obvious "namespaces!" please remember
>> >> we're not trying to define containers).
>> >
>> > I don't think namespaces have any bearing on this concern since none are
>> > required.
>> >
>> >> > +       /* Don't allow the containerid to be unset */
>> >> > +       if (!cid_valid(containerid))
>> >> > +               return -EINVAL;
>> >> > +       /* if we don't have caps, reject */
>> >> > +       if (!capable(CAP_AUDIT_CONTROL))
>> >> > +               return -EPERM;
>> >> > +       /* if containerid is unset, allow */
>> >> > +       if (!audit_containerid_set(task))
>> >> > +               return 0;
>> >> > +       /* it is already set, and not inherited from the parent, reject */
>> >> > +       ccontainerid = audit_get_containerid(task);
>> >> > +       rcu_read_lock();
>> >> > +       parent = rcu_dereference(task->real_parent);
>> >> > +       rcu_read_unlock();
>> >> > +       task_lock(parent);
>> >> > +       pcontainerid = audit_get_containerid(parent);
>> >> > +       task_unlock(parent);
>> >> > +       if (ccontainerid != pcontainerid)
>> >> > +               return -EPERM;
>> >> > +       return 0;

I'm looking at the parent checks again and I wonder if the logic above
is what we really want.  Maybe it is, but I'm not sure.

Things I'm wondering about:

* "ccontainerid" and "containerid" are too close in name, I kept
confusing myself when looking at this code.  Please change one.  Bonus
points if it is shorter.

* What if the orchestrator wants to move the task to a new container?
Right now it looks like you can only do that once, then then the
task's audit container ID will no longer be the same as real_parent
... or does the orchestrator change that?  *Can* the orchestrator
change real_parent (I suspect the answer is "no")?

* I think the key is the relationship between current and task, not
between task and task->real_parent.  I believe what we really care
about is that task is a descendant of current.  We might also want to
allow current to change the audit container ID if it holds
CAP_AUDIT_CONTROL, regardless of it's relationship with task.

>> >> > +static void audit_log_set_containerid(struct task_struct *task, u64 oldcontainerid,
>> >> > +                                     u64 containerid, int rc)
>> >> > +{
>> >> > +       struct audit_buffer *ab;
>> >> > +       uid_t uid;
>> >> > +       struct tty_struct *tty;
>> >> > +
>> >> > +       if (!audit_enabled)
>> >> > +               return;
>> >> > +
>> >> > +       ab = audit_log_start(NULL, GFP_KERNEL, AUDIT_CONTAINER);
>> >> > +       if (!ab)
>> >> > +               return;
>> >> > +
>> >> > +       uid = from_kuid(&init_user_ns, task_uid(current));
>> >> > +       tty = audit_get_tty(current);
>> >> > +
>> >> > +       audit_log_format(ab, "op=set pid=%d uid=%u", task_tgid_nr(current), uid);
>> >> > +       audit_log_task_context(ab);
>> >> > +       audit_log_format(ab, " auid=%u tty=%s ses=%u opid=%d old-contid=%llu contid=%llu res=%d",
>> >> > +                        from_kuid(&init_user_ns, audit_get_loginuid(current)),
>> >> > +                        tty ? tty_name(tty) : "(none)", audit_get_sessionid(current),
>> >> > +                        task_tgid_nr(task), oldcontainerid, containerid, !rc);
>> >> > +
>> >> > +       audit_put_tty(tty);
>> >> > +       audit_log_end(ab);
>> >> > +}
>> >> > +
>> >> > +/**
>> >> > + * audit_set_containerid - set current task's audit_context containerid
>> >> > + * @containerid: containerid value
>> >> > + *
>> >> > + * Returns 0 on success, -EPERM on permission failure.
>> >> > + *
>> >> > + * Called (set) from fs/proc/base.c::proc_containerid_write().
>> >> > + */
>> >> > +int audit_set_containerid(struct task_struct *task, u64 containerid)
>> >> > +{
>> >> > +       u64 oldcontainerid;
>> >> > +       int rc;
>> >> > +
>> >> > +       oldcontainerid = audit_get_containerid(task);
>> >> > +
>> >> > +       rc = audit_set_containerid_perm(task, containerid);
>> >> > +       if (!rc) {
>> >> > +               task_lock(task);
>> >> > +               task->containerid = containerid;
>> >> > +               task_unlock(task);
>> >> > +       }
>> >> > +
>> >> > +       audit_log_set_containerid(task, oldcontainerid, containerid, rc);
>> >> > +       return rc;
>> >>
>> >> Why are audit_set_containerid_perm() and audit_log_containerid()
>> >> separate functions?
>> >
>> > (I assume you mean audit_log_set_containerid()?)
>>
>> Yep.  My fingers got tired typing in that function name and decided a
>> shortcut was necessary.
>>
>> > It seemed clearer that all the permission checking was in one function
>> > and its return code could be used to report the outcome when logging the
>> > (attempted) action.  This is the same structure as audit_set_loginuid()
>> > and it made sense.
>>
>> When possible I really like it when the permission checks are in the
>> same function as the code which does the work; it's less likely to get
>> abused that way (you have to willfully bypass the access checks).  The
>> exceptions might be if you wanted to reuse the access control code, or
>> insert a modular access mechanism (e.g. LSMs).
>
> I don't follow how it could be abused.  The return code from the perm
> check gates setting the value and is used in the success field in the
> log.

If the permission checks are in the same function body as the code
which does the work you have to either split the function, or rewrite
it, if you want to bypass the permission checks.  It may be more of a
style issue than an actual safety issue, but the comments about
single-use functions in the same scope is the tie breaker.

-- 
paul moore
www.paul-moore.com

^ permalink raw reply

* Re: [PATCH net-next v2 0/2] openvswitch: Support conntrack zone limit
From: David Miller @ 2018-04-24 19:03 UTC (permalink / raw)
  To: yihung.wei; +Cc: pshelar, netdev, fw
In-Reply-To: <CAG1aQhKtZ_4AYuKBTzEwG1YwUr9sFchcyh+eWXB_i64GSW_Z8A@mail.gmail.com>

From: Yi-Hung Wei <yihung.wei@gmail.com>
Date: Tue, 24 Apr 2018 11:21:33 -0700

> On Tue, Apr 24, 2018 at 10:42 AM, David Miller <davem@davemloft.net> wrote:
>> From: Pravin Shelar <pshelar@ovn.org>
>> Date: Mon, 23 Apr 2018 23:34:48 -0700
>>
>>> OK. Thanks for the info.
>>
>> So, ACK, Reviewed-by, etc.? :-)
>>
> 
> Parvin provides feedback in a previous email.  I will address them and
> send out v3.

Aha, I see, thanks for explaining.

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox