* Re: [PATCH RFC] net: decrease the length of backlog queue immediately after it's detached from sk
From: Yang Yingliang @ 2016-04-12 2:59 UTC (permalink / raw)
To: Eric Dumazet; +Cc: netdev, davem, Ding Tianhong
In-Reply-To: <1460376828.6473.538.camel@edumazet-glaptop3.roam.corp.google.com>
On 2016/4/11 20:13, Eric Dumazet wrote:
> On Mon, 2016-04-11 at 19:57 +0800, Yang Yingliang wrote:
>>
>> On 2016/4/8 22:44, Eric Dumazet wrote:
>>> On Fri, 2016-04-08 at 19:18 +0800, Yang Yingliang wrote:
>>>
>>>> I expand tcp_adv_win_scale and tcp_rmem. It has no effect.
>>>
>>> Try :
>>>
>>> echo -2 >/proc/sys/net/ipv4/tcp_adv_win_scale
>>>
>>> And restart your flows.
>>>
>> cat /proc/sys/net/ipv4/tcp_rmem
>> 10240 2097152 10485760
>
> What about leaving the default values ?
I tried, it did not work.
>
> $ cat /proc/sys/net/ipv4/tcp_rmem
> 4096 87380 6291456
>
>>
>> echo 102400 20971520 104857600 > /proc/sys/net/ipv4/tcp_rmem
>> echo -2 >/proc/sys/net/ipv4/tcp_adv_win_scale
>>
>> It seems has not effect.
>>
>
> I have no idea what you did on the sender side to allow it to send more
> than 1.5 MB then.
We are doing performance test. The sender send 256KB per-block with 128
threads to one socket. And the receiver uses 10Gb NIC to handle the
data on ARM64. The data flow is driver->ip layer->tcp layer->iscsi.
I added some debug messages and found handling backlog packets in
__release_sock() cost about 11ms at most. This can cause backlog queue
overflow. The sk_data_ready is re-assigned, it may cost time in our
program. I will check it out.
^ permalink raw reply
* Re: [PATCH RFC 1/5] net: phy: sun8i-h3-ephy: Add bindings for Allwinner H3 Ethernet PHY
From: Chen-Yu Tsai @ 2016-04-12 1:38 UTC (permalink / raw)
To: Florian Fainelli
Cc: Chen-Yu Tsai, Maxime Ripard, Rob Herring, Pawel Moll,
Mark Rutland, Ian Campbell, Kumar Gala, netdev, linux-arm-kernel,
linux-kernel, devicetree, LABBE Corentin, Andrew Lunn
In-Reply-To: <570BF9B0.2090602@gmail.com>
On Tue, Apr 12, 2016 at 3:23 AM, Florian Fainelli <f.fainelli@gmail.com> wrote:
> On 04/04/16 09:22, Chen-Yu Tsai wrote:
>> The Allwinner H3 SoC incorporates an Ethernet PHY. This is enabled and
>> configured through a memory mapped hardware register.
>>
>> This same register also configures the MAC interface mode and TX clock
>> source. Also covered by the register, but not supported in these bindings,
>> are TX/RX clock delay chains and inverters, and an RMII module.
>>
>> Signed-off-by: Chen-Yu Tsai <wens@csie.org>
>> ---
>> .../bindings/net/allwinner,sun8i-h3-ephy.txt | 44 ++++++++++++++++++++++
>> 1 file changed, 44 insertions(+)
>> create mode 100644 Documentation/devicetree/bindings/net/allwinner,sun8i-h3-ephy.txt
>>
>> diff --git a/Documentation/devicetree/bindings/net/allwinner,sun8i-h3-ephy.txt b/Documentation/devicetree/bindings/net/allwinner,sun8i-h3-ephy.txt
>> new file mode 100644
>> index 000000000000..146f227e6d88
>> --- /dev/null
>> +++ b/Documentation/devicetree/bindings/net/allwinner,sun8i-h3-ephy.txt
>> @@ -0,0 +1,44 @@
>> +* Allwinner H3 E(thernet) PHY
>> +
>> +The Allwinner H3 integrates an MII ethernet PHY. As with external PHYs,
>> +before it can be configured over the MDIO bus and used, certain hardware
>> +features must be configured, such as the PHY address and LED polarity.
>
> Is the internal PHY address really configurable? Not that there is
> anything wrong with it, this is good.
It is. Things that are configured or provided to a discrete PHY are routed
to registers in the SoC, things such as PHY address, clocks, resets.
>> +The PHY must also be powered on and brought out of reset.
>> +
>> +This is accomplished with regulators and pull-up/downs for external PHYs.
>> +For this internal PHY, a hardware register is programmed.
>> +
>> +The same hardware register also contains clock and interface controls
>> +for the MAC. This is also present in earlier SoCs, and is covered by
>> +"allwinner,sun7i-a20-gmac-clk". The controls in the H3 are slightly
>> +different due to the inclusion of what appears to be an RMII-MII
>> +bridge.
>> +
>> +Required properties:
>> +- compatible: should be "allwinner,sun8i-h3-ephy"
>> +- reg: address and length of the register set for the device
>> +- clocks: A phandle to the reference clock for this device
>> +- resets: A phandle to the reset control for this device
>> +
>> +Ethernet PHY related properties:
>> +- allwinner,ephy-addr: the MDIO bus address the PHY should respond to.
>> + If this is not set, the external PHY is used, and
>> + everything else in this section is ignored.
>
> So we are going to put the same value here, and in the actual Ethernet
> PHY device tree node that should be hanging off the EMAC/MDIO bus
> controller, this is confusing and error prone.
Yes, that would be an issue when writing the DTS.
>
>> +- allwinner,leds-active-low: LEDs are active low. Without this, LEDs are
>> + active high.
>> +
>> +Ethernet MAC clock related properties:
>> +- #clock-cells: should be 0
>> +- clock-output-names: "mac_tx"
>> +
>> +Example:
>> +
>> +ethernet-phy@01c00030 {
>> + compatible = "allwinner,sun8i-h3-ephy";
>> + reg = <0x01c00030 0x4>;
>
> Looking at this register space this looks more like an internal PHY SHIM
> that is needed to be configured before the internal PHY can be access,
> not a proper Ethernet PHY per-se, see replies in aptch 2.
>
> Should not this block be a second cell associated with the Ethernet MAC
> block? One or the other are not going to be very useful without
> knowledge of each other.
True. However the lower half of the same register also controls the
MAC interface mode and TX clock source and delays. This we had a clock
driver that was used in conjuction with stmmac on earlier SoCs. I was
hoping to keep that model with the newer EMAC. At the time it was
argued that what seemed like a clock should be handled by a clock
driver, instead of just a "syscon". If this is reaching too far to
handle this new use case, I will happily just provide patches to merge
this into the MAC.
I would like to know how to deal with things like a PHY requiring
some sort of shim driver, be it an internal one, or an external mfd
chip that happens to have an Ethernet PHY included? How do we tie
this into the PHY node under the MDIO bus?
Regards
ChenYu
^ permalink raw reply
* Re: [PATCH] devlink: add missing install of header
From: David Miller @ 2016-04-12 1:34 UTC (permalink / raw)
To: stephen; +Cc: jiri, netdev
In-Reply-To: <1460407148-24228-1-git-send-email-stephen@networkplumber.org>
From: Stephen Hemminger <stephen@networkplumber.org>
Date: Mon, 11 Apr 2016 13:39:08 -0700
> The new devlink.h in uapi was not being installed by
> make headers_install
>
> Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Applied, thanks Stephen.
I wish we could just have everything under uapi/linux automatically
get this treatment, but I guess there is a very good reason for the
current behavior.
^ permalink raw reply
* Re: Configuring ethernet link fails with No such device
From: David Miller @ 2016-04-12 1:29 UTC (permalink / raw)
To: stefan
Cc: systemd-devel, netdev, fabio.estevam, l.stach, u.kleine-koenig,
bryan.wu, bob.ham
In-Reply-To: <67b662523c63277fa95a59472cbf018f@agner.ch>
From: Stefan Agner <stefan@agner.ch>
Date: Mon, 11 Apr 2016 15:46:08 -0700
> What is the expectation/definition when link configuration should be
> possible? Only after the network device got opened or before?
Only after it is open. Drivers almost always have the entire chip in
powerdown state when it is not open, so we wouldn't be able to
properly do link settings even if we wanted to when the device is
closed.
^ permalink raw reply
* Re: [PATCH net-next] net: mdio: Fix lockdep falls positive splat
From: David Miller @ 2016-04-12 1:27 UTC (permalink / raw)
To: andrew; +Cc: f.fainelli, netdev, vivien.didelot
In-Reply-To: <1460403605-31882-1-git-send-email-andrew@lunn.ch>
From: Andrew Lunn <andrew@lunn.ch>
Date: Mon, 11 Apr 2016 21:40:05 +0200
> MDIO devices can be stacked upon each other. The current code supports
> two levels, which until recently has been enough for a DSA mdio bus on
> top of another bus. Now we have hardware which has an MDIO mux in the
> middle.
>
> Define an MDIO MUTEX class with three levels.
>
> Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Applied, thanks Andrew.
^ permalink raw reply
* Re: [PATCH 0/9] RxRPC: 2nd rewrite part 1
From: David Miller @ 2016-04-12 1:26 UTC (permalink / raw)
To: dhowells; +Cc: linux-afs, netdev, linux-kernel
In-Reply-To: <15783.1460412099@warthog.procyon.org.uk>
From: David Howells <dhowells@redhat.com>
Date: Mon, 11 Apr 2016 23:01:39 +0100
> David Miller <davem@davemloft.net> wrote:
>
>> Series applied, but I had to fix up some trailing whitespace (reported by
>> GIT) by hand.
>
> Do you perchance have a git hook script for checking for trailing whitespace?
Just try to apply the series yourself with "git am", that's what I do.
You can do "git apply --check --whitespace=error-all" on individual
patches, but that doesn't work for a series.
^ permalink raw reply
* Re: [PATCH net v2] net: sched: do not requeue a NULL skb
From: Eric Dumazet @ 2016-04-11 23:48 UTC (permalink / raw)
To: Cong Wang
Cc: Lars Persson, Lars Persson, Linux Kernel Network Developers,
Jamal Hadi Salim, LKML
In-Reply-To: <CAM_iQpVid8=LeO31p_ZiPQkhWtgCuioUzfCdgDPatFxkWrkbCg@mail.gmail.com>
On Mon, 2016-04-11 at 16:19 -0700, Cong Wang wrote:
> My point is, for example, in OOM case, we don't know processin
> more SKB would make it better or worse. Maybe we really need to
> check the error code to decide to continue to exit?
Really, given this bug has been there for a long time (v3.18 ???), I
doubt it matters.
Nothing can tell that following packets in the qdisc need any
transformation, and memory allocations.
So I would just fix the bug in the simplest way.
__qdisc_run() has all the checks needed to yield when needed
(if (quota <= 0 || need_resched())) , no need to add more complexity
there.
^ permalink raw reply
* Re: [PATCH net v2] net: sched: do not requeue a NULL skb
From: Cong Wang @ 2016-04-11 23:19 UTC (permalink / raw)
To: Eric Dumazet
Cc: Lars Persson, Lars Persson, Linux Kernel Network Developers,
Jamal Hadi Salim, LKML
In-Reply-To: <1460399456.6473.566.camel@edumazet-glaptop3.roam.corp.google.com>
On Mon, Apr 11, 2016 at 11:30 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Mon, 2016-04-11 at 11:26 -0700, Eric Dumazet wrote:
>> On Mon, 2016-04-11 at 11:02 -0700, Cong Wang wrote:
>>
>> > I am fine with either way as long as the loop stops on failure.
>
>
> Note that skb that could not be validated is already freed.
>
> So I do not see any value from stopping the loop, since
> we need to schedule the queue to avoid tx hang.
>
> Just process following skb if there is one, fact that skb is sent or
> dropped does not matter.
My point is, for example, in OOM case, we don't know processing
more SKB would make it better or worse. Maybe we really need to
check the error code to decide to continue to exit?
^ permalink raw reply
* Re: Configuring ethernet link fails with No such device
From: Florian Fainelli @ 2016-04-11 23:10 UTC (permalink / raw)
To: Stefan Agner, systemd-devel, netdev, davem
Cc: fabio.estevam, l.stach, u.kleine-koenig, bryan.wu, bob.ham,
Andrew Lunn
In-Reply-To: <67b662523c63277fa95a59472cbf018f@agner.ch>
On 11/04/16 15:46, Stefan Agner wrote:
> Hi All,
>
> I traced an issue in which we tried configuring duplex mode and speed in
> a systemd-udevd .link file failed with the following error:
> "Could not set speed or duplex of eth0 to 0 Mbps (half): No such device"
>
> The FEC driver (fec_main.c) does not initialize phy_dev until the device
> has been opened, and therefor the callback fec_enet_(get|set)_settings
> returns -19.
>
> What is the expectation/definition when link configuration should be
> possible? Only after the network device got opened or before?
It seems to me like systemd tries to configure the device while we are
still opening it, which is arguably something that can be done, but you
are in a transient state, the interface is not completely ready yet.
>
> Or in other words: Is this a Kernel or systemd issue? At what point is
> link configuration "guaranteed" to work?
Past ndo_open() had a chance to return, therefore connect to the PHY
device, is when the configuration/information querying is able to succeed.
>
> It seems that the FEC driver used to probe the PHY earlier, it has been
> moved to the open callback with commit 418bd0d4df ("netdev/fec: fix
> ifconfig eth0 down hang issue")...
Part of the reason for doing that, is that probing for the PHY during
the driver's probe function could leave you with unused HW that is
powered on for no good reason. Until we get into ndo_open, we have no
clue if the interface is going to be used or not, so it is better to
move part of the HW initialization as late as possible in the process.
>
> There has been a similar report on the systemd-devel mailing list a
> while ago, probably the same underlying issue:
> https://www.mail-archive.com/systemd-devel@lists.freedesktop.org/msg33186.html
>
> --
> Stefan
>
--
Florian
^ permalink raw reply
* Configuring ethernet link fails with No such device
From: Stefan Agner @ 2016-04-11 22:46 UTC (permalink / raw)
To: systemd-devel, netdev, davem
Cc: fabio.estevam, l.stach, u.kleine-koenig, bryan.wu, bob.ham
Hi All,
I traced an issue in which we tried configuring duplex mode and speed in
a systemd-udevd .link file failed with the following error:
"Could not set speed or duplex of eth0 to 0 Mbps (half): No such device"
The FEC driver (fec_main.c) does not initialize phy_dev until the device
has been opened, and therefor the callback fec_enet_(get|set)_settings
returns -19.
What is the expectation/definition when link configuration should be
possible? Only after the network device got opened or before?
Or in other words: Is this a Kernel or systemd issue? At what point is
link configuration "guaranteed" to work?
It seems that the FEC driver used to probe the PHY earlier, it has been
moved to the open callback with commit 418bd0d4df ("netdev/fec: fix
ifconfig eth0 down hang issue")...
There has been a similar report on the systemd-devel mailing list a
while ago, probably the same underlying issue:
https://www.mail-archive.com/systemd-devel@lists.freedesktop.org/msg33186.html
^ permalink raw reply
* Re: [PATCH RFT 1/2] phylib: add device reset GPIO support
From: Rob Herring @ 2016-04-11 22:46 UTC (permalink / raw)
To: Sergei Shtylyov
Cc: Grant Likely, devicetree@vger.kernel.org, Florian Fainelli,
netdev, Frank Rowand, Pawel Moll, Mark Rutland, Ian Campbell,
Kumar Gala, linux-kernel@vger.kernel.org
In-Reply-To: <570BFACF.30507@cogentembedded.com>
On Mon, Apr 11, 2016 at 2:28 PM, Sergei Shtylyov
<sergei.shtylyov@cogentembedded.com> wrote:
> On 04/11/2016 10:25 PM, Rob Herring wrote:
>
>>> The PHY devices sometimes do have their reset signal (maybe even power
>>> supply?) tied to some GPIO and sometimes it also does happen that a boot
>>> loader does not leave it deasserted. So far this issue has been attacked
>>> from (as I believe) a wrong angle: by teaching the MAC driver to
>>> manipulate
>>> the GPIO in question; that solution, when applied to the device trees,
>>> led to adding the PHY reset GPIO properties to the MAC device node, with
>>> one exception: Cadence MACB driver which could handle the "reset-gpios"
>>> prop in a PHY device subnode. I believe that the correct approach is
>>> to
>>> teach the 'phylib' to get the MDIO device reset GPIO from the device tree
>>> node corresponding to this device -- which this patch is doing...
>>>
>>> Note that I had to modify the AT803x PHY driver as it would stop working
>>> otherwise as it made use of the reset GPIO for its own purposes...
>>
>>
>> Lots of double spaces in here. Please fix.
>
>
> Oh, it's you again! :-D
Yep, one of those picky kernel maintainers that like a bad rash just
won't go away. :)
>>> Signed-off-by: Sergei Shtylyov <sergei.shtylyov@cogentembedded.com>
>>>
>>> ---
>>> Documentation/devicetree/bindings/net/phy.txt | 2 +
>>> drivers/net/phy/at803x.c | 19 ++------------
>>> drivers/net/phy/mdio_bus.c | 4 +++
>>> drivers/net/phy/mdio_device.c | 27
>>> +++++++++++++++++++--
>>> drivers/net/phy/phy_device.c | 33
>>> ++++++++++++++++++++++++--
>>> drivers/of/of_mdio.c | 16 ++++++++++++
>>> include/linux/mdio.h | 3 ++
>>> include/linux/phy.h | 5 +++
>>> 8 files changed, 89 insertions(+), 20 deletions(-)
>>
>>
>> [...]
>>
>>> Index: net-next/drivers/of/of_mdio.c
>>> ===================================================================
>>> --- net-next.orig/drivers/of/of_mdio.c
>>> +++ net-next/drivers/of/of_mdio.c
>>> @@ -44,6 +44,7 @@ static int of_get_phy_id(struct device_n
>>> static int of_mdiobus_register_phy(struct mii_bus *mdio, struct
>>> device_node *child,
>>> u32 addr)
>>> {
>>> + struct gpio_desc *gpiod;
>>> struct phy_device *phy;
>>> bool is_c45;
>>> int rc;
>>> @@ -52,10 +53,17 @@ static int of_mdiobus_register_phy(struc
>>> is_c45 = of_device_is_compatible(child,
>>> "ethernet-phy-ieee802.3-c45");
>>>
>>> + gpiod = fwnode_get_named_gpiod(&child->fwnode, "reset-gpios");
>>
>>
>> Calling fwnode_* functions in a DT specific file/function? That doesn't
>> make sense.
>
>
> Really?! 8-)
> Where is a DT-only analog I wonder...
Ah, you're right. NM.
Rob
^ permalink raw reply
* Re: [PATCH 0/9] RxRPC: 2nd rewrite part 1
From: David Howells @ 2016-04-11 22:29 UTC (permalink / raw)
Cc: dhowells, David Miller, linux-afs, netdev, linux-kernel
In-Reply-To: <15783.1460412099@warthog.procyon.org.uk>
David Howells <dhowells@redhat.com> wrote:
> > Series applied, but I had to fix up some trailing whitespace (reported by
> > GIT) by hand.
>
> Do you perchance have a git hook script for checking for trailing whitespace?
No matter - there's already such a script installed by git clone - I just have
to enable it.
David
^ permalink raw reply
* [PATCH v2 net 3/4] ipv6: datagram: Update dst cache of a connected datagram sk during pmtu update
From: Martin KaFai Lau @ 2016-04-11 22:29 UTC (permalink / raw)
To: netdev; +Cc: Cong Wang, Eric Dumazet, Wei Wang, Kernel Team
In-Reply-To: <1460413777-177662-1-git-send-email-kafai@fb.com>
There is a case in connected UDP socket such that
getsockopt(IPV6_MTU) will return a stale MTU value. The reproducible
sequence could be the following:
1. Create a connected UDP socket
2. Send some datagrams out
3. Receive a ICMPV6_PKT_TOOBIG
4. No new outgoing datagrams to trigger the sk_dst_check()
logic to update the sk->sk_dst_cache.
5. getsockopt(IPV6_MTU) returns the mtu from the invalid
sk->sk_dst_cache instead of the newly created RTF_CACHE clone.
This patch updates the sk->sk_dst_cache for a connected datagram sk
during pmtu-update code path.
Note that the sk->sk_v6_daddr is used to do the route lookup
instead of skb->data (i.e. iph). It is because a UDP socket can become
connected after sending out some datagrams in un-connected state. or
It can be connected multiple times to different destinations. Hence,
iph may not be related to where sk is currently connected to.
It is done under '!sock_owned_by_user(sk)' condition because
the user may make another ip6_datagram_connect() (i.e changing
the sk->sk_v6_daddr) while dst lookup is happening in the pmtu-update
code path.
For the sock_owned_by_user(sk) == true case, the next patch will
introduce a release_cb() which will update the sk->sk_dst_cache.
Test:
Server (Connected UDP Socket):
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Route Details:
[root@arch-fb-vm1 ~]# ip -6 r show | egrep '2fac'
2fac::/64 dev eth0 proto kernel metric 256 pref medium
2fac:face::/64 via 2fac::face dev eth0 metric 1024 pref medium
A simple python code to create a connected UDP socket:
import socket
import errno
HOST = '2fac::1'
PORT = 8080
s = socket.socket(socket.AF_INET6, socket.SOCK_DGRAM)
s.bind((HOST, PORT))
s.connect(('2fac:face::face', 53))
print("connected")
while True:
try:
data = s.recv(1024)
except socket.error as se:
if se.errno == errno.EMSGSIZE:
pmtu = s.getsockopt(41, 24)
print("PMTU:%d" % pmtu)
break
s.close()
Python program output after getting a ICMPV6_PKT_TOOBIG:
[root@arch-fb-vm1 ~]# python2 ~/devshare/kernel/tasks/fib6/udp-connect-53-8080.py
connected
PMTU:1300
Cache routes after recieving TOOBIG:
[root@arch-fb-vm1 ~]# ip -6 r show table cache
2fac:face::face via 2fac::face dev eth0 metric 0
cache expires 463sec mtu 1300 pref medium
Client (Send the ICMPV6_PKT_TOOBIG):
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
scapy is used to generate the TOOBIG message. Here is the scapy script I have
used:
>>> p=Ether(src='da:75:4d:36:ac:32', dst='52:54:00:12:34:66', type=0x86dd)/IPv6(src='2fac::face', dst='2fac::1')/ICMPv6PacketTooBig(mtu=1300)/IPv6(src='2fac::
1',dst='2fac:face::face', nh='UDP')/UDP(sport=8080,dport=53)
>>> sendp(p, iface='qemubr0')
Fixes: 45e4fd26683c ("ipv6: Only create RTF_CACHE routes after encountering pmtu exception")
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Reported-by: Wei Wang <weiwan@google.com>
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Wei Wang <weiwan@google.com>
---
include/net/ipv6.h | 1 +
net/ipv6/datagram.c | 20 +++++++++++---------
net/ipv6/route.c | 12 ++++++++++++
3 files changed, 24 insertions(+), 9 deletions(-)
diff --git a/include/net/ipv6.h b/include/net/ipv6.h
index d0aeb97..fd02e90 100644
--- a/include/net/ipv6.h
+++ b/include/net/ipv6.h
@@ -959,6 +959,7 @@ int compat_ipv6_getsockopt(struct sock *sk, int level, int optname,
int ip6_datagram_connect(struct sock *sk, struct sockaddr *addr, int addr_len);
int ip6_datagram_connect_v6_only(struct sock *sk, struct sockaddr *addr,
int addr_len);
+int ip6_datagram_dst_update(struct sock *sk, bool fix_sk_saddr);
int ipv6_recv_error(struct sock *sk, struct msghdr *msg, int len,
int *addr_len);
diff --git a/net/ipv6/datagram.c b/net/ipv6/datagram.c
index 669585e..59e01f27 100644
--- a/net/ipv6/datagram.c
+++ b/net/ipv6/datagram.c
@@ -64,7 +64,7 @@ static void ip6_datagram_flow_key_init(struct flowi6 *fl6, struct sock *sk)
security_sk_classify_flow(sk, flowi6_to_flowi(fl6));
}
-static int ip6_datagram_dst_update(struct sock *sk)
+int ip6_datagram_dst_update(struct sock *sk, bool fix_sk_saddr)
{
struct ip6_flowlabel *flowlabel = NULL;
struct in6_addr *final_p, final;
@@ -93,14 +93,16 @@ static int ip6_datagram_dst_update(struct sock *sk)
goto out;
}
- if (ipv6_addr_any(&np->saddr))
- np->saddr = fl6.saddr;
+ if (fix_sk_saddr) {
+ if (ipv6_addr_any(&np->saddr))
+ np->saddr = fl6.saddr;
- if (ipv6_addr_any(&sk->sk_v6_rcv_saddr)) {
- sk->sk_v6_rcv_saddr = fl6.saddr;
- inet->inet_rcv_saddr = LOOPBACK4_IPV6;
- if (sk->sk_prot->rehash)
- sk->sk_prot->rehash(sk);
+ if (ipv6_addr_any(&sk->sk_v6_rcv_saddr)) {
+ sk->sk_v6_rcv_saddr = fl6.saddr;
+ inet->inet_rcv_saddr = LOOPBACK4_IPV6;
+ if (sk->sk_prot->rehash)
+ sk->sk_prot->rehash(sk);
+ }
}
ip6_dst_store(sk, dst,
@@ -221,7 +223,7 @@ ipv4_connected:
* destination cache for it.
*/
- err = ip6_datagram_dst_update(sk);
+ err = ip6_datagram_dst_update(sk, true);
if (err)
goto out;
diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index 1d8871a..d916d6a 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -1418,8 +1418,20 @@ EXPORT_SYMBOL_GPL(ip6_update_pmtu);
void ip6_sk_update_pmtu(struct sk_buff *skb, struct sock *sk, __be32 mtu)
{
+ struct dst_entry *dst;
+
ip6_update_pmtu(skb, sock_net(sk), mtu,
sk->sk_bound_dev_if, sk->sk_mark);
+
+ dst = __sk_dst_get(sk);
+ if (!dst || !dst->obsolete ||
+ dst->ops->check(dst, inet6_sk(sk)->dst_cookie))
+ return;
+
+ bh_lock_sock(sk);
+ if (!sock_owned_by_user(sk) && !ipv6_addr_v4mapped(&sk->sk_v6_daddr))
+ ip6_datagram_dst_update(sk, false);
+ bh_unlock_sock(sk);
}
EXPORT_SYMBOL_GPL(ip6_sk_update_pmtu);
--
2.5.1
^ permalink raw reply related
* [PATCH v2 net 4/4] ipv6: udp: Do a route lookup and update during release_cb
From: Martin KaFai Lau @ 2016-04-11 22:29 UTC (permalink / raw)
To: netdev; +Cc: Cong Wang, Eric Dumazet, Wei Wang, Kernel Team
In-Reply-To: <1460413777-177662-1-git-send-email-kafai@fb.com>
This patch adds a release_cb for UDPv6. It does a route lookup
and updates sk->sk_dst_cache if it is needed. It picks up the
left-over job from ip6_sk_update_pmtu() if the sk was owned
by user during the pmtu update.
It takes a rcu_read_lock to protect the __sk_dst_get() operations
because another thread may do ip6_dst_store() without taking the
sk lock (e.g. sendmsg).
Fixes: 45e4fd26683c ("ipv6: Only create RTF_CACHE routes after encountering pmtu exception")
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Reported-by: Wei Wang <weiwan@google.com>
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Wei Wang <weiwan@google.com>
---
include/net/ipv6.h | 1 +
net/ipv6/datagram.c | 20 ++++++++++++++++++++
net/ipv6/udp.c | 1 +
3 files changed, 22 insertions(+)
diff --git a/include/net/ipv6.h b/include/net/ipv6.h
index fd02e90..1be050a 100644
--- a/include/net/ipv6.h
+++ b/include/net/ipv6.h
@@ -960,6 +960,7 @@ int ip6_datagram_connect(struct sock *sk, struct sockaddr *addr, int addr_len);
int ip6_datagram_connect_v6_only(struct sock *sk, struct sockaddr *addr,
int addr_len);
int ip6_datagram_dst_update(struct sock *sk, bool fix_sk_saddr);
+void ip6_datagram_release_cb(struct sock *sk);
int ipv6_recv_error(struct sock *sk, struct msghdr *msg, int len,
int *addr_len);
diff --git a/net/ipv6/datagram.c b/net/ipv6/datagram.c
index 59e01f27..9dd3882 100644
--- a/net/ipv6/datagram.c
+++ b/net/ipv6/datagram.c
@@ -119,6 +119,26 @@ out:
return err;
}
+void ip6_datagram_release_cb(struct sock *sk)
+{
+ struct dst_entry *dst;
+
+ if (ipv6_addr_v4mapped(&sk->sk_v6_daddr))
+ return;
+
+ rcu_read_lock();
+ dst = __sk_dst_get(sk);
+ if (!dst || !dst->obsolete ||
+ dst->ops->check(dst, inet6_sk(sk)->dst_cookie)) {
+ rcu_read_unlock();
+ return;
+ }
+ rcu_read_unlock();
+
+ ip6_datagram_dst_update(sk, false);
+}
+EXPORT_SYMBOL_GPL(ip6_datagram_release_cb);
+
static int __ip6_datagram_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)
{
struct sockaddr_in6 *usin = (struct sockaddr_in6 *) uaddr;
diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c
index 8125931..6bc5c66 100644
--- a/net/ipv6/udp.c
+++ b/net/ipv6/udp.c
@@ -1539,6 +1539,7 @@ struct proto udpv6_prot = {
.sendmsg = udpv6_sendmsg,
.recvmsg = udpv6_recvmsg,
.backlog_rcv = __udpv6_queue_rcv_skb,
+ .release_cb = ip6_datagram_release_cb,
.hash = udp_lib_hash,
.unhash = udp_lib_unhash,
.rehash = udp_v6_rehash,
--
2.5.1
^ permalink raw reply related
* [PATCH v2 net 2/4] ipv6: datagram: Refactor dst lookup and update codes to a new function
From: Martin KaFai Lau @ 2016-04-11 22:29 UTC (permalink / raw)
To: netdev; +Cc: Cong Wang, Eric Dumazet, Wei Wang, Kernel Team
In-Reply-To: <1460413777-177662-1-git-send-email-kafai@fb.com>
This patch moves the route lookup and update codes for connected
datagram sk to a newly created function ip6_datagram_dst_update()
It will be reused during the pmtu update in the later patch.
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Wei Wang <weiwan@google.com>
---
net/ipv6/datagram.c | 103 +++++++++++++++++++++++++++++-----------------------
1 file changed, 57 insertions(+), 46 deletions(-)
diff --git a/net/ipv6/datagram.c b/net/ipv6/datagram.c
index f07c1dd..669585e 100644
--- a/net/ipv6/datagram.c
+++ b/net/ipv6/datagram.c
@@ -64,16 +64,65 @@ static void ip6_datagram_flow_key_init(struct flowi6 *fl6, struct sock *sk)
security_sk_classify_flow(sk, flowi6_to_flowi(fl6));
}
+static int ip6_datagram_dst_update(struct sock *sk)
+{
+ struct ip6_flowlabel *flowlabel = NULL;
+ struct in6_addr *final_p, final;
+ struct ipv6_txoptions *opt;
+ struct dst_entry *dst;
+ struct inet_sock *inet = inet_sk(sk);
+ struct ipv6_pinfo *np = inet6_sk(sk);
+ struct flowi6 fl6;
+ int err = 0;
+
+ if (np->sndflow && (np->flow_label & IPV6_FLOWLABEL_MASK)) {
+ flowlabel = fl6_sock_lookup(sk, np->flow_label);
+ if (!flowlabel)
+ return -EINVAL;
+ }
+ ip6_datagram_flow_key_init(&fl6, sk);
+
+ rcu_read_lock();
+ opt = flowlabel ? flowlabel->opt : rcu_dereference(np->opt);
+ final_p = fl6_update_dst(&fl6, opt, &final);
+ rcu_read_unlock();
+
+ dst = ip6_dst_lookup_flow(sk, &fl6, final_p);
+ if (IS_ERR(dst)) {
+ err = PTR_ERR(dst);
+ goto out;
+ }
+
+ if (ipv6_addr_any(&np->saddr))
+ np->saddr = fl6.saddr;
+
+ if (ipv6_addr_any(&sk->sk_v6_rcv_saddr)) {
+ sk->sk_v6_rcv_saddr = fl6.saddr;
+ inet->inet_rcv_saddr = LOOPBACK4_IPV6;
+ if (sk->sk_prot->rehash)
+ sk->sk_prot->rehash(sk);
+ }
+
+ ip6_dst_store(sk, dst,
+ ipv6_addr_equal(&fl6.daddr, &sk->sk_v6_daddr) ?
+ &sk->sk_v6_daddr : NULL,
+#ifdef CONFIG_IPV6_SUBTREES
+ ipv6_addr_equal(&fl6.saddr, &np->saddr) ?
+ &np->saddr :
+#endif
+ NULL);
+
+out:
+ fl6_sock_release(flowlabel);
+ return err;
+}
+
static int __ip6_datagram_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)
{
struct sockaddr_in6 *usin = (struct sockaddr_in6 *) uaddr;
struct inet_sock *inet = inet_sk(sk);
struct ipv6_pinfo *np = inet6_sk(sk);
- struct in6_addr *daddr, *final_p, final;
- struct dst_entry *dst;
- struct flowi6 fl6;
- struct ip6_flowlabel *flowlabel = NULL;
- struct ipv6_txoptions *opt;
+ struct in6_addr *daddr;
int addr_type;
int err;
__be32 fl6_flowlabel = 0;
@@ -91,14 +140,8 @@ static int __ip6_datagram_connect(struct sock *sk, struct sockaddr *uaddr, int a
if (usin->sin6_family != AF_INET6)
return -EAFNOSUPPORT;
- if (np->sndflow) {
+ if (np->sndflow)
fl6_flowlabel = usin->sin6_flowinfo & IPV6_FLOWINFO_MASK;
- if (fl6_flowlabel & IPV6_FLOWLABEL_MASK) {
- flowlabel = fl6_sock_lookup(sk, fl6_flowlabel);
- if (!flowlabel)
- return -EINVAL;
- }
- }
addr_type = ipv6_addr_type(&usin->sin6_addr);
@@ -178,45 +221,13 @@ ipv4_connected:
* destination cache for it.
*/
- ip6_datagram_flow_key_init(&fl6, sk);
-
- rcu_read_lock();
- opt = flowlabel ? flowlabel->opt : rcu_dereference(np->opt);
- final_p = fl6_update_dst(&fl6, opt, &final);
- rcu_read_unlock();
-
- dst = ip6_dst_lookup_flow(sk, &fl6, final_p);
- err = 0;
- if (IS_ERR(dst)) {
- err = PTR_ERR(dst);
+ err = ip6_datagram_dst_update(sk);
+ if (err)
goto out;
- }
-
- /* source address lookup done in ip6_dst_lookup */
-
- if (ipv6_addr_any(&np->saddr))
- np->saddr = fl6.saddr;
-
- if (ipv6_addr_any(&sk->sk_v6_rcv_saddr)) {
- sk->sk_v6_rcv_saddr = fl6.saddr;
- inet->inet_rcv_saddr = LOOPBACK4_IPV6;
- if (sk->sk_prot->rehash)
- sk->sk_prot->rehash(sk);
- }
-
- ip6_dst_store(sk, dst,
- ipv6_addr_equal(&fl6.daddr, &sk->sk_v6_daddr) ?
- &sk->sk_v6_daddr : NULL,
-#ifdef CONFIG_IPV6_SUBTREES
- ipv6_addr_equal(&fl6.saddr, &np->saddr) ?
- &np->saddr :
-#endif
- NULL);
sk->sk_state = TCP_ESTABLISHED;
sk_set_txhash(sk);
out:
- fl6_sock_release(flowlabel);
return err;
}
--
2.5.1
^ permalink raw reply related
* [PATCH v2 net 1/4] ipv6: datagram: Refactor flowi6 init codes to a new function
From: Martin KaFai Lau @ 2016-04-11 22:29 UTC (permalink / raw)
To: netdev; +Cc: Cong Wang, Eric Dumazet, Wei Wang, Kernel Team
In-Reply-To: <1460413777-177662-1-git-send-email-kafai@fb.com>
Move flowi6 init codes for connected datagram sk to a newly created
function ip6_datagram_flow_key_init().
Notes:
1. fl6_flowlabel is used instead of fl6.flowlabel in __ip6_datagram_connect
2. ipv6_addr_is_multicast(&fl6->daddr) is used instead of
(addr_type & IPV6_ADDR_MULTICAST) in ip6_datagram_flow_key_init()
This new function will be reused during pmtu update in the later patch.
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Wei Wang <weiwan@google.com>
---
net/ipv6/datagram.c | 50 ++++++++++++++++++++++++++++++--------------------
1 file changed, 30 insertions(+), 20 deletions(-)
diff --git a/net/ipv6/datagram.c b/net/ipv6/datagram.c
index 4281621..f07c1dd 100644
--- a/net/ipv6/datagram.c
+++ b/net/ipv6/datagram.c
@@ -40,6 +40,30 @@ static bool ipv6_mapped_addr_any(const struct in6_addr *a)
return ipv6_addr_v4mapped(a) && (a->s6_addr32[3] == 0);
}
+static void ip6_datagram_flow_key_init(struct flowi6 *fl6, struct sock *sk)
+{
+ struct inet_sock *inet = inet_sk(sk);
+ struct ipv6_pinfo *np = inet6_sk(sk);
+
+ memset(fl6, 0, sizeof(*fl6));
+ fl6->flowi6_proto = sk->sk_protocol;
+ fl6->daddr = sk->sk_v6_daddr;
+ fl6->saddr = np->saddr;
+ fl6->flowi6_oif = sk->sk_bound_dev_if;
+ fl6->flowi6_mark = sk->sk_mark;
+ fl6->fl6_dport = inet->inet_dport;
+ fl6->fl6_sport = inet->inet_sport;
+ fl6->flowlabel = np->flow_label;
+
+ if (!fl6->flowi6_oif)
+ fl6->flowi6_oif = np->sticky_pktinfo.ipi6_ifindex;
+
+ if (!fl6->flowi6_oif && ipv6_addr_is_multicast(&fl6->daddr))
+ fl6->flowi6_oif = np->mcast_oif;
+
+ security_sk_classify_flow(sk, flowi6_to_flowi(fl6));
+}
+
static int __ip6_datagram_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)
{
struct sockaddr_in6 *usin = (struct sockaddr_in6 *) uaddr;
@@ -52,6 +76,7 @@ static int __ip6_datagram_connect(struct sock *sk, struct sockaddr *uaddr, int a
struct ipv6_txoptions *opt;
int addr_type;
int err;
+ __be32 fl6_flowlabel = 0;
if (usin->sin6_family == AF_INET) {
if (__ipv6_only_sock(sk))
@@ -66,11 +91,10 @@ static int __ip6_datagram_connect(struct sock *sk, struct sockaddr *uaddr, int a
if (usin->sin6_family != AF_INET6)
return -EAFNOSUPPORT;
- memset(&fl6, 0, sizeof(fl6));
if (np->sndflow) {
- fl6.flowlabel = usin->sin6_flowinfo&IPV6_FLOWINFO_MASK;
- if (fl6.flowlabel&IPV6_FLOWLABEL_MASK) {
- flowlabel = fl6_sock_lookup(sk, fl6.flowlabel);
+ fl6_flowlabel = usin->sin6_flowinfo & IPV6_FLOWINFO_MASK;
+ if (fl6_flowlabel & IPV6_FLOWLABEL_MASK) {
+ flowlabel = fl6_sock_lookup(sk, fl6_flowlabel);
if (!flowlabel)
return -EINVAL;
}
@@ -145,7 +169,7 @@ ipv4_connected:
}
sk->sk_v6_daddr = *daddr;
- np->flow_label = fl6.flowlabel;
+ np->flow_label = fl6_flowlabel;
inet->inet_dport = usin->sin6_port;
@@ -154,21 +178,7 @@ ipv4_connected:
* destination cache for it.
*/
- fl6.flowi6_proto = sk->sk_protocol;
- fl6.daddr = sk->sk_v6_daddr;
- fl6.saddr = np->saddr;
- fl6.flowi6_oif = sk->sk_bound_dev_if;
- fl6.flowi6_mark = sk->sk_mark;
- fl6.fl6_dport = inet->inet_dport;
- fl6.fl6_sport = inet->inet_sport;
-
- if (!fl6.flowi6_oif)
- fl6.flowi6_oif = np->sticky_pktinfo.ipi6_ifindex;
-
- if (!fl6.flowi6_oif && (addr_type&IPV6_ADDR_MULTICAST))
- fl6.flowi6_oif = np->mcast_oif;
-
- security_sk_classify_flow(sk, flowi6_to_flowi(&fl6));
+ ip6_datagram_flow_key_init(&fl6, sk);
rcu_read_lock();
opt = flowlabel ? flowlabel->opt : rcu_dereference(np->opt);
--
2.5.1
^ permalink raw reply related
* [PATCH v2 net 0/4] ipv6: datagram: Update dst cache of a connected udp sk during pmtu update
From: Martin KaFai Lau @ 2016-04-11 22:29 UTC (permalink / raw)
To: netdev; +Cc: Cong Wang, Eric Dumazet, Wei Wang, Kernel Team
v2:
~ Protect __sk_dst_get() operations with rcu_read_lock in
release_cb() because another thread may do ip6_dst_store()
for a udp sk without taking the sk lock (e.g. in sendmsg).
~ Do a ipv6_addr_v4mapped(&sk->sk_v6_daddr) check before
calling ip6_datagram_dst_update() in patch 3 and 4. It is
similar to how __ip6_datagram_connect handles it.
~ One fix in ip6_datagram_dst_update() in patch 2. It needs
to check (np->flow_label & IPV6_FLOWLABEL_MASK) before
doing fl6_sock_lookup. I was confused with the naming
of IPV6_FLOWLABEL_MASK and IPV6_FLOWINFO_MASK.
~ Check dst->obsolete just on the safe side, although I think it
should at least have DST_OBSOLETE_FORCE_CHK by now.
~ Add Fixes tag to patch 3 and 4
~ Add some points from the previous discussion about holding
sk lock to the commit message in patch 3.
v1:
There is a case in connected UDP socket such that
getsockopt(IPV6_MTU) will return a stale MTU value. The reproducible
sequence could be the following:
1. Create a connected UDP socket
2. Send some datagrams out
3. Receive a ICMPV6_PKT_TOOBIG
4. No new outgoing datagrams to trigger the sk_dst_check()
logic to update the sk->sk_dst_cache.
5. getsockopt(IPV6_MTU) returns the mtu from the invalid
sk->sk_dst_cache instead of the newly created RTF_CACHE clone.
Patch 1 and 2 are the prep work.
Patch 3 and 4 are the fixes.
^ permalink raw reply
* Re: [Lsf] [Lsf-pc] [LSF/MM TOPIC] Generic page-pool recycle facility?
From: Alexei Starovoitov @ 2016-04-11 22:21 UTC (permalink / raw)
To: Jesper Dangaard Brouer
Cc: Sagi Grimberg, Bart Van Assche, Christoph Hellwig,
James Bottomley, Tom Herbert, Brenden Blanco,
lsf@lists.linux-foundation.org, linux-mm, netdev@vger.kernel.org,
lsf-pc@lists.linux-foundation.org
In-Reply-To: <20160411234157.3fc9c6fe@redhat.com>
On Mon, Apr 11, 2016 at 11:41:57PM +0200, Jesper Dangaard Brouer wrote:
>
> On Sun, 10 Apr 2016 21:45:47 +0300 Sagi Grimberg <sagi@grimberg.me> wrote:
>
> > >> This is also very interesting for storage targets, which face the same
> > >> issue. SCST has a mode where it caches some fully constructed SGLs,
> > >> which is probably very similar to what NICs want to do.
> > >
> > > I think a cached allocator for page sets + the scatterlists that
> > > describe these page sets would not only be useful for SCSI target
> > > implementations but also for the Linux SCSI initiator. Today the scsi-mq
> > > code reserves space in each scsi_cmnd for a scatterlist of
> > > SCSI_MAX_SG_SEGMENTS. If scatterlists would be cached together with page
> > > sets less memory would be needed per scsi_cmnd.
> >
> > If we go down this road how about also attaching some driver opaques
> > to the page sets?
>
> That was the ultimate plan... to leave some opaques bytes left in the
> page struct that drivers could use.
>
> In struct page I would need a pointer back to my page_pool struct and a
> page flag. Then, I would need room to store the dma_unmap address.
> (And then some of the usual fields are still needed, like the refcnt,
> and reusing some of the list constructs). And a zero-copy cross-domain
> id.
I don't think we need to add anything to struct page.
This is supposed to be small cache of dma_mapped pages with lockless access.
It can be implemented as an array or link list where every element
is dma_addr and pointer to page. If it is full, dma_unmap_page+put_page to
send it to back to page allocator.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply
* Re: [PATCH iproute2 0/4] vxlan: add VXLAN-GPE support
From: Stephen Hemminger @ 2016-04-11 22:17 UTC (permalink / raw)
To: Jiri Benc; +Cc: netdev
In-Reply-To: <cover.1460032274.git.jbenc@redhat.com>
On Thu, 7 Apr 2016 14:36:25 +0200
Jiri Benc <jbenc@redhat.com> wrote:
> This patchset adds a couple of refinements to the 'external' keyword and
> adds support for configuring VXLAN-GPE interfaces.
>
> The VXLAN-GPE support was recently merged to net-next.
>
> The first patch is not intended to be applied directly but I still include
> it in order for this patchset to be compilable and testable by those
> interested.
>
> Jiri Benc (4):
> if_link.h: rebase to the kernel
> vxlan: 'external' implies 'nolearning'
> ip-link.8: document "external" flag for vxlan
> vxlan: add support for VXLAN-GPE
>
> include/linux/if_link.h | 1 +
> ip/iplink_vxlan.c | 14 ++++++++++++--
> man/man8/ip-link.8.in | 17 +++++++++++++++++
> 3 files changed, 30 insertions(+), 2 deletions(-)
>
Applied to net-next
^ permalink raw reply
* Re: [PATCH] iproute2: tc_bpf.c: fix building with musl libc
From: Stephen Hemminger @ 2016-04-11 22:17 UTC (permalink / raw)
To: Gustavo Zacarias; +Cc: netdev
In-Reply-To: <1460120373-32096-1-git-send-email-gustavo@zacarias.com.ar>
On Fri, 8 Apr 2016 09:59:33 -0300
Gustavo Zacarias <gustavo@zacarias.com.ar> wrote:
> We need limits.h for PATH_MAX, fixes:
>
> tc_bpf.c: In function ‘bpf_map_selfcheck_pinned’:
> tc_bpf.c:222:12: error: ‘PATH_MAX’ undeclared (first use in this
> function)
> char file[PATH_MAX], buff[4096];
>
> Signed-off-by: Gustavo Zacarias <gustavo@zacarias.com.ar>
Applied
^ permalink raw reply
* Re: [Lsf] [Lsf-pc] [LSF/MM TOPIC] Generic page-pool recycle facility?
From: Alexander Duyck @ 2016-04-11 22:02 UTC (permalink / raw)
To: Jesper Dangaard Brouer
Cc: Sagi Grimberg, Bart Van Assche, Christoph Hellwig,
James Bottomley, Tom Herbert, Brenden Blanco,
lsf@lists.linux-foundation.org, linux-mm, netdev@vger.kernel.org,
lsf-pc@lists.linux-foundation.org, Alexei Starovoitov
In-Reply-To: <20160411234157.3fc9c6fe@redhat.com>
On Mon, Apr 11, 2016 at 2:41 PM, Jesper Dangaard Brouer
<brouer@redhat.com> wrote:
>
> On Sun, 10 Apr 2016 21:45:47 +0300 Sagi Grimberg <sagi@grimberg.me> wrote:
>
>> >> This is also very interesting for storage targets, which face the same
>> >> issue. SCST has a mode where it caches some fully constructed SGLs,
>> >> which is probably very similar to what NICs want to do.
>> >
>> > I think a cached allocator for page sets + the scatterlists that
>> > describe these page sets would not only be useful for SCSI target
>> > implementations but also for the Linux SCSI initiator. Today the scsi-mq
>> > code reserves space in each scsi_cmnd for a scatterlist of
>> > SCSI_MAX_SG_SEGMENTS. If scatterlists would be cached together with page
>> > sets less memory would be needed per scsi_cmnd.
>>
>> If we go down this road how about also attaching some driver opaques
>> to the page sets?
>
> That was the ultimate plan... to leave some opaques bytes left in the
> page struct that drivers could use.
>
> In struct page I would need a pointer back to my page_pool struct and a
> page flag. Then, I would need room to store the dma_unmap address.
> (And then some of the usual fields are still needed, like the refcnt,
> and reusing some of the list constructs). And a zero-copy cross-domain
> id.
>
>
> For my packet-page idea, I would need a packet length and an offset
> where data starts (I can derive the "head-room" for encap from these
> two).
Have you taken a look at possibly trying to optimize the DMA pool API
to work with pages? It sounds like it is supposed to do something
similar to what you are wanting to do.
- Alex
^ permalink raw reply
* Re: [PATCH 0/9] RxRPC: 2nd rewrite part 1
From: David Howells @ 2016-04-11 22:01 UTC (permalink / raw)
To: David Miller; +Cc: dhowells, linux-afs, netdev, linux-kernel
In-Reply-To: <20160411.153511.1531573815825221764.davem@davemloft.net>
David Miller <davem@davemloft.net> wrote:
> Series applied, but I had to fix up some trailing whitespace (reported by
> GIT) by hand.
Do you perchance have a git hook script for checking for trailing whitespace?
David
^ permalink raw reply
* Re: [PATCH iproute2 -master 0/3] Minor tc/bpf updates
From: Stephen Hemminger @ 2016-04-11 21:55 UTC (permalink / raw)
To: Daniel Borkmann; +Cc: netdev, alexei.starovoitov
In-Reply-To: <cover.1460153854.git.daniel@iogearbox.net>
On Sat, 9 Apr 2016 00:32:02 +0200
Daniel Borkmann <daniel@iogearbox.net> wrote:
> Some minor updates to improve error reporting, add signatures
> and recently introduced map flags attribute. Set is against
> master branch.
>
> Thanks!
>
> Daniel Borkmann (3):
> tc, bpf: add new csum and tunnel signatures
> tc, bpf: further improve error reporting
> tc, bpf: add support for map pre/allocation
>
> examples/bpf/bpf_cyclic.c | 9 ++++-
> examples/bpf/bpf_graft.c | 8 +++-
> examples/bpf/bpf_prog.c | 2 +
> examples/bpf/bpf_shared.c | 8 +++-
> examples/bpf/bpf_tailcall.c | 29 ++++++++++++--
> include/bpf_api.h | 52 ++++--------------------
> include/bpf_elf.h | 1 +
> tc/tc_bpf.c | 98 +++++++++++++++++++++++++++++++++++----------
> tc/tc_bpf.h | 4 ++
> 9 files changed, 138 insertions(+), 73 deletions(-)
>
Applied thanks
^ permalink raw reply
* Re: [Lsf] [Lsf-pc] [LSF/MM TOPIC] Generic page-pool recycle facility?
From: Jesper Dangaard Brouer @ 2016-04-11 21:41 UTC (permalink / raw)
To: Sagi Grimberg
Cc: Bart Van Assche, Christoph Hellwig, James Bottomley, Tom Herbert,
Brenden Blanco, lsf@lists.linux-foundation.org, linux-mm,
netdev@vger.kernel.org, lsf-pc@lists.linux-foundation.org,
Alexei Starovoitov, brouer
In-Reply-To: <570A9F5B.5010600@grimberg.me>
On Sun, 10 Apr 2016 21:45:47 +0300 Sagi Grimberg <sagi@grimberg.me> wrote:
> >> This is also very interesting for storage targets, which face the same
> >> issue. SCST has a mode where it caches some fully constructed SGLs,
> >> which is probably very similar to what NICs want to do.
> >
> > I think a cached allocator for page sets + the scatterlists that
> > describe these page sets would not only be useful for SCSI target
> > implementations but also for the Linux SCSI initiator. Today the scsi-mq
> > code reserves space in each scsi_cmnd for a scatterlist of
> > SCSI_MAX_SG_SEGMENTS. If scatterlists would be cached together with page
> > sets less memory would be needed per scsi_cmnd.
>
> If we go down this road how about also attaching some driver opaques
> to the page sets?
That was the ultimate plan... to leave some opaques bytes left in the
page struct that drivers could use.
In struct page I would need a pointer back to my page_pool struct and a
page flag. Then, I would need room to store the dma_unmap address.
(And then some of the usual fields are still needed, like the refcnt,
and reusing some of the list constructs). And a zero-copy cross-domain
id.
For my packet-page idea, I would need a packet length and an offset
where data starts (I can derive the "head-room" for encap from these
two).
--
Best regards,
Jesper Dangaard Brouer
MSc.CS, Principal Kernel Engineer at Red Hat
Author of http://www.iptv-analyzer.org
LinkedIn: http://www.linkedin.com/in/brouer
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply
* Re: [PATCH for-next 2/2] net/mlx5: Update mlx5_ifc hardware features
From: Or Gerlitz @ 2016-04-11 21:37 UTC (permalink / raw)
To: Saeed Mahameed
Cc: Saeed Mahameed, Matan Barak, Linux Netdev List,
linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
David S. Miller, Doug Ledford, Linus Torvalds, Or Gerlitz,
Leon Romanovsky, Tal Alon
In-Reply-To: <CALzJLG-PxUz9ov+cv4uiNSGpuxy0s9KamTvamV_g72SctqdhFA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
On Tue, Apr 12, 2016 at 12:24 AM, Saeed Mahameed
<saeedm-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org> wrote:
> On Tue, Apr 12, 2016 at 12:17 AM, Or Gerlitz <gerlitz.or-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
>> feature --> features
> Correct, will fix.
>>> * Add vport to steering commands for SRIOV ACL support
>>> * Add mlcr, pcmr and mcia registers for dump module EEPROM
>>> * Add support for FCS, baeacon led and disable_link bits to hca caps
>>> * Add CQE period mode bit in CQ context for CQE based CQ
>>> moderation support
>>> * Add umr SQ bit for fragmented memory registration
>>> * Add needed bits and caps for Striding RQ support
>> AFAIK, all the above are features will go through net-next, what made
>> you anticipate conflicts with linux-rdma?
> FCS bit is needed also for rdma, so we took the liberty of updating
> all the needed HW structs, bits, caps, etc ..
> at once for all mlx5 features planned for 4.7 regardless of rdma/net conflicts.
The cover letter states that this series deals with shared code.
I guess you might also could extend it a bit to deal also with code
that you suspect could lead to conflicts, but I don't see why it
evolved to that extent.
Or.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox