* Re: [PATCH net-next] sh_eth: fix *enum* {A|M}PR_BIT
From: Geert Uytterhoeven @ 2018-06-26 16:18 UTC (permalink / raw)
To: Sergei Shtylyov; +Cc: netdev, David S. Miller, Linux-Renesas
In-Reply-To: <e67e6256-4ae9-4527-d482-cf3bb50921cf@cogentembedded.com>
On Tue, Jun 26, 2018 at 5:43 PM Sergei Shtylyov
<sergei.shtylyov@cogentembedded.com> wrote:
> The *enum* {A|M}PR_BIT were declared in the commit 86a74ff21a7a ("net:
> sh_eth: add support for Renesas SuperH Ethernet") adding SH771x support,
> however the SH771x manual doesn't have the APR/MPR registers described
> and the code writing to them for SH7710 was later removed by the commit
> 380af9e390ec ("net: sh_eth: CPU dependency code collect to "struct
> sh_eth_cpu_data""). All the newer SoC manuals have these registers
> documented as having a 16-bit TIME parameter of the PAUSE frame, not
> 1-bit -- update the *enum* accordingly, fixing up the APR/MPR writes...
>
> Signed-off-by: Sergei Shtylyov <sergei.shtylyov@cogentembedded.com>
Reviewed-by: Geert Uytterhoeven <geert+renesas@glider.be>
Gr{oetje,eeting}s,
Geert
--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org
In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds
^ permalink raw reply
* [PATCH net] KEYS: DNS: fix parsing multiple options
From: David Howells @ 2018-06-26 16:20 UTC (permalink / raw)
To: netdev; +Cc: dhowells, linux-afs, linux-kernel
From: Eric Biggers <ebiggers@google.com>
My recent fix for dns_resolver_preparse() printing very long strings was
incomplete, as shown by syzbot which still managed to hit the
WARN_ONCE() in set_precision() by adding a crafted "dns_resolver" key:
precision 50001 too large
WARNING: CPU: 7 PID: 864 at lib/vsprintf.c:2164 vsnprintf+0x48a/0x5a0
The bug this time isn't just a printing bug, but also a logical error
when multiple options ("#"-separated strings) are given in the key
payload. Specifically, when separating an option string into name and
value, if there is no value then the name is incorrectly considered to
end at the end of the key payload, rather than the end of the current
option. This bypasses validation of the option length, and also means
that specifying multiple options is broken -- which presumably has gone
unnoticed as there is currently only one valid option anyway.
Fix it by correctly calculating the length of the option name.
Reproducer:
perl -e 'print "#A#", "\x00" x 50000' | keyctl padd dns_resolver desc @s
Fixes: 4a2d789267e0 ("DNS: If the DNS server returns an error, allow that to be cached [ver #2]")
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: David Howells <dhowells@redhat.com>
---
net/dns_resolver/dns_key.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/net/dns_resolver/dns_key.c b/net/dns_resolver/dns_key.c
index 40c851693f77..d448823d4d2e 100644
--- a/net/dns_resolver/dns_key.c
+++ b/net/dns_resolver/dns_key.c
@@ -97,7 +97,7 @@ dns_resolver_preparse(struct key_preparsed_payload *prep)
return -EINVAL;
}
- eq = memchr(opt, '=', opt_len) ?: end;
+ eq = memchr(opt, '=', opt_len) ?: next_opt;
opt_nlen = eq - opt;
eq++;
opt_vlen = next_opt - eq; /* will be -1 if no value */
^ permalink raw reply related
* Re: [PATCH bpf-next 1/7] nfp: bpf: allow source ptr type be map ptr in memcpy optimization
From: Song Liu @ 2018-06-26 16:26 UTC (permalink / raw)
To: Jakub Kicinski
Cc: Alexei Starovoitov, Daniel Borkmann, oss-drivers, Networking,
Jiong Wang
In-Reply-To: <CAJpBn1zFrBPupsug7NFrx2PV5ew+2mc3D4Zn7vWnujzyT6wEbA@mail.gmail.com>
On Tue, Jun 26, 2018 at 12:08 AM, Jakub Kicinski
<jakub.kicinski@netronome.com> wrote:
> On Mon, Jun 25, 2018 at 10:50 PM, Song Liu <liu.song.a23@gmail.com> wrote:
>> On Sun, Jun 24, 2018 at 8:54 PM, Jakub Kicinski
>> <jakub.kicinski@netronome.com> wrote:
>>> From: Jiong Wang <jiong.wang@netronome.com>
>>>
>>> Map read has been supported on NFP, this patch enables optimization for
>>> memcpy from map to packet.
>>>
>>> This patch also fixed one latent bug which will cause copying from
>>> unexpected address once memcpy for map pointer enabled.
>>>
>>> Reported-by: Mary Pham <mary.pham@netronome.com>
>>> Reported-by: David Beckett <david.beckett@netronome.com>
>>> Signed-off-by: Jiong Wang <jiong.wang@netronome.com>
>>> Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
>>> ---
>>> drivers/net/ethernet/netronome/nfp/bpf/jit.c | 5 +++--
>>> 1 file changed, 3 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/drivers/net/ethernet/netronome/nfp/bpf/jit.c b/drivers/net/ethernet/netronome/nfp/bpf/jit.c
>>> index 8a92088df0d7..33111739b210 100644
>>> --- a/drivers/net/ethernet/netronome/nfp/bpf/jit.c
>>> +++ b/drivers/net/ethernet/netronome/nfp/bpf/jit.c
>>> @@ -670,7 +670,7 @@ static int nfp_cpp_memcpy(struct nfp_prog *nfp_prog, struct nfp_insn_meta *meta)
>>> xfer_num = round_up(len, 4) / 4;
>>>
>>> if (src_40bit_addr)
>>> - addr40_offset(nfp_prog, meta->insn.src_reg, off, &src_base,
>>> + addr40_offset(nfp_prog, meta->insn.src_reg * 2, off, &src_base,
>>> &off);
>>
>> Did this break other cases before this patch?
>>
>> I am sorry if this is a dumb question. I don't think I fully
>> understand addr40_offset().
>
> Only map memory uses 40 bit addressing right now, so the if was pretty
> much dead code before the patch.
>
> The memcpy optimization was left out of the initial map support due to
> insufficient test coverage, I should have probably left more of the 40
> bit addressing code out back then.
Thanks for the explanation!
Acked-by: Song Liu <songliubraving@fb.com>
^ permalink raw reply
* Re: [rds-devel] [PATCH net-next] rds: clean up loopback rds_connections on netns deletion
From: Sowmini Varadhan @ 2018-06-26 16:28 UTC (permalink / raw)
To: David Miller; +Cc: netdev, rds-devel
In-Reply-To: <20180626145323.GH20575@oracle.com>
On (06/26/18 10:53), Sowmini Varadhan wrote:
> Date: Tue, 26 Jun 2018 10:53:23 -0400
> From: Sowmini Varadhan <sowmini.varadhan@oracle.com>
> To: David Miller <davem@davemloft.net>
> Cc: netdev@vger.kernel.org, rds-devel@oss.oracle.com
> Subject: Re: [rds-devel] [PATCH net-next] rds: clean up loopback
>
> and just to add, the fix itself is logically correct, so belongs in
> net-next. What I dont have (and therefore did not target net) is
> official confirmation that the syzbot failures are root-caused to the
> absence of this patch (since there is no reproducer for many of these,
> and no crash dumps available from syzbot).
>
With help from Dmitry, I just got the confirmation from syzbot that
"syzbot has tested the proposed patch and the reproducer did not trigger
crash:"
thus, we can mark this
Reported-and-tested-by: syzbot+4c20b3866171ce8441d2@syzkaller.appspotmail.com
and yes, it can target net.
Thanks
--Sowmini
^ permalink raw reply
* Re: [PATCH V4 0/8] net: ethernet: stmmac: add support for stm32mp1
From: Alexandre Torgue @ 2018-06-26 16:31 UTC (permalink / raw)
To: Christophe Roullier, mark.rutland, mcoquelin.stm32,
peppe.cavallaro
Cc: devicetree, linux-arm-kernel, netdev, andrew
In-Reply-To: <1527090479-5263-1-git-send-email-christophe.roullier@st.com>
Hi Christophe,
On 05/23/2018 05:47 PM, Christophe Roullier wrote:
> Patches to have Ethernet support on stm32mp1
> Changelog:
> Remark from Rob Herring
> Move Documentation/devicetree/bindings/arm/stm32.txt in
> Documentation/devicetree/bindings/arm/stm32/stm32.txt and create
> Documentation/devicetree/bindings/arm/stm32/stm32-syscon.txt
>
> Replace also in arch/arm/boot/dts/stm32mp157c.dtsi, syscfg: system-config@50020000
> with syscfg: syscon@50020000syscfg: system-config@50020000
>
> Christophe Roullier (8):
> net: ethernet: stmmac: add adaptation for stm32mp157c.
> dt-bindings: stm32-dwmac: add support of MPU families
> ARM: dts: stm32: add ethernet pins to stm32mp157c
> ARM: dts: stm32: Add syscfg on stm32mp1
> ARM: dts: stm32: Add ethernet dwmac on stm32mp1
> net: stmmac: add dwmac-4.20a compatible
> ARM: dts: stm32: add support of ethernet on stm32mp157c-ev1
> dt-bindings: stm32: add compatible for syscon
>
> Documentation/devicetree/bindings/arm/stm32.txt | 10 -
> .../devicetree/bindings/arm/stm32/stm32-syscon.txt | 14 ++
> .../devicetree/bindings/arm/stm32/stm32.txt | 10 +
> .../devicetree/bindings/net/stm32-dwmac.txt | 18 +-
> arch/arm/boot/dts/stm32mp157-pinctrl.dtsi | 46 ++++
> arch/arm/boot/dts/stm32mp157c-ev1.dts | 20 ++
> arch/arm/boot/dts/stm32mp157c.dtsi | 35 +++
> drivers/net/ethernet/stmicro/stmmac/dwmac-stm32.c | 270 +++++++++++++++++++--
> .../net/ethernet/stmicro/stmmac/stmmac_platform.c | 3 +-
> 9 files changed, 398 insertions(+), 28 deletions(-)
> delete mode 100644 Documentation/devicetree/bindings/arm/stm32.txt
> create mode 100644 Documentation/devicetree/bindings/arm/stm32/stm32-syscon.txt
> create mode 100644 Documentation/devicetree/bindings/arm/stm32/stm32.txt
>
As discussed I squashed "ARM: dts: stm32: add ethernet pins to
stm32mp157c" and "ARM: dts: stm32: add support of ethernet on
stm32mp157c-ev1" ans fixed interrupt binding issue.
So DT patches applied on stm32-next.
regards
Alex
^ permalink raw reply
* Re: [patch net-next RFC 11/12] mlxsw: core: Extend hwmon interface with FAN fault attribute
From: Guenter Roeck @ 2018-06-26 16:32 UTC (permalink / raw)
To: Vadim Pasternak
Cc: Andrew Lunn, davem@davemloft.net, netdev@vger.kernel.org,
rui.zhang@intel.com, edubezval@gmail.com, jiri@resnulli.us, mlxsw,
Michael Shych
In-Reply-To: <HE1PR0502MB3753695D03AD5D8A8DA4CD84A2490@HE1PR0502MB3753.eurprd05.prod.outlook.com>
On Tue, Jun 26, 2018 at 02:47:01PM +0000, Vadim Pasternak wrote:
>
>
> > -----Original Message-----
> > From: Andrew Lunn [mailto:andrew@lunn.ch]
> > Sent: Tuesday, June 26, 2018 5:29 PM
> > To: Vadim Pasternak <vadimp@mellanox.com>
> > Cc: davem@davemloft.net; netdev@vger.kernel.org; linux@roeck-us.net;
> > rui.zhang@intel.com; edubezval@gmail.com; jiri@resnulli.us; mlxsw
> > <mlxsw@mellanox.com>; Michael Shych <michaelsh@mellanox.com>
> > Subject: Re: [patch net-next RFC 11/12] mlxsw: core: Extend hwmon interface
> > with FAN fault attribute
> >
> > > +static ssize_t mlxsw_hwmon_fan_fault_show(struct device *dev,
> > > + struct device_attribute *attr,
> > > + char *buf)
> > > +{
> > > + struct mlxsw_hwmon_attr *mlwsw_hwmon_attr =
> > > + container_of(attr, struct mlxsw_hwmon_attr,
> > dev_attr);
> > > + struct mlxsw_hwmon *mlxsw_hwmon = mlwsw_hwmon_attr->hwmon;
> > > + char mfsm_pl[MLXSW_REG_MFSM_LEN];
> > > + u16 tach;
> > > + int err;
> > > +
> > > + mlxsw_reg_mfsm_pack(mfsm_pl, mlwsw_hwmon_attr->type_index);
> > > + err = mlxsw_reg_query(mlxsw_hwmon->core, MLXSW_REG(mfsm),
> > mfsm_pl);
> > > + if (err) {
> > > + dev_err(mlxsw_hwmon->bus_info->dev, "Failed to query
> > fan\n");
> > > + return err;
> > > + }
> > > + tach = mlxsw_reg_mfsm_rpm_get(mfsm_pl);
> > > +
> > > + return sprintf(buf, "%u\n", (tach < mlxsw_hwmon->tach_min) ? 1 : 0);
> > > +}
> >
> > Documentation/hwmon/sysfs-interface says:
> >
> > Alarms are direct indications read from the chips. The drivers do NOT make
> > comparisons of readings to thresholds. This allows violations between readings
> > to be caught and alarmed. The exact definition of an alarm (for example,
> > whether a threshold must be met or must be exceeded to cause an alarm) is
> > chip-dependent.
> >
> > Now, this is a fault, not an alarm. But does the same apply?
>
Yes, it does. There are no "soft" alarms / faults.
> Hi Andrew,
>
> Hardware provides minimum value for tachometer.
> Tachometer is considered as faulty in case it's below this
> value.
This is for user space to decide, not for the kernel.
> In case any tachometer is faulty, PWM according to the
> system requirements should be set to 100% until the fault
system requirements. Again, this is for user space to decide.
> is not recovered (f.e. by physical replacing of bad unit).
> This is the motivation to expose fan{x}_fault in the way
> it's exposed.
>
> Thanks,
> Vadim.
>
> >
> > Andrew
^ permalink raw reply
* Re: [PATCH 2/3] i40e: split XDP_TX tail and XDP_REDIRECT map flushing
From: Björn Töpel @ 2018-06-26 16:37 UTC (permalink / raw)
To: Jesper Dangaard Brouer
Cc: Netdev, John Fastabend, jasowang, Daniel Borkmann,
Björn Töpel, Alexei Starovoitov, intel-wired-lan
In-Reply-To: <153002759360.15389.18278323388927454181.stgit@firesoul>
Den tis 26 juni 2018 kl 18:08 skrev Jesper Dangaard Brouer <brouer@redhat.com>:
>
> The driver was combining the XDP_TX tail flush and XDP_REDIRECT
> map flushing (xdp_do_flush_map). This is suboptimal, these two
> flush operations should be kept separate.
>
> It looks like the mistake was copy-pasted from ixgbe.
>
> Fixes: d9314c474d4f ("i40e: add support for XDP_REDIRECT")
> Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
> ---
> drivers/net/ethernet/intel/i40e/i40e_txrx.c | 24 +++++++++++++++---------
> 1 file changed, 15 insertions(+), 9 deletions(-)
>
> diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.c b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
> index 8ffb7454e67c..c1c027743159 100644
> --- a/drivers/net/ethernet/intel/i40e/i40e_txrx.c
> +++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
> @@ -2200,9 +2200,10 @@ static bool i40e_is_non_eop(struct i40e_ring *rx_ring,
> return true;
> }
>
> -#define I40E_XDP_PASS 0
> -#define I40E_XDP_CONSUMED 1
> -#define I40E_XDP_TX 2
> +#define I40E_XDP_PASS 0
> +#define I40E_XDP_CONSUMED BIT(0)
> +#define I40E_XDP_TX BIT(1)
> +#define I40E_XDP_REDIR BIT(2)
>
> static int i40e_xmit_xdp_ring(struct xdp_frame *xdpf,
> struct i40e_ring *xdp_ring);
> @@ -2249,7 +2250,7 @@ static struct sk_buff *i40e_run_xdp(struct i40e_ring *rx_ring,
> break;
> case XDP_REDIRECT:
> err = xdp_do_redirect(rx_ring->netdev, xdp, xdp_prog);
> - result = !err ? I40E_XDP_TX : I40E_XDP_CONSUMED;
> + result = !err ? I40E_XDP_REDIR : I40E_XDP_CONSUMED;
> break;
> default:
> bpf_warn_invalid_xdp_action(act);
> @@ -2312,7 +2313,8 @@ static int i40e_clean_rx_irq(struct i40e_ring *rx_ring, int budget)
> unsigned int total_rx_bytes = 0, total_rx_packets = 0;
> struct sk_buff *skb = rx_ring->skb;
> u16 cleaned_count = I40E_DESC_UNUSED(rx_ring);
> - bool failure = false, xdp_xmit = false;
> + unsigned int xdp_xmit = 0;
> + bool failure = false;
> struct xdp_buff xdp;
>
> xdp.rxq = &rx_ring->xdp_rxq;
> @@ -2373,8 +2375,10 @@ static int i40e_clean_rx_irq(struct i40e_ring *rx_ring, int budget)
> }
>
> if (IS_ERR(skb)) {
> - if (PTR_ERR(skb) == -I40E_XDP_TX) {
> - xdp_xmit = true;
> + unsigned int xdp_res = -PTR_ERR(skb);
> +
> + if (xdp_res & (I40E_XDP_TX | I40E_XDP_REDIR)) {
> + xdp_xmit |= xdp_res;
> i40e_rx_buffer_flip(rx_ring, rx_buffer, size);
> } else {
> rx_buffer->pagecnt_bias++;
> @@ -2428,12 +2432,14 @@ static int i40e_clean_rx_irq(struct i40e_ring *rx_ring, int budget)
> total_rx_packets++;
> }
>
> - if (xdp_xmit) {
> + if (xdp_xmit & I40E_XDP_REDIR)
> + xdp_do_flush_map();
> +
> + if (xdp_xmit & I40E_XDP_TX) {
> struct i40e_ring *xdp_ring =
> rx_ring->vsi->xdp_rings[rx_ring->queue_index];
>
> i40e_xdp_ring_update_tail(xdp_ring);
> - xdp_do_flush_map();
> }
>
> rx_ring->skb = skb;
>
Nice! Added intel-wired-lan to Cc.
Acked-by: Björn Töpel <bjorn.topel@intel.com>
^ permalink raw reply
* [PATCH net-next] l2tp: define helper for parsing struct sockaddr_pppol2tp*
From: Guillaume Nault @ 2018-06-26 16:41 UTC (permalink / raw)
To: netdev; +Cc: James Chapman
'sockaddr_len' is checked against various values when entering
pppol2tp_connect(), to verify its validity. It is used again later, to
find out which sockaddr structure was passed from user space. This
patch combines these two operations into one new function in order to
simplify pppol2tp_connect().
A new structure, l2tp_connect_info, is used to pass sockaddr data back
to pppol2tp_connect(), to avoid passing too many parameters to
l2tp_sockaddr_get_info(). Also, the first parameter is void* in order
to avoid casting between all sockaddr_* structures manually.
Signed-off-by: Guillaume Nault <g.nault@alphalink.fr>
---
net/l2tp/l2tp_ppp.c | 173 ++++++++++++++++++++++++++------------------
1 file changed, 103 insertions(+), 70 deletions(-)
diff --git a/net/l2tp/l2tp_ppp.c b/net/l2tp/l2tp_ppp.c
index eea5d7844473..d3a9355ac8ac 100644
--- a/net/l2tp/l2tp_ppp.c
+++ b/net/l2tp/l2tp_ppp.c
@@ -588,40 +588,113 @@ static void pppol2tp_session_init(struct l2tp_session *session)
}
}
+struct l2tp_connect_info {
+ u8 version;
+ int fd;
+ u32 tunnel_id;
+ u32 peer_tunnel_id;
+ u32 session_id;
+ u32 peer_session_id;
+};
+
+static int pppol2tp_sockaddr_get_info(const void *sa, int sa_len,
+ struct l2tp_connect_info *info)
+{
+ switch (sa_len) {
+ case sizeof(struct sockaddr_pppol2tp):
+ {
+ const struct sockaddr_pppol2tp *sa_v2in4 = sa;
+
+ if (sa_v2in4->sa_protocol != PX_PROTO_OL2TP)
+ return -EINVAL;
+
+ info->version = 2;
+ info->fd = sa_v2in4->pppol2tp.fd;
+ info->tunnel_id = sa_v2in4->pppol2tp.s_tunnel;
+ info->peer_tunnel_id = sa_v2in4->pppol2tp.d_tunnel;
+ info->session_id = sa_v2in4->pppol2tp.s_session;
+ info->peer_session_id = sa_v2in4->pppol2tp.d_session;
+
+ break;
+ }
+ case sizeof(struct sockaddr_pppol2tpv3):
+ {
+ const struct sockaddr_pppol2tpv3 *sa_v3in4 = sa;
+
+ if (sa_v3in4->sa_protocol != PX_PROTO_OL2TP)
+ return -EINVAL;
+
+ info->version = 3;
+ info->fd = sa_v3in4->pppol2tp.fd;
+ info->tunnel_id = sa_v3in4->pppol2tp.s_tunnel;
+ info->peer_tunnel_id = sa_v3in4->pppol2tp.d_tunnel;
+ info->session_id = sa_v3in4->pppol2tp.s_session;
+ info->peer_session_id = sa_v3in4->pppol2tp.d_session;
+
+ break;
+ }
+ case sizeof(struct sockaddr_pppol2tpin6):
+ {
+ const struct sockaddr_pppol2tpin6 *sa_v2in6 = sa;
+
+ if (sa_v2in6->sa_protocol != PX_PROTO_OL2TP)
+ return -EINVAL;
+
+ info->version = 2;
+ info->fd = sa_v2in6->pppol2tp.fd;
+ info->tunnel_id = sa_v2in6->pppol2tp.s_tunnel;
+ info->peer_tunnel_id = sa_v2in6->pppol2tp.d_tunnel;
+ info->session_id = sa_v2in6->pppol2tp.s_session;
+ info->peer_session_id = sa_v2in6->pppol2tp.d_session;
+
+ break;
+ }
+ case sizeof(struct sockaddr_pppol2tpv3in6):
+ {
+ const struct sockaddr_pppol2tpv3in6 *sa_v3in6 = sa;
+
+ if (sa_v3in6->sa_protocol != PX_PROTO_OL2TP)
+ return -EINVAL;
+
+ info->version = 3;
+ info->fd = sa_v3in6->pppol2tp.fd;
+ info->tunnel_id = sa_v3in6->pppol2tp.s_tunnel;
+ info->peer_tunnel_id = sa_v3in6->pppol2tp.d_tunnel;
+ info->session_id = sa_v3in6->pppol2tp.s_session;
+ info->peer_session_id = sa_v3in6->pppol2tp.d_session;
+
+ break;
+ }
+ default:
+ return -EINVAL;
+ }
+
+ return 0;
+}
+
/* connect() handler. Attach a PPPoX socket to a tunnel UDP socket
*/
static int pppol2tp_connect(struct socket *sock, struct sockaddr *uservaddr,
int sockaddr_len, int flags)
{
struct sock *sk = sock->sk;
- struct sockaddr_pppol2tp *sp = (struct sockaddr_pppol2tp *) uservaddr;
struct pppox_sock *po = pppox_sk(sk);
struct l2tp_session *session = NULL;
+ struct l2tp_connect_info info;
struct l2tp_tunnel *tunnel;
struct pppol2tp_session *ps;
struct l2tp_session_cfg cfg = { 0, };
- int error = 0;
- u32 tunnel_id, peer_tunnel_id;
- u32 session_id, peer_session_id;
bool drop_refcnt = false;
bool drop_tunnel = false;
bool new_session = false;
bool new_tunnel = false;
- int ver = 2;
- int fd;
-
- lock_sock(sk);
-
- error = -EINVAL;
+ int error;
- if (sockaddr_len != sizeof(struct sockaddr_pppol2tp) &&
- sockaddr_len != sizeof(struct sockaddr_pppol2tpv3) &&
- sockaddr_len != sizeof(struct sockaddr_pppol2tpin6) &&
- sockaddr_len != sizeof(struct sockaddr_pppol2tpv3in6))
- goto end;
+ error = pppol2tp_sockaddr_get_info(uservaddr, sockaddr_len, &info);
+ if (error < 0)
+ return error;
- if (sp->sa_protocol != PX_PROTO_OL2TP)
- goto end;
+ lock_sock(sk);
/* Check for already bound sockets */
error = -EBUSY;
@@ -633,56 +706,12 @@ static int pppol2tp_connect(struct socket *sock, struct sockaddr *uservaddr,
if (sk->sk_user_data)
goto end; /* socket is already attached */
- /* Get params from socket address. Handle L2TPv2 and L2TPv3.
- * This is nasty because there are different sockaddr_pppol2tp
- * structs for L2TPv2, L2TPv3, over IPv4 and IPv6. We use
- * the sockaddr size to determine which structure the caller
- * is using.
- */
- peer_tunnel_id = 0;
- if (sockaddr_len == sizeof(struct sockaddr_pppol2tp)) {
- fd = sp->pppol2tp.fd;
- tunnel_id = sp->pppol2tp.s_tunnel;
- peer_tunnel_id = sp->pppol2tp.d_tunnel;
- session_id = sp->pppol2tp.s_session;
- peer_session_id = sp->pppol2tp.d_session;
- } else if (sockaddr_len == sizeof(struct sockaddr_pppol2tpv3)) {
- struct sockaddr_pppol2tpv3 *sp3 =
- (struct sockaddr_pppol2tpv3 *) sp;
- ver = 3;
- fd = sp3->pppol2tp.fd;
- tunnel_id = sp3->pppol2tp.s_tunnel;
- peer_tunnel_id = sp3->pppol2tp.d_tunnel;
- session_id = sp3->pppol2tp.s_session;
- peer_session_id = sp3->pppol2tp.d_session;
- } else if (sockaddr_len == sizeof(struct sockaddr_pppol2tpin6)) {
- struct sockaddr_pppol2tpin6 *sp6 =
- (struct sockaddr_pppol2tpin6 *) sp;
- fd = sp6->pppol2tp.fd;
- tunnel_id = sp6->pppol2tp.s_tunnel;
- peer_tunnel_id = sp6->pppol2tp.d_tunnel;
- session_id = sp6->pppol2tp.s_session;
- peer_session_id = sp6->pppol2tp.d_session;
- } else if (sockaddr_len == sizeof(struct sockaddr_pppol2tpv3in6)) {
- struct sockaddr_pppol2tpv3in6 *sp6 =
- (struct sockaddr_pppol2tpv3in6 *) sp;
- ver = 3;
- fd = sp6->pppol2tp.fd;
- tunnel_id = sp6->pppol2tp.s_tunnel;
- peer_tunnel_id = sp6->pppol2tp.d_tunnel;
- session_id = sp6->pppol2tp.s_session;
- peer_session_id = sp6->pppol2tp.d_session;
- } else {
- error = -EINVAL;
- goto end; /* bad socket address */
- }
-
/* Don't bind if tunnel_id is 0 */
error = -EINVAL;
- if (tunnel_id == 0)
+ if (!info.tunnel_id)
goto end;
- tunnel = l2tp_tunnel_get(sock_net(sk), tunnel_id);
+ tunnel = l2tp_tunnel_get(sock_net(sk), info.tunnel_id);
if (tunnel)
drop_tunnel = true;
@@ -690,7 +719,7 @@ static int pppol2tp_connect(struct socket *sock, struct sockaddr *uservaddr,
* peer_session_id is 0. Otherwise look up tunnel using supplied
* tunnel id.
*/
- if ((session_id == 0) && (peer_session_id == 0)) {
+ if (!info.session_id && !info.peer_session_id) {
if (tunnel == NULL) {
struct l2tp_tunnel_cfg tcfg = {
.encap = L2TP_ENCAPTYPE_UDP,
@@ -700,12 +729,16 @@ static int pppol2tp_connect(struct socket *sock, struct sockaddr *uservaddr,
/* Prevent l2tp_tunnel_register() from trying to set up
* a kernel socket.
*/
- if (fd < 0) {
+ if (info.fd < 0) {
error = -EBADF;
goto end;
}
- error = l2tp_tunnel_create(sock_net(sk), fd, ver, tunnel_id, peer_tunnel_id, &tcfg, &tunnel);
+ error = l2tp_tunnel_create(sock_net(sk), info.fd,
+ info.version,
+ info.tunnel_id,
+ info.peer_tunnel_id, &tcfg,
+ &tunnel);
if (error < 0)
goto end;
@@ -734,9 +767,9 @@ static int pppol2tp_connect(struct socket *sock, struct sockaddr *uservaddr,
tunnel->recv_payload_hook = pppol2tp_recv_payload_hook;
if (tunnel->peer_tunnel_id == 0)
- tunnel->peer_tunnel_id = peer_tunnel_id;
+ tunnel->peer_tunnel_id = info.peer_tunnel_id;
- session = l2tp_session_get(sock_net(sk), tunnel, session_id);
+ session = l2tp_session_get(sock_net(sk), tunnel, info.session_id);
if (session) {
drop_refcnt = true;
@@ -765,8 +798,8 @@ static int pppol2tp_connect(struct socket *sock, struct sockaddr *uservaddr,
cfg.pw_type = L2TP_PWTYPE_PPP;
session = l2tp_session_create(sizeof(struct pppol2tp_session),
- tunnel, session_id,
- peer_session_id, &cfg);
+ tunnel, info.session_id,
+ info.peer_session_id, &cfg);
if (IS_ERR(session)) {
error = PTR_ERR(session);
goto end;
--
2.18.0
^ permalink raw reply related
* RE: [patch net-next RFC 11/12] mlxsw: core: Extend hwmon interface with FAN fault attribute
From: Vadim Pasternak @ 2018-06-26 16:47 UTC (permalink / raw)
To: Guenter Roeck
Cc: Andrew Lunn, davem@davemloft.net, netdev@vger.kernel.org,
rui.zhang@intel.com, edubezval@gmail.com, jiri@resnulli.us, mlxsw,
Michael Shych
In-Reply-To: <20180626163232.GA32079@roeck-us.net>
> -----Original Message-----
> From: Guenter Roeck [mailto:linux@roeck-us.net]
> Sent: Tuesday, June 26, 2018 7:33 PM
> To: Vadim Pasternak <vadimp@mellanox.com>
> Cc: Andrew Lunn <andrew@lunn.ch>; davem@davemloft.net;
> netdev@vger.kernel.org; rui.zhang@intel.com; edubezval@gmail.com;
> jiri@resnulli.us; mlxsw <mlxsw@mellanox.com>; Michael Shych
> <michaelsh@mellanox.com>
> Subject: Re: [patch net-next RFC 11/12] mlxsw: core: Extend hwmon interface
> with FAN fault attribute
>
> On Tue, Jun 26, 2018 at 02:47:01PM +0000, Vadim Pasternak wrote:
> >
> >
> > > -----Original Message-----
> > > From: Andrew Lunn [mailto:andrew@lunn.ch]
> > > Sent: Tuesday, June 26, 2018 5:29 PM
> > > To: Vadim Pasternak <vadimp@mellanox.com>
> > > Cc: davem@davemloft.net; netdev@vger.kernel.org; linux@roeck-us.net;
> > > rui.zhang@intel.com; edubezval@gmail.com; jiri@resnulli.us; mlxsw
> > > <mlxsw@mellanox.com>; Michael Shych <michaelsh@mellanox.com>
> > > Subject: Re: [patch net-next RFC 11/12] mlxsw: core: Extend hwmon
> > > interface with FAN fault attribute
> > >
> > > > +static ssize_t mlxsw_hwmon_fan_fault_show(struct device *dev,
> > > > + struct device_attribute *attr,
> > > > + char *buf)
> > > > +{
> > > > + struct mlxsw_hwmon_attr *mlwsw_hwmon_attr =
> > > > + container_of(attr, struct mlxsw_hwmon_attr,
> > > dev_attr);
> > > > + struct mlxsw_hwmon *mlxsw_hwmon = mlwsw_hwmon_attr->hwmon;
> > > > + char mfsm_pl[MLXSW_REG_MFSM_LEN];
> > > > + u16 tach;
> > > > + int err;
> > > > +
> > > > + mlxsw_reg_mfsm_pack(mfsm_pl, mlwsw_hwmon_attr->type_index);
> > > > + err = mlxsw_reg_query(mlxsw_hwmon->core, MLXSW_REG(mfsm),
> > > mfsm_pl);
> > > > + if (err) {
> > > > + dev_err(mlxsw_hwmon->bus_info->dev, "Failed to query
> > > fan\n");
> > > > + return err;
> > > > + }
> > > > + tach = mlxsw_reg_mfsm_rpm_get(mfsm_pl);
> > > > +
> > > > + return sprintf(buf, "%u\n", (tach < mlxsw_hwmon->tach_min) ? 1 :
> > > > +0); }
> > >
> > > Documentation/hwmon/sysfs-interface says:
> > >
> > > Alarms are direct indications read from the chips. The drivers do
> > > NOT make comparisons of readings to thresholds. This allows
> > > violations between readings to be caught and alarmed. The exact
> > > definition of an alarm (for example, whether a threshold must be met
> > > or must be exceeded to cause an alarm) is chip-dependent.
> > >
> > > Now, this is a fault, not an alarm. But does the same apply?
> >
> Yes, it does. There are no "soft" alarms / faults.
>
> > Hi Andrew,
> >
> > Hardware provides minimum value for tachometer.
> > Tachometer is considered as faulty in case it's below this value.
>
> This is for user space to decide, not for the kernel.
Hi Guenter,
Do you suggest to expose provide fan{x}_min, instead of fan{x}_fault
and give to user to compare fan{x}_input versus fan{x}_min for the
fault decision?
>
> > In case any tachometer is faulty, PWM according to the system
> > requirements should be set to 100% until the fault
>
> system requirements. Again, this is for user space to decide.
Yes, user should decide in this case and I wanted to provide to user
fan{x}_fault for this matter. But it could do it based on input and min
attributes, of course.
>
> > is not recovered (f.e. by physical replacing of bad unit).
> > This is the motivation to expose fan{x}_fault in the way it's exposed.
> >
> > Thanks,
> > Vadim.
> >
> > >
> > > Andrew
^ permalink raw reply
* Re: [PATCH bpf-next 2/3] bpf: btf: add btf json print functionality
From: Okash Khawaja @ 2018-06-26 16:48 UTC (permalink / raw)
To: Martin KaFai Lau
Cc: Jakub Kicinski, Daniel Borkmann, Alexei Starovoitov,
Yonghong Song, Quentin Monnet, David S. Miller, netdev,
kernel-team, linux-kernel
In-Reply-To: <20180623002639.h4qxy7aakypi6t7b@kafai-mbp.dhcp.thefacebook.com>
On Fri, Jun 22, 2018 at 05:26:39PM -0700, Martin KaFai Lau wrote:
> On Fri, Jun 22, 2018 at 04:32:00PM -0700, Jakub Kicinski wrote:
> > On Fri, 22 Jun 2018 15:54:08 -0700, Martin KaFai Lau wrote:
> > > > > > > > > > > > "value": ["0x02","0x00","0x00","0x00","0x00","0x00","0x00","0x00"
> > > > > > > > > > > > ],
> > > > > > > > > > > > "value_struct":{
> > > > > > > > > > > > "src_ip":2,
> > > > > > > If for the same map the user changes the "src_ip" to an array of int[4]
> > > > > > > later (e.g. to support ipv6), it will become "src_ip": [1, 2, 3, 4].
> > > > > > > Is it breaking backward compat?
> > > > > > > i.e.
> > > > > > > struct five_tuples {
> > > > > > > - int src_ip;
> > > > > > > + int src_ip[4];
> > > > > > > /* ... */
> > > > > > > };
> > > > > >
> > > > > > Well, it is breaking backward compat, but it's the program doing it,
> > > > > > not bpftool :) BTF changes so does the output.
> > > > > As we see, the key/value's btf-output is inherently not backward compat.
> > > > > Hence, "-j" and "-p" will stay as is. The whole existing json will
> > > > > be backward compat instead of only partly backward compat.
> > > >
> > > > No. There is a difference between user of a facility changing their
> > > > input and kernel/libraries providing different output in response to
> > > > that, and the libraries suddenly changing the output on their own.
> > > >
> > > > Your example is like saying if user started using IPv6 addresses
> > > > instead of IPv4 the netlink attributes in dumps will be different so
> > > > kernel didn't keep backwards compat. While what you're doing is more
> > > > equivalent to dropping support for old ioctl interfaces because there
> > > > is a better mechanism now.
> > > Sorry, I don't follow this. I don't see netlink suffer json issue like
> > > the one on "key" and "value".
> > >
> > > All I can grasp is, the json should normally be backward compat but now
> > > we are saying anything added by btf-output is an exception because
> > > the script parsing it will treat it differently than "key" and "value"
> >
> > Backward compatibility means that if I run *the same* program against
> > different kernels/libraries it continues to work. If someone decides
> > to upgrade their program to work with IPv6 (which was your example)
> > obviously there is no way system as a whole will look 1:1 the same.
> >
> > > > BTF in JSON is very useful, and will help people who writes simple
> > > > orchestration/scripts based on bpftool *a* *lot*. I really appreciate
> > > Can you share what the script will do? I want to understand why
> > > it cannot directly use the BTF format and the map data.
> >
> > Think about a python script which wants to read a counter in a map.
> > Right now it would have to get the BTF, find out which bytes are the
> > counter, then convert the bytes into a larger int. With JSON BTF if
> > just does entry["formatted"]["value"]["counter"].
> >
> > Real life example from my test code (conversion of 3 element counter
> > array):
> >
> > def str2int(strtab):
> > inttab = []
> > for i in strtab:
> > inttab.append(int(i, 16))
> > ba = bytearray(inttab)
> > if len(strtab) == 4:
> > fmt = "I"
> > elif len(strtab) == 8:
> > fmt = "Q"
> > else:
> > raise Exception("String array of len %d can't be unpacked to an int" %
> > (len(strtab)))
> > return struct.unpack(fmt, ba)[0]
> >
> > def convert(elems, idx):
> > val = []
> > for i in range(3):
> > part = elems[idx]["value"][i * length:(i + 1) * length]
> > val.append(str2int(part))
> > return val
> >
> > With BTF it would be:
> >
> > elems[idx]["formatted"]["value"]
> >
> > Which is fairly awesome.
> Thanks for the example. Agree that with BTF, things are easier in general.
>
> btw, what more awesome is,
> #> bpftool map find id 100 key 1
> {
> "counter_x": 1,
> "counter_y": 10
> }
>
> >
> > > > this addition to bpftool and will start using it myself as soon as it
> > > > lands. I'm not sure why the reluctance to slightly change the output
> > > > format?
> > > The initial change argument is because the json has to be backward compat.
> > >
> > > Then we show that btf-output is inherently not backward compat, so
> > > printing it in json does not make sense at all.
> > >
> > > However, now it is saying part of it does not have to be backward compat.
> >
> > Compatibility of "formatted" member is defined as -> fields broken out
> > according to BTF. So it is backward compatible. The definition of
> > "value" member is -> an array of unfortunately formatted array of
> > ugly hex strings :(
> >
> > > I am fine putting it under "formatted" for "-j" or "-p" if that is the
> > > case, other than the double output is still confusing. Lets wait for
> > > Okash's input.
> > >
> > > At the same time, the same output will be used as the default plaintext
> > > output when BTF is available. Then the plaintext BTF output
> > > will not be limited by the json restrictions when we want
> > > to improve human readability later. Apparently, the
> > > improvements on plaintext will not be always applicable
> > > to json output.
> >
hi,
so i guess following is what we want:
1. a "formatted" object nested inside -p and -j switches for bpf map
dump. this will be JSON and backward compatible
2. an output for humans - which is like the current output. this will
not be JSON. this won't have to be backward compatible. this output will
be shown when neither of -j and -p are supplied and btf info is
available.
i can update the patches to v2 which covers 2 above + all other comments
on the patchset. later we can follow up with a patch for 1.
thanks for valuable feedback :)
okash
^ permalink raw reply
* Re: [patch net-next RFC 03/12] mlxsw: core: Add core environment module for port temperature reading
From: Guenter Roeck @ 2018-06-26 17:00 UTC (permalink / raw)
To: Andrew Lunn; +Cc: Vadim Pasternak, linux-pm, netdev, rui.zhang, edubezval, jiri
In-Reply-To: <20180626142238.GB5064@lunn.ch>
On Tue, Jun 26, 2018 at 04:22:38PM +0200, Andrew Lunn wrote:
> On Tue, Jun 26, 2018 at 12:10:28PM +0000, Vadim Pasternak wrote:
>
> Adding the linux-pm@vger.kernel.org list.
>
> > Add new core_env module to allow port temperature reading. This
> > information has most critical impact on system's thermal monitoring and
> > is to be used by core_hwmon and core_thermal modules.
> >
> > New internal API reads the temperature from all the modules, which are
> > equipped with the thermal sensor and exposes temperature according to
> > the worst measure. All individual temperature values are normalized to
> > pre-defined range.
>
> This patchset has been sent to the netdev list before. I raised a few
> questions about this, which is why it is now being posted to a bigger
> group for review.
>
> The hardware has up to 64 temperature sensors. These sensors are
> hot-plugable, since they are inside SFP modules, which are
> hot-plugable. Different SFP modules can have different operating
> temperature ranges. They contain an EEPROM which lists upper and lower
> warning and fail temperatures, and report alarms when these thresholds
> a reached.
>
> This code takes the 64 sensors readings and calculates a single value
> it passes to one thermal zone. That thermal zone then controls one fan
> to keep this single value in range.
>
> I queried is this is the correct way to do this? Would it not be
> better to have up to 64 thermal zones? Leave the thermal core to
> iterate over all the zones in order to determine how the fan should be
> driven?
>
I very much think so. This problem must exist elsewhere; essentially
it is the bundling of multiple temperature sensors into a single thermal
zone. I am not sure if this should be 64 thermal zones or one thermal
zone with up to 64 sensors and some algorithm to select the relevant
temperature; that would be up to the thermal subsystem maintainers
to decide. Either case, the sensors should be handled and reported
as individual sensors, with appropriate limits, not as single sensor.
Yes, I understand that means we'll have hundreds of hwmon devices,
but that should not be a problem (and if it is, we'll have to fix
the problem, not the code exposing it).
I understand that the thermal subsystem does not currently support
handling this problem. There may also be some missing pieces between
the hwmon and thermal subsystems, such as reporting limits or alarms
when a hwmon driver register with the thermal subsystem.
Maybe it is time to add this support as part of this patch series ?
> This is possibly the first board with so many sensors. However, i
> doubt it is totally unique. Other big Ethernet switches with lots of
> SFP modules may be added later. Also, 10G copper PHYs often have
> temperature sensors, so this is not limited to just boards with
> optical ports. So having a generic solution would be good.
Agreed.
Thanks,
Guenter
>
> What do the Linux PM exports say about this?
>
> Thanks
> Andrew
^ permalink raw reply
* [PATCH v3 net-next 0/4] Updates for ipsec selftests
From: Shannon Nelson @ 2018-06-26 17:07 UTC (permalink / raw)
To: davem, netdev, jakub.kicinski; +Cc: anders.roxell, linux-kselftest
Fix up the existing ipsec selftest and add tests for
the ipsec offload driver API.
v2: addressed formatting nits in netdevsim from Jakub Kicinski
v3: a couple more nits from Jakub
Shannon Nelson (4):
selftests: rtnetlink: clear the return code at start of ipsec test
selftests: rtnetlink: use dummydev as a test device
netdevsim: add ipsec offload testing
selftests: rtnetlink: add ipsec offload API test
drivers/net/netdevsim/Makefile | 4 +
drivers/net/netdevsim/ipsec.c | 345 +++++++++++++++++++++++++++++++
drivers/net/netdevsim/netdev.c | 7 +
drivers/net/netdevsim/netdevsim.h | 37 ++++
tools/testing/selftests/net/rtnetlink.sh | 132 +++++++++++-
5 files changed, 518 insertions(+), 7 deletions(-)
create mode 100644 drivers/net/netdevsim/ipsec.c
--
2.7.4
^ permalink raw reply
* [PATCH v3 net-next 2/4] selftests: rtnetlink: use dummydev as a test device
From: Shannon Nelson @ 2018-06-26 17:07 UTC (permalink / raw)
To: davem, netdev, jakub.kicinski; +Cc: anders.roxell, linux-kselftest
In-Reply-To: <1530032875-30482-1-git-send-email-shannon.nelson@oracle.com>
We really shouldn't mess with local system settings, so let's
use the already created dummy device instead for ipsec testing.
Oh, and let's put the temp file into a proper directory.
Signed-off-by: Shannon Nelson <shannon.nelson@oracle.com>
---
tools/testing/selftests/net/rtnetlink.sh | 15 +++++++--------
1 file changed, 7 insertions(+), 8 deletions(-)
diff --git a/tools/testing/selftests/net/rtnetlink.sh b/tools/testing/selftests/net/rtnetlink.sh
index 261a981..15948cf 100755
--- a/tools/testing/selftests/net/rtnetlink.sh
+++ b/tools/testing/selftests/net/rtnetlink.sh
@@ -523,21 +523,19 @@ kci_test_macsec()
kci_test_ipsec()
{
ret=0
-
- # find an ip address on this machine and make up a destination
- srcip=`ip -o addr | awk '/inet / { print $4; }' | grep -v "^127" | head -1 | cut -f1 -d/`
- net=`echo $srcip | cut -f1-3 -d.`
- base=`echo $srcip | cut -f4 -d.`
- dstip="$net."`expr $base + 1`
-
algo="aead rfc4106(gcm(aes)) 0x3132333435363738393031323334353664636261 128"
+ srcip=192.168.123.1
+ dstip=192.168.123.2
+ spi=7
+
+ ip addr add $srcip dev $devdummy
# flush to be sure there's nothing configured
ip x s flush ; ip x p flush
check_err $?
# start the monitor in the background
- tmpfile=`mktemp ipsectestXXX`
+ tmpfile=`mktemp /var/run/ipsectestXXX`
mpid=`(ip x m > $tmpfile & echo $!) 2>/dev/null`
sleep 0.2
@@ -601,6 +599,7 @@ kci_test_ipsec()
check_err $?
ip x p flush
check_err $?
+ ip addr del $srcip/32 dev $devdummy
if [ $ret -ne 0 ]; then
echo "FAIL: ipsec"
--
2.7.4
^ permalink raw reply related
* [PATCH v3 net-next 1/4] selftests: rtnetlink: clear the return code at start of ipsec test
From: Shannon Nelson @ 2018-06-26 17:07 UTC (permalink / raw)
To: davem, netdev, jakub.kicinski; +Cc: anders.roxell, linux-kselftest
In-Reply-To: <1530032875-30482-1-git-send-email-shannon.nelson@oracle.com>
Following the custom from the other functions, clear the global
ret code before starting the test so as to not have previously
failed tests cause us to thing this test has failed.
Reported-by: Anders Roxell <anders.roxell@linaro.org>
Signed-off-by: Shannon Nelson <shannon.nelson@oracle.com>
---
tools/testing/selftests/net/rtnetlink.sh | 2 ++
1 file changed, 2 insertions(+)
diff --git a/tools/testing/selftests/net/rtnetlink.sh b/tools/testing/selftests/net/rtnetlink.sh
index b33a371..261a981 100755
--- a/tools/testing/selftests/net/rtnetlink.sh
+++ b/tools/testing/selftests/net/rtnetlink.sh
@@ -522,6 +522,8 @@ kci_test_macsec()
#-------------------------------------------------------------------
kci_test_ipsec()
{
+ ret=0
+
# find an ip address on this machine and make up a destination
srcip=`ip -o addr | awk '/inet / { print $4; }' | grep -v "^127" | head -1 | cut -f1 -d/`
net=`echo $srcip | cut -f1-3 -d.`
--
2.7.4
^ permalink raw reply related
* [PATCH v3 net-next 3/4] netdevsim: add ipsec offload testing
From: Shannon Nelson @ 2018-06-26 17:07 UTC (permalink / raw)
To: davem, netdev, jakub.kicinski; +Cc: anders.roxell, linux-kselftest
In-Reply-To: <1530032875-30482-1-git-send-email-shannon.nelson@oracle.com>
Implement the IPsec/XFRM offload API for testing.
Signed-off-by: Shannon Nelson <shannon.nelson@oracle.com>
---
V2 - addressed formatting comments from Jakub Kicinski
V3 - a couple more little xmas tree nits
drivers/net/netdevsim/Makefile | 4 +
drivers/net/netdevsim/ipsec.c | 297 ++++++++++++++++++++++++++++++++++++++
drivers/net/netdevsim/netdev.c | 7 +
drivers/net/netdevsim/netdevsim.h | 41 ++++++
4 files changed, 349 insertions(+)
create mode 100644 drivers/net/netdevsim/ipsec.c
diff --git a/drivers/net/netdevsim/Makefile b/drivers/net/netdevsim/Makefile
index 449b2a1..0fee1d0 100644
--- a/drivers/net/netdevsim/Makefile
+++ b/drivers/net/netdevsim/Makefile
@@ -13,3 +13,7 @@ endif
ifneq ($(CONFIG_NET_DEVLINK),)
netdevsim-objs += devlink.o fib.o
endif
+
+ifneq ($(CONFIG_XFRM_OFFLOAD),)
+netdevsim-objs += ipsec.o
+endif
diff --git a/drivers/net/netdevsim/ipsec.c b/drivers/net/netdevsim/ipsec.c
new file mode 100644
index 0000000..ceff544
--- /dev/null
+++ b/drivers/net/netdevsim/ipsec.c
@@ -0,0 +1,297 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright(c) 2018 Oracle and/or its affiliates. All rights reserved. */
+
+#include <crypto/aead.h>
+#include <linux/debugfs.h>
+#include <net/xfrm.h>
+
+#include "netdevsim.h"
+
+#define NSIM_IPSEC_AUTH_BITS 128
+
+static ssize_t nsim_dbg_netdev_ops_read(struct file *filp,
+ char __user *buffer,
+ size_t count, loff_t *ppos)
+{
+ struct netdevsim *ns = filp->private_data;
+ struct nsim_ipsec *ipsec = &ns->ipsec;
+ size_t bufsize;
+ char *buf, *p;
+ int len;
+ int i;
+
+ /* the buffer needed is
+ * (num SAs * 3 lines each * ~60 bytes per line) + one more line
+ */
+ bufsize = (ipsec->count * 4 * 60) + 60;
+ buf = kzalloc(bufsize, GFP_KERNEL);
+ if (!buf)
+ return -ENOMEM;
+
+ p = buf;
+ p += snprintf(p, bufsize - (p - buf),
+ "SA count=%u tx=%u\n",
+ ipsec->count, ipsec->tx);
+
+ for (i = 0; i < NSIM_IPSEC_MAX_SA_COUNT; i++) {
+ struct nsim_sa *sap = &ipsec->sa[i];
+
+ if (!sap->used)
+ continue;
+
+ p += snprintf(p, bufsize - (p - buf),
+ "sa[%i] %cx ipaddr=0x%08x %08x %08x %08x\n",
+ i, (sap->rx ? 'r' : 't'), sap->ipaddr[0],
+ sap->ipaddr[1], sap->ipaddr[2], sap->ipaddr[3]);
+ p += snprintf(p, bufsize - (p - buf),
+ "sa[%i] spi=0x%08x proto=0x%x salt=0x%08x crypt=%d\n",
+ i, be32_to_cpu(sap->xs->id.spi),
+ sap->xs->id.proto, sap->salt, sap->crypt);
+ p += snprintf(p, bufsize - (p - buf),
+ "sa[%i] key=0x%08x %08x %08x %08x\n",
+ i, sap->key[0], sap->key[1],
+ sap->key[2], sap->key[3]);
+ }
+
+ len = simple_read_from_buffer(buffer, count, ppos, buf, p - buf);
+
+ kfree(buf);
+ return len;
+}
+
+static const struct file_operations ipsec_dbg_fops = {
+ .owner = THIS_MODULE,
+ .open = simple_open,
+ .read = nsim_dbg_netdev_ops_read,
+};
+
+static int nsim_ipsec_find_empty_idx(struct nsim_ipsec *ipsec)
+{
+ u32 i;
+
+ if (ipsec->count == NSIM_IPSEC_MAX_SA_COUNT)
+ return -ENOSPC;
+
+ /* search sa table */
+ for (i = 0; i < NSIM_IPSEC_MAX_SA_COUNT; i++) {
+ if (!ipsec->sa[i].used)
+ return i;
+ }
+
+ return -ENOSPC;
+}
+
+static int nsim_ipsec_parse_proto_keys(struct xfrm_state *xs,
+ u32 *mykey, u32 *mysalt)
+{
+ const char aes_gcm_name[] = "rfc4106(gcm(aes))";
+ struct net_device *dev = xs->xso.dev;
+ unsigned char *key_data;
+ char *alg_name = NULL;
+ int key_len;
+
+ if (!xs->aead) {
+ netdev_err(dev, "Unsupported IPsec algorithm\n");
+ return -EINVAL;
+ }
+
+ if (xs->aead->alg_icv_len != NSIM_IPSEC_AUTH_BITS) {
+ netdev_err(dev, "IPsec offload requires %d bit authentication\n",
+ NSIM_IPSEC_AUTH_BITS);
+ return -EINVAL;
+ }
+
+ key_data = &xs->aead->alg_key[0];
+ key_len = xs->aead->alg_key_len;
+ alg_name = xs->aead->alg_name;
+
+ if (strcmp(alg_name, aes_gcm_name)) {
+ netdev_err(dev, "Unsupported IPsec algorithm - please use %s\n",
+ aes_gcm_name);
+ return -EINVAL;
+ }
+
+ /* 160 accounts for 16 byte key and 4 byte salt */
+ if (key_len > NSIM_IPSEC_AUTH_BITS) {
+ *mysalt = ((u32 *)key_data)[4];
+ } else if (key_len == NSIM_IPSEC_AUTH_BITS) {
+ *mysalt = 0;
+ } else {
+ netdev_err(dev, "IPsec hw offload only supports 128 bit keys with optional 32 bit salt\n");
+ return -EINVAL;
+ }
+ memcpy(mykey, key_data, 16);
+
+ return 0;
+}
+
+static int nsim_ipsec_add_sa(struct xfrm_state *xs)
+{
+ struct nsim_ipsec *ipsec;
+ struct net_device *dev;
+ struct netdevsim *ns;
+ struct nsim_sa sa;
+ u16 sa_idx;
+ int ret;
+
+ dev = xs->xso.dev;
+ ns = netdev_priv(dev);
+ ipsec = &ns->ipsec;
+
+ if (xs->id.proto != IPPROTO_ESP && xs->id.proto != IPPROTO_AH) {
+ netdev_err(dev, "Unsupported protocol 0x%04x for ipsec offload\n",
+ xs->id.proto);
+ return -EINVAL;
+ }
+
+ if (xs->calg) {
+ netdev_err(dev, "Compression offload not supported\n");
+ return -EINVAL;
+ }
+
+ /* find the first unused index */
+ ret = nsim_ipsec_find_empty_idx(ipsec);
+ if (ret < 0) {
+ netdev_err(dev, "No space for SA in Rx table!\n");
+ return ret;
+ }
+ sa_idx = (u16)ret;
+
+ memset(&sa, 0, sizeof(sa));
+ sa.used = true;
+ sa.xs = xs;
+
+ if (sa.xs->id.proto & IPPROTO_ESP)
+ sa.crypt = xs->ealg || xs->aead;
+
+ /* get the key and salt */
+ ret = nsim_ipsec_parse_proto_keys(xs, sa.key, &sa.salt);
+ if (ret) {
+ netdev_err(dev, "Failed to get key data for SA table\n");
+ return ret;
+ }
+
+ if (xs->xso.flags & XFRM_OFFLOAD_INBOUND) {
+ sa.rx = true;
+
+ if (xs->props.family == AF_INET6)
+ memcpy(sa.ipaddr, &xs->id.daddr.a6, 16);
+ else
+ memcpy(&sa.ipaddr[3], &xs->id.daddr.a4, 4);
+ }
+
+ /* the preparations worked, so save the info */
+ memcpy(&ipsec->sa[sa_idx], &sa, sizeof(sa));
+
+ /* the XFRM stack doesn't like offload_handle == 0,
+ * so add a bitflag in case our array index is 0
+ */
+ xs->xso.offload_handle = sa_idx | NSIM_IPSEC_VALID;
+ ipsec->count++;
+
+ return 0;
+}
+
+static void nsim_ipsec_del_sa(struct xfrm_state *xs)
+{
+ struct netdevsim *ns = netdev_priv(xs->xso.dev);
+ struct nsim_ipsec *ipsec = &ns->ipsec;
+ u16 sa_idx;
+
+ sa_idx = xs->xso.offload_handle & ~NSIM_IPSEC_VALID;
+ if (!ipsec->sa[sa_idx].used) {
+ netdev_err(ns->netdev, "Invalid SA for delete sa_idx=%d\n",
+ sa_idx);
+ return;
+ }
+
+ memset(&ipsec->sa[sa_idx], 0, sizeof(struct nsim_sa));
+ ipsec->count--;
+}
+
+static bool nsim_ipsec_offload_ok(struct sk_buff *skb, struct xfrm_state *xs)
+{
+ struct netdevsim *ns = netdev_priv(xs->xso.dev);
+ struct nsim_ipsec *ipsec = &ns->ipsec;
+
+ ipsec->ok++;
+
+ return true;
+}
+
+static const struct xfrmdev_ops nsim_xfrmdev_ops = {
+ .xdo_dev_state_add = nsim_ipsec_add_sa,
+ .xdo_dev_state_delete = nsim_ipsec_del_sa,
+ .xdo_dev_offload_ok = nsim_ipsec_offload_ok,
+};
+
+bool nsim_ipsec_tx(struct netdevsim *ns, struct sk_buff *skb)
+{
+ struct nsim_ipsec *ipsec = &ns->ipsec;
+ struct xfrm_state *xs;
+ struct nsim_sa *tsa;
+ u32 sa_idx;
+
+ /* do we even need to check this packet? */
+ if (!skb->sp)
+ return true;
+
+ if (unlikely(!skb->sp->len)) {
+ netdev_err(ns->netdev, "no xfrm state len = %d\n",
+ skb->sp->len);
+ return false;
+ }
+
+ xs = xfrm_input_state(skb);
+ if (unlikely(!xs)) {
+ netdev_err(ns->netdev, "no xfrm_input_state() xs = %p\n", xs);
+ return false;
+ }
+
+ sa_idx = xs->xso.offload_handle & ~NSIM_IPSEC_VALID;
+ if (unlikely(sa_idx > NSIM_IPSEC_MAX_SA_COUNT)) {
+ netdev_err(ns->netdev, "bad sa_idx=%d max=%d\n",
+ sa_idx, NSIM_IPSEC_MAX_SA_COUNT);
+ return false;
+ }
+
+ tsa = &ipsec->sa[sa_idx];
+ if (unlikely(!tsa->used)) {
+ netdev_err(ns->netdev, "unused sa_idx=%d\n", sa_idx);
+ return false;
+ }
+
+ if (xs->id.proto != IPPROTO_ESP && xs->id.proto != IPPROTO_AH) {
+ netdev_err(ns->netdev, "unexpected proto=%d\n", xs->id.proto);
+ return false;
+ }
+
+ ipsec->tx++;
+
+ return true;
+}
+
+void nsim_ipsec_init(struct netdevsim *ns)
+{
+ ns->netdev->xfrmdev_ops = &nsim_xfrmdev_ops;
+
+#define NSIM_ESP_FEATURES (NETIF_F_HW_ESP | \
+ NETIF_F_HW_ESP_TX_CSUM | \
+ NETIF_F_GSO_ESP)
+
+ ns->netdev->features |= NSIM_ESP_FEATURES;
+ ns->netdev->hw_enc_features |= NSIM_ESP_FEATURES;
+
+ ns->ipsec.pfile = debugfs_create_file("ipsec", 0400, ns->ddir, ns,
+ &ipsec_dbg_fops);
+}
+
+void nsim_ipsec_teardown(struct netdevsim *ns)
+{
+ struct nsim_ipsec *ipsec = &ns->ipsec;
+
+ if (ipsec->count)
+ netdev_err(ns->netdev, "tearing down IPsec offload with %d SAs left\n",
+ ipsec->count);
+ debugfs_remove_recursive(ipsec->pfile);
+}
diff --git a/drivers/net/netdevsim/netdev.c b/drivers/net/netdevsim/netdev.c
index ec68f38..6ce8604d 100644
--- a/drivers/net/netdevsim/netdev.c
+++ b/drivers/net/netdevsim/netdev.c
@@ -171,6 +171,8 @@ static int nsim_init(struct net_device *dev)
if (err)
goto err_unreg_dev;
+ nsim_ipsec_init(ns);
+
return 0;
err_unreg_dev:
@@ -186,6 +188,7 @@ static void nsim_uninit(struct net_device *dev)
{
struct netdevsim *ns = netdev_priv(dev);
+ nsim_ipsec_teardown(ns);
nsim_devlink_teardown(ns);
debugfs_remove_recursive(ns->ddir);
nsim_bpf_uninit(ns);
@@ -203,11 +206,15 @@ static netdev_tx_t nsim_start_xmit(struct sk_buff *skb, struct net_device *dev)
{
struct netdevsim *ns = netdev_priv(dev);
+ if (!nsim_ipsec_tx(ns, skb))
+ goto out;
+
u64_stats_update_begin(&ns->syncp);
ns->tx_packets++;
ns->tx_bytes += skb->len;
u64_stats_update_end(&ns->syncp);
+out:
dev_kfree_skb(skb);
return NETDEV_TX_OK;
diff --git a/drivers/net/netdevsim/netdevsim.h b/drivers/net/netdevsim/netdevsim.h
index 3a8581a..29448e8 100644
--- a/drivers/net/netdevsim/netdevsim.h
+++ b/drivers/net/netdevsim/netdevsim.h
@@ -29,6 +29,27 @@ struct bpf_prog;
struct dentry;
struct nsim_vf_config;
+#define NSIM_IPSEC_MAX_SA_COUNT 33
+#define NSIM_IPSEC_VALID BIT(31)
+
+struct nsim_sa {
+ struct xfrm_state *xs;
+ __be32 ipaddr[4];
+ u32 key[4];
+ u32 salt;
+ bool used;
+ bool crypt;
+ bool rx;
+};
+
+struct nsim_ipsec {
+ struct nsim_sa sa[NSIM_IPSEC_MAX_SA_COUNT];
+ struct dentry *pfile;
+ u32 count;
+ u32 tx;
+ u32 ok;
+};
+
struct netdevsim {
struct net_device *netdev;
@@ -67,6 +88,7 @@ struct netdevsim {
#if IS_ENABLED(CONFIG_NET_DEVLINK)
struct devlink *devlink;
#endif
+ struct nsim_ipsec ipsec;
};
extern struct dentry *nsim_ddir;
@@ -147,6 +169,25 @@ static inline void nsim_devlink_exit(void)
}
#endif
+#if IS_ENABLED(CONFIG_XFRM_OFFLOAD)
+void nsim_ipsec_init(struct netdevsim *ns);
+void nsim_ipsec_teardown(struct netdevsim *ns);
+bool nsim_ipsec_tx(struct netdevsim *ns, struct sk_buff *skb);
+#else
+static inline void nsim_ipsec_init(struct netdevsim *ns)
+{
+}
+
+static inline void nsim_ipsec_teardown(struct netdevsim *ns)
+{
+}
+
+static inline bool nsim_ipsec_tx(struct netdevsim *ns, struct sk_buff *skb)
+{
+ return true;
+}
+#endif
+
static inline struct netdevsim *to_nsim(struct device *ptr)
{
return container_of(ptr, struct netdevsim, dev);
--
2.7.4
^ permalink raw reply related
* [PATCH v3 net-next 4/4] selftests: rtnetlink: add ipsec offload API test
From: Shannon Nelson @ 2018-06-26 17:07 UTC (permalink / raw)
To: davem, netdev, jakub.kicinski; +Cc: anders.roxell, linux-kselftest
In-Reply-To: <1530032875-30482-1-git-send-email-shannon.nelson@oracle.com>
Using the netdevsim as a device for testing, try out the XFRM commands
for setting up IPsec hardware offloads.
Signed-off-by: Shannon Nelson <shannon.nelson@oracle.com>
---
tools/testing/selftests/net/rtnetlink.sh | 114 +++++++++++++++++++++++++++++++
1 file changed, 114 insertions(+)
diff --git a/tools/testing/selftests/net/rtnetlink.sh b/tools/testing/selftests/net/rtnetlink.sh
index 15948cf..9e1a82e 100755
--- a/tools/testing/selftests/net/rtnetlink.sh
+++ b/tools/testing/selftests/net/rtnetlink.sh
@@ -608,6 +608,119 @@ kci_test_ipsec()
echo "PASS: ipsec"
}
+#-------------------------------------------------------------------
+# Example commands
+# ip x s add proto esp src 14.0.0.52 dst 14.0.0.70 \
+# spi 0x07 mode transport reqid 0x07 replay-window 32 \
+# aead 'rfc4106(gcm(aes))' 1234567890123456dcba 128 \
+# sel src 14.0.0.52/24 dst 14.0.0.70/24
+# offload dev sim1 dir out
+# ip x p add dir out src 14.0.0.52/24 dst 14.0.0.70/24 \
+# tmpl proto esp src 14.0.0.52 dst 14.0.0.70 \
+# spi 0x07 mode transport reqid 0x07
+#
+#-------------------------------------------------------------------
+kci_test_ipsec_offload()
+{
+ ret=0
+ algo="aead rfc4106(gcm(aes)) 0x3132333435363738393031323334353664636261 128"
+ srcip=192.168.123.3
+ dstip=192.168.123.4
+ dev=simx1
+ sysfsd=/sys/kernel/debug/netdevsim/$dev
+ sysfsf=$sysfsd/ipsec
+
+ # setup netdevsim since dummydev doesn't have offload support
+ modprobe netdevsim
+ check_err $?
+ if [ $ret -ne 0 ]; then
+ echo "FAIL: ipsec_offload can't load netdevsim"
+ return 1
+ fi
+
+ ip link add $dev type netdevsim
+ ip addr add $srcip dev $dev
+ ip link set $dev up
+ if [ ! -d $sysfsd ] ; then
+ echo "FAIL: ipsec_offload can't create device $dev"
+ return 1
+ fi
+ if [ ! -f $sysfsf ] ; then
+ echo "FAIL: ipsec_offload netdevsim doesn't support IPsec offload"
+ return 1
+ fi
+
+ # flush to be sure there's nothing configured
+ ip x s flush ; ip x p flush
+
+ # create offloaded SAs, both in and out
+ ip x p add dir out src $srcip/24 dst $dstip/24 \
+ tmpl proto esp src $srcip dst $dstip spi 9 \
+ mode transport reqid 42
+ check_err $?
+ ip x p add dir out src $dstip/24 dst $srcip/24 \
+ tmpl proto esp src $dstip dst $srcip spi 9 \
+ mode transport reqid 42
+ check_err $?
+
+ ip x s add proto esp src $srcip dst $dstip spi 9 \
+ mode transport reqid 42 $algo sel src $srcip/24 dst $dstip/24 \
+ offload dev $dev dir out
+ check_err $?
+ ip x s add proto esp src $dstip dst $srcip spi 9 \
+ mode transport reqid 42 $algo sel src $dstip/24 dst $srcip/24 \
+ offload dev $dev dir in
+ check_err $?
+ if [ $ret -ne 0 ]; then
+ echo "FAIL: ipsec_offload can't create SA"
+ return 1
+ fi
+
+ # does offload show up in ip output
+ lines=`ip x s list | grep -c "crypto offload parameters: dev $dev dir"`
+ if [ $lines -ne 2 ] ; then
+ echo "FAIL: ipsec_offload SA offload missing from list output"
+ check_err 1
+ fi
+
+ # use ping to exercise the Tx path
+ ping -I $dev -c 3 -W 1 -i 0 $dstip >/dev/null
+
+ # does driver have correct offload info
+ diff $sysfsf - << EOF
+SA count=2 tx=3
+sa[0] tx ipaddr=0x00000000 00000000 00000000 00000000
+sa[0] spi=0x00000009 proto=0x32 salt=0x61626364 crypt=1
+sa[0] key=0x34333231 38373635 32313039 36353433
+sa[1] rx ipaddr=0x00000000 00000000 00000000 037ba8c0
+sa[1] spi=0x00000009 proto=0x32 salt=0x61626364 crypt=1
+sa[1] key=0x34333231 38373635 32313039 36353433
+EOF
+ if [ $? -ne 0 ] ; then
+ echo "FAIL: ipsec_offload incorrect driver data"
+ check_err 1
+ fi
+
+ # does offload get removed from driver
+ ip x s flush
+ ip x p flush
+ lines=`grep -c "SA count=0" $sysfsf`
+ if [ $lines -ne 1 ] ; then
+ echo "FAIL: ipsec_offload SA not removed from driver"
+ check_err 1
+ fi
+
+ # clean up any leftovers
+ ip link del $dev
+ rmmod netdevsim
+
+ if [ $ret -ne 0 ]; then
+ echo "FAIL: ipsec_offload"
+ return 1
+ fi
+ echo "PASS: ipsec_offload"
+}
+
kci_test_gretap()
{
testns="testns"
@@ -862,6 +975,7 @@ kci_test_rtnl()
kci_test_encap
kci_test_macsec
kci_test_ipsec
+ kci_test_ipsec_offload
kci_del_dummy
}
--
2.7.4
^ permalink raw reply related
* Re: [PATCH net-next] tcp: remove one indentation level in tcp_create_openreq_child
From: Yuchung Cheng @ 2018-06-26 17:16 UTC (permalink / raw)
To: Eric Dumazet; +Cc: David S . Miller, netdev, Eric Dumazet
In-Reply-To: <20180626154549.102366-1-edumazet@google.com>
On Tue, Jun 26, 2018 at 8:45 AM, Eric Dumazet <edumazet@google.com> wrote:
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> ---
nice refactor!
Acked-by: Yuchung Cheng <ycheng@google.com>
> net/ipv4/tcp_minisocks.c | 223 ++++++++++++++++++++-------------------
> 1 file changed, 113 insertions(+), 110 deletions(-)
>
> diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c
> index 1dda1341a223937580b4efdbedb21ae50b221ff7..dac5893a52b4520d86ed2fcadbfb561a559fcd3d 100644
> --- a/net/ipv4/tcp_minisocks.c
> +++ b/net/ipv4/tcp_minisocks.c
> @@ -449,119 +449,122 @@ struct sock *tcp_create_openreq_child(const struct sock *sk,
> struct sk_buff *skb)
> {
> struct sock *newsk = inet_csk_clone_lock(sk, req, GFP_ATOMIC);
> -
> - if (newsk) {
> - const struct inet_request_sock *ireq = inet_rsk(req);
> - struct tcp_request_sock *treq = tcp_rsk(req);
> - struct inet_connection_sock *newicsk = inet_csk(newsk);
> - struct tcp_sock *newtp = tcp_sk(newsk);
> - struct tcp_sock *oldtp = tcp_sk(sk);
> -
> - smc_check_reset_syn_req(oldtp, req, newtp);
> -
> - /* Now setup tcp_sock */
> - newtp->pred_flags = 0;
> -
> - newtp->rcv_wup = newtp->copied_seq =
> - newtp->rcv_nxt = treq->rcv_isn + 1;
> - newtp->segs_in = 1;
> -
> - newtp->snd_sml = newtp->snd_una =
> - newtp->snd_nxt = newtp->snd_up = treq->snt_isn + 1;
> -
> - INIT_LIST_HEAD(&newtp->tsq_node);
> - INIT_LIST_HEAD(&newtp->tsorted_sent_queue);
> -
> - tcp_init_wl(newtp, treq->rcv_isn);
> -
> - newtp->srtt_us = 0;
> - newtp->mdev_us = jiffies_to_usecs(TCP_TIMEOUT_INIT);
> - minmax_reset(&newtp->rtt_min, tcp_jiffies32, ~0U);
> - newicsk->icsk_rto = TCP_TIMEOUT_INIT;
> - newicsk->icsk_ack.lrcvtime = tcp_jiffies32;
> -
> - newtp->packets_out = 0;
> - newtp->retrans_out = 0;
> - newtp->sacked_out = 0;
> - newtp->snd_ssthresh = TCP_INFINITE_SSTHRESH;
> - newtp->tlp_high_seq = 0;
> - newtp->lsndtime = tcp_jiffies32;
> - newsk->sk_txhash = treq->txhash;
> - newtp->last_oow_ack_time = 0;
> - newtp->total_retrans = req->num_retrans;
> -
> - /* So many TCP implementations out there (incorrectly) count the
> - * initial SYN frame in their delayed-ACK and congestion control
> - * algorithms that we must have the following bandaid to talk
> - * efficiently to them. -DaveM
> - */
> - newtp->snd_cwnd = TCP_INIT_CWND;
> - newtp->snd_cwnd_cnt = 0;
> -
> - /* There's a bubble in the pipe until at least the first ACK. */
> - newtp->app_limited = ~0U;
> -
> - tcp_init_xmit_timers(newsk);
> - newtp->write_seq = newtp->pushed_seq = treq->snt_isn + 1;
> -
> - newtp->rx_opt.saw_tstamp = 0;
> -
> - newtp->rx_opt.dsack = 0;
> - newtp->rx_opt.num_sacks = 0;
> -
> - newtp->urg_data = 0;
> -
> - if (sock_flag(newsk, SOCK_KEEPOPEN))
> - inet_csk_reset_keepalive_timer(newsk,
> - keepalive_time_when(newtp));
> -
> - newtp->rx_opt.tstamp_ok = ireq->tstamp_ok;
> - newtp->rx_opt.sack_ok = ireq->sack_ok;
> - newtp->window_clamp = req->rsk_window_clamp;
> - newtp->rcv_ssthresh = req->rsk_rcv_wnd;
> - newtp->rcv_wnd = req->rsk_rcv_wnd;
> - newtp->rx_opt.wscale_ok = ireq->wscale_ok;
> - if (newtp->rx_opt.wscale_ok) {
> - newtp->rx_opt.snd_wscale = ireq->snd_wscale;
> - newtp->rx_opt.rcv_wscale = ireq->rcv_wscale;
> - } else {
> - newtp->rx_opt.snd_wscale = newtp->rx_opt.rcv_wscale = 0;
> - newtp->window_clamp = min(newtp->window_clamp, 65535U);
> - }
> - newtp->snd_wnd = (ntohs(tcp_hdr(skb)->window) <<
> - newtp->rx_opt.snd_wscale);
> - newtp->max_window = newtp->snd_wnd;
> -
> - if (newtp->rx_opt.tstamp_ok) {
> - newtp->rx_opt.ts_recent = req->ts_recent;
> - newtp->rx_opt.ts_recent_stamp = get_seconds();
> - newtp->tcp_header_len = sizeof(struct tcphdr) + TCPOLEN_TSTAMP_ALIGNED;
> - } else {
> - newtp->rx_opt.ts_recent_stamp = 0;
> - newtp->tcp_header_len = sizeof(struct tcphdr);
> - }
> - newtp->tsoffset = treq->ts_off;
> + const struct inet_request_sock *ireq = inet_rsk(req);
> + struct tcp_request_sock *treq = tcp_rsk(req);
> + struct inet_connection_sock *newicsk;
> + struct tcp_sock *oldtp, *newtp;
> +
> + if (!newsk)
> + return NULL;
> +
> + newicsk = inet_csk(newsk);
> + newtp = tcp_sk(newsk);
> + oldtp = tcp_sk(sk);
> +
> + smc_check_reset_syn_req(oldtp, req, newtp);
> +
> + /* Now setup tcp_sock */
> + newtp->pred_flags = 0;
> +
> + newtp->rcv_wup = newtp->copied_seq =
> + newtp->rcv_nxt = treq->rcv_isn + 1;
> + newtp->segs_in = 1;
> +
> + newtp->snd_sml = newtp->snd_una =
> + newtp->snd_nxt = newtp->snd_up = treq->snt_isn + 1;
> +
> + INIT_LIST_HEAD(&newtp->tsq_node);
> + INIT_LIST_HEAD(&newtp->tsorted_sent_queue);
> +
> + tcp_init_wl(newtp, treq->rcv_isn);
> +
> + newtp->srtt_us = 0;
> + newtp->mdev_us = jiffies_to_usecs(TCP_TIMEOUT_INIT);
> + minmax_reset(&newtp->rtt_min, tcp_jiffies32, ~0U);
> + newicsk->icsk_rto = TCP_TIMEOUT_INIT;
> + newicsk->icsk_ack.lrcvtime = tcp_jiffies32;
> +
> + newtp->packets_out = 0;
> + newtp->retrans_out = 0;
> + newtp->sacked_out = 0;
> + newtp->snd_ssthresh = TCP_INFINITE_SSTHRESH;
> + newtp->tlp_high_seq = 0;
> + newtp->lsndtime = tcp_jiffies32;
> + newsk->sk_txhash = treq->txhash;
> + newtp->last_oow_ack_time = 0;
> + newtp->total_retrans = req->num_retrans;
> +
> + /* So many TCP implementations out there (incorrectly) count the
> + * initial SYN frame in their delayed-ACK and congestion control
> + * algorithms that we must have the following bandaid to talk
> + * efficiently to them. -DaveM
> + */
> + newtp->snd_cwnd = TCP_INIT_CWND;
> + newtp->snd_cwnd_cnt = 0;
> +
> + /* There's a bubble in the pipe until at least the first ACK. */
> + newtp->app_limited = ~0U;
> +
> + tcp_init_xmit_timers(newsk);
> + newtp->write_seq = newtp->pushed_seq = treq->snt_isn + 1;
> +
> + newtp->rx_opt.saw_tstamp = 0;
> +
> + newtp->rx_opt.dsack = 0;
> + newtp->rx_opt.num_sacks = 0;
> +
> + newtp->urg_data = 0;
> +
> + if (sock_flag(newsk, SOCK_KEEPOPEN))
> + inet_csk_reset_keepalive_timer(newsk,
> + keepalive_time_when(newtp));
> +
> + newtp->rx_opt.tstamp_ok = ireq->tstamp_ok;
> + newtp->rx_opt.sack_ok = ireq->sack_ok;
> + newtp->window_clamp = req->rsk_window_clamp;
> + newtp->rcv_ssthresh = req->rsk_rcv_wnd;
> + newtp->rcv_wnd = req->rsk_rcv_wnd;
> + newtp->rx_opt.wscale_ok = ireq->wscale_ok;
> + if (newtp->rx_opt.wscale_ok) {
> + newtp->rx_opt.snd_wscale = ireq->snd_wscale;
> + newtp->rx_opt.rcv_wscale = ireq->rcv_wscale;
> + } else {
> + newtp->rx_opt.snd_wscale = newtp->rx_opt.rcv_wscale = 0;
> + newtp->window_clamp = min(newtp->window_clamp, 65535U);
> + }
> + newtp->snd_wnd = ntohs(tcp_hdr(skb)->window) << newtp->rx_opt.snd_wscale;
> + newtp->max_window = newtp->snd_wnd;
> +
> + if (newtp->rx_opt.tstamp_ok) {
> + newtp->rx_opt.ts_recent = req->ts_recent;
> + newtp->rx_opt.ts_recent_stamp = get_seconds();
> + newtp->tcp_header_len = sizeof(struct tcphdr) + TCPOLEN_TSTAMP_ALIGNED;
> + } else {
> + newtp->rx_opt.ts_recent_stamp = 0;
> + newtp->tcp_header_len = sizeof(struct tcphdr);
> + }
> + newtp->tsoffset = treq->ts_off;
> #ifdef CONFIG_TCP_MD5SIG
> - newtp->md5sig_info = NULL; /*XXX*/
> - if (newtp->af_specific->md5_lookup(sk, newsk))
> - newtp->tcp_header_len += TCPOLEN_MD5SIG_ALIGNED;
> + newtp->md5sig_info = NULL; /*XXX*/
> + if (newtp->af_specific->md5_lookup(sk, newsk))
> + newtp->tcp_header_len += TCPOLEN_MD5SIG_ALIGNED;
> #endif
> - if (skb->len >= TCP_MSS_DEFAULT + newtp->tcp_header_len)
> - newicsk->icsk_ack.last_seg_size = skb->len - newtp->tcp_header_len;
> - newtp->rx_opt.mss_clamp = req->mss;
> - tcp_ecn_openreq_child(newtp, req);
> - newtp->fastopen_req = NULL;
> - newtp->fastopen_rsk = NULL;
> - newtp->syn_data_acked = 0;
> - newtp->rack.mstamp = 0;
> - newtp->rack.advanced = 0;
> - newtp->rack.reo_wnd_steps = 1;
> - newtp->rack.last_delivered = 0;
> - newtp->rack.reo_wnd_persist = 0;
> - newtp->rack.dsack_seen = 0;
> + if (skb->len >= TCP_MSS_DEFAULT + newtp->tcp_header_len)
> + newicsk->icsk_ack.last_seg_size = skb->len - newtp->tcp_header_len;
> + newtp->rx_opt.mss_clamp = req->mss;
> + tcp_ecn_openreq_child(newtp, req);
> + newtp->fastopen_req = NULL;
> + newtp->fastopen_rsk = NULL;
> + newtp->syn_data_acked = 0;
> + newtp->rack.mstamp = 0;
> + newtp->rack.advanced = 0;
> + newtp->rack.reo_wnd_steps = 1;
> + newtp->rack.last_delivered = 0;
> + newtp->rack.reo_wnd_persist = 0;
> + newtp->rack.dsack_seen = 0;
> +
> + __TCP_INC_STATS(sock_net(sk), TCP_MIB_PASSIVEOPENS);
>
> - __TCP_INC_STATS(sock_net(sk), TCP_MIB_PASSIVEOPENS);
> - }
> return newsk;
> }
> EXPORT_SYMBOL(tcp_create_openreq_child);
> --
> 2.18.0.rc2.346.g013aa6912e-goog
>
^ permalink raw reply
* Re: [PATCH net-next] tcp: remove one indentation level in tcp_create_openreq_child
From: Neal Cardwell @ 2018-06-26 17:19 UTC (permalink / raw)
To: Yuchung Cheng; +Cc: Eric Dumazet, David Miller, Netdev, Eric Dumazet
In-Reply-To: <CAK6E8=deSubEg3QnLw7ZGAe8q=yOnuHJk7L1bO7KPFv2HB7Low@mail.gmail.com>
On Tue, Jun 26, 2018 at 11:46 AM Eric Dumazet <edumazet@google.com> wrote:
>
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> ---
> net/ipv4/tcp_minisocks.c | 223 ++++++++++++++++++++-------------------
> 1 file changed, 113 insertions(+), 110 deletions(-)
Yes, very nice clean-up! Thanks for doing this.
Acked-by: Neal Cardwell <ncardwell@google.com>
neal
^ permalink raw reply
* Re: [virtio-dev] Re: [Qemu-devel] [PATCH] qemu: Introduce VIRTIO_NET_F_STANDBY feature bit to virtio_net
From: Michael S. Tsirkin @ 2018-06-26 17:42 UTC (permalink / raw)
To: Cornelia Huck
Cc: Alexander Duyck, virtio-dev, Jiri Pirko, konrad.wilk,
Jakub Kicinski, Samudrala, Sridhar, qemu-devel, virtualization,
Siwei Liu, Venu Busireddy, Netdev, boris.ostrovsky, aaron.f.brown,
Joao Martins
In-Reply-To: <20180626180316.3723422f.cohuck@redhat.com>
On Tue, Jun 26, 2018 at 06:03:16PM +0200, Cornelia Huck wrote:
> Ok, that makes me conclude that we definitely need to involve the
> libvirt folks before we proceed further with defining QEMU interfaces.
That's always a wise thing to do.
--
MST
^ permalink raw reply
* Re: [virtio-dev] Re: [Qemu-devel] [PATCH] qemu: Introduce VIRTIO_NET_F_STANDBY feature bit to virtio_net
From: Michael S. Tsirkin @ 2018-06-26 17:50 UTC (permalink / raw)
To: Cornelia Huck
Cc: Alexander Duyck, virtio-dev, Jiri Pirko, konrad.wilk,
Jakub Kicinski, Samudrala, Sridhar, qemu-devel, virtualization,
Siwei Liu, Venu Busireddy, Netdev, boris.ostrovsky, aaron.f.brown,
Joao Martins
In-Reply-To: <20180626170813.4db094a1.cohuck@redhat.com>
On Tue, Jun 26, 2018 at 05:08:13PM +0200, Cornelia Huck wrote:
> On Fri, 22 Jun 2018 17:05:04 -0700
> Siwei Liu <loseweigh@gmail.com> wrote:
>
> > On Fri, Jun 22, 2018 at 3:33 PM, Michael S. Tsirkin <mst@redhat.com> wrote:
> > > I suspect the diveregence will be lost on most users though
> > > simply because they don't even care about vfio. They just
> > > want things to go fast.
> >
> > Like Jason said, VF isn't faster than virtio-net in all cases. It
> > depends on the workload and performance metrics: throughput, latency,
> > or packet per second.
>
> So, will it be guest/admin-controllable then where the traffic flows
> through? Just because we do have a vf available after negotiation of
> the feature bit, it does not necessarily mean we want to use it? Do we
> (the guest) even want to make it visible in that case?
I think these ideas belong to what Alex Duyck wanted to do:
some kind of advanced device that isn't tied to
any network interfaces and allows workload and performance
specific tuning.
Way out of scope for a simple failover, and more importantly,
no one is looking at even enumerating the problems involved,
much less solving them.
--
MST
^ permalink raw reply
* [PATCH net-next v2 1/3] net: phy: xgmiitorgmii: Check phy_driver ready before accessing
From: Brandon Maier @ 2018-06-26 17:50 UTC (permalink / raw)
To: netdev
Cc: andrew, f.fainelli, davem, michal.simek, clayton.shotwell,
kristopher.cory, linux-kernel, Brandon Maier
Since a phy_device is added to the global mdio_bus list during
phy_device_register(), but a phy_device's phy_driver doesn't get
attached until phy_probe(). It's possible of_phy_find_device() in
xgmiitorgmii will return a valid phy with a NULL phy_driver. Leading to
a NULL pointer access during the memcpy().
Fixes this Oops:
Unable to handle kernel NULL pointer dereference at virtual address 00000000
pgd = c0004000
[00000000] *pgd=00000000
Internal error: Oops: 5 [#1] PREEMPT SMP ARM
Modules linked in:
CPU: 0 PID: 1 Comm: swapper/0 Not tainted 4.14.40 #1
Hardware name: Xilinx Zynq Platform
task: ce4c8d00 task.stack: ce4ca000
PC is at memcpy+0x48/0x330
LR is at xgmiitorgmii_probe+0x90/0xe8
pc : [<c074bc68>] lr : [<c0529548>] psr: 20000013
sp : ce4cbb54 ip : 00000000 fp : ce4cbb8c
r10: 00000000 r9 : 00000000 r8 : c0c49178
r7 : 00000000 r6 : cdc14718 r5 : ce762800 r4 : cdc14710
r3 : 00000000 r2 : 00000054 r1 : 00000000 r0 : cdc14718
Flags: nzCv IRQs on FIQs on Mode SVC_32 ISA ARM Segment none
Control: 18c5387d Table: 0000404a DAC: 00000051
Process swapper/0 (pid: 1, stack limit = 0xce4ca210)
...
[<c074bc68>] (memcpy) from [<c0529548>] (xgmiitorgmii_probe+0x90/0xe8)
[<c0529548>] (xgmiitorgmii_probe) from [<c0526a94>] (mdio_probe+0x28/0x34)
[<c0526a94>] (mdio_probe) from [<c04db98c>] (driver_probe_device+0x254/0x414)
[<c04db98c>] (driver_probe_device) from [<c04dbd58>] (__device_attach_driver+0xac/0x10c)
[<c04dbd58>] (__device_attach_driver) from [<c04d96f4>] (bus_for_each_drv+0x84/0xc8)
[<c04d96f4>] (bus_for_each_drv) from [<c04db5bc>] (__device_attach+0xd0/0x134)
[<c04db5bc>] (__device_attach) from [<c04dbdd4>] (device_initial_probe+0x1c/0x20)
[<c04dbdd4>] (device_initial_probe) from [<c04da8fc>] (bus_probe_device+0x98/0xa0)
[<c04da8fc>] (bus_probe_device) from [<c04d8660>] (device_add+0x43c/0x5d0)
[<c04d8660>] (device_add) from [<c0526cb8>] (mdio_device_register+0x34/0x80)
[<c0526cb8>] (mdio_device_register) from [<c0580b48>] (of_mdiobus_register+0x170/0x30c)
[<c0580b48>] (of_mdiobus_register) from [<c05349c4>] (macb_probe+0x710/0xc00)
[<c05349c4>] (macb_probe) from [<c04dd700>] (platform_drv_probe+0x44/0x80)
[<c04dd700>] (platform_drv_probe) from [<c04db98c>] (driver_probe_device+0x254/0x414)
[<c04db98c>] (driver_probe_device) from [<c04dbc58>] (__driver_attach+0x10c/0x118)
[<c04dbc58>] (__driver_attach) from [<c04d9600>] (bus_for_each_dev+0x8c/0xd0)
[<c04d9600>] (bus_for_each_dev) from [<c04db1fc>] (driver_attach+0x2c/0x30)
[<c04db1fc>] (driver_attach) from [<c04daa98>] (bus_add_driver+0x50/0x260)
[<c04daa98>] (bus_add_driver) from [<c04dc440>] (driver_register+0x88/0x108)
[<c04dc440>] (driver_register) from [<c04dd6b4>] (__platform_driver_register+0x50/0x58)
[<c04dd6b4>] (__platform_driver_register) from [<c0b31248>] (macb_driver_init+0x24/0x28)
[<c0b31248>] (macb_driver_init) from [<c010203c>] (do_one_initcall+0x60/0x1a4)
[<c010203c>] (do_one_initcall) from [<c0b00f78>] (kernel_init_freeable+0x15c/0x1f8)
[<c0b00f78>] (kernel_init_freeable) from [<c0763d10>] (kernel_init+0x18/0x124)
[<c0763d10>] (kernel_init) from [<c0112d74>] (ret_from_fork+0x14/0x20)
Code: ba000002 f5d1f03c f5d1f05c f5d1f07c (e8b151f8)
---[ end trace 3e4ec21905820a1f ]---
Signed-off-by: Brandon Maier <brandon.maier@rockwellcollins.com>
---
v2:
- Add Oops to commit message
v1: https://marc.info/?l=linux-netdev&m=152838762210538&w=2
drivers/net/phy/xilinx_gmii2rgmii.c | 5 +++++
1 file changed, 5 insertions(+)
diff --git a/drivers/net/phy/xilinx_gmii2rgmii.c b/drivers/net/phy/xilinx_gmii2rgmii.c
index 2e5150b0b8d5..04c8bec1c4c1 100644
--- a/drivers/net/phy/xilinx_gmii2rgmii.c
+++ b/drivers/net/phy/xilinx_gmii2rgmii.c
@@ -81,6 +81,11 @@ static int xgmiitorgmii_probe(struct mdio_device *mdiodev)
return -EPROBE_DEFER;
}
+ if (!priv->phy_dev->drv) {
+ dev_info(dev, "Attached phy not ready\n");
+ return -EPROBE_DEFER;
+ }
+
priv->addr = mdiodev->addr;
priv->phy_drv = priv->phy_dev->drv;
memcpy(&priv->conv_phy_drv, priv->phy_dev->drv,
--
2.17.1
^ permalink raw reply related
* [PATCH net-next v2 2/3] net: phy: xgmiitorgmii: Use correct mdio bus
From: Brandon Maier @ 2018-06-26 17:50 UTC (permalink / raw)
To: netdev
Cc: andrew, f.fainelli, davem, michal.simek, clayton.shotwell,
kristopher.cory, linux-kernel, Brandon Maier
In-Reply-To: <20180626175050.71165-1-brandon.maier@rockwellcollins.com>
The xgmiitorgmii is using the mii_bus of the device it's attached to,
instead of the bus it was given during probe.
Signed-off-by: Brandon Maier <brandon.maier@rockwellcollins.com>
---
v2:
- Fix trivial typo in commit message
v1: https://marc.info/?l=linux-netdev&m=152838761310537&w=2
drivers/net/phy/xilinx_gmii2rgmii.c | 10 ++++++----
1 file changed, 6 insertions(+), 4 deletions(-)
diff --git a/drivers/net/phy/xilinx_gmii2rgmii.c b/drivers/net/phy/xilinx_gmii2rgmii.c
index 04c8bec1c4c1..d6f8b64cddbe 100644
--- a/drivers/net/phy/xilinx_gmii2rgmii.c
+++ b/drivers/net/phy/xilinx_gmii2rgmii.c
@@ -33,17 +33,19 @@ struct gmii2rgmii {
struct phy_device *phy_dev;
struct phy_driver *phy_drv;
struct phy_driver conv_phy_drv;
- int addr;
+ struct mdio_device *mdio;
};
static int xgmiitorgmii_read_status(struct phy_device *phydev)
{
struct gmii2rgmii *priv = phydev->priv;
+ struct mii_bus *bus = priv->mdio->bus;
+ int addr = priv->mdio->addr;
u16 val = 0;
priv->phy_drv->read_status(phydev);
- val = mdiobus_read(phydev->mdio.bus, priv->addr, XILINX_GMII2RGMII_REG);
+ val = mdiobus_read(bus, addr, XILINX_GMII2RGMII_REG);
val &= ~XILINX_GMII2RGMII_SPEED_MASK;
if (phydev->speed == SPEED_1000)
@@ -53,7 +55,7 @@ static int xgmiitorgmii_read_status(struct phy_device *phydev)
else
val |= BMCR_SPEED10;
- mdiobus_write(phydev->mdio.bus, priv->addr, XILINX_GMII2RGMII_REG, val);
+ mdiobus_write(bus, addr, XILINX_GMII2RGMII_REG, val);
return 0;
}
@@ -86,7 +88,7 @@ static int xgmiitorgmii_probe(struct mdio_device *mdiodev)
return -EPROBE_DEFER;
}
- priv->addr = mdiodev->addr;
+ priv->mdio = mdiodev;
priv->phy_drv = priv->phy_dev->drv;
memcpy(&priv->conv_phy_drv, priv->phy_dev->drv,
sizeof(struct phy_driver));
--
2.17.1
^ permalink raw reply related
* [PATCH net-next v2 3/3] net: phy: xgmiitorgmii: Check read_status results
From: Brandon Maier @ 2018-06-26 17:50 UTC (permalink / raw)
To: netdev
Cc: andrew, f.fainelli, davem, michal.simek, clayton.shotwell,
kristopher.cory, linux-kernel, Brandon Maier
In-Reply-To: <20180626175050.71165-1-brandon.maier@rockwellcollins.com>
We're ignoring the result of the attached phy device's read_status().
Return it so we can detect errors.
Signed-off-by: Brandon Maier <brandon.maier@rockwellcollins.com>
---
v2:
- No change
v1: https://marc.info/?l=linux-netdev&m=152838766410559&w=2
drivers/net/phy/xilinx_gmii2rgmii.c | 5 ++++-
1 file changed, 4 insertions(+), 1 deletion(-)
diff --git a/drivers/net/phy/xilinx_gmii2rgmii.c b/drivers/net/phy/xilinx_gmii2rgmii.c
index d6f8b64cddbe..74a8782313cf 100644
--- a/drivers/net/phy/xilinx_gmii2rgmii.c
+++ b/drivers/net/phy/xilinx_gmii2rgmii.c
@@ -42,8 +42,11 @@ static int xgmiitorgmii_read_status(struct phy_device *phydev)
struct mii_bus *bus = priv->mdio->bus;
int addr = priv->mdio->addr;
u16 val = 0;
+ int err;
- priv->phy_drv->read_status(phydev);
+ err = priv->phy_drv->read_status(phydev);
+ if (err < 0)
+ return err;
val = mdiobus_read(bus, addr, XILINX_GMII2RGMII_REG);
val &= ~XILINX_GMII2RGMII_SPEED_MASK;
--
2.17.1
^ permalink raw reply related
* RE: [patch net-next RFC 03/12] mlxsw: core: Add core environment module for port temperature reading
From: Vadim Pasternak @ 2018-06-26 17:50 UTC (permalink / raw)
To: Guenter Roeck, Andrew Lunn
Cc: linux-pm@vger.kernel.org, netdev@vger.kernel.org,
rui.zhang@intel.com, edubezval@gmail.com, jiri@resnulli.us
In-Reply-To: <20180626170012.GA28370@roeck-us.net>
> -----Original Message-----
> From: Guenter Roeck [mailto:linux@roeck-us.net]
> Sent: Tuesday, June 26, 2018 8:00 PM
> To: Andrew Lunn <andrew@lunn.ch>
> Cc: Vadim Pasternak <vadimp@mellanox.com>; linux-pm@vger.kernel.org;
> netdev@vger.kernel.org; rui.zhang@intel.com; edubezval@gmail.com;
> jiri@resnulli.us
> Subject: Re: [patch net-next RFC 03/12] mlxsw: core: Add core environment
> module for port temperature reading
>
> On Tue, Jun 26, 2018 at 04:22:38PM +0200, Andrew Lunn wrote:
> > On Tue, Jun 26, 2018 at 12:10:28PM +0000, Vadim Pasternak wrote:
> >
> > Adding the linux-pm@vger.kernel.org list.
> >
> > > Add new core_env module to allow port temperature reading. This
> > > information has most critical impact on system's thermal monitoring
> > > and is to be used by core_hwmon and core_thermal modules.
> > >
> > > New internal API reads the temperature from all the modules, which
> > > are equipped with the thermal sensor and exposes temperature
> > > according to the worst measure. All individual temperature values
> > > are normalized to pre-defined range.
> >
> > This patchset has been sent to the netdev list before. I raised a few
> > questions about this, which is why it is now being posted to a bigger
> > group for review.
> >
> > The hardware has up to 64 temperature sensors. These sensors are
> > hot-plugable, since they are inside SFP modules, which are
> > hot-plugable. Different SFP modules can have different operating
> > temperature ranges. They contain an EEPROM which lists upper and lower
> > warning and fail temperatures, and report alarms when these thresholds
> > a reached.
> >
> > This code takes the 64 sensors readings and calculates a single value
> > it passes to one thermal zone. That thermal zone then controls one fan
> > to keep this single value in range.
> >
> > I queried is this is the correct way to do this? Would it not be
> > better to have up to 64 thermal zones? Leave the thermal core to
> > iterate over all the zones in order to determine how the fan should be
> > driven?
> >
> I very much think so. This problem must exist elsewhere; essentially it is the
> bundling of multiple temperature sensors into a single thermal zone. I am not
> sure if this should be 64 thermal zones or one thermal zone with up to 64
> sensors and some algorithm to select the relevant temperature; that would be
> up to the thermal subsystem maintainers to decide. Either case, the sensors
> should be handled and reported as individual sensors, with appropriate limits,
> not as single sensor.
> Yes, I understand that means we'll have hundreds of hwmon devices, but that
> should not be a problem (and if it is, we'll have to fix the problem, not the code
> exposing it).
I guess that many thermal zones with single PWM control will not work.
PWM will never stabilize in case there are some hot and some cold modules.
It seems it could be only temperature input array providing to the thermal
zone. And additionally it should have arrays at least for the warning and critical
thresholds.
We are using step-wise thermal algorithm as a default.
In case thermal zone will have multi temperature inputs this algorithm possibly
should be adapted for handling temperature arrays (input and thresholds)
along with the thermal zone normalization parameters - more or less the same
normalization process as I provided in this patch, but generic for the thermal
subsystem.
Or another possibility - to add some new thermal algorithm "step-wise-multi"
or something like that.
However, I have some concerns on this matter.
Our hardware provides bulk reading of the modules temperature, means
I can get all inputs by one hardware request, which is important optimization.
Reading each module individually will be resulted in huge overhead and will
require maybe some cashing of temperature inputs.
And also, now we have up to 64 modules per system and on the way the
system supporting 128 modules.
Would it be good to have such huge number of hwmon configuration records,
like:
HWMON_T_INPUT | HWMON_T_MAX_ALARM | HWMON_T_CRIT_ALARM ?
>
> I understand that the thermal subsystem does not currently support handling this
> problem. There may also be some missing pieces between the hwmon and
> thermal subsystems, such as reporting limits or alarms when a hwmon driver
> register with the thermal subsystem.
>
> Maybe it is time to add this support as part of this patch series ?
>
> > This is possibly the first board with so many sensors. However, i
> > doubt it is totally unique. Other big Ethernet switches with lots of
> > SFP modules may be added later. Also, 10G copper PHYs often have
> > temperature sensors, so this is not limited to just boards with
> > optical ports. So having a generic solution would be good.
>
> Agreed.
>
> Thanks,
> Guenter
>
> >
> > What do the Linux PM exports say about this?
> >
> > Thanks
> > Andrew
^ permalink raw reply
* Re: [PATCH rdma-next 08/12] overflow.h: Add arithmetic shift helper
From: Jason Gunthorpe @ 2018-06-26 17:54 UTC (permalink / raw)
To: Rasmus Villemoes
Cc: Leon Romanovsky, Doug Ledford, Kees Cook, Leon Romanovsky,
RDMA mailing list, Hadar Hen Zion, Matan Barak, Michael J Ruhl,
Noa Osherovich, Raed Salem, Yishai Hadas, Saeed Mahameed,
linux-netdev, linux-kernel
In-Reply-To: <CAKwiHFiRYbyiJqDYCgKXKZYRr0KjCt8q9AwKwfqoCA1sT2KFyQ@mail.gmail.com>
On Tue, Jun 26, 2018 at 10:07:07AM +0200, Rasmus Villemoes wrote:
> On 25 June 2018 at 19:11, Jason Gunthorpe <[1]jgg@mellanox.com> wrote:
>
> On Mon, Jun 25, 2018 at 11:26:05AM +0200, Rasmus Villemoes wrote:
> > check_shift_overflow(a, s, d) {
> > unsigned _nbits = 8*sizeof(a);
> > typeof(a) _a = (a);
> > typeof(s) _s = (s);
> > typeof(d) _d = (d);
> >
> > *_d = ((u64)(_a) << (_s & (_nbits-1)));
> > _s >= _nbits || (_s > 0 && (_a >> (_nbits - _s -
> > is_signed_type(a))) != 0);
> > }
> Those types are not quite right.. What about this?
> check_shift_overflow(a, s, d) ({
> unsigned int _nbits = 8*sizeof(d) - is_signed_type(d);
> typeof(d) _a = a; // Shift is always performed on type 'd'
> typeof(s) _s = s;
> typeof(d) _d = d;
> *_d = (_a << (_s & (_nbits-1)));
> (((*_d) >> (_s & (_nbits-1)) != _a);
> })
>
> No, because, the check_*_overflow (and the __builtin_*_overflow
> cousins) functions must do their job without causing undefined
> behaviour, regardless of what crazy input values and types they are
> given.
Okay, I see you are concerned about a UB during shifting signed
values. I didn't consider that..
> Also, the output must be completely defined for all inputs [1].
> I omitted it for brevity, but I also wanted a and *d to have the same
> type, so there should also be one of those (void)(&_a == _d);
Humm. No, that doesn't match the use case. Typically this would take
an ABI constant like a u32 and shift it into a size_t for use with an
allocator. So demanding a and d have equal types is not good, and
requiring user casting is not good as the casting could be truncating.
> statements. See the other check_*_overflow and the commit adding them.
> Without the (u64) cast, any signed (and negative) a would cause UB in
> your suggestion.
When thinking about signed cases.. The explicit u64 cast, and
implict promotion to typeof(d), produce something counter intuitive,
eg:
(u64)(s32)-1 == 0xffffffffffffffff
Which would result in a shift oucome that is not what anyone would
expect, IMHO... So the first version isn't what I'd expect either..
> Also, having _nbits be 31 when a (and/or *d) has type
> int, and then and'ing the shift by 30 doesn't make any sense; I have no
> idea what you're trying to do.
Yes, it is not helpful to avoid UB when a is signed..
> [1] For this one, it would probably be most consistent to say that the
> result is a*2^s computed in infinite-precision, then truncated to fit
> in d.
I think this does not match the usual use cases, this should strictly
be an unsigned shift. The output is guarenteed to always be positive
or overflow is signaled.
Signed types are alllowed, but negative values are not.
What about more like this?
check_shift_overflow(a, s, d) ({
// Shift is always performed on the machine's largest unsigned
u64 _a = a;
typeof(s) _s = s;
typeof(d) _d = d;
// Make s safe against UB
unsigned int _to_shift = _s >= 0 && _s < 8*sizeof(*d) : _s ? 0;
*_d = (_a << _to_shift);
// s is malformed
(_to_shift != _s ||
// d is a signed type and became negative
*_d < 0 ||
// a is a signed type and was negative
_a < 0 ||
// Not invertable means a was truncated during shifting
(*_d >> _to_shift) != a))
})
I'm not seeing a UB with this?
Jason
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox