Netdev List
 help / color / mirror / Atom feed
* Re: [PATCH][net-next] net: hinic: make functions set_ctrl0 and set_ctrl1 static
From: David Miller @ 2017-08-24  5:20 UTC (permalink / raw)
  To: colin.king; +Cc: aviad.krawczyk, netdev, kernel-janitors, linux-kernel
In-Reply-To: <20170823095940.27225-1-colin.king@canonical.com>

From: Colin King <colin.king@canonical.com>
Date: Wed, 23 Aug 2017 10:59:40 +0100

> From: Colin Ian King <colin.king@canonical.com>
> 
> The functions set_ctrl0 and set_ctrl1 are local to the source and do
> not need to be in global scope, so make them static.
> 
> Cleans up sparse warnings:
> symbol 'set_ctrl0' was not declared. Should it be static?
> symbol 'set_ctrl1' was not declared. Should it be static?
> 
> Signed-off-by: Colin Ian King <colin.king@canonical.com>

Applied.

^ permalink raw reply

* Re: [PATCH net] net/sock: allow the user to set negative peek offset
From: David Miller @ 2017-08-24  5:19 UTC (permalink / raw)
  To: pabeni; +Cc: netdev, matthew, willemdebruijn.kernel, samanthakumar
In-Reply-To: <64e676e59d203f13cd326c3b0cf6b05287b69e1b.1503481941.git.pabeni@redhat.com>

From: Paolo Abeni <pabeni@redhat.com>
Date: Wed, 23 Aug 2017 11:57:51 +0200

> This is necessary to allow the user to disable peeking with
> offset once it's enabled.
> Unix sockets already allow the above, with this patch we
> permit it for udp[6] sockets, too.
> 
> Fixes: 627d2d6b5500 ("udp: enable MSG_PEEK at non-zero offset")
> Signed-off-by: Paolo Abeni <pabeni@redhat.com>

Applied to net-next.

^ permalink raw reply

* Re: [PATCH net 4/7] net:ethernet:aquantia: Fix for MCP state change.
From: David Miller @ 2017-08-24  5:17 UTC (permalink / raw)
  To: Pavel.Belous
  Cc: netdev, darcari, Igor.Russkikh, Nadezhda.Krupnina, simon.edelhaus
In-Reply-To: <daa58e36851b90b099ade078270bff7cf2872787.1503496727.git.pavel.belous@aquantia.com>

From: Pavel Belous <Pavel.Belous@aquantia.com>
Date: Wed, 23 Aug 2017 17:05:05 +0300

> diff --git a/drivers/net/ethernet/aquantia/atlantic/hw_atl/hw_atl_utils.h b/drivers/net/ethernet/aquantia/atlantic/hw_atl/hw_atl_utils.h
> index a66aee5..fc69408a 100644
> --- a/drivers/net/ethernet/aquantia/atlantic/hw_atl/hw_atl_utils.h
> +++ b/drivers/net/ethernet/aquantia/atlantic/hw_atl/hw_atl_utils.h
> @@ -160,6 +160,19 @@ enum hal_atl_utils_fw_state_e {
>  	MPI_POWER = 4,
>  };
>  
> +union hal_atl_utils_hw_mpi_state_reg {
> +	u32 val;
> +	struct {
> +		u8 e_state;
> +		u8 reserved1;
> +		u8 u_speed;
> +		u8 reserved2:1;
> +		u8 disable_dirty_wake:1;
> +		u8 reserved3:2;
> +		u8 u_downshift:4;
> +	};
> +};
> +

You need to consider endianness when declaring bitfields in C.

Seriously, I'd rather you simply fixed the shifts and masks into the
u32 value than going down this route.

^ permalink raw reply

* Re: [PATCH net] net: dsa: use consume_skb()
From: David Miller @ 2017-08-24  5:14 UTC (permalink / raw)
  To: eric.dumazet
  Cc: f.fainelli, netdev, vivien.didelot, Woojung.Huh, UNGLinuxDriver
In-Reply-To: <1503549632.2499.77.camel@edumazet-glaptop3.roam.corp.google.com>

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Wed, 23 Aug 2017 21:40:32 -0700

> From: Eric Dumazet <edumazet@google.com>
> 
> Two kfree_skb() should be consume_skb(), to be friend with drop monitor
> (perf record ... -e skb:kfree_skb)
> 
> Signed-off-by: Eric Dumazet <edumazet@google.com>

Applied, thanks Eric.

^ permalink raw reply

* [PATCH net] net: dsa: use consume_skb()
From: Eric Dumazet @ 2017-08-24  4:40 UTC (permalink / raw)
  To: David Miller
  Cc: f.fainelli, netdev, vivien.didelot, Woojung.Huh, UNGLinuxDriver
In-Reply-To: <20170823.203725.2147193612476095021.davem@davemloft.net>

From: Eric Dumazet <edumazet@google.com>

Two kfree_skb() should be consume_skb(), to be friend with drop monitor
(perf record ... -e skb:kfree_skb)

Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 net/dsa/tag_ksz.c     |    2 +-
 net/dsa/tag_trailer.c |    2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/dsa/tag_ksz.c b/net/dsa/tag_ksz.c
index 3bd6e2a83125..fcd90f79458e 100644
--- a/net/dsa/tag_ksz.c
+++ b/net/dsa/tag_ksz.c
@@ -67,7 +67,7 @@ static struct sk_buff *ksz_xmit(struct sk_buff *skb, struct net_device *dev)
 		if (skb_put_padto(nskb, nskb->len + padlen))
 			return NULL;
 
-		kfree_skb(skb);
+		consume_skb(skb);
 	}
 
 	tag = skb_put(nskb, KSZ_INGRESS_TAG_LEN);
diff --git a/net/dsa/tag_trailer.c b/net/dsa/tag_trailer.c
index b09e56214005..9c7b1d74a5c6 100644
--- a/net/dsa/tag_trailer.c
+++ b/net/dsa/tag_trailer.c
@@ -40,7 +40,7 @@ static struct sk_buff *trailer_xmit(struct sk_buff *skb, struct net_device *dev)
 	skb_set_network_header(nskb, skb_network_header(skb) - skb->head);
 	skb_set_transport_header(nskb, skb_transport_header(skb) - skb->head);
 	skb_copy_and_csum_dev(skb, skb_put(nskb, skb->len));
-	kfree_skb(skb);
+	consume_skb(skb);
 
 	if (padlen) {
 		skb_put_zero(nskb, padlen);

^ permalink raw reply related

* Re: [PATCH net-next] virtio-net: invoke zerocopy callback on xmit path if no tx napi
From: Michael S. Tsirkin @ 2017-08-24  4:34 UTC (permalink / raw)
  To: Willem de Bruijn
  Cc: Koichiro Den, Jason Wang, virtualization, Network Development
In-Reply-To: <CAF=yD-KSek+LmZu0X0TXetmFEQ59iF3NpjZ4KugbwLo1BGfhaA@mail.gmail.com>

On Wed, Aug 23, 2017 at 11:28:24PM -0400, Willem de Bruijn wrote:
> >> > * as a generic solution, if we were to somehow overcome the safety issue, track
> >> > the delay and do copy if some threshold is reached could be an answer, but it's
> >> > hard for now.> * so things like the current vhost-net implementation of deciding whether or not
> >> > to do zerocopy beforehand referring the zerocopy tx error ratio is a point of
> >> > practical compromise.
> >>
> >> The fragility of this mechanism is another argument for switching to tx napi
> >> as default.
> >>
> >> Is there any more data about the windows guest issues when completions
> >> are not queued within a reasonable timeframe? What is this timescale and
> >> do we really need to work around this.
> >
> > I think it's pretty large, many milliseconds.
> >
> > But I wonder what do you mean by "work around". Using buffers within
> > limited time frame sounds like a reasonable requirement to me.
> 
> Vhost-net zerocopy delays completions until the skb is really
> sent. Traffic shaping can introduce msec timescale latencies.
> 
> The delay may actually be a useful signal. If the guest does not
> orphan skbs early, TSQ will throttle the socket causing host
> queue build up.
> 
> But, if completions are queued in-order, unrelated flows may be
> throttled as well. Allowing out of order completions would resolve
> this HoL blocking.

There's no issue with out of order. It does not break any guests AFAIK.

> > Neither
> > do I see why would using tx interrupts within guest be a work around -
> > AFAIK windows driver uses tx interrupts.
> 
> It does not address completion latency itself. What I meant was
> that in an interrupt-driver model, additional starvation issues,
> such as the potential deadlock raised at the start of this thread,
> or the timer delay observed before packets were orphaned in
> virtio-net in commit b0c39dbdc204, are mitigated.
> 
> Specifically, it breaks the potential deadlock where sockets are
> blocked waiting for completions (to free up budget in sndbuf, tsq, ..),
> yet completion handling is blocked waiting for a new packet to
> trigger free_old_xmit_skbs from start_xmit.
> 
> >> That is the only thing keeping us from removing the HoL blocking in vhost-net zerocopy.
> >
> > We don't enable network watchdog on virtio but we could and maybe
> > should.
> 
> Can you elaborate?

^ permalink raw reply

* Re: [patch net-next 0/5] mlxsw: spectrum: Introduce multichain TC offload
From: David Miller @ 2017-08-24  3:45 UTC (permalink / raw)
  To: jiri; +Cc: netdev, idosch, mlxsw, jhs, xiyou.wangcong
In-Reply-To: <20170823080652.14847-1-jiri@resnulli.us>

From: Jiri Pirko <jiri@resnulli.us>
Date: Wed, 23 Aug 2017 10:06:52 +0200

> From: Jiri Pirko <jiri@mellanox.com>
> 
> This patchset introduces offloading of rules added to chain with
> non-zero index, which was previously forbidden. Also, goto_chain
> termination action is offloaded allowing to jump to processing
> of desired chain.

Series applied, thanks Jiri.

^ permalink raw reply

* Re: [PATCH net-next 0/3] net: mvpp2: software TSO support
From: David Miller @ 2017-08-24  3:42 UTC (permalink / raw)
  To: antoine.tenart
  Cc: andrew, gregory.clement, thomas.petazzoni, nadavh, linux, mw,
	stefanc, netdev
In-Reply-To: <20170823074656.14595-1-antoine.tenart@free-electrons.com>

From: Antoine Tenart <antoine.tenart@free-electrons.com>
Date: Wed, 23 Aug 2017 09:46:53 +0200

> This series adds the s/w TSO support in the PPv2 driver, in addition to
> two cosmetic commits. As stated in patch 3/3:
> 
> Using iperf and 10G ports, using TSO shows a significant performance
> improvement by a factor 2 to reach around 9.5Gbps in TX; as well as a
> significant CPU usage drop (from 25% to 15%).

Series applied, thank you.

^ permalink raw reply

* Re: [PATCH net 0/3] nfp: fix SR-IOV deadlock and representor bugs
From: David Miller @ 2017-08-24  3:40 UTC (permalink / raw)
  To: jakub.kicinski; +Cc: netdev, oss-drivers
In-Reply-To: <20170823062244.26580-1-jakub.kicinski@netronome.com>

From: Jakub Kicinski <jakub.kicinski@netronome.com>
Date: Tue, 22 Aug 2017 23:22:41 -0700

> This series tackles the bug I've already tried to fix in commit 
> 6d48ceb27af1 ("nfp: allocate a private workqueue for driver work").
> I created a separate workqueue to avoid possible deadlock, and
> the lockdep error disappeared, coincidentally.  The way workqueues
> are operating, separate workqueue doesn't necessarily mean separate
> thread of execution.  Luckily we can safely forego the lock.
> 
> Second fix changes the order in which vNIC netdevs and representors
> are created/destroyed.  The fix is kept small and should be sufficient
> for net because of how flower uses representors, a more thorough fix 
> will be targeted at net-next.
> 
> Third fix avoids leaking mapped frame buffers if FW sent a frame with
> unknown portid.

Series applied, thanks.

^ permalink raw reply

* Re: [PATCH net-next] ipv4: do metrics match when looking up and deleting a route
From: David Miller @ 2017-08-24  3:38 UTC (permalink / raw)
  To: lucien.xin; +Cc: netdev, hannes
In-Reply-To: <63b0473e7248f4931d46c45ac64e63b07bea6eeb.1503454046.git.lucien.xin@gmail.com>

From: Xin Long <lucien.xin@gmail.com>
Date: Wed, 23 Aug 2017 10:07:26 +0800

> Now when ipv4 route inserts a fib_info, it memcmp fib_metrics.
> It means ipv4 route identifies one route also with metrics.
> 
> But when removing a route, it tries to find the route without
> caring about the metrics. It will cause that the route with
> right metrics can't be removed.
> 
> Thomas noticed this issue when doing the testing:
> 
> 1. add:
>    # ip route append 192.168.7.0/24 dev v window 1000
>    # ip route append 192.168.7.0/24 dev v window 1001
>    # ip route append 192.168.7.0/24 dev v window 1002
>    # ip route append 192.168.7.0/24 dev v window 1003
> 2. delete:
>    # ip route delete 192.168.7.0/24 dev v window 1002
> 3. show:
>      192.168.7.0/24 proto boot scope link window 1001
>      192.168.7.0/24 proto boot scope link window 1002
>      192.168.7.0/24 proto boot scope link window 1003
> 
> The one with window 1002 wasn't deleted but the first one was.
> 
> This patch is to do metrics match when looking up and deleting
> one route.
> 
> Reported-by: Thomas Haller <thaller@redhat.com>
> Signed-off-by: Xin Long <lucien.xin@gmail.com>
> Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>

Applied.

^ permalink raw reply

* Re: [PATCH net v2 0/2] net: dsa: Fix tag_ksz.c
From: David Miller @ 2017-08-24  3:37 UTC (permalink / raw)
  To: f.fainelli; +Cc: netdev, vivien.didelot, Woojung.Huh, UNGLinuxDriver
In-Reply-To: <20170822221215.16305-1-f.fainelli@gmail.com>

From: Florian Fainelli <f.fainelli@gmail.com>
Date: Tue, 22 Aug 2017 15:12:13 -0700

> This implements David's suggestion of providing low-level functions
> to control whether skb_pad() and skb_put_padto() should be freeing
> the passed skb.
> 
> We make use of it to fix a double free in net/dsa/tag_ksz.c that would
> occur if we kept using skb_put_padto() in both places.

Series applied and queued up for -stable, thanks.

^ permalink raw reply

* Re: [PATCH v2 net-next 0/2] Add software rx timestamp for TCP.
From: David Miller @ 2017-08-24  3:32 UTC (permalink / raw)
  To: maloneykernel; +Cc: netdev, willemdebruijn.kernel, soheil, maloney
In-Reply-To: <20170822210849.23162-1-maloneykernel@gmail.com>

From: Mike Maloney <maloneykernel@gmail.com>
Date: Tue, 22 Aug 2017 17:08:47 -0400

> From: Mike Maloney <maloney@google.com>
> 
> Add software rx timestamps for TCP, and a test to ensure consistency of
> behavior between IP, UDP, and TCP implementation.
> 
> Changes since v1:
>   -Initialize tss->ts[1] to 0 if caller requested any timestamps.
>   -Fix test case to validate that tss->ts[1] is zero.
>   -Fix tests to actually use a raw socket.
>   -Fix --tcp flag to work on the test.

Series applied, thanks.

^ permalink raw reply

* Re: [PATCH net-next] liquidio: change manner of detecting whether or not NIC firmware is loaded
From: David Miller @ 2017-08-24  3:29 UTC (permalink / raw)
  To: felix.manlunas; +Cc: netdev, raghu.vatsavayi, derek.chickles, satananda.burla
In-Reply-To: <20170822194637.GA6875@felix-thinkpad.cavium.com>

From: Felix Manlunas <felix.manlunas@cavium.com>
Date: Tue, 22 Aug 2017 12:46:37 -0700

> In the NIC firmware, the 1-bit flag indicating "firmware is loaded" moved
> from SLI_SCRATCH_1 to SLI_SCRATCH_2 (these are Octeon general-purpose
> scratch registers).  Make the PF driver conform to this change.
> 
> Remove code that sets the "firmware is loaded" flag because it's now the
> firmware's job to do that.
> 
> In the code that detects whether or not the firmware is loaded, don't just
> rely on checking the "firmware is loaded" flag because that may cause a
> rare false negative.  Add code that deduces whether or not the firmware is
> loaded; that will never give a false negative.
> 
> Also bump up driver version to match newer NIC firmware.
> 
> Signed-off-by: Felix Manlunas <felix.manlunas@cavium.com>
> Signed-off-by: Derek Chickles <derek.chickles@cavium.com>

Applied.

^ permalink raw reply

* Re: [PATCH net-next] virtio-net: invoke zerocopy callback on xmit path if no tx napi
From: Willem de Bruijn @ 2017-08-24  3:28 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Koichiro Den, Jason Wang, virtualization, Network Development
In-Reply-To: <20170824014553-mutt-send-email-mst@kernel.org>

>> > * as a generic solution, if we were to somehow overcome the safety issue, track
>> > the delay and do copy if some threshold is reached could be an answer, but it's
>> > hard for now.> * so things like the current vhost-net implementation of deciding whether or not
>> > to do zerocopy beforehand referring the zerocopy tx error ratio is a point of
>> > practical compromise.
>>
>> The fragility of this mechanism is another argument for switching to tx napi
>> as default.
>>
>> Is there any more data about the windows guest issues when completions
>> are not queued within a reasonable timeframe? What is this timescale and
>> do we really need to work around this.
>
> I think it's pretty large, many milliseconds.
>
> But I wonder what do you mean by "work around". Using buffers within
> limited time frame sounds like a reasonable requirement to me.

Vhost-net zerocopy delays completions until the skb is really
sent. Traffic shaping can introduce msec timescale latencies.

The delay may actually be a useful signal. If the guest does not
orphan skbs early, TSQ will throttle the socket causing host
queue build up.

But, if completions are queued in-order, unrelated flows may be
throttled as well. Allowing out of order completions would resolve
this HoL blocking.

> Neither
> do I see why would using tx interrupts within guest be a work around -
> AFAIK windows driver uses tx interrupts.

It does not address completion latency itself. What I meant was
that in an interrupt-driver model, additional starvation issues,
such as the potential deadlock raised at the start of this thread,
or the timer delay observed before packets were orphaned in
virtio-net in commit b0c39dbdc204, are mitigated.

Specifically, it breaks the potential deadlock where sockets are
blocked waiting for completions (to free up budget in sndbuf, tsq, ..),
yet completion handling is blocked waiting for a new packet to
trigger free_old_xmit_skbs from start_xmit.

>> That is the only thing keeping us from removing the HoL blocking in vhost-net zerocopy.
>
> We don't enable network watchdog on virtio but we could and maybe
> should.

Can you elaborate?

^ permalink raw reply

* Re: [PATCH v2] net: stmmac: socfgpa: Ensure emac bit set in sys manager for MII/GMII/SGMII.
From: David Miller @ 2017-08-24  3:27 UTC (permalink / raw)
  To: stephan.gatzka; +Cc: peppe.cavallaro, alexandre.torgue, netdev, linux-kernel
In-Reply-To: <1503404707-25662-1-git-send-email-stephan.gatzka@gmail.com>

From: Stephan Gatzka <stephan.gatzka@gmail.com>
Date: Tue, 22 Aug 2017 14:25:07 +0200

> When using MII/GMII/SGMII in the Altera SoC, the phy needs to be
> wired through the FPGA. To ensure correct behavior, the appropriate
> bit in the System Manager FPGA Interface Group register needs to be
> set.
> 
> Signed-off-by: Stephan Gatzka <stephan.gatzka@gmail.com>

Applied, thanks.

^ permalink raw reply

* Re: [PATCH net-next] inet_diag: report TCP MD5 signing keys and addresses
From: Eric Dumazet @ 2017-08-24  3:20 UTC (permalink / raw)
  To: Ivan Delalande; +Cc: David Miller, netdev
In-Reply-To: <20170824012210.m3b2hsll4avrr42i@ycc.fr>

On Thu, 2017-08-24 at 03:22 +0200, Ivan Delalande wrote:
> Report TCP MD5 (RFC2385) signing keys, addresses and address prefixes to
> processes with CAP_NET_ADMIN requesting INET_DIAG_INFO. Currently it is
> not possible to retrieve these from the kernel once they have been
> configured on sockets.

I really find that all these changes in net/ipv4/inet_diag.c
for TCP stuff (and #ifdef CONFIG_TCP_MD5SIG all over the places) are
ugly.

Also, since you do not lock the socket, inet_diag_put_md5sig()
might see different lists (&md5sig->head) and you could either write non
reserved memory, or at the contrary report not initialized kernel data
to user.

+static int inet_diag_put_md5sig(struct sk_buff *skb,
+                               const struct tcp_md5sig_info *md5sig)
+{
+       const struct tcp_md5sig_key *key;
+       struct nlattr *attr;
+       struct tcp_md5sig *info = NULL;
+       int md5sig_count = 0;
+
+       hlist_for_each_entry_rcu(key, &md5sig->head, node)
+               md5sig_count++;
+
+       attr = nla_reserve(skb, INET_DIAG_MD5SIG,
+                          md5sig_count * sizeof(struct tcp_md5sig));
+       if (!attr)
+               return -EMSGSIZE;
+
+       info = nla_data(attr);
+       hlist_for_each_entry_rcu(key, &md5sig->head, node) {
+               inet_diag_md5sig_fill(info, key);
+               info++;

Here we might see different keys than computed (in md5sig_count)

+       }
+
+       return 0;
+}
+#endif
+

^ permalink raw reply

* Re: [PATCH net-next] bpf: netdev is never null in __dev_map_flush
From: John Fastabend @ 2017-08-24  3:10 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, davem; +Cc: netdev
In-Reply-To: <1efe2357-43aa-8976-75a0-62d4572287cc@fb.com>

On 08/23/2017 06:25 PM, Alexei Starovoitov wrote:
> On 8/23/17 6:20 PM, Daniel Borkmann wrote:
>> No need to test for it in fast-path, every dev in bpf_dtab_netdev
>> is guaranteed to be non-NULL, otherwise dev_map_update_elem() will
>> fail in the first place.
>>
>> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
> 
> wow. interesting. I'm surprised you see a difference from
> such micro-optimization.
> Acked-by: Alexei Starovoitov <ast@kernel.org>
> 
> 

Thanks for the clean up, I was a bit over paranoid here as well. Is
this actually noticeable in pps benchmark or just making the code
cleaner? Just curious.

Acked-by: John Fastabend <john.fastabend@gmail.com>

^ permalink raw reply

* Re: [PATCH net-next v7 08/10] bpf: Add a Landlock sandbox example
From: Alexei Starovoitov @ 2017-08-24  2:59 UTC (permalink / raw)
  To: Mickaël Salaün
  Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA, Alexei Starovoitov,
	Andy Lutomirski, Arnaldo Carvalho de Melo, Casey Schaufler,
	Daniel Borkmann, David Drysdale, David S . Miller,
	Eric W . Biederman, James Morris, Jann Horn, Jonathan Corbet,
	Matthew Garrett, Michael Kerrisk, Kees Cook, Paul Moore,
	Sargun Dhillon, Serge E . Hallyn, Shuah Khan, Tejun Heo,
	Thomas Graf <tgr
In-Reply-To: <20170821000933.13024-9-mic-WFhQfpSGs3bR7s880joybQ@public.gmane.org>

On Mon, Aug 21, 2017 at 02:09:31AM +0200, Mickaël Salaün wrote:
> Add a basic sandbox tool to create a process isolated from some part of
> the system. This sandbox create a read-only environment. It is only
> allowed to write to a character device such as a TTY:
> 
>   # :> X
>   # echo $?
>   0
>   # ./samples/bpf/landlock1 /bin/sh -i
>   Launching a new sandboxed process.
>   # :> Y
>   cannot create Y: Operation not permitted
> 
> Signed-off-by: Mickaël Salaün <mic-WFhQfpSGs3bR7s880joybQ@public.gmane.org>

...

> +SEC("landlock1")
> +static int landlock_fs_prog1(struct landlock_context *ctx)
> +{
> +	char fmt_error_mode[] = "landlock1: error: get_mode:%lld\n";
> +	char fmt_error_access[] = "landlock1: error: access denied\n";
> +	long long ret;
> +
> +	/*
> +	 * The argument ctx->arg2 contains bitflags of actions for which the
> +	 * rule is run.  The flag LANDLOCK_ACTION_FS_WRITE means that a write
> +	 * is requested by one of the userspace processes restricted by this
> +	 * rule. The following test allows any actions which does not include a
> +	 * write.
> +	 */
> +	if (!(ctx->arg2 & LANDLOCK_ACTION_FS_WRITE))
> +		return 0;
> +
> +	/*
> +	 * The argument ctx->arg1 is a file handle for which the process want
> +	 * to access. The function bpf_handle_fs_get_mode() return the mode of
> +	 * a file (e.g. S_IFBLK, S_IFDIR, S_IFREG...). If there is an error,
> +	 * for example if the argument is not a file handle, then an
> +	 * -errno value is returned. Otherwise the caller get the file mode as
> +	 *  with stat(2).
> +	 */
> +	ret = bpf_handle_fs_get_mode((void *)ctx->arg1);
> +	if (ret < 0) {
> +
> +		/*
> +		 * The bpf_trace_printk() function enable to write in the
> +		 * kernel eBPF debug log, accessible through
> +		 * /sys/kernel/debug/tracing/trace_pipe . To be allowed to call
> +		 * this function, a Landlock rule must have the
> +		 * LANDLOCK_SUBTYPE_ABILITY_DEBUG ability, which is only
> +		 * allowed for CAP_SYS_ADMIN.
> +		 */
> +		bpf_trace_printk(fmt_error_mode, sizeof(fmt_error_mode), ret);
> +		return 1;
> +	}
> +
> +	/*
> +	 * This check allows the action on the file if it is a directory or a
> +	 * pipe. Otherwise, a message is printed to the eBPF log.
> +	 */
> +	if (S_ISCHR(ret) || S_ISFIFO(ret))
> +		return 0;
> +	bpf_trace_printk(fmt_error_access, sizeof(fmt_error_access));
> +	return 1;
> +}
> +
> +/*
> + * This subtype enable to set the ABI, which ensure that the eBPF context and
> + * program behavior will be compatible with this Landlock rule.
> + */
> +SEC("subtype")
> +static const union bpf_prog_subtype _subtype = {
> +	.landlock_rule = {
> +		.abi = 1,
> +		.event = LANDLOCK_SUBTYPE_EVENT_FS,
> +		.ability = LANDLOCK_SUBTYPE_ABILITY_DEBUG,
> +	}
> +};

from rule writer perspective can you somehow merge subtype definition
with the program? It seems they go hand in hand.
Like section name of the program can be:
SEC("landlock_rule1/event=fs/ability=debug")
static int landlock_fs_prog1(struct landlock_context *ctx)...
and the loader can parse this string and prepare appropriate
data structures for the kernel.

^ permalink raw reply

* Re: [PATCH net-next v7 05/10] landlock: Add LSM hooks related to filesystem
From: Alexei Starovoitov @ 2017-08-24  2:50 UTC (permalink / raw)
  To: Mickaël Salaün
  Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA, Alexei Starovoitov,
	Andy Lutomirski, Arnaldo Carvalho de Melo, Casey Schaufler,
	Daniel Borkmann, David Drysdale, David S . Miller,
	Eric W . Biederman, James Morris, Jann Horn, Jonathan Corbet,
	Matthew Garrett, Michael Kerrisk, Kees Cook, Paul Moore,
	Sargun Dhillon, Serge E . Hallyn, Shuah Khan, Tejun Heo,
	Thomas Graf <tgr
In-Reply-To: <20170821000933.13024-6-mic-WFhQfpSGs3bR7s880joybQ@public.gmane.org>

On Mon, Aug 21, 2017 at 02:09:28AM +0200, Mickaël Salaün wrote:
> Handle 33 filesystem-related LSM hooks for the Landlock filesystem
> event: LANDLOCK_SUBTYPE_EVENT_FS.
> 
> A Landlock event wrap LSM hooks for similar kernel object types (e.g.
> struct file, struct path...). Multiple LSM hooks can trigger the same
> Landlock event.
> 
> Landlock handle nine coarse-grained actions: read, write, execute, new,
> get, remove, ioctl, lock and fcntl. Each of them abstract LSM hook
> access control in a way that can be extended in the future.
> 
> The Landlock LSM hook registration is done after other LSM to only run
> actions from user-space, via eBPF programs, if the access was granted by
> major (privileged) LSMs.
> 
> Signed-off-by: Mickaël Salaün <mic-WFhQfpSGs3bR7s880joybQ@public.gmane.org>

...

> +/* WRAP_ARG_SB */
> +#define WRAP_ARG_SB_TYPE	WRAP_TYPE_FS
> +#define WRAP_ARG_SB_DEC(arg)					\
> +	EXPAND_C(WRAP_TYPE_FS) wrap_##arg =			\
> +	{ .type = BPF_HANDLE_FS_TYPE_DENTRY, .dentry = arg->s_root };
> +#define WRAP_ARG_SB_VAL(arg)	((uintptr_t)&wrap_##arg)
> +#define WRAP_ARG_SB_OK(arg)	(arg && arg->s_root)
...

> +HOOK_NEW_FS(sb_remount, 2,
> +	struct super_block *, sb,
> +	void *, data,
> +	WRAP_ARG_SB, sb,
> +	WRAP_ARG_RAW, LANDLOCK_ACTION_FS_WRITE
> +);

this looks wrong. casting super_block to dentry?

> +/* a directory inode contains only one dentry */
> +HOOK_NEW_FS(inode_create, 3,
> +	struct inode *, dir,
> +	struct dentry *, dentry,
> +	umode_t, mode,
> +	WRAP_ARG_INODE, dir,
> +	WRAP_ARG_RAW, LANDLOCK_ACTION_FS_WRITE
> +);

more general question: why you're not wrapping all useful
arguments? Like in the above dentry can be acted upon
by the landlock rule and it's readily available...

The limitation of only 2 args looks odd.
Is it a hard limitation ? how hard to extend?

^ permalink raw reply

* Re: [PATCH v2 3/4] net: phy: realtek: add disable RX internal delay mode
From: kbuild test robot @ 2017-08-24  2:35 UTC (permalink / raw)
  To: Icenowy Zheng
  Cc: kbuild-all-JC7UmRfGjtg, Maxime Ripard, Chen-Yu Tsai,
	Florian Fainelli, Corentin Labbe,
	linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	devicetree-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	netdev-u79uwXL29TY76Z2rM5mHXA, linux-sunxi-/JYPxA39Uh5TLH3MbocFFw,
	Icenowy Zheng
In-Reply-To: <20170822040400.12166-4-icenowy-h8G6r0blFSE@public.gmane.org>

[-- Attachment #1: Type: text/plain, Size: 2129 bytes --]

Hi Icenowy,

[auto build test WARNING on net-next/master]
[also build test WARNING on v4.13-rc6 next-20170823]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/Icenowy-Zheng/Workaround-broken-RTL8211E-on-some-Pine64-boards/20170823-234405
config: x86_64-randconfig-b0-08240928 (attached as .config)
compiler: gcc-4.4 (Debian 4.4.7-8) 4.4.7
reproduce:
        # save the attached .config to linux build tree
        make ARCH=x86_64 

All warnings (new ones prefixed by >>):

   drivers/net/phy/realtek.c: In function 'rtl8211e_config_init':
>> drivers/net/phy/realtek.c:130: warning: unused variable 'of_node'

vim +/of_node +130 drivers/net/phy/realtek.c

   126	
   127	static int rtl8211e_config_init(struct phy_device *phydev)
   128	{
   129		struct device *dev = &phydev->mdio.dev;
 > 130		struct device_node *of_node = dev->of_node;
   131		int ret;
   132	
   133		ret = genphy_config_init(phydev);
   134		if (ret < 0)
   135			return ret;
   136	
   137		if (phydev->interface == PHY_INTERFACE_MODE_RGMII_TXID) {
   138			/* Disable the RX internal delay here.
   139			 *
   140			 * All the magic numbers are not documented on RTL8211E
   141			 * datasheet. They're said to be from Realtek by Pine64.
   142			 */
   143			phy_write(phydev, RTL8211_PAGE_SELECT, RTL8211E_EXT_PAGE);
   144			phy_write(phydev, RTL8211E_EXT_PAGE_SELECT, 0xa4);
   145			phy_write(phydev, 0x1c, 0xb591);
   146	
   147			/* Restore to default page 0 */
   148			phy_write(phydev, RTL8211_PAGE_SELECT, 0);
   149		}
   150	
   151		return 0;
   152	}
   153	

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

-- 
You received this message because you are subscribed to the Google Groups "linux-sunxi" group.
To unsubscribe from this group and stop receiving emails from it, send an email to linux-sunxi+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
For more options, visit https://groups.google.com/d/optout.

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 32072 bytes --]

^ permalink raw reply

* Re: [PATCH net-next v7 01/10] selftest: Enhance kselftest_harness.h with a step mechanism
From: Alexei Starovoitov @ 2017-08-24  2:31 UTC (permalink / raw)
  To: Mickaël Salaün
  Cc: linux-kernel, Alexei Starovoitov, Andy Lutomirski,
	Arnaldo Carvalho de Melo, Casey Schaufler, Daniel Borkmann,
	David Drysdale, David S . Miller, Eric W . Biederman,
	James Morris, Jann Horn, Jonathan Corbet, Matthew Garrett,
	Michael Kerrisk, Kees Cook, Paul Moore, Sargun Dhillon,
	Serge E . Hallyn, Shuah Khan, Tejun Heo, Thomas Graf, Will Drewry,
	kernel-hardening, linux-api, linux-sec
In-Reply-To: <20170821000933.13024-2-mic@digikod.net>

On Mon, Aug 21, 2017 at 02:09:24AM +0200, Mickaël Salaün wrote:
> This step mechanism may be useful to return an information about the
> error without being able to write to TH_LOG_STREAM.
> 
> Set _metadata->no_print to true to print this counter.
> 
> Signed-off-by: Mickaël Salaün <mic@digikod.net>
> Cc: Andy Lutomirski <luto@amacapital.net>
> Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
> Cc: Kees Cook <keescook@chromium.org>
> Cc: Shuah Khan <shuah@kernel.org>
> Cc: Will Drewry <wad@chromium.org>
> Link: https://lkml.kernel.org/r/CAGXu5j+D-FP8Kt9unNOqKrQJP4DYTpmgkJxWykZyrYiVPz3Y3Q@mail.gmail.com
> ---
> 
> This patch is intended to the kselftest tree:
> https://lkml.kernel.org/r/20170806232337.4191-1-mic@digikod.net
> 
> Changes since v6:
> * add the step counter in assert/expect macros and use _metadata to
>   enable the counter (suggested by Kees Cook)
> ---
>  tools/testing/selftests/kselftest_harness.h   | 31 ++++++++++++++++++++++-----
>  tools/testing/selftests/seccomp/seccomp_bpf.c |  2 +-
>  2 files changed, 27 insertions(+), 6 deletions(-)

is there a dependency on this in patches 2+ ?
if not, I would send this patch to selftests right away.

^ permalink raw reply

* Re: [PATCH net-next v7 03/10] bpf,landlock: Define an eBPF program type for a Landlock rule
From: Alexei Starovoitov @ 2017-08-24  2:28 UTC (permalink / raw)
  To: Mickaël Salaün
  Cc: linux-kernel, Alexei Starovoitov, Andy Lutomirski,
	Arnaldo Carvalho de Melo, Casey Schaufler, Daniel Borkmann,
	David Drysdale, David S . Miller, Eric W . Biederman,
	James Morris, Jann Horn, Jonathan Corbet, Matthew Garrett,
	Michael Kerrisk, Kees Cook, Paul Moore, Sargun Dhillon,
	Serge E . Hallyn, Shuah Khan, Tejun Heo, Thomas Graf, Will Drewry,
	kernel-hardening, linux-api, linux-sec
In-Reply-To: <20170821000933.13024-4-mic@digikod.net>

On Mon, Aug 21, 2017 at 02:09:26AM +0200, Mickaël Salaün wrote:
> Add a new type of eBPF program used by Landlock rules.
> 
> This new BPF program type will be registered with the Landlock LSM
> initialization.
> 
> Add an initial Landlock Kconfig.
> 
> Signed-off-by: Mickaël Salaün <mic@digikod.net>
> Cc: Alexei Starovoitov <ast@kernel.org>
> Cc: Andy Lutomirski <luto@amacapital.net>
> Cc: Daniel Borkmann <daniel@iogearbox.net>
> Cc: David S. Miller <davem@davemloft.net>
> Cc: James Morris <james.l.morris@oracle.com>
> Cc: Kees Cook <keescook@chromium.org>
> Cc: Serge E. Hallyn <serge@hallyn.com>
> ---
> 
> Changes since v6:
> * add 3 more sub-events: IOCTL, LOCK, FCNTL
>   https://lkml.kernel.org/r/2fbc99a6-f190-f335-bd14-04bdeed35571@digikod.net
> * rename LANDLOCK_VERSION to LANDLOCK_ABI to better reflect its purpose,
>   and move it from landlock.h to common.h
> * rename BPF_PROG_TYPE_LANDLOCK to BPF_PROG_TYPE_LANDLOCK_RULE: an eBPF
>   program could be used for something else than a rule
> * simplify struct landlock_context by removing the arch and syscall_nr fields
> * remove all eBPF map functions call, remove ABILITY_WRITE
> * refactor bpf_landlock_func_proto() (suggested by Kees Cook)
> * constify pointers
> * fix doc inclusion
> 
> Changes since v5:
> * rename file hooks.c to init.c
> * fix spelling
> 
> Changes since v4:
> * merge a minimal (not enabled) LSM code and Kconfig in this commit
> 
> Changes since v3:
> * split commit
> * revamp the landlock_context:
>   * add arch, syscall_nr and syscall_cmd (ioctl, fcntl…) to be able to
>     cross-check action with the event type
>   * replace args array with dedicated fields to ease the addition of new
>     fields
> ---
>  include/linux/bpf_types.h      |  3 ++
>  include/uapi/linux/bpf.h       | 97 +++++++++++++++++++++++++++++++++++++++++
>  security/Kconfig               |  1 +
>  security/Makefile              |  2 +
>  security/landlock/Kconfig      | 18 ++++++++
>  security/landlock/Makefile     |  3 ++
>  security/landlock/common.h     | 21 +++++++++
>  security/landlock/init.c       | 98 ++++++++++++++++++++++++++++++++++++++++++
>  tools/include/uapi/linux/bpf.h | 97 +++++++++++++++++++++++++++++++++++++++++
>  9 files changed, 340 insertions(+)
>  create mode 100644 security/landlock/Kconfig
>  create mode 100644 security/landlock/Makefile
>  create mode 100644 security/landlock/common.h
>  create mode 100644 security/landlock/init.c
> 
> diff --git a/include/linux/bpf_types.h b/include/linux/bpf_types.h
> index 6f1a567667b8..8bac93970a47 100644
> --- a/include/linux/bpf_types.h
> +++ b/include/linux/bpf_types.h
> @@ -18,6 +18,9 @@ BPF_PROG_TYPE(BPF_PROG_TYPE_KPROBE, kprobe_prog_ops)
>  BPF_PROG_TYPE(BPF_PROG_TYPE_TRACEPOINT, tracepoint_prog_ops)
>  BPF_PROG_TYPE(BPF_PROG_TYPE_PERF_EVENT, perf_event_prog_ops)
>  #endif
> +#ifdef CONFIG_SECURITY_LANDLOCK
> +BPF_PROG_TYPE(BPF_PROG_TYPE_LANDLOCK_RULE, bpf_landlock_ops)
> +#endif
>  
>  BPF_MAP_TYPE(BPF_MAP_TYPE_ARRAY, array_map_ops)
>  BPF_MAP_TYPE(BPF_MAP_TYPE_PERCPU_ARRAY, percpu_array_map_ops)
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index 8541ab85e432..20da634da941 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -129,6 +129,7 @@ enum bpf_prog_type {
>  	BPF_PROG_TYPE_LWT_XMIT,
>  	BPF_PROG_TYPE_SOCK_OPS,
>  	BPF_PROG_TYPE_SK_SKB,
> +	BPF_PROG_TYPE_LANDLOCK_RULE,
>  };
>  
>  enum bpf_attach_type {
> @@ -879,4 +880,100 @@ enum {
>  #define TCP_BPF_IW		1001	/* Set TCP initial congestion window */
>  #define TCP_BPF_SNDCWND_CLAMP	1002	/* Set sndcwnd_clamp */
>  
> +/**
> + * enum landlock_subtype_event - event occurring when an action is performed on
> + * a particular kernel object
> + *
> + * An event is a policy decision point which exposes the same context type
> + * (especially the same arg[0-9] field types) for each rule execution.
> + *
> + * @LANDLOCK_SUBTYPE_EVENT_UNSPEC: invalid value
> + * @LANDLOCK_SUBTYPE_EVENT_FS: generic filesystem event
> + * @LANDLOCK_SUBTYPE_EVENT_FS_IOCTL: custom IOCTL sub-event
> + * @LANDLOCK_SUBTYPE_EVENT_FS_LOCK: custom LOCK sub-event
> + * @LANDLOCK_SUBTYPE_EVENT_FS_FCNTL: custom FCNTL sub-event
> + */
> +enum landlock_subtype_event {
> +	LANDLOCK_SUBTYPE_EVENT_UNSPEC,
> +	LANDLOCK_SUBTYPE_EVENT_FS,
> +	LANDLOCK_SUBTYPE_EVENT_FS_IOCTL,
> +	LANDLOCK_SUBTYPE_EVENT_FS_LOCK,
> +	LANDLOCK_SUBTYPE_EVENT_FS_FCNTL,
> +};
> +#define _LANDLOCK_SUBTYPE_EVENT_LAST LANDLOCK_SUBTYPE_EVENT_FS_FCNTL
> +
> +/**
> + * DOC: landlock_subtype_ability
> + *
> + * eBPF context and functions allowed for a rule
> + *
> + * - LANDLOCK_SUBTYPE_ABILITY_DEBUG: allows to do debug actions (e.g. writing
> + *   logs), which may be dangerous and should only be used for rule testing
> + */
> +#define LANDLOCK_SUBTYPE_ABILITY_DEBUG		(1ULL << 0)
> +#define _LANDLOCK_SUBTYPE_ABILITY_NB		1
> +#define _LANDLOCK_SUBTYPE_ABILITY_MASK		((1ULL << _LANDLOCK_SUBTYPE_ABILITY_NB) - 1)

can you move the last two macros out of uapi?
There is really no need for the implementation details to
leak into uapi.

> +
> +/*
> + * Future options for a Landlock rule (e.g. run even if a previous rule denied
> + * an action).
> + */
> +#define _LANDLOCK_SUBTYPE_OPTION_NB		0
> +#define _LANDLOCK_SUBTYPE_OPTION_MASK		((1ULL << _LANDLOCK_SUBTYPE_OPTION_NB) - 1)

same here

> +
> +/*
> + * Status visible in the @status field of a context (e.g. already called in
> + * this syscall session, with same args...).
> + *
> + * The @status field exposed to a rule shall depend on the rule version.
> + */
> +#define _LANDLOCK_SUBTYPE_STATUS_NB		0
> +#define _LANDLOCK_SUBTYPE_STATUS_MASK		((1ULL << _LANDLOCK_SUBTYPE_STATUS_NB) - 1)

and here

> +
> +/**
> + * DOC: landlock_action_fs
> + *
> + * - %LANDLOCK_ACTION_FS_EXEC: execute a file or walk through a directory
> + * - %LANDLOCK_ACTION_FS_WRITE: modify a file or a directory view (which
> + *   include mount actions)
> + * - %LANDLOCK_ACTION_FS_READ: read a file or a directory
> + * - %LANDLOCK_ACTION_FS_NEW: create a file or a directory
> + * - %LANDLOCK_ACTION_FS_GET: open or receive a file
> + * - %LANDLOCK_ACTION_FS_REMOVE: unlink a file or remove a directory
> + *
> + * Each of the following actions are specific to syscall multiplexers. Each of
> + * them trigger a dedicated Landlock event where their command can be read.
> + *
> + * - %LANDLOCK_ACTION_FS_IOCTL: ioctl command
> + * - %LANDLOCK_ACTION_FS_LOCK: flock or fcntl lock command
> + * - %LANDLOCK_ACTION_FS_FCNTL: fcntl command
> + */
> +#define LANDLOCK_ACTION_FS_EXEC			(1ULL << 0)
> +#define LANDLOCK_ACTION_FS_WRITE		(1ULL << 1)
> +#define LANDLOCK_ACTION_FS_READ			(1ULL << 2)
> +#define LANDLOCK_ACTION_FS_NEW			(1ULL << 3)
> +#define LANDLOCK_ACTION_FS_GET			(1ULL << 4)
> +#define LANDLOCK_ACTION_FS_REMOVE		(1ULL << 5)
> +#define LANDLOCK_ACTION_FS_IOCTL		(1ULL << 6)
> +#define LANDLOCK_ACTION_FS_LOCK			(1ULL << 7)
> +#define LANDLOCK_ACTION_FS_FCNTL		(1ULL << 8)
> +#define _LANDLOCK_ACTION_FS_NB			9
> +#define _LANDLOCK_ACTION_FS_MASK		((1ULL << _LANDLOCK_ACTION_FS_NB) - 1)

and here

> +
> +
> +/**
> + * struct landlock_context - context accessible to a Landlock rule
> + *
> + * @status: bitfield for future use (LANDLOCK_SUBTYPE_STATUS_*)
> + * @event: event type (&enum landlock_subtype_event)
> + * @arg1: event's first optional argument
> + * @arg2: event's second optional argument
> + */
> +struct landlock_context {
> +	__u64 status;
> +	__u64 event;
> +	__u64 arg1;
> +	__u64 arg2;
> +};

looking at all the uapi additions.. probably worth moving them
into separate .h like we did with bpf_perf_event.h
How about include/uapi/linux/landlock.h ?
It can include bpf.h first thing and then the rest of
landlock related bits.
so all, but BPF_PROG_TYPE_LANDLOCK_RULE will go there.

> +++ b/security/landlock/Kconfig
> @@ -0,0 +1,18 @@
> +config SECURITY_LANDLOCK
> +	bool "Landlock sandbox support"
> +	depends on SECURITY
> +	depends on BPF_SYSCALL
> +	depends on SECCOMP_FILTER
> +	default y

all new features need to start with default n

> +static inline const struct bpf_func_proto *bpf_landlock_func_proto(
> +		enum bpf_func_id func_id,
> +		const union bpf_prog_subtype *prog_subtype)
> +{
> +	/* generic functions */
> +	if (prog_subtype->landlock_rule.ability &
> +			LANDLOCK_SUBTYPE_ABILITY_DEBUG) {
> +		switch (func_id) {
> +		case BPF_FUNC_get_current_comm:
> +			return &bpf_get_current_comm_proto;
> +		case BPF_FUNC_get_current_pid_tgid:
> +			return &bpf_get_current_pid_tgid_proto;
> +		case BPF_FUNC_get_current_uid_gid:
> +			return &bpf_get_current_uid_gid_proto;

why current_*() helpers are 'debug' only?

^ permalink raw reply

* Re: UDP sockets oddities
From: Florian Fainelli @ 2017-08-24  2:23 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev, edumazet, pabeni, willemb, davem
In-Reply-To: <1503535426.2499.66.camel@edumazet-glaptop3.roam.corp.google.com>

On 08/23/2017 05:43 PM, Eric Dumazet wrote:
> On Wed, 2017-08-23 at 17:03 -0700, Florian Fainelli wrote:
> 
>> Attached is the perf report --stdio of:
>>
>> # perf record -a -g -e skb:kfree_skb iperf -c 192.168.1.23 -b 800M -t 60
>> WARNING: option -b implies udp testing
>> ------------------------------------------------------------
>> Client connecting to 192.168.1.23, UDP port 5001
>> Sending 1470 byte datagrams
>> UDP buffer size:  160 KByte (default)
>> ------------------------------------------------------------
>> [  4] local 192.168.1.66 port 36169 connected with 192.168.1.23 port 5001
>> [ ID] Interval       Transfer     Bandwidth
>> [  4]  0.0-60.0 sec   685 MBytes  95.8 Mbits/sec
>> [  4] Sent 488633 datagrams
>> [  4] Server Report:
>> [  4]  0.0-74.4 sec   685 MBytes  77.2 Mbits/sec   0.187 ms  300/488632
>> (0.061%)
>> [  4]  0.0-74.4 sec  58 datagrams received out-of-order
>>
>> It measured 488644 events which is greater than the number of packets
>> transmitted by iperf but I only count 8 non-IP packets being sent
>> (protocl=2054 -> ETH_P_ARP) so I am not sure what the other 4 were and
>> why there are not accounted for.
>>
>> Almost all events came from net_tx_action() except 2 that came from
>> net/core/neighbour.c::neigh_probe() and 65 that came from
>> arch/arm/include/asm/irqflags.h::arch_local_irq_save() ?!?
> 
> Oh you have too much noise and need this fix :
> 
> diff --git a/drivers/net/ethernet/broadcom/bcmsysport.c b/drivers/net/ethernet/broadcom/bcmsysport.c
> index dc3052751bc1..fcfa8d991850 100644
> --- a/drivers/net/ethernet/broadcom/bcmsysport.c
> +++ b/drivers/net/ethernet/broadcom/bcmsysport.c
> @@ -597,7 +597,7 @@ static int bcm_sysport_set_coalesce(struct net_device *dev,
>  
>  static void bcm_sysport_free_cb(struct bcm_sysport_cb *cb)
>  {
> -	dev_kfree_skb_any(cb->skb);
> +	dev_consume_skb_any(cb->skb);
>  	cb->skb = NULL;
>  	dma_unmap_addr_set(cb, dma_addr, 0);
>  }
> 

Yay, now I am getting some sensible information, thanks!

iperf reports 143 packets lost, and perf report gets me 146 entries for
kfree_skb() looking like that:
#
     2.74%     2.74%  skbaddr=0xee5f30c0 protocol=2048 location=0xc0855cdc
            |
            ---0x2d
               write
               0xe9fc2368
               kfree_skb


What is annoying is that 0xc0855cdc is resolved to
arch/arm/include/asm/irqflags.h::arch_local_irq_save() which does not
really help telling me where exactly in the stack the drop is happening,
although now I know it is somewhere down the path from write(2)... of
course it is :)

Now what is disturbing too is that iperf does  have its write() system
call get an error returned, write happily returned the number of bytes
written even though the perf trace indicates there was packet drops down
the road..

I will keep investigating, thanks a lot Eric!
-- 
Florian

^ permalink raw reply

* Re: [Patch net-next] net_sched: kill u32_node pointer in Qdisc
From: Cong Wang @ 2017-08-24  2:02 UTC (permalink / raw)
  To: Jiri Pirko; +Cc: Linux Kernel Network Developers, Jamal Hadi Salim
In-Reply-To: <20170823213839.GB22713@nanopsycho>

On Wed, Aug 23, 2017 at 2:38 PM, Jiri Pirko <jiri@resnulli.us> wrote:
> I would not have to spend any time on it, if you would just follow the
> usual workflow. Clearly, you have some problem with that. I cannot
> say I understand it :/

You are amazing, you still waste your time even after I said
"I can fix it"...

You need look at current code base to see how much is
over 80-cols, not everyone agrees with you or checkpatch.pl.

Feel free to waste your time, but don't waste others', please.

^ permalink raw reply

* Re: [PATCH net] sctp: Avoid out-of-bounds reads from address storage
From: Xin Long @ 2017-08-24  2:02 UTC (permalink / raw)
  To: Stefano Brivio
  Cc: David S . Miller, network dev, LKML, stable, Vlad Yasevich,
	Neil Horman, linux-sctp
In-Reply-To: <7763d91bcf14744e49f09fc4bec0fb22c097774f.1502384055.git.sbrivio@redhat.com>

On Wed, Aug 23, 2017 at 11:27 PM, Stefano Brivio <sbrivio@redhat.com> wrote:
> inet_diag_msg_sctp{,l}addr_fill() and sctp_get_sctp_info() copy
> sizeof(sockaddr_storage) bytes to fill in sockaddr structs used
> to export diagnostic information to userspace.
>
> However, the memory allocated to store sockaddr information is
> smaller than that and depends on the address family, so we leak
> up to 100 uninitialized bytes to userspace. Just use the size of
> the source structs instead, in all the three cases this is what
> userspace expects. Zero out the remaining memory.
>
> Unused bytes (i.e. when IPv4 addresses are used) in source
> structs sctp_sockaddr_entry and sctp_transport are already
> cleared by sctp_add_bind_addr() and sctp_transport_new(),
> respectively.
>
> Noticed while testing KASAN-enabled kernel with 'ss':
>
> [ 2326.885243] BUG: KASAN: slab-out-of-bounds in inet_sctp_diag_fill+0x42c/0x6c0 [sctp_diag] at addr ffff881be8779800
> [ 2326.896800] Read of size 128 by task ss/9527
> [ 2326.901564] CPU: 0 PID: 9527 Comm: ss Not tainted 4.11.0-22.el7a.x86_64 #1
> [ 2326.909236] Hardware name: Dell Inc. PowerEdge R730/072T6D, BIOS 2.4.3 01/17/2017
> [ 2326.917585] Call Trace:
> [ 2326.920312]  dump_stack+0x63/0x8d
> [ 2326.924014]  kasan_object_err+0x21/0x70
> [ 2326.928295]  kasan_report+0x288/0x540
> [ 2326.932380]  ? inet_sctp_diag_fill+0x42c/0x6c0 [sctp_diag]
> [ 2326.938500]  ? skb_put+0x8b/0xd0
> [ 2326.942098]  ? memset+0x31/0x40
> [ 2326.945599]  check_memory_region+0x13c/0x1a0
> [ 2326.950362]  memcpy+0x23/0x50
> [ 2326.953669]  inet_sctp_diag_fill+0x42c/0x6c0 [sctp_diag]
> [ 2326.959596]  ? inet_diag_msg_sctpasoc_fill+0x460/0x460 [sctp_diag]
> [ 2326.966495]  ? __lock_sock+0x102/0x150
> [ 2326.970671]  ? sock_def_wakeup+0x60/0x60
> [ 2326.975048]  ? remove_wait_queue+0xc0/0xc0
> [ 2326.979619]  sctp_diag_dump+0x44a/0x760 [sctp_diag]
> [ 2326.985063]  ? sctp_ep_dump+0x280/0x280 [sctp_diag]
> [ 2326.990504]  ? memset+0x31/0x40
> [ 2326.994007]  ? mutex_lock+0x12/0x40
> [ 2326.997900]  __inet_diag_dump+0x57/0xb0 [inet_diag]
> [ 2327.003340]  ? __sys_sendmsg+0x150/0x150
> [ 2327.007715]  inet_diag_dump+0x4d/0x80 [inet_diag]
> [ 2327.012979]  netlink_dump+0x1e6/0x490
> [ 2327.017064]  __netlink_dump_start+0x28e/0x2c0
> [ 2327.021924]  inet_diag_handler_cmd+0x189/0x1a0 [inet_diag]
> [ 2327.028045]  ? inet_diag_rcv_msg_compat+0x1b0/0x1b0 [inet_diag]
> [ 2327.034651]  ? inet_diag_dump_compat+0x190/0x190 [inet_diag]
> [ 2327.040965]  ? __netlink_lookup+0x1b9/0x260
> [ 2327.045631]  sock_diag_rcv_msg+0x18b/0x1e0
> [ 2327.050199]  netlink_rcv_skb+0x14b/0x180
> [ 2327.054574]  ? sock_diag_bind+0x60/0x60
> [ 2327.058850]  sock_diag_rcv+0x28/0x40
> [ 2327.062837]  netlink_unicast+0x2e7/0x3b0
> [ 2327.067212]  ? netlink_attachskb+0x330/0x330
> [ 2327.071975]  ? kasan_check_write+0x14/0x20
> [ 2327.076544]  netlink_sendmsg+0x5be/0x730
> [ 2327.080918]  ? netlink_unicast+0x3b0/0x3b0
> [ 2327.085486]  ? kasan_check_write+0x14/0x20
> [ 2327.090057]  ? selinux_socket_sendmsg+0x24/0x30
> [ 2327.095109]  ? netlink_unicast+0x3b0/0x3b0
> [ 2327.099678]  sock_sendmsg+0x74/0x80
> [ 2327.103567]  ___sys_sendmsg+0x520/0x530
> [ 2327.107844]  ? __get_locked_pte+0x178/0x200
> [ 2327.112510]  ? copy_msghdr_from_user+0x270/0x270
> [ 2327.117660]  ? vm_insert_page+0x360/0x360
> [ 2327.122133]  ? vm_insert_pfn_prot+0xb4/0x150
> [ 2327.126895]  ? vm_insert_pfn+0x32/0x40
> [ 2327.131077]  ? vvar_fault+0x71/0xd0
> [ 2327.134968]  ? special_mapping_fault+0x69/0x110
> [ 2327.140022]  ? __do_fault+0x42/0x120
> [ 2327.144008]  ? __handle_mm_fault+0x1062/0x17a0
> [ 2327.148965]  ? __fget_light+0xa7/0xc0
> [ 2327.153049]  __sys_sendmsg+0xcb/0x150
> [ 2327.157133]  ? __sys_sendmsg+0xcb/0x150
> [ 2327.161409]  ? SyS_shutdown+0x140/0x140
> [ 2327.165688]  ? exit_to_usermode_loop+0xd0/0xd0
> [ 2327.170646]  ? __do_page_fault+0x55d/0x620
> [ 2327.175216]  ? __sys_sendmsg+0x150/0x150
> [ 2327.179591]  SyS_sendmsg+0x12/0x20
> [ 2327.183384]  do_syscall_64+0xe3/0x230
> [ 2327.187471]  entry_SYSCALL64_slow_path+0x25/0x25
> [ 2327.192622] RIP: 0033:0x7f41d18fa3b0
> [ 2327.196608] RSP: 002b:00007ffc3b731218 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
> [ 2327.205055] RAX: ffffffffffffffda RBX: 00007ffc3b731380 RCX: 00007f41d18fa3b0
> [ 2327.213017] RDX: 0000000000000000 RSI: 00007ffc3b731340 RDI: 0000000000000003
> [ 2327.220978] RBP: 0000000000000002 R08: 0000000000000004 R09: 0000000000000040
> [ 2327.228939] R10: 00007ffc3b730f30 R11: 0000000000000246 R12: 0000000000000003
> [ 2327.236901] R13: 00007ffc3b731340 R14: 00007ffc3b7313d0 R15: 0000000000000084
> [ 2327.244865] Object at ffff881be87797e0, in cache kmalloc-64 size: 64
> [ 2327.251953] Allocated:
> [ 2327.254581] PID = 9484
> [ 2327.257215]  save_stack_trace+0x1b/0x20
> [ 2327.261485]  save_stack+0x46/0xd0
> [ 2327.265179]  kasan_kmalloc+0xad/0xe0
> [ 2327.269165]  kmem_cache_alloc_trace+0xe6/0x1d0
> [ 2327.274138]  sctp_add_bind_addr+0x58/0x180 [sctp]
> [ 2327.279400]  sctp_do_bind+0x208/0x310 [sctp]
> [ 2327.284176]  sctp_bind+0x61/0xa0 [sctp]
> [ 2327.288455]  inet_bind+0x5f/0x3a0
> [ 2327.292151]  SYSC_bind+0x1a4/0x1e0
> [ 2327.295944]  SyS_bind+0xe/0x10
> [ 2327.299349]  do_syscall_64+0xe3/0x230
> [ 2327.303433]  return_from_SYSCALL_64+0x0/0x6a
> [ 2327.308194] Freed:
> [ 2327.310434] PID = 4131
> [ 2327.313065]  save_stack_trace+0x1b/0x20
> [ 2327.317344]  save_stack+0x46/0xd0
> [ 2327.321040]  kasan_slab_free+0x73/0xc0
> [ 2327.325220]  kfree+0x96/0x1a0
> [ 2327.328530]  dynamic_kobj_release+0x15/0x40
> [ 2327.333195]  kobject_release+0x99/0x1e0
> [ 2327.337472]  kobject_put+0x38/0x70
> [ 2327.341266]  free_notes_attrs+0x66/0x80
> [ 2327.345545]  mod_sysfs_teardown+0x1a5/0x270
> [ 2327.350211]  free_module+0x20/0x2a0
> [ 2327.354099]  SyS_delete_module+0x2cb/0x2f0
> [ 2327.358667]  do_syscall_64+0xe3/0x230
> [ 2327.362750]  return_from_SYSCALL_64+0x0/0x6a
> [ 2327.367510] Memory state around the buggy address:
> [ 2327.372855]  ffff881be8779700: fc fc fc fc 00 00 00 00 00 00 00 00 fc fc fc fc
> [ 2327.380914]  ffff881be8779780: fb fb fb fb fb fb fb fb fc fc fc fc 00 00 00 00
> [ 2327.388972] >ffff881be8779800: 00 00 00 00 fc fc fc fc fb fb fb fb fb fb fb fb
> [ 2327.397031]                                ^
> [ 2327.401792]  ffff881be8779880: fc fc fc fc fb fb fb fb fb fb fb fb fc fc fc fc
> [ 2327.409850]  ffff881be8779900: 00 00 00 00 00 04 fc fc fc fc fc fc 00 00 00 00
> [ 2327.417907] ==================================================================
>
> This fixes CVE-2017-7558.
>
> References: https://bugzilla.redhat.com/show_bug.cgi?id=1480266
> Fixes: 8f840e47f190 ("sctp: add the sctp_diag.c file")
> Cc: <stable@vger.kernel.org> # 4.7+
> Cc: Xin Long <lucien.xin@gmail.com>
> Cc: Vlad Yasevich <vyasevich@gmail.com>
> Cc: Neil Horman <nhorman@tuxdriver.com>
> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>

Reviewed-by: Xin Long <lucien.xin@gmail.com>

> ---
>  net/sctp/sctp_diag.c | 7 +++++--
>  net/sctp/socket.c    | 3 +--
>  2 files changed, 6 insertions(+), 4 deletions(-)
>
> diff --git a/net/sctp/sctp_diag.c b/net/sctp/sctp_diag.c
> index 9a647214a91e..e99518e79b52 100644
> --- a/net/sctp/sctp_diag.c
> +++ b/net/sctp/sctp_diag.c
> @@ -70,7 +70,8 @@ static int inet_diag_msg_sctpladdrs_fill(struct sk_buff *skb,
>
>         info = nla_data(attr);
>         list_for_each_entry_rcu(laddr, address_list, list) {
> -               memcpy(info, &laddr->a, addrlen);
> +               memcpy(info, &laddr->a, sizeof(laddr->a));
> +               memset(info + sizeof(laddr->a), 0, addrlen - sizeof(laddr->a));
>                 info += addrlen;
>         }
>
> @@ -93,7 +94,9 @@ static int inet_diag_msg_sctpaddrs_fill(struct sk_buff *skb,
>         info = nla_data(attr);
>         list_for_each_entry(from, &asoc->peer.transport_addr_list,
>                             transports) {
> -               memcpy(info, &from->ipaddr, addrlen);
> +               memcpy(info, &from->ipaddr, sizeof(from->ipaddr));
> +               memset(info + sizeof(from->ipaddr), 0,
> +                      addrlen - sizeof(from->ipaddr));
>                 info += addrlen;
>         }
>
> diff --git a/net/sctp/socket.c b/net/sctp/socket.c
> index 1db478e34520..8d760863bc41 100644
> --- a/net/sctp/socket.c
> +++ b/net/sctp/socket.c
> @@ -4538,8 +4538,7 @@ int sctp_get_sctp_info(struct sock *sk, struct sctp_association *asoc,
>         info->sctpi_ictrlchunks = asoc->stats.ictrlchunks;
>
>         prim = asoc->peer.primary_path;
> -       memcpy(&info->sctpi_p_address, &prim->ipaddr,
> -              sizeof(struct sockaddr_storage));
> +       memcpy(&info->sctpi_p_address, &prim->ipaddr, sizeof(prim->ipaddr));
>         info->sctpi_p_state = prim->state;
>         info->sctpi_p_cwnd = prim->cwnd;
>         info->sctpi_p_srtt = prim->srtt;
> --
> 2.9.4
>

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox