* Re: [PATCH v2 2/2] i40e: add support for XDP_REDIRECT
From: Alexander Duyck @ 2018-03-22 14:52 UTC (permalink / raw)
To: Björn Töpel
Cc: Jesper Dangaard Brouer, Jeff Kirsher, intel-wired-lan,
Björn Töpel, Karlsson, Magnus, Netdev,
Duyck, Alexander H
In-Reply-To: <CAJ+HfNiGWgJEE9K1rh3OrgixFaTJE4mDCBzYVGQJv3R8_rpLTQ@mail.gmail.com>
On Thu, Mar 22, 2018 at 5:20 AM, Björn Töpel <bjorn.topel@gmail.com> wrote:
> 2018-03-22 12:58 GMT+01:00 Jesper Dangaard Brouer <brouer@redhat.com>:
>>
>> On Thu, 22 Mar 2018 10:03:07 +0100 Björn Töpel <bjorn.topel@gmail.com> wrote:
>>
>>> +/**
>>> + * i40e_xdp_xmit - Implements ndo_xdp_xmit
>>> + * @dev: netdev
>>> + * @xdp: XDP buffer
>>> + *
>>> + * Returns Zero if sent, else an error code
>>> + **/
>>> +int i40e_xdp_xmit(struct net_device *dev, struct xdp_buff *xdp)
>>> +{
>>
>> The return code is used by the XDP redirect tracepoint... this is the
>> only way we have to debug/troubleshoot runtime issues with XDP. Thus,
>> these need to be consistent across drives and distinguishable.
>>
>
> Thanks for pointing this out! I'll address all your comments and do a
> respin (but I'll wait for Alex' comments, if any).
>
>
> Björn
>
The patch mostly looks okay to me. Maybe a bit of reverse xmas tree
formatting needs to be addressed for the variable declarations in your
two new functions but that is about it in terms of what I see.
- Alex
^ permalink raw reply
* Re: [patch net-next RFC 00/12] devlink: introduce port flavours and common phys_port_name generation
From: Jiri Pirko @ 2018-03-22 14:58 UTC (permalink / raw)
To: Roopa Prabhu
Cc: netdev, David Miller, Ido Schimmel, Jakub Kicinski, mlxsw,
Andrew Lunn, Vivien Didelot, Florian Fainelli, Michael Chan,
ganeshgr, Saeed Mahameed, Simon Horman, pieter.jansenvanvuuren,
John Hurley, Dirk van der Merwe, Alexander Duyck, Or Gerlitz,
David Ahern, vijaya.guvva, satananda.burla, raghu.vatsavayi,
felix.manlunas, Andy Gospodarek, sathya.perla, vasundhara-v.volam,
tariqt, eranbe, Jeff Kirsher
In-Reply-To: <CAJieiUhiOmxjuGZ=P48j7eBz57oapS53=dvE7ZxvioxKENxbcA@mail.gmail.com>
Thu, Mar 22, 2018 at 03:40:02PM CET, roopa@cumulusnetworks.com wrote:
>On Thu, Mar 22, 2018 at 3:55 AM, Jiri Pirko <jiri@resnulli.us> wrote:
>> From: Jiri Pirko <jiri@mellanox.com>
>>
>> This patchset resolves 2 issues we have right now:
>> 1) There are many netdevices / ports in the system, for port, pf, vf
>> represenatation but the user has no way to see which is which
>> 2) The ndo_get_phys_port_name is implemented in each driver separatelly,
>> which may lead to inconsistent names between drivers.
>>
>> This patchset introduces port flavours which should address the first
>> problem. I'm testing this with Netronome nfp hardware. When the user
>> has 2 physical ports, 1 pf, and 4 vfs, he should see something like this:
>> # devlink port
>> pci/0000:05:00.0/0: type eth netdev enp5s0np0 flavour physical number 0
>> pci/0000:05:00.0/268435456: type eth netdev eth0 flavour physical number 0
>> pci/0000:05:00.0/268435460: type eth netdev enp5s0np1 flavour physical number 1
>> pci/0000:05:00.0/536875008: type eth netdev eth2 flavour pf_rep number 536875008
>> pci/0000:05:00.0/536870912: type eth netdev eth1 flavour vf_rep number 0
>> pci/0000:05:00.0/536870976: type eth netdev eth3 flavour vf_rep number 1
>> pci/0000:05:00.0/536871040: type eth netdev eth4 flavour vf_rep number 2
>> pci/0000:05:00.0/536871104: type eth netdev eth5 flavour vf_rep number 3
>>
>> The indexes are weird numbers now. That needs to be fixed. Also, netdev
>> renaming does not work correctly for me now for some reason.
>> Also, there is one extra port that I don't understand what
>> is the purpose for it - something nfp specific perhaps.
>>
>> The desired output should look like this:
>> # devlink port
>> pci/0000:05:00.0/0: type eth netdev enp5s0np0 flavour physical number 0
>> pci/0000:05:00.0/1: type eth netdev enp5s0np1 flavour physical number 1
>> pci/0000:05:00.0/2: type eth netdev enp5s0npf0 flavour pf_rep number 0
>> pci/0000:05:00.0/3: type eth netdev enp5s0nvf0 flavour vf_rep number 0
>> pci/0000:05:00.0/4: type eth netdev enp5s0nvf1 flavour vf_rep number 1
>> pci/0000:05:00.0/5: type eth netdev enp5s0nvf2 flavour vf_rep number 2
>> pci/0000:05:00.0/6: type eth netdev enp5s0nvf3 flavour vf_rep number 3
>>
>> As you can see, the netdev names are generated according to the flavour
>> and port number. In case the port is split, the split subnumber is also
>> included.
>>
>> I tested this for mlxsw and nfp. I have no way to test this on DSA hw,
>> I would really appretiate DSA guys to test this. Thanks!
>>
>
>nice series, I like that the user can query a ports flavor (I get this
>ask all the time).
Yeah, it is really needed. I would like to fix this jungle so all
drivers behave the same. Started with nfp as they are leading with what
they have implemented. But I expect others to join in (please).
Many drivers just create devlink instance without any ports. Odd. I will
write some Documentation file as a part of this patchset. Also, I'm
thinking about adding some warnings in care driver does some crippled
implementation.
^ permalink raw reply
* Re: [PATCH v5 0/2] Remove false-positive VLAs when using max()
From: Kees Cook @ 2018-03-22 15:01 UTC (permalink / raw)
To: Linus Torvalds
Cc: Al Viro, Florian Weimer, Andrew Morton, Josh Poimboeuf,
Rasmus Villemoes, Randy Dunlap, Miguel Ojeda, Ingo Molnar,
David Laight, Ian Abbott, linux-input, linux-btrfs,
Network Development, Linux Kernel Mailing List, Kernel Hardening
In-Reply-To: <CA+55aFwxk=tUECYQkd4cog08qW4ZT=r2K7FQXzGnc-zuMc7JQA@mail.gmail.com>
On Tue, Mar 20, 2018 at 4:23 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Sat, Mar 17, 2018 at 1:07 PM, Kees Cook <keescook@chromium.org> wrote:
>>
>> No luck! :( gcc 4.4 refuses to play along. And, hilariously, not only
>> does it not change the complaint about __builtin_choose_expr(), it
>> also thinks that's a VLA now.
>
> Hmm. So thanks to the diseased mind of Martin Uecker, there's a better
> test for "__is_constant()":
>
> /* Glory to Martin Uecker <Martin.Uecker@med.uni-goettingen.de> */
> #define __is_constant(a) \
> (sizeof(int) == sizeof(*(1 ? ((void*)((a) * 0l)) : (int*)1)))
>
> that is actually *specified* by the C standard to work, and doesn't
> even depend on any gcc extensions.
I feel we risk awakening Cthulhu with this. :)
> The reason is some really subtle pointer conversion rules, where the
> type of the ternary operator will depend on whether one of the
> pointers is NULL or not.
>
> And the definition of NULL, in turn, very much depends on "integer
> constant expression that has the value 0".
>
> Are you willing to do one final try on a generic min/max? Same as my
> last patch, but using the above __is_constant() test instead of
> __builtin_constant_p?
So, this time it's not a catastrophic failure with gcc 4.4. Instead it
fails in 11 distinct places:
$ grep "first argument to ‘__builtin_choose_expr’ not a constant" log
| cut -d: -f1-2
crypto/ablkcipher.c:71
crypto/blkcipher.c:70
crypto/skcipher.c:95
mm/percpu.c:2453
net/ceph/osdmap.c:1545
net/ceph/osdmap.c:1756
net/ceph/osdmap.c:1763
mm/kmemleak.c:1371
mm/kmemleak.c:1403
drivers/infiniband/hw/hfi1/pio_copy.c:421
drivers/infiniband/hw/hfi1/pio_copy.c:547
Seems like it doesn't like void * arguments:
mm/percpu.c:
void *ptr;
...
base = min(ptr, base);
mm/kmemleak.c:
static void scan_large_block(void *start, void *end)
...
next = min(start + MAX_SCAN_SIZE, end);
I'll poke a bit more...
-Kees
--
Kees Cook
Pixel Security
^ permalink raw reply
* RE: [PATCH v5 0/2] Remove false-positive VLAs when using max()
From: David Laight @ 2018-03-22 15:13 UTC (permalink / raw)
To: 'Kees Cook', Linus Torvalds
Cc: Al Viro, Florian Weimer, Andrew Morton, Josh Poimboeuf,
Rasmus Villemoes, Randy Dunlap, Miguel Ojeda, Ingo Molnar,
Ian Abbott, linux-input, linux-btrfs, Network Development,
Linux Kernel Mailing List, Kernel Hardening
In-Reply-To: <CAGXu5j+5i+56R0KDLMDA=+_DRW5w9aUGCEo0dq6PZvHPBWkM1g@mail.gmail.com>
From: Kees Cook
> Sent: 22 March 2018 15:01
...
> > /* Glory to Martin Uecker <Martin.Uecker@med.uni-goettingen.de> */
> > #define __is_constant(a) \
> > (sizeof(int) == sizeof(*(1 ? ((void*)((a) * 0l)) : (int*)1)))
...
> So, this time it's not a catastrophic failure with gcc 4.4. Instead it
> fails in 11 distinct places:
...
> Seems like it doesn't like void * arguments:
>
> mm/percpu.c:
> void *ptr;
> ...
> base = min(ptr, base);
Try adding (unsigned long) before the (a).
David
^ permalink raw reply
* [PATCH v3 1/2] i40e: tweak page counting for XDP_REDIRECT
From: Björn Töpel @ 2018-03-22 15:14 UTC (permalink / raw)
To: jeffrey.t.kirsher, intel-wired-lan
Cc: Björn Töpel, magnus.karlsson, netdev, alexander.h.duyck,
alexander.duyck
From: Björn Töpel <bjorn.topel@intel.com>
This commit tweaks the page counting for XDP_REDIRECT to function
properly. XDP_REDIRECT support will be added in a future commit.
The current page counting scheme assumes that the reference count
cannot decrease until the received frame is sent to the upper layers
of the networking stack. This assumption does not hold for the
XDP_REDIRECT action, since a page (pointed out by xdp_buff) can have
its reference count decreased via the xdp_do_redirect call.
To work around that, we now start off by a large page count and then
don't allow a refcount less than two.
Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
---
drivers/net/ethernet/intel/i40e/i40e_txrx.c | 9 ++++-----
1 file changed, 4 insertions(+), 5 deletions(-)
diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.c b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
index e8eef9a56b6b..2f817d1466eb 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_txrx.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
@@ -1588,9 +1588,8 @@ static bool i40e_alloc_mapped_page(struct i40e_ring *rx_ring,
bi->dma = dma;
bi->page = page;
bi->page_offset = i40e_rx_offset(rx_ring);
-
- /* initialize pagecnt_bias to 1 representing we fully own page */
- bi->pagecnt_bias = 1;
+ page_ref_add(page, USHRT_MAX - 1);
+ bi->pagecnt_bias = USHRT_MAX;
return true;
}
@@ -1956,8 +1955,8 @@ static bool i40e_can_reuse_rx_page(struct i40e_rx_buffer *rx_buffer)
* the pagecnt_bias and page count so that we fully restock the
* number of references the driver holds.
*/
- if (unlikely(!pagecnt_bias)) {
- page_ref_add(page, USHRT_MAX);
+ if (unlikely(pagecnt_bias == 1)) {
+ page_ref_add(page, USHRT_MAX - 1);
rx_buffer->pagecnt_bias = USHRT_MAX;
}
--
2.7.4
^ permalink raw reply related
* [PATCH v3 2/2] i40e: add support for XDP_REDIRECT
From: Björn Töpel @ 2018-03-22 15:14 UTC (permalink / raw)
To: jeffrey.t.kirsher, intel-wired-lan
Cc: Björn Töpel, magnus.karlsson, netdev, alexander.h.duyck,
alexander.duyck
In-Reply-To: <20180322151434.24338-1-bjorn.topel@gmail.com>
From: Björn Töpel <bjorn.topel@intel.com>
The driver now acts upon the XDP_REDIRECT return action. Two new ndos
are implemented, ndo_xdp_xmit and ndo_xdp_flush.
XDP_REDIRECT action enables XDP program to redirect frames to other
netdevs.
Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
---
drivers/net/ethernet/intel/i40e/i40e_main.c | 2 +
drivers/net/ethernet/intel/i40e/i40e_txrx.c | 74 +++++++++++++++++++++++++----
drivers/net/ethernet/intel/i40e/i40e_txrx.h | 2 +
3 files changed, 68 insertions(+), 10 deletions(-)
diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c b/drivers/net/ethernet/intel/i40e/i40e_main.c
index 79ab52276d12..2fb4261b4fd9 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_main.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
@@ -11815,6 +11815,8 @@ static const struct net_device_ops i40e_netdev_ops = {
.ndo_bridge_getlink = i40e_ndo_bridge_getlink,
.ndo_bridge_setlink = i40e_ndo_bridge_setlink,
.ndo_bpf = i40e_xdp,
+ .ndo_xdp_xmit = i40e_xdp_xmit,
+ .ndo_xdp_flush = i40e_xdp_flush,
};
/**
diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.c b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
index 2f817d1466eb..53dc74d4d1db 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_txrx.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
@@ -2214,7 +2214,7 @@ static int i40e_xmit_xdp_ring(struct xdp_buff *xdp,
static struct sk_buff *i40e_run_xdp(struct i40e_ring *rx_ring,
struct xdp_buff *xdp)
{
- int result = I40E_XDP_PASS;
+ int err, result = I40E_XDP_PASS;
struct i40e_ring *xdp_ring;
struct bpf_prog *xdp_prog;
u32 act;
@@ -2233,6 +2233,10 @@ static struct sk_buff *i40e_run_xdp(struct i40e_ring *rx_ring,
xdp_ring = rx_ring->vsi->xdp_rings[rx_ring->queue_index];
result = i40e_xmit_xdp_ring(xdp, xdp_ring);
break;
+ case XDP_REDIRECT:
+ err = xdp_do_redirect(rx_ring->netdev, xdp, xdp_prog);
+ result = !err ? I40E_XDP_TX : I40E_XDP_CONSUMED;
+ break;
default:
bpf_warn_invalid_xdp_action(act);
case XDP_ABORTED:
@@ -2268,6 +2272,15 @@ static void i40e_rx_buffer_flip(struct i40e_ring *rx_ring,
#endif
}
+static inline void i40e_xdp_ring_update_tail(struct i40e_ring *xdp_ring)
+{
+ /* Force memory writes to complete before letting h/w
+ * know there are new descriptors to fetch.
+ */
+ wmb();
+ writel_relaxed(xdp_ring->next_to_use, xdp_ring->tail);
+}
+
/**
* i40e_clean_rx_irq - Clean completed descriptors from Rx ring - bounce buf
* @rx_ring: rx descriptor ring to transact packets on
@@ -2402,16 +2415,11 @@ static int i40e_clean_rx_irq(struct i40e_ring *rx_ring, int budget)
}
if (xdp_xmit) {
- struct i40e_ring *xdp_ring;
-
- xdp_ring = rx_ring->vsi->xdp_rings[rx_ring->queue_index];
+ struct i40e_ring *xdp_ring =
+ rx_ring->vsi->xdp_rings[rx_ring->queue_index];
- /* Force memory writes to complete before letting h/w
- * know there are new descriptors to fetch.
- */
- wmb();
-
- writel_relaxed(xdp_ring->next_to_use, xdp_ring->tail);
+ i40e_xdp_ring_update_tail(xdp_ring);
+ xdp_do_flush_map();
}
rx_ring->skb = skb;
@@ -3659,3 +3667,49 @@ netdev_tx_t i40e_lan_xmit_frame(struct sk_buff *skb, struct net_device *netdev)
return i40e_xmit_frame_ring(skb, tx_ring);
}
+
+/**
+ * i40e_xdp_xmit - Implements ndo_xdp_xmit
+ * @dev: netdev
+ * @xdp: XDP buffer
+ *
+ * Returns Zero if sent, else an error code
+ **/
+int i40e_xdp_xmit(struct net_device *dev, struct xdp_buff *xdp)
+{
+ struct i40e_netdev_priv *np = netdev_priv(dev);
+ unsigned int queue_index = smp_processor_id();
+ struct i40e_vsi *vsi = np->vsi;
+ int err;
+
+ if (test_bit(__I40E_VSI_DOWN, vsi->state))
+ return -ENETDOWN;
+
+ if (!i40e_enabled_xdp_vsi(vsi) || queue_index >= vsi->num_queue_pairs)
+ return -ENXIO;
+
+ err = i40e_xmit_xdp_ring(xdp, vsi->xdp_rings[queue_index]);
+ if (err != I40E_XDP_TX)
+ return -ENOSPC;
+
+ return 0;
+}
+
+/**
+ * i40e_xdp_flush - Implements ndo_xdp_flush
+ * @dev: netdev
+ **/
+void i40e_xdp_flush(struct net_device *dev)
+{
+ struct i40e_netdev_priv *np = netdev_priv(dev);
+ unsigned int queue_index = smp_processor_id();
+ struct i40e_vsi *vsi = np->vsi;
+
+ if (test_bit(__I40E_VSI_DOWN, vsi->state))
+ return;
+
+ if (!i40e_enabled_xdp_vsi(vsi) || queue_index >= vsi->num_queue_pairs)
+ return;
+
+ i40e_xdp_ring_update_tail(vsi->xdp_rings[queue_index]);
+}
diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.h b/drivers/net/ethernet/intel/i40e/i40e_txrx.h
index 2444f338bb0c..a97e59721393 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_txrx.h
+++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.h
@@ -510,6 +510,8 @@ u32 i40e_get_tx_pending(struct i40e_ring *ring, bool in_sw);
void i40e_detect_recover_hung(struct i40e_vsi *vsi);
int __i40e_maybe_stop_tx(struct i40e_ring *tx_ring, int size);
bool __i40e_chk_linearize(struct sk_buff *skb);
+int i40e_xdp_xmit(struct net_device *dev, struct xdp_buff *xdp);
+void i40e_xdp_flush(struct net_device *dev);
/**
* i40e_get_head - Retrieve head from head writeback
--
2.7.4
^ permalink raw reply related
* Re: [PATCH net-next] net: Convert can_pernet_ops
From: David Miller @ 2018-03-22 15:14 UTC (permalink / raw)
To: ktkhai; +Cc: socketcan, mkl, linux-can, netdev
In-Reply-To: <152145954531.26024.3744883546191004693.stgit@localhost.localdomain>
From: Kirill Tkhai <ktkhai@virtuozzo.com>
Date: Mon, 19 Mar 2018 14:39:05 +0300
> These pernet_operations create and destroy /proc entries
> and cancel per-net timer.
>
> Also, there are unneed iterations over empty list of net
> devices, since all net devices must be already moved
> to init_net or unregistered by default_device_ops. This
> already was mentioned here:
>
> https://marc.info/?l=linux-can&m=150169589119335&w=2
>
> So, it looks safe to make them async.
>
> Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Applied.
^ permalink raw reply
* Re: [PATCH net-next 1/2] net: Convert lowpan_frags_ops
From: David Miller @ 2018-03-22 15:14 UTC (permalink / raw)
To: ktkhai
Cc: alex.aring, stefan, pablo, kadlec, fw, yoshfuji, brouer, keescook,
linux-wpan, netdev, netfilter-devel
In-Reply-To: <152145993742.26348.7771294840455450666.stgit@localhost.localdomain>
From: Kirill Tkhai <ktkhai@virtuozzo.com>
Date: Mon, 19 Mar 2018 14:45:37 +0300
> These pernet_operations register and unregister sysctl.
> Also, there is inet_frags_exit_net() called in exit method,
> which has to be safe after a560002437d3 "net: Fix hlist
> corruptions in inet_evict_bucket()".
>
> Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Applied.
^ permalink raw reply
* Re: [PATCH net-next 2/2] net: Convert nf_ct_net_ops
From: David Miller @ 2018-03-22 15:14 UTC (permalink / raw)
To: ktkhai
Cc: alex.aring, stefan, pablo, kadlec, fw, yoshfuji, brouer, keescook,
linux-wpan, netdev, netfilter-devel
In-Reply-To: <152145994646.26348.3106975491506950180.stgit@localhost.localdomain>
From: Kirill Tkhai <ktkhai@virtuozzo.com>
Date: Mon, 19 Mar 2018 14:45:46 +0300
> These pernet_operations register and unregister sysctl.
> Also, there is inet_frags_exit_net() called in exit method,
> which has to be safe after a560002437d3 "net: Fix hlist
> corruptions in inet_evict_bucket()".
>
> Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Applied.
^ permalink raw reply
* Re: [PATCH net-next 1/2 v4] net: add uevent socket member
From: David Miller @ 2018-03-22 15:18 UTC (permalink / raw)
To: christian.brauner
Cc: ebiederm, gregkh, netdev, linux-kernel, serge, avagin, ktkhai
In-Reply-To: <20180319121731.14449-1-christian.brauner@ubuntu.com>
From: Christian Brauner <christian.brauner@ubuntu.com>
Date: Mon, 19 Mar 2018 13:17:30 +0100
> This commit adds struct uevent_sock to struct net. Since struct uevent_sock
> records the position of the uevent socket in the uevent socket list we can
> trivially remove it from the uevent socket list during cleanup. This speeds
> up the old removal codepath.
> Note, list_del() will hit __list_del_entry_valid() in its call chain which
> will validate that the element is a member of the list. If it isn't it will
> take care that the list is not modified.
>
> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Applied.
^ permalink raw reply
* Re: [PATCH net-next 2/2 v4] netns: send uevent messages
From: David Miller @ 2018-03-22 15:18 UTC (permalink / raw)
To: christian.brauner
Cc: ebiederm, gregkh, netdev, linux-kernel, serge, avagin, ktkhai
In-Reply-To: <20180319121731.14449-2-christian.brauner@ubuntu.com>
From: Christian Brauner <christian.brauner@ubuntu.com>
Date: Mon, 19 Mar 2018 13:17:31 +0100
> This patch adds a receive method to NETLINK_KOBJECT_UEVENT netlink sockets
> to allow sending uevent messages into the network namespace the socket
> belongs to.
>
> Currently non-initial network namespaces are already isolated and don't
> receive uevents. There are a number of cases where it is beneficial for a
> sufficiently privileged userspace process to send a uevent into a network
> namespace.
>
> One such use case would be debugging and fuzzing of a piece of software
> which listens and reacts to uevents. By running a copy of that software
> inside a network namespace, specific uevents could then be presented to it.
> More concretely, this would allow for easy testing of udevd/ueventd.
>
> This will also allow some piece of software to run components inside a
> separate network namespace and then effectively filter what that software
> can receive. Some examples of software that do directly listen to uevents
> and that we have in the past attempted to run inside a network namespace
> are rbd (CEPH client) or the X server.
...
> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Applied.
^ permalink raw reply
* Re: [PATCH net-next] rds: tcp: remove register_netdevice_notifier infrastructure.
From: David Miller @ 2018-03-22 15:21 UTC (permalink / raw)
To: sowmini.varadhan; +Cc: netdev, santosh.shilimkar, ktkhai
In-Reply-To: <1521467568-37876-1-git-send-email-sowmini.varadhan@oracle.com>
From: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Date: Mon, 19 Mar 2018 06:52:48 -0700
> The netns deletion path does not need to wait for all net_devices
> to be unregistered before dismantling rds_tcp state for the netns
> (we are able to dismantle this state on module unload even when
> all net_devices are active so there is no dependency here).
>
> This patch removes code related to netdevice notifiers and
> refactors all the code needed to dismantle rds_tcp state
> into a ->exit callback for the pernet_operations used with
> register_pernet_device().
>
> Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Applied.
^ permalink raw reply
* Re: [PATCH 0/5] pull request for net: batman-adv 2018-03-19
From: David Miller @ 2018-03-22 15:26 UTC (permalink / raw)
To: sw-2YrNx6rUIHYiY0qSoAWiAoQuADTiUCJX
Cc: netdev-u79uwXL29TY76Z2rM5mHXA,
b.a.t.m.a.n-ZwoEplunGu2X36UT3dwllkB+6BGkLq7r
In-Reply-To: <20180319163726.10921-1-sw-2YrNx6rUIHYiY0qSoAWiAoQuADTiUCJX@public.gmane.org>
From: Simon Wunderlich <sw-2YrNx6rUIHYiY0qSoAWiAoQuADTiUCJX@public.gmane.org>
Date: Mon, 19 Mar 2018 17:37:21 +0100
> here are some more bugfixes for net.
>
> Please pull or let me know of any problem!
Pulled, thanks Simon.
^ permalink raw reply
* Re: [PATCH 0/3] pull request for net-next: batman-adv 2018-03-19
From: David Miller @ 2018-03-22 15:29 UTC (permalink / raw)
To: sw-2YrNx6rUIHYiY0qSoAWiAoQuADTiUCJX
Cc: netdev-u79uwXL29TY76Z2rM5mHXA,
b.a.t.m.a.n-ZwoEplunGu2X36UT3dwllkB+6BGkLq7r
In-Reply-To: <20180319164153.11536-1-sw-2YrNx6rUIHYiY0qSoAWiAoQuADTiUCJX@public.gmane.org>
From: Simon Wunderlich <sw-2YrNx6rUIHYiY0qSoAWiAoQuADTiUCJX@public.gmane.org>
Date: Mon, 19 Mar 2018 17:41:50 +0100
> here is another late feature/cleanup pull request of batman-adv to go into net-next.
>
> Please pull or let me know of any problem!
Also pulled, thanks Simon.
^ permalink raw reply
* [PATCH v2 0/8] page_frag_cache improvements
From: Matthew Wilcox @ 2018-03-22 15:31 UTC (permalink / raw)
To: Alexander Duyck
Cc: Matthew Wilcox, netdev, linux-mm, Jesper Dangaard Brouer,
Eric Dumazet
From: Matthew Wilcox <mawilcox@microsoft.com>
Version 1 was completely wrong-headed and I have repented of the error
of my ways. Thanks for educating me.
I still think it's possible to improve on the current state of the
page_frag allocator, and here are eight patches, each of which I think
represents an improvement. They're not all that interlinked, although
there will be textual conflicts, so I'll be happy to revise and drop
any that are not actual improvements.
I have discovered (today), much to my chagrin, that testing using trinity
in KVM doesn't actually test the page_frag allocator. I don't understand
why not. So, this turns out to only be compile tested. Sorry.
The net effect of all these patches is a reduction of four instructions
in the fastpath of the allocator on x86. The page_frag_cache structure
also shrinks, to as small as 8 bytes on 32-bit with CONFIG_BASE_SMALL.
The last patch is probably wrong. It'll definitely be inaccurate
because the call to page_frag_free() may not be the call which frees
a page; there's a really unlikely race where the page cache finds a
stale RCU pointer, bumps its refcount, discovers it's not the page it
was looking for and calls put_page(), which might end up being the last
reference count. We can do something about that inaccuracy, but I don't
even know if this is the best approach to accounting these pages.
Matthew Wilcox (8):
page_frag_cache: Remove pfmemalloc bool
page_frag_cache: Move slowpath code from page_frag_alloc
page_frag_cache: Rename 'nc' to 'pfc'
page_frag_cache: Rename fragsz to size
page_frag_cache: Save memory on small machines
page_frag_cache: Use a mask instead of offset
page_frag: Update documentation
page_frag: Account allocations
Documentation/vm/page_frags | 42 -----------
Documentation/vm/page_frags.rst | 24 +++++++
include/linux/mm_types.h | 20 ++++--
include/linux/mmzone.h | 3 +-
mm/page_alloc.c | 155 ++++++++++++++++++++++++----------------
net/core/skbuff.c | 5 +-
6 files changed, 135 insertions(+), 114 deletions(-)
delete mode 100644 Documentation/vm/page_frags
create mode 100644 Documentation/vm/page_frags.rst
--
2.16.2
^ permalink raw reply
* [PATCH v2 1/8] page_frag_cache: Remove pfmemalloc bool
From: Matthew Wilcox @ 2018-03-22 15:31 UTC (permalink / raw)
To: Alexander Duyck
Cc: Matthew Wilcox, netdev, linux-mm, Jesper Dangaard Brouer,
Eric Dumazet
In-Reply-To: <20180322153157.10447-1-willy@infradead.org>
From: Matthew Wilcox <mawilcox@microsoft.com>
Save 4/8 bytes by moving the pfmemalloc indicator from its own bool
to the top bit of pagecnt_bias. This has no effect on the fastpath
of the allocator since the pagecnt_bias cannot go negative. It's
a couple of extra instructions in the slowpath.
Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com>
---
include/linux/mm_types.h | 4 +++-
mm/page_alloc.c | 8 +++++---
net/core/skbuff.c | 5 ++---
3 files changed, 10 insertions(+), 7 deletions(-)
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index fd1af6b9591d..a63b138ad1a4 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -218,6 +218,7 @@ struct page {
#define PAGE_FRAG_CACHE_MAX_SIZE __ALIGN_MASK(32768, ~PAGE_MASK)
#define PAGE_FRAG_CACHE_MAX_ORDER get_order(PAGE_FRAG_CACHE_MAX_SIZE)
+#define PFC_MEMALLOC (1U << 31)
struct page_frag_cache {
void * va;
@@ -231,9 +232,10 @@ struct page_frag_cache {
* containing page->_refcount every time we allocate a fragment.
*/
unsigned int pagecnt_bias;
- bool pfmemalloc;
};
+#define page_frag_cache_pfmemalloc(pfc) ((pfc)->pagecnt_bias & PFC_MEMALLOC)
+
typedef unsigned long vm_flags_t;
/*
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 635d7dd29d7f..61366f23e8c8 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4395,16 +4395,18 @@ void *page_frag_alloc(struct page_frag_cache *nc,
page_ref_add(page, size - 1);
/* reset page count bias and offset to start of new frag */
- nc->pfmemalloc = page_is_pfmemalloc(page);
nc->pagecnt_bias = size;
+ if (page_is_pfmemalloc(page))
+ nc->pagecnt_bias |= PFC_MEMALLOC;
nc->offset = size;
}
offset = nc->offset - fragsz;
if (unlikely(offset < 0)) {
+ unsigned int pagecnt_bias = nc->pagecnt_bias & ~PFC_MEMALLOC;
page = virt_to_page(nc->va);
- if (!page_ref_sub_and_test(page, nc->pagecnt_bias))
+ if (!page_ref_sub_and_test(page, pagecnt_bias))
goto refill;
#if (PAGE_SIZE < PAGE_FRAG_CACHE_MAX_SIZE)
@@ -4415,7 +4417,7 @@ void *page_frag_alloc(struct page_frag_cache *nc,
set_page_count(page, size);
/* reset page count bias and offset to start of new frag */
- nc->pagecnt_bias = size;
+ nc->pagecnt_bias = size | (nc->pagecnt_bias - pagecnt_bias);
offset = size - fragsz;
}
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 0bb0d8877954..54bbde8f7541 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -412,7 +412,7 @@ struct sk_buff *__netdev_alloc_skb(struct net_device *dev, unsigned int len,
nc = this_cpu_ptr(&netdev_alloc_cache);
data = page_frag_alloc(nc, len, gfp_mask);
- pfmemalloc = nc->pfmemalloc;
+ pfmemalloc = page_frag_cache_pfmemalloc(nc);
local_irq_restore(flags);
@@ -485,8 +485,7 @@ struct sk_buff *__napi_alloc_skb(struct napi_struct *napi, unsigned int len,
return NULL;
}
- /* use OR instead of assignment to avoid clearing of bits in mask */
- if (nc->page.pfmemalloc)
+ if (page_frag_cache_pfmemalloc(&nc->page))
skb->pfmemalloc = 1;
skb->head_frag = 1;
--
2.16.2
^ permalink raw reply related
* [PATCH v2 2/8] page_frag_cache: Move slowpath code from page_frag_alloc
From: Matthew Wilcox @ 2018-03-22 15:31 UTC (permalink / raw)
To: Alexander Duyck
Cc: Matthew Wilcox, netdev, linux-mm, Jesper Dangaard Brouer,
Eric Dumazet
In-Reply-To: <20180322153157.10447-1-willy@infradead.org>
From: Matthew Wilcox <mawilcox@microsoft.com>
Put all the unlikely code in __page_frag_cache_refill to make the
fastpath code more obvious.
Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com>
---
mm/page_alloc.c | 70 ++++++++++++++++++++++++++++-----------------------------
1 file changed, 34 insertions(+), 36 deletions(-)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 61366f23e8c8..6d2c106f4e5d 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4339,20 +4339,50 @@ EXPORT_SYMBOL(free_pages);
static struct page *__page_frag_cache_refill(struct page_frag_cache *nc,
gfp_t gfp_mask)
{
+ unsigned int size = PAGE_SIZE;
struct page *page = NULL;
+ struct page *old = nc->va ? virt_to_page(nc->va) : NULL;
gfp_t gfp = gfp_mask;
+ unsigned int pagecnt_bias = nc->pagecnt_bias & ~PFC_MEMALLOC;
+
+ /* If all allocations have been freed, we can reuse this page */
+ if (old && page_ref_sub_and_test(old, pagecnt_bias)) {
+ page = old;
+#if (PAGE_SIZE < PAGE_FRAG_CACHE_MAX_SIZE)
+ /* if size can vary use size else just use PAGE_SIZE */
+ size = nc->size;
+#endif
+ /* Page count is 0, we can safely set it */
+ set_page_count(page, size);
+ goto reset;
+ }
#if (PAGE_SIZE < PAGE_FRAG_CACHE_MAX_SIZE)
gfp_mask |= __GFP_COMP | __GFP_NOWARN | __GFP_NORETRY |
__GFP_NOMEMALLOC;
page = alloc_pages_node(NUMA_NO_NODE, gfp_mask,
PAGE_FRAG_CACHE_MAX_ORDER);
- nc->size = page ? PAGE_FRAG_CACHE_MAX_SIZE : PAGE_SIZE;
+ if (page)
+ size = PAGE_FRAG_CACHE_MAX_SIZE;
+ nc->size = size;
#endif
if (unlikely(!page))
page = alloc_pages_node(NUMA_NO_NODE, gfp, 0);
+ if (!page) {
+ nc->va = NULL;
+ return NULL;
+ }
+
+ nc->va = page_address(page);
- nc->va = page ? page_address(page) : NULL;
+ /* Using atomic_set() would break get_page_unless_zero() users. */
+ page_ref_add(page, size - 1);
+reset:
+ /* reset page count bias and offset to start of new frag */
+ nc->pagecnt_bias = size;
+ if (page_is_pfmemalloc(page))
+ nc->pagecnt_bias |= PFC_MEMALLOC;
+ nc->offset = size;
return page;
}
@@ -4375,7 +4405,6 @@ EXPORT_SYMBOL(__page_frag_cache_drain);
void *page_frag_alloc(struct page_frag_cache *nc,
unsigned int fragsz, gfp_t gfp_mask)
{
- unsigned int size = PAGE_SIZE;
struct page *page;
int offset;
@@ -4384,42 +4413,11 @@ void *page_frag_alloc(struct page_frag_cache *nc,
page = __page_frag_cache_refill(nc, gfp_mask);
if (!page)
return NULL;
-
-#if (PAGE_SIZE < PAGE_FRAG_CACHE_MAX_SIZE)
- /* if size can vary use size else just use PAGE_SIZE */
- size = nc->size;
-#endif
- /* Even if we own the page, we do not use atomic_set().
- * This would break get_page_unless_zero() users.
- */
- page_ref_add(page, size - 1);
-
- /* reset page count bias and offset to start of new frag */
- nc->pagecnt_bias = size;
- if (page_is_pfmemalloc(page))
- nc->pagecnt_bias |= PFC_MEMALLOC;
- nc->offset = size;
}
offset = nc->offset - fragsz;
- if (unlikely(offset < 0)) {
- unsigned int pagecnt_bias = nc->pagecnt_bias & ~PFC_MEMALLOC;
- page = virt_to_page(nc->va);
-
- if (!page_ref_sub_and_test(page, pagecnt_bias))
- goto refill;
-
-#if (PAGE_SIZE < PAGE_FRAG_CACHE_MAX_SIZE)
- /* if size can vary use size else just use PAGE_SIZE */
- size = nc->size;
-#endif
- /* OK, page count is 0, we can safely set it */
- set_page_count(page, size);
-
- /* reset page count bias and offset to start of new frag */
- nc->pagecnt_bias = size | (nc->pagecnt_bias - pagecnt_bias);
- offset = size - fragsz;
- }
+ if (unlikely(offset < 0))
+ goto refill;
nc->pagecnt_bias--;
nc->offset = offset;
--
2.16.2
^ permalink raw reply related
* [PATCH v2 3/8] page_frag_cache: Rename 'nc' to 'pfc'
From: Matthew Wilcox @ 2018-03-22 15:31 UTC (permalink / raw)
To: Alexander Duyck
Cc: Matthew Wilcox, netdev, linux-mm, Jesper Dangaard Brouer,
Eric Dumazet
In-Reply-To: <20180322153157.10447-1-willy@infradead.org>
From: Matthew Wilcox <mawilcox@microsoft.com>
This name was a legacy from the 'netdev_alloc_cache' days.
Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com>
---
mm/page_alloc.c | 34 +++++++++++++++++-----------------
1 file changed, 17 insertions(+), 17 deletions(-)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 6d2c106f4e5d..c9fc76135dd8 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4336,21 +4336,21 @@ EXPORT_SYMBOL(free_pages);
* drivers to provide a backing region of memory for use as either an
* sk_buff->head, or to be used in the "frags" portion of skb_shared_info.
*/
-static struct page *__page_frag_cache_refill(struct page_frag_cache *nc,
+static struct page *__page_frag_cache_refill(struct page_frag_cache *pfc,
gfp_t gfp_mask)
{
unsigned int size = PAGE_SIZE;
struct page *page = NULL;
- struct page *old = nc->va ? virt_to_page(nc->va) : NULL;
+ struct page *old = pfc->va ? virt_to_page(pfc->va) : NULL;
gfp_t gfp = gfp_mask;
- unsigned int pagecnt_bias = nc->pagecnt_bias & ~PFC_MEMALLOC;
+ unsigned int pagecnt_bias = pfc->pagecnt_bias & ~PFC_MEMALLOC;
/* If all allocations have been freed, we can reuse this page */
if (old && page_ref_sub_and_test(old, pagecnt_bias)) {
page = old;
#if (PAGE_SIZE < PAGE_FRAG_CACHE_MAX_SIZE)
/* if size can vary use size else just use PAGE_SIZE */
- size = nc->size;
+ size = pfc->size;
#endif
/* Page count is 0, we can safely set it */
set_page_count(page, size);
@@ -4364,25 +4364,25 @@ static struct page *__page_frag_cache_refill(struct page_frag_cache *nc,
PAGE_FRAG_CACHE_MAX_ORDER);
if (page)
size = PAGE_FRAG_CACHE_MAX_SIZE;
- nc->size = size;
+ pfc->size = size;
#endif
if (unlikely(!page))
page = alloc_pages_node(NUMA_NO_NODE, gfp, 0);
if (!page) {
- nc->va = NULL;
+ pfc->va = NULL;
return NULL;
}
- nc->va = page_address(page);
+ pfc->va = page_address(page);
/* Using atomic_set() would break get_page_unless_zero() users. */
page_ref_add(page, size - 1);
reset:
/* reset page count bias and offset to start of new frag */
- nc->pagecnt_bias = size;
+ pfc->pagecnt_bias = size;
if (page_is_pfmemalloc(page))
- nc->pagecnt_bias |= PFC_MEMALLOC;
- nc->offset = size;
+ pfc->pagecnt_bias |= PFC_MEMALLOC;
+ pfc->offset = size;
return page;
}
@@ -4402,27 +4402,27 @@ void __page_frag_cache_drain(struct page *page, unsigned int count)
}
EXPORT_SYMBOL(__page_frag_cache_drain);
-void *page_frag_alloc(struct page_frag_cache *nc,
+void *page_frag_alloc(struct page_frag_cache *pfc,
unsigned int fragsz, gfp_t gfp_mask)
{
struct page *page;
int offset;
- if (unlikely(!nc->va)) {
+ if (unlikely(!pfc->va)) {
refill:
- page = __page_frag_cache_refill(nc, gfp_mask);
+ page = __page_frag_cache_refill(pfc, gfp_mask);
if (!page)
return NULL;
}
- offset = nc->offset - fragsz;
+ offset = pfc->offset - fragsz;
if (unlikely(offset < 0))
goto refill;
- nc->pagecnt_bias--;
- nc->offset = offset;
+ pfc->pagecnt_bias--;
+ pfc->offset = offset;
- return nc->va + offset;
+ return pfc->va + offset;
}
EXPORT_SYMBOL(page_frag_alloc);
--
2.16.2
^ permalink raw reply related
* [PATCH v2 4/8] page_frag_cache: Rename fragsz to size
From: Matthew Wilcox @ 2018-03-22 15:31 UTC (permalink / raw)
To: Alexander Duyck
Cc: Matthew Wilcox, netdev, linux-mm, Jesper Dangaard Brouer,
Eric Dumazet
In-Reply-To: <20180322153157.10447-1-willy@infradead.org>
From: Matthew Wilcox <mawilcox@microsoft.com>
The 'size' variable name used to be used for the page size.
Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com>
---
mm/page_alloc.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index c9fc76135dd8..5a2e3e293079 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4403,7 +4403,7 @@ void __page_frag_cache_drain(struct page *page, unsigned int count)
EXPORT_SYMBOL(__page_frag_cache_drain);
void *page_frag_alloc(struct page_frag_cache *pfc,
- unsigned int fragsz, gfp_t gfp_mask)
+ unsigned int size, gfp_t gfp_mask)
{
struct page *page;
int offset;
@@ -4415,7 +4415,7 @@ void *page_frag_alloc(struct page_frag_cache *pfc,
return NULL;
}
- offset = pfc->offset - fragsz;
+ offset = pfc->offset - size;
if (unlikely(offset < 0))
goto refill;
--
2.16.2
^ permalink raw reply related
* [PATCH v2 5/8] page_frag_cache: Save memory on small machines
From: Matthew Wilcox @ 2018-03-22 15:31 UTC (permalink / raw)
To: Alexander Duyck
Cc: Matthew Wilcox, netdev, linux-mm, Jesper Dangaard Brouer,
Eric Dumazet
In-Reply-To: <20180322153157.10447-1-willy@infradead.org>
From: Matthew Wilcox <mawilcox@microsoft.com>
Only allocate a single page if CONFIG_BASE_SMALL is set.
Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com>
---
include/linux/mm_types.h | 4 ++++
1 file changed, 4 insertions(+)
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index a63b138ad1a4..0defff9e3c0e 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -216,7 +216,11 @@ struct page {
#endif
} _struct_page_alignment;
+#if CONFIG_BASE_SMALL
+#define PAGE_FRAG_CACHE_MAX_SIZE PAGE_SIZE
+#else
#define PAGE_FRAG_CACHE_MAX_SIZE __ALIGN_MASK(32768, ~PAGE_MASK)
+#endif
#define PAGE_FRAG_CACHE_MAX_ORDER get_order(PAGE_FRAG_CACHE_MAX_SIZE)
#define PFC_MEMALLOC (1U << 31)
--
2.16.2
^ permalink raw reply related
* [PATCH v2 6/8] page_frag_cache: Use a mask instead of offset
From: Matthew Wilcox @ 2018-03-22 15:31 UTC (permalink / raw)
To: Alexander Duyck
Cc: Matthew Wilcox, netdev, linux-mm, Jesper Dangaard Brouer,
Eric Dumazet
In-Reply-To: <20180322153157.10447-1-willy@infradead.org>
From: Matthew Wilcox <mawilcox@microsoft.com>
By combining 'va' and 'offset' into 'addr' and using a mask instead,
we can save a compare-and-branch in the fast-path of the allocator.
This removes 4 instructions on x86 (both 32 and 64 bit).
We can avoid storing the mask at all if we know that we're only allocating
a single page. This shrinks page_frag_cache from 12 to 8 bytes on 32-bit
CONFIG_BASE_SMALL build.
Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com>
---
include/linux/mm_types.h | 12 +++++++-----
mm/page_alloc.c | 40 +++++++++++++++-------------------------
2 files changed, 22 insertions(+), 30 deletions(-)
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 0defff9e3c0e..ebe93edec752 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -225,12 +225,9 @@ struct page {
#define PFC_MEMALLOC (1U << 31)
struct page_frag_cache {
- void * va;
+ void *addr;
#if (PAGE_SIZE < PAGE_FRAG_CACHE_MAX_SIZE)
- __u16 offset;
- __u16 size;
-#else
- __u32 offset;
+ unsigned int mask;
#endif
/* we maintain a pagecount bias, so that we dont dirty cache line
* containing page->_refcount every time we allocate a fragment.
@@ -239,6 +236,11 @@ struct page_frag_cache {
};
#define page_frag_cache_pfmemalloc(pfc) ((pfc)->pagecnt_bias & PFC_MEMALLOC)
+#if (PAGE_SIZE < PAGE_FRAG_CACHE_MAX_SIZE)
+#define page_frag_cache_mask(pfc) (pfc)->mask
+#else
+#define page_frag_cache_mask(pfc) (~PAGE_MASK)
+#endif
typedef unsigned long vm_flags_t;
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 5a2e3e293079..d15a5348a8e4 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4336,22 +4336,19 @@ EXPORT_SYMBOL(free_pages);
* drivers to provide a backing region of memory for use as either an
* sk_buff->head, or to be used in the "frags" portion of skb_shared_info.
*/
-static struct page *__page_frag_cache_refill(struct page_frag_cache *pfc,
+static void *__page_frag_cache_refill(struct page_frag_cache *pfc,
gfp_t gfp_mask)
{
unsigned int size = PAGE_SIZE;
struct page *page = NULL;
- struct page *old = pfc->va ? virt_to_page(pfc->va) : NULL;
+ struct page *old = pfc->addr ? virt_to_head_page(pfc->addr) : NULL;
gfp_t gfp = gfp_mask;
unsigned int pagecnt_bias = pfc->pagecnt_bias & ~PFC_MEMALLOC;
/* If all allocations have been freed, we can reuse this page */
if (old && page_ref_sub_and_test(old, pagecnt_bias)) {
page = old;
-#if (PAGE_SIZE < PAGE_FRAG_CACHE_MAX_SIZE)
- /* if size can vary use size else just use PAGE_SIZE */
- size = pfc->size;
-#endif
+ size = page_frag_cache_mask(pfc) + 1;
/* Page count is 0, we can safely set it */
set_page_count(page, size);
goto reset;
@@ -4364,27 +4361,24 @@ static struct page *__page_frag_cache_refill(struct page_frag_cache *pfc,
PAGE_FRAG_CACHE_MAX_ORDER);
if (page)
size = PAGE_FRAG_CACHE_MAX_SIZE;
- pfc->size = size;
+ pfc->mask = size - 1;
#endif
if (unlikely(!page))
page = alloc_pages_node(NUMA_NO_NODE, gfp, 0);
if (!page) {
- pfc->va = NULL;
+ pfc->addr = NULL;
return NULL;
}
- pfc->va = page_address(page);
-
/* Using atomic_set() would break get_page_unless_zero() users. */
page_ref_add(page, size - 1);
reset:
- /* reset page count bias and offset to start of new frag */
pfc->pagecnt_bias = size;
if (page_is_pfmemalloc(page))
pfc->pagecnt_bias |= PFC_MEMALLOC;
- pfc->offset = size;
+ pfc->addr = page_address(page) + size;
- return page;
+ return pfc->addr;
}
void __page_frag_cache_drain(struct page *page, unsigned int count)
@@ -4405,24 +4399,20 @@ EXPORT_SYMBOL(__page_frag_cache_drain);
void *page_frag_alloc(struct page_frag_cache *pfc,
unsigned int size, gfp_t gfp_mask)
{
- struct page *page;
- int offset;
+ void *addr = pfc->addr;
+ unsigned int offset = (unsigned long)addr & page_frag_cache_mask(pfc);
- if (unlikely(!pfc->va)) {
-refill:
- page = __page_frag_cache_refill(pfc, gfp_mask);
- if (!page)
+ if (unlikely(offset < size)) {
+ addr = __page_frag_cache_refill(pfc, gfp_mask);
+ if (!addr)
return NULL;
}
- offset = pfc->offset - size;
- if (unlikely(offset < 0))
- goto refill;
-
+ addr -= size;
+ pfc->addr = addr;
pfc->pagecnt_bias--;
- pfc->offset = offset;
- return pfc->va + offset;
+ return addr;
}
EXPORT_SYMBOL(page_frag_alloc);
--
2.16.2
^ permalink raw reply related
* [PATCH v2 7/8] page_frag: Update documentation
From: Matthew Wilcox @ 2018-03-22 15:31 UTC (permalink / raw)
To: Alexander Duyck
Cc: Matthew Wilcox, netdev, linux-mm, Jesper Dangaard Brouer,
Eric Dumazet
In-Reply-To: <20180322153157.10447-1-willy@infradead.org>
From: Matthew Wilcox <mawilcox@microsoft.com>
- Rename Documentation/vm/page_frags to page_frags.rst
- Change page_frags.rst to be a user's guide rather than implementation
detail.
- Add kernel-doc for the page_frag allocator
- Move implementation details to the comments in page_alloc.c
Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com>
---
Documentation/vm/page_frags | 42 ---------------------------
Documentation/vm/page_frags.rst | 24 ++++++++++++++++
mm/page_alloc.c | 63 ++++++++++++++++++++++++++++++++---------
3 files changed, 74 insertions(+), 55 deletions(-)
delete mode 100644 Documentation/vm/page_frags
create mode 100644 Documentation/vm/page_frags.rst
diff --git a/Documentation/vm/page_frags b/Documentation/vm/page_frags
deleted file mode 100644
index a6714565dbf9..000000000000
--- a/Documentation/vm/page_frags
+++ /dev/null
@@ -1,42 +0,0 @@
-Page fragments
---------------
-
-A page fragment is an arbitrary-length arbitrary-offset area of memory
-which resides within a 0 or higher order compound page. Multiple
-fragments within that page are individually refcounted, in the page's
-reference counter.
-
-The page_frag functions, page_frag_alloc and page_frag_free, provide a
-simple allocation framework for page fragments. This is used by the
-network stack and network device drivers to provide a backing region of
-memory for use as either an sk_buff->head, or to be used in the "frags"
-portion of skb_shared_info.
-
-In order to make use of the page fragment APIs a backing page fragment
-cache is needed. This provides a central point for the fragment allocation
-and tracks allows multiple calls to make use of a cached page. The
-advantage to doing this is that multiple calls to get_page can be avoided
-which can be expensive at allocation time. However due to the nature of
-this caching it is required that any calls to the cache be protected by
-either a per-cpu limitation, or a per-cpu limitation and forcing interrupts
-to be disabled when executing the fragment allocation.
-
-The network stack uses two separate caches per CPU to handle fragment
-allocation. The netdev_alloc_cache is used by callers making use of the
-__netdev_alloc_frag and __netdev_alloc_skb calls. The napi_alloc_cache is
-used by callers of the __napi_alloc_frag and __napi_alloc_skb calls. The
-main difference between these two calls is the context in which they may be
-called. The "netdev" prefixed functions are usable in any context as these
-functions will disable interrupts, while the "napi" prefixed functions are
-only usable within the softirq context.
-
-Many network device drivers use a similar methodology for allocating page
-fragments, but the page fragments are cached at the ring or descriptor
-level. In order to enable these cases it is necessary to provide a generic
-way of tearing down a page cache. For this reason __page_frag_cache_drain
-was implemented. It allows for freeing multiple references from a single
-page via a single call. The advantage to doing this is that it allows for
-cleaning up the multiple references that were added to a page in order to
-avoid calling get_page per allocation.
-
-Alexander Duyck, Nov 29, 2016.
diff --git a/Documentation/vm/page_frags.rst b/Documentation/vm/page_frags.rst
new file mode 100644
index 000000000000..e675bfad6710
--- /dev/null
+++ b/Documentation/vm/page_frags.rst
@@ -0,0 +1,24 @@
+==============
+Page fragments
+==============
+
+:Author: Alexander Duyck
+
+A page fragment is a physically contiguous area of memory that is smaller
+than a page. It may cross a page boundary, and may be allocated at
+an arbitrary alignment.
+
+The page fragment allocator is optimised for very fast allocation
+of arbitrary-sized objects which will likely be freed soon. It does
+not take any locks, relying on the caller to ensure that simultaneous
+allocations from the same page_frag_cache cannot occur. The allocator
+does not support red zones or poisoning. If the user has alignment
+requirements, rounding the size of each object allocated from the cache
+will ensure that all objects are aligned. Do not attempt to allocate
+0 bytes; it is not checked for and will end badly.
+
+Functions
+=========
+
+.. kernel-doc:: mm/page_alloc.c
+ :functions: page_frag_alloc page_frag_free __page_frag_cache_drain
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index d15a5348a8e4..b9beafa5d2a5 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4326,15 +4326,27 @@ void free_pages(unsigned long addr, unsigned int order)
EXPORT_SYMBOL(free_pages);
/*
- * Page Fragment:
- * An arbitrary-length arbitrary-offset area of memory which resides
- * within a 0 or higher order page. Multiple fragments within that page
- * are individually refcounted, in the page's reference counter.
- *
- * The page_frag functions below provide a simple allocation framework for
- * page fragments. This is used by the network stack and network device
- * drivers to provide a backing region of memory for use as either an
+ * The page_frag functions below are used by the network stack and network
+ * device drivers to provide a backing region of memory for use as either an
* sk_buff->head, or to be used in the "frags" portion of skb_shared_info.
+ *
+ * We attempt to use a compound page (unless the machine has a large
+ * PAGE_SIZE) in order to minimise trips into the page allocator. Allocation
+ * starts at the end of the page and proceeds towards the beginning of the
+ * page. Once there is insufficient space in the page to satisfy the
+ * next allocation, we call into __page_frag_cache_refill() in order to
+ * either recycle the existing page or start allocation from a new page.
+ *
+ * The allocation side maintains a count of the number of allocations it
+ * has made while frees are counted in the struct page reference count.
+ * We reconcile the two when there is no space left in the page. This
+ * minimises cache line bouncing when page frags are freed on a different
+ * CPU from the one they were allocated on.
+ *
+ * Several network drivers use a similar approach to the page_frag_cache,
+ * but specialise their allocator to return a dma_addr_t instead of a
+ * virtual address. They can also use page_frag_free(), and will use
+ * __page_frag_cache_drain() in order to destroy their caches.
*/
static void *__page_frag_cache_refill(struct page_frag_cache *pfc,
gfp_t gfp_mask)
@@ -4381,6 +4393,18 @@ static void *__page_frag_cache_refill(struct page_frag_cache *pfc,
return pfc->addr;
}
+/**
+ * __page_frag_cache_drain() - Stop using a page.
+ * @page: Current page in use.
+ * @count: Number of allocations remaining.
+ *
+ * When a page fragment cache is being destroyed, this function prepares
+ * the page to be freed. It will actually be freed if there are no
+ * outstanding allocations on that page; otherwise it will be freed when
+ * the last allocation is freed.
+ *
+ * Context: Any context.
+ */
void __page_frag_cache_drain(struct page *page, unsigned int count)
{
VM_BUG_ON_PAGE(page_ref_count(page) == 0, page);
@@ -4396,14 +4420,22 @@ void __page_frag_cache_drain(struct page *page, unsigned int count)
}
EXPORT_SYMBOL(__page_frag_cache_drain);
-void *page_frag_alloc(struct page_frag_cache *pfc,
- unsigned int size, gfp_t gfp_mask)
+/**
+ * page_frag_alloc() - Allocate a page fragment.
+ * @pfc: page_frag cache.
+ * @size: Number of bytes to allocate.
+ * @gfp: Memory allocation flags.
+ *
+ * Context: Any context.
+ * Return: Address of allocated memory or %NULL.
+ */
+void *page_frag_alloc(struct page_frag_cache *pfc, unsigned int size, gfp_t gfp)
{
void *addr = pfc->addr;
unsigned int offset = (unsigned long)addr & page_frag_cache_mask(pfc);
if (unlikely(offset < size)) {
- addr = __page_frag_cache_refill(pfc, gfp_mask);
+ addr = __page_frag_cache_refill(pfc, gfp);
if (!addr)
return NULL;
}
@@ -4416,8 +4448,13 @@ void *page_frag_alloc(struct page_frag_cache *pfc,
}
EXPORT_SYMBOL(page_frag_alloc);
-/*
- * Frees a page fragment allocated out of either a compound or order 0 page.
+/**
+ * page_frag_free() - Free a page fragment.
+ * @addr: Address of page fragment.
+ *
+ * Free memory previously allocated by page_frag_alloc().
+ *
+ * Context: Any context.
*/
void page_frag_free(void *addr)
{
--
2.16.2
^ permalink raw reply related
* [PATCH v2 8/8] page_frag: Account allocations
From: Matthew Wilcox @ 2018-03-22 15:31 UTC (permalink / raw)
To: Alexander Duyck
Cc: Matthew Wilcox, netdev, linux-mm, Jesper Dangaard Brouer,
Eric Dumazet
In-Reply-To: <20180322153157.10447-1-willy@infradead.org>
From: Matthew Wilcox <mawilcox@microsoft.com>
Note the number of pages currently used in page_frag allocations.
This may help diagnose leaks in page_frag users.
Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com>
---
include/linux/mmzone.h | 3 ++-
mm/page_alloc.c | 10 +++++++---
2 files changed, 9 insertions(+), 4 deletions(-)
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 7522a6987595..ed6be33dcc7a 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -139,10 +139,10 @@ enum zone_stat_item {
NR_ZONE_ACTIVE_FILE,
NR_ZONE_UNEVICTABLE,
NR_ZONE_WRITE_PENDING, /* Count of dirty, writeback and unstable pages */
+ /* Second 128 byte cacheline */
NR_MLOCK, /* mlock()ed pages found and moved off LRU */
NR_PAGETABLE, /* used for pagetables */
NR_KERNEL_STACK_KB, /* measured in KiB */
- /* Second 128 byte cacheline */
NR_BOUNCE,
#if IS_ENABLED(CONFIG_ZSMALLOC)
NR_ZSPAGES, /* allocated in zsmalloc */
@@ -175,6 +175,7 @@ enum node_stat_item {
NR_SHMEM_THPS,
NR_SHMEM_PMDMAPPED,
NR_ANON_THPS,
+ NR_PAGE_FRAG,
NR_UNSTABLE_NFS, /* NFS unstable pages */
NR_VMSCAN_WRITE,
NR_VMSCAN_IMMEDIATE, /* Prioritise for reclaim when writeback ends */
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index b9beafa5d2a5..5a9441b46604 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4382,6 +4382,7 @@ static void *__page_frag_cache_refill(struct page_frag_cache *pfc,
return NULL;
}
+ inc_node_page_state(page, NR_PAGE_FRAG);
/* Using atomic_set() would break get_page_unless_zero() users. */
page_ref_add(page, size - 1);
reset:
@@ -4460,8 +4461,10 @@ void page_frag_free(void *addr)
{
struct page *page = virt_to_head_page(addr);
- if (unlikely(put_page_testzero(page)))
+ if (unlikely(put_page_testzero(page))) {
+ dec_node_page_state(page, NR_PAGE_FRAG);
__free_pages_ok(page, compound_order(page));
+ }
}
EXPORT_SYMBOL(page_frag_free);
@@ -4769,7 +4772,7 @@ void show_free_areas(unsigned int filter, nodemask_t *nodemask)
" unevictable:%lu dirty:%lu writeback:%lu unstable:%lu\n"
" slab_reclaimable:%lu slab_unreclaimable:%lu\n"
" mapped:%lu shmem:%lu pagetables:%lu bounce:%lu\n"
- " free:%lu free_pcp:%lu free_cma:%lu\n",
+ " free:%lu free_pcp:%lu free_cma:%lu page_frag:%lu\n",
global_node_page_state(NR_ACTIVE_ANON),
global_node_page_state(NR_INACTIVE_ANON),
global_node_page_state(NR_ISOLATED_ANON),
@@ -4788,7 +4791,8 @@ void show_free_areas(unsigned int filter, nodemask_t *nodemask)
global_zone_page_state(NR_BOUNCE),
global_zone_page_state(NR_FREE_PAGES),
free_pcp,
- global_zone_page_state(NR_FREE_CMA_PAGES));
+ global_zone_page_state(NR_FREE_CMA_PAGES),
+ global_node_page_state(NR_PAGE_FRAG));
for_each_online_pgdat(pgdat) {
if (show_mem_node_skip(filter, pgdat->node_id, nodemask))
--
2.16.2
^ permalink raw reply related
* Re: [PATCH net-next 0/2] Fixes to allow mv88e6xxx module to be reloaded
From: David Miller @ 2018-03-22 15:33 UTC (permalink / raw)
To: andrew; +Cc: netdev, u.kleine-koenig
In-Reply-To: <1521494181-4653-1-git-send-email-andrew@lunn.ch>
From: Andrew Lunn <andrew@lunn.ch>
Date: Mon, 19 Mar 2018 22:16:19 +0100
> As reported by Uwe Kleine-K�nig, the interrupt trigger is first
> configured by DT and then reconfigured to edge. This results in a
> failure on EPROBE_DEFER, or if the module is unloaded and reloaded.
>
> A second crash happens on module reload due to a missing call to the
> common IRQ free code when using polled interrupts.
>
> With these fixes in place, it becomes possible to load and unload the
> kernel modules a few times without it crashing.
Andrew, please respin this with the characters in Uwe's name fixed
up for patch #2.
Thank you.
^ permalink raw reply
* [PATCH net-next] bridge: Allow max MTU when multiple VLANs present
From: Chas Williams @ 2018-03-22 15:34 UTC (permalink / raw)
To: davem; +Cc: netdev, stephen, Chas Williams
If the bridge is allowing multiple VLANs, some VLANs may have
different MTUs. Instead of choosing the minimum MTU for the
bridge interface, choose the maximum MTU of the bridge members.
With this the user only needs to set a larger MTU on the member
ports that are participating in the large MTU VLANS.
Signed-off-by: Chas Williams <3chas3@gmail.com>
---
net/bridge/br.c | 2 +-
net/bridge/br_device.c | 2 +-
net/bridge/br_if.c | 26 ++++++++++++++++++++++----
net/bridge/br_private.h | 2 +-
4 files changed, 25 insertions(+), 7 deletions(-)
diff --git a/net/bridge/br.c b/net/bridge/br.c
index 7770481a6506..a3f95ab9d6a3 100644
--- a/net/bridge/br.c
+++ b/net/bridge/br.c
@@ -52,7 +52,7 @@ static int br_device_event(struct notifier_block *unused, unsigned long event, v
switch (event) {
case NETDEV_CHANGEMTU:
- dev_set_mtu(br->dev, br_min_mtu(br));
+ dev_set_mtu(br->dev, br_mtu(br));
break;
case NETDEV_CHANGEADDR:
diff --git a/net/bridge/br_device.c b/net/bridge/br_device.c
index 1285ca30ab0a..278fc999d355 100644
--- a/net/bridge/br_device.c
+++ b/net/bridge/br_device.c
@@ -224,7 +224,7 @@ static void br_get_stats64(struct net_device *dev,
static int br_change_mtu(struct net_device *dev, int new_mtu)
{
struct net_bridge *br = netdev_priv(dev);
- if (new_mtu > br_min_mtu(br))
+ if (new_mtu > br_mtu(br))
return -EINVAL;
dev->mtu = new_mtu;
diff --git a/net/bridge/br_if.c b/net/bridge/br_if.c
index 9ba4ed65c52b..48dc4d2e2be3 100644
--- a/net/bridge/br_if.c
+++ b/net/bridge/br_if.c
@@ -424,8 +424,18 @@ int br_del_bridge(struct net *net, const char *name)
return ret;
}
+static bool min_mtu(int a, int b)
+{
+ return a < b ? 1 : 0;
+}
+
+static bool max_mtu(int a, int b)
+{
+ return a > b ? 1 : 0;
+}
+
/* MTU of the bridge pseudo-device: ETH_DATA_LEN or the minimum of the ports */
-int br_min_mtu(const struct net_bridge *br)
+static int __br_mtu(const struct net_bridge *br, bool (compare_fn)(int, int))
{
const struct net_bridge_port *p;
int mtu = 0;
@@ -436,13 +446,21 @@ int br_min_mtu(const struct net_bridge *br)
mtu = ETH_DATA_LEN;
else {
list_for_each_entry(p, &br->port_list, list) {
- if (!mtu || p->dev->mtu < mtu)
+ if (!mtu || compare_fn(p->dev->mtu, mtu))
mtu = p->dev->mtu;
}
}
return mtu;
}
+int br_mtu(const struct net_bridge *br)
+{
+ if (br->vlan_enabled)
+ return __br_mtu(br, max_mtu);
+ else
+ return __br_mtu(br, min_mtu);
+}
+
static void br_set_gso_limits(struct net_bridge *br)
{
unsigned int gso_max_size = GSO_MAX_SIZE;
@@ -594,7 +612,7 @@ int br_add_if(struct net_bridge *br, struct net_device *dev,
if (changed_addr)
call_netdevice_notifiers(NETDEV_CHANGEADDR, br->dev);
- dev_set_mtu(br->dev, br_min_mtu(br));
+ dev_set_mtu(br->dev, br_mtu(br));
br_set_gso_limits(br);
kobject_uevent(&p->kobj, KOBJ_ADD);
@@ -641,7 +659,7 @@ int br_del_if(struct net_bridge *br, struct net_device *dev)
*/
del_nbp(p);
- dev_set_mtu(br->dev, br_min_mtu(br));
+ dev_set_mtu(br->dev, br_mtu(br));
br_set_gso_limits(br);
spin_lock_bh(&br->lock);
diff --git a/net/bridge/br_private.h b/net/bridge/br_private.h
index 8e13a64d8c99..048d5b51813b 100644
--- a/net/bridge/br_private.h
+++ b/net/bridge/br_private.h
@@ -578,7 +578,7 @@ int br_del_bridge(struct net *net, const char *name);
int br_add_if(struct net_bridge *br, struct net_device *dev,
struct netlink_ext_ack *extack);
int br_del_if(struct net_bridge *br, struct net_device *dev);
-int br_min_mtu(const struct net_bridge *br);
+int br_mtu(const struct net_bridge *br);
netdev_features_t br_features_recompute(struct net_bridge *br,
netdev_features_t features);
void br_port_flags_change(struct net_bridge_port *port, unsigned long mask);
--
2.13.6
^ permalink raw reply related
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox