* [PATCH 02/14] netfilter: nft_osf: check if attribute is present
From: Pablo Neira Ayuso @ 2018-11-05 23:28 UTC (permalink / raw)
To: netfilter-devel; +Cc: davem, netdev
In-Reply-To: <20181105232832.21896-1-pablo@netfilter.org>
If the attribute is not sent, eg. old libnftnl binary, then
tb[NFTA_OSF_TTL] is NULL and kernel crashes from the _init path.
Fixes: a218dc82f0b5 ("netfilter: nft_osf: Add ttl option support")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
---
net/netfilter/nft_osf.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/net/netfilter/nft_osf.c b/net/netfilter/nft_osf.c
index ca5e5d8c5ef8..b13618c764ec 100644
--- a/net/netfilter/nft_osf.c
+++ b/net/netfilter/nft_osf.c
@@ -50,7 +50,7 @@ static int nft_osf_init(const struct nft_ctx *ctx,
int err;
u8 ttl;
- if (nla_get_u8(tb[NFTA_OSF_TTL])) {
+ if (tb[NFTA_OSF_TTL]) {
ttl = nla_get_u8(tb[NFTA_OSF_TTL]);
if (ttl > 2)
return -EINVAL;
--
2.11.0
^ permalink raw reply related
* [PATCH 01/14] netfilter: ipv6: fix oops when defragmenting locally generated fragments
From: Pablo Neira Ayuso @ 2018-11-05 23:28 UTC (permalink / raw)
To: netfilter-devel; +Cc: davem, netdev
In-Reply-To: <20181105232832.21896-1-pablo@netfilter.org>
From: Florian Westphal <fw@strlen.de>
Unlike ipv4 and normal ipv6 defrag, netfilter ipv6 defragmentation did
not save/restore skb->dst.
This causes oops when handling locally generated ipv6 fragments, as
output path needs a valid dst.
Reported-by: Maciej Żenczykowski <zenczykowski@gmail.com>
Fixes: 84379c9afe01 ("netfilter: ipv6: nf_defrag: drop skb dst before queueing")
Signed-off-by: Florian Westphal <fw@strlen.de>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
---
net/ipv6/netfilter/nf_conntrack_reasm.c | 13 +++++++++----
1 file changed, 9 insertions(+), 4 deletions(-)
diff --git a/net/ipv6/netfilter/nf_conntrack_reasm.c b/net/ipv6/netfilter/nf_conntrack_reasm.c
index b8ac369f98ad..d219979c3e52 100644
--- a/net/ipv6/netfilter/nf_conntrack_reasm.c
+++ b/net/ipv6/netfilter/nf_conntrack_reasm.c
@@ -587,11 +587,16 @@ int nf_ct_frag6_gather(struct net *net, struct sk_buff *skb, u32 user)
*/
ret = -EINPROGRESS;
if (fq->q.flags == (INET_FRAG_FIRST_IN | INET_FRAG_LAST_IN) &&
- fq->q.meat == fq->q.len &&
- nf_ct_frag6_reasm(fq, skb, dev))
- ret = 0;
- else
+ fq->q.meat == fq->q.len) {
+ unsigned long orefdst = skb->_skb_refdst;
+
+ skb->_skb_refdst = 0UL;
+ if (nf_ct_frag6_reasm(fq, skb, dev))
+ ret = 0;
+ skb->_skb_refdst = orefdst;
+ } else {
skb_dst_drop(skb);
+ }
out_unlock:
spin_unlock_bh(&fq->q.lock);
--
2.11.0
^ permalink raw reply related
* [PATCH 00/14] Netfilter fixes for net
From: Pablo Neira Ayuso @ 2018-11-05 23:28 UTC (permalink / raw)
To: netfilter-devel; +Cc: davem, netdev
Hi David,
The following patchset contains the first batch of Netfilter fixes for
your net tree:
1) Fix splat with IPv6 defragmenting locally generated fragments,
from Florian Westphal.
2) Fix Incorrect check for missing attribute in nft_osf.
3) Missing INT_MIN & INT_MAX definition for netfilter bridge uapi
header, from Jiri Slaby.
4) Revert map lookup in nft_numgen, this is already possible with
the existing infrastructure without this extension.
5) Fix wrong listing of set reference counter, make counter
synchronous again, from Stefano Brivio.
6) Fix CIDR 0 in hash:net,port,net, from Eric Westbrook.
7) Fix allocation failure with large set, use kvcalloc().
From Andrey Ryabinin.
8) No need to disable BH when fetch ip set comment, patch from
Jozsef Kadlecsik.
9) Sanity check for valid sysfs entry in xt_IDLETIMER, from
Taehee Yoo.
10) Fix suspicious rcu usage via ip_set() macro at netlink dump,
from Jozsef Kadlecsik.
11) Fix setting default timeout via nfnetlink_cttimeout, this
comes with preparation patch to add nf_{tcp,udp,...}_pernet()
helper.
12) Allow ebtables table nat to be of filter type via nft_compat.
From Florian Westphal.
13) Incorrect calculation of next bucket in early_drop, do no bump
hash value, update bucket counter instead. From Vasily Khoruzhick.
You can pull these changes from:
git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf.git
Thanks!
----------------------------------------------------------------
The following changes since commit 4f3ebb04d05fe36f74ef17c6ee06559626d47964:
Merge branch '100GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/net-queue (2018-10-24 16:27:33 -0700)
are available in the git repository at:
git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf.git HEAD
for you to fetch changes up to f393808dc64149ccd0e5a8427505ba2974a59854:
netfilter: conntrack: fix calculation of next bucket number in early_drop (2018-11-03 14:16:28 +0100)
----------------------------------------------------------------
Andrey Ryabinin (1):
netfilter: ipset: fix ip_set_list allocation failure
Eric Westbrook (1):
netfilter: ipset: actually allow allowable CIDR 0 in hash:net,port,net
Florian Westphal (2):
netfilter: ipv6: fix oops when defragmenting locally generated fragments
netfilter: nft_compat: ebtables 'nat' table is normal chain type
Jiri Slaby (1):
netfilter: bridge: define INT_MIN & INT_MAX in userspace
Jozsef Kadlecsik (2):
netfilter: ipset: Correct rcu_dereference() call in ip_set_put_comment()
netfilter: ipset: Fix calling ip_set() macro at dumping
Pablo Neira Ayuso (4):
netfilter: nft_osf: check if attribute is present
Revert "netfilter: nft_numgen: add map lookups for numgen random operations"
netfilter: conntrack: add nf_{tcp,udp,sctp,icmp,dccp,icmpv6,generic}_pernet()
netfilter: nfnetlink_cttimeout: pass default timeout policy to obj_to_nlattr
Stefano Brivio (1):
netfilter: ipset: list:set: Decrease refcount synchronously on deletion and replace
Taehee Yoo (1):
netfilter: xt_IDLETIMER: add sysfs filename checking routine
Vasily Khoruzhick (1):
netfilter: conntrack: fix calculation of next bucket number in early_drop
include/linux/netfilter/ipset/ip_set.h | 2 +-
include/linux/netfilter/ipset/ip_set_comment.h | 4 +-
include/net/netfilter/nf_conntrack_l4proto.h | 39 ++++++++
include/uapi/linux/netfilter/nf_tables.h | 4 +-
include/uapi/linux/netfilter_bridge.h | 4 +
net/ipv6/netfilter/nf_conntrack_reasm.c | 13 ++-
net/netfilter/ipset/ip_set_core.c | 43 +++++----
net/netfilter/ipset/ip_set_hash_netportnet.c | 8 +-
net/netfilter/ipset/ip_set_list_set.c | 17 ++--
net/netfilter/nf_conntrack_core.c | 13 ++-
net/netfilter/nf_conntrack_proto_dccp.c | 13 +--
net/netfilter/nf_conntrack_proto_generic.c | 11 +--
net/netfilter/nf_conntrack_proto_icmp.c | 11 +--
net/netfilter/nf_conntrack_proto_icmpv6.c | 11 +--
net/netfilter/nf_conntrack_proto_sctp.c | 11 +--
net/netfilter/nf_conntrack_proto_tcp.c | 15 +--
net/netfilter/nf_conntrack_proto_udp.c | 11 +--
net/netfilter/nfnetlink_cttimeout.c | 47 +++++++--
net/netfilter/nft_compat.c | 21 ++--
net/netfilter/nft_numgen.c | 127 -------------------------
net/netfilter/nft_osf.c | 2 +-
net/netfilter/xt_IDLETIMER.c | 20 ++++
22 files changed, 200 insertions(+), 247 deletions(-)
^ permalink raw reply
* Re: [PATCH v2 2/2] mm/page_alloc: use a single function to free page
From: Aaron Lu @ 2018-11-06 8:47 UTC (permalink / raw)
To: Vlastimil Babka
Cc: linux-mm, linux-kernel, netdev, Andrew Morton,
Paweł Staszewski, Jesper Dangaard Brouer, Eric Dumazet,
Tariq Toukan, Ilias Apalodimas, Yoel Caspersen, Mel Gorman,
Saeed Mahameed, Michal Hocko, Dave Hansen, Alexander Duyck
In-Reply-To: <d6b4890c-0def-6114-2dcf-3ed120dea82c@suse.cz>
On Tue, Nov 06, 2018 at 09:16:20AM +0100, Vlastimil Babka wrote:
> On 11/6/18 6:30 AM, Aaron Lu wrote:
> > We have multiple places of freeing a page, most of them doing similar
> > things and a common function can be used to reduce code duplicate.
> >
> > It also avoids bug fixed in one function but left in another.
> >
> > Signed-off-by: Aaron Lu <aaron.lu@intel.com>
>
> Acked-by: Vlastimil Babka <vbabka@suse.cz>
Thanks.
> I assume there's no arch that would run page_ref_sub_and_test(1) slower
> than put_page_testzero(), for the critical __free_pages() case?
Good question.
I followed the non-arch specific calls and found that:
page_ref_sub_and_test() ends up calling atomic_sub_return(i, v) while
put_page_testzero() ends up calling atomic_sub_return(1, v). So they
should be same for archs that do not have their own implementations.
Back to your question: I don't know either.
If this is deemed unsafe, we can probably keep the ref modify part in
their original functions and only take the free part into a common
function.
Regards,
Aaron
> > ---
> > v2: move comments close to code as suggested by Dave.
> >
> > mm/page_alloc.c | 36 ++++++++++++++++--------------------
> > 1 file changed, 16 insertions(+), 20 deletions(-)
> >
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index 91a9a6af41a2..4faf6b7bf225 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -4425,9 +4425,17 @@ unsigned long get_zeroed_page(gfp_t gfp_mask)
> > }
> > EXPORT_SYMBOL(get_zeroed_page);
> >
> > -void __free_pages(struct page *page, unsigned int order)
> > +static inline void free_the_page(struct page *page, unsigned int order, int nr)
> > {
> > - if (put_page_testzero(page)) {
> > + VM_BUG_ON_PAGE(page_ref_count(page) == 0, page);
> > +
> > + /*
> > + * Free a page by reducing its ref count by @nr.
> > + * If its refcount reaches 0, then according to its order:
> > + * order0: send to PCP;
> > + * high order: directly send to Buddy.
> > + */
> > + if (page_ref_sub_and_test(page, nr)) {
> > if (order == 0)
> > free_unref_page(page);
> > else
> > @@ -4435,6 +4443,10 @@ void __free_pages(struct page *page, unsigned int order)
> > }
> > }
> >
> > +void __free_pages(struct page *page, unsigned int order)
> > +{
> > + free_the_page(page, order, 1);
> > +}
> > EXPORT_SYMBOL(__free_pages);
> >
> > void free_pages(unsigned long addr, unsigned int order)
> > @@ -4481,16 +4493,7 @@ static struct page *__page_frag_cache_refill(struct page_frag_cache *nc,
> >
> > void __page_frag_cache_drain(struct page *page, unsigned int count)
> > {
> > - VM_BUG_ON_PAGE(page_ref_count(page) == 0, page);
> > -
> > - if (page_ref_sub_and_test(page, count)) {
> > - unsigned int order = compound_order(page);
> > -
> > - if (order == 0)
> > - free_unref_page(page);
> > - else
> > - __free_pages_ok(page, order);
> > - }
> > + free_the_page(page, compound_order(page), count);
> > }
> > EXPORT_SYMBOL(__page_frag_cache_drain);
> >
> > @@ -4555,14 +4558,7 @@ void page_frag_free(void *addr)
> > {
> > struct page *page = virt_to_head_page(addr);
> >
> > - if (unlikely(put_page_testzero(page))) {
> > - unsigned int order = compound_order(page);
> > -
> > - if (order == 0)
> > - free_unref_page(page);
> > - else
> > - __free_pages_ok(page, order);
> > - }
> > + free_the_page(page, compound_order(page), 1);
> > }
> > EXPORT_SYMBOL(page_frag_free);
> >
> >
>
^ permalink raw reply
* Re: [PATCH v2 2/2] mm/page_alloc: use a single function to free page
From: Vlastimil Babka @ 2018-11-06 8:16 UTC (permalink / raw)
To: Aaron Lu, linux-mm, linux-kernel, netdev
Cc: Andrew Morton, Paweł Staszewski, Jesper Dangaard Brouer,
Eric Dumazet, Tariq Toukan, Ilias Apalodimas, Yoel Caspersen,
Mel Gorman, Saeed Mahameed, Michal Hocko, Dave Hansen,
Alexander Duyck
In-Reply-To: <20181106053037.GD6203@intel.com>
On 11/6/18 6:30 AM, Aaron Lu wrote:
> We have multiple places of freeing a page, most of them doing similar
> things and a common function can be used to reduce code duplicate.
>
> It also avoids bug fixed in one function but left in another.
>
> Signed-off-by: Aaron Lu <aaron.lu@intel.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
I assume there's no arch that would run page_ref_sub_and_test(1) slower
than put_page_testzero(), for the critical __free_pages() case?
> ---
> v2: move comments close to code as suggested by Dave.
>
> mm/page_alloc.c | 36 ++++++++++++++++--------------------
> 1 file changed, 16 insertions(+), 20 deletions(-)
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 91a9a6af41a2..4faf6b7bf225 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -4425,9 +4425,17 @@ unsigned long get_zeroed_page(gfp_t gfp_mask)
> }
> EXPORT_SYMBOL(get_zeroed_page);
>
> -void __free_pages(struct page *page, unsigned int order)
> +static inline void free_the_page(struct page *page, unsigned int order, int nr)
> {
> - if (put_page_testzero(page)) {
> + VM_BUG_ON_PAGE(page_ref_count(page) == 0, page);
> +
> + /*
> + * Free a page by reducing its ref count by @nr.
> + * If its refcount reaches 0, then according to its order:
> + * order0: send to PCP;
> + * high order: directly send to Buddy.
> + */
> + if (page_ref_sub_and_test(page, nr)) {
> if (order == 0)
> free_unref_page(page);
> else
> @@ -4435,6 +4443,10 @@ void __free_pages(struct page *page, unsigned int order)
> }
> }
>
> +void __free_pages(struct page *page, unsigned int order)
> +{
> + free_the_page(page, order, 1);
> +}
> EXPORT_SYMBOL(__free_pages);
>
> void free_pages(unsigned long addr, unsigned int order)
> @@ -4481,16 +4493,7 @@ static struct page *__page_frag_cache_refill(struct page_frag_cache *nc,
>
> void __page_frag_cache_drain(struct page *page, unsigned int count)
> {
> - VM_BUG_ON_PAGE(page_ref_count(page) == 0, page);
> -
> - if (page_ref_sub_and_test(page, count)) {
> - unsigned int order = compound_order(page);
> -
> - if (order == 0)
> - free_unref_page(page);
> - else
> - __free_pages_ok(page, order);
> - }
> + free_the_page(page, compound_order(page), count);
> }
> EXPORT_SYMBOL(__page_frag_cache_drain);
>
> @@ -4555,14 +4558,7 @@ void page_frag_free(void *addr)
> {
> struct page *page = virt_to_head_page(addr);
>
> - if (unlikely(put_page_testzero(page))) {
> - unsigned int order = compound_order(page);
> -
> - if (order == 0)
> - free_unref_page(page);
> - else
> - __free_pages_ok(page, order);
> - }
> + free_the_page(page, compound_order(page), 1);
> }
> EXPORT_SYMBOL(page_frag_free);
>
>
^ permalink raw reply
* Re: [PATCH net] net/ipv6: Move anycast init/cleanup functions out of CONFIG_PROC_FS
From: Jeff Barnhill @ 2018-11-05 22:02 UTC (permalink / raw)
To: davem; +Cc: netdev, Alexey Kuznetsov, yoshfuji
In-Reply-To: <20181105.133702.1214041828910910455.davem@davemloft.net>
Thanks, David. Sorry for missing that in the original patch.
Jeff
On Mon, Nov 5, 2018 at 4:55 PM David Miller <davem@davemloft.net> wrote:
>
> From: Jeff Barnhill <0xeffeff@gmail.com>
> Date: Mon, 5 Nov 2018 20:36:45 +0000
>
> > Move the anycast.c init and cleanup functions which were inadvertently
> > added inside the CONFIG_PROC_FS definition.
> >
> > Fixes: 2384d02520ff ("net/ipv6: Add anycast addresses to a global hashtable")
> > Signed-off-by: Jeff Barnhill <0xeffeff@gmail.com>
>
> Applied, thanks Jeff.
^ permalink raw reply
* Re: [PATCH net] net/ipv6: Move anycast init/cleanup functions out of CONFIG_PROC_FS
From: David Miller @ 2018-11-05 21:37 UTC (permalink / raw)
To: 0xeffeff; +Cc: netdev, kuznet, yoshfuji
In-Reply-To: <20181105203645.4993-1-0xeffeff@gmail.com>
From: Jeff Barnhill <0xeffeff@gmail.com>
Date: Mon, 5 Nov 2018 20:36:45 +0000
> Move the anycast.c init and cleanup functions which were inadvertently
> added inside the CONFIG_PROC_FS definition.
>
> Fixes: 2384d02520ff ("net/ipv6: Add anycast addresses to a global hashtable")
> Signed-off-by: Jeff Barnhill <0xeffeff@gmail.com>
Applied, thanks Jeff.
^ permalink raw reply
* Re: [PATCH 1/5] VSOCK: support fill mergeable rx buffer in guest
From: jiangyiwen @ 2018-11-06 6:22 UTC (permalink / raw)
To: Jason Wang, stefanha; +Cc: netdev, kvm, virtualization
In-Reply-To: <1801c013-6cfb-1d07-3401-49f536e01983@redhat.com>
On 2018/11/6 11:38, Jason Wang wrote:
>
> On 2018/11/5 下午3:45, jiangyiwen wrote:
>> In driver probing, if virtio has VIRTIO_VSOCK_F_MRG_RXBUF feature,
>> it will fill mergeable rx buffer, support for host send mergeable
>> rx buffer. It will fill a page everytime to compact with small
>> packet and big packet.
>>
>> Signed-off-by: Yiwen Jiang <jiangyiwen@huawei.com>
>> ---
>> include/linux/virtio_vsock.h | 3 ++
>> net/vmw_vsock/virtio_transport.c | 72 +++++++++++++++++++++++++++++-----------
>> 2 files changed, 56 insertions(+), 19 deletions(-)
>>
>> diff --git a/include/linux/virtio_vsock.h b/include/linux/virtio_vsock.h
>> index e223e26..bf84418 100644
>> --- a/include/linux/virtio_vsock.h
>> +++ b/include/linux/virtio_vsock.h
>> @@ -14,6 +14,9 @@
>> #define VIRTIO_VSOCK_MAX_BUF_SIZE 0xFFFFFFFFUL
>> #define VIRTIO_VSOCK_MAX_PKT_BUF_SIZE (1024 * 64)
>>
>> +/* Virtio-vsock feature */
>> +#define VIRTIO_VSOCK_F_MRG_RXBUF 0 /* Host can merge receive buffers. */
>> +
>> enum {
>> VSOCK_VQ_RX = 0, /* for host to guest data */
>> VSOCK_VQ_TX = 1, /* for guest to host data */
>> diff --git a/net/vmw_vsock/virtio_transport.c b/net/vmw_vsock/virtio_transport.c
>> index 5d3cce9..2040a9e 100644
>> --- a/net/vmw_vsock/virtio_transport.c
>> +++ b/net/vmw_vsock/virtio_transport.c
>> @@ -64,6 +64,7 @@ struct virtio_vsock {
>> struct virtio_vsock_event event_list[8];
>>
>> u32 guest_cid;
>> + bool mergeable;
>> };
>>
>> static struct virtio_vsock *virtio_vsock_get(void)
>> @@ -256,6 +257,25 @@ static int virtio_transport_send_pkt_loopback(struct virtio_vsock *vsock,
>> return 0;
>> }
>>
>> +static int fill_mergeable_rx_buff(struct virtqueue *vq)
>> +{
>> + void *page = NULL;
>> + struct scatterlist sg;
>> + int err;
>> +
>> + page = (void *)get_zeroed_page(GFP_KERNEL);
>
>
> Any reason to use zeroed page?
In previous version, the entire structure of virtio_vsock_pkt is preallocated
in guest use kzalloc, it is a contiguous zeroed physical memory, but host only
need to fill virtio_vsock_hdr size.
However, in mergeable rx buffer version, we only fill a page in vring descriptor
in guest, and I will reserve size of virtio_vsock_pkt in host instead of write
the total size of virtio_vsock_pkt, for the correctness of structure value,
we should set zeroed page in advance.
>
>
>> + if (!page)
>> + return -ENOMEM;
>> +
>> + sg_init_one(&sg, page, PAGE_SIZE);
>
>
> FYI, for virtio-net we have several optimizations for mergeable rx buffer:
>
> - skb_page_frag_refill() which can use high order page and reduce the stress of page allocator
>
You're right, initially I want to use a memory poll to manage the rx buffer,
and then use this in the later optimized patch. Your advice is very great.
> - we don't use fixed buffer size, instead we use EWMA to estimate the possible rx buffer size to avoid internal fragmentation
>
Ok, I analysis the feature and consider add it into my patches.
>
> If we can try to reuse virtio-net driver, we will get those nice features.
>
Yes, after all virtio-net has a very good ecological environment, and it also
do many performance optimization, it is actually a good idea.
>
> Thanks
>
>
>> +
>> + err = virtqueue_add_inbuf(vq, &sg, 1, page, GFP_KERNEL);
>> + if (err < 0)
>> + free_page((unsigned long) page);
>> +
>> + return err;
>> +}
>> +
>> static void virtio_vsock_rx_fill(struct virtio_vsock *vsock)
>> {
>> int buf_len = VIRTIO_VSOCK_DEFAULT_RX_BUF_SIZE;
>> @@ -267,27 +287,33 @@ static void virtio_vsock_rx_fill(struct virtio_vsock *vsock)
>> vq = vsock->vqs[VSOCK_VQ_RX];
>>
>> do {
>> - pkt = kzalloc(sizeof(*pkt), GFP_KERNEL);
>> - if (!pkt)
>> - break;
>> + if (vsock->mergeable) {
>> + ret = fill_mergeable_rx_buff(vq);
>> + if (ret)
>> + break;
>> + } else {
>> + pkt = kzalloc(sizeof(*pkt), GFP_KERNEL);
>> + if (!pkt)
>> + break;
>>
>> - pkt->buf = kmalloc(buf_len, GFP_KERNEL);
>> - if (!pkt->buf) {
>> - virtio_transport_free_pkt(pkt);
>> - break;
>> - }
>> + pkt->buf = kmalloc(buf_len, GFP_KERNEL);
>> + if (!pkt->buf) {
>> + virtio_transport_free_pkt(pkt);
>> + break;
>> + }
>>
>> - pkt->len = buf_len;
>> + pkt->len = buf_len;
>>
>> - sg_init_one(&hdr, &pkt->hdr, sizeof(pkt->hdr));
>> - sgs[0] = &hdr;
>> + sg_init_one(&hdr, &pkt->hdr, sizeof(pkt->hdr));
>> + sgs[0] = &hdr;
>>
>> - sg_init_one(&buf, pkt->buf, buf_len);
>> - sgs[1] = &buf;
>> - ret = virtqueue_add_sgs(vq, sgs, 0, 2, pkt, GFP_KERNEL);
>> - if (ret) {
>> - virtio_transport_free_pkt(pkt);
>> - break;
>> + sg_init_one(&buf, pkt->buf, buf_len);
>> + sgs[1] = &buf;
>> + ret = virtqueue_add_sgs(vq, sgs, 0, 2, pkt, GFP_KERNEL);
>> + if (ret) {
>> + virtio_transport_free_pkt(pkt);
>> + break;
>> + }
>> }
>> vsock->rx_buf_nr++;
>> } while (vq->num_free);
>> @@ -588,6 +614,9 @@ static int virtio_vsock_probe(struct virtio_device *vdev)
>> if (ret < 0)
>> goto out_vqs;
>>
>> + if (virtio_has_feature(vdev, VIRTIO_VSOCK_F_MRG_RXBUF))
>> + vsock->mergeable = true;
>> +
>> vsock->rx_buf_nr = 0;
>> vsock->rx_buf_max_nr = 0;
>> atomic_set(&vsock->queued_replies, 0);
>> @@ -640,8 +669,12 @@ static void virtio_vsock_remove(struct virtio_device *vdev)
>> vdev->config->reset(vdev);
>>
>> mutex_lock(&vsock->rx_lock);
>> - while ((pkt = virtqueue_detach_unused_buf(vsock->vqs[VSOCK_VQ_RX])))
>> - virtio_transport_free_pkt(pkt);
>> + while ((pkt = virtqueue_detach_unused_buf(vsock->vqs[VSOCK_VQ_RX]))) {
>> + if (vsock->mergeable)
>> + free_page((unsigned long)(void *)pkt);
>> + else
>> + virtio_transport_free_pkt(pkt);
>> + }
>> mutex_unlock(&vsock->rx_lock);
>>
>> mutex_lock(&vsock->tx_lock);
>> @@ -683,6 +716,7 @@ static void virtio_vsock_remove(struct virtio_device *vdev)
>> };
>>
>> static unsigned int features[] = {
>> + VIRTIO_VSOCK_F_MRG_RXBUF,
>> };
>>
>> static struct virtio_driver virtio_vsock_driver = {
>
> .
>
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization
^ permalink raw reply
* [PATCH net] net/ipv6: Move anycast init/cleanup functions out of CONFIG_PROC_FS
From: Jeff Barnhill @ 2018-11-05 20:36 UTC (permalink / raw)
To: netdev; +Cc: davem, kuznet, yoshfuji, Jeff Barnhill
Move the anycast.c init and cleanup functions which were inadvertently
added inside the CONFIG_PROC_FS definition.
Fixes: 2384d02520ff ("net/ipv6: Add anycast addresses to a global hashtable")
Signed-off-by: Jeff Barnhill <0xeffeff@gmail.com>
---
net/ipv6/anycast.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/net/ipv6/anycast.c b/net/ipv6/anycast.c
index 7698637cf827..94999058e110 100644
--- a/net/ipv6/anycast.c
+++ b/net/ipv6/anycast.c
@@ -590,6 +590,7 @@ void ac6_proc_exit(struct net *net)
{
remove_proc_entry("anycast6", net->proc_net);
}
+#endif
/* Init / cleanup code
*/
@@ -611,4 +612,3 @@ void ipv6_anycast_cleanup(void)
WARN_ON(!hlist_empty(&inet6_acaddr_lst[i]));
spin_unlock(&acaddr_hash_lock);
}
-#endif
--
2.14.1
^ permalink raw reply related
* Re: [PATCH 0/5] VSOCK: support mergeable rx buffer in vhost-vsock
From: jiangyiwen @ 2018-11-06 5:53 UTC (permalink / raw)
To: Jason Wang, stefanha; +Cc: netdev, kvm, virtualization
In-Reply-To: <229559d5-1787-09b1-6c26-57b535f20006@redhat.com>
On 2018/11/6 11:32, Jason Wang wrote:
>
> On 2018/11/6 上午11:17, jiangyiwen wrote:
>> On 2018/11/6 10:41, Jason Wang wrote:
>>> On 2018/11/6 上午10:17, jiangyiwen wrote:
>>>> On 2018/11/5 17:21, Jason Wang wrote:
>>>>> On 2018/11/5 下午3:43, jiangyiwen wrote:
>>>>>> Now vsock only support send/receive small packet, it can't achieve
>>>>>> high performance. As previous discussed with Jason Wang, I revisit the
>>>>>> idea of vhost-net about mergeable rx buffer and implement the mergeable
>>>>>> rx buffer in vhost-vsock, it can allow big packet to be scattered in
>>>>>> into different buffers and improve performance obviously.
>>>>>>
>>>>>> I write a tool to test the vhost-vsock performance, mainly send big
>>>>>> packet(64K) included guest->Host and Host->Guest. The result as
>>>>>> follows:
>>>>>>
>>>>>> Before performance:
>>>>>> Single socket Multiple sockets(Max Bandwidth)
>>>>>> Guest->Host ~400MB/s ~480MB/s
>>>>>> Host->Guest ~1450MB/s ~1600MB/s
>>>>>>
>>>>>> After performance:
>>>>>> Single socket Multiple sockets(Max Bandwidth)
>>>>>> Guest->Host ~1700MB/s ~2900MB/s
>>>>>> Host->Guest ~1700MB/s ~2900MB/s
>>>>>>
>>>>>> From the test results, the performance is improved obviously, and guest
>>>>>> memory will not be wasted.
>>>>> Hi:
>>>>>
>>>>> Thanks for the patches and the numbers are really impressive.
>>>>>
>>>>> But instead of duplicating codes between sock and net. I was considering to use virtio-net as a transport of vsock. Then we may have all existed features likes batching, mergeable rx buffers and multiqueue. Want to consider this idea? Thoughts?
>>>>>
>>>>>
>>>> Hi Jason,
>>>>
>>>> I am not very familiar with virtio-net, so I am afraid I can't give too
>>>> much effective advice. Then I have several problems:
>>>>
>>>> 1. If use virtio-net as a transport, guest should see a virtio-net
>>>> device instead of virtio-vsock device, right? Is vsock only as a
>>>> transport between socket and net_device? User should still use
>>>> AF_VSOCK type to create socket, right?
>>>
>>> Well, there're many choices. What you need is just to keep the socket API and hide the implementation. For example, you can keep the vosck device in guest and switch to use vhost-net in host. We probably need a new feature bit or header to let vhost know we are passing vsock packet. And vhost-net could forward the packet to vsock core on host.
>>>
>>>
>>>> 2. I want to know if this idea has already started, and how is
>>>> the current progress?
>>>
>>> Not yet started. Just want to listen from the community. If this sounds good, do you have interest in implementing this?
>>>
>>>
>>>> 3. And what is stefan's idea?
>>>
>>> Talk with Stefan a little on this during KVM Forum. I think he tends to agree on this idea. Anyway, let's wait for his reply.
>>>
>>>
>>> Thanks
>>>
>>>
>> Hi Jason,
>>
>> Thanks your reply, what you want is try to avoid duplicate code, and still
>> use the existed features with virtio-net.
>
>
> Yes, technically we can use virtio-net driver is guest as well but we could do it step by step.
>
>
>> Yes, if this sounds good and most people can recognize this idea, I am very
>> happy to implement this.
>
>
> Cool, thanks.
>
>
>>
>> In addition, I hope you can review these patches before the new idea is
>> implemented, after all the performance can be improved. :-)
>
>
> Ok.
>
>
> So the patch actually did three things:
>
> - mergeable buffer implementation
>
> - increase the default rx buffer size
>
> - add used and signal guest in a batch
>
> It would be helpful if you can measure the performance improvement independently. This can give reviewer a better understanding on how much did each part help.
>
> Thanks
>
>
Great, I will test the performance independently in the later version.
Thanks,
Yiwen.
>>
>> Thanks,
>> Yiwen.
>>
>>>> Thanks,
>>>> Yiwen.
>>>>
>>> .
>>>
>>
>
> .
>
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization
^ permalink raw reply
* Re: Kernel 4.19 network performance - forwarding/routing normal users traffic
From: Jesper Dangaard Brouer @ 2018-11-05 20:17 UTC (permalink / raw)
To: Paweł Staszewski; +Cc: David Ahern, netdev, Yoel Caspersen, brouer
In-Reply-To: <394a0bf2-fa97-1085-2eda-98ddf476895c@itcare.pl>
On Sun, 4 Nov 2018 01:24:03 +0100 Paweł Staszewski <pstaszewski@itcare.pl> wrote:
> And today again after allpy patch for page allocator - reached again
> 64/64 Gbit/s
>
> with only 50-60% cpu load
Great.
> today no slowpath hit for netwoking :)
>
> But again dropped pckt at 64GbitRX and 64TX ....
> And as it should not be pcie express limit -i think something more is
Well, this does sounds like a PCIe bandwidth limit to me.
See the PCIe BW here: https://en.wikipedia.org/wiki/PCI_Express
You likely have PCIe v3, where 1-lane have 984.6 MBytes/s or 7.87 Gbit/s
Thus, x16-lanes have 15.75 GBytes or 126 Gbit/s. It does say "in each
direction", but you are also forwarding this RX->TX on both (dual) ports
NIC that is sharing the same PCIe slot.
> going on there - and hard to catch - cause perf top doestn chenged
> besides there is no queued slowpath hit now
>
> I ordered now also intel cards to compare - but 3 weeks eta
> Faster - cause 3 days - i will have mellanox connectx 5 - so can
> separate traffic to two different x16 pcie busses
I do think you need to separate traffic to two different x16 PCIe
slots. I have found that the ConnectX-5 is significantly faster
packet-per-sec performance than ConnectX-4, but that is not your
use-case (max BW). I've not tested these NICs for maximum
_bidirectional_ bandwidth limits, I've only made sure I can do 100G
unidirectional, which can hit some funny motherboard memory limits
(remember to equip motherboard with 4 RAM blocks for full memory BW).
--
Best regards,
Jesper Dangaard Brouer
MSc.CS, Principal Kernel Engineer at Red Hat
LinkedIn: http://www.linkedin.com/in/brouer
^ permalink raw reply
* [PATCH v2 2/2] mm/page_alloc: use a single function to free page
From: Aaron Lu @ 2018-11-06 5:30 UTC (permalink / raw)
To: linux-mm, linux-kernel, netdev
Cc: Andrew Morton, Paweł Staszewski, Jesper Dangaard Brouer,
Eric Dumazet, Tariq Toukan, Ilias Apalodimas, Yoel Caspersen,
Mel Gorman, Saeed Mahameed, Michal Hocko, Vlastimil Babka,
Dave Hansen, Alexander Duyck
In-Reply-To: <20181105085820.6341-2-aaron.lu@intel.com>
We have multiple places of freeing a page, most of them doing similar
things and a common function can be used to reduce code duplicate.
It also avoids bug fixed in one function but left in another.
Signed-off-by: Aaron Lu <aaron.lu@intel.com>
---
v2: move comments close to code as suggested by Dave.
mm/page_alloc.c | 36 ++++++++++++++++--------------------
1 file changed, 16 insertions(+), 20 deletions(-)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 91a9a6af41a2..4faf6b7bf225 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4425,9 +4425,17 @@ unsigned long get_zeroed_page(gfp_t gfp_mask)
}
EXPORT_SYMBOL(get_zeroed_page);
-void __free_pages(struct page *page, unsigned int order)
+static inline void free_the_page(struct page *page, unsigned int order, int nr)
{
- if (put_page_testzero(page)) {
+ VM_BUG_ON_PAGE(page_ref_count(page) == 0, page);
+
+ /*
+ * Free a page by reducing its ref count by @nr.
+ * If its refcount reaches 0, then according to its order:
+ * order0: send to PCP;
+ * high order: directly send to Buddy.
+ */
+ if (page_ref_sub_and_test(page, nr)) {
if (order == 0)
free_unref_page(page);
else
@@ -4435,6 +4443,10 @@ void __free_pages(struct page *page, unsigned int order)
}
}
+void __free_pages(struct page *page, unsigned int order)
+{
+ free_the_page(page, order, 1);
+}
EXPORT_SYMBOL(__free_pages);
void free_pages(unsigned long addr, unsigned int order)
@@ -4481,16 +4493,7 @@ static struct page *__page_frag_cache_refill(struct page_frag_cache *nc,
void __page_frag_cache_drain(struct page *page, unsigned int count)
{
- VM_BUG_ON_PAGE(page_ref_count(page) == 0, page);
-
- if (page_ref_sub_and_test(page, count)) {
- unsigned int order = compound_order(page);
-
- if (order == 0)
- free_unref_page(page);
- else
- __free_pages_ok(page, order);
- }
+ free_the_page(page, compound_order(page), count);
}
EXPORT_SYMBOL(__page_frag_cache_drain);
@@ -4555,14 +4558,7 @@ void page_frag_free(void *addr)
{
struct page *page = virt_to_head_page(addr);
- if (unlikely(put_page_testzero(page))) {
- unsigned int order = compound_order(page);
-
- if (order == 0)
- free_unref_page(page);
- else
- __free_pages_ok(page, order);
- }
+ free_the_page(page, compound_order(page), 1);
}
EXPORT_SYMBOL(page_frag_free);
--
2.17.2
^ permalink raw reply related
* [PATCH v2 1/2] mm/page_alloc: free order-0 pages through PCP in page_frag_free()
From: Aaron Lu @ 2018-11-06 5:28 UTC (permalink / raw)
To: linux-mm, linux-kernel, netdev
Cc: Andrew Morton, Paweł Staszewski, Jesper Dangaard Brouer,
Eric Dumazet, Tariq Toukan, Ilias Apalodimas, Yoel Caspersen,
Mel Gorman, Saeed Mahameed, Michal Hocko, Vlastimil Babka,
Dave Hansen, Alexander Duyck
In-Reply-To: <20181105085820.6341-1-aaron.lu@intel.com>
page_frag_free() calls __free_pages_ok() to free the page back to
Buddy. This is OK for high order page, but for order-0 pages, it
misses the optimization opportunity of using Per-Cpu-Pages and can
cause zone lock contention when called frequently.
Paweł Staszewski recently shared his result of 'how Linux kernel
handles normal traffic'[1] and from perf data, Jesper Dangaard Brouer
found the lock contention comes from page allocator:
mlx5e_poll_tx_cq
|
--16.34%--napi_consume_skb
|
|--12.65%--__free_pages_ok
| |
| --11.86%--free_one_page
| |
| |--10.10%--queued_spin_lock_slowpath
| |
| --0.65%--_raw_spin_lock
|
|--1.55%--page_frag_free
|
--1.44%--skb_release_data
Jesper explained how it happened: mlx5 driver RX-page recycle
mechanism is not effective in this workload and pages have to go
through the page allocator. The lock contention happens during
mlx5 DMA TX completion cycle. And the page allocator cannot keep
up at these speeds.[2]
I thought that __free_pages_ok() are mostly freeing high order
pages and thought this is an lock contention for high order pages
but Jesper explained in detail that __free_pages_ok() here are
actually freeing order-0 pages because mlx5 is using order-0 pages
to satisfy its page pool allocation request.[3]
The free path as pointed out by Jesper is:
skb_free_head()
-> skb_free_frag()
-> page_frag_free()
And the pages being freed on this path are order-0 pages.
Fix this by doing similar things as in __page_frag_cache_drain() -
send the being freed page to PCP if it's an order-0 page, or
directly to Buddy if it is a high order page.
With this change, Paweł hasn't noticed lock contention yet in
his workload and Jesper has noticed a 7% performance improvement
using a micro benchmark and lock contention is gone. Ilias' test
on a 'low' speed 1Gbit interface on an cortex-a53 shows ~11%
performance boost testing with 64byte packets and __free_pages_ok()
disappeared from perf top.
[1]: https://www.spinics.net/lists/netdev/msg531362.html
[2]: https://www.spinics.net/lists/netdev/msg531421.html
[3]: https://www.spinics.net/lists/netdev/msg531556.html
Reported-by: Paweł Staszewski <pstaszewski@itcare.pl>
Analysed-by: Jesper Dangaard Brouer <brouer@redhat.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Acked-by: Ilias Apalodimas <ilias.apalodimas@linaro.org>
Tested-by: Ilias Apalodimas <ilias.apalodimas@linaro.org>
Acked-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
Signed-off-by: Aaron Lu <aaron.lu@intel.com>
---
v2: only changelog changes:
- remove the duplicated skb_free_frag() as pointed by Jesper;
- add Ilias' test result;
- add people's ack/test tag.
mm/page_alloc.c | 10 ++++++++--
1 file changed, 8 insertions(+), 2 deletions(-)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index ae31839874b8..91a9a6af41a2 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4555,8 +4555,14 @@ void page_frag_free(void *addr)
{
struct page *page = virt_to_head_page(addr);
- if (unlikely(put_page_testzero(page)))
- __free_pages_ok(page, compound_order(page));
+ if (unlikely(put_page_testzero(page))) {
+ unsigned int order = compound_order(page);
+
+ if (order == 0)
+ free_unref_page(page);
+ else
+ __free_pages_ok(page, order);
+ }
}
EXPORT_SYMBOL(page_frag_free);
--
2.17.2
^ permalink raw reply related
* Can NFS work with VRF?
From: Ben Greear @ 2018-11-05 19:42 UTC (permalink / raw)
To: netdev
Hello,
I was trying to improve my old series of patches that binds NFS to
a particular source IP address so that it could work with VRF in a 4.16
kernel. But, it seems a huge tangle to try to make NFS (and rpc, etc) able to bind to
a local netdevice, which I think is what would be needed to make it work with VRF.
Has anyone already worked on VRF support for NFS?
Thanks,
Ben
--
Ben Greear <greearb@candelatech.com>
Candela Technologies Inc http://www.candelatech.com
^ permalink raw reply
* Re: [RFC PATCH] net/mlx5e: Temp software stats variable is not required
From: David Miller @ 2018-11-05 19:27 UTC (permalink / raw)
To: saeedm; +Cc: jgg, netdev, eric.dumazet, eranbe, leonro, arnd, akpm
In-Reply-To: <002da677314734dedd31be1e1b199dd4f1ae8457.camel@mellanox.com>
From: Saeed Mahameed <saeedm@mellanox.com>
Date: Mon, 5 Nov 2018 19:13:59 +0000
> On Sat, 2018-11-03 at 19:36 -0700, David Miller wrote:
>> From: Saeed Mahameed <saeedm@mellanox.com>
>> Date: Fri, 2 Nov 2018 18:54:22 -0700
>>
>> > +static void mlx5e_fold_sw_stats(struct mlx5e_priv *priv, struct
>> > rtnl_link_stats64 *s)
>> > +{
>> > + int i;
>> > +
>> > + /* not required ? */
>> > + memset(s, 0, sizeof(*s));
>>
>> Why wouldn't this be required?
>>
>
> I just checked it is already done by the stack @dev_get_stats()
Then please document this in the commit message or similar.
^ permalink raw reply
* Re: [PATCH v3] arm64: dts: stratix10: fix multicast filtering
From: Dinh Nguyen @ 2018-11-05 19:14 UTC (permalink / raw)
To: Aaro Koskinen, Thor Thayer, Rob Herring, Mark Rutland
Cc: devicetree, linux-kernel, netdev, Aaro Koskinen
In-Reply-To: <20181102191048.22657-1-aaro.koskinen@iki.fi>
On 11/2/18 2:10 PM, Aaro Koskinen wrote:
> From: Aaro Koskinen <aaro.koskinen@nokia.com>
>
> On Stratix 10, the EMAC has 256 hash buckets for multicast filtering. This
> needs to be specified in DTS, otherwise the stmmac driver defaults to 64
> buckets and initializes the filter incorrectly. As a result, e.g. valid
> IPv6 multicast traffic ends up being dropped.
>
> Fixes: 78cd6a9d8e15 ("arm64: dts: Add base stratix 10 dtsi")
> Signed-off-by: Aaro Koskinen <aaro.koskinen@nokia.com>
> ---
> arch/arm64/boot/dts/altera/socfpga_stratix10.dtsi | 3 +++
> 1 file changed, 3 insertions(+)
>
Applied!
Thanks,
Dinh
^ permalink raw reply
* Re: [RFC PATCH] net/mlx5e: Temp software stats variable is not required
From: Saeed Mahameed @ 2018-11-05 19:13 UTC (permalink / raw)
To: davem@davemloft.net
Cc: Jason Gunthorpe, netdev@vger.kernel.org, eric.dumazet@gmail.com,
Eran Ben Elisha, Leon Romanovsky, arnd@arndb.de,
akpm@linux-foundation.org
In-Reply-To: <20181103.193617.810293775666516890.davem@davemloft.net>
On Sat, 2018-11-03 at 19:36 -0700, David Miller wrote:
> From: Saeed Mahameed <saeedm@mellanox.com>
> Date: Fri, 2 Nov 2018 18:54:22 -0700
>
> > +static void mlx5e_fold_sw_stats(struct mlx5e_priv *priv, struct
> > rtnl_link_stats64 *s)
> > +{
> > + int i;
> > +
> > + /* not required ? */
> > + memset(s, 0, sizeof(*s));
>
> Why wouldn't this be required?
>
I just checked it is already done by the stack @dev_get_stats()
> I can see that perhaps you can only zero out the statistics that are
> used in
>
> the ndo_get_stats64() code path, but that's different.
mlx5e_fold_sw_stats can only be called from ndo_get_stats64().
The "s" pointer i am trying to zero out here is the same pointer we
receive from ndo_get_stats64().
^ permalink raw reply
* Re: [PATCH 3/5] VSOCK: support receive mergeable rx buffer in guest
From: Jason Wang @ 2018-11-06 4:00 UTC (permalink / raw)
To: jiangyiwen, stefanha; +Cc: netdev, kvm, virtualization
In-Reply-To: <5BDFF57C.5020106@huawei.com>
On 2018/11/5 下午3:47, jiangyiwen wrote:
> Guest receive mergeable rx buffer, it can merge
> scatter rx buffer into a big buffer and then copy
> to user space.
>
> Signed-off-by: Yiwen Jiang <jiangyiwen@huawei.com>
> ---
> include/linux/virtio_vsock.h | 9 ++++
> net/vmw_vsock/virtio_transport.c | 75 +++++++++++++++++++++++++++++----
> net/vmw_vsock/virtio_transport_common.c | 59 ++++++++++++++++++++++----
> 3 files changed, 127 insertions(+), 16 deletions(-)
>
> diff --git a/include/linux/virtio_vsock.h b/include/linux/virtio_vsock.h
> index da9e1fe..6be3cd7 100644
> --- a/include/linux/virtio_vsock.h
> +++ b/include/linux/virtio_vsock.h
> @@ -13,6 +13,8 @@
> #define VIRTIO_VSOCK_DEFAULT_RX_BUF_SIZE (1024 * 4)
> #define VIRTIO_VSOCK_MAX_BUF_SIZE 0xFFFFFFFFUL
> #define VIRTIO_VSOCK_MAX_PKT_BUF_SIZE (1024 * 64)
> +/* virtio_vsock_pkt + max_pkt_len(default MAX_PKT_BUF_SIZE) */
> +#define VIRTIO_VSOCK_MAX_MRG_BUF_NUM ((VIRTIO_VSOCK_MAX_PKT_BUF_SIZE / PAGE_SIZE) + 1)
>
> /* Virtio-vsock feature */
> #define VIRTIO_VSOCK_F_MRG_RXBUF 0 /* Host can merge receive buffers. */
> @@ -48,6 +50,11 @@ struct virtio_vsock_sock {
> struct list_head rx_queue;
> };
>
> +struct virtio_vsock_mrg_rxbuf {
> + void *buf;
> + u32 len;
> +};
> +
> struct virtio_vsock_pkt {
> struct virtio_vsock_hdr hdr;
> struct virtio_vsock_mrg_rxbuf_hdr mrg_rxbuf_hdr;
> @@ -59,6 +66,8 @@ struct virtio_vsock_pkt {
> u32 len;
> u32 off;
> bool reply;
> + bool mergeable;
> + struct virtio_vsock_mrg_rxbuf mrg_rxbuf[VIRTIO_VSOCK_MAX_MRG_BUF_NUM];
> };
It's better to use iov here I think, and drop buf completely.
And this is better to be done in an independent patch.
>
> struct virtio_vsock_pkt_info {
> diff --git a/net/vmw_vsock/virtio_transport.c b/net/vmw_vsock/virtio_transport.c
> index 2040a9e..3557ad3 100644
> --- a/net/vmw_vsock/virtio_transport.c
> +++ b/net/vmw_vsock/virtio_transport.c
> @@ -359,11 +359,62 @@ static bool virtio_transport_more_replies(struct virtio_vsock *vsock)
> return val < virtqueue_get_vring_size(vq);
> }
>
> +static struct virtio_vsock_pkt *receive_mergeable(struct virtqueue *vq,
> + struct virtio_vsock *vsock, unsigned int *total_len)
> +{
> + struct virtio_vsock_pkt *pkt;
> + u16 num_buf;
> + void *page;
> + unsigned int len;
> + int i = 0;
> +
> + page = virtqueue_get_buf(vq, &len);
> + if (!page)
> + return NULL;
> +
> + *total_len = len;
> + vsock->rx_buf_nr--;
> +
> + pkt = page;
> + num_buf = le16_to_cpu(pkt->mrg_rxbuf_hdr.num_buffers);
> + if (!num_buf || num_buf > VIRTIO_VSOCK_MAX_MRG_BUF_NUM)
> + goto err;
> +
> + pkt->mergeable = true;
> + if (!le32_to_cpu(pkt->hdr.len))
> + return pkt;
> +
> + len -= sizeof(struct virtio_vsock_pkt);
> + pkt->mrg_rxbuf[i].buf = page + sizeof(struct virtio_vsock_pkt);
> + pkt->mrg_rxbuf[i].len = len;
> + i++;
> +
> + while (--num_buf) {
> + page = virtqueue_get_buf(vq, &len);
> + if (!page)
> + goto err;
> +
> + *total_len += len;
> + vsock->rx_buf_nr--;
> +
> + pkt->mrg_rxbuf[i].buf = page;
> + pkt->mrg_rxbuf[i].len = len;
> + i++;
> + }
> +
> + return pkt;
> +err:
> + virtio_transport_free_pkt(pkt);
> + return NULL;
> +}
Similar logic could be found at virtio-net driver.
Haven't thought this deeply, but it looks to me use virtio-net driver is
also possible, e.g for data-path, just register vsock specific callbacks.
> +
> static void virtio_transport_rx_work(struct work_struct *work)
> {
> struct virtio_vsock *vsock =
> container_of(work, struct virtio_vsock, rx_work);
> struct virtqueue *vq;
> + size_t vsock_hlen = vsock->mergeable ? sizeof(struct virtio_vsock_pkt) :
> + sizeof(struct virtio_vsock_hdr);
>
> vq = vsock->vqs[VSOCK_VQ_RX];
>
> @@ -383,21 +434,29 @@ static void virtio_transport_rx_work(struct work_struct *work)
> goto out;
> }
>
> - pkt = virtqueue_get_buf(vq, &len);
> - if (!pkt) {
> - break;
> - }
> + if (likely(vsock->mergeable)) {
> + pkt = receive_mergeable(vq, vsock, &len);
> + if (!pkt)
> + break;
>
> - vsock->rx_buf_nr--;
> + pkt->len = le32_to_cpu(pkt->hdr.len);
> + } else {
> + pkt = virtqueue_get_buf(vq, &len);
> + if (!pkt) {
> + break;
> + }
> +
> + vsock->rx_buf_nr--;
> + }
>
> /* Drop short/long packets */
> - if (unlikely(len < sizeof(pkt->hdr) ||
> - len > sizeof(pkt->hdr) + pkt->len)) {
> + if (unlikely(len < vsock_hlen ||
> + len > vsock_hlen + pkt->len)) {
> virtio_transport_free_pkt(pkt);
> continue;
> }
>
> - pkt->len = len - sizeof(pkt->hdr);
> + pkt->len = len - vsock_hlen;
> virtio_transport_deliver_tap_pkt(pkt);
> virtio_transport_recv_pkt(pkt);
> }
> diff --git a/net/vmw_vsock/virtio_transport_common.c b/net/vmw_vsock/virtio_transport_common.c
> index 3ae3a33..7bef1d5 100644
> --- a/net/vmw_vsock/virtio_transport_common.c
> +++ b/net/vmw_vsock/virtio_transport_common.c
> @@ -272,14 +272,49 @@ static int virtio_transport_send_credit_update(struct vsock_sock *vsk,
> */
> spin_unlock_bh(&vvs->rx_lock);
>
> - err = memcpy_to_msg(msg, pkt->buf + pkt->off, bytes);
> - if (err)
> - goto out;
> + if (pkt->mergeable) {
> + struct virtio_vsock_mrg_rxbuf *buf = pkt->mrg_rxbuf;
> + size_t mrg_copy_bytes, last_buf_total = 0, rxbuf_off;
> + size_t tmp_bytes = bytes;
> + int i;
> +
> + for (i = 0; i < le16_to_cpu(pkt->mrg_rxbuf_hdr.num_buffers); i++) {
> + if (pkt->off > last_buf_total + buf[i].len) {
> + last_buf_total += buf[i].len;
> + continue;
> + }
> +
> + rxbuf_off = pkt->off - last_buf_total;
> + mrg_copy_bytes = min(buf[i].len - rxbuf_off, tmp_bytes);
> + err = memcpy_to_msg(msg, buf[i].buf + rxbuf_off, mrg_copy_bytes);
> + if (err)
> + goto out;
> +
> + tmp_bytes -= mrg_copy_bytes;
> + pkt->off += mrg_copy_bytes;
> + last_buf_total += buf[i].len;
> + if (!tmp_bytes)
> + break;
> + }
After switch to use iov, you can user iov_iter helper to avoid the above
open-coding I believe.
And you can also drop the if (mergeable) condition.
Thanks
> +
> + if (tmp_bytes) {
> + printk(KERN_WARNING "WARNING! bytes = %llu, "
> + "bytes = %llu\n",
> + (unsigned long long)bytes,
> + (unsigned long long)tmp_bytes);
> + }
> +
> + total += (bytes - tmp_bytes);
> + } else {
> + err = memcpy_to_msg(msg, pkt->buf + pkt->off, bytes);
> + if (err)
> + goto out;
> +
> + total += bytes;
> + pkt->off += bytes;
> + }
>
> spin_lock_bh(&vvs->rx_lock);
> -
> - total += bytes;
> - pkt->off += bytes;
> if (pkt->off == pkt->len) {
> virtio_transport_dec_rx_pkt(vvs, pkt);
> list_del(&pkt->list);
> @@ -1050,8 +1085,16 @@ void virtio_transport_recv_pkt(struct virtio_vsock_pkt *pkt)
>
> void virtio_transport_free_pkt(struct virtio_vsock_pkt *pkt)
> {
> - kfree(pkt->buf);
> - kfree(pkt);
> + int i;
> +
> + if (pkt->mergeable) {
> + for (i = 1; i < le16_to_cpu(pkt->mrg_rxbuf_hdr.num_buffers); i++)
> + free_page((unsigned long)pkt->mrg_rxbuf[i].buf);
> + free_page((unsigned long)(void *)pkt);
> + } else {
> + kfree(pkt->buf);
> + kfree(pkt);
> + }
> }
> EXPORT_SYMBOL_GPL(virtio_transport_free_pkt);
>
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization
^ permalink raw reply
* Rules for retransmitting sk_buffs?
From: John Ousterhout @ 2018-11-05 18:34 UTC (permalink / raw)
To: netdev
I am creating a kernel module that implements the Homa transport
protocol (see paper in SIGCOMM 2018) and as a Linux kernel newbie I'm
struggling a bit to figure out how all of Linux's network plumbing
works.
I'm currently having problems retransmitting an sk_buff after packet
loss and hoping that perhaps someone here can help me understand the
rules and/or constraints around retransmission. Pointers to any
existing documentation would also be great.
I'm currently using the naive approach: Homa saves a reference to the
sk_buff after it is first transmitted, and if retransmission is
necessary it calls ip_queue_xmit again with the same sk_buff (it also
reuses the same flowi and dst as in the first call). The behavior I'm
seeing is very strange: the second call to ip_queue_xmit is modifying
the flowi so that its daddr is 127.0.0.1. This then results in an
ICMP_DEST_UNREACH error.
Am I doing something fundamentally wrong here? E.g., do I need to
clone the sk_buff before retransmitting it? If so, are there any
restrictions on *when* I clone it (I'd prefer not to do this unless
retransmission is necessary, just to save work).
Thanks in advance for any advice/pointers.
-John-
^ permalink raw reply
* Re: [PATCH] sock_diag: fix autoloading of the raw_diag module
From: Cyrill Gorcunov @ 2018-11-05 18:22 UTC (permalink / raw)
To: Andrei Vagin; +Cc: David S. Miller, netdev, Xin Long
In-Reply-To: <20181105071234.GC2869@uranus.lan>
On Mon, Nov 05, 2018 at 10:12:34AM +0300, Cyrill Gorcunov wrote:
>
> Andrew, looking into kernel code I wonder, maybe we should simply
> add this protocol into inet_protos during net/ipv4/af_inet.c:inet_init?
> It will require to add netns_ok into raw_prot of course.
After spending some time on this idea I think your patch is better,
since it is small and suitable for a -stable fix (because the former
issue is definitely changing the kernel behaviour, the fix should
be passed to -stable as well).
Reviewed-by: Cyrill Gorcunov <gorcunov@gmail.com>
^ permalink raw reply
* Re: [PATCH 0/5] VSOCK: support mergeable rx buffer in vhost-vsock
From: jiangyiwen @ 2018-11-06 3:17 UTC (permalink / raw)
To: Jason Wang, stefanha; +Cc: netdev, kvm, virtualization
In-Reply-To: <b4ce06c8-1c4b-724f-52e8-cb8d7b23fd27@redhat.com>
On 2018/11/6 10:41, Jason Wang wrote:
>
> On 2018/11/6 上午10:17, jiangyiwen wrote:
>> On 2018/11/5 17:21, Jason Wang wrote:
>>> On 2018/11/5 下午3:43, jiangyiwen wrote:
>>>> Now vsock only support send/receive small packet, it can't achieve
>>>> high performance. As previous discussed with Jason Wang, I revisit the
>>>> idea of vhost-net about mergeable rx buffer and implement the mergeable
>>>> rx buffer in vhost-vsock, it can allow big packet to be scattered in
>>>> into different buffers and improve performance obviously.
>>>>
>>>> I write a tool to test the vhost-vsock performance, mainly send big
>>>> packet(64K) included guest->Host and Host->Guest. The result as
>>>> follows:
>>>>
>>>> Before performance:
>>>> Single socket Multiple sockets(Max Bandwidth)
>>>> Guest->Host ~400MB/s ~480MB/s
>>>> Host->Guest ~1450MB/s ~1600MB/s
>>>>
>>>> After performance:
>>>> Single socket Multiple sockets(Max Bandwidth)
>>>> Guest->Host ~1700MB/s ~2900MB/s
>>>> Host->Guest ~1700MB/s ~2900MB/s
>>>>
>>>> From the test results, the performance is improved obviously, and guest
>>>> memory will not be wasted.
>>> Hi:
>>>
>>> Thanks for the patches and the numbers are really impressive.
>>>
>>> But instead of duplicating codes between sock and net. I was considering to use virtio-net as a transport of vsock. Then we may have all existed features likes batching, mergeable rx buffers and multiqueue. Want to consider this idea? Thoughts?
>>>
>>>
>> Hi Jason,
>>
>> I am not very familiar with virtio-net, so I am afraid I can't give too
>> much effective advice. Then I have several problems:
>>
>> 1. If use virtio-net as a transport, guest should see a virtio-net
>> device instead of virtio-vsock device, right? Is vsock only as a
>> transport between socket and net_device? User should still use
>> AF_VSOCK type to create socket, right?
>
>
> Well, there're many choices. What you need is just to keep the socket API and hide the implementation. For example, you can keep the vosck device in guest and switch to use vhost-net in host. We probably need a new feature bit or header to let vhost know we are passing vsock packet. And vhost-net could forward the packet to vsock core on host.
>
>
>>
>> 2. I want to know if this idea has already started, and how is
>> the current progress?
>
>
> Not yet started. Just want to listen from the community. If this sounds good, do you have interest in implementing this?
>
>
>>
>> 3. And what is stefan's idea?
>
>
> Talk with Stefan a little on this during KVM Forum. I think he tends to agree on this idea. Anyway, let's wait for his reply.
>
>
> Thanks
>
>
Hi Jason,
Thanks your reply, what you want is try to avoid duplicate code, and still
use the existed features with virtio-net.
Yes, if this sounds good and most people can recognize this idea, I am very
happy to implement this.
In addition, I hope you can review these patches before the new idea is
implemented, after all the performance can be improved. :-)
Thanks,
Yiwen.
>>
>> Thanks,
>> Yiwen.
>>
>
> .
>
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization
^ permalink raw reply
* Re: [PATCH 1/1] vhost: add per-vq worker thread
From: Vitaly Mayatskih @ 2018-11-06 2:59 UTC (permalink / raw)
To: Jason Wang; +Cc: Michael S . Tsirkin, kvm, virtualization, netdev, linux-kernel
In-Reply-To: <617e0c54-7d11-09b6-21e5-968260107872@redhat.com>
On Mon, Nov 5, 2018 at 9:49 PM Jason Wang <jasowang@redhat.com> wrote:
> If you allow device to specify the worker itself, you can do any kinds
> of mapping bettween work and worker kthread I think. The advantage of
> doing this is that you can keep the vhost-net untouched. This makes
> things a little bit easier and proving two kthreads is better than one
> for -net workload is probably not as easy as it looks. We may get boost
> in some cases but degradation for the rest.
Sounds good. 1 worker is created by vhost as today, all the others are
entering from userspace via ioctl.
--
wbr, Vitaly
^ permalink raw reply
* Re: [PATCH 1/1] vhost: add per-vq worker thread
From: Jason Wang @ 2018-11-06 2:48 UTC (permalink / raw)
To: Vitaly Mayatskih
Cc: Michael S . Tsirkin, kvm, virtualization, netdev, linux-kernel
In-Reply-To: <CAGF4SLgDk+48aLKHhA_ZgRc6D30tGdnB89b5m5bZKwzyoDb0dQ@mail.gmail.com>
On 2018/11/5 上午11:28, Vitaly Mayatskih wrote:
> On Sun, Nov 4, 2018 at 9:53 PM Jason Wang <jasowang@redhat.com> wrote:
>
>> I wonder whether or not it's better to allow the device to specific the
>> worker here instead of forcing a per vq worker model. Then we can keep
>> the behavior of exist implementation and do optimization on top?
> I was thinking about that too, but for the sake of simplicity it
> sounds valid that if the user wanted 8 parallel queues for the disk,
> they better be parallel, i.e. worker per queue. The rest of disks that
> don't need high-performance, can have 1 queue specified.
>
If you allow device to specify the worker itself, you can do any kinds
of mapping bettween work and worker kthread I think. The advantage of
doing this is that you can keep the vhost-net untouched. This makes
things a little bit easier and proving two kthreads is better than one
for -net workload is probably not as easy as it looks. We may get boost
in some cases but degradation for the rest.
Thanks
^ permalink raw reply
* [PATCH V2 6/7] net: maclorawan: Implement maclorawan class module
From: Jian-Hong Pan @ 2018-11-05 16:55 UTC (permalink / raw)
To: Andreas Färber
Cc: netdev, linux-arm-kernel, linux-kernel, Marcel Holtmann,
David S . Miller, Dollar Chen, Ken Yu, linux-wpan, Stefan Schmidt,
Jian-Hong Pan
In-Reply-To: <fc737f3940bbe91341fb15d85ac11931eb56d1fc.1535039998.git.starnight@g.ncu.edu.tw>
LoRaWAN defined by LoRa Alliance(TM) is the MAC layer over LoRa devices.
This patch implements part of Class A end-devices SoftMAC defined in
LoRaWAN(TM) Specification Ver. 1.0.2:
1. End-device receive slot timing
2. Only single channel and single data rate for now
3. Unconfirmed data up/down message types
On the other side, it defines the basic interface and operation
functions for compatible LoRa device drivers.
Signed-off-by: Jian-Hong Pan <starnight@g.ncu.edu.tw>
---
V2:
- Split the LoRaWAN class module patch in V1 into LoRaWAN socket and
LoRaWAN Soft MAC modules
- Modify for Big/Little-Endian
- Use SPDX license identifiers
net/maclorawan/Kconfig | 14 +
net/maclorawan/Makefile | 2 +
net/maclorawan/mac.c | 522 ++++++++++++++++++++++++++++++++++
net/maclorawan/main.c | 600 ++++++++++++++++++++++++++++++++++++++++
4 files changed, 1138 insertions(+)
create mode 100644 net/maclorawan/Kconfig
create mode 100644 net/maclorawan/Makefile
create mode 100644 net/maclorawan/mac.c
create mode 100644 net/maclorawan/main.c
diff --git a/net/maclorawan/Kconfig b/net/maclorawan/Kconfig
new file mode 100644
index 000000000000..177537d5f59f
--- /dev/null
+++ b/net/maclorawan/Kconfig
@@ -0,0 +1,14 @@
+config MACLORAWAN
+ tristate "Generic LoRaWAN Soft Networking Stack (maclorawan)"
+ depends on LORAWAN
+ select CRYPTO
+ select CRYPTO_CMAC
+ select CRYPTO_CBC
+ select CRYPTO_AES
+ ---help---
+ This option enables the hardware independent LoRaWAN
+ networking stack for SoftMAC devices (the ones implementing
+ only PHY level of LoRa standard).
+
+ If you plan to use HardMAC LoRaWAN devices, you can say N
+ here. Alternatively you can say M to compile it as a module.
diff --git a/net/maclorawan/Makefile b/net/maclorawan/Makefile
new file mode 100644
index 000000000000..562831e66c82
--- /dev/null
+++ b/net/maclorawan/Makefile
@@ -0,0 +1,2 @@
+obj-$(CONFIG_MACLORAWAN) += maclorawan.o
+maclorawan-objs := main.o mac.o crypto.o
diff --git a/net/maclorawan/mac.c b/net/maclorawan/mac.c
new file mode 100644
index 000000000000..343fe729a883
--- /dev/null
+++ b/net/maclorawan/mac.c
@@ -0,0 +1,522 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later OR BSD-3-Clause */
+/*-
+ * LoRaWAN soft MAC
+ *
+ * Copyright (c) 2018 Jian-Hong, Pan <starnight@g.ncu.edu.tw>
+ *
+ */
+
+#include <linux/init.h>
+#include <linux/module.h>
+#include <linux/device.h>
+#include <linux/poll.h>
+#include <asm/uaccess.h>
+#include <linux/errno.h>
+#include <linux/list.h>
+#include <linux/slab.h>
+#include <linux/timer.h>
+#include <linux/interrupt.h>
+#include <linux/netdevice.h>
+#include <linux/lora/lorawan.h>
+#include <linux/lora/lorawan_netdev.h>
+
+#include "maclorawan.h"
+#include "crypto.h"
+
+static void rx_timeout_work(struct work_struct *work);
+
+struct lrw_session *
+lrw_alloc_ss(struct lrw_struct *lrw_st)
+{
+ struct lrw_session *ss;
+
+ ss = kzalloc(sizeof(struct lrw_session), GFP_KERNEL);
+ if (!ss)
+ goto lrw_alloc_ss_end;
+
+ ss->lrw_st = lrw_st;
+ ss->devaddr = lrw_st->devaddr;
+ INIT_LIST_HEAD(&ss->entry);
+
+ ss->tx_should_ack = false;
+ ss->retry = 3;
+ spin_lock_init(&ss->state_lock);
+ INIT_WORK(&ss->timeout_work, rx_timeout_work);
+
+lrw_alloc_ss_end:
+ return ss;
+}
+
+void
+lrw_free_ss(struct lrw_session *ss)
+{
+ netdev_dbg(ss->lrw_st->ndev, "%s\n", __func__);
+ if (ss->tx_skb)
+ consume_skb(ss->tx_skb);
+ netdev_dbg(ss->lrw_st->ndev, "%s: free rx skb\n", __func__);
+ if (ss->rx_skb)
+ consume_skb(ss->rx_skb);
+
+ netdev_dbg(ss->lrw_st->ndev, "%s: free ss\n", __func__);
+ kfree(ss);
+}
+
+void
+lrw_del_ss(struct lrw_session *ss)
+{
+ netdev_dbg(ss->lrw_st->ndev, "%s\n", __func__);
+ list_del(&ss->entry);
+ lrw_free_ss(ss);
+}
+
+void
+lrw_del_all_ss(struct lrw_struct *lrw_st)
+{
+ struct lrw_session *ss, *tmp;
+
+ mutex_lock(&lrw_st->ss_list_lock);
+ lrw_st->_cur_ss = NULL;
+ list_for_each_entry_safe(ss, tmp, &lrw_st->ss_list, entry) {
+ del_timer(&ss->timer);
+ lrw_del_ss(ss);
+ }
+ mutex_unlock(&lrw_st->ss_list_lock);
+}
+
+void
+lrw_ready_hw(struct lrw_struct *lrw_st)
+{
+ lrw_st->state = LRW_STATE_IDLE;
+}
+
+int
+lrw_start_hw(struct lrw_struct *lrw_st)
+{
+ int ret = 0;
+
+ netdev_dbg(lrw_st->ndev, "%s\n", __func__);
+ lrw_st->nwks_shash_tfm = lrw_mic_key_setup(lrw_st->nwkskey,
+ LRW_KEY_LEN);
+ lrw_st->nwks_skc_tfm = lrw_encrypt_key_setup(lrw_st->nwkskey,
+ LRW_KEY_LEN);
+ lrw_st->apps_skc_tfm = lrw_encrypt_key_setup(lrw_st->appskey,
+ LRW_KEY_LEN);
+ lrw_st->state = LRW_START;
+ ret = lrw_st->ops->start(&lrw_st->hw);
+ if (!ret)
+ lrw_ready_hw(lrw_st);
+
+ return ret;
+}
+
+void
+lrw_stop_hw(struct lrw_struct *lrw_st)
+{
+ netdev_dbg(lrw_st->ndev, "%s\n", __func__);
+ lrw_st->state = LRW_STOP;
+ netdev_dbg(lrw_st->ndev, "%s: going to stop hardware\n", __func__);
+ lrw_st->ops->stop(&lrw_st->hw);
+
+ netdev_dbg(lrw_st->ndev, "%s: going to kill tasks & flush works", __func__);
+ tasklet_kill(&lrw_st->xmit_task);
+ flush_work(&lrw_st->rx_work);
+
+ netdev_dbg(lrw_st->ndev, "%s: going to delete all session\n", __func__);
+ lrw_del_all_ss(lrw_st);
+
+ netdev_dbg(lrw_st->ndev, "%s: going to free mic tfm\n", __func__);
+ lrw_mic_key_free(lrw_st->nwks_shash_tfm);
+ netdev_dbg(lrw_st->ndev, "%s: going to free nwks tfm\n", __func__);
+ lrw_encrypt_key_free(lrw_st->nwks_skc_tfm);
+ netdev_dbg(lrw_st->ndev, "%s: going to free apps tfm\n", __func__);
+ lrw_encrypt_key_free(lrw_st->apps_skc_tfm);
+}
+
+void
+lrw_prepare_tx_frame(struct lrw_session *ss)
+{
+ struct lrw_struct *lrw_st = ss->lrw_st;
+ struct sk_buff *skb = ss->tx_skb;
+ __le32 le_devaddr = cpu_to_le32(ss->devaddr);
+ u8 mhdr, fctrl, fport;
+ u8 mic[4];
+
+ netdev_dbg(lrw_st->ndev, "%s\n", __func__);
+
+ mhdr = LRW_UNCONFIRMED_DATA_UP << 5;
+ if ((mhdr & (0x6 << 5)) == (0x4 << 5))
+ ss->tx_should_ack = true;
+
+ fctrl = 0;
+ if (lrw_st->rx_should_ack) {
+ fctrl |= 0x20;
+ lrw_st->rx_should_ack = false;
+ }
+
+ /* Encrypt the plain buffer content */
+ lrw_encrypt_buf(lrw_st->apps_skc_tfm, LRW_UPLINK,
+ ss->devaddr, ss->fcnt_up, skb->data, skb->len);
+
+ /* Push FPort */
+ if (skb->len) {
+ fport = ss->fport;
+ memcpy(skb_push(skb, LRW_FPORT_LEN), &fport, LRW_FPORT_LEN);
+ }
+
+ /* Push FCnt_Up */
+ memcpy(skb_push(skb, 2), &ss->fcnt_up, 2);
+
+ /* Push FCtrl */
+ memcpy(skb_push(skb, 1), &fctrl, 1);
+
+ /* Push DevAddr */
+ memcpy(skb_push(skb, LRW_DEVADDR_LEN), &le_devaddr, LRW_DEVADDR_LEN);
+
+ /* Push MHDR */
+ memcpy(skb_push(skb, LRW_MHDR_LEN), &mhdr, LRW_MHDR_LEN);
+
+ /* Put MIC */
+ lrw_calc_mic(lrw_st->nwks_shash_tfm, LRW_UPLINK,
+ ss->devaddr, ss->fcnt_up, skb->data, skb->len, mic);
+ memcpy(skb_put(skb, LRW_MIC_LEN), mic, LRW_MIC_LEN);
+}
+
+void
+lrw_xmit(unsigned long data)
+{
+ struct lrw_struct *lrw_st = (struct lrw_struct *) data;
+ struct lrw_session *ss = lrw_st->_cur_ss;
+
+ netdev_dbg(lrw_st->ndev, "%s\n", __func__);
+ ss->state = LRW_XMITTING_SS;
+ lrw_st->ops->xmit_async(&lrw_st->hw, ss->tx_skb);
+}
+
+void
+lrw_parse_frame(struct lrw_session *ss, struct sk_buff *skb)
+{
+ struct lrw_fhdr *fhdr = &ss->rx_fhdr;
+ __le16 *p_fcnt;
+
+ pr_debug("%s: %s\n", LORAWAN_MODULE_NAME, __func__);
+
+ /* Get message type */
+ fhdr->mtype = skb->data[0];
+ skb_pull(skb, LRW_MHDR_LEN);
+
+ /* Trim Device Address */
+ skb_pull(skb, 4);
+
+ /* Get frame control */
+ fhdr->fctrl = skb->data[0];
+ skb_pull(skb, 1);
+
+ /* Ack the original TX frame if it should be acked */
+ if (ss->tx_should_ack && (fhdr->fctrl & 0x20))
+ ss->tx_should_ack = false;
+
+ /* Get frame count */
+ p_fcnt = (__le16 *)skb->data;
+ fhdr->fcnt = le16_to_cpu(*p_fcnt);
+ skb_pull(skb, 2);
+
+ /* Get frame options */
+ fhdr->fopts_len = fhdr->fctrl & 0xF;
+ if (fhdr->fopts_len > 0) {
+ memcpy(fhdr->fopts, skb->data, fhdr->fopts_len);
+ skb_pull(skb, fhdr->fopts_len);
+ }
+
+ /* TODO: Parse frame options */
+
+ /* Remove message integrity code */
+ skb_trim(skb, skb->len - LRW_MIC_LEN);
+}
+
+struct lrw_session *
+lrw_rx_skb_2_session(struct lrw_struct *lrw_st, struct sk_buff *rx_skb)
+{
+ struct lrw_session *ss;
+ u16 fcnt;
+ __le16 *p_fcnt;
+
+ netdev_dbg(lrw_st->ndev, "%s\n", __func__);
+
+ p_fcnt = (__le16 *)(rx_skb->data + 6);
+ fcnt = le16_to_cpu(*p_fcnt);
+
+ /* Find the corresponding session */
+ ss = lrw_st->_cur_ss;
+
+ /* Frame count downlink check */
+ if (fcnt >= (ss->fcnt_down & 0xFFFF))
+ ss->rx_skb = rx_skb;
+ else
+ ss = NULL;
+
+ return ss;
+}
+
+void
+lrw_rx_work(struct work_struct *work)
+{
+ struct lrw_struct *lrw_st;
+ struct lrw_session *ss;
+ struct sk_buff *skb;
+
+ lrw_st = container_of(work, struct lrw_struct, rx_work);
+
+ netdev_dbg(lrw_st->ndev, "%s\n", __func__);
+
+ skb = lrw_st->rx_skb_list.next;
+ skb_dequeue(&lrw_st->rx_skb_list);
+
+ /* Check and parse the RX frame */
+ ss = lrw_rx_skb_2_session(lrw_st, skb);
+ if (!ss)
+ goto lrw_rx_work_not_new_frame;
+
+ lrw_parse_frame(ss, skb);
+
+ /* Check the TX frame is acked or not */
+ if (ss->tx_should_ack) {
+ ss->rx_skb = NULL;
+ goto lrw_rx_work_not_new_frame;
+ }
+
+ /* The TX frame is acked or no need to be acked */
+ del_timer(&ss->timer);
+ lrw_st->rx_should_ack = (ss->rx_fhdr.mtype & 0xC0) == 0x40;
+
+ lrw_st->ndev->stats.rx_packets++;
+ lrw_st->ndev->stats.rx_bytes += ss->rx_skb->len;
+
+ if (ss->rx_skb->len > 0) {
+ spin_lock_bh(&ss->state_lock);
+ ss->state = LRW_RXRECEIVED_SS;
+ spin_unlock_bh(&ss->state_lock);
+
+ mac_cb(skb)->devaddr = lrw_st->devaddr;
+ netif_receive_skb(skb);
+
+ ss->rx_skb = NULL;
+ }
+
+ mutex_lock(&lrw_st->ss_list_lock);
+ lrw_st->fcnt_down = ss->rx_fhdr.fcnt;
+ lrw_st->_cur_ss = NULL;
+ lrw_del_ss(ss);
+ lrw_st->state = LRW_STATE_IDLE;
+ mutex_unlock(&lrw_st->ss_list_lock);
+
+ return;
+
+lrw_rx_work_not_new_frame:
+ /* Drop the RX frame if checked failed */
+ kfree_skb(skb);
+}
+
+int
+lrw_check_mic(struct crypto_shash *tfm, struct sk_buff *skb)
+{
+ u8 *buf;
+ size_t len;
+ u32 devaddr;
+ u16 fcnt;
+ __le16 *p_fcnt;
+ u8 cks[4];
+ u8 *mic;
+
+ buf = skb->data;
+ len = skb->len - 4;
+ devaddr = le32_to_cpu(*((__le32 *)(buf + 1)));
+ p_fcnt = (__le16 *)(buf + 6);
+ fcnt = le16_to_cpu(*p_fcnt);
+ mic = skb->data + len;
+
+ lrw_calc_mic(tfm, LRW_DOWNLINK, devaddr, fcnt, buf, len, cks);
+
+ return (!memcmp(cks, mic, 4));
+}
+
+/**
+ * lrw_rx_irqsave - Tell LoRaWAN module that there is new received frame
+ * @hw: the LoRa device
+ * @skb: the new received frame
+ */
+void
+lrw_rx_irqsave(struct lrw_hw *hw, struct sk_buff *skb)
+{
+ struct lrw_struct *lrw_st = container_of(hw, struct lrw_struct, hw);
+ u8 mtype;
+ u32 devaddr;
+
+ netdev_dbg(lrw_st->ndev, "%s\n", __func__);
+
+ mtype = skb->data[0] >> 5;
+ devaddr = le32_to_cpu(*(__le32 *)(skb->data + LRW_MHDR_LEN));
+
+ /* Check the frame is the downlink frame */
+ if (((mtype == LRW_UNCONFIRMED_DATA_DOWN)
+ || (mtype == LRW_CONFIRMED_DATA_DOWN))
+ && (devaddr == lrw_st->devaddr)
+ && lrw_check_mic(lrw_st->nwks_shash_tfm, skb)) {
+ skb_queue_tail(&lrw_st->rx_skb_list, skb);
+ schedule_work(&lrw_st->rx_work);
+ }
+ else {
+ kfree_skb(skb);
+ }
+}
+EXPORT_SYMBOL(lrw_rx_irqsave);
+
+static void
+lrw_rexmit(struct timer_list *timer)
+{
+ struct lrw_session *ss = container_of(timer, struct lrw_session, timer);
+ struct lrw_struct *lrw_st = ss->lrw_st;
+
+ netdev_dbg(lrw_st->ndev, "%s\n", __func__);
+
+ lrw_st->state = LRW_STATE_TX;
+ lrw_xmit((unsigned long) lrw_st);
+}
+
+static void
+rx_timeout_work(struct work_struct *work)
+{
+ struct lrw_session *ss;
+ struct lrw_struct *lrw_st;
+
+ ss = container_of(work, struct lrw_session, timeout_work);
+ lrw_st = ss->lrw_st;
+
+ netdev_dbg(lrw_st->ndev, "%s\n", __func__);
+ mutex_lock(&lrw_st->ss_list_lock);
+ lrw_st->_cur_ss = NULL;
+ lrw_st->state = LRW_STATE_IDLE;
+ lrw_del_ss(ss);
+ mutex_unlock(&lrw_st->ss_list_lock);
+}
+
+static void
+rx2_timeout_isr(struct timer_list *timer)
+{
+ struct lrw_session *ss = container_of(timer, struct lrw_session, timer);
+ struct lrw_struct *lrw_st = ss->lrw_st;
+
+ netdev_dbg(lrw_st->ndev, "%s\n", __func__);
+
+ /* Check TX is acked or not */
+ if (!ss->tx_should_ack) {
+ spin_lock_bh(&ss->state_lock);
+ if (ss->state != LRW_RXRECEIVED_SS)
+ ss->state = LRW_RXTIMEOUT_SS;
+ spin_unlock_bh(&ss->state_lock);
+
+ if (ss->state == LRW_RXTIMEOUT_SS) {
+ netdev_dbg(lrw_st->ndev, "%s: rx time out\n", __func__);
+ goto rx2_timeout_isr_no_retry_rx_frame;
+ }
+ else {
+ return;
+ }
+ }
+
+ /* Check the session need to be retransmitted or not */
+ if (ss->retry > 0) {
+ ss->state = LRW_RETRANSMIT_SS;
+ ss->retry--;
+
+ /* Start timer for ack timeout and retransmit */
+ ss->timer.function = lrw_rexmit;
+ ss->timer.expires = jiffies_64 + ss->ack_timeout * HZ;
+ add_timer(&ss->timer);
+ }
+ else {
+ /* Retry failed */
+rx2_timeout_isr_no_retry_rx_frame:
+ schedule_work(&ss->timeout_work);
+ }
+}
+
+static void
+rx2_delay_isr(struct timer_list *timer)
+{
+ struct lrw_session *ss = container_of(timer, struct lrw_session, timer);
+ struct lrw_struct *lrw_st = ss->lrw_st;
+ unsigned long delay;
+
+ netdev_dbg(lrw_st->ndev, "%s\n", __func__);
+
+ /* Start timer for RX2 window */
+ ss->timer.function = rx2_timeout_isr;
+ delay = jiffies_64 + (ss->rx2_window + 20) * HZ / 1000 + HZ;
+ ss->timer.expires = delay;
+ add_timer(&ss->timer);
+
+ /* Start LoRa hardware to RX2 window */
+ ss->state = LRW_RX2_SS;
+ lrw_st->ops->start_rx_window(&lrw_st->hw, ss->rx2_window + 20);
+}
+
+static void
+rx1_delay_isr(struct timer_list *timer)
+{
+ struct lrw_session *ss = container_of(timer, struct lrw_session, timer);
+ struct lrw_struct *lrw_st = ss->lrw_st;
+ unsigned long delay;
+
+ netdev_dbg(lrw_st->ndev, "%s\n", __func__);
+
+ /* Start timer for RX_Delay2 - RX_Delay2 */
+ ss->timer.function = rx2_delay_isr;
+ delay = jiffies_64 + (ss->rx_delay2 - ss->rx_delay1) * HZ - 20 * HZ / 1000;
+ ss->timer.expires = delay;
+ add_timer(&ss->timer);
+
+ /* Start LoRa hardware to RX1 window */
+ ss->state = LRW_RX1_SS;
+ lrw_st->ops->start_rx_window(&lrw_st->hw, ss->rx1_window + 20);
+}
+
+void
+lrw_sent_tx_work(struct lrw_struct *lrw_st, struct sk_buff *skb)
+{
+ struct lrw_session *ss = lrw_st->_cur_ss;
+ struct net_device *ndev;
+ unsigned long delay;
+
+ netdev_dbg(lrw_st->ndev, "%s\n", __func__);
+
+ ss->state = LRW_XMITTED;
+
+ /* Start session timer for RX_Delay1 */
+ timer_setup(&ss->timer, rx1_delay_isr, 0);
+ delay = jiffies_64 + ss->rx_delay1 * HZ - 20 * HZ / 1000;
+ ss->timer.expires = delay;
+ add_timer(&ss->timer);
+
+ ndev = skb->dev;
+ ndev->stats.tx_packets++;
+ ndev->stats.tx_bytes += skb->len;
+ dev_consume_skb_any(skb);
+ ss->tx_skb = NULL;
+}
+
+/**
+ * lrw_xmit_complete - Tell LoRaWAN module that the frame is xmitted completely
+ * @hw: the LoRa device
+ * @skb: the xmitted frame
+ */
+void
+lrw_xmit_complete(struct lrw_hw *hw, struct sk_buff *skb)
+{
+ struct lrw_struct *lrw_st = container_of(hw, struct lrw_struct, hw);
+
+ netdev_dbg(lrw_st->ndev, "%s\n", __func__);
+
+ lrw_sent_tx_work(lrw_st, skb);
+ lrw_st->state = LRW_STATE_RX;
+}
+EXPORT_SYMBOL(lrw_xmit_complete);
diff --git a/net/maclorawan/main.c b/net/maclorawan/main.c
new file mode 100644
index 000000000000..002781295369
--- /dev/null
+++ b/net/maclorawan/main.c
@@ -0,0 +1,600 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later OR BSD-3-Clause */
+/*-
+ * LoRaWAN soft MAC
+ *
+ * Copyright (c) 2018 Jian-Hong, Pan <starnight@g.ncu.edu.tw>
+ *
+ */
+
+#include <linux/init.h>
+#include <linux/module.h>
+#include <linux/device.h>
+#include <linux/string.h>
+#include <linux/errno.h>
+#include <linux/list.h>
+
+#include <linux/lora/lorawan.h>
+#include <linux/lora/lorawan_netdev.h>
+#include "maclorawan.h"
+
+#define PHY_NAME "lora"
+
+/* Need to find a way to define or assign */
+#define LORAWAN_MTU 20
+
+static struct class *lrw_sys_class;
+
+static void
+lrw_if_setup(struct net_device *ndev)
+{
+ ndev->addr_len = LRW_DEVADDR_LEN;
+ memset(ndev->broadcast, 0xFF, ndev->addr_len);
+ ndev->type = ARPHRD_LORAWAN;
+
+ ndev->hard_header_len = LRW_MHDR_LEN + LRW_FHDR_MAX_LEN + LRW_FPORT_LEN;
+ ndev->needed_tailroom = LRW_MIC_LEN;
+
+ /**
+ * TODO: M should be a dynamic value defined by Regional Parameters,
+ * Being fixed for now. Going to be changed.
+ */
+ ndev->mtu = LORAWAN_MTU;
+}
+
+/**
+ * lrw_alloc_hw - Allocate a memory space for the LoRa device
+ * @priv_data_len: the private data size
+ * @lrw_operations: the implemented operations of the LoRa device
+ *
+ * Return: address of the LoRa device or NULL for failed
+ */
+struct lrw_hw *
+lrw_alloc_hw(size_t priv_data_len, struct lrw_operations *ops)
+{
+ struct net_device *ndev;
+ struct lrw_struct *lrw_st;
+ int ret;
+
+ if (WARN_ON(!ops || !ops->start || !ops->stop || !ops->xmit_async ||
+ !ops->set_txpower || !ops->set_dr ||
+ !ops->start_rx_window || !ops->set_state))
+ return NULL;
+
+ /* In memory it'll be like this:
+ *
+ * +-----------------------+
+ * | struct net_device |
+ * +-----------------------+
+ * | struct lrw_struct |
+ * +-----------------------+
+ * | driver's private data |
+ * +-----------------------+
+ */
+ ndev = alloc_netdev(sizeof(struct lrw_struct) + priv_data_len,
+ PHY_NAME"%d", NET_NAME_ENUM, lrw_if_setup);
+ if (!ndev)
+ return ERR_PTR(-ENOMEM);
+ ret = dev_alloc_name(ndev, ndev->name);
+ if (ret < 0)
+ goto lrw_alloc_hw_err;
+
+ lrw_st = (struct lrw_struct *)netdev_priv(ndev);
+ lrw_st->ndev = ndev;
+
+ lrw_st->state = LRW_STOP;
+ lrw_st->ops = ops;
+ lrw_st->hw.priv = (void *) lrw_st + sizeof(struct lrw_struct);
+
+ ndev->flags |= IFF_NOARP;
+ ndev->features |= NETIF_F_HW_CSUM;
+
+ return &lrw_st->hw;
+
+lrw_alloc_hw_err:
+ free_netdev(ndev);
+ return ERR_PTR(ret);
+}
+EXPORT_SYMBOL(lrw_alloc_hw);
+
+/**
+ * lrw_free_hw - Free the LoRa device's memory resource
+ * @hw: the LoRa device going to be freed
+ */
+void
+lrw_free_hw(struct lrw_hw *hw)
+{
+ struct lrw_struct *lrw_st;
+
+ lrw_st = container_of(hw, struct lrw_struct, hw);
+ free_netdev(lrw_st->ndev);
+}
+EXPORT_SYMBOL(lrw_free_hw);
+
+/**
+ * lrw_set_deveui - Set the LoRa device's DevEUI
+ * @hw: the LoRa device going to be set
+ * @eui: the global end-device ID in IEEE EUI64 address space
+ */
+void
+lrw_set_deveui(struct lrw_hw *hw, u64 eui)
+{
+ struct lrw_struct *lrw_st;
+
+ lrw_st = container_of(hw, struct lrw_struct, hw);
+ lrw_st->dev_eui = eui;
+}
+EXPORT_SYMBOL(lrw_set_deveui);
+
+/**
+ * lrw_get_deveui - Get the LoRa device's DevEUI
+ * @hw: the LoRa device going to be got from
+ *
+ * Return: the device's DevEUI in IEEE EUI64 address space
+ */
+u64
+lrw_get_deveui(struct lrw_hw *hw)
+{
+ struct lrw_struct *lrw_st;
+
+ lrw_st = container_of(hw, struct lrw_struct, hw);
+ return lrw_st->dev_eui;
+}
+EXPORT_SYMBOL(lrw_get_deveui);
+
+/**
+ * lrw_set_appeui - Set the LoRa device's AppEUI
+ * @hw: the LoRa device going to be set
+ * @eui: the global end-device ID in IEEE EUI64 address space
+ */
+void
+lrw_set_appeui(struct lrw_hw *hw, u64 eui)
+{
+ struct lrw_struct *lrw_st;
+
+ lrw_st = container_of(hw, struct lrw_struct, hw);
+ lrw_st->app_eui = eui;
+}
+EXPORT_SYMBOL(lrw_set_appeui);
+
+/**
+ * lrw_get_appeui - Get the LoRa device's AppEUI
+ * @hw: the LoRa device going to be got from
+ *
+ * Return: the device's AppEUI in IEEE EUI64 address space
+ */
+u64
+lrw_get_appeui(struct lrw_hw *hw)
+{
+ struct lrw_struct *lrw_st;
+
+ lrw_st = container_of(hw, struct lrw_struct, hw);
+ return lrw_st->app_eui;
+}
+EXPORT_SYMBOL(lrw_get_appeui);
+
+/**
+ * lrw_set_devaddr - Set the LoRa device's address
+ * @hw: the LoRa device going to be set
+ * @devaddr: the device address
+ */
+void
+lrw_set_devaddr(struct lrw_hw *hw, u32 devaddr)
+{
+ struct lrw_struct *lrw_st;
+
+ lrw_st = container_of(hw, struct lrw_struct, hw);
+ lrw_st->devaddr = devaddr;
+}
+EXPORT_SYMBOL(lrw_set_devaddr);
+
+/**
+ * lrw_get_devaddr - Get the LoRa device's address
+ * @hw: the LoRa device going to be got from
+ *
+ * Return: the device address
+ */
+u32
+lrw_get_devaddr(struct lrw_hw *hw)
+{
+ struct lrw_struct *lrw_st;
+
+ lrw_st = container_of(hw, struct lrw_struct, hw);
+ return lrw_st->devaddr;
+}
+EXPORT_SYMBOL(lrw_get_devaddr);
+
+/**
+ * lrw_add_hw - Add a LoRaWAN hardware as a network device
+ * @lrw_st: the LoRa device going to be added
+ *
+ * Return: 0 / other number for success / failed
+ */
+int
+lrw_add_hw(struct lrw_struct *lrw_st)
+{
+ struct net_device *ndev = lrw_st->ndev;
+ __be32 be_addr;
+ int ret;
+
+ lrw_st->fcnt_up = 0;
+ lrw_st->fcnt_down = 0;
+ lrw_st->_cur_ss = NULL;
+
+ mutex_init(&lrw_st->ss_list_lock);
+ INIT_LIST_HEAD(&lrw_st->ss_list);
+
+ tasklet_init(&lrw_st->xmit_task, lrw_xmit, (unsigned long) lrw_st);
+ INIT_WORK(&lrw_st->rx_work, lrw_rx_work);
+
+ be_addr = cpu_to_be32(lrw_st->devaddr);
+ memcpy(ndev->perm_addr, &be_addr, ndev->addr_len);
+ memcpy(ndev->dev_addr, ndev->perm_addr, ndev->addr_len);
+
+ write_pnet(&lrw_st->_net, &init_net);
+ ret = register_netdev(ndev);
+
+ return ret;
+}
+
+/**
+ * lrw_remove_hw - Remove a LoRaWAN hardware from a network device
+ * @lrw_st: the LoRa device going to be removed
+ */
+void
+lrw_remove_hw(struct lrw_struct *lrw_st)
+{
+ unregister_netdev(lrw_st->ndev);
+ tasklet_kill(&lrw_st->xmit_task);
+}
+
+bool
+ready2write(struct lrw_struct *lrw_st)
+{
+ bool status = false;
+
+ if ((!lrw_st->_cur_ss) && (lrw_st->state == LRW_STATE_IDLE))
+ status = true;
+
+ return status;
+}
+
+bool
+ready2read(struct lrw_struct *lrw_st)
+{
+ bool status = false;
+ struct lrw_session *ss;
+
+ if (!list_empty(&lrw_st->ss_list) && (lrw_st->state != LRW_STOP)) {
+ ss = list_first_entry(&lrw_st->ss_list,
+ struct lrw_session,
+ entry);
+ if (ss->state == LRW_RXRECEIVED_SS)
+ status = true;
+ }
+
+ return status;
+}
+
+static int
+lrw_if_up(struct net_device *ndev)
+{
+ struct lrw_struct *lrw_st = NETDEV_2_LRW(ndev);
+ int ret = -EBUSY;
+
+ if (lrw_st->state == LRW_STOP) {
+ ret = lrw_start_hw(lrw_st);
+ netif_start_queue(ndev);
+ }
+
+ return ret;
+}
+
+static int
+lrw_if_down(struct net_device *ndev)
+{
+ struct lrw_struct *lrw_st = NETDEV_2_LRW(ndev);
+
+ if (lrw_st->state != LRW_STOP) {
+ netif_stop_queue(ndev);
+ lrw_stop_hw(lrw_st);
+ }
+
+ return 0;
+}
+
+netdev_tx_t
+lrw_if_start_xmit(struct sk_buff *skb, struct net_device *ndev)
+{
+ struct lrw_struct *lrw_st = NETDEV_2_LRW(ndev);
+ struct lrw_session *ss;
+ netdev_tx_t ret = NETDEV_TX_OK;
+
+ ss = lrw_alloc_ss(lrw_st);
+ if (!ss)
+ return NETDEV_TX_BUSY;
+
+ mutex_lock(&lrw_st->ss_list_lock);
+ if (ready2write(lrw_st)) {
+ list_add_tail(&ss->entry, &lrw_st->ss_list);
+ lrw_st->state = LRW_STATE_TX;
+ lrw_st->_cur_ss = ss;
+ ss->fcnt_up = lrw_st->fcnt_up;
+ ss->fcnt_down = lrw_st->fcnt_down;
+ /* TODO: RX delay #1/#2 should be set by regional parameters */
+ ss->rx_delay1 = 1;
+ ss->rx_delay2 = 2;
+ ss->rx1_window = 500;
+ ss->rx2_window = 500;
+ }
+ else
+ ret = NETDEV_TX_BUSY;
+ mutex_unlock(&lrw_st->ss_list_lock);
+
+ if (ret == NETDEV_TX_OK) {
+ ss->state = LRW_INIT_SS;
+ ss->tx_skb = skb;
+ lrw_prepare_tx_frame(ss);
+ tasklet_schedule(&lrw_st->xmit_task);
+ }
+ else
+ lrw_free_ss(ss);
+
+ return ret;
+}
+
+inline int
+lrw_if_get_addr(struct lrw_struct *lrw_st, struct sockaddr_lorawan *addr)
+{
+ int ret = 0;
+
+ switch (addr->addr_in.addr_type) {
+ case LRW_ADDR_DEVADDR:
+ addr->addr_in.devaddr = lrw_st->devaddr;
+ break;
+ case LRW_ADDR_DEVEUI:
+ addr->addr_in.dev_eui = lrw_st->dev_eui;
+ break;
+ case LRW_ADDR_APPEUI:
+ addr->addr_in.app_eui = lrw_st->app_eui;
+ break;
+ default:
+ ret = -ENOTSUPP;
+ }
+
+ return ret;
+}
+
+inline int
+lrw_if_set_addr(struct lrw_struct *lrw_st, struct sockaddr_lorawan *addr)
+{
+ struct lrw_hw *hw = &lrw_st->hw;
+ int ret = 0;
+
+ if (netif_running(lrw_st->ndev))
+ return -EBUSY;
+
+ switch (addr->addr_in.addr_type) {
+ case LRW_ADDR_DEVADDR:
+ lrw_set_devaddr(hw, addr->addr_in.devaddr);
+ break;
+ case LRW_ADDR_DEVEUI:
+ lrw_set_deveui(hw, addr->addr_in.dev_eui);
+ break;
+ case LRW_ADDR_APPEUI:
+ lrw_set_appeui(hw, addr->addr_in.app_eui);
+ break;
+ default:
+ ret = -ENOTSUPP;
+ }
+
+ return ret;
+}
+
+inline void
+swap_bytes(u8 *dst, u8 *src, size_t l)
+{
+ /* Human reading is big-endian, but LoRaWAN is little-endian */
+ unsigned int i;
+ for (i = 0; i < l; i++)
+ dst[i] = src[l - i - 1];
+}
+
+int
+lrw_set_key(struct lrw_hw *hw, u8 type, u8 *key, size_t key_len)
+{
+ struct lrw_struct *lrw_st;
+ int ret = 0;
+
+ lrw_st = container_of(hw, struct lrw_struct, hw);
+
+ netdev_dbg(lrw_st->ndev, "%s: type=%d\n", __func__, type);
+ if (lrw_st->state != LRW_STOP)
+ return -EINVAL;
+
+ print_hex_dump(KERN_DEBUG, "", DUMP_PREFIX_OFFSET, 16, 1, key, key_len, true);
+ switch (type) {
+ case LRW_APPKEY:
+ swap_bytes(lrw_st->appkey, key, key_len);
+ break;
+ case LRW_NWKSKEY:
+ swap_bytes(lrw_st->nwkskey, key, key_len);
+ break;
+ case LRW_APPSKEY:
+ swap_bytes(lrw_st->appskey, key, key_len);
+ break;
+ default:
+ ret = -ENOTSUPP;
+ }
+
+ return ret;
+}
+EXPORT_SYMBOL(lrw_set_key);
+
+int
+lrw_get_key(struct lrw_hw *hw, u8 type, u8 *key, size_t key_len)
+{
+ struct lrw_struct *lrw_st;
+ int ret = 0;
+
+ lrw_st = container_of(hw, struct lrw_struct, hw);
+
+ netdev_dbg(lrw_st->ndev, "%s: type=%d\n", __func__, type);
+ switch (type) {
+ case LRW_APPKEY:
+ swap_bytes(key, lrw_st->appkey, key_len);
+ break;
+ case LRW_NWKSKEY:
+ swap_bytes(key, lrw_st->nwkskey, key_len);
+ break;
+ case LRW_APPSKEY:
+ swap_bytes(key, lrw_st->appskey, key_len);
+ break;
+ default:
+ ret = -ENOTSUPP;
+ }
+
+ return ret;
+}
+
+static int
+lrw_if_ioctl(struct net_device *ndev, struct ifreq *ifr, int cmd)
+{
+ struct lrw_struct *lrw_st = NETDEV_2_LRW(ndev);
+ struct sockaddr_lorawan *addr;
+ int ret = 0;
+
+ netdev_dbg(ndev, "%s: ioctl file (cmd=0x%X)\n", __func__, cmd);
+
+ /* I/O control by each command */
+ switch (cmd) {
+ /* Set & get the DevAddr, DevEUI and AppEUI */
+ case SIOCSIFADDR:
+ addr = (struct sockaddr_lorawan *)&ifr->ifr_addr;
+ ret = lrw_if_set_addr(lrw_st, addr);
+ break;
+ case SIOCGIFADDR:
+ addr = (struct sockaddr_lorawan *)&ifr->ifr_addr;
+ ret = lrw_if_get_addr(lrw_st, addr);
+ break;
+ default:
+ ret = -ENOTSUPP;
+ }
+
+ return ret;
+}
+
+static int
+lrw_if_set_mac(struct net_device *ndev, void *p)
+{
+ struct lrw_struct *lrw_st = NETDEV_2_LRW(ndev);
+ struct sockaddr *addr = p;
+ __be32 *be_addr = (__be32 *)addr->sa_data;
+
+ netdev_dbg(ndev, "%s: AF_TYPE:%d set mac address %X\n",
+ __func__, addr->sa_family, be32_to_cpu(*be_addr));
+
+ if (netif_running(ndev))
+ return -EBUSY;
+
+ lrw_set_devaddr(&lrw_st->hw, be32_to_cpu(*be_addr));
+ memcpy(ndev->dev_addr, be_addr, ndev->addr_len);
+
+ return 0;
+}
+
+static const struct net_device_ops lrw_if_ops = {
+ .ndo_open = lrw_if_up,
+ .ndo_stop = lrw_if_down,
+ .ndo_start_xmit = lrw_if_start_xmit,
+ .ndo_do_ioctl = lrw_if_ioctl,
+ .ndo_set_mac_address = lrw_if_set_mac,
+};
+
+/**
+ * lrw_register_hw - Register as a LoRaWAN compatible device
+ * @hw: LoRa device going to be registered
+ *
+ * Return: 0 / negative number for success / error number
+ */
+int
+lrw_register_hw(struct lrw_hw *hw)
+{
+ struct lrw_struct *lrw_st = container_of(hw, struct lrw_struct, hw);
+ int ret;
+
+ device_initialize(&lrw_st->dev);
+ dev_set_name(&lrw_st->dev, netdev_name(lrw_st->ndev));
+ lrw_st->dev.class = lrw_sys_class;
+ lrw_st->dev.platform_data = lrw_st;
+
+ ret = device_add(&lrw_st->dev);
+ if (ret)
+ goto lrw_register_hw_end;
+
+ /* Add a LoRa device node as a network device */
+ lrw_st->ndev->netdev_ops = &lrw_if_ops;
+ ret = lrw_add_hw(lrw_st);
+
+lrw_register_hw_end:
+ return ret;
+}
+EXPORT_SYMBOL(lrw_register_hw);
+
+/**
+ * lrw_unregister_hw - Unregister the LoRaWAN compatible device
+ * @hw: LoRa device going to be unregistered
+ */
+void
+lrw_unregister_hw(struct lrw_hw *hw)
+{
+ struct lrw_struct *lrw_st = container_of(hw, struct lrw_struct, hw);
+
+ pr_info("%s: unregister %s\n",
+ LORAWAN_MODULE_NAME, dev_name(&lrw_st->dev));
+
+ /* Stop and remove the LoRaWAM hardware from system */
+ if (lrw_st->state != LRW_STOP)
+ lrw_stop_hw(lrw_st);
+ device_del(&lrw_st->dev);
+ lrw_remove_hw(lrw_st);
+
+ return;
+}
+EXPORT_SYMBOL(lrw_unregister_hw);
+
+static int __init
+lrw_init(void)
+{
+ int err = 0;
+
+ pr_info("%s: module inserted\n", LORAWAN_MODULE_NAME);
+
+ /* Create device class */
+ lrw_sys_class = class_create(THIS_MODULE, LORAWAN_MODULE_NAME);
+ if (IS_ERR(lrw_sys_class)) {
+ pr_err("%s: Failed to create a class of LoRaWAN\n",
+ LORAWAN_MODULE_NAME);
+ err = PTR_ERR(lrw_sys_class);
+ goto lrw_init_end;
+ }
+
+ pr_debug("%s: class created\n", LORAWAN_MODULE_NAME);
+
+lrw_init_end:
+ return err;
+}
+
+static void __exit
+lrw_exit(void)
+{
+ /* Delete device class */
+ class_destroy(lrw_sys_class);
+ pr_info("%s: module removed\n", LORAWAN_MODULE_NAME);
+}
+
+module_init(lrw_init);
+module_exit(lrw_exit);
+
+MODULE_AUTHOR("Jian-Hong Pan, <starnight@g.ncu.edu.tw>");
+MODULE_DESCRIPTION("LoRaWAN soft MAC kernel module");
+MODULE_LICENSE("Dual BSD/GPL");
--
2.19.1
^ permalink raw reply related
* [PATCH V2 4/7] net: maclorawan: Add maclorawan module declaration
From: Jian-Hong Pan @ 2018-11-05 16:55 UTC (permalink / raw)
To: Andreas Färber
Cc: netdev, linux-arm-kernel, linux-kernel, Marcel Holtmann,
David S . Miller, Dollar Chen, Ken Yu, linux-wpan, Stefan Schmidt,
Jian-Hong Pan
In-Reply-To: <fc737f3940bbe91341fb15d85ac11931eb56d1fc.1535039998.git.starnight@g.ncu.edu.tw>
Add the maclorawan header file for common APIs in the module.
Signed-off-by: Jian-Hong Pan <starnight@g.ncu.edu.tw>
---
V2:
- Split the LoRaWAN class module patch in V1 into LoRaWAN socket and
LoRaWAN Soft MAC modules
- Use SPDX license identifiers
net/maclorawan/maclorawan.h | 199 ++++++++++++++++++++++++++++++++++++
1 file changed, 199 insertions(+)
create mode 100644 net/maclorawan/maclorawan.h
diff --git a/net/maclorawan/maclorawan.h b/net/maclorawan/maclorawan.h
new file mode 100644
index 000000000000..66b87f051d51
--- /dev/null
+++ b/net/maclorawan/maclorawan.h
@@ -0,0 +1,199 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later OR BSD-3-Clause */
+/*-
+ * LoRaWAN soft MAC
+ *
+ * Copyright (c) 2018 Jian-Hong, Pan <starnight@g.ncu.edu.tw>
+ *
+ */
+
+#ifndef __MAC_LORAWAN_H__
+#define __MAC_LORAWAN_H__
+
+#include <linux/mutex.h>
+#include <linux/spinlock.h>
+#include <linux/skbuff.h>
+#include <linux/workqueue.h>
+#include <linux/netdevice.h>
+#include <uapi/linux/if_arp.h>
+#include <crypto/hash.h>
+#include <crypto/skcipher.h>
+#include <linux/lora/lorawan.h>
+
+#define LORAWAN_MODULE_NAME "maclorawan"
+
+/* List the message types of LoRaWAN */
+enum {
+ LRW_JOIN_REQUEST,
+ LRW_JOIN_ACCEPT,
+ LRW_UNCONFIRMED_DATA_UP,
+ LRW_UNCONFIRMED_DATA_DOWN,
+ LRW_CONFIRMED_DATA_UP,
+ LRW_CONFIRMED_DATA_DOWN,
+ LRW_RFU,
+ LRW_PROPRIETARY,
+};
+
+/* List the communication directions */
+enum {
+ LRW_UPLINK,
+ LRW_DOWNLINK,
+};
+
+/* States of LoRaWAN slot timing */
+enum {
+ LRW_INIT_SS,
+ LRW_XMITTING_SS,
+ LRW_XMITTED,
+ LRW_RX1_SS,
+ LRW_RX2_SS,
+ LRW_RXTIMEOUT_SS,
+ LRW_RXRECEIVED_SS,
+ LRW_RETRANSMIT_SS,
+};
+
+#define LRW_MHDR_LEN 1
+#define LRW_FHDR_MAX_LEN 22
+#define LRW_FOPS_MAX_LEN 15
+#define LRW_FPORT_LEN 1
+#define LRW_MIC_LEN 4
+
+/**
+ * lrw_fhdr - Hold the message's basic information of the frame
+ *
+ * @mtype: this message's type
+ * @fctrl: the frame control byte
+ * @fcnt: this message's frame counter value
+ * @fopts: this frame's options field
+ * @fopts_len: the length of the fopts
+ */
+struct lrw_fhdr {
+ u8 mtype;
+ u8 fctrl;
+ u16 fcnt;
+ u8 fopts[LRW_FPORT_LEN];
+ u8 fopts_len;
+};
+
+/**
+ * lrw_session - LoRaWAN session for Class A end device
+ *
+ * @lrw_st: points to the belonging lrw_st
+ * @entry: the entry of the ss_list in lrw_struct
+ * @devaddr: the LoRaWAN device address of this LoRaWAN hardware
+ * @fcnt_up: uplink frame counter
+ * @fcnt_down: downlink frame counter
+ * @fport: the LoRaWAN data message's port field
+ * @tx_skb: points to the TX skb, the frame
+ * @rx_skb: points to the RX skb, the frame
+ * @tx_fhdr: hold the message's basic information of the TX frame
+ * @rx_fhdr: hold the message's basic information of the RX frame
+ * @tx_should_ack: flag for determining the TX which should be acked or not
+ * @retry: retry times for xmitting failed
+ * @state: this session's current state
+ * @state_lock: lock of the session's state
+ * @timer: timing for this session and the state transition
+ * @timeout_work: work if waiting acknowledge time out
+ * @rx_delay1: RX1 delay time in seconds
+ * @rx_delay2: RX2 delay time in seconds
+ * @rx1_window: RX1 window opening time in mini-seconds
+ * @rx2_window: RX2 window opening time in mini-seconds
+ * @ack_timeout: time out time for waiting acknowledge in seconds
+ */
+struct lrw_session {
+ struct lrw_struct *lrw_st;
+ struct list_head entry;
+
+ u32 devaddr;
+ u16 fcnt_up;
+ u16 fcnt_down;
+ u8 fport;
+ struct sk_buff *tx_skb;
+ struct sk_buff *rx_skb;
+ struct lrw_fhdr tx_fhdr;
+ struct lrw_fhdr rx_fhdr;
+
+ bool tx_should_ack;
+ u8 retry;
+ u8 state;
+ spinlock_t state_lock;
+
+ struct timer_list timer;
+ struct work_struct timeout_work;
+ unsigned long rx_delay1;
+ unsigned long rx_delay2;
+ unsigned long rx1_window;
+ unsigned long rx2_window;
+ unsigned long ack_timeout;
+};
+
+/**
+ * lrw_struct - The full LoRaWAN hardware to the LoRa device.
+ *
+ * @dev: this LoRa device registed in system
+ * @hw: the LoRa device of this LoRaWAN hardware
+ * @ops: handle of LoRa operations interfaces
+ * @rx_skb_list: the list of received frames
+ * @ss_list: LoRaWAN session list of this LoRaWAN hardware
+ * @_cur_ss: pointer of the current processing session
+ * @rx_should_ack: represent the current session should be acked or not
+ * @state: the state of this LoRaWAN hardware
+ * @app_eui: the LoRaWAN application EUI
+ * @dev_eui: the LoRaWAN device EUI
+ * @devaddr: the LoRaWAN device address of this LoRaWAN hardware
+ * @appky: the Application key
+ * @nwkskey: the Network session key
+ * @appskey: the Application session key
+ * @nwks_shash_tfm: the hash handler for LoRaWAN network session
+ * @nwks_skc_tfm: the crypto handler for LoRaWAN network session
+ * @apps_skc_tfm: the crypto handler for LoRaWAN application session
+ * @fcnt_up: the counter of this LoRaWAN hardware's up frame
+ * @fcnt_down: the counter of this LoRaWAN hardware's down frame
+ * @xmit_task: the xmit task for the current LoRaWAN session
+ * @rx_work: the RX work in workqueue for the current LoRaWAN session
+ * @ndev: points to the emulating network device
+ * @_net: the current network namespace of this LoRaWAN hardware
+ */
+struct lrw_struct {
+ struct device dev;
+ struct lrw_hw hw;
+ struct lrw_operations *ops;
+
+ struct sk_buff_head rx_skb_list;
+ struct list_head ss_list;
+ struct mutex ss_list_lock;
+ struct lrw_session *_cur_ss;
+ bool rx_should_ack;
+ u8 state;
+
+ u64 app_eui;
+ u64 dev_eui;
+ u32 devaddr;
+ u8 appkey[LRW_KEY_LEN];
+ u8 nwkskey[LRW_KEY_LEN];
+ u8 appskey[LRW_KEY_LEN];
+ struct crypto_shash *nwks_shash_tfm;
+ struct crypto_skcipher *nwks_skc_tfm;
+ struct crypto_skcipher *apps_skc_tfm;
+
+ u16 fcnt_up;
+ u16 fcnt_down;
+
+ struct tasklet_struct xmit_task;
+ struct work_struct rx_work;
+
+ struct net_device *ndev;
+ possible_net_t _net;
+};
+
+#define NETDEV_2_LRW(ndev) ((struct lrw_struct *)netdev_priv(ndev))
+
+struct lrw_session * lrw_alloc_ss(struct lrw_struct *);
+void lrw_free_ss(struct lrw_session *);
+void lrw_del_ss(struct lrw_session *);
+int lrw_start_hw(struct lrw_struct *);
+void lrw_stop_hw(struct lrw_struct *);
+void lrw_prepare_tx_frame(struct lrw_session *);
+void lrw_xmit(unsigned long);
+void lrw_rx_work(struct work_struct *);
+
+#endif
--
2.19.1
^ permalink raw reply related
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox