Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH net v3] tcp: ensure to use the most recently sent skb when filling the rate sample
From: Eric Dumazet @ 2022-04-22 21:06 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Pengcheng Yang, Neal Cardwell, netdev, David S. Miller,
	Hideaki YOSHIFUJI, David Ahern, Paolo Abeni
In-Reply-To: <20220422133712.17eebbcb@kernel.org>

On Fri, Apr 22, 2022 at 1:37 PM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Wed, 20 Apr 2022 10:34:41 +0800 Pengcheng Yang wrote:
> > If an ACK (s)acks multiple skbs, we favor the information
> > from the most recently sent skb by choosing the skb with
> > the highest prior_delivered count. But in the interval
> > between receiving ACKs, we send multiple skbs with the same
> > prior_delivered, because the tp->delivered only changes
> > when we receive an ACK.
> >
> > We used RACK's solution, copying tcp_rack_sent_after() as
> > tcp_skb_sent_after() helper to determine "which packet was
> > sent last?". Later, we will use tcp_skb_sent_after() instead
> > in RACK.
> >
> > Fixes: b9f64820fb22 ("tcp: track data delivery rate for a TCP connection")
> > Signed-off-by: Pengcheng Yang <yangpc@wangsu.com>
> > Cc: Neal Cardwell <ncardwell@google.com>
> > Cc: Paolo Abeni <pabeni@redhat.com>
>
> Somehow this patch got marked as archived in patchwork. Reviving it now.
>
> Eric, Neal, ack?

Oops, right, thanks !

Reviewed-by: Eric Dumazet <edumazet@google.com>

^ permalink raw reply

* Re: [net-next v1] net: Add a second bind table hashed by port and address
From: Joanne Koong @ 2022-04-22 21:06 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev, Martin KaFai Lau, David Miller, Jakub Kicinski
In-Reply-To: <CANn89iLw835MMj5DXw+KyX0fscb7Jw3e0nF5TW54hwqMtsekfA@mail.gmail.com>

On Thu, Apr 21, 2022 at 3:50 PM Eric Dumazet <edumazet@google.com> wrote:
>
> On Thu, Apr 21, 2022 at 3:16 PM Joanne Koong <joannelkoong@gmail.com> wrote:
> >
> > We currently have one tcp bind table (bhash) which hashes by port
> > number only. In the socket bind path, we check for bind conflicts by
> > traversing the specified port's inet_bind2_bucket while holding the
> > bucket's spinlock (see inet_csk_get_port() and inet_csk_bind_conflict()).
> >
> > In instances where there are tons of sockets hashed to the same port
> > at different addresses, checking for a bind conflict is time-intensive
> > and can cause softirq cpu lockups, as well as stops new tcp connections
> > since __inet_inherit_port() also contests for the spinlock.
> >
> > This patch proposes adding a second bind table, bhash2, that hashes by
> > port and ip address. Searching the bhash2 table leads to significantly
> > faster conflict resolution and less time holding the spinlock.
> > When experimentally testing this on a local server, the results for how
> > long a bind request takes were as follows:
> >
> > when there are ~24k sockets already bound to the port -
> >
> > ipv4:
> > before - 0.002317 seconds
> > with bhash2 - 0.000018 seconds
> >
> > ipv6:
> > before - 0.002431 seconds
> > with bhash2 - 0.000021 seconds
>
>
> Hi Joanne
>
> Do you have a test for this ? Are you using 24k IPv6 addresses on the host ?
>
> I fear we add some extra code and cost for quite an unusual configuration.
>
> Thanks.
>
Hi Eric,

I have a test on my local server that populates the bhash table entry
with 24k sockets for a given port and address, and then times how long
a bind request on that port takes. When populating the table entry, I
use the same IPv6 address on the host (with SO_REUSEADDR set). At
Facebook, there are some internal teams that submit bind requests for
400 vips on the same port on concurrent threads that run into softirq
lockup issues due to the bhash table entry spinlock contention, which
is the main motivation behind this patch.

Thanks for your reply.
> >
> > when there are ~12 million sockets already bound to the port -
> >
> > ipv4:
> > before - 7.498583 seconds
> > with bhash2 - 0.000021 seconds
> >
> > ipv6:
> > before - 7.813554 seconds
> > with bhash2 - 0.000029 seconds
> >
> > Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
> > ---
> >  include/net/inet_connection_sock.h |   3 +
> >  include/net/inet_hashtables.h      |  56 ++++++-
> >  include/net/sock.h                 |  14 ++
> >  net/dccp/proto.c                   |  14 +-
> >  net/ipv4/inet_connection_sock.c    | 227 +++++++++++++++++++++--------
> >  net/ipv4/inet_hashtables.c         | 188 ++++++++++++++++++++++--
> >  net/ipv4/tcp.c                     |  14 +-
> >  7 files changed, 438 insertions(+), 78 deletions(-)
> >
> > diff --git a/include/net/inet_connection_sock.h b/include/net/inet_connection_sock.h
> > index 3908296d103f..d89a78d10294 100644
> > --- a/include/net/inet_connection_sock.h
> > +++ b/include/net/inet_connection_sock.h
> > @@ -25,6 +25,7 @@
> >  #undef INET_CSK_CLEAR_TIMERS
> >
> >  struct inet_bind_bucket;
> > +struct inet_bind2_bucket;
> >  struct tcp_congestion_ops;
> >
> >  /*
> > @@ -57,6 +58,7 @@ struct inet_connection_sock_af_ops {
> >   *
> >   * @icsk_accept_queue:    FIFO of established children
> >   * @icsk_bind_hash:       Bind node
> > + * @icsk_bind2_hash:      Bind node in the bhash2 table
> >   * @icsk_timeout:         Timeout
> >   * @icsk_retransmit_timer: Resend (no ack)
> >   * @icsk_rto:             Retransmit timeout
> > @@ -84,6 +86,7 @@ struct inet_connection_sock {
> >         struct inet_sock          icsk_inet;
> >         struct request_sock_queue icsk_accept_queue;
> >         struct inet_bind_bucket   *icsk_bind_hash;
> > +       struct inet_bind2_bucket  *icsk_bind2_hash;
> >         unsigned long             icsk_timeout;
> >         struct timer_list         icsk_retransmit_timer;
> >         struct timer_list         icsk_delack_timer;
> > diff --git a/include/net/inet_hashtables.h b/include/net/inet_hashtables.h
> > index f72ec113ae56..143a33d815c2 100644
> > --- a/include/net/inet_hashtables.h
> > +++ b/include/net/inet_hashtables.h
> > @@ -90,11 +90,30 @@ struct inet_bind_bucket {
> >         struct hlist_head       owners;
> >  };
> >
> > +struct inet_bind2_bucket {
> > +       possible_net_t          ib_net;
> > +       int                     l3mdev;
> > +       unsigned short          port;
> > +       union {
> > +#if IS_ENABLED(CONFIG_IPV6)
> > +               struct in6_addr         v6_rcv_saddr;
> > +#endif
> > +               __be32                  rcv_saddr;
> > +       };
> > +       struct hlist_node       node;           /* Node in the inet2_bind_hashbucket chain */
> > +       struct hlist_head       owners;         /* List of sockets hashed to this bucket */
> > +};
> > +
> >  static inline struct net *ib_net(struct inet_bind_bucket *ib)
> >  {
> >         return read_pnet(&ib->ib_net);
> >  }
> >
> > +static inline struct net *ib2_net(struct inet_bind2_bucket *ib)
> > +{
> > +       return read_pnet(&ib->ib_net);
> > +}
> > +
> >  #define inet_bind_bucket_for_each(tb, head) \
> >         hlist_for_each_entry(tb, head, node)
> >
> > @@ -103,6 +122,15 @@ struct inet_bind_hashbucket {
> >         struct hlist_head       chain;
> >  };
> >
> > +/* This is synchronized using the inet_bind_hashbucket's spinlock.
> > + * Instead of having separate spinlocks, the inet_bind2_hashbucket can share
> > + * the inet_bind_hashbucket's given that in every case where the bhash2 table
> > + * is useful, a lookup in the bhash table also occurs.
> > + */
> > +struct inet_bind2_hashbucket {
> > +       struct hlist_head       chain;
> > +};
> > +
> >  /* Sockets can be hashed in established or listening table.
> >   * We must use different 'nulls' end-of-chain value for all hash buckets :
> >   * A socket might transition from ESTABLISH to LISTEN state without
> > @@ -138,6 +166,11 @@ struct inet_hashinfo {
> >          */
> >         struct kmem_cache               *bind_bucket_cachep;
> >         struct inet_bind_hashbucket     *bhash;
> > +       /* The 2nd binding table hashed by port and address.
> > +        * This is used primarily for expediting the resolution of bind conflicts.
> > +        */
> > +       struct kmem_cache               *bind2_bucket_cachep;
> > +       struct inet_bind2_hashbucket    *bhash2;
> >         unsigned int                    bhash_size;
> >
> >         /* The 2nd listener table hashed by local port and address */
> > @@ -221,6 +254,27 @@ inet_bind_bucket_create(struct kmem_cache *cachep, struct net *net,
> >  void inet_bind_bucket_destroy(struct kmem_cache *cachep,
> >                               struct inet_bind_bucket *tb);
> >
> > +static inline bool check_bind_bucket_match(struct inet_bind_bucket *tb, struct net *net,
> > +                                          const unsigned short port, int l3mdev)
> > +{
> > +       return net_eq(ib_net(tb), net) && tb->port == port && tb->l3mdev == l3mdev;
> > +}
> > +
> > +struct inet_bind2_bucket *
> > +inet_bind2_bucket_create(struct kmem_cache *cachep, struct net *net,
> > +                        struct inet_bind2_hashbucket *head, const unsigned short port,
> > +                        int l3mdev, const struct sock *sk);
> > +
> > +void inet_bind2_bucket_destroy(struct kmem_cache *cachep, struct inet_bind2_bucket *tb);
> > +
> > +struct inet_bind2_bucket *
> > +inet_bind2_bucket_find(struct inet_hashinfo *hinfo, struct net *net, const unsigned short port,
> > +                      int l3mdev, struct sock *sk, struct inet_bind2_hashbucket **head);
> > +
> > +bool check_bind2_bucket_match_nulladdr(struct inet_bind2_bucket *tb, struct net *net,
> > +                                      const unsigned short port, int l3mdev,
> > +                                      const struct sock *sk);
> > +
> >  static inline u32 inet_bhashfn(const struct net *net, const __u16 lport,
> >                                const u32 bhash_size)
> >  {
> > @@ -228,7 +282,7 @@ static inline u32 inet_bhashfn(const struct net *net, const __u16 lport,
> >  }
> >
> >  void inet_bind_hash(struct sock *sk, struct inet_bind_bucket *tb,
> > -                   const unsigned short snum);
> > +                   struct inet_bind2_bucket *tb2, const unsigned short snum);
> >
> >  /* These can have wildcards, don't try too hard. */
> >  static inline u32 inet_lhashfn(const struct net *net, const unsigned short num)
> > diff --git a/include/net/sock.h b/include/net/sock.h
> > index c4b91fc19b9c..a2198d5674f6 100644
> > --- a/include/net/sock.h
> > +++ b/include/net/sock.h
> > @@ -352,6 +352,7 @@ struct sk_filter;
> >    *    @sk_txtime_report_errors: set report errors mode for SO_TXTIME
> >    *    @sk_txtime_unused: unused txtime flags
> >    *    @ns_tracker: tracker for netns reference
> > +  *    @sk_bind2_node: bind node in the bhash2 table
> >    */
> >  struct sock {
> >         /*
> > @@ -542,6 +543,7 @@ struct sock {
> >  #endif
> >         struct rcu_head         sk_rcu;
> >         netns_tracker           ns_tracker;
> > +       struct hlist_node       sk_bind2_node;
> >  };
> >
> >  enum sk_pacing {
> > @@ -822,6 +824,16 @@ static inline void sk_add_bind_node(struct sock *sk,
> >         hlist_add_head(&sk->sk_bind_node, list);
> >  }
> >
> > +static inline void __sk_del_bind2_node(struct sock *sk)
> > +{
> > +       __hlist_del(&sk->sk_bind2_node);
> > +}
> > +
> > +static inline void sk_add_bind2_node(struct sock *sk, struct hlist_head *list)
> > +{
> > +       hlist_add_head(&sk->sk_bind2_node, list);
> > +}
> > +
> >  #define sk_for_each(__sk, list) \
> >         hlist_for_each_entry(__sk, list, sk_node)
> >  #define sk_for_each_rcu(__sk, list) \
> > @@ -839,6 +851,8 @@ static inline void sk_add_bind_node(struct sock *sk,
> >         hlist_for_each_entry_safe(__sk, tmp, list, sk_node)
> >  #define sk_for_each_bound(__sk, list) \
> >         hlist_for_each_entry(__sk, list, sk_bind_node)
> > +#define sk_for_each_bound_bhash2(__sk, list) \
> > +       hlist_for_each_entry(__sk, list, sk_bind2_node)
> >
> >  /**
> >   * sk_for_each_entry_offset_rcu - iterate over a list at a given struct offset
> > diff --git a/net/dccp/proto.c b/net/dccp/proto.c
> > index a976b4d29892..e65768370170 100644
> > --- a/net/dccp/proto.c
> > +++ b/net/dccp/proto.c
> > @@ -1121,6 +1121,12 @@ static int __init dccp_init(void)
> >                                   SLAB_HWCACHE_ALIGN | SLAB_ACCOUNT, NULL);
> >         if (!dccp_hashinfo.bind_bucket_cachep)
> >                 goto out_free_hashinfo2;
> > +       dccp_hashinfo.bind2_bucket_cachep =
> > +               kmem_cache_create("dccp_bind2_bucket",
> > +                                 sizeof(struct inet_bind2_bucket), 0,
> > +                                 SLAB_HWCACHE_ALIGN | SLAB_ACCOUNT, NULL);
> > +       if (!dccp_hashinfo.bind2_bucket_cachep)
> > +               goto out_free_bind_bucket_cachep;
> >
> >         /*
> >          * Size and allocate the main established and bind bucket
> > @@ -1151,7 +1157,7 @@ static int __init dccp_init(void)
> >
> >         if (!dccp_hashinfo.ehash) {
> >                 DCCP_CRIT("Failed to allocate DCCP established hash table");
> > -               goto out_free_bind_bucket_cachep;
> > +               goto out_free_bind2_bucket_cachep;
> >         }
> >
> >         for (i = 0; i <= dccp_hashinfo.ehash_mask; i++)
> > @@ -1170,6 +1176,8 @@ static int __init dccp_init(void)
> >                         continue;
> >                 dccp_hashinfo.bhash = (struct inet_bind_hashbucket *)
> >                         __get_free_pages(GFP_ATOMIC|__GFP_NOWARN, bhash_order);
> > +               dccp_hashinfo.bhash2 = (struct inet_bind2_hashbucket *)
> > +                       __get_free_pages(GFP_ATOMIC | __GFP_NOWARN, bhash_order);
> >         } while (!dccp_hashinfo.bhash && --bhash_order >= 0);
> >
> >         if (!dccp_hashinfo.bhash) {
> > @@ -1180,6 +1188,7 @@ static int __init dccp_init(void)
> >         for (i = 0; i < dccp_hashinfo.bhash_size; i++) {
> >                 spin_lock_init(&dccp_hashinfo.bhash[i].lock);
> >                 INIT_HLIST_HEAD(&dccp_hashinfo.bhash[i].chain);
> > +               INIT_HLIST_HEAD(&dccp_hashinfo.bhash2[i].chain);
> >         }
> >
> >         rc = dccp_mib_init();
> > @@ -1214,6 +1223,8 @@ static int __init dccp_init(void)
> >         inet_ehash_locks_free(&dccp_hashinfo);
> >  out_free_dccp_ehash:
> >         free_pages((unsigned long)dccp_hashinfo.ehash, ehash_order);
> > +out_free_bind2_bucket_cachep:
> > +       kmem_cache_destroy(dccp_hashinfo.bind2_bucket_cachep);
> >  out_free_bind_bucket_cachep:
> >         kmem_cache_destroy(dccp_hashinfo.bind_bucket_cachep);
> >  out_free_hashinfo2:
> > @@ -1222,6 +1233,7 @@ static int __init dccp_init(void)
> >         dccp_hashinfo.bhash = NULL;
> >         dccp_hashinfo.ehash = NULL;
> >         dccp_hashinfo.bind_bucket_cachep = NULL;
> > +       dccp_hashinfo.bind2_bucket_cachep = NULL;
> >         return rc;
> >  }
> >
> > diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
> > index 1e5b53c2bb26..482935f0c8f6 100644
> > --- a/net/ipv4/inet_connection_sock.c
> > +++ b/net/ipv4/inet_connection_sock.c
> > @@ -117,6 +117,30 @@ bool inet_rcv_saddr_any(const struct sock *sk)
> >         return !sk->sk_rcv_saddr;
> >  }
> >
> > +static bool use_bhash2_on_bind(const struct sock *sk)
> > +{
> > +#if IS_ENABLED(CONFIG_IPV6)
> > +       int addr_type;
> > +
> > +       if (sk->sk_family == AF_INET6) {
> > +               addr_type = ipv6_addr_type(&sk->sk_v6_rcv_saddr);
> > +               return addr_type != IPV6_ADDR_ANY && addr_type != IPV6_ADDR_MAPPED;
> > +       }
> > +#endif
> > +       return sk->sk_rcv_saddr != htonl(INADDR_ANY);
> > +}
> > +
> > +static u32 get_bhash2_nulladdr_hash(const struct sock *sk, struct net *net, int port)
> > +{
> > +#if IS_ENABLED(CONFIG_IPV6)
> > +       struct in6_addr nulladdr = {};
> > +
> > +       if (sk->sk_family == AF_INET6)
> > +               return ipv6_portaddr_hash(net, &nulladdr, port);
> > +#endif
> > +       return ipv4_portaddr_hash(net, 0, port);
> > +}
> > +
> >  void inet_get_local_port_range(struct net *net, int *low, int *high)
> >  {
> >         unsigned int seq;
> > @@ -130,16 +154,58 @@ void inet_get_local_port_range(struct net *net, int *low, int *high)
> >  }
> >  EXPORT_SYMBOL(inet_get_local_port_range);
> >
> > -static int inet_csk_bind_conflict(const struct sock *sk,
> > -                                 const struct inet_bind_bucket *tb,
> > -                                 bool relax, bool reuseport_ok)
> > +static bool bind_conflict_exist(const struct sock *sk, struct sock *sk2,
> > +                               kuid_t sk_uid, bool relax, bool reuseport_cb_ok,
> > +                               bool reuseport_ok)
> > +{
> > +       if (sk != sk2 && (!sk->sk_bound_dev_if || !sk2->sk_bound_dev_if ||
> > +                         sk->sk_bound_dev_if == sk2->sk_bound_dev_if)) {
> > +               if (sk->sk_reuse && sk2->sk_reuse && sk2->sk_state != TCP_LISTEN) {
> > +                       if (!relax || (!reuseport_ok && sk->sk_reuseport && sk2->sk_reuseport &&
> > +                                      reuseport_cb_ok && (sk2->sk_state == TCP_TIME_WAIT ||
> > +                                                          uid_eq(sk_uid, sock_i_uid(sk2)))))
> > +                               return true;
> > +               } else if (!reuseport_ok || !sk->sk_reuseport || !sk2->sk_reuseport ||
> > +                          !reuseport_cb_ok || (sk2->sk_state != TCP_TIME_WAIT &&
> > +                                               !uid_eq(sk_uid, sock_i_uid(sk2)))) {
> > +                       return true;
> > +               }
> > +       }
> > +       return false;
> > +}
> > +
> > +static bool check_bhash2_conflict(const struct sock *sk, struct inet_bind2_bucket *tb2,
> > +                                 kuid_t sk_uid, bool relax, bool reuseport_cb_ok,
> > +                                 bool reuseport_ok)
> >  {
> >         struct sock *sk2;
> > -       bool reuseport_cb_ok;
> > -       bool reuse = sk->sk_reuse;
> > -       bool reuseport = !!sk->sk_reuseport;
> > -       struct sock_reuseport *reuseport_cb;
> > +
> > +       sk_for_each_bound_bhash2(sk2, &tb2->owners) {
> > +               if (sk->sk_family == AF_INET && ipv6_only_sock(sk2))
> > +                       continue;
> > +
> > +               if (bind_conflict_exist(sk, sk2, sk_uid, relax,
> > +                                       reuseport_cb_ok, reuseport_ok))
> > +                       return true;
> > +       }
> > +       return false;
> > +}
> > +
> > +/* This should be called only when the corresponding inet_bind_bucket spinlock is held */
> > +static int inet_csk_bind_conflict(const struct sock *sk, int port,
> > +                                 struct inet_bind_bucket *tb,
> > +                                 struct inet_bind2_bucket *tb2, /* may be null */
> > +                                 bool relax, bool reuseport_ok)
> > +{
> > +       struct inet_hashinfo *hinfo = sk->sk_prot->h.hashinfo;
> >         kuid_t uid = sock_i_uid((struct sock *)sk);
> > +       struct sock_reuseport *reuseport_cb;
> > +       struct inet_bind2_hashbucket *head2;
> > +       bool reuseport_cb_ok;
> > +       struct sock *sk2;
> > +       struct net *net;
> > +       int l3mdev;
> > +       u32 hash;
> >
> >         rcu_read_lock();
> >         reuseport_cb = rcu_dereference(sk->sk_reuseport_cb);
> > @@ -150,36 +216,40 @@ static int inet_csk_bind_conflict(const struct sock *sk,
> >         /*
> >          * Unlike other sk lookup places we do not check
> >          * for sk_net here, since _all_ the socks listed
> > -        * in tb->owners list belong to the same net - the
> > -        * one this bucket belongs to.
> > +        * in tb->owners and tb2->owners list belong
> > +        * to the same net
> >          */
> >
> > -       sk_for_each_bound(sk2, &tb->owners) {
> > -               if (sk != sk2 &&
> > -                   (!sk->sk_bound_dev_if ||
> > -                    !sk2->sk_bound_dev_if ||
> > -                    sk->sk_bound_dev_if == sk2->sk_bound_dev_if)) {
> > -                       if (reuse && sk2->sk_reuse &&
> > -                           sk2->sk_state != TCP_LISTEN) {
> > -                               if ((!relax ||
> > -                                    (!reuseport_ok &&
> > -                                     reuseport && sk2->sk_reuseport &&
> > -                                     reuseport_cb_ok &&
> > -                                     (sk2->sk_state == TCP_TIME_WAIT ||
> > -                                      uid_eq(uid, sock_i_uid(sk2))))) &&
> > -                                   inet_rcv_saddr_equal(sk, sk2, true))
> > -                                       break;
> > -                       } else if (!reuseport_ok ||
> > -                                  !reuseport || !sk2->sk_reuseport ||
> > -                                  !reuseport_cb_ok ||
> > -                                  (sk2->sk_state != TCP_TIME_WAIT &&
> > -                                   !uid_eq(uid, sock_i_uid(sk2)))) {
> > -                               if (inet_rcv_saddr_equal(sk, sk2, true))
> > -                                       break;
> > -                       }
> > -               }
> > +       if (!use_bhash2_on_bind(sk)) {
> > +               sk_for_each_bound(sk2, &tb->owners)
> > +                       if (bind_conflict_exist(sk, sk2, uid, relax,
> > +                                               reuseport_cb_ok, reuseport_ok) &&
> > +                           inet_rcv_saddr_equal(sk, sk2, true))
> > +                               return true;
> > +
> > +               return false;
> >         }
> > -       return sk2 != NULL;
> > +
> > +       if (tb2 && check_bhash2_conflict(sk, tb2, uid, relax, reuseport_cb_ok, reuseport_ok))
> > +               return true;
> > +
> > +       net = sock_net(sk);
> > +
> > +       /* check there's no conflict with an existing IPV6_ADDR_ANY (if ipv6) or
> > +        * INADDR_ANY (if ipv4) socket.
> > +        */
> > +       hash = get_bhash2_nulladdr_hash(sk, net, port);
> > +       head2 = &hinfo->bhash2[hash & (hinfo->bhash_size - 1)];
> > +
> > +       l3mdev = inet_sk_bound_l3mdev(sk);
> > +       inet_bind_bucket_for_each(tb2, &head2->chain)
> > +               if (check_bind2_bucket_match_nulladdr(tb2, net, port, l3mdev, sk))
> > +                       break;
> > +
> > +       if (tb2 && check_bhash2_conflict(sk, tb2, uid, relax, reuseport_cb_ok, reuseport_ok))
> > +               return true;
> > +
> > +       return false;
> >  }
> >
> >  /*
> > @@ -187,16 +257,20 @@ static int inet_csk_bind_conflict(const struct sock *sk,
> >   * inet_bind_hashbucket lock held.
> >   */
> >  static struct inet_bind_hashbucket *
> > -inet_csk_find_open_port(struct sock *sk, struct inet_bind_bucket **tb_ret, int *port_ret)
> > +inet_csk_find_open_port(struct sock *sk, struct inet_bind_bucket **tb_ret,
> > +                       struct inet_bind2_bucket **tb2_ret,
> > +                       struct inet_bind2_hashbucket **head2_ret, int *port_ret)
> >  {
> >         struct inet_hashinfo *hinfo = sk->sk_prot->h.hashinfo;
> > -       int port = 0;
> > +       struct inet_bind2_hashbucket *head2;
> >         struct inet_bind_hashbucket *head;
> >         struct net *net = sock_net(sk);
> > -       bool relax = false;
> >         int i, low, high, attempt_half;
> > +       struct inet_bind2_bucket *tb2;
> >         struct inet_bind_bucket *tb;
> >         u32 remaining, offset;
> > +       bool relax = false;
> > +       int port = 0;
> >         int l3mdev;
> >
> >         l3mdev = inet_sk_bound_l3mdev(sk);
> > @@ -235,10 +309,11 @@ inet_csk_find_open_port(struct sock *sk, struct inet_bind_bucket **tb_ret, int *
> >                 head = &hinfo->bhash[inet_bhashfn(net, port,
> >                                                   hinfo->bhash_size)];
> >                 spin_lock_bh(&head->lock);
> > +               tb2 = inet_bind2_bucket_find(hinfo, net, port, l3mdev, sk, &head2);
> >                 inet_bind_bucket_for_each(tb, &head->chain)
> > -                       if (net_eq(ib_net(tb), net) && tb->l3mdev == l3mdev &&
> > -                           tb->port == port) {
> > -                               if (!inet_csk_bind_conflict(sk, tb, relax, false))
> > +                       if (check_bind_bucket_match(tb, net, port, l3mdev)) {
> > +                               if (!inet_csk_bind_conflict(sk, port, tb, tb2,
> > +                                                           relax, false))
> >                                         goto success;
> >                                 goto next_port;
> >                         }
> > @@ -268,6 +343,8 @@ inet_csk_find_open_port(struct sock *sk, struct inet_bind_bucket **tb_ret, int *
> >  success:
> >         *port_ret = port;
> >         *tb_ret = tb;
> > +       *tb2_ret = tb2;
> > +       *head2_ret = head2;
> >         return head;
> >  }
> >
> > @@ -363,54 +440,77 @@ int inet_csk_get_port(struct sock *sk, unsigned short snum)
> >  {
> >         bool reuse = sk->sk_reuse && sk->sk_state != TCP_LISTEN;
> >         struct inet_hashinfo *hinfo = sk->sk_prot->h.hashinfo;
> > -       int ret = 1, port = snum;
> > +       bool bhash_created = false, bhash2_created = false;
> > +       struct inet_bind2_bucket *tb2 = NULL;
> > +       struct inet_bind2_hashbucket *head2;
> > +       struct inet_bind_bucket *tb = NULL;
> >         struct inet_bind_hashbucket *head;
> >         struct net *net = sock_net(sk);
> > -       struct inet_bind_bucket *tb = NULL;
> > +       int ret = 1, port = snum;
> > +       bool found_port = false;
> >         int l3mdev;
> >
> >         l3mdev = inet_sk_bound_l3mdev(sk);
> >
> >         if (!port) {
> > -               head = inet_csk_find_open_port(sk, &tb, &port);
> > +               head = inet_csk_find_open_port(sk, &tb, &tb2, &head2, &port);
> >                 if (!head)
> >                         return ret;
> > +               if (tb && tb2)
> > +                       goto success;
> > +               found_port = true;
> > +       } else {
> > +               head = &hinfo->bhash[inet_bhashfn(net, port,
> > +                                                 hinfo->bhash_size)];
> > +               spin_lock_bh(&head->lock);
> > +               inet_bind_bucket_for_each(tb, &head->chain)
> > +                       if (check_bind_bucket_match(tb, net, port, l3mdev))
> > +                               break;
> > +
> > +               tb2 = inet_bind2_bucket_find(hinfo, net, port, l3mdev, sk, &head2);
> > +       }
> > +
> > +       if (!tb) {
> > +               tb = inet_bind_bucket_create(hinfo->bind_bucket_cachep, net, head,
> > +                                            port, l3mdev);
> >                 if (!tb)
> > -                       goto tb_not_found;
> > -               goto success;
> > +                       goto fail_unlock;
> > +               bhash_created = true;
> > +       }
> > +
> > +       if (!tb2) {
> > +               tb2 = inet_bind2_bucket_create(hinfo->bind2_bucket_cachep,
> > +                                              net, head2, port, l3mdev, sk);
> > +               if (!tb2)
> > +                       goto fail_unlock;
> > +               bhash2_created = true;
> >         }
> > -       head = &hinfo->bhash[inet_bhashfn(net, port,
> > -                                         hinfo->bhash_size)];
> > -       spin_lock_bh(&head->lock);
> > -       inet_bind_bucket_for_each(tb, &head->chain)
> > -               if (net_eq(ib_net(tb), net) && tb->l3mdev == l3mdev &&
> > -                   tb->port == port)
> > -                       goto tb_found;
> > -tb_not_found:
> > -       tb = inet_bind_bucket_create(hinfo->bind_bucket_cachep,
> > -                                    net, head, port, l3mdev);
> > -       if (!tb)
> > -               goto fail_unlock;
> > -tb_found:
> > -       if (!hlist_empty(&tb->owners)) {
> > +
> > +       /* If we had to find an open port, we already checked for conflicts */
> > +       if (!found_port && !hlist_empty(&tb->owners)) {
> >                 if (sk->sk_reuse == SK_FORCE_REUSE)
> >                         goto success;
> > -
> >                 if ((tb->fastreuse > 0 && reuse) ||
> >                     sk_reuseport_match(tb, sk))
> >                         goto success;
> > -               if (inet_csk_bind_conflict(sk, tb, true, true))
> > +               if (inet_csk_bind_conflict(sk, port, tb, tb2, true, true))
> >                         goto fail_unlock;
> >         }
> >  success:
> >         inet_csk_update_fastreuse(tb, sk);
> > -
> >         if (!inet_csk(sk)->icsk_bind_hash)
> > -               inet_bind_hash(sk, tb, port);
> > +               inet_bind_hash(sk, tb, tb2, port);
> >         WARN_ON(inet_csk(sk)->icsk_bind_hash != tb);
> > +       WARN_ON(inet_csk(sk)->icsk_bind2_hash != tb2);
> >         ret = 0;
> >
> >  fail_unlock:
> > +       if (ret) {
> > +               if (bhash_created)
> > +                       inet_bind_bucket_destroy(hinfo->bind_bucket_cachep, tb);
> > +               if (bhash2_created)
> > +                       inet_bind2_bucket_destroy(hinfo->bind2_bucket_cachep, tb2);
> > +       }
> >         spin_unlock_bh(&head->lock);
> >         return ret;
> >  }
> > @@ -957,6 +1057,7 @@ struct sock *inet_csk_clone_lock(const struct sock *sk,
> >
> >                 inet_sk_set_state(newsk, TCP_SYN_RECV);
> >                 newicsk->icsk_bind_hash = NULL;
> > +               newicsk->icsk_bind2_hash = NULL;
> >
> >                 inet_sk(newsk)->inet_dport = inet_rsk(req)->ir_rmt_port;
> >                 inet_sk(newsk)->inet_num = inet_rsk(req)->ir_num;
> > diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c
> > index 17440840a791..9f0bece06609 100644
> > --- a/net/ipv4/inet_hashtables.c
> > +++ b/net/ipv4/inet_hashtables.c
> > @@ -81,6 +81,41 @@ struct inet_bind_bucket *inet_bind_bucket_create(struct kmem_cache *cachep,
> >         return tb;
> >  }
> >
> > +struct inet_bind2_bucket *inet_bind2_bucket_create(struct kmem_cache *cachep,
> > +                                                  struct net *net,
> > +                                                  struct inet_bind2_hashbucket *head,
> > +                                                  const unsigned short port,
> > +                                                  int l3mdev,
> > +                                                  const struct sock *sk)
> > +{
> > +       struct inet_bind2_bucket *tb = kmem_cache_alloc(cachep, GFP_ATOMIC);
> > +
> > +       if (tb) {
> > +               write_pnet(&tb->ib_net, net);
> > +               tb->l3mdev    = l3mdev;
> > +               tb->port      = port;
> > +#if IS_ENABLED(CONFIG_IPV6)
> > +               if (sk->sk_family == AF_INET6)
> > +                       tb->v6_rcv_saddr = sk->sk_v6_rcv_saddr;
> > +               else
> > +#endif
> > +                       tb->rcv_saddr = sk->sk_rcv_saddr;
> > +               INIT_HLIST_HEAD(&tb->owners);
> > +               hlist_add_head(&tb->node, &head->chain);
> > +       }
> > +       return tb;
> > +}
> > +
> > +static bool bind2_bucket_addr_match(struct inet_bind2_bucket *tb2, struct sock *sk)
> > +{
> > +#if IS_ENABLED(CONFIG_IPV6)
> > +       if (sk->sk_family == AF_INET6)
> > +               return ipv6_addr_equal(&tb2->v6_rcv_saddr,
> > +                                      &sk->sk_v6_rcv_saddr);
> > +#endif
> > +       return tb2->rcv_saddr == sk->sk_rcv_saddr;
> > +}
> > +
> >  /*
> >   * Caller must hold hashbucket lock for this tb with local BH disabled
> >   */
> > @@ -92,12 +127,25 @@ void inet_bind_bucket_destroy(struct kmem_cache *cachep, struct inet_bind_bucket
> >         }
> >  }
> >
> > +/* Caller must hold the lock for the corresponding hashbucket in the bhash table
> > + * with local BH disabled
> > + */
> > +void inet_bind2_bucket_destroy(struct kmem_cache *cachep, struct inet_bind2_bucket *tb)
> > +{
> > +       if (hlist_empty(&tb->owners)) {
> > +               __hlist_del(&tb->node);
> > +               kmem_cache_free(cachep, tb);
> > +       }
> > +}
> > +
> >  void inet_bind_hash(struct sock *sk, struct inet_bind_bucket *tb,
> > -                   const unsigned short snum)
> > +                   struct inet_bind2_bucket *tb2, const unsigned short snum)
> >  {
> >         inet_sk(sk)->inet_num = snum;
> >         sk_add_bind_node(sk, &tb->owners);
> >         inet_csk(sk)->icsk_bind_hash = tb;
> > +       sk_add_bind2_node(sk, &tb2->owners);
> > +       inet_csk(sk)->icsk_bind2_hash = tb2;
> >  }
> >
> >  /*
> > @@ -109,6 +157,7 @@ static void __inet_put_port(struct sock *sk)
> >         const int bhash = inet_bhashfn(sock_net(sk), inet_sk(sk)->inet_num,
> >                         hashinfo->bhash_size);
> >         struct inet_bind_hashbucket *head = &hashinfo->bhash[bhash];
> > +       struct inet_bind2_bucket *tb2;
> >         struct inet_bind_bucket *tb;
> >
> >         spin_lock(&head->lock);
> > @@ -117,6 +166,13 @@ static void __inet_put_port(struct sock *sk)
> >         inet_csk(sk)->icsk_bind_hash = NULL;
> >         inet_sk(sk)->inet_num = 0;
> >         inet_bind_bucket_destroy(hashinfo->bind_bucket_cachep, tb);
> > +
> > +       if (inet_csk(sk)->icsk_bind2_hash) {
> > +               tb2 = inet_csk(sk)->icsk_bind2_hash;
> > +               __sk_del_bind2_node(sk);
> > +               inet_csk(sk)->icsk_bind2_hash = NULL;
> > +               inet_bind2_bucket_destroy(hashinfo->bind2_bucket_cachep, tb2);
> > +       }
> >         spin_unlock(&head->lock);
> >  }
> >
> > @@ -133,14 +189,19 @@ int __inet_inherit_port(const struct sock *sk, struct sock *child)
> >         struct inet_hashinfo *table = sk->sk_prot->h.hashinfo;
> >         unsigned short port = inet_sk(child)->inet_num;
> >         const int bhash = inet_bhashfn(sock_net(sk), port,
> > -                       table->bhash_size);
> > +                                      table->bhash_size);
> >         struct inet_bind_hashbucket *head = &table->bhash[bhash];
> > +       struct inet_bind2_hashbucket *head_bhash2;
> > +       bool created_inet_bind_bucket = false;
> > +       struct net *net = sock_net(sk);
> > +       struct inet_bind2_bucket *tb2;
> >         struct inet_bind_bucket *tb;
> >         int l3mdev;
> >
> >         spin_lock(&head->lock);
> >         tb = inet_csk(sk)->icsk_bind_hash;
> > -       if (unlikely(!tb)) {
> > +       tb2 = inet_csk(sk)->icsk_bind2_hash;
> > +       if (unlikely(!tb || !tb2)) {
> >                 spin_unlock(&head->lock);
> >                 return -ENOENT;
> >         }
> > @@ -153,25 +214,45 @@ int __inet_inherit_port(const struct sock *sk, struct sock *child)
> >                  * as that of the child socket. We have to look up or
> >                  * create a new bind bucket for the child here. */
> >                 inet_bind_bucket_for_each(tb, &head->chain) {
> > -                       if (net_eq(ib_net(tb), sock_net(sk)) &&
> > -                           tb->l3mdev == l3mdev && tb->port == port)
> > +                       if (check_bind_bucket_match(tb, net, port, l3mdev))
> >                                 break;
> >                 }
> >                 if (!tb) {
> >                         tb = inet_bind_bucket_create(table->bind_bucket_cachep,
> > -                                                    sock_net(sk), head, port,
> > -                                                    l3mdev);
> > +                                                    net, head, port, l3mdev);
> >                         if (!tb) {
> >                                 spin_unlock(&head->lock);
> >                                 return -ENOMEM;
> >                         }
> > +                       created_inet_bind_bucket = true;
> >                 }
> >                 inet_csk_update_fastreuse(tb, child);
> > +
> > +               goto bhash2_find;
> > +       } else if (!bind2_bucket_addr_match(tb2, child)) {
> > +               l3mdev = inet_sk_bound_l3mdev(sk);
> > +
> > +bhash2_find:
> > +               tb2 = inet_bind2_bucket_find(table, net, port, l3mdev, child,
> > +                                            &head_bhash2);
> > +               if (!tb2) {
> > +                       tb2 = inet_bind2_bucket_create(table->bind2_bucket_cachep,
> > +                                                      net, head_bhash2, port, l3mdev,
> > +                                                      child);
> > +                       if (!tb2)
> > +                               goto error;
> > +               }
> >         }
> > -       inet_bind_hash(child, tb, port);
> > +       inet_bind_hash(child, tb, tb2, port);
> >         spin_unlock(&head->lock);
> >
> >         return 0;
> > +
> > +error:
> > +       if (created_inet_bind_bucket)
> > +               inet_bind_bucket_destroy(table->bind_bucket_cachep, tb);
> > +       spin_unlock(&head->lock);
> > +       return -ENOMEM;
> >  }
> >  EXPORT_SYMBOL_GPL(__inet_inherit_port);
> >
> > @@ -722,6 +803,71 @@ void inet_unhash(struct sock *sk)
> >  }
> >  EXPORT_SYMBOL_GPL(inet_unhash);
> >
> > +static inline bool check_bind2_bucket_match(struct inet_bind2_bucket *tb, struct net *net,
> > +                                           unsigned short port, int l3mdev, struct sock *sk)
> > +{
> > +#if IS_ENABLED(CONFIG_IPV6)
> > +       if (sk->sk_family == AF_INET6)
> > +               return net_eq(ib2_net(tb), net) && tb->port == port && tb->l3mdev == l3mdev &&
> > +                       ipv6_addr_equal(&tb->v6_rcv_saddr, &sk->sk_v6_rcv_saddr);
> > +       else
> > +#endif
> > +               return net_eq(ib2_net(tb), net) && tb->port == port && tb->l3mdev == l3mdev &&
> > +                       tb->rcv_saddr == sk->sk_rcv_saddr;
> > +}
> > +
> > +bool check_bind2_bucket_match_nulladdr(struct inet_bind2_bucket *tb, struct net *net,
> > +                                      const unsigned short port, int l3mdev, const struct sock *sk)
> > +{
> > +#if IS_ENABLED(CONFIG_IPV6)
> > +       struct in6_addr nulladdr = {};
> > +
> > +       if (sk->sk_family == AF_INET6)
> > +               return net_eq(ib2_net(tb), net) && tb->port == port && tb->l3mdev == l3mdev &&
> > +                       ipv6_addr_equal(&tb->v6_rcv_saddr, &nulladdr);
> > +       else
> > +#endif
> > +               return net_eq(ib2_net(tb), net) && tb->port == port && tb->l3mdev == l3mdev &&
> > +                       tb->rcv_saddr == 0;
> > +}
> > +
> > +static struct inet_bind2_hashbucket *
> > +inet_bhashfn_portaddr(struct inet_hashinfo *hinfo, const struct sock *sk,
> > +                     const struct net *net, unsigned short port)
> > +{
> > +       u32 hash;
> > +
> > +#if IS_ENABLED(CONFIG_IPV6)
> > +       if (sk->sk_family == AF_INET6)
> > +               hash = ipv6_portaddr_hash(net, &sk->sk_v6_rcv_saddr, port);
> > +       else
> > +#endif
> > +               hash = ipv4_portaddr_hash(net, sk->sk_rcv_saddr, port);
> > +       return &hinfo->bhash2[hash & (hinfo->bhash_size - 1)];
> > +}
> > +
> > +/* This should only be called when the spinlock for the socket's corresponding
> > + * bind_hashbucket is held
> > + */
> > +struct inet_bind2_bucket *
> > +inet_bind2_bucket_find(struct inet_hashinfo *hinfo, struct net *net, const unsigned short port,
> > +                      int l3mdev, struct sock *sk, struct inet_bind2_hashbucket **head)
> > +{
> > +       struct inet_bind2_bucket *bhash2 = NULL;
> > +       struct inet_bind2_hashbucket *h;
> > +
> > +       h = inet_bhashfn_portaddr(hinfo, sk, net, port);
> > +       inet_bind_bucket_for_each(bhash2, &h->chain) {
> > +               if (check_bind2_bucket_match(bhash2, net, port, l3mdev, sk))
> > +                       break;
> > +       }
> > +
> > +       if (head)
> > +               *head = h;
> > +
> > +       return bhash2;
> > +}
> > +
> >  /* RFC 6056 3.3.4.  Algorithm 4: Double-Hash Port Selection Algorithm
> >   * Note that we use 32bit integers (vs RFC 'short integers')
> >   * because 2^16 is not a multiple of num_ephemeral and this
> > @@ -740,10 +886,13 @@ int __inet_hash_connect(struct inet_timewait_death_row *death_row,
> >  {
> >         struct inet_hashinfo *hinfo = death_row->hashinfo;
> >         struct inet_timewait_sock *tw = NULL;
> > +       struct inet_bind2_hashbucket *head2;
> >         struct inet_bind_hashbucket *head;
> >         int port = inet_sk(sk)->inet_num;
> >         struct net *net = sock_net(sk);
> > +       struct inet_bind2_bucket *tb2;
> >         struct inet_bind_bucket *tb;
> > +       bool tb_created = false;
> >         u32 remaining, offset;
> >         int ret, i, low, high;
> >         int l3mdev;
> > @@ -797,8 +946,7 @@ int __inet_hash_connect(struct inet_timewait_death_row *death_row,
> >                  * the established check is already unique enough.
> >                  */
> >                 inet_bind_bucket_for_each(tb, &head->chain) {
> > -                       if (net_eq(ib_net(tb), net) && tb->l3mdev == l3mdev &&
> > -                           tb->port == port) {
> > +                       if (check_bind_bucket_match(tb, net, port, l3mdev)) {
> >                                 if (tb->fastreuse >= 0 ||
> >                                     tb->fastreuseport >= 0)
> >                                         goto next_port;
> > @@ -816,6 +964,7 @@ int __inet_hash_connect(struct inet_timewait_death_row *death_row,
> >                         spin_unlock_bh(&head->lock);
> >                         return -ENOMEM;
> >                 }
> > +               tb_created = true;
> >                 tb->fastreuse = -1;
> >                 tb->fastreuseport = -1;
> >                 goto ok;
> > @@ -831,6 +980,17 @@ int __inet_hash_connect(struct inet_timewait_death_row *death_row,
> >         return -EADDRNOTAVAIL;
> >
> >  ok:
> > +       /* Find the corresponding tb2 bucket since we need to
> > +        * add the socket to the bhash2 table as well
> > +        */
> > +       tb2 = inet_bind2_bucket_find(hinfo, net, port, l3mdev, sk, &head2);
> > +       if (!tb2) {
> > +               tb2 = inet_bind2_bucket_create(hinfo->bind2_bucket_cachep, net,
> > +                                              head2, port, l3mdev, sk);
> > +               if (!tb2)
> > +                       goto error;
> > +       }
> > +
> >         /* If our first attempt found a candidate, skip next candidate
> >          * in 1/16 of cases to add some noise.
> >          */
> > @@ -839,7 +999,7 @@ int __inet_hash_connect(struct inet_timewait_death_row *death_row,
> >         WRITE_ONCE(table_perturb[index], READ_ONCE(table_perturb[index]) + i + 2);
> >
> >         /* Head lock still held and bh's disabled */
> > -       inet_bind_hash(sk, tb, port);
> > +       inet_bind_hash(sk, tb, tb2, port);
> >         if (sk_unhashed(sk)) {
> >                 inet_sk(sk)->inet_sport = htons(port);
> >                 inet_ehash_nolisten(sk, (struct sock *)tw, NULL);
> > @@ -851,6 +1011,12 @@ int __inet_hash_connect(struct inet_timewait_death_row *death_row,
> >                 inet_twsk_deschedule_put(tw);
> >         local_bh_enable();
> >         return 0;
> > +
> > +error:
> > +       if (tb_created)
> > +               inet_bind_bucket_destroy(hinfo->bind_bucket_cachep, tb);
> > +       spin_unlock_bh(&head->lock);
> > +       return -ENOMEM;
> >  }
> >
> >  /*
> > diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> > index cf18fbcbf123..5a143c9afd20 100644
> > --- a/net/ipv4/tcp.c
> > +++ b/net/ipv4/tcp.c
> > @@ -4627,6 +4627,12 @@ void __init tcp_init(void)
> >                                   SLAB_HWCACHE_ALIGN | SLAB_PANIC |
> >                                   SLAB_ACCOUNT,
> >                                   NULL);
> > +       tcp_hashinfo.bind2_bucket_cachep =
> > +               kmem_cache_create("tcp_bind2_bucket",
> > +                                 sizeof(struct inet_bind2_bucket), 0,
> > +                                 SLAB_HWCACHE_ALIGN | SLAB_PANIC |
> > +                                 SLAB_ACCOUNT,
> > +                                 NULL);
> >
> >         /* Size and allocate the main established and bind bucket
> >          * hash tables.
> > @@ -4649,8 +4655,9 @@ void __init tcp_init(void)
> >         if (inet_ehash_locks_alloc(&tcp_hashinfo))
> >                 panic("TCP: failed to alloc ehash_locks");
> >         tcp_hashinfo.bhash =
> > -               alloc_large_system_hash("TCP bind",
> > -                                       sizeof(struct inet_bind_hashbucket),
> > +               alloc_large_system_hash("TCP bind bhash tables",
> > +                                       sizeof(struct inet_bind_hashbucket) +
> > +                                       sizeof(struct inet_bind2_hashbucket),
> >                                         tcp_hashinfo.ehash_mask + 1,
> >                                         17, /* one slot per 128 KB of memory */
> >                                         0,
> > @@ -4659,9 +4666,12 @@ void __init tcp_init(void)
> >                                         0,
> >                                         64 * 1024);
> >         tcp_hashinfo.bhash_size = 1U << tcp_hashinfo.bhash_size;
> > +       tcp_hashinfo.bhash2 =
> > +               (struct inet_bind2_hashbucket *)(tcp_hashinfo.bhash + tcp_hashinfo.bhash_size);
> >         for (i = 0; i < tcp_hashinfo.bhash_size; i++) {
> >                 spin_lock_init(&tcp_hashinfo.bhash[i].lock);
> >                 INIT_HLIST_HEAD(&tcp_hashinfo.bhash[i].chain);
> > +               INIT_HLIST_HEAD(&tcp_hashinfo.bhash2[i].chain);
> >         }
> >
> >
> > --
> > 2.30.2
> >

^ permalink raw reply

* Re: [PATCH net-next] 1588 support on bcm54210pe
From: Richard Cochran @ 2022-04-22 19:48 UTC (permalink / raw)
  To: Andrew Lunn
  Cc: Lasse Johnsen, netdev, Gordon Hollingworth, Ahmad Byagowi,
	Heiner Kallweit, Russell King, bcm-kernel-feedback-list,
	Florian Fainelli
In-Reply-To: <YmLC98NMfHUxwPF6@lunn.ch>

On Fri, Apr 22, 2022 at 05:00:07PM +0200, Andrew Lunn wrote:

> > I am confident that this code is relevant exclusively to the
> > BCM54210PE.

Not true.

> It will not even work with the BCM54210, BCM54210S and
> > BCM54210SE PHYs.

The registers you used are also present in the BCM541xx devices.
Pretty sure your code would work on those devices (after adjusting
register offsets).

> Florian can probably tell us more, but often hardware like this is
> shared by multiple devices. If it is, you might want to use a more
> generic prefix.

My understanding is that there are two implementions, gen1 and gen2.
Your bcm542xx and the bcm541xx are both gen1, and both support inband
Rx time stamping.

Because the registers are all the same (just the offsets are
different), I'd like to see a common module that can be used by all
gen1 devices.  The module could be named bcm-ptp-gen1.c for example.

Thanks,
Richard

^ permalink raw reply

* Re: [PATCH memcg RFC] net: set proper memcg for net_init hooks allocations
From: Shakeel Butt @ 2022-04-22 20:22 UTC (permalink / raw)
  To: Vasily Averin
  Cc: Vlastimil Babka, Roman Gushchin, kernel, LKML, netdev, Cgroups,
	Michal Hocko, Florian Westphal, David S. Miller, Jakub Kicinski,
	Paolo Abeni
In-Reply-To: <e9cd84f2-d2e9-33a8-d74e-edcf60d35236@openvz.org>

On Fri, Apr 22, 2022 at 1:09 PM Vasily Averin <vvs@openvz.org> wrote:
>
> On 4/22/22 23:01, Vasily Averin wrote:
> > On 4/21/22 18:56, Shakeel Butt wrote:
> >> On Sat, Apr 16, 2022 at 11:39 PM Vasily Averin <vvs@openvz.org> wrote:
> >>> @@ -1147,7 +1148,13 @@ static int __register_pernet_operations(struct list_head *list,
> >>>                  * setup_net() and cleanup_net() are not possible.
> >>>                  */
> >>>                 for_each_net(net) {
> >>> +                       struct mem_cgroup *old, *memcg = NULL;
> >>> +#ifdef CONFIG_MEMCG
> >>> +                       memcg = (net == &init_net) ? root_mem_cgroup : mem_cgroup_from_obj(net);
> >>
> >> memcg from obj is unstable, so you need a reference on memcg. You can
> >> introduce get_mem_cgroup_from_kmem() which works for both
> >> MEMCG_DATA_OBJCGS and MEMCG_DATA_KMEM. For uncharged objects (like
> >> init_net) it should return NULL.
> >
> > Could you please elaborate with more details?
> > It seems to me mem_cgroup_from_obj() does everything exactly as you say:
> > - for slab objects it returns memcg taken from according slab->memcg_data
> > - for ex-slab objects (i.e. page->memcg_data & MEMCG_DATA_OBJCGS)
> >     page_memcg_check() returns NULL
> > - for kmem objects (i.e. page->memcg_data & MEMCG_DATA_KMEM)
> >     page_memcg_check() returns objcg->memcg
> > - in another cases
> >     page_memcg_check() returns page->memcg_data,
> >     so for uncharged objects like init_net NULL should be returned.
> >
> > I can introduce exported get_mem_cgroup_from_kmem(), however it should only
> > call mem_cgroup_from_obj(), perhaps under read_rcu_lock/unlock.
>
> I think I finally got your point:
> Do you mean I should use css_tryget(&memcg->css) for found memcg,
> like get_mem_cgroup_from_mm() does?

Yes.

^ permalink raw reply

* Re: [PATCH net v3] tcp: ensure to use the most recently sent skb when filling the rate sample
From: Neal Cardwell @ 2022-04-22 20:56 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Pengcheng Yang, Eric Dumazet, netdev, David S. Miller,
	Hideaki YOSHIFUJI, David Ahern, Paolo Abeni
In-Reply-To: <20220422133712.17eebbcb@kernel.org>

On Fri, Apr 22, 2022 at 4:37 PM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Wed, 20 Apr 2022 10:34:41 +0800 Pengcheng Yang wrote:
> > If an ACK (s)acks multiple skbs, we favor the information
> > from the most recently sent skb by choosing the skb with
> > the highest prior_delivered count. But in the interval
> > between receiving ACKs, we send multiple skbs with the same
> > prior_delivered, because the tp->delivered only changes
> > when we receive an ACK.
> >
> > We used RACK's solution, copying tcp_rack_sent_after() as
> > tcp_skb_sent_after() helper to determine "which packet was
> > sent last?". Later, we will use tcp_skb_sent_after() instead
> > in RACK.
> >
> > Fixes: b9f64820fb22 ("tcp: track data delivery rate for a TCP connection")
> > Signed-off-by: Pengcheng Yang <yangpc@wangsu.com>
> > Cc: Neal Cardwell <ncardwell@google.com>
> > Cc: Paolo Abeni <pabeni@redhat.com>
>
> Somehow this patch got marked as archived in patchwork. Reviving it now.
>
> Eric, Neal, ack?

Acked-by: Neal Cardwell <ncardwell@google.com>
Tested-by: Neal Cardwell <ncardwell@google.com>

Looks good to me. Thanks for the patch!

neal

^ permalink raw reply

* [PATCH v2 net-next] net: generalize skb freeing deferral to per-cpu lists
From: Eric Dumazet @ 2022-04-22 20:12 UTC (permalink / raw)
  To: David S . Miller, Jakub Kicinski, Paolo Abeni
  Cc: netdev, Eric Dumazet, Eric Dumazet

From: Eric Dumazet <edumazet@google.com>

Logic added in commit f35f821935d8 ("tcp: defer skb freeing after socket
lock is released") helped bulk TCP flows to move the cost of skbs
frees outside of critical section where socket lock was held.

But for RPC traffic, or hosts with RFS enabled, the solution is far from
being ideal.

For RPC traffic, recvmsg() has to return to user space right after
skb payload has been consumed, meaning that BH handler has no chance
to pick the skb before recvmsg() thread. This issue is more visible
with BIG TCP, as more RPC fit one skb.

For RFS, even if BH handler picks the skbs, they are still picked
from the cpu on which user thread is running.

Ideally, it is better to free the skbs (and associated page frags)
on the cpu that originally allocated them.

This patch removes the per socket anchor (sk->defer_list) and
instead uses a per-cpu list, which will hold more skbs per round.

This new per-cpu list is drained at the end of net_action_rx(),
after incoming packets have been processed, to lower latencies.

In normal conditions, skbs are added to the per-cpu list with
no further action. In the (unlikely) cases where the cpu does not
run net_action_rx() handler fast enough, we use an IPI to raise
NET_RX_SOFTIRQ on the remote cpu.

Also, we do not bother draining the per-cpu list from dev_cpu_dead()
This is because skbs in this list have no requirement on how fast
they should be freed.

Note that we can add in the future a small per-cpu cache
if we see any contention on sd->defer_lock.

Tested on a pair of hosts with 100Gbit NIC, RFS enabled,
and /proc/sys/net/ipv4/tcp_rmem[2] tuned to 16MB to work around
page recycling strategy used by NIC driver (its page pool capacity
being too small compared to number of skbs/pages held in sockets
receive queues)

Note that this tuning was only done to demonstrate worse
conditions for skb freeing for this particular test.
These conditions can happen in more general production workload.

10 runs of one TCP_STREAM flow

Before:
Average throughput: 49685 Mbit.

Kernel profiles on cpu running user thread recvmsg() show high cost for
skb freeing related functions (*)

    57.81%  [kernel]       [k] copy_user_enhanced_fast_string
(*) 12.87%  [kernel]       [k] skb_release_data
(*)  4.25%  [kernel]       [k] __free_one_page
(*)  3.57%  [kernel]       [k] __list_del_entry_valid
     1.85%  [kernel]       [k] __netif_receive_skb_core
     1.60%  [kernel]       [k] __skb_datagram_iter
(*)  1.59%  [kernel]       [k] free_unref_page_commit
(*)  1.16%  [kernel]       [k] __slab_free
     1.16%  [kernel]       [k] _copy_to_iter
(*)  1.01%  [kernel]       [k] kfree
(*)  0.88%  [kernel]       [k] free_unref_page
     0.57%  [kernel]       [k] ip6_rcv_core
     0.55%  [kernel]       [k] ip6t_do_table
     0.54%  [kernel]       [k] flush_smp_call_function_queue
(*)  0.54%  [kernel]       [k] free_pcppages_bulk
     0.51%  [kernel]       [k] llist_reverse_order
     0.38%  [kernel]       [k] process_backlog
(*)  0.38%  [kernel]       [k] free_pcp_prepare
     0.37%  [kernel]       [k] tcp_recvmsg_locked
(*)  0.37%  [kernel]       [k] __list_add_valid
     0.34%  [kernel]       [k] sock_rfree
     0.34%  [kernel]       [k] _raw_spin_lock_irq
(*)  0.33%  [kernel]       [k] __page_cache_release
     0.33%  [kernel]       [k] tcp_v6_rcv
(*)  0.33%  [kernel]       [k] __put_page
(*)  0.29%  [kernel]       [k] __mod_zone_page_state
     0.27%  [kernel]       [k] _raw_spin_lock

After patch:
Average throughput: 73076 Mbit.

Kernel profiles on cpu running user thread recvmsg() looks better:

    81.35%  [kernel]       [k] copy_user_enhanced_fast_string
     1.95%  [kernel]       [k] _copy_to_iter
     1.95%  [kernel]       [k] __skb_datagram_iter
     1.27%  [kernel]       [k] __netif_receive_skb_core
     1.03%  [kernel]       [k] ip6t_do_table
     0.60%  [kernel]       [k] sock_rfree
     0.50%  [kernel]       [k] tcp_v6_rcv
     0.47%  [kernel]       [k] ip6_rcv_core
     0.45%  [kernel]       [k] read_tsc
     0.44%  [kernel]       [k] _raw_spin_lock_irqsave
     0.37%  [kernel]       [k] _raw_spin_lock
     0.37%  [kernel]       [k] native_irq_return_iret
     0.33%  [kernel]       [k] __inet6_lookup_established
     0.31%  [kernel]       [k] ip6_protocol_deliver_rcu
     0.29%  [kernel]       [k] tcp_rcv_established
     0.29%  [kernel]       [k] llist_reverse_order

v2: kdoc issue (kernel bots)
    do not defer if (alloc_cpu == smp_processor_id()) (Paolo)
    replace the sk_buff_head with a single-linked list (Jakub)
    add a READ_ONCE()/WRITE_ONCE() for the lockless read of sd->defer_list

Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 include/linux/netdevice.h |  5 ++++
 include/linux/skbuff.h    |  3 +++
 include/net/sock.h        |  2 --
 include/net/tcp.h         | 12 ---------
 net/core/dev.c            | 31 ++++++++++++++++++++++++
 net/core/skbuff.c         | 51 ++++++++++++++++++++++++++++++++++++++-
 net/core/sock.c           |  3 ---
 net/ipv4/tcp.c            | 25 +------------------
 net/ipv4/tcp_ipv4.c       |  1 -
 net/ipv6/tcp_ipv6.c       |  1 -
 net/tls/tls_sw.c          |  2 --
 11 files changed, 90 insertions(+), 46 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 7dccbfd1bf5635c27514c70b4a06d3e6f74395dd..ac8a5f71220a999aebabd73d8df2c8e2b1325ad4 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -3081,6 +3081,11 @@ struct softnet_data {
 	struct sk_buff_head	input_pkt_queue;
 	struct napi_struct	backlog;
 
+	/* Another possibly contended cache line */
+	spinlock_t		defer_lock ____cacheline_aligned_in_smp;
+	int			defer_count;
+	struct sk_buff		*defer_list;
+	call_single_data_t	defer_csd;
 };
 
 static inline void input_queue_head_incr(struct softnet_data *sd)
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 84d78df60453955a8eaf05847f6e2145176a727a..5cbc184ca685d886306ccff70b82cd409082c229 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -888,6 +888,7 @@ typedef unsigned char *sk_buff_data_t;
  *		delivery_time at egress.
  *	@napi_id: id of the NAPI struct this skb came from
  *	@sender_cpu: (aka @napi_id) source CPU in XPS
+ *	@alloc_cpu: CPU which did the skb allocation.
  *	@secmark: security marking
  *	@mark: Generic packet mark
  *	@reserved_tailroom: (aka @mark) number of bytes of free space available
@@ -1080,6 +1081,7 @@ struct sk_buff {
 		unsigned int	sender_cpu;
 	};
 #endif
+	u16			alloc_cpu;
 #ifdef CONFIG_NETWORK_SECMARK
 	__u32		secmark;
 #endif
@@ -1321,6 +1323,7 @@ struct sk_buff *__build_skb(void *data, unsigned int frag_size);
 struct sk_buff *build_skb(void *data, unsigned int frag_size);
 struct sk_buff *build_skb_around(struct sk_buff *skb,
 				 void *data, unsigned int frag_size);
+void skb_attempt_defer_free(struct sk_buff *skb);
 
 struct sk_buff *napi_build_skb(void *data, unsigned int frag_size);
 
diff --git a/include/net/sock.h b/include/net/sock.h
index a01d6c421aa2caad4032167284eed9565d4bd545..f9f8ecae0f8decb3e0e74c6efaff5b890e3685ea 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -292,7 +292,6 @@ struct sk_filter;
   *	@sk_pacing_shift: scaling factor for TCP Small Queues
   *	@sk_lingertime: %SO_LINGER l_linger setting
   *	@sk_backlog: always used with the per-socket spinlock held
-  *	@defer_list: head of llist storing skbs to be freed
   *	@sk_callback_lock: used with the callbacks in the end of this struct
   *	@sk_error_queue: rarely used
   *	@sk_prot_creator: sk_prot of original sock creator (see ipv6_setsockopt,
@@ -417,7 +416,6 @@ struct sock {
 		struct sk_buff	*head;
 		struct sk_buff	*tail;
 	} sk_backlog;
-	struct llist_head defer_list;
 
 #define sk_rmem_alloc sk_backlog.rmem_alloc
 
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 679b1964d49414fcb1c361778fd0ac664e8c8ea5..94a52ad1101c12e13c2957e8b028b697742c451f 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -1375,18 +1375,6 @@ static inline bool tcp_checksum_complete(struct sk_buff *skb)
 bool tcp_add_backlog(struct sock *sk, struct sk_buff *skb,
 		     enum skb_drop_reason *reason);
 
-#ifdef CONFIG_INET
-void __sk_defer_free_flush(struct sock *sk);
-
-static inline void sk_defer_free_flush(struct sock *sk)
-{
-	if (llist_empty(&sk->defer_list))
-		return;
-	__sk_defer_free_flush(sk);
-}
-#else
-static inline void sk_defer_free_flush(struct sock *sk) {}
-#endif
 
 int tcp_filter(struct sock *sk, struct sk_buff *skb);
 void tcp_set_state(struct sock *sk, int state);
diff --git a/net/core/dev.c b/net/core/dev.c
index 4a77ebda4fb155581a5f761a864446a046987f51..611bd719706412723561c27753150b27e1dc4e7a 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -4545,6 +4545,12 @@ static void rps_trigger_softirq(void *data)
 
 #endif /* CONFIG_RPS */
 
+/* Called from hardirq (IPI) context */
+static void trigger_rx_softirq(void *data __always_unused)
+{
+	__raise_softirq_irqoff(NET_RX_SOFTIRQ);
+}
+
 /*
  * Check if this softnet_data structure is another cpu one
  * If yes, queue it to our IPI list and return 1
@@ -6571,6 +6577,28 @@ static int napi_threaded_poll(void *data)
 	return 0;
 }
 
+static void skb_defer_free_flush(struct softnet_data *sd)
+{
+	struct sk_buff *skb, *next;
+	unsigned long flags;
+
+	/* Paired with WRITE_ONCE() in skb_attempt_defer_free() */
+	if (!READ_ONCE(sd->defer_list))
+		return;
+
+	spin_lock_irqsave(&sd->defer_lock, flags);
+	skb = sd->defer_list;
+	sd->defer_list = NULL;
+	sd->defer_count = 0;
+	spin_unlock_irqrestore(&sd->defer_lock, flags);
+
+	while (skb != NULL) {
+		next = skb->next;
+		__kfree_skb(skb);
+		skb = next;
+	}
+}
+
 static __latent_entropy void net_rx_action(struct softirq_action *h)
 {
 	struct softnet_data *sd = this_cpu_ptr(&softnet_data);
@@ -6616,6 +6644,7 @@ static __latent_entropy void net_rx_action(struct softirq_action *h)
 		__raise_softirq_irqoff(NET_RX_SOFTIRQ);
 
 	net_rps_action_and_irq_enable(sd);
+	skb_defer_free_flush(sd);
 }
 
 struct netdev_adjacent {
@@ -11326,6 +11355,8 @@ static int __init net_dev_init(void)
 		INIT_CSD(&sd->csd, rps_trigger_softirq, sd);
 		sd->cpu = i;
 #endif
+		INIT_CSD(&sd->defer_csd, trigger_rx_softirq, NULL);
+		spin_lock_init(&sd->defer_lock);
 
 		init_gro_hash(&sd->backlog);
 		sd->backlog.poll = process_backlog;
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 30b523fa4ad2e9be30bdefdc61f70f989c345bbf..028a280fbabd5b69770ddd6bf0e00eae7651bbf1 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -204,7 +204,7 @@ static void __build_skb_around(struct sk_buff *skb, void *data,
 	skb_set_end_offset(skb, size);
 	skb->mac_header = (typeof(skb->mac_header))~0U;
 	skb->transport_header = (typeof(skb->transport_header))~0U;
-
+	skb->alloc_cpu = raw_smp_processor_id();
 	/* make sure we initialize shinfo sequentially */
 	shinfo = skb_shinfo(skb);
 	memset(shinfo, 0, offsetof(struct skb_shared_info, dataref));
@@ -1037,6 +1037,7 @@ static void __copy_skb_header(struct sk_buff *new, const struct sk_buff *old)
 #ifdef CONFIG_NET_RX_BUSY_POLL
 	CHECK_SKB_FIELD(napi_id);
 #endif
+	CHECK_SKB_FIELD(alloc_cpu);
 #ifdef CONFIG_XPS
 	CHECK_SKB_FIELD(sender_cpu);
 #endif
@@ -6486,3 +6487,51 @@ void __skb_ext_put(struct skb_ext *ext)
 }
 EXPORT_SYMBOL(__skb_ext_put);
 #endif /* CONFIG_SKB_EXTENSIONS */
+
+/**
+ * skb_attempt_defer_free - queue skb for remote freeing
+ * @skb: buffer
+ *
+ * Put @skb in a per-cpu list, using the cpu which
+ * allocated the skb/pages to reduce false sharing
+ * and memory zone spinlock contention.
+ */
+void skb_attempt_defer_free(struct sk_buff *skb)
+{
+	int cpu = skb->alloc_cpu;
+	struct softnet_data *sd;
+	unsigned long flags;
+	bool kick;
+
+	if (WARN_ON_ONCE(cpu >= nr_cpu_ids) ||
+	    !cpu_online(cpu) ||
+	    cpu == raw_smp_processor_id()) {
+		__kfree_skb(skb);
+		return;
+	}
+
+	sd = &per_cpu(softnet_data, cpu);
+	/* We do not send an IPI or any signal.
+	 * Remote cpu will eventually call skb_defer_free_flush()
+	 */
+	spin_lock_irqsave(&sd->defer_lock, flags);
+	skb->next = sd->defer_list;
+	/* Paired with READ_ONCE() in skb_defer_free_flush() */
+	WRITE_ONCE(sd->defer_list, skb);
+	sd->defer_count++;
+
+	/* kick every time queue length reaches 128.
+	 * This should avoid blocking in smp_call_function_single_async().
+	 * This condition should hardly be bit under normal conditions,
+	 * unless cpu suddenly stopped to receive NIC interrupts.
+	 */
+	kick = sd->defer_count == 128;
+
+	spin_unlock_irqrestore(&sd->defer_lock, flags);
+
+	/* Make sure to trigger NET_RX_SOFTIRQ on the remote CPU
+	 * if we are unlucky enough (this seems very unlikely).
+	 */
+	if (unlikely(kick))
+		smp_call_function_single_async(cpu, &sd->defer_csd);
+}
diff --git a/net/core/sock.c b/net/core/sock.c
index 29abec3eabd8905f2671e0b5789878a129453ef6..a0f3989de3d62456665e8b6382a4681fba17d60c 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -2082,9 +2082,6 @@ void sk_destruct(struct sock *sk)
 {
 	bool use_call_rcu = sock_flag(sk, SOCK_RCU_FREE);
 
-	WARN_ON_ONCE(!llist_empty(&sk->defer_list));
-	sk_defer_free_flush(sk);
-
 	if (rcu_access_pointer(sk->sk_reuseport_cb)) {
 		reuseport_detach_sock(sk);
 		use_call_rcu = true;
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index e20b87b3bf907a9b04b7531936129fb729e96c52..db55af9eb37b56bf0ec3b47212240c0302b86a1f 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -843,7 +843,6 @@ ssize_t tcp_splice_read(struct socket *sock, loff_t *ppos,
 	}
 
 	release_sock(sk);
-	sk_defer_free_flush(sk);
 
 	if (spliced)
 		return spliced;
@@ -1589,20 +1588,6 @@ void tcp_cleanup_rbuf(struct sock *sk, int copied)
 		tcp_send_ack(sk);
 }
 
-void __sk_defer_free_flush(struct sock *sk)
-{
-	struct llist_node *head;
-	struct sk_buff *skb, *n;
-
-	head = llist_del_all(&sk->defer_list);
-	llist_for_each_entry_safe(skb, n, head, ll_node) {
-		prefetch(n);
-		skb_mark_not_on_list(skb);
-		__kfree_skb(skb);
-	}
-}
-EXPORT_SYMBOL(__sk_defer_free_flush);
-
 static void tcp_eat_recv_skb(struct sock *sk, struct sk_buff *skb)
 {
 	__skb_unlink(skb, &sk->sk_receive_queue);
@@ -1610,11 +1595,7 @@ static void tcp_eat_recv_skb(struct sock *sk, struct sk_buff *skb)
 		sock_rfree(skb);
 		skb->destructor = NULL;
 		skb->sk = NULL;
-		if (!skb_queue_empty(&sk->sk_receive_queue) ||
-		    !llist_empty(&sk->defer_list)) {
-			llist_add(&skb->ll_node, &sk->defer_list);
-			return;
-		}
+		return skb_attempt_defer_free(skb);
 	}
 	__kfree_skb(skb);
 }
@@ -2453,7 +2434,6 @@ static int tcp_recvmsg_locked(struct sock *sk, struct msghdr *msg, size_t len,
 			__sk_flush_backlog(sk);
 		} else {
 			tcp_cleanup_rbuf(sk, copied);
-			sk_defer_free_flush(sk);
 			sk_wait_data(sk, &timeo, last);
 		}
 
@@ -2571,7 +2551,6 @@ int tcp_recvmsg(struct sock *sk, struct msghdr *msg, size_t len, int flags,
 	lock_sock(sk);
 	ret = tcp_recvmsg_locked(sk, msg, len, flags, &tss, &cmsg_flags);
 	release_sock(sk);
-	sk_defer_free_flush(sk);
 
 	if (cmsg_flags && ret >= 0) {
 		if (cmsg_flags & TCP_CMSG_TS)
@@ -3096,7 +3075,6 @@ int tcp_disconnect(struct sock *sk, int flags)
 		sk->sk_frag.page = NULL;
 		sk->sk_frag.offset = 0;
 	}
-	sk_defer_free_flush(sk);
 	sk_error_report(sk);
 	return 0;
 }
@@ -4225,7 +4203,6 @@ static int do_tcp_getsockopt(struct sock *sk, int level,
 		err = BPF_CGROUP_RUN_PROG_GETSOCKOPT_KERN(sk, level, optname,
 							  &zc, &len, err);
 		release_sock(sk);
-		sk_defer_free_flush(sk);
 		if (len >= offsetofend(struct tcp_zerocopy_receive, msg_flags))
 			goto zerocopy_rcv_cmsg;
 		switch (len) {
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 2c2d421425557c188c4bcf3dc113baea62e915c7..918816ec5dd49abe321f0179a2a64ca9a989a01c 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -2065,7 +2065,6 @@ int tcp_v4_rcv(struct sk_buff *skb)
 
 	sk_incoming_cpu_update(sk);
 
-	sk_defer_free_flush(sk);
 	bh_lock_sock_nested(sk);
 	tcp_segs_in(tcp_sk(sk), skb);
 	ret = 0;
diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
index 54277de7474b78f1fea033b7978acebc0647f3ad..60bdec257ba7220d6c05b48208a587c7be2b4087 100644
--- a/net/ipv6/tcp_ipv6.c
+++ b/net/ipv6/tcp_ipv6.c
@@ -1728,7 +1728,6 @@ INDIRECT_CALLABLE_SCOPE int tcp_v6_rcv(struct sk_buff *skb)
 
 	sk_incoming_cpu_update(sk);
 
-	sk_defer_free_flush(sk);
 	bh_lock_sock_nested(sk);
 	tcp_segs_in(tcp_sk(sk), skb);
 	ret = 0;
diff --git a/net/tls/tls_sw.c b/net/tls/tls_sw.c
index ddbe05ec5489dd352dee832e038884339f338b43..bc54f6c5b1a4cabbfe1e3eff1768128b2730c730 100644
--- a/net/tls/tls_sw.c
+++ b/net/tls/tls_sw.c
@@ -1911,7 +1911,6 @@ int tls_sw_recvmsg(struct sock *sk,
 
 end:
 	release_sock(sk);
-	sk_defer_free_flush(sk);
 	if (psock)
 		sk_psock_put(sk, psock);
 	return copied ? : err;
@@ -1983,7 +1982,6 @@ ssize_t tls_sw_splice_read(struct socket *sock,  loff_t *ppos,
 
 splice_read_end:
 	release_sock(sk);
-	sk_defer_free_flush(sk);
 	return copied ? : err;
 }
 
-- 
2.36.0.rc2.479.g8af0fa9b8e-goog


^ permalink raw reply related

* Re: [PATCH v2] mediatek/mt7601u: add debugfs exit function
From: Jakub Kicinski @ 2022-04-22 19:47 UTC (permalink / raw)
  To: Bernard Zhao
  Cc: Kalle Valo, David S. Miller, Paolo Abeni, Matthias Brugger,
	linux-wireless, netdev, linux-arm-kernel, linux-mediatek,
	linux-kernel, bernard
In-Reply-To: <20220422080854.490379-1-zhaojunkui2008@126.com>

On Fri, 22 Apr 2022 01:08:54 -0700 Bernard Zhao wrote:
> When mt7601u loaded, there are two cases:
> First when mt7601u is loaded, in function mt7601u_probe, if
> function mt7601u_probe run into error lable err_hw,
> mt7601u_cleanup didn`t cleanup the debugfs node.
> Second when the module disconnect, in function mt7601u_disconnect,
> mt7601u_cleanup didn`t cleanup the debugfs node.
> This patch add debugfs exit function and try to cleanup debugfs
> node when mt7601u loaded fail or unloaded.
> 
> Signed-off-by: Bernard Zhao <zhaojunkui2008@126.com>

Ah, missed that there was a v2. My point stands, wiphy debugfs dir
should do the cleanup.

Do you encounter problems in practice or are you sending this patches
based on reading / static analysis of the code only.

^ permalink raw reply

* Re: [PATCH 0/1] add support for enum module parameters
From: Jakub Kicinski @ 2022-04-22 20:44 UTC (permalink / raw)
  To: Kalle Valo
  Cc: Jani Nikula, Greg Kroah-Hartman, linux-kernel, intel-gfx,
	dri-devel, Andrew Morton, Lucas De Marchi, linux-wireless, netdev
In-Reply-To: <87sfq8qqus.fsf@tynnyri.adurom.net>

On Wed, 20 Apr 2022 08:13:47 +0300 Kalle Valo wrote:
> Wireless drivers would also desperately need to pass device specific
> parameters at (or before) probe time. And not only debug parameters but
> also configuration parameters, for example firmware memory allocations
> schemes (optimise for features vs number of clients etc) and whatnot.
> 
> Any ideas how to implement that? Is there any prior work for anything
> like this? This is pretty hard limiting usability of upstream wireless
> drivers and I really want to find a proper solution.

In netdev we have devlink which is used for all sort of device
configuration. devlink-resource sounds like what you need,
but it'd have to be extended to support configuration which requires
reload/re-probe. Currently only devlink-params support that but params
were a mistake so don't use that.

^ permalink raw reply

* Re: [PATCH net v3] tcp: ensure to use the most recently sent skb when filling the rate sample
From: Jakub Kicinski @ 2022-04-22 20:37 UTC (permalink / raw)
  To: Pengcheng Yang
  Cc: Eric Dumazet, Neal Cardwell, netdev, David S. Miller,
	Hideaki YOSHIFUJI, David Ahern, Paolo Abeni
In-Reply-To: <1650422081-22153-1-git-send-email-yangpc@wangsu.com>

On Wed, 20 Apr 2022 10:34:41 +0800 Pengcheng Yang wrote:
> If an ACK (s)acks multiple skbs, we favor the information
> from the most recently sent skb by choosing the skb with
> the highest prior_delivered count. But in the interval
> between receiving ACKs, we send multiple skbs with the same
> prior_delivered, because the tp->delivered only changes
> when we receive an ACK.
> 
> We used RACK's solution, copying tcp_rack_sent_after() as
> tcp_skb_sent_after() helper to determine "which packet was
> sent last?". Later, we will use tcp_skb_sent_after() instead
> in RACK.
> 
> Fixes: b9f64820fb22 ("tcp: track data delivery rate for a TCP connection")
> Signed-off-by: Pengcheng Yang <yangpc@wangsu.com>
> Cc: Neal Cardwell <ncardwell@google.com>
> Cc: Paolo Abeni <pabeni@redhat.com>

Somehow this patch got marked as archived in patchwork. Reviving it now.

Eric, Neal, ack?

^ permalink raw reply

* Re: [PATCH net-next 1/5] net: ipqess: introduce the Qualcomm IPQESS driver
From: Andrew Lunn @ 2022-04-22 20:19 UTC (permalink / raw)
  To: Maxime Chevallier
  Cc: davem, Rob Herring, netdev, linux-kernel, devicetree,
	thomas.petazzoni, Florian Fainelli, Heiner Kallweit, Russell King,
	linux-arm-kernel, Vladimir Oltean, Luka Perkov, Robert Marko
In-Reply-To: <20220422180305.301882-2-maxime.chevallier@bootlin.com>

> +static int ipqess_axi_probe(struct platform_device *pdev)
> +{
> +	struct device_node *np = pdev->dev.of_node;
> +	struct net_device *netdev;
> +	phy_interface_t phy_mode;
> +	struct resource *res;
> +	struct ipqess *ess;
> +	int i, err = 0;
> +
> +	netdev = devm_alloc_etherdev_mqs(&pdev->dev, sizeof(struct ipqess),
> +					 IPQESS_NETDEV_QUEUES,
> +					 IPQESS_NETDEV_QUEUES);
> +	if (!netdev)
> +		return -ENOMEM;
> +
> +	ess = netdev_priv(netdev);
> +	ess->netdev = netdev;
> +	ess->pdev = pdev;
> +	spin_lock_init(&ess->stats_lock);
> +	SET_NETDEV_DEV(netdev, &pdev->dev);
> +	platform_set_drvdata(pdev, netdev);

....

> +
> +	ipqess_set_ethtool_ops(netdev);
> +
> +	err = register_netdev(netdev);
> +	if (err)
> +		goto err_out;

Before register_netdev() even returns, your devices can be in use, the
open callback called and packets sent. This is particularly true for
NFS root. Which means any setup done after this is probably wrong.

> +
> +	err = ipqess_hw_init(ess);
> +	if (err)
> +		goto err_out;
> +
> +	for (i = 0; i < IPQESS_NETDEV_QUEUES; i++) {
> +		int qid;
> +
> +		netif_tx_napi_add(netdev, &ess->tx_ring[i].napi_tx,
> +				  ipqess_tx_napi, 64);
> +		netif_napi_add(netdev,
> +			       &ess->rx_ring[i].napi_rx,
> +			       ipqess_rx_napi, 64);
> +
> +		qid = ess->tx_ring[i].idx;
> +		err = devm_request_irq(&ess->netdev->dev, ess->tx_irq[qid],
> +				       ipqess_interrupt_tx, 0,
> +				       ess->tx_irq_names[qid],
> +				       &ess->tx_ring[i]);
> +		if (err)
> +			goto err_out;
> +
> +		qid = ess->rx_ring[i].idx;
> +		err = devm_request_irq(&ess->netdev->dev, ess->rx_irq[qid],
> +				       ipqess_interrupt_rx, 0,
> +				       ess->rx_irq_names[qid],
> +				       &ess->rx_ring[i]);
> +		if (err)
> +			goto err_out;
> +	}

All this should probably go before netdev_register().

> +static int ipqess_get_strset_count(struct net_device *netdev, int sset)
> +{
> +	switch (sset) {
> +	case ETH_SS_STATS:
> +		return ARRAY_SIZE(ipqess_stats);
> +	default:
> +		netdev_dbg(netdev, "%s: Invalid string set", __func__);

Unsupported would be better than invalid.

> +		return -EOPNOTSUPP;
> +	}
> +}

  Andrew

^ permalink raw reply

* Re: [net-next v4 0/3] use standard sysctl macro
From: Jakub Kicinski @ 2022-04-22 19:43 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: xiangxia.m.yue, netdev, linux-fsdevel, Kees Cook, Iurii Zaikin,
	David S. Miller, Paolo Abeni, Hideaki YOSHIFUJI, David Ahern,
	Simon Horman, Julian Anastasov, Pablo Neira Ayuso,
	Jozsef Kadlecsik, Florian Westphal, Shuah Khan, Andrew Morton,
	Alexei Starovoitov, Eric Dumazet, Lorenz Bauer, Akhmat Karakotov
In-Reply-To: <YmK/PM2x5PTG2b+c@bombadil.infradead.org>

On Fri, 22 Apr 2022 07:44:12 -0700 Luis Chamberlain wrote:
> On Fri, Apr 22, 2022 at 03:01:38PM +0800, xiangxia.m.yue@gmail.com wrote:
> > From: Tonghao Zhang <xiangxia.m.yue@gmail.com>
> > 
> > This patchset introduce sysctl macro or replace var
> > with macro.
> > 
> > Tonghao Zhang (3):
> >   net: sysctl: use shared sysctl macro
> >   net: sysctl: introduce sysctl SYSCTL_THREE
> >   selftests/sysctl: add sysctl macro test  
> 
> I see these are based on net-next, to avoid conflicts with
> sysctl development this may be best based on sysctl-next
> though. Jakub?

I guess the base should be whatever we are going to use as
a base for a branch, the branch we can both pull in?

How many patches like that do you see flying around, tho?
I feel like I've seen at least 3 - netfilter, net core and bpf.
It's starting to feel like we should have one patch that adds all 
the constants and self test, put that in a branch anyone can pull in,
and then do the conversions in separate patches..

Option number two - rename the statics in the subsystems to SYSCTL_x,
and we can do a much smaller cleanup in the next cycle which would
replace those with a centralized instances? That should have minimal
chance of conflicts so no need to do special branches.

Option number three defer all this until the merge window.

^ permalink raw reply

* Re: [PATCH net] net: Use this_cpu_inc() to increment net->core_stats
From: Jakub Kicinski @ 2022-04-22 19:56 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: netdev, Eric Dumazet, David S. Miller, Paolo Abeni,
	Thomas Gleixner, Peter Zijlstra
In-Reply-To: <YmFjdOp+R5gVGZ7p@linutronix.de>

On Thu, 21 Apr 2022 16:00:20 +0200 Sebastian Andrzej Siewior wrote:
> @@ -3851,7 +3851,7 @@ static inline struct net_device_core_stats *dev_core_stats(struct net_device *de

I think this needs to return __percpu now?
Double check sparse is happy for v2, pls.

>  	struct net_device_core_stats __percpu *p = READ_ONCE(dev->core_stats);
>  
>  	if (likely(p))
> -		return this_cpu_ptr(p);
> +		return p;


^ permalink raw reply

* Re: [PATCH memcg RFC] net: set proper memcg for net_init hooks allocations
From: Vasily Averin @ 2022-04-22 20:01 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Vlastimil Babka, Roman Gushchin, kernel, LKML, netdev, Cgroups,
	Michal Hocko, Florian Westphal, David S. Miller, Jakub Kicinski,
	Paolo Abeni
In-Reply-To: <CALvZod47PARcupR4P41p5XJRfCaTqSuy-cfXs7Ky9=-aJQuoFA@mail.gmail.com>

On 4/21/22 18:56, Shakeel Butt wrote:
> On Sat, Apr 16, 2022 at 11:39 PM Vasily Averin <vvs@openvz.org> wrote:
>> @@ -1147,7 +1148,13 @@ static int __register_pernet_operations(struct list_head *list,
>>                  * setup_net() and cleanup_net() are not possible.
>>                  */
>>                 for_each_net(net) {
>> +                       struct mem_cgroup *old, *memcg = NULL;
>> +#ifdef CONFIG_MEMCG
>> +                       memcg = (net == &init_net) ? root_mem_cgroup : mem_cgroup_from_obj(net);
> 
> memcg from obj is unstable, so you need a reference on memcg. You can
> introduce get_mem_cgroup_from_kmem() which works for both
> MEMCG_DATA_OBJCGS and MEMCG_DATA_KMEM. For uncharged objects (like
> init_net) it should return NULL.

Could you please elaborate with more details?
It seems to me mem_cgroup_from_obj() does everything exactly as you say:
- for slab objects it returns memcg taken from according slab->memcg_data
- for ex-slab objects (i.e. page->memcg_data & MEMCG_DATA_OBJCGS)
    page_memcg_check() returns NULL
- for kmem objects (i.e. page->memcg_data & MEMCG_DATA_KMEM) 
    page_memcg_check() returns objcg->memcg
- in another cases
    page_memcg_check() returns page->memcg_data,
    so for uncharged objects like init_net NULL should be returned.

I can introduce exported get_mem_cgroup_from_kmem(), however it should only
call mem_cgroup_from_obj(), perhaps under read_rcu_lock/unlock.

Do you mean something like this?

--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -1768,4 +1768,14 @@ static inline struct mem_cgroup *mem_cgroup_from_obj(void *p)
 
 #endif /* CONFIG_MEMCG_KMEM */
 
+static inline struct mem_cgroup *get_mem_cgroup_from_kmem(void *p)
+{
+	struct mem_cgroup *memcg;
+
+	rcu_read_lock();
+	memcg = mem_cgroup_from_obj(p);
+	rcu_read_unlock();
+
+	return memcg;
+}
 #endif /* _LINUX_MEMCONTROL_H */
diff --git a/net/core/net_namespace.c b/net/core/net_namespace.c
index a5b5bb99c644..4003c47965c9 100644
--- a/net/core/net_namespace.c
+++ b/net/core/net_namespace.c
@@ -26,6 +26,7 @@
 #include <net/net_namespace.h>
 #include <net/netns/generic.h>
 
+#include <linux/sched/mm.h>
 /*
  *	Our network namespace constructor/destructor lists
  */
@@ -1147,7 +1148,14 @@ static int __register_pernet_operations(struct list_head *list,
 		 * setup_net() and cleanup_net() are not possible.
 		 */
 		for_each_net(net) {
+			struct mem_cgroup *old, *memcg;
+
+			memcg = get_mem_cgroup_from_kmem(net);
+			if (memcg == NULL)
+				memcg = root_mem_cgroup;
+			old = set_active_memcg(memcg);
 			error = ops_init(ops, net);
+			set_active_memcg(old);
 			if (error)
 				goto out_undo;
 			list_add_tail(&net->exit_list, &net_exit_list);

^ permalink raw reply related

* Re: [PATCH iproute2-next v2] ip-link: put types on man page in alphabetic order
From: patchwork-bot+netdevbpf @ 2022-04-22 20:00 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: netdev
In-Reply-To: <20220420031115.26270-1-stephen@networkplumber.org>

Hello:

This patch was applied to iproute2/iproute2-next.git (main)
by David Ahern <dsahern@kernel.org>:

On Tue, 19 Apr 2022 20:11:15 -0700 you wrote:
> Lets try and keep man pages using alpha order, it looks like
> it started that way then drifted.
> 
> Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
> ---
>  man/man8/ip-link.8.in | 175 +++++++++++++++++++++---------------------
>  1 file changed, 89 insertions(+), 86 deletions(-)

Here is the summary with links:
  - [iproute2-next,v2] ip-link: put types on man page in alphabetic order
    https://git.kernel.org/pub/scm/network/iproute2/iproute2-next.git/commit/?id=f6559beaf7ab

You are awesome, thank you!
-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html



^ permalink raw reply

* Re: [PATCH] wireless: ipw2x00: Refine the error handling of ipw2100_pci_init_one()
From: Stanislav Yakovlev @ 2022-04-22 19:25 UTC (permalink / raw)
  To: Zheyu Ma
  Cc: kvalo, David S. Miller, Jakub Kicinski, pabeni, wireless, netdev,
	Linux Kernel Mailing List
In-Reply-To: <CAMhUBjkWcg4+YYynsd90jX1A+zp95tUUcLgYrTPAqSmbxM7TJA@mail.gmail.com>

Hi Zheyu,

On 18/04/2022, Zheyu Ma <zheyuma97@gmail.com> wrote:
> On Thu, Apr 14, 2022 at 2:40 AM Stanislav Yakovlev
> <stas.yakovlev@gmail.com> wrote:
>>
>> On Sat, 9 Apr 2022 at 02:25, Zheyu Ma <zheyuma97@gmail.com> wrote:
>> >
>> > The driver should release resources in reverse order, i.e., the
>> > resources requested first should be released last, and the driver
>> > should adjust the order of error handling code by this rule.
>> >
>> > Signed-off-by: Zheyu Ma <zheyuma97@gmail.com>
>> > ---
>> >  drivers/net/wireless/intel/ipw2x00/ipw2100.c | 34 +++++++++-----------
>> >  1 file changed, 16 insertions(+), 18 deletions(-)
>> >
>> [Skipped]
>>
>> > @@ -6306,9 +6303,13 @@ static int ipw2100_pci_init_one(struct pci_dev
>> > *pci_dev,
>> >  out:
>> >         return err;
>> >
>> > -      fail_unlock:
>> > +fail_unlock:
>> >         mutex_unlock(&priv->action_mutex);
>> > -      fail:
>> > +fail:
>> > +       pci_release_regions(pci_dev);
>> > +fail_disable:
>> > +       pci_disable_device(pci_dev);
>> We can't move these functions before the following block.
>>
>> > +fail_dev:
>> >         if (dev) {
>> >                 if (registered >= 2)
>> >                         unregister_netdev(dev);
>> This block continues with a function call to ipw2100_hw_stop_adapter
>> which assumes that device is still accessible via pci bus.
>
> Thanks for your reminder, but the existing error handling does need to
> be revised, I got the following warning when the probing fails at
> pci_resource_flags():
>
> [   20.712160] WARNING: CPU: 1 PID: 462 at lib/iomap.c:44
> pci_iounmap+0x40/0x50
> [   20.716583] RIP: 0010:pci_iounmap+0x40/0x50
> [   20.726342]  <TASK>
> [   20.726550]  ipw2100_pci_init_one+0x101/0x1ee0 [ipw2100]
>
> Since I am not familiar with the ipw2100, could someone give me some
> advice to fix this.

Could you please rebuild the kernel with IPW2100_DEBUG config option
enabled, rerun the test and post your results here? Also, please post
the output of "lspci -v" here.

Stanislav.

^ permalink raw reply

* Re: [PATCH memcg RFC] net: set proper memcg for net_init hooks allocations
From: Vasily Averin @ 2022-04-22 20:09 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Vlastimil Babka, Roman Gushchin, kernel, LKML, netdev, Cgroups,
	Michal Hocko, Florian Westphal, David S. Miller, Jakub Kicinski,
	Paolo Abeni
In-Reply-To: <964ae72a-0484-67de-8143-a9a2d492a520@openvz.org>

On 4/22/22 23:01, Vasily Averin wrote:
> On 4/21/22 18:56, Shakeel Butt wrote:
>> On Sat, Apr 16, 2022 at 11:39 PM Vasily Averin <vvs@openvz.org> wrote:
>>> @@ -1147,7 +1148,13 @@ static int __register_pernet_operations(struct list_head *list,
>>>                  * setup_net() and cleanup_net() are not possible.
>>>                  */
>>>                 for_each_net(net) {
>>> +                       struct mem_cgroup *old, *memcg = NULL;
>>> +#ifdef CONFIG_MEMCG
>>> +                       memcg = (net == &init_net) ? root_mem_cgroup : mem_cgroup_from_obj(net);
>>
>> memcg from obj is unstable, so you need a reference on memcg. You can
>> introduce get_mem_cgroup_from_kmem() which works for both
>> MEMCG_DATA_OBJCGS and MEMCG_DATA_KMEM. For uncharged objects (like
>> init_net) it should return NULL.
> 
> Could you please elaborate with more details?
> It seems to me mem_cgroup_from_obj() does everything exactly as you say:
> - for slab objects it returns memcg taken from according slab->memcg_data
> - for ex-slab objects (i.e. page->memcg_data & MEMCG_DATA_OBJCGS)
>     page_memcg_check() returns NULL
> - for kmem objects (i.e. page->memcg_data & MEMCG_DATA_KMEM) 
>     page_memcg_check() returns objcg->memcg
> - in another cases
>     page_memcg_check() returns page->memcg_data,
>     so for uncharged objects like init_net NULL should be returned.
> 
> I can introduce exported get_mem_cgroup_from_kmem(), however it should only
> call mem_cgroup_from_obj(), perhaps under read_rcu_lock/unlock.

I think I finally got your point:
Do you mean I should use css_tryget(&memcg->css) for found memcg,
like get_mem_cgroup_from_mm() does?

Thank you,
	Vasily Averin

^ permalink raw reply

* Re: [syzbot] WARNING: kmalloc bug in bpf
From: syzbot @ 2022-04-22 18:53 UTC (permalink / raw)
  To: andrii, ast, bpf, daniel, davem, hawk, jiri, john.fastabend,
	kafai, kpsingh, kuba, leonro, linux-kernel, netdev,
	songliubraving, syzkaller-bugs, torvalds, yhs
In-Reply-To: <00000000000033acbf05d1a969aa@google.com>

syzbot suspects this issue was fixed by commit:

commit 0708a0afe291bdfe1386d74d5ec1f0c27e8b9168
Author: Daniel Borkmann <daniel@iogearbox.net>
Date:   Fri Mar 4 14:26:32 2022 +0000

    mm: Consider __GFP_NOWARN flag for oversized kvmalloc() calls

bisection log:  https://syzkaller.appspot.com/x/bisect.txt?x=1499c6fcf00000
start commit:   1d5a47424040 sfc: The RX page_ring is optional
git tree:       net
kernel config:  https://syzkaller.appspot.com/x/.config?x=1a86c22260afac2f
dashboard link: https://syzkaller.appspot.com/bug?extid=cecf5b7071a0dfb76530
syz repro:      https://syzkaller.appspot.com/x/repro.syz?x=176738e7b00000
C reproducer:   https://syzkaller.appspot.com/x/repro.c?x=13b4508db00000

If the result looks correct, please mark the issue as fixed by replying with:

#syz fix: mm: Consider __GFP_NOWARN flag for oversized kvmalloc() calls

For information about bisection process see: https://goo.gl/tpsmEJ#bisection

^ permalink raw reply

* Re: [PATCH net-next 2/2] net: vxlan: vxlan_core.c: Add extack support to vxlan_fdb_delet
From: David Ahern @ 2022-04-22 18:37 UTC (permalink / raw)
  To: Alaa Mohamed; +Cc: netdev, outreachy, roopa, roopa.prabhu, jdenham, sbrivio
In-Reply-To: <c6765ff1f66cf74ba6f25ba9b1c91dfe410abcfd.1650377624.git.eng.alaamohamedsoliman.am@gmail.com>

On Tue, Apr 19, 2022 at 04:37:18PM +0200, Alaa Mohamed wrote:
> diff --git a/drivers/net/vxlan/vxlan_core.c b/drivers/net/vxlan/vxlan_core.c
> index cf2f60037340..4ecbb5878fe2 100644
> --- a/drivers/net/vxlan/vxlan_core.c
> +++ b/drivers/net/vxlan/vxlan_core.c
> @@ -1129,18 +1129,20 @@ static void vxlan_fdb_dst_destroy(struct vxlan_dev *vxlan, struct vxlan_fdb *f,
> 
>  static int vxlan_fdb_parse(struct nlattr *tb[], struct vxlan_dev *vxlan,
>  			   union vxlan_addr *ip, __be16 *port, __be32 *src_vni,
> -			   __be32 *vni, u32 *ifindex, u32 *nhid)
> +			   __be32 *vni, u32 *ifindex, u32 *nhid, struct netlink_ext_ack *extack)
>  {
>  	struct net *net = dev_net(vxlan->dev);
>  	int err;
> 
>  	if (tb[NDA_NH_ID] && (tb[NDA_DST] || tb[NDA_VNI] || tb[NDA_IFINDEX] ||
>  	    tb[NDA_PORT]))
> +		NL_SET_ERR_MSG(extack, "Missing required arguments");

That's a misleading error message; I think it should be something like:
		NL_SET_ERR_MSG(extack, "DST, VNI, ifindex and port are mutually exclusive with NH_ID");

>  		return -EINVAL;
> 
>  	if (tb[NDA_DST]) {
>  		err = vxlan_nla_get_addr(ip, tb[NDA_DST]);
>  		if (err)
> +			NL_SET_ERR_MSG(extack, "Unsupported address family");
>  			return err;
>  	} else {
>  		union vxlan_addr *remote = &vxlan->default_dst.remote_ip;

^ permalink raw reply

* Re: [PATCH net-next 2/5] net: dsa: add out-of-band tagging protocol
From: Florian Fainelli @ 2022-04-22 18:28 UTC (permalink / raw)
  To: Maxime Chevallier, davem, Rob Herring
  Cc: netdev, linux-kernel, devicetree, thomas.petazzoni, Andrew Lunn,
	Heiner Kallweit, Russell King, linux-arm-kernel, Vladimir Oltean,
	Luka Perkov, Robert Marko
In-Reply-To: <20220422180305.301882-3-maxime.chevallier@bootlin.com>

On 4/22/22 11:03, Maxime Chevallier wrote:
> This tagging protocol is designed for the situation where the link
> between the MAC and the Switch is designed such that the Destination
> Port, which is usually embedded in some part of the Ethernet Header, is
> sent out-of-band, and isn't present at all in the Ethernet frame.
> 
> This can happen when the MAC and Switch are tightly integrated on an
> SoC, as is the case with the Qualcomm IPQ4019 for example, where the DSA
> tag is inserted directly into the DMA descriptors. In that case,
> the MAC driver is responsible for sending the tag to the switch using
> the out-of-band medium. To do so, the MAC driver needs to have the
> information of the destination port for that skb.
> 
> This tagging protocol relies on a new set of fields in skb->shinfo to
> transmit the dsa tagging information to and from the MAC driver.
> 
> Signed-off-by: Maxime Chevallier <maxime.chevallier@bootlin.com>

First off, I am not a big fan of expanding skb::shared_info because it 
is sensitive to cache line sizes and is critical for performance at much 
higher speeds, I would expect Eric and Jakub to not be terribly happy 
about it.

The Broadcom systemport (bcmsysport.c) has a mode where it can extract 
the Broadcom tag and put it in front of the actual packet contents which 
appears to be very similar here. From there on, you can have two strategies:

- have the Ethernet controller mangle the packet contents such that the 
QCA tag is located in front of the actual Ethernet frame and create a 
new tagging protocol variant for QCA, similar to the TAG_BRCM versus 
TAG_BRCM_PREPEND

- provide the necessary information for the tagger to work using an out 
of band mechanism, which is what you have done, in which case, maybe you 
can use skb->cb[] instead of using skb::shared_info?
-- 
Florian

^ permalink raw reply

* Re: [PATCH net-next v2] net: phy: broadcom: 1588 support on bcm54210pe
From: Jonathan Lemon @ 2022-04-22 18:20 UTC (permalink / raw)
  To: Lasse Johnsen
  Cc: Richard Cochran, netdev, Gordon Hollingworth, Ahmad Byagowi,
	Florian Fainelli, Andrew Lunn, Heiner Kallweit, Russell King,
	bcm-kernel-feedback-list, David S. Miller, Jakub Kicinski,
	Paolo Abeni
In-Reply-To: <567C8D9F-BF2B-4DE6-8991-DB86A845C49C@timebeat.app>

On 22 Apr 2022, at 11:11, Lasse Johnsen wrote:

> Hi Jonathan,
>
> I suspect you make the conflation I also made when I started working on this PHY driver. Broadcom has a number of different, nearly identical chips. The BCM54210, the BCM54210E, the BCM54210PE, the BCM54210S and the BCM54210SE.
>
> It’s hard to imagine, but only the BCM54210PE is a first generation PHY and the BCM54210 (and others) are second generation. I have to be mighty careful not to breach my NDA, but I can furnish you with these quotes directly from the Broadcom engineers I worked with during the development:
>
> 24 March:
>
> "The BCM54210PE is the first-gen 40-nm GPHY, but the BCM54210 is the second-gen 40-nm GPHY.”
>
> "The 1588 Inband function only applied to BCM54210 or later PHYs. It doesn't be supported in the BCM54210PE”
>
> So, I quite agree with you that in-band would be preferable (subject to the issue with hawking the reserved field used in 1588-2019 I described in my note to Richard), but I am convinced that it is not supported in the BCM54210PE. Indeed if you are looking at a document describing features based on the RDB register access method it is not supported by the BCM54210PE.

Uhm, I have inbound timestamps working for RX on an RPI CM4.
—
Jonathan

^ permalink raw reply

* Re: [PATCH net-next v2] net: phy: broadcom: 1588 support on bcm54210pe
From: Lasse Johnsen @ 2022-04-22 18:11 UTC (permalink / raw)
  To: Jonathan Lemon
  Cc: Richard Cochran, netdev, Gordon Hollingworth, Ahmad Byagowi,
	Florian Fainelli, Andrew Lunn, Heiner Kallweit, Russell King,
	bcm-kernel-feedback-list, David S. Miller, Jakub Kicinski,
	Paolo Abeni
In-Reply-To: <20220422152209.cwofghzr2wyxopek@bsd-mbp.local>

Hi Jonathan,

I suspect you make the conflation I also made when I started working on this PHY driver. Broadcom has a number of different, nearly identical chips. The BCM54210, the BCM54210E, the BCM54210PE, the BCM54210S and the BCM54210SE.

It’s hard to imagine, but only the BCM54210PE is a first generation PHY and the BCM54210 (and others) are second generation. I have to be mighty careful not to breach my NDA, but I can furnish you with these quotes directly from the Broadcom engineers I worked with during the development:

24 March:

"The BCM54210PE is the first-gen 40-nm GPHY, but the BCM54210 is the second-gen 40-nm GPHY.”

"The 1588 Inband function only applied to BCM54210 or later PHYs. It doesn't be supported in the BCM54210PE”

So, I quite agree with you that in-band would be preferable (subject to the issue with hawking the reserved field used in 1588-2019 I described in my note to Richard), but I am convinced that it is not supported in the BCM54210PE. Indeed if you are looking at a document describing features based on the RDB register access method it is not supported by the BCM54210PE.

I would like nothing better than to be wrong, but you will need to provide me with something substantial to investigate further. (Offline is NDA requires it - happy to discuss any time).

In any event, I’m sure the time is not wasted and will be relevant when the Raspberry PI CM5,6&7 is launched… :-)

Thank you for your note and all the best,

Lasse

> On 22 Apr 2022, at 16:22, Jonathan Lemon <jonathan.lemon@gmail.com> wrote:
> 
> On Fri, Apr 22, 2022 at 04:08:18PM +0100, Lasse Johnsen wrote:
>>> On 21 Apr 2022, at 15:48, Richard Cochran <richardcochran@gmail.com> wrote:
>>> Moreover: Does this device provide in-band Rx time stamps?  If so, why
>>> not use them?
>> 
>> This is the first generation PHY and it does not do in-band RX. I asked BCM and studied the documentation. I’m sure I’m allowed to say, that the second generation 40nm BCM PHY (which - "I am not making this up" is available in 3 versions: BCM54210, BCM54210S and BCM54210SE - not “PE”) - supports in-band rx timestamps. However, as a matter of curiosity, BCM utilise the field in the header now used for minor versioning in 1588-2019, so in due course using this silicon feature will be a significant challenge.
> 
> Actually, it does support in-band RX timestamps.  Doing this would be
> cleaner, and you'd only need to capture TX timestamps.
> -- 
> Jonathan

^ permalink raw reply

* [PATCH net-next 1/5] net: ipqess: introduce the Qualcomm IPQESS driver
From: Maxime Chevallier @ 2022-04-22 18:03 UTC (permalink / raw)
  To: davem, Rob Herring
  Cc: Maxime Chevallier, netdev, linux-kernel, devicetree,
	thomas.petazzoni, Andrew Lunn, Florian Fainelli, Heiner Kallweit,
	Russell King, linux-arm-kernel, Vladimir Oltean, Luka Perkov,
	Robert Marko
In-Reply-To: <20220422180305.301882-1-maxime.chevallier@bootlin.com>

The Qualcomm IPQESS controller is a simple 1G Ethernet controller found
on the IPQ4019 chip. This controller has some specificities, in that the
IPQ4019 platform that includes that controller also has an internal
switch, based on the QCA8K IP.

It is connected to that switch through an internal link, and doesn't
expose directly any external interface, hence it only supports the
PHY_INTERFACE_MODE_INTERNAL for now.

It has 16 RX and TX queues, with a very basic RSS fanout configured at
init time.

Signed-off-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
---
 MAINTAINERS                                   |    6 +
 drivers/net/ethernet/qualcomm/Kconfig         |   11 +
 drivers/net/ethernet/qualcomm/Makefile        |    2 +
 drivers/net/ethernet/qualcomm/ipqess/Makefile |    8 +
 drivers/net/ethernet/qualcomm/ipqess/ipqess.c | 1235 +++++++++++++++++
 drivers/net/ethernet/qualcomm/ipqess/ipqess.h |  515 +++++++
 .../ethernet/qualcomm/ipqess/ipqess_ethtool.c |  168 +++
 7 files changed, 1945 insertions(+)
 create mode 100644 drivers/net/ethernet/qualcomm/ipqess/Makefile
 create mode 100644 drivers/net/ethernet/qualcomm/ipqess/ipqess.c
 create mode 100644 drivers/net/ethernet/qualcomm/ipqess/ipqess.h
 create mode 100644 drivers/net/ethernet/qualcomm/ipqess/ipqess_ethtool.c

diff --git a/MAINTAINERS b/MAINTAINERS
index 9b0480f1b153..29e6ec4f975a 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -16308,6 +16308,12 @@ L:	netdev@vger.kernel.org
 S:	Maintained
 F:	drivers/net/ethernet/qualcomm/emac/
 
+QUALCOMM IPQESS ETHERNET DRIVER
+M:	Maxime Chevallier <maxime.chevallier@bootlin.com>
+L:	netdev@vger.kernel.org
+S:	Maintained
+F:	drivers/net/ethernet/qualcomm/ipqess/
+
 QUALCOMM ETHQOS ETHERNET DRIVER
 M:	Vinod Koul <vkoul@kernel.org>
 L:	netdev@vger.kernel.org
diff --git a/drivers/net/ethernet/qualcomm/Kconfig b/drivers/net/ethernet/qualcomm/Kconfig
index a4434eb38950..a723ddbea248 100644
--- a/drivers/net/ethernet/qualcomm/Kconfig
+++ b/drivers/net/ethernet/qualcomm/Kconfig
@@ -60,6 +60,17 @@ config QCOM_EMAC
 	  low power, Receive-Side Scaling (RSS), and IEEE 1588-2008
 	  Precision Clock Synchronization Protocol.
 
+config QCOM_IPQ4019_ESS_EDMA
+	tristate "Qualcomm Atheros IPQ4019 ESS EDMA support"
+	depends on OF
+	select PHYLINK
+	help
+	  This driver supports the Qualcomm Atheros IPQ40xx built-in
+	  ESS EDMA ethernet controller.
+
+	  To compile this driver as a module, choose M here: the
+	  module will be called ipqess.
+
 source "drivers/net/ethernet/qualcomm/rmnet/Kconfig"
 
 endif # NET_VENDOR_QUALCOMM
diff --git a/drivers/net/ethernet/qualcomm/Makefile b/drivers/net/ethernet/qualcomm/Makefile
index 9250976dd884..db463c9ea1f9 100644
--- a/drivers/net/ethernet/qualcomm/Makefile
+++ b/drivers/net/ethernet/qualcomm/Makefile
@@ -11,4 +11,6 @@ qcauart-objs := qca_uart.o
 
 obj-y += emac/
 
+obj-$(CONFIG_QCOM_IPQ4019_ESS_EDMA) += ipqess/
+
 obj-$(CONFIG_RMNET) += rmnet/
diff --git a/drivers/net/ethernet/qualcomm/ipqess/Makefile b/drivers/net/ethernet/qualcomm/ipqess/Makefile
new file mode 100644
index 000000000000..4f2db7283ebf
--- /dev/null
+++ b/drivers/net/ethernet/qualcomm/ipqess/Makefile
@@ -0,0 +1,8 @@
+# SPDX-License-Identifier: GPL-2.0-only
+#
+# Makefile for the IPQ ESS driver
+#
+
+obj-$(CONFIG_QCOM_IPQ4019_ESS_EDMA) += ipq_ess.o
+
+ipq_ess-objs := ipqess.o ipqess_ethtool.o
diff --git a/drivers/net/ethernet/qualcomm/ipqess/ipqess.c b/drivers/net/ethernet/qualcomm/ipqess/ipqess.c
new file mode 100644
index 000000000000..4ecb8c65417b
--- /dev/null
+++ b/drivers/net/ethernet/qualcomm/ipqess/ipqess.c
@@ -0,0 +1,1235 @@
+// SPDX-License-Identifier: GPL-2.0 OR ISC
+/* Copyright (c) 2014 - 2017, The Linux Foundation. All rights reserved.
+ * Copyright (c) 2017 - 2018, John Crispin <john@phrozen.org>
+ * Copyright (c) 2018 - 2019, Christian Lamparter <chunkeey@gmail.com>
+ * Copyright (c) 2020 - 2021, Gabor Juhos <j4g8y7@gmail.com>
+ * Copyright (c) 2021 - 2022, Maxime Chevallier <maxime.chevallier@bootlin.com>
+ *
+ */
+
+#include <linux/bitfield.h>
+#include <linux/if_vlan.h>
+#include <linux/interrupt.h>
+#include <linux/module.h>
+#include <linux/of.h>
+#include <linux/of_device.h>
+#include <linux/of_mdio.h>
+#include <linux/of_net.h>
+#include <linux/phylink.h>
+#include <linux/platform_device.h>
+#include <linux/skbuff.h>
+#include <linux/vmalloc.h>
+#include <net/checksum.h>
+#include <net/ip6_checksum.h>
+
+#include "ipqess.h"
+
+#define IPQESS_RRD_SIZE		16
+#define IPQESS_NEXT_IDX(X, Y)  (((X) + 1) & ((Y) - 1))
+#define IPQESS_TX_DMA_BUF_LEN	0x3fff
+
+static void ipqess_w32(struct ipqess *ess, u32 reg, u32 val)
+{
+	writel(val, ess->hw_addr + reg);
+}
+
+static u32 ipqess_r32(struct ipqess *ess, u16 reg)
+{
+	return readl(ess->hw_addr + reg);
+}
+
+static void ipqess_m32(struct ipqess *ess, u32 mask, u32 val, u16 reg)
+{
+	u32 _val = ipqess_r32(ess, reg);
+
+	_val &= ~mask;
+	_val |= val;
+
+	ipqess_w32(ess, reg, _val);
+}
+
+void ipqess_update_hw_stats(struct ipqess *ess)
+{
+	u32 *p;
+	u32 stat;
+	int i;
+
+	lockdep_assert_held(&ess->stats_lock);
+
+	p = (u32 *)&ess->ipqess_stats;
+	for (i = 0; i < IPQESS_MAX_TX_QUEUE; i++) {
+		stat = ipqess_r32(ess, IPQESS_REG_TX_STAT_PKT_Q(i));
+		*p += stat;
+		p++;
+	}
+
+	for (i = 0; i < IPQESS_MAX_TX_QUEUE; i++) {
+		stat = ipqess_r32(ess, IPQESS_REG_TX_STAT_BYTE_Q(i));
+		*p += stat;
+		p++;
+	}
+
+	for (i = 0; i < IPQESS_MAX_RX_QUEUE; i++) {
+		stat = ipqess_r32(ess, IPQESS_REG_RX_STAT_PKT_Q(i));
+		*p += stat;
+		p++;
+	}
+
+	for (i = 0; i < IPQESS_MAX_RX_QUEUE; i++) {
+		stat = ipqess_r32(ess, IPQESS_REG_RX_STAT_BYTE_Q(i));
+		*p += stat;
+		p++;
+	}
+}
+
+static int ipqess_tx_ring_alloc(struct ipqess *ess)
+{
+	struct device *dev = &ess->pdev->dev;
+	int i;
+
+	for (i = 0; i < IPQESS_NETDEV_QUEUES; i++) {
+		struct ipqess_tx_ring *tx_ring = &ess->tx_ring[i];
+		size_t size;
+		u32 idx;
+
+		tx_ring->ess = ess;
+		tx_ring->ring_id = i;
+		tx_ring->idx = i * 4;
+		tx_ring->count = IPQESS_TX_RING_SIZE;
+		tx_ring->nq = netdev_get_tx_queue(ess->netdev, i);
+
+		size = sizeof(struct ipqess_buf) * IPQESS_TX_RING_SIZE;
+		tx_ring->buf = devm_kzalloc(dev, size, GFP_KERNEL);
+		if (!tx_ring->buf) {
+			netdev_err(ess->netdev, "buffer alloc of tx ring failed");
+			return -ENOMEM;
+		}
+
+		size = sizeof(struct ipqess_tx_desc) * IPQESS_TX_RING_SIZE;
+		tx_ring->hw_desc = dmam_alloc_coherent(dev, size, &tx_ring->dma,
+						       GFP_KERNEL | __GFP_ZERO);
+		if (!tx_ring->hw_desc) {
+			netdev_err(ess->netdev, "descriptor allocation for tx ring failed");
+			return -ENOMEM;
+		}
+
+		ipqess_w32(ess, IPQESS_REG_TPD_BASE_ADDR_Q(tx_ring->idx),
+			   (u32)tx_ring->dma);
+
+		idx = ipqess_r32(ess, IPQESS_REG_TPD_IDX_Q(tx_ring->idx));
+		idx >>= IPQESS_TPD_CONS_IDX_SHIFT; /* need u32 here */
+		idx &= 0xffff;
+		tx_ring->head = idx;
+		tx_ring->tail = idx;
+
+		ipqess_m32(ess, IPQESS_TPD_PROD_IDX_MASK << IPQESS_TPD_PROD_IDX_SHIFT,
+			   idx, IPQESS_REG_TPD_IDX_Q(tx_ring->idx));
+		ipqess_w32(ess, IPQESS_REG_TX_SW_CONS_IDX_Q(tx_ring->idx), idx);
+		ipqess_w32(ess, IPQESS_REG_TPD_RING_SIZE, IPQESS_TX_RING_SIZE);
+	}
+
+	return 0;
+}
+
+static int ipqess_tx_unmap_and_free(struct device *dev, struct ipqess_buf *buf)
+{
+	int len = 0;
+
+	if (buf->flags & IPQESS_DESC_SINGLE)
+		dma_unmap_single(dev, buf->dma,	buf->length, DMA_TO_DEVICE);
+	else if (buf->flags & IPQESS_DESC_PAGE)
+		dma_unmap_page(dev, buf->dma, buf->length, DMA_TO_DEVICE);
+
+	if (buf->flags & IPQESS_DESC_LAST) {
+		len = buf->skb->len;
+		dev_kfree_skb_any(buf->skb);
+	}
+
+	buf->flags = 0;
+
+	return len;
+}
+
+static void ipqess_tx_ring_free(struct ipqess *ess)
+{
+	int i;
+
+	for (i = 0; i < IPQESS_NETDEV_QUEUES; i++) {
+		int j;
+
+		if (ess->tx_ring[i].hw_desc)
+			continue;
+
+		for (j = 0; j < IPQESS_TX_RING_SIZE; j++) {
+			struct ipqess_buf *buf = &ess->tx_ring[i].buf[j];
+
+			ipqess_tx_unmap_and_free(&ess->pdev->dev, buf);
+		}
+
+		ess->tx_ring[i].buf = NULL;
+	}
+}
+
+static int ipqess_rx_buf_prepare(struct ipqess_buf *buf,
+				 struct ipqess_rx_ring *rx_ring)
+{
+	memset(buf->skb->data, 0, sizeof(struct ipqess_rx_desc));
+
+	buf->dma = dma_map_single(rx_ring->ppdev, buf->skb->data,
+				  IPQESS_RX_HEAD_BUFF_SIZE, DMA_FROM_DEVICE);
+	if (dma_mapping_error(rx_ring->ppdev, buf->dma)) {
+		dev_err_once(rx_ring->ppdev,
+			     "IPQESS DMA mapping failed for linear address %x",
+			     buf->dma);
+		dev_kfree_skb_any(buf->skb);
+		buf->skb = NULL;
+		return -EFAULT;
+	}
+
+	buf->length = IPQESS_RX_HEAD_BUFF_SIZE;
+	rx_ring->hw_desc[rx_ring->head] = (struct ipqess_rx_desc *)buf->dma;
+	rx_ring->head = (rx_ring->head + 1) % IPQESS_RX_RING_SIZE;
+
+	ipqess_m32(rx_ring->ess, IPQESS_RFD_PROD_IDX_BITS,
+		   (rx_ring->head + IPQESS_RX_RING_SIZE - 1) % IPQESS_RX_RING_SIZE,
+		   IPQESS_REG_RFD_IDX_Q(rx_ring->idx));
+
+	return 0;
+}
+
+/* locking is handled by the caller */
+static int ipqess_rx_buf_alloc_napi(struct ipqess_rx_ring *rx_ring)
+{
+	struct ipqess_buf *buf = &rx_ring->buf[rx_ring->head];
+
+	buf->skb = napi_alloc_skb(&rx_ring->napi_rx, IPQESS_RX_HEAD_BUFF_SIZE);
+	if (!buf->skb)
+		return -ENOMEM;
+
+	return ipqess_rx_buf_prepare(buf, rx_ring);
+}
+
+static int ipqess_rx_buf_alloc(struct ipqess_rx_ring *rx_ring)
+{
+	struct ipqess_buf *buf = &rx_ring->buf[rx_ring->head];
+
+	buf->skb = netdev_alloc_skb_ip_align(rx_ring->ess->netdev,
+					     IPQESS_RX_HEAD_BUFF_SIZE);
+
+	if (!buf->skb)
+		return -ENOMEM;
+
+	return ipqess_rx_buf_prepare(buf, rx_ring);
+}
+
+static void ipqess_refill_work(struct work_struct *work)
+{
+	struct ipqess_rx_ring_refill *rx_refill = container_of(work,
+		struct ipqess_rx_ring_refill, refill_work);
+	struct ipqess_rx_ring *rx_ring = rx_refill->rx_ring;
+	int refill = 0;
+
+	/* don't let this loop by accident. */
+	while (atomic_dec_and_test(&rx_ring->refill_count)) {
+		napi_disable(&rx_ring->napi_rx);
+		if (ipqess_rx_buf_alloc(rx_ring)) {
+			refill++;
+			dev_dbg(rx_ring->ppdev,
+				"Not all buffers were reallocated");
+		}
+		napi_enable(&rx_ring->napi_rx);
+	}
+
+	if (atomic_add_return(refill, &rx_ring->refill_count))
+		schedule_work(&rx_refill->refill_work);
+}
+
+static int ipqess_rx_ring_alloc(struct ipqess *ess)
+{
+	int i;
+
+	for (i = 0; i < IPQESS_NETDEV_QUEUES; i++) {
+		int j;
+
+		ess->rx_ring[i].ess = ess;
+		ess->rx_ring[i].ppdev = &ess->pdev->dev;
+		ess->rx_ring[i].ring_id = i;
+		ess->rx_ring[i].idx = i * 2;
+
+		ess->rx_ring[i].buf = devm_kzalloc(&ess->pdev->dev,
+						   sizeof(struct ipqess_buf) * IPQESS_RX_RING_SIZE,
+						   GFP_KERNEL);
+
+		if (!ess->rx_ring[i].buf)
+			return -ENOMEM;
+
+		ess->rx_ring[i].hw_desc =
+			dmam_alloc_coherent(&ess->pdev->dev,
+					    sizeof(struct ipqess_rx_desc) * IPQESS_RX_RING_SIZE,
+					    &ess->rx_ring[i].dma, GFP_KERNEL);
+
+		if (!ess->rx_ring[i].hw_desc)
+			return -ENOMEM;
+
+		for (j = 0; j < IPQESS_RX_RING_SIZE; j++)
+			if (ipqess_rx_buf_alloc(&ess->rx_ring[i]) < 0)
+				return -ENOMEM;
+
+		ess->rx_refill[i].rx_ring = &ess->rx_ring[i];
+		INIT_WORK(&ess->rx_refill[i].refill_work, ipqess_refill_work);
+
+		ipqess_w32(ess, IPQESS_REG_RFD_BASE_ADDR_Q(ess->rx_ring[i].idx),
+			   (u32)(ess->rx_ring[i].dma));
+	}
+
+	ipqess_w32(ess, IPQESS_REG_RX_DESC0,
+		   (IPQESS_RX_HEAD_BUFF_SIZE << IPQESS_RX_BUF_SIZE_SHIFT) |
+		   (IPQESS_RX_RING_SIZE << IPQESS_RFD_RING_SIZE_SHIFT));
+
+	return 0;
+}
+
+static void ipqess_rx_ring_free(struct ipqess *ess)
+{
+	int i;
+
+	for (i = 0; i < IPQESS_NETDEV_QUEUES; i++) {
+		int j;
+
+		atomic_set(&ess->rx_ring[i].refill_count, 0);
+		cancel_work_sync(&ess->rx_refill[i].refill_work);
+
+		for (j = 0; j < IPQESS_RX_RING_SIZE; j++) {
+			dma_unmap_single(&ess->pdev->dev,
+					 ess->rx_ring[i].buf[j].dma,
+					 ess->rx_ring[i].buf[j].length,
+					 DMA_FROM_DEVICE);
+			dev_kfree_skb_any(ess->rx_ring[i].buf[j].skb);
+		}
+	}
+}
+
+static struct net_device_stats *ipqess_get_stats(struct net_device *netdev)
+{
+	struct ipqess *ess = netdev_priv(netdev);
+
+	spin_lock(&ess->stats_lock);
+	ipqess_update_hw_stats(ess);
+	spin_unlock(&ess->stats_lock);
+
+	return &ess->stats;
+}
+
+static int ipqess_rx_poll(struct ipqess_rx_ring *rx_ring, int budget)
+{
+	u32 length = 0, num_desc, tail, rx_ring_tail;
+	int done = 0;
+
+	rx_ring_tail = rx_ring->tail;
+
+	tail = ipqess_r32(rx_ring->ess, IPQESS_REG_RFD_IDX_Q(rx_ring->idx));
+	tail >>= IPQESS_RFD_CONS_IDX_SHIFT;
+	tail &= IPQESS_RFD_CONS_IDX_MASK;
+
+	while (done < budget) {
+		struct sk_buff *skb;
+		struct ipqess_rx_desc *rd;
+
+		if (rx_ring_tail == tail)
+			break;
+
+		dma_unmap_single(rx_ring->ppdev,
+				 rx_ring->buf[rx_ring_tail].dma,
+				 rx_ring->buf[rx_ring_tail].length,
+				 DMA_FROM_DEVICE);
+
+		skb = xchg(&rx_ring->buf[rx_ring_tail].skb, NULL);
+		rd = (struct ipqess_rx_desc *)skb->data;
+		rx_ring_tail = IPQESS_NEXT_IDX(rx_ring_tail, IPQESS_RX_RING_SIZE);
+
+		/* Check if RRD is valid */
+		if (!(rd->rrd7 & IPQESS_RRD_DESC_VALID)) {
+			num_desc = 1;
+			dev_kfree_skb_any(skb);
+			goto skip;
+		}
+
+		num_desc = rd->rrd1 & IPQESS_RRD_NUM_RFD_MASK;
+		length = rd->rrd6 & IPQESS_RRD_PKT_SIZE_MASK;
+
+		skb_reserve(skb, IPQESS_RRD_SIZE);
+		if (num_desc > 1) {
+			struct sk_buff *skb_prev = NULL;
+			int size_remaining;
+			int i;
+
+			skb->data_len = 0;
+			skb->tail += (IPQESS_RX_HEAD_BUFF_SIZE - IPQESS_RRD_SIZE);
+			skb->len = length;
+			skb->truesize = length;
+			size_remaining = length - (IPQESS_RX_HEAD_BUFF_SIZE - IPQESS_RRD_SIZE);
+
+			for (i = 1; i < num_desc; i++) {
+				struct sk_buff *skb_temp = rx_ring->buf[rx_ring_tail].skb;
+
+				dma_unmap_single(rx_ring->ppdev,
+						 rx_ring->buf[rx_ring_tail].dma,
+						 rx_ring->buf[rx_ring_tail].length,
+						 DMA_FROM_DEVICE);
+
+				skb_put(skb_temp, min(size_remaining, IPQESS_RX_HEAD_BUFF_SIZE));
+				if (skb_prev)
+					skb_prev->next = rx_ring->buf[rx_ring_tail].skb;
+				else
+					skb_shinfo(skb)->frag_list = rx_ring->buf[rx_ring_tail].skb;
+				skb_prev = rx_ring->buf[rx_ring_tail].skb;
+				rx_ring->buf[rx_ring_tail].skb->next = NULL;
+
+				skb->data_len += rx_ring->buf[rx_ring_tail].skb->len;
+				size_remaining -= rx_ring->buf[rx_ring_tail].skb->len;
+
+				rx_ring_tail = IPQESS_NEXT_IDX(rx_ring_tail, IPQESS_RX_RING_SIZE);
+			}
+
+		} else {
+			skb_put(skb, length);
+		}
+
+		skb->dev = rx_ring->ess->netdev;
+		skb->protocol = eth_type_trans(skb, rx_ring->ess->netdev);
+		skb_record_rx_queue(skb, rx_ring->ring_id);
+
+		if (rd->rrd6 & IPQESS_RRD_CSUM_FAIL_MASK)
+			skb_checksum_none_assert(skb);
+		else
+			skb->ip_summed = CHECKSUM_UNNECESSARY;
+
+		if (rd->rrd7 & IPQESS_RRD_CVLAN)
+			__vlan_hwaccel_put_tag(skb, htons(ETH_P_8021Q),
+					       rd->rrd4);
+		else if (rd->rrd1 & IPQESS_RRD_SVLAN)
+			__vlan_hwaccel_put_tag(skb, htons(ETH_P_8021AD),
+					       rd->rrd4);
+
+		napi_gro_receive(&rx_ring->napi_rx, skb);
+
+		rx_ring->ess->stats.rx_packets++;
+		rx_ring->ess->stats.rx_bytes += length;
+
+		done++;
+skip:
+
+		num_desc += atomic_xchg(&rx_ring->refill_count, 0);
+		while (num_desc) {
+			if (ipqess_rx_buf_alloc_napi(rx_ring)) {
+				num_desc = atomic_add_return(num_desc,
+							     &rx_ring->refill_count);
+				if (num_desc >= ((4 * IPQESS_RX_RING_SIZE + 6) / 7))
+					schedule_work(&rx_ring->ess->rx_refill[rx_ring->ring_id].refill_work);
+				break;
+			}
+			num_desc--;
+		}
+	}
+
+	ipqess_w32(rx_ring->ess, IPQESS_REG_RX_SW_CONS_IDX_Q(rx_ring->idx),
+		   rx_ring_tail);
+	rx_ring->tail = rx_ring_tail;
+
+	return done;
+}
+
+static int ipqess_tx_complete(struct ipqess_tx_ring *tx_ring, int budget)
+{
+	int total = 0, ret;
+	int done = 0;
+	u32 tail;
+
+	tail = ipqess_r32(tx_ring->ess, IPQESS_REG_TPD_IDX_Q(tx_ring->idx));
+	tail >>= IPQESS_TPD_CONS_IDX_SHIFT;
+	tail &= IPQESS_TPD_CONS_IDX_MASK;
+
+	while ((tx_ring->tail != tail) && (done < budget)) {
+		ret = ipqess_tx_unmap_and_free(&tx_ring->ess->pdev->dev,
+					       &tx_ring->buf[tx_ring->tail]);
+		tx_ring->tail = IPQESS_NEXT_IDX(tx_ring->tail, tx_ring->count);
+
+		if (ret) {
+			total += ret;
+			done++;
+		}
+	}
+
+	ipqess_w32(tx_ring->ess, IPQESS_REG_TX_SW_CONS_IDX_Q(tx_ring->idx),
+		   tx_ring->tail);
+
+	if (netif_tx_queue_stopped(tx_ring->nq)) {
+		netdev_dbg(tx_ring->ess->netdev, "waking up tx queue %d\n",
+			   tx_ring->idx);
+		netif_tx_wake_queue(tx_ring->nq);
+	}
+
+	netdev_tx_completed_queue(tx_ring->nq, done, total);
+
+	return done;
+}
+
+static int ipqess_tx_napi(struct napi_struct *napi, int budget)
+{
+	struct ipqess_tx_ring *tx_ring = container_of(napi, struct ipqess_tx_ring,
+						    napi_tx);
+	int work_done = 0;
+	u32 tx_status;
+
+	tx_status = ipqess_r32(tx_ring->ess, IPQESS_REG_TX_ISR);
+	tx_status &= BIT(tx_ring->idx);
+
+	work_done = ipqess_tx_complete(tx_ring, budget);
+
+	ipqess_w32(tx_ring->ess, IPQESS_REG_TX_ISR, tx_status);
+
+	if (likely(work_done < budget)) {
+		if (napi_complete_done(napi, work_done))
+			ipqess_w32(tx_ring->ess,
+				   IPQESS_REG_TX_INT_MASK_Q(tx_ring->idx), 0x1);
+	}
+
+	return work_done;
+}
+
+static int ipqess_rx_napi(struct napi_struct *napi, int budget)
+{
+	struct ipqess_rx_ring *rx_ring = container_of(napi, struct ipqess_rx_ring,
+						    napi_rx);
+	struct ipqess *ess = rx_ring->ess;
+	u32 rx_mask = BIT(rx_ring->idx);
+	int remain_budget = budget;
+	int rx_done;
+	u32 status;
+
+poll_again:
+	ipqess_w32(ess, IPQESS_REG_RX_ISR, rx_mask);
+	rx_done = ipqess_rx_poll(rx_ring, remain_budget);
+
+	if (rx_done == remain_budget)
+		return budget;
+
+	status = ipqess_r32(ess, IPQESS_REG_RX_ISR);
+	if (status & rx_mask) {
+		remain_budget -= rx_done;
+		goto poll_again;
+	}
+
+	if (napi_complete_done(napi, rx_done + budget - remain_budget))
+		ipqess_w32(ess, IPQESS_REG_RX_INT_MASK_Q(rx_ring->idx), 0x1);
+
+	return rx_done + budget - remain_budget;
+}
+
+static irqreturn_t ipqess_interrupt_tx(int irq, void *priv)
+{
+	struct ipqess_tx_ring *tx_ring = (struct ipqess_tx_ring *)priv;
+
+	if (likely(napi_schedule_prep(&tx_ring->napi_tx))) {
+		__napi_schedule(&tx_ring->napi_tx);
+		ipqess_w32(tx_ring->ess, IPQESS_REG_TX_INT_MASK_Q(tx_ring->idx),
+			   0x0);
+	}
+
+	return IRQ_HANDLED;
+}
+
+static irqreturn_t ipqess_interrupt_rx(int irq, void *priv)
+{
+	struct ipqess_rx_ring *rx_ring = (struct ipqess_rx_ring *)priv;
+
+	if (likely(napi_schedule_prep(&rx_ring->napi_rx))) {
+		__napi_schedule(&rx_ring->napi_rx);
+		ipqess_w32(rx_ring->ess, IPQESS_REG_RX_INT_MASK_Q(rx_ring->idx),
+			   0x0);
+	}
+
+	return IRQ_HANDLED;
+}
+
+static void ipqess_irq_enable(struct ipqess *ess)
+{
+	int i;
+
+	ipqess_w32(ess, IPQESS_REG_RX_ISR, 0xff);
+	ipqess_w32(ess, IPQESS_REG_TX_ISR, 0xffff);
+	for (i = 0; i < IPQESS_NETDEV_QUEUES; i++) {
+		ipqess_w32(ess, IPQESS_REG_RX_INT_MASK_Q(ess->rx_ring[i].idx), 1);
+		ipqess_w32(ess, IPQESS_REG_TX_INT_MASK_Q(ess->tx_ring[i].idx), 1);
+	}
+}
+
+static void ipqess_irq_disable(struct ipqess *ess)
+{
+	int i;
+
+	for (i = 0; i < IPQESS_NETDEV_QUEUES; i++) {
+		ipqess_w32(ess, IPQESS_REG_RX_INT_MASK_Q(ess->rx_ring[i].idx), 0);
+		ipqess_w32(ess, IPQESS_REG_TX_INT_MASK_Q(ess->tx_ring[i].idx), 0);
+	}
+}
+
+static int __init ipqess_init(struct net_device *netdev)
+{
+	struct ipqess *ess = netdev_priv(netdev);
+	struct device_node *of_node = ess->pdev->dev.of_node;
+	int ret;
+
+	ret = of_get_ethdev_address(of_node, netdev);
+	if (ret)
+		eth_hw_addr_random(netdev);
+
+	return phylink_of_phy_connect(ess->phylink, of_node, 0);
+}
+
+static void ipqess_uninit(struct net_device *netdev)
+{
+	struct ipqess *ess = netdev_priv(netdev);
+
+	phylink_disconnect_phy(ess->phylink);
+}
+
+static int ipqess_open(struct net_device *netdev)
+{
+	struct ipqess *ess = netdev_priv(netdev);
+	int i;
+
+	for (i = 0; i < IPQESS_NETDEV_QUEUES; i++) {
+		napi_enable(&ess->tx_ring[i].napi_tx);
+		napi_enable(&ess->rx_ring[i].napi_rx);
+	}
+	ipqess_irq_enable(ess);
+	phylink_start(ess->phylink);
+	netif_tx_start_all_queues(netdev);
+
+	return 0;
+}
+
+static int ipqess_stop(struct net_device *netdev)
+{
+	struct ipqess *ess = netdev_priv(netdev);
+	int i;
+
+	netif_tx_stop_all_queues(netdev);
+	phylink_stop(ess->phylink);
+	ipqess_irq_disable(ess);
+	for (i = 0; i < IPQESS_NETDEV_QUEUES; i++) {
+		napi_disable(&ess->tx_ring[i].napi_tx);
+		napi_disable(&ess->rx_ring[i].napi_rx);
+	}
+
+	return 0;
+}
+
+static int ipqess_do_ioctl(struct net_device *netdev, struct ifreq *ifr, int cmd)
+{
+	struct ipqess *ess = netdev_priv(netdev);
+
+	switch (cmd) {
+	case SIOCGMIIPHY:
+	case SIOCGMIIREG:
+	case SIOCSMIIREG:
+		return phylink_mii_ioctl(ess->phylink, ifr, cmd);
+	default:
+		break;
+	}
+
+	return -EOPNOTSUPP;
+}
+
+static inline u16 ipqess_tx_desc_available(struct ipqess_tx_ring *tx_ring)
+{
+	u16 count = 0;
+
+	if (tx_ring->tail <= tx_ring->head)
+		count = IPQESS_TX_RING_SIZE;
+
+	count += tx_ring->tail - tx_ring->head - 1;
+
+	return count;
+}
+
+static inline int ipqess_cal_txd_req(struct sk_buff *skb)
+{
+	int tpds;
+
+	/* one TPD for the header, and one for each fragments */
+	tpds = 1 + skb_shinfo(skb)->nr_frags;
+	if (skb_is_gso(skb) && skb_is_gso_v6(skb)) {
+		/* for LSOv2 one extra TPD is needed */
+		tpds++;
+	}
+
+	return tpds;
+}
+
+static struct ipqess_buf *ipqess_get_tx_buffer(struct ipqess_tx_ring *tx_ring,
+					       struct ipqess_tx_desc *desc)
+{
+	return &tx_ring->buf[desc - tx_ring->hw_desc];
+}
+
+static struct ipqess_tx_desc *ipqess_tx_desc_next(struct ipqess_tx_ring *tx_ring)
+{
+	struct ipqess_tx_desc *desc;
+
+	desc = &tx_ring->hw_desc[tx_ring->head];
+	tx_ring->head = IPQESS_NEXT_IDX(tx_ring->head, tx_ring->count);
+
+	return desc;
+}
+
+static void ipqess_rollback_tx(struct ipqess *eth,
+			       struct ipqess_tx_desc *first_desc, int ring_id)
+{
+	struct ipqess_tx_ring *tx_ring = &eth->tx_ring[ring_id];
+	struct ipqess_tx_desc *desc = NULL;
+	struct ipqess_buf *buf;
+	u16 start_index, index;
+
+	start_index = first_desc - tx_ring->hw_desc;
+
+	index = start_index;
+	while (index != tx_ring->head) {
+		desc = &tx_ring->hw_desc[index];
+		buf = &tx_ring->buf[index];
+		ipqess_tx_unmap_and_free(&eth->pdev->dev, buf);
+		memset(desc, 0, sizeof(struct ipqess_tx_desc));
+		if (++index == tx_ring->count)
+			index = 0;
+	}
+	tx_ring->head = start_index;
+}
+
+static int ipqess_tx_map_and_fill(struct ipqess_tx_ring *tx_ring,
+				  struct sk_buff *skb)
+{
+	struct ipqess_tx_desc *desc = NULL, *first_desc = NULL;
+	u32 word1 = 0, word3 = 0, lso_word1 = 0, svlan_tag = 0;
+	struct platform_device *pdev = tx_ring->ess->pdev;
+	struct ipqess_buf *buf = NULL;
+	u16 len;
+	int i;
+
+	if (skb_is_gso(skb)) {
+		if (skb_shinfo(skb)->gso_type & SKB_GSO_TCPV4) {
+			lso_word1 |= IPQESS_TPD_IPV4_EN;
+			ip_hdr(skb)->check = 0;
+			tcp_hdr(skb)->check = ~csum_tcpudp_magic(ip_hdr(skb)->saddr,
+								 ip_hdr(skb)->daddr,
+								 0, IPPROTO_TCP, 0);
+		} else if (skb_shinfo(skb)->gso_type & SKB_GSO_TCPV6) {
+			lso_word1 |= IPQESS_TPD_LSO_V2_EN;
+			ipv6_hdr(skb)->payload_len = 0;
+			tcp_hdr(skb)->check = ~csum_ipv6_magic(&ipv6_hdr(skb)->saddr,
+							       &ipv6_hdr(skb)->daddr,
+							       0, IPPROTO_TCP, 0);
+		}
+
+		lso_word1 |= IPQESS_TPD_LSO_EN |
+			     ((skb_shinfo(skb)->gso_size & IPQESS_TPD_MSS_MASK) <<
+							   IPQESS_TPD_MSS_SHIFT) |
+			     (skb_transport_offset(skb) << IPQESS_TPD_HDR_SHIFT);
+	} else if (likely(skb->ip_summed == CHECKSUM_PARTIAL)) {
+		u8 css, cso;
+
+		cso = skb_checksum_start_offset(skb);
+		css = cso + skb->csum_offset;
+
+		word1 |= (IPQESS_TPD_CUSTOM_CSUM_EN);
+		word1 |= (cso >> 1) << IPQESS_TPD_HDR_SHIFT;
+		word1 |= ((css >> 1) << IPQESS_TPD_CUSTOM_CSUM_SHIFT);
+	}
+
+	if (skb_vlan_tag_present(skb)) {
+		switch (skb->vlan_proto) {
+		case htons(ETH_P_8021Q):
+			word3 |= BIT(IPQESS_TX_INS_CVLAN);
+			word3 |= skb_vlan_tag_get(skb) << IPQESS_TX_CVLAN_TAG_SHIFT;
+			break;
+		case htons(ETH_P_8021AD):
+			word1 |= BIT(IPQESS_TX_INS_SVLAN);
+			svlan_tag = skb_vlan_tag_get(skb);
+			break;
+		default:
+			dev_err(&pdev->dev, "no ctag or stag present\n");
+			goto vlan_tag_error;
+		}
+	}
+
+	if (eth_type_vlan(skb->protocol))
+		word1 |= IPQESS_TPD_VLAN_TAGGED;
+
+	if (skb->protocol == htons(ETH_P_PPP_SES))
+		word1 |= IPQESS_TPD_PPPOE_EN;
+
+	len = skb_headlen(skb);
+
+	first_desc = ipqess_tx_desc_next(tx_ring);
+	desc = first_desc;
+	if (lso_word1 & IPQESS_TPD_LSO_V2_EN) {
+		desc->addr = cpu_to_le16(skb->len);
+		desc->word1 = word1 | lso_word1;
+		desc->svlan_tag = svlan_tag;
+		desc->word3 = word3;
+		desc = ipqess_tx_desc_next(tx_ring);
+	}
+
+	buf = ipqess_get_tx_buffer(tx_ring, desc);
+	buf->length = len;
+	buf->dma = dma_map_single(&pdev->dev, skb->data, len, DMA_TO_DEVICE);
+
+	if (dma_mapping_error(&pdev->dev, buf->dma))
+		goto dma_error;
+
+	desc->addr = cpu_to_le32(buf->dma);
+	desc->len  = cpu_to_le16(len);
+
+	buf->flags |= IPQESS_DESC_SINGLE;
+	desc->word1 = word1 | lso_word1;
+	desc->svlan_tag = svlan_tag;
+	desc->word3 = word3;
+
+	for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) {
+		skb_frag_t *frag = &skb_shinfo(skb)->frags[i];
+
+		len = skb_frag_size(frag);
+		desc = ipqess_tx_desc_next(tx_ring);
+		buf = ipqess_get_tx_buffer(tx_ring, desc);
+		buf->length = len;
+		buf->flags |= IPQESS_DESC_PAGE;
+		buf->dma = skb_frag_dma_map(&pdev->dev, frag, 0, len,
+					    DMA_TO_DEVICE);
+
+		if (dma_mapping_error(&pdev->dev, buf->dma))
+			goto dma_error;
+
+		desc->addr = cpu_to_le32(buf->dma);
+		desc->len  = cpu_to_le16(len);
+		desc->svlan_tag = svlan_tag;
+		desc->word1 = word1 | lso_word1;
+		desc->word3 = word3;
+	}
+	desc->word1 |= 1 << IPQESS_TPD_EOP_SHIFT;
+	buf->skb = skb;
+	buf->flags |= IPQESS_DESC_LAST;
+
+	return 0;
+
+dma_error:
+	ipqess_rollback_tx(tx_ring->ess, first_desc, tx_ring->ring_id);
+	dev_err(&pdev->dev, "TX DMA map failed\n");
+
+vlan_tag_error:
+	return -ENOMEM;
+}
+
+static inline void ipqess_kick_tx(struct ipqess_tx_ring *tx_ring)
+{
+	/* Ensure that all TPDs has been written completely */
+	dma_wmb();
+
+	/* update software producer index */
+	ipqess_w32(tx_ring->ess, IPQESS_REG_TPD_IDX_Q(tx_ring->idx),
+		   tx_ring->head);
+}
+
+static netdev_tx_t ipqess_xmit(struct sk_buff *skb, struct net_device *netdev)
+{
+	struct ipqess *ess = netdev_priv(netdev);
+	struct ipqess_tx_ring *tx_ring;
+	int avail;
+	int tx_num;
+	int ret;
+
+	tx_ring = &ess->tx_ring[skb_get_queue_mapping(skb)];
+	tx_num = ipqess_cal_txd_req(skb);
+	avail = ipqess_tx_desc_available(tx_ring);
+	if (avail < tx_num) {
+		netdev_dbg(netdev,
+			   "stopping tx queue %d, avail=%d req=%d im=%x\n",
+			   tx_ring->idx, avail, tx_num,
+			   ipqess_r32(tx_ring->ess,
+				      IPQESS_REG_TX_INT_MASK_Q(tx_ring->idx)));
+		netif_tx_stop_queue(tx_ring->nq);
+		ipqess_w32(tx_ring->ess, IPQESS_REG_TX_INT_MASK_Q(tx_ring->idx), 0x1);
+		ipqess_kick_tx(tx_ring);
+		return NETDEV_TX_BUSY;
+	}
+
+	ret = ipqess_tx_map_and_fill(tx_ring, skb);
+	if (ret) {
+		dev_kfree_skb_any(skb);
+		ess->stats.tx_errors++;
+		goto err_out;
+	}
+
+	ess->stats.tx_packets++;
+	ess->stats.tx_bytes += skb->len;
+	netdev_tx_sent_queue(tx_ring->nq, skb->len);
+
+	if (!netdev_xmit_more() || netif_xmit_stopped(tx_ring->nq))
+		ipqess_kick_tx(tx_ring);
+
+err_out:
+	return NETDEV_TX_OK;
+}
+
+static int ipqess_set_mac_address(struct net_device *netdev, void *p)
+{
+	struct ipqess *ess = netdev_priv(netdev);
+	const char *macaddr = netdev->dev_addr;
+	int ret = eth_mac_addr(netdev, p);
+
+	if (ret)
+		return ret;
+
+	ipqess_w32(ess, IPQESS_REG_MAC_CTRL1, (macaddr[0] << 8) | macaddr[1]);
+	ipqess_w32(ess, IPQESS_REG_MAC_CTRL0,
+		   (macaddr[2] << 24) | (macaddr[3] << 16) | (macaddr[4] << 8) |
+		    macaddr[5]);
+
+	return 0;
+}
+
+static void ipqess_tx_timeout(struct net_device *netdev, unsigned int txq_id)
+{
+	struct ipqess *ess = netdev_priv(netdev);
+	struct ipqess_tx_ring *tr = &ess->tx_ring[txq_id];
+
+	netdev_warn(netdev, "TX timeout on queue %d\n", tr->idx);
+}
+
+static const struct net_device_ops ipqess_axi_netdev_ops = {
+	.ndo_init		= ipqess_init,
+	.ndo_uninit		= ipqess_uninit,
+	.ndo_open		= ipqess_open,
+	.ndo_stop		= ipqess_stop,
+	.ndo_do_ioctl		= ipqess_do_ioctl,
+	.ndo_start_xmit		= ipqess_xmit,
+	.ndo_get_stats		= ipqess_get_stats,
+	.ndo_set_mac_address	= ipqess_set_mac_address,
+	.ndo_tx_timeout		= ipqess_tx_timeout,
+};
+
+static void ipqess_hw_stop(struct ipqess *ess)
+{
+	int i;
+
+	/* disable all RX queue IRQs */
+	for (i = 0; i < IPQESS_MAX_RX_QUEUE; i++)
+		ipqess_w32(ess, IPQESS_REG_RX_INT_MASK_Q(i), 0);
+
+	/* disable all TX queue IRQs */
+	for (i = 0; i < IPQESS_MAX_TX_QUEUE; i++)
+		ipqess_w32(ess, IPQESS_REG_TX_INT_MASK_Q(i), 0);
+
+	/* disable all other IRQs */
+	ipqess_w32(ess, IPQESS_REG_MISC_IMR, 0);
+	ipqess_w32(ess, IPQESS_REG_WOL_IMR, 0);
+
+	/* clear the IRQ status registers */
+	ipqess_w32(ess, IPQESS_REG_RX_ISR, 0xff);
+	ipqess_w32(ess, IPQESS_REG_TX_ISR, 0xffff);
+	ipqess_w32(ess, IPQESS_REG_MISC_ISR, 0x1fff);
+	ipqess_w32(ess, IPQESS_REG_WOL_ISR, 0x1);
+	ipqess_w32(ess, IPQESS_REG_WOL_CTRL, 0);
+
+	/* disable RX and TX queues */
+	ipqess_m32(ess, IPQESS_RXQ_CTRL_EN_MASK, 0, IPQESS_REG_RXQ_CTRL);
+	ipqess_m32(ess, IPQESS_TXQ_CTRL_TXQ_EN, 0, IPQESS_REG_TXQ_CTRL);
+}
+
+static int ipqess_hw_init(struct ipqess *ess)
+{
+	int i, err;
+	u32 tmp;
+
+	ipqess_hw_stop(ess);
+
+	ipqess_m32(ess, BIT(IPQESS_INTR_SW_IDX_W_TYP_SHIFT),
+		   IPQESS_INTR_SW_IDX_W_TYPE << IPQESS_INTR_SW_IDX_W_TYP_SHIFT,
+		   IPQESS_REG_INTR_CTRL);
+
+	/* enable IRQ delay slot */
+	ipqess_w32(ess, IPQESS_REG_IRQ_MODRT_TIMER_INIT,
+		   (IPQESS_TX_IMT << IPQESS_IRQ_MODRT_TX_TIMER_SHIFT) |
+		   (IPQESS_RX_IMT << IPQESS_IRQ_MODRT_RX_TIMER_SHIFT));
+
+	/* Set Customer and Service VLAN TPIDs */
+	ipqess_w32(ess, IPQESS_REG_VLAN_CFG,
+		   (ETH_P_8021Q << IPQESS_VLAN_CFG_CVLAN_TPID_SHIFT) |
+		   (ETH_P_8021AD << IPQESS_VLAN_CFG_SVLAN_TPID_SHIFT));
+
+	/* Configure the TX Queue bursting */
+	ipqess_w32(ess, IPQESS_REG_TXQ_CTRL,
+		   (IPQESS_TPD_BURST << IPQESS_TXQ_NUM_TPD_BURST_SHIFT) |
+		   (IPQESS_TXF_BURST << IPQESS_TXQ_TXF_BURST_NUM_SHIFT) |
+		   IPQESS_TXQ_CTRL_TPD_BURST_EN);
+
+	/* Set RSS type */
+	ipqess_w32(ess, IPQESS_REG_RSS_TYPE,
+		   IPQESS_RSS_TYPE_IPV4TCP | IPQESS_RSS_TYPE_IPV6_TCP |
+		   IPQESS_RSS_TYPE_IPV4_UDP | IPQESS_RSS_TYPE_IPV6UDP |
+		   IPQESS_RSS_TYPE_IPV4 | IPQESS_RSS_TYPE_IPV6);
+
+	/* Set RFD ring burst and threshold */
+	ipqess_w32(ess, IPQESS_REG_RX_DESC1,
+		   (IPQESS_RFD_BURST << IPQESS_RXQ_RFD_BURST_NUM_SHIFT) |
+		   (IPQESS_RFD_THR << IPQESS_RXQ_RFD_PF_THRESH_SHIFT) |
+		   (IPQESS_RFD_LTHR << IPQESS_RXQ_RFD_LOW_THRESH_SHIFT));
+
+	/* Set Rx FIFO
+	 * - threshold to start to DMA data to host
+	 */
+	ipqess_w32(ess, IPQESS_REG_RXQ_CTRL,
+		   IPQESS_FIFO_THRESH_128_BYTE | IPQESS_RXQ_CTRL_RMV_VLAN);
+
+	err = ipqess_rx_ring_alloc(ess);
+	if (err)
+		return err;
+
+	err = ipqess_tx_ring_alloc(ess);
+	if (err)
+		return err;
+
+	/* Load all of ring base addresses above into the dma engine */
+	ipqess_m32(ess, 0, BIT(IPQESS_LOAD_PTR_SHIFT), IPQESS_REG_TX_SRAM_PART);
+
+	/* Disable TX FIFO low watermark and high watermark */
+	ipqess_w32(ess, IPQESS_REG_TXF_WATER_MARK, 0);
+
+	/* Configure RSS indirection table.
+	 * 128 hash will be configured in the following
+	 * pattern: hash{0,1,2,3} = {Q0,Q2,Q4,Q6} respectively
+	 * and so on
+	 */
+	for (i = 0; i < IPQESS_NUM_IDT; i++)
+		ipqess_w32(ess, IPQESS_REG_RSS_IDT(i), IPQESS_RSS_IDT_VALUE);
+
+	/* Configure load balance mapping table.
+	 * 4 table entry will be configured according to the
+	 * following pattern: load_balance{0,1,2,3} = {Q0,Q1,Q3,Q4}
+	 * respectively.
+	 */
+	ipqess_w32(ess, IPQESS_REG_LB_RING, IPQESS_LB_REG_VALUE);
+
+	/* Configure Virtual queue for Tx rings */
+	ipqess_w32(ess, IPQESS_REG_VQ_CTRL0, IPQESS_VQ_REG_VALUE);
+	ipqess_w32(ess, IPQESS_REG_VQ_CTRL1, IPQESS_VQ_REG_VALUE);
+
+	/* Configure Max AXI Burst write size to 128 bytes*/
+	ipqess_w32(ess, IPQESS_REG_AXIW_CTRL_MAXWRSIZE,
+		   IPQESS_AXIW_MAXWRSIZE_VALUE);
+
+	/* Enable TX queues */
+	ipqess_m32(ess, 0, IPQESS_TXQ_CTRL_TXQ_EN, IPQESS_REG_TXQ_CTRL);
+
+	/* Enable RX queues */
+	tmp = 0;
+	for (i = 0; i < IPQESS_NETDEV_QUEUES; i++)
+		tmp |= IPQESS_RXQ_CTRL_EN(ess->rx_ring[i].idx);
+
+	ipqess_m32(ess, IPQESS_RXQ_CTRL_EN_MASK, tmp, IPQESS_REG_RXQ_CTRL);
+
+	return 0;
+}
+
+static void ipqess_mac_config(struct phylink_config *config, unsigned int mode,
+			      const struct phylink_link_state *state)
+{
+	/* Nothing to do, use fixed Internal mode */
+}
+
+static void ipqess_mac_link_down(struct phylink_config *config,
+				 unsigned int mode,
+				 phy_interface_t interface)
+{
+	/* Nothing to do, use fixed Internal mode */
+}
+
+static void ipqess_mac_link_up(struct phylink_config *config,
+			       struct phy_device *phy, unsigned int mode,
+			       phy_interface_t interface,
+			       int speed, int duplex,
+			       bool tx_pause, bool rx_pause)
+{
+	/* Nothing to do, use fixed Internal mode */
+}
+
+static struct phylink_mac_ops ipqess_phylink_mac_ops = {
+	.validate		= phylink_generic_validate,
+	.mac_config		= ipqess_mac_config,
+	.mac_link_up		= ipqess_mac_link_up,
+	.mac_link_down		= ipqess_mac_link_down,
+};
+
+static void ipqess_cleanup(struct ipqess *ess)
+{
+	ipqess_hw_stop(ess);
+	unregister_netdev(ess->netdev);
+
+	ipqess_tx_ring_free(ess);
+	ipqess_rx_ring_free(ess);
+
+	if (!IS_ERR_OR_NULL(ess->phylink))
+		phylink_destroy(ess->phylink);
+}
+
+static int ipqess_axi_probe(struct platform_device *pdev)
+{
+	struct device_node *np = pdev->dev.of_node;
+	struct net_device *netdev;
+	phy_interface_t phy_mode;
+	struct resource *res;
+	struct ipqess *ess;
+	int i, err = 0;
+
+	netdev = devm_alloc_etherdev_mqs(&pdev->dev, sizeof(struct ipqess),
+					 IPQESS_NETDEV_QUEUES,
+					 IPQESS_NETDEV_QUEUES);
+	if (!netdev)
+		return -ENOMEM;
+
+	ess = netdev_priv(netdev);
+	ess->netdev = netdev;
+	ess->pdev = pdev;
+	spin_lock_init(&ess->stats_lock);
+	SET_NETDEV_DEV(netdev, &pdev->dev);
+	platform_set_drvdata(pdev, netdev);
+
+	res = platform_get_resource(pdev, IORESOURCE_MEM, 0);
+	ess->hw_addr = devm_ioremap_resource(&pdev->dev, res);
+	if (IS_ERR(ess->hw_addr)) {
+		err = PTR_ERR(ess->hw_addr);
+		goto err_out;
+	}
+
+	err = of_get_phy_mode(np, &phy_mode);
+	if (err) {
+		dev_err(&pdev->dev, "incorrect phy-mode\n");
+		goto err_out;
+	}
+
+	ess->phylink_config.dev = &netdev->dev;
+	ess->phylink_config.type = PHYLINK_NETDEV;
+
+	__set_bit(PHY_INTERFACE_MODE_INTERNAL,
+		  ess->phylink_config.supported_interfaces);
+
+	ess->phylink = phylink_create(&ess->phylink_config,
+				      of_fwnode_handle(np), phy_mode,
+				      &ipqess_phylink_mac_ops);
+	if (IS_ERR(ess->phylink)) {
+		err = PTR_ERR(ess->phylink);
+		goto err_out;
+	}
+
+	for (i = 0; i < IPQESS_MAX_TX_QUEUE; i++) {
+		ess->tx_irq[i] = platform_get_irq(pdev, i);
+		scnprintf(ess->tx_irq_names[i], sizeof(ess->tx_irq_names[i]),
+			  "%s:txq%d", pdev->name, i);
+	}
+
+	for (i = 0; i < IPQESS_MAX_RX_QUEUE; i++) {
+		ess->rx_irq[i] = platform_get_irq(pdev, i + IPQESS_MAX_TX_QUEUE);
+		scnprintf(ess->rx_irq_names[i], sizeof(ess->rx_irq_names[i]),
+			  "%s:rxq%d", pdev->name, i);
+	}
+
+	netdev->netdev_ops = &ipqess_axi_netdev_ops;
+	netdev->features = NETIF_F_HW_CSUM | NETIF_F_RXCSUM |
+			   NETIF_F_HW_VLAN_CTAG_RX |
+			   NETIF_F_HW_VLAN_CTAG_TX |
+			   NETIF_F_TSO | NETIF_F_GRO | NETIF_F_SG;
+	/* feature change is not supported yet */
+	netdev->hw_features = 0;
+	netdev->vlan_features = NETIF_F_HW_CSUM | NETIF_F_SG | NETIF_F_RXCSUM |
+				NETIF_F_TSO |
+				NETIF_F_GRO;
+	netdev->watchdog_timeo = 5 * HZ;
+	netdev->base_addr = (u32)ess->hw_addr;
+	netdev->max_mtu = 9000;
+	netdev->gso_max_segs = IPQESS_TX_RING_SIZE / 2;
+
+	ipqess_set_ethtool_ops(netdev);
+
+	err = register_netdev(netdev);
+	if (err)
+		goto err_out;
+
+	err = ipqess_hw_init(ess);
+	if (err)
+		goto err_out;
+
+	for (i = 0; i < IPQESS_NETDEV_QUEUES; i++) {
+		int qid;
+
+		netif_tx_napi_add(netdev, &ess->tx_ring[i].napi_tx,
+				  ipqess_tx_napi, 64);
+		netif_napi_add(netdev,
+			       &ess->rx_ring[i].napi_rx,
+			       ipqess_rx_napi, 64);
+
+		qid = ess->tx_ring[i].idx;
+		err = devm_request_irq(&ess->netdev->dev, ess->tx_irq[qid],
+				       ipqess_interrupt_tx, 0,
+				       ess->tx_irq_names[qid],
+				       &ess->tx_ring[i]);
+		if (err)
+			goto err_out;
+
+		qid = ess->rx_ring[i].idx;
+		err = devm_request_irq(&ess->netdev->dev, ess->rx_irq[qid],
+				       ipqess_interrupt_rx, 0,
+				       ess->rx_irq_names[qid],
+				       &ess->rx_ring[i]);
+		if (err)
+			goto err_out;
+	}
+
+	return 0;
+
+err_out:
+	ipqess_cleanup(ess);
+	return err;
+}
+
+static int ipqess_axi_remove(struct platform_device *pdev)
+{
+	const struct net_device *netdev = platform_get_drvdata(pdev);
+	struct ipqess *ess = netdev_priv(netdev);
+
+	ipqess_cleanup(ess);
+
+	return 0;
+}
+
+static const struct of_device_id ipqess_of_mtable[] = {
+	{.compatible = "qcom,ipq4019-ess-edma" },
+	{}
+};
+MODULE_DEVICE_TABLE(of, ipqess_of_mtable);
+
+static struct platform_driver ipqess_axi_driver = {
+	.driver = {
+		.name    = "ipqess-edma",
+		.of_match_table = ipqess_of_mtable,
+	},
+	.probe    = ipqess_axi_probe,
+	.remove   = ipqess_axi_remove,
+};
+
+module_platform_driver(ipqess_axi_driver);
+
+MODULE_AUTHOR("Qualcomm Atheros Inc");
+MODULE_AUTHOR("John Crispin <john@phrozen.org>");
+MODULE_AUTHOR("Christian Lamparter <chunkeey@gmail.com>");
+MODULE_AUTHOR("Gabor Juhos <j4g8y7@gmail.com>");
+MODULE_AUTHOR("Maxime Chevallier <maxime.chevallier@bootlin.com>");
+MODULE_LICENSE("GPL");
diff --git a/drivers/net/ethernet/qualcomm/ipqess/ipqess.h b/drivers/net/ethernet/qualcomm/ipqess/ipqess.h
new file mode 100644
index 000000000000..f961cdad193c
--- /dev/null
+++ b/drivers/net/ethernet/qualcomm/ipqess/ipqess.h
@@ -0,0 +1,515 @@
+/* SPDX-License-Identifier: (GPL-2.0 OR ISC) */
+/* Copyright (c) 2014 - 2016, The Linux Foundation. All rights reserved.
+ * Copyright (c) 2017 - 2018, John Crispin <john@phrozen.org>
+ * Copyright (c) 2018 - 2019, Christian Lamparter <chunkeey@gmail.com>
+ * Copyright (c) 2020 - 2021, Gabor Juhos <j4g8y7@gmail.com>
+ * Copyright (c) 2021 - 2022, Maxime Chevallier <maxime.chevallier@bootlin.com>
+ *
+ */
+
+#ifndef _IPQESS_H_
+#define _IPQESS_H_
+
+#define IPQESS_NETDEV_QUEUES	4
+
+#define IPQESS_TPD_EOP_SHIFT 31
+
+#define IPQESS_PORT_ID_SHIFT 12
+#define IPQESS_PORT_ID_MASK 0x7
+
+/* tpd word 3 bit 18-28 */
+#define IPQESS_TPD_PORT_BITMAP_SHIFT 18
+
+#define IPQESS_TPD_FROM_CPU_SHIFT 25
+
+#define IPQESS_RX_RING_SIZE 128
+#define IPQESS_RX_HEAD_BUFF_SIZE 1540
+#define IPQESS_TX_RING_SIZE 128
+#define IPQESS_MAX_RX_QUEUE 8
+#define IPQESS_MAX_TX_QUEUE 16
+
+/* Configurations */
+#define IPQESS_INTR_CLEAR_TYPE 0
+#define IPQESS_INTR_SW_IDX_W_TYPE 0
+#define IPQESS_FIFO_THRESH_TYPE 0
+#define IPQESS_RSS_TYPE 0
+#define IPQESS_RX_IMT 0x0020
+#define IPQESS_TX_IMT 0x0050
+#define IPQESS_TPD_BURST 5
+#define IPQESS_TXF_BURST 0x100
+#define IPQESS_RFD_BURST 8
+#define IPQESS_RFD_THR 16
+#define IPQESS_RFD_LTHR 0
+
+/* Flags used in transmit direction */
+#define IPQESS_DESC_LAST 0x1
+#define IPQESS_DESC_SINGLE 0x2
+#define IPQESS_DESC_PAGE 0x4
+
+struct ipqess_statistics {
+	u32 tx_q0_pkt;
+	u32 tx_q1_pkt;
+	u32 tx_q2_pkt;
+	u32 tx_q3_pkt;
+	u32 tx_q4_pkt;
+	u32 tx_q5_pkt;
+	u32 tx_q6_pkt;
+	u32 tx_q7_pkt;
+	u32 tx_q8_pkt;
+	u32 tx_q9_pkt;
+	u32 tx_q10_pkt;
+	u32 tx_q11_pkt;
+	u32 tx_q12_pkt;
+	u32 tx_q13_pkt;
+	u32 tx_q14_pkt;
+	u32 tx_q15_pkt;
+	u32 tx_q0_byte;
+	u32 tx_q1_byte;
+	u32 tx_q2_byte;
+	u32 tx_q3_byte;
+	u32 tx_q4_byte;
+	u32 tx_q5_byte;
+	u32 tx_q6_byte;
+	u32 tx_q7_byte;
+	u32 tx_q8_byte;
+	u32 tx_q9_byte;
+	u32 tx_q10_byte;
+	u32 tx_q11_byte;
+	u32 tx_q12_byte;
+	u32 tx_q13_byte;
+	u32 tx_q14_byte;
+	u32 tx_q15_byte;
+	u32 rx_q0_pkt;
+	u32 rx_q1_pkt;
+	u32 rx_q2_pkt;
+	u32 rx_q3_pkt;
+	u32 rx_q4_pkt;
+	u32 rx_q5_pkt;
+	u32 rx_q6_pkt;
+	u32 rx_q7_pkt;
+	u32 rx_q0_byte;
+	u32 rx_q1_byte;
+	u32 rx_q2_byte;
+	u32 rx_q3_byte;
+	u32 rx_q4_byte;
+	u32 rx_q5_byte;
+	u32 rx_q6_byte;
+	u32 rx_q7_byte;
+	u32 tx_desc_error;
+};
+
+struct ipqess_tx_desc {
+	__le16  len;
+	__le16  svlan_tag;
+	__le32  word1;
+	__le32  addr;
+	__le32  word3;
+} __aligned(16) __packed;
+
+struct ipqess_rx_desc {
+	u16 rrd0;
+	u16 rrd1;
+	u16 rrd2;
+	u16 rrd3;
+	u16 rrd4;
+	u16 rrd5;
+	u16 rrd6;
+	u16 rrd7;
+} __aligned(16) __packed;
+
+struct ipqess_buf {
+	struct sk_buff *skb;
+	dma_addr_t dma;
+	u32 flags;
+	u16 length;
+};
+
+struct ipqess_tx_ring {
+	struct napi_struct napi_tx;
+	u32 idx;
+	int ring_id;
+	struct ipqess *ess;
+	struct netdev_queue *nq;
+	struct ipqess_tx_desc *hw_desc;
+	struct ipqess_buf *buf;
+	dma_addr_t dma;
+	u16 count;
+	u16 head;
+	u16 tail;
+};
+
+struct ipqess_rx_ring {
+	struct napi_struct napi_rx;
+	u32 idx;
+	int ring_id;
+	struct ipqess *ess;
+	struct device *ppdev;
+	struct ipqess_rx_desc **hw_desc;
+	struct ipqess_buf *buf;
+	dma_addr_t dma;
+	u16 head;
+	u16 tail;
+	atomic_t refill_count;
+};
+
+struct ipqess_rx_ring_refill {
+	struct ipqess_rx_ring *rx_ring;
+	struct work_struct refill_work;
+};
+
+#define IPQESS_IRQ_NAME_LEN	32
+
+struct ipqess {
+	struct net_device *netdev;
+	void __iomem *hw_addr;
+
+	struct ipqess_rx_ring rx_ring[IPQESS_NETDEV_QUEUES];
+
+	struct platform_device *pdev;
+	struct phylink *phylink;
+	struct phylink_config phylink_config;
+	struct ipqess_tx_ring tx_ring[IPQESS_NETDEV_QUEUES];
+
+	struct ipqess_statistics ipqess_stats;
+
+	/* Protects stats */
+	spinlock_t stats_lock;
+	struct net_device_stats stats;
+
+	struct ipqess_rx_ring_refill rx_refill[IPQESS_NETDEV_QUEUES];
+	u32 tx_irq[IPQESS_MAX_TX_QUEUE];
+	char tx_irq_names[IPQESS_MAX_TX_QUEUE][IPQESS_IRQ_NAME_LEN];
+	u32 rx_irq[IPQESS_MAX_RX_QUEUE];
+	char rx_irq_names[IPQESS_MAX_TX_QUEUE][IPQESS_IRQ_NAME_LEN];
+};
+
+void ipqess_set_ethtool_ops(struct net_device *netdev);
+void ipqess_update_hw_stats(struct ipqess *ess);
+
+/* register definition */
+#define IPQESS_REG_MAS_CTRL 0x0
+#define IPQESS_REG_TIMEOUT_CTRL 0x004
+#define IPQESS_REG_DBG0 0x008
+#define IPQESS_REG_DBG1 0x00C
+#define IPQESS_REG_SW_CTRL0 0x100
+#define IPQESS_REG_SW_CTRL1 0x104
+
+/* Interrupt Status Register */
+#define IPQESS_REG_RX_ISR 0x200
+#define IPQESS_REG_TX_ISR 0x208
+#define IPQESS_REG_MISC_ISR 0x210
+#define IPQESS_REG_WOL_ISR 0x218
+
+#define IPQESS_MISC_ISR_RX_URG_Q(x) (1 << (x))
+
+#define IPQESS_MISC_ISR_AXIR_TIMEOUT 0x00000100
+#define IPQESS_MISC_ISR_AXIR_ERR 0x00000200
+#define IPQESS_MISC_ISR_TXF_DEAD 0x00000400
+#define IPQESS_MISC_ISR_AXIW_ERR 0x00000800
+#define IPQESS_MISC_ISR_AXIW_TIMEOUT 0x00001000
+
+#define IPQESS_WOL_ISR 0x00000001
+
+/* Interrupt Mask Register */
+#define IPQESS_REG_MISC_IMR 0x214
+#define IPQESS_REG_WOL_IMR 0x218
+
+#define IPQESS_RX_IMR_NORMAL_MASK 0x1
+#define IPQESS_TX_IMR_NORMAL_MASK 0x1
+#define IPQESS_MISC_IMR_NORMAL_MASK 0x80001FFF
+#define IPQESS_WOL_IMR_NORMAL_MASK 0x1
+
+/* Edma receive consumer index */
+#define IPQESS_REG_RX_SW_CONS_IDX_Q(x) (0x220 + ((x) << 2)) /* x is the queue id */
+
+/* Edma transmit consumer index */
+#define IPQESS_REG_TX_SW_CONS_IDX_Q(x) (0x240 + ((x) << 2)) /* x is the queue id */
+
+/* IRQ Moderator Initial Timer Register */
+#define IPQESS_REG_IRQ_MODRT_TIMER_INIT 0x280
+#define IPQESS_IRQ_MODRT_TIMER_MASK 0xFFFF
+#define IPQESS_IRQ_MODRT_RX_TIMER_SHIFT 0
+#define IPQESS_IRQ_MODRT_TX_TIMER_SHIFT 16
+
+/* Interrupt Control Register */
+#define IPQESS_REG_INTR_CTRL 0x284
+#define IPQESS_INTR_CLR_TYP_SHIFT 0
+#define IPQESS_INTR_SW_IDX_W_TYP_SHIFT 1
+#define IPQESS_INTR_CLEAR_TYPE_W1 0
+#define IPQESS_INTR_CLEAR_TYPE_R 1
+
+/* RX Interrupt Mask Register */
+#define IPQESS_REG_RX_INT_MASK_Q(x) (0x300 + ((x) << 2)) /* x = queue id */
+
+/* TX Interrupt mask register */
+#define IPQESS_REG_TX_INT_MASK_Q(x) (0x340 + ((x) << 2)) /* x = queue id */
+
+/* Load Ptr Register
+ * Software sets this bit after the initialization of the head and tail
+ */
+#define IPQESS_REG_TX_SRAM_PART 0x400
+#define IPQESS_LOAD_PTR_SHIFT 16
+
+/* TXQ Control Register */
+#define IPQESS_REG_TXQ_CTRL 0x404
+#define IPQESS_TXQ_CTRL_IP_OPTION_EN 0x10
+#define IPQESS_TXQ_CTRL_TXQ_EN 0x20
+#define IPQESS_TXQ_CTRL_ENH_MODE 0x40
+#define IPQESS_TXQ_CTRL_LS_8023_EN 0x80
+#define IPQESS_TXQ_CTRL_TPD_BURST_EN 0x100
+#define IPQESS_TXQ_CTRL_LSO_BREAK_EN 0x200
+#define IPQESS_TXQ_NUM_TPD_BURST_MASK 0xF
+#define IPQESS_TXQ_TXF_BURST_NUM_MASK 0xFFFF
+#define IPQESS_TXQ_NUM_TPD_BURST_SHIFT 0
+#define IPQESS_TXQ_TXF_BURST_NUM_SHIFT 16
+
+#define	IPQESS_REG_TXF_WATER_MARK 0x408 /* In 8-bytes */
+#define IPQESS_TXF_WATER_MARK_MASK 0x0FFF
+#define IPQESS_TXF_LOW_WATER_MARK_SHIFT 0
+#define IPQESS_TXF_HIGH_WATER_MARK_SHIFT 16
+#define IPQESS_TXQ_CTRL_BURST_MODE_EN 0x80000000
+
+/* WRR Control Register */
+#define IPQESS_REG_WRR_CTRL_Q0_Q3 0x40c
+#define IPQESS_REG_WRR_CTRL_Q4_Q7 0x410
+#define IPQESS_REG_WRR_CTRL_Q8_Q11 0x414
+#define IPQESS_REG_WRR_CTRL_Q12_Q15 0x418
+
+/* Weight round robin(WRR), it takes queue as input, and computes
+ * starting bits where we need to write the weight for a particular
+ * queue
+ */
+#define IPQESS_WRR_SHIFT(x) (((x) * 5) % 20)
+
+/* Tx Descriptor Control Register */
+#define IPQESS_REG_TPD_RING_SIZE 0x41C
+#define IPQESS_TPD_RING_SIZE_SHIFT 0
+#define IPQESS_TPD_RING_SIZE_MASK 0xFFFF
+
+/* Transmit descriptor base address */
+#define IPQESS_REG_TPD_BASE_ADDR_Q(x) (0x420 + ((x) << 2)) /* x = queue id */
+
+/* TPD Index Register */
+#define IPQESS_REG_TPD_IDX_Q(x) (0x460 + ((x) << 2)) /* x = queue id */
+
+#define IPQESS_TPD_PROD_IDX_BITS 0x0000FFFF
+#define IPQESS_TPD_CONS_IDX_BITS 0xFFFF0000
+#define IPQESS_TPD_PROD_IDX_MASK 0xFFFF
+#define IPQESS_TPD_CONS_IDX_MASK 0xFFFF
+#define IPQESS_TPD_PROD_IDX_SHIFT 0
+#define IPQESS_TPD_CONS_IDX_SHIFT 16
+
+/* TX Virtual Queue Mapping Control Register */
+#define IPQESS_REG_VQ_CTRL0 0x4A0
+#define IPQESS_REG_VQ_CTRL1 0x4A4
+
+/* Virtual QID shift, it takes queue as input, and computes
+ * Virtual QID position in virtual qid control register
+ */
+#define IPQESS_VQ_ID_SHIFT(i) (((i) * 3) % 24)
+
+/* Virtual Queue Default Value */
+#define IPQESS_VQ_REG_VALUE 0x240240
+
+/* Tx side Port Interface Control Register */
+#define IPQESS_REG_PORT_CTRL 0x4A8
+#define IPQESS_PAD_EN_SHIFT 15
+
+/* Tx side VLAN Configuration Register */
+#define IPQESS_REG_VLAN_CFG 0x4AC
+
+#define IPQESS_VLAN_CFG_SVLAN_TPID_SHIFT 0
+#define IPQESS_VLAN_CFG_SVLAN_TPID_MASK 0xffff
+#define IPQESS_VLAN_CFG_CVLAN_TPID_SHIFT 16
+#define IPQESS_VLAN_CFG_CVLAN_TPID_MASK 0xffff
+
+#define IPQESS_TX_CVLAN 16
+#define IPQESS_TX_INS_CVLAN 17
+#define IPQESS_TX_CVLAN_TAG_SHIFT 0
+
+#define IPQESS_TX_SVLAN 14
+#define IPQESS_TX_INS_SVLAN 15
+#define IPQESS_TX_SVLAN_TAG_SHIFT 16
+
+/* Tx Queue Packet Statistic Register */
+#define IPQESS_REG_TX_STAT_PKT_Q(x) (0x700 + ((x) << 3)) /* x = queue id */
+
+#define IPQESS_TX_STAT_PKT_MASK 0xFFFFFF
+
+/* Tx Queue Byte Statistic Register */
+#define IPQESS_REG_TX_STAT_BYTE_Q(x) (0x704 + ((x) << 3)) /* x = queue id */
+
+/* Load Balance Based Ring Offset Register */
+#define IPQESS_REG_LB_RING 0x800
+#define IPQESS_LB_RING_ENTRY_MASK 0xff
+#define IPQESS_LB_RING_ID_MASK 0x7
+#define IPQESS_LB_RING_PROFILE_ID_MASK 0x3
+#define IPQESS_LB_RING_ENTRY_BIT_OFFSET 8
+#define IPQESS_LB_RING_ID_OFFSET 0
+#define IPQESS_LB_RING_PROFILE_ID_OFFSET 3
+#define IPQESS_LB_REG_VALUE 0x6040200
+
+/* Load Balance Priority Mapping Register */
+#define IPQESS_REG_LB_PRI_START 0x804
+#define IPQESS_REG_LB_PRI_END 0x810
+#define IPQESS_LB_PRI_REG_INC 4
+#define IPQESS_LB_PRI_ENTRY_BIT_OFFSET 4
+#define IPQESS_LB_PRI_ENTRY_MASK 0xf
+
+/* RSS Priority Mapping Register */
+#define IPQESS_REG_RSS_PRI 0x820
+#define IPQESS_RSS_PRI_ENTRY_MASK 0xf
+#define IPQESS_RSS_RING_ID_MASK 0x7
+#define IPQESS_RSS_PRI_ENTRY_BIT_OFFSET 4
+
+/* RSS Indirection Register */
+#define IPQESS_REG_RSS_IDT(x) (0x840 + ((x) << 2)) /* x = No. of indirection table */
+#define IPQESS_NUM_IDT 16
+#define IPQESS_RSS_IDT_VALUE 0x64206420
+
+/* Default RSS Ring Register */
+#define IPQESS_REG_DEF_RSS 0x890
+#define IPQESS_DEF_RSS_MASK 0x7
+
+/* RSS Hash Function Type Register */
+#define IPQESS_REG_RSS_TYPE 0x894
+#define IPQESS_RSS_TYPE_NONE 0x01
+#define IPQESS_RSS_TYPE_IPV4TCP 0x02
+#define IPQESS_RSS_TYPE_IPV6_TCP 0x04
+#define IPQESS_RSS_TYPE_IPV4_UDP 0x08
+#define IPQESS_RSS_TYPE_IPV6UDP 0x10
+#define IPQESS_RSS_TYPE_IPV4 0x20
+#define IPQESS_RSS_TYPE_IPV6 0x40
+#define IPQESS_RSS_HASH_MODE_MASK 0x7f
+
+#define IPQESS_REG_RSS_HASH_VALUE 0x8C0
+
+#define IPQESS_REG_RSS_TYPE_RESULT 0x8C4
+
+#define IPQESS_HASH_TYPE_START 0
+#define IPQESS_HASH_TYPE_END 5
+#define IPQESS_HASH_TYPE_SHIFT 12
+
+#define IPQESS_RFS_FLOW_ENTRIES 1024
+#define IPQESS_RFS_FLOW_ENTRIES_MASK (IPQESS_RFS_FLOW_ENTRIES - 1)
+#define IPQESS_RFS_EXPIRE_COUNT_PER_CALL 128
+
+/* RFD Base Address Register */
+#define IPQESS_REG_RFD_BASE_ADDR_Q(x) (0x950 + ((x) << 2)) /* x = queue id */
+
+/* RFD Index Register */
+#define IPQESS_REG_RFD_IDX_Q(x) (0x9B0 + ((x) << 2)) /* x = queue id */
+
+#define IPQESS_RFD_PROD_IDX_BITS 0x00000FFF
+#define IPQESS_RFD_CONS_IDX_BITS 0x0FFF0000
+#define IPQESS_RFD_PROD_IDX_MASK 0xFFF
+#define IPQESS_RFD_CONS_IDX_MASK 0xFFF
+#define IPQESS_RFD_PROD_IDX_SHIFT 0
+#define IPQESS_RFD_CONS_IDX_SHIFT 16
+
+/* Rx Descriptor Control Register */
+#define IPQESS_REG_RX_DESC0 0xA10
+#define IPQESS_RFD_RING_SIZE_MASK 0xFFF
+#define IPQESS_RX_BUF_SIZE_MASK 0xFFFF
+#define IPQESS_RFD_RING_SIZE_SHIFT 0
+#define IPQESS_RX_BUF_SIZE_SHIFT 16
+
+#define IPQESS_REG_RX_DESC1 0xA14
+#define IPQESS_RXQ_RFD_BURST_NUM_MASK 0x3F
+#define IPQESS_RXQ_RFD_PF_THRESH_MASK 0x1F
+#define IPQESS_RXQ_RFD_LOW_THRESH_MASK 0xFFF
+#define IPQESS_RXQ_RFD_BURST_NUM_SHIFT 0
+#define IPQESS_RXQ_RFD_PF_THRESH_SHIFT 8
+#define IPQESS_RXQ_RFD_LOW_THRESH_SHIFT 16
+
+/* RXQ Control Register */
+#define IPQESS_REG_RXQ_CTRL 0xA18
+#define IPQESS_FIFO_THRESH_TYPE_SHIF 0
+#define IPQESS_FIFO_THRESH_128_BYTE 0x0
+#define IPQESS_FIFO_THRESH_64_BYTE 0x1
+#define IPQESS_RXQ_CTRL_RMV_VLAN 0x00000002
+#define IPQESS_RXQ_CTRL_EN_MASK			GENMASK(15, 8)
+#define IPQESS_RXQ_CTRL_EN(__qid)		BIT(8 + (__qid))
+
+/* AXI Burst Size Config */
+#define IPQESS_REG_AXIW_CTRL_MAXWRSIZE 0xA1C
+#define IPQESS_AXIW_MAXWRSIZE_VALUE 0x0
+
+/* Rx Statistics Register */
+#define IPQESS_REG_RX_STAT_BYTE_Q(x) (0xA30 + ((x) << 2)) /* x = queue id */
+#define IPQESS_REG_RX_STAT_PKT_Q(x) (0xA50 + ((x) << 2)) /* x = queue id */
+
+/* WoL Pattern Length Register */
+#define IPQESS_REG_WOL_PATTERN_LEN0 0xC00
+#define IPQESS_WOL_PT_LEN_MASK 0xFF
+#define IPQESS_WOL_PT0_LEN_SHIFT 0
+#define IPQESS_WOL_PT1_LEN_SHIFT 8
+#define IPQESS_WOL_PT2_LEN_SHIFT 16
+#define IPQESS_WOL_PT3_LEN_SHIFT 24
+
+#define IPQESS_REG_WOL_PATTERN_LEN1 0xC04
+#define IPQESS_WOL_PT4_LEN_SHIFT 0
+#define IPQESS_WOL_PT5_LEN_SHIFT 8
+#define IPQESS_WOL_PT6_LEN_SHIFT 16
+
+/* WoL Control Register */
+#define IPQESS_REG_WOL_CTRL 0xC08
+#define IPQESS_WOL_WK_EN 0x00000001
+#define IPQESS_WOL_MG_EN 0x00000002
+#define IPQESS_WOL_PT0_EN 0x00000004
+#define IPQESS_WOL_PT1_EN 0x00000008
+#define IPQESS_WOL_PT2_EN 0x00000010
+#define IPQESS_WOL_PT3_EN 0x00000020
+#define IPQESS_WOL_PT4_EN 0x00000040
+#define IPQESS_WOL_PT5_EN 0x00000080
+#define IPQESS_WOL_PT6_EN 0x00000100
+
+/* MAC Control Register */
+#define IPQESS_REG_MAC_CTRL0 0xC20
+#define IPQESS_REG_MAC_CTRL1 0xC24
+
+/* WoL Pattern Register */
+#define IPQESS_REG_WOL_PATTERN_START 0x5000
+#define IPQESS_PATTERN_PART_REG_OFFSET 0x40
+
+/* TX descriptor fields */
+#define IPQESS_TPD_HDR_SHIFT 0
+#define IPQESS_TPD_PPPOE_EN 0x00000100
+#define IPQESS_TPD_IP_CSUM_EN 0x00000200
+#define IPQESS_TPD_TCP_CSUM_EN 0x0000400
+#define IPQESS_TPD_UDP_CSUM_EN 0x00000800
+#define IPQESS_TPD_CUSTOM_CSUM_EN 0x00000C00
+#define IPQESS_TPD_LSO_EN 0x00001000
+#define IPQESS_TPD_LSO_V2_EN 0x00002000
+/* The VLAN_TAGGED bit is not used in the publicly available
+ * drivers. The definition has been stolen from the Atheros
+ * 'alx' driver (drivers/net/ethernet/atheros/alx/hw.h). It
+ * seems that it has the same meaning in regard to the EDMA
+ * hardware.
+ */
+#define IPQESS_TPD_VLAN_TAGGED 0x00004000
+#define IPQESS_TPD_IPV4_EN 0x00010000
+#define IPQESS_TPD_MSS_MASK 0x1FFF
+#define IPQESS_TPD_MSS_SHIFT 18
+#define IPQESS_TPD_CUSTOM_CSUM_SHIFT 18
+
+/* RRD descriptor fields */
+#define IPQESS_RRD_NUM_RFD_MASK 0x000F
+#define IPQESS_RRD_PKT_SIZE_MASK 0x3FFF
+#define IPQESS_RRD_SRC_PORT_NUM_MASK 0x4000
+#define IPQESS_RRD_SVLAN 0x8000
+#define IPQESS_RRD_FLOW_COOKIE_MASK 0x07FF
+
+#define IPQESS_RRD_PKT_SIZE_MASK 0x3FFF
+#define IPQESS_RRD_CSUM_FAIL_MASK 0xC000
+#define IPQESS_RRD_CVLAN 0x0001
+#define IPQESS_RRD_DESC_VALID 0x8000
+
+#define IPQESS_RRD_PRIORITY_SHIFT 4
+#define IPQESS_RRD_PRIORITY_MASK 0x7
+#define IPQESS_RRD_PORT_TYPE_SHIFT 7
+#define IPQESS_RRD_PORT_TYPE_MASK 0x1F
+
+#define IPQESS_RRD_PORT_ID_MASK 0x7000
+
+#endif
diff --git a/drivers/net/ethernet/qualcomm/ipqess/ipqess_ethtool.c b/drivers/net/ethernet/qualcomm/ipqess/ipqess_ethtool.c
new file mode 100644
index 000000000000..ad31fa910983
--- /dev/null
+++ b/drivers/net/ethernet/qualcomm/ipqess/ipqess_ethtool.c
@@ -0,0 +1,168 @@
+// SPDX-License-Identifier: GPL-2.0 OR ISC
+/* Copyright (c) 2015 - 2016, The Linux Foundation. All rights reserved.
+ * Copyright (c) 2017 - 2018, John Crispin <john@phrozen.org>
+ * Copyright (c) 2021 - 2022, Maxime Chevallier <maxime.chevallier@bootlin.com>
+ *
+ */
+
+#include <linux/ethtool.h>
+#include <linux/netdevice.h>
+#include <linux/string.h>
+#include <linux/phylink.h>
+
+#include "ipqess.h"
+
+struct ipqess_ethtool_stats {
+	u8 string[ETH_GSTRING_LEN];
+	u32 offset;
+};
+
+#define IPQESS_STAT(m)    offsetof(struct ipqess_statistics, m)
+#define DRVINFO_LEN	32
+
+static const struct ipqess_ethtool_stats ipqess_stats[] = {
+	{"tx_q0_pkt", IPQESS_STAT(tx_q0_pkt)},
+	{"tx_q1_pkt", IPQESS_STAT(tx_q1_pkt)},
+	{"tx_q2_pkt", IPQESS_STAT(tx_q2_pkt)},
+	{"tx_q3_pkt", IPQESS_STAT(tx_q3_pkt)},
+	{"tx_q4_pkt", IPQESS_STAT(tx_q4_pkt)},
+	{"tx_q5_pkt", IPQESS_STAT(tx_q5_pkt)},
+	{"tx_q6_pkt", IPQESS_STAT(tx_q6_pkt)},
+	{"tx_q7_pkt", IPQESS_STAT(tx_q7_pkt)},
+	{"tx_q8_pkt", IPQESS_STAT(tx_q8_pkt)},
+	{"tx_q9_pkt", IPQESS_STAT(tx_q9_pkt)},
+	{"tx_q10_pkt", IPQESS_STAT(tx_q10_pkt)},
+	{"tx_q11_pkt", IPQESS_STAT(tx_q11_pkt)},
+	{"tx_q12_pkt", IPQESS_STAT(tx_q12_pkt)},
+	{"tx_q13_pkt", IPQESS_STAT(tx_q13_pkt)},
+	{"tx_q14_pkt", IPQESS_STAT(tx_q14_pkt)},
+	{"tx_q15_pkt", IPQESS_STAT(tx_q15_pkt)},
+	{"tx_q0_byte", IPQESS_STAT(tx_q0_byte)},
+	{"tx_q1_byte", IPQESS_STAT(tx_q1_byte)},
+	{"tx_q2_byte", IPQESS_STAT(tx_q2_byte)},
+	{"tx_q3_byte", IPQESS_STAT(tx_q3_byte)},
+	{"tx_q4_byte", IPQESS_STAT(tx_q4_byte)},
+	{"tx_q5_byte", IPQESS_STAT(tx_q5_byte)},
+	{"tx_q6_byte", IPQESS_STAT(tx_q6_byte)},
+	{"tx_q7_byte", IPQESS_STAT(tx_q7_byte)},
+	{"tx_q8_byte", IPQESS_STAT(tx_q8_byte)},
+	{"tx_q9_byte", IPQESS_STAT(tx_q9_byte)},
+	{"tx_q10_byte", IPQESS_STAT(tx_q10_byte)},
+	{"tx_q11_byte", IPQESS_STAT(tx_q11_byte)},
+	{"tx_q12_byte", IPQESS_STAT(tx_q12_byte)},
+	{"tx_q13_byte", IPQESS_STAT(tx_q13_byte)},
+	{"tx_q14_byte", IPQESS_STAT(tx_q14_byte)},
+	{"tx_q15_byte", IPQESS_STAT(tx_q15_byte)},
+	{"rx_q0_pkt", IPQESS_STAT(rx_q0_pkt)},
+	{"rx_q1_pkt", IPQESS_STAT(rx_q1_pkt)},
+	{"rx_q2_pkt", IPQESS_STAT(rx_q2_pkt)},
+	{"rx_q3_pkt", IPQESS_STAT(rx_q3_pkt)},
+	{"rx_q4_pkt", IPQESS_STAT(rx_q4_pkt)},
+	{"rx_q5_pkt", IPQESS_STAT(rx_q5_pkt)},
+	{"rx_q6_pkt", IPQESS_STAT(rx_q6_pkt)},
+	{"rx_q7_pkt", IPQESS_STAT(rx_q7_pkt)},
+	{"rx_q0_byte", IPQESS_STAT(rx_q0_byte)},
+	{"rx_q1_byte", IPQESS_STAT(rx_q1_byte)},
+	{"rx_q2_byte", IPQESS_STAT(rx_q2_byte)},
+	{"rx_q3_byte", IPQESS_STAT(rx_q3_byte)},
+	{"rx_q4_byte", IPQESS_STAT(rx_q4_byte)},
+	{"rx_q5_byte", IPQESS_STAT(rx_q5_byte)},
+	{"rx_q6_byte", IPQESS_STAT(rx_q6_byte)},
+	{"rx_q7_byte", IPQESS_STAT(rx_q7_byte)},
+	{"tx_desc_error", IPQESS_STAT(tx_desc_error)},
+};
+
+static int ipqess_get_strset_count(struct net_device *netdev, int sset)
+{
+	switch (sset) {
+	case ETH_SS_STATS:
+		return ARRAY_SIZE(ipqess_stats);
+	default:
+		netdev_dbg(netdev, "%s: Invalid string set", __func__);
+		return -EOPNOTSUPP;
+	}
+}
+
+static void ipqess_get_strings(struct net_device *netdev, u32 stringset,
+			       u8 *data)
+{
+	u8 *p = data;
+	u32 i;
+
+	switch (stringset) {
+	case ETH_SS_STATS:
+		for (i = 0; i < ARRAY_SIZE(ipqess_stats); i++) {
+			memcpy(p, ipqess_stats[i].string,
+			       min((size_t)ETH_GSTRING_LEN,
+				   strlen(ipqess_stats[i].string) + 1));
+			p += ETH_GSTRING_LEN;
+		}
+		break;
+	}
+}
+
+static void ipqess_get_ethtool_stats(struct net_device *netdev,
+				     struct ethtool_stats *stats,
+				     uint64_t *data)
+{
+	struct ipqess *ess = netdev_priv(netdev);
+	u32 *essstats = (u32 *)&ess->ipqess_stats;
+	int i;
+
+	spin_lock(&ess->stats_lock);
+
+	ipqess_update_hw_stats(ess);
+
+	for (i = 0; i < ARRAY_SIZE(ipqess_stats); i++)
+		data[i] = *(u32 *)(essstats + (ipqess_stats[i].offset / sizeof(u32)));
+
+	spin_unlock(&ess->stats_lock);
+}
+
+static void ipqess_get_drvinfo(struct net_device *dev,
+			       struct ethtool_drvinfo *info)
+{
+	strscpy(info->driver, "qca_ipqess", DRVINFO_LEN);
+	strscpy(info->bus_info, "axi", ETHTOOL_BUSINFO_LEN);
+}
+
+static int ipqess_get_settings(struct net_device *netdev,
+			       struct ethtool_link_ksettings *cmd)
+{
+	struct ipqess *ess = netdev_priv(netdev);
+
+	return phylink_ethtool_ksettings_get(ess->phylink, cmd);
+}
+
+static int ipqess_set_settings(struct net_device *netdev,
+			       const struct ethtool_link_ksettings *cmd)
+{
+	struct ipqess *ess = netdev_priv(netdev);
+
+	return phylink_ethtool_ksettings_set(ess->phylink, cmd);
+}
+
+static void ipqess_get_ringparam(struct net_device *netdev,
+				 struct ethtool_ringparam *ring,
+				 struct kernel_ethtool_ringparam *kernel_ering,
+				 struct netlink_ext_ack *extack)
+{
+	ring->tx_max_pending = IPQESS_TX_RING_SIZE;
+	ring->rx_max_pending = IPQESS_RX_RING_SIZE;
+}
+
+static const struct ethtool_ops ipqesstool_ops = {
+	.get_drvinfo = &ipqess_get_drvinfo,
+	.get_link = &ethtool_op_get_link,
+	.get_link_ksettings = &ipqess_get_settings,
+	.set_link_ksettings = &ipqess_set_settings,
+	.get_strings = &ipqess_get_strings,
+	.get_sset_count = &ipqess_get_strset_count,
+	.get_ethtool_stats = &ipqess_get_ethtool_stats,
+	.get_ringparam = ipqess_get_ringparam,
+};
+
+void ipqess_set_ethtool_ops(struct net_device *netdev)
+{
+	netdev->ethtool_ops = &ipqesstool_ops;
+}
-- 
2.35.1


^ permalink raw reply related

* [PATCH net-next 5/5] ARM: dts: qcom: ipq4019: Add description for the IPQESS Ethernet controller
From: Maxime Chevallier @ 2022-04-22 18:03 UTC (permalink / raw)
  To: davem, Rob Herring
  Cc: Maxime Chevallier, netdev, linux-kernel, devicetree,
	thomas.petazzoni, Andrew Lunn, Florian Fainelli, Heiner Kallweit,
	Russell King, linux-arm-kernel, Vladimir Oltean, Luka Perkov,
	Robert Marko
In-Reply-To: <20220422180305.301882-1-maxime.chevallier@bootlin.com>

The Qualcomm IPQ4019 includes an internal 5 ports switch, which is
connected to the CPU through the internal IPQESS Ethernet controller.

This commit adds support for this internal interface, which is
internally connected to a modified version of the QCA8K Ethernet switch.

This Ethernet controller only support a specific internal interface mode
for connection to the switch.

Signed-off-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
---
 arch/arm/boot/dts/qcom-ipq4019.dtsi | 42 +++++++++++++++++++++++++++++
 1 file changed, 42 insertions(+)

diff --git a/arch/arm/boot/dts/qcom-ipq4019.dtsi b/arch/arm/boot/dts/qcom-ipq4019.dtsi
index cac92dde040f..52761c801258 100644
--- a/arch/arm/boot/dts/qcom-ipq4019.dtsi
+++ b/arch/arm/boot/dts/qcom-ipq4019.dtsi
@@ -38,6 +38,7 @@ aliases {
 		spi1 = &blsp1_spi2;
 		i2c0 = &blsp1_i2c3;
 		i2c1 = &blsp1_i2c4;
+		ethernet0 = &gmac;
 	};
 
 	cpus {
@@ -668,6 +669,47 @@ swport5: port@5 { /* MAC5 */
 			};
 		};
 
+		gmac: ethernet@c080000 {
+			compatible = "qcom,ipq4019-ess-edma";
+			reg = <0xc080000 0x8000>;
+			interrupts = <GIC_SPI  65 IRQ_TYPE_EDGE_RISING>,
+				     <GIC_SPI  66 IRQ_TYPE_EDGE_RISING>,
+				     <GIC_SPI  67 IRQ_TYPE_EDGE_RISING>,
+				     <GIC_SPI  68 IRQ_TYPE_EDGE_RISING>,
+				     <GIC_SPI  69 IRQ_TYPE_EDGE_RISING>,
+				     <GIC_SPI  70 IRQ_TYPE_EDGE_RISING>,
+				     <GIC_SPI  71 IRQ_TYPE_EDGE_RISING>,
+				     <GIC_SPI  72 IRQ_TYPE_EDGE_RISING>,
+				     <GIC_SPI  73 IRQ_TYPE_EDGE_RISING>,
+				     <GIC_SPI  74 IRQ_TYPE_EDGE_RISING>,
+				     <GIC_SPI  75 IRQ_TYPE_EDGE_RISING>,
+				     <GIC_SPI  76 IRQ_TYPE_EDGE_RISING>,
+				     <GIC_SPI  77 IRQ_TYPE_EDGE_RISING>,
+				     <GIC_SPI  78 IRQ_TYPE_EDGE_RISING>,
+				     <GIC_SPI  79 IRQ_TYPE_EDGE_RISING>,
+				     <GIC_SPI  80 IRQ_TYPE_EDGE_RISING>,
+				     <GIC_SPI 240 IRQ_TYPE_EDGE_RISING>,
+				     <GIC_SPI 241 IRQ_TYPE_EDGE_RISING>,
+				     <GIC_SPI 242 IRQ_TYPE_EDGE_RISING>,
+				     <GIC_SPI 243 IRQ_TYPE_EDGE_RISING>,
+				     <GIC_SPI 244 IRQ_TYPE_EDGE_RISING>,
+				     <GIC_SPI 245 IRQ_TYPE_EDGE_RISING>,
+				     <GIC_SPI 246 IRQ_TYPE_EDGE_RISING>,
+				     <GIC_SPI 247 IRQ_TYPE_EDGE_RISING>,
+				     <GIC_SPI 248 IRQ_TYPE_EDGE_RISING>,
+				     <GIC_SPI 249 IRQ_TYPE_EDGE_RISING>,
+				     <GIC_SPI 250 IRQ_TYPE_EDGE_RISING>,
+				     <GIC_SPI 251 IRQ_TYPE_EDGE_RISING>,
+				     <GIC_SPI 252 IRQ_TYPE_EDGE_RISING>,
+				     <GIC_SPI 253 IRQ_TYPE_EDGE_RISING>,
+				     <GIC_SPI 254 IRQ_TYPE_EDGE_RISING>,
+				     <GIC_SPI 255 IRQ_TYPE_EDGE_RISING>;
+
+			status = "disabled";
+
+			phy-mode = "internal";
+		};
+
 		mdio: mdio@90000 {
 			#address-cells = <1>;
 			#size-cells = <0>;
-- 
2.35.1


^ permalink raw reply related

* [PATCH net-next 4/5] net: dt-bindings: Introduce the Qualcomm IPQESS Ethernet controller
From: Maxime Chevallier @ 2022-04-22 18:03 UTC (permalink / raw)
  To: davem, Rob Herring
  Cc: Maxime Chevallier, netdev, linux-kernel, devicetree,
	thomas.petazzoni, Andrew Lunn, Florian Fainelli, Heiner Kallweit,
	Russell King, linux-arm-kernel, Vladimir Oltean, Luka Perkov,
	Robert Marko
In-Reply-To: <20220422180305.301882-1-maxime.chevallier@bootlin.com>

Add the DT binding for the IPQESS Ethernet Controller. This is a simple
controller, only requiring the phy-mode, interrupts, clocks, and
possibly a MAC address setting.

Signed-off-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
---
 .../devicetree/bindings/net/qcom,ipqess.yaml  | 94 +++++++++++++++++++
 1 file changed, 94 insertions(+)
 create mode 100644 Documentation/devicetree/bindings/net/qcom,ipqess.yaml

diff --git a/Documentation/devicetree/bindings/net/qcom,ipqess.yaml b/Documentation/devicetree/bindings/net/qcom,ipqess.yaml
new file mode 100644
index 000000000000..8fec5633692f
--- /dev/null
+++ b/Documentation/devicetree/bindings/net/qcom,ipqess.yaml
@@ -0,0 +1,94 @@
+# SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause)
+%YAML 1.2
+---
+$id: http://devicetree.org/schemas/net/qcom,ipqess.yaml#
+$schema: http://devicetree.org/meta-schemas/core.yaml#
+
+title: Qualcomm IPQ ESS EDMA Ethernet Controller Device Tree Bindings
+
+allOf:
+  - $ref: "ethernet-controller.yaml#"
+
+maintainers:
+  - Maxime Chevallier <maxime.chevallier@bootlin.com>
+
+properties:
+  compatible:
+    const: qcom,ipq4019e-ess-edma
+
+  reg:
+    maxItems: 1
+
+  interrupts:
+    minItems: 2
+    maxItems: 32
+    description: One interrupt per tx and rx queue, with up to 16 queues.
+
+  clocks:
+    maxItems: 1
+
+  phy-mode: true
+
+  fixed-link: true
+
+  mac-address: true
+
+required:
+  - compatible
+  - reg
+  - interrupts
+  - clocks
+  - phy-mode
+
+unevaluatedProperties: false
+
+examples:
+  - |
+    gmac: ethernet@c080000 {
+        compatible = "qcom,ipq4019-ess-edma";
+        reg = <0xc080000 0x8000>;
+        interrupts = <GIC_SPI  65 IRQ_TYPE_EDGE_RISING>,
+                     <GIC_SPI  66 IRQ_TYPE_EDGE_RISING>,
+                     <GIC_SPI  67 IRQ_TYPE_EDGE_RISING>,
+                     <GIC_SPI  68 IRQ_TYPE_EDGE_RISING>,
+                     <GIC_SPI  69 IRQ_TYPE_EDGE_RISING>,
+                     <GIC_SPI  70 IRQ_TYPE_EDGE_RISING>,
+                     <GIC_SPI  71 IRQ_TYPE_EDGE_RISING>,
+                     <GIC_SPI  72 IRQ_TYPE_EDGE_RISING>,
+                     <GIC_SPI  73 IRQ_TYPE_EDGE_RISING>,
+                     <GIC_SPI  74 IRQ_TYPE_EDGE_RISING>,
+                     <GIC_SPI  75 IRQ_TYPE_EDGE_RISING>,
+                     <GIC_SPI  76 IRQ_TYPE_EDGE_RISING>,
+                     <GIC_SPI  77 IRQ_TYPE_EDGE_RISING>,
+                     <GIC_SPI  78 IRQ_TYPE_EDGE_RISING>,
+                     <GIC_SPI  79 IRQ_TYPE_EDGE_RISING>,
+                     <GIC_SPI  80 IRQ_TYPE_EDGE_RISING>,
+                     <GIC_SPI 240 IRQ_TYPE_EDGE_RISING>,
+                     <GIC_SPI 241 IRQ_TYPE_EDGE_RISING>,
+                     <GIC_SPI 242 IRQ_TYPE_EDGE_RISING>,
+                     <GIC_SPI 243 IRQ_TYPE_EDGE_RISING>,
+                     <GIC_SPI 244 IRQ_TYPE_EDGE_RISING>,
+                     <GIC_SPI 245 IRQ_TYPE_EDGE_RISING>,
+                     <GIC_SPI 246 IRQ_TYPE_EDGE_RISING>,
+                     <GIC_SPI 247 IRQ_TYPE_EDGE_RISING>,
+                     <GIC_SPI 248 IRQ_TYPE_EDGE_RISING>,
+                     <GIC_SPI 249 IRQ_TYPE_EDGE_RISING>,
+                     <GIC_SPI 250 IRQ_TYPE_EDGE_RISING>,
+                     <GIC_SPI 251 IRQ_TYPE_EDGE_RISING>,
+                     <GIC_SPI 252 IRQ_TYPE_EDGE_RISING>,
+                     <GIC_SPI 253 IRQ_TYPE_EDGE_RISING>,
+                     <GIC_SPI 254 IRQ_TYPE_EDGE_RISING>,
+                     <GIC_SPI 255 IRQ_TYPE_EDGE_RISING>;
+
+        status = "okay";
+
+        phy-mode = "internal";
+        fixed-link {
+            speed = <1000>;
+            full-duplex;
+            pause;
+            asym-pause;
+        };
+    };
+
+...
-- 
2.35.1


^ permalink raw reply related

* [PATCH net-next 0/5] net: ipqess: introduce Qualcomm IPQESS driver
From: Maxime Chevallier @ 2022-04-22 18:03 UTC (permalink / raw)
  To: davem, Rob Herring
  Cc: Maxime Chevallier, netdev, linux-kernel, devicetree,
	thomas.petazzoni, Andrew Lunn, Florian Fainelli, Heiner Kallweit,
	Russell King, linux-arm-kernel, Vladimir Oltean, Luka Perkov,
	Robert Marko

Hello everyone,

This series introduces a new driver, for the Qualcomm IPQESS Ethernet
Controller, found on the IPQ4019.

The driver itself is pretty straightforward, but has lived out-of-tree
for a while. I've done my best to clean-up some outdated API calls, but
some might remain.

This controller is somewhat special, since it's part of the IPQ4019 SoC
which also includes an QCA8K switch, and uses the IPQESS controller for
the CPU port. The switch is so tightly intergrated with the MAC that it
is connected to the MAC using an internal link (hence the fact that we
only support PHY_INTERFACE_MODE_INTERNAL), and this has some
consequences on the DSA side.

The tagging for the switch isn't done inband as most switch do, but
out-of-band, the DSA tag being included in the DMA descriptor.

So, this series also includes a new DSA tagging protocol, that sets the
DSA port index into skb->shinfo, so that the MAC driver can use it to
build the descriptor. This is definitely unusual, so I'l very openned to
suggestions, comments and reviews on the tagging side of this series.

Thanks to the Sartura folks who worked on a base version of this driver,
and provided test hardware.

Best regards,

Maxime Chevallier

Maxime Chevallier (5):
  net: ipqess: introduce the Qualcomm IPQESS driver
  net: dsa: add out-of-band tagging protocol
  net: ipqess: Add out-of-band DSA tagging support
  net: dt-bindings: Introduce the Qualcomm IPQESS Ethernet controller
  ARM: dts: qcom: ipq4019: Add description for the IPQESS Ethernet
    controller

 .../devicetree/bindings/net/qcom,ipqess.yaml  |   94 ++
 MAINTAINERS                                   |    6 +
 arch/arm/boot/dts/qcom-ipq4019.dtsi           |   42 +
 drivers/net/ethernet/qualcomm/Kconfig         |   11 +
 drivers/net/ethernet/qualcomm/Makefile        |    2 +
 drivers/net/ethernet/qualcomm/ipqess/Makefile |    8 +
 drivers/net/ethernet/qualcomm/ipqess/ipqess.c | 1258 +++++++++++++++++
 drivers/net/ethernet/qualcomm/ipqess/ipqess.h |  515 +++++++
 .../ethernet/qualcomm/ipqess/ipqess_ethtool.c |  168 +++
 include/linux/skbuff.h                        |    7 +
 include/net/dsa.h                             |    2 +
 net/dsa/Kconfig                               |    7 +
 net/dsa/Makefile                              |    1 +
 net/dsa/tag_oob.c                             |   45 +
 14 files changed, 2166 insertions(+)
 create mode 100644 Documentation/devicetree/bindings/net/qcom,ipqess.yaml
 create mode 100644 drivers/net/ethernet/qualcomm/ipqess/Makefile
 create mode 100644 drivers/net/ethernet/qualcomm/ipqess/ipqess.c
 create mode 100644 drivers/net/ethernet/qualcomm/ipqess/ipqess.h
 create mode 100644 drivers/net/ethernet/qualcomm/ipqess/ipqess_ethtool.c
 create mode 100644 net/dsa/tag_oob.c

-- 
2.35.1

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox