From: Mike Rapoport <rppt@linux.vnet.ibm.com>
To: John Fastabend <john.fastabend@gmail.com>
Cc: "Jesper Dangaard Brouer" <brouer@redhat.com>,
"netdev@vger.kernel.org" <netdev@vger.kernel.org>,
linux-mm <linux-mm@kvack.org>,
"Willem de Bruijn" <willemdebruijn.kernel@gmail.com>,
"Björn Töpel" <bjorn.topel@intel.com>,
"Karlsson, Magnus" <magnus.karlsson@intel.com>,
"Alexander Duyck" <alexander.duyck@gmail.com>,
"Mel Gorman" <mgorman@techsingularity.net>,
"Tom Herbert" <tom@herbertland.com>,
"Brenden Blanco" <bblanco@plumgrid.com>,
"Tariq Toukan" <tariqt@mellanox.com>,
"Saeed Mahameed" <saeedm@mellanox.com>,
"Jesse Brandeburg" <jesse.brandeburg@intel.com>,
"Kalman Meth" <METH@il.ibm.com>,
"Vladislav Yasevich" <vyasevich@gmail.com>
Subject: Re: Designing a safe RX-zero-copy Memory Model for Networking
Date: Tue, 13 Dec 2016 11:42:22 +0200 [thread overview]
Message-ID: <20161213094222.GF19987@rapoport-lnx> (raw)
In-Reply-To: <584EB8DF.8000308@gmail.com>
On Mon, Dec 12, 2016 at 06:49:03AM -0800, John Fastabend wrote:
> On 16-12-12 06:14 AM, Mike Rapoport wrote:
> >>
> > We were not considered using XDP yet, so we've decided to limit the initial
> > implementation to macvtap because we can ensure correspondence between a
> > NIC queue and virtual NIC, which is not the case with more generic tap
> > device. It could be that use of XDP will allow for a generic solution for
> > virtio case as well.
>
> Interesting this was one of the original ideas behind the macvlan
> offload mode. iirc Vlad also was interested in this.
>
> I'm guessing this was used because of the ability to push macvlan onto
> its own queue?
Yes, with a queue dedicated to a virtual NIC we only need to ensure that
guest memory is used for RX buffers.
> >>
> >>> Have you considered using "push" model for setting the NIC's RX memory?
> >>
> >> I don't understand what you mean by a "push" model?
> >
> > Currently, memory allocation in NIC drivers boils down to alloc_page with
> > some wrapping code. I see two possible ways to make NIC use of some
> > preallocated pages: either NIC driver will call an API (probably different
> > from alloc_page) to obtain that memory, or there will be NDO API that
> > allows to set the NIC's RX buffers. I named the later case "push".
>
> I prefer the ndo op. This matches up well with AF_PACKET model where we
> have "slots" and offload is just a transparent "push" of these "slots"
> to the driver. Below we have a snippet of our proposed API,
>
> (https://patchwork.ozlabs.org/patch/396714/ note the descriptor mapping
> bits will be dropped)
>
> + * int (*ndo_direct_qpair_page_map) (struct vm_area_struct *vma,
> + * struct net_device *dev)
> + * Called to map queue pair range from split_queue_pairs into
> + * mmap region.
> +
>
> > +
> > +static int
> > +ixgbe_ndo_qpair_page_map(struct vm_area_struct *vma, struct net_device *dev)
> > +{
> > + struct ixgbe_adapter *adapter = netdev_priv(dev);
> > + phys_addr_t phy_addr = pci_resource_start(adapter->pdev, 0);
> > + unsigned long pfn_rx = (phy_addr + RX_DESC_ADDR_OFFSET) >> PAGE_SHIFT;
> > + unsigned long pfn_tx = (phy_addr + TX_DESC_ADDR_OFFSET) >> PAGE_SHIFT;
> > + unsigned long dummy_page_phy;
> > + pgprot_t pre_vm_page_prot;
> > + unsigned long start;
> > + unsigned int i;
> > + int err;
> > +
> > + if (!dummy_page_buf) {
> > + dummy_page_buf = kzalloc(PAGE_SIZE_4K, GFP_KERNEL);
> > + if (!dummy_page_buf)
> > + return -ENOMEM;
> > +
> > + for (i = 0; i < PAGE_SIZE_4K / sizeof(unsigned int); i++)
> > + dummy_page_buf[i] = 0xdeadbeef;
> > + }
> > +
> > + dummy_page_phy = virt_to_phys(dummy_page_buf);
> > + pre_vm_page_prot = vma->vm_page_prot;
> > + vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
> > +
> > + /* assume the vm_start is 4K aligned address */
> > + for (start = vma->vm_start;
> > + start < vma->vm_end;
> > + start += PAGE_SIZE_4K) {
> > + if (start == vma->vm_start + RX_DESC_ADDR_OFFSET) {
> > + err = remap_pfn_range(vma, start, pfn_rx, PAGE_SIZE_4K,
> > + vma->vm_page_prot);
> > + if (err)
> > + return -EAGAIN;
> > + } else if (start == vma->vm_start + TX_DESC_ADDR_OFFSET) {
> > + err = remap_pfn_range(vma, start, pfn_tx, PAGE_SIZE_4K,
> > + vma->vm_page_prot);
> > + if (err)
> > + return -EAGAIN;
> > + } else {
> > + unsigned long addr = dummy_page_phy > PAGE_SHIFT;
> > +
> > + err = remap_pfn_range(vma, start, addr, PAGE_SIZE_4K,
> > + pre_vm_page_prot);
> > + if (err)
> > + return -EAGAIN;
> > + }
> > + }
> > + return 0;
> > +}
> > +
>
> Any thoughts on something like the above? We could push it when net-next
> opens. One piece that fits naturally into vhost/macvtap is the kicks and
> queue splicing are already there so no need to implement this making the
> above patch much simpler.
Sorry, but I don't quite follow you here. The vhost does not use vma
mappings, it just sees a bunch of pages pointed by the vring descriptors...
> .John
--
Sincerely yours,
Mike.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
next prev parent reply other threads:[~2016-12-13 9:42 UTC|newest]
Thread overview: 39+ messages / expand[flat|nested] mbox.gz Atom feed top
2016-12-05 14:31 Designing a safe RX-zero-copy Memory Model for Networking Jesper Dangaard Brouer
2016-12-12 8:38 ` Mike Rapoport
2016-12-12 9:40 ` Jesper Dangaard Brouer
2016-12-12 14:14 ` Mike Rapoport
2016-12-12 14:14 ` Mike Rapoport
2016-12-12 14:49 ` John Fastabend
2016-12-12 17:13 ` Jesper Dangaard Brouer
2016-12-12 18:06 ` Christoph Lameter
2016-12-12 18:06 ` Christoph Lameter
2016-12-13 16:10 ` Jesper Dangaard Brouer
2016-12-13 16:36 ` Christoph Lameter
2016-12-13 16:36 ` Christoph Lameter
2016-12-13 17:43 ` John Fastabend
2016-12-13 17:43 ` John Fastabend
2016-12-13 19:53 ` David Miller
2016-12-13 20:08 ` John Fastabend
2016-12-14 9:39 ` Jesper Dangaard Brouer
2016-12-14 16:32 ` John Fastabend
2016-12-14 16:45 ` Alexander Duyck
2016-12-14 21:29 ` Jesper Dangaard Brouer
2016-12-14 22:45 ` Alexander Duyck
2016-12-15 8:28 ` Jesper Dangaard Brouer
2016-12-15 15:59 ` Alexander Duyck
2016-12-15 16:38 ` Christoph Lameter
2016-12-14 21:04 ` Jesper Dangaard Brouer
2016-12-13 18:39 ` Hannes Frederic Sowa
2016-12-14 17:00 ` Christoph Lameter
2016-12-14 17:00 ` Christoph Lameter
2016-12-14 17:37 ` David Laight
2016-12-14 19:43 ` Christoph Lameter
2016-12-14 19:43 ` Christoph Lameter
2016-12-14 20:37 ` Hannes Frederic Sowa
2016-12-14 20:37 ` Hannes Frederic Sowa
2016-12-14 21:22 ` Christoph Lameter
2016-12-13 9:42 ` Mike Rapoport [this message]
2016-12-12 15:10 ` Jesper Dangaard Brouer
2016-12-12 15:10 ` Jesper Dangaard Brouer
2016-12-13 8:43 ` Mike Rapoport
2016-12-13 8:43 ` Mike Rapoport
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20161213094222.GF19987@rapoport-lnx \
--to=rppt@linux.vnet.ibm.com \
--cc=METH@il.ibm.com \
--cc=alexander.duyck@gmail.com \
--cc=bblanco@plumgrid.com \
--cc=bjorn.topel@intel.com \
--cc=brouer@redhat.com \
--cc=jesse.brandeburg@intel.com \
--cc=john.fastabend@gmail.com \
--cc=linux-mm@kvack.org \
--cc=magnus.karlsson@intel.com \
--cc=mgorman@techsingularity.net \
--cc=netdev@vger.kernel.org \
--cc=saeedm@mellanox.com \
--cc=tariqt@mellanox.com \
--cc=tom@herbertland.com \
--cc=vyasevich@gmail.com \
--cc=willemdebruijn.kernel@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.