From: Jason Wang <jasowang@redhat.com>
To: Dmitry Fleytman <dmitry@daynix.com>
Cc: Leonid Bloch <leonid.bloch@ravellosystems.com>,
Shmulik Ladkani <shmulik.ladkani@ravellosystems.com>,
Leonid Bloch <leonid@daynix.com>,
qemu-devel@nongnu.org, "Michael S. Tsirkin" <mst@redhat.com>
Subject: Re: [Qemu-devel] [PATCH v2 13/13] net: Introduce e1000e device emulation
Date: Thu, 7 Apr 2016 15:24:08 +0800 [thread overview]
Message-ID: <57060B18.4010808@redhat.com> (raw)
In-Reply-To: <7BD4DE03-17BA-4638-A751-13DDC5F9A79F@daynix.com>
On 04/06/2016 04:22 PM, Dmitry Fleytman wrote:
> Hi Jason,
>
> Please see my comments below.
>
>> On 8 Mar 2016, at 11:31 AM, Jason Wang <jasowang@redhat.com
>> <mailto:jasowang@redhat.com>> wrote:
>>
>>
>>
>> On 02/23/2016 01:37 AM, Leonid Bloch wrote:
>>> From: Dmitry Fleytman <dmitry.fleytman@ravellosystems.com
>>> <mailto:dmitry.fleytman@ravellosystems.com>>
>>>
>>> This patch introduces emulation for the Intel 82574 adapter, AKA e1000e.
>>>
>>> This implementation is derived from the e1000 emulation code, and
>>> utilizes the TX/RX packet abstractions that were initially developed for
>>> the vmxnet3 device. Although some parts of the introduced code may be
>>> shared with e1000, the differences are substantial enough so that the
>>> only shared resources for the two devices are the definitions in
>>> hw/net/e1000_regs.h.
>>>
>>> Similarly to vmxnet3, the new device uses virtio headers for task
>>> offloads (for backends that support virtio extensions). Usage of
>>> virtio headers may be forcibly disabled via a boolean device property
>>> "vnet" (which is enabled by default). In such case task offloads
>>> will be performed in software, in the same way it is done on
>>> backends that do not support virtio headers.
>>>
>>> The device code is split into two parts:
>>>
>>> 1. hw/net/e1000e.c: QEMU-specific code for a network device;
>>> 2. hw/net/e1000e_core.[hc]: Device emulation according to the spec.
>>>
>>> The new device name is e1000e.
>>>
>>> Intel specifications for the 82574 controller are available at:
>>> http://www.intel.com/content/dam/doc/datasheet/82574l-gbe-controller-datasheet.pdf
[...]
>>> +Device properties:
>>> +
>>> ++-------------------------------------------------+--------+---------+
>>> +| Propery name | Description | Type | Default |
>>> ++--------------------------------------------------------------------|
>>> +| subsys_ven | PCI device Subsystem Vendor ID | UINT16 | 0x8086 |
>>> +| | | | |
>>> +| subsys | PCI device Subsystem ID | UINT16 | 0 |
>>> +| | | | |
>>> +| vnet | Use virtio headers or perform | BOOL | TRUE |
>>> +| | SW offloads emulation | | |
>>> ++----------------+--------------------------------+--------+---------+
>>
>> You may using PropertyInfo which has a "description" field to do this
>> better (e.g qdev_prop_vlan). Then there's even no need for this file.
>
> We replaced this file with source code comments.
> As for PropertyInfo description I can’t see any way to pass
> description to DEFINE_PROP_* macros. Could you please elaborate on this?
You could use DEFINE_PROP() which can accept a PropertyInfo parameter.
>
>>
>>> diff --git a/hw/net/Makefile.objs b/hw/net/Makefile.objs
>>> index 527d264..4201115 100644
>>> --- a/hw/net/Makefile.objs
>>> +++ b/hw/net/Makefile.objs
[...]
>>
>>> + MemoryRegion io;
>>> + MemoryRegion msix;
>>> +
>>> + uint32_t ioaddr;
>>> +
>>> + uint16_t subsys_ven;
>>> + uint16_t subsys;
>>> +
>>> + uint16_t subsys_ven_used;
>>> + uint16_t subsys_used;
>>> +
>>> + uint32_t intr_state;
>>> + bool use_vnet;
>>> +
>>> + E1000ECore core;
>>
>> What's the advantages of introducing another indirection level here?
>> Looks like it causes lots of extra code overheads.
>
> This was done by intention.
> Unlike e1000 and e1000e devices which are really different, e1000e and
> newer Intel devices like ixgb* are rather similar. Introduction of
> e1000e core abstraction will simplify possible code reuse in the future.
Ok, I see.
>
>>
>>> +
>>> +} E1000EState;
>>> +
[...]
>>> +static NetClientInfo net_e1000e_info = {
>>> + .type = NET_CLIENT_OPTIONS_KIND_NIC,
>>> + .size = sizeof(NICState),
>>> + .can_receive = e1000e_nc_can_receive,
>>> + .receive = e1000e_nc_receive,
>>> + .receive_iov = e1000e_nc_receive_iov,
>>
>> All of the above three functions has a NetClientState parameter, so
>> probably no need to have "nc" in their name.
>
> The issue is there are other functions with similar names but without
> _nc_...
>
Right.
>>
>>> + .link_status_changed = e1000e_set_link_status,
>>> +};
>>> +
[...]
>>> + for (i = 0; i < s->conf.peers.queues; i++) {
>>> + nc = qemu_get_subqueue(s->nic, i);
>>> + if (!nc->peer || !qemu_has_vnet_hdr(nc->peer)) {
>>> + s->core.has_vnet = false;
>>
>> So in fact we're probing the vnet support instead of doing vnet our own
>> if backend does not support it. It seems to me that there's no need to
>> having "vnet" property.
>
> This property is intended for forcible disable of virtio headers
> support. Possible use cases are verification of SW offloads logic and
> work around for theoretical interoperability problems.
Ok, so maybe we'd better rename it to "disable_vnet".
>
>>
>>> + trace_e1000e_cfg_support_virtio(false);
>>> + return;
>>> + }
>>> + }
>>> +
>>> + VMSTATE_UINT8(core.tx[1].sum_needed, E1000EState),
>>> + VMSTATE_UINT8(core.tx[1].ipcss, E1000EState),
>>> + VMSTATE_UINT8(core.tx[1].ipcso, E1000EState),
>>> + VMSTATE_UINT16(core.tx[1].ipcse, E1000EState),
>>> + VMSTATE_UINT8(core.tx[1].tucss, E1000EState),
>>> + VMSTATE_UINT8(core.tx[1].tucso, E1000EState),
>>> + VMSTATE_UINT16(core.tx[1].tucse, E1000EState),
>>> + VMSTATE_UINT8(core.tx[1].hdr_len, E1000EState),
>>> + VMSTATE_UINT16(core.tx[1].mss, E1000EState),
>>> + VMSTATE_UINT32(core.tx[1].paylen, E1000EState),
>>> + VMSTATE_INT8(core.tx[1].ip, E1000EState),
>>> + VMSTATE_INT8(core.tx[1].tcp, E1000EState),
>>> + VMSTATE_BOOL(core.tx[1].tse, E1000EState),
>>> + VMSTATE_BOOL(core.tx[1].cptse, E1000EState),
>>> + VMSTATE_BOOL(core.tx[1].skip_cp, E1000EState),
>>
>> You can move the tx state into another structure and use VMSTATE_ARRAY()
>> here.
>
> Not sure I got you point. Could you please provide more details?
I mean e.g, move skip_cp into e1000e_txd_props and move tx_pkt out of
e1000_tx. Then you can use VMSTATE_ARRAY to save and load
e1000_txd_props array.
>
>>
>>> + VMSTATE_END_OF_LIST()
>>> + }
>>> +};
>>> +
>>> +static Property e1000e_properties[] = {
>>> + DEFINE_NIC_PROPERTIES(E1000EState, conf),
>>> + DEFINE_PROP_BOOL("vnet", E1000EState, use_vnet, true),
>>> + DEFINE_PROP_UINT16("subsys_ven", E1000EState,
>>> + subsys_ven, PCI_VENDOR_ID_INTEL),
>>> + DEFINE_PROP_UINT16("subsys", E1000EState, subsys, 0),
>>
>> I'm not quite sure the reason we need subsys_ven and subsys here?
>
> Some vendors (like VMWare) may replace these device IDs with their
> own. These parameters provide a way to make device look exactly as
> those. Useful in various V2V scenarios.
I get it.
Thanks
>
>>
>>> + DEFINE_PROP_END_OF_LIST(),
>>> +};
>>> +
[...]
>>> +
>>> +static void
>>> +e1000e_intrmgr_on_throttling_timer(void *opaque)
>>> +{
>>> + E1000IntrDelayTimer *timer = opaque;
>>> +
>>> + assert(!msix_enabled(timer->core->owner));
>>
>> Can this assert be triggered by guest? If yes, let's remove this.
>
> No, this timer is armed for legacy interrupts only.
> This assertion would mean there is an internal logic error in the device,
> i.e. the device code did not check we’re working with legacy
> interrupts before arming it.
Ok.
>
>>
>>> +
>>> + timer->running = false;
>>> +
>>> + if (!timer->core->itr_intr_pending) {
>>> + trace_e1000e_irq_throttling_no_pending_interrupts();
>>> + return;
>>> + }
>>> +
>>> + if (msi_enabled(timer->core->owner)) {
>>> + trace_e1000e_irq_msi_notify_postponed();
>>> + e1000e_set_interrupt_cause(timer->core, 0);
>>> + } else {
>>> + trace_e1000e_irq_legacy_notify_postponed();
>>> + e1000e_set_interrupt_cause(timer->core, 0);
>>> + }
>>> +}
>>> +
>>> +static void
>>> +e1000e_intrmgr_on_msix_throttling_timer(void *opaque)
>>> +{
>>> + E1000IntrDelayTimer *timer = opaque;
>>> + int idx = timer - &timer->core->eitr[0];
>>> +
>>> + assert(msix_enabled(timer->core->owner));
>>
>> Same question as above.
>
> The same. This timer is for MSI-X only.
Ok.
>
>>
>>> +
>>> + timer->running = false;
>>> +
>>> + if (!timer->core->eitr_intr_pending[idx]) {
>>> + trace_e1000e_irq_throttling_no_pending_vec(idx);
>>> + return;
>>> + }
>>> +
>>> + trace_e1000e_irq_msix_notify_postponed_vec(idx);
>>> + msix_notify(timer->core->owner, idx);
>>> +}
>>> +
>>> +static void
>>> +e1000e_intrmgr_initialize_all_timers(E1000ECore *core, bool create)
>>> +{
>>> + int i;
>>> +
>>> + core->radv.delay_reg = RADV;
>>> + core->rdtr.delay_reg = RDTR;
>>> + core->raid.delay_reg = RAID;
>>> + core->tadv.delay_reg = TADV;
>>> + core->tidv.delay_reg = TIDV;
>>> +
>>> + core->radv.delay_resolution_ns = E1000_INTR_DELAY_NS_RES;
>>> + core->rdtr.delay_resolution_ns = E1000_INTR_DELAY_NS_RES;
>>> + core->raid.delay_resolution_ns = E1000_INTR_DELAY_NS_RES;
>>> + core->tadv.delay_resolution_ns = E1000_INTR_DELAY_NS_RES;
>>> + core->tidv.delay_resolution_ns = E1000_INTR_DELAY_NS_RES;
>>> +
>>> + core->radv.core = core;
>>> + core->rdtr.core = core;
>>> + core->raid.core = core;
>>> + core->tadv.core = core;
>>> + core->tidv.core = core;
>>> +
>>> + core->itr.core = core;
>>
>> Couldn't we simply get core pointer through container_of() ?
>
> Unfortunately not. Timer abstraction functions are not aware of which
> specific timer they’re dealing with.
Yes, it is.
>
>>
>>> + core->itr.delay_reg = ITR;
>>> + core->itr.delay_resolution_ns = E1000_INTR_THROTTLING_NS_RES;
>>> +
>>> + for (i = 0; i < E1000E_MSIX_VEC_NUM; i++) {
>>> + core->eitr[i].core = core;
>>> + core->eitr[i].delay_reg = EITR + i;
>>> + core->eitr[i].delay_resolution_ns =
>>> E1000_INTR_THROTTLING_NS_RES;
>>> + }
>>> +
>>> + if (!create) {
>>> + return;
>>> + }
>>> +
>>> + core->radv.timer =
>>> + timer_new_ns(QEMU_CLOCK_VIRTUAL, e1000e_intrmgr_on_timer,
>>> &core->radv);
>>> + core->rdtr.timer =
>>> + timer_new_ns(QEMU_CLOCK_VIRTUAL, e1000e_intrmgr_on_timer,
>>> &core->rdtr);
>>> + core->raid.timer =
>>> + timer_new_ns(QEMU_CLOCK_VIRTUAL, e1000e_intrmgr_on_timer,
>>> &core->raid);
>>> +
>>> + core->tadv.timer =
>>> + timer_new_ns(QEMU_CLOCK_VIRTUAL, e1000e_intrmgr_on_timer,
>>> &core->tadv);
>>> + core->tidv.timer =
>>> + timer_new_ns(QEMU_CLOCK_VIRTUAL, e1000e_intrmgr_on_timer,
>>> &core->tidv);
>>> +
>>> + core->itr.timer = timer_new_ns(QEMU_CLOCK_VIRTUAL,
>>> + e1000e_intrmgr_on_throttling_timer,
>>> + &core->itr);
>>
>> Should we stop/start the above timers during vm stop/start through vm
>> state change handler?
>
> Not sure. Is there any reason for this?
Otherwise we may raise irq during vm stop?
[...]
>>> +
>>> +static inline void
>>> +e1000e_tx_ring_init(E1000ECore *core, E1000E_TxRing *txr, int idx)
>>> +{
>>> + static const E1000E_RingInfo i[E1000E_NUM_QUEUES] = {
>>> + { TDBAH, TDBAL, TDLEN, TDH, TDT, 0 },
>>> + { TDBAH1, TDBAL1, TDLEN1, TDH1, TDT1, 1 }
>>> + };
>>
>> Instead of using static inside a function, why not just use a global
>> array and then there's no need for this function and caller can access
>> it directly?
>
> Anyway there should be a function to combine the two assignments below.
> And since this array is used by this function only it should better be
> hidden.
I mean you can avoid the assignment before each transmission by just
introducing arrays like:
int tdt[] = {TDT, TDT1};
And then access them through qidx, e.g: core->mac[tdt[qidx]].
>
>>
>>> +
>>> + assert(idx < ARRAY_SIZE(i));
>>> +
>>> + txr->i = &i[idx];
>>> + txr->tx = &core->tx[idx];
>>> +}
>>> +
>>> +typedef struct E1000E_RxRing_st {
>>> + const E1000E_RingInfo *i;
>>> +} E1000E_RxRing;
>>> +
>>> +static inline void
>>> +e1000e_rx_ring_init(E1000ECore *core, E1000E_RxRing *rxr, int idx)
>>> +{
>>> + static const E1000E_RingInfo i[E1000E_NUM_QUEUES] = {
>>> + { RDBAH0, RDBAL0, RDLEN0, RDH0, RDT0, 0 },
>>> + { RDBAH1, RDBAL1, RDLEN1, RDH1, RDT1, 1 }
>>> + };
>>
>> Similar issue with above.
>>
>>> +
>>> + assert(idx < ARRAY_SIZE(i));
>>> +
>>> + rxr->i = &i[idx];
>>> +}
>>> +
[...]
>>> + trace_e1000e_rx_start_recv();
>>> +
>>> + for (i = 0; i <= core->max_queue_num; i++) {
>>> +
>>> qemu_flush_queued_packets(qemu_get_subqueue(core->owner_nic, i));
>>
>> Is this the hardware behavior, I mean setting rdt of queue 0 will also
>> flush queue 1? Looks like the correct behavior is to calculate the queue
>> index and flush it.
>
> The issue is that there is no direct correspondence between virtio
> queues and device queues.
> There is RSS mechanism in the middle that may re-route packets from
> virtio queue 0 to device queue 1 and vice versa,
> so we cannot know where each specific packets goes until it read and
> processed.
Aha, I see.
>
>>
>>> + }
>>> +}
>>> +
>>> +int
>>> +e1000e_can_receive(E1000ECore *core)
>>> +{
>>> + int i;
>>> +
>>> + bool link_up = core->mac[STATUS] & E1000_STATUS_LU;
>>> + bool rx_enabled = core->mac[RCTL] & E1000_RCTL_EN;
>>> + bool pci_master = core->owner->config[PCI_COMMAND] &
>>> PCI_COMMAND_MASTER;
>>
>> Should we flush the queue packets when guest enable bus master? (like
>> what we did in e1000_write_config)?
>
> Yes, fixed.
>
>>
>>> +
>>> + if (!link_up || !rx_enabled || !pci_master) {
>>> + trace_e1000e_rx_can_recv_disabled(link_up, rx_enabled,
>>> pci_master);
>>> + return false;
>>> + }
>>> +
>>> + for (i = 0; i < E1000E_NUM_QUEUES; i++) {
>>> + E1000E_RxRing rxr;
>>> +
>>> + e1000e_rx_ring_init(core, &rxr, i);
>>> + if (e1000e_ring_enabled(core, rxr.i) &&
>>> + e1000e_has_rxbufs(core, rxr.i, 1)) {
>>> + trace_e1000e_rx_can_recv();
>>> + return true;
>>
>> Similar issue, this will return true if anyone of the queue has
>> available buffer. This seems wrong.
>
>
> The same as above. This is done due to RSS mechanisms of the device.
Right, looks ok now, but there's also a case e.g we want to queue the
packet to queue 0, but only queue 1 has the available buffer. May need
to check this in the future.
[...]
>
> Thanks for review,
> Dmitry
>
You're welcome
next prev parent reply other threads:[~2016-04-07 7:24 UTC|newest]
Thread overview: 31+ messages / expand[flat|nested] mbox.gz Atom feed top
2016-02-22 17:37 [Qemu-devel] [PATCH v2 00/13] Introduce Intel 82574 GbE Controller Emulation (e1000e) Leonid Bloch
2016-02-22 17:37 ` [Qemu-devel] [PATCH v2 01/13] msix: make msix_clr_pending() visible for clients Leonid Bloch
2016-02-22 17:37 ` [Qemu-devel] [PATCH v2 02/13] pci: Introduce define for PM capability version 1.1 Leonid Bloch
2016-02-22 17:37 ` [Qemu-devel] [PATCH v2 03/13] pcie: Add support for PCIe CAP v1 Leonid Bloch
2016-02-22 17:37 ` [Qemu-devel] [PATCH v2 04/13] pcie: Introduce function for DSN capability creation Leonid Bloch
2016-02-22 17:37 ` [Qemu-devel] [PATCH v2 05/13] vmxnet3: Use generic function for DSN capability definition Leonid Bloch
2016-02-22 17:37 ` [Qemu-devel] [PATCH v2 06/13] net: Introduce Toeplitz hash calculator Leonid Bloch
2016-02-22 17:37 ` [Qemu-devel] [PATCH v2 07/13] net: Add macros for MAC address tracing Leonid Bloch
2016-02-22 17:37 ` [Qemu-devel] [PATCH v2 08/13] vmxnet3: Use common MAC address tracing macros Leonid Bloch
2016-02-22 17:37 ` [Qemu-devel] [PATCH v2 09/13] net_pkt: Name vmxnet3 packet abstractions more generic Leonid Bloch
2016-03-08 3:10 ` Jason Wang
2016-04-06 7:27 ` Dmitry Fleytman
2016-02-22 17:37 ` [Qemu-devel] [PATCH v2 10/13] rtl8139: Move more TCP definitions to common header Leonid Bloch
2016-02-22 17:37 ` [Qemu-devel] [PATCH v2 11/13] net_pkt: Extend packet abstraction as required by e1000e functionality Leonid Bloch
2016-02-22 17:37 ` [Qemu-devel] [PATCH v2 12/13] e1000_regs: Add definitions for Intel 82574-specific bits Leonid Bloch
2016-02-22 17:37 ` [Qemu-devel] [PATCH v2 13/13] net: Introduce e1000e device emulation Leonid Bloch
2016-03-08 9:31 ` Jason Wang
2016-04-06 8:22 ` Dmitry Fleytman
2016-04-06 13:23 ` Michael S. Tsirkin
2016-04-06 13:42 ` Dmitry Fleytman
2016-04-06 13:44 ` Michael S. Tsirkin
2016-04-06 13:45 ` Dmitry Fleytman
2016-04-07 7:24 ` Jason Wang [this message]
2016-04-10 15:08 ` Dmitry Fleytman
2016-03-03 10:02 ` [Qemu-devel] [PATCH v2 00/13] Introduce Intel 82574 GbE Controller Emulation (e1000e) Leonid Bloch
2016-03-04 1:22 ` Jason Wang
2016-03-08 3:01 ` Jason Wang
2016-04-06 7:25 ` Dmitry Fleytman
2016-04-06 7:27 ` Dmitry Fleytman
2016-03-08 9:51 ` Jason Wang
2016-04-06 7:23 ` Dmitry Fleytman
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=57060B18.4010808@redhat.com \
--to=jasowang@redhat.com \
--cc=dmitry@daynix.com \
--cc=leonid.bloch@ravellosystems.com \
--cc=leonid@daynix.com \
--cc=mst@redhat.com \
--cc=qemu-devel@nongnu.org \
--cc=shmulik.ladkani@ravellosystems.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).