Re: [Qemu-devel] [PATCH v2 13/13] net: Introduce e1000e device emulation

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Jason Wang <jasowang@redhat.com>
To: Dmitry Fleytman <dmitry@daynix.com>
Cc: Leonid Bloch <leonid.bloch@ravellosystems.com>,
	Shmulik Ladkani <shmulik.ladkani@ravellosystems.com>,
	Leonid Bloch <leonid@daynix.com>,
	qemu-devel@nongnu.org, "Michael S. Tsirkin" <mst@redhat.com>
Subject: Re: [Qemu-devel] [PATCH v2 13/13] net: Introduce e1000e device emulation
Date: Thu, 7 Apr 2016 15:24:08 +0800	[thread overview]
Message-ID: <57060B18.4010808@redhat.com> (raw)
In-Reply-To: <7BD4DE03-17BA-4638-A751-13DDC5F9A79F@daynix.com>



On 04/06/2016 04:22 PM, Dmitry Fleytman wrote:
> Hi Jason,
>
> Please see my comments below.
>
>> On 8 Mar 2016, at 11:31 AM, Jason Wang <jasowang@redhat.com
>> <mailto:jasowang@redhat.com>> wrote:
>>
>>
>>
>> On 02/23/2016 01:37 AM, Leonid Bloch wrote:
>>> From: Dmitry Fleytman <dmitry.fleytman@ravellosystems.com
>>> <mailto:dmitry.fleytman@ravellosystems.com>>
>>>
>>> This patch introduces emulation for the Intel 82574 adapter, AKA e1000e.
>>>
>>> This implementation is derived from the e1000 emulation code, and
>>> utilizes the TX/RX packet abstractions that were initially developed for
>>> the vmxnet3 device. Although some parts of the introduced code may be
>>> shared with e1000, the differences are substantial enough so that the
>>> only shared resources for the two devices are the definitions in
>>> hw/net/e1000_regs.h.
>>>
>>> Similarly to vmxnet3, the new device uses virtio headers for task
>>> offloads (for backends that support virtio extensions). Usage of
>>> virtio headers may be forcibly disabled via a boolean device property
>>> "vnet" (which is enabled by default). In such case task offloads
>>> will be performed in software, in the same way it is done on
>>> backends that do not support virtio headers.
>>>
>>> The device code is split into two parts:
>>>
>>>  1. hw/net/e1000e.c: QEMU-specific code for a network device;
>>>  2. hw/net/e1000e_core.[hc]: Device emulation according to the spec.
>>>
>>> The new device name is e1000e.
>>>
>>> Intel specifications for the 82574 controller are available at:
>>> http://www.intel.com/content/dam/doc/datasheet/82574l-gbe-controller-datasheet.pdf

[...]

>>> +Device properties:
>>> +
>>> ++-------------------------------------------------+--------+---------+
>>> +| Propery name   |         Description            |  Type  | Default |
>>> ++--------------------------------------------------------------------|
>>> +|  subsys_ven    | PCI device Subsystem Vendor ID | UINT16 | 0x8086  |
>>> +|                |                                |        |         |
>>> +|  subsys        | PCI device Subsystem ID        | UINT16 | 0       |
>>> +|                |                                |        |         |
>>> +|  vnet          | Use virtio headers or perform  | BOOL   | TRUE    |
>>> +|                | SW offloads emulation          |        |         |
>>> ++----------------+--------------------------------+--------+---------+
>>
>> You may using PropertyInfo which has a "description" field to do this
>> better (e.g qdev_prop_vlan). Then there's even no need for this file.
>
> We replaced this file with source code comments.
> As for PropertyInfo description I can’t see any way to pass
> description to DEFINE_PROP_* macros. Could you please elaborate on this?

You could use DEFINE_PROP() which can accept a PropertyInfo parameter.

>
>>
>>> diff --git a/hw/net/Makefile.objs b/hw/net/Makefile.objs
>>> index 527d264..4201115 100644
>>> --- a/hw/net/Makefile.objs
>>> +++ b/hw/net/Makefile.objs

[...]

>>
>>> +    MemoryRegion io;
>>> +    MemoryRegion msix;
>>> +
>>> +    uint32_t ioaddr;
>>> +
>>> +    uint16_t subsys_ven;
>>> +    uint16_t subsys;
>>> +
>>> +    uint16_t subsys_ven_used;
>>> +    uint16_t subsys_used;
>>> +
>>> +    uint32_t intr_state;
>>> +    bool use_vnet;
>>> +
>>> +    E1000ECore core;
>>
>> What's the advantages of introducing another indirection level here?
>> Looks like it causes lots of extra code overheads.
>
> This was done by intention.
> Unlike e1000 and e1000e devices which are really different, e1000e and
> newer Intel devices like ixgb* are rather similar. Introduction of
> e1000e core abstraction will simplify possible code reuse in the future.

Ok, I see.

>
>>
>>> +
>>> +} E1000EState;
>>> +

[...]

>>> +static NetClientInfo net_e1000e_info = {
>>> +    .type = NET_CLIENT_OPTIONS_KIND_NIC,
>>> +    .size = sizeof(NICState),
>>> +    .can_receive = e1000e_nc_can_receive,
>>> +    .receive = e1000e_nc_receive,
>>> +    .receive_iov = e1000e_nc_receive_iov,
>>
>> All of the above three functions has a NetClientState parameter, so
>> probably no need to have "nc" in their name.
>
> The issue is there are other functions with similar names but without
> _nc_...
>

Right.

>>
>>> +    .link_status_changed = e1000e_set_link_status,
>>> +};
>>> +

[...]

>>> +    for (i = 0; i < s->conf.peers.queues; i++) {
>>> +        nc = qemu_get_subqueue(s->nic, i);
>>> +        if (!nc->peer || !qemu_has_vnet_hdr(nc->peer)) {
>>> +            s->core.has_vnet = false;
>>
>> So in fact we're probing the vnet support instead of doing vnet our own
>> if backend does not support it. It seems to me that there's no need to
>> having "vnet" property.
>
> This property is intended for forcible disable of virtio headers
> support. Possible use cases are verification of SW offloads logic and
> work around for theoretical interoperability problems.

Ok, so maybe we'd better rename it to "disable_vnet".

>
>>
>>> +            trace_e1000e_cfg_support_virtio(false);
>>> +            return;
>>> +        }
>>> +    }
>>> +



>>> +        VMSTATE_UINT8(core.tx[1].sum_needed, E1000EState),
>>> +        VMSTATE_UINT8(core.tx[1].ipcss, E1000EState),
>>> +        VMSTATE_UINT8(core.tx[1].ipcso, E1000EState),
>>> +        VMSTATE_UINT16(core.tx[1].ipcse, E1000EState),
>>> +        VMSTATE_UINT8(core.tx[1].tucss, E1000EState),
>>> +        VMSTATE_UINT8(core.tx[1].tucso, E1000EState),
>>> +        VMSTATE_UINT16(core.tx[1].tucse, E1000EState),
>>> +        VMSTATE_UINT8(core.tx[1].hdr_len, E1000EState),
>>> +        VMSTATE_UINT16(core.tx[1].mss, E1000EState),
>>> +        VMSTATE_UINT32(core.tx[1].paylen, E1000EState),
>>> +        VMSTATE_INT8(core.tx[1].ip, E1000EState),
>>> +        VMSTATE_INT8(core.tx[1].tcp, E1000EState),
>>> +        VMSTATE_BOOL(core.tx[1].tse, E1000EState),
>>> +        VMSTATE_BOOL(core.tx[1].cptse, E1000EState),
>>> +        VMSTATE_BOOL(core.tx[1].skip_cp, E1000EState),
>>
>> You can move the tx state into another structure and use VMSTATE_ARRAY()
>> here.
>
> Not sure I got you point. Could you please provide more details?

I mean e.g, move skip_cp into e1000e_txd_props and move tx_pkt out of
e1000_tx. Then you can use VMSTATE_ARRAY to save and load
e1000_txd_props array.

>
>>
>>> +        VMSTATE_END_OF_LIST()
>>> +    }
>>> +};
>>> +
>>> +static Property e1000e_properties[] = {
>>> +    DEFINE_NIC_PROPERTIES(E1000EState, conf),
>>> +    DEFINE_PROP_BOOL("vnet", E1000EState, use_vnet, true),
>>> +    DEFINE_PROP_UINT16("subsys_ven", E1000EState,
>>> +                       subsys_ven, PCI_VENDOR_ID_INTEL),
>>> +    DEFINE_PROP_UINT16("subsys", E1000EState, subsys, 0),
>>
>> I'm not quite sure the reason we need subsys_ven and subsys here?
>
> Some vendors (like VMWare) may replace these device IDs with their
> own. These parameters provide a way to make device look exactly as
> those. Useful in various V2V scenarios.

I get it.

Thanks

>
>>
>>> +    DEFINE_PROP_END_OF_LIST(),
>>> +};
>>> +

[...]

>>> +
>>> +static void
>>> +e1000e_intrmgr_on_throttling_timer(void *opaque)
>>> +{
>>> +    E1000IntrDelayTimer *timer = opaque;
>>> +
>>> +    assert(!msix_enabled(timer->core->owner));
>>
>> Can this assert be triggered by guest? If yes, let's remove this.
>
> No, this timer is armed for legacy interrupts only.
> This assertion would mean there is an internal logic error in the device,
> i.e. the device code did not check we’re working with legacy
> interrupts before arming it.

Ok.

>
>>
>>> +
>>> +    timer->running = false;
>>> +
>>> +    if (!timer->core->itr_intr_pending) {
>>> +        trace_e1000e_irq_throttling_no_pending_interrupts();
>>> +        return;
>>> +    }
>>> +
>>> +    if (msi_enabled(timer->core->owner)) {
>>> +        trace_e1000e_irq_msi_notify_postponed();
>>> +        e1000e_set_interrupt_cause(timer->core, 0);
>>> +    } else {
>>> +        trace_e1000e_irq_legacy_notify_postponed();
>>> +        e1000e_set_interrupt_cause(timer->core, 0);
>>> +    }
>>> +}
>>> +
>>> +static void
>>> +e1000e_intrmgr_on_msix_throttling_timer(void *opaque)
>>> +{
>>> +    E1000IntrDelayTimer *timer = opaque;
>>> +    int idx = timer - &timer->core->eitr[0];
>>> +
>>> +    assert(msix_enabled(timer->core->owner));
>>
>> Same question as above.
>
> The same. This timer is for MSI-X only.

Ok.

>
>>
>>> +
>>> +    timer->running = false;
>>> +
>>> +    if (!timer->core->eitr_intr_pending[idx]) {
>>> +        trace_e1000e_irq_throttling_no_pending_vec(idx);
>>> +        return;
>>> +    }
>>> +
>>> +    trace_e1000e_irq_msix_notify_postponed_vec(idx);
>>> +    msix_notify(timer->core->owner, idx);
>>> +}
>>> +
>>> +static void
>>> +e1000e_intrmgr_initialize_all_timers(E1000ECore *core, bool create)
>>> +{
>>> +    int i;
>>> +
>>> +    core->radv.delay_reg = RADV;
>>> +    core->rdtr.delay_reg = RDTR;
>>> +    core->raid.delay_reg = RAID;
>>> +    core->tadv.delay_reg = TADV;
>>> +    core->tidv.delay_reg = TIDV;
>>> +
>>> +    core->radv.delay_resolution_ns = E1000_INTR_DELAY_NS_RES;
>>> +    core->rdtr.delay_resolution_ns = E1000_INTR_DELAY_NS_RES;
>>> +    core->raid.delay_resolution_ns = E1000_INTR_DELAY_NS_RES;
>>> +    core->tadv.delay_resolution_ns = E1000_INTR_DELAY_NS_RES;
>>> +    core->tidv.delay_resolution_ns = E1000_INTR_DELAY_NS_RES;
>>> +
>>> +    core->radv.core = core;
>>> +    core->rdtr.core = core;
>>> +    core->raid.core = core;
>>> +    core->tadv.core = core;
>>> +    core->tidv.core = core;
>>> +
>>> +    core->itr.core = core;
>>
>> Couldn't we simply get core pointer through container_of() ?
>
> Unfortunately not. Timer abstraction functions are not aware of which
> specific timer they’re dealing with.

Yes, it is.

>
>>
>>> +    core->itr.delay_reg = ITR;
>>> +    core->itr.delay_resolution_ns = E1000_INTR_THROTTLING_NS_RES;
>>> +
>>> +    for (i = 0; i < E1000E_MSIX_VEC_NUM; i++) {
>>> +        core->eitr[i].core = core;
>>> +        core->eitr[i].delay_reg = EITR + i;
>>> +        core->eitr[i].delay_resolution_ns =
>>> E1000_INTR_THROTTLING_NS_RES;
>>> +    }
>>> +
>>> +    if (!create) {
>>> +        return;
>>> +    }
>>> +
>>> +    core->radv.timer =
>>> +        timer_new_ns(QEMU_CLOCK_VIRTUAL, e1000e_intrmgr_on_timer,
>>> &core->radv);
>>> +    core->rdtr.timer =
>>> +        timer_new_ns(QEMU_CLOCK_VIRTUAL, e1000e_intrmgr_on_timer,
>>> &core->rdtr);
>>> +    core->raid.timer =
>>> +        timer_new_ns(QEMU_CLOCK_VIRTUAL, e1000e_intrmgr_on_timer,
>>> &core->raid);
>>> +
>>> +    core->tadv.timer =
>>> +        timer_new_ns(QEMU_CLOCK_VIRTUAL, e1000e_intrmgr_on_timer,
>>> &core->tadv);
>>> +    core->tidv.timer =
>>> +        timer_new_ns(QEMU_CLOCK_VIRTUAL, e1000e_intrmgr_on_timer,
>>> &core->tidv);
>>> +
>>> +    core->itr.timer = timer_new_ns(QEMU_CLOCK_VIRTUAL,
>>> +                                   e1000e_intrmgr_on_throttling_timer,
>>> +                                   &core->itr);
>>
>> Should we stop/start the above timers during vm stop/start through vm
>> state change handler?
>
> Not sure. Is there any reason for this?

Otherwise we may raise irq during vm stop?

[...]

>>> +
>>> +static inline void
>>> +e1000e_tx_ring_init(E1000ECore *core, E1000E_TxRing *txr, int idx)
>>> +{
>>> +    static const E1000E_RingInfo i[E1000E_NUM_QUEUES] = {
>>> +        { TDBAH,  TDBAL,  TDLEN,  TDH,  TDT, 0 },
>>> +        { TDBAH1, TDBAL1, TDLEN1, TDH1, TDT1, 1 }
>>> +    };
>>
>> Instead of using static inside a function, why not just use a global
>> array and then there's no need for this function and caller can access
>> it directly?
>
> Anyway there should be a function to combine the two assignments below.
> And since this array is used by this function only it should better be
> hidden.

I mean you can avoid the assignment before each transmission by just
introducing arrays like:

int tdt[] = {TDT, TDT1};

And then access them through qidx, e.g: core->mac[tdt[qidx]].

>
>>
>>> +
>>> +    assert(idx < ARRAY_SIZE(i));
>>> +
>>> +    txr->i     = &i[idx];
>>> +    txr->tx    = &core->tx[idx];
>>> +}
>>> +
>>> +typedef struct E1000E_RxRing_st {
>>> +    const E1000E_RingInfo *i;
>>> +} E1000E_RxRing;
>>> +
>>> +static inline void
>>> +e1000e_rx_ring_init(E1000ECore *core, E1000E_RxRing *rxr, int idx)
>>> +{
>>> +    static const E1000E_RingInfo i[E1000E_NUM_QUEUES] = {
>>> +        { RDBAH0, RDBAL0, RDLEN0, RDH0, RDT0, 0 },
>>> +        { RDBAH1, RDBAL1, RDLEN1, RDH1, RDT1, 1 }
>>> +    };
>>
>> Similar issue with above.
>>
>>> +
>>> +    assert(idx < ARRAY_SIZE(i));
>>> +
>>> +    rxr->i      = &i[idx];
>>> +}
>>> +

[...]

>>> +    trace_e1000e_rx_start_recv();
>>> +
>>> +    for (i = 0; i <= core->max_queue_num; i++) {
>>> +
>>>        qemu_flush_queued_packets(qemu_get_subqueue(core->owner_nic, i));
>>
>> Is this the hardware behavior, I mean setting rdt of queue 0 will also
>> flush queue 1? Looks like the correct behavior is to calculate the queue
>> index and flush it.
>
> The issue is that there is no direct correspondence between virtio
> queues and device queues.
> There is RSS mechanism in the middle that may re-route packets from
> virtio queue 0 to device queue 1 and vice versa,
> so we cannot know where each specific packets goes until it read and
> processed.

Aha, I see.

>
>>
>>> +    }
>>> +}
>>> +
>>> +int
>>> +e1000e_can_receive(E1000ECore *core)
>>> +{
>>> +    int i;
>>> +
>>> +    bool link_up = core->mac[STATUS] & E1000_STATUS_LU;
>>> +    bool rx_enabled = core->mac[RCTL] & E1000_RCTL_EN;
>>> +    bool pci_master = core->owner->config[PCI_COMMAND] &
>>> PCI_COMMAND_MASTER;
>>
>> Should we flush the queue packets when guest enable bus master? (like
>> what we did in e1000_write_config)?
>
> Yes, fixed.
>
>>
>>> +
>>> +    if (!link_up || !rx_enabled || !pci_master) {
>>> +        trace_e1000e_rx_can_recv_disabled(link_up, rx_enabled,
>>> pci_master);
>>> +        return false;
>>> +    }
>>> +
>>> +    for (i = 0; i < E1000E_NUM_QUEUES; i++) {
>>> +        E1000E_RxRing rxr;
>>> +
>>> +        e1000e_rx_ring_init(core, &rxr, i);
>>> +        if (e1000e_ring_enabled(core, rxr.i) &&
>>> +            e1000e_has_rxbufs(core, rxr.i, 1)) {
>>> +            trace_e1000e_rx_can_recv();
>>> +            return true;
>>
>> Similar issue, this will return true if anyone of the queue has
>> available buffer. This seems wrong.
>
>
> The same as above. This is done due to RSS mechanisms of the device.

Right, looks ok now, but there's also a case e.g we want to queue the
packet to queue 0, but only queue 1 has the available buffer. May need
to check this in the future.

[...]

>
> Thanks for review,
> Dmitry
>

You're welcome

next prev parent reply	other threads:[~2016-04-07  7:24 UTC|newest]

Thread overview: 31+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-02-22 17:37 [Qemu-devel] [PATCH v2 00/13] Introduce Intel 82574 GbE Controller Emulation (e1000e) Leonid Bloch
2016-02-22 17:37 ` [Qemu-devel] [PATCH v2 01/13] msix: make msix_clr_pending() visible for clients Leonid Bloch
2016-02-22 17:37 ` [Qemu-devel] [PATCH v2 02/13] pci: Introduce define for PM capability version 1.1 Leonid Bloch
2016-02-22 17:37 ` [Qemu-devel] [PATCH v2 03/13] pcie: Add support for PCIe CAP v1 Leonid Bloch
2016-02-22 17:37 ` [Qemu-devel] [PATCH v2 04/13] pcie: Introduce function for DSN capability creation Leonid Bloch
2016-02-22 17:37 ` [Qemu-devel] [PATCH v2 05/13] vmxnet3: Use generic function for DSN capability definition Leonid Bloch
2016-02-22 17:37 ` [Qemu-devel] [PATCH v2 06/13] net: Introduce Toeplitz hash calculator Leonid Bloch
2016-02-22 17:37 ` [Qemu-devel] [PATCH v2 07/13] net: Add macros for MAC address tracing Leonid Bloch
2016-02-22 17:37 ` [Qemu-devel] [PATCH v2 08/13] vmxnet3: Use common MAC address tracing macros Leonid Bloch
2016-02-22 17:37 ` [Qemu-devel] [PATCH v2 09/13] net_pkt: Name vmxnet3 packet abstractions more generic Leonid Bloch
2016-03-08  3:10   ` Jason Wang
2016-04-06  7:27     ` Dmitry Fleytman
2016-02-22 17:37 ` [Qemu-devel] [PATCH v2 10/13] rtl8139: Move more TCP definitions to common header Leonid Bloch
2016-02-22 17:37 ` [Qemu-devel] [PATCH v2 11/13] net_pkt: Extend packet abstraction as required by e1000e functionality Leonid Bloch
2016-02-22 17:37 ` [Qemu-devel] [PATCH v2 12/13] e1000_regs: Add definitions for Intel 82574-specific bits Leonid Bloch
2016-02-22 17:37 ` [Qemu-devel] [PATCH v2 13/13] net: Introduce e1000e device emulation Leonid Bloch
2016-03-08  9:31   ` Jason Wang
2016-04-06  8:22     ` Dmitry Fleytman
2016-04-06 13:23       ` Michael S. Tsirkin
2016-04-06 13:42         ` Dmitry Fleytman
2016-04-06 13:44           ` Michael S. Tsirkin
2016-04-06 13:45             ` Dmitry Fleytman
2016-04-07  7:24       ` Jason Wang [this message]
2016-04-10 15:08         ` Dmitry Fleytman
2016-03-03 10:02 ` [Qemu-devel] [PATCH v2 00/13] Introduce Intel 82574 GbE Controller Emulation (e1000e) Leonid Bloch
2016-03-04  1:22   ` Jason Wang
2016-03-08  3:01 ` Jason Wang
2016-04-06  7:25   ` Dmitry Fleytman
2016-04-06  7:27   ` Dmitry Fleytman
2016-03-08  9:51 ` Jason Wang
2016-04-06  7:23   ` Dmitry Fleytman

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=57060B18.4010808@redhat.com \
    --to=jasowang@redhat.com \
    --cc=dmitry@daynix.com \
    --cc=leonid.bloch@ravellosystems.com \
    --cc=leonid@daynix.com \
    --cc=mst@redhat.com \
    --cc=qemu-devel@nongnu.org \
    --cc=shmulik.ladkani@ravellosystems.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.