From mboxrd@z Thu Jan 1 00:00:00 1970 From: Ross Lagerwall Subject: Re: [PATCH v2 09/13] xsplice: Implement support for applying/reverting/replacing patches. (v2) Date: Mon, 25 Jan 2016 11:43:10 +0000 Message-ID: <56A60A4E.3070107@citrix.com> References: <1452808031-706-1-git-send-email-konrad.wilk@oracle.com> <1452808031-706-10-git-send-email-konrad.wilk@oracle.com> <569E4AAC.9070100@citrix.com> <20160119165543.GD17737@char.us.oracle.com> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii"; Format="flowed" Content-Transfer-Encoding: 7bit Return-path: Received: from mail6.bemta14.messagelabs.com ([193.109.254.103]) by lists.xen.org with esmtp (Exim 4.72) (envelope-from ) id 1aNfYA-00058u-Ka for xen-devel@lists.xenproject.org; Mon, 25 Jan 2016 11:43:14 +0000 In-Reply-To: <20160119165543.GD17737@char.us.oracle.com> List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Sender: xen-devel-bounces@lists.xen.org Errors-To: xen-devel-bounces@lists.xen.org To: Konrad Rzeszutek Wilk Cc: wei.liu2@citrix.com, ian.campbell@citrix.com, andrew.cooper3@citrix.com, ian.jackson@eu.citrix.com, mpohlack@amazon.com, stefano.stabellini@citrix.com, jbeulich@suse.com, sasha.levin@oracle.com, xen-devel@lists.xenproject.org List-Id: xen-devel@lists.xenproject.org On 01/19/2016 04:55 PM, Konrad Rzeszutek Wilk wrote: snip >>> +/* Must be holding the payload_list lock. */ >> >> payload lock? >> >>> +static int schedule_work(struct payload *data, uint32_t cmd) >>> +{ >>> + /* Fail if an operation is already scheduled. */ >>> + if ( xsplice_work.do_work ) >>> + return -EAGAIN; >> >> Hmm, I don't think EAGAIN is correct. It will cause xen-xsplice to poll for >> a status update, but the operation hasn't actually been submitted. > > -EBUSY -EDEADLK ? I would choose -EBUSY. >> >>> + >>> + xsplice_work.cmd = cmd; >>> + xsplice_work.data = data; >>> + atomic_set(&xsplice_work.semaphore, -1); >>> + atomic_set(&xsplice_work.irq_semaphore, -1); >>> + >>> + xsplice_work.ready = 0; >>> + smp_wmb(); >>> + xsplice_work.do_work = 1; >>> + smp_wmb(); >>> + >>> + return 0; >>> +} >>> + >>> +/* >>> + * Note that because of this NOP code the do_nmi is not safely patchable. >>> + * Also if we do receive 'real' NMIs we have lost them. >>> + */ >>> +static int mask_nmi_callback(const struct cpu_user_regs *regs, int cpu) >>> +{ >>> + return 1; >>> +} >>> + >>> +static void reschedule_fn(void *unused) >>> +{ >>> + smp_mb(); /* Synchronize with setting do_work */ >>> + raise_softirq(SCHEDULE_SOFTIRQ); >>> +} >>> + >>> +static int xsplice_do_wait(atomic_t *counter, s_time_t timeout, >>> + unsigned int total_cpus, const char *s) >>> +{ >>> + int rc = 0; >>> + >>> + while ( atomic_read(counter) != total_cpus && NOW() < timeout ) >>> + cpu_relax(); >>> + >>> + /* Log & abort. */ >>> + if ( atomic_read(counter) != total_cpus ) >>> + { >>> + printk(XENLOG_DEBUG "%s: %s %u/%u\n", xsplice_work.data->name, >>> + s, atomic_read(counter), total_cpus); >>> + rc = -EBUSY; >>> + xsplice_work.data->rc = rc; >>> + xsplice_work.do_work = 0; >>> + smp_wmb(); >>> + return rc; >>> + } >>> + return rc; >>> +} >>> + >>> +static void xsplice_do_single(unsigned int total_cpus) >>> +{ >>> + nmi_callback_t saved_nmi_callback; >>> + s_time_t timeout; >>> + struct payload *data, *tmp; >>> + int rc; >>> + >>> + data = xsplice_work.data; >>> + timeout = data->timeout ? data->timeout : MILLISECS(30); >> >> The design doc says that a timeout of 0 means infinity. > > True. Lets update the document. >> >>> + printk(XENLOG_DEBUG "%s: timeout is %"PRI_stime"ms\n", data->name, >>> + timeout / MILLISECS(1)); >>> + >>> + timeout += NOW(); >>> + >>> + if ( xsplice_do_wait(&xsplice_work.semaphore, timeout, total_cpus, >>> + "Timed out on CPU semaphore") ) >>> + return; >>> + >>> + /* "Mask" NMIs. */ >>> + saved_nmi_callback = set_nmi_callback(mask_nmi_callback); >>> + >>> + /* All CPUs are waiting, now signal to disable IRQs. */ >>> + xsplice_work.ready = 1; >>> + smp_wmb(); >>> + >>> + atomic_inc(&xsplice_work.irq_semaphore); >>> + if ( xsplice_do_wait(&xsplice_work.irq_semaphore, timeout, total_cpus, >>> + "Timed out on IRQ semaphore.") ) >>> + return; >>> + >>> + local_irq_disable(); >> >> As far as I can tell, the mechanics of how this works haven't changed, the >> code has just been reorganized. Which means the points that Martin raised >> about this mechanism are still outstanding. > > A bit. I added the extra timeout on both of the 'spin-around' and also > moved some of the barriers around. Also removed your spin-lock and used > the atomic_t mechanism to synchronize. > > But the one thing that I didn't do was the spin on the 'workers?' that > are just spinnig idly. They will do that forever if say the 'master' > hasn't gone to the IRQ semaphore part. > > My thinking was that the 'workers' could also use the timeout feature > but just multiple it by two? > After looking at this again, I remembered that the algorithm I used is the same as the one used by stop_machine_run(). That function runs without timeouts at all (seemingly without problems), so why shouldn't this one? (The only reason stop_machine_run() itself isn't used for patching is because we need to enter the function without a stack, i.e. not from a tasklet.) -- Ross Lagerwall