Re: [PATCH] prserv: don't wait until exit to sync

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Richard Purdie <richard.purdie@linuxfoundation.org>
To: Gary Thomas <gary@mlbassoc.com>
Cc: bitbake-devel@lists.openembedded.org
Subject: Re: [PATCH] prserv: don't wait until exit to sync
Date: Tue, 04 Nov 2014 08:42:08 +0000	[thread overview]
Message-ID: <1415090528.23396.4.camel@ted> (raw)
In-Reply-To: <5457C903.2030703@mlbassoc.com>

On Mon, 2014-11-03 at 11:27 -0700, Gary Thomas wrote:
> On 2014-11-03 10:30, Richard Purdie wrote:
> > On Mon, 2014-11-03 at 09:47 -0600, Ben Shelton wrote:
> >> On 11/02, Burton, Ross wrote:
> >>> On 27 October 2014 17:27, Ben Shelton <ben.shelton@ni.com> wrote:
> >>>
> >>>> In the commit 'prserv: Ensure data is committed', the PR server moved to
> >>>> only committing transactions to the database when the PR server is
> >>>> stopped.  This improves performance, but it means that if the machine
> >>>> running the PR server loses power unexpectedly or if the PR server
> >>>> process gets SIGKILL, the uncommitted package revision data is lost.
> >>>>
> >>>> To fix this issue, sync the database periodically, once per 30 seconds
> >>>> by default, if it has been marked as dirty.  To be safe, continue to
> >>>> sync the database at exit regardless of its status.
> >>>>
> >>>
> >>> This appears to be causing random problems for me where bitbake will
> >>> timeout attempting to access the PR database, my hunch is that it's
> >>> blocking on disk I/O.  Are there any tricks we can do with sqlite to reduce
> >>> the overhead of committing? (assuming that sqlite isn't causing a full
> >>> filesystem sync).
> >>>
> >>> Ross
> >>
> >> After running a few large nightly builds, we've seen some issues with
> >> this as well.  It looks like the issue is in the PR server itself, which
> >> logs this error:
> >>
> >> "OperationalError: cannot start a transaction within a transaction"
> >>
> >> However, I'm confused as to why this is happening, since the only place
> >> new transactions are being created is in the sync() function ("BEGIN
> >> EXCLUSIVE TRANSACTION"), and AFAIK that's only called by a single
> >> thread.  Any ideas?
> >
> > Did the commit() fail and therefore there was already an transaction
> > open? It leads to another quesiton of why the commit would fail (timeout
> > maybe?).
> >
> >> Would it make sense to revert the patch until we identify/fix the issue?
> >
> > You have flagged a valid issue that I would like to get to the bottom of
> > so perhaps not quite yet.
> >
> > I'm wondering if we can have some in memory copy of the table which we
> > flush to disk in a separate thread which wouldn't influence the PR
> > service request responses but its a horrible idea to workaround what
> > seems like a fundamental problem in sqlite :/.
> 
> I just got this error:
> ERROR: Can NOT get PRAUTO from remote PR service
> ERROR: Function failed: package_get_auto_pr
> ERROR: Logfile of failure stored in: /home/local/rpi-latest_2014-10-30/tmp/work/armv6-vfp-amltd-linux-gnueabi/usbutils/007-r0/temp/log.do_package.13260
> ERROR: Task 3204 (/home/local/poky-latest/meta/recipes-bsp/usbutils/usbutils_007.bb, do_package) failed with exit code '1'
> 
> Is it the same as what's being discussed above?

Yes.

>   Where can I
> look for more info on what happened?

We're still figuring out what is going on but it is roughly that:

a) The build generates a ton of IO
b) That IO builds up into a queue
c) The PR service decides it needs to sync to disk
d) The PR service hits an fsync() of some kind in sqlite whilst writing 
e) The PR service is blocked for its clients until the sync() finishes
f) Connections to the PR service timeout.

It would be nice if we could write the sqlite data in a separate thread
whilst the readers continue. There is an asynchronous module but its
deprecated:

http://www.sqlite.org/asyncvfs.html

WAL is recommended instead:

http://www.sqlite.org/wal.html

so we probably need to look at that.

> n.b. I just restarted my build and it seems happy to carry on
> where it left off.

No data is lost and this is a transient issue.

Why not revert the patch? The issue is that the data in the PR service
whilst in memory, *never* makes it to disk until process exit time. This
is bad if your build server loses power for example. I would therefore
like to try and fix this rather then revert.

Cheers,

Richard

next prev parent reply	other threads:[~2014-11-04  8:42 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-10-27 17:27 [PATCH] prserv: don't wait until exit to sync Ben Shelton
2014-11-02 21:00 ` Burton, Ross
2014-11-03 15:47   ` Ben Shelton
2014-11-03 17:30     ` Richard Purdie
2014-11-03 18:27       ` Gary Thomas
2014-11-04  8:42         ` Richard Purdie [this message]
2014-11-04 10:51           ` Richard Purdie
2014-11-04 13:10             ` Burton, Ross
2014-11-04 14:13     ` Richard Purdie

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1415090528.23396.4.camel@ted \
    --to=richard.purdie@linuxfoundation.org \
    --cc=bitbake-devel@lists.openembedded.org \
    --cc=gary@mlbassoc.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.