[uml-devel] Review needed for ubd fixes

All of lore.kernel.org
 help / color / mirror / Atom feed

* [uml-devel] Review needed for ubd fixes
@ 2005-09-17 22:45 Jeff Dike
  2005-09-20 12:01 ` Blaisorblade
  0 siblings, 1 reply; 12+ messages in thread
From: Jeff Dike @ 2005-09-17 22:45 UTC (permalink / raw)
  To: Blaisorblade, Bodo Stroesser; +Cc: user-mode-linux-devel

[-- Attachment #1: Type: text/plain, Size: 5028 bytes --]

I've spent some time working on the ubd and AIO problems that have cropped up 
recently.  Patches are attached - a critical look at them would be appreciated.

I'm going to start with a problem that hasn't exactly cropped up, and move on
to the other ones from there.

That is that the ubd driver must issue I/O requests to the host even if there
is no memory available.  It must make progress, because the request for which
it can't get memory might be the swapping that will free some up.

The driver allocates memory for several purposes -
	a header-type buffer per request which holds the bitmap data
	one aio request per chunk - a request may be broken into chunks because
of COWing - some pieces from the backing file, some from the COW file

To allow the driver to always make progress, even when no memory is available,
I added a static instance of each buffer and made the kmalloc calls atomic.  
When kmalloc fails, then the static buffer is used.  When it is in use, the
user acquires a semaphore, which is released in the interrupt handler when the
buffer is freed.  This implies that future I/O submissions will sleep on the
semaphore, so we have changed the reason for sleeping rather than eliminating 
it.

Any I/O submissions which need memory will sleep on the semaphore until the
buffer is released, resulting in synchronous, one at a time, I/O submission
until the memory shortage has cleared up.

Now that it is established that we must sleep, the next problem is making sure
the requests get submitted to the host in the order that they were queued to
the driver.  Because we must sleep, we must drop the queue spinlock, which opens
the queue to any other processes that want to do I/O.  One of these can queue
to the allocation semaphore (or just kmalloc memory now that some is available)
and sneak ahead of a process handling an earlier request.

To solve this, I add two atomic variables and a wait queue.  One of the 
variables, started, counts the number of requests that have been submitted 
to the driver and the other, submitted, counts the number that have been 
submitted to the host.  started is incremented before any submissions, submitted
is incremented after all submissions for the request.

The driver gets the value of started and waits for submitted to catch up to
it.  When that happens, it is now that requests's turn, and it wakes up from 
the sequence wake queue.

I believe these three fixes - dropping the spinlock around host I/O submission,
adding the static buffers, and sequencing requests - cover the scheduling while
atomic problem and associated problems.

Next is a deadlock that I found during a kernel build loop.  The AIO thread
seems to be able to fill the pipe back to UML with I/O completions.  When this
happens, it sleeps in a write and stops calling io_getevents.  If UML keeps
submitting events, the AIO ring buffer will fill, and io_submit will start
returning -EAGAIN.  At this point, UML will try to write the error to itself
through the same pipe that the AIO thread has filled.  It will now sleep.
Since interrupts are off, that pipe can never be emptied since it is handled
by the driver's interrupt routine, and we are deadlocked.

Enabling interrups during the submission seems to help, but doesn't eliminate
the deadlock.  To fix it for real, I now have the -EAGAIN returned to the 
driver, which sleeps on a wait queue which is woken up by the interrupt 
routine.  Interrupts are enabled, as they must be during sleep, so the handler
can wake the queue after it has processed some completions, and the driver
can retry.

As for the interactions between the wait queues and the semaphores - the 
sequence wait queue controls access to everything else.  Once a request is
past that, it can only deadlock with itself.  The allocation semaphore is used
by a finishing request to wake up the next one.  The eagain wait queue is taken
in a loop with no sleeping on anything else.

The patches are attached.  They are:

ubd-drop-lock - drops the ubd_io_lock around the call to do_io and takes it
before dequeuing the next request.  It also removes the call of do_ubd_request
from the interrupt handler since the request handler now processes the entire 
queue in one call.

ubd-atomic - makes the kmalloc calls atomic and adds the static buffers, 
associated semaphores and related synchronization.  Unfinished - it doesn't
have a static version of the bitmap, since that is variable length, so it's not
so easy to have a static copy of it.

ubd-sequence - makes the requests go out in the right order.  Adds the sequence
wait queue which controls access to the actual submission code.

ubd-eagain - the -EAGAIN handling - waits in the eagain wait queue to be woken
up by the interrupt handler after it has processed some completions.

These are all against 2.6.14-rc1-mm1.  They should apply to 2.6.14-rc1, but I
haven't checked that.

				Jeff

P.S. exmh insisted on a crazy attachment order - the order is the one listed
above, not the order of attachment.

[-- Attachment #2: ubd-eagain --]
[-- Type: text/plain , Size: 2230 bytes --]

# We may submit more requests to the host than it is prepared to
# handle.  In this case, io_submit returns -EAGAIN.  When this
# happens, we return from submit_aio and sleep on the eagain_queue
# wait_queue.  We are woken from the interrupt handler - if some
# requests have come back from the host, presumably there's some room
# on the host to submit more requests.
#
# Interactions between this wait_queue and the sequence_queue and the
# allocation semaphore - the sequence queue and any allocations always
# happen first.  While looping on this wait_queue, we do nothing but
# try to resubmit to the host.
Index: test/arch/um/drivers/ubd_kern.c
===================================================================
--- test.orig/arch/um/drivers/ubd_kern.c	2005-09-17 17:40:53.000000000 -0400
+++ test/arch/um/drivers/ubd_kern.c	2005-09-17 17:41:03.000000000 -0400
@@ -574,6 +574,9 @@
 
 static int ubd_reply_fd = -1;
 
+DECLARE_WAIT_QUEUE_HEAD(eagain_queue);
+static int dequeued;
+
 static irqreturn_t ubd_intr(int irq, void *dev, struct pt_regs *unused)
 {
 	struct aio_thread_reply reply;
@@ -629,6 +632,9 @@
 	}
 	reactivate_fd(fd, UBD_IRQ);
 
+	dequeued = 1;
+	wake_up(&eagain_queue);
+
 	return(IRQ_HANDLED);
 }
 
@@ -1376,7 +1382,19 @@
                 if(aio->bitmap != NULL)
                         atomic_inc(&aio->bitmap->count);
 
-                err = submit_aio(&aio->aio);
+		while(1){
+			unsigned long flags;
+
+			err = submit_aio(&aio->aio);
+			if(err != -EAGAIN)
+				break;
+			dequeued = 0;
+			local_save_flags(flags);
+			local_irq_enable();
+			wait_event(eagain_queue, dequeued);
+			local_irq_restore(flags);
+		}
+
                 if(err){
                         printk("do_io - submit_aio failed, "
                                "err = %d\n", err);
Index: test/arch/um/os-Linux/aio.c
===================================================================
--- test.orig/arch/um/os-Linux/aio.c	2005-09-17 17:39:33.000000000 -0400
+++ test/arch/um/os-Linux/aio.c	2005-09-17 17:41:03.000000000 -0400
@@ -296,6 +296,9 @@
 	int err;
 
 	err = do_aio(ctx, aio);
+	if(err == -EAGAIN)
+		return err;
+
 	if(err){
 		reply = ((struct aio_thread_reply) { .data = aio,
 					             .err  = err });

[-- Attachment #3: ubd-sequence --]
[-- Type: text/plain , Size: 2571 bytes --]

# It is important that I/O requests be submitted to the host in the
# order that they are received by the driver, especially since the
# driver can sleep.  This patch adds two atomic counters, one for
# requests started and one for requests submitted to the host.  A new
# request can proceed when started (after being incremented) is one
# more than submitted.  submitted is incremented after all pieces of
# the request have been sent to the host.
# When a request is proceeding out of order, it will sleep on a wait
# queue until its number comes up.
#
# Interaction between this wait_queue and the emergency allocation
# semaphore - a request will only try to allocate data when it is
# its turn to go.  The wait_queue is finished before the semaphore is
# acquired, so there can't be a deadlock between one process holding
# the allocation semaphore and sleeping because it's out of order and
# one proceeding in order and sleeping on the allocation semaphore.
#
# This currently doesn't correctly handle the bitmap, which must be
# written after the data has been finished.  These must be sequenced
# in the same way as the data.
Index: test/arch/um/drivers/ubd_kern.c
===================================================================
--- test.orig/arch/um/drivers/ubd_kern.c	2005-09-17 17:40:34.000000000 -0400
+++ test/arch/um/drivers/ubd_kern.c	2005-09-17 17:40:53.000000000 -0400
@@ -1297,6 +1297,10 @@
 	return(err);
 }
 
+static atomic_t started = ATOMIC_INIT(0);
+static atomic_t submitted = ATOMIC_INIT(0);
+DECLARE_WAIT_QUEUE_HEAD(sequence_queue);
+
 void do_io(struct io_thread_req *req, struct request *r, unsigned long *bitmap)
 {
         struct ubd_aio *aio;
@@ -1304,14 +1308,18 @@
         char *buf;
         void *bitmap_buf = NULL;
         unsigned long len, sector;
-        int nsectors, start, end, bit, err;
+        int nsectors, start, end, bit, err, want;
         __u64 off;
 
-        if(req->bitmap_start != -1){
-		bitmap_io = alloc_bitmap_io();
+	want = atomic_add_return(1, &started);
+	wait_event(sequence_queue, want - 1 == atomic_read(&submitted));
 
+        if(req->bitmap_start != -1){
                 /* Round up to the nearest word */
                 int round = sizeof(unsigned long);
+
+		bitmap_io = alloc_bitmap_io();
+
                 len = (req->bitmap_end - req->bitmap_start +
                        round * 8 - 1) / (round * 8);
                 len *= round;
@@ -1378,4 +1386,7 @@
 
                 start = end;
         } while(start < nsectors);
+
+	atomic_inc(&submitted);
+	wake_up(&sequence_queue);
 }

[-- Attachment #4: ubd-drop-lock --]
[-- Type: text/plain , Size: 1820 bytes --]

# This patch drops the ubd_io_lock around do_io, where we might sleep,
# and picks it up afterwards.  This looks safe since it protects the
# queue, and we have already picked a request off it and are dealing
# with it without referring to the queue again.  It is picked up
# before dequeueing the next request.
#
# It also removes the chaining from the interrupt routine, where it
# calls do_ubd_request to start the next request.  This is unecessary
# since do_ubd_request now processes the entire queue before
# returning.
#
# This is needed because we currently do non-atomic allocations.
# These will be removed later, but we will also be introducing new
# ways to sleep.
Index: test/arch/um/drivers/ubd_kern.c
===================================================================
--- test.orig/arch/um/drivers/ubd_kern.c	2005-09-17 17:39:24.000000000 -0400
+++ test/arch/um/drivers/ubd_kern.c	2005-09-17 17:40:08.000000000 -0400
@@ -466,7 +466,6 @@
 );
 
 static void do_ubd_request(request_queue_t * q);
-static int in_ubd;
 
 /* Changed by ubd_handler, which is serialized because interrupts only
  * happen on CPU 0.
@@ -569,8 +568,6 @@
 	}
 	reactivate_fd(fd, UBD_IRQ);
 
-        do_ubd_request(ubd_queue);
-
 	return(IRQ_HANDLED);
 }
 
@@ -999,15 +996,13 @@
 	__u64 sector;
 	int err;
 
-	if(in_ubd)
-		return;
-	in_ubd = 1;
 	while((req = elv_next_request(q)) != NULL){
 		struct gendisk *disk = req->rq_disk;
 		struct ubd *dev = disk->private_data;
 		int n, i;
 
 		blkdev_dequeue_request(req);
+		spin_unlock(&ubd_io_lock);
 
 		sector = req->sector;
 		n = blk_rq_map_sg(q, req, dev->sg);
@@ -1024,8 +1019,8 @@
 			sector += sg->length >> 9;
 			do_io(&io_req, req, dev->cow.bitmap);
 		}
+		spin_lock(&ubd_io_lock);
 	}
-	in_ubd = 0;
 }
 
 static int ubd_ioctl(struct inode * inode, struct file * file,

[-- Attachment #5: ubd-atomic --]
[-- Type: text/plain , Size: 5035 bytes --]

# To ensure that I/O can always make progress, even when there is no
# memory, we provide static buffers which are to be used when dynamic
# ones can't be allocated.  These are protected by a semaphore when in
# use.  When one is freed, the semaphore is released, and one of the
# waiters gets to use it.  There is an allocation failure emulation
# mechanism here - setting fail_start and fail_end will cause
# allocations in that range (fail_start <= allocations < fail_end) to
# fail, invoking the emergency mechanism.
# When this is happening, I/O requests proceed one at a time,
# essentially synchronously, until allocations start succeeding again.
#
# This currently doesn't handle the bitmap array, since that can be of
# any length, so we can't have a static version of it at this point.
Index: test/arch/um/drivers/ubd_kern.c
===================================================================
--- test.orig/arch/um/drivers/ubd_kern.c	2005-09-17 17:40:08.000000000 -0400
+++ test/arch/um/drivers/ubd_kern.c	2005-09-17 17:40:34.000000000 -0400
@@ -511,6 +511,67 @@
         void *bitmap_buf;
 };
 
+static int allocations;
+static int fail_start, fail_end;
+
+static struct bitmap_io emergency_bitmap_io;
+DECLARE_MUTEX(bitmap_io_sem);
+
+static struct bitmap_io *alloc_bitmap_io(void)
+{
+	struct bitmap_io *ret;
+
+	allocations++;
+	ret = kmalloc(sizeof(*ret), GFP_KERNEL | GFP_ATOMIC);
+
+	if((allocations >= fail_start) && (allocations < fail_end)){
+		kfree(ret);
+		ret = NULL;
+	}
+
+	if(ret != NULL)
+		return ret;
+
+	down(&bitmap_io_sem);
+	return(&emergency_bitmap_io);
+}
+
+static void free_bitmap_io(struct bitmap_io *io)
+{
+	if(io == &emergency_bitmap_io)
+		up(&bitmap_io_sem);
+	else kfree(io);
+}
+
+static struct ubd_aio emergency_aio;
+DECLARE_MUTEX(aio_sem);
+
+static struct ubd_aio *alloc_ubd_aio(void)
+{
+	struct ubd_aio *ret;
+
+	allocations++;
+	ret = kmalloc(sizeof(*ret), GFP_KERNEL | GFP_ATOMIC);
+
+	if((allocations >= fail_start) && (allocations < fail_end)){
+		kfree(ret);
+		ret = NULL;
+	}
+
+	if(ret != NULL)
+		return ret;
+
+	down(&aio_sem);
+	return(&emergency_aio);
+}
+
+static void free_ubd_aio(struct ubd_aio *aio)
+{
+	if(aio == &emergency_aio)
+		up(&aio_sem);
+	else kfree(aio);
+}
+
 static int ubd_reply_fd = -1;
 
 static irqreturn_t ubd_intr(int irq, void *dev, struct pt_regs *unused)
@@ -541,7 +602,7 @@
 			   (atomic_dec_and_test(&aio->bitmap->count))){
                                 aio->aio = aio->bitmap->aio;
                                 aio->len = 0;
-                                kfree(aio->bitmap);
+				free_bitmap_io(aio->bitmap);
                                 aio->bitmap = NULL;
                                 submit_aio(&aio->aio);
 			}
@@ -554,16 +615,16 @@
 
                                 if(aio->bitmap_buf != NULL)
                                         kfree(aio->bitmap_buf);
-				kfree(aio);
+				free_ubd_aio(aio);
 			}
 		}
                 else if(n < 0){
                         ubd_finish(aio->req, n);
                         if(aio->bitmap != NULL)
-                                kfree(aio->bitmap);
+                                free_bitmap_io(aio->bitmap);
                         if(aio->bitmap_buf != NULL)
                                 kfree(aio->bitmap_buf);
-                        kfree(aio);
+                        free_ubd_aio(aio);
                 }
 	}
 	reactivate_fd(fd, UBD_IRQ);
@@ -1247,6 +1308,8 @@
         __u64 off;
 
         if(req->bitmap_start != -1){
+		bitmap_io = alloc_bitmap_io();
+
                 /* Round up to the nearest word */
                 int round = sizeof(unsigned long);
                 len = (req->bitmap_end - req->bitmap_start +
@@ -1256,18 +1319,11 @@
                 off = req->bitmap_start / (8 * round);
                 off *= round;
 
-                bitmap_io = kmalloc(sizeof(*bitmap_io), GFP_KERNEL);
-                if(bitmap_io == NULL){
-                        printk("Failed to kmalloc bitmap IO\n");
-                        req->error = 1;
-                        return;
-                }
-
                 bitmap_buf = kmalloc(len, GFP_KERNEL);
                 if(bitmap_buf == NULL){
                         printk("do_io : kmalloc of bitmap chunk "
                                "failed\n");
-                        kfree(bitmap_io);
+                        free_bitmap_io(bitmap_io);
                         req->error = 1;
                         return;
                 }
@@ -1300,12 +1356,7 @@
                 len = (end - start) * req->sectorsize;
                 buf = &req->buffer[start * req->sectorsize];
 
-                aio = kmalloc(sizeof(*aio), GFP_KERNEL);
-                if(aio == NULL){
-                        req->error = 1;
-                        return;
-                }
-
+		aio = alloc_ubd_aio();
                 *aio = ((struct ubd_aio)
                         { .aio		= INIT_AIO(req->op, req->fds[bit], buf,
                                                    len, off, ubd_reply_fd),

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [uml-devel] Review needed for ubd fixes
  2005-09-17 22:45 [uml-devel] Review needed for ubd fixes Jeff Dike
@ 2005-09-20 12:01 ` Blaisorblade
  2005-09-20 19:06   ` Jeff Dike
  0 siblings, 1 reply; 12+ messages in thread
From: Blaisorblade @ 2005-09-20 12:01 UTC (permalink / raw)
  To: user-mode-linux-devel; +Cc: Jeff Dike, Bodo Stroesser

On Sunday 18 September 2005 00:45, Jeff Dike wrote:
> I've spent some time working on the ubd and AIO problems that have cropped
> up recently.  Patches are attached - a critical look at them would be
> appreciated.
>
> I'm going to start with a problem that hasn't exactly cropped up, and move
> on to the other ones from there.

> That is that the ubd driver must issue I/O requests to the host even if
> there is no memory available.  It must make progress, because the request
> for which it can't get memory might be the swapping that will free some up.

> The driver allocates memory for several purposes -
> 	a header-type buffer per request which holds the bitmap data
> 	one aio request per chunk - a request may be broken into chunks because
> of COWing - some pieces from the backing file, some from the COW file

> To allow the driver to always make progress, even when no memory is
> available, I added a static instance of each buffer and made the kmalloc
> calls atomic. When kmalloc fails, then the static buffer is used.
Hmm, this kind of thing is exactly the one for which mempool's were created - 
have you looked at whether using them (which can be used for atomic purposes) 
would be better?
> When it 
> is in use, the user acquires a semaphore, which is released in the
> interrupt handler when the buffer is freed.  This implies that future I/O
> submissions will sleep on the semaphore, so we have changed the reason for
> sleeping rather than eliminating it.
I've not looked at the code, but have you tested that with 
sleep-inside-spinlock checking enabled?

However, ok, you do release spinlocks so you should be safe. However, in your 
custom allocation routines, you're going to sleep possibly, so why do you use 
GFP_ATOMIC? There's absolutely no need. If there's a need, you can't take the 
semaphore afterwards.

Also, GFP_KERNEL|GFP_ATOMIC is bogus - GFP_ATOMIC must be used alone, when 
needed.

Grep kernel/*.c to make sure, or look at the definition.
> Any I/O submissions which need memory will sleep on the semaphore until the
> buffer is released, resulting in synchronous, one at a time, I/O submission
> until the memory shortage has cleared up.
About AIO, I've read on http://lse.sourceforge.net/io/aio.html that, indeed, 
the host AIO code isn't really Asynchronous for buffered I/O, but only for 
O_DIRECT I/O (which we don't seem to use).

We're safe, sure, and the new code is anyhow better in using the new 
blockdevice model.
> Now that it is established that we must sleep, the next problem is making
> sure the requests get submitted to the host in the order that they were
> queued to the driver.  Because we must sleep, we must drop the queue
> spinlock, which opens the queue to any other processes that want to do I/O.
>  One of these can queue to the allocation semaphore (or just kmalloc memory
> now that some is available) and sneak ahead of a process handling an
> earlier request.

> To solve this, I add two atomic variables and a wait queue.  One of the
> variables, started, counts the number of requests that have been submitted
> to the driver and the other, submitted, counts the number that have been
> submitted to the host.  started is incremented before any submissions,
> submitted is incremented after all submissions for the request.

> The driver gets the value of started and waits for submitted to catch up to
> it.  When that happens, it is now that requests's turn, and it wakes up
> from the sequence wake queue.
These two atomics + one wait queue are very similar to a semaphore, even if 
not identical. The semaphore value would be submitted - started. The change 
is that the driver sleep at the first increment of "started" rather than at 
the last one, but it should be ok. And much less error-prone. If you keep 
your custom design, you should at least unify the two vars with the 
difference.

I'm not confident about the races between the two different variables, even if 
it may be perfectly safe (Tanenbaum uses something very similar, with even 
two vars, for the producer-consumer example). But it must be examined.

And yes, on a semaphore you can call up() as many times as you want (which is 
non-obvious maybe - I didn't realize well until one week ago). In fact, 
provisions are made to initialize a semaphore with initial count bigger than 
1.

However, I would like to note that you're not always forced to sequence 
requests - write barriers were recently implemented, so the filesystem 
explicitly serializes requests when needed, rather than ask for strictly 
sequential processing. It's not especially hard to do, when you use your 
custom semaphore code - you can say "down() but don't sleep".
> I believe these three fixes - dropping the spinlock around host I/O
> submission, adding the static buffers, and sequencing requests - cover the
> scheduling while atomic problem and associated problems.

> Next is a deadlock that I found during a kernel build loop.  The AIO thread
> seems to be able to fill the pipe back to UML with I/O completions.  When
> this happens, it sleeps in a write and stops calling io_getevents.  If UML
> keeps submitting events, the AIO ring buffer will fill, and io_submit will
> start returning -EAGAIN.  At this point, UML will try to write the error to
> itself through the same pipe that the AIO thread has filled.  It will now
> sleep. Since interrupts are off, that pipe can never be emptied since it is
> handled by the driver's interrupt routine, and we are deadlocked.

-- 
Inform me of my mistakes, so I can keep imitating Homer Simpson's "Doh!".
Paolo Giarrusso, aka Blaisorblade (Skype ID "PaoloGiarrusso", ICQ 215621894)
http://www.user-mode-linux.org/~blaisorblade


	

	
		
___________________________________ 
Yahoo! Mail: gratis 1GB per i messaggi e allegati da 10MB 
http://mail.yahoo.it



-------------------------------------------------------
SF.Net email is sponsored by:
Tame your development challenges with Apache's Geronimo App Server. Download
it for free - -and be entered to win a 42" plasma tv or your very own
Sony(tm)PSP.  Click here to play: http://sourceforge.net/geronimo.php
_______________________________________________
User-mode-linux-devel mailing list
User-mode-linux-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/user-mode-linux-devel

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [uml-devel] Review needed for ubd fixes
  2005-09-20 12:01 ` Blaisorblade
@ 2005-09-20 19:06   ` Jeff Dike
  2005-09-21 15:49     ` Blaisorblade
  0 siblings, 1 reply; 12+ messages in thread
From: Jeff Dike @ 2005-09-20 19:06 UTC (permalink / raw)
  To: Blaisorblade; +Cc: user-mode-linux-devel, Bodo Stroesser

On Tue, Sep 20, 2005 at 02:01:11PM +0200, Blaisorblade wrote:
> Hmm, this kind of thing is exactly the one for which mempool's were created -
> have you looked at whether using them (which can be used for atomic purposes)
> would be better?

Yeah, that would be worth looking into.

> I've not looked at the code, but have you tested that with 
> sleep-inside-spinlock checking enabled?
> 
> However, ok, you do release spinlocks so you should be safe. However, in your
> custom allocation routines, you're going to sleep possibly, so why do you use
> GFP_ATOMIC? There's absolutely no need. If there's a need, you can't take the
> semaphore afterwards.

?

GFP_ATOMIC doesn't always mean that you're in an interrupt.
Generally, it means not to sleep in kmalloc.  And here, if it fails,
I'll use the static buffers.

> Also, GFP_KERNEL|GFP_ATOMIC is bogus - GFP_ATOMIC must be used alone, when 
> needed.

OK.

> About AIO, I've read on http://lse.sourceforge.net/io/aio.html that, indeed, 
> the host AIO code isn't really Asynchronous for buffered I/O, but only for 
> O_DIRECT I/O (which we don't seem to use).

There's another patch, called o_direct, which I didn't send out, which
fixes this.

> These two atomics + one wait queue are very similar to a semaphore, even if 
> not identical. The semaphore value would be submitted - started. The change 
> is that the driver sleep at the first increment of "started" rather than at 
> the last one, but it should be ok. And much less error-prone. If you keep 
> your custom design, you should at least unify the two vars with the 
> difference.

The wait queue allows the correct thread to be woken up.  If I used a
semaphore, its value would be the same for all threads, and they would
all be woken up when that value goes to 0.  With a wait queue, each
thread has a different value of started, and they wait for submitted
to catch up to it.  Meanwhile, any other sleeping threads stay
sleeping because submitted hasn't caught up to their values of started.

> However, I would like to note that you're not always forced to sequence 
> requests - write barriers were recently implemented, so the filesystem 
> explicitly serializes requests when needed, rather than ask for strictly 
> sequential processing. It's not especially hard to do, when you use your 
> custom semaphore code - you can say "down() but don't sleep".

I'm concerned about the COW bitmap.  That's something that the upper
layers know nothing about, so requests could overlap without there
being barriers between that.

However, this is an artifact of the implementation, where the section
of bitmap that will be written out is copied into the aio request when
it is started.  If I grabbed the bitmap section and set the bits in it
just before it is written out, then the sequencing stuff might be able
to just go away.  We rely on the block layer to put write barriers
between requests with overlapping data.

				Jeff

-------------------------------------------------------
SF.Net email is sponsored by:
Tame your development challenges with Apache's Geronimo App Server. Download
it for free - -and be entered to win a 42" plasma tv or your very own
Sony(tm)PSP.  Click here to play: http://sourceforge.net/geronimo.php
_______________________________________________
User-mode-linux-devel mailing list
User-mode-linux-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/user-mode-linux-devel

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [uml-devel] Review needed for ubd fixes
  2005-09-20 19:06   ` Jeff Dike
@ 2005-09-21 15:49     ` Blaisorblade
  2005-09-21 18:04       ` Jeff Dike
  0 siblings, 1 reply; 12+ messages in thread
From: Blaisorblade @ 2005-09-21 15:49 UTC (permalink / raw)
  To: Jeff Dike; +Cc: user-mode-linux-devel, Bodo Stroesser

On Tuesday 20 September 2005 21:06, Jeff Dike wrote:
> On Tue, Sep 20, 2005 at 02:01:11PM +0200, Blaisorblade wrote:
> > Hmm, this kind of thing is exactly the one for which mempool's were
> > created - have you looked at whether using them (which can be used for
> > atomic purposes) would be better?

> Yeah, that would be worth looking into.

> > I've not looked at the code, but have you tested that with
> > sleep-inside-spinlock checking enabled?

> > However, ok, you do release spinlocks so you should be safe. However, in
> > your custom allocation routines, you're going to sleep possibly, so why
> > do you use GFP_ATOMIC? There's absolutely no need. If there's a need, you
> > can't take the semaphore afterwards.

> ?

> GFP_ATOMIC doesn't always mean that you're in an interrupt.
> Generally, it means not to sleep in kmalloc.
Yes, but GFP_ATOMIC was needed in first place because we where in an atomic 
section (not for interrupts, but because we were under a spinlock).

> And here, if it fails, 
> I'll use the static buffers.
Since you're going to possibly sleep, it's better to allow kmalloc to sleep in 
first place!

> > These two atomics + one wait queue are very similar to a semaphore, even
> > if not identical. The semaphore value would be submitted - started. The
> > change is that the driver sleep at the first increment of "started"
> > rather than at the last one, but it should be ok. And much less
> > error-prone. If you keep your custom design, you should at least unify
> > the two vars with the difference.

> The wait queue allows the correct thread to be woken up.  If I used a
> semaphore, its value would be the same for all threads, and they would
> all be woken up when that value goes to 0.  With a wait queue, each
> thread has a different value of started, and they wait for submitted
> to catch up to it.
Ok, filtered wait queues. And I also see why you used two atomics rather than 
one. I'll have to think deep about this, though.
> Meanwhile, any other sleeping threads stay 
> sleeping because submitted hasn't caught up to their values of started.
[...]
> I'm concerned about the COW bitmap.  That's something that the upper
> layers know nothing about, so requests could overlap without there
> being barriers between that.

> However, this is an artifact of the implementation, where the section
> of bitmap that will be written out is copied into the aio request when
> it is started.  If I grabbed the bitmap section and set the bits in it
> just before it is written out, then the sequencing stuff might be able
> to just go away.
Is the COW bitmap mmap'ed? Because otherwise we'd need to queue up an 
additional write request, which creates problems.

Btw, msync() allows sync'ing only a specific region, which fsync() doesn't 
allow. Don't know whether we need this really, however (we probably don't do 
it, but we should).

Also, I kept forgetting to mention one thing: Device mapper has the support 
for COW volumes, like we do. E.g. when you create a snapshot, it is not a 
static immutable copy of the original, it's writable too!

Sure, less granular than COW devices, it doesn't copy 512 bytes at a time 
(which may help with performance though), but works on the host too!

I was waiting to let lists know of this after writing a conversion program, 
which I never finished.
> We rely on the block layer to put write barriers 
> between requests with overlapping data.
Don't think it's reasonable to expect this. I think that filesystems like ext2 
will never produce write barriers - they are needed for journaled FS's.
> 				Jeff

-- 
Inform me of my mistakes, so I can keep imitating Homer Simpson's "Doh!".
Paolo Giarrusso, aka Blaisorblade (Skype ID "PaoloGiarrusso", ICQ 215621894)
http://www.user-mode-linux.org/~blaisorblade

	

	
		
___________________________________ 
Yahoo! Mail: gratis 1GB per i messaggi e allegati da 10MB 
http://mail.yahoo.it



-------------------------------------------------------
SF.Net email is sponsored by:
Tame your development challenges with Apache's Geronimo App Server. Download
it for free - -and be entered to win a 42" plasma tv or your very own
Sony(tm)PSP.  Click here to play: http://sourceforge.net/geronimo.php
_______________________________________________
User-mode-linux-devel mailing list
User-mode-linux-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/user-mode-linux-devel

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [uml-devel] Review needed for ubd fixes
  2005-09-21 15:49     ` Blaisorblade
@ 2005-09-21 18:04       ` Jeff Dike
  2005-09-21 19:06         ` Blaisorblade
  0 siblings, 1 reply; 12+ messages in thread
From: Jeff Dike @ 2005-09-21 18:04 UTC (permalink / raw)
  To: Blaisorblade; +Cc: user-mode-linux-devel, Bodo Stroesser

On Wed, Sep 21, 2005 at 05:49:43PM +0200, Blaisorblade wrote:
> Since you're going to possibly sleep, it's better to allow kmalloc to sleep in 
> first place!

WRONG!  The first request in line can't sleep, that's the whole point.

If it does, the system can deadlock.  Subsequent requests can sleep, waiting
for previous ones to finish, but there absolutely can't be any sleeping
in kmalloc.

> Is the COW bitmap mmap'ed? Because otherwise we'd need to queue up an 
> additional write request, which creates problems.

No.  That creates ordering problems which I haven't thought about yet,  It
may be worth doing, but explicit writes make ordering easier to think about.
> 
> Btw, msync() allows sync'ing only a specific region, which fsync() doesn't 
> allow. Don't know whether we need this really, however (we probably don't do 
> it, but we should).

Or, the host may just flush out the mmapped stuff on its own.  This is my 
concern about ordering.

> Also, I kept forgetting to mention one thing: Device mapper has the support 
> for COW volumes, like we do. E.g. when you create a snapshot, it is not a 
> static immutable copy of the original, it's writable too!

Yeah, that's something to think about.

> Don't think it's reasonable to expect this. I think that filesystems like ext2 
> will never produce write barriers - they are needed for journaled FS's.

I thought your point was to rely on the block layer to order overlapping
writes for us.

It sounded reasonable to me :-)

				Jeff


-------------------------------------------------------
SF.Net email is sponsored by:
Tame your development challenges with Apache's Geronimo App Server. Download
it for free - -and be entered to win a 42" plasma tv or your very own
Sony(tm)PSP.  Click here to play: http://sourceforge.net/geronimo.php
_______________________________________________
User-mode-linux-devel mailing list
User-mode-linux-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/user-mode-linux-devel

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [uml-devel] Review needed for ubd fixes
  2005-09-21 18:04       ` Jeff Dike
@ 2005-09-21 19:06         ` Blaisorblade
  2005-09-21 20:45           ` Jeff Dike
  0 siblings, 1 reply; 12+ messages in thread
From: Blaisorblade @ 2005-09-21 19:06 UTC (permalink / raw)
  To: Jeff Dike; +Cc: user-mode-linux-devel, Bodo Stroesser

On Wednesday 21 September 2005 20:04, Jeff Dike wrote:
> On Wed, Sep 21, 2005 at 05:49:43PM +0200, Blaisorblade wrote:
[...]
> > Don't think it's reasonable to expect this. I think that filesystems like
> > ext2 will never produce write barriers - they are needed for journaled
> > FS's.

> I thought your point was to rely on the block layer to order overlapping
> writes for us.
The block layer is supposed to merge as far as possible overlapping writes, 
that's reasonable, but not dependable. Say it gets the overlapping request 
*after* it sent the first one - we'll see both. Also, this is trivially true 
for output done through page cache, but for the rest I don't know if explicit 
merging is implemented.

For the buffer cache (i.e. fs metadata) it may still hold, but I'm not sure at 
all. Not that it matters for us - we can't depend on it anyway.

No, I was thinking to you enforcing ordering to comply to what journaled 
filesystems expect.

Btw: even when we aren't using COW, we're supposed to do writes in the order 
the fs passed them to us - at least when write barriers are explicitly sent 
(don't know if there's something to care normally - I know this as a LWN.net 
reader, not as an hacker in this area).

> It sounded reasonable to me :-)

> 				Jeff

-- 
Inform me of my mistakes, so I can keep imitating Homer Simpson's "Doh!".
Paolo Giarrusso, aka Blaisorblade (Skype ID "PaoloGiarrusso", ICQ 215621894)
http://www.user-mode-linux.org/~blaisorblade

	

	
		
___________________________________ 
Yahoo! Mail: gratis 1GB per i messaggi e allegati da 10MB 
http://mail.yahoo.it



-------------------------------------------------------
SF.Net email is sponsored by:
Tame your development challenges with Apache's Geronimo App Server. Download
it for free - -and be entered to win a 42" plasma tv or your very own
Sony(tm)PSP.  Click here to play: http://sourceforge.net/geronimo.php
_______________________________________________
User-mode-linux-devel mailing list
User-mode-linux-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/user-mode-linux-devel

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [uml-devel] Review needed for ubd fixes
  2005-09-21 19:06         ` Blaisorblade
@ 2005-09-21 20:45           ` Jeff Dike
  2005-09-22 20:51             ` Blaisorblade
  0 siblings, 1 reply; 12+ messages in thread
From: Jeff Dike @ 2005-09-21 20:45 UTC (permalink / raw)
  To: Blaisorblade; +Cc: user-mode-linux-devel, Bodo Stroesser

On Wed, Sep 21, 2005 at 09:06:38PM +0200, Blaisorblade wrote:
> The block layer is supposed to merge as far as possible overlapping writes, 
> that's reasonable, but not dependable. Say it gets the overlapping request 
> *after* it sent the first one - we'll see both. Also, this is trivially true 
> for output done through page cache, but for the rest I don't know if explicit 
> merging is implemented.

Yeah, I'm concerned about O_DIRECT.  Also, what happens when we write a page
to disk, and immediately afterwards, it gets dirty again and written again?
Is there a barrier between the two?

> No, I was thinking to you enforcing ordering to comply to what journaled 
> filesystems expect.

Yeah, but there's more ordering than that.

> Btw: even when we aren't using COW, we're supposed to do writes in the order 
> the fs passed them to us - at least when write barriers are explicitly sent 
> (don't know if there's something to care normally - I know this as a LWN.net 
> reader, not as an hacker in this area).

We get requests from the block layer, not the fs.  And physical disks are 
allowed to reorder requests, so I think there is some flexibility there.

				Jeff


-------------------------------------------------------
SF.Net email is sponsored by:
Tame your development challenges with Apache's Geronimo App Server. Download
it for free - -and be entered to win a 42" plasma tv or your very own
Sony(tm)PSP.  Click here to play: http://sourceforge.net/geronimo.php
_______________________________________________
User-mode-linux-devel mailing list
User-mode-linux-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/user-mode-linux-devel

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [uml-devel] Review needed for ubd fixes
  2005-09-21 20:45           ` Jeff Dike
@ 2005-09-22 20:51             ` Blaisorblade
  2005-09-27 18:12               ` Jeff Dike
  0 siblings, 1 reply; 12+ messages in thread
From: Blaisorblade @ 2005-09-22 20:51 UTC (permalink / raw)
  To: Jeff Dike; +Cc: user-mode-linux-devel, Bodo Stroesser

On Wednesday 21 September 2005 22:45, Jeff Dike wrote:
> On Wed, Sep 21, 2005 at 09:06:38PM +0200, Blaisorblade wrote:
> > The block layer is supposed to merge as far as possible overlapping
> > writes, that's reasonable, but not dependable. Say it gets the
> > overlapping request *after* it sent the first one - we'll see both. Also,
> > this is trivially true for output done through page cache, but for the
> > rest I don't know if explicit merging is implemented.
>
> Yeah, I'm concerned about O_DIRECT.  Also, what happens when we write a
> page to disk, and immediately afterwards, it gets dirty again and written
> again? Is there a barrier between the two?
>
> > No, I was thinking to you enforcing ordering to comply to what journaled
> > filesystems expect.
>
> Yeah, but there's more ordering than that.
>
> > Btw: even when we aren't using COW, we're supposed to do writes in the
> > order the fs passed them to us - at least when write barriers are
> > explicitly sent (don't know if there's something to care normally - I
> > know this as a LWN.net reader, not as an hacker in this area).
>
> We get requests from the block layer, not the fs.  And physical disks are
> allowed to reorder requests, so I think there is some flexibility there.
I expect the FS to send the barriers - they don't make sense otherwise.

In fact, grep -i barrier fs/jbd/*.c gives some results, which point to 
BH_Ordered.

And after grepping it in reiserfs, I found that fs/buffer.c:submit_bh() cares 
about it (i.e. looks at the buffer head and sees if it is ordered), and sends 
it down to the elevator.
-- 
Inform me of my mistakes, so I can keep imitating Homer Simpson's "Doh!".
Paolo Giarrusso, aka Blaisorblade (Skype ID "PaoloGiarrusso", ICQ 215621894)
http://www.user-mode-linux.org/~blaisorblade

	

	
		
___________________________________ 
Yahoo! Mail: gratis 1GB per i messaggi e allegati da 10MB 
http://mail.yahoo.it



-------------------------------------------------------
SF.Net email is sponsored by:
Tame your development challenges with Apache's Geronimo App Server. Download
it for free - -and be entered to win a 42" plasma tv or your very own
Sony(tm)PSP.  Click here to play: http://sourceforge.net/geronimo.php
_______________________________________________
User-mode-linux-devel mailing list
User-mode-linux-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/user-mode-linux-devel

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [uml-devel] Review needed for ubd fixes
  2005-09-22 20:51             ` Blaisorblade
@ 2005-09-27 18:12               ` Jeff Dike
  2005-09-28 12:14                 ` Blaisorblade
  0 siblings, 1 reply; 12+ messages in thread
From: Jeff Dike @ 2005-09-27 18:12 UTC (permalink / raw)
  To: Blaisorblade; +Cc: user-mode-linux-devel, Bodo Stroesser

[-- Attachment #1: Type: text/plain, Size: 2074 bytes --]

Attached is my current set of I/O patches.  It's significantly different,
and simpler, than my previous set.

Now, there is no sleeping while running the queue, so the spinlock is never
dropped, and all of the extra synchronization is gone.

do_ubd_handler runs the queue until it's empty, or the host refuses to take
any more AIO requests.  If the queue not empty, then the current state
of the in-process request (the request itself and which sg entries have
not yet been sent to the host) is saved in the device structure.  When 
the interrupt handler empties out the host some, it will take the queue
lock and call the request handler to push some more requests to the host.

I also made the queues and locks per-device rather than having one for
all devices.  This means that when a device gets -EAGAIN from the host,
the interrupt handler needs to know which queues got stalled.  This is 
handled by having the request handler stick the device on a list when
this happems, and the interrupt handler walks that list when rerunning
request handlers.

The patch name is the same as before, and now quite misleading because
the queue lock is now not dropped.

ubd-atomic is much the same as before, except simple flags is used to
indicate whether the static buffers are available.  This is OK since 
any reading or writing of the flags happens under the queue spinlock.

init_aio_err is a simple error path cleanup patch.

aio-batching causes the AIO thread not to process completions until the
current batch of I/O is submitted.  I added this because I was seeing
a context switch between UML and the AIO thread on every AIO submission.
This keeps the AIO thread asleep until the current I/O is fully submitted 
and it may be able to process a bunch of completions at once.

o_direct adds O_DIRECT support to UML and makes the ubd driver use it.

aio-errors makes submit_aio_24 return -errno instead of -1.

I've given this a bunch of testing and it has survived overnight kernel
build loops on both x86 and x86_64.

The attached patches are against 2.6.14-rc2.

			Jeff

[-- Attachment #2: ubd-drop-lock --]
[-- Type: text/plain, Size: 10804 bytes --]

# This patch changes the ubd I/O submission process to avoid some sleeping.
# When the host returns -EAGAIN from io_submit, do_ubd_request returns to
# its caller, saving the current state of the request submission in the 
# struct ubd.  This state consists of the request structure and the range
# of sg entries which have not yet been submitted.  If the request queue is
# drained, then this state is reset to indicate that, the next time it is 
# called, a new request needs to be pulled from the request queue.
# When do_ubd_request returns because the host can handle no more requests,
# it is necessary to rerun the queue after some completions have been handled.
# This is done by adding the device to the restart list.  ubd_intr walks
# this list before returning, calling do_ubd_request for each device.
# In addition, the queues and queue locks are now per-device, rather than
# having a single queue and lock for all devices.
# Note that kmalloc is still called, and can sleep.  This is fixed in a 
# future patch.
Index: test/arch/um/drivers/ubd_kern.c
===================================================================
--- test.orig/arch/um/drivers/ubd_kern.c	2005-09-27 11:33:43.000000000 -0400
+++ test/arch/um/drivers/ubd_kern.c	2005-09-27 12:02:00.000000000 -0400
@@ -82,7 +82,7 @@
 			   unsigned long *bitmap_len_out,
 			   int *data_offset_out);
 extern int read_cow_bitmap(int fd, void *buf, int offset, int len);
-extern void do_io(struct io_thread_req *req, struct request *r,
+static int do_io(struct io_thread_req *req, struct request *r,
 		  unsigned long *bitmap);
 
 static inline int ubd_test_bit(__u64 bit, void *data)
@@ -112,7 +112,6 @@
 
 #define DRIVER_NAME "uml-blkdev"
 
-static DEFINE_SPINLOCK(ubd_io_lock);
 static DEFINE_SPINLOCK(ubd_lock);
 
 static int ubd_open(struct inode * inode, struct file * filp);
@@ -129,9 +128,6 @@
         .ioctl		= ubd_ioctl,
 };
 
-/* Protected by the queue_lock */
-static request_queue_t *ubd_queue;
-
 /* Protected by ubd_lock */
 static int fake_major = MAJOR_NR;
 
@@ -164,6 +160,7 @@
 #define MAX_SG 64
 
 struct ubd {
+	struct list_head restart;
 	char *file;
 	int count;
 	int fd;
@@ -174,6 +171,10 @@
 	struct cow cow;
 	struct platform_device pdev;
         struct scatterlist sg[MAX_SG];
+	struct request_queue *queue;
+	spinlock_t lock;
+	struct request *request;
+	int start_sg, end_sg;
 };
 
 #define DEFAULT_COW { \
@@ -193,6 +194,10 @@
 	.openflags =		OPEN_FLAGS, \
         .no_cow =               0, \
         .cow =			DEFAULT_COW, \
+	.lock =			SPIN_LOCK_UNLOCKED, \
+	.request =		NULL, \
+	.start_sg =		0, \
+	.end_sg =		0, \
 }
 
 struct ubd ubd_dev[MAX_DEV] = { [ 0 ... MAX_DEV - 1 ] = DEFAULT_UBD };
@@ -466,7 +471,6 @@
 );
 
 static void do_ubd_request(request_queue_t * q);
-static int in_ubd;
 
 /* Changed by ubd_handler, which is serialized because interrupts only
  * happen on CPU 0.
@@ -494,9 +498,11 @@
 
 static inline void ubd_finish(struct request *req, int bytes)
 {
-   	spin_lock(&ubd_io_lock);
+	struct ubd *dev = req->rq_disk->private_data;
+
+   	spin_lock(&dev->lock);
 	__ubd_finish(req, bytes);
-  	spin_unlock(&ubd_io_lock);
+  	spin_unlock(&dev->lock);
 }
 
 struct bitmap_io {
@@ -513,12 +519,16 @@
 };
 
 static int ubd_reply_fd = -1;
+static struct list_head restart = LIST_HEAD_INIT(restart);
 
 static irqreturn_t ubd_intr(int irq, void *dev, struct pt_regs *unused)
 {
 	struct aio_thread_reply reply;
 	struct ubd_aio *aio;
 	struct request *req;
+	struct ubd *ubd;
+	struct list_head *list, *next;
+	unsigned long flags;
 	int err, n, fd = (int) (long) dev;
 
 	while(1){
@@ -532,10 +542,10 @@
 		}
 
                 aio = container_of(reply.data, struct ubd_aio, aio);
+		req = aio->req;
                 n = reply.err;
 
 		if(n == 0){
-			req = aio->req;
 			req->nr_sectors -= aio->len >> 9;
 
 			if((aio->bitmap != NULL) &&
@@ -559,7 +569,7 @@
 			}
 		}
                 else if(n < 0){
-                        ubd_finish(aio->req, n);
+                        ubd_finish(req, n);
                         if(aio->bitmap != NULL)
                                 kfree(aio->bitmap);
                         if(aio->bitmap_buf != NULL)
@@ -567,10 +577,15 @@
                         kfree(aio);
                 }
 	}
-	reactivate_fd(fd, UBD_IRQ);
-
-        do_ubd_request(ubd_queue);
+	list_for_each_safe(list, next, &restart){
+		ubd = container_of(list, struct ubd, restart);
+		list_del_init(&ubd->restart);
+		spin_lock_irqsave(&ubd->lock, flags);
+		do_ubd_request(ubd->queue);
+		spin_unlock_irqrestore(&ubd->lock, flags);
+	}
 
+	reactivate_fd(fd, UBD_IRQ);
 	return(IRQ_HANDLED);
 }
 
@@ -693,7 +708,7 @@
 	}
 
 	disk->private_data = &ubd_dev[unit];
-	disk->queue = ubd_queue;
+	disk->queue = ubd_dev[unit].queue;
 	add_disk(disk);
 
 	*disk_out = disk;
@@ -719,10 +734,19 @@
 		goto out_close;
 
 	dev->size = ROUND_BLOCK(dev->size);
+	INIT_LIST_HEAD(&dev->restart);
+
+	err = -ENOMEM;
+	dev->queue = blk_init_queue(do_ubd_request, &dev->lock);
+	if (!dev->queue)
+		goto out_close;
 
+	blk_queue_max_hw_segments(dev->queue, MAX_SG);
+	dev->queue->queuedata = dev;
+		
 	err = ubd_new_disk(MAJOR_NR, dev->size, n, &ubd_gendisk[n]);
 	if(err) 
-		goto out_close;
+		goto out_cleanup;
  
 	if(fake_major != MAJOR_NR)
 		ubd_new_disk(fake_major, dev->size, n, 
@@ -738,6 +762,10 @@
 	ubd_close(dev);
 out:
 	return err;
+
+out_cleanup:
+	blk_cleanup_queue(dev->queue);
+	goto out_close;
 }
 
 static int ubd_config(char *str)
@@ -878,13 +906,6 @@
 	if (register_blkdev(MAJOR_NR, "ubd"))
 		return -1;
 
-	ubd_queue = blk_init_queue(do_ubd_request, &ubd_io_lock);
-	if (!ubd_queue) {
-		unregister_blkdev(MAJOR_NR, "ubd");
-		return -1;
-	}
-		
-	blk_queue_max_hw_segments(ubd_queue, MAX_SG);
 	if (fake_major != MAJOR_NR) {
 		char name[sizeof("ubd_nnn\0")];
 
@@ -956,22 +977,13 @@
         }
 }
 
-/* Called with ubd_io_lock held */
-static int prepare_request(struct request *req, struct io_thread_req *io_req,
-                           unsigned long long offset, int page_offset,
-                           int len, struct page *page)
+static void prepare_request(struct request *req, struct io_thread_req *io_req,
+			    unsigned long long offset, int page_offset,
+			    int len, struct page *page)
 {
 	struct gendisk *disk = req->rq_disk;
 	struct ubd *dev = disk->private_data;
 
-	/* This should be impossible now */
-	if((rq_data_dir(req) == WRITE) && !dev->openflags.w){
-		printk("Write attempted on readonly ubd device %s\n", 
-		       disk->disk_name);
-                ubd_end_request(req, 0, 0);
-		return(1);
-	}
-
 	io_req->fds[0] = (dev->cow.file != NULL) ? dev->cow.fd : dev->fd;
 	io_req->fds[1] = dev->fd;
 	io_req->offset = offset;
@@ -988,44 +1000,51 @@
 
         if((dev->cow.file != NULL) && (io_req->op == UBD_WRITE))
                 cowify_bitmap(io_req, dev->cow.bitmap);
-	return(0);
 }
 
-/* Called with ubd_io_lock held */
+/* Called with dev->lock held */
 static void do_ubd_request(request_queue_t *q)
 {
 	struct io_thread_req io_req;
 	struct request *req;
-	__u64 sector;
-	int err;
 
-	if(in_ubd)
-		return;
-	in_ubd = 1;
-	while((req = elv_next_request(q)) != NULL){
-		struct gendisk *disk = req->rq_disk;
-		struct ubd *dev = disk->private_data;
-		int n, i;
+	while(1){
+		struct ubd *dev = q->queuedata;
 
-		blkdev_dequeue_request(req);
+		if(dev->end_sg == 0){
+			struct request *req = elv_next_request(q);
+			if(req == NULL)
+				return;
+
+			dev->request = req;
+			blkdev_dequeue_request(req);
+			dev->start_sg = 0;
+			dev->end_sg = blk_rq_map_sg(q, req, dev->sg);
+		}
 
-		sector = req->sector;
-		n = blk_rq_map_sg(q, req, dev->sg);
+		req = dev->request;
 
-		for(i = 0; i < n; i++){
-			struct scatterlist *sg = &dev->sg[i];
+		while(dev->start_sg < dev->end_sg){
+			struct scatterlist *sg = &dev->sg[dev->start_sg];
 
-			err = prepare_request(req, &io_req, sector << 9,
+			err = prepare_request(req, &io_req, req->sector << 9,
 					      sg->offset, sg->length,
 					      sg->page);
 			if(err)
 				continue;
 
-			sector += sg->length >> 9;
-			do_io(&io_req, req, dev->cow.bitmap);
+			if(do_io(&io_req, req, dev->cow.bitmap) == -EAGAIN){
+				if(list_empty(&dev->restart))
+					list_add(&dev->restart, &restart);
+				return;
+			}
+
+			req->sector += sg->length >> 9;
+			dev->start_sg++;
 		}
+		dev->end_sg = 0;
+		dev->request = NULL;
 	}
-	in_ubd = 0;
 }
 
 static int ubd_ioctl(struct inode * inode, struct file * file,
@@ -1241,7 +1260,8 @@
 	return(err);
 }
 
-void do_io(struct io_thread_req *req, struct request *r, unsigned long *bitmap)
+static int do_io(struct io_thread_req *req, struct request *r, 
+		 unsigned long *bitmap)
 {
         struct ubd_aio *aio;
         struct bitmap_io *bitmap_io = NULL;
@@ -1265,7 +1285,7 @@
                 if(bitmap_io == NULL){
                         printk("Failed to kmalloc bitmap IO\n");
                         req->error = 1;
-                        return;
+                        return -ENOMEM;
                 }
 
                 bitmap_buf = kmalloc(len, GFP_KERNEL);
@@ -1274,7 +1294,7 @@
                                "failed\n");
                         kfree(bitmap_io);
                         req->error = 1;
-                        return;
+                        return -ENOMEM;
                 }
                 memcpy(bitmap_buf, &bitmap[off / sizeof(bitmap[0])], len);
 
@@ -1308,7 +1328,7 @@
                 aio = kmalloc(sizeof(*aio), GFP_KERNEL);
                 if(aio == NULL){
                         req->error = 1;
-                        return;
+                        return -ENOMEM;
                 }
 
                 *aio = ((struct ubd_aio)
@@ -1322,14 +1342,18 @@
                 if(aio->bitmap != NULL)
                         atomic_inc(&aio->bitmap->count);
 
-                err = submit_aio(&aio->aio);
+		err = submit_aio(&aio->aio);
                 if(err){
-                        printk("do_io - submit_aio failed, "
-                               "err = %d\n", err);
-                        req->error = 1;
-                        return;
+			if(err != -EAGAIN){
+				printk("do_io - submit_aio failed, "
+				       "err = %d\n", err);
+				req->error = 1;
+			}
+                        return err;
                 }
 
                 start = end;
         } while(start < nsectors);
+
+	return 0;
 }
Index: test/arch/um/os-Linux/aio.c
===================================================================
--- test.orig/arch/um/os-Linux/aio.c	2005-09-27 11:33:43.000000000 -0400
+++ test/arch/um/os-Linux/aio.c	2005-09-27 12:02:00.000000000 -0400
@@ -296,6 +296,9 @@
 	int err;
 
 	err = do_aio(ctx, aio);
+	if(err == -EAGAIN)
+		return err;
+
 	if(err){
 		reply = ((struct aio_thread_reply) { .data = aio,
 					             .err  = err });

[-- Attachment #3: ubd-atomic --]
[-- Type: text/plain, Size: 5219 bytes --]

# To ensure that I/O can always make progress, even when there is no
# memory, we provide static buffers which are to be used when dynamic
# ones can't be allocated.  These buffers are protected by flags which
# are set when they are currently in use.  The use of these flags is 
# protected by the queue lock, which is held for the duration of the 
# do_ubd_request call.
#
# There is an allocation failure emulation
# mechanism here - setting fail_start and fail_end will cause
# allocations in that range (fail_start <= allocations < fail_end) to
# fail, invoking the emergency mechanism.
# When this is happening, I/O requests proceed one at a time,
# essentially synchronously, until allocations start succeeding again.
#
# This currently doesn't handle the bitmap array, since that can be of
# any length, so we can't have a static version of it at this point.
Index: test/arch/um/drivers/ubd_kern.c
===================================================================
--- test.orig/arch/um/drivers/ubd_kern.c	2005-09-27 12:02:00.000000000 -0400
+++ test/arch/um/drivers/ubd_kern.c	2005-09-27 12:02:19.000000000 -0400
@@ -518,6 +518,73 @@
         void *bitmap_buf;
 };
 
+static int allocations;
+static int fail_start, fail_end;
+
+static struct bitmap_io emergency_bitmap_io;
+static int bitmap_io_taken = 0;
+
+static struct bitmap_io *alloc_bitmap_io(void)
+{
+	struct bitmap_io *ret;
+
+	allocations++;
+	ret = kmalloc(sizeof(*ret), GFP_ATOMIC);
+
+	if((allocations >= fail_start) && (allocations < fail_end)){
+		kfree(ret);
+		ret = NULL;
+	}
+
+	if(ret != NULL)
+		return ret;
+
+	if(bitmap_io_taken)
+		return ERR_PTR(-EAGAIN);
+
+	bitmap_io_taken = 1;
+	return(&emergency_bitmap_io);
+}
+
+static void free_bitmap_io(struct bitmap_io *io)
+{
+	if(io == &emergency_bitmap_io)
+		bitmap_io_taken = 0;
+	else kfree(io);
+}
+
+static struct ubd_aio emergency_aio;
+static int aio_taken = 0;
+
+static struct ubd_aio *alloc_ubd_aio(void)
+{
+	struct ubd_aio *ret;
+
+	allocations++;
+	ret = kmalloc(sizeof(*ret), GFP_ATOMIC);
+
+	if((allocations >= fail_start) && (allocations < fail_end)){
+		kfree(ret);
+		ret = NULL;
+	}
+
+	if(ret != NULL)
+		return ret;
+
+	if(aio_taken)
+		return ERR_PTR(-EAGAIN);
+
+	aio_taken = 1;
+	return(&emergency_aio);
+}
+
+static void free_ubd_aio(struct ubd_aio *aio)
+{
+	if(aio == &emergency_aio)
+		aio_taken = 0;
+	else kfree(aio);
+}
+
 static int ubd_reply_fd = -1;
 static struct list_head restart = LIST_HEAD_INIT(restart);
 
@@ -552,7 +619,7 @@
 			   (atomic_dec_and_test(&aio->bitmap->count))){
                                 aio->aio = aio->bitmap->aio;
                                 aio->len = 0;
-                                kfree(aio->bitmap);
+				free_bitmap_io(aio->bitmap);
                                 aio->bitmap = NULL;
                                 submit_aio(&aio->aio);
 			}
@@ -565,16 +632,16 @@
 
                                 if(aio->bitmap_buf != NULL)
                                         kfree(aio->bitmap_buf);
-				kfree(aio);
+				free_ubd_aio(aio);
 			}
 		}
                 else if(n < 0){
                         ubd_finish(req, n);
                         if(aio->bitmap != NULL)
-                                kfree(aio->bitmap);
+				free_bitmap_io(aio->bitmap);
                         if(aio->bitmap_buf != NULL)
                                 kfree(aio->bitmap_buf);
-                        kfree(aio);
+			free_ubd_aio(aio);
                 }
 	}
 	list_for_each_safe(list, next, &restart){
@@ -1274,6 +1341,10 @@
         if(req->bitmap_start != -1){
                 /* Round up to the nearest word */
                 int round = sizeof(unsigned long);
+		bitmap_io = alloc_bitmap_io();
+		if(IS_ERR(bitmap_io))
+			return PTR_ERR(bitmap_io);
+
                 len = (req->bitmap_end - req->bitmap_start +
                        round * 8 - 1) / (round * 8);
                 len *= round;
@@ -1281,18 +1352,11 @@
                 off = req->bitmap_start / (8 * round);
                 off *= round;
 
-                bitmap_io = kmalloc(sizeof(*bitmap_io), GFP_KERNEL);
-                if(bitmap_io == NULL){
-                        printk("Failed to kmalloc bitmap IO\n");
-                        req->error = 1;
-                        return -ENOMEM;
-                }
-
                 bitmap_buf = kmalloc(len, GFP_KERNEL);
                 if(bitmap_buf == NULL){
                         printk("do_io : kmalloc of bitmap chunk "
                                "failed\n");
-                        kfree(bitmap_io);
+			free_bitmap_io(bitmap_io);
                         req->error = 1;
                         return -ENOMEM;
                 }
@@ -1325,11 +1389,9 @@
                 len = (end - start) * req->sectorsize;
                 buf = &req->buffer[start * req->sectorsize];
 
-                aio = kmalloc(sizeof(*aio), GFP_KERNEL);
-                if(aio == NULL){
-                        req->error = 1;
-                        return -ENOMEM;
-                }
+		aio = alloc_ubd_aio();
+		if(IS_ERR(aio))
+			return PTR_ERR(aio);
 
                 *aio = ((struct ubd_aio)
                         { .aio		= INIT_AIO(req->op, req->fds[bit], buf,

[-- Attachment #4: init_aio_err --]
[-- Type: text/plain, Size: 808 bytes --]

Index: test/arch/um/os-Linux/aio.c
===================================================================
--- test.orig/arch/um/os-Linux/aio.c	2005-09-27 12:02:00.000000000 -0400
+++ test/arch/um/os-Linux/aio.c	2005-09-27 12:02:42.000000000 -0400
@@ -321,21 +321,23 @@
 		err = -errno;
                 printk("aio_thread failed to initialize context, err = %d\n",
                        errno);
-                return err;
+		goto out;
         }
 
         err = run_helper_thread(aio_thread, NULL,
                                 CLONE_FILES | CLONE_VM | SIGCHLD, &stack, 0);
         if(err < 0)
-                return err;
+		goto out;
 
         aio_pid = err;
+	err = 0;
 
 	printk("Using 2.6 host AIO\n");
 
 	submit_proc = submit_aio_26;
 
-        return 0;
+out:
+        return err;
 }
 
 #else

[-- Attachment #5: aio-batching --]
[-- Type: text/plain, Size: 5590 bytes --]

Index: test/arch/um/drivers/ubd_kern.c
===================================================================
--- test.orig/arch/um/drivers/ubd_kern.c	2005-09-27 12:02:19.000000000 -0400
+++ test/arch/um/drivers/ubd_kern.c	2005-09-27 12:03:00.000000000 -0400
@@ -1081,7 +1081,7 @@
 		if(dev->end_sg == 0){
 			struct request *req = elv_next_request(q);
 			if(req == NULL)
-				return;
+				goto out;
 
 			dev->request = req;
 			blkdev_dequeue_request(req);
@@ -1103,7 +1103,7 @@
 			if(do_io(&io_req, req, dev->cow.bitmap) == -EAGAIN){
 				if(list_empty(&dev->restart))
 					list_add(&dev->restart, &restart);
-				return;
+				goto out;
 			}
 
 			req->sector += sg->length >> 9;
@@ -1112,6 +1112,8 @@
 		dev->end_sg = 0;
 		dev->request = NULL;
 	}
+out:
+	finish_aio();
 }
 
 static int ubd_ioctl(struct inode * inode, struct file * file,
Index: test/arch/um/include/aio.h
===================================================================
--- test.orig/arch/um/include/aio.h	2005-09-27 11:33:43.000000000 -0400
+++ test/arch/um/include/aio.h	2005-09-27 12:03:00.000000000 -0400
@@ -36,5 +36,6 @@
 			   .next	= NULL }
 
 extern int submit_aio(struct aio_context *aio);
+extern int finish_aio(void);
 
 #endif
Index: test/arch/um/os-Linux/aio.c
===================================================================
--- test.orig/arch/um/os-Linux/aio.c	2005-09-27 12:02:42.000000000 -0400
+++ test/arch/um/os-Linux/aio.c	2005-09-27 12:03:00.000000000 -0400
@@ -80,6 +80,9 @@
  * that it now backs the mmapped area.
  */
 
+/* XXX Fix for SMP */
+static int pending_events;
+
 static int do_aio(aio_context_t ctx, struct aio_context *aio)
 {
         struct iocb iocb, *iocbp = &iocb;
@@ -115,8 +118,10 @@
         }
 
         err = io_submit(ctx, 1, &iocbp);
-        if(err > 0)
+        if(err > 0){
                 err = 0;
+		pending_events++;
+	}
 	else
 		err = -errno;
 
@@ -124,6 +129,21 @@
         return err;
 }
 
+static int aio_wakeup_r_fd;
+static int aio_wakeup_w_fd;
+
+static int finish_aio_26(void)
+{
+	int err;
+
+	err = write(aio_wakeup_w_fd, &pending_events, sizeof(pending_events));
+	err = (err != sizeof(pending_events)) ? errno : 0;
+
+	pending_events = 0;
+
+	return err;
+}
+
 static aio_context_t ctx = 0;
 
 static int aio_thread(void *arg)
@@ -131,35 +151,44 @@
         struct aio_thread_reply reply;
         struct aio_context *aio;
         struct io_event event;
-        int err, n;
+        int err, i, n, nevents;
 
         signal(SIGWINCH, SIG_IGN);
 
         while(1){
-                n = io_getevents(ctx, 1, 1, &event, NULL);
-                if(n < 0){
-                        if(errno == EINTR)
-                                continue;
-                        printk("aio_thread - io_getevents failed, "
-                               "errno = %d\n", errno);
-                }
-                else {
-			aio = (struct aio_context *) (long) event.data;
-			if(update_aio(aio, event.res)){
-				do_aio(ctx, aio);
-				continue;
+		n = read(aio_wakeup_r_fd, &nevents, sizeof(nevents));
+		if(n != sizeof(nevents)){
+			   printk("aio_thread - reading wakeup fd returned "
+				  "%d, errno = %d\n", n, errno);
+			   continue;
+		}
+
+		for(i = 0; i < nevents; i++){
+			n = io_getevents(ctx, 1, 1, &event, NULL);
+			if(n < 0){
+				if(errno == EINTR)
+					continue;
+				printk("aio_thread - io_getevents failed, "
+				       "errno = %d\n", errno);
 			}
-
-                        reply = ((struct aio_thread_reply)
-				{ .data = aio,
-				  .err	= aio->len });
-			err = os_write_file(aio->reply_fd, &reply,
-					    sizeof(reply));
-                        if(err != sizeof(reply))
-				printk("aio_thread - write failed, "
-				       "fd = %d, err = %d\n", aio->reply_fd,
-				       -err);
-                }
+			else {
+				aio = (struct aio_context *) (long) event.data;
+				if(update_aio(aio, event.res)){
+					do_aio(ctx, aio);
+					continue;
+				}
+
+				reply = ((struct aio_thread_reply)
+					{ .data = aio,
+							.err	= aio->len });
+				err = os_write_file(aio->reply_fd, &reply,
+						    sizeof(reply));
+				if(err != sizeof(reply))
+					printk("aio_thread - write failed, "
+					       "fd = %d, err = %d\n", 
+					       aio->reply_fd, -err);
+			}
+		}
         }
         return 0;
 }
@@ -251,6 +280,7 @@
 
 static int aio_pid = -1;
 static int (*submit_proc)(struct aio_context *aio);
+static int (*finish_proc)(void);
 
 static int init_aio_24(void)
 {
@@ -315,7 +345,7 @@
 static int init_aio_26(void)
 {
         unsigned long stack;
-        int err;
+        int err, wakeup_pipe[2];
 
         if(io_setup(256, &ctx)){
 		err = -errno;
@@ -324,10 +354,18 @@
 		goto out;
         }
 
+	if(pipe(wakeup_pipe) < 0){
+		err = -errno;
+		goto out;
+	}
+
+	aio_wakeup_r_fd = wakeup_pipe[0];
+	aio_wakeup_w_fd = wakeup_pipe[1];
+
         err = run_helper_thread(aio_thread, NULL,
                                 CLONE_FILES | CLONE_VM | SIGCHLD, &stack, 0);
         if(err < 0)
-		goto out;
+		goto out_close;
 
         aio_pid = err;
 	err = 0;
@@ -335,9 +373,15 @@
 	printk("Using 2.6 host AIO\n");
 
 	submit_proc = submit_aio_26;
+	finish_proc = finish_aio_26;
 
 out:
         return err;
+
+out_close:
+	close(wakeup_pipe[0]);
+	close(wakeup_pipe[1]);
+	goto out;
 }
 
 #else
@@ -350,6 +394,7 @@
 static int init_aio_26(void)
 {
 	submit_proc = submit_aio_26;
+	finish_proc = finish_aio_26;
         return -ENOSYS;
 }
 #endif
@@ -420,3 +465,11 @@
 {
 	return (*submit_proc)(aio);
 }
+
+int finish_aio(void)
+{
+	if(finish_proc == NULL)
+		return 0;
+
+	return (*finish_proc)();
+}

[-- Attachment #6: o_direct --]
[-- Type: text/plain, Size: 2846 bytes --]

Index: test/arch/um/drivers/ubd_kern.c
===================================================================
--- test.orig/arch/um/drivers/ubd_kern.c	2005-09-27 12:03:00.000000000 -0400
+++ test/arch/um/drivers/ubd_kern.c	2005-09-27 12:03:15.000000000 -0400
@@ -136,10 +136,10 @@
  
 #ifdef CONFIG_BLK_DEV_UBD_SYNC
 #define OPEN_FLAGS ((struct openflags) { .r = 1, .w = 1, .s = 1, .c = 0, \
-					 .cl = 1 })
+				         .cl = 1, .d = 1 })
 #else
 #define OPEN_FLAGS ((struct openflags) { .r = 1, .w = 1, .s = 0, .c = 0, \
-					 .cl = 1 })
+					 .cl = 1, .d = 1  })
 #endif
 
 /* Not protected - changed only in ubd_setup_common and then only to
@@ -809,6 +809,8 @@
 		goto out_close;
 
 	blk_queue_max_hw_segments(dev->queue, MAX_SG);
+	blk_queue_hardsect_size(dev->queue, PAGE_SIZE);
+
 	dev->queue->queuedata = dev;
 		
 	err = ubd_new_disk(MAJOR_NR, dev->size, n, &ubd_gendisk[n]);
@@ -1094,11 +1096,8 @@
 		while(dev->start_sg < dev->end_sg){
 			struct scatterlist *sg = &dev->sg[dev->start_sg];
 
-			err = prepare_request(req, &io_req, req->sector << 9,
-					      sg->offset, sg->length,
-					      sg->page);
-			if(err)
-				continue;
+			prepare_request(req, &io_req, req->sector << 9,
+					sg->offset, sg->length, sg->page);
 
 			if(do_io(&io_req, req, dev->cow.bitmap) == -EAGAIN){
 				if(list_empty(&dev->restart))
Index: test/arch/um/include/os.h
===================================================================
--- test.orig/arch/um/include/os.h	2005-09-27 11:33:43.000000000 -0400
+++ test/arch/um/include/os.h	2005-09-27 12:19:15.000000000 -0400
@@ -52,10 +52,12 @@
 	unsigned int a : 1;	/* O_APPEND */
 	unsigned int e : 1;	/* O_EXCL */
 	unsigned int cl : 1;    /* FD_CLOEXEC */
+        unsigned int d : 1;     /* O_DIRECT */
 };
 
 #define OPENFLAGS() ((struct openflags) { .r = 0, .w = 0, .s = 0, .c = 0, \
- 					  .t = 0, .a = 0, .e = 0, .cl = 0 })
+                                          .t = 0, .a = 0, .e = 0, .cl = 0, \
+                                          .d = 0 })
 
 static inline struct openflags of_read(struct openflags flags)
 {
@@ -117,6 +119,12 @@
 	return(flags); 
 }
   
+static inline struct openflags of_direct(struct openflags flags)
+{ 
+	flags.d = 1; 
+	return(flags); 
+}
+
 extern int os_stat_file(const char *file_name, struct uml_stat *buf);
 extern int os_stat_fd(const int fd, struct uml_stat *buf);
 extern int os_access(const char *file, int mode);
Index: test/arch/um/os-Linux/file.c
===================================================================
--- test.orig/arch/um/os-Linux/file.c	2005-09-27 11:33:43.000000000 -0400
+++ test/arch/um/os-Linux/file.c	2005-09-27 12:03:15.000000000 -0400
@@ -249,6 +249,7 @@
 	if(flags.c) f |= O_CREAT;
 	if(flags.t) f |= O_TRUNC;
 	if(flags.e) f |= O_EXCL;
+	if(flags.d) f |= O_DIRECT;
 
 	fd = open64(file, f, mode);
 	if(fd < 0)

[-- Attachment #7: aio-errors --]
[-- Type: text/plain, Size: 432 bytes --]

Index: test/arch/um/os-Linux/aio.c
===================================================================
--- test.orig/arch/um/os-Linux/aio.c	2005-09-27 12:03:00.000000000 -0400
+++ test/arch/um/os-Linux/aio.c	2005-09-27 12:03:29.000000000 -0400
@@ -274,6 +274,7 @@
         err = os_write_file(aio_req_fd_w, &aio, sizeof(aio));
         if(err == sizeof(aio))
                 err = 0;
+	else err = -errno;
 
         return err;
 }

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [uml-devel] Review needed for ubd fixes
  2005-09-27 18:12               ` Jeff Dike
@ 2005-09-28 12:14                 ` Blaisorblade
  2005-09-28 15:54                   ` Jeff Dike
  0 siblings, 1 reply; 12+ messages in thread
From: Blaisorblade @ 2005-09-28 12:14 UTC (permalink / raw)
  To: user-mode-linux-devel; +Cc: Jeff Dike, Bodo Stroesser

On Tuesday 27 September 2005 20:12, Jeff Dike wrote:
> Attached is my current set of I/O patches.  It's significantly different,
> and simpler, than my previous set.

> Now, there is no sleeping while running the queue, so the spinlock is never
> dropped, and all of the extra synchronization is gone.
Very, very nice.

What about the early removal of the request from the queue, before being sure 
we can complete it (sorry for not reading carefully everything below, I'm 
going to have lunch and afterwards back to study).
> o_direct adds O_DIRECT support to UML and makes the ubd driver use it.
>
> aio-errors makes submit_aio_24 return -errno instead of -1.
Reorder o_direct at the end - no time to read them yet, but guess that more 
intrusive goes at the end. And I'm not sure whether merging o_direct so late 
in 2.6.14 is nice (even if you've tested the full patch set, and without 
o_direct the thing makes less sense).

Let's hope for the better, but next time the debug should be done *before* 
merging, ok? Even because, for instance, if (say) the patch was ready for 
2.6.14 and merged into 2.6.15, I could have released a test tree against 
2.6.14 (maybe I could do this now).
> I've given this a bunch of testing and it has survived overnight kernel
> build loops on both x86 and x86_64.

> The attached patches are against 2.6.14-rc2.
Please, send them to Andrew for -mm, and say "2.6.14 can't go without them, 
but they might need further review".

And ask Jens Axboe if it can give a look to them (this time it's less 
necessary, since there are less dirty tricks).
> 			Jeff

-- 
Inform me of my mistakes, so I can keep imitating Homer Simpson's "Doh!".
Paolo Giarrusso, aka Blaisorblade (Skype ID "PaoloGiarrusso", ICQ 215621894)
http://www.user-mode-linux.org/~blaisorblade

	

	
		
___________________________________ 
Yahoo! Mail: gratis 1GB per i messaggi e allegati da 10MB 
http://mail.yahoo.it



-------------------------------------------------------
This SF.Net email is sponsored by:
Power Architecture Resource Center: Free content, downloads, discussions,
and more. http://solutions.newsforge.com/ibmarch.tmpl
_______________________________________________
User-mode-linux-devel mailing list
User-mode-linux-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/user-mode-linux-devel

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [uml-devel] Review needed for ubd fixes
  2005-09-28 12:14                 ` Blaisorblade
@ 2005-09-28 15:54                   ` Jeff Dike
  2005-09-28 16:47                     ` Blaisorblade
  0 siblings, 1 reply; 12+ messages in thread
From: Jeff Dike @ 2005-09-28 15:54 UTC (permalink / raw)
  To: Blaisorblade; +Cc: user-mode-linux-devel, Bodo Stroesser

On Wed, Sep 28, 2005 at 02:14:32PM +0200, Blaisorblade wrote:
> Very, very nice.
> 
> What about the early removal of the request from the queue, before being sure 
> we can complete it (sorry for not reading carefully everything below, I'm 
> going to have lunch and afterwards back to study).

I think that doesn't matter too much.  I have to get the sg information out
of it, and store that, plus the current position within the sg array if
we couldn't send it all to the host.  So, I don't refer to the request after
I get the sg information out of it, so I might as well remove it from the
queue.

> Reorder o_direct at the end - no time to read them yet, but guess that more 
> intrusive goes at the end. And I'm not sure whether merging o_direct so late 
> in 2.6.14 is nice (even if you've tested the full patch set, and without 
> o_direct the thing makes less sense).

Yeah, I wasn't planning on sending that for 2.6.14.  It was just part of that
set of patches.

> Let's hope for the better, but next time the debug should be done *before* 
> merging, ok? Even because, for instance, if (say) the patch was ready for 
> 2.6.14 and merged into 2.6.15, I could have released a test tree against 
> 2.6.14 (maybe I could do this now).

I thought I did.  AIO had been in my tree forever with no apparent problems.
I recently started running 24 hour/day stress tests on UML, and that turned
up some problems.  People turning on spinlock debugging turned up the sleeping
while atomic problem.  And me staring at the code as a result of that turned
up the possible deadlock.

> Please, send them to Andrew for -mm, and say "2.6.14 can't go without them, 
> but they might need further review".
> 
> And ask Jens Axboe if it can give a look to them (this time it's less 
> necessary, since there are less dirty tricks).

Yeah.  This still needs some work.  I need to deal with the bitmap array
properly.

				Jeff


-------------------------------------------------------
This SF.Net email is sponsored by:
Power Architecture Resource Center: Free content, downloads, discussions,
and more. http://solutions.newsforge.com/ibmarch.tmpl
_______________________________________________
User-mode-linux-devel mailing list
User-mode-linux-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/user-mode-linux-devel

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [uml-devel] Review needed for ubd fixes
  2005-09-28 15:54                   ` Jeff Dike
@ 2005-09-28 16:47                     ` Blaisorblade
  0 siblings, 0 replies; 12+ messages in thread
From: Blaisorblade @ 2005-09-28 16:47 UTC (permalink / raw)
  To: Jeff Dike; +Cc: user-mode-linux-devel, Bodo Stroesser

On Wednesday 28 September 2005 17:54, Jeff Dike wrote:
> On Wed, Sep 28, 2005 at 02:14:32PM +0200, Blaisorblade wrote:
> > Very, very nice.

> > Let's hope for the better, but next time the debug should be done
> > *before* merging, ok? Even because, for instance, if (say) the patch was
> > ready for 2.6.14 and merged into 2.6.15, I could have released a test
> > tree against 2.6.14 (maybe I could do this now).

> I thought I did.  AIO had been in my tree forever with no apparent
> problems.
We're not currently getting a lot of testing on that. Apart the really current 
issue of the page not uptodate, users mostly ran that because of x86_64 
support, which is now reliable in mainline.

In fact, the SKAS0 compiler-dependant bug (the asm code was popping a fixed 
number of words, which was wrong with different compiler) only came out on 
the tree I released against 2.6.12...
> I recently started running 24 hour/day stress tests on UML, and 
> that turned up some problems.  People turning on spinlock debugging turned
> up the sleeping while atomic problem.  And me staring at the code as a
> result of that turned up the possible deadlock.

> > And ask Jens Axboe if it can give a look to them (this time it's less
> > necessary, since there are less dirty tricks).

> Yeah.  This still needs some work.  I need to deal with the bitmap array
> properly.

> 				Jeff

-- 
Inform me of my mistakes, so I can keep imitating Homer Simpson's "Doh!".
Paolo Giarrusso, aka Blaisorblade (Skype ID "PaoloGiarrusso", ICQ 215621894)
http://www.user-mode-linux.org/~blaisorblade

		
___________________________________ 
Yahoo! Messenger: chiamate gratuite in tutto il mondo 
http://it.messenger.yahoo.com



-------------------------------------------------------
This SF.Net email is sponsored by:
Power Architecture Resource Center: Free content, downloads, discussions,
and more. http://solutions.newsforge.com/ibmarch.tmpl
_______________________________________________
User-mode-linux-devel mailing list
User-mode-linux-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/user-mode-linux-devel

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2005-09-28 16:48 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2005-09-17 22:45 [uml-devel] Review needed for ubd fixes Jeff Dike
2005-09-20 12:01 ` Blaisorblade
2005-09-20 19:06   ` Jeff Dike
2005-09-21 15:49     ` Blaisorblade
2005-09-21 18:04       ` Jeff Dike
2005-09-21 19:06         ` Blaisorblade
2005-09-21 20:45           ` Jeff Dike
2005-09-22 20:51             ` Blaisorblade
2005-09-27 18:12               ` Jeff Dike
2005-09-28 12:14                 ` Blaisorblade
2005-09-28 15:54                   ` Jeff Dike
2005-09-28 16:47                     ` Blaisorblade

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.