public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* Re: poll() in 2.6 and beyond
@ 2004-03-03  3:06 linux
  0 siblings, 0 replies; 29+ messages in thread
From: linux @ 2004-03-03  3:06 UTC (permalink / raw)
  To: linux-kernel, root

> I'm talking about the driver! When a open fd called poll() or select(),
> in user-mode code, the driver's poll() was called, and the driver's poll()
> would call poll_wait().  Poll_wait() used to NOT return until the driver
> executed wake_up_interruptible() on that wait-queue. When poll_wait()
> returned, the driver would return to the caller with the new poll-
> status.

poll_wait has ALWAYS, since it was introduces during the select/poll
changeover in 2.1 development, been a non-sleeping, immediately returning
function.

Its predecessor, select_wait(), has been a non-sleeping function since 1.0.

> So, if the poll_wait isn't a wait-function, but just some add-wakeup
> to the queue function, then its name probably should have been
> changed when it changed. At one time it did, truly, wait until
> it was awakened with wake_up_interruptible.

There is no "when it changed".  It has never changed.  Go look at
the 2.2.20 code on http://lxr.linux.no/.

    ***************************************************************
    *                                                             *
    *  poll_wait:                                                 *
    *  - DOES NOT sleep.                                          *
    *  - NEVER HAS slept, in any kernel version, EVER.            *
    *  - WOULD NOT WORK if it did sleep, for reasons that are     *
    *    so BLATANTLY OBVIOUS that arguing about it after it's    *
    *    been REPEATEDLY pointed out is a sign that the person    *
    *    arguing needs to go and visit the rest home with those   *
    *    nice, young men in their clean, white coats.             *
    *                                                             *
    ***************************************************************

It has aways, since select_wait in Linux 1.0, been nothing more than
an "add-wakeup-to-the-queue" function.  The last time the code changed
significantly was the select/poll changeover in 2.1.x, and even then it
was very similar.

When a particular filp's poll method is called, there are two things
that have to get done:
1) Check if the wakeup conditions are already satisfied
   (the "no-wait" case), and
2) Schedule the task for wakeup when the filp's condition changes
   (the "wait" case)

Now, 2) only has to be done if 1) fails, but we can't do things in that
order because there's a race condition. if the condition changed between
the two steps, but doesn't change after that, we'll never wake up.

So we have to do 2), and THEN check for 1).  This is the fundamental
race condition of sleeping until a condition becomes true, so anyone who
entertains the remotest channce of ever writing functioning kernel code
should be excruciatingly familiar with it, and its solution.


For the terminally dim, everyone hold hands and follow me.  Remember
when we read Kernighan & Ritchie aloud and don't get lost...

In the 2.2.20 kernel, pollwait is defined (include/linux/poll.h) as:

14 struct poll_table_entry {
15         struct file * filp;
16         struct wait_queue wait;
17         struct wait_queue ** wait_address;
18 };
19 
20 typedef struct poll_table_struct {
21         struct poll_table_struct * next;
22         unsigned int nr;
23         struct poll_table_entry * entry;
24 } poll_table;
25 
26 #define __MAX_POLL_TABLE_ENTRIES ((PAGE_SIZE - sizeof (poll_table)) / sizeof (struct poll_table_entry))
27 
28 extern void __pollwait(struct file * filp, struct wait_queue ** wait_address, poll_table *p);
29 
30 extern inline void poll_wait(struct file * filp, struct wait_queue ** wait_address, poll_table *p)
31 {
32         if (p && wait_address)
33                 __pollwait(filp, wait_address, p);
34 }

This ia a trivial wrapper around __pollwait().  Nothing else in the
function could possibly take more than a few clock cycles.

__pollwait is defined (fs/select.c) as:

 94 void __pollwait(struct file * filp, struct wait_queue ** wait_address, poll_table *p)
 95 {
 96         for (;;) {
 97                 if (p->nr < __MAX_POLL_TABLE_ENTRIES) {
 98                         struct poll_table_entry * entry;
 99                         entry = p->entry + p->nr;
100                         entry->filp = filp;
101                         filp->f_count++;
102                         entry->wait_address = wait_address;
103                         entry->wait.task = current;
104                         entry->wait.next = NULL;
105                         add_wait_queue(wait_address,&entry->wait);
106                         p->nr++;
107                         return;
108                 }
109                 p = p->next;
110         }
111 }

This does a little bit of bookkeeping and calls add_wait_queue().
Nothing else in the function could possibly take more than a few dozen
clock cycles.

Now look at add_wait_queue(), which is defined (include/linux/sched.h) as
nothing more than calls to three other functions:

745 extern inline void __add_wait_queue(struct wait_queue ** p, struct wait_queue * wait)
746 {
747         wait->next = *p ? : WAIT_QUEUE_HEAD(p);
748         *p = wait;
749 }
750 
751 extern rwlock_t waitqueue_lock;
752 
753 extern inline void add_wait_queue(struct wait_queue ** p, struct wait_queue * wait)
754 {
755         unsigned long flags;
756 
757         write_lock_irqsave(&waitqueue_lock, flags);
758         __add_wait_queue(p, wait);
759         write_unlock_irqrestore(&waitqueue_lock, flags);
760 }

write_lock_irqsave() and write_unlock_irqsave() can get a little
bit complicated, so pay careful attention.

We're going to consider just the i386, non-SMP case.  SMP locking is
for big boys and girls who did all their homework and got As on their
tests.

They are defined (include/asm-i386/spinlock.h) as:
115 #define write_lock_irqsave(lock, flags) \
116         do { save_flags(flags); cli(); } while (0)
117 #define write_unlock_irqrestore(lock, flags) \
118         restore_flags(flags)

These, in turn, go through some wrappers in include/asm-i386/system.h:
198 #define cli() __cli()
199 #define sti() __sti()
200 #define save_flags(x) __save_flags(x)
201 #define restore_flags(x) __restore_flags(x)

and end up calling the primitive assembly routines in the same file:

176 /* interrupt control.. */
177 #define __sti() __asm__ __volatile__ ("sti": : :"memory")
178 #define __cli() __asm__ __volatile__ ("cli": : :"memory")
179 #define __save_flags(x) \
180 __asm__ __volatile__("pushfl ; popl %0":"=g" (x): /* no input */ :"memory")
181 #define __restore_flags(x) \
182 __asm__ __volatile__("pushl %0 ; popfl": /* no output */ :"g" (x):"memory")

These are one or two machine instructions each.

Finally, _add__wait_queue, above, is just two assignments.  Just to
be absolutely sure we've explored every function-like thing that
the most deluded paranoid could think might sleep, I'll mention that
WAIT_QUEUE_HEAD is defined in include/linux/wait.h as:

22 #define WAIT_QUEUE_HEAD(x) ((struct wait_queue *)((x)-1))

This, also, cannot possibly sleep.



To recap, looking back in the wayback machine to 2.2.20, the complete
call/macro graph of poll_wait is:

poll_wait
    __pollwait
        add_wait_queue
            write_lock_irqsave
                save_flags
                    __save_flags
                cli
                    __cli
            __add_wait_queue
                WAIT_QUEUE_HEAD
            write_unlock_irqrestore
                restore_flags
                    __restore_flags

and NONE OF THESE FUNCTIONS SLEEP.  Therefore, poll_wait DID NOT USED TO
SLEEP.

2.4 and later allocate the poll_table pages on demand (with GFP_KERNEL)
rather than preallocating everything, so *that* part can sleep, but it
doesn't depend on the filp being polled, and the rest CANNOT.

^ permalink raw reply	[flat|nested] 29+ messages in thread
* poll() in 2.6 and beyond
@ 2004-03-02 18:21 Richard B. Johnson
  2004-03-02 20:04 ` Roland Dreier
  0 siblings, 1 reply; 29+ messages in thread
From: Richard B. Johnson @ 2004-03-02 18:21 UTC (permalink / raw)
  To: Linux kernel


Poll in 2.6.0; when a driver routine calls poll_wait()
it returns <<immediately>> to somewhere in the
kernel, then waits for my wake_up_interuptible(), before
returning control to a user sleeping in poll(). This means
that the user gets the wrong poll return value! It
doesn't get the value it was given as a result of the
interrupt, but the value that existed (0) before the
interrupt occurred.

Poll should not return from poll_wait() until it gets
a wake_up_interruptible() call. The wait variable,
info->pwait, below, has been initialized by executing
init_waitqueue_head(&info->pwait) in the initialization
code. This code works in 2.4.24.

What do I do to make it work in 2.6.0 and beyond? There
are no hints in the 2.6 drivers as they all seem to be
written like this and they presumably work.


/*-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=*/
/*
 *   The interrupt service routine.
 */
static void pci_isr(int irq, void *p, struct pt_regs *regs)
{
    spin_lock(&info->lock);
    DEB(printk(KERN_INFO"%s : Interrupt!\n", devname));
    info->poll_flag |= POLLIN|DEF_POLL;
    wake_up_interruptible(&info->pwait);
    spin_unlock(&info->lock);
}
/*-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=*/
/*
 *  Device poll routine.
 */
static size_t poll(struct file *fp, struct poll_table_struct *wait)
{
    size_t poll_flag;
    size_t flags;
    DEB(printk(KERN_INFO"%s : poll called\n", devname));
    poll_wait(fp, &info->pwait, wait);
    lockit(TRUE, &flags);
    poll_flag = info->poll_flag;
    info->poll_flag = 0;
    lockit(FALSE, &flags);
    DEB(printk(KERN_INFO"%s : poll returns\n", devname));
    return poll_flag;
}

Cheers,
Dick Johnson
Penguin : Linux version 2.4.24 on an i686 machine (797.90 BogoMips).
            Note 96.31% of all statistics are fiction.



^ permalink raw reply	[flat|nested] 29+ messages in thread

end of thread, other threads:[~2004-03-03 23:08 UTC | newest]

Thread overview: 29+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <1vmPm-4lU-11@gated-at.bofh.it>
     [not found] ` <1vonq-6dr-37@gated-at.bofh.it>
     [not found]   ` <1voGY-6vC-41@gated-at.bofh.it>
     [not found]     ` <1vpjt-7dl-17@gated-at.bofh.it>
     [not found]       ` <1vpCV-7wY-41@gated-at.bofh.it>
     [not found]         ` <1vpWa-7Py-19@gated-at.bofh.it>
2004-03-02 22:53           ` poll() in 2.6 and beyond Bill Davidsen
2004-03-02 22:57             ` Roland Dreier
2004-03-02 23:32             ` Richard B. Johnson
2004-03-03  0:07               ` John Muir
2004-03-03  1:18                 ` Richard B. Johnson
2004-03-03  4:04                   ` Roland Dreier
2004-03-03 12:38                     ` Richard B. Johnson
2004-03-03 14:29                       ` Davide Libenzi
2004-03-03  3:57               ` David Dillow
2004-03-03 18:23                 ` Richard B. Johnson
2004-03-03 19:29                   ` Dave Dillow
2004-03-03 20:10                     ` Richard B. Johnson
2004-03-03 22:25                   ` Linus Torvalds
2004-03-03 22:42                     ` Richard B. Johnson
2004-03-03 23:14                       ` Linus Torvalds
2004-03-03 22:52                     ` Linus Torvalds
2004-03-03 23:07                       ` Richard B. Johnson
2004-03-03  3:06 linux
  -- strict thread matches above, loose matches on Subject: below --
2004-03-02 18:21 Richard B. Johnson
2004-03-02 20:04 ` Roland Dreier
2004-03-02 20:24   ` Richard B. Johnson
2004-03-02 21:00     ` Roland Dreier
2004-03-02 21:26       ` Richard B. Johnson
2004-03-02 21:39         ` Roland Dreier
2004-03-02 21:59           ` Richard B. Johnson
2004-03-02 22:41             ` Dave Dillow
2004-03-02 22:56             ` Roland Dreier
2004-03-02 23:16               ` Richard B. Johnson
2004-03-02 23:21                 ` Roland Dreier

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox