Project Home
Project Home
Wiki
Wiki
Discussion Forums
Discussions
Project Information
Project Info
Forum Topic - possible bug in kern_sync.c: (7 Items)
   
possible bug in kern_sync.c  
Hello:

First, I'm not 100% sure this is a bug; but it's definitely a problem for me.  I'm also fairly convinced that my "fix" 
is incorrect; but it works for me (for now).

At any rate, here's the situation.  I "ported" a driver for the MOSCHIP 7830 USB ethernet part to "NetBSD".  What I 
really did was to take the example if_axe driver and modify it according to the OpenBSD driver that MOSCHIP provides.

At any rate, if I forced autonegotiation (by not connecting an ethernet cable), the stack would fail.  By looking at the
 dump, I found it was failing in endtsleep() (in kern_sync.c).  

Specifically, it was failing here:

    /* take out of fp->f_slpq  */
    if ((*l->l_fslback = l->l_fslforw) != NULL)
        l->l_fslforw->l_fslback = l->l_fslback;

because l->l_fslback was null.

The reason it was null was caused by the following code in ltsleep():

    if (l->l_fp) {
        /*
         * Put it on this list to make the handling of
         * unblock pulses faster.  The unblock callout
         * gives us a reference to p_fp.  Otherwise, we
         * have to search all the slpqs.
         */
        l->l_fslforw = l->l_fp->f_slpq;
        if (l->l_fslforw != NULL)
            l->l_fslforw->l_fslback = &l->l_fslforw;
        l->l_fp->f_slpq = l;
        l->l_fslback = &l->l_fp->f_slpq;
    }
    else {
        /*
         * Sleeping during connect message handling.
         * This can happen if a dlopen() results while
         * servicing a mount request: dlopen() gets mapped
         * to a blockop.
         *
         * Note: The resmgr layer will never call out to our
         * tcpip_unblock() func prior to the resmgr_open_bind()
         * so we better make sure this blocking state doesn't
         * linger.
         */
        l->l_fslback = NULL;
        l->l_fslforw = NULL;
    }

Specifically the last two lines.  In my case, at this time, timo was non-zero, so the following was called:

   callout_reset(&l->l_tsleep_ch, timo, endtsleep, l);

Now, my temporary fix for this is to set timo = in the case where we are setting l->l_fslback to NULL.

I don't necessarily believe this is a good thing, but it works (at least for me), and the stack seems stable (granted, I
 have not had the opportunity yet to transfer very much data across it).

My questions are:

1)  Is this the right fix for this bug?  or is there something else that should be going on here?

2)  If it's not the right fix, what is the right fix?

Thanks!!

-garyf
RE: possible bug in kern_sync.c  
Hi Gary:

	I've reproduced that here using the axe driver, so we'll dig in and
see what's going on.

Just to check this seems to require the following:

Plug in dongle (no cable)

io-pkt-v4 -daxe
ifconfig axe0 up

Now plug in cable (crash).  

Seems to be somewhat intermittent, but it is reproducible.

	Thanks very much for the report,
		Robert.


-----Original Message-----
From: Gary Faulkner [mailto:gary.faulkner@ucontrol.com] 
Sent: Thursday, March 06, 2008 9:27 AM
To: builds-networking
Subject: possible bug in kern_sync.c

Hello:

First, I'm not 100% sure this is a bug; but it's definitely a problem for
me.  I'm also fairly convinced that my "fix" is incorrect; but it works for
me (for now).

At any rate, here's the situation.  I "ported" a driver for the MOSCHIP 7830
USB ethernet part to "NetBSD".  What I really did was to take the example
if_axe driver and modify it according to the OpenBSD driver that MOSCHIP
provides.

At any rate, if I forced autonegotiation (by not connecting an ethernet
cable), the stack would fail.  By looking at the dump, I found it was
failing in endtsleep() (in kern_sync.c).  

Specifically, it was failing here:

    /* take out of fp->f_slpq  */
    if ((*l->l_fslback = l->l_fslforw) != NULL)
        l->l_fslforw->l_fslback = l->l_fslback;

because l->l_fslback was null.

The reason it was null was caused by the following code in ltsleep():

    if (l->l_fp) {
        /*
         * Put it on this list to make the handling of
         * unblock pulses faster.  The unblock callout
         * gives us a reference to p_fp.  Otherwise, we
         * have to search all the slpqs.
         */
        l->l_fslforw = l->l_fp->f_slpq;
        if (l->l_fslforw != NULL)
            l->l_fslforw->l_fslback = &l->l_fslforw;
        l->l_fp->f_slpq = l;
        l->l_fslback = &l->l_fp->f_slpq;
    }
    else {
        /*
         * Sleeping during connect message handling.
         * This can happen if a dlopen() results while
         * servicing a mount request: dlopen() gets mapped
         * to a blockop.
         *
         * Note: The resmgr layer will never call out to our
         * tcpip_unblock() func prior to the resmgr_open_bind()
         * so we better make sure this blocking state doesn't
         * linger.
         */
        l->l_fslback = NULL;
        l->l_fslforw = NULL;
    }

Specifically the last two lines.  In my case, at this time, timo was
non-zero, so the following was called:

   callout_reset(&l->l_tsleep_ch, timo, endtsleep, l);

Now, my temporary fix for this is to set timo = in the case where we are
setting l->l_fslback to NULL.

I don't necessarily believe this is a good thing, but it works (at least for
me), and the stack seems stable (granted, I have not had the opportunity yet
to transfer very much data across it).

My questions are:

1)  Is this the right fix for this bug?  or is there something else that
should be going on here?

2)  If it's not the right fix, what is the right fix?

Thanks!!

-garyf

_______________________________________________
Builds
http://community.qnx.com/sf/go/post5579
Re: RE: possible bug in kern_sync.c  
Just as a side note, I've create a PR to track this (56077) if you're looking for an update at any point in time.

   Robert.
Re: possible bug in kern_sync.c  
This was probably introduced in rev 210 of
sys/dev/lib/usb/usb_conv_qnx.c when 
static struct file usb_file; was removed.
I think the proper fix is to check for NULL
l_fslback and friends members on wakeup but
I'll look into it more thoroughly next week
when I have hardware.

-seanb

On Fri, Mar 07, 2008 at 02:08:27PM -0500, Robert Craig wrote:
> Hi Gary:
> 
> 	I've reproduced that here using the axe driver, so we'll dig in and
> see what's going on.
> 
> Just to check this seems to require the following:
> 
> Plug in dongle (no cable)
> 
> io-pkt-v4 -daxe
> ifconfig axe0 up
> 
> Now plug in cable (crash).  
> 
> Seems to be somewhat intermittent, but it is reproducible.
> 
> 	Thanks very much for the report,
> 		Robert.
> 
> 
> -----Original Message-----
> From: Gary Faulkner [mailto:gary.faulkner@ucontrol.com] 
> Sent: Thursday, March 06, 2008 9:27 AM
> To: builds-networking
> Subject: possible bug in kern_sync.c
> 
> Hello:
> 
> First, I'm not 100% sure this is a bug; but it's definitely a problem for
> me.  I'm also fairly convinced that my "fix" is incorrect; but it works for
> me (for now).
> 
> At any rate, here's the situation.  I "ported" a driver for the MOSCHIP 7830
> USB ethernet part to "NetBSD".  What I really did was to take the example
> if_axe driver and modify it according to the OpenBSD driver that MOSCHIP
> provides.
> 
> At any rate, if I forced autonegotiation (by not connecting an ethernet
> cable), the stack would fail.  By looking at the dump, I found it was
> failing in endtsleep() (in kern_sync.c).  
> 
> Specifically, it was failing here:
> 
>     /* take out of fp->f_slpq  */
>     if ((*l->l_fslback = l->l_fslforw) != NULL)
>         l->l_fslforw->l_fslback = l->l_fslback;
> 
> because l->l_fslback was null.
> 
> The reason it was null was caused by the following code in ltsleep():
> 
>     if (l->l_fp) {
>         /*
>          * Put it on this list to make the handling of
>          * unblock pulses faster.  The unblock callout
>          * gives us a reference to p_fp.  Otherwise, we
>          * have to search all the slpqs.
>          */
>         l->l_fslforw = l->l_fp->f_slpq;
>         if (l->l_fslforw != NULL)
>             l->l_fslforw->l_fslback = &l->l_fslforw;
>         l->l_fp->f_slpq = l;
>         l->l_fslback = &l->l_fp->f_slpq;
>     }
>     else {
>         /*
>          * Sleeping during connect message handling.
>          * This can happen if a dlopen() results while
>          * servicing a mount request: dlopen() gets mapped
>          * to a blockop.
>          *
>          * Note: The resmgr layer will never call out to our
>          * tcpip_unblock() func prior to the resmgr_open_bind()
>          * so we better make sure this blocking state doesn't
>          * linger.
>          */
>         l->l_fslback = NULL;
>         l->l_fslforw = NULL;
>     }
> 
> Specifically the last two lines.  In my case, at this time, timo was
> non-zero, so the following was called:
> 
>    callout_reset(&l->l_tsleep_ch, timo, endtsleep, l);
> 
> Now, my temporary fix for this is to set timo = in the case where we are
> setting l->l_fslback to NULL.
> 
> I don't necessarily believe this is a good thing, but it works (at least for
> me), and the stack seems stable (granted, I have not had the opportunity yet
> to transfer very much data across it).
> 
> My questions are:
> 
> 1)  Is this the right fix for this bug?  or is there something else that
> should be going on here?
> 
> 2)  If it's not the right fix, what is the...
Re: possible bug in kern_sync.c  
Thanks Sean!  I had thought about that solution; but wasn't sure of all of the ramifications (I did not thoroughly 
understand all of the intricacies of the thread sleeping models being used at this point in the code).  

FWIW, I am willing to try other fixes if it will help (as I'm still relatively early in product development at this 
point).

Thanks!!

-garyf

> 
> This was probably introduced in rev 210 of
> sys/dev/lib/usb/usb_conv_qnx.c when 
> static struct file usb_file; was removed.
> I think the proper fix is to check for NULL
> l_fslback and friends members on wakeup but
> I'll look into it more thoroughly next week
> when I have hardware.
> 
> -seanb
> 
Re: possible bug in kern_sync.c  
On Mon, Mar 10, 2008 at 02:15:25PM -0400, Gary Faulkner wrote:
> Thanks Sean!  I had thought about that solution; but wasn't sure of all of the ramifications (I did not thoroughly 
understand all of the intricacies of the thread sleeping models being used at this point in the code).  
> 
> FWIW, I am willing to try other fixes if it will help (as I'm still relatively early in product development at this 
point).
> 
> Thanks!!
> 
> -garyf

Should be fixed now.

-seanb
Re: RE: possible bug in kern_sync.c  
Hi Robert:

In my case, I did not have to plug the cable in for it to fail; it would just eventually crash (after a bunch of auto-
negotiation tried to happen).  This *could* be due to differences between my actual driver (mos) and the axe driver; but
 sounds like you've got the jist of the idea from what you've done.

Thanks for looking into this!!

-garyf

> Hi Gary:
> 
> 	I've reproduced that here using the axe driver, so we'll dig in and
> see what's going on.
> 
> Just to check this seems to require the following:
> 
> Plug in dongle (no cable)
> 
> io-pkt-v4 -daxe
> ifconfig axe0 up
> 
> Now plug in cable (crash).  
> 
> Seems to be somewhat intermittent, but it is reproducible.
> 
> 	Thanks very much for the report,
> 		Robert.