Forum Topic - interpretation:
   
interpretation  
What is going on when:

Coming from the computer doing a pidin -n ...

03103902(L4): l4_tx_timeout(): timeout: nd 40 sc 1 dc 1 ss 1941 tk 18443 ct 18445
03103902(L4): l4_tx_rx_ack_r_nack(): unkn tx: nd:40 dc:1 seq:1941
03103902(L4): l4_tx_timeout(): timeout: nd 40 sc 1 dc 1 ss 1990 tk 18446 ct 18448
03103902(L4): l4_tx_rx_ack_r_nack(): unkn tx: nd:40 dc:1 seq:1990
03103904(L4): l4_tx_timeout(): timeout: nd 40 sc 1 dc 1 ss 2697 tk 18457 ct 18459
03103904(L4): l4_tx_rx_ack_r_nack(): unkn tx: nd:40 dc:1 seq:2697
03103940(L4): l4_tx_timeout(): timeout: nd 40 sc 1 dc 1 ss 3292 tk 18635 ct 18637
03103940(L4): l4_tx_rx_ack_r_nack(): unkn tx: nd:40 dc:1 seq:3292
03103941(L4): l4_tx_timeout(): timeout: nd 40 sc 1 dc 1 ss 3306 tk 18638 ct 18640
03103941(L4): l4_tx_rx_ack_r_nack(): unkn tx: nd:40 dc:1 seq:3306
03103941(L4): l4_tx_timeout(): timeout: nd 40 sc 1 dc 1 ss 3319 tk 18641 ct 18643
03103941(L4): l4_tx_rx_ack_r_nack(): unkn tx: nd:40 dc:1 seq:3319
03103942(L4): l4_tx_timeout(): timeout: nd 40 sc 1 dc 1 ss 3326 tk 18644 ct 18646
03103942(L4): l4_tx_rx_ack_r_nack(): unkn tx: nd:40 dc:1 seq:3326
03103942(L4): l4_tx_timeout(): timeout: nd 40 sc 1 dc 1 ss 3342 tk 18647 ct 18649
03103942(L4): l4_tx_rx_ack_r_nack(): unkn tx: nd:40 dc:1 seq:3342
03103943(L4): l4_tx_timeout(): timeout: nd 40 sc 1 dc 1 ss 3343 tk 18650 ct 18652
03103943(L4): l4_tx_rx_ack_r_nack(): unkn tx: nd:40 dc:1 seq:3343
03103944(L4): l4_tx_timeout(): timeout: nd 40 sc 1 dc 1 ss 3357 tk 18654 ct 18656
03103944(L4): l4_tx_rx_ack_r_nack(): unkn tx: nd:40 dc:1 seq:3357
03103944(L4): l4_tx_timeout(): timeout: nd 40 sc 1 dc 1 ss 3361 tk 18657 ct 18659
03103944(L4): l4_tx_rx_ack_r_nack(): unkn tx: nd:40 dc:1 seq:3361
03104009(L4): l4_tx_timeout(): timeout: nd 40 sc 1 dc 1 ss 3381 tk 18778 ct 18780
03104009(L4): l4_tx_rx_ack_r_nack(): unkn tx: nd:40 dc:1 seq:3381
03104011(L4): l4_tx_timeout(): timeout: nd 40 sc 1 dc 1 ss 3475 tk 18788 ct 18790
03104011(L4): l4_tx_rx_ack_r_nack(): unkn tx: nd:40 dc:1 seq:3475
03104017(L4): l4_tx_timeout(): timeout: nd 40 sc 1 dc 1 ss 3495 tk 18819 ct 18821
03104017(L4): l4_tx_rx_ack_r_nack(): unkn tx: nd:40 dc:1 seq:3495
03104419(L4): l4_tx_timeout(): timeout: nd 40 sc 1 dc 1 ss 3611 tk 20030 ct 20032
03104419(L4): l4_tx_rx_ack_r_nack(): unkn tx: nd:40 dc:1 seq:3611

and coming being inquired about

3224734(QOS): qos_verify_rx_conn_seq(): rxd old dup (3961/3961) nd 1  rx conn 1
3224734(QOS): rx_user_data(): bad conn/seq rxd from nd 1
3224735(QOS): qos_verify_rx_conn_seq(): rxd old dup (4210/4210) nd 1  rx conn 1
3224735(QOS): rx_user_data(): bad conn/seq rxd from nd 1
3224736(QOS): qos_verify_rx_conn_seq(): rxd old dup (4269/4269) nd 1  rx conn 1
3224736(QOS): rx_user_data(): bad conn/seq rxd from nd 1
3224736(QOS): qos_verify_rx_conn_seq(): rxd old dup (4308/4308) nd 1  rx conn 1
3224736(QOS): rx_user_data(): bad conn/seq rxd from nd 1
3224737(QOS): qos_verify_rx_conn_seq(): rxd old dup (4368/4368) nd 1  rx conn 1
3224737(QOS): rx_user_data(): bad conn/seq rxd from nd 1
3224738(QOS): qos_verify_rx_conn_seq(): rxd old dup (4378/4378) nd 1  rx conn 1
3224738(QOS): rx_user_data(): bad conn/seq rxd from nd 1
3224738(QOS): qos_verify_rx_conn_seq(): rxd old dup (4440/4440) nd 1  rx conn 1
3224738(QOS): rx_user_data(): bad conn/seq rxd from nd 1
3224739(QOS): qos_verify_rx_conn_seq(): rxd old dup (4466/4466) nd 1  rx conn 1
3224739(QOS): rx_user_data(): bad conn/seq rxd from nd 1
3224759(QOS): qos_verify_rx_conn_seq(): rxd old dup (4509/4509) nd 1  rx conn 1
3224759(QOS): rx_user_data(): bad conn/seq rxd from nd 1
3224759(QOS): qos_verify_rx_conn_seq(): rxd old dup (4519/4519) nd 1  rx conn 1
3224759(QOS): rx_user_data(): bad conn/seq rxd from nd 1
3224800(QOS): qos_verify_rx_conn_seq(): rxd old dup (4565/4565) nd 1  rx conn 1
3224800(QOS): rx_user_data(): bad conn/seq rxd from nd 1
3224801(QOS): qos_verify_rx_conn_seq(): rxd old dup (4572/4572) nd 1  rx conn 1
3224801(QOS):...
View Full Message
RE: interpretation  
Qnet is timing out too early on transmit, in your
network.  Try increasing the timeout, or try to 
get rid of the delay, which can be caused simply
by scheduling latency eg high cpu utilization at
high priority.

--
aboyd
Re: RE: interpretation  
> 
> Qnet is timing out too early on transmit, in your
> network.  Try increasing the timeout, or try to 
> get rid of the delay, which can be caused simply
> by scheduling latency eg high cpu utilization at
> high priority.
> 
The two machines were connected via a switch, I  changed it to a direct link with cross cable, same thing.  Can't do 
better then that for reducing the delay.

As for high cpu utilization, both machine are a 2 Core Xeon at 3Gig and while running piding CPU usage is .4% for the 
machine begin pidined and 1.2% for the machine running pidin.  Got the .key file to prove it ;-)

Looking at the .kev file I find it odd that interrupt 3 ( the one io-net is connected to) are happening only every 100ms
 or so.  There is only io-net connected to irq 3 ( and yes the com ports have been disabled).

I guess it sounds like I'll have to get in touch with our account manager and send you some hardware.    This is not 
defective hardware, I have 12 machines doing the same thing.





> --
> aboyd


Re: RE: interpretation  
> I guess it sounds like I'll have to get in touch with our account manager and 
> send you some hardware.    

By send you, I meant send QSSL ;-)

RE: RE: interpretation  
> I find it odd that interrupt 3 ( the one io-net is connected to) 
> are happening only every 100ms or so.  

Likely that is causing the qnet timeout.

Are you sure that this interrupt isn't being shared?  Sharing
interrupts can cause all sorts of undesirable latencies.

Also, some network cards can be programmed to limit the number
of interrupts per second - perhaps an erroneously (huge) delay
could have been introduced that way?

--
aboyd
Re: RE: RE: interpretation  
> 
> > I find it odd that interrupt 3 ( the one io-net is connected to) 
> > are happening only every 100ms or so.  
> 
> Likely that is causing the qnet timeout.
> 
> Are you sure that this interrupt isn't being shared?  Sharing
> interrupts can cause all sorts of undesirable latencies.

Well there is latencies and there is latencies ;-)  Pidin takes about 6 seconds to run.

> 
> Also, some network cards can be programmed to limit the number
> of interrupts per second - perhaps an erroneously (huge) delay
> could have been introduced that way?

Don't know.  We are using Net.e1000 (build for 6.3.2), and Net.i82540 shows the same behavior.  Drivers are started with
 priority=100 that's all.

> 
> --
> aboyd


RE: RE: RE: interpretation  
FYI:  I think it's 80% probability that the delay 
is on the rx side (ie the packet is in the rx descr
ring, but the driver and hence qnet doesn't know 
about it until after the long-delayed interrupt) but ...

There is a 20% chance that the delay is coming
from the transmit side.  That is, the driver puts
the packet information in the tx descr entry, but
for some reason the transmitting nic does not learn
about it for quite some time, and that's where the
delay is coming from.

You've got a driver/hardware problem.

--
aboyd
Re: RE: RE: RE: interpretation  
> 
> FYI:  I think it's 80% probability that the delay 
> is on the rx side (ie the packet is in the rx descr
> ring, but the driver and hence qnet doesn't know 
> about it until after the long-delayed interrupt) but ...
> 
> There is a 20% chance that the delay is coming
> from the transmit side.  That is, the driver puts
> the packet information in the tx descr entry, but
> for some reason the transmitting nic does not learn
> about it for quite some time, and that's where the
> delay is coming from.
> 
> You've got a driver/hardware problem.

Tried with 6.4 M8 and same thing.  Thanks for the information.

> 
> --
> aboyd


Re: RE: RE: RE: interpretation  
> > 
> > FYI:  I think it's 80% probability that the delay 
> > is on the rx side (ie the packet is in the rx descr
> > ring, but the driver and hence qnet doesn't know 
> > about it until after the long-delayed interrupt) but ...
> > 
> > There is a 20% chance that the delay is coming
> > from the transmit side.  That is, the driver puts
> > the packet information in the tx descr entry, but
> > for some reason the transmitting nic does not learn
> > about it for quite some time, and that's where the
> > delay is coming from.
> > 
> > You've got a driver/hardware problem.
> 
> Tried with 6.4 M8 and same thing.  Thanks for the information.


Tried it with QNX4 and it works fine.

> 
> > 
> > --
> > aboyd
> 
> 


RE: RE: RE: RE: interpretation  
> > You've got a driver/hardware problem.
>
> Tried it with QNX4 and it works fine.

Sigh.  Sure looks like a driver mis-configuration
of your nic.  Problem is, there are over SIXTY
hardware variants of the i82544, and trying to
write a "golden" driver which perfectly initializes
all variants, can be quite difficult.

I'm surprised the recently ported Intel driver didn't 
work, either - you would have thought it should have
handled all of the possible variants correctly.

What is your DID?

--
aboyd
Re: RE: RE: RE: RE: interpretation  
> > > You've got a driver/hardware problem.
> >
> > Tried it with QNX4 and it works fine.
> 
> Sigh.  

I know the feeling ;-)

> Sure looks like a driver mis-configuration
> of your nic.  Problem is, there are over SIXTY
> hardware variants of the i82544, and trying to
> write a "golden" driver which perfectly initializes
> all variants, can be quite difficult.

I sympathize

> 
> I'm surprised the recently ported Intel driver didn't 
> work, either - you would have thought it should have
> handled all of the possible variants correctly.

Indeed, we've have had problem with the Intel card and QNX6 since the beginning.   Unfortunately we have no real 
alternative for now.

> 
> What is your DID?

10b9. Hugh once told me he tested the driver with this same  DID ;-(

> 
> --
> aboyd


RE: RE: RE: RE: RE: interpretation  
That's really weird.  I've got that DID in my machine (it's a PCI
Express card) and there haven't been any problems with it.  There's a
little bit of me saying that it's unlikely to be a driver problem.

Can you post the output from "pidin irq" for us to have a look at?  If
the NIC isn't built in, can you move it to a different slot and see if
that has any effect?

	R.



-----Original Message-----
From: Mario Charest [mailto:community-noreply@qnx.com] 
Sent: Monday, October 06, 2008 4:53 PM
To: technology-networking
Subject: Re: RE: RE: RE: RE: interpretation

> > > You've got a driver/hardware problem.
> >
> > Tried it with QNX4 and it works fine.
> 
> Sigh.  

I know the feeling ;-)

> Sure looks like a driver mis-configuration
> of your nic.  Problem is, there are over SIXTY
> hardware variants of the i82544, and trying to
> write a "golden" driver which perfectly initializes
> all variants, can be quite difficult.

I sympathize

> 
> I'm surprised the recently ported Intel driver didn't 
> work, either - you would have thought it should have
> handled all of the possible variants correctly.

Indeed, we've have had problem with the Intel card and QNX6 since the
beginning.   Unfortunately we have no real alternative for now.

> 
> What is your DID?

10b9. Hugh once told me he tested the driver with this same  DID ;-(

> 
> --
> aboyd




_______________________________________________
Technology
http://community.qnx.com/sf/go/post14550
RE: RE: RE: RE: RE: interpretation  
> I've got that DID in my machine (it's a PCI Express card) 

Ah ha.  It's PCI-X - I've experienced some really bizarre
and flaky problems with PCI-X on some motherboards.  Another 
thing to check is that you aren't running an ancient version 
of pci server.  Also, try running a NON PCI-X (ie a simple PCI)
i82544 card - I'll bet your problems completely go away.

P.S.  With Rob's setup - with the same PCI-X cards as you -
we are seeing superb performance numbers, with io-pkt.

--
aboyd
Re: RE: RE: RE: RE: RE: interpretation  
> 
> > I've got that DID in my machine (it's a PCI Express card) 
> 
> Ah ha.  It's PCI-X - I've experienced some really bizarre
> and flaky problems with PCI-X on some motherboards. 

We have that problem on various model  of motherboard

> Another 
> thing to check is that you aren't running an ancient version 
> of pci server. 

As I said the problem is present with M8.

> Also, try running a NON PCI-X (ie a simple PCI)
> i82544 card - I'll bet your problems completely go away.

I wish be we can't...

> 
> P.S.  With Rob's setup - with the same PCI-X cards as you -
> we are seeing superb performance numbers, with io-pkt.
> 

You're a tease, lol!  Not that for sustain operation it's not bad at all for us, it's for small transcation, like pidin 
that it gets awfully slow.  A copy a file will go as fast at the HD will allow ( 50M/sec ).


> --
> aboyd


RE: RE: RE: RE: RE: RE: interpretation  
> it's for small transcation, like pidin that it 
> gets awfully slow.

Sure sounds like an rx interrupt problem  :-(

> A copy a file will go as fast at the HD will 
> allow ( 50M/sec ).

With your card - the PCI-X i28544 - we're seeing 
950 mbits/sec (pretty much wire rate) user tcp 
throughput with io-pkt, using a fraction of the
cpu, with only 2K user transfer size - it's flat
after that, with bigger sizes.

--
aboyd
Re: RE: RE: RE: RE: RE: interpretation  
> That's really weird.  I've got that DID in my machine (it's a PCI
> Express card) and there haven't been any problems with it.  There's a
> little bit of me saying that it's unlikely to be a driver problem.
> 
> Can you post the output from "pidin irq" for us to have a look at? 

File attached.

> the NIC isn't built in, can you move it to a different slot and see if
> that has any effect?

No it's not built in. I will try that after I'm done with the bnx driver.

That being said I'm working on sending QSS the hardware because i've spend a huge amount of time on this over the past 
months and it's now time we get to the bottom of this.  This is a very critical issue for us.

> 
> 	
> 
> 
> 
> -----Original Message-----
> From: Mario Charest [mailto:community-noreply@qnx.com] 
> Sent: Monday, October 06, 2008 4:53 PM
> To: technology-networking
> Subject: Re: RE: RE: RE: RE: interpretation
> 
> > > > You've got a driver/hardware problem.
> > >
> > > Tried it with QNX4 and it works fine.
> > 
> > Sigh.  
> 
> I know the feeling ;-)
> 
> > Sure looks like a driver mis-configuration
> > of your nic.  Problem is, there are over SIXTY
> > hardware variants of the i82544, and trying to
> > write a "golden" driver which perfectly initializes
> > all variants, can be quite difficult.
> 
> I sympathize
> 
> > 
> > I'm surprised the recently ported Intel driver didn't 
> > work, either - you would have thought it should have
> > handled all of the possible variants correctly.
> 
> Indeed, we've have had problem with the Intel card and QNX6 since the
> beginning.   Unfortunately we have no real alternative for now.
> 
> > 
> > What is your DID?
> 
> 10b9. Hugh once told me he tested the driver with this same  DID ;-(
> 
> > 
> > --
> > aboyd
> 
> 
> 
> 
> _______________________________________________
> Technology
> http://community.qnx.com/sf/go/post14550


Attachment: Text pidin.txt 2.6 KB
RE: RE: RE: RE: RE: RE: interpretation  
> File attached.

Gosh, that sure looks clean for interrupts.

Shot in the dark: try running non-SMP.  Does
that make a difference?

--
aboyd  www.PoweredByQNX.com
RE: RE: RE: RE: RE: RE: interpretation  
There's certainly nothing obvious in the pidin. I look at that 0x3 IRQ
and immediately think serial port even though there's none running and
I'm sure it's disabled in the BIOS.  On top of Andrew's non-SMP, how
about preventing IRQ3 from being used in the BIOS (reserving it) and see
if that changes things.

  As you can tell, we're grasping at straws here (!).

One final question... Is this with io-net or io-pkt?  We've got some
tuning options in the io-pkt driver that we could play with that affects
the interrupt capabilities...

	Robert.

-----Original Message-----
From: Mario Charest [mailto:community-noreply@qnx.com] 
Sent: Tuesday, October 07, 2008 9:25 AM
To: technology-networking
Subject: Re: RE: RE: RE: RE: RE: interpretation

> That's really weird.  I've got that DID in my machine (it's a PCI
> Express card) and there haven't been any problems with it.  There's a
> little bit of me saying that it's unlikely to be a driver problem.
> 
> Can you post the output from "pidin irq" for us to have a look at? 

File attached.

> the NIC isn't built in, can you move it to a different slot and see if
> that has any effect?

No it's not built in. I will try that after I'm done with the bnx
driver.

That being said I'm working on sending QSS the hardware because i've
spend a huge amount of time on this over the past months and it's now
time we get to the bottom of this.  This is a very critical issue for
us.

> 
> 	
> 
> 
> 
> -----Original Message-----
> From: Mario Charest [mailto:community-noreply@qnx.com] 
> Sent: Monday, October 06, 2008 4:53 PM
> To: technology-networking
> Subject: Re: RE: RE: RE: RE: interpretation
> 
> > > > You've got a driver/hardware problem.
> > >
> > > Tried it with QNX4 and it works fine.
> > 
> > Sigh.  
> 
> I know the feeling ;-)
> 
> > Sure looks like a driver mis-configuration
> > of your nic.  Problem is, there are over SIXTY
> > hardware variants of the i82544, and trying to
> > write a "golden" driver which perfectly initializes
> > all variants, can be quite difficult.
> 
> I sympathize
> 
> > 
> > I'm surprised the recently ported Intel driver didn't 
> > work, either - you would have thought it should have
> > handled all of the possible variants correctly.
> 
> Indeed, we've have had problem with the Intel card and QNX6 since the
> beginning.   Unfortunately we have no real alternative for now.
> 
> > 
> > What is your DID?
> 
> 10b9. Hugh once told me he tested the driver with this same  DID ;-(
> 
> > 
> > --
> > aboyd
> 
> 
> 
> 
> _______________________________________________
> Technology
> http://community.qnx.com/sf/go/post14550




_______________________________________________
Technology
http://community.qnx.com/sf/go/post14590
Re: RE: RE: RE: RE: RE: RE: interpretation  
Mario: shot in the dark #2: try this experimental driver:

  www.pittspecials.com/etc/devnp-i82544.so

--
aboyd
Re: interpretation  
The acrobatics this team goes through to bring you a working driver! :-)

Andrew Boyd wrote:
> Mario: shot in the dark #2: try this experimental driver:
> 
>   www.pittspecials.com/etc/devnp-i82544.so
> 
> --
> aboyd
> 
> _______________________________________________
> Technology
> http://community.qnx.com/sf/go/post14603
> 

-- 
cburgess@qnx.com
RE: interpretation  
We fly through hoops to take care of our customers!

-----Original Message-----
From: Colin Burgess [mailto:community-noreply@qnx.com] 
Sent: Tuesday, October 07, 2008 10:23 AM
To: technology-networking
Subject: Re: interpretation

The acrobatics this team goes through to bring you a working driver! :-)

Andrew Boyd wrote:
> Mario: shot in the dark #2: try this experimental driver:
> 
>   www.pittspecials.com/etc/devnp-i82544.so
> 
> --
> aboyd
> 
> _______________________________________________
> Technology
> http://community.qnx.com/sf/go/post14603
> 

-- 
cburgess@qnx.com

_______________________________________________
Technology
http://community.qnx.com/sf/go/post14605
RE: interpretation  
Sorry, I haven't a clue how/where to officially post
experimental binaries  (red face) so it's faster for
me to just use my website.

P.S.  My kid is really enamoured of this one:

  http://www.pittspecials.com/images/eric_l39.jpg

which is fun to fly, but a tad hard on fuel.

--
aboyd
Re: RE: RE: RE: RE: RE: RE: interpretation  
> Mario: shot in the dark #2: try this experimental driver:
> 
>   www.pittspecials.com/etc/devnp-i82544.so
> 

No change, I also tried with non-SMP that also didn't help.

> --
> aboyd


RE: RE: RE: RE: RE: RE: RE: interpretation  
What about a different interrupt?

--
aboyd

Re: RE: RE: RE: RE: RE: RE: interpretation  
> There's certainly nothing obvious in the pidin. I look at that 0x3 IRQ
> and immediately think serial port even though there's none running and
> I'm sure it's disabled in the BIOS.  On top of Andrew's non-SMP, how
> about preventing IRQ3 from being used in the BIOS (reserving it) and see
> if that changes things.

I already tried it by forcing the slots to IRQ5, no change.  The onboard ethernet device Broadcom 5708 which are 
disabled use the same interrupt as the PCI-e slot.  I have no control over that.

> 
>   As you can tell, we're grasping at straws here (!).


> 
> One final question... Is this with io-net or io-pkt?  We've got some
> tuning options in the io-pkt driver that we could play with that affects
> the interrupt capabilities...

With both.  However my gut feeling tells me trying various option will not fix the root of the problem.

> 
> 	Robert.
> 
> -----Original Message-----
> From: Mario Charest [mailto:community-noreply@qnx.com] 
> Sent: Tuesday, October 07, 2008 9:25 AM
> To: technology-networking
> Subject: Re: RE: RE: RE: RE: RE: interpretation
> 
> > That's really weird.  I've got that DID in my machine (it's a PCI
> > Express card) and there haven't been any problems with it.  There's a
> > little bit of me saying that it's unlikely to be a driver problem.
> > 
> > Can you post the output from "pidin irq" for us to have a look at? 
> 
> File attached.
> 
> > the NIC isn't built in, can you move it to a different slot and see if
> > that has any effect?
> 
> No it's not built in. I will try that after I'm done with the bnx
> driver.
> 
> That being said I'm working on sending QSS the hardware because i've
> spend a huge amount of time on this over the past months and it's now
> time we get to the bottom of this.  This is a very critical issue for
> us.
> 
> > 
> > 	
> > 
> > 
> > 
> > -----Original Message-----
> > From: Mario Charest [mailto:community-noreply@qnx.com] 
> > Sent: Monday, October 06, 2008 4:53 PM
> > To: technology-networking
> > Subject: Re: RE: RE: RE: RE: interpretation
> > 
> > > > > You've got a driver/hardware problem.
> > > >
> > > > Tried it with QNX4 and it works fine.
> > > 
> > > Sigh.  
> > 
> > I know the feeling ;-)
> > 
> > > Sure looks like a driver mis-configuration
> > > of your nic.  Problem is, there are over SIXTY
> > > hardware variants of the i82544, and trying to
> > > write a "golden" driver which perfectly initializes
> > > all variants, can be quite difficult.
> > 
> > I sympathize
> > 
> > > 
> > > I'm surprised the recently ported Intel driver didn't 
> > > work, either - you would have thought it should have
> > > handled all of the possible variants correctly.
> > 
> > Indeed, we've have had problem with the Intel card and QNX6 since the
> > beginning.   Unfortunately we have no real alternative for now.
> > 
> > > 
> > > What is your DID?
> > 
> > 10b9. Hugh once told me he tested the driver with this same  DID ;-(
> > 
> > > 
> > > --
> > > aboyd
> > 
> > 
> > 
> > 
> > _______________________________________________
> > Technology
> > http://community.qnx.com/sf/go/post14550
> 
> 
> 
> 
> _______________________________________________
> Technology
> http://community.qnx.com/sf/go/post14590


RE: RE: RE: RE: RE: RE: RE: interpretation  
OK...  Let's try something a bit different.  What's TCP performance
like?  If you do a flood ping to a board and take a kernel trace, does
your interrupt still only go off every 100ms?

	What type of board is it?

	Robert.


-----Original Message-----
From: Mario Charest [mailto:community-noreply@qnx.com] 
Sent: Tuesday, October 07, 2008 11:18 AM
To: technology-networking
Subject: Re: RE: RE: RE: RE: RE: RE: interpretation

> There's certainly nothing obvious in the pidin. I look at that 0x3 IRQ
> and immediately think serial port even though there's none running and
> I'm sure it's disabled in the BIOS.  On top of Andrew's non-SMP, how
> about preventing IRQ3 from being used in the BIOS (reserving it) and
see
> if that changes things.

I already tried it by forcing the slots to IRQ5, no change.  The onboard
ethernet device Broadcom 5708 which are disabled use the same interrupt
as the PCI-e slot.  I have no control over that.

> 
>   As you can tell, we're grasping at straws here (!).


> 
> One final question... Is this with io-net or io-pkt?  We've got some
> tuning options in the io-pkt driver that we could play with that
affects
> the interrupt capabilities...

With both.  However my gut feeling tells me trying various option will
not fix the root of the problem.

> 
> 	Robert.
> 
> -----Original Message-----
> From: Mario Charest [mailto:community-noreply@qnx.com] 
> Sent: Tuesday, October 07, 2008 9:25 AM
> To: technology-networking
> Subject: Re: RE: RE: RE: RE: RE: interpretation
> 
> > That's really weird.  I've got that DID in my machine (it's a PCI
> > Express card) and there haven't been any problems with it.  There's
a
> > little bit of me saying that it's unlikely to be a driver problem.
> > 
> > Can you post the output from "pidin irq" for us to have a look at? 
> 
> File attached.
> 
> > the NIC isn't built in, can you move it to a different slot and see
if
> > that has any effect?
> 
> No it's not built in. I will try that after I'm done with the bnx
> driver.
> 
> That being said I'm working on sending QSS the hardware because i've
> spend a huge amount of time on this over the past months and it's now
> time we get to the bottom of this.  This is a very critical issue for
> us.
> 
> > 
> > 	
> > 
> > 
> > 
> > -----Original Message-----
> > From: Mario Charest [mailto:community-noreply@qnx.com] 
> > Sent: Monday, October 06, 2008 4:53 PM
> > To: technology-networking
> > Subject: Re: RE: RE: RE: RE: interpretation
> > 
> > > > > You've got a driver/hardware problem.
> > > >
> > > > Tried it with QNX4 and it works fine.
> > > 
> > > Sigh.  
> > 
> > I know the feeling ;-)
> > 
> > > Sure looks like a driver mis-configuration
> > > of your nic.  Problem is, there are over SIXTY
> > > hardware variants of the i82544, and trying to
> > > write a "golden" driver which perfectly initializes
> > > all variants, can be quite difficult.
> > 
> > I sympathize
> > 
> > > 
> > > I'm surprised the recently ported Intel driver didn't 
> > > work, either - you would have thought it should have
> > > handled all of the possible variants correctly.
> > 
> > Indeed, we've have had problem with the Intel card and QNX6 since
the
> > beginning.   Unfortunately we have no real alternative for now.
> > 
> > > 
> > > What is your DID?
> > 
> > 10b9. Hugh once told me he tested the driver with this same  DID ;-(
> > 
> > > 
> > > --
> > > aboyd
> > 
>...
Re: RE: RE: RE: RE: RE: RE: RE: interpretation  
> OK...  Let's try something a bit different.  What's TCP performance
> like?  If you do a flood ping to a board and take a kernel trace, does
> your interrupt still only go off every 100ms?
> 

Between two 6.4, non SMP kernel & SMP Kernel + experimental driver.
120us - 2.5us - 120us - 2.5us - 120us - 2.5us .....
Ping report no packet lost and 9000 packets/sec. 

Between two 6.3 SMP kernel 
Ping report no packet lost and 14000 packets/sec

> 	What type of board is it?

IBM xSeries3650 server.  

On the 6.4 machines i running some more ping command, i notice that the speed goes from 9000 packets per second to 4500 
( I let ping run for about 5 seconds).  It switches speed every ~30 sec.  This is getting stranger by the minute.


> 
> 	Robert.
> 

Re: RE: RE: RE: RE: RE: RE: RE: interpretation  
> > OK...  Let's try something a bit different.  What's TCP performance
> > like?  If you do a flood ping to a board and take a kernel trace, does
> > your interrupt still only go off every 100ms?
> > 
> 
> Between two 6.4, non SMP kernel & SMP Kernel + experimental driver.
> 120us - 2.5us - 120us - 2.5us - 120us - 2.5us .....
> Ping report no packet lost and 9000 packets/sec. 
> 
> Between two 6.3 SMP kernel 
> Ping report no packet lost and 14000 packets/sec
> 
> > 	What type of board is it?
> 
> IBM xSeries3650 server.  
> 
> On the 6.4 machines i running some more ping command, i notice that the speed 
> goes from 9000 packets per second to 4500 ( I let ping run for about 5 seconds
> ).  It switches speed every ~30 sec.  This is getting stranger by the minute.

I booted two machines from the 6.4 M8 CD, re-run the ping test. I get 14000 packet/sec.  However sloginf still show 
plenty of qnet related error. The issue with the 9000/4500 packet is either related to the experimental version of the 
driver or other stuff in my custom image.  I guess it can be dismissed.

I set the ping packet size to 1200 and the number of packet goes to 6900/sec.  That's around 65Mbytes per sec in and 
65Mbytes out, not a single  packet dropped, ran ping for a minute -w60.


> 
> 
> > 
> > 	Robert.
> > 
> 


RE: RE: RE: RE: RE: RE: RE: RE: interpretation  
> not a single (IP) packet dropped

N.B.  Qnet isn't complaining about lost packets - 
it's just timing out.  If you increased the qnet
tx timeout sufficiently to straddle the 100ms
interrupt latency, it would probably clean up
the qnet event logs.

P.S.  I have a really wild and unlikely shot in
the dark.  On the very slight offchance that the
problem is on the tx side - that is, the driver
puts the packet info into the descriptor ring
but for some reason the nic doesn't see it until
later, which perhaps another descriptor is stuffed - 
I can compile yet another test binary i82544 driver
which sets the PROT_NOCACHE bit in the descriptor
ring mmap() call.

See, on x86 (and only x86) we don't set that bit,
and we actually run with the rx and tx descriptors
cached, which sounds wild at first, but actually
improves performance because the descriptor accesses
should be much faster.

However.  The above is predicated upon a very sophisticated
bus snooping cache "doing the right thing", and let's
say that sometimes it doesn't on your motherboard,
and the nic doesn't see the updated tx descriptor
information, until later when the cache is flushed
by the cpu.

I think it's unlikely, but in the past, pessimism
about complicated systems has frequently rewarded me
in the past.  Let me know.

--
aboyd
Re: RE: RE: RE: RE: RE: RE: RE: RE: interpretation  
> 
> > not a single (IP) packet dropped
> 
> N.B.  Qnet isn't complaining about lost packets - 
> it's just timing out.  If you increased the qnet
> tx timeout sufficiently to straddle the 100ms
> interrupt latency, it would probably clean up
> the qnet event logs.

I try a normal ping and the time is 0ms and no packet lost  I wasn't expecting it. If qnet times out either because of 
TX or RX, wouldn't ping also show the same behavior?

What is strange is that while it's doing a ping, a pidin to that same machine still take a lot of time. However if I do 
a ping flood then pidin is ok.

> 

[cut]
> I think it's unlikely, but in the past, pessimism
> about complicated systems has frequently rewarded me
> in the past.  Let me know.

Go for it.  These ibm server are very messy.  I have to use the -n option of pci to see all of the buses.  I'm on a 
crusade to get rid of them.  But that will not solve all the networking issue because I've seen that problem on other 
brand of PC.

> 
> --
> aboyd


RE: RE: RE: RE: RE: RE: RE: RE: RE: interpretation  
Sounding more and more like a "stuck packet" issue where the packet gets
into the HW queues and doesn't get released until something else comes
along and kicks it. Whether this is on the Tx or Rx front is something
else....

	R.

-----Original Message-----
From: Mario Charest [mailto:community-noreply@qnx.com] 
Sent: Wednesday, October 08, 2008 9:01 AM
To: technology-networking
Subject: Re: RE: RE: RE: RE: RE: RE: RE: RE: interpretation

> 
> > not a single (IP) packet dropped
> 
> N.B.  Qnet isn't complaining about lost packets - 
> it's just timing out.  If you increased the qnet
> tx timeout sufficiently to straddle the 100ms
> interrupt latency, it would probably clean up
> the qnet event logs.

I try a normal ping and the time is 0ms and no packet lost  I wasn't
expecting it. If qnet times out either because of TX or RX, wouldn't
ping also show the same behavior?

What is strange is that while it's doing a ping, a pidin to that same
machine still take a lot of time. However if I do a ping flood then
pidin is ok.

> 

[cut]
> I think it's unlikely, but in the past, pessimism
> about complicated systems has frequently rewarded me
> in the past.  Let me know.

Go for it.  These ibm server are very messy.  I have to use the -n
option of pci to see all of the buses.  I'm on a crusade to get rid of
them.  But that will not solve all the networking issue because I've
seen that problem on other brand of PC.

> 
> --
> aboyd




_______________________________________________
Technology
http://community.qnx.com/sf/go/post14674
RE: RE: RE: RE: RE: RE: RE: RE: RE: interpretation  
ok, try this x86 driver:

www.pittspecials.com/etc/devnp-i82544.so

it mmap()'s the tx and rx descriptor rings 
in as non-cached.

--
aboyd
Re: RE: RE: RE: RE: RE: RE: RE: RE: RE: interpretation  
> 
> ok, try this x86 driver:
> 
> www.pittspecials.com/etc/devnp-i82544.so
> 
> it mmap()'s the tx and rx descriptor rings 
> in as non-cached.
> 
> --
> aboyd

No change. 

Our sales contact is setting up the details, should be make a trip to Kanata next week with the hardware.

Hum where did I put my "own you a beer" list.




RE: RE: RE: RE: RE: RE: RE: RE: RE: RE: interpretation  
>> ok, try this x86 driver:
>
> No change. 

Argh.  Ok, the next thing we need to do, is set up
a packet sniffer (eg 10mbit hub) and correlate the
interrupt delay with the packet flow.

See, if the rx interrupt always fires immediately
after the packet arrives, that points the finger
at the tx side.

But if the rx interrupt doesn't fire immediately
after the packet arrives, that points the finger
at the rx side interrupt configuration.

We need more data to focus our efforts.

> where did I put my "own you a beer" list.

Heh - how about a tankful of jet-A for the L-39?  :-)

P.S.  I just paid 99.6 cents per liter for gas
here in Ottawa.

--
aboyd
Re: RE: RE: RE: RE: RE: RE: RE: RE: RE: RE: interpretation  
> 
> >> ok, try this x86 driver:
> >
> > No change. 
> 
> Argh.  Ok, the next thing we need to do, is set up
> a packet sniffer (eg 10mbit hub) and correlate the
> interrupt delay with the packet flow.
> 

I'll be bringing a 1Gig switch that can be setup to sniff activity on any given port. That way tests are done at real-
life speed.

> 
> We need more data to focus our efforts.
> 
> > where did I put my "own you a beer" list.
> 
> Heh - how about a tankful of jet-A for the L-39?  :-)

I'll have to get a loan first.

> 
> P.S.  I just paid 99.6 cents per liter for gas
> here in Ottawa.


> 
> --
> aboyd


Interpretation  
For those who are still interested...

What we've found is related to the receive path.  We get an interrupt from the card saying that a packet has been 
received and, in our interrupt handler, we go through and check to see if any receive descriptors are available for 
processing and then process them.  It appears that we quite frequently end up with the scenario where the Rx interrupt 
goes off BUT none of the descriptors are marked as ready to be processed, so, of course, we don't do anything.  Shortly 
after this, a descriptor gets updated as ready for processing but, of course, it's already too late since we've checked 
and not found any that are ready.  This means that we often end up with the situation in which packets are left in the 
receive queue, potentially for long periods of time (depending upon the network traffic patterns).

	We're still trying to figure out what to do about this.  We don't know if it's something fundamental to the way that 
PCI Express works or if it's a card configuration issue.  We're investigating :>


	Robert.
Re: Interpretation  
> For those who are still interested...
> 

For those who are still interested...

I was given a new driver to test and so far it's looking good, can't reproduce the problem.  Will give it more mileage 
and report the results.


> What we've found is related to the receive path.  We get an interrupt from the
>  card saying that a packet has been received and, in our interrupt handler, we
>  go through and check to see if any receive descriptors are available for 
> processing and then process them.  It appears that we quite frequently end up 
> with the scenario where the Rx interrupt goes off BUT none of the descriptors 
> are marked as ready to be processed, so, of course, we don't do anything.  
> Shortly after this, a descriptor gets updated as ready for processing but, of 
> course, it's already too late since we've checked and not found any that are 
> ready.  This means that we often end up with the situation in which packets 
> are left in the receive queue, potentially for long periods of time (depending
>  upon the network traffic patterns).
> 
> 	We're still trying to figure out what to do about this.  We don't know if 
> it's something fundamental to the way that PCI Express works or if it's a card
>  configuration issue.  We're investigating :>
> 
> 
> 	Robert.