Project Home
Project Home
Wiki
Wiki
Discussion Forums
Discussions
Project Information
Project Info
Forum Topic - very slow libsocket/io-pkt interface: (16 Items)
   
very slow libsocket/io-pkt interface  
Hi,
running io-pkt on mpc8360, which uses devn-ucc_ec + devnp-shim driver to 
send the data through 1Gbit interface.

I have two tests to measure throughput of the interface:
1. Use raw sockets to send one prepared IP packet (1496 bytes) in a loop
2. Use bpf interface to send one prepared ethernet frame (1506 bytes) in a loop

Both tests yield throughput of about 14Mb/s, sending larger IP packets (32k) yield speed
of about 40Mb/s.

On the other hand, writing a lsm module for io-pkt which sends prepared ethernet frame
from inside io-pkt yields maximum throughput (120Mb/s).

Why is there so huge difference between the two approaches? lsm inside io-pkt also uses shim layer,
so I have to blame slow message passing between libsocket and io-pkt manager.
RE: very slow libsocket/io-pkt interface  
Not sure about devn-ucc_ec, but on x86,
Robert was able to get wire rate on gige
with devnp-i82544.so and ttcp using only 
14% of the CPU at larger sizes.

The message passing of QNX only affected
the slope of the asymptotic curve.

--
aboyd
Re: RE: very slow libsocket/io-pkt interface  
> 
> Not sure about devn-ucc_ec, but on x86,
> Robert was able to get wire rate on gige
> with devnp-i82544.so and ttcp using only 
> 14% of the CPU at larger sizes.
> 
> The message passing of QNX only affected
> the slope of the asymptotic curve.
> 
> --
> aboyd

Hi,
Core 2 Duo, 2Ghz is not a payer here :-). I have single core, 533Mhz PPC.
It seems that either context switching or message passing is to be blamed. 
Otherwise I can't explain the difference. I have no other running processes,
except of the sending one.

Yura.
RE: RE: very slow libsocket/io-pkt interface  
> 533Mhz PPC

Not a problem.  First gige driver I ever got wire
rate (well, 960 mbits/sec) was mpc85xx on PPC.
 
> either context switching or message passing is 
> to be blamed

No.  Despite them being convenient whipping boys,
they only come into play for small packets.  For
large transfers, the overhead is amortized.

I am not familiar with the ucc_ec driver - are
there any hardware limitations eg bus bandwidth?

As an example, generic PCI is NOT capable of wire
rate gige - you need PCI express for that.

Also, there may be driver limitations - a common
hack is always defrag before tx, which is not good.

Just because a driver will ping, does not mean that
it is capable of doing wire rate gige with low cpu
utilization.  It took a lot of work by Robert and 
me and Sean (enabling TSO, hardware checksumming,
interrupt throttling, etc) to get wire rate on x86 
with ttcp and 14% cpu with devnp-i82544.so

Another whipping boy: don't always blame the shim.
We did a test with 100mbit pcnet, and we got exactly
the same throughput with a shim pcnet driver, and
a native pcnet driver.

Hugh recently ported the devn-e1000 driver to io-pkt,
and he was disappointed to learn that the devnp-e1000
driver only reduced the CPU utilization by ONE PERCENT
during his benchmark testing.

Despite what many people think, creating a reliable,
fast, efficient, maintainable driver is not always a 
trivial task.

--
aboyd   www.pittspecials.com/movies/tumble.wmv
Re: RE: RE: very slow libsocket/io-pkt interface  
> No.  Despite them being convenient whipping boys,
> they only come into play for small packets.  For
> large transfers, the overhead is amortized.

Well, there have to be a problem somewhere...
The difference is quite measurable.

> 
> I am not familiar with the ucc_ec driver - are
> there any hardware limitations eg bus bandwidth?

the memory bus is 32 bit x 133 Mhz DDR, which gives
us 4*133*2 = 1064 Mbytes/second, which is x8 the ethernet rate.


> Also, there may be driver limitations - a common
> hack is always defrag before tx, which is not good.

This is neither driver nor shim limitation, 
since both io-pkt lsm and libsocket use it.

> 
> Just because a driver will ping, does not mean that
> it is capable of doing wire rate gige with low cpu
> utilization.

I'm capturing all the packets correctly on the client side
with wireshark. Also, I don't care about low CPU utilization, 
pushing the data through ethernet is my main concern.
Noting that mpc8360 has two gigabit interfaces, I must
conclude that it's powerful enough to fill both of them.

Thanks for the fast reply,

Yura.
RE: RE: RE: very slow libsocket/io-pkt interface  
I'll just comment quickly on the CPU being capable of filling two
gigabit-ethernet ports and say "don't count on it".  I seem to remember
having real problems with the 8313 (which is a similar chip) Ethernet
FIFOs being capable of filling the GigE pipe (I'd get FIFO underruns /
overruns especially when using jumbo packets).  This may have only  been
a problem with the particular chip that I had at the time, but who
knows?  In terms of raw CPU, the actual limiting factor is
packets-per-second, so it's unlikely that you'll be able to fill a
single pipe with 64 byte packets, but you may be able to fill both pipes
if you use jumbo packets.

	R.


-----Original Message-----
From: Yuri Shchors [mailto:community-noreply@qnx.com] 
Sent: Thursday, March 26, 2009 10:21 AM
To: general-networking
Subject: Re: RE: RE: very slow libsocket/io-pkt interface


> No.  Despite them being convenient whipping boys, they only come into 
> play for small packets.  For large transfers, the overhead is 
> amortized.

Well, there have to be a problem somewhere...
The difference is quite measurable.

> 
> I am not familiar with the ucc_ec driver - are there any hardware 
> limitations eg bus bandwidth?

the memory bus is 32 bit x 133 Mhz DDR, which gives us 4*133*2 = 1064
Mbytes/second, which is x8 the ethernet rate.


> Also, there may be driver limitations - a common hack is always defrag

> before tx, which is not good.

This is neither driver nor shim limitation, since both io-pkt lsm and
libsocket use it.

> 
> Just because a driver will ping, does not mean that it is capable of 
> doing wire rate gige with low cpu utilization.

I'm capturing all the packets correctly on the client side with
wireshark. Also, I don't care about low CPU utilization, pushing the
data through ethernet is my main concern.
Noting that mpc8360 has two gigabit interfaces, I must conclude that
it's powerful enough to fill both of them.

Thanks for the fast reply,

Yura.

_______________________________________________
General
http://community.qnx.com/sf/go/post25342
RE: RE: RE: very slow libsocket/io-pkt interface  
> real problems with the 8313 

IIRC it used the low-rent tsec which had MUCH
less DMA bandwidth available than the etsec 
implementations, which worked MUCH better at
gige.

Gotta have both bus bandwidth, and low latency 
(re fifo underruns/overruns).

--
aboyd
RE: very slow libsocket/io-pkt interface  
Hmmm... Interesting one...

So BPF / Raw socket produces ~ 1/3 of the rate of the direct interface
through an LSM.

By hooking into the LSM layer, you're likely also bypassing quite a bit
of stack  / message processing in addition to removing the message pass
/ data copy associated with an external application.

Just to check, are you sure that you're CPU limited in both cases (very
likely so, but it doesn't hurt to ask).

If you're looking to an external interface for generating raw packets,
you might also want to have a look at lsm-nraw.so.  This will bypass the
stack processing but leave the message passing interface in place.

Just to check... Are your measurements Mega Bits per second or Mega
Bytes per second?  I'm assuming megabytes since 14 megabits seems
extremely slow.

If you're really interested in solid numbers you can always get a kernel
trace to see what's happening.  You sometimes see "unexpected"
interactions when you look at the kernel trace.


	R.

-----Original Message-----
From: Yuri Shchors [mailto:community-noreply@qnx.com] 
Sent: Thursday, March 26, 2009 7:50 AM
To: general-networking
Subject: very slow libsocket/io-pkt interface

Hi,
running io-pkt on mpc8360, which uses devn-ucc_ec + devnp-shim driver to
send the data through 1Gbit interface.

I have two tests to measure throughput of the interface:
1. Use raw sockets to send one prepared IP packet (1496 bytes) in a loop
2. Use bpf interface to send one prepared ethernet frame (1506 bytes) in
a loop

Both tests yield throughput of about 14Mb/s, sending larger IP packets
(32k) yield speed of about 40Mb/s.

On the other hand, writing a lsm module for io-pkt which sends prepared
ethernet frame from inside io-pkt yields maximum throughput (120Mb/s).

Why is there so huge difference between the two approaches? lsm inside
io-pkt also uses shim layer, so I have to blame slow message passing
between libsocket and io-pkt manager.

_______________________________________________
General
http://community.qnx.com/sf/go/post25305
Re: RE: very slow libsocket/io-pkt interface  
> Hmmm... Interesting one...
> 
> So BPF / Raw socket produces ~ 1/3 of the rate of the direct interface
> through an LSM.

> Just to check, are you sure that you're CPU limited in both cases (very
> likely so, but it doesn't hurt to ask).

Yes.

> 
> If you're looking to an external interface for generating raw packets,
> you might also want to have a look at lsm-nraw.so.  This will bypass the
> stack processing but leave the message passing interface in place.
> 

using nraw will give me about 18 Megabytes/sec on priority 10 
and about 28MiB/s on priority 100.

> 
> If you're really interested in solid numbers you can always get a kernel
> trace to see what's happening.  You sometimes see "unexpected"
> interactions when you look at the kernel trace.

I've looked at the kernel trace, and don't see some unusual interactions there.
I've attached the trace, in case you'll want to have a look at it...

Yura.
Attachment: Text trace.kev.bz2 2.41 MB
Re: RE: very slow libsocket/io-pkt interface  
Hi,
I've looked at the kernel profiling info for both io-pkt module and lsm-nraw usage.
in first scenario I'm able to send 12Mb/sec using 20% of CPU, in second one almost the
same amout of data (14Mb/sec) takes 100% of CPU. When using nraw interface, for 
each 2msec inside io-pkt there's 1msec inside procnto-600. 1msec is about 500,000 cycles.
Wasting that much time inside the procnto could be explained by running from uncacheable
memory or flash. But the entire ifs image is copied to DDR before running, and I suppose that one of the things procnto 
does is to enable the dcache. Copy entire ethernet packet (1500 bytes) takes
between 1200 to 200 cycles, which makes me wonder where the rest 490k cycles are spent.

Yura.
RE: RE: very slow libsocket/io-pkt interface  
> for each 2msec inside io-pkt there's 1msec inside 
> procnto-600. 

Shot in the dark:  the CACHE_FLUSH() during tx.

As a test, try commenting it out (yes, the packets
transmitted will be corrupted) and see if that changes
things.

Also, another thing that really kills benchmarks is
io-pkt allocating pages, which requires sending messages
to proc.  Make sure that during your test, when you do a 
"pidin mem", that io-pkt is NOT growing it's data.  Before 
the benchmark begins, you want to make sure it's steady-state - 
it's allocated all the clusters it's going to need.

--
aboyd
RE: RE: very slow libsocket/io-pkt interface  
What thread is being executed in procnto?  Is it the idle thread?

   R. 

-----Original Message-----
From: Yuri Shchors [mailto:community-noreply@qnx.com] 
Sent: Tuesday, March 31, 2009 5:02 AM
To: general-networking
Subject: Re: RE: very slow libsocket/io-pkt interface

Hi,
I've looked at the kernel profiling info for both io-pkt module and
lsm-nraw usage.
in first scenario I'm able to send 12Mb/sec using 20% of CPU, in second
one almost the same amout of data (14Mb/sec) takes 100% of CPU. When
using nraw interface, for each 2msec inside io-pkt there's 1msec inside
procnto-600. 1msec is about 500,000 cycles.
Wasting that much time inside the procnto could be explained by running
from uncacheable memory or flash. But the entire ifs image is copied to
DDR before running, and I suppose that one of the things procnto does is
to enable the dcache. Copy entire ethernet packet (1500 bytes) takes
between 1200 to 200 cycles, which makes me wonder where the rest 490k
cycles are spent.

Yura.


_______________________________________________
General
http://community.qnx.com/sf/go/post25626
Re: RE: RE: very slow libsocket/io-pkt interface  
> What thread is being executed in procnto?  Is it the idle thread?
> 
>    R. 
Oh, no... it's not

Yura.
RE: RE: RE: very slow libsocket/io-pkt interface  
What does the instrumented kernel show
for kernel calls, interrupts, etc?  It
can be really helpful for this kind of
system diagnosis.

--
aboyd

RE: RE: RE: very slow libsocket/io-pkt interface  
I'm surprised that procnto is involved at all.  Hmmm...  From the trace
can you see what causes the kernel thread to run?  

-----Original Message-----
From: Yuri Shchors [mailto:community-noreply@qnx.com] 
Sent: Tuesday, March 31, 2009 10:51 AM
To: general-networking
Subject: Re: RE: RE: very slow libsocket/io-pkt interface

> What thread is being executed in procnto?  Is it the idle thread?
> 
>    R. 
Oh, no... it's not

Yura.

_______________________________________________
General
http://community.qnx.com/sf/go/post25666
Re: RE: very slow libsocket/io-pkt interface  
> Hi,
> running io-pkt on mpc8360, which uses devn-ucc_ec + devnp-shim driver to
> send the data through 1Gbit interface.
> 
> I have two tests to measure throughput of the interface:
> 1. Use raw sockets to send one prepared IP packet (1496 bytes) in a loop
> 2. Use bpf interface to send one prepared ethernet frame (1506 bytes) in
> a loop
> 
> Both tests yield throughput of about 14Mb/s, sending larger IP packets
> (32k) yield speed of about 40Mb/s.
> 
> On the other hand, writing a lsm module for io-pkt which sends prepared
> ethernet frame from inside io-pkt yields maximum throughput (120Mb/s).
> 
> Why is there so huge difference between the two approaches? lsm inside
> io-pkt also uses shim layer, so I have to blame slow message passing
> between libsocket and io-pkt manager.
> 
> _______________________________________________
> General
> http://community.qnx.com/sf/go/post25305

Just to complete the picture: I've found that memory throughput to the PPC core (but not to the DMA) is about 100Mb/s 
(one way!), so my theory about having 32 bit x 133 Mhz DDR2 = 1064Mb/s memory throughput doesn't reflect the real state 
of affairs :-).  Copy data twice by the CPU and you have 50Mb/s throughput, which I almost achieve using large UDP 
packets. I've tried to play with various bus arbiter registers but had no any improvement. I guess, I'll have a FPGA to 
prepare network packets for this ,umm, network chip, and then use io-pkt module to send them without copying to the 
network.