Project Home
Project Home
Wiki
Wiki
Discussion Forums
Discussions
Project Information
Project Info
Forum Topic - High-Performance Network I/O Requirements: (11 Items)
   
High-Performance Network I/O Requirements  
BACKGROUND:
A prospect in the Audio broadcasting business has some stringent Network I/O requirements.
They plan on using Intel Atom w/ US15W chipset. Today, they use RTP and their own implementation of PTP (not the 
standard IEEE1588)

QUESTION:
They ask the following scenario:
"
High performance network I/O: We need to transfer around 100,000 small packets per second on the 100Bt and/or Gbit 
network interface.
In RTAI, we turn off the interrupt per packet, and service the network I/O with one interrupt every 250usec. 
Our hardware (a programmable clock), makes this 4khz interrupt. Our signal processing happens within the same 250usec 
interval and generates output packets scheduled for transmit by the end of the 250usec interval. Network I/O packets 
that are not real time audio, are passed through to 'normal' Linux for control, web pages, etc.

    How would we accomplish the same mechanism using QNX and the QNX network i/o driver?
	... How to disable the interrupt per packet?
	... How to process the network packet data that had come in since the last 250usec audio frame interrupt, with the real
 time deadline of
	    generating output packets by the end of the 250us interval.
	... How would we pass the non-audio packets through to non-real time (i.e. simply lower priority thread?) processing.
	.. How do we make sure that for network output, all real time audio packets get sent before any waiting low priority 
packets?
"

Thank you,
Karim.
Re: High-Performance Network I/O Requirements  
Karim,

Lots of options...

Maybe I'm misunderstanding the questions, but it seems to me these are 
very open ended questions which would be addressed by the QNX Writing 
Drivers course, looking at what existing network drivers do and how they 
work, and the Hardware spec for the particular network hardware they use.

In Linux, do they have their own (special) network driver? Do they have 
a user space application or a kernel module to implement RTP protocol 
and signal processing? Do they want to leverage their existing software? 
Do they have specialized hardware to do signal processing?

E.g. see below...

> 	... How to disable the interrupt per packet?
>    
This would depend on how you program the hardware right? Or simply 
ignore it? In general, the network driver can do that.
> 	... How to process the network packet data that had come in since the last 250usec audio frame interrupt, with the 
real time deadline of generating output packets by the end of the 250us interval.
>    
Not really sure what's being asked here. It sounds like a very wide 
scope design question. There would be several ways to do this, I guess...
> 	... How would we pass the non-audio packets through to non-real time (i.e. simply lower priority thread?) processing.

>    
I guess you'd pass the (regular) non-audio packets to io-pkt and pass 
the (real time) audio packets to the audio signaling processing module, 
which could be located in the driver, or in a different protocol module 
(lsm), or located in a separate process (outside of stack). In the 
latter, the packets could actually be passed via a specialized interface 
to the app or via io-pkt stack through a socket to the app. The choice 
would depend on a lot of things, and I might be making some valid or 
invalid assumptions suggesting these things...
> 	.. How do we make sure that for network output, all real time audio packets get sent before any waiting low priority 
packets?
>    
There are several options, and depends on the other choices, examples:
  - Your network driver could have two queues - one for audio packets, 
and the usual queue for packets going to regular io-pkt TCP/IP stack 
processing. Process the audio queue first, then the other, stop if you 
run out of time.
- You could simply use the existing if queue and let the driver walk 
through it once to find and send the audio packets, then do the rest, 
again stop if you run out of time doing the non-audio packets.
- Can use ALTQ to implement QoS queues, but with the deadlines you may 
want more control than that...
- Can maybe use PFIL hooks to filter out the audio packets to put in the 
special queue...

Again, lots of options...

/P
RE: High-Performance Network I/O Requirements  
> How to disable the interrupt per packet?

Actually, it already is.  When the rx interrupt
fires after a packet arrives, the interrupt is
masked and left that way.  An io-pkt thread is
scheduled to run - generally when you exit the
kernel after the interrupt, pre-empting whatever
else was running at the time - and then the driver
process_interrupt() callback is called by io-pkt.

This function generally loops, processing rxd
packets, until there are no more packets.  Remember
that packets generally arrive in bursts, and most
new nics will have interrupt coalescing in hardware
to limit the number of interrupts per second you
take - this is always tunable.  This really helps
in amortizing the cost of the interrupt over more
than one packet.

Only when the driver has completed processing packets
is the rx interrupt enabled.  There are some tricks
you can use - eg return code from process_interrupt()
as well as grungier hacks in the driver - to try to 
straddle brief intervals that you think occur in a 
burst of rxd packets in your application, to avoid 
having to mask the interrupt, then take another 
interrupt, go through the kernel again, and then 
schedule the same thread to do the same thing again.

Looking at kernel event logs is really educational
about this sort of thing, or as someone said years
ago, all knowledge comes from the lab.

--
aboyd  www.pittspecials.com/movies/takeoff.wmv

Re: RE: High-Performance Network I/O Requirements  
Andrew,

I was just about to post a related question, but this seems to be as good of a place as any.

We have also found that in our system, we can achieve better performance simply by dedicating a few CPU cores to polling
 for packets, i.e., running without interrupts at all.  

As you mentioned, under heavy load, the driver essentially becomes polled anyway, however, there are a couple of areas 
where it's still not as good as a polled architecture.  In our world, real time response rate is critical, so using 
interrupt throttling is good for internal efficiency but not good for latency.  Then, the CPUs we're using are fast 
enough such that it takes a lot of network traffic to load them to the point that they will stay in the receive thread. 
 In most cases, there is just enough traffic to cause it to go in an out of the thread for nearly every packet.  The 
problem is that this incurs significant overhead in the kernel, which is always happening in CPU0, I believe, right?  So
 even though we have plenty of CPU resources available, the system is acting relatively sluggish.

Another problem has to do with shared interrupts.  We have many interfaces and not many interrupt levels available, to 
the point that when an interrupt occurs on a particular level, that may invoke four or five interrupt routines, and not 
just in the network code, the USB ports and so on are also getting hit with false interrupts.

Anyway, long story short, we have built an e1000 driver on io-net that can run either in the traditional interrupt mode 
or with a configurable number of dedicated polling threads, and the polled mode even with just two CPUs is already about
 50% better than interrupt mode, plus the machine is completely responsive since we always have one or two CPUs doing 
almost nothing.

Which, finally, leads me to the real question/dilema: we can't figure out how to do this on io-pkt.  My knowledge of 
this is limited, but I get the impression that drivers aren't allowed to start threads, so how can we do something 
similar?

Thanks,
lew
RE: RE: High-Performance Network I/O Requirements  
Add my vote to have an option to go polling only.  We have more cores then we can use.  Shared interrupt is adding a 
very significant overhead, we have no control over that for the moment.

> -----Original Message-----
> From: Lewis Donzis [mailto:community-noreply@qnx.com]
> Sent: Thursday, April 01, 2010 3:53 PM
> To: technology-networking
> Subject: Re: RE: High-Performance Network I/O Requirements
> 
> Andrew,
> 
> I was just about to post a related question, but this seems to be as
> good of a place as any.
> 
> We have also found that in our system, we can achieve better
> performance simply by dedicating a few CPU cores to polling for
> packets, i.e., running without interrupts at all.
> 
> As you mentioned, under heavy load, the driver essentially becomes
> polled anyway, however, there are a couple of areas where it's still
> not as good as a polled architecture.  In our world, real time response
> rate is critical, so using interrupt throttling is good for internal
> efficiency but not good for latency.  Then, the CPUs we're using are
> fast enough such that it takes a lot of network traffic to load them to
> the point that they will stay in the receive thread.  In most cases,
> there is just enough traffic to cause it to go in an out of the thread
> for nearly every packet.  The problem is that this incurs significant
> overhead in the kernel, which is always happening in CPU0, I believe,
> right?  So even though we have plenty of CPU resources available, the
> system is acting relatively sluggish.
> 
> Another problem has to do with shared interrupts.  We have many
> interfaces and not many interrupt levels available, to the point that
> when an interrupt occurs on a particular level, that may invoke four or
> five interrupt routines, and not just in the network code, the USB
> ports and so on are also getting hit with false interrupts.
> 
> Anyway, long story short, we have built an e1000 driver on io-net that
> can run either in the traditional interrupt mode or with a configurable
> number of dedicated polling threads, and the polled mode even with just
> two CPUs is already about 50% better than interrupt mode, plus the
> machine is completely responsive since we always have one or two CPUs
> doing almost nothing.
> 
> Which, finally, leads me to the real question/dilema: we can't figure
> out how to do this on io-pkt.  My knowledge of this is limited, but I
> get the impression that drivers aren't allowed to start threads, so how
> can we do something similar?
> 
> Thanks,
> lew
> 
> 
> 
> _______________________________________________
> 
> Technology
> http://community.qnx.com/sf/go/post51073
> 
Re: RE: High-Performance Network I/O Requirements  
> Which, finally, leads me to the real question/dilema: we can't figure out how 
> to do this on io-pkt.  My knowledge of this is limited, but I get the 
> impression that drivers aren't allowed to start threads, so how can we do 
> something similar?

In order to avoid context switches,  build your own lsm moduel wich handels all of your apllication specific tasks.  
Such a lsm modul is directly linked to the ethernet driver.
A good example of a lsm module is the lsm-raw module include in the network sources.

With 100000 packets per second you have to send every 10us one packet. IMHO that's not possible without a specialized 
ethernet hardware.  The best value we have measured was in the range of 15us.

Armin Steinhoff

http://www.steinhoff-automation.com



Re: RE: High-Performance Network I/O Requirements  
> In order to avoid context switches,  build your own lsm moduel wich handels 
> all of your apllication specific tasks.  Such a lsm modul is directly linked 
> to the ethernet driver.
> A good example of a lsm module is the lsm-raw module include in the network 
> sources.

Yes, we already have an lsm that takes the packets that we're interested in from a pfil hook, but in some cases, we have
 to give the packet to the TCP/IP stack.

I'm still not sure how this works around the claim that a driver cannot start any threads of its own, and the stack 
threads that call into the driver certainly are not designed for polled mode.

> With 100000 packets per second you have to send every 10us one packet. IMHO 
> that's not possible without a specialized ethernet hardware.  The best value 
> we have measured was in the range of 15us.

Obviously, that depends a lot on the available hardware resources and how much work you want to do on each packet.  Our 
packet code path is roughly 0.6us, so we're reaching 1.5 mpps without much trouble.  And that's including the io-net 
code in the path.

lew
Re: High-Performance Network I/O Requirements  
> Yes, we already have an lsm that takes the packets that we're interested in from a pfil hook, but in some cases, we 
have to give the packet to the TCP/IP stack.
>    
http://community.qnx.com/sf/wiki/do/viewPage/projects.networking/wiki/Filtering_wiki_page
"The filter returns a non-zero value if the packet processing is to 
stop, or 0 if the processing is to continue."
Re: High-Performance Network I/O Requirements  
> > Yes, we already have an lsm that takes the packets that we're interested in 
> from a pfil hook, but in some cases, we have to give the packet to the TCP/IP 
> stack.
> >    
> http://community.qnx.com/sf/wiki/do/viewPage/projects.networking/wiki/
> Filtering_wiki_page
> "The filter returns a non-zero value if the packet processing is to 
> stop, or 0 if the processing is to continue."


Thanks, we  already have our filters converted from io-net to io-pkt, so that's not the issue right now.  The question 
is how a driver can operate 100% in polled mode.
RE: RE: High-Performance Network I/O Requirements  
> it takes a lot of network traffic to load them 
> to the point that they will stay in the receive thread.  

I need to write a little wiki article on this ...

You need to change the return code of process_interrupt().
 
This is a driver function which is called by io-pkt
after an interrupt fires.  Normally all the servicing
is completed and the function returns, but by altering
the return code you can tell io-pkt that you have not
finished processing - even if you really have - and
io-pkt will call you again soon, with the interrupt
still masked!
 
You can create a little state machine with a static
variable so that your driver thread stays active longer 
after an interrupt, to straddle the interval between
packets in a burst.

The above is the elegant way of doing it.  You may not
prefer it - are you a manual transmission kind of guy?

If so, ensure that io-pkt has enough threads - one will
be created per cpu, you may wish to use the command line
options to create at *least* as many threads as interfaces - 
and in this case, you can "hog" the io-pkt thread by
not returning immediately from process_interrupt().
  
Intead what you might do, after all your processing is
complete, is loop perhaps calling SchedYield() and then 
reading your hardware register, looking for rxd packets.  

Seanb doesn't like this very much - he would prefer you return
from process_interrupt() asking to be called again - but
as long as you carefully configure your system with enough
threads for your number of cores and interfaces, spinning
in the process_interrupt() callback will work very well indeed.

If you've got cores to burn, skip the SchedYield() - that's
more for single-cpu configurations, to allow other threads at
the same priority - whom are ready to run - a chance at the cpu.

Hope this helps,

--
aboyd  
Re: RE: RE: High-Performance Network I/O Requirements  
Hi, Andrew.

Thanks for the pointers.  I say what follows without any experience writing an io-pkt driver -- we've done quite a few 
for io-net, but none for io-pkt yet.

First of all, there is heavy interrupt sharing going on, with multiple types of devices, so it's not reasonable to leave
 the interrupt masked in the PIC.  For example, the USB ports share the same interrupts with the MACs.  So I presume, 
from what you're saying, if we return from process_interrupt() with the interrupt still masked, then the USB ports will 
never get serviced.

Another thing is that the network threads run at relatively high priority (22), so calling SchedYield() is of no benefit
 in terms of allowing other things to run.

On io-net, we wrote a completely independent module that can be called by any driver in lieu of InterruptAttachEvent(). 
 The driver calls with its event service function and interrupt level.

If there is only a single CPU or if traditional interrupt mode is desired, then the polling module creates a thread for 
that driver, uses InterruptAttachEvent() and InterruptWait(), and when an interrupt occurs, it calls back to the driver.
  So in that mode, the driver works pretty much as it did before.

But if multiple CPUs are available, it creates a number of threads, usually the number of CPUs minus one, or half the 
CPUs, or whatever.  Each thread locks itself to a CPU (starting at the highest numbers) and hangs in a loop forever, 
calling the registered driver functions.  (All threads share the same pool of callbacks.)

In the latter case, i.e., when it's running in polled mode, the driver does not enable interrupts on its hardware MAC, 
it runs strictly in polling mode.  There are no kernel calls in the polling loop, so it's quite fast, completing a 
polling loop in roughly 50 nanoseconds.

It's all pretty straightforward, and this is the functionality we hoped to replicate on io-pkt.

Just out of curiosity, is there anything really preventing a driver from starting a thread of its own, and not using the
 process_interrupt() callback?

Thanks,
lew