Karim Mouline(deleted)
|
High-Performance Network I/O Requirements
|
Karim Mouline(deleted)
03/30/2010 3:10 PM
post50854
|
High-Performance Network I/O Requirements
BACKGROUND:
A prospect in the Audio broadcasting business has some stringent Network I/O requirements.
They plan on using Intel Atom w/ US15W chipset. Today, they use RTP and their own implementation of PTP (not the
standard IEEE1588)
QUESTION:
They ask the following scenario:
"
High performance network I/O: We need to transfer around 100,000 small packets per second on the 100Bt and/or Gbit
network interface.
In RTAI, we turn off the interrupt per packet, and service the network I/O with one interrupt every 250usec.
Our hardware (a programmable clock), makes this 4khz interrupt. Our signal processing happens within the same 250usec
interval and generates output packets scheduled for transmit by the end of the 250usec interval. Network I/O packets
that are not real time audio, are passed through to 'normal' Linux for control, web pages, etc.
How would we accomplish the same mechanism using QNX and the QNX network i/o driver?
... How to disable the interrupt per packet?
... How to process the network packet data that had come in since the last 250usec audio frame interrupt, with the real
time deadline of
generating output packets by the end of the 250us interval.
... How would we pass the non-audio packets through to non-real time (i.e. simply lower priority thread?) processing.
.. How do we make sure that for network output, all real time audio packets get sent before any waiting low priority
packets?
"
Thank you,
Karim.
|
|
|
Patrik Lahti
|
Re: High-Performance Network I/O Requirements
|
Patrik Lahti
03/31/2010 5:02 PM
post51000
|
Re: High-Performance Network I/O Requirements
Karim,
Lots of options...
Maybe I'm misunderstanding the questions, but it seems to me these are
very open ended questions which would be addressed by the QNX Writing
Drivers course, looking at what existing network drivers do and how they
work, and the Hardware spec for the particular network hardware they use.
In Linux, do they have their own (special) network driver? Do they have
a user space application or a kernel module to implement RTP protocol
and signal processing? Do they want to leverage their existing software?
Do they have specialized hardware to do signal processing?
E.g. see below...
> ... How to disable the interrupt per packet?
>
This would depend on how you program the hardware right? Or simply
ignore it? In general, the network driver can do that.
> ... How to process the network packet data that had come in since the last 250usec audio frame interrupt, with the
real time deadline of generating output packets by the end of the 250us interval.
>
Not really sure what's being asked here. It sounds like a very wide
scope design question. There would be several ways to do this, I guess...
> ... How would we pass the non-audio packets through to non-real time (i.e. simply lower priority thread?) processing.
>
I guess you'd pass the (regular) non-audio packets to io-pkt and pass
the (real time) audio packets to the audio signaling processing module,
which could be located in the driver, or in a different protocol module
(lsm), or located in a separate process (outside of stack). In the
latter, the packets could actually be passed via a specialized interface
to the app or via io-pkt stack through a socket to the app. The choice
would depend on a lot of things, and I might be making some valid or
invalid assumptions suggesting these things...
> .. How do we make sure that for network output, all real time audio packets get sent before any waiting low priority
packets?
>
There are several options, and depends on the other choices, examples:
- Your network driver could have two queues - one for audio packets,
and the usual queue for packets going to regular io-pkt TCP/IP stack
processing. Process the audio queue first, then the other, stop if you
run out of time.
- You could simply use the existing if queue and let the driver walk
through it once to find and send the audio packets, then do the rest,
again stop if you run out of time doing the non-audio packets.
- Can use ALTQ to implement QoS queues, but with the deadlines you may
want more control than that...
- Can maybe use PFIL hooks to filter out the audio packets to put in the
special queue...
Again, lots of options...
/P
|
|
|
Andrew Boyd(deleted)
|
RE: High-Performance Network I/O Requirements
|
Andrew Boyd(deleted)
04/01/2010 10:15 AM
post51047
|
RE: High-Performance Network I/O Requirements
> How to disable the interrupt per packet?
Actually, it already is. When the rx interrupt
fires after a packet arrives, the interrupt is
masked and left that way. An io-pkt thread is
scheduled to run - generally when you exit the
kernel after the interrupt, pre-empting whatever
else was running at the time - and then the driver
process_interrupt() callback is called by io-pkt.
This function generally loops, processing rxd
packets, until there are no more packets. Remember
that packets generally arrive in bursts, and most
new nics will have interrupt coalescing in hardware
to limit the number of interrupts per second you
take - this is always tunable. This really helps
in amortizing the cost of the interrupt over more
than one packet.
Only when the driver has completed processing packets
is the rx interrupt enabled. There are some tricks
you can use - eg return code from process_interrupt()
as well as grungier hacks in the driver - to try to
straddle brief intervals that you think occur in a
burst of rxd packets in your application, to avoid
having to mask the interrupt, then take another
interrupt, go through the kernel again, and then
schedule the same thread to do the same thing again.
Looking at kernel event logs is really educational
about this sort of thing, or as someone said years
ago, all knowledge comes from the lab.
--
aboyd www.pittspecials.com/movies/takeoff.wmv
|
|
|
Lewis Donzis
|
Re: RE: High-Performance Network I/O Requirements
|
Lewis Donzis
04/01/2010 3:53 PM
post51073
|
Re: RE: High-Performance Network I/O Requirements
Andrew,
I was just about to post a related question, but this seems to be as good of a place as any.
We have also found that in our system, we can achieve better performance simply by dedicating a few CPU cores to polling
for packets, i.e., running without interrupts at all.
As you mentioned, under heavy load, the driver essentially becomes polled anyway, however, there are a couple of areas
where it's still not as good as a polled architecture. In our world, real time response rate is critical, so using
interrupt throttling is good for internal efficiency but not good for latency. Then, the CPUs we're using are fast
enough such that it takes a lot of network traffic to load them to the point that they will stay in the receive thread.
In most cases, there is just enough traffic to cause it to go in an out of the thread for nearly every packet. The
problem is that this incurs significant overhead in the kernel, which is always happening in CPU0, I believe, right? So
even though we have plenty of CPU resources available, the system is acting relatively sluggish.
Another problem has to do with shared interrupts. We have many interfaces and not many interrupt levels available, to
the point that when an interrupt occurs on a particular level, that may invoke four or five interrupt routines, and not
just in the network code, the USB ports and so on are also getting hit with false interrupts.
Anyway, long story short, we have built an e1000 driver on io-net that can run either in the traditional interrupt mode
or with a configurable number of dedicated polling threads, and the polled mode even with just two CPUs is already about
50% better than interrupt mode, plus the machine is completely responsive since we always have one or two CPUs doing
almost nothing.
Which, finally, leads me to the real question/dilema: we can't figure out how to do this on io-pkt. My knowledge of
this is limited, but I get the impression that drivers aren't allowed to start threads, so how can we do something
similar?
Thanks,
lew
|
|
|
Mario Charest
|
RE: RE: High-Performance Network I/O Requirements
|
Mario Charest
04/01/2010 4:19 PM
post51076
|
RE: RE: High-Performance Network I/O Requirements
Add my vote to have an option to go polling only. We have more cores then we can use. Shared interrupt is adding a
very significant overhead, we have no control over that for the moment.
> -----Original Message-----
> From: Lewis Donzis [mailto:community-noreply@qnx.com]
> Sent: Thursday, April 01, 2010 3:53 PM
> To: technology-networking
> Subject: Re: RE: High-Performance Network I/O Requirements
>
> Andrew,
>
> I was just about to post a related question, but this seems to be as
> good of a place as any.
>
> We have also found that in our system, we can achieve better
> performance simply by dedicating a few CPU cores to polling for
> packets, i.e., running without interrupts at all.
>
> As you mentioned, under heavy load, the driver essentially becomes
> polled anyway, however, there are a couple of areas where it's still
> not as good as a polled architecture. In our world, real time response
> rate is critical, so using interrupt throttling is good for internal
> efficiency but not good for latency. Then, the CPUs we're using are
> fast enough such that it takes a lot of network traffic to load them to
> the point that they will stay in the receive thread. In most cases,
> there is just enough traffic to cause it to go in an out of the thread
> for nearly every packet. The problem is that this incurs significant
> overhead in the kernel, which is always happening in CPU0, I believe,
> right? So even though we have plenty of CPU resources available, the
> system is acting relatively sluggish.
>
> Another problem has to do with shared interrupts. We have many
> interfaces and not many interrupt levels available, to the point that
> when an interrupt occurs on a particular level, that may invoke four or
> five interrupt routines, and not just in the network code, the USB
> ports and so on are also getting hit with false interrupts.
>
> Anyway, long story short, we have built an e1000 driver on io-net that
> can run either in the traditional interrupt mode or with a configurable
> number of dedicated polling threads, and the polled mode even with just
> two CPUs is already about 50% better than interrupt mode, plus the
> machine is completely responsive since we always have one or two CPUs
> doing almost nothing.
>
> Which, finally, leads me to the real question/dilema: we can't figure
> out how to do this on io-pkt. My knowledge of this is limited, but I
> get the impression that drivers aren't allowed to start threads, so how
> can we do something similar?
>
> Thanks,
> lew
>
>
>
> _______________________________________________
>
> Technology
> http://community.qnx.com/sf/go/post51073
>
|
|
|
Armin Steinhoff
|
Re: RE: High-Performance Network I/O Requirements
|
Armin Steinhoff
04/01/2010 4:21 PM
post51077
|
Re: RE: High-Performance Network I/O Requirements
> Which, finally, leads me to the real question/dilema: we can't figure out how
> to do this on io-pkt. My knowledge of this is limited, but I get the
> impression that drivers aren't allowed to start threads, so how can we do
> something similar?
In order to avoid context switches, build your own lsm moduel wich handels all of your apllication specific tasks.
Such a lsm modul is directly linked to the ethernet driver.
A good example of a lsm module is the lsm-raw module include in the network sources.
With 100000 packets per second you have to send every 10us one packet. IMHO that's not possible without a specialized
ethernet hardware. The best value we have measured was in the range of 15us.
Armin Steinhoff
http://www.steinhoff-automation.com
|
|
|
Lewis Donzis
|
Re: RE: High-Performance Network I/O Requirements
|
Lewis Donzis
04/01/2010 6:54 PM
post51085
|
Re: RE: High-Performance Network I/O Requirements
> In order to avoid context switches, build your own lsm moduel wich handels
> all of your apllication specific tasks. Such a lsm modul is directly linked
> to the ethernet driver.
> A good example of a lsm module is the lsm-raw module include in the network
> sources.
Yes, we already have an lsm that takes the packets that we're interested in from a pfil hook, but in some cases, we have
to give the packet to the TCP/IP stack.
I'm still not sure how this works around the claim that a driver cannot start any threads of its own, and the stack
threads that call into the driver certainly are not designed for polled mode.
> With 100000 packets per second you have to send every 10us one packet. IMHO
> that's not possible without a specialized ethernet hardware. The best value
> we have measured was in the range of 15us.
Obviously, that depends a lot on the available hardware resources and how much work you want to do on each packet. Our
packet code path is roughly 0.6us, so we're reaching 1.5 mpps without much trouble. And that's including the io-net
code in the path.
lew
|
|
|
Patrik Lahti
|
Re: High-Performance Network I/O Requirements
|
Patrik Lahti
04/05/2010 9:42 AM
post51133
|
Re: High-Performance Network I/O Requirements
> Yes, we already have an lsm that takes the packets that we're interested in from a pfil hook, but in some cases, we
have to give the packet to the TCP/IP stack.
>
http://community.qnx.com/sf/wiki/do/viewPage/projects.networking/wiki/Filtering_wiki_page
"The filter returns a non-zero value if the packet processing is to
stop, or 0 if the processing is to continue."
|
|
|
Lewis Donzis
|
Re: High-Performance Network I/O Requirements
|
Lewis Donzis
04/05/2010 11:10 AM
post51146
|
Re: High-Performance Network I/O Requirements
> > Yes, we already have an lsm that takes the packets that we're interested in
> from a pfil hook, but in some cases, we have to give the packet to the TCP/IP
> stack.
> >
> http://community.qnx.com/sf/wiki/do/viewPage/projects.networking/wiki/
> Filtering_wiki_page
> "The filter returns a non-zero value if the packet processing is to
> stop, or 0 if the processing is to continue."
Thanks, we already have our filters converted from io-net to io-pkt, so that's not the issue right now. The question
is how a driver can operate 100% in polled mode.
|
|
|
Andrew Boyd(deleted)
|
RE: RE: High-Performance Network I/O Requirements
|
Andrew Boyd(deleted)
04/04/2010 10:38 AM
post51114
|
RE: RE: High-Performance Network I/O Requirements
> it takes a lot of network traffic to load them
> to the point that they will stay in the receive thread.
I need to write a little wiki article on this ...
You need to change the return code of process_interrupt().
This is a driver function which is called by io-pkt
after an interrupt fires. Normally all the servicing
is completed and the function returns, but by altering
the return code you can tell io-pkt that you have not
finished processing - even if you really have - and
io-pkt will call you again soon, with the interrupt
still masked!
You can create a little state machine with a static
variable so that your driver thread stays active longer
after an interrupt, to straddle the interval between
packets in a burst.
The above is the elegant way of doing it. You may not
prefer it - are you a manual transmission kind of guy?
If so, ensure that io-pkt has enough threads - one will
be created per cpu, you may wish to use the command line
options to create at *least* as many threads as interfaces -
and in this case, you can "hog" the io-pkt thread by
not returning immediately from process_interrupt().
Intead what you might do, after all your processing is
complete, is loop perhaps calling SchedYield() and then
reading your hardware register, looking for rxd packets.
Seanb doesn't like this very much - he would prefer you return
from process_interrupt() asking to be called again - but
as long as you carefully configure your system with enough
threads for your number of cores and interfaces, spinning
in the process_interrupt() callback will work very well indeed.
If you've got cores to burn, skip the SchedYield() - that's
more for single-cpu configurations, to allow other threads at
the same priority - whom are ready to run - a chance at the cpu.
Hope this helps,
--
aboyd
|
|
|
Lewis Donzis
|
Re: RE: RE: High-Performance Network I/O Requirements
|
Lewis Donzis
04/04/2010 3:43 PM
post51115
|
Re: RE: RE: High-Performance Network I/O Requirements
Hi, Andrew.
Thanks for the pointers. I say what follows without any experience writing an io-pkt driver -- we've done quite a few
for io-net, but none for io-pkt yet.
First of all, there is heavy interrupt sharing going on, with multiple types of devices, so it's not reasonable to leave
the interrupt masked in the PIC. For example, the USB ports share the same interrupts with the MACs. So I presume,
from what you're saying, if we return from process_interrupt() with the interrupt still masked, then the USB ports will
never get serviced.
Another thing is that the network threads run at relatively high priority (22), so calling SchedYield() is of no benefit
in terms of allowing other things to run.
On io-net, we wrote a completely independent module that can be called by any driver in lieu of InterruptAttachEvent().
The driver calls with its event service function and interrupt level.
If there is only a single CPU or if traditional interrupt mode is desired, then the polling module creates a thread for
that driver, uses InterruptAttachEvent() and InterruptWait(), and when an interrupt occurs, it calls back to the driver.
So in that mode, the driver works pretty much as it did before.
But if multiple CPUs are available, it creates a number of threads, usually the number of CPUs minus one, or half the
CPUs, or whatever. Each thread locks itself to a CPU (starting at the highest numbers) and hangs in a loop forever,
calling the registered driver functions. (All threads share the same pool of callbacks.)
In the latter case, i.e., when it's running in polled mode, the driver does not enable interrupts on its hardware MAC,
it runs strictly in polling mode. There are no kernel calls in the polling loop, so it's quite fast, completing a
polling loop in roughly 50 nanoseconds.
It's all pretty straightforward, and this is the functionality we hoped to replicate on io-pkt.
Just out of curiosity, is there anything really preventing a driver from starting a thread of its own, and not using the
process_interrupt() callback?
Thanks,
lew
|
|
|
|