Project Home
Project Home
Documents
Documents
Wiki
Wiki
Discussion Forums
Discussions
Project Information
Project Info
Forum Topic - optimizing read/write call performance / contiguous RAM allocation: (19 Items)
   
optimizing read/write call performance / contiguous RAM allocation  
Hi,

I'm writing a device driver for a fast flash memory device (>90 Mega-Byte/sec). My driver uses DMA to do read and write 
accesses to the flash memory device (i.e. it is memory mapped).

Using the read and write primitives which I've written, I can read and write at a level of 90 Mega-Bytes per second.

I simply pass my read and write functions a buffer and once the DMA generates an interrupt (saying transfer complete), 
my functions return. 

The user will access my device driver via POSIX like open, read and write. But I'm finding that using the QNX read and 
write routines is killing my performance.

I'm down to 25 Mega-Byte/sec read and 16 Mega-Byte/sec write.

I've written io_read and io_write functions, which are called when the user issues a read or write. 

In io_read I need to allocate a physically contiguous buffer for the DMA transfer. I get the physical address of the 
buffer via mem_offset and then call my device driver to perform the DMA operation.
Upon completion, I return the buffer to the caller via:

MsgReply(rcvid, size, pContigBuffer, size)

This then take >0.3 Seconds to copy the data from the contiguous buffer which I created to the virtual address space of 
the buffer which the user passed to me.

This is where my performance is going.

Its a similar situation for write, but the performance hit is bigger, as
I have to copy the buffer which I receive via io_write to a physical contiguous buffer which I need for DMA.

The user code will not be written by me and will be within a separate process, so I can't access the physical address of
 the buffers directly and I won't know if the buffers are physically contiguous - is there maybe a way to force the 
memory manager to only ever allocate physically contiguous RAM?

And is there also a way of preventing QNX from doing buffer copying in the read/write routines, if the buffers are 
already physically contiguous?

Any other ideas?

Cheers, Mark.
AW: optimizing read/write call performance / contiguous RAM allocation  
for performance of memmgr.map_xfer functions some QNX Gurus are needed,
but why don't you just do direct IO from your client site using devctl()?
here you're allocating the phys blocks on client site and pass them to your driver.
If needed you could setup a client lib that could be used if you don't want users to handle direct IO on their own.
/hp

>-----Ursprüngliche Nachricht-----
>Von: Mark Pearson [mailto:community-noreply@qnx.com] 
>Gesendet: Donnerstag, 14. Mai 2009 15:09
>An: ostech-core_os
>Betreff: optimizing read/write call performance / contiguous 
>RAM allocation
>
>Hi,
>
>I'm writing a device driver for a fast flash memory device 
>(>90 Mega-Byte/sec). My driver uses DMA to do read and write 
>accesses to the flash memory device (i.e. it is memory mapped).
>
>Using the read and write primitives which I've written, I can 
>read and write at a level of 90 Mega-Bytes per second.
>
>I simply pass my read and write functions a buffer and once 
>the DMA generates an interrupt (saying transfer complete), my 
>functions return. 
>
>The user will access my device driver via POSIX like open, 
>read and write. But I'm finding that using the QNX read and 
>write routines is killing my performance.
>
>I'm down to 25 Mega-Byte/sec read and 16 Mega-Byte/sec write.
>
>I've written io_read and io_write functions, which are called 
>when the user issues a read or write. 
>
>In io_read I need to allocate a physically contiguous buffer 
>for the DMA transfer. I get the physical address of the buffer 
>via mem_offset and then call my device driver to perform the 
>DMA operation.
>Upon completion, I return the buffer to the caller via:
>
>MsgReply(rcvid, size, pContigBuffer, size)
>
>This then take >0.3 Seconds to copy the data from the 
>contiguous buffer which I created to the virtual address space 
>of the buffer which the user passed to me.
>
>This is where my performance is going.
>
>Its a similar situation for write, but the performance hit is 
>bigger, as I have to copy the buffer which I receive via 
>io_write to a physical contiguous buffer which I need for DMA.
>
>The user code will not be written by me and will be within a 
>separate process, so I can't access the physical address of 
>the buffers directly and I won't know if the buffers are 
>physically contiguous - is there maybe a way to force the 
>memory manager to only ever allocate physically contiguous RAM?
>
>And is there also a way of preventing QNX from doing buffer 
>copying in the read/write routines, if the buffers are already 
>physically contiguous?
>
>Any other ideas?
>
>Cheers, Mark.
>
>
>_______________________________________________
>OSTech
>http://community.qnx.com/sf/go/post29423
>
> 
 
*******************************************
Harman Becker Automotive Systems GmbH
Management Board: Dr. Klaus Blickle (Chairman), Dr. Udo Hüls, Michael Mauser
Chairman of the Supervisory Board: Ansgar Rempp | Domicile: Karlsbad | 
Local Court Mannheim: Register No. 361395

 
*******************************************
Diese E-Mail enthaelt vertrauliche und/oder rechtlich geschuetzte Informationen. Wenn Sie nicht der richtige Adressat 
sind oder diese E-Mail irrtuemlich erhalten haben, informieren Sie bitte sofort den Absender und loeschen Sie diese Mail
. Das unerlaubte Kopieren sowie die unbefugte Weitergabe dieser Mail ist nicht gestattet.
This e-mail may contain confidential and/or privileged information. If you are not the intended recipient (or have 
received this e-mail in error) please notify the sender immediately and delete this e-mail. Any unauthorized copying, 
disclosure or distribution of the contents in this e-mail is strictly...
Re: optimizing read/write call performance / contiguous RAM allocation  
A couple of things...

Are you allocating buffers for every read?  Calling mmap() and mem_offset() are, relatively speaking, expensive
calls - they are both msgpasses to procnto.

You should probably cache your buffers, along with their physical addresses.

You don't mention which release you're running on.

I would strongly suggest taking a kernel trace to isolate unknown factors.

Regards,

Colin

Mark Pearson wrote:
> Hi,
> 
> I'm writing a device driver for a fast flash memory device (>90 Mega-Byte/sec). My driver uses DMA to do read and 
write accesses to the flash memory device (i.e. it is memory mapped).
> 
> Using the read and write primitives which I've written, I can read and write at a level of 90 Mega-Bytes per second.
> 
> I simply pass my read and write functions a buffer and once the DMA generates an interrupt (saying transfer complete),
 my functions return. 
> 
> The user will access my device driver via POSIX like open, read and write. But I'm finding that using the QNX read and
 write routines is killing my performance.
> 
> I'm down to 25 Mega-Byte/sec read and 16 Mega-Byte/sec write.
> 
> I've written io_read and io_write functions, which are called when the user issues a read or write. 
> 
> In io_read I need to allocate a physically contiguous buffer for the DMA transfer. I get the physical address of the 
buffer via mem_offset and then call my device driver to perform the DMA operation.
> Upon completion, I return the buffer to the caller via:
> 
> MsgReply(rcvid, size, pContigBuffer, size)
> 
> This then take >0.3 Seconds to copy the data from the contiguous buffer which I created to the virtual address space 
of the buffer which the user passed to me.
> 
> This is where my performance is going.
> 
> Its a similar situation for write, but the performance hit is bigger, as
> I have to copy the buffer which I receive via io_write to a physical contiguous buffer which I need for DMA.
> 
> The user code will not be written by me and will be within a separate process, so I can't access the physical address 
of the buffers directly and I won't know if the buffers are physically contiguous - is there maybe a way to force the 
memory manager to only ever allocate physically contiguous RAM?
> 
> And is there also a way of preventing QNX from doing buffer copying in the read/write routines, if the buffers are 
already physically contiguous?
> 
> Any other ideas?
> 
> Cheers, Mark.
> 
> 
> _______________________________________________
> OSTech
> http://community.qnx.com/sf/go/post29423
> 

-- 
cburgess@qnx.com
Re: optimizing read/write call performance / contiguous RAM allocation  
Thanks for the replies.

@Hans-Peter: I do have devctl calls to do this and I can achieve the desired performance - only I wish to offer read/
write uinterfaces to the user, providing a similar level of performance.

@Colin: I have used pre-allocated buffers (and have cached the physical addresses) within my io_read and write routines,
 which have helped to gain a little more performance, but the main hit is where the read buffer is transfered back to 
the caller of the read routine, as there will be a copy from my pre-allocated physically contiguous buffer to the buffer
 which the caller wants the data to be presented in.

I'm running release 6.4.0 on a PowerPC platform.

I have tested the read routine without copying back my contiguous buffer to the buffer of the caller (i.e. in MsgReply 
simply setting the size of the buffer to return as 0), then I achieve >90 Mega-Byte per second. So it seems that the 
performance is only lost at this stage.

Best regards and many thanks, Mark.
Re: optimizing read/write call performance / contiguous RAM allocation  
Most likely your DMA buffer is nocachable, have you tried to use cacheable memory, then flush/invalidate cache after 
copy from client buffer to DMA buffer, and then start dma, etc.
Re: optimizing read/write call performance / contiguous RAM allocation  
> Most likely your DMA buffer is nocachable, have you tried to use cacheable 
> memory, then flush/invalidate cache after copy from client buffer to DMA 
> buffer, and then start dma, etc.

Making the client buffers cachable didn't improve the performance significantly and actually on the read operation, it 
ran a little slower with caching.

Cheers, Mark.
Re: optimizing read/write call performance / contiguous RAM allocation  
Your client buffer most likely already is cacheable, I was refering to your DMA buffer. How do you mmap the DMA buffer?
Re: optimizing read/write call performance / contiguous RAM allocation  
Hi Mark,

Mark Pearson wrote:
> Thanks for the replies.
> 
> @Hans-Peter: I do have devctl calls to do this and I can achieve the desired performance - only I wish to offer read/
write uinterfaces to the user, providing a similar level of performance.
> 
> @Colin: I have used pre-allocated buffers (and have cached the physical addresses) within my io_read and write 
routines, which have helped to gain a little more performance, but the main hit is where the read buffer is transfered 
back to the caller of the read routine, as there will be a copy from my pre-allocated physically contiguous buffer to 
the buffer which the caller wants the data to be presented in.
> 
> I'm running release 6.4.0 on a PowerPC platform.
> 
> I have tested the read routine without copying back my contiguous buffer to the buffer of the caller (i.e. in MsgReply
 simply setting the size of the buffer to return as 0), then I achieve >90 Mega-Byte per second. So it seems that the 
performance is only lost at this stage.

As an experiment, can you try varying the amount that you reply back with?  The kernel will use a slightly different
mapping strategy with messages up to 256 bytes.

Also, which powerpc family are you running on?

> Best regards and many thanks, Mark.
> 
> _______________________________________________
> OSTech
> http://community.qnx.com/sf/go/post29429
> 

-- 
cburgess@qnx.com
Re: optimizing read/write call performance / contiguous RAM allocation  
Hi Colin,

> As an experiment, can you try varying the amount that you reply back with?  
> The kernel will use a slightly different
> mapping strategy with messages up to 256 bytes.
> 
I've varied the buffer size between the minimum and maximum that I will be asked to read/write (min 2k bytes -> max 16 
Mega bytes).
I get the best performance from the 16 Mega byte transfers and the poorest from the 2k byte transfers.

If I do a 16 Mega Byte read but tell MsgReply that the returned buffer is smaller then I see a linear relationship 
between the reply size and the time taken to perform the MsgReply. There doesn't seem to be a point where the 
performance becomes significantly worse.

> Also, which powerpc family are you running on?
> 
Its the AMCC460ex family.

Many thanks, Mark.
Re: optimizing read/write call performance / contiguous RAM allocation  
Are the DMA buffers cacheable?

Mark Pearson wrote:
> Hi Colin,
> 
>> As an experiment, can you try varying the amount that you reply back with?  
>> The kernel will use a slightly different
>> mapping strategy with messages up to 256 bytes.
>>
> I've varied the buffer size between the minimum and maximum that I will be asked to read/write (min 2k bytes -> max 16
 Mega bytes).
> I get the best performance from the 16 Mega byte transfers and the poorest from the 2k byte transfers.
> 
> If I do a 16 Mega Byte read but tell MsgReply that the returned buffer is smaller then I see a linear relationship 
between the reply size and the time taken to perform the MsgReply. There doesn't seem to be a point where the 
performance becomes significantly worse.
> 
>> Also, which powerpc family are you running on?
>>
> Its the AMCC460ex family.
> 
> Many thanks, Mark.
> 
> _______________________________________________
> OSTech
> http://community.qnx.com/sf/go/post29500
> 

-- 
cburgess@qnx.com
Re: optimizing read/write call performance / contiguous RAM allocation  
> Are the DMA buffers cacheable?
> 

I'm not sure if the cache library is supported for my hardware, as I get an undefined reference to cache_init when I add
 this function to my startup code.

In my board support package, I find two library files for cache: libcache.a and libcacheS.a. When re-compile my code 
with the added libraries + add the gcc compiler flag -lcache, I still get an undefined reference to cache_init (I have 
included sys/cache.h to my includes).

I'm not sure if caching is useful for my application, as my processor only has 32k-bytes of data cache and my DMA 
transfers are much larger than this (upto 16 Mega-bytes).

Cheers, Mark.
Re: optimizing read/write call performance / contiguous RAM allocation  
It's definately there...

$ ntoppc-nm /opt/qnx640/target/qnx6/ppcbe/usr/lib/libcache.a | grep init
cache_init.o:
00000000 T cache_init

And accessing the memory via cache would definately make a difference... it allows the chip to do
all sorts of optimisations.  Definately worth a try.  Just make sure you invalidate the cache beforehand,
so that state data isn't accessed.

Mark Pearson wrote:
>> Are the DMA buffers cacheable?
>>
> 
> I'm not sure if the cache library is supported for my hardware, as I get an undefined reference to cache_init when I 
add this function to my startup code.
> 
> In my board support package, I find two library files for cache: libcache.a and libcacheS.a. When re-compile my code 
with the added libraries + add the gcc compiler flag -lcache, I still get an undefined reference to cache_init (I have 
included sys/cache.h to my includes).
> 
> I'm not sure if caching is useful for my application, as my processor only has 32k-bytes of data cache and my DMA 
transfers are much larger than this (upto 16 Mega-bytes).
> 
> Cheers, Mark.
> 
> _______________________________________________
> OSTech
> http://community.qnx.com/sf/go/post29608
> 

-- 
cburgess@qnx.com
Re: optimizing read/write call performance / contiguous RAM allocation  
Hi Colin,

> 
> And accessing the memory via cache would definately make a difference... it 
> allows the chip to do
> all sorts of optimisations.  Definately worth a try.  Just make sure you 
> invalidate the cache beforehand,
> so that state data isn't accessed.
> 

I added the library by hand to my project and now I see the cache_init symbol.

However, now when I call CACHE_INVAL my CPU receives a SIGILL signal :-(

Here to be precise (see >>>>>)

__cpu_cache_inval(struct cache_ctrl *cinfo,
    void *vaddr, uint64_t paddr, size_t len)
{
	unsigned linesize = cinfo->cache_line_size;
	unsigned __dst = (unsigned)vaddr & ~(linesize-1);
	int __nlines = (((unsigned)vaddr + len + linesize-1)-__dst)/linesize;

	while(__nlines) {	
>>>>>	__asm__ __volatile__("dcbi 0,%0;" : : "r" (__dst));
		__dst+=linesize;
		__nlines--;
	}

I call the CACHE_INVAL as follows:

CACHE_INVAL(cinfo, pVirtBuffer, phyAddr, bytes)

cinfo is a pointer to struct cache_ctrl
pVirtBuffer is a mmap'd buffer
phyAddr is the physical address of pVirtBuffer returned from mem_offset
bytes = size of pVirtBuffer in bytes.

If I call CACHE_FLUSH instead, this works no problem - looks like the CACHE_INVAL library function is using an 
instruction that my processor doesn't support :-(

Any ideas? 

The result is still quite good (as long as CACHE_FLUSH is also guaranteeing cache coherency???)

I now have a read performance of 57Mega-bytes/sec. Writing still lags at 33Mega-bytes/sec.

Cheers, Mark.

Cheers, Mark.
Re: optimizing read/write call performance / contiguous RAM allocation  
I don't see why dcbi would cause SIGILL for you - you'd better check your processor manual.

You don't want to use CACHE_FLUSH in this case (unless you are writing data via the cached mapping
that you want to send out to main memory before you initiate the DMA.

Let's check a few things - you are root, you have IO privity, and the CACHE_INVAL is operating on a
cached, writeable mapping?

Colin

Mark Pearson wrote:
> Hi Colin,
> 
>> And accessing the memory via cache would definately make a difference... it 
>> allows the chip to do
>> all sorts of optimisations.  Definately worth a try.  Just make sure you 
>> invalidate the cache beforehand,
>> so that state data isn't accessed.
>>
> 
> I added the library by hand to my project and now I see the cache_init symbol.
> 
> However, now when I call CACHE_INVAL my CPU receives a SIGILL signal :-(
> 
> Here to be precise (see >>>>>)
> 
> __cpu_cache_inval(struct cache_ctrl *cinfo,
>     void *vaddr, uint64_t paddr, size_t len)
> {
> 	unsigned linesize = cinfo->cache_line_size;
> 	unsigned __dst = (unsigned)vaddr & ~(linesize-1);
> 	int __nlines = (((unsigned)vaddr + len + linesize-1)-__dst)/linesize;
> 
> 	while(__nlines) {	
>>>>>> 	__asm__ __volatile__("dcbi 0,%0;" : : "r" (__dst));
> 		__dst+=linesize;
> 		__nlines--;
> 	}
> 
> I call the CACHE_INVAL as follows:
> 
> CACHE_INVAL(cinfo, pVirtBuffer, phyAddr, bytes)
> 
> cinfo is a pointer to struct cache_ctrl
> pVirtBuffer is a mmap'd buffer
> phyAddr is the physical address of pVirtBuffer returned from mem_offset
> bytes = size of pVirtBuffer in bytes.
> 
> If I call CACHE_FLUSH instead, this works no problem - looks like the CACHE_INVAL library function is using an 
instruction that my processor doesn't support :-(
> 
> Any ideas? 
> 
> The result is still quite good (as long as CACHE_FLUSH is also guaranteeing cache coherency???)
> 
> I now have a read performance of 57Mega-bytes/sec. Writing still lags at 33Mega-bytes/sec.
> 
> Cheers, Mark.
> 
> Cheers, Mark.
> 
> _______________________________________________
> OSTech
> http://community.qnx.com/sf/go/post29642
> 

-- 
cburgess@qnx.com
Re: optimizing read/write call performance / contiguous RAM allocation  
Hi, Mark:

I suspect your processor does not support dcbi instruction, you can try 'dcbf' instead, or check you processor manual as
 Colin suggested.

For read case, you need invalidate and then start DMA, for write case, you only need to flush. 

Instead of going through the large while loop (if len is big), you can try do flush/invalidate the whole cache lines. 
Re: optimizing read/write call performance / contiguous RAM allocation  
> For read case, you need invalidate and then start DMA, for write case, you 
> only need to flush. 
> 
Ok, thanks.

> Instead of going through the large while loop (if len is big), you can try do 
> flush/invalidate the whole cache lines. 
>
Which while loop do you mean? My code only allows a max. transfer of 16 Mega-bytes and all the data to be written (or 
that to be read) will be held within one contiguous buffer, so there will only be one DMA transfer per call.

Cheers, Mark.

Re: optimizing read/write call performance / contiguous RAM allocation  
I was referring to this loop
while(__nlines) {

If the len is 16MB, __nlines is relatively big, you can check your processor manual to see if there is any easy way to 
flush/invalidate the whole cache lines, instead of of going through the loop.
Re: optimizing read/write call performance / contiguous RAM allocation  
> I was referring to this loop
> while(__nlines) {
> [...]

Ah, ok. Will take a look into it.

Thanks, Mark.
Re: optimizing read/write call performance / contiguous RAM allocation  
Good morning guys,

> 
> Let's check a few things - you are root, you have IO privity, and the 
> CACHE_INVAL is operating on a
> cached, writeable mapping?
> 
Ooops! my IO privilage was missing - I had forgotten to call ThreadCtl for the thread where I was doing the CACHE_INVAL 
- now it is working :o)

Many thanks for the help.

Cheers, Mark.