Adam Mallory
08/18/2008 4:40 PM
post11986
|
During some investigation I've found that in 6.3.2 (and still in HEAD) we're flushing the cache at map time for even
PROT_NOCACHE mappings that do not exist in the sysram areas. The problem is that these types of mappings tend to be the
worst kind - large areas of under used mappings (flash arrays, device IO, register windows).
For example on the Jacinto, we have a 64meg flash array and devf does a mmap() to map in the flash array and the time
spent in procnto/memmgr is 500ms+ just doing the cache flush of what is arguably a waste of time. Granted it is
possible to have previously cachable mappings still in the cache - but those cases tend to be very rare and it makes
little sense to penalize all mmap() attempts of PROT_NOCACHE|MAP_PHYS|MAP_SHARED,NOFD, <paddr> .
I've moved the cache flush for this case out to the munmap() code. This way, only if the mapping was done cached do we
attempt to flush. That way we only penalize a small set of software that do map those large areas cached and we don't
do it at startup time, only when it's unmapped.
The diff is rough and I don't make a distinction for ARM vs. other (since I need to pass in the paddr I think to
CacheControl rather than the vaddr) - it's purely for the idea of it and to get someone who might have some background
on why we do it the other way (ie. at map time, every time).
Thanks!
|
|
|
Sunil Kittur(deleted)
08/21/2008 10:13 AM
post12193
|
I agree with the basic principle of this change, and I think it looks OK
for non-RAM fake quanta. However, the RUSH3 comments in pq_map just above
the turning off of PROT_NOCACHE make me wonder if there's something more
tricky to take care of for RAM?
Sunil.
Adam Mallory wrote:
> During some investigation I've found that in 6.3.2 (and still in HEAD) we're flushing the cache at map time for even
PROT_NOCACHE mappings that do not exist in the sysram areas. The problem is that these types of mappings tend to be the
worst kind - large areas of under used mappings (flash arrays, device IO, register windows).
>
> For example on the Jacinto, we have a 64meg flash array and devf does a mmap() to map in the flash array and the time
spent in procnto/memmgr is 500ms+ just doing the cache flush of what is arguably a waste of time. Granted it is
possible to have previously cachable mappings still in the cache - but those cases tend to be very rare and it makes
little sense to penalize all mmap() attempts of PROT_NOCACHE|MAP_PHYS|MAP_SHARED,NOFD, <paddr> .
>
> I've moved the cache flush for this case out to the munmap() code. This way, only if the mapping was done cached do
we attempt to flush. That way we only penalize a small set of software that do map those large areas cached and we
don't do it at startup time, only when it's unmapped.
>
> The diff is rough and I don't make a distinction for ARM vs. other (since I need to pass in the paddr I think to
CacheControl rather than the vaddr) - it's purely for the idea of it and to get someone who might have some background
on why we do it the other way (ie. at map time, every time).
>
> Thanks!
>
> _______________________________________________
> OSTech
> http://community.qnx.com/sf/go/post11986
>
>
> ------------------------------------------------------------------------
>
> Index: memmgr/mm_reference.c
> ===================================================================
> --- memmgr/mm_reference.c (revision 174993)
> +++ memmgr/mm_reference.c (working copy)
> @@ -221,7 +221,6 @@
> }
> }
>
> -
> static int
> pq_map(OBJECT *obp, off64_t off, struct pa_quantum *pq, unsigned num, void *d) {
> struct data *data = d;
> @@ -286,7 +285,7 @@
> //RUSH3: returns NULL. Maybe not - what about special purpose
> //RUSH3: ram that's outside the sysram region(s)?
> // We need to map cached initially so we can flush
> - mmap_flags &= ~PROT_NOCACHE;
> + // mmap_flags &= ~PROT_NOCACHE;
> }
>
> data->va = start + NQUANTUM_TO_LEN(num);
> @@ -295,6 +294,7 @@
> if(mmap_flags & PROT_WRITE) {
> pq->flags |= PAQ_FLAG_MODIFIED;
> }
> +#if 0
> if(!(mmap_flags & PROT_NOCACHE) && (mm->mmap_flags & PROT_NOCACHE)) {
> // We've done a cached mapping, but the actual flags were for
> // no-cache. We did this so that we can make sure there are
> @@ -303,6 +303,7 @@
> CPU_CACHE_CONTROL(or->adp, (void *)start, data->va - start, MS_INVALIDATE|MS_INVALIDATE_ICACHE);
> r = pte_map(or->adp, start, data->va - 1, mmap_flags | PROT_NOCACHE, or->obp, paddr, 0);
> }
> +#endif
> }
> } else {
> data->va = start + NQUANTUM_TO_LEN(num);
>
> fail1:
> Index: memmgr/vmm_munmap.c
> ===================================================================
> --- memmgr/vmm_munmap.c (revision 174993)
> +++ memmgr/vmm_munmap.c (working copy)
> @@ -58,6 +58,9 @@
> sync_off, sync_off + ((num << QUANTUM_BITS) - 1),
> data->prp, (void *)(mm->start + (uintptr_t)(off - mm->offset)));
> }
> + if(!(mm->mmap_flags & PROT_NOCACHE)) {
> + CPU_CACHE_CONTROL(data->prp->memory, (void *)mm->start, mm->end -...
|
|
|
Adam Mallory
08/21/2008 2:26 PM
post12212
|
> I agree with the basic principle of this change, and I think it looks OK
> for non-RAM fake quanta. However, the RUSH3 comments in pq_map just above
> the turning off of PROT_NOCACHE make me wonder if there's something more
> tricky to take care of for RAM?
The comments refered to specialized ram not in the SYSRAM description of the syspage. I contend that it is outside of
the OS's control and up to the people mapping that special RAM to do the 'right thing' and as such I'm not in agreement
with the comment.
IMHO, these special RAM areas are mapped all cachable or all uncached, then life should continue to be fine. If
mappings are mixed (ie. cachable once, unmap and remap uncached later) those developers need to flush the cache
themselves via the cache lib. In those special cases. I also think the comment in the code suggests the optimization
was to avoid the double pte manipulation and not the cache flush itself (which is the much larger cost in this case) - I
could be wrong.
The alternative option here if others in the OS group feel I'm wrong, would be to poison the cache outright rather than
spend our time flushing everything by lines (which is also how other OS do it). It's not free, but it puts a bound on
the worst case to a much more reasonable level (I think around 30-50ms of time). In this case, the board I'm testing on
has 16k L1 data cache - we could implement something like (obviously not exactly, it's just illustrative)
#define MAX_FLUSH_SIZE KB(32)
void dcache_flush(int nlines) {
static char poison_buff[MAX_FLUSH_SIZE]; /* this could be a mapping virtualizing the range to one page perhaps */
static volatile register char dummy;
int i;
if((nlines * const_linesize * const_nways) > MAX_FLUSH_SIZE) {
for(i=0 ; i < MAX_FLUSH_SIZE ; i+=const_linesize) {
dummy = poison_buff[i];
}
} else {
/* do it line by line as before */
}
}
The MAX_FLUSH_SIZE is an empirical measure comparing the time to flush via. lines or poison. In my board they are the
same just below 32kb worth of flushing, so 32kb is a good point to choose one over the other. It will be different for
all CPUs/caches. I did look at doing this in the startup callouts but I don't see any reasonable way to find some
mapping I can use to provide poison data from that should be big enough (and persistent). At least I couldn't see what
to do without a lot of contortions via the patcher routines or putting in a constant data array in the assembly callout
itself (which is just awful as it increases the callout by a large amount).
|
|
|
Sunil Kittur(deleted)
08/21/2008 2:38 PM
post12214
|
I only mentioned that because I wasn't 100% sure what the comment was
really describing, so I wanted to see if anyone could step in and point
out some non-obvious flaw in your code.
I see nothing incorrect in flushing the cache for fake quanta at unmap
time vs. flushing at map time iff PROT_NOCACHE was on. The main benefit
of doing it the current way is to avoid the cache flush for repeated
cacheable mappings, but I would think that is pretty rare for fake quanta
since most of the time these refer to peripheral registers/memory that
would typically be mapped uncached.
Sunil.
Adam Mallory wrote:
>>I agree with the basic principle of this change, and I think it looks OK
>>for non-RAM fake quanta. However, the RUSH3 comments in pq_map just above
>>the turning off of PROT_NOCACHE make me wonder if there's something more
>>tricky to take care of for RAM?
>
>
> The comments refered to specialized ram not in the SYSRAM description of the syspage. I contend that it is outside of
the OS's control and up to the people mapping that special RAM to do the 'right thing' and as such I'm not in agreement
with the comment.
>
> IMHO, these special RAM areas are mapped all cachable or all uncached, then life should continue to be fine. If
mappings are mixed (ie. cachable once, unmap and remap uncached later) those developers need to flush the cache
themselves via the cache lib. In those special cases. I also think the comment in the code suggests the optimization
was to avoid the double pte manipulation and not the cache flush itself (which is the much larger cost in this case) - I
could be wrong.
>
> The alternative option here if others in the OS group feel I'm wrong, would be to poison the cache outright rather
than spend our time flushing everything by lines (which is also how other OS do it). It's not free, but it puts a bound
on the worst case to a much more reasonable level (I think around 30-50ms of time). In this case, the board I'm
testing on has 16k L1 data cache - we could implement something like (obviously not exactly, it's just illustrative)
>
> #define MAX_FLUSH_SIZE KB(32)
> void dcache_flush(int nlines) {
> static char poison_buff[MAX_FLUSH_SIZE]; /* this could be a mapping virtualizing the range to one page perhaps */
> static volatile register char dummy;
> int i;
>
> if((nlines * const_linesize * const_nways) > MAX_FLUSH_SIZE) {
> for(i=0 ; i < MAX_FLUSH_SIZE ; i+=const_linesize) {
> dummy = poison_buff[i];
> }
> } else {
> /* do it line by line as before */
> }
> }
>
> The MAX_FLUSH_SIZE is an empirical measure comparing the time to flush via. lines or poison. In my board they are the
same just below 32kb worth of flushing, so 32kb is a good point to choose one over the other. It will be different for
all CPUs/caches. I did look at doing this in the startup callouts but I don't see any reasonable way to find some
mapping I can use to provide poison data from that should be big enough (and persistent). At least I couldn't see what
to do without a lot of contortions via the patcher routines or putting in a constant data array in the assembly callout
itself (which is just awful as it increases the callout by a large amount).
>
>
> _______________________________________________
> OSTech
> http://community.qnx.com/sf/go/post12212
>
|
|
|
Adam Mallory
08/21/2008 2:50 PM
post12217
|
> I only mentioned that because I wasn't 100% sure what the comment was
> really describing, so I wanted to see if anyone could step in and
point
> out some non-obvious flaw in your code.
We're both on the same wavelength here :) . I'm certainly not trying to
argue about anything with you specifically :)
> I see nothing incorrect in flushing the cache for fake quanta at unmap
> time vs. flushing at map time iff PROT_NOCACHE was on. The main
benefit
> of doing it the current way is to avoid the cache flush for repeated
> cacheable mappings, but I would think that is pretty rare for fake
quanta
> since most of the time these refer to peripheral registers/memory that
> would typically be mapped uncached.
Yep - invert the cost to the lower runner case rather than make the more
common case pay for it everytime.
--
Cheers,
Adam
QNX Software Systems
[ amallory@harman.com ]
---------------------------------------------------
With a PC, I always felt limited by the software available.
On Unix, I am limited only by my knowledge.
--Peter J. Schoenster
|
|
|
|