Project Home
Project Home
Documents
Documents
Wiki
Wiki
Discussion Forums
Discussions
Project Information
Project Info
Forum Topic - cache flushing: (5 Items)
   
cache flushing  
During some investigation I've found that in 6.3.2 (and still in HEAD) we're flushing the cache at map time for even 
PROT_NOCACHE mappings that do not exist in the sysram areas.  The problem is that these types of mappings tend to be the
 worst kind - large areas of under used mappings (flash arrays, device IO, register windows).

For example on the Jacinto, we have a 64meg flash array and devf does a mmap() to map in the flash array and the time 
spent in procnto/memmgr is 500ms+ just doing the cache flush of what is arguably a waste of time.  Granted it is 
possible to have previously cachable mappings still in the cache - but those cases tend to be very rare and it makes 
little sense to penalize all mmap() attempts of PROT_NOCACHE|MAP_PHYS|MAP_SHARED,NOFD, <paddr> .

I've moved the cache flush for this case out to the munmap() code.  This way, only if the mapping was done cached do we 
attempt to flush.  That way we only penalize a small set of software that do map those large areas cached and we don't 
do it at startup time, only when it's unmapped.

The diff is rough and I don't make a distinction for ARM vs. other (since I need to pass in the paddr I think to 
CacheControl rather than the vaddr) - it's purely for the idea of it and to get someone who might have some background 
on why we do it the other way (ie. at map time, every time).

Thanks!
Attachment: Text d.diff 1.83 KB
Re: cache flushing  
I agree with the basic principle of this change, and I think it looks OK
for non-RAM fake quanta. However, the RUSH3 comments in pq_map just above
the turning off of PROT_NOCACHE make me wonder if there's something more
tricky to take care of for RAM?

	Sunil.

Adam Mallory wrote:
> During some investigation I've found that in 6.3.2 (and still in HEAD) we're flushing the cache at map time for even 
PROT_NOCACHE mappings that do not exist in the sysram areas.  The problem is that these types of mappings tend to be the
 worst kind - large areas of under used mappings (flash arrays, device IO, register windows).
> 
> For example on the Jacinto, we have a 64meg flash array and devf does a mmap() to map in the flash array and the time 
spent in procnto/memmgr is 500ms+ just doing the cache flush of what is arguably a waste of time.  Granted it is 
possible to have previously cachable mappings still in the cache - but those cases tend to be very rare and it makes 
little sense to penalize all mmap() attempts of PROT_NOCACHE|MAP_PHYS|MAP_SHARED,NOFD, <paddr> .
> 
> I've moved the cache flush for this case out to the munmap() code.  This way, only if the mapping was done cached do 
we attempt to flush.  That way we only penalize a small set of software that do map those large areas cached and we 
don't do it at startup time, only when it's unmapped.
> 
> The diff is rough and I don't make a distinction for ARM vs. other (since I need to pass in the paddr I think to 
CacheControl rather than the vaddr) - it's purely for the idea of it and to get someone who might have some background 
on why we do it the other way (ie. at map time, every time).
> 
> Thanks!
> 
> _______________________________________________
> OSTech
> http://community.qnx.com/sf/go/post11986
> 
> 
> ------------------------------------------------------------------------
> 
> Index: memmgr/mm_reference.c
> ===================================================================
> --- memmgr/mm_reference.c	(revision 174993)
> +++ memmgr/mm_reference.c	(working copy)
> @@ -221,7 +221,6 @@
>  	}
>  }
>  
> -
>  static int
>  pq_map(OBJECT *obp, off64_t off, struct pa_quantum *pq, unsigned num, void *d) {
>  	struct data				*data = d;
> @@ -286,7 +285,7 @@
>  			//RUSH3: returns NULL. Maybe not - what about special purpose
>  			//RUSH3: ram that's outside the sysram region(s)?
>  			// We need to map cached initially so we can flush
> -			mmap_flags &= ~PROT_NOCACHE;
> +	//		mmap_flags &= ~PROT_NOCACHE;
>  		}
>  
>  		data->va = start + NQUANTUM_TO_LEN(num);
> @@ -295,6 +294,7 @@
>  			if(mmap_flags & PROT_WRITE) {
>  				pq->flags |= PAQ_FLAG_MODIFIED;
>  			}
> +#if 0
>  			if(!(mmap_flags & PROT_NOCACHE) && (mm->mmap_flags & PROT_NOCACHE)) {
>  				// We've done a cached mapping, but the actual flags were for
>  				// no-cache. We did this so that we can make sure there are
> @@ -303,6 +303,7 @@
>  				CPU_CACHE_CONTROL(or->adp, (void *)start, data->va - start, MS_INVALIDATE|MS_INVALIDATE_ICACHE);
>  				r = pte_map(or->adp, start, data->va - 1, mmap_flags | PROT_NOCACHE, or->obp, paddr, 0);
>  			}
> +#endif
>  		}
>  	} else { 
>  		data->va = start + NQUANTUM_TO_LEN(num);
>  
>  fail1:	
> Index: memmgr/vmm_munmap.c
> ===================================================================
> --- memmgr/vmm_munmap.c	(revision 174993)
> +++ memmgr/vmm_munmap.c	(working copy)
> @@ -58,6 +58,9 @@
>  					sync_off, sync_off + ((num << QUANTUM_BITS) - 1), 
>  					data->prp, (void *)(mm->start + (uintptr_t)(off - mm->offset)));
>  		}
> +		if(!(mm->mmap_flags & PROT_NOCACHE)) {
> +			CPU_CACHE_CONTROL(data->prp->memory, (void *)mm->start, mm->end -...
Re: cache flushing  
> I agree with the basic principle of this change, and I think it looks OK
> for non-RAM fake quanta. However, the RUSH3 comments in pq_map just above
> the turning off of PROT_NOCACHE make me wonder if there's something more
> tricky to take care of for RAM?

The comments refered to specialized ram not in the SYSRAM description of the syspage.  I contend that it is outside of 
the OS's control and up to the people mapping that special RAM to do the 'right thing' and as such I'm not in agreement 
with the comment.

IMHO, these special RAM areas are mapped all cachable or all uncached, then life should continue to be fine.  If 
mappings are mixed (ie. cachable once, unmap and remap uncached later) those developers need to flush the cache 
themselves via the cache lib.  In those special cases.  I also think the comment in the code suggests the optimization 
was to avoid the double pte manipulation and not the cache flush itself (which is the much larger cost in this case) - I
 could be wrong.

The alternative option here if others in the OS group feel I'm wrong, would be to poison the cache outright rather than 
spend our time flushing everything by lines (which is also how other OS do it).  It's not free, but it puts a bound on 
the worst case to a much more reasonable level (I think around 30-50ms of time).  In this case, the board I'm testing on
 has 16k L1 data cache - we could implement something like (obviously not exactly, it's just illustrative)

#define MAX_FLUSH_SIZE KB(32)
void dcache_flush(int nlines) {
  static char poison_buff[MAX_FLUSH_SIZE];  /* this could be a mapping virtualizing the range to one page perhaps */
  static volatile register char dummy;
  int i;

  if((nlines * const_linesize * const_nways) > MAX_FLUSH_SIZE) {
    for(i=0 ; i < MAX_FLUSH_SIZE ; i+=const_linesize) {
      dummy = poison_buff[i];
    }
  } else {
    /* do it line by line as before */
  }
}

The MAX_FLUSH_SIZE is an empirical measure comparing the time to flush via. lines or poison.  In my board they are the 
same just below 32kb worth of flushing, so 32kb is a good point to choose one over the other.  It will be different for 
all CPUs/caches.  I did look at doing this in the startup callouts but I don't see any reasonable way to find some 
mapping I can use to provide poison data from that should be big enough (and persistent).  At least I couldn't see what 
to do without a lot of contortions via the patcher routines or putting in a constant data array in the assembly callout 
itself (which is just awful as it increases the callout by a large amount).
Re: cache flushing  
I only mentioned that because I wasn't 100% sure what the comment was
really describing, so I wanted to see if anyone could step in and point
out some non-obvious flaw in your code.

I see nothing incorrect in flushing the cache for fake quanta at unmap
time vs. flushing at map time iff PROT_NOCACHE was on. The main benefit
of doing it the current way is to avoid the cache flush for repeated
cacheable mappings, but I would think that is pretty rare for fake quanta
since most of the time these refer to peripheral registers/memory that
would typically be mapped uncached.

	Sunil.

Adam Mallory wrote:
>>I agree with the basic principle of this change, and I think it looks OK
>>for non-RAM fake quanta. However, the RUSH3 comments in pq_map just above
>>the turning off of PROT_NOCACHE make me wonder if there's something more
>>tricky to take care of for RAM?
> 
> 
> The comments refered to specialized ram not in the SYSRAM description of the syspage.  I contend that it is outside of
 the OS's control and up to the people mapping that special RAM to do the 'right thing' and as such I'm not in agreement
 with the comment.
> 
> IMHO, these special RAM areas are mapped all cachable or all uncached, then life should continue to be fine.  If 
mappings are mixed (ie. cachable once, unmap and remap uncached later) those developers need to flush the cache 
themselves via the cache lib.  In those special cases.  I also think the comment in the code suggests the optimization 
was to avoid the double pte manipulation and not the cache flush itself (which is the much larger cost in this case) - I
 could be wrong.
> 
> The alternative option here if others in the OS group feel I'm wrong, would be to poison the cache outright rather 
than spend our time flushing everything by lines (which is also how other OS do it).  It's not free, but it puts a bound
 on the worst case to a much more reasonable level (I think around 30-50ms of time).  In this case, the board I'm 
testing on has 16k L1 data cache - we could implement something like (obviously not exactly, it's just illustrative)
> 
> #define MAX_FLUSH_SIZE KB(32)
> void dcache_flush(int nlines) {
>   static char poison_buff[MAX_FLUSH_SIZE];  /* this could be a mapping virtualizing the range to one page perhaps */
>   static volatile register char dummy;
>   int i;
> 
>   if((nlines * const_linesize * const_nways) > MAX_FLUSH_SIZE) {
>     for(i=0 ; i < MAX_FLUSH_SIZE ; i+=const_linesize) {
>       dummy = poison_buff[i];
>     }
>   } else {
>     /* do it line by line as before */
>   }
> }
> 
> The MAX_FLUSH_SIZE is an empirical measure comparing the time to flush via. lines or poison.  In my board they are the
 same just below 32kb worth of flushing, so 32kb is a good point to choose one over the other.  It will be different for
 all CPUs/caches.  I did look at doing this in the startup callouts but I don't see any reasonable way to find some 
mapping I can use to provide poison data from that should be big enough (and persistent).  At least I couldn't see what 
to do without a lot of contortions via the patcher routines or putting in a constant data array in the assembly callout 
itself (which is just awful as it increases the callout by a large amount).
> 
> 
> _______________________________________________
> OSTech
> http://community.qnx.com/sf/go/post12212
> 
RE: cache flushing  
> I only mentioned that because I wasn't 100% sure what the comment was
> really describing, so I wanted to see if anyone could step in and
point
> out some non-obvious flaw in your code.

We're both on the same wavelength here :) .  I'm certainly not trying to
argue about anything with you specifically :)

> I see nothing incorrect in flushing the cache for fake quanta at unmap
> time vs. flushing at map time iff PROT_NOCACHE was on. The main
benefit
> of doing it the current way is to avoid the cache flush for repeated
> cacheable mappings, but I would think that is pretty rare for fake
quanta
> since most of the time these refer to peripheral registers/memory that
> would typically be mapped uncached.

Yep - invert the cost to the lower runner case rather than make the more
common case pay for it everytime.



-- 
 Cheers,
    Adam

   QNX Software Systems
   [ amallory@harman.com ]
   ---------------------------------------------------
   With a PC, I always felt limited by the software available.
   On Unix, I am limited only by my knowledge.
       --Peter J. Schoenster