Project Home
Project Home
Documents
Documents
Wiki
Wiki
Discussion Forums
Discussions
Project Information
Project Info
Forum Topic - Thoughts on flushing the dcache more efficiently: (18 Items)
   
Thoughts on flushing the dcache more efficiently  
After some feedback from certain customers who shall go nameless, I've
given some more thought on how to obtain the speed up we need to flush
more efficiently without penalizing the unmap side of the equation.

To recap, we've seen regressions (starting in 6.3.2) in process startup
time on some drivers/applications (devf is a good example).  This slow
down is due to additional cache flushing we introduced to ensure the
dcache was sane when making device mappings.  The reason is mostly
because these mappings are large and our cache flushing callouts flush
the entire region regardless of the size (and device mappings can get
quite large).  In previous metrics I saw up to almost a 1 seconds worth
of time for devf just to map it's flash array (almost all of which was
cache flushing 32megs of cache line by line).

The attached diff is only a concept and not really how I'd like to
implement it in the final form.  There are comments about where I think
the change is weak or inadequate and I'm looking for feedback on how to
improve it.  We're flushing/dumping the dcache in about 50ms which is
better and more inline with the mmap() performance from 6.3.0.

Effectively, the change implements a short cut that triggers if we're
attempting to flush the dcache for a region larger than 3x the size of
the cache (arbitrary amount based on some quick numbers on an ARM).  In
this case, instead of doing the line by line flush, the kernel will
pollute the cache instead (effectively doing a flush of the original
data for the whole cache) for the size of the dcache.

As mentioned in the comments, I'd really like to use a mapping the size
of the cache which maps to one page instead of abusing RAM the way I am
now, so if anyone has a good idea (other than ones I've already
commented) I'm all ears.

Feedback in general is also welcome.  I always enjoy being told I'm
crazy or silly :)


-- 
 Cheers,
    Adam

   QNX Software Systems
   [ amallory@qnx.com ]
   ---------------------------------------------------
   With a PC, I always felt limited by the software available.
   On Unix, I am limited only by my knowledge.
       --Peter J. Schoenster 

Attachment: Text flush.diff 2.4 KB
RE: Thoughts on flushing the dcache more efficiently  
I think the idea can work.  We need to make sure that we can handle
caches larger than 4K correctly -- that means we need to make sure that
the memory we allocate is contiguous.  Given that we're only reading
from the memory, we could do a MAP_PHYS to an arbitrary piece of RAM.
Alternately, if the SOC has some on-chip memory that is very fast to
access (but still goes through the cache) we could save a lot of memory
bus activity by using it.

This solution is very generic -- it tries to work for all processors.
We could probably implement more efficient solutions if we implemented
different solutions for each processor family.  That would, of course,
be more work -- both now and in the future. 

Doug

-----Original Message-----
From: Adam Mallory [mailto:community-noreply@qnx.com] 
Sent: Thursday, November 27, 2008 5:57 PM
To: ostech-core_os
Subject: Thoughts on flushing the dcache more efficiently

After some feedback from certain customers who shall go nameless, I've
given some more thought on how to obtain the speed up we need to flush
more efficiently without penalizing the unmap side of the equation.

To recap, we've seen regressions (starting in 6.3.2) in process startup
time on some drivers/applications (devf is a good example).  This slow
down is due to additional cache flushing we introduced to ensure the
dcache was sane when making device mappings.  The reason is mostly
because these mappings are large and our cache flushing callouts flush
the entire region regardless of the size (and device mappings can get
quite large).  In previous metrics I saw up to almost a 1 seconds worth
of time for devf just to map it's flash array (almost all of which was
cache flushing 32megs of cache line by line).

The attached diff is only a concept and not really how I'd like to
implement it in the final form.  There are comments about where I think
the change is weak or inadequate and I'm looking for feedback on how to
improve it.  We're flushing/dumping the dcache in about 50ms which is
better and more inline with the mmap() performance from 6.3.0.

Effectively, the change implements a short cut that triggers if we're
attempting to flush the dcache for a region larger than 3x the size of
the cache (arbitrary amount based on some quick numbers on an ARM).  In
this case, instead of doing the line by line flush, the kernel will
pollute the cache instead (effectively doing a flush of the original
data for the whole cache) for the size of the dcache.

As mentioned in the comments, I'd really like to use a mapping the size
of the cache which maps to one page instead of abusing RAM the way I am
now, so if anyone has a good idea (other than ones I've already
commented) I'm all ears.

Feedback in general is also welcome.  I always enjoy being told I'm
crazy or silly :)


--
 Cheers,
    Adam

   QNX Software Systems
   [ amallory@qnx.com ]
   ---------------------------------------------------
   With a PC, I always felt limited by the software available.
   On Unix, I am limited only by my knowledge.
       --Peter J. Schoenster 



_______________________________________________
OSTech
http://community.qnx.com/sf/go/post17496
RE: Thoughts on flushing the dcache more efficiently  
hmmm...  will it work on all cache architectures?  Are there any
multi-way caches out there that will randomly choose a way to use,
instead of using an LRU algorithm or something similar? 

-----Original Message-----
From: Doug Bailey 
Sent: Thursday, November 27, 2008 6:21 PM
To: 'post17496@community.qnx.com'
Subject: RE: Thoughts on flushing the dcache more efficiently


I think the idea can work.  We need to make sure that we can handle
caches larger than 4K correctly -- that means we need to make sure that
the memory we allocate is contiguous.  Given that we're only reading
from the memory, we could do a MAP_PHYS to an arbitrary piece of RAM.
Alternately, if the SOC has some on-chip memory that is very fast to
access (but still goes through the cache) we could save a lot of memory
bus activity by using it.

This solution is very generic -- it tries to work for all processors.
We could probably implement more efficient solutions if we implemented
different solutions for each processor family.  That would, of course,
be more work -- both now and in the future. 

Doug

-----Original Message-----
From: Adam Mallory [mailto:community-noreply@qnx.com]
Sent: Thursday, November 27, 2008 5:57 PM
To: ostech-core_os
Subject: Thoughts on flushing the dcache more efficiently

After some feedback from certain customers who shall go nameless, I've
given some more thought on how to obtain the speed up we need to flush
more efficiently without penalizing the unmap side of the equation.

To recap, we've seen regressions (starting in 6.3.2) in process startup
time on some drivers/applications (devf is a good example).  This slow
down is due to additional cache flushing we introduced to ensure the
dcache was sane when making device mappings.  The reason is mostly
because these mappings are large and our cache flushing callouts flush
the entire region regardless of the size (and device mappings can get
quite large).  In previous metrics I saw up to almost a 1 seconds worth
of time for devf just to map it's flash array (almost all of which was
cache flushing 32megs of cache line by line).

The attached diff is only a concept and not really how I'd like to
implement it in the final form.  There are comments about where I think
the change is weak or inadequate and I'm looking for feedback on how to
improve it.  We're flushing/dumping the dcache in about 50ms which is
better and more inline with the mmap() performance from 6.3.0.

Effectively, the change implements a short cut that triggers if we're
attempting to flush the dcache for a region larger than 3x the size of
the cache (arbitrary amount based on some quick numbers on an ARM).  In
this case, instead of doing the line by line flush, the kernel will
pollute the cache instead (effectively doing a flush of the original
data for the whole cache) for the size of the dcache.

As mentioned in the comments, I'd really like to use a mapping the size
of the cache which maps to one page instead of abusing RAM the way I am
now, so if anyone has a good idea (other than ones I've already
commented) I'm all ears.

Feedback in general is also welcome.  I always enjoy being told I'm
crazy or silly :)


--
 Cheers,
    Adam

   QNX Software Systems
   [ amallory@qnx.com ]
   ---------------------------------------------------
   With a PC, I always felt limited by the software available.
   On Unix, I am limited only by my knowledge.
       --Peter J. Schoenster 



_______________________________________________
OSTech
http://community.qnx.com/sf/go/post17496
RE: Thoughts on flushing the dcache more efficiently  
Well my understanding (and I could be wrong) is that we describe our cache in pure terms # of lines and line size, not 
ways.  So if the cache is 32K, 4 way set associative (16byte line size) that would mean we have 2048 lines in our cache 
description not 512 in 4ways.  I could be wrong since I haven't looked at each architecture and it's cache setup.  My 
idea is that when we flush and it's > 3*2048 lines worth of mapping, I just pollute the 2048 lines of cache and call it 
a day instead of letting the callout do 6144 lines (or more if the mapping is bigger).

The multilevel cache logic is higher up, so we should be doing the right thing (as correct as it was before at least) 
and the cast outs should generate snoop transactions in SMP.  I've not completely convinced myself that if we start 
flushing on one CPU pre-emption doesn't occur and push us to another CPU later flushing the rest (or if it even matters)
, but that would already be an issue with the existing flushing mechanism if it was an issue.

The lack of MAP_PHYS is my blunder, that should be contiguous to support whatever cache size is used to flush, thanks 
for pointing that out.  You're also correct about the arbitrary ram; as in my comments I'd rather use some existing 
known location than allocate one as it is wasteful.  My original idea was to provide a CPU specific poison base address 
which each CPU or architecture could define.  That way if there is a fast RAM or architecture specific optimization/spot
 we could exploit the code wouldn't change.  I think perhaps it is even possible to allow a hint in startup so it could 
be more board specific but for the sake of discussion I did this simplistic mmap just to get a discussion going.

I do understand your point of being too generic but for the sake of an initial implementation/proof of concept can you 
blame me? :)

--
 Cheers,
    Adam

   QNX Software Systems
   [ amallory@qnx.com ]
   ---------------------------------------------------
   With a PC, I always felt limited by the software available.
   On Unix, I am limited only by my knowledge.
       --Peter J. Schoenster



-----Original Message-----
From: Douglas Bailey [mailto:community-noreply@qnx.com]
Sent: Thu 11/27/2008 18:27
To: ostech-core_os
Subject: RE: Thoughts on flushing the dcache more efficiently
 
hmmm...  will it work on all cache architectures?  Are there any
multi-way caches out there that will randomly choose a way to use,
instead of using an LRU algorithm or something similar? 

-----Original Message-----
From: Doug Bailey 
Sent: Thursday, November 27, 2008 6:21 PM
To: 'post17496@community.qnx.com'
Subject: RE: Thoughts on flushing the dcache more efficiently


I think the idea can work.  We need to make sure that we can handle
caches larger than 4K correctly -- that means we need to make sure that
the memory we allocate is contiguous.  Given that we're only reading
from the memory, we could do a MAP_PHYS to an arbitrary piece of RAM.
Alternately, if the SOC has some on-chip memory that is very fast to
access (but still goes through the cache) we could save a lot of memory
bus activity by using it.

This solution is very generic -- it tries to work for all processors.
We could probably implement more efficient solutions if we implemented
different solutions for each processor family.  That would, of course,
be more work -- both now and in the future. 

Doug

-----Original Message-----
From: Adam Mallory [mailto:community-noreply@qnx.com]
Sent: Thursday, November 27, 2008 5:57 PM
To: ostech-core_os
Subject: Thoughts on flushing the dcache more efficiently

After some feedback from certain customers who shall go nameless, I've
given some more thought on how to obtain the speed up we need to flush
more efficiently without penalizing the unmap side of the equation.

To recap, we've seen regressions (starting in 6.3.2) in process startup
time on some drivers/applications (devf is a...
View Full Message
Attachment: Text winmail.dat 5.51 KB
Re: Thoughts on flushing the dcache more efficiently  
Your understanding of the 'lines' field is correct - in 6.4.0 we
did add a 'ways' field (not filled in by all the startups yet) so that
procnto could figure out the associativity.

Anyhow, I don't think you can do it this way - as Doug pointed out, with
a multi-way cache, there's no generic mechanism you can do to ensure that
all the ways have been flushed (e.g a cache using random replacement).

I think the place for the solution is in the callouts. They know what
the replacement policy is and they'd know if there's a fast mechanism
to flush/invalidate the full cache. If they see a big request, they
can just do their quick thing and return zero - the upper levels treat
that as an indication that whole cache is done and early out. The downside
is, of course, that we have to implement the fix multiple times and in
assembler but I think it's the best way of making sure that the cache
is actually flushed/invalidated.

A bit of a side issue, but there is an optimization to the upper layers
that could be done that might help in some cases. If it's just a flush
request (with no invalidation), the code should look to see if the 
CACHE_FLAG_WRITEBACK flag is on. If not, we don't have to do anything since
a writethrough cache will never have any dirty entries.

	Brian

On Thu, Nov 27, 2008 at 10:51:44PM -0500, Adam Mallory wrote:
> Well my understanding (and I could be wrong) is that we describe our
> cache in pure terms # of lines and line size, not ways.  So if the cache
> is 32K, 4 way set associative (16byte line size) that would mean we have
> 2048 lines in our cache description not 512 in 4ways.  I could be wrong
> since I haven't looked at each architecture and it's cache setup.  My
> idea is that when we flush and it's > 3*2048 lines worth of mapping, I
> just pollute the 2048 lines of cache and call it a day instead of
> letting the callout do 6144 lines (or more if the mapping is bigger).
> 
> The multilevel cache logic is higher up, so we should be doing the right
> thing (as correct as it was before at least) and the cast outs should
> generate snoop transactions in SMP.  I've not completely convinced
> myself that if we start flushing on one CPU pre-emption doesn't occur
> and push us to another CPU later flushing the rest (or if it even
> matters), but that would already be an issue with the existing flushing
> mechanism if it was an issue.
> 
> The lack of MAP_PHYS is my blunder, that should be contiguous to support
> whatever cache size is used to flush, thanks for pointing that out.
> You're also correct about the arbitrary ram; as in my comments I'd
> rather use some existing known location than allocate one as it is
> wasteful.  My original idea was to provide a CPU specific poison base
> address which each CPU or architecture could define.  That way if there
> is a fast RAM or architecture specific optimization/spot we could
> exploit the code wouldn't change.  I think perhaps it is even possible
> to allow a hint in startup so it could be more board specific but for
> the sake of discussion I did this simplistic mmap just to get a
> discussion going.
> 
> I do understand your point of being too generic but for the sake of an
> initial implementation/proof of concept can you blame me? :)
> 
> --
>  Cheers,
>     Adam
> 
>    QNX Software Systems
>    [ amallory@qnx.com ]
>    ---------------------------------------------------
>    With a PC, I always felt limited by the software available.
>    On Unix, I am limited only by my knowledge.
>        --Peter J. Schoenster
> 
> 
> 
> -----Original Message-----
> From: Douglas Bailey [mailto:community-noreply@qnx.com]
> Sent: Thu 11/27/2008 18:27
> To: ostech-core_os
> Subject: RE: Thoughts on flushing the dcache more efficiently
>  
>...
View Full Message
RE: Thoughts on flushing the dcache more efficiently  
> Your understanding of the 'lines' field is correct - in 6.4.0 we
> did add a 'ways' field (not filled in by all the startups yet) so that
> procnto could figure out the associativity.

Good to know, thanks.

> Anyhow, I don't think you can do it this way - as Doug pointed out,
> with
> a multi-way cache, there's no generic mechanism you can do to ensure
> that
> all the ways have been flushed (e.g a cache using random replacement).

From my experience (limited as it may be) that ability tends to be an
option in configuration amongst others such as LRU.  Do we currently
have boards that do random replacement per way and we configure it that
way?

> I think the place for the solution is in the callouts. They know what
> the replacement policy is and they'd know if there's a fast mechanism
> to flush/invalidate the full cache. If they see a big request, they
> can just do their quick thing and return zero - the upper levels treat
> that as an indication that whole cache is done and early out. The
> downside
> is, of course, that we have to implement the fix multiple times and in
> assembler but I think it's the best way of making sure that the cache
> is actually flushed/invalidated.

Putting aside that I don't agree with the constant writing of code in
assembler and having to repeatedly do the same thing over and over but
slightly different each time (and the inherent bugs that come with doing
it this way).  I think we're limited in what the callouts can do.  For
the sake of argument say I wanted to pollute the cache as I've done in
the kernel (and I know the cache is no random replacement).  What base
address could I use to do this?  The startup writers wouldn't know of a
common area that could be 'abused' so they would be forced to mmap the
memory which is wasteful and it would have to be done via a patcher
routine which is really unfortunate.

I guess if we could provide such an address to the callouts, they could
optionally use it.  I'll have to marinate on that a little longer.

IMHO, all of this startup work presents a barrier.  Is it
insurmountable? No, but I think it's fair to say that given the option
of implementing an optional 'flush all' code path or new callout will
result in startups that likely don't bother since it works either way.
Even if every new startup starts doing the new flush, older and current
startups are unlikely to be reworked and it will be left until a
customer complains about system startup performance of drivers etc.

I guess I agree from a pure technical standpoint that custom code should
belong in the startup arena but the realist and practical side of me
says I hate doing things many times and if it could benefit everyone it
should be done in the OS.

In any case, I'll think about it some more...

> A bit of a side issue, but there is an optimization to the upper
layers
> that could be done that might help in some cases. If it's just a flush
> request (with no invalidation), the code should look to see if the
> CACHE_FLAG_WRITEBACK flag is on. If not, we don't have to do anything
> since
> a writethrough cache will never have any dirty entries.

Yeah, it might help in very limited cases.

--
 Cheers,
    Adam

   QNX Software Systems
   [amallory@qnx.com]
   ---------------------------------------------------
   With a PC, I always felt limited by the software available.
   On Unix, I am limited only by my knowledge.
       --Peter J. Schoenster
RE: Thoughts on flushing the dcache more efficiently  
We don't have to go all-or-nothing...  The common C code can support the
cache-dirty mechanism, and the callout can do a board-specific solution.
The startup can indicate what support the callout provides and whether
or not there is a good/cheap address to use for the cache-dirtying
solution, and if the startup indicates nothing (i.e. unoptimized
startup, or older startup) the kernel can fall back to a generic
cache-dirtying mechanism that uses a MAP_PHYS.

I think this works, with whatever optimizations we provide for specific
boards, for everything but the hypothetical multi-way cache that does
random way selection.  Is there such a beast?

Doug


-----Original Message-----
From: Adam Mallory [mailto:community-noreply@qnx.com] 
Sent: Friday, November 28, 2008 9:57 AM
To: ostech-core_os
Subject: RE: Thoughts on flushing the dcache more efficiently

> Your understanding of the 'lines' field is correct - in 6.4.0 we did 
> add a 'ways' field (not filled in by all the startups yet) so that 
> procnto could figure out the associativity.

Good to know, thanks.

> Anyhow, I don't think you can do it this way - as Doug pointed out, 
> with a multi-way cache, there's no generic mechanism you can do to 
> ensure that all the ways have been flushed (e.g a cache using random 
> replacement).

From my experience (limited as it may be) that ability tends to be an
option in configuration amongst others such as LRU.  Do we currently
have boards that do random replacement per way and we configure it that
way?

> I think the place for the solution is in the callouts. They know what 
> the replacement policy is and they'd know if there's a fast mechanism 
> to flush/invalidate the full cache. If they see a big request, they 
> can just do their quick thing and return zero - the upper levels treat

> that as an indication that whole cache is done and early out. The 
> downside is, of course, that we have to implement the fix multiple 
> times and in assembler but I think it's the best way of making sure 
> that the cache is actually flushed/invalidated.

Putting aside that I don't agree with the constant writing of code in
assembler and having to repeatedly do the same thing over and over but
slightly different each time (and the inherent bugs that come with doing
it this way).  I think we're limited in what the callouts can do.  For
the sake of argument say I wanted to pollute the cache as I've done in
the kernel (and I know the cache is no random replacement).  What base
address could I use to do this?  The startup writers wouldn't know of a
common area that could be 'abused' so they would be forced to mmap the
memory which is wasteful and it would have to be done via a patcher
routine which is really unfortunate.

I guess if we could provide such an address to the callouts, they could
optionally use it.  I'll have to marinate on that a little longer.

IMHO, all of this startup work presents a barrier.  Is it
insurmountable? No, but I think it's fair to say that given the option
of implementing an optional 'flush all' code path or new callout will
result in startups that likely don't bother since it works either way.
Even if every new startup starts doing the new flush, older and current
startups are unlikely to be reworked and it will be left until a
customer complains about system startup performance of drivers etc.

I guess I agree from a pure technical standpoint that custom code should
belong in the startup arena but the realist and practical side of me
says I hate doing things many times and if it could benefit everyone it
should be done in the OS.

In any case, I'll think about it some more...

> A bit of a side issue, but there is an optimization to the upper
layers
> that could be done that might help in some cases. If it's just a flush

> request (with no invalidation), the code should look to see...
View Full Message
RE: Thoughts on flushing the dcache more efficiently  
> We don't have to go all-or-nothing...  The common C code can support
> the
> cache-dirty mechanism, and the callout can do a board-specific
> solution.
> The startup can indicate what support the callout provides and whether
> or not there is a good/cheap address to use for the cache-dirtying
> solution, and if the startup indicates nothing (i.e. unoptimized
> startup, or older startup) the kernel can fall back to a generic
> cache-dirtying mechanism that uses a MAP_PHYS.

I was thinking the same.  We could even allow for startup to dictate the
replacement policy (perhaps look at improveing our cache description in
the syspage) but I really dislike the idea of pushing the logic to
startup and into assembler.
 
> I think this works, with whatever optimizations we provide for
specific
> boards, for everything but the hypothetical multi-way cache that does
> random way selection.  Is there such a beast?

There certainly are caches capable of doing this, and some of the boards
allow for it but I really don't know of any particular board/setup that
actually USES random replacement over FIFO or LRU (or PLRU).  If there
are, they are certainly in the minority and for those I'd be ok with a
cache description marking the replacement policy and on those small
cases we can default to the original flush line by line.

-Adam
RE: Thoughts on flushing the dcache more efficiently  
The problem with the hypothetical boards that use a random replacement
policy is that we need changes to procnto that work with old startups.
That means that if such boards exist, we either need gear in procnto to
recognize them and act differently with them, or we only introduce the
new improved efficiency on boards with new startups.  Neither of these
options would be desirable...

-----Original Message-----
From: Adam Mallory [mailto:community-noreply@qnx.com] 
Sent: Friday, November 28, 2008 10:54 AM
To: ostech-core_os
Subject: RE: Thoughts on flushing the dcache more efficiently

> We don't have to go all-or-nothing...  The common C code can support 
> the cache-dirty mechanism, and the callout can do a board-specific 
> solution.
> The startup can indicate what support the callout provides and whether

> or not there is a good/cheap address to use for the cache-dirtying 
> solution, and if the startup indicates nothing (i.e. unoptimized 
> startup, or older startup) the kernel can fall back to a generic 
> cache-dirtying mechanism that uses a MAP_PHYS.

I was thinking the same.  We could even allow for startup to dictate the
replacement policy (perhaps look at improveing our cache description in
the syspage) but I really dislike the idea of pushing the logic to
startup and into assembler.
 
> I think this works, with whatever optimizations we provide for
specific
> boards, for everything but the hypothetical multi-way cache that does 
> random way selection.  Is there such a beast?

There certainly are caches capable of doing this, and some of the boards
allow for it but I really don't know of any particular board/setup that
actually USES random replacement over FIFO or LRU (or PLRU).  If there
are, they are certainly in the minority and for those I'd be ok with a
cache description marking the replacement policy and on those small
cases we can default to the original flush line by line.

-Adam

_______________________________________________
OSTech
http://community.qnx.com/sf/go/post17549
Re: Thoughts on flushing the dcache more efficiently  
On Fri, Nov 28, 2008 at 09:56:33AM -0500, Adam Mallory wrote:
> From my experience (limited as it may be) that ability tends to be an
> option in configuration amongst others such as LRU.  Do we currently
> have boards that do random replacement per way and we configure it that
> way?

We may never set it up that way, but we can't stop external BSP developers
from doing it - procnto should be able to deal with it.

> > I think the place for the solution is in the callouts. They know what
> > the replacement policy is and they'd know if there's a fast mechanism
> > to flush/invalidate the full cache. If they see a big request, they
> > can just do their quick thing and return zero - the upper levels treat
> > that as an indication that whole cache is done and early out. The
> > downside
> > is, of course, that we have to implement the fix multiple times and in
> > assembler but I think it's the best way of making sure that the cache
> > is actually flushed/invalidated.
> 
> Putting aside that I don't agree with the constant writing of code in
> assembler and having to repeatedly do the same thing over and over but
> slightly different each time (and the inherent bugs that come with doing
> it this way).  I think we're limited in what the callouts can do.  For
> the sake of argument say I wanted to pollute the cache as I've done in
> the kernel (and I know the cache is no random replacement).  What base
> address could I use to do this?  The startup writers wouldn't know of a
> common area that could be 'abused' so they would be forced to mmap the
> memory which is wasteful and it would have to be done via a patcher
> routine which is really unfortunate.

True, but there don't tend to be too many cache callouts written - it tends
to be one per architecture, aside from chip bug workarounds (ARM being
the main exception). That means that we correct the callouts in our
library and everyone will benefit.

> I guess if we could provide such an address to the callouts, they could
> optionally use it.  I'll have to marinate on that a little longer.

You're assuming that the callouts would use the poison method.
They typically wouldn't - they'll have better mechanisms for flushing 
the cache. Even if they do need to do loads, on SH/PPC/MIPS they
already have a one-to-one mapping area that they can use for the load
address. On ARM, there's a special address range just for doing this
kind of thing. On X86, the cache is super smart and we don't have a
callout at all.

> IMHO, all of this startup work presents a barrier.  Is it
> insurmountable? No, but I think it's fair to say that given the option
> of implementing an optional 'flush all' code path or new callout will
> result in startups that likely don't bother since it works either way.
> Even if every new startup starts doing the new flush, older and current
> startups are unlikely to be reworked and it will be left until a
> customer complains about system startup performance of drivers etc.
> 
> I guess I agree from a pure technical standpoint that custom code should
> belong in the startup arena but the realist and practical side of me
> says I hate doing things many times and if it could benefit everyone it
> should be done in the OS.

I basically agree, but this isn't the interrupt callouts where every
board has it's own thing - there's a much more limited set that need
to be dealt with. If it weren't for wacky hardware developers who
did things like writing custom cache controllers in ASIC's we probably
wouldn't even have a cache callout - everything would have been self 
contained in procnto.

	Brian

-- 
Brian Stecher (bstecher@qnx.com)        QNX Software Systems
phone: +1 (613) 591-0931 (voice)        175 Terence Matthews Cr.
       +1 (613) 591-3579 (fax)         ...
RE: Thoughts on flushing the dcache more efficiently  
> You're assuming that the callouts would use the poison method.
> They typically wouldn't - they'll have better mechanisms for flushing
> the cache. Even if they do need to do loads, on SH/PPC/MIPS they
> already have a one-to-one mapping area that they can use for the load
> address. On ARM, there's a special address range just for doing this
> kind of thing. On X86, the cache is super smart and we don't have a
> callout at all.

From my understanding on SH4, we have to have a vaddr+paddr match in the
cache operation to flush the line which means we'd have to iterate over
the entire mapping to be sure we got every vaddr+paddr combination.  I
don't see how you could avoid doing a poison in that case (I'm not
familiar enough on MIPS/ARM to speak on limitations there).  The cache
is memory mapped as well on the SH and speaking to Renesas, on some you
could use those addresses to to flush the cache line without a
vaddr/paddr match.  I recall the documentation indicating that the
existance of the memory mapped cache was not always going to exist
(meaning a board specific derivation from the lib) and I think even the
OS does this in some cases.

> I basically agree, but this isn't the interrupt callouts where every
> board has it's own thing - there's a much more limited set that need
> to be dealt with. If it weren't for wacky hardware developers who
> did things like writing custom cache controllers in ASIC's we probably
> wouldn't even have a cache callout - everything would have been self
> contained in procnto.

Yeah, those hardware developers are always trying to save a flipflop :)

Alright I'll keep thinking.

-Adam
RE: Thoughts on flushing the dcache more efficiently  
With the SH I think we can write to the memory-mapped cache table and
that will cause the cache line to flush if its valid.  The SH-4a
software manual says (under "OC Address Array/OC address array write)
"When a write is performed to a cache line for which the U bit and V bit
are both 1, after write-back of that cache line, the tag, U bit and V
bit specified in the data field are written."  The same statement exists
in the 7751 hardware manual.  So, marking a cache line as invalid
through the address array should cause a write-back.  This would require
just as many operations as the cache-poison mechanism, but would be
faster since it wouldn't require any access to the memory bus.

I'm not familiar with any documentation that says this feature might not
exist in some cases.  The associative write to the OC address array has
been deprecated, but the non-associative form still functions.

Regardless, Brian is correct that placing the functionality in the
startup lib means that we can use platform-specific mechanisms, which
will be more efficient for some boards.  But, yeah, even writing 5
versions of the algorithm in assembler won't be fun...

I'm still thinking about a hybrid mechanism that has some support in
both the start and in procnto that will save us some effort...

Doug

-----Original Message-----
From: Adam Mallory [mailto:community-noreply@qnx.com] 
Sent: Friday, November 28, 2008 1:12 PM
To: ostech-core_os
Subject: RE: Thoughts on flushing the dcache more efficiently

> You're assuming that the callouts would use the poison method.
> They typically wouldn't - they'll have better mechanisms for flushing 
> the cache. Even if they do need to do loads, on SH/PPC/MIPS they 
> already have a one-to-one mapping area that they can use for the load 
> address. On ARM, there's a special address range just for doing this 
> kind of thing. On X86, the cache is super smart and we don't have a 
> callout at all.

From my understanding on SH4, we have to have a vaddr+paddr match in the
cache operation to flush the line which means we'd have to iterate over
the entire mapping to be sure we got every vaddr+paddr combination.  I
don't see how you could avoid doing a poison in that case (I'm not
familiar enough on MIPS/ARM to speak on limitations there).  The cache
is memory mapped as well on the SH and speaking to Renesas, on some you
could use those addresses to to flush the cache line without a
vaddr/paddr match.  I recall the documentation indicating that the
existance of the memory mapped cache was not always going to exist
(meaning a board specific derivation from the lib) and I think even the
OS does this in some cases.

> I basically agree, but this isn't the interrupt callouts where every 
> board has it's own thing - there's a much more limited set that need 
> to be dealt with. If it weren't for wacky hardware developers who did 
> things like writing custom cache controllers in ASIC's we probably 
> wouldn't even have a cache callout - everything would have been self 
> contained in procnto.

Yeah, those hardware developers are always trying to save a flipflop :)

Alright I'll keep thinking.

-Adam

_______________________________________________
OSTech
http://community.qnx.com/sf/go/post17565
RE: Thoughts on flushing the dcache more efficiently  
> I'm not familiar with any documentation that says this feature might
> not
> exist in some cases.  The associative write to the OC address array
has
> been deprecated, but the non-associative form still functions.

That could be what I'm remembering, it was a while ago I read that
section. Thanks for pointing out the distinction.

> Regardless, Brian is correct that placing the functionality in the
> startup lib means that we can use platform-specific mechanisms, which
> will be more efficient for some boards.  But, yeah, even writing 5
> versions of the algorithm in assembler won't be fun...

Even more than the effort to write in assembler is the readability and
debugability later on when a problem occurs.

> 
> I'm still thinking about a hybrid mechanism that has some support in
> both the start and in procnto that will save us some effort...

If you have some ideas, even if they are rough, I'm quite curious to
hear them!  For me it always disentegrates into startup since both you
and Brian believe that we need some to allow for board specific tweaks
and I tend to think we could offer something generic (ie. like
poisoning) which is better than we have today anyways andperhaps still
allow the cache callout to do something smarter if anyone decides to
tweak it (yet another flag in the cache attr?).

Thanks for all the feedback guys, it's much appriciated!!
-Adam
RE: Thoughts on flushing the dcache more efficiently  
 
This is what it comes down to, I think...  Your cache poisoning idea is
a good generic solution but has the problem that it might not work in
some situations (those hypothetical random-way-selection caches) and it
won't be as good as can be in other situations.  I'm stuck on the fact
that we need to be able to identify boards or architectures that the
cache-poisoning mechanism won't work on, and we need to be able to do
that without any assistance from startup, since the startup might not be
up-to-date...

The safest bet would be to introduce a new cache attribute which says to
go ahead and use cache-poisoning.  Then a startup can do nothing and get
the current solution, could introduce smarter cache callouts, or could
tell procnto to use the cache-poisoning mechanism.  This has the problem
that we don't get anything for free -- that is, we have to touch the
startup before a board will pick up an improved strategy.

If we want to use the cache-poisoning mechanism by default, and only
turn it off if told to by the startup, then we get the cache-poisoning
mechanism for free on most boards -- we only need to touch the startup
code if we have a better cache callout, or if we need to disable the
cache-poisoning mechanism for some reason.  The problem here is that
then we'll get the cache-poisoning mechanism all the time with old
startups, even if its not safe.

And I'm back to the fact that we need procnto to be able to identify
boards for which cache poisoning isn't safe...  How can we do that if
"safe" is controlled by how the cache is configured in the startup or
the ipl?

Or, its just occurred to me, we find another generic solution that is
safe on all boards...  I don't know what that generic solution might be,
but we shouldn't fixate on cache poisoning (yet, anyways).

I'll keep thinking...

Doug



-----Original Message-----
From: Adam Mallory [mailto:community-noreply@qnx.com] 
Sent: Friday, November 28, 2008 2:16 PM
To: ostech-core_os
Subject: RE: Thoughts on flushing the dcache more efficiently
[...]
For me it always disentegrates into startup since both you and Brian
believe that we need some to allow for board specific tweaks and I tend
to think we could offer something generic (ie. like
poisoning) which is better than we have today anyways andperhaps still
allow the cache callout to do something smarter if anyone decides to
tweak it (yet another flag in the cache attr?).

Thanks for all the feedback guys, it's much appriciated!!
-Adam


_______________________________________________
OSTech
http://community.qnx.com/sf/go/post17569
Re: Thoughts on flushing the dcache more efficiently  
Sorry for jumping in so late - it's been an interesting discussion...

The cache poisoning approach probably won't work generically on ARM/Xscale.
Many of the ARM/Xscale caches are (pseudo) random so you have to be careful
when trying to poison the cache - typical approaches include flipping between
two different cache areas so you guarantee that each poison attempt always
causes line fills. Also, there are often cpu-specific cache operations that
allow flush by set/way or even flush whole cache - these would have to be in
startup callouts since they are cpu-specific.

Since startup tells procnto what memory is on the board in the first place,
I'm not sure I see why it's a problem to have startup map the area to use in
a callout to poison the cache itself - the patcher and rw data mechanism seem
to be sufficient to allow whatever implementation is required.

The fact that this is a little awkward to write is a separate issue...

	Sunil.

Douglas Bailey wrote:
>  
> This is what it comes down to, I think...  Your cache poisoning idea is
> a good generic solution but has the problem that it might not work in
> some situations (those hypothetical random-way-selection caches) and it
> won't be as good as can be in other situations.  I'm stuck on the fact
> that we need to be able to identify boards or architectures that the
> cache-poisoning mechanism won't work on, and we need to be able to do
> that without any assistance from startup, since the startup might not be
> up-to-date...
> 
> The safest bet would be to introduce a new cache attribute which says to
> go ahead and use cache-poisoning.  Then a startup can do nothing and get
> the current solution, could introduce smarter cache callouts, or could
> tell procnto to use the cache-poisoning mechanism.  This has the problem
> that we don't get anything for free -- that is, we have to touch the
> startup before a board will pick up an improved strategy.
> 
> If we want to use the cache-poisoning mechanism by default, and only
> turn it off if told to by the startup, then we get the cache-poisoning
> mechanism for free on most boards -- we only need to touch the startup
> code if we have a better cache callout, or if we need to disable the
> cache-poisoning mechanism for some reason.  The problem here is that
> then we'll get the cache-poisoning mechanism all the time with old
> startups, even if its not safe.
> 
> And I'm back to the fact that we need procnto to be able to identify
> boards for which cache poisoning isn't safe...  How can we do that if
> "safe" is controlled by how the cache is configured in the startup or
> the ipl?
> 
> Or, its just occurred to me, we find another generic solution that is
> safe on all boards...  I don't know what that generic solution might be,
> but we shouldn't fixate on cache poisoning (yet, anyways).
> 
> I'll keep thinking...
> 
> Doug
> 
> 
> 
> -----Original Message-----
> From: Adam Mallory [mailto:community-noreply@qnx.com] 
> Sent: Friday, November 28, 2008 2:16 PM
> To: ostech-core_os
> Subject: RE: Thoughts on flushing the dcache more efficiently
> [...]
> For me it always disentegrates into startup since both you and Brian
> believe that we need some to allow for board specific tweaks and I tend
> to think we could offer something generic (ie. like
> poisoning) which is better than we have today anyways andperhaps still
> allow the cache callout to do something smarter if anyone decides to
> tweak it (yet another flag in the cache attr?).
> 
> Thanks for all the feedback guys, it's much appriciated!!
> -Adam
> 
> 
> _______________________________________________
> OSTech
> http://community.qnx.com/sf/go/post17569
> 
> 
>...
RE: Thoughts on flushing the dcache more efficiently  
> The cache poisoning approach probably won't work generically on
> ARM/Xscale.
> Many of the ARM/Xscale caches are (pseudo) random so you have to be
> careful
> when trying to poison the cache - typical approaches include flipping
> between
> two different cache areas so you guarantee that each poison attempt
always
> causes line fills. Also, there are often cpu-specific cache operations
> that
> allow flush by set/way or even flush whole cache - these would have to
be
> in
> startup callouts since they are cpu-specific.

I've started to convince myself there really isn't any alternative
except doing it in the callouts.  It's unfortunate IMHO but that's the
way it seems to be.

I've attached an example (in C) of the SH4 callout with what I think
might be the start of how to implement the SH4 specific speed up.  I
didn't include the patcher part for now.

@Doug, if you see something wrong or I've misread the SH4 manual, can
you let me know?

> Since startup tells procnto what memory is on the board in the first
> place,
> I'm not sure I see why it's a problem to have startup map the area to
use
> in
> a callout to poison the cache itself - the patcher and rw data
mechanism
> seem
> to be sufficient to allow whatever implementation is required.

I think technically I would be hard pressed to find something that
wasn't possible to implement in the startup callout.  The callout
environment is just much more limited and more difficult to exploit.
One challenge I'm still trying to work out is if I can make a mapping of
vaddr range for the length of the cache to a single page rather than
allocate RAM which is always 1:1 for the cache size (wasteful).  I think
I'm out of luck in making that kind of mapping AND making it persistent
into the OS.  It could be that poisoning is just not needed if we
implement all the platform specific methods.  I just won't know until we
do all the work :)

> The fact that this is a little awkward to write is a separate issue...

I agree and disagree :).  At the start of this conversation, I though if
we could generically poison caches we shouldn't have to push the pain to
all the board variant startup developers to implement.  Now that I think
we just aren't that lucky my complaint about doing it in startup is a
little academic since there isn't much choice.

That said, I think writing cache callouts to do board specific flushing
+ generic flushing + invalidation is more than just a little awkward,
it's a LOT awkward.  My beef about it comes from a maintenance point of
view as well as initial development.  I find deciphering the code later
quite time consuming just to understand it let alone find bugs.  Perhaps
it's just me - I'm not an assembly god so I find it painful.

From what I gather, the general consensus is that we need to do the work
in the callouts.  At least it was an interesting topic and I learned
that random or pseudo random is still alive and kicking in the CPU cache
world.


-- 
 Cheers,
    Adam

   QNX Software Systems
   [ amallory@qnx.com ]
   ---------------------------------------------------
   With a PC, I always felt limited by the software available.
   On Unix, I am limited only by my knowledge.
       --Peter J. Schoenster
Attachment: Text cc.c 1.44 KB
Re: Thoughts on flushing the dcache more efficiently  
As an alternative to writing it all in callouts, anyone care for libmod_cachestuff.a?

Adam Mallory wrote:
>> The cache poisoning approach probably won't work generically on
>> ARM/Xscale.
>> Many of the ARM/Xscale caches are (pseudo) random so you have to be
>> careful
>> when trying to poison the cache - typical approaches include flipping
>> between
>> two different cache areas so you guarantee that each poison attempt
> always
>> causes line fills. Also, there are often cpu-specific cache operations
>> that
>> allow flush by set/way or even flush whole cache - these would have to
> be
>> in
>> startup callouts since they are cpu-specific.
> 
> I've started to convince myself there really isn't any alternative
> except doing it in the callouts.  It's unfortunate IMHO but that's the
> way it seems to be.
> 
> I've attached an example (in C) of the SH4 callout with what I think
> might be the start of how to implement the SH4 specific speed up.  I
> didn't include the patcher part for now.
> 
> @Doug, if you see something wrong or I've misread the SH4 manual, can
> you let me know?
> 
>> Since startup tells procnto what memory is on the board in the first
>> place,
>> I'm not sure I see why it's a problem to have startup map the area to
> use
>> in
>> a callout to poison the cache itself - the patcher and rw data
> mechanism
>> seem
>> to be sufficient to allow whatever implementation is required.
> 
> I think technically I would be hard pressed to find something that
> wasn't possible to implement in the startup callout.  The callout
> environment is just much more limited and more difficult to exploit.
> One challenge I'm still trying to work out is if I can make a mapping of
> vaddr range for the length of the cache to a single page rather than
> allocate RAM which is always 1:1 for the cache size (wasteful).  I think
> I'm out of luck in making that kind of mapping AND making it persistent
> into the OS.  It could be that poisoning is just not needed if we
> implement all the platform specific methods.  I just won't know until we
> do all the work :)
> 
>> The fact that this is a little awkward to write is a separate issue...
> 
> I agree and disagree :).  At the start of this conversation, I though if
> we could generically poison caches we shouldn't have to push the pain to
> all the board variant startup developers to implement.  Now that I think
> we just aren't that lucky my complaint about doing it in startup is a
> little academic since there isn't much choice.
> 
> That said, I think writing cache callouts to do board specific flushing
> + generic flushing + invalidation is more than just a little awkward,
> it's a LOT awkward.  My beef about it comes from a maintenance point of
> view as well as initial development.  I find deciphering the code later
> quite time consuming just to understand it let alone find bugs.  Perhaps
> it's just me - I'm not an assembly god so I find it painful.
> 
> From what I gather, the general consensus is that we need to do the work
> in the callouts.  At least it was an interesting topic and I learned
> that random or pseudo random is still alive and kicking in the CPU cache
> world.
> 
> 

-- 
cburgess@qnx.com
Re: Thoughts on flushing the dcache more efficiently  
Colin Burgess wrote:
> As an alternative to writing it all in callouts, anyone care for libmod_cachestuff.a?

Well, I've always wondered about the possibility of putting all the board
specific support code into a .o or similar that gets linked into procnto
at build time. This has some possible advantages:
- can be written as "normal" code (easier to develop?)
- the generated procnto.sym contains _all_ the runtime code (easier to debug?)

The downside is that there would need to be 2 different sets of startup/BSP code:
- the bootstrap/init code required to build the startup binary
- the procnto runtime code required to be linked into procnto
- the runtime code needs to be linked into other bootstrap executables (eg. kdebug)

There's probably a host of other good reasons why the callout approach is the
best on balance, but just wondered what anyone thinks...

	Sunil.

> Adam Mallory wrote:
> 
>>>The cache poisoning approach probably won't work generically on
>>>ARM/Xscale.
>>>Many of the ARM/Xscale caches are (pseudo) random so you have to be
>>>careful
>>>when trying to poison the cache - typical approaches include flipping
>>>between
>>>two different cache areas so you guarantee that each poison attempt
>>
>>always
>>
>>>causes line fills. Also, there are often cpu-specific cache operations
>>>that
>>>allow flush by set/way or even flush whole cache - these would have to
>>
>>be
>>
>>>in
>>>startup callouts since they are cpu-specific.
>>
>>I've started to convince myself there really isn't any alternative
>>except doing it in the callouts.  It's unfortunate IMHO but that's the
>>way it seems to be.
>>
>>I've attached an example (in C) of the SH4 callout with what I think
>>might be the start of how to implement the SH4 specific speed up.  I
>>didn't include the patcher part for now.
>>
>>@Doug, if you see something wrong or I've misread the SH4 manual, can
>>you let me know?
>>
>>
>>>Since startup tells procnto what memory is on the board in the first
>>>place,
>>>I'm not sure I see why it's a problem to have startup map the area to
>>
>>use
>>
>>>in
>>>a callout to poison the cache itself - the patcher and rw data
>>
>>mechanism
>>
>>>seem
>>>to be sufficient to allow whatever implementation is required.
>>
>>I think technically I would be hard pressed to find something that
>>wasn't possible to implement in the startup callout.  The callout
>>environment is just much more limited and more difficult to exploit.
>>One challenge I'm still trying to work out is if I can make a mapping of
>>vaddr range for the length of the cache to a single page rather than
>>allocate RAM which is always 1:1 for the cache size (wasteful).  I think
>>I'm out of luck in making that kind of mapping AND making it persistent
>>into the OS.  It could be that poisoning is just not needed if we
>>implement all the platform specific methods.  I just won't know until we
>>do all the work :)
>>
>>
>>>The fact that this is a little awkward to write is a separate issue...
>>
>>I agree and disagree :).  At the start of this conversation, I though if
>>we could generically poison caches we shouldn't have to push the pain to
>>all the board variant startup developers to implement.  Now that I think
>>we just aren't that lucky my complaint about doing it in startup is a
>>little academic since there isn't much choice.
>>
>>That said, I think writing cache callouts to do board specific flushing
>>+ generic flushing...
View Full Message