Project Home
Project Home
Documents
Documents
Wiki
Wiki
Discussion Forums
Discussions
Project Information
Project Info
Forum Topic - ETFS 2048: hack workaround for read errors in spare area : (4 Items)
   
ETFS 2048: hack workaround for read errors in spare area  
Quality Engineering at STMicro and new data sheets from Numonyx require that the spare area be protected from read 
errors with ECC.  Read errors are defined as when after some period of time a bit can change from program, 0, to un-
program state, 1.  This new requirement was probably not anticipated by the ETFS design.

Recall, the data portion of the page already has ECC to prevent this error condition.

The following code snippet from devio.c devio_readcluster() will prevent a DATAERR from occurring in the case of a 
single bit releasing from a 0 to a 1 in the spare area:

			trp->tacode = ETFS_TRANS_ERASED;
		else
			trp->tacode = ETFS_TRANS_FOXES;

        # hack >>>>
	else if(dev->crc32((uint8_t *) sp, sizeof(*sp) - sizeof(sp->crctrans)) != sp->crctrans) {
	  // try brute force contingency of walking a 0 through
	  // since failure mode is a single bit releasing from
	  // 0 to 1
	  int i,j;
	  uint32_t mask;
    uint32_t *ptr32 = (uint32_t*) sp;
	  //iterate through spare area 32bits at a time
	  for( j=0 ; j < sizeof(struct spare)/sizeof(uint32_t) ; j++, ptr32++ ){
		  for( i=0 ; i<8*sizeof(uint32_t) ; i++ ){
		    mask = 1 << i;
		    //only need to try making 1's a 0
		    if( ( *ptr32 & mask) == mask ){
		      *ptr32 = *ptr32 & ~mask;
		      //retry crc of spare
        	if(dev->crc32((uint8_t *) sp, sizeof(*sp) - sizeof(sp->crctrans)) == sp->crctrans) {
        	  // error in spare found
        		dev->log(_SLOG_ERROR, "readcluster trans DATAERR FORCED CORRECTION on cluster %d", cluster);
        		trp->tacode = ETFS_TRANS_OK;
        	  goto spare_area_2048_corrected;
        	}
        	else{
        	  //this wasn't the bit in error so undo the change
			      *ptr32 = *ptr32 | mask;
        	}
		    }
		  } 
	  }
	  // falling through means we were not able to correct the problem 
		dev->log(_SLOG_ERROR, "readcluster trans DATAERR on cluster %d", cluster);
		trp->tacode = ETFS_TRANS_DATAERR;
	} else
		trp->tacode = ETFS_TRANS_OK;

	spare_area_2048_corrected:
        # hack <<<<

	// Build transaction data from data in the spare area.
Re: ETFS 2048: hack workaround for read errors in spare area  
That's a potentially expensive operation, especially in situations where there are multiple bit errors and you've run 
several hundred CRC calculations for no gain.  I DO think there is a potential use for this idea in conjunction with 
having an ECC correction across the spare as well.

Having a proper ECC covers you for single bit errors in the CRC'd portion of the spare but the hole in the logic is if 
you get a single bit error in the ECC itself.  This error would go unnoticed as long as the CRC check passed (checking 
the ECC as well on each read would be bad for performance).  This leaves open the possibility that later on we would 
also develop a single bit error within the CRC'd area.  If this were to happen we would be unable to repair either error
 and the cluster would have to be considered bad.

If, as a last resort, we were to attempt a brute-force repair of the CRC'd area we would be able to correct a single bit
, which would in turn allow us to recalculate the ECC and repair the entire spare.  If the brute-force repair was 
unsuccessful it would imply that there were multiple errors in the CRC'd area anyway and we were never going to have a 
chance at fixing it.

This is probably a pretty small window in the actual filesystem since the wear leveling will likely move the data around
 before we hit this situation, but in the raw area where people may have boot images this is something we might want to 
take a look at.
Re: ETFS 2048: hack workaround for read errors in spare area  
Actually I did encounter this in the raw partition first.  I implemented the hack in our IPL to resurrect boards that 
would not scan the OS when the cluster was tossed out.  We got a "bad" batch of STmicro nand parts.  STmicro was able to
 bound the read error to being only 1 bit per 256B per page.


Re: ETFS 2048: hack workaround for read errors in spare area  
I think the best way to do it would be using a proper ECC algorithm...

Currently there are 41 bytes used in the spare area, if I counted correctly.
This means that we have 23 bytes room for an ECC for the spare area - this is plenty!

The only drawback is, there may be quite a performance penalty. This is bad. But I don't think one will get around this 
on the long run...?

Greetings,
 Marc