foundry27 : Post

Forum Topic - Disk Errors are cached?: (8 Items)

View: as

Andy Rhind

09/11/2016 10:24 PM

post116768

HI:

I have a QNX4 system  running Fsys 4.24Z and Fsys.atapi 4.25G . 
Fsys is started with "Fsys -H disk80 -r 512 -c64M"

There is a problem with the motherboard and disk that causes them to rarely throw an error 5 .
  Traceinfo says -->
"Sept 11 11:45:02 2 00003001 Bad Block 00C1ECE9 on /dev/hd0 flagged in cache"

I stop the test program and there is no other disk activity.  
spatch /dev/hd0 c1ece9  reports a bad block 
dcheck -f 12709097 -b2 /dev/hd0  also reports the bad block. 
They consistently return the error for the the next 30 minutes. 

Then when I do a dcheck that scans more than 64MB (see Fsys's startup params) of disk the error goes away.  Also a full 
dcheck reports all blocks as OK.

When I check the disks SMART stats. There are 1) no relocated sectors, 2) no reallocation events ,3) no Current Pending 
sectors  and 4) no Un-correctable sectors.  The disk looks perfect. 

It seems I have two problems.
1) The one that caused a false reporting of an I/O error (5) on write.  This is probably a single occurrence.
2) The I/O error seems to have been cached and when retried the cached failure is returned. Which makes recovery/retry 
code fail. 

Yes the I/O error is not a good thing and is begin looked into, but that research is made much harder by the caching of 
the error. 

Does my analysis sound correct? and if so is there a way to stop Fsys/Fsys.atapi caching the disk error.? 

Thanks,
Andy

Oleg Bolshakov

09/13/2016 7:22 AM

post116772

Re: Disk Errors are cached?

Hi Andy,

This may be an issue in Fsys. I'll look deeply at this. Can you post here all Fsys traceinfo messages?

Respectfully,
Oleg

12 сент. 2016 г., в 5:24:54, Andy Rhind <community-noreply@qnx.com> написал:

> HI:
> 
> I have a QNX4 system  running Fsys 4.24Z and Fsys.atapi 4.25G . 
> Fsys is started with "Fsys -H disk80 -r 512 -c64M"
> 
> There is a problem with the motherboard and disk that causes them to rarely throw an error 5 .
>  Traceinfo says -->
> "Sept 11 11:45:02 2 00003001 Bad Block 00C1ECE9 on /dev/hd0 flagged in cache"
> 
> I stop the test program and there is no other disk activity.  
> spatch /dev/hd0 c1ece9  reports a bad block 
> dcheck -f 12709097 -b2 /dev/hd0  also reports the bad block. 
> They consistently return the error for the the next 30 minutes. 
> 
> Then when I do a dcheck that scans more than 64MB (see Fsys's startup params) of disk the error goes away.  Also a 
full dcheck reports all blocks as OK.
> 
> When I check the disks SMART stats. There are 1) no relocated sectors, 2) no reallocation events ,3) no Current 
Pending sectors  and 4) no Un-correctable sectors.  The disk looks perfect. 
> 
> It seems I have two problems.
> 1) The one that caused a false reporting of an I/O error (5) on write.  This is probably a single occurrence.
> 2) The I/O error seems to have been cached and when retried the cached failure is returned. Which makes recovery/retry
 code fail. 
> 
> Yes the I/O error is not a good thing and is begin looked into, but that research is made much harder by the caching 
of the error. 
> 
> Does my analysis sound correct? and if so is there a way to stop Fsys/Fsys.atapi caching the disk error.? 
> 
> Thanks,
> Andy
> 
> 
> 
> 
> _______________________________________________
> 
> QNX4 Community Support
> http://community.qnx.com/sf/go/post116768
> To cancel your subscription to this discussion, please e-mail qnx4-community-unsubscribe@community.qnx.com

Andy Rhind

09/13/2016 11:16 PM

post116777

Re: Disk Errors are cached?

Oleg:

Thanks for this.  Attached are two screen pictures taken from traceinfo.. The 7th of Aug is when the clients application
 spotted the error. Since most of their system was still running, eventually (5 minutes) the bad block disappeared. 

The 11th of September was caught using only a test program. This one is 30 minutes after the initial error. the trace 
lines are caused by my using spatch to examine the block. 

So far I've looked for ways to flush or clear the cache, including sync and ioctl() calls 

Thanks,
]Andy

Attachment:

20160912_104032.jpg 79.89 KB

20160807_182436.jpg 59.16 KB

Oleg Bolshakov

09/14/2016 3:43 AM

post116778

Re: Disk Errors are cached?

Hi Andy,

Did you get this issue on the one particular disk / controller? Or did it occur on different hardware? How can I 
reproduce the issue?

Respectfully,
Oleg

14 сент. 2016 г., в 6:16:31, Andy Rhind <community-noreply@qnx.com> написал:

> Oleg:
> 
> Thanks for this.  Attached are two screen pictures taken from traceinfo.. The 7th of Aug is when the clients 
application spotted the error. Since most of their system was still running, eventually (5 minutes) the bad block 
disappeared. 
> 
> The 11th of September was caught using only a test program. This one is 30 minutes after the initial error. the trace 
lines are caused by my using spatch to examine the block. 
> 
> So far I've looked for ways to flush or clear the cache, including sync and ioctl() calls 
> 
> Thanks,
> ]Andy
> 
> 
> 
> _______________________________________________
> 
> QNX4 Community Support
> http://community.qnx.com/sf/go/post116777
> To cancel your subscription to this discussion, please e-mail qnx4-community-unsubscribe@community.qnx.com
> <20160912_104032.jpg><20160807_182436.jpg>

Andy Rhind

09/15/2016 12:08 AM

post116790

Re: Disk Errors are cached?

Oleg:

I wish I could easily reproduce it . It seems to happen when their Raima database does a purge. I've tried to create 
scripts that simulate the original fault but cant do it. The customer has a process that floods the db and causes it to 
purge often. This causes the real I/O error somewhere between 1 and 5 days. 

Interestingly this is only happening since we changed the motherboard to a Commell AS-C74. The original motherboard is 
now EOL. We also use an Accordance ARaid. There two together are necessary to produce the error. 

We don't have an easy way to create the original error. 

Andy

Oleg Bolshakov

09/16/2016 8:48 AM

post116801

Re: Disk Errors are cached?

Hi Andy,

It seems that this issue can't be explored without hardware. It's too difficult to reproduce it (up to 5 days). Can you 
replace hardware as a workaround?

Respectfully,
Oleg

15 сент. 2016 г., в 7:08:57, Andy Rhind <community-noreply@qnx.com> написал:

> Oleg:
> 
> I wish I could easily reproduce it . It seems to happen when their Raima database does a purge. I've tried to create 
scripts that simulate the original fault but cant do it. The customer has a process that floods the db and causes it to 
purge often. This causes the real I/O error somewhere between 1 and 5 days. 
> 
> Interestingly this is only happening since we changed the motherboard to a Commell AS-C74. The original motherboard is
 now EOL. We also use an Accordance ARaid. There two together are necessary to produce the error. 
> 
> We don't have an easy way to create the original error. 
> 
> Andy
> 
> 
> 
> _______________________________________________
> 
> QNX4 Community Support
> http://community.qnx.com/sf/go/post116790
> To cancel your subscription to this discussion, please e-mail qnx4-community-unsubscribe@community.qnx.com

Andy Rhind

09/19/2016 2:44 AM

post116810

Re: Disk Errors are cached?

Oleg:

Yes II agree the original problem is hard to reproduce and 5 days is forever, if testing. 5 days is a very short time in
 production.  I believe the problem is the Commell with the motherboard. Finding a QNX4 compatible replacement will take
 time and cost,  then it needs to be tested for the current problem.  We have a possible replacement for the ARaid, but 
testing is ongoing. 

Sure the original, temporary I/O Error happens and is a bad thing and needs to be stopped. I posted here to try to 
understand the stickyness of the error, to understand its reason and find a way to get around it in the short term. 

We have the source and the error happens in one place. If we could invalidate the cache , retry and move on   (or not if
 its a real error) . This would help production and allow us time to get the best replacement. 

1) I see that dcheck has a 'disable drive error correction' options. Is this a suitable candidate ?

2) Is there an ioctl() that will do the cache invalidation? 

3) Is all this caused by Fsys.Fsys.atapi caching I/O Errors when they shouldn't be cached.? If so can that behavior be 
stopped?


Thanks,
Andy

Oleg Bolshakov

09/19/2016 4:54 AM

post116811

Re: Disk Errors are cached?

Hi Andy,

> 1) I see that dcheck has a 'disable drive error correction' options. Is this a suitable candidate ?

It seems that this has sense for Fsys.ata only (old driver) to disable ECC check.

> 2) Is there an ioctl() that will do the cache invalidation? 

Did you try sync utility to flush caches? Did it help? 

> 3) Is all this caused by Fsys.Fsys.atapi caching I/O Errors when they shouldn't be cached.? If so can that behavior be
 stopped?

Disk cache is driven by Fsys, not the disk driver. I can't know how to disable caching in this particular case.

> We have the source and the error happens in one place. If we could invalidate the cache , retry and move on   (or not 
if its a real error) . This would help production and allow us time to get the best replacement. 

I understand this. Unfortunately, I can't debug the issue without the ability to reproduce it. And even more the 
debugging may take quite a few time.

Respectfully,
Oleg

19 сент. 2016 г., в 9:44:55, Andy Rhind <community-noreply@qnx.com> написал:

> Oleg:
> 
> Yes II agree the original problem is hard to reproduce and 5 days is forever, if testing. 5 days is a very short time 
in production.  I believe the problem is the Commell with the motherboard. Finding a QNX4 compatible replacement will 
take time and cost,  then it needs to be tested for the current problem.  We have a possible replacement for the ARaid, 
but testing is ongoing. 
> 
> Sure the original, temporary I/O Error happens and is a bad thing and needs to be stopped. I posted here to try to 
understand the stickyness of the error, to understand its reason and find a way to get around it in the short term. 
> 
> We have the source and the error happens in one place. If we could invalidate the cache , retry and move on   (or not 
if its a real error) . This would help production and allow us time to get the best replacement. 
> 
> 1) I see that dcheck has a 'disable drive error correction' options. Is this a suitable candidate ?
> 
> 2) Is there an ioctl() that will do the cache invalidation? 
> 
> 3) Is all this caused by Fsys.Fsys.atapi caching I/O Errors when they shouldn't be cached.? If so can that behavior be
 stopped?
> 
> 
> Thanks,
> Andy
> 
> 
> 
> 
> 
> 
> 
> _______________________________________________
> 
> QNX4 Community Support
> http://community.qnx.com/sf/go/post116810
> To cancel your subscription to this discussion, please e-mail qnx4-community-unsubscribe@community.qnx.com

Return

The text you entered is not a valid object ID
More Information
Object IDs begin with an object prefix and end with a number. For example, if you enter
artf2345
the application will jump directly to an artifact with the ID artf2345. Some valid object prefixes are:
artf	for an artifact
doc	for a document
page	for a project page
topc	for a discussion topic
wiki	for a wiki page