A methodology to measure explicitly the increase in the risk of a processor error with increasing workload is proposed. By relating the occurrence of a CPU-related error to the system activity just prior to the occurrence of an error, the approach measures the dynamic CPU workload/failure relationship. The measurements show that the probability of a CPU-related error (the load hazard) increases nonlinearly with increasing workload; i. e. , the CPU rapidly deteriorates as endpoints are reached. The load hazard is observed to be most sensitive to system CPU utilization, the I/O rate, and the interrupt rates. The results are significant because they indicate that it may not be useful to push a system close to its performance limits (the previously accepted operating goal), since what we gain in slightly improved performance is more than offset by the degradation in reliability. The results also indicate that conventional reliability models need to be reevaluated to take system workload explicitly into account.
ASJC Scopus subject areas
- Theoretical Computer Science
- Hardware and Architecture
- Computational Theory and Mathematics