You are here

System health monitoring and corrective measure activation

System irregularities and overall performance degradation can be observed and measured to a new level of awareness by the enabling technologies described in the other subprojects (trace abstraction, analysis and correlation, automated fault identication, adaptive fault probing, etc). The objective of this research and development thread is to define and to validate quantitative measures that may be used to assess global system health and appropriate activation of corrective measures.

System health can be evaluated using an array of different, often complementary, approaches. A more traditional approach is using the low level (trace) metrics or statistics computed directly in the probes. Similar metrics may be built on higher level abstracted events. The comparison of correlated traces from redundant systems will use different techniques to extract the differences, and measure their size and significance; different techniques such as measuring the edit distance, used to study the temporal evolution of source code in a project, or to detect cloned source code blocks, may be used.

A model of healthy behavior may be described or deduced from system characterization through the analysis of several traces. Thereafter, any indication that the trace of a monitored system diverges from the healthy model is then tagged as suspicious. Additionally, strict access rules for the different system resources by the different processes may be denied and checked during the trace monitoring; these access rules may be much more fine grained and thus precise than what is supported by the operating system. The pattern languages developed for automated fault identification and traces abstraction may be used to characterize important aspects of the system

The resulting system health or reliability measurement can be highly informative to the system administrator to increase the level of surveillance or to activate additional protective and reactive measures such as logging more information, modifying packet filtering rules, inserting and/or activating probes dynamically, modifying system behaviour or simply shutting down some computers in order to protect them from hacking or malicious exploitation

You can find project updates bellow and on this page.