The government, industry and general public increasingly rely on a distributed, real-time, always
connected computing infrastructure (online servers and wireless computers, phones, handheld termi-nals, cameras and sensors) for tasks and activities ranging from entertainement and communicationto management and security. The quality and effciency of that infrastructure has a great impacton the speed and accuracy of execution, differentiating market leaders from lesser performers, ormaking a huge difference in the effciency of defence operations. The complexity of that infras-tructure is rapidly increasing now that each computing node contains several processors (cores);general purpose computers routinely contain 2 cores, newer game consoles include 3 to 9 cores, ad-vanced graphics cards provide 128 cores and processors for handheld devices have been announcedwith 32 cores. The current generation of tracing and monitoring tools does not provide the neededfunctionality and performance to effciently support the development and robust operation of thesesophisticated systems. This results in much longer development time, poor performance and unre-liability under heavy load (important event, natural disaster, battlefield).
The technical objectives of the project are to provide techniques and algorithms for: highprecision, low overhead, low disturbance software probing of online systems for tracing, monitoring,and for adapting it to the changing environment; analyzing the combined traces coming fromdistributed networked multi-core nodes; using the combined traces to automatically measure thesystem health and identify a wide range of faults or more minor problems; correlating the eventsin the traces with other events and with the system models to determine properties such as thecritical paths and resource bottlenecks, and compute the expected performance if the system wasupdated (e.g. adding disks or memory).
To achieve these technical objectives, the research work will focus on 6 complementary researchand development threads: 1) adaptive fault probing, inserting or activating low overhead probes inan online real-time system, even in interrupt context sections; 2) multi-level, multi-core distributedtraces synchronization, synchronizing events from numerous, possibly huge, traces collected ondistributed multi-core systems; 3) trace abstraction, analysis and correlation, abstracting low levelevents, correlating events from multiple related traces and quantifying divergences; 4) automatedfault identification, defining languages and structures for building fault symptoms catalogs anddetecting these symptoms; 5) system health monitoring and corrective measure activation, definingand measuring system health metrics, and studying the possible activation of protection, adaptationand optimization services; 6) trace directed modeling, relating the events to a model of the tracedsystem in order to provide higher level answers, such as decomposing the time taken for a requestor estimating the benefits of adding resources (e.g. disk or memory).
The project will provide the industry and the government with new, accurate, low overheadalgorithms and tools to study and monitor the behavior, performance and general health of complexmulti-core online distributed systems. The project structure will be particularly effcient to achievethese goals with the teaming of academic groups concentrating on the research issues outlinedhere, and industrial groups working on the integration of the algorithms and techniques developed,and their effective graphical presentation, to better serve their developers. The structure will thusinsure a tight feedback loop and a continuous technology transfer to test the new algorithms andtechniques, in real industrial and governmental settings, and thus help select the best alternatives,and improve the overall effectiveness of the newly developed tools. The industry and governmentwill therefore get more effcient and accurate tools to greatly improve the process of developing andoperating new complex computing systems. The academic groups will get the best possible testbedand user community to get feedback on the effectiveness of the proposed techniques and algorithms.
The project regroups 5 faculty members, 3 senior professors with extensive industrial collaboration experience and 2 junior professors keenly interested in industrial applications. It will contributeto the training of 8 Master students, 2 Ph.D. students, 1 research associate and 15 undergraduatesummer students. This research team will interact directly with several research and developmentscientists and internal clients at Ericsson Canada and at Defence R&D Canada, National DefenceCanada and Public Safety Canada, who are financially and technically supporting the project.These strong benefits to the Information Systems area, for the industry and the governmentfacilities, and for the training of highly qualified personnel in universities, will have direct andimportant benefits to Canada, given the strategic importance of this sector for the economy andsecurity, worldwide and particularly in Canada.