Day 1 - Progress Report (Room M-2204)
9:45 - 10:00 Introduction
Mario Couture, Michel Dagenais and Dominique Toupin
Welcome and introduce the participants. Present the agenda.
10:00 - 11:00 - Adaptative Fault Probing
Adaptative fault probing is the base infrastructure to efficiently insert tracepoints at compilation and execution time, dynamically activate these tracepoints, and retrieve the tracing data. For whole system tracing, the tracepoints may be inserted in any layer, from hypervisor to operating system, virtual machine, system library and applications. The objective is the ability to trace with minimal disturbance any significant event in a distributed multi-core system: a low overhead, high throughput, whole system wait-free tracing infrastructure.
The presentation will describe the recent advances in tracing: the improvements to the LTTng kernel tracing infrastructure, the status of mainlining LTTng kernel tracing in the Linux kernel, the new user-space tracing library, and the links to complementary tracing technologies such as GDB tracepoints, Ftrace, Perf and SystemTap.
11:00 - 12:00 - Multi-level Multi-core Distributed Traces Synchronization
Traces synchronization is required to compute a common time base for all the traces collected on the multiple cores of multiple systems in a distributed multi-core system. Once all the events in these traces are brought to a common time base, it enables whole system trace analysis.
The new trace synchronization algorithm will be described along with recent optimizations. Ongoing extensions to synchronize traces in streaming mode will be presented. A new technique to synchronize efficiently virtual machines traces will also be presented. Finally, the architecture developed for integrating distributed, multi-level (kernel, physical and virtual machines, system libraries and applications) trace sources into a live trace visualization and analysis framework will be presented and discussed.
12:00 - 13:00 - Lunch Break
Lunch will be provided at Polytechnique.
13:00 - 14:00 - Trace abstraction, analysis and correlation
Abdelwahab Hamou-Lhadj and Waseem Fadel (slides)
The objective of trace abstraction is to replace several low-level events (e.g. disk blocks read requests, disk controller interrupts) by fewer high-level events (e.g. reading a file) in order to simplify the subsequent analysis of distributed multi-core execution traces. As a result, by abstracting low-level details, it should be easier to verify the correlation between two redundant systems executing the same commands, or between different releases of the same software. When significant differences are found, this may indicate an intrusion in one of the two redundant systems, or an error introduced in a software release.
A new framework for abstracting detailed operating system executation traces was developed and will be presented.
14:00 - 15:00 - Automated fault identification
Béchir Ktari and Hashem Mohamed-Waly (slides)
The objective of automated fault identification is to have an efficient system to describe fault patterns and verify automatically large execution traces against an extensible fault pattern dictionary. This may be used either to detect ongoing cyber-attacks or intrusions, or to rapidly identify common performance or programming problems.
A new language, used to describe different fault patterns visible in execution traces, will be presented. A new framework to define and search fault patterns in large execution traces will be described.
15:00 - 15:15 - Break
15:15 - 16:15 - Trace directed modeling
Timothy Lethbridge (slides)
The objective of this track is to connect high level models with low level tracing tools. Low level events represent actions which may be correlated with high level state transitions, thus enabling the display of the execution trace at the UML modeling level.
This presentation discusses the process of defining tracepoints in Papyrus UML models, generating executable code with these tracepoints and relating back the corresponding events at execution time with the UML model.
16:15 - 16:45 - System health monitoring and corrective measure activation
Michel Dagenais and Alireza Shameli (pdf)
Once detailed execution traces for distributed multi-core systems are available, further processing abstracts low-level events into high-level events, measures different usage and performance metrics, detects known fault patterns, and looks for correlation or deviation from known good systems. As a result, high level information becomes available about the system health. The objective of system health monitoring is to determine and display the system health, trigger additional information collection through tracing if a problem in some area is suspected, and trigger corrective measures if a serious problem is found. Examples of corrective measures include limiting the resources consumed by some users to protect the quality of service for critical functions, adapting the firewall configuration when a system is under cyber-attack, or disconnecting a redundant system suspected of being compromised.
A new architecture for System Health Monitoring and Corrective Measure Activation will be presented. For each module (detection, risk assessment, decision, reaction) performance and accuracy are discussed.
16:45 - 17:15 - Tracing and Monitoring Framework Impact Prediction
The impact of activating sources of tracing data and reactive measures may or not be acceptable and must therefore be estimated and monitored. To this end, it is necessary to have a model of the distributed system to trace and monitor.
This presentation will describe a new framework to monitor the tracing and monitoring data volume, and to model its impact. A new algorithm to store and index the system state data will also be presented.
Day 2 - In progress implementation work demonstrations and Linux kernel tutorial (Room M-2204)
9:00 - 9:20 - Live tracing infrastructure
Alexandre Montplaisir, David Goulet, Yannick Brosseau
Kernel tracing provides an effective way of understanding system behavior and debugging problems in the kernel and in user-space applications. Tracing events that occur in application code can further help by providing access to application activity unknown to the kernel.
LTTng now provides a way of tracing simultaneously the kernel as well as the applications of several multi-core nodes in a distributed a system. The kernel instrumentation and event collection facilities were adapted to user-space, yielding a similarly low overhead. An efficient algorithm for computing the clock differences between distributed nodes was implemented. It matches packet send and receive events on communicating nodes to estimate the clock differences and uses that information to align the different traces on a common time base. This presentation will demonstrate how to install the needed software from precompiled packages, and how to use LTTng for user-space and kernel tracing, for a posteriori and for live tracing. This will show examples of how correlating kernel and user-space events from distributed multi-core nodes can lead to successful debugging of complex problems. This work has been done in collaboration with Ericsson BNET, IBM Research, Novell, WindRiver and Futjitsu.
9:20 - 9:40 - System Health Monitoring
Alireza Shameli and Douglas Santos
The LTTng infrastructure, coupled with the Target Communication Framework, may be used to dynamically control the level of monitoring for a distributed system, extract detailed tracing data and aggregated status values, and remotely activate reactive measures.
9:40 - 10:00 - System State Display
Julien Desfossez, Rafik Fahem and Francis Giraldeau
Julien Desfossez will present an enhanced Control Flow View which models the interactions between a native Linux system and virtual Linux systems, representing by special states when a native system delegates control to a virtual system and when a virtual system exits to the native system for serving a request. Rafik Fahem will show how GDB tracepoints can interact with user-space and kernel tracepoints. Francis Giraldeau will demonstrate recent enhancements to interactive views for statistics and system health metrics.
10:00 - 10:30 - Automated Fault Identification
Béchir Ktari and Hashem Mohamed-Waly
The detection of abnormal behavior is becoming essential in the surveillance of complex systems. A complete framework will be presented, for the specification, detection and display of scenarios. For the purpose, a new scenario description language is created with a full editor and type checker. We assume that our framework is generic enough to be used with any type of traces and any kind of scenario. The various parts of the framework will be presented along with the steps to install it.
10:30 - 10:40 - Break
10:40 - 11:00 - Trace abstraction, analysis and correlation
Abdelwahab Hamou-Lhadj and Waseem Fadel
The abstraction of system call traces in an essential step towards analyzing their content. In this presentation, we show the trace abstraction tool that we have built to abstract out the content of LTTng traces. The tool implements the research techniques that have been developed throughout the year, and which utilize a pattern library of Linux operations as the main mechanism for the abstraction process. We also how new patterns can be easily added to the tool, making it flexible to extend and adapt.
11:00 - 12:00 - Updates on the Eclipse Debugging Service Framework and Tracing and Monitoring Framework
Marc Khouzam (slides), François Chouinard and Matthew Khouzam
The goal of this project is to provide an Eclipse-based framework to interact with the tracing, monitoring and system health assessment tools. In addition to controlling the tracing tools and activating tracepoints in kernel and user-space, the framework allows the retrieval, analysis, correlation and visualization of heterogeneous and arbitrarily large trace files. This work has been done at Ericsson Montreal in collaboration with the LTTng community, Eclipse Linux Tools community (RedHat), WindRiver, Ericsson PM&T and CPP.
12:00 - 13:00 - Lunch Break
Lunch will be provided at Polytechnique.
13:00 - 14:00 - Discussion
Linux kernel expert Jonathan Corbet will comment on the mid project results and will provide advice and suggestions for the project continuation.
14:00 - 15:00 - Linux kernel report
Jonathan Corbet (Report)
Linux kernel expert Jonathan Corbert will describe the latest change to the Linux kernel.
15:00 - 15:15 - Break
15:15 - 17:00 - Linux kernel tutorial
Linux kernel expert Jonathan Corbert will describe the Linux kernel architecture. He will detail the main kernel subsystems (scheduler, memory management, interrupts, I/O queues, networking...) and their interactions. For each subsystem, he will suggest the most interesting events and parameters to trace and monitor.
Day 3 - Linux kernel tutorial (continuation) (Room M-2204)