Background

The electronic revolution started in the late 1960s has already profoundly impacted our daily lives both at home and in the office. Online electronic services are now intensively used for communications, shopping, entertainment and information gathering. This revolution is accelerating and reaching much farther than anticipated. New services include always online third and fourth generation mobile devices acting as telephone, Internet gateway, music player and geographical navigation system. As a result, the Advanced Communications and Information Systems industry is growing rapidly and is a key economic sector. Canada is traditionally well positioned in this area with a number of major players in the Advanced Communications area, such as Ericsson, Bell, Videotron and Nortel, and a large number of smaller but significant players like QNX, Oz Communications,BlueSlice or Broadsoft. This is, however, a very global market where technical competitiveness iscrucial. Information Systems are just as vital for the fast and efficient operation of the Defence infrastructure, both locally and abroad.

All these new services rely on an increasingly sophisticated infrastructure composed of powerful servers, numerous fixed or mobile clients, and the system and networking software. The enabling technologies for many of these new developments are the increasingly sophisticated processors, and software libraries and protocols. Computer central processing units have evolved from simple processors, to symmetric or asymmetric multi-processors (SMP/ASMP), non-uniform memory access(NUMA) SMPs and more recently multi-core (SMP/ASMP on a single chip) systems .Embedded soft and hard real-time multi-core multi-computer systems are exceedingly difficult to debug and tune. Many problems, often timing related, only show under real loads, when the hardware (cache, page tables, synchronization) and software (operating system, virtual machines,libraries, applications) are interacting in real-time. The development time of distributed, online,applications is a major stumbling block for creating new services. Similarly, it is a significant challenge to efficiently operate and maintain these complex distributed systems, under widely varying conditions (e.g. varying load, wireless communication problems, failing nodes, network attacks).Monitoring and tracing tools are thus needed to extract precise, globally ordered, monitoring, debugging and performance data while minimizing the overhead on the systems under test. The tools must handle the potentially millions of significant events per minute to be expected on a 64processors system running at several GHz. The lack of adequate tracing and debugging tools wasm entioned as a critical challenge for the deployment of multi-core systems.

Interestingly, the low overhead system level monitoring and tracing infrastructure needed for performance analysis and debugging tools can at the same time provide extremely accurate information for system security monitoring. Defence R&D Canada is interested in building redundant architecture with mutual monitoring to greatly improve the resistance of critical infrastructures to cyber-attacks. Even if software/hardware redundancy has been used for decades to improve Fault Tolerance, little work has been done to extend the concept to Cyber-Attack Tolerance (i.e.to manage the risks of software defects being exploited). When properly designed, the "mutual surveillance" of redundant implementations could ensure the detection of hidden communications,suspicious database changes and other malicious behaviors that are normally very difficult to detect.This can be achieved by comparing execution traces, a much simpler process than abstracting long series of low-level CPU instructions.

A Tracing Summit was held in January 2008 to survey the tracing and monitoring field.
Several of the most advanced players in the areas of Advanced Communications, Information Systems and Computer Safety discussed the state-of-the-art tools, and their problems and unmet needs. The most promising avenues for solutions were identified with the guidance of key industrial,governmental and academic researchers in the field. Representatives from Defence R&D Canada at Valcartier, Enea, Ericsson, Freescale, IBM, MontaVista, Nokia, Oracle, Rational, Red Hat,TimeSys, Wind River and Zealcore were present.

There were traditionally a number of different segments in the field. Embedded systems developers have been using system level tracing for real-time systems for a number of years. World leader WindRiver is offering two different embedded operating systems, using LTTng (developed at Polytechnique) for Linux and its own tracer for VxWorks. MontaVista is using LTTng as well,while Canadian leader QNX is using its own tracer for QNX Neutrino. In all three cases, the trace visualization tool is based on the Eclipse framework and works adequately for traces of a fewmegabytes. However, these viewing tools cannot handle the gigabytes of traces that will come with the newer more sophisticated multimedia handheld devices containing high performance multi-coreprocessors (e.g. for high definition video processing).

In the general purpose computing field, tools exist for logging and monitoring high level events(login and logout events, web requests...) or for studying the detailed CPU usage through profiling.Telecommunication servers, on the other hand, have good specialized facilities for logging high level high volume real-time events (e.g. cellular phone connections). Both types of servers, however,are evolving rapidly, are now using multi-core processors, and are often serving high volume online applications. Indeed, telecommunication servers are becoming more general purpose as numerous services are added, and next generation cellular phones are turning into networked handheld computers. Similarly, a large fraction of the new general purpose servers are dedicated to high volume online soft real-time services. As a consequence, in both cases the interaction between the operating system, the applications and the multi-core processors has a large impact on the real-time performance of these systems. System level tracing and monitoring is therefore needed to debugand tune these systems, just like for real-time embedded systems.Efficient and mature system level tracing and monitoring tools are sorely lacking for either high performance embedded systems or new multi-core online servers. LTTng offers the base functionality with very efficient static probes. It is already used in a number of demanding applications,for instance at Autodesk Media and Entertainement, Google and IBM. More research and development is needed to extend LTTng to user-level tracing, integrate dynamic probes, develop more advanced algorithms optimized for massively multi-core systems, and build higher level monitoring and analysis tools. DTrace from Sun is very well regarded and offers adequate performance for low to medium volume monitoring through dynamic probing. SystemTap, from Red Hat, IBM and others, is a similar dynamic probing system for Linux, still under development and planned to complement LTTng.

High performance parallel computing systems (HPC), with thousands of computing nodes, have used tracing and monitoring with success for a long time. However, these systems execute long running CPU intensive tasks. High level libraries like the Message Passing Interface (MPI)are used to communicate between the computing nodes, dividing the task and exchanging data and results. At each node, the performance may be optimized by profiling the CPU usage. Monitoring is performed at high level by tracing all MPI events (high level data exchanges over the network).The traces can then be used to understand the interactions between the computation nodes, and optimize the division of labor and communication.

The tracing tools available for high performance computing are adequate for their intended purpose but would not be suitable for high volume online servers. They operate at a much larger granularity and do not provide system level information. Much more efficient algorithms are needed for probes instrumenting frequent events (interrupts, system calls...). The synchronization of events between traces is also much more challenging when events are distant of 1 microsecond or less. There are, however, a number of areas for which HPC tools can serve as inspiration. High performance systems routinely produce huge multi gigabytes traces on numerous networked multi-core systems. Some of the multi-level trace synchronization and visualization algorithms can therefore be a source of inspiration for new system level tracing and monitoring systems.

In summary, the surge of interest in accurate system monitoring and tracing is explained by the explosion in the number of online services deployed, their complexity, their pervasiveness and then umber of its users. This is the case for the telecommunication industry developing and deploying sophisticated multimedia, video capable, smart handheld phones and computers, for the government with sophisticated information and operation support systems, and for general information or inventory tracking systems everywhere. Furthermore, the transition from mostly single processor clients and some multi-processor servers, to heavy use of multi-core processors on both clients and servers, represents a significant shift in the programming paradigm, for which the current tools are inadequate. This lack of adequate tools results in longer development cycles, operations and maintenance support nightmares, and poorly optimized systems.

Main menu

You are here