Debugging a Complex Systems, the Long Way from Data to Knowledge

Debugging in a complex system is a challenge. First you need to collect the monitoring information, second you need to interpret the information correctly. In this paper, we give a short example of a real live debugging season in a Grid-environment and argue, for a monitoring system to deliver information over all used systems and layers. Further, we discuss how Software Performance Engineering (SPE) can help to identify unintentional behavior.

The infrastructure we talk about here looks as following: On a server installation with different types of computing nodes we operate a batch system. This batch system allows users to execute programs. The programs (jobs) are managed from a locally installed Grid-middleware. The middleware provides a uniform access to all servers for all Grid-users, similar to a Cloud access layer. The Grid-access is used by a workflow engine. This workflow engine can split a complex research task (e.g. analyzing an image of a microscope) in a set of jobs (depending on the image’s content). This workflow engine accesses job’s descriptions and executables and is controlled via a web-based interface.

In this scenario, an end user from life science domain complained about workflows running for weeks. The provider of the web frontend and the workflow engine identified one of the problematic long running workflow executions for detailed debugging. So, the workflow provider found out, one job was reexecuted many times therefore he contacted the program’s developer who found out, that the program aborted on some servers. Thus, we as operator of a servers have been contacted. We found out that the job was aborted by the batch-system because the job allocated to much main memory. A time-consuming task was to identify the same user’s job, and workflow on different execution layers. IDs, etc. needed to be mapped from one layer to the other and monitoring had to be enabled or filtered. Thus, the debugging took weeks. This was just too much time and manpower.

Finally, we found out, that the user changed and increased the size of input data and therefore the demand for main memory increased. Depending on the specific server, the job failed, was aborted from the batch system, or was executed normally.

To reduce the debugging time, we introduced a monitoring system that shows all information’s of a job over all layers in a transparent way and allows to directly compare jobs. Therefore, AMon as visualization tool with SLAte as monitoring data collection tool was used. By comparing an aborted job with a normally running one we could directly observe the increased main memory usage.

However, faced with more than 10,000 jobs, it is not easy to identify a regular job. Therefore, AMon was extended with automatic analysis capabilities. This analysis categorizes jobs based on similarity functions. When a job with a new behavior is found, the job is presented to an administrator be analyzed in detail. Afterwards the job is tagged as regular or faulty. All jobs that are similar to a regular one are in following accepted. Faulty and new jobs are flagged as abnormal. This automatic analysis could clearly identify a job with an increased main memory usage and also the aborted once. Thus, the change of job’s behavior could be much easier identified.

To find out whether a new job represents a regular or faulty behavior is not trivial because it demands on many parameters (in our use-case the input data). Also, a lot of stakeholders need to be involved. Therefore, the solution we prefer is to have a performance description and the parameters of usage annotated to each job. Based on descriptions and parameters software performance engineering approaches can be used to determine the expected behavior of a job. Afterwards, the monitoring data of the job can be compared to the behavior expectation. This avoids the manual analysis of a job; faulty behavior can be directly identified. Additionally, based on the expectations it is possible to determine the resource demand and thus a selection of the right server is possible.

In a further project in cooperation with an (academic) computing center we want to develop a SPE behavior description to improve our monitoring system to current needs.