September | 2011 | Thoughts on software and systems engineering

I have written elsewhere in this blog on how I believe that reductionism as a basis for software engineering is not appropriate for the complex coalitions of systems that we are now building. This means that ‘more of the same’ – improving existing software engineering methods – simply will not allow us to tackle the development of the large-scale, complex software systems that are rapidly becoming a reality.

Rather, I believe that we need to completely rethink software engineering research and propose the following top-10 research problems that I think need to be addressed:

1. How can we model and simulate the interactions between independent systems?
To help us understand and manage coalitions of systems we need dynamic models that are updated in real-time with information from the actual system. We need these models to help us make rapid ‘what-if’ assessments of the consequences of system change options. This will require new performance and failure modelling techniques where the models can adapt automatically from system monitoring data.

2. How can we monitor coalitions of systems and what are the warning signs of problems?

To help avoid the transitions to sunstable system state, we need to know what are the indicators that provide information about the state of the coalition of systems, how these indicators may be used to provide both early warnings of system problems and, if necessary, switch to safe-mode operating conditions that will stop damage occurring. To make effective use of this data, we need visualization techniques that reveal the subtleties of coalition operation and interactions to operators and users.

3. How can systems be designed to recover from failure?

As we construct coalitions of systems with independently-managed elements and negotiated requirements, it is increasingly impractical to avoid ‘failure’. Indeed, what seems to be a ‘failure’ for some users may not affect some others. Because some failures are ambiguous, automated systems cannot cope on their own. Human operators have to use information from the system and intervene to recover from the failure and restore the system. This means that we need to understand the socio-technical processes of failure recovery, the support that these operators need and how to design coalition members to be ‘good citizens’ and to support failure recovery.

4. How can we integrate socio-technical factors into systems and software engineering methods?

Software and systems engineering methods have been created to support the development of technical systems and, by and large, consider human, social and organisational issues to be outside the system boundary. However, these non-technical factors significantly affect the development, integration and operation of coalitions of systems. There is a considerable body of work on socio-technical systems but this has not been ‘industrialised’ and made accessible to practitioners.

5. To what extent can coalitions of systems be self-managing?

The coalitions of systems that will be created are complex and dynamic and it will be difficult to keep track of system operation and respond in a timely way to the monitoring and health measurement information that is provided. We need research into self-management so that systems can detect changes in both their own operation and in their operational environment and dynamically reconfigure themselves to cope with these changes. The danger is that reconfiguration will create further problems so a key requirement is for these techniques to operate in a safe, predictable and auditable way and to ensure that self-management does not conflict with ‘design for recovery’.

6. How can we manage complex, dynamically changing system configurations?

Coalitions of systems will be constructed by orchestration and configuration and the desired system configurations will change dynamically in response to load, indicators of the system health, unavailability of components and system health warnings. We need ways of supporting construction by configuration, managing configuration changes and recording changes (including automated changes from the self-management system) in real-time so that we have an audit trail recording what the configuration of the coalition was at any point in time.

7. How can we support the agile engineering of coalitions of systems?

The business environment changes incredibly quickly in response to economic circumstances, competition and business reorganization. The coalitions of systems that we create will have to change rapidly to reflect new business needs. A model of system change that relies on lengthy processes of requirements analysis and approval simply will not work.

Agile methods of programming have been successful for small to medium sized systems where the dominant activity is system development. For large and complex systems, development processes are often dominated by coordination activities involving multiple stakeholders and engineers who are not co-located. How can we evolve agile approaches that are effective for ‘systems development in the small’ to support multi-organization, global software development?

8. How should coalitions of systems be regulated and certified?

Many coalitions of systems will be critical systems whose failure could threaten individuals, organizations and economies. They may have to be certified by a regulator who will check that, as far as possible, the systems will not pose a threat to their operators or the wider systems’ environment. But certification is increasingly expensive. For some safety-critical systems the cost of certification may exceed the costs of development. These costs will continue to rise as systems become larger and more complex.

9. How can we do ‘probabilistic verification’ of systems?

Our current techniques of system testing and more formal analysis are based on the assumption that the system has a definitive specification and that behaviour which deviates from that specification can be recognized. Coalitions of systems will have no such specification nor will system behaviour be guaranteed to be deterministic. The key verification issue will not be ‘is the system correct’ but ‘what is the probability that it satisfies essential properties, such as safety, that take into account its probabilistic, real-time and non-deterministic behaviour’.

10. How should shared knowledge in a coalition of systems be represented?

We assume that the systems in a coalition will interact through service interfaces so there will not be any over-arching controller in the system. Information will be encoded in a standards-based representation. The key problem will not therefore be a problem of compatibility – it will be a problem of understanding what the information that systems exchange actually means.

Currently, we address this problem on a system by system basis with negotiations taking place between system owners to clarify what shared information means. However, if we allow for dynamic coalitions with systems entering and leaving the coalition, this is no longer a practical approach. The key issue is developing a means of sharing the meaning of information – perhaps using ontologies as proposed in the work on the semantic web.

Thanks to Dave Cliff and Radu Calinescu for their input. More on this in our paper on Large Scale Complex IT Systems.

Monthly Archives: September 2011

A research agenda for software engineering

About the image

My sites

Twitter

Archives