operation

Dr. Hans von Gersdorff - The Wounded Man (1517)

introduction

The operations teams that manage the LIC (Larg's Internet Cluster) ensure service levels are met, monitor performance and improve systems. Operational processes manage incidents, monitor performance, maintain facilities and so on.

A dozen operational principles form the foundation of a system that is fit for production use.

operational principles

Each business system contains many components, such as applications, machines and connections to other enterprise systems. Each system as a whole must provide a level of service that is fit for business purpose. A set of LIC operational principles describes what standards must be met to achieve this service level. These principles describe the fundamental nature of the production service. The production environment and applications are measured to track compliance to these principles. The operational principles are listed here. Each principle has a description section, to provide a detailed definition of what the principle is and what needs to be done, and a criteria section to provide a ticklist for each new system. All statements in the criteria must be true for the system.

  • 24 by 7 support including incident management
  • System recovery within 4 hours of a total outage including data recovery to the point of failure
  • Full DR (Disaster Recovery), with a “cut-over” to DR within 48 hours and the ability to cut-back to the primary site within 48 hours
  • Fully resilient system configurations with no single point of failure
  • End-to-end application and infrastructure monitoring

the system has been tested

The system has undergone full end to end testing including stress, performance, DR and recovery testing.

description

This principle will be underpinned by the results of subsequent principles. At this stage it is assumed that the application has passed through the LIC test phases. Several LIC environments are provided for testing.

criteria

As a summary, testing should include the following.

  1. A disaster recovery test has taken place.
  2. Volume testing at predicted peak load has taken place.
  3. Stress testing at peak load plus 50% has taken place.

the system is resilient

There are no single points of failure. The system remains available in error conditions. A system should be able to cope with component failures, user errors and data corruption.

description

  1. The application should be able to continue with component failures outside its core operating environment eg. network failure, database failure, while issuing appropriate error messages.
  2. When component failures are resolved the application should fully recover itself transparently.
  3. The system should issue error messages when a hardware, operating system, application or network failure occurs. These errors must be logged for diagnostic and monitoring purposes.
  4. The system should handle any form of user input error or data corruption without failure of the application. The error should be logged for monitoring purposes.

Criteria

A system can be classed as resilient after meeting these criteria.

  1. System issues appropriate messages, as defined by the infrastructure support area and/or Systems Operations upon the failure of a non-critical component.
  2. Application resumes normal processing without manual intervention upon re-instatement of the ‘failed’ component.

the system is recoverable

There are current backups of the system. The system will have Disaster Recovery (DR) and contingency in place. Support arrangements will meet service level requirements. Appropriate recovery/restart points are built in.

description

backup and recovery
  1. Sufficient backups should be taken so that application, operating system and application data can be restored to a point agreed by the business or, as a minimum, within support standards.
  2. The restore should be achievable within agreed times scales as defined by SLA or, as a minimum, within support standards.
  3. Where recovery of the application data beyond two weeks is required then the application should have self-maintaining archiving eg. XML described data stream, application archive table, images.
  4. Test should include recovery of application, application data, operating system and associated software infrastructure components.
  5. A full backup and recovery test has taken place. This must include recovery from last night’s backup and a backup that is more than 24 hours old. All recoveries should include the rolling forward of data to the point of failure.
  6. Wherever possible the recovery should avoid manual intervention e.g. auto install.
  7. Full documentation for recovery should be produced.
failover
  1. Systems should be designed to offer transparent fail-over where possible. Upon a terminal error on the active platform (usually identified by a heartbeat failure), the fail-over application infrastructure is automatically activated.
DR (Disaster Recovery)
  1. DR will be in place for the system. The DR site will be an authorised site and DR capability will have been tested and proved (signed off by the business or technical expert) before the system is promoted into the live environment.
batch
  1. Batch programs will be restartable.
  2. The batch should complete within 50% of the available batch window.
  3. Batch should be designed to co-exist with on-line in error situations and accommodate Bank holiday processing etc.

criteria

A system can be classed as recoverable if it meets these criteria.

  1. Successful Disaster Recovery Capability Test
  2. End to end support structure produced/signed off as part of the Support Model
  3. Backup/recovery signed off by appropriate expert
  4. Batch signed off by appropriate expert
  5. Full recovery documentation in place
  6. DR test schedule in place (annually)
  7. Fail-over Test
  8. Sign off by Business Continuity

the system is reliable

The system will operate with no (or very few) operational failures. Individual hardware and software components (platforms, application and networks) are stable. The standard period of stability testing is fourteen days for key systems and seven days for non-key systems.

There may be windows of reliability. An intranet application may be available only during business hours. E-commerce applications for the general public must be available 24 hours a day.

e-commerce application availability
day monday tuesday wednesday thursday friday saturday sunday bank holidays christmas new year
start 00:00 00:00 00:00 00:00 00:00 00:00 00:00 00:00 00:00 00:00
finish 23:59 23:59 23:59 23:59 23:59 23:59 23:59 23:59 23:59 23:59

description

  1. The system will be capable of managing log files/datasets to ensure that availability of storage is not compromised by their growth.
  2. The system will manage memory, storage and processor resource in such a way as to avoid the cumulative consumption of such resources i.e. the system should terminate processes cleanly, avoid memory leaks and appropriately manage temporary files/datasets.
  3. The system should have demonstrated during stability testing that it is capable of operating without error.

criteria

A system can be classed as reliable if it meets these criteria.

  1. During the standard period of stability testing* the system should demonstrate its reliability e.g. no memory leaks, any fatal errors, acceptable log file and predicted data growth. An operational failure shall not be caused by the application.

the system has capacity

The system has built in growth for an agreed period and agreed headroom above peak demand on CPU, storage, memory; the system should not run continuously at greater than an agreed percentage of infrastructure capacity; the batch can run within a defined window with acceptable contingency available.

description

storage
  1. The system should have sufficient storage to cope with the predicted data storage requirements for the first year of the system’s life.
  2. The storage capacity of the system should be capable of being increased by 50% of the first years’ predicted requirement without wholesale re-design of the application or infrastructure. The costs for upgrading the storage capacity should be detailed.
load balancing
  1. Where appropriate, load balancing will be deployed to ensure that flexible use of the infrastructure is used at peak periods.
performance
  1. The system has passed stress-testing Criteria. This test should be on a fully populated application environment. Stress testing should include:
  2. The system should utilise CPU, memory and storage resources in the most efficient manner possible.

 

criteria

A system can be classed as performant after meeting these criteria.

  1. When testing 1 & 2 above the infrastructure and its components should perform within normal limits.
  2. The target infrastructure will have capacity to accept the application.

the system is scalable

The system should be designed with horizontal scaling in mind. Vertical scaling should be avoided i.e. we should be able take advantage of work load balancing instead of purchasing more powerful infrastructure.

description

  1. The system should be scalable by a factor of 100% without wholesale re-design of the infrastructure or application.

criteria

  1. The design should state how scalability would be achieved.

the system can be monitored

Throughout the 24 by 7 period, the system issues appropriate alerts, system thresholds are set correctly and system heartbeats appropriately.

description

All critical alerts should go to Systems Operations and reference the correct resolution document where it exists.

Monitoring should be part of the normal testing process e.g. failure analysis testing, stress testing etc. detailed in previous sections.

Each application should have the following monitoring in place.

Heartbeat functionality – End to End

This will mimic customer experience on a regular basis – every minute. An alert should be issued if response times fall below a predetermined (by the business) threshold or fail five times consecutively. If end to end heartbeat is not appropriate then component heartbeats should be applied as above.

The object of the heartbeat is to prove that key business functionality is available and performing to an acceptable standard. Any system that fails either at an infrastructure or application level, should either alert on failure or be addressed by heartbeat functionality.

Environmental Monitoring

This should also be in place and alerting issued if agreed thresholds are exceeded. E.g. Disk utilisation 80% CPU, 80% for memory etc

Security Monitoring

Where appropriate, security monitoring should be put in place e.g. systems which have external interfaces.

Event Monitoring

The system should monitor for all key events involved in the successful support of business processing. This should include:

  • Arrival of data feeds
  • Initiation and completion of batch elements
  • Initiation and completion of system backup
Performance Monitoring

Systems should have facilities to monitor and manage performance. This should include the ability to monitor overall resource utilisation of the hardware, operating system and storage subsystem but additionally, should provide detailed information on the resource utilisation and performance of individual transactions.

criteria

  1. Heartbeat monitoring in place
  2. Environmental monitoring in place
  3. Sign off by Systems Operations expert
  4. Sign off by the security expert (if appropriate)

the system is supportable and maintainable

The system has been designed to run on strategic platforms; all software and hardware infrastructure elements are supported by the LIC owner’s team or suppliers; essential maintenance can be performed without service disruption or within agreed windows of scheduled downtime.

A support team must be available to work on the system. For example, the OLA for a business critical system will require 24/7 support.

Support team availability for a business critical application
day monday tuesday wednesday thursday friday saturday sunday bank holidays christmas new year
start 00:00 00:00 00:00 00:00 00:00 00:00 00:00 00:00 00:00 00:00
finish 23:59 23:59 23:59 23:59 23:59 23:59 23:59 23:59 23:59 23:59

description

  1. Any new infrastructure technology should pass through the Introduction Process
  2. The system must have Architecture Board sign off.
  3. Every component should be supported by a recognised support structure; (including individual components of an externally supported solution). The appropriate support area should support each component and the component should be formally noted in the support team OLA.
  4. Maintenance of the components must be performed without disruption to service through resilient system design. Maintenance may be performed with an agreed outage as per SLA.
  5. The system must have full remote control capability
  6. The system components must all be underwritten by full support from their respective vendors via a current support contract.

criteria

  1. Sign off from Architecture Review Board (ARB)
  2. Relevant support contracts and OLA’s in place
  3. Support Documentation accepted by relevant expert Support Areas.

the system is secure

The system will be compliant with the Group Information Security Policy, and the parameters defined in the appropriate internal and external standards. Security features include:

  • a dedicated team of security professionals watching the network at all times for attacks.
  • intrusion detection monitoring
  • regular vulnerability scanning
  • log analysis
  • firewalls ensure that outside access to confidential information does not occur.

description

  1. The system will go through a formal system assessment and appropriate controls will be employed to offset agreed risks. Risk not addressed must be formally recognised through the Security Waiver Process.

criteria

  1. Sign off from relevant experts, specifically the security expert.

the system operational limits are known

Number of users, data feeds, transactions and transaction response times before failure to meet the SLA are known and monitored.

description

These can largely be derived from the stress testing results. They should detail maximum number of users (including concurrent), maximum database sizes, maximum network throughput and actual end-to-end transaction response times under stress testing conditions.

criteria

  1. A document should be produced by the project, detailing the above limits and handed over to the Service Manager.

the system has integrity

The system updates using logical units of work and commits changes appropriately. It should also report any integrity breaches.

description

  1. Changes will only be applied through recognised Change Management processes.
  2. Failure to complete a unit of work should not result in data corruption. Updates across multiple files within a single unit of work should be committed together - failure to commit all file updates should result in rollback of the updates to all files. Manual recovery should not be necessary.
  3. If appropriate, the system should check for and accommodate duplicate transactions e.g. by failing the transaction and reporting.

criteria

  1. Sign off from relevant experts

the system will operate within the SLA

The SLA (Service Level Agreement) is in place and has business and technical team agreement. The system has been designed to meet and will operate within the SLA requirements. The system is supported by appropriate OLAs.

For example, the availability target for Business Critical Systems is 99.7%.

description

  1. The SLA should detail:

criteria

  1. Sign off for Service Level targets agreed by service managers, technical experts and the service review board.
  2. Support is in place to meet SLA requirements and is detailed in the appropriate Operating Level.