Mr. Buswell questions the Treasurer regarding the Office of Shared Services' closures in Dec 2006/Jan 2007 and a hardware failure due to air-conditioning issues. The Treasurer responds detailing the hardware failure, its impact on testing/development, and the timeline for resolution.

AnsweredQoN 1840Legislative Assembly
Asked
20 March 2007
Portfolio
Treasurer

QuestionView source ↗

I refer to the Office of Shared Services and ask -
(1) For what dates was the Office of Shared Services closed during the months of December 2006 and January 2007, and for what reasons?
(2) Can the Treasurer confirm that an air-conditioning failure in January resulted in a loss of hardware function, and if so -
(a) why did this occur;
(b) what was the nature of the disruption; and
(c) how long were the functions of the Office disrupted?

AnswerView source ↗

Answered
5 April 2007
Response time
16 days
(b) what was the nature of the disruption; and (c) how long were the functions of the Office disrupted?
(c) how long were the functions of the Office disrupted?
(2) On the morning of 18 January 2007 part of the air conditioning systems supporting the Secondary Data Centre (SDC) failed resulting in a significant temperature increase in that computer room facility. Because of the heat situation the Data Centre vendor began shutting down non-critical systems and OSS were advised of the issue. That evening vendor technical staff identified that the Hardware Management Console (HMC) had failed and that a number of related and dependent systems were shutting down. Over that evening all the SDC systems were shut down and the HMC processors were replaced and dispatched to the original vendor in the United States (US) for analysis. The results of that analysis have stated that the failure was due to a "random intrinsic hardware failure". In considering the heat related issues the vendor also found that "there was no temperature errors logged on the system". The vendor was able to reproduce the error in their test lab in the US, further indicating that temperature was not a factor in the failure. (a) The cause of the air conditioning failure is unknown. Failed components were replaced and the air conditioning system restarted. One of the two air conditioning units failed. The second unit continued to operate but is insufficient to maintain the desired temperature. The Data Centre vendor is reviewing this situation. The cause of the computing systems failure was a failure of the HMC and the subsequent inability of related computer units to connect to the back-up unit. Both units were returned to the US for analysis. (b) The event occurred at the SDC which houses OSS's development and testing systems and its production disaster recovery systems. The production systems continued to operate unaffected as they are housed at the Primary Data Centre at a different physical location. Until the problems were resolved and the development, testing and disaster recovery environments brought back on-line all testing and development activities were suspended and a change freeze imposed on the production systems until the disaster recovery systems were reinstated. The event occurred during a primary testing phase and resulted in a significant (four days) delay in that phase. (c) In terms of its material impact to OSS development and testing activities the event took four days to resolve from Thursday evening (18 January 2007) until the morning of Tuesday 23 January 2007. Several critical environments were brought back into operation after a 72 hour outage, during Monday 22 January 2007. Together with the main outage and subsequent remedial action the incident lasted for a total of 120 hours before being closed.
The results of that analysis have stated that the failure was due to a "random intrinsic hardware failure". In considering the heat related issues the vendor also found that "there was no temperature errors logged on the system". The vendor was able to reproduce the error in their test lab in the US, further indicating that temperature was not a factor in the failure. (a) The cause of the air conditioning failure is unknown. Failed components were replaced and the air conditioning system restarted. One of the two air conditioning units failed. The second unit continued to operate but is insufficient to maintain the desired temperature. The Data Centre vendor is reviewing this situation. The cause of the computing systems failure was a failure of the HMC and the subsequent inability of related computer units to connect to the back-up unit. Both units were returned to the US for analysis. (b) The event occurred at the SDC which houses OSS's development and testing systems and its production disaster recovery systems. The production systems continued to operate unaffected as they are housed at the Primary Data Centre at a different physical location. Until the problems were resolved and the development, testing and disaster recovery environments brought back on-line all testing and development activities were suspended and a change freeze imposed on the production systems until the disaster recovery systems were reinstated. The event occurred during a primary testing phase and resulted in a significant (four days) delay in that phase. (c) In terms of its material impact to OSS development and testing activities the event took four days to resolve from Thursday evening (18 January 2007) until the morning of Tuesday 23 January 2007. Several critical environments were brought back into operation after a 72 hour outage, during Monday 22 January 2007. Together with the main outage and subsequent remedial action the incident lasted for a total of 120 hours before being closed.
(a) The cause of the air conditioning failure is unknown. Failed components were replaced and the air conditioning system restarted. One of the two air conditioning units failed. The second unit continued to operate but is insufficient to maintain the desired temperature. The Data Centre vendor is reviewing this situation. The cause of the computing systems failure was a failure of the HMC and the subsequent inability of related computer units to connect to the back-up unit. Both units were returned to the US for analysis. (b) The event occurred at the SDC which houses OSS's development and testing systems and its production disaster recovery systems. The production systems continued to operate unaffected as they are housed at the Primary Data Centre at a different physical location. Until the problems were resolved and the development, testing and disaster recovery environments brought back on-line all testing and development activities were suspended and a change freeze imposed on the production systems until the disaster recovery systems were reinstated. The event occurred during a primary testing phase and resulted in a significant (four days) delay in that phase. (c) In terms of its material impact to OSS development and testing activities the event took four days to resolve from Thursday evening (18 January 2007) until the morning of Tuesday 23 January 2007. Several critical environments were brought back into operation after a 72 hour outage, during Monday 22 January 2007. Together with the main outage and subsequent remedial action the incident lasted for a total of 120 hours before being closed.
(b) The event occurred at the SDC which houses OSS's development and testing systems and its production disaster recovery systems. The production systems continued to operate unaffected as they are housed at the Primary Data Centre at a different physical location. Until the problems were resolved and the development, testing and disaster recovery environments brought back on-line all testing and development activities were suspended and a change freeze imposed on the production systems until the disaster recovery systems were reinstated. The event occurred during a primary testing phase and resulted in a significant (four days) delay in that phase. (c) In terms of its material impact to OSS development and testing activities the event took four days to resolve from Thursday evening (18 January 2007) until the morning of Tuesday 23 January 2007. Several critical environments were brought back into operation after a 72 hour outage, during Monday 22 January 2007. Together with the main outage and subsequent remedial action the incident lasted for a total of 120 hours before being closed.
(c) In terms of its material impact to OSS development and testing activities the event took four days to resolve from Thursday evening (18 January 2007) until the morning of Tuesday 23 January 2007. Several critical environments were brought back into operation after a 72 hour outage, during Monday 22 January 2007. Together with the main outage and subsequent remedial action the incident lasted for a total of 120 hours before being closed.

Explore WA Government Data

Search the full archive in the free dashboard, or query programmatically via API.

Explore more