23.2. WS-BA Recovery
23.2.1. WS-BA Coordinator Crash Recovery
The WS-BA coordination service implementation tracks the status of each participant in an activity as the activity progresses through completion and closure. A transition point occurs during closure, once all
CoordinatorCompletion
participants receive a complete
message and respond with a completed
message. At this point, all ParticipantCompletion
participants should have sent a completed
message. The coordinator writes a log record storing the details of each participant, and indicating that the transaction is ready to close. If the coordinator service crashes after the log record is written, the close
operation is still guaranteed to be successful. The coordinator checks the log after the system reboots and re sends a close
message to all participants. After all participants respond to the close
with a closed
message, the coordinator can safely delete the log entry.
The coordinator does not need to account for any
close
messages sent before the crash, nor resend messages if it crashes several times. The XTS participant implementation is resilient to redelivery of close
messages. Assuming that the participant has implemented the recovery functions described below, the coordinator can even guarantee delivery of close
messages if both it, and one or more of the participant service hosts, crash simultaneously.
If the coordination service crashes before it has written the log record, it does not need to explicitly compensate any completed participants. The presumed abort protocol ensures that all completed participants are eventually sent a
compensate
message. Recovery must be initiated from the participant side.
A log record does not need to be written when an activity is being canceled. If a participant does not respond to a
cancel
or compensate
request, the coordinator logs a warning and continues. The combination of the presumed abort protocol and participant-led recovery ensures that all participants eventually get canceled or compensated, as appropriate, even if the participant host crashes.
If a completed participant does not detect a response from its coordinator after resending its
completed
response a suitable number of times, it switches to sending getstatus
messages, to determine whether the coordinator still knows about it. If a crash occurs before writing the log record, the coordinator has no record of the participant when the coordinator restarts, and the getstatus
request returns a fault. The participant recovery manager automatically compensates the participant in this situation, just as if the activity had been canceled by the client.
After a participant crash, the participant recovery manager detects the log entries for each completed participant. It sends
getstatus
messages to each participant's coordinator host, to determine whether the activity still exists. If the coordinator has not crashed and the activity is still running, the participant switches back to resending completed
messages, and waits for a close
or compensate
response. If the coordinator has also crashed or the activity has been canceled, the participant is automatically canceled.