http://courses.cs.vt.edu/~cs5204/fall00/distributedDBMS/sreenu/3pc.html
http://wiki.c2.com/?TwoPhaseCommit
- Two-Phase Commit protocol - a blocking protocol
- Three-Phase Commit protocol - a non-blocking protocol
In the X/Open model, there are 3 parties: the application, the resource manager (RM), and the transaction manager (TM). An example of an RM might be a database (like Oracle, DB2, SQL Server) or a transactional message queue (like IBM MQSeries or Microsoft Message Queue). An example of an App is, the code that denotes the transaction operation. The TM is often invisible to the app, but plays the role of director when multiple distributed RMs participate in a transaction.
The way it works:
- The app begins a transaction. The TM opens and maintains a context on behalf of the app.
- The app then contacts an RM and reads or writes within the context of that transaction. The app must communicate to the RM via a client library that is aware of the TM, and the RM itself must be aware of the TM context.
- The app can then contact other RMs similarly.
- When the app is finished, it can request a commit.
- The TM then contacts each RM that was involved in the transaction, and sends the "prepare" command. Essentially the TM is asking the RM, "the changes performed at your resource, on behalf of this transaction - can you make them permanent?"
- Each RM then must respond "commit" or "abort" . This is sometimes called the "vote". If the RM votes to commit, it implicitly assures the TM that no changes will be lost, even in the face of failure (like power failure, or network failure). This generally means the RM must store its changes to durable media, like a disk.
- If the TM gets a unanimous "commit" vote, then the TM sends "commit" messages to each RM. If any RM votes to abort, then the TM Sends an abort message to each RM.
- Each RM receives the direction of the TM, and then either rolls forward or back, and releases locks held on behalf of the transaction. If the network is interrupted and the RM never gets the message, the RM never resolves the transaction. In this case administrator intervention may be required.
TwoPhaseCommit assumes reliable communications and tends to utilize locking (between receipt of 'Prepare' and TM's final 'Commit'). If the communications are unreliable (especially if the TM can go down in the middle of a commit) or if blocking is undesirable, ThreePhaseCommit will be required.
http://the-paper-trail.org/blog/consensus-protocols-two-phase-commit/
We can identify two steps in the process:
- Contact every participant, suggest a value and gather their responses
- If everyone agrees, contact every participant again to let them know. Otherwise, contact every participant to abort the consensus.
The first proposal phase involves proposing a value to every participant in the system and gathering responses. The second commit-or-abort phase communicates the result of the vote to the participants and tells them either to go ahead and decide or abort the protocol.
The process that proposes values is called the coordinator, and does not have to be specially elected – any node can act as the coordinator if they want to and therefore initiate a round of 2PC.
Observe that the consensus here is in regard to whether or not to accept the value proposed by the coordinator, not on the value itself. So the nodes are not achieving consensus about what that value should be, they are achieving consensus on whether or not to accept that valueAs discussed elsewhere on this blog, nodes can crash in several ways. Most simply, they can crash and never recover. This is the ‘fail-stop’ model of distributed system failure. Instead, nodes could crash and may at some later date recover from the failure and continue executing. This is the ‘fail-recover’ model. Most generally, a node could deviate from the protocol specification in completely arbitrary ways. This is called ‘Byzantine failure’, and designing protocols to cope with it is an active area of research, and rather difficult.
Now consider the protocol after some of the proposal messages have been sent, but not all of them. If the coordinator crashes at this point we’ll have some nodes that have received a proposal and are starting a 2PC round, and some nodes that are unaware that anything is going on. If the coordinator doesn’t recover for a long time, the nodes that received the proposal are going to be blocked waiting for the outcome of a protocol that might never be finished. This can mean that no other instances of the protocol can be succesfully executed, as participants might have to take locks on resources when voting ‘commit’. These nodes will have sent back their votes to the coordinator – unaware that it has failed – and therefore can’t simply timeout and abort the protocol since there’s a possibility the coordinator might reawaken, see their ‘commit’ votes and start phase two of the protocol with a commit message.
The protocol is therefore blocked on the coordinator, and can’t make any progress. We can add some mechanisms to deal with this – and we’ll describe these below – but this problem of being stuck waiting for some participant to complete their part of the protocol is something that 2PC will never quite shrug off.
2PC is still a very popular consensus protocol, because it has a low message complexity (although in the failure case, if every node decides to be the recovery node the complexity can go to ). A client that talks to the co-ordinator can have a reply in 3 message delays’ time. This low latency is very appealing for some applications.
However, the fact the 2PC can block on co-ordinator failure is a significant problem that dramatically hurts availability.
http://the-paper-trail.org/blog/consensus-protocols-three-phase-commit/
The fundamental difficulty with 2PC is that, once the decision to commit has been made by the co-ordinator and communicated to some replicas, the replicas go right ahead and act upon the commit statement without checking to see if every other replica got the message. Then, if a replica that committed crashes along with the co-ordinator, the system has no way of telling what the result of the transaction was (since only the co-ordinator and the replica that got the message know for sure). Since the transaction might already have been committed at the crashed replica, the protocol cannot pessimistically abort – as the transaction might have had side-effects that are impossible to undo. Similarly, the protocol cannot optimistically force the transaction to commit, as the original vote might have been to abort.
The idea is very simple. We break the second phase of 2PC – ‘commit’ – into two sub-phases. The first is the ‘prepare to commit’ phase. The co-ordinator sends this message to all replicas when it has received unanimous ‘yes’ votes in the first phase. On receipt of this messages, replicas get into a state where they are able to commit the transaction – by taking necessary locks and so forth – but crucially do not do any work that they cannot later undo. They then reply to the co-ordinator telling it that the ‘prepare to commit’ message was received.
The purpose of this phase is to communicate the result of the vote to every replica so that the state of the protocol can be recovered no matter which replica dies.
The last phase of the protocol does almost exactly the same thing as the original ‘commit or abort’ phase in 2PC. If the co-ordinator receives confirmation of the delivery of the ‘prepare to commit’ message from all replicas, it is then safe to go ahead with committing the transaction. However, if delivery is not confirmed, the co-ordinator cannot guarantee that the protocol state will be recovered should it crash (if you are tolerating a fixed number of failures, the co-ordinator can go ahead once it has received confirmations). In this case, the co-ordinator will abort the transaction.
If the co-ordinator should crash at any point, a recovery node can take over the transaction and query the state from any remaining replicas. If a replica that has committed the transaction has crashed, we know that every other replica has received a ‘prepare to commit’ message (otherwise the co-ordinator wouldn’t have moved to the commit phase), and therefore the recovery node will be able to determine that the transaction was able to be committed, and safely shepherd the protocol to its conclusion. If any replica reports to the recovery node that it has not received ‘prepare to commit’, the recovery node will know that the transaction has not been committed at any replica, and will therefore be able either to pessimistically abort or re-run the protocol from the beginning.
So does 3PC fix all our problems? Not quite, but it comes close. In the case of a network partition, the wheels rather come off – imagine that all the replicas that received ‘prepare to commit’ are on one side of the partition, and those that did not are on the other. Then both partitions will continue with recovery nodes that respectively commit or abort the transaction, and when the network merges the system will have an inconsistent state. So 3PC has potentially unsafe runs, as does 2PC, but will always make progress and therefore satisfies its liveness properties. The fact that 3PC will not block on single node failures makes it much more appealing for services where high availability is more important than low latencies.
The beginning of a distributed transaction looks just like any other transaction: the application issues a BEGIN statement (optional in PostgreSQL), followed by normal SQL statements. When the transaction manager is instructed to commit, it runs the first commit phase by saying “PREPARE TRANSACTION 'foo'” (where “foo” is some arbitrary identifier for this transaction) on each system involved in the distributed transaction. Each system does whatever it needs to do to determine whether or not this transaction can be committed and to make sure it can be committed even if the server crashes, and reports success or failure. If all systems succeed, the transaction manager follows up with “COMMIT PREPARED 'foo'”, and if a system reports failure, the transaction manager can roll back all the other systems using either ROLLBACK (for those transactions it hasn’t yet prepared), or “ROLLBACK PREPARED 'foo'”. Using two-phase commit is obviously slower than committing transactions on only one database, but sometimes the data integrity it provides justifies the extra cost.
Distributed transactions require a “transaction manager”, an application which handles the special semantics required to commit a distributed transaction. Second, the systems involved must support “two-phase commit” (which was added to PostgreSQL in version 8.1). Distributed transactions are committed using PREPARE TRANSACTION 'foo' (phase 1), and COMMIT PREPARED 'foo' or ROLLBACK PREPARED 'foo' (phase 2), rather than the usual COMMIT or ROLLBACK.
Full J2EE application servers typically come with a transaction manager component. For my examples I’ll use an open source, standalone transaction manager, called Bitronix. I’m not particularly fond of using Java for simple scripts, though, so I’ve used JRuby for this demonstration code.
Unlike 2PC, cohorts do not execute a transaction during the voting phase. Rather, they simply indicate if they are prepared to perform the transaction. If cohorts timeout during this phase or there is one or more “no” vote, the transaction is aborted. If the vote is unanimously “yes,” the coordinator moves on to the “prepare” phase, sending a message to its cohorts to acknowledge the transaction will be committed. Again, if an ack times out, the transaction is aborted. Once all cohorts have acknowledged the commit, we are guaranteed to be in a state where all cohorts have agreed to commit. At this point, if the commit message from the coordinator is not received in the third phase, the cohort will go ahead and commit anyway. This solves the deadlocking problems described earlier. However, 3PC is still susceptible to network partitions. If a partition occurs, the coordinator will timeout and progress will not be made.