Article From:

I. Basic Concepts

  1. ACIDTheory: Transactions of relational databases satisfy the characteristics of ACID, and databases with ACID support strong consistency of data, which ensures that data itself will not be inconsistent.It is suitable for traditional single structure.
  2. CAPTheory: In a distributed system, there are three elements: Consistency, Availability, Partition tolerance, and they cannot be combined. Distributed system requires guarantee pointsZone fault tolerance can only be balanced between strong consistency (C) and availability (A), i.e., select CP or AP. For example, Zookeeper sacrifices some usability to ensure strong consistency for CP systems; Eureka sacrifices some consistency for high availability for AP systems. in additionIn CAP theory, network latency is ignored, that is, when a transaction is committed, the copy from node A to node B is obviously impossible in reality, so there will always be a certain amount of time is not consistent.Therefore, CAP is generally applicable to the theoretical basis of LAN system.
  3. BASETheory: Solve the unavailability and consistency of distributed system in CAP theory, and propose the final consistency. That is,The final data is consistent, rather than maintaining consistency in real time.。For example, if the payment is successful and the order is successful, but the increase points fail, the payment and the order should not be rolled back, but the points should be increased correctly through some compensation methods.

Two, solutions

programmeSolutionsApplicable sceneExplain
Local affairs
  1. ACID theory based on Database
  2. Log records based on undo and redo
  3. undoLog implementation rollback, redo log to achieve commit scene exception recovery
  1. Traditional single architecture
  2. Low requirements for distributed transactions
  1. What happens to distributed system scenarios?
  2. Log record – monitoring alarm – manual intervention repair
  3. Problem traceability, such as: maintenance orders can be created, but failure to invoke maintenance costs causes the entire transaction to roll back
    1. Possible maintenance cost problems, such as excessive performance pressure, cause the call failure to roll back when requested.
    2. Possible maintenance cost is successful, but returns.
Two stage submission
  1. Based on XA protocol, relying on TM and RM interaction, the ability to rely on database.
  2. TMThere is a single point of failure, lock resources occupy a longer time.
  1. For multi data sources or distributed database design (XA is essentially a specification between TM and RM).
  2. Architecture for multiple data sources
  3. MycatXA protocol is also implemented. Some companies use this scheme for distributed transactions, but the application layer is not a microservice architecture.
  4. It is suitable for core transaction business scenarios with short concurrency and short processing time.
  1. XAAgreement:
Three stage submission
  1. Based on TCC protocol
  2. Implementing transaction mechanism outside database to achieve final consistency.
  3. At the expense of application flexibility, you need to provide concrete implementations of Try, Confirm, and Cancel, and you need to be careful about idempotent operations
  1. Cross application, but the need to achieve TCC interface, invasion of existing systems larger, suitable for the new system.
  2. TCC is not a strong dependency on database properties.
  3. Reference implementation:
  1. TCCAgreement:
Reliable message mode
  1. Big business is changed into small business. The inconsistency between small things is compensated by additional training tasks.
  2. The idea was first put forward by Ebay: Id=1394128
  3. Can be divided into two models based on local events and external events.
  4. Business logic needs to ensure idempotency.
  1.  Suitable for core module modification, or completely based on message-driven architecture, otherwise the existing system intrusion is greater
  2. In addition, if you need to roll back, the scenario of more than two practical operations is complex, so this scenario needs to follow the ultimate consistency principle and failures do not roll until compensation is successful.
  3. Depending on the messaging system or database that has transaction functions, such as RabbitMQ, Kafka, RocketMQ, etc.


  1.  Based on local events:
  2. Based on external events:


 Reliable message variants
  1.  Packing the message queue functionality as a Rest service without relying on message queue communication masks the interface of the message queue
  2. Reduce the shortcomings of architecture and Application Intrusion Based on reliable message mode.
  1.  Maximum effort notification type
  2. For example, Alipay’s callback mechanism can set the index time to retry, refer to Ali to achieve:
  1. Downstream application polling
  2. For example, WeChat’s polling mechanism guarantees consistency from downstream applications.
  1. Based on workflow, principle:
  2. Define the process of sequential operation and rollback operation, and give transaction coordinator unified management.
  3. Some application frameworks implement this scheme, such as CQRS framework Axon framework: Framework/Axon Framework, and Huawei service comb:httPs://
  1. The application side defines the workflow and gives it to SAGA for management. Although this scheme is not hot, it has less intrusion to the application and conforms to the principle of layered design. Adding a composite layer to implement the process that requires distributed transactions separately is enough.


  1. SAGAWorkflow:


  1. The route to optimize the XA architecture is similar to that of XA, and the business invasion is small, adding annotations.
  2. GTSReference:
  3. Imitation GTS implementation:
  4. Similar to GTS: seems to be the most mature open source solution.
  1. It is applicable to Ali cloud program, and the dedicated line can also be accessed. The third party system can also access TCC.
Summing up suggestions
  1. If it is not necessary to introduce distributed transactions, each micro-service guarantees its own high availability and basically guarantees data consistency, except in extreme cases. –In fact, the microservice architecture BAT was in use ten years ago, and it’s the same without distributed transactions, because of the infrastructure, the availability of each microservice itselfRelatively high, so there is no need to introduce greater complexity.
  2. If necessary, the first step is to ensure data consistency for core services, such as transactions, using messaging, best effort notification, and polling schemes, which are all bookkeeping in nature and can be traced even if problems arise – this is usually done with the help of the capabilities of third-party payment systems.
  3. If only a small number of services require distributed transaction characteristics, you can use a reliable message-based solution locally, referring to, which requires a lot of detail and is theoretically possible for every link.Current network anomalies require corresponding measures to ensure, such as: if the establishment of an exponential time retry mechanism, downstream service interfaces need to ensure idempotency, which is equivalent to the business itself responsible for maintaining consistency.
  4. If a large number of businesses require distributed transactions, services like DelayMq can also be introduced to decouple and use this service to provide callback services to concatenate the service chains (messages contain callback Urls, parameters), but downstream service interfaces need to ensure idempotency — the PaaS platform can provide classesSimilar services, reference: The scheme needs to be able to accept part of the code refactoring.
  5. If a large number of businesses require distributed transactions, you can introduce a framework similar to GTS that has less intrusion into the business to avoid updating the architecture and code, and add the necessary annotations to the code, such as: api/tx-lcn — an open source solutionIt is recommended to be cautious after testing, and this ability can also be studied to see if the PaaS platform can be achieved
  6. Data consistency is a system engineering. It is not enough to solve it at the level of transaction framework only. It also needs supporting normative measures such as request request ID, link tracing, interface idempotency, log output specification, Key log record specification and so on.Let PaaS take over, provide link service, monitor alarm service, etc.
  7. Improving infrastructure and reducing the impact of network problems are important prerequisites. PaaS can provide DelayMq-like services for network exceptions when the actual invocation is successful and returns
  8. Perfect application monitoring and alarm facilities, such as API, access times, failure times monitoring, timely alarm – PaaS can provide a useful real-time monitoring and alarm capabilities

Three, reference materials

  • Then someone asks you about distributed transactions and throws this article to him:

  • GTSDemoIntroduction: Spm=a2c4g.11174283.3.5.6eea735d9NoIS6

  • ByteTCC:

  • GTSDeciphering the principle, architecture and characteristics of –GTS:

  • Distributed transaction series:

Four. GitHub related projects


Leave a Reply

Your email address will not be published. Required fields are marked *