Cross Internet Data Synchronization
Overview
This article briefly introduces how CloudCanal performs cross internet secure data synchronization, including the following features:
- The databases on both ends do not open public network ports at all
- Database schema is mappable on both ends.
- Based on HTTPS transport.
- It has a username and password authentication mechanism.
- Supports heterogeneous interworking of multiple databases.
- Does not rely on software such as message middleware.
Key Points
Tunnel DataSource
We connect CloudCanal DataJob through a virtual DataSource Tunnel.
The Tunnel DataSource itself is not an entity database, but a set of logical information, including:
- IP (or domain name)
- port
- User name
- password
- TLS certificate file and password
- schema
Through this virtual DataSource, we can use HTTP(S) or TCP to pull or receive data, while fully matching the CloudCanal business model to achieve functional integrity.
PUSH Model
For the data transfer mode PUSH or PULL, we choose the PUSH mode, that is, the client pushes data to the server ,the causes are:
- It mainly solves the interoperability problem, not the subscription problem
- Target-side synchronous writing data better matches CloudCanal's other target-side flavors
- There is no data persistence in the data channel, and there is no need to maintain a store to stage data
Of course, the PUSH mode also brings some problems, including:
- How to ensure that final data is written before submit position.
- Complex position backtracking (full and delta, inconsistent site formats for different DataSources).
For the above two problems, we use the Lazy Commit Position technique.
Lazy Commit Position
With the PUSH model, position management is a complex and dangerous job, and if the position is submitted early, data may be lost.
We implement the Lazy Commit Position technique, that is, every time the client writes to the server side, only the submitted positions is returned, and serialized into JSON strings.
Through this technology, we can ensure that the data before the position must have been written to the peer, and in some business scenarios, through the backtracking of the position of the client DataJob, we can achieve the purpose of repeatedly consuming data for a certain period of time.
Schema Mapping
We simulate the process of obtaining and migrating the schema for this DataSource, so that it is like a real database during DataJob creation and operation and maintenance.
The schema for a real Tunnel data source is as follows:
[
{
"db": "cc_virtual_db",
"schemas": [
{
"schema": "cc_virtual_schema",
"tables": [
{
"table": "WORKER_STATS",
"columns": [
{
"name": "ID",
"jdbcType": -5,
"typeName": "LONG",
"key": true
},
{
"name": "GMT_CREATE",
"jdbcType": 93,
"typeName": "TIMESTAMP",
"key": false
},
{
"name": "AUCRDT",
"jdbcType": 93,
"typeName": "TIMESTAMP",
"key": false
}
]
},
{
"table": "KBS_QUESTION",
"columns": [
{
"name": "ID",
"jdbcType": -5,
"typeName": "LONG",
"key": true
},
{
"name": "CATEGORY",
"jdbcType": 12,
"typeName": "STRING",
"key": false
}
]
}
]
}
]
}
]
We can change it with Schema Migration (MySQL/SQLServer/Oracle -> Tunnel) or modify it directly through **DataSource > More > View Configuration > dbsJson.
Example
This example uses Alibaba Cloud resources to simulate RDS for MySQL in Hangzhou to RDS for MySQL in Shenzhen, with no public network ports on either database, data going over the Internet, HTTPS transmission, and username and password authentication.
Environment Preparation
The Hangzhou environment deploys CloudCanal with RDS for MySQL as the source DataSource.
The Shenzhen environment deployed CloudCanal and targeted RDS for MySQL.
Because CloudCanal is a docker version, after the CloudCanal installation package in Shenzhen is decompressed, you need to modify the docker-compose.yml port mapping before installing/upgrading, and open the relevant ports of the ECS security group for remote connection.
In this example, port 18443 is used as the listening port for the Tunnel DataSource.
Initialize Schema for Target DataSource
- Since it is not possible to migrate the schema to the peer database through the Tunnel, you need to use tools such as mysqldump to initialize the peer database structure in advance.
Add a Tunnel DataSource
Configure the Tunnel DataSource in the source and target CloudCanal, respectively.
List of source side DataSources
List of target side DataSources
Initialize Schema for Tunnel
Create a MySQL -> Tunnel schema migration DataJob and wait to complete.
Copy the schema from the source side Tunnel and copy it to the target side.
Create Target DataJob
Select the Tunnel and target DataSource.
Select Incremental.
Select tables, columns.
The DataJob runs normally, listens on the port and prepares to receive data.
Create Source DataJob
Select the source DataSource and the Tunnel.
Select Incremental, and check Full Data option.
Data is continuously synchronized.
Verification
Create Loads
- In order to make it easy to create data, open the public network address of the source DataSource.
Data Validate
- Add source DataSource in the Shenzhen environment and perform data verification. The results show consistent.
FAQ
Which Connections Are Currently Supported?
MySQL/SQLServer/ORACLE -> MySQL, other interoperable additions as needed.
Can Tunnel Do Schema Migration To Peer Database?
Because the database schema requires high metadata accuracy, the intermediate structure of the Tunnel is mainly synchronous services, so it cannot constitute an accurate schema migration source at the metadata level. We recommend that you build a temporary instance (only dump the table structure), open the public network, and then use CloudCanal to migrate to solve the problem.
Tunnel Have A Schema, Can They Be Edited Dynamically?
The Tunnel simulates a database, and editing are naturally available. Add a table to edit the target DataJob first, and then edit the source DataJob, otherwise vice versa.
What Are The Remaining Problems?
- Field types such as blobs require further support and validation.
- Across the Internet, the performance level needs to be specially optimized.
- Only HTTPS certificate encryption is used with custom account passwords.
Summary
This article briefly introduces how CloudCanal performs cross internet secure data synchronization by introducing virtual DataSource, which has good landability.