Skip to main content

Cross Internet Data Synchronization

Overview

This article briefly introduces how CloudCanal performs cross internet secure data synchronization, including the following features:

  • The databases on both ends do not open public network ports at all
  • Database schema is mappable on both ends.
  • Based on HTTPS transport.
  • It has a username and password authentication mechanism.
  • Supports heterogeneous interworking of multiple databases.
  • Does not rely on software such as message middleware.

Key Points

image.png

Tunnel DataSource

We connect CloudCanal DataJob through a virtual DataSource Tunnel.

The Tunnel DataSource itself is not an entity database, but a set of logical information, including:

  • IP (or domain name)
  • port
  • User name
  • password
  • TLS certificate file and password
  • schema

Through this virtual DataSource, we can use HTTP(S) or TCP to pull or receive data, while fully matching the CloudCanal business model to achieve functional integrity.

PUSH Model

For the data transfer mode PUSH or PULL, we choose the PUSH mode, that is, the client pushes data to the server ,the causes are:

  • It mainly solves the interoperability problem, not the subscription problem
  • Target-side synchronous writing data better matches CloudCanal's other target-side flavors
  • There is no data persistence in the data channel, and there is no need to maintain a store to stage data

Of course, the PUSH mode also brings some problems, including:

  • How to ensure that final data is written before submit position.
  • Complex position backtracking (full and delta, inconsistent site formats for different DataSources).

For the above two problems, we use the Lazy Commit Position technique.

Lazy Commit Position

With the PUSH model, position management is a complex and dangerous job, and if the position is submitted early, data may be lost.

We implement the Lazy Commit Position technique, that is, every time the client writes to the server side, only the submitted positions is returned, and serialized into JSON strings.

Through this technology, we can ensure that the data before the position must have been written to the peer, and in some business scenarios, through the backtracking of the position of the client DataJob, we can achieve the purpose of repeatedly consuming data for a certain period of time.

Schema Mapping

We simulate the process of obtaining and migrating the schema for this DataSource, so that it is like a real database during DataJob creation and operation and maintenance.

The schema for a real Tunnel data source is as follows:

[
{
"db": "cc_virtual_db",
"schemas": [
{
"schema": "cc_virtual_schema",
"tables": [
{
"table": "WORKER_STATS",
"columns": [
{
"name": "ID",
"jdbcType": -5,
"typeName": "LONG",
"key": true
},
{
"name": "GMT_CREATE",
"jdbcType": 93,
"typeName": "TIMESTAMP",
"key": false
},
{
"name": "AUCRDT",
"jdbcType": 93,
"typeName": "TIMESTAMP",
"key": false
}
]
},
{
"table": "KBS_QUESTION",
"columns": [
{
"name": "ID",
"jdbcType": -5,
"typeName": "LONG",
"key": true
},
{
"name": "CATEGORY",
"jdbcType": 12,
"typeName": "STRING",
"key": false
}
]
}
]
}
]
}
]

We can change it with Schema Migration (MySQL/SQLServer/Oracle -> Tunnel) or modify it directly through **DataSource > More > View Configuration > dbsJson.

Example

This example uses Alibaba Cloud resources to simulate RDS for MySQL in Hangzhou to RDS for MySQL in Shenzhen, with no public network ports on either database, data going over the Internet, HTTPS transmission, and username and password authentication.

Environment Preparation

  • The Hangzhou environment deploys CloudCanal with RDS for MySQL as the source DataSource. http_sync_3 http_sync_4

  • The Shenzhen environment deployed CloudCanal and targeted RDS for MySQL. http_sync_1 http_sync_2

  • Because CloudCanal is a docker version, after the CloudCanal installation package in Shenzhen is decompressed, you need to modify the docker-compose.yml port mapping before installing/upgrading, and open the relevant ports of the ECS security group for remote connection.

  • In this example, port 18443 is used as the listening port for the Tunnel DataSource. http_sync_5 http_sync_6

Initialize Schema for Target DataSource

  • Since it is not possible to migrate the schema to the peer database through the Tunnel, you need to use tools such as mysqldump to initialize the peer database structure in advance.

Add a Tunnel DataSource

  • Configure the Tunnel DataSource in the source and target CloudCanal, respectively. http_sync_7

  • List of source side DataSources http_sync_9

  • List of target side DataSources
    http_sync_8

Initialize Schema for Tunnel

  • Create a MySQL -> Tunnel schema migration DataJob and wait to complete. http_sync_10 http_sync_11

  • Copy the schema from the source side Tunnel and copy it to the target side. http_sync_12 http_sync_13 http_sync_14

Create Target DataJob

  • Select the Tunnel and target DataSource. http_sync_15

  • Select Incremental. http_sync_16

  • Select tables, columns.

  • The DataJob runs normally, listens on the port and prepares to receive data. http_sync_17

Create Source DataJob

  • Select the source DataSource and the Tunnel. http_sync_18

  • Select Incremental, and check Full Data option. http_sync_19

  • Data is continuously synchronized. http_sync_20

Verification

Create Loads

  • In order to make it easy to create data, open the public network address of the source DataSource. http_sync_21

Data Validate

  • Add source DataSource in the Shenzhen environment and perform data verification. The results show consistent. http_sync_22http_sync_23

FAQ

Which Connections Are Currently Supported?

MySQL/SQLServer/ORACLE -> MySQL, other interoperable additions as needed.

Can Tunnel Do Schema Migration To Peer Database?

Because the database schema requires high metadata accuracy, the intermediate structure of the Tunnel is mainly synchronous services, so it cannot constitute an accurate schema migration source at the metadata level. We recommend that you build a temporary instance (only dump the table structure), open the public network, and then use CloudCanal to migrate to solve the problem.

Tunnel Have A Schema, Can They Be Edited Dynamically?

The Tunnel simulates a database, and editing are naturally available. Add a table to edit the target DataJob first, and then edit the source DataJob, otherwise vice versa.

What Are The Remaining Problems?

  • Field types such as blobs require further support and validation.
  • Across the Internet, the performance level needs to be specially optimized.
  • Only HTTPS certificate encryption is used with custom account passwords.

Summary

This article briefly introduces how CloudCanal performs cross internet secure data synchronization by introducing virtual DataSource, which has good landability.