FileSystem/JDBC/Kafka - Flink三大Connector实现原理及案例
本文分别讲述了Flink三大Connector:FileSystem Connector、JDBC Connector和Kafka Connector的源码实现和案例代码。
FileSystem Connector
Sink
构造FileSystemTableSink对象,传入相关属性参数:
public TableSink<RowData> createTableSink(TableSinkFactory.Context context) {
Configuration conf = new Configuration();
context.getTable().getOptions().forEach(conf::setString);
return new FileSystemTableSink(
context.getObjectIdentifier(),//connector标识符
context.isBounded(),//是否有界流
context.getTable().getSchema(),//表的schema
getPath(conf),//file 路径
context.getTable().getPartitionKeys(),//分区key
conf.get(PARTITION_DEFAULT_NAME),//默认分区名称
context.getTable().getOptions());//参数
}
FileSystemTableSink会根据DataStream构造DataStreamSink。consumeDataStream主要做几个事情:
构造RowDataPartitionComputer,将分区字段和非分区字段index和type分开。
EmptyMetaStoreFactory空的metastore实现。
UUID生成文件前缀
构造FileSystemFactory的实现
根据是否有界流走不同分支处理
public final DataStreamSink<RowData> consumeDataStream(DataStream<RowData> dataStream) {
RowDataPartitionComputer computer = new RowDataPartitionComputer(
defaultPartName,
schema.getFieldNames(),
schema.getFieldDataTypes(),
partitionKeys.toArray(new String[0]));
EmptyMetaStoreFactory metaStoreFactory = new EmptyMetaStoreFactory(path);
OutputFileConfig outputFileConfig = OutputFileConfig.builder()
.withPartPrefix("part-" + UUID.randomUUID().toString())
.build();
FileSystemFactory fsFactory = FileSystem::get;
if (isBounded) {
FileSystemOutputFormat.Builder<RowData> builder = new FileSystemOutputFormat.Builder<>();
builder.setPartitionComputer(computer);
builder.setDynamicGrouped(dynamicGrouping);
builder.setPartitionColumns(partitionKeys.toArray(new String[0]));
builder.setFormatFactory(createOutputFormatFactory());
builder.setMetaStoreFactory(metaStoreFactory);
builder.setFileSystemFactory(fsFactory);
builder.setOverwrite(overwrite);
builder.setStaticPartitions(staticPartitions);
builder.setTempPath(toStagingPath());
builder.setOutputFileConfig(outputFileConfig);
return dataStream.writeUsingOutputFormat(builder.build())
.setParallelism(dataStream.getParallelism());
} else {
Configuration conf = new Configuration();
properties.forEach(conf::setString);
Object writer = createWriter();
TableBucketAssigner assigner = new TableBucketAssigner(computer);
TableRollingPolicy rollingPolicy = new TableRollingPolicy(
!(writer instanceof Encoder),
conf.get(SINK_ROLLING_POLICY_FILE_SIZE).getBytes(),
conf.get(SINK_ROLLING_POLICY_ROLLOVER_INTERVAL).toMillis());
BucketsBuilder<RowData, String, ? extends BucketsBuilder<RowData, ?, ?>> bucketsBuilder;
if (writer instanceof Encoder) {
//noinspection unchecked
bucketsBuilder = StreamingFileSink.forRowFormat(
path, new ProjectionEncoder((Encoder<RowData>) writer, computer))
.withBucketAssigner(assigner)
.withOutputFileConfig(outputFileConfig)
.withRollingPolicy(rollingPolicy);
} else {
//noinspection unchecked
bucketsBuilder = StreamingFileSink.forBulkFormat(
path, new ProjectionBulkFactory((BulkWriter.Factory<RowData>) writer, computer))
.withBucketAssigner(assigner)
.withOutputFileConfig(outputFileConfig)
.withRollingPolicy(rollingPolicy);
}
return createStreamingSink(
conf,
path,
partitionKeys,
tableIdentifier,
overwrite,
dataStream,
bucketsBuilder,
metaStoreFactory,
fsFactory,
conf.get(SINK_ROLLING_POLICY_CHECK_INTERVAL).toMillis());
}
}
一般流式任务都是无界流,所以走else分支:
根据format类型创建Writer对象,比如parquet,是从BulkWriter创建来的
用TableBucketAssigner包装RowDataPartitionComputer
构造TableRollingPolicy,用于文件的生成策略,BulkWriter是根据checkpoint的执行来生成文件
构造BucketsBuilder对象
createStreamingSink
BucketsBuilder包装成StreamingFileWriter,这是个operator,继承了AbstractStreamOperator
在inputStream后增加了一个operator,主要处理逻辑在这个operator里面
如果配置了sink.partition-commit.policy.kind,则会进行commit处理,比如维护partition到metastore或者生成_success文件,同样也是增加了一个operator
最后通过一个DiscardingSink function将数据丢弃,因为数据在上面operator已经处理过了
public static DataStreamSink<RowData> createStreamingSink(
Configuration conf,
Path path,
List<String> partitionKeys,
ObjectIdentifier tableIdentifier,
boolean overwrite,
DataStream<RowData> inputStream,
BucketsBuilder<RowData, String, ? extends BucketsBuilder<RowData, ?, ?>> bucketsBuilder,
TableMetaStoreFactory msFactory,
FileSystemFactory fsFactory,
long rollingCheckInterval) {
if (overwrite) {
throw new IllegalStateException("Streaming mode not support overwrite.");
}
StreamingFileWriter fileWriter = new StreamingFileWriter(
rollingCheckInterval,
bucketsBuilder);
DataStream<CommitMessage> writerStream = inputStream.transform(
StreamingFileWriter.class.getSimpleName(),
TypeExtractor.createTypeInfo(CommitMessage.class),
fileWriter).setParallelism(inputStream.getParallelism());
DataStream<?> returnStream = writerStream;
// save committer when we don't need it.
if (partitionKeys.size() > 0 && conf.contains(SINK_PARTITION_COMMIT_POLICY_KIND)) {
StreamingFileCommitter committer = new StreamingFileCommitter(
path, tableIdentifier, partitionKeys, msFactory, fsFactory, conf);
returnStream = writerStream
.transform(StreamingFileCommitter.class.getSimpleName(), Types.VOID, committer)
.setParallelism(1)
.setMaxParallelism(1);
}
//noinspection unchecked
return returnStream.addSink(new DiscardingSink()).setParallelism(1);
}
PS:这里有个java8的函数式接口的写法,第一次接触的同学可能会有点蒙,如果接口只有一个抽象方法,那么接口就是函数式接口,实现方式可以有很多种,最常见的就是使用匿名内部类,还有就是使用lambda或构造器引用来实现。如下,
FileSystemFactory fsFactory = FileSystem::get;
//等同于 匿名类
FileSystemFactory fileSystemFactory = new FileSystemFactory() {
public FileSystem create(URI fsUri) throws IOException {
return FileSystem.get(fsUri);
}
};
// 等同于 lambda
FileSystemFactory fileSystemFactory = uri -> FileSystem.get(uri);
数据写入filesystem
数据处理在StreamingFileWriter#processElement
public void processElement(StreamRecord<RowData> element) throws Exception {
helper.onElement(
element.getValue(),
getProcessingTimeService().getCurrentProcessingTime(),
element.hasTimestamp() ? element.getTimestamp() : null,
currentWatermark);
}
在此之前会在initializeState中通过BucketsBuilder创建Buckets,并封装到StreamingFileSinkHelper中
@Override
public void initializeState(StateInitializationContext context) throws Exception {
super.initializeState(context);
buckets = bucketsBuilder.createBuckets(getRuntimeContext().getIndexOfThisSubtask());
// Set listener before the initialization of Buckets.
inactivePartitions = new HashSet<>();
buckets.setBucketLifeCycleListener(new BucketLifeCycleListener<RowData, String>() {
@Override
public void bucketCreated(Bucket<RowData, String> bucket) {
}
@Override
public void bucketInactive(Bucket<RowData, String> bucket) {
inactivePartitions.add(bucket.getBucketId());
}
});
helper = new StreamingFileSinkHelper<>(
buckets,
context.isRestored(),
context.getOperatorStateStore(),
getRuntimeContext().getProcessingTimeService(),
bucketCheckInterval);
currentWatermark = Long.MIN_VALUE;
}
回到processElement,跟进代码你会发现最终数据会由Bucket的write写入文件
void write(IN element, long currentTime) throws IOException {
//判断是否有inprogress的文件,如果没有则新起一个
if (inProgressPart == null || rollingPolicy.shouldRollOnEvent(inProgressPart, element)) {
if (LOG.isDebugEnabled()) {
LOG.debug("Subtask {} closing in-progress part file for bucket id={} due to element {}.",
subtaskIndex, bucketId, element);
}
inProgressPart = rollPartFile(currentTime);
}
inProgressPart.write(element, currentTime);
}
最终通过调用第三方包中write的方式写入文件系统,如 hadoop、hive、parquet、orc等
checkpoint
做cp的是snapshotState方法,主要逻辑在Buckets类中
public void snapshotState(
final long checkpointId,
final ListState<byte[]> bucketStatesContainer,
final ListState<Long> partCounterStateContainer) throws Exception {
Preconditions.checkState(
bucketWriter != null && bucketStateSerializer != null,
"sink has not been initialized");
LOG.info("Subtask {} checkpointing for checkpoint with id={} (max part counter={}).",
subtaskIndex, checkpointId, maxPartCounter);
bucketStatesContainer.clear();
partCounterStateContainer.clear();
snapshotActiveBuckets(checkpointId, bucketStatesContainer);
partCounterStateContainer.add(maxPartCounter);
}
private void snapshotActiveBuckets(
final long checkpointId,
final ListState<byte[]> bucketStatesContainer) throws Exception {
for (Bucket<IN, BucketID> bucket : activeBuckets.values()) {
final BucketState<BucketID> bucketState = bucket.onReceptionOfCheckpoint(checkpointId);
final byte[] serializedBucketState = SimpleVersionedSerialization
.writeVersionAndSerialize(bucketStateSerializer, bucketState);
bucketStatesContainer.add(serializedBucketState);
if (LOG.isDebugEnabled()) {
LOG.debug("Subtask {} checkpointing: {}", subtaskIndex, bucketState);
}
}
}
这里会对active状态的Bucket进行snapshot
BucketState<BucketID> onReceptionOfCheckpoint(long checkpointId) throws IOException {
prepareBucketForCheckpointing(checkpointId);
InProgressFileWriter.InProgressFileRecoverable inProgressFileRecoverable = null;
long inProgressFileCreationTime = Long.MAX_VALUE;
if (inProgressPart != null) {
inProgressFileRecoverable = inProgressPart.persist();
inProgressFileCreationTime = inProgressPart.getCreationTime();
this.inProgressFileRecoverablesPerCheckpoint.put(checkpointId, inProgressFileRecoverable);
}
return new BucketState<>(bucketId, bucketPath, inProgressFileCreationTime, inProgressFileRecoverable, pendingFileRecoverablesPerCheckpoint);//返回BucketState,用于序列化
}
private void prepareBucketForCheckpointing(long checkpointId) throws IOException {
if (inProgressPart != null && rollingPolicy.shouldRollOnCheckpoint(inProgressPart)) {
if (LOG.isDebugEnabled()) {
LOG.debug("Subtask {} closing in-progress part file for bucket id={} on checkpoint.", subtaskIndex, bucketId);
}
closePartFile();
}
if (!pendingFileRecoverablesForCurrentCheckpoint.isEmpty()) {
pendingFileRecoverablesPerCheckpoint.put(checkpointId, pendingFileRecoverablesForCurrentCheckpoint);
pendingFileRecoverablesForCurrentCheckpoint = new ArrayList<>();//重置
}
}
核心逻辑在closePartFile中,将inprogress状态的文件关闭并由内存提交到文件系统中,得到pendingFileRecoverable对象并存储到pendingFileRecoverablesForCurrentCheckpoint列表里,为snapshot准备。
private InProgressFileWriter.PendingFileRecoverable closePartFile() throws IOException {
InProgressFileWriter.PendingFileRecoverable pendingFileRecoverable = null;
if (inProgressPart != null) {
pendingFileRecoverable = inProgressPart.closeForCommit();
pendingFileRecoverablesForCurrentCheckpoint.add(pendingFileRecoverable);
inProgressPart = null;//置位null
}
return pendingFileRecoverable;
}
写入中的文件是in progress,此时是不可以读取的,什么时候才可以被下游读取,取决于文件什么时候提交。上一步已经将数据写入文件了,但是还没有正式提交。我们知道checkpoint的几个步骤,不了解的可以参考之前的博文,在最后一步checkpointcoordinator会调用各operator的notifyCheckpointComplete方法。
public void notifyCheckpointComplete(long checkpointId) throws Exception {
super.notifyCheckpointComplete(checkpointId);
commitUpToCheckpoint(checkpointId);
}
public void commitUpToCheckpoint(final long checkpointId) throws IOException {
final Iterator<Map.Entry<BucketID, Bucket<IN, BucketID>>> activeBucketIt =
activeBuckets.entrySet().iterator();
LOG.info("Subtask {} received completion notification for checkpoint with id={}.", subtaskIndex, checkpointId);
while (activeBucketIt.hasNext()) {
final Bucket<IN, BucketID> bucket = activeBucketIt.next().getValue();
bucket.onSuccessfulCompletionOfCheckpoint(checkpointId);
if (!bucket.isActive()) {//由于前面一系列清理动作,这里的bucket将不会是active状态
// We've dealt with all the pending files and the writer for this bucket is not currently open.
// Therefore this bucket is currently inactive and we can remove it from our state.
activeBucketIt.remove();
notifyBucketInactive(bucket);
}
}
}
文件的提交是在Bucket的onSuccessfulCompletionOfCheckpoint
void onSuccessfulCompletionOfCheckpoint(long checkpointId) throws IOException {
checkNotNull(bucketWriter);
Iterator<Map.Entry<Long, List<InProgressFileWriter.PendingFileRecoverable>>> it =
pendingFileRecoverablesPerCheckpoint.headMap(checkpointId, true)
.entrySet().iterator();
while (it.hasNext()) {
Map.Entry<Long, List<InProgressFileWriter.PendingFileRecoverable>> entry = it.next();
for (InProgressFileWriter.PendingFileRecoverable pendingFileRecoverable : entry.getValue()) {
bucketWriter.recoverPendingFile(pendingFileRecoverable).commit();
}
it.remove();
}
cleanupInProgressFileRecoverables(checkpointId);
}
在commit方法中对文件进行重命名,使其能够被下游读取,比如hadoop的commit实现
@Override
public void commit() throws IOException {
final Path src = recoverable.tempFile();
final Path dest = recoverable.targetFile();
final long expectedLength = recoverable.offset();
final FileStatus srcStatus;
try {
srcStatus = fs.getFileStatus(src);
}
catch (IOException e) {
throw new IOException("Cannot clean commit: Staging file does not exist.");
}
if (srcStatus.getLen() != expectedLength) {
// something was done to this file since the committer was created.
// this is not the "clean" case
throw new IOException("Cannot clean commit: File has trailing junk data.");
}
try {
fs.rename(src, dest);
}
catch (IOException e) {
throw new IOException("Committing file by rename failed: " + src + " to " + dest, e);
}
}
最后会对InprogressFile的一些状态做清理工作
private void cleanupInProgressFileRecoverables(long checkpointId) throws IOException {
Iterator<Map.Entry<Long, InProgressFileWriter.InProgressFileRecoverable>> it =
inProgressFileRecoverablesPerCheckpoint.headMap(checkpointId, false)
.entrySet().iterator();
while (it.hasNext()) {
final InProgressFileWriter.InProgressFileRecoverable inProgressFileRecoverable = it.next().getValue();
// this check is redundant, as we only put entries in the inProgressFileRecoverablesPerCheckpoint map
// list when the requiresCleanupOfInProgressFileRecoverableState() returns true, but having it makes
// the code more readable.
final boolean successfullyDeleted = bucketWriter.cleanupInProgressFileRecoverable(inProgressFileRecoverable);//除了s3,都返回false
if (LOG.isDebugEnabled() && successfullyDeleted) {
LOG.debug("Subtask {} successfully deleted incomplete part for bucket id={}.", subtaskIndex, bucketId);
}
it.remove();//清除
}
}
partition commit
分区提交的触发以及提交的策略。触发条件分为process-time和partition-time。process time的原理是当前Checkpoint需要提交的分区和当前系统时间注册到pendingPartitions map中,在提交时判断注册时间+delay是否小于当前系统时间来确定是否需要提交分区,如果delay=0直接提交。所以如果delay=0立即提交,如果有数据延迟的话可能导致该分区过早的提交。如果delay=分区大小,那么就是在Checkpoint间隔+delay后提交上一次Checkpoint需要提交的分区。
@Override
public void addPartition(String partition) {
if (!StringUtils.isNullOrWhitespaceOnly(partition)) {
this.pendingPartitions.putIfAbsent(partition, procTimeService.getCurrentProcessingTime());
}
}
@Override
public List<String> committablePartitions(long checkpointId) {
List<String> needCommit = new ArrayList<>();
long currentProcTime = procTimeService.getCurrentProcessingTime();
Iterator<Map.Entry<String, Long>> iter = pendingPartitions.entrySet().iterator();
while (iter.hasNext()) {
Map.Entry<String, Long> entry = iter.next();
long creationTime = entry.getValue();
if (commitDelay == 0 || currentProcTime > creationTime + commitDelay) {
needCommit.add(entry.getKey());
iter.remove();
}
}
return needCommit;
}
partition time的原理是基于watermark是否达到分区时间+delay来判断是否要提交。
@Override
public void addPartition(String partition) {
if (!StringUtils.isNullOrWhitespaceOnly(partition)) {
this.pendingPartitions.add(partition);
}
}
@Override
public List<String> committablePartitions(long checkpointId) {
if (!watermarks.containsKey(checkpointId)) {
throw new IllegalArgumentException(String.format(
"Checkpoint(%d) has not been snapshot. The watermark information is: %s.",
checkpointId, watermarks));
}
long watermark = watermarks.get(checkpointId);
watermarks.headMap(checkpointId, true).clear();
List<String> needCommit = new ArrayList<>();
Iterator<String> iter = pendingPartitions.iterator();
while (iter.hasNext()) {
String partition = iter.next();
LocalDateTime partTime = extractor.extract(
partitionKeys, extractPartitionValues(new Path(partition)));//根据path来抽取时间,比如partition='day=2020-12-01/hour=11/minute=11' 转换成 2020-12-01 11:11:00
if (watermark > toMills(partTime) + commitDelay) {
needCommit.add(partition);
iter.remove();
}
}
return needCommit;
}
Source
读取数据相对于写入数据要简单些。
创建FileSystemTableSource对象
public TableSource<RowData> createTableSource(TableSourceFactory.Context context) {
Configuration conf = new Configuration();
context.getTable().getOptions().forEach(conf::setString);
return new FileSystemTableSource(
context.getTable().getSchema(),
getPath(conf),
context.getTable().getPartitionKeys(),
conf.get(PARTITION_DEFAULT_NAME),
context.getTable().getProperties());
}
构造source function,传入input format用于读取源数据。
public DataStream<RowData> getDataStream(StreamExecutionEnvironment execEnv) {
@SuppressWarnings("unchecked")
TypeInformation<RowData> typeInfo =
(TypeInformation<RowData>) TypeInfoDataTypeConverter.fromDataTypeToTypeInfo(getProducedDataType());
// Avoid using ContinuousFileMonitoringFunction
InputFormatSourceFunction<RowData> func = new InputFormatSourceFunction<>(getInputFormat(), typeInfo);
DataStreamSource<RowData> source = execEnv.addSource(func, explainSource(), typeInfo);
return source.name(explainSource());
}
在run方法中,循环读取数据,发送到下游算子
public void run(SourceContext<OUT> ctx) throws Exception {
try {
Counter completedSplitsCounter = getRuntimeContext().getMetricGroup().counter("numSplitsProcessed");
if (isRunning && format instanceof RichInputFormat) {
((RichInputFormat) format).openInputFormat();
}
OUT nextElement = serializer.createInstance();
while (isRunning) {
format.open(splitIterator.next());
// for each element we also check if cancel
// was called by checking the isRunning flag
while (isRunning && !format.reachedEnd()) {
nextElement = format.nextRecord(nextElement);
if (nextElement != null) {
ctx.collect(nextElement);
} else {
break;
}
}
format.close();
completedSplitsCounter.inc();
if (isRunning) {
isRunning = splitIterator.hasNext();
}
}
} finally {
format.close();
if (format instanceof RichInputFormat) {
((RichInputFormat) format).closeInputFormat();
}
isRunning = false;
}
}
一个完整的案例:
从Kafka流式读取数据,流式写入FileSystem,并从fs_table流式查询
CREATE TABLE kafka_table (
user_id STRING,
order_amount DOUBLE,
log_ts TIMESTAMP(3),
WATERMARK FOR log_ts AS log_ts - INTERVAL '5' SECOND
) WITH (...);
CREATE TABLE fs_table (
user_id STRING,
order_amount DOUBLE,
dt STRING,
`hour` STRING
) PARTITIONED BY (dt, `hour`) WITH (
'connector'='filesystem',
'path'='...',
'format'='parquet',
'sink.partition-commit.delay'='1 h',
'sink.partition-commit.policy.kind'='success-file'
);
-- streaming sql, insert into file system table
INSERT INTO fs_table SELECT user_id, order_amount, DATE_FORMAT(log_ts, 'yyyy-MM-dd'), DATE_FORMAT(log_ts, 'HH') FROM kafka_table;
-- batch sql, select with partition pruning
SELECT * FROM fs_table WHERE dt='2020-05-20' and `hour`='12';
JDBC Connector
JDBC connector的入口JdbcDynamicTableFactory,提供了source和sink的支持。
Source
在Factory类中通过createDynamicTableSource来创建JdbcDynamicTableSource,并将需要的所有参数传递过去。jdbc作为source有两种用途:1.数据源使用Scan 2.维表关联使用Lookup。
Scan
通过JdbcRowDataInputFormat来实现数据读取,同时支持了列裁剪,limit下推。
注意:scan source只支持batch。
public ScanRuntimeProvider getScanRuntimeProvider(ScanContext runtimeProviderContext) {
//构造JdbcRowDataInputFormat,传递基础属性
final JdbcRowDataInputFormat.Builder builder =
JdbcRowDataInputFormat.builder()
.setDrivername(options.getDriverName())
.setDBUrl(options.getDbURL())
.setUsername(options.getUsername().orElse(null))
.setPassword(options.getPassword().orElse(null))
.setAutoCommit(readOptions.getAutoCommit());
if (readOptions.getFetchSize() != 0) {
builder.setFetchSize(readOptions.getFetchSize());
}
final JdbcDialect dialect = options.getDialect();//jdbc方言,目前支持mysql、postgres、derby,是根据url来推断
String query =
dialect.getSelectFromStatement(
options.getTableName(), physicalSchema.getFieldNames(), new String[0]);//构造select语句
if (readOptions.getPartitionColumnName().isPresent()) {//支持并发读取,提高读取速度
long lowerBound = readOptions.getPartitionLowerBound().get();
long upperBound = readOptions.getPartitionUpperBound().get();
int numPartitions = readOptions.getNumPartitions().get();
builder.setParametersProvider(
new JdbcNumericBetweenParametersProvider(lowerBound, upperBound)
.ofBatchNum(numPartitions));
query +=
" WHERE "
+ dialect.quoteIdentifier(readOptions.getPartitionColumnName().get())
+ " BETWEEN ? AND ?";//拼上sql
}
if (limit >= 0) {//如果指定了limit
query = String.format("%s %s", query, dialect.getLimitClause(limit));
}
builder.setQuery(query);
final RowType rowType = (RowType) physicalSchema.toRowDataType().getLogicalType();
builder.setRowConverter(dialect.getRowConverter(rowType));//对应的converter,用于转换jdbc数据成flink数据类型
builder.setRowDataTypeInfo(
runtimeProviderContext.createTypeInformation(physicalSchema.toRowDataType()));
return InputFormatProvider.of(builder.build());
}
JdbcRowDataInputFormat几个核心方法如下:
openInputFormat 初始化jdbc链接,创建PreparedStatement
open批量拉取数据
reachedEnd是否结束
nextRecord下一条记录
Lookup
用作维表关联时,主要实现在JdbcRowDataLookupFunction类中,逻辑类似,不同在于,这里限制了特定的的key去查询,并且支持对查询的结果进行缓存来加速。这里使用的缓存是Guava cache,支持缓存大小和失效时间的配置。
public void eval(Object... keys) {
RowData keyRow = GenericRowData.of(keys);
if (cache != null) {
List<RowData> cachedRows = cache.getIfPresent(keyRow);
if (cachedRows != null) {
for (RowData cachedRow : cachedRows) {
collect(cachedRow);
}
return;
}
}
for (int retry = 0; retry <= maxRetryTimes; retry++) {
try {
statement.clearParameters();
statement = lookupKeyRowConverter.toExternal(keyRow, statement);
try (ResultSet resultSet = statement.executeQuery()) {
if (cache == null) {
while (resultSet.next()) {
collect(jdbcRowConverter.toInternal(resultSet));
}
} else {
ArrayList<RowData> rows = new ArrayList<>();
while (resultSet.next()) {
RowData row = jdbcRowConverter.toInternal(resultSet);
rows.add(row);
collect(row);
}
rows.trimToSize();
cache.put(keyRow, rows);
}
}
break;
} catch (SQLException e) {
LOG.error(String.format("JDBC executeBatch error, retry times = %d", retry), e);
if (retry >= maxRetryTimes) {
throw new RuntimeException("Execution of JDBC statement failed.", e);
}
try {
if (!connectionProvider.isConnectionValid()) {
statement.close();
connectionProvider.closeConnection();
establishConnectionAndStatement();
}
} catch (SQLException | ClassNotFoundException excpetion) {
LOG.error(
"JDBC connection is not valid, and reestablish connection failed",
excpetion);
throw new RuntimeException("Reestablish JDBC connection failed", excpetion);
}
try {
Thread.sleep(1000 * retry);
} catch (InterruptedException e1) {
throw new RuntimeException(e1);
}
}
}
}
Sink
作为sink支持append only和upsert两种模式,根据ddl中是否定义了primary key来确定那种模式。Factory创建JdbcDynamicTableSink对象,进一步通过JdbcDynamicOutputFormatBuilder创建JdbcBatchingOutputFormat来写数据。从类名就可以看出来是批量写入的,实际情况也确实如此,因为flink是实时流处理引擎,如果每条数据都要写db的话,首先性能上得不到保证,同时对db也会造成很大压力。
主要写数据的逻辑如下方法:
@Override
public final synchronized void writeRecord(In record) throws IOException {
checkFlushException();
try {
addToBatch(record, jdbcRecordExtractor.apply(record));
batchCount++;
if (executionOptions.getBatchSize() > 0
&& batchCount >= executionOptions.getBatchSize()) {
flush();
}
} catch (Exception e) {
throw new IOException("Writing records to JDBC failed.", e);
}
}
protected void addToBatch(In original, JdbcIn extracted) throws SQLException {
jdbcStatementExecutor.addToBatch(extracted);
}
根据jdbcStatementExecutor的不同实现,insert的话只是单纯的添加到statement中,
public void addToBatch(RowData record) throws SQLException {
converter.toExternal(record, st);
st.addBatch();
}
如果是upsert那么会先判断数据是否存在,不存在则insert,存在则update。
public void addToBatch(RowData record) throws SQLException {
processOneRowInBatch(keyExtractor.apply(record), record);
}
private void processOneRowInBatch(RowData pk, RowData row) throws SQLException {
if (exist(pk)) {
updateSetter.toExternal(row, updateStatement);
updateStatement.addBatch();
} else {
insertSetter.toExternal(row, insertStatement);
insertStatement.addBatch();
}
}
当满足batch的大小或达到指定的时间间隔后就会进行flush操作,在Checkpoint时如果有缓存的数据也会进行flush。
PS:可以扩展JDBC dialect来实现其他依赖于jdbc的数据库的实现,比如clickhouse等。
一个完整的案例:
-- register a MySQL table 'users' in Flink SQL
CREATE TABLE MyUserTable (
id BIGINT,
name STRING,
age INT,
status BOOLEAN,
PRIMARY KEY (id) NOT ENFORCED
) WITH (
'connector' = 'jdbc',
'url' = 'jdbc:mysql://localhost:3306/mydatabase',
'table-name' = 'users'
);
-- write data into the JDBC table from the other table "T"
INSERT INTO MyUserTable
SELECT id, name, age, status FROM T;
-- scan data from the JDBC table
SELECT id, name, age, status FROM MyUserTable;
-- temporal join the JDBC table as a dimension table
SELECT * FROM myTopic
LEFT JOIN MyUserTable FOR SYSTEM_TIME AS OF myTopic.proctime
ON myTopic.key = MyUserTable.id;
Kafka Connector
本文基于Flink 1.12版本,目前这个版本已经不需要再指定具体的kafka版本了。
本文从Sql角度分析一下,创建一个kafka的table之后,flink是如何从kafka中读写数据的。
入口
依然是通过SPI机制找到kafka的factory(KafkaDynamicTableFactory),Flink中大量使用了SPI机制,有时间再整理一篇SPI在Flink中的应用。话不多说,进入正题。
Source
通过createDynamicTableSource方法创建 kafka source,这里主要做几件事:
从context获取table ddl中相关的信息、比如schema、with属性,生成TableFactoryHelper辅助工具类。
根据with中的key/value format配置discover key/value的format。
各种参数校验。
构造KafkaDynamicSource对象。
在KafkaDynamicSource中通过key/value 的format创建对应的deserialization schema,将schema中的metadata字段和普通字段分开,创建FlinkKafkaConsumer对象封装在SourceFunctionProvider当中。
@Override
public ScanRuntimeProvider getScanRuntimeProvider(ScanContext context) {
final DeserializationSchema<RowData> keyDeserialization =
createDeserialization(context, keyDecodingFormat, keyProjection, keyPrefix);
final DeserializationSchema<RowData> valueDeserialization =
createDeserialization(context, valueDecodingFormat, valueProjection, null);
final TypeInformation<RowData> producedTypeInfo =
context.createTypeInformation(producedDataType);
final FlinkKafkaConsumer<RowData> kafkaConsumer =
createKafkaConsumer(keyDeserialization, valueDeserialization, producedTypeInfo);
return SourceFunctionProvider.of(kafkaConsumer, false);
}
FlinkKafkaConsumer就是用来读取kafka的,核心逻辑在其父类FlinkKafkaConsumerBase中,几个核心方法:
open:kafka consumer相关对象的初始化,包括offset提交模式、动态分区发现、消费模式、反序列化器
run: 通过kafkaFetcher从kafka中拉取数据
runWithPartitionDiscovery:独立线程运行动态分区发现
snapshotState:Checkpoint时对partition和offset信息进行快照,用于failover
initializeState:从Checkpoint恢复时用来恢复现场
notifyCheckpointComplete:Checkpoint完成时进行offset提交到kafka
关于动态分区发现,在open中就一次性拉取了topic的所有分区,当周期性的执行分区发现,如果有新的partition加入,就会再拉取一次所有的partition,根据partition id判断哪些是基于上次新增的,并根据一下分配算法决定由哪个subtask进行订阅消费。
public static int assign(KafkaTopicPartition partition, int numParallelSubtasks) {
int startIndex =
((partition.getTopic().hashCode() * 31) & 0x7FFFFFFF) % numParallelSubtasks;
// here, the assumption is that the id of Kafka partitions are always ascending
// starting from 0, and therefore can be used directly as the offset clockwise from the
// start index
return (startIndex + partition.getPartition()) % numParallelSubtasks;
}
KafkaFetcher通过消费线程KafkaConsumerThread来消费kafka的数据,内部是使用kafka的KafkaConsumer实现。kafkaFetcher每次从Handover中pollnext,KafkaConsumerThread消费到数据然后produce到handover当中,handover充当了生产者-消费者模型中阻塞队列的作用。
public void runFetchLoop() throws Exception {
try {
// kick off the actual Kafka consumer
consumerThread.start();
while (running) {
// this blocks until we get the next records
// it automatically re-throws exceptions encountered in the consumer thread
final ConsumerRecords<byte[], byte[]> records = handover.pollNext();
// get the records for each topic partition
for (KafkaTopicPartitionState<T, TopicPartition> partition :
subscribedPartitionStates()) {
List<ConsumerRecord<byte[], byte[]>> partitionRecords =
records.records(partition.getKafkaPartitionHandle());
partitionConsumerRecordsHandler(partitionRecords, partition);
}
}
} finally {
// this signals the consumer thread that no more work is to be done
consumerThread.shutdown();
}
// on a clean exit, wait for the runner thread
try {
consumerThread.join();
} catch (InterruptedException e) {
// may be the result of a wake-up interruption after an exception.
// we ignore this here and only restore the interruption state
Thread.currentThread().interrupt();
}
}
Sink
sink也类似,在createDynamicTableSink方法中创建KafkaDynamicSink,主要负责:
同source,有个特殊处理,如果是avro-confluent或debezium-avro-confluent,且schema-registry.subject没有设置的话,自动补齐。
根据with熟悉discover key/value的encoding format
参数校验
构造KafkaDynamicSink对象
在SinkRuntimeProvider#getSinkRuntimeProvider构造FlinkKafkaProducer封装在SinkFunctionProvider当中。
public SinkRuntimeProvider getSinkRuntimeProvider(Context context) {
final SerializationSchema<RowData> keySerialization =
createSerialization(context, keyEncodingFormat, keyProjection, keyPrefix);
final SerializationSchema<RowData> valueSerialization =
createSerialization(context, valueEncodingFormat, valueProjection, null);
final FlinkKafkaProducer<RowData> kafkaProducer =
createKafkaProducer(keySerialization, valueSerialization);
return SinkFunctionProvider.of(kafkaProducer, parallelism);
}
FlinkKafkaProducer向kafka中写数据,为了保证exactly-once语义,其继承了TwoPhaseCommitSinkFunction两段式提交方法,利用kafka事务机制保证了数据的仅此一次语义。
FlinkKafkaProducer几个核心方法:
open:kafka相关属性初始化
invoke:数据处理逻辑,将key和value序列化后构造成ProducerRecord,根据分区策略调用kafka的API KafkaProducer来发送数据
beginTransaction:开启事务
preCommit:预提交
commit:正式提交
snapshotState:Checkpoint时对状态进行快照,主要是事务相关的状态
notifyCheckpointComplete:父类方法,用于Checkpoint完成时回调,提交事务
initializeState:状态初始化,用于任务从Checkpoint恢复时恢复状态
整个过程发送数据以及事务提交过程如下:
initializeState(程序启动或从cp恢复开启第一次事务 beginTransaction)→invoke(处理数据并发送kafka)→snapshotState(将当前事务存储并记录到状态,并开启下一次事务,同时进行预提交preCommit)→notifyCheckpointComplete(提交之前pending的事务,并进行正式提交commit)
如果中间有报错,最终会调用close方法来终止事务。
一个完整的案例:
CREATE TABLE KafkaTable (
`user_id` BIGINT,
`item_id` BIGINT,
`behavior` STRING,
`ts` TIMESTAMP(3) METADATA FROM 'timestamp'
) WITH (
'connector' = 'kafka',
'topic' = 'user_behavior',
'properties.bootstrap.servers' = 'localhost:9092',
'properties.group.id' = 'testGroup',
'scan.startup.mode' = 'earliest-offset',
'format' = 'csv'
)