Antonio Davide Calì

Posted on Nov 18

My (Goofy) attempt on building a Flink BigQuery Source Connector

#flink #bigquery #streaming

EDIT: So, I got told that post is probably way too long, and I fully agree. What I really suggest you to focus on is the DataSource Picture at Start, and the very last Recap at the bottom. And in case you want to deep in more in a single component, just go to the correct section.

So hello everyone!
Here I am, taking you in this journey of my attempt on creating a Flink BigQuery Source Connector.

For the sake of it, I will skip a lot of "What's Flink", "BigQuery: An introduction" etc.. etc.. and I will try to jump directly to the action.

FOR THE SAKE OF IT: this is a tentative to make things work. They work, badly, inneficiently, and in a very ugly way, but they work.

But let's begin at least with one important image taken from Flink Official Documentation.

From what we can understand, writing a Source Connector needs at least to implementing the DataSource structure

Let's start looking at my project tree, just to have an overview

lib/src
├── main
│   ├── java
│   │   └── com
│   │       └── antoniocali
│   │           ├── Library.java
│   │           ├── bq
│   │           │   ├── BigQueryClient.java
│   │           │   ├── BigQueryReadOptions.java
│   │           │   ├── BigQueryUtils.java
│   │           │   └── converters
│   │           │       ├── AvroToRowDataConverters.java
│   │           │       └── FieldValueListToRowDataConverters.java
│   │           └── sources
│   │               ├── BigQueryDataStreamSource.java
│   │               ├── emitter
│   │               │   └── BigQueryRecordEmitter.java
│   │               ├── reader
│   │               │   ├── BigQuerySourceReader.java
│   │               │   ├── BigQuerySplitEnumerator.java
│   │               │   └── BigQuerySplitReader.java
│   │               └── split
│   │                   ├── BigQuerySourceSplit.java
│   │                   ├── BigQuerySourceSplitSerializer.java
│   │                   ├── BigQuerySourceSplitState.java
│   │                   ├── BigQuerySplitEnumeratorState.java
│   │                   └── BigQuerySplitEnumeratorStateSerializer.java

Wow a lot of files. Yup.
I will problably go in each one LOL SO BE READY!.

But let's make a step back and describe the image above and each of those components.

Split

So what's a Split? (the little yellow box in the graph above)
A split is a base unit of "data" operation in Flink. I would describe mostly like a metadata of information more than data itself, at least in my case, because it contains the information needed to retrieve the actual data from BigQuery.

Let's take a look at the code

// antoniocali/sources/split/BigQuerySourceSplit.java
import org.apache.flink.api.connector.source.SourceSplit;

import java.io.Serializable;

public class BigQuerySourceSplit implements SourceSplit, Serializable {
    private final String splitName;
    private final Long minTimestamp;
    private final Long maxTimestamp;

    public BigQuerySourceSplit(String splitName, Long minTimestamp, Long maxTimestamp) {
        this.splitName = splitName;
        this.maxTimestamp = maxTimestamp;
        this.minTimestamp = minTimestamp;
    }

    @Override
    public String splitId() {
        return splitName;
    }

    public String getSplitName() {
        return splitName;
    }

    public Long getMinTimestamp() {
        return minTimestamp;
    }

    public Long getMaxTimestamp() {
        return maxTimestamp;
    }

    @Override
    public String toString() {
        return "BigQuerySourceSplit{" + "splitName='" + splitName + '\'' + ", minTimestamp=" + minTimestamp + ", maxTimestamp=" + maxTimestamp + '}';
    }
}

In this case a Split (that I called BigQuerySourceSplit) contains the information needed for then retrieve the data from BigQuery.

I think it's worth mentioning that in this case a BigQuery Flink Source is designed to read a single BigQuery Table, and that's why the only informations needed, in terms of useful metadata, are minTimestamp and maxTimestamp.
I do not need to have the information of the project.dataset.table of BigQuery because they are "fix" metadata of the Source.

To clarify, when we will cover later how to collect data from BigQuery using a Split, we could expect a query like

SELECT *
FROM project.dataset.table
WHERE ??? >= minTimestamp AND ??? < maxTimestamp

The ??? will fulfilled at runtime, based on a Configuration option of my Source Connector, that contains a Timestamp column of the table that can be used for "chunking" the table. (e.g. created_at).

Since we are talking about Split, let's also see all the class around it.

A very first simple one is the SerDe class for the Split. Since most of information describes run across the network, we need a simple SerDe for the Split.

// antoniocali/sources/split/BigQuerySourceSplitSerializer.java
public class BigQuerySourceSplitSerializer implements SimpleVersionedSerializer<BigQuerySourceSplit> {
    public static final BigQuerySourceSplitSerializer INSTANCE = new BigQuerySourceSplitSerializer();
    final int VERSION = 0;

    @Override
    public int getVersion() {
        return VERSION;
    }

    @Override
    public byte[] serialize(BigQuerySourceSplit bigQuerySourceSplit) throws IOException {
        try (ByteArrayOutputStream baos = new ByteArrayOutputStream(); DataOutputStream out = new DataOutputStream(
                baos)) {
            out.writeUTF(bigQuerySourceSplit.getSplitName());
            out.writeLong(bigQuerySourceSplit.getMinTimestamp());
            out.writeLong(bigQuerySourceSplit.getMaxTimestamp());
            out.flush();
            return baos.toByteArray();
        }

    }

    @Override
    public BigQuerySourceSplit deserialize(int version, byte[] serialized) throws IOException {
        if (getVersion() != version) {
            throw new IllegalArgumentException(
                    String.format("The provided serializer version (%d) is not expected (expected : %s).", version,
                            VERSION));
        }
        try (ByteArrayInputStream bais = new ByteArrayInputStream(serialized); DataInputStream in = new DataInputStream(
                bais)) {
            switch (version) {
                case VERSION:
                    String splitName = in.readUTF();
                    long minTimestamp = in.readLong();
                    long maxTimestamp = in.readLong();
                    return new BigQuerySourceSplit(splitName, minTimestamp, maxTimestamp);
                default:
                    throw new IOException("Unknown version: " + version);
            }
        }
    }
}

A more interesting class is the State of a Split.
A State contains the "latest" snapshot of a Split.
In my scenario the State contains the latest timestamp "processed".

A State is important for when creating a Checkpoint, needed for restoring a Split in case of errors.

// antoniocali/sources/split/BigQuerySourceSplitState.java
import lombok.Getter;
import lombok.NonNull;
import lombok.Setter;

import java.io.Serializable;

@Getter
@Setter
public class BigQuerySourceSplitState implements Serializable {
    private String splitId;
    private Long minCurrentTimestamp;
    private Long maxCurrentTimestamp;

    public BigQuerySourceSplitState(@NonNull String splitId, @NonNull Long minCurrentTimestamp,
            @NonNull Long currentTimestamp) {
        this.splitId = splitId;
        this.minCurrentTimestamp = minCurrentTimestamp;
        this.maxCurrentTimestamp = currentTimestamp;
    }

    public BigQuerySourceSplit toBigQuerySourceSplit() {
        return new BigQuerySourceSplit(splitId, minCurrentTimestamp, maxCurrentTimestamp);
    }
}

Good, now we have our split and we understand how it works. What's next?

Back to the graph!

SplitEnumerator

Ok, we have a split, but now? We need a way to create Splits, and the SplitEnumerator class comes in our help.

⚠️ Pay also attention on where these classes live: the Master/Job Manager in Flink has the job to create Splits and assign them to each Task process.
This compared to the actual process a reading Split, that it's a Task Manager job.

SplitEnumerator is one of the most important class to understand, so be ready.


public class BigQuerySplitEnumerator implements SplitEnumerator<BigQuerySourceSplit, BigQuerySplitEnumeratorState> {
    private final long DISCOVER_INTERVAL = 60_000L;
    private final long INITIAL_DELAY = 0L;
    private static final Logger LOG = LoggerFactory.getLogger(BigQuerySplitEnumerator.class);
    protected Long maxCurrentTimetstamp;
    private final SplitEnumeratorContext<BigQuerySourceSplit> enumContext;
    private BigQuerySourceSplit currentSplit;
    private boolean isAssigned;
    private final BigQueryReadOptions readOptions;
    private final TreeSet<Integer> readersAwaitingSplit;

    public BigQuerySplitEnumerator(SplitEnumeratorContext<BigQuerySourceSplit> enumContext,
            BigQueryReadOptions readOptions, BigQuerySplitEnumeratorState enumeratorState) {
        this.enumContext = enumContext;

        this.maxCurrentTimetstamp = enumeratorState == null ? 0L : enumeratorState.getCurrentMaxTimestamp();
        this.readOptions = readOptions;
        this.currentSplit = enumeratorState == null ? new BigQuerySourceSplit("0L", 0L,
                Long.MAX_VALUE) : enumeratorState.getCurrentSplit();
        this.readersAwaitingSplit = new TreeSet<>();
    }

    @Override
    public void notifyCheckpointAborted(long checkpointId) throws Exception {
        SplitEnumerator.super.notifyCheckpointAborted(checkpointId);
    }

    @Override
    public void start() {
        LOG.info("Starting BigQuery split enumerator");
        this.scheduleNextSplit();
    }

    @Override
    public void handleSplitRequest(int subtaskId, @Nullable String requesterHostname) {
        if (!enumContext.registeredReaders().containsKey(subtaskId)) {
            // reader failed between sending the request and now. skip this request.
            return;
        }
        readersAwaitingSplit.add(subtaskId);
    }

    @Override
    public void addSplitsBack(List<BigQuerySourceSplit> list, int subTaskId) {
        if (!list.isEmpty()) {
            this.currentSplit = list.remove(0);
        }
        this.isAssigned = false;
    }

    @Override
    public void addReader(int i) {

    }

    @Override
    public BigQuerySplitEnumeratorState snapshotState(long checkpointId) throws Exception {
        return new BigQuerySplitEnumeratorState(this.currentSplit, this.isAssigned, this.maxCurrentTimetstamp);
    }

    @Override
    public void close() throws IOException {

    }

    @Override
    public void notifyCheckpointComplete(long checkpointId) throws Exception {
        SplitEnumerator.super.notifyCheckpointComplete(checkpointId);
    }

    @Override
    public void handleSourceEvent(int subtaskId, SourceEvent sourceEvent) {
        SplitEnumerator.super.handleSourceEvent(subtaskId, sourceEvent);
    }

    Optional<BigQuerySourceSplit> discoverNewSplit() {
        try {

            BigQuery bigquery = BigQueryClient.builder().setReadOptions(this.readOptions).build().getBigQuery();
            var tableName = readOptions.getFullTableName(true);
            var query = "SELECT MAX(" + readOptions.getColumnFetcher() + ") as max_timestamp FROM `" + tableName + "`";
            LOG.info("Discovering new split - Query: {}", query);
            QueryJobConfiguration queryConfig = QueryJobConfiguration.newBuilder(query).setUseLegacySql(false).build();
            TableResult result = bigquery.query(queryConfig);
            var maxColumnFetcher = result.iterateAll().iterator().next().get("max_timestamp");
            var maxTimestamp = convertFieldValueToEpochTime(maxColumnFetcher);
            if (this.maxCurrentTimetstamp == 0L) {
                LOG.info("Initial Load");
                return Optional.of(new BigQuerySourceSplit("InitialLoad", 0L, maxTimestamp));
            }
            if (maxTimestamp > maxCurrentTimetstamp) {
                LOG.info("Found a new split with new timestamp: {}", maxTimestamp);
                return Optional.of(
                        new BigQuerySourceSplit(Long.toString(maxTimestamp), maxCurrentTimetstamp, maxTimestamp));
            } else {
                LOG.info("No new split found");
                return Optional.empty();
            }
        } catch (Exception e) {
            throw new RuntimeException(e);
        }
    }

    void handleDiscoverNewSplitResult(Optional<BigQuerySourceSplit> newSplit, Throwable t) {
        if (t != null) {
            throw new RuntimeException(t);
        }
        if (newSplit.isEmpty()) {
            // No new split
            return;
        }
        this.currentSplit = newSplit.get();
        this.maxCurrentTimetstamp = this.currentSplit.getMaxTimestamp();
        this.assignSplit();

    }

    void assignSplit() {
        final Iterator<Integer> awaitingReader = readersAwaitingSplit.iterator();
        LOG.info("Assigning split to readers");
        while (awaitingReader.hasNext()) {
            int nextAwaiting = awaitingReader.next();
            // if the reader that requested another split has failed in the meantime, remove
            // it from the list of waiting readers
            if (!enumContext.registeredReaders().containsKey(nextAwaiting)) {
                awaitingReader.remove();
                continue;
            }
            Optional<BigQuerySourceSplit> split = Optional.of(currentSplit);
            final BigQuerySourceSplit bqSplit = split.get();
            enumContext.assignSplit(bqSplit, nextAwaiting);
            this.isAssigned = true;
            awaitingReader.remove();
        }
    }


    private void scheduleNextSplit() {
        LOG.info("Scheduling Discovery next split");
        this.enumContext.callAsync(this::discoverNewSplit, this::handleDiscoverNewSplitResult, INITIAL_DELAY,
                DISCOVER_INTERVAL);
    }

    public long convertFieldValueToEpochTime(FieldValue fieldValue) {
        if (fieldValue.isNull()) {
            throw new IllegalArgumentException("FieldValue is null");
        }

        try {
            return fieldValue.getTimestampValue();
        } catch (NumberFormatException e) {
            LocalDate date = LocalDate.parse(fieldValue.getStringValue());
            return date.atStartOfDay(ZoneOffset.UTC).toInstant().toEpochMilli();
        }
    }
}

My BigQuerySplitEnumerator implements the SplitEnumerator interface, which needs two generics <SplitT extends SourceSplit, CheckpointT>.

In our case SplitT is the Split we created before - BigQuerySourceSplit - and CheckpointT - in my case BigQuerySplitEnumeratorState- is the State of the SplitEnumerator.

On Constructor side I've decided to pass few things:

A SplitEnumeratorContext that:
- Host information necessary for the SplitEnumerator to make split assignment decisions.
- Accept and track the split assignment from the enumerator.
- Provide a managed threading model so the split enumerators do not need to create their own internal threads.
A BigQueryOptions class that contains information to connect to BigQuery
A BigQuerySplitEnumeratorState to recover from a Checkpoint the State.

Worth to mention the TreeSet<Integer> readersAwaitingSplit, contains Readers (i.e. Task Processors) that have request a Split and are waiting for that to be assigned.

BIG BREAK

At this point I need to stop and mention how bad my implementation is.
For how I've coded the project there is only ONE Split available and alive at time.
This whole project was a way for me to understand the behind-the-scene of Flink.
The result of this decision is that SplitEnumerator will just generate one split at time and that's why, for example, I have a boolean variable named isAssigned that tracks this only Split.

Let's move to the methods that are not self-explanatory.

A SplitEnumerator needs was a way to discover new splits, in other words, to create them.

To do so I've used the following three methods - called by scheduleNextSplit method:

discoverNewSplit
handleDiscoverNewSplitResult
assignSplit

The overridden start method calls scheduleNextSplit that schedule discoverNewSplit every minute.
The return value of it will be passed to handleDiscoverNewSplitResult that, in case of a new Split, will assign it via assignSplit method..

I know it's confusing, but the idea is to have a scheduled method that discover and assign Splits.

The discovery phase is based on a simple query

SELECT MAX(" + readOptions.getColumnFetcher() + ") as max_timestamp FROM `" + tableName + "`

Every minute, it retrieves the latest timestamp available in the table. If this timestamp is greater than the most recent one stored internally, a new split is created.

In case a new Split is created, the SplitEnumerator would check if there are any Readers waiting for a Split, assign the split to the first one, and update the internal State.

Let's take a break

Ok, let's stop a seconds and try to recap what we have understood so far.

SplitEnumerator creates splits and assign them to Readers that are waiting in idle.
A Split contains important metadata needed to retrieve the actual data.
A State class, in general, is used for creating a checkpoint.
A SerDe class is needed for all those components that needs to be serialized and deserialized.

Now that we know how a Split is created, we need to understand how to retrieve data based on the information stored on the Split.

SplitReader

Here we are.
We have an understanding of what a Split is.
We understood how discover and create and assign them.
Now we need a way to actually collect and read the data.
The SplitReader class is what we need.

Important A single SplitReader class could read from a multiple splits, but since we have only one Split available at time, the implementation of it would be easier.

// antoniocali/sources/reader/BigQuerySplitReader.java

public class BigQuerySplitReader implements SplitReader<RowData, BigQuerySourceSplit> {
    private static final Logger LOG = LoggerFactory.getLogger(BigQuerySplitReader.class);
    private final Queue<BigQuerySourceSplit> assignedSplits = new ArrayDeque<>();
    private final BigQueryReadOptions readOptions;
    private Long currentTimestamp = 0L;
    private Boolean closed = false;

    public BigQuerySplitReader(BigQueryReadOptions readOptions) {
        this.readOptions = readOptions;
    }


    // To check where to use
    Long currentTimestampToFetch(BigQuerySourceSplit split) {
        if (split.getMaxTimestamp() > 0) {
            currentTimestamp = split.getMaxTimestamp();
        }
        return currentTimestamp;
    }


    private Iterator<FieldValueList> retrieveSplitData(BigQuerySourceSplit bqSplit) throws IOException, InterruptedException {
        BigQueryClient bigQueryClient = BigQueryClient.builder().setReadOptions(readOptions).build();
        String sqlQuery = BigQueryUtils.getSqlFromSplit(bqSplit, readOptions);
        LOG.info("Retrieving Data for - {}", bqSplit);
        LOG.info("SQL query: {}", sqlQuery);

        BigQuery bigQuery = bigQueryClient.getBigQuery();
        QueryJobConfiguration queryConfig = QueryJobConfiguration.newBuilder(sqlQuery).setUseLegacySql(false).build();
        TableResult tableResult = bigQuery.query(queryConfig);
        return tableResult.iterateAll().iterator();
    }

    @Override
    public RecordsWithSplitIds<RowData> fetch() throws IOException {
        if (closed) {
            throw new IllegalStateException("Can't fetch records from a closed split reader.");
        }
        RecordsBySplits.Builder<RowData> respBuilder = new RecordsBySplits.Builder<>();
        var currentSplit = Optional.ofNullable(assignedSplits.poll());
        if (currentSplit.isEmpty()) {
            LOG.info("current split is empty");
            return respBuilder.build();
        }
        BigQueryClient bigQueryClient = BigQueryClient.builder().setReadOptions(readOptions).build();
        TableSchema tableSchema = bigQueryClient.getTableSchema();
        Schema avroSchema = SchemaTransform.toGenericAvroSchema(readOptions.getFullTableName(true),
                tableSchema.getFields());
        RowType rowType = (RowType) FlinkAvroUtils.AvroSchemaToRowType(avroSchema.toString()).getTypeAt(0);
        FieldValueListToRowDataConverters.FieldValueListToRowDataConverter fieldValueListToRowDataConverter = FieldValueListToRowDataConverters.createRowConverter(
                rowType);
        var actualSplit = currentSplit.get();
        LOG.info("actual split is {}", actualSplit);
        var read = 0L;
        try {
            var records = retrieveSplitData(actualSplit);
            while (records.hasNext()) {
                var record = records.next();
                read++;
                var converted = fieldValueListToRowDataConverter.convert(record);
                respBuilder.add(actualSplit.splitId(), converted);
            }
            respBuilder.addFinishedSplit(actualSplit.splitId());
            currentTimestamp = actualSplit.getMaxTimestamp();
            LOG.info("Finish Reading {} - Total Records {}", actualSplit, read);
            return respBuilder.build();
        } catch (InterruptedException e) {
            throw new RuntimeException(e);
        }

    }

    @Override
    public void handleSplitsChanges(SplitsChange<BigQuerySourceSplit> splitsChanges) {
        LOG.debug("Handle split changes {}.", splitsChanges);
        assignedSplits.addAll(splitsChanges.splits());
    }

    @Override
    public void wakeUp() {

    }

    @Override
    public void pauseOrResumeSplits(Collection<BigQuerySourceSplit> splitsToPause,
            Collection<BigQuerySourceSplit> splitsToResume) {
        SplitReader.super.pauseOrResumeSplits(splitsToPause, splitsToResume);
    }


    @Override
    public void close() throws Exception {
        if (!closed) {
            closed = true;
            currentTimestamp = 0L;
        }
    }
}

Let's focus on what's important - the fetch method.

I've used an internal class called RecordsBySplits, that provides us a easy way to push data to a Split Blocking Queue.

We finally see the query against BigQuery used to retrieve data.
It's perfomed by the following call var records = retrieveSplitData(actualSplit); and it looks like this

"SELECT * " + 
" FROM " + tableName + 
" WHERE " + columnNameConverted + 
" BETWEEN " + bigQuerySourceSplit.getMinTimestamp() + 
" AND " + bigQuerySourceSplit.getMaxTimestamp();

The fetch method does also few extra steps.
My idea was to be also TableApi compatible.
So I've created a Converter from FieldValueList to RowData - the abstract class used by TableApi.

New Break

We are starting to connects all the dots.
It's important to mention that the main job of a SplitReader is just to push Records in the Reader BlockingQueue, so Flink can pass it to the downstream.

If you're familiar with Flink, you know that usually a pipeline starts as env.fromSource(???).<Operators>.to()

So far we checked single components, but we are missing what is needed to fulfill those ??? - a Source. And that's is our next step, where all the links are going to be connected.

Let's take a look of what we need for creating a Source in Flink.
We need to create two classes:

a SourceReader class: more likely SourceReaderBase, an easier provided implementation of it, that provides some synchronization between the mail box main thread and the SourceReader internal threads.
a Source class: which acts like a factory class that connects in one spot the SplitEnumerator and SourceReader.

SourceReader

Let's take a look at the SourceReaderBase first, since the Source is the final glue to everything we built so far.

If we take a look at our very first image on the article we can understand what a SourceReaderBase actually does.
We need a mechanism for the Reader to requests Split to the Job processor and process it.

In my follow example I use a provided implementation of the SourceReaderBase called SingleThreadMultiplexSourceReaderBase that use a SingleThread to perform the following steps:

Request a Split
Read the data of the Split using a SplitReader

public class BigQuerySourceReader
        extends SingleThreadMultiplexSourceReaderBase<RowData, RowData, BigQuerySourceSplit, BigQuerySourceSplitState> {
    private static final Logger LOG = LoggerFactory.getLogger(BigQuerySourceReader.class);

    public BigQuerySourceReader(FutureCompletingBlockingQueue<RecordsWithSplitIds<RowData>> elementsQueue,
            Supplier<SplitReader<RowData, BigQuerySourceSplit>> splitFetcherManager,
            RecordEmitter<RowData, RowData, BigQuerySourceSplitState> recordEmitter, Configuration config,
            SourceReaderContext context) {
        super(elementsQueue, splitFetcherManager, recordEmitter, config, context);
    }

    @Override
    public void start() {
        if (getNumberOfCurrentlyAssignedSplits() == 0) {
            context.sendSplitRequest();
        }
    }

    @Override
    protected void onSplitFinished(Map<String, BigQuerySourceSplitState> finishedSplitIds) {
        for (BigQuerySourceSplitState splitState : finishedSplitIds.values()) {
            BigQuerySourceSplit sourceSplit = splitState.toBigQuerySourceSplit();
            LOG.info("Read for split {} is completed.", sourceSplit.splitId());
        }
        context.sendSplitRequest();
    }

    @Override
    protected BigQuerySourceSplitState initializedState(BigQuerySourceSplit split) {
        return new BigQuerySourceSplitState(split.splitId(), split.getMinTimestamp(), split.getMaxTimestamp());
    }

    @Override
    protected BigQuerySourceSplit toSplitType(String splitId, BigQuerySourceSplitState sst) {
        return new BigQuerySourceSplit(splitId, sst.getMinCurrentTimestamp(), sst.getMaxCurrentTimestamp());
    }
}

When the Reader starts, it first checks whether it has an assigned split. If not, it sends a request to obtain one.

It's important to mention that SingleThreadMultiplexSourceReaderBase requires some extra configuration - look at our constructor.

⚠️ It requires a RecordEmitter: a class that takes a record from the SplitReader, updates the state of the Split and finally emits the records to the downstream. You can see an implementation below

public class BigQueryRecordEmitter implements RecordEmitter<RowData, RowData, BigQuerySourceSplitState>, Serializable {

    @Override
    public void emitRecord(RowData rowData, SourceOutput<RowData> sourceOutput,
            BigQuerySourceSplitState bigQuerySourceSplitState) throws Exception {

        sourceOutput.collect(rowData);
    }
}

ALERT: I've never implemented a way to update the State of the Split!

⚠️ It requires a FutureCompletingBlockingQueue: a custom implementation of blocking queue that is used in the hand-over of data from a producing thread to a consuming thread (the same Blocking Queue I've mentioned when I was talking about a SplitReader)

RECAP before the grand finale.

We saw few things so far:

What a Split is (Metadata information)
a SplitEnumerator that discover Splits and assigned to Readers (on Job/Master Manager)
a SplitReader that collects the data based on Split's metadata information (on Task Manager)
a SourceReaderBase that requests for a Split on a Reader when it doesn't have any assigned to (on Task Manager)

Source

And finally we can step to the final class - the Source.
As mentioned above, it is the glue to all our Lego bricks we have built so far.

Let's take a look.

public class BigQueryDataStreamSource implements Source<RowData, BigQuerySourceSplit, BigQuerySplitEnumeratorState> {
    private static final Logger LOG = LoggerFactory.getLogger(BigQueryDataStreamSource.class);

    final BigQueryReadOptions readOptions;

    public BigQueryDataStreamSource(BigQueryReadOptions readOptions) {
        this.readOptions = readOptions;
    }

    @Override
    public Boundedness getBoundedness() {
        return Boundedness.CONTINUOUS_UNBOUNDED;
    }

    @Override
    public SplitEnumerator<BigQuerySourceSplit, BigQuerySplitEnumeratorState> restoreEnumerator(
            SplitEnumeratorContext<BigQuerySourceSplit> splitEnumeratorContext,
            BigQuerySplitEnumeratorState bigQuerySplitEnumeratorState) throws Exception {
        LOG.info("Restoring Enumerator with following State: {}", bigQuerySplitEnumeratorState);
        return new BigQuerySplitEnumerator(splitEnumeratorContext, readOptions, bigQuerySplitEnumeratorState);

    }

    @Override
    public SplitEnumerator<BigQuerySourceSplit, BigQuerySplitEnumeratorState> createEnumerator(
            SplitEnumeratorContext<BigQuerySourceSplit> splitEnumeratorContext) throws Exception {
        LOG.info("Creating new Enumerator");
        return new BigQuerySplitEnumerator(splitEnumeratorContext, readOptions, null);
    }


    @Override
    public SimpleVersionedSerializer<BigQuerySourceSplit> getSplitSerializer() {
        return BigQuerySourceSplitSerializer.INSTANCE;
    }

    @Override
    public SimpleVersionedSerializer<BigQuerySplitEnumeratorState> getEnumeratorCheckpointSerializer() {
        return BigQuerySplitEnumeratorStateSerializer.INSTANCE;
    }

    @Override
    public SourceReader<RowData, BigQuerySourceSplit> createReader(SourceReaderContext sourceReaderContext) throws
            Exception {
        FutureCompletingBlockingQueue<RecordsWithSplitIds<RowData>> elementsQueue = new FutureCompletingBlockingQueue<>();
        BigQueryRecordEmitter recordEmitter = new BigQueryRecordEmitter();
        Supplier<SplitReader<RowData, BigQuerySourceSplit>> splitReaderSupplier = () -> new BigQuerySplitReader(
                readOptions);
        return new BigQuerySourceReader(elementsQueue, splitReaderSupplier, recordEmitter, new Configuration(),
                sourceReaderContext);
    }
}

Let's deep dive in:

getBoundiness provides a description of the Source: if it is Unbounded (STREAMING) or Bounded (Batched)
createEnumerator glues a Source to a SplitEnumerator
createReader glues a Source to a SourceReader

TL;DR - A Recap

Here we are, at the end of our journey.
I know I've done already few recaps but a very last one is needed, since I've also forgot to mention few things.

A Split serves as the fundamental unit of data processing in Flink. It primarily provides the information to retrieve the required data. Each Split is always associated with a SerDe and a State — used for checkpointing.
A SplitEnumerator is responsible for discovering and creating new Splits, as well as assigning them to Readers.
It is also associated with a State, as checkpointing is crucial to track which Splits have been created and read, as well as whether any Splits are currently assigned to Readers and to which ones.
Given that the State of a SplitEnumerator is often complex, a SerDe class is required to handle its serialization and deserialization.
SplitEnumerator lives on Job/Master Process.
A SplitReader, responsible for reading the data from a Split and pushing it into the Blocking Queue of Records.
It operates within the Job/Master Process.
A SourceReader, responsible for requesting a new Splits.
[Optional] A RecordEmitter, a straightforward implementation for retrieving data from a SplitReader, updating the Split's State, and emitting records downstream.
A Source, the glue connecting all components. It brings together a SplitEnumerator<Split> and a SplitReader<Split>into a unified interface.

DEV Community

My (Goofy) attempt on building a Flink BigQuery Source Connector

Split

SplitEnumerator

BIG BREAK

Let's take a break

SplitReader

New Break

SourceReader

RECAP before the grand finale.

Source

TL;DR - A Recap

Top comments (0)

Read next

Cyberdev Obsidian Theme

Docker with CI/CD: Automating the Software Lifecycle

Mastering Responsive Design: Best Practices for 2025

🌐 Server-side rendering without Next.js, Remix, Nuxt.js, etc.