Lightweight Data Engineering Platform

By DataMonkey Pte Ltd

Building a serverless data lakehouse platform where users can run lightweight workloads on GCC

Members

Colin Ng (CEP/DEDIP)
Jerome Goh (CEP/DEDIP)
Joshua Na (Data Engineering Practice)
Raymond Harris (GTO / FDT (SSG))
Daniel Yuen (CEP/DEDIP)

Problem Statement

Govt devs often face hurdles provisioning their own GCC environment, which hinders their ability to leverage cloud-based services for development. This not only stifles engineering productivity and innovation, but more importantly, hinders agencies from leveraging modern technology to deliver services efficiently across the public sector.

Our platform solves this challenge by offering an intuitive, integrated environment for govt developers to easily create and deploy engineering pipelines. It supports everything from lightweight serverless experiments to robust, containerised production workloads, allowing for the seamless integration of GCC and WOG services. In doing so, we aim to lower barriers to cloud adoption, accelerating digital transformation across the govt and empowering developers to deliver impactful, data-driven solutions.

Problem Statement Ideation

Our problem statement ideation started out with 3 hypotheses / problem statements, that come from our background from the various teams

(GTO / FDT) Devs in govt agencies may face issues deploying cloud workloads due to a variety of issues. E.g. Agency system might be managed by Vendor IT teams, usage of legacy systems, or general lack of know-how from the devs. Sometimes their use case might be too small or niche in their agency's view, so they can only self-serve instead.
(DEDIP) The product that we aim to build targets the a missing layer within the current CEP / DEDIP Stack, which is a data engineering centric solution focusing on data storage + compute + ETL.
(Data Engineering Practice) How do we choose the right tools / technology / architecture to design our platform to run the above workloads in a composable, headless (architecture) way, while allowing us to showcase the latest in data engineering trends and tooling.

Analysis

Existing products and solutions are at the wrong level of abstraction.

GCC or GCC Devlab accounts target users who want full control over their GCC environments

Inappropriate level of abstraction as users might not know how to setup a fully compliant environment from end-to-end and may not want to do so in an agile development process.

Analytics.gov and Maestro

Is the closest to what we want to do but the key difference is that MAESTRO is AI/ML ops focused, and not truly optimised for data engineering workloads; AG focuses mostly on IDEs like Jupyter Notebook running on EKS, instead of easily integrating with cloud native tools and other platforms.

Cstack / Airbase

Is a great option for hosting frontend apps, but data engineering and storage containers still need to be managed by Data Engineers themselves, which is something that they may not be comfortable with or want to do.

Databricks and Snowflake

Are the closest in terms of what we want to offer as a platform but these are usually geared towards bigger / enterprise workloads
Also, configuring a Snowflake / Databricks for WOG scale is difficult (We have plenty of experience at the DEDIP and Data Engineering Practice level doing such projects). Furthermore, we are of the opinion that solutions from these providers should be deployed at the agency level instead.
Lastly, we can always look into implementing these for future features / use cases when the platform grows, but for now, we are looking for lightweight alternatives that are easier to implement

Key Value Prop to focus on:

Simple data engineering tooling (we settle the networking and security and compliance and integration with other GCC Platforms/Products)
Instead of workspaces, we are aiming for cloud native, production ready data engineering workloads
Focused on the earlier part of the data journey, namely these components: storage, compute, orchestration, exploitation)

Solution

Idea is simple:

Give users access to a baseline set of cloud native resources such as S3, Lambdas/Containers, IAM and other relevant toolings.
Abstract away complex setup like networking, security group, IAM policies and just let people deploy easily onto cloud resources.
Expose the services as APIs. e.g
- User uploads files/data to S3 via APIs
- User uploads script to lambda and trigger
- User creates eventbridge cronjob or event triggers for their s3 / lambdas combination
- Integrate with other CEP platforms (Airbase, CFT, MAESTRO, Gitlab, Cloak) to open up more use case for users
- Give users the necessary tools and let them work around these