Technology
We do all the heavy lifting for you so you can focus on what brings value to your business.
- 1. Qluster is designed so non-engineers can do data ingestion, cleaning, and unification at scale.
- 2. You don't need to know about the internal workings of Qluster to use it.
- 3. The more your data is prone to human error, the more value Qluster will bring.
Qluster is Cloud Native
It runs on top of Kubernetes. Qluster does not have a dependency on any cloud provider. All it needs is Kubernetes and Postgres for storing its settings.
Qluster utilizes Kubernetes to scale up and down automatically based on the resources needed to handle your volume of data.
Data Security as a First Class Citizen
Qluster is committed to security and focused on keeping your data safe. We adhere to industry-leading standards while connecting, processing, and loading data from all of your data sources.
- 1. The entire data flow lifecycle stays within your virtual private cloud in the on-prem deployment.
- 2. In the hosted version, ephemeral data specific to a data source may be used by a process, i.e., in the form of logs. This data stream is essential to the ingestion process and is retained for debugging for up to a few days, depending on the client's requirements.
- 3. In the hosted version, we can let you host the settings and logs in your infrastructure. Then there will be absolutely no traces of your data in our infrastructure.
Encryption
Encryption At Rest
The first thing Qluster does before processing a file, is to use GnuPG to encrypt the file and push it to the backup storage, i.e. AWS S3 or Google Cloud Storage. That way if anything fails we have a fallback. Nothing is ever lost!
Qluster uses industry standard Fernet (Secret Key) authenticated cryptography for securely storing secrets in the settings database.
Encryption In Transit
Data is encrypted when transmitted across networks to protect against eavesdropping of network traffic by unauthorized users.
Data Governance: Role Based Access Control (RBAC)
Each user is associated with one or more data sources and can only act on behalf of those data sources.
Users can resolve data issues associated with their own data, or decide to delete the erroneous rows instead.
Qluster Cloud
🚀 Qluster is officially a part of the Google for Startups Cloud Program, Amazon Web Services Startups Activate Program, and Microsoft for Startups Founders Hub! 🚀
The devil is in the details
Data ingestion pipelines are brittle. Things can go wrong in a million places. We take care of all the tiny details so you don't have to.
Your Data Safety is our number one commitment
- Files are encrypted and backed up before we process them. If anything breaks, the original file will be pulled from the secure backup.
- You own the backup storage. We don't hold on to your files. You can use AWS S3, or Google Cloud Storage as the backup medium.
- We avoid locking your destination database at all costs. We use a combination of triggers and advanced codes to run database schema migrations on live databases.
- You can even host the settings database that powers Qluster. The pipelines will still run within Qluster's infrastructure. That way there is nothing from your data that will be leaked to Qluster except logs. We can even have the logs stream to your logging infrastructure.
- Data governance: You can assign users to specific data sources, i.e. assigning 3rd party vendors to only access their respective data.
- Incoming data is never lost. The bad data makes it way in the quarantine tables for you to review and resolve.
Data Firewall
- Qluster is the first layer of defense against bad data.
- The mentality of "get all the data in, deal with the problems later" will hurt you later down the line. We detect and handle data issues upstream before it even makes its way to your infrastructure.
- If some rows of data can't be imported to the database, we quarantine the bad rows for you so you can take a look at them. We provide tooling to be able to resolve the issue.
- Once you have enough data ingested through Qluster, we can enable anomaly detection.
Observability
- We give observability to both you and 3rd party vendor about what is happening to the data and guide the user to make the right choices when things break.
- Just like you, your vendors can see and get notified about what has gone wrong. They will get notified only about the issues specific to their own data.
- Qluster Slack bot can be installed in your slack channels to provide immediate notification for data issues.
Data Lineage
- Qluster tracks how data is modified as it gets cleaned. Every value modification is recorded, along with why and who changed it.
- With a click of a button on each value, you can get the history of that value.
File Formats
- Automatically detecting file encoding. Is this a big-endian file? little-endian UTF-16 vs. UTF-32? There are so many flavors of encoding. We got you covered. No more strange characters showing up in your output.
- Field name transliteration. We automatically transliterate the field names in your files into English to make sure your destination database can handle them.
- Header line not in the first line? No problem. Some text above the header? No problem. We automatically detect all that.
- Auto handling compressed files such as ZIP, TAR, GZIP, SNAPPY
- Handling Excel files including modern XLSX files all the way to the original XLS-XML files from 1995
AI Driven Data Cleaning
- When we see a new file with fields we did not expect, we use AI (i.e. Named Entity Recognition), fuzzy matching and statistics to recommend column mapping for you.
- We use semi-supervised anomaly detection algorithms to detect anomalies in your incoming data. The model is auto-trained based on your most recent data.
- We automatically take care of the majority of common data issues such as datetime formats, fat finger errors, and field renames for you. If Qluster doesn't automatically resolve it, it will recommend solutions to you. Both the admin and the vendors can confirm the recommended solution so the issue can be resolved.
Cloud Independent
- All Qluster needs to run is Kubernetes and Postgres. It does not have a hard dependency on any cloud provider.
- Qluster can be deployed on your own VPC (Virtual Private Cloud). That way no data ever gets out of your infrastructure.
Highly Scalable
- Qluster is distributed. It can scale as big as you need it to be.
- Qluster uses Kubernetes primitives and is cloud native.
Extendable
- Qluster's SDK can be used to build custom validation and data transformation logic.
- Qluster's REST API can be used to by the engineers to interface with Qluster.
- Since each action in Qluster corresponds to a kubernetes job, if you need custom validation or ingestion code to be run as a part of your pipeline, as long as it can run as a Docker image, we have got you covered. Custom code can only be deployed in the enterprise self-hosted deployments.
Don't re-invent the wheel
We've got you covered for file data ingestion.
Want to hear more?