Persistent Storage
Persistent storage refers to any method of storing data that remains intact and accessible even after a system is powered off, restarted, or experiences a crash.
In the context of Windmill, the stakes are: where to store the data manipulated by Windmill? (ETL, data ingestion and preprocessing, data migration and sync etc.)
There are 4 kinds of persistent storage in Windmill:
Small data that is relevant in between script/flow execution and can be persisted on Windmill itself.
Big structured SQL data that is critical to your services and that is stored externally on an SQL Database or Data Warehouse.
Object storage for large data such as S3.
NoSQL and document database such as MongoDB and Key-Value stores.
You already have your own database
Then we have nothing to say.
If your service provider is already part of our list of integrations, just add your database as a resource.
Otherwise, create access to your service provider through a new resource type (and if you want, share the schema on our Hub).
Within Windmill: not recommended
Windmill is not designed to store heavy data that goes beyond the execution of a script or flow. Indeed, for each computation the worker executing is not the same as the previous computation, so the data would have to be retrieved from another location.
Instead, Windmill is very convenient to use alongside data storage providers to manipulate big amounts of data.
There are however internal methods to persist data between executions of jobs.
Internal States and Resources
Within Windmill, you can use Internal States and Resources as a way to store a transient state - that can be represented as small JSON.
States
States are actually resources (but excluded from the Workspace tab for clarity). They are used by scripts to keep data persistent between runs of the same script by the same trigger (schedule or user).
An internal state is just a state which is meant to persist across distinct executions of the same script. This is what enables Flows to watch for changes in most event watching scenarios. The pattern is as follows:
- Retrieve the last internal state or, if undefined, assume it is the first execution.
- Retrieve the current state in the external system you are watching, e.g. the list of users having starred your repo or the maximum ID of posts on Hacker News.
- Calculate the difference between the current state and the last internal state. This difference is what you will want to act upon.
- Set the new internal state as the current state so that you do not process the elements you just processed.
- Return the differences calculated previously so that you can process them in the next steps. You will likely want to forloop over the items and trigger one Flow per item. This is exactly the pattern used when your Flow is in the mode of "Watching changes regularly".
The convenience functions do this in TypeScript are:
getState()
which retrieves an object of any type (internally a simple Resource) at a path determined bygetStatePath
, which is unique to the user currently executing the Script, the Flow in which it is currently getting called in - if any - and the path of the ScriptsetState(value: any)
which sets the new state
Resources
States are actually just a specialization of resources where the type is state
the path is automatically calculated for you based on the schedule path (if any) and the script path. In some cases, you want to set the path arbitrarily and/or use a different type than state
. In this case, you can use the setResource
and getResource
functions. A same resource can be used across different scripts and flows.
setResource(value: any, path?: string, initializeToTypeIfNotExist?: string)
: which sets a resource at a given path. This is equivalent tosetState
but allows you to set an arbitrary path and chose a type other than state if wanted. See apigetResource(path: string)
: gets a resource at a given path. See api
The states can be seen in the Resources section with a
Resource Type of state
.
Variables are similar to resources but have no types, can be tagged as secret
(in which case they are encrypted by the workspace key) and can only store strings. In some situations, you may prefer setVariable
/getVariable
to resources.
In conclusion setState
and setResource
are convenient ways to persist json between multiple script executions.
Shared Directory
Flows on Windmill are by default based on a result basis. A step will take as inputs the results of previous steps. And this works fine for lightweight automation.
For heavier ETLs, you might want to use the Shared Directory to share data between steps. Steps will share a folder at ./shared
in which they can store heavier data and pass them to the next step.
Beware that the ./shared
folder is not preserved across suspends and sleeps. The directory is temporary and active for the time of the execution.
To enable the shared directory, on the Settings
menu, go to Shared Directory
and toggle on Shared Directory on './shared'
.
To use the shared directory, just load outputs using ./shared/${path}
and call it for following steps.
Although we recommend using Shared Folders for persistent states within a flow, be aware that with Shared Folders, all the steps are executed on the same worker.
Therefore, this method is strictly ephemeral to the flow.
Structured Databases: Postgres (Supabase, Neon.tech)
For Postgres databases (best for structured data storage and retrieval, where you can define schema and relationships between entities), we recommend using Supabase or Neon.tech.
Supabase
Supabase is an open-source alternative to Firebase, providing a backend-as-a-service platform that offers a suite of tools, including real-time subscriptions, authentication, storage, and a PostgreSQL-based database.
Get a Connection string.
- Go to the
Settings
section. - Click
Database
. - Find your Connection Info and Connection String. Direct connections are on port 5432.
- Go to the
From Windmill, add your Supabase connection string as a Postgresql resource and Execute queries. Tip: you might need to set the
sslmode
to "disable".
You can also integrate Supabase directly through its API.
Neon.tech
Neon.tech is an open-source cloud database platform that provides fully managed PostgreSQL databases with high availability and scalability.
Get a Connection string. You can obtain it connection string from the Connection Details widget on the Neon Dashboard: select a branch, a role, and the database you want to connect to and a connection string will be constructed for you.
From Windmill, add your Neon.tech connection string as a Postgresql resource and Execute queries.
Adding the connection string as a Postgres resource requires to parse it.
For example, for psql postgres://daniel:<password>@ep-restless-rice.us-east-2.aws.neon.tech/neondb
, that would be:
{
"host": "ep-restless-rice.us-east-2.aws.neon.tech",
"port": 5432,
"user": "daniel",
"dbname": "neondb",
"sslmode": "require",
"password": "<password>"
}
Where the sslmode should be "require" and Neon uses the default PostgreSQL port, 5432
.
Large Data Files: S3, R2, MinIO
On heavier data objects & unstructured data storage, Amazon S3 (Simple Storage Service) and its alternatives Cloudflare R2 and MinIO are highly scalable and durable object storage service that provides secure, reliable, and cost-effective storage for a wide range of data types and use cases.
Amazon S3, Cloudflare R2 and MinIO all follow the same API schema and therefore have a common Windmill resource type.
Amazon S3
Amazon S3 (Simple Storage Service) is a scalable and durable object storage service offered by Amazon Web Services (AWS), designed to provide developers and businesses with an effective way to store and retrieve any amount of data from anywhere on the web.
Create a bucket on S3.
Integrate it to Windmill by filling the resource type details for S3 APIs.
Make sure the user associated with the resource has the right policies allowed in AWS Identity and Access Management (IAM).
You can find examples and premade S3 scripts on Windmill Hub.
Cloudflare R2
Cloudflare R2 is a cloud-based storage service that provides developers and businesses with a cost-effective and secure way to store and access their data.
Create a bucket on R2.
Integrate it to Windmill by filling the resource type details for S3 APIs.
MinIO
For best performance, install MinIO locally.
MinIO is an open-source, high-performance, and scalable object storage server that is compatible with Amazon S3 APIs, designed for building private and public cloud storage solutions.
Then from Windmill, just fill the S3 resource type.
Key-Value Stores: MongoDB Atlas, Redis, Upstash
Key-value stores are a popular choice for managing non-structured data, providing a flexible and scalable solution for various data types and use cases. In the context of Windmill, you can use MongoDB Atlas, Redis, and Upstash to store and manipulate non-structured data effectively.
MongoDB Atlas
MongoDB Atlas is a managed database-as-a-service platform that provides an efficient way to deploy, manage, and optimize MongoDB instances. As a document-oriented NoSQL database, MongoDB is well-suited for handling large volumes of unstructured data. Its dynamic schema enables the storage and retrieval of JSON-like documents with diverse structures, making it a suitable option for managing non-structured data.
To use MongoDB Atlas with Windmill:
You can find examples and premade MonggoDB scripts on Windmill Hub.
Redis
Redis is an open-source, in-memory key-value store that can be used for caching, message brokering, and real-time analytics. It supports a variety of data structures such as strings, lists, sets, and hashes, providing flexibility for non-structured data storage and management. Redis is known for its high performance and low-latency data access, making it a suitable choice for applications requiring fast data retrieval and processing.
To use Redis with Windmill:
Integrate it to Windmill by filling the resource type details following the same schema as MongoDB Atlas.
Upstash
Upstash is a serverless, edge-optimized key-value store designed for low-latency access to non-structured data. It is built on top of Redis, offering similar performance benefits and data structure support while adding serverless capabilities, making it easy to scale your data storage needs.
To use Upstash with Windmill:
Integrate it to Windmill by filling the resource type details following the same schema as MongoDB Atlas.