Appearance
Are you an LLM? You can read better optimized documentation at /services/managed/datapool.md for this page in Markdown format
Using Data Pools in Services
This guide explains how to use the DataPool feature to work with datasets and file collections within your services.
What is a Data Pool?
A Data Pool is a managed collection of files, similar to a directory or a folder, that can be attached to your service at runtime. It provides a simple and efficient way to access large datasets, pre-trained models, or any other file-based resources without having to include them directly in your service's deployment package.
When you use a Data Pool, the platform mounts the specified file collection into your service's runtime environment. The qhub-commons library provides a convenient DataPool abstraction to interact with these mounted files.
Data Pool Limits
Data Pools are designed to handle large datasets, but there are some limits to keep in mind:
- The maximum size of a single file in a Data Pool is 500 MB.
- The files are mounted using a blob storage technology, which means performance may vary based on the size and number of files.
How to Use the DataPool Class
To use a Data Pool in your service, you simply need to declare a parameter of type DataPool in your run method. The runtime will automatically detect this and inject a DataPool object that corresponds to the mounted file collection.
The DataPool Object
The DataPool object, found in qhub.commons.datapool, provides the following methods to interact with the files in the mounted directory:
list_files() -> Dict[str, str]: Returns a dictionary of all files in the Data Pool, where the keys are the file names and the values are their absolute paths.open(file_name: str, mode: str = "r"): Opens a specific file within the Data Pool and returns a file handle, similar to Python's built-inopen()function.path: A property that returns the absolute path to the mounted Data Pool directory.name: A property that returns the name of the Data Pool (which corresponds to the parameter name in yourrunmethod).
Tutorial: Building a Service with a Data Pool
Let's walk through an example of a service that reads data from a Data Pool.
1. Initialize a New Project
If you haven't already, create a new service project. You can use the CLI to set up a new service:
bash
qhubctl init
cd [user_code]
uv venv
source .venv/bin/activate
uv syncFor the rest of this guide, we assume that you created your service in a directory named user_code, with the main code in user_code/src/.
2. Update the run Method
In your program.py, define a run method that accepts a DataPool parameter. The name of the parameter (e.g., my_dataset) is important, as it will be used to identify the Data Pool in the API call.
python
# user_code/src/program.py
from qhub.commons.datapool import DataPool
from pydantic import BaseModel
class InputData(BaseModel):
file_to_read: str
def run(data: InputData, my_dataset: DataPool) -> str:
"""
Reads the content of a specified file from a Data Pool.
"""
try:
# Use the open() method to read a file from the Data Pool
with my_dataset.open(data.file_to_read) as f:
content = f.read()
return content
except FileNotFoundError:
return f"File '{data.file_to_read}' not found in the Data Pool."In this example, the run method expects a Data Pool to be provided for the my_dataset parameter.
3. Local Testing with Data Pools
When developing and testing your service locally, you don't have access to the platform's Data Pool mounting system. However, you can easily simulate this by creating a local directory and passing it to your run method.
Steps for Local Testing
Create a local directory for your Data Pool. This directory should be placed inside the
user_code/inputdirectory. The name of this directory can be anything, but for this example, we'll name itmy_datasetto match the parameter in therunmethod.Populate the directory with your test files. Place any files you need for your test inside this directory (e.g.,
user_code/input/my_dataset/hello.txt). And add the valueHelloto thehello.txtfile.Update the
__main__.pyfile. Modify your main entrypoint to manually create aDataPoolinstance and pass it to therunfunction. You will create theDataPoolobject with a relative path to your local Data Pool directory.Run your service. Now you can run your service directly without setting any environment variables.
bash# Run your service's main entrypoint cd user_code python -m src
Example
Let's assume your project has the following structure:
user_code
├── src/
│ ├── __main__.py
│ └── program.py
└── input/
├── data.json
└── my_dataset/
└── hello.txtAnd user_code/input/data.json contains:
json
{
"file_to_read": "hello.txt"
}Update your user_code/src/__main__.py to look like this:
python
# user_code/src/__main__.py
import json
import os
from qhub.commons.constants import OUTPUT_DIRECTORY_ENV
from qhub.commons.datapool import DataPool
from qhub.commons.json import any_to_json
from qhub.commons.logging import init_logging
from .program import InputData, run
init_logging()
# This file is executed if you run `python -m src` from the project root. Use this file to test your program locally.
# You can read the input data from the `input` directory and map it to the respective parameter of the `run()` function.
# Redirect the platform's output directory for local testing
directory = "./out"
os.makedirs(directory, exist_ok=True)
os.environ[OUTPUT_DIRECTORY_ENV] = directory
with open(f"./input/data.json") as file:
data = InputData.model_validate(json.load(file))
result = run(data, my_dataset=DataPool("./input/my_dataset"))
print(any_to_json(result))The __main__.py script now manually creates the DataPool object and passes it to your run function, simulating the behavior of the platform and allowing you to test your run method's logic with local files.
Now run the service:
bash
python -m src4. Use Multiple Data Pools
If your service needs to work with multiple Data Pools, you can simply add more parameters of type DataPool to your run method.
python
import os
from qhub.commons.datapool import DataPool
from pydantic import BaseModel
class InputData(BaseModel):
file_to_read_from_my_dataset: str
file_to_read_from_another_dataset: str
def run(data: InputData, my_dataset: DataPool, another_dataset: DataPool, output_datapool: DataPool) -> str:
"""
Combines the content of the specified files from two different Data Pools in a third Data Pool.
"""
try:
with my_dataset.open(data.file_to_read_from_my_dataset) as f1:
content1 = f1.read()
with another_dataset.open(data.file_to_read_from_another_dataset) as f2:
content2 = f2.read()
# You can also write to the output Data Pool if needed
concatinated_file = os.path.join(output_datapool.path, "concatenated_output.txt")
with open(concatinated_file, "w") as out_file:
out_file.write(content1)
out_file.write(content2)
return f"Content from my_dataset: {content1}\nContent from another_dataset: {content2}"
except FileNotFoundError as e:
return str(e)For local development, you would create two directories in your input folder (e.g., my_dataset and another_dataset) and pass them as separate parameters in the __main__.py:
python
# user_code/src/__main__.py
## see from above...
result = run(data, my_dataset=DataPool("./input/my_dataset"), another_dataset=DataPool("./input/another_dataset"), output_datapool=DataPool("./input/output_datapool"))
## see from above...Create the another_dataset and output_datapool directories in your input folder. Then, create the file input/another_dataset/world.txt with the content World.
Before running the service, update the data in input/data.json to include the new file names:
json
{
"file_to_read_from_my_dataset": "hello.txt",
"file_to_read_from_another_dataset": "world.txt"
}After running the service, you should see the concatenated output file in the input/output_datapool directory.
bash
python -m src5. Configuring the Data Pool in the API Call
When you execute this service via the platform API, you need to specify which Data Pool to mount. This is done by providing a special JSON object in the request body. The key of this object must match the DataPool parameter name in your run method (my_dataset in our example).
The JSON object has two fields:
id: The unique identifier (UUID) of the Data Pool you want to use.ref: A static value that must be"DATAPOOL".
Here is an example of a request body for our first version of the service:
json
{
"data": {
"file_to_read": "hello.txt"
},
"my_dataset": {
"id": "a1b2c3d4-e5f6-7890-1234-567890abcdef",
"ref": "DATAPOOL"
}
}When the service is executed with this input, the platform will:
- Identify that the
my_datasetparameter is a Data Pool reference. - Mount the Data Pool with the specified
id. - Instantiate a
DataPoolobject pointing to the mounted directory. - Inject this
Data Poolobject into therunmethod as themy_datasetargument.
Your code can then use the my_dataset object to interact with the files in the mounted Data Pool.
Here is an example of a request body for our second version of the service:
json
{
"data": {
"file_to_read_from_my_dataset": "hello.txt",
"file_to_read_from_another_dataset": "world.txt"
},
"my_dataset": {
"id": "a1b2c3d4-e5f6-7890-1234-567890abcdef",
"ref": "DATAPOOL"
},
"another_dataset": {
"id": "b1b2c3d4-e5f6-7890-1234-567890abcdef",
"ref": "DATAPOOL"
},
"output_datapool": {
"id": "c1b2c3d4-e5f6-7890-1234-567890abcdef",
"ref": "DATAPOOL"
}
}OpenAPI Specification for Data Pools
When you generate an OpenAPI specification for a service that uses a DataPool, the qhubctl openapi library automatically creates the correct schema for the Data Pool parameter. Instead of showing the internal structure of the DataPool class, it generates a schema that reflects the expected API input format.
For the my_dataset: DataPool parameter, the generated OpenAPI schema will look like this:
yaml
my_dataset:
type: object
properties:
id:
type: string
format: uuid
description: UUID of the Data Pool to mount
ref:
type: string
enum: [DATAPOOL]
description: Reference type indicating this is a Data Pool
required:
- id
- ref
additionalProperties: falseThis ensures that the API documentation accurately represents how to use the service and provides a clear contract for API clients.
Data Pool Access Grants
By default, the platform checks whether the application creator has direct access to a data pool before mounting it during a service execution. Data Pool Access Grants provide an alternative: short-lived, JWT-based tokens that authorize access to a data pool without sharing it permanently with the service owner.
This is designed for cross-organization workflows where a third-party service needs to read from or write to your data pool during execution, but you don't want to grant the service owner permanent access.
How Access Grants Work
- User A owns a data pool and an application that is subscribed to a service owned by User B.
- User A requests a grant by calling the grant endpoint, specifying their application ID and the required permission level (
VIEWorMODIFY). - The platform returns a signed JWT token scoped to that specific data pool, application, and tenant. The token expires after 15 minutes by default.
- User A attaches the grant token to the volume mount reference when creating a service execution. The platform validates the token at execution time instead of checking whether the service owner has permanent access to the data pool.
Creating a Grant
To create an access grant, send a POST request to the grant endpoint for the target data pool:
Endpoint: POST /datapools/{datapoolId}/grants
Request body:
json
{
"applicationId": "<uuid>",
"permission": "VIEW"
}| Field | Type | Required | Description |
|---|---|---|---|
applicationId | UUID | Yes | The ID of your application that will be used for execution. |
permission | String | Yes | Access level: VIEW (read-only) or MODIFY (read-write). |
Response (200 OK):
json
{
"token": "eyJhbGciOiJIUzI1NiJ9..."
}Authorization Requirements
The requesting user must have at least VIEWER role on the data pool to create a VIEW grant, or MAINTAINER role to create a MODIFY grant.
Error responses:
| Status | Condition |
|---|---|
| 400 | Invalid permission value (must be VIEW or MODIFY) |
| 403 | Insufficient permission on the data pool |
| 404 | Data pool not found |
Using a Grant in a Service Execution
When creating a service execution, include the grant field on the Data Pool reference object. Using the example from above, the request body would look like this:
json
{
"data": {
"file_to_read": "hello.txt"
},
"my_dataset": {
"id": "a1b2c3d4-e5f6-7890-1234-567890abcdef",
"ref": "DATAPOOL",
"grant": "eyJhbGciOiJIUzI1NiJ9..."
}
}Behavior:
- If a
granttoken is present, the platform validates the token (signature, expiration, claim matching) instead of checking whether the application creator has direct permission on the data pool. - The
writeableflag on the mount is automatically set based on the grant's permission level:MODIFYresults in a writeable mount,VIEWresults in a read-only mount. - If no
grantis provided, the existing behavior applies: the platform checks whether the application creator has direct access to the data pool.
Token Validation
At execution time, the platform rejects the grant token if:
- The token signature is invalid or tampered
- The token has expired
- The
datapoolIdin the token does not match the volume mount reference - The
applicationIdin the token does not match the application running the execution - The
tenantIdin the token does not match the application creator
A rejected token results in a 403 Forbidden response.
Grant Token Properties
| Property | Value |
|---|---|
| Algorithm | HMAC-SHA256 (HS256) |
| Expiration | 15 minutes (not configurable at the moment) |
| Scoped to | A specific data pool ID, application ID, and tenant ID |
| Permission | VIEW or MODIFY, matching what was requested at creation |
| Reusable | Yes — the token can be used multiple times until it expires |
Example Workflow
The following example illustrates a typical cross-organization workflow using access grants:
1. User B owns Service "quantum-optimizer"
2. User A owns Data Pool "experiment-results"
3. User A creates Application "my-quantum-run", subscribed to User B's Service
4. User A requests a grant:
POST /datapools/{experiment-results-id}/grants
{
"applicationId": "{my-quantum-run-id}",
"permission": "VIEW"
}
5. User A receives:
{ "token": "eyJ..." }
6. User A creates a service execution, attaching the grant token:
POST /service-executions
{
"data": { ... },
"my_dataset": {
"id": "{experiment-results-id}",
"ref": "DATAPOOL",
"grant": "eyJ..."
}
}
7. The platform validates the grant and mounts User A's data pool
as read-only for User B's Service during execution
