Using Data Pools in Services

This guide explains how to use the DataPool feature to work with datasets and file collections within your services.

What is a Data Pool?

A Data Pool is a managed collection of files, similar to a directory or a folder, that can be attached to your service at runtime. It provides a simple and efficient way to access large datasets, pre-trained models, or any other file-based resources without having to include them directly in your service's deployment package.

When you use a Data Pool, the platform mounts the specified file collection into your service's runtime environment. The qhub-commons library provides a convenient DataPool abstraction to interact with these mounted files.

Data Pool Limits

Data Pools are designed to handle large datasets, but there are some limits to keep in mind:

The maximum size of a single file in a Data Pool is 500 MB.
The files are mounted using a blob storage technology, which means performance may vary based on the size and number of files.

How to Use the `DataPool` Class

To use a Data Pool in your service, you simply need to declare a parameter of type DataPool in your run method. The runtime will automatically detect this and inject a DataPool object that corresponds to the mounted file collection.

The `DataPool` Object

The DataPool object, found in qhub.commons.datapool, provides the following methods to interact with the files in the mounted directory:

list_files() -> Dict[str, str]: Returns a dictionary of all files in the Data Pool, where the keys are the file names and the values are their absolute paths.
open(file_name: str, mode: str = "r"): Opens a specific file within the Data Pool and returns a file handle, similar to Python's built-in open() function.
path: A property that returns the absolute path to the mounted Data Pool directory.
name: A property that returns the name of the Data Pool (which corresponds to the parameter name in your run method).

Tutorial: Building a Service with a Data Pool

Let's walk through an example of a service that reads data from a Data Pool.

1. Initialize a New Project

If you haven't already, create a new service project. You can use the CLI to set up a new service:

bash

qhubctl init

cd [user_code]

uv venv
source .venv/bin/activate
uv sync

For the rest of this guide, we assume that you created your service in a directory named user_code, with the main code in user_code/src/.

2. Update the `run` Method

In your program.py, define a run method that accepts a DataPool parameter. The name of the parameter (e.g., my_dataset) is important, as it will be used to identify the Data Pool in the API call.

python

# user_code/src/program.py

from qhub.commons.datapool import DataPool
from pydantic import BaseModel

class InputData(BaseModel):
    file_to_read: str

def run(data: InputData, my_dataset: DataPool) -> str:
    """
    Reads the content of a specified file from a Data Pool.
    """
    try:
        # Use the open() method to read a file from the Data Pool
        with my_dataset.open(data.file_to_read) as f:
            content = f.read()
        return content
    except FileNotFoundError:
        return f"File '{data.file_to_read}' not found in the Data Pool."

In this example, the run method expects a Data Pool to be provided for the my_dataset parameter.

3. Local Testing with Data Pools

When developing and testing your service locally, you don't have access to the platform's Data Pool mounting system. However, you can easily simulate this by creating a local directory and passing it to your run method.

Steps for Local Testing

Create a local directory for your Data Pool. This directory should be placed inside the user_code/input directory. The name of this directory can be anything, but for this example, we'll name it my_dataset to match the parameter in the run method.
Populate the directory with your test files. Place any files you need for your test inside this directory (e.g., user_code/input/my_dataset/hello.txt). And add the value Hello to the hello.txt file.
Update the __main__.py file. Modify your main entrypoint to manually create a DataPool instance and pass it to the run function. You will create the DataPool object with a relative path to your local Data Pool directory.
Run your service. Now you can run your service directly without setting any environment variables.
bash
```
# Run your service's main entrypoint
cd user_code
python -m src
```

Example

Let's assume your project has the following structure:

user_code
├── src/
│   ├── __main__.py
│   └── program.py
└── input/
    ├── data.json
    └── my_dataset/
        └── hello.txt

And user_code/input/data.json contains:

json

{
  "file_to_read": "hello.txt"
}

Update your user_code/src/__main__.py to look like this:

python

# user_code/src/__main__.py
import json
import os

from qhub.commons.constants import OUTPUT_DIRECTORY_ENV
from qhub.commons.datapool import DataPool
from qhub.commons.json import any_to_json
from qhub.commons.logging import init_logging

from .program import InputData, run

init_logging()

# This file is executed if you run `python -m src` from the project root. Use this file to test your program locally.
# You can read the input data from the `input` directory and map it to the respective parameter of the `run()` function.

# Redirect the platform's output directory for local testing
directory = "./out"
os.makedirs(directory, exist_ok=True)
os.environ[OUTPUT_DIRECTORY_ENV] = directory

with open(f"./input/data.json") as file:
    data = InputData.model_validate(json.load(file))

result = run(data, my_dataset=DataPool("./input/my_dataset"))

print(any_to_json(result))

The __main__.py script now manually creates the DataPool object and passes it to your run function, simulating the behavior of the platform and allowing you to test your run method's logic with local files.

Now run the service:

bash

python -m src

4. Use Multiple Data Pools

If your service needs to work with multiple Data Pools, you can simply add more parameters of type DataPool to your run method.

python

import os

from qhub.commons.datapool import DataPool
from pydantic import BaseModel

class InputData(BaseModel):
    file_to_read_from_my_dataset: str
    file_to_read_from_another_dataset: str

def run(data: InputData, my_dataset: DataPool, another_dataset: DataPool, output_datapool: DataPool) -> str:
    """
    Combines the content of the specified files from two different Data Pools in a third Data Pool.
    """
    try:
        with my_dataset.open(data.file_to_read_from_my_dataset) as f1:
            content1 = f1.read()
        
        with another_dataset.open(data.file_to_read_from_another_dataset) as f2:
            content2 = f2.read()
        
        # You can also write to the output Data Pool if needed
        concatinated_file = os.path.join(output_datapool.path, "concatenated_output.txt")
        with open(concatinated_file, "w") as out_file:
            out_file.write(content1)
            out_file.write(content2)
        
        return f"Content from my_dataset: {content1}\nContent from another_dataset: {content2}"
    except FileNotFoundError as e:
        return str(e)

For local development, you would create two directories in your input folder (e.g., my_dataset and another_dataset) and pass them as separate parameters in the __main__.py:

python

# user_code/src/__main__.py

## see from above...

result = run(data, my_dataset=DataPool("./input/my_dataset"), another_dataset=DataPool("./input/another_dataset"), output_datapool=DataPool("./input/output_datapool"))

## see from above...

Create the another_dataset and output_datapool directories in your input folder. Then, create the file input/another_dataset/world.txt with the content World.

Before running the service, update the data in input/data.json to include the new file names:

json

{
  "file_to_read_from_my_dataset": "hello.txt",
  "file_to_read_from_another_dataset": "world.txt"
}

After running the service, you should see the concatenated output file in the input/output_datapool directory.

bash

python -m src

5. Configuring the Data Pool in the API Call

When you execute this service via the platform API, you need to specify which Data Pool to mount. This is done by providing a special JSON object in the request body. The key of this object must match the DataPool parameter name in your run method (my_dataset in our example).

The JSON object has two fields:

id: The unique identifier (UUID) of the Data Pool you want to use.
ref: A static value that must be "DATAPOOL".

Here is an example of a request body for our first version of the service:

json

{
  "data": {
    "file_to_read": "hello.txt"
  },
  "my_dataset": {
    "id": "a1b2c3d4-e5f6-7890-1234-567890abcdef",
    "ref": "DATAPOOL"
  }
}

When the service is executed with this input, the platform will:

Identify that the my_dataset parameter is a Data Pool reference.
Mount the Data Pool with the specified id.
Instantiate a DataPool object pointing to the mounted directory.
Inject this Data Pool object into the run method as the my_dataset argument.

Your code can then use the my_dataset object to interact with the files in the mounted Data Pool.

Here is an example of a request body for our second version of the service:

json

{
  "data": {
    "file_to_read_from_my_dataset": "hello.txt",
    "file_to_read_from_another_dataset": "world.txt"
  },
  "my_dataset": {
    "id": "a1b2c3d4-e5f6-7890-1234-567890abcdef",
    "ref": "DATAPOOL"
  },
  "another_dataset": {
    "id": "b1b2c3d4-e5f6-7890-1234-567890abcdef",
    "ref": "DATAPOOL"
  },
  "output_datapool": {
    "id": "c1b2c3d4-e5f6-7890-1234-567890abcdef",
    "ref": "DATAPOOL"
  }
}

OpenAPI Specification for Data Pools

When you generate an OpenAPI specification for a service that uses a DataPool, the qhubctl openapi library automatically creates the correct schema for the Data Pool parameter. Instead of showing the internal structure of the DataPool class, it generates a schema that reflects the expected API input format.

For the my_dataset: DataPool parameter, the generated OpenAPI schema will look like this:

yaml

my_dataset:
  type: object
  properties:
    id:
      type: string
      format: uuid
      description: UUID of the Data Pool to mount
    ref:
      type: string
      enum: [DATAPOOL]
      description: Reference type indicating this is a Data Pool
  required:
    - id
    - ref
  additionalProperties: false

This ensures that the API documentation accurately represents how to use the service and provides a clear contract for API clients.

Data Pool Access Grants

By default, the platform checks whether the application creator has direct access to a data pool before mounting it during a service execution. Data Pool Access Grants provide an alternative: short-lived, JWT-based tokens that authorize access to a data pool without sharing it permanently with the service owner.

This is designed for cross-organization workflows where a third-party service needs to read from or write to your data pool during execution, but you don't want to grant the service owner permanent access.

How Access Grants Work

User A owns a data pool and an application that is subscribed to a service owned by User B.
User A requests a grant by calling the grant endpoint, specifying their application ID and the required permission level (VIEW or MODIFY).
The platform returns a signed JWT token scoped to that specific data pool, application, and tenant. The token expires after 15 minutes by default.
User A attaches the grant token to the volume mount reference when creating a service execution. The platform validates the token at execution time instead of checking whether the service owner has permanent access to the data pool.

Creating a Grant

To create an access grant, send a POST request to the grant endpoint for the target data pool:

Endpoint: POST /datapools/{datapoolId}/grants

Request body:

json

{
  "applicationId": "<uuid>",
  "permission": "VIEW"
}

Field	Type	Required	Description
`applicationId`	UUID	Yes	The ID of your application that will be used for execution.
`permission`	String	Yes	Access level: `VIEW` (read-only) or `MODIFY` (read-write).

Response (200 OK):

json

{
  "token": "eyJhbGciOiJIUzI1NiJ9..."
}

Authorization Requirements

The requesting user must have at least VIEWER role on the data pool to create a VIEW grant, or MAINTAINER role to create a MODIFY grant.

Error responses:

Status	Condition
400	Invalid permission value (must be `VIEW` or `MODIFY`)
403	Insufficient permission on the data pool
404	Data pool not found

Using a Grant in a Service Execution

When creating a service execution, include the grant field on the Data Pool reference object. Using the example from above, the request body would look like this:

json

{
  "data": {
    "file_to_read": "hello.txt"
  },
  "my_dataset": {
    "id": "a1b2c3d4-e5f6-7890-1234-567890abcdef",
    "ref": "DATAPOOL",
    "grant": "eyJhbGciOiJIUzI1NiJ9..."
  }
}

Behavior:

If a grant token is present, the platform validates the token (signature, expiration, claim matching) instead of checking whether the application creator has direct permission on the data pool.
The writeable flag on the mount is automatically set based on the grant's permission level: MODIFY results in a writeable mount, VIEW results in a read-only mount.
If no grant is provided, the existing behavior applies: the platform checks whether the application creator has direct access to the data pool.

Token Validation

At execution time, the platform rejects the grant token if:

The token signature is invalid or tampered
The token has expired
The datapoolId in the token does not match the volume mount reference
The applicationId in the token does not match the application running the execution
The tenantId in the token does not match the application creator

A rejected token results in a 403 Forbidden response.

Grant Token Properties

Property	Value
Algorithm	HMAC-SHA256 (HS256)
Expiration	15 minutes (not configurable at the moment)
Scoped to	A specific data pool ID, application ID, and tenant ID
Permission	`VIEW` or `MODIFY`, matching what was requested at creation
Reusable	Yes — the token can be used multiple times until it expires

Example Workflow

The following example illustrates a typical cross-organization workflow using access grants:

1. User B owns Service "quantum-optimizer"
2. User A owns Data Pool "experiment-results"
3. User A creates Application "my-quantum-run", subscribed to User B's Service

4. User A requests a grant:
     POST /datapools/{experiment-results-id}/grants
     {
       "applicationId": "{my-quantum-run-id}",
       "permission": "VIEW"
     }

5. User A receives:
     { "token": "eyJ..." }

6. User A creates a service execution, attaching the grant token:
     POST /service-executions
     {
       "data": { ... },
       "my_dataset": {
         "id": "{experiment-results-id}",
         "ref": "DATAPOOL",
         "grant": "eyJ..."
       }
     }

7. The platform validates the grant and mounts User A's data pool
   as read-only for User B's Service during execution

Using Data Pools in Services ​

What is a Data Pool? ​

How to Use the DataPool Class ​

The DataPool Object ​

Tutorial: Building a Service with a Data Pool ​

1. Initialize a New Project ​

2. Update the run Method ​

3. Local Testing with Data Pools ​

Steps for Local Testing ​

Example ​

4. Use Multiple Data Pools ​

5. Configuring the Data Pool in the API Call ​

OpenAPI Specification for Data Pools ​

Data Pool Access Grants ​

How Access Grants Work ​

Creating a Grant ​

Using a Grant in a Service Execution ​

Token Validation ​

Grant Token Properties ​

Example Workflow ​

Using Data Pools in Services

What is a Data Pool?

How to Use the `DataPool` Class

The `DataPool` Object

Tutorial: Building a Service with a Data Pool

1. Initialize a New Project

2. Update the `run` Method

3. Local Testing with Data Pools

Steps for Local Testing

Example

4. Use Multiple Data Pools

5. Configuring the Data Pool in the API Call

OpenAPI Specification for Data Pools

Data Pool Access Grants

How Access Grants Work

Creating a Grant

Using a Grant in a Service Execution

Token Validation

Grant Token Properties

Example Workflow