---
url: /services/managed/datapool.md
description: >-
  Attach Data Pools to Managed Services and use the qhub-commons DataPool class
  to read mounted files at runtime.
---

# Using Data Pools in Services

This guide explains how to use the `DataPool` feature to work with datasets and file collections within your services.

## What is a Data Pool?

A Data Pool is a managed collection of files, similar to a directory or a folder, that can be attached to your service at runtime.
It provides a simple and efficient way to access large datasets, pre-trained models, or any other file-based resources without having to include them directly in your service's deployment package.

When you use a Data Pool, the platform mounts the specified file collection into your service's runtime environment.
The `qhub-commons` library provides a convenient `DataPool` abstraction to interact with these mounted files.

::: tip Data Pool Limits

Data Pools are designed to handle large datasets, but there are some limits to keep in mind:

* The maximum size of a single file in a Data Pool is 500 MB.
* The files are mounted using a blob storage technology, which means performance may vary based on the size and number of files.

:::

## How to Use the `DataPool` Class

To use a Data Pool in your service, you simply need to declare a parameter of type `DataPool` in your `run` method.
The runtime will automatically detect this and inject a `DataPool` object that corresponds to the mounted file collection.

### The `DataPool` Object

The `DataPool` object, found in `qhub.commons.datapool`, provides the following methods to interact with the files in the mounted directory:

* `list_files() -> Dict[str, str]`: Returns a dictionary of all files in the Data Pool, where the keys are the file names and the values are their absolute paths.
* `open(file_name: str, mode: str = "r")`: Opens a specific file within the Data Pool and returns a file handle, similar to Python's built-in
  `open()` function.
* `path`: A property that returns the absolute path to the mounted Data Pool directory.
* `name`: A property that returns the name of the Data Pool (which corresponds to the parameter name in your `run` method).

### Tutorial: Building a Service with a Data Pool

Let's walk through an example of a service that reads data from a Data Pool.

#### 1. Initialize a New Project

If you haven't already, create a new service project.
You can use the CLI to set up a new service:

```bash
qhubctl init

cd [user_code]

uv venv
source .venv/bin/activate
uv sync
```

For the rest of this guide, we assume that you created your service in a directory named `user_code`, with the main code in `user_code/src/`.

#### 2. Update the `run` Method

In your `program.py`, define a `run` method that accepts a `DataPool` parameter.
The name of the parameter (e.g., `my_dataset`) is important, as it will be used to identify the Data Pool in the API call.

```python
# user_code/src/program.py

from qhub.commons.datapool import DataPool
from pydantic import BaseModel

class InputData(BaseModel):
    file_to_read: str

def run(data: InputData, my_dataset: DataPool) -> str:
    """
    Reads the content of a specified file from a Data Pool.
    """
    try:
        # Use the open() method to read a file from the Data Pool
        with my_dataset.open(data.file_to_read) as f:
            content = f.read()
        return content
    except FileNotFoundError:
        return f"File '{data.file_to_read}' not found in the Data Pool."
```

In this example, the `run` method expects a Data Pool to be provided for the `my_dataset` parameter.

#### 3. Local Testing with Data Pools

When developing and testing your service locally, you don't have access to the platform's Data Pool mounting system.
However, you can easily simulate this by creating a local directory and passing it to your `run` method.

##### Steps for Local Testing

1. **Create a local directory for your Data Pool.**
   This directory should be placed inside the `user_code/input` directory.
   The name of this directory can be anything, but for this example, we'll name it `my_dataset` to match the parameter in the `run` method.

2. **Populate the directory with your test files.**
   Place any files you need for your test inside this directory (e.g., `user_code/input/my_dataset/hello.txt`).
   And add the value `Hello` to the `hello.txt` file.

3. **Update the `__main__.py` file.**
   Modify your main entrypoint to manually create a `DataPool` instance and pass it to the `run` function.
   You will create the `DataPool` object with a relative path to your local Data Pool directory.

4. **Run your service.**
   Now you can run your service directly without setting any environment variables.

   ```bash
   # Run your service's main entrypoint
   cd user_code
   python -m src
   ```

##### Example

Let's assume your project has the following structure:

```
user_code
├── src/
│   ├── __main__.py
│   └── program.py
└── input/
    ├── data.json
    └── my_dataset/
        └── hello.txt
```

And `user_code/input/data.json` contains:

```json
{
  "file_to_read": "hello.txt"
}
```

Update your `user_code/src/__main__.py` to look like this:

```python
# user_code/src/__main__.py
import json
import os

from qhub.commons.constants import OUTPUT_DIRECTORY_ENV
from qhub.commons.datapool import DataPool
from qhub.commons.json import any_to_json
from qhub.commons.logging import init_logging

from .program import InputData, run

init_logging()

# This file is executed if you run `python -m src` from the project root. Use this file to test your program locally.
# You can read the input data from the `input` directory and map it to the respective parameter of the `run()` function.

# Redirect the platform's output directory for local testing
directory = "./out"
os.makedirs(directory, exist_ok=True)
os.environ[OUTPUT_DIRECTORY_ENV] = directory

with open(f"./input/data.json") as file:
    data = InputData.model_validate(json.load(file))

result = run(data, my_dataset=DataPool("./input/my_dataset"))

print(any_to_json(result))
```

The `__main__.py` script now manually creates the `DataPool` object and passes it to your
`run` function, simulating the behavior of the platform and allowing you to test your `run` method's logic with local files.

Now run the service:

```bash
python -m src
```

#### 4. Use Multiple Data Pools

If your service needs to work with multiple Data Pools, you can simply add more parameters of type `DataPool` to your `run` method.

```python
import os

from qhub.commons.datapool import DataPool
from pydantic import BaseModel

class InputData(BaseModel):
    file_to_read_from_my_dataset: str
    file_to_read_from_another_dataset: str

def run(data: InputData, my_dataset: DataPool, another_dataset: DataPool, output_datapool: DataPool) -> str:
    """
    Combines the content of the specified files from two different Data Pools in a third Data Pool.
    """
    try:
        with my_dataset.open(data.file_to_read_from_my_dataset) as f1:
            content1 = f1.read()
        
        with another_dataset.open(data.file_to_read_from_another_dataset) as f2:
            content2 = f2.read()
        
        # You can also write to the output Data Pool if needed
        concatinated_file = os.path.join(output_datapool.path, "concatenated_output.txt")
        with open(concatinated_file, "w") as out_file:
            out_file.write(content1)
            out_file.write(content2)
        
        return f"Content from my_dataset: {content1}\nContent from another_dataset: {content2}"
    except FileNotFoundError as e:
        return str(e)
```

For local development, you would create two directories in your `input` folder (e.g., `my_dataset` and
`another_dataset`) and pass them as separate parameters in the `__main__.py`:

```python
# user_code/src/__main__.py

## see from above...

result = run(data, my_dataset=DataPool("./input/my_dataset"), another_dataset=DataPool("./input/another_dataset"), output_datapool=DataPool("./input/output_datapool"))

## see from above...
```

Create the `another_dataset` and `output_datapool` directories in your `input` folder.
Then, create the file `input/another_dataset/world.txt` with the content `World`.

Before running the service, update the data in `input/data.json` to include the new file names:

```json
{
  "file_to_read_from_my_dataset": "hello.txt",
  "file_to_read_from_another_dataset": "world.txt"
}
```

After running the service, you should see the concatenated output file in the `input/output_datapool` directory.

```bash
python -m src
```

#### 5. Configuring the Data Pool in the API Call

When you execute this service via the platform API, you need to specify which Data Pool to mount.
This is done by providing a special JSON object in the request body.
The key of this object must match the `DataPool` parameter name in your `run` method (`my_dataset` in our example).

The JSON object has two fields:

* `id`: The unique identifier (UUID) of the Data Pool you want to use.
* `ref`: A static value that must be `"DATAPOOL"`.

Here is an example of a request body for our first version of the service:

```json
{
  "data": {
    "file_to_read": "hello.txt"
  },
  "my_dataset": {
    "id": "a1b2c3d4-e5f6-7890-1234-567890abcdef",
    "ref": "DATAPOOL"
  }
}
```

When the service is executed with this input, the platform will:

1. Identify that the `my_dataset` parameter is a Data Pool reference.
2. Mount the Data Pool with the specified `id`.
3. Instantiate a `DataPool` object pointing to the mounted directory.
4. Inject this `Data Pool` object into the `run` method as the `my_dataset` argument.

Your code can then use the `my_dataset` object to interact with the files in the mounted Data Pool.

Here is an example of a request body for our second version of the service:

```json
{
  "data": {
    "file_to_read_from_my_dataset": "hello.txt",
    "file_to_read_from_another_dataset": "world.txt"
  },
  "my_dataset": {
    "id": "a1b2c3d4-e5f6-7890-1234-567890abcdef",
    "ref": "DATAPOOL"
  },
  "another_dataset": {
    "id": "b1b2c3d4-e5f6-7890-1234-567890abcdef",
    "ref": "DATAPOOL"
  },
  "output_datapool": {
    "id": "c1b2c3d4-e5f6-7890-1234-567890abcdef",
    "ref": "DATAPOOL"
  }
}
```

## OpenAPI Specification for Data Pools

When you generate an OpenAPI specification for a service that uses a `DataPool`, the
`qhubctl openapi` library automatically creates the correct schema for the Data Pool parameter.
Instead of showing the internal structure of the `DataPool` class, it generates a schema that reflects the expected API input format.

For the `my_dataset: DataPool` parameter, the generated OpenAPI schema will look like this:

```yaml
my_dataset:
  type: object
  properties:
    id:
      type: string
      format: uuid
      description: UUID of the Data Pool to mount
    ref:
      type: string
      enum: [DATAPOOL]
      description: Reference type indicating this is a Data Pool
  required:
    - id
    - ref
  additionalProperties: false
```

This ensures that the API documentation accurately represents how to use the service and provides a clear contract for API clients.

## Data Pool Access Grants

By default, the platform checks whether the application creator has direct access to a data pool before mounting it during a service execution.
**Data Pool Access Grants** provide an alternative: short-lived, JWT-based tokens that authorize access to a data pool without sharing it permanently with the service owner.

This is designed for **cross-organization workflows** where a third-party service needs to read from or write to your data pool during execution, but you don't want to grant the service owner permanent access.

### How Access Grants Work

1. **User A** owns a data pool and an application that is subscribed to a service owned by **User B**.
2. **User A** requests a grant by calling the grant endpoint, specifying their application ID and the required permission level (`VIEW` or `MODIFY`).
3. The platform returns a **signed JWT token** scoped to that specific data pool, application, and tenant. The token expires after 15 minutes by default.
4. **User A** attaches the grant token to the volume mount reference when creating a service execution.
   The platform validates the token at execution time instead of checking whether the service owner has permanent access to the data pool.

### Creating a Grant

To create an access grant, send a `POST` request to the grant endpoint for the target data pool:

**Endpoint:** `POST /datapools/{datapoolId}/grants`

**Request body:**

```json
{
  "applicationId": "<uuid>",
  "permission": "VIEW"
}
```

| Field           | Type   | Required | Description                                                 |
|:----------------|:-------|:---------|:------------------------------------------------------------|
| `applicationId` | UUID   | Yes      | The ID of your application that will be used for execution. |
| `permission`    | String | Yes      | Access level: `VIEW` (read-only) or `MODIFY` (read-write).  |

**Response (200 OK):**

```json
{
  "token": "eyJhbGciOiJIUzI1NiJ9..."
}
```

::: tip Authorization Requirements
The requesting user must have at least `VIEWER` role on the data pool to create a `VIEW` grant, or `MAINTAINER` role to create a `MODIFY` grant.
:::

**Error responses:**

| Status | Condition                                             |
|:-------|:------------------------------------------------------|
| 400    | Invalid permission value (must be `VIEW` or `MODIFY`) |
| 403    | Insufficient permission on the data pool              |
| 404    | Data pool not found                                   |

### Using a Grant in a Service Execution

When creating a service execution, include the `grant` field on the Data Pool reference object.
Using the example from above, the request body would look like this:

```json
{
  "data": {
    "file_to_read": "hello.txt"
  },
  "my_dataset": {
    "id": "a1b2c3d4-e5f6-7890-1234-567890abcdef",
    "ref": "DATAPOOL",
    "grant": "eyJhbGciOiJIUzI1NiJ9..."
  }
}
```

**Behavior:**

* If a `grant` token is present, the platform validates the token (signature, expiration, claim matching) instead of checking whether the application creator has direct permission on the data pool.
* The `writeable` flag on the mount is automatically set based on the grant's permission level: `MODIFY` results in a writeable mount, `VIEW` results in a read-only mount.
* If no `grant` is provided, the existing behavior applies: the platform checks whether the application creator has direct access to the data pool.

### Token Validation

At execution time, the platform rejects the grant token if:

* The token signature is invalid or tampered
* The token has expired
* The `datapoolId` in the token does not match the volume mount reference
* The `applicationId` in the token does not match the application running the execution
* The `tenantId` in the token does not match the application creator

A rejected token results in a `403 Forbidden` response.

### Grant Token Properties

| Property   | Value                                                       |
|:-----------|:------------------------------------------------------------|
| Algorithm  | HMAC-SHA256 (HS256)                                         |
| Expiration | 15 minutes (not configurable at the moment)                 |
| Scoped to  | A specific data pool ID, application ID, and tenant ID      |
| Permission | `VIEW` or `MODIFY`, matching what was requested at creation |
| Reusable   | Yes — the token can be used multiple times until it expires |

### Example Workflow

The following example illustrates a typical cross-organization workflow using access grants:

```
1. User B owns Service "quantum-optimizer"
2. User A owns Data Pool "experiment-results"
3. User A creates Application "my-quantum-run", subscribed to User B's Service

4. User A requests a grant:
     POST /datapools/{experiment-results-id}/grants
     {
       "applicationId": "{my-quantum-run-id}",
       "permission": "VIEW"
     }

5. User A receives:
     { "token": "eyJ..." }

6. User A creates a service execution, attaching the grant token:
     POST /service-executions
     {
       "data": { ... },
       "my_dataset": {
         "id": "{experiment-results-id}",
         "ref": "DATAPOOL",
         "grant": "eyJ..."
       }
     }

7. The platform validates the grant and mounts User A's data pool
   as read-only for User B's Service during execution
```
