File Cleaning and Archiving

This module provides a utility for archiving labeled data and cleaning the workspace after each labeling cycle. It ensures that the project directory remains organized while preserving important folders such as label_studio.


Function: clean_pipeline_workspace

clean_pipeline_workspace(data_pipeline_dir: Path, master_dataset_dir: Path)

Description

Archives labeled results and cleans the data pipeline workspace.

  • Copies .json label files from data_pipeline/labeled/ to a timestamped archive under master_dataset/.

  • Copies matching image files from data_pipeline/input/ into the same archive (based on file stem).

  • Clears all folders in data_pipeline/ except label_studio.

Arguments

Name

Type

Description

data_pipeline_dir

Path

Path to the root of the data_pipeline/ directory

master_dataset_dir

Path

Path to the master_dataset/ directory for saving archives


Folder Structure After Cleaning

Only the following directory is preserved:

data_pipeline/
├── input/           # Emptied
├── labeled/         # Emptied
├── label_studio/    # Preserved with all contents
└── ...              # All other folders are emptied

Archive Output Example

A successful run creates an archive folder like this:

master_dataset/
└── labeled_20250624_140501/
    ├── labels/
    │   ├── image1.json
    │   ├── image2.json
    └── images/
        ├── image1.png
        ├── image2.jpg

Notes

  • Matching images are copied based on the label file’s stem (e.g., image1.json matches image1.png, image1.jpg, etc.).

  • If a matching image is not found, a warning is printed.

  • Preserves folder structure for any future uploads to Label Studio or audit tracking.