File Cleaning and Archiving

This page outlines the unit test coverage for the clean_pipeline.py module, which handles archiving labeled data and cleaning up the data pipeline workspace.


Coverage Overview

The tests confirm that the cleaning process:

  • Archives labeled JSON and associated image files into timestamped folders under the master dataset directory

  • Cleans all pipeline folders except for label_studio

  • Preserves the folder structure by keeping directories but removing their contents


Fixtures

setup_clean_test_dirs

Creates a temporary mock workspace with:

  • A data_pipeline/ structure containing:

    • labeled/: with a sample JSON label

    • input/: with a matching image

    • label_studio/: preserved during cleanup

  • A master_dataset/ directory to hold archived results


Function Tests

test_clean_pipeline_creates_archive_and_cleans

  • Verifies the creation of a timestamped archive folder containing:

    • labels/: where the original .json file is stored

    • images/: where the associated image file is copied

  • Confirms that:

    • label_studio/ folder and its contents remain untouched

    • All other folders (labeled/, input/, etc.) are retained but emptied


Summary

This test ensures that the pipeline cleanup process effectively resets the workspace while preserving key folder structures and backing up labeled content to the master_dataset/ archive.