Skip to content

PendingTestRun entities stuck perpetually if results-processor fails with unhandled exceptions #4813

@DanielRyanSmith

Description

@DanielRyanSmith

A large number of runs are appearing as 'Pending' in https://wpt.fyi/api/status/pending despite no active tasks remaining in the results-arrival Cloud Task queue. This indicates that these runs have become 'orphaned' in Datastore without a corresponding processing task.

Root Cause Analysis:

  1. Status Update Desync: When the results-processor starts a task, it updates the run stage to WPTFYI_PROCESSING (700).
  2. Unhandled Exceptions: The current implementation in results-processor/processor.py only catches WPTReportError. Any other exception (e.g., GCS connectivity issues, Datastore timeouts, or unexpected Python errors) results in a 500 error returned to the Task Queue.
  3. Retry Exhaustion: The results-arrival queue has a task_age_limit of 14 days. If a task fails persistently, Cloud Tasks eventually drops it.
  4. No Terminal Status: Because the exception was unhandled, the processor never calls update_status to move the run to a terminal state (INVALID, EMPTY, VALID, etc.).
  5. Persistence: Since there is no TTL or automatic cleanup for PendingTestRun entities, they remain in the 'Pending' list forever.

Steps to Reproduce (Theoretical):

  1. Upload a result that causes a non-WPTReportError exception in the processor (e.g., a network timeout during upload_raw).
  2. Observe the run move to WPTFYI_PROCESSING.
  3. Allow the task to fail and retry until it reaches the queue's age limit or retry limit.
  4. The task is deleted from the queue, but the PendingTestRun remains in stage 700.

Suggested Fix:

  • Processor Level: Wrap the core logic in results-processor/main.py or processor.py in a global try...except block.
  • Failure Callback: In the event of an unhandled exception, attempt to update the run status to INVALID with a summary of the error before allowing the task to fail/retry. This ensures that even if the task is eventually dropped, the UI reflects a failed state rather than a pending one.
  • Cleanup Logic: Consider adding a cron job or a TTL mechanism to automatically mark very old PendingTestRun entities (e.g., older than 14 days) as INVALID or STALE.

Affected Files:

  • results-processor/main.py
  • results-processor/processor.py
  • api/pending_test_runs.go

Metadata

Metadata

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions