-
Notifications
You must be signed in to change notification settings - Fork 122
Open
Description
A large number of runs are appearing as 'Pending' in https://wpt.fyi/api/status/pending despite no active tasks remaining in the results-arrival Cloud Task queue. This indicates that these runs have become 'orphaned' in Datastore without a corresponding processing task.
Root Cause Analysis:
- Status Update Desync: When the results-processor starts a task, it updates the run stage to WPTFYI_PROCESSING (700).
- Unhandled Exceptions: The current implementation in results-processor/processor.py only catches WPTReportError. Any other exception (e.g., GCS connectivity issues, Datastore timeouts, or unexpected Python errors) results in a 500 error returned to the Task Queue.
- Retry Exhaustion: The results-arrival queue has a task_age_limit of 14 days. If a task fails persistently, Cloud Tasks eventually drops it.
- No Terminal Status: Because the exception was unhandled, the processor never calls update_status to move the run to a terminal state (INVALID, EMPTY, VALID, etc.).
- Persistence: Since there is no TTL or automatic cleanup for PendingTestRun entities, they remain in the 'Pending' list forever.
Steps to Reproduce (Theoretical):
- Upload a result that causes a non-WPTReportError exception in the processor (e.g., a network timeout during upload_raw).
- Observe the run move to WPTFYI_PROCESSING.
- Allow the task to fail and retry until it reaches the queue's age limit or retry limit.
- The task is deleted from the queue, but the PendingTestRun remains in stage 700.
Suggested Fix:
- Processor Level: Wrap the core logic in results-processor/main.py or processor.py in a global try...except block.
- Failure Callback: In the event of an unhandled exception, attempt to update the run status to INVALID with a summary of the error before allowing the task to fail/retry. This ensures that even if the task is eventually dropped, the UI reflects a failed state rather than a pending one.
- Cleanup Logic: Consider adding a cron job or a TTL mechanism to automatically mark very old PendingTestRun entities (e.g., older than 14 days) as INVALID or STALE.
Affected Files:
- results-processor/main.py
- results-processor/processor.py
- api/pending_test_runs.go
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels