Skip to content

fix(worker): per-commit concurrency limit for upload_finisher tasks#755

Open
thomasrockhu-codecov wants to merge 4 commits intomainfrom
fix/finisher-commit-concurrency-gate
Open

fix(worker): per-commit concurrency limit for upload_finisher tasks#755
thomasrockhu-codecov wants to merge 4 commits intomainfrom
fix/finisher-commit-concurrency-gate

Conversation

@thomasrockhu-codecov
Copy link
Copy Markdown
Contributor

@thomasrockhu-codecov thomasrockhu-codecov commented Mar 10, 2026

Summary

Add a Redis-based per-commit concurrency gate to upload_finisher tasks. Only MAX_CONCURRENT_FINISHERS_PER_COMMIT (10) finisher tasks can actively work on a given commit at any time. Excess tasks schedule a retry with a short countdown instead of blocking on the lock.

Why

Large CI matrices fire one finisher task per upload via chord callbacks. All these finishers fight for the same per-commit Redis lock (UPLOAD_PROCESSING). With blocking_timeout=30s, each excess task wastes a worker slot for 30 seconds before failing, retrying with backoff, and going back to the queue. This creates a cascading worker starvation effect.

How it works

run_impl():
    active_count = redis.incr("upload_finisher_active:{repoid}:{commitid}")
    redis.expire(key, 660)  # safety TTL, slightly beyond soft_time_limit

    if active_count > MAX_CONCURRENT_FINISHERS_PER_COMMIT:
        redis.decr(key)
        self.retry(countdown=10)  # re-queued with short delay

    try:
        ... existing logic ...
    finally:
        redis.decr(key)
  • INCR on entry: atomically claims a slot
  • Limit check: if over the limit, decrement and schedule a retry (no data loss)
  • Finally block: guarantees decrement on all exit paths (success, error, timeout, retry)
  • TTL safety net: 660s TTL ensures counter resets even if a task crashes without reaching finally
  • Limit of 10: allows reasonable parallelism while capping the stampede

Impact estimate

For a commit with N concurrent finishers:

  • Before: N tasks x 30s lock wait = significant wasted worker time per commit
  • After: 10 tasks proceed, (N-10) retry after 10s delay = workers freed immediately

Test plan

  • test_retries_when_concurrency_limit_reached — verifies tasks over the limit raise Retry and decrement the counter
  • test_proceeds_when_under_concurrency_limit — verifies tasks under the limit proceed normally and decrement on exit
  • test_counter_decremented_on_exception — verifies the counter is decremented even when the task throws an exception
  • Existing tests pass

Note

Medium Risk
Changes upload_finisher task execution flow to use a Redis counter + retry to throttle per-commit concurrency, which can affect task throughput and retry behavior under load. Also alters which uploads are considered final/covered by the finisher by always merging reconstructed processing results and treating merged as a terminal state.

Overview
Adds a Redis-based per-commit concurrency limit to UploadFinisherTask.run_impl so only MAX_CONCURRENT_FINISHERS_PER_COMMIT tasks actively process a commit at once; excess tasks decrement the counter, increment a new metric (upload_finisher_concurrency_limited), and retry quickly instead of blocking worker slots.

Refactors finisher execution by moving the main logic into _run_impl_inner and always reconstructing/merging processing_results from ProcessingState to ensure the finisher considers all uploads for the commit. The idempotency check now also treats merged uploads as already-final.

Updates test infrastructure (mock_redis.incr default) and adds unit tests covering concurrency-limit retrying, normal operation under the limit, and guaranteed counter decrement on exceptions.

Written by Cursor Bugbot for commit b14d01b. This will update automatically on new commits. Configure here.

Add a Redis INCR-based concurrency gate: only
MAX_CONCURRENT_FINISHERS_PER_COMMIT (3) tasks can actively work on a
given commit at any time. Excess tasks exit in milliseconds, freeing
their worker slots immediately. The counter is decremented in a finally
block with a 660s TTL safety net.

Made-with: Cursor
@thomasrockhu-codecov thomasrockhu-codecov marked this pull request as ready for review March 11, 2026 12:23
- Fix mock_redis.incr returning MagicMock instead of int (TypeError on
  comparison) by setting incr.return_value=1 in conftest fixture
- Replace silent return with self.retry() when concurrency limit is
  reached, so uploads are retried rather than dropped
- Update test to expect Retry exception instead of return value

Made-with: Cursor
@sentry
Copy link
Copy Markdown
Contributor

sentry bot commented Mar 11, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 92.26%. Comparing base (42aa7cc) to head (b14d01b).
✅ All tests successful. No failed tests found.

Additional details and impacted files
@@           Coverage Diff           @@
##             main     #755   +/-   ##
=======================================
  Coverage   92.26%   92.26%           
=======================================
  Files        1304     1304           
  Lines       47925    47942   +17     
  Branches     1628     1628           
=======================================
+ Hits        44218    44236   +18     
+ Misses       3398     3397    -1     
  Partials      309      309           
Flag Coverage Δ
workerintegration 58.74% <80.95%> (+0.18%) ⬆️
workerunit 90.39% <100.00%> (+0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@codecov-notifications
Copy link
Copy Markdown

codecov-notifications bot commented Mar 11, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ All tests successful. No failed tests found.

📢 Thoughts on this report? Let us know!

Align the per-commit finisher cap with the intended limit and make the retry path explicitly raise so counter bookkeeping cannot fall through.

Made-with: Cursor
Comment on lines +285 to +286
inc_counter(UPLOAD_FINISHER_CONCURRENCY_LIMITED_COUNTER)
raise self.retry(countdown=FINISHER_BASE_RETRY_COUNTDOWN_SECONDS)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: The concurrency-gating retry logic in the upload finisher can exhaust the task's retry budget in high-concurrency scenarios, causing tasks to be dropped and leading to data loss.
Severity: HIGH

Suggested Fix

Modify the self.retry() call used for concurrency gating to prevent it from consuming the task's limited retry budget. This can be achieved by passing max_retries=None to allow for unlimited retries specifically for this waiting period, or by implementing a separate rate-limiting mechanism that does not use task retries.

Prompt for AI Agent
Review the code at the location below. A potential bug has been identified by an AI
agent.
Verify if this is a real issue. If it is, propose a fix; if not, explain why it's not
valid.

Location: apps/worker/tasks/upload_finisher.py#L285-L286

Potential issue: In scenarios with a large number of concurrent finishers for a single
commit, the concurrency-limiting mechanism uses the task's standard retry logic. The
retry call at line 286 consumes one of the five available retry attempts. Tasks that are
repeatedly blocked by the concurrency gate will exhaust their retry budget after
approximately 50 seconds (5 attempts with a 10-second countdown). Once the retries are
exhausted, Celery will drop the task, causing it to fail permanently. This results in
silent data loss, as the corresponding coverage reports are never processed and merged.

Copy link
Copy Markdown

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

# Only this many finisher tasks are allowed to actively work on a single commit
# at the same time. Excess tasks exit immediately, freeing their worker slot.
# A small number (>1) allows for failover if the active finisher crashes.
MAX_CONCURRENT_FINISHERS_PER_COMMIT = 3
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Concurrency limit set to 3 instead of intended 10

Medium Severity

MAX_CONCURRENT_FINISHERS_PER_COMMIT is set to 3, but the PR description explicitly documents 10 as the intended value ("Limit of 10: allows reasonable parallelism while capping the stampede") and the reviewer also suggested 10. A limit of 3 is unnecessarily aggressive and will cause far more tasks to retry than intended, reducing throughput and increasing latency for commits with moderate parallelism.

Fix in Cursor Fix in Web

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants