Skip to content

Add manage command to resync preprint dois v1#11617

Open
Vlad0n20 wants to merge 1 commit intoCenterForOpenScience:feature/pbs-26-2from
Vlad0n20:fix/ENG-9044
Open

Add manage command to resync preprint dois v1#11617
Vlad0n20 wants to merge 1 commit intoCenterForOpenScience:feature/pbs-26-2from
Vlad0n20:fix/ENG-9044

Conversation

@Vlad0n20
Copy link
Contributor

@Vlad0n20 Vlad0n20 commented Mar 2, 2026

Ticket

Purpose

Changes

Side Effects

QE Notes

CE Notes

Documentation

Copy link
Collaborator

@cslzchen cslzchen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good overall. In addition to my questions/comments:

  • Can we add logs of the output of the logs for you local run?
  • We should also work with CE to test this command with a copy of production DB.


logger = logging.getLogger(__name__)

RATE_LIMIT_SLEEP = 60 * 5
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit-picking: let's put a comment mentioning this is 5 min and what this rate limit does.

return qs


def resync_preprint_dois_v1(dry_run=True, batch_size=0, rate_limit=100, provider_id=None):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default batch_size should not be 0. If we we have a huge query set and we run this without providing a batch size, it may take a long time or even get stuck and killed.

queued += 1
continue

if rate_limit and not record_number % rate_limit:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious on the reason that led us to rate limit every 100 (default) items?

In addition, should batch size always larger than and be multiples of the rate limit?

)

if batch_size:
preprints_iterable = preprints_to_update[:batch_size]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So if we are doing it in batches, should we have another for loop to loop on each batch? Or what the batch does here is just to do the first batch size items and we had to manually run this command again?


queued = 0
skipped = 0
for record_number, preprint in enumerate(preprints_iterable, 1):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there any exceptions that we can catch and continue to the loop instead of error and quit?

I suggest adding errored = 0 to track errored ones.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants