chore(migration): Migrate code from googleapis/python-documentai-toolbox into packages/google-cloud-documentai-toolbox by parthea · Pull Request #16010 · googleapis/google-cloud-python

parthea · 2026-03-02T18:04:19Z

This PR should be merged with a merge-commit, not a squash-commit, in order to preserve the git history.

* chore: add unit test for Entity

* chore: fix to_dataframe header issue

* chore(main): release 0.1.0 * feat: change release version to alpha * changed changelog version Co-authored-by: release-please[bot] <55107282+release-please[bot]@users.noreply.github.com>

* chore: updated readme * changed rst text * added disclaimer * update readme * update readme * update readme * update readme * Update README.rst * Update README.rst Co-authored-by: Anthonios Partheniou <partheniou@google.com> Co-authored-by: Holt Skinner <13262395+holtskinner@users.noreply.github.com> Co-authored-by: Anthonios Partheniou <partheniou@google.com>

* chore: added tests for page.py * added test fixture for page.py tests * changed fixture name

Bumps [certifi](https://github.com/certifi/python-certifi) from 2022.9.24 to 2022.12.7. - [Release notes](https://github.com/certifi/python-certifi/releases) - [Commits](certifi/python-certifi@2022.09.24...2022.12.07) --- updated-dependencies: - dependency-name: certifi dependency-type: indirect ... Signed-off-by: dependabot[bot] <support@github.com> Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* chore: update repo-metadata.json * syntax

…a.json (#31) * chore: update client_documentation and issue_tracker in .repo-metadata.json * remove client_documentation value

* chore: changed gcs_prefix pattern comment * changed functions using gcs_prefix * fixed failing tests * updated comments * updated comments

* chore: documentation changes * addressed comments

* chore: updated testing constraints * updates 3.7 constraints * updated test constraints * updated constrainst 3.7 * updated setup.py deps * removed setup.py dep * removed google-common-proto * updated pandas deps * changes pandas in setup.py * revertes setup changes added deps to3.7constraints * updated storage deps * removed numpy from 3.7 constraint * added numpy * changed constraints * added numpy * fixed dependency error * changed setup.py * updated numpy constraints * changed min dep for numpy * changed api core * fixed lint issue * removed get_bytes test * testing changes * removed test issue

* docs: fix docs arrangement * updated documentation * lint fix --------- Co-authored-by: Gal Zahavi <38544478+galz10@users.noreply.github.com>

Co-authored-by: release-please[bot] <55107282+release-please[bot]@users.noreply.github.com>

* chore: minor refactoring of GCS Functions in document wrapper - Simplified `print_gcs_document_tree` for readaibility/maintainability (And to resolve linter errors) - Added constants for reused values - Added `ignore_unknown_values` to `Document.from_json()` to avoid exceptions with new Document Proto versions between client library updates * chore: minor refactoring of GCS Functions in document wrapper - Simplified `print_gcs_document_tree` for readaibility/maintainability (And to resolve linter errors) - Added constants for reused values - Added `ignore_unknown_values` to `Document.from_json()` to avoid exceptions with new Document Proto versions between client library updates * chore: Fix to allow tests to pass

* chore: update docs url * update README

* chore: fixed documentation devsite issues * fixed comments * Update README.rst * updated readme

* added init files * added wrap_documents samples * changed file name for test * Delete wrap_document_samples_test.py * added test requirements * Delete wrap_document_samples_test.py * updated sample * updated region tags * changed tags * changed region tag --------- Co-authored-by: Anthonios Partheniou <partheniou@google.com>

* chore: changed test file name for quickstart sample * Updated Quickstart Sample - Added noxfile config for testing - Updated test file for formatting & type annotation * Added noxfile.py to samples directory * updated noxfile * fixed quickstart_sample --------- Co-authored-by: Holt Skinner <holtskinner@google.com> Co-authored-by: Holt Skinner <13262395+holtskinner@users.noreply.github.com>

- Fixed Type mismatch on `_table_wrapper_from_documentai_table` - Simplified boolean checks - Removed Extra local variables - Updated function names/docs for clarity Co-authored-by: Gal Zahavi <38544478+galz10@users.noreply.github.com>

* fix: Updated Pip install name in README - Added Toolbox to requirements.txt for Samples * fix: Update requirements and setup to avoid version conflict

Co-authored-by: release-please[bot] <55107282+release-please[bot]@users.noreply.github.com>

* chore(deps): update dependency google-cloud-documentai to v2 * Update requirements.txt --------- Co-authored-by: Holt Skinner <13262395+holtskinner@users.noreply.github.com>

* feat: Added Support for Form Fields * docs: Updated spacing in `_trim_text()` docstring * Added return type for `get_form_field_by_name()` --------- Co-authored-by: Gal Zahavi <38544478+galz10@users.noreply.github.com>

* samples: Added Table Sample * test: Add Table Sample Test * Update test_table_sample.py * samples: Updated Sample variables * Updated comment to specify local document path --------- Co-authored-by: Gal Zahavi <38544478+galz10@users.noreply.github.com>

* feat: Add PDF Splitter * fix: Updated setup.py syntax * fix: Fixed initializer error * Updated Test to include a multi-page split * formatting fix * Add Pdf Split Example * Adjusted mkdir in tests * Added pikepdf to test dependencies --------- Co-authored-by: Gal Zahavi <38544478+galz10@users.noreply.github.com>

…ent` wrapper (#50) * feat: Add `entities_to_dict()` and `entities_to_bigquery()` to Document wrapper - Uploads entities to an existing dataset, creates new table if it doesn't already exist. ## Example Output Table | supplier_iban | purchase_order | supplier_email | freight_amount | supplier_address | receiver_address | total_amount | supplier_name | total_tax_amount | payment_terms | line_item | receiver_name | receiver_email | due_date | invoice_date | invoice_id | currency | receiver_tax_id | net_amount | vat | |---------------|----------------|---------------------|----------------|------------------------------|------------------------------|--------------|---------------|------------------|------------------|-------------------------------------------------------------------------------|---------------|-------------------|------------|--------------|------------|----------|-----------------|------------|-----| | 50 | 1 | user@companyabc.com | 600 | 111 Main Street Anytown, USA | 222 Main Street Anytown, USA | 2140 | Company ABC | 140 | 6 month contract | [Tool A 500 1.00 500.00,Service B 1 900.00 900.00,Resource C 50 12.00 600.00] | John Doe | johndoe@email.com | 2025-01-01 | 1970-01-01 | NO. 001 | $ | 1 | 2000 | 140 | * Removed unneeded test code * Added bigquery library to setup.py * Updated Docstrings * Fixed Test import linter error * Added bigQuery Library to Testing Constraints * Added handling of Nested Entities (properties) * Dependency Update for Tests * Update Dependencies * Fixed Test Output * Updated DatasetReference based on Deprecation Warning * samples: Added Entities to BigQuery Sample Code * Added Required tag to `entities_to_bigquery()` arguments * Fixed Issues from merge conflict * Fixed numpy import --------- Co-authored-by: Gal Zahavi <38544478+galz10@users.noreply.github.com>

* chore(main): release 0.2.0-alpha * Update CHANGELOG.md --------- Co-authored-by: release-please[bot] <55107282+release-please[bot]@users.noreply.github.com> Co-authored-by: Holt Skinner <13262395+holtskinner@users.noreply.github.com>

Co-authored-by: release-please[bot] <55107282+release-please[bot]@users.noreply.github.com>

…lbox/main' into migration.python-documentai-toolbox.migration.2026-03-02_16-59-45.migrate

…ed files

gemini-code-assist · 2026-03-02T18:09:26Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request focuses on migrating the Document AI Toolbox code from the googleapis/python-documentai-toolbox repository into a dedicated package within the google-cloud-documentai-toolbox repository. This involves moving the source code, tests, and documentation, as well as setting up the necessary infrastructure for continuous integration, linting, and release management. The goal is to consolidate the codebase and improve maintainability.

Highlights

Code Migration: Migrated the googleapis/python-documentai-toolbox code into the packages/google-cloud-documentai-toolbox directory.
Dependency Management: Introduced new files for dependency management, linting, testing, and release automation.
Repository Configuration: Added various configuration files for GitHub actions, code owners, issue templates, and repository synchronization.

Activity

The PR description requests a merge-commit to preserve git history.
The PR introduces a large number of new files, primarily for configuration and automation.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This PR migrates the python-documentai-toolbox library into this repository. The changes consist of adding a large number of files for the new package, including source code, tests, samples, and configuration. The code is generally well-structured and tested. My feedback focuses on making the code more robust and maintainable, addressing a potential bug in file handling, and improving code clarity and type correctness.

_{Note: Security Review is unavailable for this PR.}

gemini-code-assist · 2026-03-02T18:25:04Z

...ages/google-cloud-documentai-toolbox/google/cloud/documentai_toolbox/converters/converter.py

+    try:
+        for blob in blobs:
+            if blob.name.endswith("/"):
+                continue
+            file_name = os.path.basename(blob.name)
+            if annotation_file_prefix in file_name:
+                annotation_blob = blob
+            elif config_file_prefix in file_name:
+                metadata_blob = blob
+            elif constants.PDF_EXTENSION in file_name:
+                doc_blob = blob
+
+        if config_path:
+            metadata_blob = gcs_utilities.get_blob(config_path)
+
+        directory_name = os.path.basename(gcs_uri)
+        print(f"Downloaded: {directory_name}", end="\r")
+
+        return (
+            annotation_blob.download_as_bytes(),
+            doc_blob.download_as_bytes(),
+            metadata_blob.download_as_bytes(),
+            directory_name,
+        )
+    except Exception as e:
+        raise e


The _get_bytes function does not handle cases where one of the expected files (annotation, config, or PDF) is missing in the GCS directory. If a file is not found, an UnboundLocalError will be raised when trying to access variables like annotation_blob, metadata_blob, or doc_blob. It would be more robust to initialize these variables to None before the loop and add a check to ensure all required files are found before proceeding. This would provide a clearer FileNotFoundError.

gemini-code-assist · 2026-03-02T18:25:04Z

...loud-documentai-toolbox/google/cloud/documentai_toolbox/converters/config/bbox_conversion.py

+        y_min = _convert_bbox_units(
+            block.bounding_box[f"{block.bounding_y}"],
+            input_bbox_units=block.bounding_unit,
+            width=block.page_height,
+            multiplier=y_multiplier,
+        )


In the call to _convert_bbox_units for the y_min coordinate, the width parameter is used to pass block.page_height. This is confusing as the function signature for _convert_bbox_units includes a height parameter. Using the height parameter would make the code more readable and less prone to errors.

Suggested change

y_min = _convert_bbox_units(

block.bounding_box[f"{block.bounding_y}"],

input_bbox_units=block.bounding_unit,

width=block.page_height,

multiplier=y_multiplier,

)

y_min = _convert_bbox_units(

block.bounding_box[f"{block.bounding_y}"],

input_bbox_units=block.bounding_unit,

height=block.page_height,

multiplier=y_multiplier,

)

gemini-code-assist · 2026-03-02T18:25:04Z

...google-cloud-documentai-toolbox/google/cloud/documentai_toolbox/converters/vision_helpers.py

+    """
+    entity_annotations: List[EntityAnnotation] = []
+    for token in page_info.page.tokens:
+        v: vision.Vertex = []


The type hint for the v variable is vision.Vertex, but it is initialized as a list []. This is incorrect as vision.Vertex represents a single vertex object, not a list of them. To improve code clarity and correctness for static analysis tools, the type hint should be changed to List[Dict[str, int]] or List[vision.Vertex].

Suggested change

v: vision.Vertex = []

v: List[Dict[str, int]] = []

snippet-bot · 2026-03-05T20:41:20Z

No region tags are edited in this PR.

This comment is generated by snippet-bot.
If you find problems with this result, please file an issue at:
https://github.com/googleapis/repo-automation-bots/issues.
To update this comment, add snippet-bot:force-run label or use the checkbox below:

Refresh this comment

dizcology and others added 30 commits November 15, 2022 15:04

chore: add unit test for Entity (#19)

e73ce4a

* chore: add unit test for Entity

chore: fixed to_dataframe header issue (#20)

75c82ec

* chore: fix to_dataframe header issue

chore: release 0.1.0-alpha (#22)

2842d09

* chore(main): release 0.1.0 * feat: change release version to alpha * changed changelog version Co-authored-by: release-please[bot] <55107282+release-please[bot]@users.noreply.github.com>

chore: added tests for page.py (#23)

6de8028

* chore: added tests for page.py * added test fixture for page.py tests * changed fixture name

chore: update repo-metadata.json (#29)

cf7432e

* chore: update repo-metadata.json * syntax

chore: update client_documentation and issue_tracker in .repo-metadat…

0128ccf

…a.json (#31) * chore: update client_documentation and issue_tracker in .repo-metadata.json * remove client_documentation value

chore: changed gcs_prefix pattern comment (#21)

7b63e71

* chore: changed gcs_prefix pattern comment * changed functions using gcs_prefix * fixed failing tests * updated comments * updated comments

chore: documentation changes (#33)

86b5f04

* chore: documentation changes * addressed comments

docs: fix docs arrangement (#35)

4daab05

* docs: fix docs arrangement * updated documentation * lint fix --------- Co-authored-by: Gal Zahavi <38544478+galz10@users.noreply.github.com>

chore(main): release 0.1.0-alpha (#25)

b809c1e

Co-authored-by: release-please[bot] <55107282+release-please[bot]@users.noreply.github.com>

chore: fix readme release issues (#38)

eeb1f98

chore: update docs url (#41)

ac3df1c

* chore: update docs url * update README

chore: documentation fixes (#44)

03587a0

* chore: fixed documentation devsite issues * fixed comments * Update README.rst * updated readme

chore(deps): update all dependencies (#46)

6d64547

fix: Updated Pip install name in README (#52)

7800f73

* fix: Updated Pip install name in README - Added Toolbox to requirements.txt for Samples * fix: Update requirements and setup to avoid version conflict

chore(main): release 0.1.1-alpha (#45)

319a11d

Co-authored-by: release-please[bot] <55107282+release-please[bot]@users.noreply.github.com>

chore(deps): update dependency google-cloud-documentai to v2 (#49)

3855ca9

* chore(deps): update dependency google-cloud-documentai to v2 * Update requirements.txt --------- Co-authored-by: Holt Skinner <13262395+holtskinner@users.noreply.github.com>

chore(deps): update dependency google-cloud-documentai to v2.12.0 (#55)

4567ee3

feat: Added Support for Form Fields (#48)

3ad4af8

* feat: Added Support for Form Fields * docs: Updated spacing in `_trim_text()` docstring * Added return type for `get_form_field_by_name()` --------- Co-authored-by: Gal Zahavi <38544478+galz10@users.noreply.github.com>

chore(main): release 0.2.0-alpha (#56)

29d62af

* chore(main): release 0.2.0-alpha * Update CHANGELOG.md --------- Co-authored-by: release-please[bot] <55107282+release-please[bot]@users.noreply.github.com> Co-authored-by: Holt Skinner <13262395+holtskinner@users.noreply.github.com>

daniel-sanche and others added 7 commits December 11, 2025 09:51

feat: Add support for Python 3.14 (#390)

fa8038f

chore(main): release 0.15.0-alpha (#384)

b810dd8

Co-authored-by: release-please[bot] <55107282+release-please[bot]@users.noreply.github.com>

fix: Update storage.Blob.from_string() to from_uri() (#385)

ce0b4c5

chore(main): release 0.15.1-alpha (#393)

5cab8dc

Co-authored-by: release-please[bot] <55107282+release-please[bot]@users.noreply.github.com>

Merge remote-tracking branch 'remote.googleapis/python-documentai-too…

eabdf35

…lbox/main' into migration.python-documentai-toolbox.migration.2026-03-02_16-59-45.migrate

Trigger owlbot post-processor

71a7a89

build: google-cloud-documentai-toolbox migration: adjust owlbot-relat…

729e243

…ed files

gemini-code-assist bot reviewed Mar 2, 2026

View reviewed changes

parthea self-assigned this Mar 2, 2026

parthea added the do not merge Indicates a pull request not ready for merge, due to either quality or timing. label Mar 2, 2026

parthea and others added 13 commits March 2, 2026 21:29

tests: fix build

784a580

chore: delete unused files

7328538

chore: delete unused files, part 2

781d121

chore: update repo URLs and references and check for Ruff

7f67650

updates librarian state.yaml with id and metadata

d729e86

chore: remove unused file

8672d7e

tests: update default python to 3.14

527a60b

tests: skip mypy

964992d

tests: filter warnings related to EOL Python

4bf77dc

chore: lint

bba17be

tests: skip core_deps_from_source

e4312ec

tests: fix docs

1c0cc15

tests: fix docs

482d576

parthea marked this pull request as ready for review March 5, 2026 20:41

parthea requested review from a team as code owners March 5, 2026 20:41

docs: remove samples

a129b70

daniel-sanche approved these changes Mar 5, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chore(migration): Migrate code from googleapis/python-documentai-toolbox into packages/google-cloud-documentai-toolbox#16010

chore(migration): Migrate code from googleapis/python-documentai-toolbox into packages/google-cloud-documentai-toolbox#16010
parthea wants to merge 275 commits intomainfrom
migration.python-documentai-toolbox.migration.2026-03-02_16-59-45.migrate

parthea commented Mar 2, 2026

Uh oh!

gemini-code-assist bot commented Mar 2, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Mar 2, 2026

Uh oh!

gemini-code-assist bot Mar 2, 2026

Uh oh!

gemini-code-assist bot Mar 2, 2026

Uh oh!

snippet-bot bot commented Mar 5, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

11 participants

Conversation

parthea commented Mar 2, 2026

Uh oh!

gemini-code-assist bot commented Mar 2, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Mar 2, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Mar 2, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Mar 2, 2026

Choose a reason for hiding this comment

Uh oh!

snippet-bot bot commented Mar 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

No region tags are edited in this PR.

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

11 participants

snippet-bot bot commented Mar 5, 2026 •

edited

Loading