Skip to content

chore(migration): Migrate code from googleapis/python-documentai-toolbox into packages/google-cloud-documentai-toolbox#16010

Open
parthea wants to merge 275 commits intomainfrom
migration.python-documentai-toolbox.migration.2026-03-02_16-59-45.migrate
Open

chore(migration): Migrate code from googleapis/python-documentai-toolbox into packages/google-cloud-documentai-toolbox#16010
parthea wants to merge 275 commits intomainfrom
migration.python-documentai-toolbox.migration.2026-03-02_16-59-45.migrate

Conversation

@parthea
Copy link
Contributor

@parthea parthea commented Mar 2, 2026

See #11026.

This PR should be merged with a merge-commit, not a squash-commit, in order to preserve the git history.

dizcology and others added 30 commits November 15, 2022 15:04
* chore: add unit test for Entity
* chore: fix to_dataframe header issue
* chore(main): release 0.1.0

* feat: change release version to alpha

* changed changelog version

Co-authored-by: release-please[bot] <55107282+release-please[bot]@users.noreply.github.com>
* chore: updated readme

* changed rst text

* added disclaimer

* update readme

* update readme

* update readme

* update readme

* Update README.rst

* Update README.rst

Co-authored-by: Anthonios Partheniou <partheniou@google.com>

Co-authored-by: Holt Skinner <13262395+holtskinner@users.noreply.github.com>
Co-authored-by: Anthonios Partheniou <partheniou@google.com>
* chore: added tests for page.py

* added test fixture for page.py tests

* changed fixture name
Bumps [certifi](https://github.com/certifi/python-certifi) from 2022.9.24 to 2022.12.7.
- [Release notes](https://github.com/certifi/python-certifi/releases)
- [Commits](certifi/python-certifi@2022.09.24...2022.12.07)

---
updated-dependencies:
- dependency-name: certifi
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
* chore: update repo-metadata.json

* syntax
…a.json (#31)

* chore: update client_documentation and issue_tracker in .repo-metadata.json

* remove client_documentation value
* chore: changed gcs_prefix pattern comment

* changed functions using gcs_prefix

* fixed failing tests

* updated comments

* updated comments
* chore: documentation changes

* addressed comments
* chore: updated testing constraints

* updates 3.7 constraints

* updated test constraints

* updated constrainst 3.7

* updated setup.py deps

* removed setup.py dep

* removed google-common-proto

* updated pandas deps

* changes pandas in setup.py

* revertes setup changes added deps to3.7constraints

* updated storage deps

* removed numpy from 3.7 constraint

* added numpy

* changed constraints

* added numpy

* fixed dependency error

* changed setup.py

* updated numpy constraints

* changed min dep for numpy

* changed api core

* fixed lint issue

* removed get_bytes test

* testing changes

* removed test issue
* docs: fix docs arrangement

* updated documentation

* lint fix

---------

Co-authored-by: Gal Zahavi <38544478+galz10@users.noreply.github.com>
Co-authored-by: release-please[bot] <55107282+release-please[bot]@users.noreply.github.com>
* chore: minor refactoring of GCS Functions in document wrapper
- Simplified `print_gcs_document_tree` for readaibility/maintainability (And to resolve linter errors)
- Added constants for reused values
- Added `ignore_unknown_values` to `Document.from_json()` to avoid exceptions with new Document Proto versions between client library updates

* chore: minor refactoring of GCS Functions in document wrapper
- Simplified `print_gcs_document_tree` for readaibility/maintainability (And to resolve linter errors)
- Added constants for reused values
- Added `ignore_unknown_values` to `Document.from_json()` to avoid exceptions with new Document Proto versions between client library updates

* chore: Fix to allow tests to pass
* chore: update docs url

* update README
* chore: fixed documentation devsite issues

* fixed comments

* Update README.rst

* updated readme
* added init files

* added wrap_documents samples

* changed file name for test

* Delete wrap_document_samples_test.py

* added test requirements

* Delete wrap_document_samples_test.py

* updated sample

* updated region tags

* changed tags

* changed region tag

---------

Co-authored-by: Anthonios Partheniou <partheniou@google.com>
* chore: changed test file name for quickstart sample

* Updated Quickstart Sample
- Added noxfile config for testing
- Updated test file for formatting & type annotation

* Added noxfile.py to samples directory

* updated noxfile

* fixed quickstart_sample

---------

Co-authored-by: Holt Skinner <holtskinner@google.com>
Co-authored-by: Holt Skinner <13262395+holtskinner@users.noreply.github.com>
- Fixed Type mismatch on `_table_wrapper_from_documentai_table`
- Simplified boolean checks
- Removed Extra local variables
- Updated function names/docs for clarity

Co-authored-by: Gal Zahavi <38544478+galz10@users.noreply.github.com>
* fix: Updated Pip install name in README

- Added Toolbox to requirements.txt for Samples

* fix: Update requirements and setup to avoid version conflict
Co-authored-by: release-please[bot] <55107282+release-please[bot]@users.noreply.github.com>
* chore(deps): update dependency google-cloud-documentai to v2

* Update requirements.txt

---------

Co-authored-by: Holt Skinner <13262395+holtskinner@users.noreply.github.com>
* feat: Added Support for Form Fields

* docs: Updated spacing in `_trim_text()` docstring

* Added return type for `get_form_field_by_name()`

---------

Co-authored-by: Gal Zahavi <38544478+galz10@users.noreply.github.com>
* samples: Added Table Sample

* test: Add Table Sample Test

* Update test_table_sample.py

* samples: Updated Sample variables

* Updated comment to specify local document path

---------

Co-authored-by: Gal Zahavi <38544478+galz10@users.noreply.github.com>
* feat: Add PDF Splitter

* fix: Updated setup.py syntax

* fix: Fixed initializer error

* Updated Test to include a multi-page split

* formatting fix

* Add Pdf Split Example

* Adjusted mkdir in tests

* Added pikepdf to test dependencies

---------

Co-authored-by: Gal Zahavi <38544478+galz10@users.noreply.github.com>
…ent` wrapper (#50)

* feat: Add `entities_to_dict()` and `entities_to_bigquery()` to Document wrapper

- Uploads entities to an existing dataset, creates new table if it doesn't already exist.

## Example Output Table

| supplier_iban | purchase_order | supplier_email      | freight_amount | supplier_address             | receiver_address             | total_amount | supplier_name | total_tax_amount | payment_terms    | line_item                                                                     | receiver_name | receiver_email    | due_date   | invoice_date | invoice_id | currency | receiver_tax_id | net_amount | vat |
|---------------|----------------|---------------------|----------------|------------------------------|------------------------------|--------------|---------------|------------------|------------------|-------------------------------------------------------------------------------|---------------|-------------------|------------|--------------|------------|----------|-----------------|------------|-----|
|            50 |              1 | user@companyabc.com |            600 | 111 Main Street Anytown, USA | 222 Main Street Anytown, USA |         2140 | Company ABC   |              140 | 6 month contract | [Tool A 500 1.00 500.00,Service B 1 900.00 900.00,Resource C 50 12.00 600.00] | John Doe      | johndoe@email.com | 2025-01-01 |   1970-01-01 | NO. 001    | $        |               1 |       2000 | 140 |

* Removed unneeded test code

* Added bigquery library to setup.py

* Updated Docstrings

* Fixed Test import linter error

* Added bigQuery Library to Testing Constraints

* Added handling of Nested Entities (properties)

* Dependency Update for Tests

* Update Dependencies

* Fixed Test Output

* Updated DatasetReference based on Deprecation Warning

* samples: Added Entities to BigQuery Sample Code

* Added Required tag to `entities_to_bigquery()` arguments

* Fixed Issues from merge conflict

* Fixed numpy import

---------

Co-authored-by: Gal Zahavi <38544478+galz10@users.noreply.github.com>
* chore(main): release 0.2.0-alpha

* Update CHANGELOG.md

---------

Co-authored-by: release-please[bot] <55107282+release-please[bot]@users.noreply.github.com>
Co-authored-by: Holt Skinner <13262395+holtskinner@users.noreply.github.com>
daniel-sanche and others added 7 commits December 11, 2025 09:51
Co-authored-by: release-please[bot] <55107282+release-please[bot]@users.noreply.github.com>
Co-authored-by: release-please[bot] <55107282+release-please[bot]@users.noreply.github.com>
…lbox/main' into migration.python-documentai-toolbox.migration.2026-03-02_16-59-45.migrate
@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request focuses on migrating the Document AI Toolbox code from the googleapis/python-documentai-toolbox repository into a dedicated package within the google-cloud-documentai-toolbox repository. This involves moving the source code, tests, and documentation, as well as setting up the necessary infrastructure for continuous integration, linting, and release management. The goal is to consolidate the codebase and improve maintainability.

Highlights

  • Code Migration: Migrated the googleapis/python-documentai-toolbox code into the packages/google-cloud-documentai-toolbox directory.
  • Dependency Management: Introduced new files for dependency management, linting, testing, and release automation.
  • Repository Configuration: Added various configuration files for GitHub actions, code owners, issue templates, and repository synchronization.
Activity
  • The PR description requests a merge-commit to preserve git history.
  • The PR introduces a large number of new files, primarily for configuration and automation.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This PR migrates the python-documentai-toolbox library into this repository. The changes consist of adding a large number of files for the new package, including source code, tests, samples, and configuration. The code is generally well-structured and tested. My feedback focuses on making the code more robust and maintainable, addressing a potential bug in file handling, and improving code clarity and type correctness.

Note: Security Review is unavailable for this PR.

Comment on lines +233 to +258
try:
for blob in blobs:
if blob.name.endswith("/"):
continue
file_name = os.path.basename(blob.name)
if annotation_file_prefix in file_name:
annotation_blob = blob
elif config_file_prefix in file_name:
metadata_blob = blob
elif constants.PDF_EXTENSION in file_name:
doc_blob = blob

if config_path:
metadata_blob = gcs_utilities.get_blob(config_path)

directory_name = os.path.basename(gcs_uri)
print(f"Downloaded: {directory_name}", end="\r")

return (
annotation_blob.download_as_bytes(),
doc_blob.download_as_bytes(),
metadata_blob.download_as_bytes(),
directory_name,
)
except Exception as e:
raise e
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The _get_bytes function does not handle cases where one of the expected files (annotation, config, or PDF) is missing in the GCS directory. If a file is not found, an UnboundLocalError will be raised when trying to access variables like annotation_blob, metadata_blob, or doc_blob. It would be more robust to initialize these variables to None before the loop and add a check to ensure all required files are found before proceeding. This would provide a clearer FileNotFoundError.

Comment on lines +233 to +238
y_min = _convert_bbox_units(
block.bounding_box[f"{block.bounding_y}"],
input_bbox_units=block.bounding_unit,
width=block.page_height,
multiplier=y_multiplier,
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

In the call to _convert_bbox_units for the y_min coordinate, the width parameter is used to pass block.page_height. This is confusing as the function signature for _convert_bbox_units includes a height parameter. Using the height parameter would make the code more readable and less prone to errors.

Suggested change
y_min = _convert_bbox_units(
block.bounding_box[f"{block.bounding_y}"],
input_bbox_units=block.bounding_unit,
width=block.page_height,
multiplier=y_multiplier,
)
y_min = _convert_bbox_units(
block.bounding_box[f"{block.bounding_y}"],
input_bbox_units=block.bounding_unit,
height=block.page_height,
multiplier=y_multiplier,
)

"""
entity_annotations: List[EntityAnnotation] = []
for token in page_info.page.tokens:
v: vision.Vertex = []
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The type hint for the v variable is vision.Vertex, but it is initialized as a list []. This is incorrect as vision.Vertex represents a single vertex object, not a list of them. To improve code clarity and correctness for static analysis tools, the type hint should be changed to List[Dict[str, int]] or List[vision.Vertex].

Suggested change
v: vision.Vertex = []
v: List[Dict[str, int]] = []

@parthea parthea self-assigned this Mar 2, 2026
@parthea parthea added the do not merge Indicates a pull request not ready for merge, due to either quality or timing. label Mar 2, 2026
@parthea parthea marked this pull request as ready for review March 5, 2026 20:41
@parthea parthea requested review from a team as code owners March 5, 2026 20:41
@snippet-bot
Copy link

snippet-bot bot commented Mar 5, 2026

No region tags are edited in this PR.

This comment is generated by snippet-bot.
If you find problems with this result, please file an issue at:
https://github.com/googleapis/repo-automation-bots/issues.
To update this comment, add snippet-bot:force-run label or use the checkbox below:

  • Refresh this comment

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

do not merge Indicates a pull request not ready for merge, due to either quality or timing.

Projects

None yet

Development

Successfully merging this pull request may close these issues.