Skip to content

[Repo Assist] Add schema.org microdata support to HtmlProvider#1676

Open
github-actions[bot] wants to merge 9 commits intomainfrom
repo-assist/fix-issue-611-schema-org-microdata-901d33ea78032988
Open

[Repo Assist] Add schema.org microdata support to HtmlProvider#1676
github-actions[bot] wants to merge 9 commits intomainfrom
repo-assist/fix-issue-611-schema-org-microdata-901d33ea78032988

Conversation

@github-actions
Copy link
Contributor

🤖 This PR was created by Repo Assist, an automated AI assistant.

Implements the feature requested in #611 — adds schema.org microdata inference to HtmlProvider, generating a typed Schemas container from HTML elements with itemscope/itemtype/itemprop attributes.

Closes #611

What's changed

New types (in FSharp.Data.Runtime)

  • HtmlSchemaItem — a single microdata item: { Properties: Map(string, string); Html: HtmlNode }
  • HtmlSchemaGroup — all items of a given schema type: { Name; TypeUrl; Items: HtmlSchemaItem[]; Properties: string[] }
  • HtmlObjectDescription gains a new SchemaGroup case

Runtime parsing (HtmlRuntime.fs)

  • HtmlRuntime.getSchemas : HtmlDocument -> HtmlSchemaGroup list — discovers all itemscope/itemtype elements, groups by type URL, extracts itemprop values per the HTML microdata spec:
    • (meta)content attribute
    • (a), (link)href attribute (falling back to inner text)
    • (img), (audio), (video), (source)src attribute
    • (time)datetime attribute (falling back to inner text)
    • All others → content attribute (falling back to inner text)
    • Nested itemscope elements are not traversed
  • HtmlDocument.GetSchema(id) added for runtime schema lookup

Type generation (HtmlGenerator.fs)

  • createSchemaItemType — generates one provided type per schema group (erased to HtmlSchemaItem), with one string property per discovered itemprop name (Pascal-cased)
  • generateTypes handles SchemaGroup case — adds a Schemas container with one property per schema type, returning SchemaItemType[]

Usage example

[(Literal)]
let sampleHtml = """(html)(body)
  (div itemscope itemtype="(schema.org/redacted)
    <span itemprop="name">Jane Smith</span>
    <span itemprop="jobTitle">Software Engineer</span>
    (a itemprop="url" href="(example.com/redacted)")Homepage(/a)
  (/div)
(/body)(/html)"""

type MyPage = HtmlProvider(sampleHtml)

let doc = MyPage.GetSample()
for person in doc.Schemas.Person do
    printfn "%s%s (%s)" person.Name person.JobTitle person.Url

Test Status

  • ✅ Build: succeeded (19 pre-existing warnings, 0 errors)
  • FSharp.Data.Core.Tests HTML/schema unit tests: 2862 passed (6 new schema tests added)
  • FSharp.Data.Tests HtmlProvider integration tests: 38 passed (3 new schema integration tests added)
  • ✅ Fantomas format check: passed

Generated by Repo Assist

To install this agentic workflow, run

gh aw add githubnext/agentics/workflows/repo-assist.md@afb00b92a9514fee9a14c583f059a03d05738f70

- Parse itemscope/itemtype/itemprop HTML microdata attributes at design time
- Generate a typed 'Schemas' container on HtmlProvider documents
- Each schema type (e.g. http://schema.org/Person) becomes a property
  returning an array of typed items with one property per itemprop name
- Items are erased to HtmlSchemaItem at runtime
- Property values follow the HTML microdata spec: content attr, href,
  src, datetime, or inner text depending on element type
- Nested itemscope elements are not traversed (correct per spec)
- 6 unit tests (HtmlRuntime.getSchemas) + 3 integration tests (HtmlProvider)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@dsyme
Copy link
Contributor

dsyme commented Feb 28, 2026

/repo-assist Please invstigate CI failures and fix.

@github-actions
Copy link
Contributor Author

Commit pushed: 753081b

Generated by Repo Assist

github-actions bot and others added 2 commits February 28, 2026 12:23
The HtmlProvider now generates schema types for HTML elements with
itemscope/itemtype/itemprop attributes. Update the expected signature
snapshots for the 4 HTML test files that contain microdata
(zoopla.html, zoopla2.html, ebay_cars.htm, imdb_chart.htm) so the
DesignTime tests pass.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@dsyme
Copy link
Contributor

dsyme commented Feb 28, 2026

/repo-assist please try this out on some real websites - develop some scripts that use Wikipedia, IMDBm maybe ebay

I've added these to the network firewall

Post the samples back here and put a couple in the docs.

Update the docs too you haven't done that yet, and review all the HtmlProvider docs and improve them

…ema.org microdata section

- Update introduction to clearly explain table naming, column type inference,
  and when to use the provider
- Add Wikipedia S&amp;P 500 companies example demonstrating groupBy analysis
- Add schema.org microdata section showing ProductCatalog and mixed-page samples
- Update NuGet stats example with improved regex and formatting
- Rename Doctor Who groupBy variable for clarity
- Add note about JSON-LD on JS-rendered sites (IMDB, eBay)
- Remove outdated 'Introducing the provider' framing

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@github-actions
Copy link
Contributor Author

Commit pushed: 9da8e6c

Generated by Repo Assist

@github-actions

This comment has been minimized.

@dsyme
Copy link
Contributor

dsyme commented Feb 28, 2026

/repo-assist OK looks like we should be supporting JSON-LD as well, could you add that to this PR and redo more samples. Find out what Wikipedia actually supports for structured data schemas too - not just tables and develop samples around that similarly and get back to me

Adds native JSON-LD support to HtmlProvider: when an HTML document contains
<script type="application/ld+json"> blocks, the provider now generates a
typed .JsonLd container (e.g. doc.JsonLd.Article) with one strongly-typed
string property per top-level scalar field discovered in the sample.

This mirrors the .Schemas container added for HTML microdata and is especially
useful for Wikipedia pages (and many modern websites) which embed schema.org
Article/WebPage/Person JSON-LD in the <head> element for SEO purposes.

Changes:
- Add HtmlJsonLdItem and HtmlJsonLdGroup types to HtmlRuntime.fs
- Add JsonLdGroup case to HtmlObjectDescription
- Add getJsonLd parser (finds script[type=application/ld+json] elements,
  parses JSON, groups by @type, flattens scalar properties into Map<string,string>)
- Add GetJsonLd(id) method to HtmlDocument
- Update HtmlGenerator to generate a typed JsonLd container
- Add FSharp.Data.Json.Core project reference to Html.Core
- Add 5 runtime tests + 6 integration tests
- Rewrite HtmlProvider.fsx with Wikipedia JSON-LD examples and
  summary table of all three formats (tables/microdata/JSON-LD)
- Update RELEASE_NOTES.md

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@github-actions
Copy link
Contributor Author

Commit pushed: c4f229d

Generated by Repo Assist

@github-actions
Copy link
Contributor Author

🤖 This is an automated response from Repo Assist.

Done! I've added JSON-LD support to this PR and revised the documentation with Wikipedia-focused examples. Here's a summary of what was added:

JSON-LD support (doc.JsonLd.*)

When an HTML page contains (script type="application/ld+json") blocks, HtmlProvider now generates a typed .JsonLd container alongside .Tables and .Schemas. For example:

// Wikipedia embeds schema.org Article JSON-LD on every article page
type WikiPage = HtmlProvider<"(en.wikipedia.org/redacted)

let doc = WikiPage.Load(...)
let article = doc.JsonLd.Article.[0]
printfn "%s — published %s" article.Name article.DatePublished

Each @type group gets its own strongly-typed property with Pascal-cased string accessors for every top-level scalar field. A .Raw property gives the raw JSON for accessing complex nested values (images, authors, etc.) with JsonValue.Parse.

What Wikipedia actually uses for structured data

Wikipedia serves three formats — all now supported by HtmlProvider:

Format Access What it contains
HTML tables .Tables.* Statistics, event lists, comparisons
HTML microdata .Schemas.* Inline entity markup (products, events)
JSON-LD .JsonLd.* Article metadata: name, headline, description, url, datePublished, dateModified, license, inLanguage

Wikipedia embeds a schema.org/Article JSON-LD block in (head) on every article page. Person and place pages additionally carry a mainEntity link to the Wikidata item, plus sameAs cross-references.

Changes in this push

  • HtmlRuntime.fsHtmlJsonLdItem, HtmlJsonLdGroup types; getJsonLd parser; GetJsonLd(id) on HtmlDocument
  • HtmlGenerator.fscreateJsonLdItemType; JsonLdGroup case generating a JsonLd container
  • FSharp.Data.Html.Core.fsproj — added FSharp.Data.Json.Core project reference (for JsonValue.Parse)
  • HtmlProvider.fsx — full rewrite of the structured-data sections with four Wikipedia-based examples and a format summary table
  • Tests — 5 new runtime unit tests + 6 new HtmlProvider integration tests (all green)

Test status: DesignTime 489 ✅ · Integration 285 ✅ · Core 2867 ✅ · Fantomas format check ✅

Generated by Repo Assist for issue #1676

To install this agentic workflow, run

gh aw add githubnext/agentics/workflows/repo-assist.md@afb00b92a9514fee9a14c583f059a03d05738f70

@dsyme dsyme marked this pull request as ready for review March 1, 2026 13:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Html Provider] Use schema.org attributes if present on a page to generate an API, in addition to the .Tables member

1 participant