Skip to content

Conversation

@ubik2
Copy link
Contributor

@ubik2 ubik2 commented May 7, 2025

This is mostly plumbing to pass classification data through the system and enforce it.

As a concrete demonstration, the tokens used for OAuth are marked "secret" so they cannot be accessed over the network by a schema query that doesn't include the necessary classification claims.


Summary by mrge

Added classification enforcement to the data system, so that data marked as "secret" or "confidential" cannot be accessed without the correct claims. Updated schema handling, storage, and tests to support classification tags and authorization checks.

  • New Features

    • Enforced classification tags in schema queries and storage.
    • Added support for "secret" and "confidential" labels in schemas and data.
    • Introduced a CLI tool for reading and writing classified data.
    • Updated OAuth and Gmail recipes to mark tokens as secret.
  • Bug Fixes

    • Fixed schema traversal to correctly handle classification and additionalProperties.
    • Improved test coverage for classified data access and authorization errors.

ubik2 added 23 commits April 18, 2025 13:47
Update a few "satisfies JSONSchema" to use as const.
Tag the AuthSchema for google-oauth
add labels field to StorageValue to track classification tags
better boolean schema handling in cfc and general api cleanup
break out loadDocFacts from selectSchema function to avoid megafunction
made SchemaQueryArgs its own type
fixed schema empty check
- Handle classification tags in selectSchema
- Pass AuthorizationErrors through
- Removed includedQueryView helper, since RemoteStorageProvider no longer exists
- Added classification tests.
…ved or true if it should be retained.

Copy the set of subscribers before iterating, since a subscriber could trigger another subscription.
Fix an issue in provider where we weren't including the space in our fact address, and weren't matching some records.
Fixed a potential deadlock in provider where our subsequent querySchema while in the perform loop seemed to deadlock. Bypassing the perform queue here addresses this (we're already in the perform block).
Minor change to subscription to avoid shadowing a variable name
Added a comment to remind me to follow up and check that we are getting the subscription updates properly on the client.
- opaque-ref.ts
  - Use ContextualFlowControl to walk schemas, since this needs to track any ifc tags encountered.
    - Rename one of the schema variables to nestedSchema to avoid shadowing the other.
    - Use our values of path and rootSchema when exporting, rather than those of the store.
- recipe.ts: When constructing an opaqueRef, if our argumentSchema is just a string, don't pass that as the schema.
- types.ts: Added the definitions field for future $ref implementation
- opaque-ref-schema-test.ts: Changed provided schema to exclude additional properties. The JSONSchema default is to allow arbitrary keys on an object, but I wasn't previously handling this correctly.
- recipe.test.ts: Correctly expect the ifc tags in the schema
- iterate.ts: Changed scrubbed schemas to have additionalProperties false to disable other keys.
- space-schema.ts: Moved isTrueSchema into ContextualFlowControl and use that class for this check.
- gmail.tsx: include the Classification constants here directly
- cell.ts
  - Use ContextualFlowControl class to descend the schema
  - Pass schema and rootSchema around much more
- cfc.ts
  - Moved the isTruSchema and isInternalSchemaKey methods here
  - Added a getSchemaAtPath method, which better matches the expected interface for the rest of the system.
  - Allow lubSchema to return undefined if there is no classification requirement.
  - Corrected behavior when we encounter properties of an object which aren't included in the properties and when we don't have the additionalProperties property.
  - Handle the $ref of `#`, which requires passing the rootSchema in as well.
- doc.ts
  - Added JSONSchema to setAtPath method, since we need that for the logs and callbacks.
  - Changed the callback signature for the updates, to have an optional labels argument (which contains any tags we store in the label).
  - In setAtPath, if the object changed, generate the appropriate labels prior to invoking callbacks.
- schema.ts
  - Use ContextualFlowControl to detect the true schema, instead of checking for the empty object.
  - Pass schemas to resolveLinkToAlias and resolveLinks
  - Fix some cases where we didn't have the correct schema when generating cells in validateAndTransform.
  - Altered some places where we log reads to include more of the ref, since we may want to know about the schema.
  - Use ContextualFlowControl classes instead of walking the schema directly when walking the path.
- storage.ts
  - Added Labels class
  - Changed _batchForStorage to take an optional labels parameter, which will then be included in the StorageValue.
  - Changed _subscribeToChanges so our callback gets the labels.
- utils.ts
  - Added schema and rootSchema parameters to resolveLinkToValues, resolveLinkToAliases, and resolvePath.
  - Use ContextualFlowControl for schema walking.
  - Pass more complete CellLinks to the read log
  - Removed export of ChangeSet type, since this wasn't used elsewhere.
  - Generally track schema and rootSchema more in CellLinks.
  - When generating the change for an array length alteration, include any of the classification values that would have applied to the array.
- base.ts: Changed labels to be the Labels type instead of just a string[], to be more extensible in the future.
- cache.ts
  - Include deepEqual to compare label changes
  - If a StorageValue has any labels, generate a write to the `application/label+json` attribute of that entity.
- schema-lineage.test.ts: Provide the additionalProperties: false schema entry when we want to exclude unexpected keys.
schema.test.ts: Changed a couple tests to account for the schema in the result.
utils.test.ts: Changed a test to account for a schema in the result.
Copy link
Contributor

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mrge found 18 issues across 35 files. Review them in mrge.io

@ubik2
Copy link
Contributor Author

ubik2 commented May 8, 2025

Integration tests are failing, but I think this is unrelated.

Error: Browser Console [error]: Search charms error: HTTP error! status: 400, body: {"error":"No response from LLM"}

Edit: Also, I wrote this error off because there were a bunch of other PRs with red status, but those aren't running into the same error as me, so I may have something else broken. Investigating more.
Edit2: Thanks to Jordan for tracking this down. This is because a schema change meant we weren't able to use cached LLM responses, so those needed to be regenerated. Jordan took care of that, and things are working again.

@jsantell
Copy link
Collaborator

jsantell commented May 8, 2025

Integration tests are failing, but I think this is unrelated.

Error: Browser Console [error]: Search charms error: HTTP error! status: 400, body: {"error":"No response from LLM"}

This is error is in previous (successful) CI runs FWIW, not from this PR

* add log message to debug on server (works locally)

* log before and after the llm request

* Update LLM cache

* Remove logging that was added to debug

* Better error message in integration tests when workflow fails.

---------

Co-authored-by: Jordan Santell <jordan@common.tools>
? {
...data.schema,
properties: scrubbed,
additionalProperties: false,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isn't additionalProperties: false the default? I saw you added that here and in a few other places, curious why.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It isn't the default, though I thought it was prior to this patch.
The text in the doc isn't very direct, but leaving it out is the same as the empty schema (which always matches).
https://json-schema.org/draft/2020-12/json-schema-core#section-10.3.2.3-5

Omitting this keyword has the same assertion behavior as an empty schema.

The guide page is a bit more direct:
https://json-schema.org/understanding-json-schema/reference/object#additionalproperties

By default any additional properties are allowed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh!, thanks for clarifying, I did indeed have it the other way around. Sigh. Makes schemas even longer by default…

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, or maybe we should flip it in our case, at least for the query case: true would mean that we want to always read all other fields, which maximizes taint unless this is specified. On write it means that any other property is accepted, which is less problematic, but also not a great default behavior.

We need to start documenting our version of schema and the differences to json schema and openapi schema.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm inclined to align with the standard here, even though it's more verbose.
We do have the scrubbed versions of our schema, and if we wanted to use that as the transformer between ours and the standard, we could, However, I think it's likely to cause confusion, especially as we use the LLMs to generate schema.

@ubik2 ubik2 merged commit cd72178 into main May 12, 2025
6 checks passed
@ubik2 ubik2 deleted the feat/cfc2 branch May 12, 2025 21:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants