Skip to content
Raphael Krupinski edited this page Nov 10, 2024 · 1 revision

JSON Schema

Representing JSON Schema as python types.

The examples below use YAML notation of JSON Schema version draft-wright-json-schema-00.

Set theory

JSON is a data format representing a limited collection of basic data types.

JSON Schema is language representing a set of constraints applicable to JSON, where empty schema means any JSON value, and set theory can be used to manipulate it.

Python model types, while able to represent the same basic types as JSON, describe what can be stored in memory, where empty model represents no data, which is the inversion of how JSON Schema describes values.

!!! note By default, object instances in python are just dictionaries with OOP syntax, but IDEs and type checking tools treat them more akin to C structs - if a field is not declared, it's not there.

Simplified, JSON Schemas can be transformed to python model with the following formula

python model = any JSON type - declared JSON Schema constraints

type

  1. Since Schema object validates any JSON value, let's consider it a Union type:

     {}
    

    =>

     dict | list | float | int | str | bool
    

Type-specific constraints

Most of the constraints are type-specific (they apply only to values of a single type).

The exceptions are:

  • nullable: extend types by type null, but only if type is specified in that schema
  • enum: only allow values specified in the list
  • numeric constraints for types number and integer, to both of which the numeric constraints apply.

That means most constraints can be processed separately, which is useful when they occur together with allOf, oneOf, allOf and not.

nullable

  1. When type is present and nullable is true, the allowed types are extended with null. Three cases are possible

    1. any type but null

       {}
      
    2. single type

       type: integer
      

      =>

       int
      
    3. single type or null

       type: integer
       nullable: true
      

      =>

       int | None
      
  2. Any combination of types is possible with anyOf/oneOf.

     anyOf:
     -   type: string
     -   type: integer
         nullable: true
    

    =>

     str | int | None
    

enum

  1. If enum is in anyOf sub-schemas, the values are summed as sets.

  2. If enum is in oneOf sub-schemas, only the values that occur once can be validated.

  3. If enum is in allOf sub-schemas, only the common values can be validated.

enum as Literal

  1. Scalar enum could be translated literally

     enum:
     - true
     - false
     - FileNotFound
    

    =>

     Literal[True, False, 'FileNotFound']
    

    or grouped by type:

     Union[
         Literal[true, False],
         StrLiteral['FileNotFound'],
     ]
    

enum as enum.Enum

  1. Both scalar and non-scalar literals could be translated as python enums, but that would require names

     type: object
     enum:
     - key: value
    

    =>

     class ${schema}Enum(Enum):
         elem${idx} = $schema(key='value')
    

    This solution could introduce unintentional breaking changes when simply changing order of enum elements, unless enum elements were named with some extension keyword.

    It would work for scalar values

     enum:
     - true
     - false
     - FileNotFound
    

    =>

     class ${schema}Enum(Enum):
         value_true = True
         value_false = False
         value_FileNotFound = 'FileNotFound'
    

Non-scalar enum values

Non-scalar enum values don't have natural names, but a hash of stringified value could be used.

Also creating an arbitrary number of objects, that might never be used will be expensive and wasteful, so factory methods could be used:

enum:
-   id: 1
    name: LoL
    slug: league-of-legends

=>

def value_5d9b08cdd67689d128f7c30f885f273c():
    return CurrentVideogame(
        id=1,
        name='LoL',
        slug='league-of-legends',
    )

The problem with this solution is that the name changes when keys or any value changes, which may or may not be desirable from the user-developer perspective.

Scalar constraints

  1. Constraint keywords can be grouped by type (both numeric types together) and processed as such.

     maximum: 10
     maxLength: 10
     anyOf:
     - type: string
     - type: integer
    

    =>

     anyOf:
     - type: string
       maxLength: 10
     - type: integer
       maximum: 10
    

    =>

     Union[
         Annotated[str, Field(max_length=10)],
         Annotated[int, Field(ge=10)]
     ]
    
  2. There might be more than one element for a given type:

     anyOf:
     - type: integer
       maximum: 10
     - type: integer
       minimum: 20
    

    =>

     Union[
         Annotated[int, Field(ge=10)],
         Annotated[int, Field(le=20)],
     ]
    
  3. The above is different than this, which is a bottom type (no object can validate against it, since no number can be greater than 20 and smaller than 10):

     type: integer
     maximum: 10
     minimum: 20
    
  4. As somewhat a special case, this schema is alright:

     maximum: 10
     minimum: 20
    

    =>

     str | bool | dict | list
    
  5. allOf applies the most restrictive set of constraints.

     allOf:
     - maximum: 10
     - maximum: 20
    

    =>

     maximum: 10
    

Object constraints

  1. JSON type object could be mapped to dict or, with some limitations to TypedDict or a model in one of data modelling libraries like dataclasses, pydantic msgspec, etc. Here the choice falls on pydantic, which seems the most featured.

  2. In the most trivial (from python's perspective) case properties can be translated to instance fields in a model class:

     additionalProperties: false
     properties:
       name:
         type: string
       required:
       - name
    

    =>

     class $name(BaseModel):
       name: string
    
  3. In case of empty schema, it's impossible to say anything about it's possible contents.

    It could be mapped as dict but then adding a property would cause an incompatible change in the python code. Instead, it can be translated to empty model class with extra = 'allow'

     {}
    

    =>

     class $name(BaseModel):
         model_config = pydantic.ConfigDict(
             extra='allow'
         )
    

    The problem with this form is that there's no way to know what to do with the extra object values. Since an empty schema has the default:

     additionalProperties: true
    

    which is the same as

     additionalProperties: {}
    

    which means such a definition is indefinitely recursive. We could model it as a simple dict (in union with other types) or a common model class

     class AnyObject(BaseModel):
         model_config = pydantic.ConfigDict(
             extra='allow'
         )
    

    but in either case adding a property (particularly a non-required property, which is a compatible change) leads to an incompatible change in the python model.

additionalProperties

The value is processed as a JSON Schema.

  1. true

    Allows extra fields of any type. See above.

  2. false

    Forbids extra fields

     class $name(BaseModel):
         model_config = pydantic.ConfigDict(
             extra='forbid'
         )
    
  3. A schema definition.

    Allows extra fields and, if a non-empty schema is used, generate type:

     type: object
     additionalProperties:
         type: int
    

    =>

     class $name(BaseModel):
         model_config = pydantic.ConfigDict(
             extra='allow'
         )
    
         __extra__: dict[str, int]
    

patternProperties

The keyword is similar to additionalProperties, except keys must match a regular expression.

This could be implemented as a simple extra field with pre-validation, except when both additionalProperties and patternProperties are present.

The below example describes an object with positive integer keys (as strings) with string values, and other keys with integer values:

type: object
additionalProperties:
    type: int
patternProperties:
    "\\d+":
        type: string

this could translate to pydantic model:

class $name(BaseModel):
    model_config = pydantic.ConfigDict(
        extra='allow'
    )

    __extra__: dict[str, int | str]

    _handle_pattern_props = validate(handle_pattern_props)({"\\d":str})

or an alternative with synthetic fields that group each pattern and schema ('model_' prefix added to decrease chances of name clashes):

class $name(BaseModel):
    model_config = pydantic.ConfigDict(
        extra='allow'
    )

    __extra__: dict[str, int]

    model_pattern_props_xxd: dict[str, str]

    _handle_pattern_props = validate(handle_pattern_props)("\\d", str)

The second form would offer stricter validation at the cost of introducing synthetic fields and potential name clashes, while the second form would make a model more closely resembling the JSON object, but would allow invalid instances.

anyOf

  1. When anyOf keyword is used, the instance validates as long as it validates against one of children, while the validation results against other children schemas are ignored.

  2. Scalar types can be implemented as Union

     type: integer
     oneOf:
     - maximum: 10
     - minimum: 20
    

    =>

     Union[
         Annotated[int, Field(ge=10)],
         Annotated[int, Field(le=20)],
     ]
    
  3. objects should probably be implemented as synthetic Union fields, instead of unions, because adding

     properties:
         length:
             type: integer
     anyOf:
     -   properties:
             height:
                 type: integer
     -   properties:
             width:
                 type: integer
    

    =>

     class $nameAnyOf1(BaseModel):
         height: int
    
     class $nameAnyOf2(BaseModel):
         width: int
    
     class $name(BaseModel):
         model_prop_any_of: $nameAnyOf1 | $nameAnyOf2
    

oneOf

  1. Per the specification oneOf keyword validates when the value validates against exactly one child schema.

    In practice it's not the case, and oneOf is used as type union, typically with disjoint sub-schemas, but sometimes erroneously with overlapping ones.

    For this reason it could be processed just like anyOf, although separately: values must validate against one of oneOf children and one of anyOf children.

  2. With only type constraint, anyOf and oneOf are equivalent, since any value can be of only one type:

     oneOf:
     - type: integer
     - type: string
    
     anyOf:
     - type: integer
     - type: string
    

    =>

     int | str
    
  3. With more than one constraint for the same type, interpreting them (also for python) gets more complex

     type: integer
     oneOf:
     - maximum: 20
     - minimum: 10
    

    is equivalent to:

     type: integer
     oneOf:
     - allOf:
       - maximum: 20
       # not minimum: 10
       - maximum: 10
         exclusiveMaximum: true
     - allOf:
       - minimum: 10
       # not maximum: 20
       - minimum: 20
         exclusiveMinimum: true
    

    Since 10 < 20, it reduces to:

     type: integer
     oneOf:
     # matches minimum: 10 but not maximum: 20
     - minimum: 20
       exclusiveMinimum: true
     # matches maximum: 20 but not minimum: 10
     - maximum: 10
       exclusiveMaximum: true
    

type and enum

When enum and type are both used, any value present in enum but whose type is not present in type wouldn't validate. Similarly, any value of allowed type, but absent from enum wouldn't validate.

Therefore enum keyword determines allowed types, and it's a set intersection of the two.

type and constraints

If type is defined, constraints for types not listed are discarded.

The default type value (not defined) means all types are valid and all constraints are considered.

type and anyOf or oneOf

  1. When evaluating type with either oneOf or anyOf, the two are equivalent, since no JSON value can be of more than one type

     anyOf:
     - type: integer
     - type: string
    

    or

     oneOf:
     - type: integer
     - type: string
    

    =>

     int | str
    

type and allOf

allOf applies a set intersection to type

# implied type: $any
allOf:
- type: integer

=>

type: integer
maximum: 20

This is a bottom type:

allOf:
- integer
- string

type: object and allOf

  1. Since instances must validate against the parent schema and each of allOf child schemas, properties from each schema can be calculated either as a sum of sets, and their schemas merged.

     allOf:
     - properties:
         length:
             type: integer
     - properties:
         width:
             type: integer
    

    =>

     properties:
         length:
             type: integer
         width:
             type: integer
    

    An exception to this rule occurs when the parent schema or any of the child schemas has additionalProperties: false, in which case the resulting set of properties in an intersection of all properties defined in these schemas.

     allOf:
     -   properties:
             length:
                 type: integer
     -   properties:
             width:
                 type: integer
             height:
                 type: integer
         additionalProperties: false
     -   properties:
             length:
                 type: integer
             height:
                 type: integer
         additionalProperties: false
    

    =>

     properties:
         height:
             type: integer
    
  2. In cases where the same property appears in more than once in allOf schemas, rules apply as if allOf was defined for that property:

     allOf:
     - properties:
         size:
             maximum: 10
     - properties:
         size:
             maximum: 20
    

    =>

     properties:
         size:
             allOf:
             -   maximum: 10
             -   maximum: 20
    

    =>

     properties:
         size:
             maximum: 10
    

Conflicting schemas

  1. There are many ways of declaring schemas that no value could validate. Such schemas aren't invalid, but the part describing a single type the the keywords apply to, must be discarded.

     minimum: 20
     maximum: 10
    

    =>

     oneOf:
     -   type: boolean
     -   type: string
     -   type: object
     -   type: array
    

    but this schema never validates, so can't be translated to a type:

     type: integer
     minimum: 20
     maximum: 10
    
  2. Conflicting enum values

    enum keyword applies to all types so a conflict here makes the schema impossible to translate into a type:`

     allOf:
     -   enum: ["red"]
     -   enum: ["green"]
    

References

  1. https://apis.guru/ - a directory of OpenAPI/swagger descriptions.
  2. https://www.learnjsonschema.com/2019-09/ - an extended explanation of JSON Schema keywords. Wrong version, but close enough.