File Based Lineage
This plugin pulls lineage metadata from a yaml-formatted file. An example of one such file is located in the examples directory here.
CLI based Ingestion
Install the Plugin
pip install 'acryl-datahub[datahub-lineage-file]'
Starter Recipe
Check out the following recipe to get started with ingestion! See below for full configuration options.
For general pointers on writing and running a recipe, see our main recipe guide.
source:
  type: datahub-lineage-file
  config:
    # Coordinates
    file: /path/to/file_lineage.yml
    # Whether we want to query datahub-gms for upstream data
    preserve_upstream: False
sink:
# sink configs
Config Details
- Options
- Schema
Note that a . is used to denote nested fields in the YAML recipe.
| Field [Required] | Type | Description | Default | Notes | 
|---|---|---|---|---|
| file [✅] | string | Path to lineage file to ingest. This may also be in the form of a URL. | ||
| preserve_upstream | boolean | Whether we want to query datahub-gms for upstream data. False means it will hard replace upstream data for a given entity. True means it will query the backend for existing upstreams and include it in the ingestion run | True | 
The JSONSchema for this configuration is inlined below.
{
  "title": "LineageFileSourceConfig",
  "type": "object",
  "properties": {
    "file": {
      "title": "File",
      "description": "Path to lineage file to ingest. This may also be in the form of a URL.",
      "type": "string"
    },
    "preserve_upstream": {
      "title": "Preserve Upstream",
      "description": "Whether we want to query datahub-gms for upstream data. False means it will hard replace upstream data for a given entity. True means it will query the backend for existing upstreams and include it in the ingestion run",
      "default": true,
      "type": "boolean"
    }
  },
  "required": [
    "file"
  ],
  "additionalProperties": false
}
Lineage File Format
The lineage source file should be a .yml file with the following top-level keys:
version: the version of lineage file config the config conforms to. Currently, the only version released
is 1.
lineage: the top level key of the lineage file containing a list of EntityNodeConfig objects
EntityNodeConfig:
- entity: EntityConfig object
- upstream: (optional) list of child EntityNodeConfig objects
EntityConfig:
- name : identifier of the entity. Typically name or guid, as used in constructing entity urn.
- type: type of the entity (only datasetis supported as of now)
- env: the environment of this entity. Should match the values in the table here
- platform: a valid platform like kafka, snowflake, etc..
- platform_instance: optional string specifying the platform instance of this entity
For example if dataset URN is urn:li:dataset:(urn:li:dataPlatform:redshift,userdb.public.customer_table,DEV) then EntityConfig will look like:
name : userdb.public.customer_table
type: dataset
env: DEV
platform: redshift
You can also view an example lineage file checked in here
Code Coordinates
- Class Name: datahub.ingestion.source.metadata.lineage.LineageFileSource
- Browse on GitHub
Questions
If you've got any questions on configuring ingestion for File Based Lineage, feel free to ping us on our Slack.