# Data Modeling

> **Advanced concept.** This page assumes familiarity with entities, features, metrics, and relationships. If you're just getting started, read [Entities](/concepts/concepts/entities.md) first. For the basic feature-chaining mechanic (single metric feature pulling from a related entity), see [Metrics](/concepts/concepts/metrics.md). This page covers what happens when chains span multiple entities.

Feature chaining is Lynk's mechanism for building features on top of features from other entities. When used across multiple entities, it creates a full data pipeline — values that flow through layers of your semantic graph, each layer building on the one before it.

This page explains how that works and when to use each pattern.

***

## Chain Links Across Entities

Any feature type that sources from another entity is a chain link. That includes:

* **`metric`** — aggregates rows from a related entity (`total_spend_usd` on `player` from `purchase.sum_net_revenue_usd`)
* **`first_last`** — pulls the first or last value from a related entity's rows (`first_purchase_date` on `player` from `purchase`, ordered by `purchase_date`)
* **`field`** — pulls a column from a related entity or source table (`store` on `player` from the player's most recent purchase record)

What makes chaining a pipeline is that a feature on entity B — regardless of type — can be used as input to a metric or formula on entity B, which entity C can then reference. Each layer computes once; downstream entities reference it.

Two components are required at each step:

1. A relationship connecting the two entities
2. A feature on the destination entity that sources from the source entity (`source: <entity_name>`)

**`formula` features don't chain across entities** — they compute from other features on the same entity, so they can't source from a related entity. But they are essential to data modeling: once a `metric` or `first_last` feature has pulled a value onto an entity, a formula can derive new meaning from it. On `player`, after `total_spend_usd` is chained in from `purchase`, a formula can tier each player:

```yaml
- type: formula
  name: player_segment
  data_type: string
  description: Spend-based segment — 'whale', 'dolphin', or 'minnow'
  sql: >
    CASE
      WHEN {total_spend_usd} > 100 THEN 'whale'
      WHEN {total_spend_usd} > 20  THEN 'dolphin'
      ELSE 'minnow'
    END
```

The formula references `total_spend_usd`, which was itself chained in from `purchase`. That's the pattern: chain the raw value in, derive business meaning with a formula.

***

## Example: Three Entities, Two Patterns

Arcadia tracks player behavior through three entities: `purchase`, `player`, and `player_cohort`. Each `player_cohort` row represents a group of players who installed on the same date — pre-calculated in the warehouse.

The example below uses `metric` features to show chaining, since aggregations make the pipeline logic most visible. The same patterns apply to `first_last` and `field` features — any feature that sources from another entity participates in the chain.

The relationships are:

* `player` ↔ `purchase`: one player has many purchases
* `player_cohort` ↔ `player`: one cohort has many players
* `player_cohort` ↔ `purchase`: one cohort's players have many purchases (direct)

The example below shows two chain patterns from these three entities.

### Relationships

```yaml
relationships:
  player-purchase:
    relationship: one_to_many
    description: A player's purchase history
    joins:
      - name: player_to_purchase
        default: true
        join_type: left
        type: sql
        sql: '{source}.{player_id} = {destination}.{player_id}'

  player_cohort-player:
    relationship: one_to_many
    description: All players who belong to a cohort by install date
    joins:
      - name: cohort_to_player
        default: true
        join_type: left
        type: sql
        sql: '{source}.{install_date} = {destination}.{install_date}'

  player_cohort-purchase:
    relationship: one_to_many
    description: All purchases made by players in a cohort — direct path, no player intermediary
    joins:
      - name: cohort_to_purchase
        default: true
        join_type: left
        type: sql
        sql: '{source}.{install_date} = {destination}.{cohort_install_date}'
```

***

### Pattern 1 — Linear Chain (purchase → player → player\_cohort)

Use this when the intermediate aggregation matters. Here, you want cohort-level spend, but averaged *per player* — not summed across all purchases directly. You need the player layer to compute the per-player value first.

**Step 1 — `purchase` defines the base metric.**

```yaml
# purchase entity
metrics:
  - name: sum_net_revenue_usd
    description: Total net revenue from purchases, in USD
    sql: SUM({net_revenue_usd})
```

**Step 2 — `player` chains from `purchase` and defines a new metric.**

```yaml
# player entity
features:
  - type: metric
    name: total_spend_usd
    description: Total net revenue from all purchases by this player, in USD
    data_type: number
    source: purchase
    metric: sum_net_revenue_usd

metrics:
  - name: avg_spend_usd
    description: Average spend per player, in USD
    sql: AVG({total_spend_usd})
```

`total_spend_usd` is a metric feature — it pulls from `purchase`. `avg_spend_usd` is an entity metric — it aggregates over players using that feature as input.

**Step 3 — `player_cohort` chains from `player`.**

```yaml
# player_cohort entity
features:
  - type: metric
    name: avg_player_spend_usd
    description: Average spend per player within this cohort, in USD
    data_type: number
    source: player
    metric: avg_spend_usd
```

`avg_player_spend_usd` on `player_cohort` reflects the chain: it averages `avg_spend_usd` across players, which itself averages `total_spend_usd`, which sums from `purchase.net_revenue_usd`. Three entities, one logical pipeline.

***

### Pattern 2 — Direct Chain (purchase → player\_cohort)

Use this when you don't need the intermediate layer. Cohort total revenue is a simple sum — no per-player averaging required. Going through `player` would add an unnecessary aggregation step.

```yaml
# player_cohort entity
features:
  - type: metric
    name: cohort_total_revenue_usd
    description: Total net revenue from all purchases made by players in this cohort, in USD
    data_type: number
    source: purchase
    metric: sum_net_revenue_usd
```

Same source metric (`purchase.sum_net_revenue_usd`), different path. The result is different too: `cohort_total_revenue_usd` is a cohort-level sum, not a per-player average.

***

## Linear vs. Direct — Which to Use

|                          | Linear chain                                           | Direct chain                                    |
| ------------------------ | ------------------------------------------------------ | ----------------------------------------------- |
| **Use when**             | The intermediate entity's aggregation is what you need | You need a simple rollup, no intermediate logic |
| **Arcadia example**      | Average spend per player, rolled up to cohort          | Total revenue for all players in a cohort       |
| **Relationships needed** | purchase↔player, player↔player\_cohort                 | player\_cohort↔purchase                         |

Both patterns can coexist on the same entity. `player_cohort` in the example above defines both `avg_player_spend_usd` (linear) and `cohort_total_revenue_usd` (direct) — two different aggregations, two different paths, from the same underlying data.

***

## Related Reference

* [Metrics](/concepts/concepts/metrics.md) — entity metrics, metric features, what `sql:` accepts, metric-over-metric composition
* [Entities](/concepts/concepts/entities.md) — entity anatomy and the full feature type reference
* [Relationships YAML Reference](/file-types-reference/file-types/relationships-yaml.md) — how to define joins and named join paths
* [Entity YAML Reference](/file-types-reference/file-types/entity-yaml.md) — full field reference for defining features and metrics


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.getlynk.ai/concepts/concepts/data-modeling.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
