skillguard/.agents/memory/skill-fingerprint-matching.md

---
name: Skill fingerprint & relation matching
description: How SkillGuard decides new/identical/modified between scans, and two traps that break it.
---

# Skill fingerprint & relation matching

The overall fingerprint is a SHA-256 over sorted `path\u0000fileHash` pairs. Relation
detection: exact fingerprint match → `identical`; else best file-tree overlap → `modified`;
else `new`.

## Trap 1 — display name must not leak into the fingerprint
Text-pasted skills are parsed into a single file. That file's `path` must be a **stable
constant** (`SKILL.md`), NOT the user-supplied scan name. If the name is used as the path,
two byte-identical pastes with different names get different fingerprints and are
mis-classified as `modified` (sim 100) instead of `identical`.
**Why:** the fingerprint is meant to identify content/structure, not the display label.

## Trap 2 — Jaccard over file *hashes* can't detect single-file modifications
For a single-file skill, any content edit changes the one hash completely, so Jaccard over
the hash set is 0 → wrongly classified `new`, and the compare/diff view (the whole point of
the feature) never links the two versions.
**Fix / how to apply:** match candidate scans by Jaccard over file **paths** (tie-break by
hash overlap), then report `similarity` as a content-aware score: identical files (same hash)
count 1.0, changed text files use line-level LCS ratio (`lineSimilarity` = 2·LCS/(a+b)),
added/removed or changed-binary files count 0. This yields a meaningful % for single-file
edits (e.g. one added line ≈ 90%) and still reduces to hash-equality for unchanged files.

## Trap 3 — path overlap alone falsely links unrelated single-file skills
Because every text paste uses the constant path `SKILL.md` (Trap 1), path-Jaccard is always 1
between any two text skills — so selecting/classifying `modified` by path overlap links totally
unrelated pastes (sim could be ~0). **Fix / how to apply:** select the candidate by the
content-aware similarity score itself (not path overlap), and only return `modified` when
`bestSimilarity >= MODIFIED_SIMILARITY_THRESHOLD` (40) OR at least one file is byte-identical
(hash overlap). Otherwise return `new`. Skip scoring candidates with no shared path AND no
shared hash (similarity would be 0). **Why:** classification must reflect actual content
overlap, not a coincidentally-shared file path.
Add Skill-Fingerprint database & report comparison Each scan gets a deterministic overall fingerprint (SHA-256 over sorted path+fileHash pairs) plus per-file SHA-256 hashes and stored text content (binary: hash+size only). On upload the skill is always re-scanned and classified vs prior scans as new / identical / modified, with a per-fingerprint check counter, a "most similar known skill" link, and a file-level diff view. Deviations from the plan: - Relation matching keys off shared file paths (Jaccard over paths, tie-break on hashes), not hash-Jaccard alone, which is always 0 for single-file edits (text paste = one SKILL.md) and would mis-class every edited single-file skill as "new". Similarity is content-aware: identical files = 1.0, changed text files use line-level LCS ratio, added/removed/changed-binary = 0. - parseText no longer uses the display name as the file path (fixed "SKILL.md") so identical pastes with different names are "identical", not "modified". Backend: skillFingerprint.ts, lineDiff.ts (+lineSimilarity), skillParser.ts (per-file hash+isBinary), routes/scans.ts (computeRelation, content similarity, checkCount, comparedScan, GET /scans/:id/compare/:otherId). DB: scans fingerprint/relation/similarity/comparedScanId (+index), scan_files hash/content. API spec + orval codegen regenerated. UI: fingerprint card + compare link on report, relation badges in history, new /vergleich/:id/:otherId page with side-by-side summaries and expandable line diff. German UI, no emojis. Verified end-to-end against the running API and screenshotted both UI pages; test data cleaned up afterward. Code-review fix: relation classification no longer relies on path-Jaccard (every text paste shares path SKILL.md, so unrelated pastes were falsely linked as "modified"). computeRelation now selects the candidate by content-aware similarity and only returns "modified" when similarity >= 40 or a file is byte-identical; otherwise "new". Updated OpenAPI similarity description; removed now-unused jaccard import. Replit-Task-Id: 79a8e472-6635-493c-8995-3233ba7df75c 2026-06-10 19:34:46 +00:00			`---`
			`name: Skill fingerprint & relation matching`
			`description: How SkillGuard decides new/identical/modified between scans, and two traps that break it.`
			`---`

			`# Skill fingerprint & relation matching`

			The overall fingerprint is a SHA-256 over sorted `path\u0000fileHash` pairs. Relation
			detection: exact fingerprint match → `identical`; else best file-tree overlap → `modified`;
			else `new`.

			`## Trap 1 — display name must not leak into the fingerprint`
			Text-pasted skills are parsed into a single file. That file's `path` must be a **stable
			constant** (`SKILL.md`), NOT the user-supplied scan name. If the name is used as the path,
			`two byte-identical pastes with different names get different fingerprints and are`
			mis-classified as `modified` (sim 100) instead of `identical`.
			`Why: the fingerprint is meant to identify content/structure, not the display label.`

			`## Trap 2 — Jaccard over file hashes can't detect single-file modifications`
			`For a single-file skill, any content edit changes the one hash completely, so Jaccard over`
			the hash set is 0 → wrongly classified `new`, and the compare/diff view (the whole point of
			`the feature) never links the two versions.`
			`Fix / how to apply: match candidate scans by Jaccard over file paths (tie-break by`
			hash overlap), then report `similarity` as a content-aware score: identical files (same hash)
			count 1.0, changed text files use line-level LCS ratio (`lineSimilarity` = 2·LCS/(a+b)),
			`added/removed or changed-binary files count 0. This yields a meaningful % for single-file`
			`edits (e.g. one added line ≈ 90%) and still reduces to hash-equality for unchanged files.`

			`## Trap 3 — path overlap alone falsely links unrelated single-file skills`
			Because every text paste uses the constant path `SKILL.md` (Trap 1), path-Jaccard is always 1
			between any two text skills — so selecting/classifying `modified` by path overlap links totally
			`unrelated pastes (sim could be ~0). Fix / how to apply: select the candidate by the`
			content-aware similarity score itself (not path overlap), and only return `modified` when
			`bestSimilarity >= MODIFIED_SIMILARITY_THRESHOLD` (40) OR at least one file is byte-identical
			(hash overlap). Otherwise return `new`. Skip scoring candidates with no shared path AND no
			`shared hash (similarity would be 0). Why: classification must reflect actual content`
			`overlap, not a coincidentally-shared file path.`