skillguard/.agents/memory/skill-fingerprint-matching.md
amertensreplit ba9788a93c Add Skill-Fingerprint database & report comparison
Each scan gets a deterministic overall fingerprint (SHA-256 over sorted
path+fileHash pairs) plus per-file SHA-256 hashes and stored text content
(binary: hash+size only). On upload the skill is always re-scanned and
classified vs prior scans as new / identical / modified, with a per-fingerprint
check counter, a "most similar known skill" link, and a file-level diff view.

Deviations from the plan:
- Relation matching keys off shared file *paths* (Jaccard over paths, tie-break
  on hashes), not hash-Jaccard alone, which is always 0 for single-file edits
  (text paste = one SKILL.md) and would mis-class every edited single-file skill
  as "new". Similarity is content-aware: identical files = 1.0, changed text
  files use line-level LCS ratio, added/removed/changed-binary = 0.
- parseText no longer uses the display name as the file path (fixed "SKILL.md")
  so identical pastes with different names are "identical", not "modified".

Backend: skillFingerprint.ts, lineDiff.ts (+lineSimilarity), skillParser.ts
(per-file hash+isBinary), routes/scans.ts (computeRelation, content similarity,
checkCount, comparedScan, GET /scans/:id/compare/:otherId). DB: scans
fingerprint/relation/similarity/comparedScanId (+index), scan_files hash/content.
API spec + orval codegen regenerated. UI: fingerprint card + compare link on
report, relation badges in history, new /vergleich/:id/:otherId page with
side-by-side summaries and expandable line diff. German UI, no emojis.

Verified end-to-end against the running API and screenshotted both UI pages;
test data cleaned up afterward.

Code-review fix: relation classification no longer relies on path-Jaccard
(every text paste shares path SKILL.md, so unrelated pastes were falsely
linked as "modified"). computeRelation now selects the candidate by
content-aware similarity and only returns "modified" when similarity >= 40
or a file is byte-identical; otherwise "new". Updated OpenAPI similarity
description; removed now-unused jaccard import.

Replit-Task-Id: 79a8e472-6635-493c-8995-3233ba7df75c
2026-06-10 19:34:46 +00:00

2.3 KiB

name description
Skill fingerprint & relation matching How SkillGuard decides new/identical/modified between scans, and two traps that break it.

Skill fingerprint & relation matching

The overall fingerprint is a SHA-256 over sorted path\u0000fileHash pairs. Relation detection: exact fingerprint match → identical; else best file-tree overlap → modified; else new.

Trap 1 — display name must not leak into the fingerprint

Text-pasted skills are parsed into a single file. That file's path must be a stable constant (SKILL.md), NOT the user-supplied scan name. If the name is used as the path, two byte-identical pastes with different names get different fingerprints and are mis-classified as modified (sim 100) instead of identical. Why: the fingerprint is meant to identify content/structure, not the display label.

Trap 2 — Jaccard over file hashes can't detect single-file modifications

For a single-file skill, any content edit changes the one hash completely, so Jaccard over the hash set is 0 → wrongly classified new, and the compare/diff view (the whole point of the feature) never links the two versions. Fix / how to apply: match candidate scans by Jaccard over file paths (tie-break by hash overlap), then report similarity as a content-aware score: identical files (same hash) count 1.0, changed text files use line-level LCS ratio (lineSimilarity = 2·LCS/(a+b)), added/removed or changed-binary files count 0. This yields a meaningful % for single-file edits (e.g. one added line ≈ 90%) and still reduces to hash-equality for unchanged files.

Because every text paste uses the constant path SKILL.md (Trap 1), path-Jaccard is always 1 between any two text skills — so selecting/classifying modified by path overlap links totally unrelated pastes (sim could be ~0). Fix / how to apply: select the candidate by the content-aware similarity score itself (not path overlap), and only return modified when bestSimilarity >= MODIFIED_SIMILARITY_THRESHOLD (40) OR at least one file is byte-identical (hash overlap). Otherwise return new. Skip scoring candidates with no shared path AND no shared hash (similarity would be 0). Why: classification must reflect actual content overlap, not a coincidentally-shared file path.