38 lines
2.3 KiB
Markdown
38 lines
2.3 KiB
Markdown
|
|
---
|
||
|
|
name: Skill fingerprint & relation matching
|
||
|
|
description: How SkillGuard decides new/identical/modified between scans, and two traps that break it.
|
||
|
|
---
|
||
|
|
|
||
|
|
# Skill fingerprint & relation matching
|
||
|
|
|
||
|
|
The overall fingerprint is a SHA-256 over sorted `path\u0000fileHash` pairs. Relation
|
||
|
|
detection: exact fingerprint match → `identical`; else best file-tree overlap → `modified`;
|
||
|
|
else `new`.
|
||
|
|
|
||
|
|
## Trap 1 — display name must not leak into the fingerprint
|
||
|
|
Text-pasted skills are parsed into a single file. That file's `path` must be a **stable
|
||
|
|
constant** (`SKILL.md`), NOT the user-supplied scan name. If the name is used as the path,
|
||
|
|
two byte-identical pastes with different names get different fingerprints and are
|
||
|
|
mis-classified as `modified` (sim 100) instead of `identical`.
|
||
|
|
**Why:** the fingerprint is meant to identify content/structure, not the display label.
|
||
|
|
|
||
|
|
## Trap 2 — Jaccard over file *hashes* can't detect single-file modifications
|
||
|
|
For a single-file skill, any content edit changes the one hash completely, so Jaccard over
|
||
|
|
the hash set is 0 → wrongly classified `new`, and the compare/diff view (the whole point of
|
||
|
|
the feature) never links the two versions.
|
||
|
|
**Fix / how to apply:** match candidate scans by Jaccard over file **paths** (tie-break by
|
||
|
|
hash overlap), then report `similarity` as a content-aware score: identical files (same hash)
|
||
|
|
count 1.0, changed text files use line-level LCS ratio (`lineSimilarity` = 2·LCS/(a+b)),
|
||
|
|
added/removed or changed-binary files count 0. This yields a meaningful % for single-file
|
||
|
|
edits (e.g. one added line ≈ 90%) and still reduces to hash-equality for unchanged files.
|
||
|
|
|
||
|
|
## Trap 3 — path overlap alone falsely links unrelated single-file skills
|
||
|
|
Because every text paste uses the constant path `SKILL.md` (Trap 1), path-Jaccard is always 1
|
||
|
|
between any two text skills — so selecting/classifying `modified` by path overlap links totally
|
||
|
|
unrelated pastes (sim could be ~0). **Fix / how to apply:** select the candidate by the
|
||
|
|
content-aware similarity score itself (not path overlap), and only return `modified` when
|
||
|
|
`bestSimilarity >= MODIFIED_SIMILARITY_THRESHOLD` (40) OR at least one file is byte-identical
|
||
|
|
(hash overlap). Otherwise return `new`. Skip scoring candidates with no shared path AND no
|
||
|
|
shared hash (similarity would be 0). **Why:** classification must reflect actual content
|
||
|
|
overlap, not a coincidentally-shared file path.
|