De-duplicating records based on title
This is a draft!
I have a large database of metadata records, harvested from repositories using the OAI-PMH. We want to couple that with a database of records from the Thomson ISI Web of Knowledge database, in order to answer the question of what proportion of the ISI records are also available in a repository. (Whether there is an attached full-text will be a later question ...)
The technological problem is matching 700,000 records from ISI against 29,000,000 harvested from repositories. The common fields we have are title, date and author. Each ISI record has a unique ISI-assigned numeric identifier. Each OAI record has a unique numeric identifier assigned by me.
For now, I'm going to ignore the date and author and instead concentrate on matching titles. The technique I'm using to test equivalence is the Jaccard index. The elements from which the intersection and union for Jaccard are calculated are w-shingles. The shingles are generated by normalising and tokenizing the input title and calculating the digest of each 4-term sequence e.g.
"She sells sea shells, by the sea shore"
is normalised to:
"she sells sea shells by the sea shore"
which is tokenized (split) into sequences of 4 terms each:
{she, sells, sea, shells}
{sells, sea, shells, by}
{sea, shells, by, the}
{shells, by, the, sea}
{by, the, sea, shore}
from which the MD5 digest is calculated for each shingle, resulting in 5 unique shingles:
{8864a21dca39c33f878dbee52a68e0ed, 9c63cb078bdb03930b244383bc283d74, 67e4f723c2e752bc84da38835aed8790, 84868ccce9218519b8829b4f087ba87c, 06542f2681a2e8fc0bab470f638992aa}
To reduce the search space only the last 4 bytes of the digest is used. A database table is written that consists of pairs of the 4 byte shingle identifier plus the numeric identifier of the source record. Once all the records are processed we have a sorted table of shingles plus the list of records that have that common shingle.
|shingle|id|
2a68e0ed|1
bc283d74|1
5aed8790|1
087ba87c|1
638992aa|1
Calculating the overlap between records can be calculated using a self-table join:
SELECT a.id a, b.id b FROM shingles a INNER JOIN shingles b ON a.shingle=b.shingle AND a.id < b.id;
Counting the number of distinct a.id/b.id pairs from the above will tell us the number of shared (intersection) shingles:
SELECT x.a, x.b, COUNT(*) c FROM (SELECT a.id a, b.id b...) x GROUP BY x.a, x.b;
The count of the union of shingles between two records can be calculated by:
SELECT COUNT(DISTINCT shingle)) FROM shingles WHERE id=a.id OR id=b.id;
Once a likely set of duplicates has been identified (Jp>.5), a more accurate matching can be performed on the candidate titles, authors and dates.
I have a large database of metadata records, harvested from repositories using the OAI-PMH. We want to couple that with a database of records from the Thomson ISI Web of Knowledge database, in order to answer the question of what proportion of the ISI records are also available in a repository. (Whether there is an attached full-text will be a later question ...)
The technological problem is matching 700,000 records from ISI against 29,000,000 harvested from repositories. The common fields we have are title, date and author. Each ISI record has a unique ISI-assigned numeric identifier. Each OAI record has a unique numeric identifier assigned by me.
For now, I'm going to ignore the date and author and instead concentrate on matching titles. The technique I'm using to test equivalence is the Jaccard index. The elements from which the intersection and union for Jaccard are calculated are w-shingles. The shingles are generated by normalising and tokenizing the input title and calculating the digest of each 4-term sequence e.g.
"She sells sea shells, by the sea shore"
is normalised to:
"she sells sea shells by the sea shore"
which is tokenized (split) into sequences of 4 terms each:
{she, sells, sea, shells}
{sells, sea, shells, by}
{sea, shells, by, the}
{shells, by, the, sea}
{by, the, sea, shore}
from which the MD5 digest is calculated for each shingle, resulting in 5 unique shingles:
{8864a21dca39c33f878dbee52a68e0ed, 9c63cb078bdb03930b244383bc283d74, 67e4f723c2e752bc84da38835aed8790, 84868ccce9218519b8829b4f087ba87c, 06542f2681a2e8fc0bab470f638992aa}
To reduce the search space only the last 4 bytes of the digest is used. A database table is written that consists of pairs of the 4 byte shingle identifier plus the numeric identifier of the source record. Once all the records are processed we have a sorted table of shingles plus the list of records that have that common shingle.
|shingle|id|
2a68e0ed|1
bc283d74|1
5aed8790|1
087ba87c|1
638992aa|1
Calculating the overlap between records can be calculated using a self-table join:
SELECT a.id a, b.id b FROM shingles a INNER JOIN shingles b ON a.shingle=b.shingle AND a.id < b.id;
Counting the number of distinct a.id/b.id pairs from the above will tell us the number of shared (intersection) shingles:
SELECT x.a, x.b, COUNT(*) c FROM (SELECT a.id a, b.id b...) x GROUP BY x.a, x.b;
The count of the union of shingles between two records can be calculated by:
SELECT COUNT(DISTINCT shingle)) FROM shingles WHERE id=a.id OR id=b.id;
Once a likely set of duplicates has been identified (Jp>.5), a more accurate matching can be performed on the candidate titles, authors and dates.