DPD: Delphion Search Algorithm

About the DPD
Conditions of Use
Genome Archive
Patent Coding Form
Resources
Search Algorithm

Home >> Delphion Search Algorithm

Delphion Search Algorithm

To generate the collection of DNA Patent Database, the process is as follows:

Run the algorithm below on the Delphion Patent Database. This algorithm is explained in plain English below.
Review patent titles and claims, and reject patents where the mention of a nucleic acid term is merely incidental (for example, as one of many examples of a subordinate claim)

Delphion Search Algorithm

((047???* OR 119* OR 260???* OR 426* OR 435* OR 514* OR 536022* OR 5360231 OR 536024* OR 536025* OR 800*) <in> NC)

AND

((antisense OR <case><wildcard>cDNA* OR centromere OR deoxyoligonucleotide OR deoxyribonucleic OR deoxyribonucleotide OR <case><wildcard>DNA* OR exon OR "gene" OR "genes" OR genetic OR genome OR genomic OR genotype OR haplotype OR intron OR <case><wildcard>mtDNA* OR nucleic OR nucleotide OR oligonucleotide OR oligodeoxynucleotide OR oligoribonucleotide OR plasmid OR polymorphism OR polynucleotide OR polyribonucleotide OR ribonucleotide OR ribonucleic OR "recombinant DNA" OR <case><wildcard>RNA* OR <case><wildcard>mRNA* OR <case><wildcard>rRNA* OR <case><wildcard>siRNA* OR <case><wildcard>snRNA* OR <case><wildcard>tRNA* OR ribonucleoprotein OR <case><wildcard>hnRNP* OR <case><wildcard>snRNP* OR <case><wildcard>SNP*) <in> CLAIMS)

Translation of Delphion Search Algorithm

As of mid-2015, the US Patent and Trademark is discontinuing use of US patent codes in favor or the international Cooperative Patent Classification (CPC) system. Based on sensitivity and specificity tests that generated the algorithm, leaving out this .patent class. specification step only reduces specificity by a few percent. While it vastly expands the number of patents searched, modern search algorithms can easily accommodate the expansion. Moreover, the CPC codes do not map cleanly to nucleic acid entities, so searches should now probably avoid the patent classification restriction step.
Search US Patent classes 047 (plant husbandry), 119 (animal husbandry), 260 (organic chemistry), 426 (food), 435 (molecular biology and microbiology), 514 (drug, bio-affecting and body treating compositions), 536/subclasses 22 through 23.1 (nucleic acids, genes, etc., but not peptides or proteins), subclasses 24 and 25 (various nucleic acids, variants, and related methods), and class 800 (multicellular organisms).
Select patents from that group that include one or more of the following terms in their claims:

antisense
cDNA
centromere
deoxyoligonucleotide
deoxyribonucleic
deoxyribonucleotide
DNA (with or without following letters, such as DNAs)
exon
gene or genes (exact match only)
genetic
genome
genomic
genotype
haplotype
intron
mtDNA (with or without following letters such as mtDNAs)-exact case match only
nucleic
nucleotide
oligonucleotide
oligodeoxynucleotide
oligoribonucleotide
plasmid
polymorphism
polynucleotide
polyribonucleotide
ribonucleotide
ribonucleic
recombinant DNA (exact match for case and words only)
RNA (all upper case only, with or without following letters such as RNAs)
mRNA (exact case match only, with or without following letters such as mRNAs)
rRNA (exact case match only, with or without following letters such as rRNAs)
siRNA (exact case match only, with or without following letters such as siRNAs)
snRNA (exact case match only, with or without following letters such as snRNAs)
tRNA (exact case match only, with or without following letters such as tRNAs)
ribonucleoprotein
hnRNP (exact case match only, with or without following letters such as hnRNPs)
snRNP (exact case match only, with or without following letters such as snRNPs)
SNP (exact case match only, with or without following letters such as SNPs)

As of 2014, we also recommend adding:

CRISPR or "clustered regulalry interspaced short palindromic repeats"
Cas9 or Cas-9
cfDNA and "cell free DNA"
cffDNA and "cell free fetal DNA"
ctDNA and "circulating tumor DNA"
"plasma DNA"

We have not performed the same term-by-term specificity and sensitivity analysis of these text strings as we did for the list above, but they have good face validity as highly specific to nucleic acid constructs and are important subjects of patents and patent applications.
Last updated: June 2015
This will likely be the final Update for the DNA Patent Database, as the grant supporting it (P50 HG003391) expired March 31, 2015.

A Note on the History of the Algorithm

Original Martinell algorithm:

This algorithm was based on an original algorithm developed by USPTO Senior Examiner James Martinell in response to a 1993 request from the Office of Technology Assessment, U.S. Congress.

((435* OR 800* OR 530* OR 536/23*) <in> NC)

AND

((sequenc* OR (atga* OR atgc* OR atgg* OR atgt*) OR cDNA? OR deoxyribo* OR deoxynuclei* OR deoxynucle* OR dna? OR gene? OR nucle* OR nucleotide OR oligonucle* OR oligodeoxy*) <in> CLAIMS)

Translation of original Martinell algorithm:

This algorithm searches classes 435, 800, 530, or 536/23 and

any word starting with "sequenc" (such as sequence or sequences)
atga., atgc., atgg., atgt. (to capture DNA sequences starting with an "ATG" sequence, which includes many complementary DNA [gene] patents)
cDNA with just one more letter
any word starting "deoxyribo", "deoxynuclei"
DNA with just one more letter (such as DNAs)
any five-letter word starting with "gene" (including "genes")
nucleotide
any word starting with "oligonucle"
any word starting "oligodeoxy"

Process for modifying the algorithm

The algorithm was systematically modified from the original Martinell algorithm to the algorithm used until expiration of the grant supporting the DNA Patent Database (NHGRI grant P50 HG003391) in March 2015. The algorithm evolved as follows: Individual terms were tested for "sensitivity" (whether a word identified all the patents we believed it should), and "specificity" (whether it selected only those patents and not patents lacking DNA- or RNA-based claims). The starting point for testing sensitivity and specificity was a set of patents previously read and coded by hand. This work was done by Bi Ade, a research assistant at Georgetown.s Kennedy Institute of Ethics, under supervision of Bob Cook-Deegan, so sometimes termed the "AB/BCD. or .Ade/Cook-Deegan" algorithm.

To expand the algorithm, we gathered a set of all patents assigned to companies known to do primarily genomic research (Human Genome Sciences and Incyte). Those patents were read and coded by hand, rejecting patents not based on DNA or RNA (e.g., each company had some protein and peptide patents). Patents that contained DNA-based claims but not captured by the Martinell algorithm were then reviewed to identify nucleic-acid-specific terms that would identify them. Those terms were added to the list (e.g., "polynucleotide" was added this way). Finally, all USPTO patents were searched for terms specific to nucleic acids, and all or a sample of those patents were read to verify that they were based on DNA or RNA.

We eliminated terms that did not improve either sensitivity or specificity (using the "ATG." terms, for example, did not identify any patents not already identified). In particular, we rejected class 530 (protein and peptides) and 526/23.2-23.74; and added classes that contained some DNA-based patents that were not included in the Martinell algorithm. We added many new terms specific to nucleic acids, and retained terms that retrieved more than 4 previously unidentified patents (all years), after verifying that the newly identified patents included at least one DNA- or RNA-based claim. The term that introduced the most spurious (non-DNA-based) patents was "sequenc*" (words starting with "sequenc").

The results of searches on the USPTO's EAST and WEST software (searches performed on site at the USPTO in Crystal City, Virginia) were compared to Delphion search results for replicability before shifting to the Delphion search system. The Delphion patent database was originally developed by IBM, and then became the basis for Thomson Reuter.s database, which in turn was retired and replaced by Thomson Innovation in 2014. Mark Hakkarinen updated the search algorithm to use a PHP based, regular expression approach using Google XML full-text patent data files. This process would identify all matching patents within a weekly file and a second process would download the data and prepare for upload into the DPD. Please contact Mark Hakkarinen if you are interested in this approach.

Last updated on June 9th, 2015

Kennedy Institute of Ethics, Georgetown University

For comments, suggestions, information, or questions you may contact:

Bob Cook-Deegan or Mark Hakkarinen

©2008-2012. Web Design. Kennedy Institute of Ethics, Georgetown University
©2005. Plan 9 Database Design. IP Data Corporation - ALL RIGHTS RESERVED
©2000. Portions of this program. Faircom Corporation - ALL RIGHTS RESERVED

Duke Institute for Genome Science and Policy