About the DPD
Conditions of Use
Patent Coding Form
Home >> Delphion Search Algorithm
Delphion Search Algorithm
To generate the collection of DNA Patent Database, the process is as follows:
Delphion Search Algorithm
((047???* OR 119* OR 260???* OR 426* OR 435* OR 514* OR 536022* OR 5360231 OR 536024* OR 536025* OR 800*) <in> NC)
((antisense OR <case><wildcard>cDNA* OR centromere OR deoxyoligonucleotide OR deoxyribonucleic OR deoxyribonucleotide OR <case><wildcard>DNA* OR exon OR "gene" OR "genes" OR genetic OR genome OR genomic OR genotype OR haplotype OR intron OR <case><wildcard>mtDNA* OR nucleic OR nucleotide OR oligonucleotide OR oligodeoxynucleotide OR oligoribonucleotide OR plasmid OR polymorphism OR polynucleotide OR polyribonucleotide OR ribonucleotide OR ribonucleic OR "recombinant DNA" OR <case><wildcard>RNA* OR <case><wildcard>mRNA* OR <case><wildcard>rRNA* OR <case><wildcard>siRNA* OR <case><wildcard>snRNA* OR <case><wildcard>tRNA* OR ribonucleoprotein OR <case><wildcard>hnRNP* OR <case><wildcard>snRNP* OR <case><wildcard>SNP*) <in> CLAIMS)
Translation of Delphion Search Algorithm
Last updated: June 2015
This will likely be the final Update for the DNA Patent Database, as the grant supporting it (P50 HG003391) expired March 31, 2015.
A Note on the History of the Algorithm
Original Martinell algorithm:
This algorithm was based on an original algorithm developed by USPTO Senior Examiner James Martinell in response to a 1993 request from the Office of Technology Assessment, U.S. Congress.
((435* OR 800* OR 530* OR 536/23*) <in> NC)
((sequenc* OR (atga* OR atgc* OR atgg* OR atgt*) OR cDNA? OR deoxyribo* OR deoxynuclei* OR deoxynucle* OR dna? OR gene? OR nucle* OR nucleotide OR oligonucle* OR oligodeoxy*) <in> CLAIMS)
Translation of original Martinell algorithm:
This algorithm searches classes 435, 800, 530, or 536/23 and
Process for modifying the algorithm
The algorithm was systematically modified from the original Martinell algorithm to the algorithm used until expiration of the grant supporting the DNA Patent Database (NHGRI grant P50 HG003391) in March 2015. The algorithm evolved as follows: Individual terms were tested for "sensitivity" (whether a word identified all the patents we believed it should), and "specificity" (whether it selected only those patents and not patents lacking DNA- or RNA-based claims). The starting point for testing sensitivity and specificity was a set of patents previously read and coded by hand. This work was done by Bi Ade, a research assistant at Georgetown.s Kennedy Institute of Ethics, under supervision of Bob Cook-Deegan, so sometimes termed the "AB/BCD. or .Ade/Cook-Deegan" algorithm.
To expand the algorithm, we gathered a set of all patents assigned to companies known to do primarily genomic research (Human Genome Sciences and Incyte). Those patents were read and coded by hand, rejecting patents not based on DNA or RNA (e.g., each company had some protein and peptide patents). Patents that contained DNA-based claims but not captured by the Martinell algorithm were then reviewed to identify nucleic-acid-specific terms that would identify them. Those terms were added to the list (e.g., "polynucleotide" was added this way). Finally, all USPTO patents were searched for terms specific to nucleic acids, and all or a sample of those patents were read to verify that they were based on DNA or RNA.
We eliminated terms that did not improve either sensitivity or specificity (using the "ATG." terms, for example, did not identify any patents not already identified). In particular, we rejected class 530 (protein and peptides) and 526/23.2-23.74; and added classes that contained some DNA-based patents that were not included in the Martinell algorithm. We added many new terms specific to nucleic acids, and retained terms that retrieved more than 4 previously unidentified patents (all years), after verifying that the newly identified patents included at least one DNA- or RNA-based claim. The term that introduced the most spurious (non-DNA-based) patents was "sequenc*" (words starting with "sequenc").
The results of searches on the USPTO's EAST and WEST software (searches performed on site at the USPTO in Crystal City, Virginia) were compared to Delphion search results for replicability before shifting to the Delphion search system. The Delphion patent database was originally developed by IBM, and then became the basis for Thomson Reuter.s database, which in turn was retired and replaced by Thomson Innovation in 2014. Mark Hakkarinen updated the search algorithm to use a PHP based, regular expression approach using Google XML full-text patent data files. This process would identify all matching patents within a weekly file and a second process would download the data and prepare for upload into the DPD. Please contact Mark Hakkarinen if you are interested in this approach.
Last updated on June 9th, 2015