
jsaiyyp 14 2021-08-31 机器学习



1. promoter instances: C. Harley (CHARLEY '@' McMaster.CA) and R. Reynolds

2. non-promoter instances and domain theory: M. Noordewier
-- (non-promoters derived from work of lab of Prof. Tom Record, University of Wisconsin Biochemistry Department)


M. Noordewier and J. Shavlik, {noordewi,shavlik}

Data Set Information:

This dataset has been developed to help evaluate a "hybrid" learning algorithm ("KBANN") that uses examples to inductively refine preexisting knowledge. Using a "leave-one-out" methodology, the following errors were produced by various ML algorithms. (See Towell, Shavlik, & Noordewier, 1990, for details.)

System -- Errors -- Comments
KBANN -- 4/106 -- a hybrid ML system
BP -- 8/106 -- std backprop with one hidden layer
O'Neill -- 12/106 -- ad hoc technique from the bio. lit.
Near-Neigh -- 13/106 -- a nearest-neighbor algo (k=3)
ID3 -- 19/106 -- Quinlan's decision-tree builder

Type of domain: non-numeric, nominal (one of A, G, T, C)

Note: DNA nucleotides can be grouped into a hierarchy, as shown below:

X (any)
(purine) R Y (pyrimidine)
/ /

Here is that hierachy in a text-friendly format:

X (any)
. R (purine)
. . A
. . G
. Y (pyrimidine)
. . T
. . C

Attribute Information:

1. One of {+/-}, indicating the class ("+" = promoter).
2. The instance name (non-promoters named by position in the 1500-long nucleotide sequence provided by T. Record).
3-59. The remaining 57 fields are the sequence, starting at position -50 (p-50) and ending at position +7 (p7). Each of these fields is filled by one of {a, g, t, c}.

Relevant Papers:

Harley, C. and Reynolds, R. 1987. "Analysis of E. Coli Promoter Sequences." Nucleic Acids Research, 15:2343-2361.
[Web Link]

Towell, G., Shavlik, J. and Noordewier, M. 1990. "Refinement of Approximate Domain Theories by Knowledge-Based Artificial Neural Networks." In Proceedings of the Eighth National Conference on Artificial Intelligence (AAAI-90).
[Web Link]


