Neural network databases
- Data Sets
- Some links to datasets at Center for Biomedical Modeling Research (CBMR).
- DELVE
- Data for Evaluating Learning in Valid Experiments. DELVE is a
collection of datasets from many sources and a standardised
environment whitin which this data can be used to assess the
performance of procedures that learn relationships based primarily
on such empirical data.
- Face Detection Test Set at CMU
- 42 Compuserve GIF format gray-scale images. 169 Labeled Face Locations.
Our work concentrated on finding frontal views of faces in scenes,
so the list of faces includes mainly those faces looking
towards the camera; extreme side views are ignored. These faces
vary in size from ~20 to several hundred pixels. Image
lighting/noise/graininess/contrast also varies; images were taken from
CCD cameras, scanned photographs, scanned magazine & newspaper
pictures, and handrawn images. To get the images:
http://www.ius.cs.cmu.edu/IUS/dylan_usr0/har/faces/test/index.html.
A demo of the system and papers are at
http://www.cs.cmu.edu/~baluja.
- The HCRC Map Task Corpus
- The Map Task Corpus is a set of 8
CD-ROMs containing linked audio and transcriptions of a total of about
18 hours of spontaneous speech that was recorded from 128 two-person
conversations according to a detailed experimental design.
In Europe please contact
Henry Thompson, HCRC, or
Dawn Griesbach, ELSNET. Outside Europe please contact
Elizabeth Hodas,
Linguistic Data Consortium.
- UCI KDD Archive
- UC Irvine Knowledge Discovery in Databases (KDD) Archive
of large datasets which encompasses a
wide variety of data types, analysis tasks, and application areas.
The primary role of this repository is to serve as a benchmark
testbed to enable researchers in knowledge discovery and data ]
mining to scale existing and future data analysis algorithms to
very large and complex data sets.
- LEXA: Corpus processing software
- Lexa, a set of programs for lexical data processing, written by
Raymond Hickey, is now available from the Norwegian Computing Centre
for the Humanities for about 100 USD. To get more information and
order form, send the following line to
FILESERV@HD.UIB.NO:
send icame lexa.info. This file can also be fetched
with FTP.
-
Machine Learning Data Sets
- Several zip-encoded data sets.
- Netlib
- Netlib is a collection of mathematical software, papers, and databases.
- NIST databases
- For NIST ordering information contact:
srdata@enh.nist.gov. For
further information contact Craig Watson at
craig@magi.ncsl.nist.gov.
- Database 2 -- Structured Forms Reference Set (SFRS)
- The NIST database of structured forms contains 5,590 full page
images of simulated tax forms completed using machine print.
- Database 3 -- Binary Images of Handwritten Segmented
Characters (HWSC)
- The NIST database of handwritten segmented characters
contains 313,389 isolated character images segmented from
the 2,100 full-page images distributed with "NIST Special
Database 1".
- Database 4 -- 8-Bit Gray Scale Images of Fingerprint
Image Groups (FIGS)
- The NIST database of fingerprint images contains 2000 8-bit
gray scale fingerprint image pairs.
- Database 18 -- Mugshot Identification Database (MID)
- There are images of 1573 individuals (cases), 1495 male and 78
female. The database contains both front and side (profile) views
when available.
- Database 19 -- Handprinted Forms and Characters Database
- NIST's entire corpus of training materials for handprinted
document and character recognition. It supersedes Special
Databases 3 and 7. "Final" accumulation of NIST's handprinted
sample data.
- Database 20 -- Technical Document Image Database
- 23468 high resolution binary images
obtained from copyright-expired scientific and technical journals
and books. The images contain a very rich set of graphic elements
such as graphs, tables, equations, two column text, maps,
pictures, footnotes, annotations, and arrays of such elements.
Special Database 20 is available as a four 5.25 inch CD-ROM
set in the ISO-9660 format. Price: $1000.00 US.
- Solar
and Upper Atmospheric Data Services
- The Solar-Terrestrial Physics division of the National Geophysical Data
Center is the focal point for data pertaining to solar activity, the
ionosphere, and geomagnetic variations.
- StatLib at CMU
- StatLib is a system for distributing statistical software, datasets,
and information by electronic mail, FTP, gopher, and WWW. There are
many datasets, plus software and statistics algorithms.
- Tulips1 AV Database
- Tulips1 is a small Audio-Visual database useful for simple projects on
audio-visual speech recognition. Tulips1 includes 12 subjects saying
the first four digits in English. Audio part is in .au format,
visual part was digitized at 30fps and it is in .pgm format.
Available through anonymous
ftp at ergo.ucsd.edu.
- UNIPEN project of data exchange and recognizer benchmarks
- UNIPEN is a project of data exchange and benchmarks for on-line
handwriting recognition, started at the initiative of the technical
committee 11 of the IAPR. An experimental
FTP setup
is currently being tested at the Nijmegen Institute for
Cognition and Information (NICI).
- U.S. Bureau of the Census
- Main Data Bank.
- Wordnet Version 1.3
- Available by ftp.
Last modified: Thu Jul 15 14:14:15 EDT 1999