Neural network databases


Data Sets
Some links to datasets at Center for Biomedical Modeling Research (CBMR).

DELVE
Data for Evaluating Learning in Valid Experiments. DELVE is a collection of datasets from many sources and a standardised environment whitin which this data can be used to assess the performance of procedures that learn relationships based primarily on such empirical data.

Face Detection Test Set at CMU
42 Compuserve GIF format gray-scale images. 169 Labeled Face Locations. Our work concentrated on finding frontal views of faces in scenes, so the list of faces includes mainly those faces looking towards the camera; extreme side views are ignored. These faces vary in size from ~20 to several hundred pixels. Image lighting/noise/graininess/contrast also varies; images were taken from CCD cameras, scanned photographs, scanned magazine & newspaper pictures, and handrawn images. To get the images: http://www.ius.cs.cmu.edu/IUS/dylan_usr0/har/faces/test/index.html. A demo of the system and papers are at http://www.cs.cmu.edu/~baluja.

The HCRC Map Task Corpus
The Map Task Corpus is a set of 8 CD-ROMs containing linked audio and transcriptions of a total of about 18 hours of spontaneous speech that was recorded from 128 two-person conversations according to a detailed experimental design. In Europe please contact Henry Thompson, HCRC, or Dawn Griesbach, ELSNET. Outside Europe please contact Elizabeth Hodas, Linguistic Data Consortium.

UCI KDD Archive
UC Irvine Knowledge Discovery in Databases (KDD) Archive of large datasets which encompasses a wide variety of data types, analysis tasks, and application areas. The primary role of this repository is to serve as a benchmark testbed to enable researchers in knowledge discovery and data ] mining to scale existing and future data analysis algorithms to very large and complex data sets.

LEXA: Corpus processing software
Lexa, a set of programs for lexical data processing, written by Raymond Hickey, is now available from the Norwegian Computing Centre for the Humanities for about 100 USD. To get more information and order form, send the following line to FILESERV@HD.UIB.NO: send icame lexa.info. This file can also be fetched with FTP.

Machine Learning Data Sets
Several zip-encoded data sets.

Netlib
Netlib is a collection of mathematical software, papers, and databases.

NIST databases
For NIST ordering information contact: srdata@enh.nist.gov. For further information contact Craig Watson at craig@magi.ncsl.nist.gov.
Database 2 -- Structured Forms Reference Set (SFRS)
The NIST database of structured forms contains 5,590 full page images of simulated tax forms completed using machine print.
Database 3 -- Binary Images of Handwritten Segmented Characters (HWSC)
The NIST database of handwritten segmented characters contains 313,389 isolated character images segmented from the 2,100 full-page images distributed with "NIST Special Database 1".
Database 4 -- 8-Bit Gray Scale Images of Fingerprint Image Groups (FIGS)
The NIST database of fingerprint images contains 2000 8-bit gray scale fingerprint image pairs.
Database 18 -- Mugshot Identification Database (MID)
There are images of 1573 individuals (cases), 1495 male and 78 female. The database contains both front and side (profile) views when available.
Database 19 -- Handprinted Forms and Characters Database
NIST's entire corpus of training materials for handprinted document and character recognition. It supersedes Special Databases 3 and 7. "Final" accumulation of NIST's handprinted sample data.
Database 20 -- Technical Document Image Database
23468 high resolution binary images obtained from copyright-expired scientific and technical journals and books. The images contain a very rich set of graphic elements such as graphs, tables, equations, two column text, maps, pictures, footnotes, annotations, and arrays of such elements. Special Database 20 is available as a four 5.25 inch CD-ROM set in the ISO-9660 format. Price: $1000.00 US.

Solar and Upper Atmospheric Data Services
The Solar-Terrestrial Physics division of the National Geophysical Data Center is the focal point for data pertaining to solar activity, the ionosphere, and geomagnetic variations.

StatLib at CMU
StatLib is a system for distributing statistical software, datasets, and information by electronic mail, FTP, gopher, and WWW. There are many datasets, plus software and statistics algorithms.

Tulips1 AV Database
Tulips1 is a small Audio-Visual database useful for simple projects on audio-visual speech recognition. Tulips1 includes 12 subjects saying the first four digits in English. Audio part is in .au format, visual part was digitized at 30fps and it is in .pgm format. Available through anonymous ftp at ergo.ucsd.edu.

UNIPEN project of data exchange and recognizer benchmarks
UNIPEN is a project of data exchange and benchmarks for on-line handwriting recognition, started at the initiative of the technical committee 11 of the IAPR. An experimental FTP setup is currently being tested at the Nijmegen Institute for Cognition and Information (NICI).

U.S. Bureau of the Census
Main Data Bank.

Wordnet Version 1.3
Available by ftp.

Last modified: Thu Jul 15 14:14:15 EDT 1999