-
ADHO DH 2020 long paper - The Semantics of Structure in Large Historical Corpora
- Author(s):
- Rik Hoekstra, Marijn Koolen (see profile)
- Date:
- 2020
- Group(s):
- DH2020
- Subject(s):
- Digital humanities, Text data mining
- Item Type:
- Abstract
- Tag(s):
- digital historical corpora, information extraction, text analysis, Digital history, Text analytics
- Permanent URL:
- http://dx.doi.org/10.17613/m1rp-6460
- Abstract:
- Structuring large historical corpora that are too big to be processed manually can take two approaches. The first is an inductive method extracting implicit entities and meaning from textual (and sometimes visual) content. With the help of AI or manually compiled (existing) lists of entities, the entities are converted into information. The second, that Colavizza (2019) calls referential information systems, takes existing reference systems (like archival indexes) and uses them to contextualize individual documents. Both methods are used to turn corpora into computer accessible information systems. Ideally a more complete information system would result from combining both approaches, but in practice they are hard to bridge because of a number of different problems. This paper presents an approach that addresses those problems and combines inductive methods of automated text analysis and information extraction techniques with knowledge of the referential information systems to add rich semantic layers of information to large historical corpora.
- Metadata:
- xml
- Status:
- Published
- Last Updated:
- 3 years ago
- License:
- All Rights Reserved
- Share this:
Downloads
Item Name: the-semantics-of-structure-in-large-historical-corpora.pdf
Download View in browser Activity: Downloads: 118