Our technology analyzes the whole structure of a document, identifies tables, paragraphs, lists, headlines, textsizes and attributes, footnotes, header, margin, footer, shapes and figures.
Each Document get’s analyzed into it’s semantic and linguistic correct pieces, independent from it’s original format, it get’s segmented into pages, paragraphs, sentences, multi word tokens and the single words.
It also extracts named entities and there relation to each other:
Any type and form of number, dates, events, buildings and facilities, countires, cities and states, named languages, references to law, locations and territories, monetary values, nationalities, religous or political groups, named ordinals, companies and institutions, percentages, phone numbers, standards like IBAN or ISSN, emails and web links, real and fictional people, non service products like objects, vehicles, food etc., measurements like weight or distance, time values and work of art.