Type: | Package |
Title: | Extracting Order Position Tables from PDF-Based Order Documents |
Version: | 1.0.0 |
Date: | 2024-12-11 |
Maintainer: | Michael Scholz <michael.scholz@th-deg.de> |
Description: | Functions for extracting text and tables from PDF-based order documents. It provides an n-gram-based approach for identifying the language of an order document. It furthermore uses R-package 'pdftools' to extract the text from an order document. In the case that the PDF document is only including an image (because it is scanned document), R package 'tesseract' is used for OCR. Furthermore, the package provides functionality for identifying and extracting order position tables in order documents based on a clustering approach. |
License: | GPL-3 |
SystemRequirements: | Tesseract >= 5.0.0, libtesseract-dev (deb), tesseract-devel (rpm), libleptonica-dev (deb), leptonica-devel (rpm), tesseract-ocr-eng (deb), libpoppler-cpp-dev (deb), poppler-cpp-devel (rpm), poppler-data (rpm/deb), libxml2-dev (deb), libxml2-devel (rpm) |
Depends: | R(≥ 4.3.0), tidyselect |
Imports: | data.table, dplyr, matrixcalc, quanteda, rlist, stringr, tibble, tidyr, utils, purrr, digest, lubridate |
Suggests: | pdftools, tesseract, xml2 |
Encoding: | UTF-8 |
RoxygenNote: | 7.3.2 |
NeedsCompilation: | no |
Packaged: | 2024-12-11 16:46:01 UTC; mscholz |
Author: | Michael Scholz [cre, aut], Joerg Bauer [aut] |
Repository: | CRAN |
Date/Publication: | 2024-12-12 15:20:02 UTC |
Extracting order position tables from PDF-based order documents
Description
This packages provides functions for extracting text and order-position-tables from PDF-based order documents.
Details
Package: | orderanalyzer |
Type: | Package |
Version: | 1.0.0 |
Date: | 2024-12-11 |
License: | GPL-3 |
Depends: | R (>= 4.3.0) |
Author(s)
Michael Scholz michael.scholz@th-deg.de
Joerg Bauer joerg.bauer@th-deg.de
Extract tables from a given words-dataframe
Description
This function extracts order-position-tables from PDF-based order documents. It tries to identify table rows based on a clustering approach and thereafter identifies the column structure. A table row can consist of multiple text rows and the text rows can span different columns. This function furthermore tries to identify the meaning of the columns (position, articleID, description, quantity, quanity unit, unit price, total price, currency, date).
Usage
extractTables(text, minCols = 3, maxDistance = 20, entityNames = NA)
Arguments
text |
List including several representations of text extracted from a PDF file. This list is generated by the function extractText. |
minCols |
Number of columns a table must minimal consist of |
maxDistance |
Number of text lines that can maximally exist between the start of two table rows |
entityNames |
A list of four name vectors (currencyUnits, quantityUnits, headerNames, noTableNames). Each vector contains strings that correspond to currency units, quantity units, header names or names of entities not being a table. |
Value
List of lists describing the tables. Each sublist includes a data frame (data) which is the identified table, the position of text lines that constitute the table and the position of the significant lines.
Examples
file <- system.file("extdata", "OrderDocument_en.pdf", package = "orderanalyzer")
text <- extractText(file)
# Extracting order tables without any further information
tables <- extractTables(text)
tables[[1]]$data
# Extracting order tables with further information
tables <- extractTables(text,
entityNames = list(currencyUnits = enc2utf8(c("eur", "euro", "\u20AC")),
quantityUnits = enc2utf8(c("pcs", "pcs.")),
headerNames = enc2utf8(c("pos", "item", "quantity")),
noTableNames = enc2utf8(c("order total", "supplier number")))
)
tables[[1]]$data
# Extracting order tables from a German document
file <- system.file("extdata", "OrderDocument_de.pdf", package = "orderanalyzer")
text <- extractText(file)
tables <- extractTables(text)
tables[[1]]$data
Extracts the text from a PDF file
Description
This function extracts text from PDF documents and returns the text as a string, as a list of lines and as a list of words. It uses 'pdftools' to extract the content from textual PDF files and 'tesseract' to extract the content from image-based PDF-files.
Usage
extractText(file)
Arguments
file |
Path to the PDF file |
Value
List including the extracted text, a data table including the lines, a data table including the words, the type and language of the document.
Examples
file <- system.file("extdata", "OrderDocument_en.pdf", package = "orderanalyzer")
text <- extractText(file)
text$words
Identifies the language of a given text based on frequent trigrams
Description
This function identifies the language of a given string based on the most frequent trigrams in different languages. Supported languages are Czech, Dutch, English, French, German, Spanish, Latvian and Lithuanian.
Usage
identifyLanguage(text)
Arguments
text |
String for which the language should be identified |
Value
Name of the detected language.
Examples
text <- "The tea in the cup still is hot."
language <- identifyLanguage(text)
language