Tesseract.tess_alto
— Functiontess_alto(
inst::TessInst,
page::Integer = Int32(1)
)::Union{String, Nothing}
Extract the text in ALTO format from the image. Returns nothing
if there is an error.
Arguments:
T | Name | Default | Description |
---|---|---|---|
R | inst | The instance to grab the text from. | |
O | page | Int32(1) | The page to extract the ALTO text for. |
Details:
This method will call tess_recognize()
if it has not been called yet for the image. The current ALTO spec can be accessed at https://github.com/altoxml/documentation/wiki/Versions.
Example:
using Tesseract
download_languages()
instance = TessInst()
pix = sample_pix()
tess_image(instance, pix)
tess_resolution(instance, 72)
alto = tess_alto(instance)
for line in split(alto, '\n'; keepempty = false)[1:5]
println(strip(line))
end
# output
<Page WIDTH="500" HEIGHT="600" PHYSICAL_IMG_NR="0" ID="page_0">
<PrintSpace HPOS="0" VPOS="0" WIDTH="500" HEIGHT="600">
<ComposedBlock ID="cblock_0" HPOS="10" VPOS="9" WIDTH="479" HEIGHT="514">
<TextBlock ID="block_0" HPOS="11" VPOS="9" WIDTH="406" HEIGHT="14">
<TextLine ID="line_0" HPOS="11" VPOS="9" WIDTH="406" HEIGHT="14">
See also: tess_text
, tess_hocr
, tess_tsv
, tess_parsed_tsv
, tess_confidences