Tesseract.tess_hocrFunction
tess_hocr(
    inst::TessInst,
    page::Integer = Int32(1)
)::Union{String, Nothing}

Extract the text in hOCR format from the image. Returns nothing if there is an error.

Arguments:

TNameDefaultDescription
RinstThe instance to grab the text from.
OpageInt32(1)The page to extract the hOCR text for.

Details:

This method will call tess_recognize() if it has not been called yet for the image. The current hOCR spec can be accessed at http://kba.cloud/hocr-spec/1.2/.

Example:

using Tesseract

download_languages()

instance = TessInst()
pix = sample_pix()

tess_image(instance, pix)
tess_resolution(instance, 72)

hocr = tess_hocr(instance)

for line in split(hocr, '\n'; keepempty = false)[1:5]
    println(strip(line))
end

# output

<div class='ocr_page' id='page_1' title='image "unknown"; bbox 0 0 500 600; ppageno 0; scan_res 72 72'>
<div class='ocr_carea' id='block_1_1' title="bbox 10 9 489 523">
<p class='ocr_par' id='par_1_1' lang='eng' title="bbox 11 9 417 23">
<span class='ocr_line' id='line_1_1' title="bbox 11 9 417 23; baseline 0 -3; x_size 22.717392; x_descenders 5.5; x_ascenders 5.7391305">
<span class='ocrx_word' id='word_1_1' title='bbox 11 9 25 20; x_wconf 95'>No</span>

See also: tess_text, tess_alto, tess_tsv, tess_parsed_tsv, tess_confidences

source