Process multiple images
This process is normally used to process all the pages in a book.
using Tesseract
# Generate some pages to load.
write("page01.tiff", sample_tiff())
write("page02.tiff", sample_tiff())
write("page03.tiff", sample_tiff())
download_languages() # Make sure we have the data files.
instance = TessInst()
pipeline = TessPipeline(instance)
text = tess_pipeline_text(pipeline)
hocr = tess_pipeline_hocr(pipeline)
tsv = tess_pipeline_tsv(pipeline)
tess_run_pipeline(pipeline, "My First Book") do add
add(pix_read("page01.tiff"), 72)
add(pix_read("page02.tiff"), 72)
add(pix_read("page03.tiff"), 72)
end
println("Text size: $(length(text[]))")
println("HOCR size: $(length(hocr[]))")
println("TSV size: $(length(tsv[]))")
# output
Text size: 4430
HOCR size: 91120
TSV size: 34934
To process multiple pages and combine them all into a single document you use the TessPipeline object. First a TessInst object is created to handle the OCR then the it's passed to the TessPipeline durring initialization.
You can generate multiple document types simultaneously. In the above example we use tess_pipeline_text to generate a TXT file, tess_pipeline_hocr to generate a HORC XML file, and tess_pipeline_tsv to get details about the OCR in the Tabbed Separated Format.
Finally tess_run_pipeline is called to generate the documents. An optional title is added. Some output formats ignore the title, others will add it to the output. A callback is used with tess_run_pipeline which passes back a function that is used to specify the images to decode along with their resolution. To indicate an error your callback can return false
which will in turn cause tess_run_pipeline to return false
as well.
In the above example the documents are created in memory and can be accessed by using the []
operation on the returned objects. In general there are 3 output formats for each pipeline output function (i.e. tess_pipeline_text). The first is to output to a file by specifying a file name. The second we used above to output the results to memory. The third is to pass in a callback function that will be called with each line of text, allowing you to process the data as it is being generated.