PDF OCR - reading content of a PDF - VisualCron

Please note that VisualCron support is not actively monitoring this community forum. Please use our contact page for contacting the VisualCron support directly.

Welcome Guest! To enable all features please Login or Register.

chris.hill@adsigroup.co.uk
Free support Topic Starter

2019-02-22T16:08:24Z

Hi

Has anyone had any luck using VC, perhaps with a 3rd party bit of software, of reading (OCR) a PDF file?

I need to extract details from the PDF and rename the file based on those values - we use a package currently that sits outside VisualCron but ideally for the benefit of the overall process it would be great to do a file at a time within my VC loop.

Thanks

Edited by moderator 2019-02-26T16:10:06Z | Reason: Not specified

Sponsor

Forum information

Wanna join the discussion?! Login to your Forum forum account or Register a new forum account

Gary_W
Free support

2019-02-22T16:25:13Z

Does the tool you currently use have a command line interface? If so you could do it.

chris.hill@adsigroup.co.uk
Free support Topic Starter

2019-02-22T16:46:49Z

It does but we seem to have some license issues as it's installed on another server. It's not the greatest software anyway so if we can get a smaller util that can be used by VC, or some .Net code within VC that would be cool

Support
Official support

2019-02-22T19:28:05Z

Did you test the PDF Tasks? How far did you get?

Henrik
Support
http://www.visualcron.com
Please like VisualCron on facebook!

thomas
Free support

2019-02-25T08:20:19Z

We use iTextSharp in .net code to extract from pdf. I believe iText7 is the new version of this library, but I haven't tested it yet.
https://www.nuget.org/packages/iTextSharp/

We have the code in a .net assembly and call it from vc.

chris.hill@adsigroup.co.uk
Free support Topic Starter

2019-02-25T08:34:56Z

Originally Posted by: Support

Did you test the PDF Tasks? How far did you get?

Hi - which of the PDF tasks would actually do OCR - I couldn't see which one if any could do that?

chris.hill@adsigroup.co.uk
Free support Topic Starter

2019-02-25T08:36:47Z

Originally Posted by: thomas

We use iTextSharp in .net code to extract from pdf. I believe iText7 is the new version of this library, but I haven't tested it yet.
https://www.nuget.org/packages/iTextSharp/

We have the code in a .net assembly and call it from vc.

Thanks for that - does it let you target specific areas of the PDF and look for placeholders and following values? For example we want to look in to top right of a page, say 400 x 400 pixels for a string like "Account:" and then read the value after it on the same line to extract the value.

thomas
Free support

2019-02-25T09:20:21Z

To be honest I am not sure. The trouble with pdf , is that it depends a lot on how it was created. We receive pdf's that were created from excel, ie the users have a spreadsheet, and converted it to pdf. The lines and columns will not always be aligned the same way every time (if the user changes the excel sheet). It basically sucks to extract anything from pdf. So I tend to convert to a textfile, and use c# and regex to extract what I need.

If it is a well made pdf, it could work with iTextSharp. Here is an example where they retrieve the fields (columns) from a pdf. You should be able to use the column position to get the value in the next line

https://stackoverflow.co...rm-data-using-itextsharp

chris.hill@adsigroup.co.uk
Free support Topic Starter

2019-02-25T10:39:56Z

Originally Posted by: thomas

So I tend to convert to a textfile, and use c# and regex to extract what I need.

Ah I hadn't thought of that - could be a great route to try. Will give it a go and see.

Do you have any examples of your RegEx that you could share?

Many Thanks

Support
Official support

2019-02-25T11:01:10Z

#10

Originally Posted by: chris.hill@adsigroup.co.uk

Originally Posted by: Support

Did you test the PDF Tasks? How far did you get?

Hi - which of the PDF tasks would actually do OCR - I couldn't see which one if any could do that?

We do not use OCR but can extract text or images. Do you have one example and document of what you are trying to do?

Henrik
Support
http://www.visualcron.com
Please like VisualCron on facebook!

thomas
Free support

2019-02-25T11:34:05Z

#11

Here is some C# code to read pdf content into a list. It's very simple as you can see. As far as regex is concerned, I am not an expert. There are many on this site that can help you if you give them some concrete examples.

using iTextSharp.text.pdf;
using iTextSharp.text.pdf.parser;

private List<string> ReadPdfFile()
{
var result = new List<string>();
using (var reader = new PdfReader(SourceFile))
{
for (var page = 1; page <= reader.NumberOfPages; page++)
{
var rr = PdfTextExtractor.GetTextFromPage(reader, page);
var temp = Regex.Split(rr, "\n");
result.AddRange(temp);
}
}

return result;
}

Edited by user 2019-02-25T11:35:00Z | Reason: Not specified

chris.hill@adsigroup.co.uk
Free support Topic Starter

2019-02-26T14:05:56Z

#12

Thanks all - getting good results just converting to txt file, and then extracting the values using some C# code

Support
Official support

2019-02-26T16:09:52Z

#13

Originally Posted by: chris.hill@adsigroup.co.uk

Thanks all - getting good results just converting to txt file, and then extracting the values using some C# code

So basically you want to get the raw text on one or more pages?

Henrik
Support
http://www.visualcron.com
Please like VisualCron on facebook!

thomas
Free support

2019-02-27T08:56:36Z

#14

Good point. There is already a task for converting pdf to txt. I hadn't seen that.

Support
Official support

2020-05-22T14:47:53Z

#15

In 9.2.0 we introduced the Scan document Task which can automate this. See video here: https://www.visualcron.com/tutorials.aspx

Henrik
Support
http://www.visualcron.com
Please like VisualCron on facebook!