How to Automatically OCR Scanned PDF Docs

I moved to a paperless office system a few years back. This means that any and all statements, invoices, bills and receipts get scanned, OCR’d, stored on a secure drive and then shredded.

The OCR aspect of this is important as it means that the scanned documents are indexed and searchable by Spotlight on my Mac. I now have my system configured so that each document is automatically OCR’d as soon as they are scanned, here’s how.

I use the following to achieve this:

  • HP OfficeJet Pro 8600A. This is a very nice all-in-one printer/scanner that allows duplex scanning; it’s pretty quick too.
  • Hazel. This software helps automate processes such as cleaning out your Downloads folder, empty your Trash, etc. You can use Automator to achieve this, but I scan my documents to a Network Drive (NAS) and folder actions are not availble on external drives. Hazel monitors these for me.
  • PDFpen. This is a great little PDF editor and includes the OCR functionality that we are going to use – and what’s more, it provides the ability to automate this process with AppleScript support.
  • Tags. I use tagging to distinguish between files that need to be OCR’d and those already processed. This was a late addition to this process, but it has made all of the difference.

So once you have Hazel and PDFpen installed (Hazel is optional, but PDFpen is going to do the heavy lifting here and so you really need it), let’s create the process to automatically OCR any scanned documents.

1. Watch Folder. I have a shared folder on my NAS called “SCANS” and so you add the folder to watch in Hazel by pressing the “+” button in the folders section.

OCR Scanned Docs 1

2. Create Rule Conditions. You are then going to create a new rule Hazel tha has the following two conditions:

  • Kind is PDF. You want to only OCR PDF files that you scan in.
  • Not already processed. We are going tag all the files we process with “ocr” and so we only want to process files that does not have this tag.

It should look like this:

OCR Scanned Docs 2

3. Create Rule Actions. Now we are going to do two things – run an AppleScript that automates PDFpen and then tag the processed file with the “ocr” tag so that it doesn’t get processed again.

  • Run AppleScript. Select Run AppleScript from the drop down menu.

OCR Scanned Docs 3

The next box will automatically change to “embeddded script”. Click on “Edit script” text to the right of it.

OCR Scanned Docs 5

Into this box, copy and paste the following script:

tell application "PDFpen 6"
	delay 5
	open theFile as alias
	tell document 1
		ocr
		repeat while performing ocr
			delay 1
		end repeat
		delay 1
		close with saving
	end tell
end tell

So that it now looks like this:

OCR Scanned Docs 6

  • Add tags. Now that we have OCR’d the PDF file, we want to tag it so that it doesn’t get done again. So lets add the “Add tags” action and add “ocr” into the box to the right like so:

OCR Scanned Docs 4

I also add a “scanned” tag so that I can track to source of this document.

Once you have done that, press OK.

4. Make sure that your new rule is enabled (checkbox/tick to the left of the name). It should now look like this:

OCR Scanned Docs 7

That’s it. Scan something (make sure that you configure your scanner to scan to PDF and not JPG by default) and see that it all works ok.

You’ve now created a process that will automatically OCR your scanned documents which will then be indexed by the search engine built into your Mac (Spotlight) so that you can always find the text in these documents (e.g. search by account number).

5 Comments

  1. I am not 100% certain, but I think it will still work.. you just need to change the application line in the script to call whatever version you have.

Leave a Reply