Processing large files using dropbox
The API described in this section is for testing purposes only. Its parameters, functionality and availability are not guaranteed in future.
In this section you will find detailed instructions on how to process large image files with Cloud OCR service, using the Dropbox service to host files.
The Cloud OCR SDK method which we will use in this tutorial is processRemoteImage. With this method you can process large images; however, the size of one image still must not exceed 200 MB.
In addition, please notice that the Dropbox service limits the total size of traffic that can go through the user's public links. For free accounts, the limit is 20 GB, and for paid accounts it is 200 GB. See this article in the Dropbox Help.
Prerequisites for following this tutorial
- you have an account and an active application with Cloud OCR service (visit this link to register and create an application)
- you have a Dropbox account (go to https://www.dropbox.com/ to create one)
- you work under Linux operating system
- you have installed Java, cURL, Git (see help sections on Java, cURL, Git websites)
Follow the steps described below to process a batch of image files.
Create a folder with images
Put the images you are going to process into a separate folder in your Dropbox folder. We have called the folder OcrSdk.
Set up the connection to Dropbox
Now you need to obtain public URLs of your images, so that Cloud OCR service would be able to download them from the Dropbox server. Please note that the "default" link to your file in Dropbox, which looks like "https://www.dropbox.com/s/some-file-name" is not a direct download link. The link you need looks like "https://dl.dropboxusercontent.com/s/some-file-name". See https://www.dropbox.com/help/201/en for details.
Retrieving a list of download links manually can be impractical if you are processing many files. Instead you can use a script which will create a list of public direct download links for all files in the specified folder. Here are the instructions:
- Create a Dropbox Platform application. Visit https://www.dropbox.com/developers/apps/create and create a Dropbox API app with the following settings:
- On the next screen, make a note of App key and App secret. You will need them to let the script access your Dropbox folder.
- Get the script to your computer. Open the Git shell and type
git clone https://github.com/abbyysdk/GetDropboxLinks
- Set up the connection to Dropbox. Type
cd GetDropboxLinks ./dropbox_uploader.sh
- You will be asked to enter the App key and App secret. These are the codes you received in step 2. For access level enter Full Dropbox (type the letter f).
- Check and confirm the data you entered when the script asks for it. Type the letter y if everything is correct.
- The script will prompt you to visit a link. Click Allow on the screen you will see:
Obtain a list of download URLs
Now you are ready to execute the script which generates public download links. The folder into which you put the images for processing is called OcrSdk. Type
Save the links you receive into a text file. Let us call it dropbox_files.txt and put it into the user folder, one level higher than the scripts.
Run the Java sample to process the images
You can now run the Java sample (located at GitHub) which will process the files according to the URL list.
- Download the code samples and go to the Java sample folder:
cd .. git clone https://github.com/abbyysdk/ocrsdk.com
- Modify the ClientSettings.java file to include your Cloud OCR Application ID and password in the following lines:
// Name of application you created public static final String APPLICATION_ID = ""; // Password should be sent to your e-mail after application was created public static final String PASSWORD = "";
- Compile the sample:
- Run the sample:
java ProcessManyFiles remote --lang=English --format=txt ~/dropbox_files.txt
- Wait until the sample application finishes working. Now you can see the results in the result_dir folder, in the text format.
See the source code of the sample for implementation details. The method which is used for processing remote files is described in the API Reference section of documentation: processRemoteImage Method