Skip to main content

PDF Converter

The PDF Converter node extracts text content from PDF documents, converting them into structured text data for processing in workflows. This is essential for processing logistics documents like bills of lading, invoices, shipping manifests, and customs forms that are commonly received as PDFs.

Overview

The PDF node is essential when you need to:

  • Extract document data - Pull text from bills of lading, invoices, and shipping documents
  • Process forms - Extract information from customs forms and regulatory documents
  • Parse reports - Convert PDF reports into processable text data
  • Document automation - Automate processing of PDF-based logistics workflows
  • Data migration - Convert legacy PDF documents to structured data
  • Content analysis - Analyze text content from PDF communications

Configuration

Input Source

  • File upload - Upload PDF files directly to workflow
  • Previous node output - Use PDF files from earlier workflow steps
  • Dynamic file selection - Use workflow data to specify PDF file path

Page Selection

  • All pages - Extract text from entire document (default)
  • Specific pages - Define individual pages (e.g., 1, 3, 5)
  • Page ranges - Specify ranges (e.g., 1-3, 5-10)

Output Format Options

  • Preserve page breaks - Maintain page structure in output
  • Ignore page breaks - Combine all text into continuous format

Output Formats

String Array (string[])

When ignoring page breaks, output is a single array of text rows:

[
"BILL OF LADING",
"Shipper: ABC Logistics Inc",
"Consignee: XYZ Distribution",
"Tracking Number: BL123456789",
"Date: 2024-01-15"
]

Two-Dimensional Array (string[][])

When preserving page breaks, output is grouped by pages:

[
[ // Page 1
"BILL OF LADING",
"Shipper: ABC Logistics Inc",
"Consignee: XYZ Distribution"
],
[ // Page 2
"ITEM DETAILS",
"Product: Electronics",
"Quantity: 100 units"
]
]

Example Usage & Common Use Cases

Bill of Lading Processing

Document Processing:
Receive BOL PDF → Extract text → Parse shipping details → Update TMS

Configuration:
Input: bill_of_lading.pdf
Pages: All pages
Ignore Page Breaks: true
Output Format: string[]

Processing:
- Extract shipper/consignee information
- Parse tracking numbers and dates
- Identify cargo details

Output: Array of text lines for further processing

Invoice Data Extraction

Accounts Payable Process:
Receive vendor invoice → Extract text → Parse amounts → Validate → Process payment

Configuration:
Input: vendor_invoice.pdf
Pages: 1-2 (first two pages only)
Preserve Page Breaks: false

Processing:
- Extract vendor information
- Parse line items and amounts
- Identify payment terms

Output: Structured text for invoice processing

Customs Documentation

Import Processing:
Receive customs forms → Extract text → Validate compliance → Submit to authorities

Configuration:
Input: customs_declaration.pdf
Pages: All pages
Preserve Page Breaks: true
Output Format: string[][]

Processing:
- Extract commodity codes
- Parse declared values
- Verify documentation completeness

Output: Page-separated text arrays for compliance checking

Shipping Manifest Analysis

Warehouse Operations:
Receive manifest PDF → Extract contents → Update inventory → Generate pick lists

Configuration:
Input: shipping_manifest.pdf
Pages: 2-10 (skip cover page)
Ignore Page Breaks: true

Processing:
- Extract product SKUs
- Parse quantities and locations
- Identify special handling requirements

Output: Clean text array for inventory processing

Multi-Document Processing

Batch Document Processing:
Process multiple PDFs → Extract text from each → Combine results → Generate summary

Loop Configuration:
For each PDF file:
- Extract text content
- Parse relevant data fields
- Collect structured information

Output: Combined text data from all processed documents

Report Data Mining

Analytics Process:
Receive carrier reports → Extract performance data → Analyze trends → Generate insights

Configuration:
Input: carrier_performance_report.pdf
Pages: 3-15 (data pages only)
Preserve Page Breaks: false

Processing:
- Extract performance metrics
- Parse delivery statistics
- Identify trend indicators

Output: Text data for analytics processing

Text Processing Features

Page Break Handling

  • Preserve structure - Maintain document page organization
  • Continuous text - Merge all pages into single text stream
  • Selective pages - Process only relevant document sections

Extraction Quality

  • OCR fallback - Handle scanned PDFs with optical character recognition
  • Font handling - Process various fonts and text styles
  • Layout preservation - Maintain relative text positioning when possible
  • Table detection - Identify and preserve tabular data structure

Best Practices

Document Preparation

  • File validation - Verify PDF files are not corrupted or password-protected
  • Size optimization - Consider file size limits for processing
  • Quality assessment - Ensure PDFs have extractable text (not just images)
  • Version compatibility - Test with various PDF versions and creators

Page Selection Strategy

  • Identify relevant pages - Skip cover pages, headers, and irrelevant content
  • Consistent structure - Understand document layout patterns
  • Dynamic selection - Use metadata to determine which pages to process
  • Error handling - Handle documents with varying page counts

Output Processing

  • Data validation - Verify extracted text quality and completeness
  • Pattern matching - Use regular expressions to find specific data patterns
  • Error detection - Identify extraction failures or poor-quality text
  • Fallback strategies - Handle cases where text extraction fails

Performance Optimization

  • Selective processing - Only extract from necessary pages
  • Batch processing - Process multiple documents efficiently
  • Memory management - Handle large PDF files appropriately
  • Caching - Store processed results to avoid re-extraction

Integration Patterns

With Text Processing

PDF Node → Code Node (parse text) → Set Node (structure data) → Database Update

With Conditional Logic

PDF Node → If Node (validate content) → Different processing paths → Results

With Loops

File List → Loop → PDF Node (extract each) → Collect Results → Summary Report

With External APIs

PDF Node → Code Node (format data) → HTTP Request (submit to API) → Response Processing

Troubleshooting

Common Issues

  • Empty output - PDF may be image-based or password-protected
  • Garbled text - Character encoding or font issues
  • Missing content - Page selection may be incorrect
  • Performance problems - Large files or complex layouts

Debugging Tips

  • Test with simple PDFs - Verify functionality with basic text documents
  • Check page counts - Ensure page selection matches document structure
  • Validate file integrity - Confirm PDF files are not corrupted
  • Monitor output quality - Review extracted text for accuracy

Quality Issues

  • OCR accuracy - Scanned documents may have text recognition errors
  • Layout preservation - Complex layouts may not extract cleanly
  • Special characters - Unicode or special symbols may not extract properly
  • Table formatting - Tabular data may lose structure during extraction

The PDF node provides essential document processing capabilities for logistics workflows, enabling automated extraction and processing of text content from PDF-based shipping documents, invoices, and regulatory forms.