PDF Converter
The PDF Converter node extracts text content from PDF documents, converting them into structured text data for processing in workflows. This is essential for processing logistics documents like bills of lading, invoices, shipping manifests, and customs forms that are commonly received as PDFs.
Overview
The PDF node is essential when you need to:
- Extract document data - Pull text from bills of lading, invoices, and shipping documents
- Process forms - Extract information from customs forms and regulatory documents
- Parse reports - Convert PDF reports into processable text data
- Document automation - Automate processing of PDF-based logistics workflows
- Data migration - Convert legacy PDF documents to structured data
- Content analysis - Analyze text content from PDF communications
Configuration
Input Source
- File upload - Upload PDF files directly to workflow
- Previous node output - Use PDF files from earlier workflow steps
- Dynamic file selection - Use workflow data to specify PDF file path
Page Selection
- All pages - Extract text from entire document (default)
- Specific pages - Define individual pages (e.g., 1, 3, 5)
- Page ranges - Specify ranges (e.g., 1-3, 5-10)
Output Format Options
- Preserve page breaks - Maintain page structure in output
- Ignore page breaks - Combine all text into continuous format
Output Formats
String Array (string[])
When ignoring page breaks, output is a single array of text rows:
[
"BILL OF LADING",
"Shipper: ABC Logistics Inc",
"Consignee: XYZ Distribution",
"Tracking Number: BL123456789",
"Date: 2024-01-15"
]
Two-Dimensional Array (string[][])
When preserving page breaks, output is grouped by pages:
[
[ // Page 1
"BILL OF LADING",
"Shipper: ABC Logistics Inc",
"Consignee: XYZ Distribution"
],
[ // Page 2
"ITEM DETAILS",
"Product: Electronics",
"Quantity: 100 units"
]
]
Example Usage & Common Use Cases
Bill of Lading Processing
Document Processing:
Receive BOL PDF → Extract text → Parse shipping details → Update TMS
Configuration:
Input: bill_of_lading.pdf
Pages: All pages
Ignore Page Breaks: true
Output Format: string[]
Processing:
- Extract shipper/consignee information
- Parse tracking numbers and dates
- Identify cargo details
Output: Array of text lines for further processing
Invoice Data Extraction
Accounts Payable Process:
Receive vendor invoice → Extract text → Parse amounts → Validate → Process payment
Configuration:
Input: vendor_invoice.pdf
Pages: 1-2 (first two pages only)
Preserve Page Breaks: false
Processing:
- Extract vendor information
- Parse line items and amounts
- Identify payment terms
Output: Structured text for invoice processing
Customs Documentation
Import Processing:
Receive customs forms → Extract text → Validate compliance → Submit to authorities
Configuration:
Input: customs_declaration.pdf
Pages: All pages
Preserve Page Breaks: true
Output Format: string[][]
Processing:
- Extract commodity codes
- Parse declared values
- Verify documentation completeness
Output: Page-separated text arrays for compliance checking
Shipping Manifest Analysis
Warehouse Operations:
Receive manifest PDF → Extract contents → Update inventory → Generate pick lists
Configuration:
Input: shipping_manifest.pdf
Pages: 2-10 (skip cover page)
Ignore Page Breaks: true
Processing:
- Extract product SKUs
- Parse quantities and locations
- Identify special handling requirements
Output: Clean text array for inventory processing
Multi-Document Processing
Batch Document Processing:
Process multiple PDFs → Extract text from each → Combine results → Generate summary
Loop Configuration:
For each PDF file:
- Extract text content
- Parse relevant data fields
- Collect structured information
Output: Combined text data from all processed documents
Report Data Mining
Analytics Process:
Receive carrier reports → Extract performance data → Analyze trends → Generate insights
Configuration:
Input: carrier_performance_report.pdf
Pages: 3-15 (data pages only)
Preserve Page Breaks: false
Processing:
- Extract performance metrics
- Parse delivery statistics
- Identify trend indicators
Output: Text data for analytics processing
Text Processing Features
Page Break Handling
- Preserve structure - Maintain document page organization
- Continuous text - Merge all pages into single text stream
- Selective pages - Process only relevant document sections
Extraction Quality
- OCR fallback - Handle scanned PDFs with optical character recognition
- Font handling - Process various fonts and text styles
- Layout preservation - Maintain relative text positioning when possible
- Table detection - Identify and preserve tabular data structure
Best Practices
Document Preparation
- File validation - Verify PDF files are not corrupted or password-protected
- Size optimization - Consider file size limits for processing
- Quality assessment - Ensure PDFs have extractable text (not just images)
- Version compatibility - Test with various PDF versions and creators
Page Selection Strategy
- Identify relevant pages - Skip cover pages, headers, and irrelevant content
- Consistent structure - Understand document layout patterns
- Dynamic selection - Use metadata to determine which pages to process
- Error handling - Handle documents with varying page counts
Output Processing
- Data validation - Verify extracted text quality and completeness
- Pattern matching - Use regular expressions to find specific data patterns
- Error detection - Identify extraction failures or poor-quality text
- Fallback strategies - Handle cases where text extraction fails
Performance Optimization
- Selective processing - Only extract from necessary pages
- Batch processing - Process multiple documents efficiently
- Memory management - Handle large PDF files appropriately
- Caching - Store processed results to avoid re-extraction
Integration Patterns
With Text Processing
PDF Node → Code Node (parse text) → Set Node (structure data) → Database Update
With Conditional Logic
PDF Node → If Node (validate content) → Different processing paths → Results
With Loops
File List → Loop → PDF Node (extract each) → Collect Results → Summary Report
With External APIs
PDF Node → Code Node (format data) → HTTP Request (submit to API) → Response Processing
Troubleshooting
Common Issues
- Empty output - PDF may be image-based or password-protected
- Garbled text - Character encoding or font issues
- Missing content - Page selection may be incorrect
- Performance problems - Large files or complex layouts
Debugging Tips
- Test with simple PDFs - Verify functionality with basic text documents
- Check page counts - Ensure page selection matches document structure
- Validate file integrity - Confirm PDF files are not corrupted
- Monitor output quality - Review extracted text for accuracy
Quality Issues
- OCR accuracy - Scanned documents may have text recognition errors
- Layout preservation - Complex layouts may not extract cleanly
- Special characters - Unicode or special symbols may not extract properly
- Table formatting - Tabular data may lose structure during extraction
The PDF node provides essential document processing capabilities for logistics workflows, enabling automated extraction and processing of text content from PDF-based shipping documents, invoices, and regulatory forms.