PDF Converter

The PDF Converter node extracts text content from PDF documents, converting them into structured text data for processing in workflows. This is essential for processing logistics documents like bills of lading, invoices, shipping manifests, and customs forms that are commonly received as PDFs.

Overview

The PDF node is essential when you need to:

Extract document data - Pull text from bills of lading, invoices, and shipping documents
Process forms - Extract information from customs forms and regulatory documents
Parse reports - Convert PDF reports into processable text data
Document automation - Automate processing of PDF-based logistics workflows
Data migration - Convert legacy PDF documents to structured data
Content analysis - Analyze text content from PDF communications

Configuration

Input Source

File upload - Upload PDF files directly to workflow
Previous node output - Use PDF files from earlier workflow steps
Dynamic file selection - Use workflow data to specify PDF file path

Page Selection

All pages - Extract text from entire document (default)
Specific pages - Define individual pages (e.g., 1, 3, 5)
Page ranges - Specify ranges (e.g., 1-3, 5-10)

Output Format Options

Preserve page breaks - Maintain page structure in output
Ignore page breaks - Combine all text into continuous format

Output Formats

String Array (string[])

When ignoring page breaks, output is a single array of text rows:

[
  "BILL OF LADING",
  "Shipper: ABC Logistics Inc",
  "Consignee: XYZ Distribution",
  "Tracking Number: BL123456789",
  "Date: 2024-01-15"
]

Two-Dimensional Array (string[][])

When preserving page breaks, output is grouped by pages:

[
  [  // Page 1
    "BILL OF LADING",
    "Shipper: ABC Logistics Inc",
    "Consignee: XYZ Distribution"
  ],
  [  // Page 2
    "ITEM DETAILS",
    "Product: Electronics",
    "Quantity: 100 units"
  ]
]

Example Usage & Common Use Cases

Bill of Lading Processing

Document Processing:
  Receive BOL PDF → Extract text → Parse shipping details → Update TMS

Configuration:
  Input: bill_of_lading.pdf
  Pages: All pages
  Ignore Page Breaks: true
  Output Format: string[]

Processing:
  - Extract shipper/consignee information
  - Parse tracking numbers and dates
  - Identify cargo details

Output: Array of text lines for further processing

Invoice Data Extraction

Accounts Payable Process:
  Receive vendor invoice → Extract text → Parse amounts → Validate → Process payment

Configuration:
  Input: vendor_invoice.pdf
  Pages: 1-2 (first two pages only)
  Preserve Page Breaks: false

Processing:
  - Extract vendor information
  - Parse line items and amounts
  - Identify payment terms

Output: Structured text for invoice processing

Customs Documentation

Import Processing:
  Receive customs forms → Extract text → Validate compliance → Submit to authorities

Configuration:
  Input: customs_declaration.pdf
  Pages: All pages
  Preserve Page Breaks: true
  Output Format: string[][]

Processing:
  - Extract commodity codes
  - Parse declared values
  - Verify documentation completeness

Output: Page-separated text arrays for compliance checking

Shipping Manifest Analysis

Warehouse Operations:
  Receive manifest PDF → Extract contents → Update inventory → Generate pick lists

Configuration:
  Input: shipping_manifest.pdf
  Pages: 2-10 (skip cover page)
  Ignore Page Breaks: true

Processing:
  - Extract product SKUs
  - Parse quantities and locations
  - Identify special handling requirements

Output: Clean text array for inventory processing

Multi-Document Processing

Batch Document Processing:
  Process multiple PDFs → Extract text from each → Combine results → Generate summary

Loop Configuration:
  For each PDF file:
    - Extract text content
    - Parse relevant data fields
    - Collect structured information

Output: Combined text data from all processed documents

Report Data Mining

Analytics Process:
  Receive carrier reports → Extract performance data → Analyze trends → Generate insights

Configuration:
  Input: carrier_performance_report.pdf
  Pages: 3-15 (data pages only)
  Preserve Page Breaks: false

Processing:
  - Extract performance metrics
  - Parse delivery statistics
  - Identify trend indicators

Output: Text data for analytics processing

Text Processing Features

Page Break Handling

Preserve structure - Maintain document page organization
Continuous text - Merge all pages into single text stream
Selective pages - Process only relevant document sections

Extraction Quality

OCR fallback - Handle scanned PDFs with optical character recognition
Font handling - Process various fonts and text styles
Layout preservation - Maintain relative text positioning when possible
Table detection - Identify and preserve tabular data structure

Best Practices

Document Preparation

File validation - Verify PDF files are not corrupted or password-protected
Size optimization - Consider file size limits for processing
Quality assessment - Ensure PDFs have extractable text (not just images)
Version compatibility - Test with various PDF versions and creators

Page Selection Strategy

Identify relevant pages - Skip cover pages, headers, and irrelevant content
Consistent structure - Understand document layout patterns
Dynamic selection - Use metadata to determine which pages to process
Error handling - Handle documents with varying page counts

Output Processing

Data validation - Verify extracted text quality and completeness
Pattern matching - Use regular expressions to find specific data patterns
Error detection - Identify extraction failures or poor-quality text
Fallback strategies - Handle cases where text extraction fails

Performance Optimization

Selective processing - Only extract from necessary pages
Batch processing - Process multiple documents efficiently
Memory management - Handle large PDF files appropriately
Caching - Store processed results to avoid re-extraction

Integration Patterns

With Text Processing

PDF Node → Code Node (parse text) → Set Node (structure data) → Database Update

With Conditional Logic

PDF Node → If Node (validate content) → Different processing paths → Results

With Loops

File List → Loop → PDF Node (extract each) → Collect Results → Summary Report

With External APIs

PDF Node → Code Node (format data) → HTTP Request (submit to API) → Response Processing

Troubleshooting

Common Issues

Empty output - PDF may be image-based or password-protected
Garbled text - Character encoding or font issues
Missing content - Page selection may be incorrect
Performance problems - Large files or complex layouts

Debugging Tips

Test with simple PDFs - Verify functionality with basic text documents
Check page counts - Ensure page selection matches document structure
Validate file integrity - Confirm PDF files are not corrupted
Monitor output quality - Review extracted text for accuracy

Quality Issues

OCR accuracy - Scanned documents may have text recognition errors
Layout preservation - Complex layouts may not extract cleanly
Special characters - Unicode or special symbols may not extract properly
Table formatting - Tabular data may lose structure during extraction

The PDF node provides essential document processing capabilities for logistics workflows, enabling automated extraction and processing of text content from PDF-based shipping documents, invoices, and regulatory forms.

Overview​

Configuration​

Input Source​

Page Selection​

Output Format Options​

Output Formats​

String Array (string[])​

Two-Dimensional Array (string[][])​

Example Usage & Common Use Cases​

Bill of Lading Processing​

Invoice Data Extraction​

Customs Documentation​

Shipping Manifest Analysis​

Multi-Document Processing​

Report Data Mining​

Text Processing Features​

Page Break Handling​

Extraction Quality​

Best Practices​

Document Preparation​

Page Selection Strategy​

Output Processing​

Performance Optimization​

Integration Patterns​

With Text Processing​

With Conditional Logic​

With Loops​

With External APIs​

Troubleshooting​

Common Issues​

Debugging Tips​

Quality Issues​