One service to download them all

One service to download them all

Working with patent data often begins with a familiar headache: how to efficiently collect it. With so many data sources offering data in myriad formats—and these formats constantly evolving—it’s a challenge to maintain a reliable system. For instance, the USPTO is transitioning to its new Open Data Portal, requiring developers to update their software to continue downloading data automatically.

Wouldn’t it be great to have one robust, unified software solution to handle this ever-changing landscape? A tool designed to adapt seamlessly, sparing both developers and data providers from unnecessary friction.

Here’s my wishlist for what such a tool should offer:


Built for Versatility and Modern Platforms

  • Written in Go for efficiency and cross-platform compatibility.
  • Available for popular platforms and as a Docker image.

Extensible and Scalable

  • Modular architecture to support an ever-growing number of data sources.
  • Capable of running multiple processes in parallel for faster performance.

Robust Tracking and Management

  • Uses an SQLite database to track progress and store configurations.
  • Includes a JSON API for programmatic access to status and configuration management.
  • Features an integrated web UI for visualization and ease of management.

Smart Notification and Event Handling

  • Configurable webhooks to push events and status updates to other services, reducing reliance on polling.
  • Alerting capabilities via email and Sentry integration for real-time issue monitoring.

Flexible Storage Options

  • Adapters for object storage (e.g., S3, MinIO) and traditional file systems.

Efficiently manages diverse data scales

  • Capable of handling both massive datasets (e.g., multi-gigabyte ZIP files) and millions of small files (just a few KB each), addressing the unique challenges posed by both scenarios.

Comprehensive support for related file types

  • Beyond downloading core data (e.g., XML files), it also ensures access to associated metadata like DTDs, documentation, and other auxiliary files essential for working effectively with the main dataset.

Seamless Asset Lifecycle Management

  • Clean, abstracted tracking from the appearance of a new asset to download completion, including error handling.
  • Includes reasonable retry logic to minimize provider load.
  • Allows skipping of specific assets as needed.
  • Leverages checksums where available to ensure data integrity.

Source Management and Change Awareness

  • Maintains versioned and documented connections to all supported data sources, with detailed changelogs.
  • Sends email notifications for source changes requiring updates.

Imagine This…

A tool so simple, you could:

  • Spin up a Docker container, log into a sleek web interface, and configure it to download patent data from multiple sources.
  • Receive notifications about updates or issues automatically.
  • Add new data sources or update existing ones with ease—no headaches, no delays.

And all of this with the peace of mind that the software is maintained, up-to-date, and built for long-term reliability.

What’s Next?

I’d love to hear your thoughts:

  • Does this sound like a tool you’d want in your workflow?
  • What features or capabilities would you add to this wishlist?

In the meantime, I’ll continue supporting companies as they adapt their custom solutions to, for example, the new USPTO Open Data Portal (ODP). But wouldn’t it be great if such a tool existed? After all, the real value lies in working with the data—analyzing, understanding, and applying it—not in the tedious task of downloading it or constantly ensuring that the download processes keep functioning.

It’s worth noting that there are existing tools like Apache Airflow or Luigi, which already offer robust workflow orchestration and scheduling capabilities. While these tools can be adapted to manage downloading and processing tasks, they often require significant customization to handle the unique challenges posed by dynamic data sources like patent databases. Additionally, they may lack some of the more specialized features on the wishlist, such as integrated notifications, seamless handling of auxiliary files, or tailored adapters for diverse data storage formats. As powerful as they are, their general-purpose nature can sometimes result in added complexity and overhead for tasks better served by a dedicated solution.

Read more