gotcha

Never trust the data you receive.

Wolfgang Braun

Nov 8, 2024 — 2 min read

When someone applies for a patent, they embark on a long journey through filing, examination, grant, and maintenance. This process creates a massive trail of data, inputted at each step by people in multiple systems. This data is processed, transformed, and passed through various networks, databases, and software systems, leading to even more data layers generated by each program in the chain. At every step, there are countless opportunities for the data to get distorted, misplaced, or corrupted—whether due to technical issues, human error, or system limitations.

Efforts like the EPO’s DOCDB database aim to harmonize patent data globally. They put tremendous effort into standardizing and correcting data, creating a powerful resource for patent information. But no matter how diligent these teams are, it’s impossible to eliminate every error. In fact, the harmonization process itself can introduce new inconsistencies.

Take the example of DOCDB, where the data often appears well-organized, following XML definitions and coherent formatting. But as the old adage goes, “Don’t assume.” Simply because the data looks structured doesn’t mean it’s accurate. Always remember to approach these data sets with a critical eye.

Let’s get concrete. Consider this snippet from the DOCDB XML file
DOCDB-202407-020-TN-0217.xml:

This abstract, labeled as Arabic with lang="ar", looks consistent with its content.

<abstract lang="ar" data-format="docdba" abstract-source="national office">
<p>جهازلتحسين مردودية المقاومات الحرارية المستعملة في تسخين الماء و تسخين المنازل. يتكون الاختراع من : هيكل ذو شكل مسطح لايتجاوز سمكه بعض السنتيمترات 1 يحتوي على مقاوم حراري 4 مرتبط بمثبت له المحشور في قاعدة2 . يتم ربط الأسلاك الكهربائية و تثبيتها ببراغي5 . يتم مسك المقاوم الحراري لمنعه من الارتخاء بماسك المقاوم. الجهاز حسب الاختراع موجه أساسا لتسخين المسطحات المائية و تسخين المنازل</p>
</abstract>

However, here’s another abstract from a different entry:

<abstract lang="ar" data-format="docdba" abstract-source="national office">
<p>TAPIS DE PRIERE TRANSPORTABLE ERGONOMIQUE A ORIENTATION ORTHOPEDIQUE. C'EST UN TAPIS DE PRIERE PLIABLE, TRANSPORTABLE SOUS FORME DE SAC A L'AIDE D'UNE BANDOULIERE, AVEC LA PARTICULARITE D'ABSORBER LE CHOC DE LA PRIERE GRآCE A LA MOUSSE UTILISE. IL EST EGALEMENT ORNE DE 2 POCHES PERMETTANT A L'UTILISATEUR DE TRANSPORTER DES EFFETS PERSONNELS.</p>
</abstract>

489016392

The lang="ar" attribute here still specifies Arabic, but the content is actually in French. If you’re building software to parse and rely on these language tags, this inconsistency could easily slip through, leading to processing errors or misclassification.

This is not to say DOCDB and similar resources aren’t invaluable—they are. But remember, even the best tools come with their quirks. So, take a critical approach, validate your assumptions, and always test your software against unexpected anomalies.

Never trust the data you receive.

Wolfgang Braun

Read more

Game-Changer: Key EPO Patent Datasets Are Now Free!

Launch: USPTO Open Data Portal

EPO’s New AI Chatbot for Navigating Patent Guidelines

Launch: EPO Technology Intelligence Platform