Unable to read in xml file #33

sophiatabchouri · 2020-09-18T04:40:19Z

USpatenttest.xml.zip

Having trouble reading in this XML file with the generic XMLReader. It's downloaded from the WIPO patenscope site.

I run:
from chemdataextractor import Document
f = open('USpatenttest.xml', 'rb')
doc=Document.from_file(f)

And I get the error File "/home/ubuntu/miniconda3/envs/reverie_env/lib/python3.6/site-packages/chemdataextractor/reader/markup.py", line 208, in parse root = self._css(self.root_css, root)[0] IndexError: list index out of range

Any advice is greatly appreciated! Thanks

The text was updated successfully, but these errors were encountered:

maddenfederico · 2020-09-20T02:44:41Z

Each parser has a detect() method to determine whether it should be the one to parse a given file. My guess is that the US patent XML parser isn't registering your file. Note a comment left in its detect() method

        if b'us-patent-grant' in fstring:
            return True
        # TODO: Other DTDs

So you probably have to make a subclass of UsptoXmlReader and override the detect() method to accept your file, then pass that subclass into the readers parameter of Document.from_file()

lameturkey · 2020-09-22T13:16:59Z

I am trying to parse an xml file using the generic XMLReader and I am also getting this error. When I use the function lxml.etree.fromstring directly, it parses fine. My xml isn't an US patent, as such I can't use the specific reader for this.

It seems when I change the root_css query from html to :root my document can be successfully parsed.

fmoorhof · 2023-09-21T14:25:46Z

We had similar issues:
Valid .xml but IndexError on parsing.
Inspired by the unit tests we wrote a manual parser for PMC (NlmXmlReader: for other formats you can change the reader to your use case. See here: http://chemdataextractor.org/docs/reading):

import io
from chemdataextractor.reader import NlmXmlReader

def read_xml_file(fname: str) -> str:
    """Read a xml file manually"""
    r = NlmXmlReader()
    body = io.open(os.path.join(os.path.dirname(__file__), xml_file), 'rb')
    content = body.read()

    return r.readstring(content)

fname = 'Your/Path/file.xml'
doc = read_xml_file(fname=fname)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unable to read in xml file #33

Unable to read in xml file #33

sophiatabchouri commented Sep 18, 2020 •

edited

Loading

maddenfederico commented Sep 20, 2020

lameturkey commented Sep 22, 2020 •

edited

Loading

fmoorhof commented Sep 21, 2023 •

edited

Loading

Unable to read in xml file #33

Unable to read in xml file #33

Comments

sophiatabchouri commented Sep 18, 2020 • edited Loading

maddenfederico commented Sep 20, 2020

lameturkey commented Sep 22, 2020 • edited Loading

fmoorhof commented Sep 21, 2023 • edited Loading

sophiatabchouri commented Sep 18, 2020 •

edited

Loading

lameturkey commented Sep 22, 2020 •

edited

Loading

fmoorhof commented Sep 21, 2023 •

edited

Loading