Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to read in xml file #33

Open
sophiatabchouri opened this issue Sep 18, 2020 · 3 comments
Open

Unable to read in xml file #33

sophiatabchouri opened this issue Sep 18, 2020 · 3 comments

Comments

@sophiatabchouri
Copy link

sophiatabchouri commented Sep 18, 2020

USpatenttest.xml.zip

Having trouble reading in this XML file with the generic XMLReader. It's downloaded from the WIPO patenscope site.

I run:
from chemdataextractor import Document
f = open('USpatenttest.xml', 'rb')
doc=Document.from_file(f)

And I get the error File "/home/ubuntu/miniconda3/envs/reverie_env/lib/python3.6/site-packages/chemdataextractor/reader/markup.py", line 208, in parse root = self._css(self.root_css, root)[0] IndexError: list index out of range

Any advice is greatly appreciated! Thanks

@maddenfederico
Copy link

Each parser has a detect() method to determine whether it should be the one to parse a given file. My guess is that the US patent XML parser isn't registering your file. Note a comment left in its detect() method

        if b'us-patent-grant' in fstring:
            return True
        # TODO: Other DTDs

So you probably have to make a subclass of UsptoXmlReader and override the detect() method to accept your file, then pass that subclass into the readers parameter of Document.from_file()

@lameturkey
Copy link

lameturkey commented Sep 22, 2020

I am trying to parse an xml file using the generic XMLReader and I am also getting this error. When I use the function lxml.etree.fromstring directly, it parses fine. My xml isn't an US patent, as such I can't use the specific reader for this.

It seems when I change the root_css query from html to :root my document can be successfully parsed.

@fmoorhof
Copy link

fmoorhof commented Sep 21, 2023

We had similar issues:
Valid .xml but IndexError on parsing.
Inspired by the unit tests we wrote a manual parser for PMC (NlmXmlReader: for other formats you can change the reader to your use case. See here: http://chemdataextractor.org/docs/reading):

import io
from chemdataextractor.reader import NlmXmlReader

def read_xml_file(fname: str) -> str:
    """Read a xml file manually"""
    r = NlmXmlReader()
    body = io.open(os.path.join(os.path.dirname(__file__), xml_file), 'rb')
    content = body.read()

    return r.readstring(content)

fname = 'Your/Path/file.xml'
doc = read_xml_file(fname=fname)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants