Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error when trying to use nlp.pipe with n_process > 1 #179

Open
DayalStrub opened this issue Jul 7, 2021 · 3 comments
Open

Error when trying to use nlp.pipe with n_process > 1 #179

DayalStrub opened this issue Jul 7, 2021 · 3 comments
Assignees
Labels
bug enhancement good first issue items that are good as starting points for new contributors help wanted

Comments

@DayalStrub
Copy link
Contributor

Intro

I am getting TypeError: can not serialize 'BaseTextRank' object when trying to use spaCy's multiprocessing in nlp.pipe with a textrank pipeline component.

Sorry if this a known/expected feature/limitation - I couldn't find anything by searching repo. I generally find (spaCy's) multiprocessing a bit temperamental anyhow, but this seems to just not work.

PS. thanks for all the great work on the package!

Environment

Ubuntu 18.X (AWS DL AMI), Python 3.8 (via conda/mamba), pytextrank installed via pip, thtough conda - do let me know if you need more info.

Reproducible example - hopefullly

import spacy
import pytextrank

import en_core_web_sm

nlp = en_core_web_sm.load()
nlp.add_pipe("textrank", last=True);

txt = """
The Old Testament of the King James Bible
The First Book of Moses:  Called Genesis
1:1 In the beginning God created the heaven and the earth.
1:2 And the earth was without form, and void; and darkness was upon
the face of the deep. And the Spirit of God moved upon the face of the
waters.
1:3 And God said, Let there be light: and there was light.
1:4 And God saw the light, that it was good: and God divided the light
from the darkness.
1:5 And God called the light Day, and the darkness he called Night.
And the evening and the morning were the first day.
...
"""

data = []
for i in range(50):
    data.append((txt, {"doc_id": i}))

keys = []

for doc, context in nlp.pipe(data, as_tuples=True, n_process=-1): ## NOTE throws error, but hangs. work with n_process=1
    out = {"doc_id": context["doc_id"], "keyphrases": [(phr.text, phr.rank) for phr in doc._.phrases]}
    keys.append(out)
# pd.DataFrame(keys).head()

keys
@ceteri
Copy link
Collaborator

ceteri commented Jul 7, 2021

Thank you @DayalStrub -
This is good. I don't recall that we've had any cases using the multi-processor option in spaCy previously.

To confirm, when running Language.pipe() with a number of processors other than the default 1 value,

import pytextrank
import spacy
import en_core_web_sm

txt = """To return to my trees. This, as you know, is something that I do often. But sometimes, I even surprise myself with how powerful the pull of trees can be. Take this latest tree. I walked out onto this huge expanse of hard sand and then headed directly across to where there was this amazing old fir tree whose growth seems to have split the sandstone, its top is blown off, and its roots getting salted with every winter storm. I could not easily capture its grandness in one image so I pieced a few together and relied mostly on a short video for painting references. After all the little plein air paintings, this is my first studio painting from Hornby Island. Well, let’s see what we have shall we?"""

nlp = en_core_web_sm.load()
nlp.add_pipe("textrank", last=True);
doc = nlp(txt)

data = [
    (txt, {"doc_id": i})
    for i in range(5)
    ]

## `n_process=-1` throws exception
## `n_process=1` works

for doc, context in nlp.pipe(data, as_tuples=True, n_process=1): 
    out = {"doc_id": context["doc_id"], "keyphrases": [(phr.text, phr.rank) for phr in doc._.phrases]}
    print(out)

Then pytextrank causes an exception to be thrown:

Process Process-1:
Traceback (most recent call last):
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
    self.run()
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/multiprocessing/process.py", line 99, in run
    self._target(*self._args, **self._kwargs)
  File "/Users/paco/src/pytextrank/venv/lib/python3.7/site-packages/spacy/language.py", line 2007, in _apply_pipes
    sender.send([doc.to_bytes() for doc in docs])
  File "/Users/paco/src/pytextrank/venv/lib/python3.7/site-packages/spacy/language.py", line 2007, in <listcomp>
    sender.send([doc.to_bytes() for doc in docs])
  File "spacy/tokens/doc.pyx", line 1237, in spacy.tokens.doc.Doc.to_bytes
  File "spacy/tokens/doc.pyx", line 1296, in spacy.tokens.doc.Doc.to_dict
  File "/Users/paco/src/pytextrank/venv/lib/python3.7/site-packages/spacy/util.py", line 1134, in to_dict
    serialized[key] = getter()
  File "spacy/tokens/doc.pyx", line 1293, in spacy.tokens.doc.Doc.to_dict.lambda18
  File "/Users/paco/src/pytextrank/venv/lib/python3.7/site-packages/srsly/_msgpack_api.py", line 14, in msgpack_dumps
    return msgpack.dumps(data, use_bin_type=True)
  File "/Users/paco/src/pytextrank/venv/lib/python3.7/site-packages/srsly/msgpack/__init__.py", line 55, in packb
    return Packer(**kwargs).pack(o)
  File "srsly/msgpack/_packer.pyx", line 285, in srsly.msgpack._packer.Packer.pack
  File "srsly/msgpack/_packer.pyx", line 291, in srsly.msgpack._packer.Packer.pack
  File "srsly/msgpack/_packer.pyx", line 288, in srsly.msgpack._packer.Packer.pack
  File "srsly/msgpack/_packer.pyx", line 264, in srsly.msgpack._packer.Packer._pack
  File "srsly/msgpack/_packer.pyx", line 282, in srsly.msgpack._packer.Packer._pack
TypeError: can not serialize 'BaseTextRank' object

So we need to make the pytextrank base class and subclasses per algorithm to be serializable.
This would also be needed if we ever wanted to run distributed, say on a Ray cluster.

@ceteri ceteri self-assigned this Jul 7, 2021
@ceteri ceteri added the bug label Jul 7, 2021
@ceteri ceteri added enhancement help wanted good first issue items that are good as starting points for new contributors labels Apr 11, 2022
@ceteri
Copy link
Collaborator

ceteri commented Nov 6, 2023

This appears to be happening in several cases in spaCy and some of the GH issues point to using srsly https://github.com/explosion/srsly to resolving serialization issues.

@elirannrich
Copy link

any update on this bug ?
happy to help if needed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug enhancement good first issue items that are good as starting points for new contributors help wanted
Projects
None yet
Development

No branches or pull requests

3 participants