Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Encoding helpers #52

Open
JocelynDelalande opened this issue Dec 23, 2022 · 3 comments
Open

Encoding helpers #52

JocelynDelalande opened this issue Dec 23, 2022 · 3 comments

Comments

@JocelynDelalande
Copy link
Contributor

JocelynDelalande commented Dec 23, 2022

I kinda struggle with edifact encoding, but here what I came up to :

data:

# https://blog.sandro-pereira.com/2009/08/15/edifact-encoding-edi-character-set-support/
# https://www.truugo.com/edifact/d09a/cl0001/
# A bit unsure of how 10646-1 maps exactly to utf-8
EDIFACT_ENCODINGS = {
    "UNOA": "ascii",  # iso-"646",
    "UNOB": "ascii",  # iso-"646",
    "UNOC": "iso-8859-1",
    "UNOD": "iso-8859-2",
    "UNOE": "iso-8859-5",
    "UNOF": "iso-8859-7",
    "UNOG": "iso-8859-3",
    "UNOH": "iso-8859-4",
    "UNOI": "iso-8859-6",
    "UNOJ": "iso-8859-8",
    "UNOK": "iso-8859-9",
    "UNOW": "utf-8",  # "10646-1",
    "UNOX": "iso-2022-jp",  # "2022 2375",
    "UNOY": "utf-8",  # "10646-1",
}

deserializing helper:

def guess_edifact_encoding(stream):
    unb_line = b"\n"
    eof_marker = b""
    while not unb_line.startswith(b"UNB") and unb_line != eof_marker:
        unb_line = stream.readline()

    if not unb_line.startswith(b"UNB"):
        raise ParseError("Missing UNB segment: ")

    else:
        # Must be ASCII-only
        unb_line_s = unb_line.decode()
        parser = Parser()
        unb_segment = list(parser.parse(unb_line_s))[0]
        try:
            # Ignore version, always v1…
            encoding_element = unb_segment.elements[0][0]
            return EDIFACT_ENCODINGS[encoding_element]
        except KeyError:
            raise ParseError(f"Wrong encoding spec : {encoding_element}")

I wonder what pydifact could embed in its scope in terms of :

  • helper (data)
  • serialization helper (like having a Interchange.serialize_to_bytes() helper with automatic encoding selection based on syntax identifier ?)
  • deserialization from bytes handling decoding with a guesser like the one I wrote

Any thought appreciated :-).

@nerdoc
Copy link
Member

nerdoc commented Dec 29, 2022

Hm. I am just dealing with 8869-1 encoding in my files. But yes, there is the specification for all in the header.
But one thing I don't understand, and I frankly am ignoring it most of the times, as I'd like to see everything in utf8 - other encodings are stupid, and don't exist... ...and the earth is flat.
OMG.

ok. AFAIU, it would suffice to read the 4 bytes of the interchange and decide which encoding to take, and then, read the rest of the stream in that encoding.

I have files starting with "UNB+ANSI:1+ME123456" - mostly without an UNA header, and none of those UNO[x] specifiers. An example is in the test data files.
How to deal with that?

@JocelynDelalande
Copy link
Contributor Author

JocelynDelalande commented Dec 30, 2022 via email

@nerdoc
Copy link
Member

nerdoc commented May 7, 2023

It is definitely part of what I want to cope with, because I need to deal with that kind of files... But I'm afraid this is EDIFACT. It's a file I got myself (just changed names to pseudonymize them) - but here in medical systems, many companies don't care about standards...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants