How to overcome RAM issue with parsing MOS-MIX Large #834

meteoDaniel · 2022-12-29T14:26:07Z

meteoDaniel
Dec 29, 2022

I hope you enjoyed your christmas holidays !
You worked on the topic parsing mos mix data.
Actually I am using the KMLReader to open a MOS-MIX all stations file. Unfortunately the ram is exploding.

Do you know how to reduce the ram during parsing?

It takes approx. 20 GB of RAM.

Best regards and thanks for your support !

meteoDaniel · 2022-12-29T14:26:45Z

meteoDaniel
Dec 29, 2022
Author

PS: My goal is to pass the data into xarray object to store them in a zarr Archive.

0 replies

amotl · 2022-12-30T21:10:57Z

amotl
Dec 30, 2022
Maintainer

Hi Daniel,

based on some research and the implementation by @niclashoyer (see references below), we think the processing of MOSMIX XML files got way more efficient than before, by using stream-based XML parsing.

It would be sad if there is still something wrong with the implementation, and the RAM explodes, according to your observations. So, I think we will have to revisit the implementation.

To make investigations easier, can I humbly ask you to share the code you are using?

With kind regards,
Andreas.

References

0 replies

meteoDaniel · 2022-12-31T09:35:27Z

meteoDaniel
Dec 31, 2022
Author

Dear @amotl, thanks for you reply.

I have just made some minor adaptions to the code to use a local kmz archive an removed tqdm. But it is in general the same. I have used different types of iterparse e.g. cElementTree implementation. But all of them overflow the memory.

class KMLReader:
    def __init__(
            self,
    ) -> None:
        self.metadata = {}
        self.timesteps = []
        self.nsmap = None
        self.iter_elems = None
        self.station_ids = None
        self.parameters = []

    @staticmethod
    def open(local_file_path: str) -> BytesIO:
        """Open kml file as bytes """
        buffer = BytesIO()
        with open(local_file_path, "rb") as file:
            for data in read_in_chunks(file, chunk_size=1024):
                _ = buffer.write(data)
        return buffer

    def fetch(self, local_file_path: str) -> bytes:
        """ Fetch weather mosmix file (zipped xml) """
        buffer = self.download(local_file_path)
        zfs = ZipFileSystem(buffer, "r")
        return zfs.open(zfs.glob("*")[0]).read()

    def read(self, local_file_path: str):
        """
        Download and read DWD XML Weather Forecast File of Type KML.
        """

        log.info(f"Downloading KMZ file {basename(url)}")
        kml = self.fetch(local_file_path)

        log.info("Parsing KML data")
        # iterparse step makes
        self.iter_elems = iterparse(BytesIO(kml), events=("start", "end"), resolve_entities=False)

        prod_items = {
            "issuer": "Issuer",
            "product_id": "ProductID",
            "generating_process": "GeneratingProcess",
            "issue_time": "IssueTime",
        }

        nsmap = None

        # Get Basic Metadata
        prod_definition = None
        prod_definition_tag = None
        for event, element in self.iter_elems:
            if event == "start":
                # get namespaces from root element
                if nsmap is None:
                    nsmap = element.nsmap
                    prod_definition_tag = f"{{{nsmap['dwd']}}}ProductDefinition"
            elif event == "end":
                if element.tag == prod_definition_tag:
                    prod_definition = element
                    # stop processing after head
                    # leave forecast data for iteration
                    break

        self.metadata = {k: prod_definition.find(f"{{{nsmap['dwd']}}}{v}").text for k, v in prod_items.items()}
        self.metadata["issue_time"] = pd.Timestamp(self.metadata["issue_time"])

        # Get time steps.
        timesteps = prod_definition.findall(
            "dwd:ForecastTimeSteps",
            nsmap,
        )[0]
        self.timesteps = DatetimeIndex([pd.Timestamp(i.text) for i in timesteps.getchildren()])

        # save namespace map for later iteration
        self.nsmap = nsmap

    def iter_items(self):
        clear = True
        placemark_tag = f"{{{self.nsmap['kml']}}}Placemark"
        for event, element in self.iter_elems:
            if event == "start":
                if element.tag == placemark_tag:
                    clear = False
            elif event == "end":
                if element.tag == placemark_tag:
                    station_id = element.find("kml:name", self.nsmap).text
                    if (self.station_ids is None) or station_id in self.station_ids:
                        yield element
                    clear = True
                if clear:
                    element.clear()

1 reply

amotl Dec 31, 2022
Maintainer

Hi Daniel,

thank you for sharing more details. First of all, I recognize that you are aiming to read a local archive file using KMLReader. It would be sweet if the component would support that scenario, so that corresponding implementations will not diverge.

Secondly, I also had problems recently on another project, where I believed I did everything correctly, but the memory would still explode when reading a large XML file. After another iteration, it works perfectly well now, chugging through large XML files with only a few MB of RAM usage ¹.

Now, back to KMLReader. Without running the code, I can see two spots where I think the optimization is blocked by reading the whole data into memory, so I am kindly asking you to change them and report back about any improvements.

wetterdienst/wetterdienst/provider/dwd/mosmix/access.py

Line 80 in 40e823d

return zfs.open(zfs.glob("*")[0]).read()

wetterdienst/wetterdienst/provider/dwd/mosmix/access.py

Line 91 in 40e823d

    
           self.iter_elems = iterparse(BytesIO(kml), events=("start", "end"), resolve_entities=False)

What happens if you omit the .read() call on the first line, and the wrapping into BytesIO() on the second one? As said, without running the code yet, I would think it may unlock the streaming-like reading and parsing of data. If it doesn't, I think we will have to refine the iter_items method.

With kind regards,
Andreas.

https://github.com/ip-tools/patzilla/blob/peds/patzilla/util/xml/reader.py -- please ignore to_dict and to_json, and focus on read_xml and fast_iter instead. ↩

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to overcome RAM issue with parsing MOS-MIX Large #834

{{title}}

Replies: 3 comments 1 reply

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

How to overcome RAM issue with parsing MOS-MIX Large #834

meteoDaniel Dec 29, 2022

Replies: 3 comments · 1 reply

meteoDaniel Dec 29, 2022 Author

amotl Dec 30, 2022 Maintainer

References

meteoDaniel Dec 31, 2022 Author

amotl Dec 31, 2022 Maintainer

Footnotes

meteoDaniel
Dec 29, 2022

Replies: 3 comments 1 reply

meteoDaniel
Dec 29, 2022
Author

amotl
Dec 30, 2022
Maintainer

meteoDaniel
Dec 31, 2022
Author

amotl Dec 31, 2022
Maintainer