How to overcome RAM issue with parsing MOS-MIX Large #834
Replies: 3 comments 1 reply
-
PS: My goal is to pass the data into xarray object to store them in a zarr Archive. |
Beta Was this translation helpful? Give feedback.
-
Hi Daniel, based on some research and the implementation by @niclashoyer (see references below), we think the processing of MOSMIX XML files got way more efficient than before, by using stream-based XML parsing. It would be sad if there is still something wrong with the implementation, and the RAM explodes, according to your observations. So, I think we will have to revisit the implementation. To make investigations easier, can I humbly ask you to share the code you are using? With kind regards, References |
Beta Was this translation helpful? Give feedback.
-
Dear @amotl, thanks for you reply. I have just made some minor adaptions to the code to use a local kmz archive an removed tqdm. But it is in general the same. I have used different types of iterparse e.g. class KMLReader:
def __init__(
self,
) -> None:
self.metadata = {}
self.timesteps = []
self.nsmap = None
self.iter_elems = None
self.station_ids = None
self.parameters = []
@staticmethod
def open(local_file_path: str) -> BytesIO:
"""Open kml file as bytes """
buffer = BytesIO()
with open(local_file_path, "rb") as file:
for data in read_in_chunks(file, chunk_size=1024):
_ = buffer.write(data)
return buffer
def fetch(self, local_file_path: str) -> bytes:
""" Fetch weather mosmix file (zipped xml) """
buffer = self.download(local_file_path)
zfs = ZipFileSystem(buffer, "r")
return zfs.open(zfs.glob("*")[0]).read()
def read(self, local_file_path: str):
"""
Download and read DWD XML Weather Forecast File of Type KML.
"""
log.info(f"Downloading KMZ file {basename(url)}")
kml = self.fetch(local_file_path)
log.info("Parsing KML data")
# iterparse step makes
self.iter_elems = iterparse(BytesIO(kml), events=("start", "end"), resolve_entities=False)
prod_items = {
"issuer": "Issuer",
"product_id": "ProductID",
"generating_process": "GeneratingProcess",
"issue_time": "IssueTime",
}
nsmap = None
# Get Basic Metadata
prod_definition = None
prod_definition_tag = None
for event, element in self.iter_elems:
if event == "start":
# get namespaces from root element
if nsmap is None:
nsmap = element.nsmap
prod_definition_tag = f"{{{nsmap['dwd']}}}ProductDefinition"
elif event == "end":
if element.tag == prod_definition_tag:
prod_definition = element
# stop processing after head
# leave forecast data for iteration
break
self.metadata = {k: prod_definition.find(f"{{{nsmap['dwd']}}}{v}").text for k, v in prod_items.items()}
self.metadata["issue_time"] = pd.Timestamp(self.metadata["issue_time"])
# Get time steps.
timesteps = prod_definition.findall(
"dwd:ForecastTimeSteps",
nsmap,
)[0]
self.timesteps = DatetimeIndex([pd.Timestamp(i.text) for i in timesteps.getchildren()])
# save namespace map for later iteration
self.nsmap = nsmap
def iter_items(self):
clear = True
placemark_tag = f"{{{self.nsmap['kml']}}}Placemark"
for event, element in self.iter_elems:
if event == "start":
if element.tag == placemark_tag:
clear = False
elif event == "end":
if element.tag == placemark_tag:
station_id = element.find("kml:name", self.nsmap).text
if (self.station_ids is None) or station_id in self.station_ids:
yield element
clear = True
if clear:
element.clear() |
Beta Was this translation helpful? Give feedback.
-
Dear @amotl ,
I hope you enjoyed your christmas holidays !
You worked on the topic parsing mos mix data.
Actually I am using the
KMLReader
to open a MOS-MIX all stations file. Unfortunately the ram is exploding.Do you know how to reduce the ram during parsing?
It takes approx. 20 GB of RAM.
Best regards and thanks for your support !
Beta Was this translation helpful? Give feedback.
All reactions