Skip to content

alejandrofdez-us/DataCenter-Traces-Datasets

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 

Repository files navigation

DataCenter-Traces-Datasets

Public datasets organized for machine learning or artificial intelligence usage. The following dasets can be used:

Alibaba 2018 machine usage

Processed from the original files found at: https://github.com/alibaba/clusterdata/tree/master/cluster-trace-v2018

This repository dataset of machine usage includes the following columns:

+--------------------------------------------------------------------------------------------+
| Field            | Type       | Label | Comment                                            |
+--------------------------------------------------------------------------------------------+
| cpu_util_percent | bigint     |       | [0, 100]                                           |
| mem_util_percent | bigint     |       | [0, 100]                                           |
| net_in           | double     |       | normarlized in coming network traffic, [0, 100]    |
| net_out          | double     |       | normarlized out going network traffic, [0, 100]    |
| disk_io_percent  | double     |       | [0, 100], abnormal values are of -1 or 101         |
+--------------------------------------------------------------------------------------------+

Three sampled datasets are found: average value of each column grouped every 10 seconds as original, and downsampled to 30 seconds and 300 seconds as well. Every column includes the average utilization of the whole data center.

Figures

Some figures were generated from these datasets

cpu_util_percent_usage_days_1_to_8_grouped_10_seconds
Figure: CPU utilization sampled every 10 seconds
mem_util_percent_usage_days_1_to_8_grouped_300_seconds
Figure: Memory utilization sampled every 300 seconds
net_in_usage_days_1_to_8_grouped_300_seconds
Figure: Net in sampled every 300 seconds
net_out_usage_days_1_to_8_grouped_300_seconds
Figure: Net out sampled every 300 seconds
disk_io_percent_usage_days_1_to_8_grouped_300_seconds
Figure: Disk io sampled every 300 seconds

Google 2019 instance usage

Processed from the original dataset and queried using Big Query. More information available at: https://research.google/tools/datasets/google-cluster-workload-traces-2019/

This repository dataset of instance usage includes the following columns:

+--------------------------------------------------------------------------------------------+
| Field                         | Type       | Label | Comment                               |
+--------------------------------------------------------------------------------------------+
| avg_cpu                       | double     |       | [0, 1]                                |
| avg_mem                       | double     |       | [0, 1]                                |
| avg_assigned_mem              | double     |       | [0, 1]                                |
| avg_cycles_per_instruction    | double     |       | [0, _]                                |
+--------------------------------------------------------------------------------------------+

One sampled dataset is found: average value of each column grouped every 300 seconds as original. Every column includes the average utilization of the whole data center.

Figures

Some figures were generated from these datasets

cpu_usage_day_26
Figure: CPU usage day 26 sampled every 300 seconds
mem_usage_day_26
Figure: Mem usage day 26 sampled every 300 seconds
assigned_mem_day_26
Figure: Assigned mem day 26 sampled every 300 seconds
cycles_per_instruction_day_26
Figure: Cycles per instruction day 26 sampled every 300 seconds

Azure v2 virtual machine workload

Processed from the original dataset. More information available at: https://github.com/Azure/AzurePublicDataset/blob/master/AzurePublicDatasetV2.md

This repository dataset of instance usage includes the following columns:

+--------------------------------------------------------------------------------------------+
| Field                         | Type       | Label | Comment                               |
+--------------------------------------------------------------------------------------------+
| cpu_usage                     | double     |       | [0, _]                                |
| assigned_mem                  | double     |       | [0, _]                                |
+--------------------------------------------------------------------------------------------+

One sampled dataset is found: sum value of each column grouped every 300 seconds as original. For computing CPU_usage, we used core_count usage of each virtual machine. Every column includes the total consumption of the whole data center virtual machines. There is a version of each file including timestamp (from 0 to 2591700, in 300 seconds timestep), and other version without timestamp

Figures

Some figures were generated from these datasets

cpu_usage_month
Figure: CPU total usage by virtual machines sampled every 300 seconds.
assigned_mem_month
Figure: Total assigned memory for virtual machines sampled every 300 seconds.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published