Pysharkfeat is a TLS encrypted traffic feature extraction tool from pcaps written in Python by using Wireshark's command line tshark
.
Pysharkfeat is derived from an academic research on malicious encrypted traffic analysis. Compare with other feature extraction tools such as Flowmeter and Joy, Pysharkfeat is easier to setup and use while providing rich features.
- Parse a single pcap or directory to generate meta and statistical features
- Export features in JSON files
- Support logging
Traffic features include:
- Meta: 5-tuple(src ip, src port, dest ip, dest port, timestamp), duration, stream index
- Statistical:
- Bidirectional packet len and inter-arrival-time sum/max/min/mean/std
- SPLT(Markov sequence of pkt len and time)
- Byte distribution, payload std and entropy.
- TLS: todo.
Full features can be found in feat.py
or feature JSON file.
- Operating systems: Linux, Mac OSX. (Windows is not supported in current version, see Known issues)
- Language: Python3.8, 3.9
- Dependence: Wireshark
Install pysharkfeat from pip
pip3 install pysharkfeat
Install Wireshark(tshark)
- Windows/Mac: https://www.wireshark.org/#download
- Centos:
sudo yum install wireshark
- Ubuntu:
sudo apt-get install wireshark
Test tshark
tshark --version
For Windows, make sure tshark can be called by command line by adding tshark
to the environment path.
Pysharkfeat can be used for machine learning research and threat analysis.
There are several feature files in tests/output
generated from pcaps at Malware Traffic Analysis, and you can immediately start analyzing them.
This code snippet can be found in tests/demo.py
.
from pysharkfeat.featextractor import FeatureExtractor
import json, os
# specify pcaps and output dir
pcap_dir = "./pcaps/2021-01-04-Emotet-infection-with-Trickbot-traffic.pcap"
output_dir="./output"
extractor = FeatureExtractor(pcap_path=pcap_dir, output_dir=output_dir)
summary = extractor.main_extract_pcaps_feat()
print(summary)
# read feature files
feat_file = os.path.join(output_dir, "2021-01-04-Emotet-infection-with-Trickbot-traffic.json")
f = open(feat_file)
stream_feats = json.load(f)
for feat in stream_feats :
print("%s, stream_index:%s, byte dist entropy:%s" % (feat["pcap_name"], feat["stream_index"], feat["bd_entropy"]))
# display stream index and byte distribution entropy features, and bd entropies are very close.
2021-01-04-Emotet-infection-with-Trickbot-traffic.pcap, stream_index:3, byte dist entropy:7.999464797314957
2021-01-04-Emotet-infection-with-Trickbot-traffic.pcap, stream_index:7, byte dist entropy:7.903172099500442
2021-01-04-Emotet-infection-with-Trickbot-traffic.pcap, stream_index:9, byte dist entropy:7.9876935373284805
...
Pysharkfeat is built on tshark, which may incur substantial overhead. The following table shows some test results for meta freatures on OSX(CPU i5, 16GB RAM).
pcap name | pcap size | num of TLS streams | time(sec) |
---|---|---|---|
2021-01-04-Emotet-infection-with-Trickbot-traffic.pcap | 5.4MB | 10 | 10.8 |
2021-01-05-PurpleFox-EK-and-post-infection-traffic.pcap | 9.5MB | 8 | 11.5 |
2021-01-15-Emotet-epoch-1-infection-traffic.pcap | 5.9MB | 40 | 38.2 |
2021-02-24-Qakbot-infection-with-spambot-traffic.pcap | 21.1MB | 94 | 213.9 |
20 pcaps collected from MalwareTrafficAnalsysis | 1503 | 3342.3 |
In the last row of the table, we tested Pysharfeat with 20 pcaps from MalwareTrafficAnalysis (from Jan 2021 to Feb 2021) on a workstation using Centos8.5, CPU i7 and 32GBM RAM. It took 55 minutes to extract features.
The feature file of a single TLS stream has approximately 16KB. If a pcap has 100 TLS streams, the storage will be roughly 1.6MB.
You are welcome to post a issue or feature request, or send email to the author [email protected].
Pysharkfeat is open source and free to use under GPL V3 license. See LICENSE for more details.
- Running Pysharkfeat on Windows may encounter syntax parsing error with tshark, probably Windows has a different usage of ""(quotes).