BERT for Vietnamese is trained on more 20 GB news dataset

Apply for task sentiment analysis on using AIViVN's comments dataset

The model achieved 0.90268 on the public leaderboard, (winner's score is 0.90087) Bert4news is used for a toolkit Vietnames(segmentation and Named Entity Recognition) at ViNLPtoolkit(https://github.com/bino282/ViNLP)

***************New Mar 11 , 2020 ***************

BERT (from Google Research and the Toyota Technological Institute at Chicago) released with the paper BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.

We use word sentencepiece, use basic bert tokenization and same config with bert base with lowercase = False.

You can download trained model:

tensorflow.
pytorch.

Use with huggingface/transformers

import torch
from transformers import AutoTokenizer,AutoModel
tokenizer= AutoTokenizer.from_pretrained("NlpHUST/vibert4news-base-cased")
bert_model = AutoModel.from_pretrained("NlpHUST/vibert4news-base-cased")

line = "Tôi là sinh viên trường Bách Khoa Hà Nội ."
input_id = tokenizer.encode(line,add_special_tokens = True)
att_mask = [int(token_id > 0) for token_id in input_id]
input_ids = torch.tensor([input_id])
att_masks = torch.tensor([att_mask])
with torch.no_grad():
    features = bert_model(input_ids,att_masks)

print(features)

Run training with base config

python train_pytorch.py \
  --model_path=bert4news.pytorch \
  --max_len=200 \
  --batch_size=16 \
  --epochs=6 \
  --lr=2e-5

Contact information

For personal communication related to this project, please contact Nha Nguyen Van ([email protected]).

Name		Name	Last commit message	Last commit date
Latest commit History 68 Commits
__pycache__		__pycache__
data		data
raw		raw
README.md		README.md
convert_data.py		convert_data.py
model.py		model.py
pred.py		pred.py
submission.csv		submission.csv
train_pytorch.py		train_pytorch.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

BERT for Vietnamese is trained on more 20 GB news dataset

Contact information

About

Releases

Packages

Languages

bino282/bert4news

Folders and files

Latest commit

History

Repository files navigation

BERT for Vietnamese is trained on more 20 GB news dataset

Contact information

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages