Skip to content

bino282/bert4news

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

68 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

BERT for Vietnamese is trained on more 20 GB news dataset

Apply for task sentiment analysis on using AIViVN's comments dataset

The model achieved 0.90268 on the public leaderboard, (winner's score is 0.90087) Bert4news is used for a toolkit Vietnames(segmentation and Named Entity Recognition) at ViNLPtoolkit(https://github.com/bino282/ViNLP)

***************New Mar 11 , 2020 ***************

BERT (from Google Research and the Toyota Technological Institute at Chicago) released with the paper BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.

We use word sentencepiece, use basic bert tokenization and same config with bert base with lowercase = False.

You can download trained model:

Use with huggingface/transformers

import torch
from transformers import AutoTokenizer,AutoModel
tokenizer= AutoTokenizer.from_pretrained("NlpHUST/vibert4news-base-cased")
bert_model = AutoModel.from_pretrained("NlpHUST/vibert4news-base-cased")

line = "Tôi là sinh viên trường Bách Khoa Hà Nội ."
input_id = tokenizer.encode(line,add_special_tokens = True)
att_mask = [int(token_id > 0) for token_id in input_id]
input_ids = torch.tensor([input_id])
att_masks = torch.tensor([att_mask])
with torch.no_grad():
    features = bert_model(input_ids,att_masks)

print(features)

Run training with base config

python train_pytorch.py \
  --model_path=bert4news.pytorch \
  --max_len=200 \
  --batch_size=16 \
  --epochs=6 \
  --lr=2e-5

Contact information

For personal communication related to this project, please contact Nha Nguyen Van ([email protected]).

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages