0%

基于kashgari+BERT的命名实体识别(NER)

最近帮师兄做NER的BIO标注进行触发词识别的对比实验,用到了一个很好的框架kashgari,基于 tf.keras编写,使用起来很方便,几分钟即可搭建一个文本分类/命名实体识别的baseline。对于命名实体识别任务,该框架封装了BiLSTM,BiGRU,BiLSTM+CRF,BiGRU+CRF,CNN_LSTM。可加载wordembedding和bert词向量。

环境安装

1
pip install tensorflow==1.14
2
pip install kashgari-tf
3
python3.6

数据准备

NER标签采用BIO标注的形式。

train_x和 train_y,test_x和test_y都是list类型。

1
train_x: [[char1],[char1],[char1],..... ]
2
train_y: [[label1],[label2],[label3],..... ]
3
#eg:bert基于字级别的BIO数据
4
train_x = [['立','法','院','成','立','刺','激','团','体'],...]
5
train_y = [['B-ORG','I-ORG','I-ORG','O','O','O','O','O','O'],...]

导入模型,加载预训练embedding

BERT-Base, Chinese

Chinese-Word-Vectors

1
import kashgari
2
from kashgari.tasks.labeling import BiLSTM_CRF_Model, BiGRU_CRF_Model
3
from kashgari.embeddings import BERTEmbedding, WordEmbedding
4
5
BERT_PATH = './data/chinese_L-12_H-768_A-12'
6
bert_embedding=BERTEmbedding(BERT_PATH,task=kashgari.LABELING,sequence_length=100)
7
#如果加载wordembedding的词向量
8
#word_embedding = WordEmbedding(embed_path, task=kashgari.LABELING, 								sequence_length=100)

模型训练

1
model = BiLSTM_CRF_Model(embedding)
2
3
model.fit(train_x, train_y,x_validate=val_x, y_validate=val_y,
4
          epochs=40,batch_size=128)
5
#如果没有开发集
6
#model.fit(train_x, train_y,epochs=40,batch_size=128)
7
8
model.evaluate(test_x, test_y)

实验结果及代码

实验证明BERT+BiLSTM+CRF取得的效果是最好的。完整代码待上传github。