您的位置:首页 > 综合 >

环球快资讯:怎么让英文大预言模型支持中文?(一)构建自己的tokenization

代码地址:https://github.com/taishan1994/sentencepiece_chinese_bpe

Part1前言

目前,大语言模型呈爆发式的增长,其中,基于llama家族的模型占据了半壁江山。而原始的llama模型对中文的支持不太友好,接下来本文将讲解如何去扩充vocab里面的词以对中文进行token化。


(资料图片仅供参考)

Part2数据预处理

对斗破苍穹语料进行预处理,每一行为一句或多句话。

withopen("data/《斗破苍穹》.txt","r",encoding="utf-8")asfp:data=fp.read().strip().split("\n")sentences=[]fordindata:d=d.strip()if"==="indorlen(d)==0ord=="《斗破苍穹》来自:":continuesentences.append(d)withopen("data/corpus.txt","w",encoding="utf-8")asfp:fp.write("\n".join(sentences))

最终得到corpus.txt。

Part3sentencepiece

首先,我们需要去构建中文的词库。一般的,目前比较主流的是使用sentencepiece训练中文词库。安装指令也很简单:pip install sentencepiece。然后,我们准备好语料,这里我们使用的语料是斗破苍穹小说。

直接看代码:

importsentencepieceasspmspm.SentencePieceTrainer.train(input="data/corpus.txt",model_prefix="tokenizer",vocab_size=50000,user_defined_symbols=["foo","bar"],character_coverage=1.0,model_type="bpe",)

这里讲下每个参数的作用:

input:指定输入文本文件的路径或者是一个目录,可以指定多个输入文件或目录。其中每一行可以是一句话或者多句话。
tokenizer:保存的模型的名称前缀。
vocab_size:设置的词表大小。
user_defined_symbols:用于指定用户自定义的符号。这些符号将会被视为单独的 Token,不会被拆分成子词。这个参数的作用是将一些用户定义的特殊符号作为一个整体加入到生成的词表中,以便于后续的模型使用。这里我们简单进行了测试。
model_type: 指定模型的类型,有三种可选参数:unigram, bpe, char. word。
character_coverage指定覆盖字符的数量,可以理解为限制字符集的大小。默认值为 1.0,即覆盖全部字符。
unk_id: 指定未登录词的 ID 号,即在词表中为未登录词分配一个整数 ID。默认值为 0。
bos_id: 指定句子开头符号的 ID 号,即在词表中为句子开头符号分配一个整数 ID。默认值为 1。
eos_id: 指定句子结束符号的 ID 号,即在词表中为句子结束符号分配一个整数 ID。默认值为 2。
pad_id: 指定填充符号的 ID 号,即在词表中为填充符号分配一个整数 ID。默认值为 -1,即不使用填充符号。

运行后会得到tokenizer.model和tokenizer.vocab两个文件。

我们来看看tokenizer.vocab里面是什么:

000foo0bar0萧炎-0..-1▁“-2也是-3便是-4了一-5。”-6

除了一些特殊符号外,还有我们自定义的foo和bar,其余的一些词是BPE训练得到,具体什么是BPE算法这里不作展开了。

Part4怎么使用transformers库加载sentencepiece模型

直接看代码:

importosos.environ["PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION"]="python"fromtransformersimportLlamaTokenizerfromsentencepieceimportsentencepiece_model_pb2assp_pb2_modelimportsentencepieceasspmfromtokenizationimportChineseTokenizerchinese_sp_model_file="sentencepisece_tokenizer/tokenizer.model"#loadchinese_sp_model=spm.SentencePieceProcessor()chinese_sp_model.Load(chinese_sp_model_file)chinese_spm=sp_pb2_model.ModelProto()chinese_spm.ParseFromString(chinese_sp_model.serialized_model_proto())##Saveoutput_dir="./transformers_tokenizer/chinese/"os.makedirs(output_dir,exist_ok=True)withopen(output_dir+"chinese.model","wb")asf:f.write(chinese_spm.SerializeToString())tokenizer=ChineseTokenizer(vocab_file=output_dir+"chinese.model")tokenizer.save_pretrained(output_dir)print(f"Chinesetokenizerhasbeensavedto{output_dir}")#Testchinese_tokenizer=ChineseTokenizer.from_pretrained(output_dir)print(tokenizer.all_special_tokens)print(tokenizer.all_special_ids)print(tokenizer.special_tokens_map)text="""白日依山尽,黄河入海流。欲穷千里目,更上一层楼。TheprimaryuseofLLaMAisresearchonlargelanguagemodels,including"""print("Testtext:\n",text)print(f"TokenizedbyChinese-LLaMAtokenizer:{chinese_tokenizer.tokenize(text)}")

结果:

Chinesetokenizerhasbeensavedto./transformers_tokenizer/chinese/["","",""][1,2,0]{"bos_token":"","eos_token":"","unk_token":""}Testtext:白日依山尽,黄河入海流。欲穷千里目,更上一层楼。TheprimaryuseofLLaMAisresearchonlargelanguagemodels,includingTokenizedbyChinese-LLaMAtokenizer:["▁","白日","依","山","尽",",","黄","河","入","海","流","。","欲","穷","千里","目",",","更","上一层","楼","。","▁","T","h","e","▁","p","r","i","m","a","r","y","▁","u","s","e","▁","o","f","▁","LL","a","MA","▁i","s","▁","r","e","s","e","a","r","ch","▁","o","n","▁","l","a","r","g","e","▁","l","an","g","u","a","g","e","▁","m","o","d","e","l","s",",","▁i","n","c","lu","d","i","ng"]

其中ChineseTokenizer这里参考了llama模型里面使用的方法,并稍微做些修改:

#coding=utf-8#Copyright2022EleutherAIandtheHuggingFaceInc.team.Allrightsreserved.##ThiscodeisbasedonEleutherAI"sGPT-NeoXlibraryandtheGPT-NeoX#andOPTimplementationsinthislibrary.Ithasbeenmodifiedfromits#originalformstoaccommodateminorarchitecturaldifferencescompared#toGPT-NeoXandOPTusedbytheMetaAIteamthattrainedthemodel.##LicensedundertheApacheLicense,Version2.0(the"License");#youmaynotusethisfileexceptincompliancewiththeLicense.#YoumayobtainacopyoftheLicenseat##http://www.apache.org/licenses/LICENSE-2.0##Unlessrequiredbyapplicablelaworagreedtoinwriting,software#distributedundertheLicenseisdistributedonan"ASIS"BASIS,#WITHOUTWARRANTIESORCONDITIONSOFANYKIND,eitherexpressorimplied.#SeetheLicenseforthespecificlanguagegoverningpermissionsand#limitationsundertheLicense."""TokenizationclassesforLLaMA."""importosfromshutilimportcopyfilefromtypingimportAny,Dict,List,Optional,Tupleimportsentencepieceasspmfromtransformers.tokenization_utilsimportAddedToken,PreTrainedTokenizerfromtransformers.utilsimportlogginglogger=logging.get_logger(__name__)VOCAB_FILES_NAMES={"vocab_file":"tokenizer.model"}#PRETRAINED_VOCAB_FILES_MAP={#"vocab_file":{#"hf-internal-testing/llama-tokenizer":"https://huggingface.co/hf-internal-testing/llama-tokenizer/resolve/main/tokenizer.model",#},#"tokenizer_file":{#"hf-internal-testing/llama-tokenizer":"https://huggingface.co/hf-internal-testing/llama-tokenizer/resolve/main/tokenizer_config.json",#},#}#PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES={#"hf-internal-testing/llama-tokenizer":2048,#}classChineseTokenizer(PreTrainedTokenizer):"""ConstructaLlamatokenizer.Basedonbyte-levelByte-Pair-Encoding.Args:vocab_file(`str`):Pathtothevocabularyfile."""vocab_files_names=VOCAB_FILES_NAMES#pretrained_vocab_files_map=PRETRAINED_VOCAB_FILES_MAP#max_model_input_sizes=PRETRAINED_POSITIONAL_EMBEDDINGS_SIZESmodel_input_names=["input_ids","attention_mask"]def__init__(self,vocab_file,unk_token="",bos_token="",eos_token="",pad_token=None,sp_model_kwargs:Optional[Dict[str,Any]]=None,add_bos_token=True,add_eos_token=False,clean_up_tokenization_spaces=False,**kwargs,):self.sp_model_kwargs={}ifsp_model_kwargsisNoneelsesp_model_kwargsbos_token=AddedToken(bos_token,lstrip=False,rstrip=False)ifisinstance(bos_token,str)elsebos_tokeneos_token=AddedToken(eos_token,lstrip=False,rstrip=False)ifisinstance(eos_token,str)elseeos_tokenunk_token=AddedToken(unk_token,lstrip=False,rstrip=False)ifisinstance(unk_token,str)elseunk_tokenpad_token=AddedToken(pad_token,lstrip=False,rstrip=False)ifisinstance(pad_token,str)elsepad_tokensuper().__init__(bos_token=bos_token,eos_token=eos_token,unk_token=unk_token,pad_token=pad_token,add_bos_token=add_bos_token,add_eos_token=add_eos_token,sp_model_kwargs=self.sp_model_kwargs,clean_up_tokenization_spaces=clean_up_tokenization_spaces,**kwargs,)self.vocab_file=vocab_fileself.add_bos_token=add_bos_tokenself.add_eos_token=add_eos_tokenself.sp_model=spm.SentencePieceProcessor(**self.sp_model_kwargs)self.sp_model.Load(vocab_file)def__getstate__(self):state=self.__dict__.copy()state["sp_model"]=Nonereturnstatedef__setstate__(self,d):self.__dict__=dself.sp_model=spm.SentencePieceProcessor(**self.sp_model_kwargs)self.sp_model.Load(self.vocab_file)@propertydefvocab_size(self):"""Returnsvocabsize"""returnself.sp_model.get_piece_size()defget_vocab(self):"""Returnsvocabasadict"""vocab={self.convert_ids_to_tokens(i):iforiinrange(self.vocab_size)}vocab.update(self.added_tokens_encoder)returnvocabdef_tokenize(self,text):"""Returnsatokenizedstring."""returnself.sp_model.encode(text,out_type=str)def_convert_token_to_id(self,token):"""Convertsatoken(str)inanidusingthevocab."""returnself.sp_model.piece_to_id(token)def_convert_id_to_token(self,index):"""Convertsanindex(integer)inatoken(str)usingthevocab."""token=self.sp_model.IdToPiece(index)returntokendefconvert_tokens_to_string(self,tokens):"""Convertsasequenceoftokens(string)inasinglestring."""current_sub_tokens=[]out_string=""prev_is_special=Falsefori,tokeninenumerate(tokens):#makesurethatspecialtokensarenotdecodedusingsentencepiecemodeliftokeninself.all_special_tokens:ifnotprev_is_specialandi!=0:out_string+=""out_string+=self.sp_model.decode(current_sub_tokens)+tokenprev_is_special=Truecurrent_sub_tokens=[]else:current_sub_tokens.append(token)prev_is_special=Falseout_string+=self.sp_model.decode(current_sub_tokens)returnout_stringdefsave_vocabulary(self,save_directory,filename_prefix:Optional[str]=None)->Tuple[str]:"""Savethevocabularyandspecialtokensfiletoadirectory.Args:save_directory(`str`):Thedirectoryinwhichtosavethevocabulary.Returns:`Tuple(str)`:Pathstothefilessaved."""ifnotos.path.isdir(save_directory):logger.error(f"Vocabularypath({save_directory})shouldbeadirectory")returnout_vocab_file=os.path.join(save_directory,(filename_prefix+"-"iffilename_prefixelse"")+VOCAB_FILES_NAMES["vocab_file"])ifos.path.abspath(self.vocab_file)!=os.path.abspath(out_vocab_file)andos.path.isfile(self.vocab_file):copyfile(self.vocab_file,out_vocab_file)elifnotos.path.isfile(self.vocab_file):withopen(out_vocab_file,"wb")asfi:content_spiece_model=self.sp_model.serialized_model_proto()fi.write(content_spiece_model)return(out_vocab_file,)defbuild_inputs_with_special_tokens(self,token_ids_0,token_ids_1=None):bos_token_id=[self.bos_token_id]ifself.add_bos_tokenelse[]eos_token_id=[self.eos_token_id]ifself.add_eos_tokenelse[]output=bos_token_id+token_ids_0+eos_token_idiftoken_ids_1isnotNone:output=output+bos_token_id+token_ids_1+eos_token_idreturnoutputdefget_special_tokens_mask(self,token_ids_0:List[int],token_ids_1:Optional[List[int]]=None,already_has_special_tokens:bool=False)->List[int]:"""Retrievesequenceidsfromatokenlistthathasnospecialtokensadded.Thismethodiscalledwhenaddingspecialtokensusingthetokenizer`prepare_for_model`method.Args:token_ids_0(`List[int]`):ListofIDs.token_ids_1(`List[int]`,*optional*):OptionalsecondlistofIDsforsequencepairs.already_has_special_tokens(`bool`,*optional*,defaultsto`False`):Whetherornotthetokenlistisalreadyformattedwithspecialtokensforthemodel.Returns:`List[int]`:Alistofintegersintherange[0,1]:1foraspecialtoken,0forasequencetoken."""ifalready_has_special_tokens:returnsuper().get_special_tokens_mask(token_ids_0=token_ids_0,token_ids_1=token_ids_1,already_has_special_tokens=True)bos_token_id=[1]ifself.add_bos_tokenelse[]eos_token_id=[1]ifself.add_eos_tokenelse[]iftoken_ids_1isNone:returnbos_token_id+([0]*len(token_ids_0))+eos_token_idreturn(bos_token_id+([0]*len(token_ids_0))+eos_token_id+bos_token_id+([0]*len(token_ids_1))+eos_token_id)defcreate_token_type_ids_from_sequences(self,token_ids_0:List[int],token_ids_1:Optional[List[int]]=None)->List[int]:"""Createsamaskfromthetwosequencespassedtobeusedinasequence-pairclassificationtask.AnALBERTsequencepairmaskhasthefollowingformat:```00000000000111111111|firstsequence|secondsequence|```iftoken_ids_1isNone,onlyreturnsthefirstportionofthemask(0s).Args:token_ids_0(`List[int]`):Listofids.token_ids_1(`List[int]`,*optional*):OptionalsecondlistofIDsforsequencepairs.Returns:`List[int]`:Listof[tokentypeIDs](../glossary#token-type-ids)accordingtothegivensequence(s)."""bos_token_id=[self.bos_token_id]ifself.add_bos_tokenelse[]eos_token_id=[self.eos_token_id]ifself.add_eos_tokenelse[]output=[0]*len(bos_token_id+token_ids_0+eos_token_id)iftoken_ids_1isnotNone:output+=[1]*len(bos_token_id+token_ids_1+eos_token_id)returnoutput

不难发现其实里面使用了一些sentencepiece里面的函数。

Part5怎么合并英文词表和中文词表?

直接看代码:

importosos.environ["PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION"]="python"fromtransformersimportLlamaTokenizerfromsentencepieceimportsentencepiece_model_pb2assp_pb2_modelimportsentencepieceasspmllama_tokenizer_dir="transformers_tokenizer/llama/tokenizer.model"chinese_sp_model_file="sentencepisece_tokenizer/tokenizer.model"#loadllama_tokenizer=LlamaTokenizer.from_pretrained(llama_tokenizer_dir)chinese_sp_model=spm.SentencePieceProcessor()chinese_sp_model.Load(chinese_sp_model_file)llama_spm=sp_pb2_model.ModelProto()llama_spm.ParseFromString(llama_tokenizer.sp_model.serialized_model_proto())chinese_spm=sp_pb2_model.ModelProto()chinese_spm.ParseFromString(chinese_sp_model.serialized_model_proto())#printnumberoftokensprint(len(llama_tokenizer),len(chinese_sp_model))print(llama_tokenizer.all_special_tokens)print(llama_tokenizer.all_special_ids)print(llama_tokenizer.special_tokens_map)##AddChinesetokenstoLLaMAtokenizerllama_spm_tokens_set=set(p.pieceforpinllama_spm.pieces)print(len(llama_spm_tokens_set))print(f"Before:{len(llama_spm_tokens_set)}")forpinchinese_spm.pieces:piece=p.pieceifpiecenotinllama_spm_tokens_set:new_p=sp_pb2_model.ModelProto().SentencePiece()new_p.piece=piecenew_p.score=0llama_spm.pieces.append(new_p)print(f"Newmodelpieces:{len(llama_spm.pieces)}")##Saveoutput_sp_dir="transformers_tokenizer/llama_chinese"output_hf_dir="transformers_tokenizer/llama_chinese"#thepathtosaveChinese-LLaMAtokenizeros.makedirs(output_sp_dir,exist_ok=True)withopen(output_sp_dir+"/chinese_llama.model","wb")asf:f.write(llama_spm.SerializeToString())tokenizer=LlamaTokenizer(vocab_file=output_sp_dir+"/chinese_llama.model")tokenizer.save_pretrained(output_hf_dir)print(f"Chinese-LLaMAtokenizerhasbeensavedto{output_hf_dir}")#Testllama_tokenizer=LlamaTokenizer.from_pretrained(llama_tokenizer_dir)chinese_llama_tokenizer=LlamaTokenizer.from_pretrained(output_hf_dir)print(tokenizer.all_special_tokens)print(tokenizer.all_special_ids)print(tokenizer.special_tokens_map)text="""白日依山尽,黄河入海流。欲穷千里目,更上一层楼。TheprimaryuseofLLaMAisresearchonlargelanguagemodels,including"""print("Testtext:\n",text)print(f"TokenizedbyLLaMAtokenizer:{llama_tokenizer.tokenize(text)}")print(f"TokenizedbyChinese-LLaMAtokenizer:{chinese_llama_tokenizer.tokenize(text)}")

核心部分是这一块:

forpinchinese_spm.pieces:piece=p.pieceifpiecenotinllama_spm_tokens_set:new_p=sp_pb2_model.ModelProto().SentencePiece()new_p.piece=piecenew_p.score=0llama_spm.pieces.append(new_p)

也就是将原始词表中没有的新加入进去。

最后看一下结果:

3200050000["","",""][1,2,0]{"bos_token":"","eos_token":"","unk_token":""}32000Before:32000Newmodelpieces:81163Chinese-LLaMAtokenizerhasbeensavedtotransformers_tokenizer/llama_chinese["","",""][1,2,0]{"bos_token":"","eos_token":"","unk_token":""}Testtext:白日依山尽,黄河入海流。欲穷千里目,更上一层楼。TheprimaryuseofLLaMAisresearchonlargelanguagemodels,includingTokenizedbyLLaMAtokenizer:["▁","白","日","<0xE4>","<0xBE>","<0x9D>","山","<0xE5>","<0xB0>","<0xBD>",",","黄","河","入","海","流","。","<0xE6>","<0xAC>","<0xB2>","<0xE7>","<0xA9>","<0xB7>","千","里","目",",","更","上","一","<0xE5>","<0xB1>","<0x82>","<0xE6>","<0xA5>","<0xBC>","。","<0x0A>","The","▁primary","▁use","▁of","▁L","La","MA","▁is","▁research","▁on","▁large","▁language","▁models",",","▁including"]TokenizedbyChinese-LLaMAtokenizer:["▁白","日","依","山","尽",",","黄","河","入","海","流","。","欲","穷","千里","目",",","更","上一层","楼","。","<0x0A>","The","▁primary","▁use","▁of","▁L","La","MA","▁is","▁research","▁on","▁large","▁language","▁models",",","▁including"]

会发现再加入了我们定义的词表后确实能够对中文进行分词了。

Part6怎么使用修改后的词表?

如果我们重新从头开始训练,那么其实使用起来很简单:

config=AutoConfig.from_pretrained(...)tokenizer=LlamaTokenizer.from_pretrained(...)model=LlamaForCausalLM.from_pretrained(...,config=config)model_vocab_size=model.get_output_embeddings().weight.size(0)model.resize_token_embeddings(len(tokenizer))

但是如果我们想要保留原始模型embedding的参数,那么我们可以这么做:

1、找到新词表和旧词表id之间的映射关系。
2、将模型里面新词表里面包含的旧词表用原始模型的embedding替换。
3、如果新词在旧词表里面没有出现就进行相应的初始化再进行赋值。比如transformers库中的llama是这么进行初始化的:
def_init_weights(self,module):std=self.config.initializer_rangeifisinstance(module,nn.Linear):module.weight.data.normal_(mean=0.0,std=std)ifmodule.biasisnotNone:module.bias.data.zero_()elifisinstance(module,nn.Embedding):module.weight.data.normal_(mean=0.0,std=std)ifmodule.padding_idxisnotNone:module.weight.data[module.padding_idx].zero_()

具体怎么做可以参考一下这个:https://github.com/yangjianxin1/LLMPruner

Part7总结

到这里为止,我们已经学会了:

1、使用sentencepiece训练一个中文的词表。
2、使用transformers加载sentencepiece模型。
3、怎么合并中英文的词表,并使用transformers使用合并后的词表。
4、在模型中怎么使用新词表。
Part8参考

https://github.com/ymcui/Chinese-LLaMA-Alpaca

https://github.com/yangjianxin1/LLMPruner

https://github.com/huggingface/transformers

关键词:

相关新闻