代码地址:https://github.com/taishan1994/sentencepiece_chinese_bpe
Part1前言目前,大语言模型呈爆发式的增长,其中,基于llama家族的模型占据了半壁江山。而原始的llama模型对中文的支持不太友好,接下来本文将讲解如何去扩充vocab里面的词以对中文进行token化。
(资料图片仅供参考)
Part2数据预处理对斗破苍穹语料进行预处理,每一行为一句或多句话。
withopen("data/《斗破苍穹》.txt","r",encoding="utf-8")asfp:data=fp.read().strip().split("\n")sentences=[]fordindata:d=d.strip()if"==="indorlen(d)==0ord=="《斗破苍穹》来自:":continuesentences.append(d)withopen("data/corpus.txt","w",encoding="utf-8")asfp:fp.write("\n".join(sentences))
最终得到corpus.txt。
Part3sentencepiece首先,我们需要去构建中文的词库。一般的,目前比较主流的是使用sentencepiece训练中文词库。安装指令也很简单:pip install sentencepiece
。然后,我们准备好语料,这里我们使用的语料是斗破苍穹小说。
直接看代码:
importsentencepieceasspmspm.SentencePieceTrainer.train(input="data/corpus.txt",model_prefix="tokenizer",vocab_size=50000,user_defined_symbols=["foo","bar"],character_coverage=1.0,model_type="bpe",)
这里讲下每个参数的作用:
运行后会得到tokenizer.model和tokenizer.vocab两个文件。
我们来看看tokenizer.vocab里面是什么:
000foo0bar0萧炎-0..-1▁“-2也是-3便是-4了一-5。”-6
除了一些特殊符号外,还有我们自定义的foo和bar,其余的一些词是BPE训练得到,具体什么是BPE算法这里不作展开了。
Part4怎么使用transformers库加载sentencepiece模型直接看代码:
importosos.environ["PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION"]="python"fromtransformersimportLlamaTokenizerfromsentencepieceimportsentencepiece_model_pb2assp_pb2_modelimportsentencepieceasspmfromtokenizationimportChineseTokenizerchinese_sp_model_file="sentencepisece_tokenizer/tokenizer.model"#loadchinese_sp_model=spm.SentencePieceProcessor()chinese_sp_model.Load(chinese_sp_model_file)chinese_spm=sp_pb2_model.ModelProto()chinese_spm.ParseFromString(chinese_sp_model.serialized_model_proto())##Saveoutput_dir="./transformers_tokenizer/chinese/"os.makedirs(output_dir,exist_ok=True)withopen(output_dir+"chinese.model","wb")asf:f.write(chinese_spm.SerializeToString())tokenizer=ChineseTokenizer(vocab_file=output_dir+"chinese.model")tokenizer.save_pretrained(output_dir)print(f"Chinesetokenizerhasbeensavedto{output_dir}")#Testchinese_tokenizer=ChineseTokenizer.from_pretrained(output_dir)print(tokenizer.all_special_tokens)print(tokenizer.all_special_ids)print(tokenizer.special_tokens_map)text="""白日依山尽,黄河入海流。欲穷千里目,更上一层楼。TheprimaryuseofLLaMAisresearchonlargelanguagemodels,including"""print("Testtext:\n",text)print(f"TokenizedbyChinese-LLaMAtokenizer:{chinese_tokenizer.tokenize(text)}")
结果:
Chinesetokenizerhasbeensavedto./transformers_tokenizer/chinese/["","",""][1,2,0]{"bos_token":"","eos_token":"","unk_token":""}Testtext:白日依山尽,黄河入海流。欲穷千里目,更上一层楼。TheprimaryuseofLLaMAisresearchonlargelanguagemodels,includingTokenizedbyChinese-LLaMAtokenizer:["▁","白日","依","山","尽",",","黄","河","入","海","流","。","欲","穷","千里","目",",","更","上一层","楼","。","▁","T","h","e","▁","p","r","i","m","a","r","y","▁","u","s","e","▁","o","f","▁","LL","a","MA","▁i","s","▁","r","e","s","e","a","r","ch","▁","o","n","▁","l","a","r","g","e","▁","l","an","g","u","a","g","e","▁","m","o","d","e","l","s",",","▁i","n","c","lu","d","i","ng"]
其中ChineseTokenizer这里参考了llama模型里面使用的方法,并稍微做些修改:
#coding=utf-8#Copyright2022EleutherAIandtheHuggingFaceInc.team.Allrightsreserved.##ThiscodeisbasedonEleutherAI"sGPT-NeoXlibraryandtheGPT-NeoX#andOPTimplementationsinthislibrary.Ithasbeenmodifiedfromits#originalformstoaccommodateminorarchitecturaldifferencescompared#toGPT-NeoXandOPTusedbytheMetaAIteamthattrainedthemodel.##LicensedundertheApacheLicense,Version2.0(the"License");#youmaynotusethisfileexceptincompliancewiththeLicense.#YoumayobtainacopyoftheLicenseat##http://www.apache.org/licenses/LICENSE-2.0##Unlessrequiredbyapplicablelaworagreedtoinwriting,software#distributedundertheLicenseisdistributedonan"ASIS"BASIS,#WITHOUTWARRANTIESORCONDITIONSOFANYKIND,eitherexpressorimplied.#SeetheLicenseforthespecificlanguagegoverningpermissionsand#limitationsundertheLicense."""TokenizationclassesforLLaMA."""importosfromshutilimportcopyfilefromtypingimportAny,Dict,List,Optional,Tupleimportsentencepieceasspmfromtransformers.tokenization_utilsimportAddedToken,PreTrainedTokenizerfromtransformers.utilsimportlogginglogger=logging.get_logger(__name__)VOCAB_FILES_NAMES={"vocab_file":"tokenizer.model"}#PRETRAINED_VOCAB_FILES_MAP={#"vocab_file":{#"hf-internal-testing/llama-tokenizer":"https://huggingface.co/hf-internal-testing/llama-tokenizer/resolve/main/tokenizer.model",#},#"tokenizer_file":{#"hf-internal-testing/llama-tokenizer":"https://huggingface.co/hf-internal-testing/llama-tokenizer/resolve/main/tokenizer_config.json",#},#}#PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES={#"hf-internal-testing/llama-tokenizer":2048,#}classChineseTokenizer(PreTrainedTokenizer):"""ConstructaLlamatokenizer.Basedonbyte-levelByte-Pair-Encoding.Args:vocab_file(`str`):Pathtothevocabularyfile."""vocab_files_names=VOCAB_FILES_NAMES#pretrained_vocab_files_map=PRETRAINED_VOCAB_FILES_MAP#max_model_input_sizes=PRETRAINED_POSITIONAL_EMBEDDINGS_SIZESmodel_input_names=["input_ids","attention_mask"]def__init__(self,vocab_file,unk_token="",bos_token="",eos_token="",pad_token=None,sp_model_kwargs:Optional[Dict[str,Any]]=None,add_bos_token=True,add_eos_token=False,clean_up_tokenization_spaces=False,**kwargs,):self.sp_model_kwargs={}ifsp_model_kwargsisNoneelsesp_model_kwargsbos_token=AddedToken(bos_token,lstrip=False,rstrip=False)ifisinstance(bos_token,str)elsebos_tokeneos_token=AddedToken(eos_token,lstrip=False,rstrip=False)ifisinstance(eos_token,str)elseeos_tokenunk_token=AddedToken(unk_token,lstrip=False,rstrip=False)ifisinstance(unk_token,str)elseunk_tokenpad_token=AddedToken(pad_token,lstrip=False,rstrip=False)ifisinstance(pad_token,str)elsepad_tokensuper().__init__(bos_token=bos_token,eos_token=eos_token,unk_token=unk_token,pad_token=pad_token,add_bos_token=add_bos_token,add_eos_token=add_eos_token,sp_model_kwargs=self.sp_model_kwargs,clean_up_tokenization_spaces=clean_up_tokenization_spaces,**kwargs,)self.vocab_file=vocab_fileself.add_bos_token=add_bos_tokenself.add_eos_token=add_eos_tokenself.sp_model=spm.SentencePieceProcessor(**self.sp_model_kwargs)self.sp_model.Load(vocab_file)def__getstate__(self):state=self.__dict__.copy()state["sp_model"]=Nonereturnstatedef__setstate__(self,d):self.__dict__=dself.sp_model=spm.SentencePieceProcessor(**self.sp_model_kwargs)self.sp_model.Load(self.vocab_file)@propertydefvocab_size(self):"""Returnsvocabsize"""returnself.sp_model.get_piece_size()defget_vocab(self):"""Returnsvocabasadict"""vocab={self.convert_ids_to_tokens(i):iforiinrange(self.vocab_size)}vocab.update(self.added_tokens_encoder)returnvocabdef_tokenize(self,text):"""Returnsatokenizedstring."""returnself.sp_model.encode(text,out_type=str)def_convert_token_to_id(self,token):"""Convertsatoken(str)inanidusingthevocab."""returnself.sp_model.piece_to_id(token)def_convert_id_to_token(self,index):"""Convertsanindex(integer)inatoken(str)usingthevocab."""token=self.sp_model.IdToPiece(index)returntokendefconvert_tokens_to_string(self,tokens):"""Convertsasequenceoftokens(string)inasinglestring."""current_sub_tokens=[]out_string=""prev_is_special=Falsefori,tokeninenumerate(tokens):#makesurethatspecialtokensarenotdecodedusingsentencepiecemodeliftokeninself.all_special_tokens:ifnotprev_is_specialandi!=0:out_string+=""out_string+=self.sp_model.decode(current_sub_tokens)+tokenprev_is_special=Truecurrent_sub_tokens=[]else:current_sub_tokens.append(token)prev_is_special=Falseout_string+=self.sp_model.decode(current_sub_tokens)returnout_stringdefsave_vocabulary(self,save_directory,filename_prefix:Optional[str]=None)->Tuple[str]:"""Savethevocabularyandspecialtokensfiletoadirectory.Args:save_directory(`str`):Thedirectoryinwhichtosavethevocabulary.Returns:`Tuple(str)`:Pathstothefilessaved."""ifnotos.path.isdir(save_directory):logger.error(f"Vocabularypath({save_directory})shouldbeadirectory")returnout_vocab_file=os.path.join(save_directory,(filename_prefix+"-"iffilename_prefixelse"")+VOCAB_FILES_NAMES["vocab_file"])ifos.path.abspath(self.vocab_file)!=os.path.abspath(out_vocab_file)andos.path.isfile(self.vocab_file):copyfile(self.vocab_file,out_vocab_file)elifnotos.path.isfile(self.vocab_file):withopen(out_vocab_file,"wb")asfi:content_spiece_model=self.sp_model.serialized_model_proto()fi.write(content_spiece_model)return(out_vocab_file,)defbuild_inputs_with_special_tokens(self,token_ids_0,token_ids_1=None):bos_token_id=[self.bos_token_id]ifself.add_bos_tokenelse[]eos_token_id=[self.eos_token_id]ifself.add_eos_tokenelse[]output=bos_token_id+token_ids_0+eos_token_idiftoken_ids_1isnotNone:output=output+bos_token_id+token_ids_1+eos_token_idreturnoutputdefget_special_tokens_mask(self,token_ids_0:List[int],token_ids_1:Optional[List[int]]=None,already_has_special_tokens:bool=False)->List[int]:"""Retrievesequenceidsfromatokenlistthathasnospecialtokensadded.Thismethodiscalledwhenaddingspecialtokensusingthetokenizer`prepare_for_model`method.Args:token_ids_0(`List[int]`):ListofIDs.token_ids_1(`List[int]`,*optional*):OptionalsecondlistofIDsforsequencepairs.already_has_special_tokens(`bool`,*optional*,defaultsto`False`):Whetherornotthetokenlistisalreadyformattedwithspecialtokensforthemodel.Returns:`List[int]`:Alistofintegersintherange[0,1]:1foraspecialtoken,0forasequencetoken."""ifalready_has_special_tokens:returnsuper().get_special_tokens_mask(token_ids_0=token_ids_0,token_ids_1=token_ids_1,already_has_special_tokens=True)bos_token_id=[1]ifself.add_bos_tokenelse[]eos_token_id=[1]ifself.add_eos_tokenelse[]iftoken_ids_1isNone:returnbos_token_id+([0]*len(token_ids_0))+eos_token_idreturn(bos_token_id+([0]*len(token_ids_0))+eos_token_id+bos_token_id+([0]*len(token_ids_1))+eos_token_id)defcreate_token_type_ids_from_sequences(self,token_ids_0:List[int],token_ids_1:Optional[List[int]]=None)->List[int]:"""Createsamaskfromthetwosequencespassedtobeusedinasequence-pairclassificationtask.AnALBERTsequencepairmaskhasthefollowingformat:```00000000000111111111|firstsequence|secondsequence|```iftoken_ids_1isNone,onlyreturnsthefirstportionofthemask(0s).Args:token_ids_0(`List[int]`):Listofids.token_ids_1(`List[int]`,*optional*):OptionalsecondlistofIDsforsequencepairs.Returns:`List[int]`:Listof[tokentypeIDs](../glossary#token-type-ids)accordingtothegivensequence(s)."""bos_token_id=[self.bos_token_id]ifself.add_bos_tokenelse[]eos_token_id=[self.eos_token_id]ifself.add_eos_tokenelse[]output=[0]*len(bos_token_id+token_ids_0+eos_token_id)iftoken_ids_1isnotNone:output+=[1]*len(bos_token_id+token_ids_1+eos_token_id)returnoutput
不难发现其实里面使用了一些sentencepiece里面的函数。
Part5怎么合并英文词表和中文词表?直接看代码:
importosos.environ["PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION"]="python"fromtransformersimportLlamaTokenizerfromsentencepieceimportsentencepiece_model_pb2assp_pb2_modelimportsentencepieceasspmllama_tokenizer_dir="transformers_tokenizer/llama/tokenizer.model"chinese_sp_model_file="sentencepisece_tokenizer/tokenizer.model"#loadllama_tokenizer=LlamaTokenizer.from_pretrained(llama_tokenizer_dir)chinese_sp_model=spm.SentencePieceProcessor()chinese_sp_model.Load(chinese_sp_model_file)llama_spm=sp_pb2_model.ModelProto()llama_spm.ParseFromString(llama_tokenizer.sp_model.serialized_model_proto())chinese_spm=sp_pb2_model.ModelProto()chinese_spm.ParseFromString(chinese_sp_model.serialized_model_proto())#printnumberoftokensprint(len(llama_tokenizer),len(chinese_sp_model))print(llama_tokenizer.all_special_tokens)print(llama_tokenizer.all_special_ids)print(llama_tokenizer.special_tokens_map)##AddChinesetokenstoLLaMAtokenizerllama_spm_tokens_set=set(p.pieceforpinllama_spm.pieces)print(len(llama_spm_tokens_set))print(f"Before:{len(llama_spm_tokens_set)}")forpinchinese_spm.pieces:piece=p.pieceifpiecenotinllama_spm_tokens_set:new_p=sp_pb2_model.ModelProto().SentencePiece()new_p.piece=piecenew_p.score=0llama_spm.pieces.append(new_p)print(f"Newmodelpieces:{len(llama_spm.pieces)}")##Saveoutput_sp_dir="transformers_tokenizer/llama_chinese"output_hf_dir="transformers_tokenizer/llama_chinese"#thepathtosaveChinese-LLaMAtokenizeros.makedirs(output_sp_dir,exist_ok=True)withopen(output_sp_dir+"/chinese_llama.model","wb")asf:f.write(llama_spm.SerializeToString())tokenizer=LlamaTokenizer(vocab_file=output_sp_dir+"/chinese_llama.model")tokenizer.save_pretrained(output_hf_dir)print(f"Chinese-LLaMAtokenizerhasbeensavedto{output_hf_dir}")#Testllama_tokenizer=LlamaTokenizer.from_pretrained(llama_tokenizer_dir)chinese_llama_tokenizer=LlamaTokenizer.from_pretrained(output_hf_dir)print(tokenizer.all_special_tokens)print(tokenizer.all_special_ids)print(tokenizer.special_tokens_map)text="""白日依山尽,黄河入海流。欲穷千里目,更上一层楼。TheprimaryuseofLLaMAisresearchonlargelanguagemodels,including"""print("Testtext:\n",text)print(f"TokenizedbyLLaMAtokenizer:{llama_tokenizer.tokenize(text)}")print(f"TokenizedbyChinese-LLaMAtokenizer:{chinese_llama_tokenizer.tokenize(text)}")
核心部分是这一块:
forpinchinese_spm.pieces:piece=p.pieceifpiecenotinllama_spm_tokens_set:new_p=sp_pb2_model.ModelProto().SentencePiece()new_p.piece=piecenew_p.score=0llama_spm.pieces.append(new_p)
也就是将原始词表中没有的新加入进去。
最后看一下结果:
3200050000["","",""][1,2,0]{"bos_token":"","eos_token":"","unk_token":""}32000Before:32000Newmodelpieces:81163Chinese-LLaMAtokenizerhasbeensavedtotransformers_tokenizer/llama_chinese["","",""][1,2,0]{"bos_token":"","eos_token":"","unk_token":""}Testtext:白日依山尽,黄河入海流。欲穷千里目,更上一层楼。TheprimaryuseofLLaMAisresearchonlargelanguagemodels,includingTokenizedbyLLaMAtokenizer:["▁","白","日","<0xE4>","<0xBE>","<0x9D>","山","<0xE5>","<0xB0>","<0xBD>",",","黄","河","入","海","流","。","<0xE6>","<0xAC>","<0xB2>","<0xE7>","<0xA9>","<0xB7>","千","里","目",",","更","上","一","<0xE5>","<0xB1>","<0x82>","<0xE6>","<0xA5>","<0xBC>","。","<0x0A>","The","▁primary","▁use","▁of","▁L","La","MA","▁is","▁research","▁on","▁large","▁language","▁models",",","▁including"]TokenizedbyChinese-LLaMAtokenizer:["▁白","日","依","山","尽",",","黄","河","入","海","流","。","欲","穷","千里","目",",","更","上一层","楼","。","<0x0A>","The","▁primary","▁use","▁of","▁L","La","MA","▁is","▁research","▁on","▁large","▁language","▁models",",","▁including"]
会发现再加入了我们定义的词表后确实能够对中文进行分词了。
Part6怎么使用修改后的词表?如果我们重新从头开始训练,那么其实使用起来很简单:
config=AutoConfig.from_pretrained(...)tokenizer=LlamaTokenizer.from_pretrained(...)model=LlamaForCausalLM.from_pretrained(...,config=config)model_vocab_size=model.get_output_embeddings().weight.size(0)model.resize_token_embeddings(len(tokenizer))
但是如果我们想要保留原始模型embedding的参数,那么我们可以这么做:
def_init_weights(self,module):std=self.config.initializer_rangeifisinstance(module,nn.Linear):module.weight.data.normal_(mean=0.0,std=std)ifmodule.biasisnotNone:module.bias.data.zero_()elifisinstance(module,nn.Embedding):module.weight.data.normal_(mean=0.0,std=std)ifmodule.padding_idxisnotNone:module.weight.data[module.padding_idx].zero_()
具体怎么做可以参考一下这个:https://github.com/yangjianxin1/LLMPruner
Part7总结到这里为止,我们已经学会了:
https://github.com/ymcui/Chinese-LLaMA-Alpaca
https://github.com/yangjianxin1/LLMPruner
https://github.com/huggingface/transformers
关键词: