Eyes - 台灣輿情監測系統
Source Code: https://github.com/r05323028/eyes
近期心血來潮想說把在公司第一年建制的輿情監測系統重新構思,然後重寫成一個新的開源專案。
#
架構訓練 NLP 模型、預測等任務,主要是利用 spaCy 來完成。會利用這個框架主要的原因是因為他擁有很完整的生態系,豐富的 pretrained models 、完整的文件、簡單明瞭的 ML lifecycle ,在效能上也不輸給其他主流的框架。
ETL pipeline 的部分,會使用 Argo Workflows + celery ,利用 Producer-Consumer 的架構,由 Argo Workflows 定期調配任務,並經由 celery worker 來執行。在 Kubernetes 上,可以輕鬆地針對 celery worker 做 auto scaling ,可以更有效率的利用計算資源。
最後資料視覺化的部分使用 React + FastAPI 來建置網頁與 API 。
#
SpaCySpaCy 在 v3 以後,提供了利用 config 檔輕鬆定義訓練流程的方法。一般來說 NLP 的 pipeline 會有這幾個流程
不同的 component 如果要自己管理的話非常的麻煩,而 spaCy 利用了以下的語法(以 Eyes 內用到的模型為例),讓你輕鬆的管理每個 component
[paths]train = nulldev = nullvectors = nullinit_tok2vec = null
[system]gpu_allocator = nullseed = 0
[nlp]lang = "zh"pipeline = ["tok2vec","tagger","parser","ner"]batch_size = 32disabled = []before_creation = nullafter_creation = nullafter_pipeline_creation = null
[nlp.tokenizer]@tokenizers = "spacy.zh.ChineseTokenizer"segmenter = "char"
[components]
[components.ner]factory = "ner"incorrect_spans_key = nullmoves = nullupdate_with_oracle_cut_size = 100
[components.ner.model]@architectures = "spacy.TransitionBasedParser.v2"state_type = "ner"extra_state_tokens = falsehidden_width = 64maxout_pieces = 2use_upper = truenO = null
[components.ner.model.tok2vec]@architectures = "spacy.Tok2Vec.v2"
[components.ner.model.tok2vec.embed]@architectures = "spacy.MultiHashEmbed.v2"width = 96attrs = ["ORTH","SHAPE"]rows = [5000,2500]include_static_vectors = false
[components.ner.model.tok2vec.encode]@architectures = "spacy.MaxoutWindowEncoder.v2"width = 96depth = 4window_size = 1maxout_pieces = 3
[components.parser]source = "zh_core_web_md"
[components.tagger]source = "zh_core_web_md"
[components.tok2vec]source = "zh_core_web_md"
[corpora]
[corpora.dev]@readers = "spacy.Corpus.v1"path = ${paths.dev}max_length = 0gold_preproc = falselimit = 0augmenter = null
[corpora.train]@readers = "spacy.Corpus.v1"path = ${paths.train}max_length = 0gold_preproc = falselimit = 0augmenter = null
[training]dev_corpus = "corpora.dev"train_corpus = "corpora.train"seed = ${system.seed}gpu_allocator = ${system.gpu_allocator}dropout = 0.1accumulate_gradient = 1patience = 1600max_epochs = 0max_steps = 20000eval_frequency = 200frozen_components = ["tagger","parser","tok2vec"]annotating_components = []before_to_disk = null
[training.batcher]@batchers = "spacy.batch_by_words.v1"discard_oversize = falsetolerance = 0.2get_length = null
[training.batcher.size]@schedules = "compounding.v1"start = 100stop = 1000compound = 1.001t = 0.0
[training.logger]@loggers = "spacy.ConsoleLogger.v1"progress_bar = false
[training.optimizer]@optimizers = "Adam.v1"beta1 = 0.9beta2 = 0.999L2_is_weight_decay = trueL2 = 0.01grad_clip = 1.0use_averages = falseeps = 0.00000001learn_rate = 0.001
[training.score_weights]tag_acc = 0.33dep_uas = 0.17dep_las = 0.17dep_las_per_type = nullsents_p = nullsents_r = nullsents_f = 0.0ents_f = 0.33ents_p = 0.0ents_r = 0.0ents_per_type = null
[pretraining]
[initialize]vectors = ${paths.vectors}init_tok2vec = ${paths.init_tok2vec}vocab_data = nulllookups = nullbefore_init = nullafter_init = null
[initialize.components]
[initialize.tokenizer]pkuseg_model = nullpkuseg_user_dict = "default"
可以輕鬆的定義,要使用哪個 pretrained model 、哪些 components 是要從原先的模型繼承、哪些 components 的參數要凍結、優化器的參數等等。在 Eyes 的模型中我們只訓練 NER 的模型,所以我們繼承了原先模型的 Embeddings 、 POS Tagging 、 Dependency Parsing ,並將其凍結。
然後可以使用簡單的指令來訓練模型
spacy train config/spacy/zh_core_eyes_md.cfg --paths.train [TRAIN_FILE] --paths.dev [DEV_FILE] --output [OUTPUT_DIR] --initialize.vectors zh_core_web_md
更多的介紹可以參照 https://spacy.io/usage/training
#
Argo WorkflowsArgo workflows 擁有簡單便利的圖形化介面,方便查看每個 Cronjob 執行的狀況、管理任務的調用等等。當初會選用這個的原因,是因為原本就打算將整個系統利用 Kubernetes 來部署,而 Argo workflows 又跟 Kubernetes 很好的整合在一起了。在語法方面,幾乎是只要會寫 Kubernetes 的 API ,稍微看一下文件馬上就能上手,以 PTT 的 Daily pipeline 為例:
apiVersion: argoproj.io/v1alpha1kind: CronWorkflowmetadata: name: ptt-posts-pipelinespec: schedule: "0 8 * * *" timezone: "Asia/Taipei" workflowSpec: entrypoint: pipeline templates: - name: pipeline dag: tasks: - name: crawl-top-board-posts template: crawl-top-board-posts arguments: parameters: - name: n_days value: "{{ .Values.n_days }}" - name: transform-ptt-posts template: transform-ptt-posts-to-spacy depends: "crawl-top-board-posts" arguments: parameters: - name: year value: "{{ `{{ workflow.creationTimestamp.Y }}` }}" - name: month value: "{{ `{{ workflow.creationTimestamp.m }}` }}" - name: ptt-monthly-summary depends: "transform-ptt-posts" template: stats-ptt-summaries arguments: parameters: - name: year value: "{{ `{{ workflow.creationTimestamp.Y }}` }}" - name: month value: "{{ `{{ workflow.creationTimestamp.m }}` }}" - name: entity-monthly-summary depends: "transform-ptt-posts" template: stats-entity-summaries arguments: parameters: - name: year value: "{{ `{{ workflow.creationTimestamp.Y }}` }}" - name: month value: "{{ `{{ workflow.creationTimestamp.m }}` }}"
- name: crawl-top-board-posts inputs: parameters: - name: n_days container: image: eyes/base imagePullPolicy: Never command: ["/bin/bash", "-c", "--"] envFrom: - configMapRef: name: eyes-config-prod args: [ "poetry run eyes job dispatch --job_type CRAWL_PTT_TOP_BOARD_POSTS --n_days {{ `{{ inputs.parameters.n_days }}` }}", ]
- name: transform-ptt-posts-to-spacy inputs: parameters: - name: year - name: month container: image: eyes/base imagePullPolicy: Never command: ["/bin/bash", "-c", "--"] envFrom: - configMapRef: name: eyes-config-prod args: [ "poetry run eyes job dispatch --job_type PTT_SPACY_PIPELINE --year {{ `{{ inputs.parameters.year }}` }} --month {{ `{{ inputs.parameters.month }}` }}", ]
- name: stats-ptt-summaries inputs: parameters: - name: year - name: month container: image: eyes/base imagePullPolicy: Never command: ["/bin/bash", "-c", "--"] envFrom: - configMapRef: name: eyes-config-prod args: [ "poetry run eyes job dispatch --job_type PTT_MONTHLY_SUMMARY --year {{ `{{ inputs.parameters.year }}` }} --month {{ `{{ inputs.parameters.month }}` }}", ]
- name: stats-entity-summaries inputs: parameters: - name: year - name: month container: image: eyes/base imagePullPolicy: Never command: ["/bin/bash", "-c", "--"] envFrom: - configMapRef: name: eyes-config-prod args: [ "poetry run eyes job dispatch --job_type ENTITY_MONTHLY_SUMMARY --year {{ `{{ inputs.parameters.year }}` }} --month {{ `{{ inputs.parameters.month }}` }}", ]
他可以很輕鬆的利用 Docker image 來跑起 Container ,並把每個步驟組成 DAG ,部署 Pipeline 的方式也跟 Kubernetes 的其他元件一樣,非常便利。
#
Weak SupervisionWeak supervision 是我們拿來減少人工標記成本的一種手段,透過類似 Rule-based 的方法來定義 Labeling Functions ,並利用這些函數來進行標記,最後產出的 Label 會透過一個 Aggregation model 來決定,細節的部分可以參考 skweak ,因為在 Eyes 裡面,我們透過爬蟲爬回來的 Entity 就有上萬個,不大可能利用人工標記的方法來產生訓練資料,所以採用了弱監督的方法。
#
結語第一次認真地開發 Side project ,發現自己走過一遍以後對於系統架構、各種常用的框架有了更深入的了解。但疲累程度也是工作時的數倍 XD