Skip to main content

Eyes - 台灣輿情監測系統

· 5 min read
Sean Chang
Senior Data Scientist @ KKLab, KKBOX Group

Brand

Source Code: https://github.com/r05323028/eyes

近期心血來潮想說把在公司第一年建制的輿情監測系統重新構思,然後重寫成一個新的開源專案。

架構#

Architecture

訓練 NLP 模型、預測等任務,主要是利用 spaCy 來完成。會利用這個框架主要的原因是因為他擁有很完整的生態系,豐富的 pretrained models 、完整的文件、簡單明瞭的 ML lifecycle ,在效能上也不輸給其他主流的框架。

Flower

ETL pipeline 的部分,會使用 Argo Workflows + celery ,利用 Producer-Consumer 的架構,由 Argo Workflows 定期調配任務,並經由 celery worker 來執行。在 Kubernetes 上,可以輕鬆地針對 celery worker 做 auto scaling ,可以更有效率的利用計算資源。

Web

最後資料視覺化的部分使用 React + FastAPI 來建置網頁與 API 。

SpaCy#

SpaCy 在 v3 以後,提供了利用 config 檔輕鬆定義訓練流程的方法。一般來說 NLP 的 pipeline 會有這幾個流程

graph LR; A(Tokenize) --> B(Embedding); B --> C(POS Tagging); C --> D(Dependency Parsing); D --> E(NER); E --> F(Text Classification)

不同的 component 如果要自己管理的話非常的麻煩,而 spaCy 利用了以下的語法(以 Eyes 內用到的模型為例),讓你輕鬆的管理每個 component

[paths]train = nulldev = nullvectors = nullinit_tok2vec = null
[system]gpu_allocator = nullseed = 0
[nlp]lang = "zh"pipeline = ["tok2vec","tagger","parser","ner"]batch_size = 32disabled = []before_creation = nullafter_creation = nullafter_pipeline_creation = null
[nlp.tokenizer]@tokenizers = "spacy.zh.ChineseTokenizer"segmenter = "char"
[components]
[components.ner]factory = "ner"incorrect_spans_key = nullmoves = nullupdate_with_oracle_cut_size = 100
[components.ner.model]@architectures = "spacy.TransitionBasedParser.v2"state_type = "ner"extra_state_tokens = falsehidden_width = 64maxout_pieces = 2use_upper = truenO = null
[components.ner.model.tok2vec]@architectures = "spacy.Tok2Vec.v2"
[components.ner.model.tok2vec.embed]@architectures = "spacy.MultiHashEmbed.v2"width = 96attrs = ["ORTH","SHAPE"]rows = [5000,2500]include_static_vectors = false
[components.ner.model.tok2vec.encode]@architectures = "spacy.MaxoutWindowEncoder.v2"width = 96depth = 4window_size = 1maxout_pieces = 3
[components.parser]source = "zh_core_web_md"
[components.tagger]source = "zh_core_web_md"
[components.tok2vec]source = "zh_core_web_md"
[corpora]
[corpora.dev]@readers = "spacy.Corpus.v1"path = ${paths.dev}max_length = 0gold_preproc = falselimit = 0augmenter = null
[corpora.train]@readers = "spacy.Corpus.v1"path = ${paths.train}max_length = 0gold_preproc = falselimit = 0augmenter = null
[training]dev_corpus = "corpora.dev"train_corpus = "corpora.train"seed = ${system.seed}gpu_allocator = ${system.gpu_allocator}dropout = 0.1accumulate_gradient = 1patience = 1600max_epochs = 0max_steps = 20000eval_frequency = 200frozen_components = ["tagger","parser","tok2vec"]annotating_components = []before_to_disk = null
[training.batcher]@batchers = "spacy.batch_by_words.v1"discard_oversize = falsetolerance = 0.2get_length = null
[training.batcher.size]@schedules = "compounding.v1"start = 100stop = 1000compound = 1.001t = 0.0
[training.logger]@loggers = "spacy.ConsoleLogger.v1"progress_bar = false
[training.optimizer]@optimizers = "Adam.v1"beta1 = 0.9beta2 = 0.999L2_is_weight_decay = trueL2 = 0.01grad_clip = 1.0use_averages = falseeps = 0.00000001learn_rate = 0.001
[training.score_weights]tag_acc = 0.33dep_uas = 0.17dep_las = 0.17dep_las_per_type = nullsents_p = nullsents_r = nullsents_f = 0.0ents_f = 0.33ents_p = 0.0ents_r = 0.0ents_per_type = null
[pretraining]
[initialize]vectors = ${paths.vectors}init_tok2vec = ${paths.init_tok2vec}vocab_data = nulllookups = nullbefore_init = nullafter_init = null
[initialize.components]
[initialize.tokenizer]pkuseg_model = nullpkuseg_user_dict = "default"

可以輕鬆的定義,要使用哪個 pretrained model 、哪些 components 是要從原先的模型繼承、哪些 components 的參數要凍結、優化器的參數等等。在 Eyes 的模型中我們只訓練 NER 的模型,所以我們繼承了原先模型的 Embeddings 、 POS Tagging 、 Dependency Parsing ,並將其凍結。

然後可以使用簡單的指令來訓練模型

spacy train config/spacy/zh_core_eyes_md.cfg --paths.train [TRAIN_FILE] --paths.dev [DEV_FILE] --output [OUTPUT_DIR] --initialize.vectors zh_core_web_md

更多的介紹可以參照 https://spacy.io/usage/training

Argo Workflows#

Argo workflows 擁有簡單便利的圖形化介面,方便查看每個 Cronjob 執行的狀況、管理任務的調用等等。當初會選用這個的原因,是因為原本就打算將整個系統利用 Kubernetes 來部署,而 Argo workflows 又跟 Kubernetes 很好的整合在一起了。在語法方面,幾乎是只要會寫 Kubernetes 的 API ,稍微看一下文件馬上就能上手,以 PTT 的 Daily pipeline 為例:

apiVersion: argoproj.io/v1alpha1kind: CronWorkflowmetadata:  name: ptt-posts-pipelinespec:  schedule: "0 8 * * *"  timezone: "Asia/Taipei"  workflowSpec:    entrypoint: pipeline    templates:      - name: pipeline        dag:          tasks:            - name: crawl-top-board-posts              template: crawl-top-board-posts              arguments:                parameters:                  - name: n_days                    value: "{{ .Values.n_days }}"            - name: transform-ptt-posts              template: transform-ptt-posts-to-spacy              depends: "crawl-top-board-posts"              arguments:                parameters:                  - name: year                    value: "{{ `{{ workflow.creationTimestamp.Y }}` }}"                  - name: month                    value: "{{ `{{ workflow.creationTimestamp.m }}` }}"            - name: ptt-monthly-summary              depends: "transform-ptt-posts"              template: stats-ptt-summaries              arguments:                parameters:                  - name: year                    value: "{{ `{{ workflow.creationTimestamp.Y }}` }}"                  - name: month                    value: "{{ `{{ workflow.creationTimestamp.m }}` }}"            - name: entity-monthly-summary              depends: "transform-ptt-posts"              template: stats-entity-summaries              arguments:                parameters:                  - name: year                    value: "{{ `{{ workflow.creationTimestamp.Y }}` }}"                  - name: month                    value: "{{ `{{ workflow.creationTimestamp.m }}` }}"
      - name: crawl-top-board-posts        inputs:          parameters:            - name: n_days        container:          image: eyes/base          imagePullPolicy: Never          command: ["/bin/bash", "-c", "--"]          envFrom:            - configMapRef:                name: eyes-config-prod          args:            [              "poetry run eyes job dispatch --job_type CRAWL_PTT_TOP_BOARD_POSTS --n_days {{ `{{ inputs.parameters.n_days }}` }}",            ]
      - name: transform-ptt-posts-to-spacy        inputs:          parameters:            - name: year            - name: month        container:          image: eyes/base          imagePullPolicy: Never          command: ["/bin/bash", "-c", "--"]          envFrom:            - configMapRef:                name: eyes-config-prod          args:            [              "poetry run eyes job dispatch --job_type PTT_SPACY_PIPELINE --year {{ `{{ inputs.parameters.year }}` }} --month {{ `{{ inputs.parameters.month }}` }}",            ]
      - name: stats-ptt-summaries        inputs:          parameters:            - name: year            - name: month        container:          image: eyes/base          imagePullPolicy: Never          command: ["/bin/bash", "-c", "--"]          envFrom:            - configMapRef:                name: eyes-config-prod          args:            [              "poetry run eyes job dispatch --job_type PTT_MONTHLY_SUMMARY --year {{ `{{ inputs.parameters.year }}` }} --month {{ `{{ inputs.parameters.month }}` }}",            ]
      - name: stats-entity-summaries        inputs:          parameters:            - name: year            - name: month        container:          image: eyes/base          imagePullPolicy: Never          command: ["/bin/bash", "-c", "--"]          envFrom:            - configMapRef:                name: eyes-config-prod          args:            [              "poetry run eyes job dispatch --job_type ENTITY_MONTHLY_SUMMARY --year {{ `{{ inputs.parameters.year }}` }} --month {{ `{{ inputs.parameters.month }}` }}",            ]

他可以很輕鬆的利用 Docker image 來跑起 Container ,並把每個步驟組成 DAG ,部署 Pipeline 的方式也跟 Kubernetes 的其他元件一樣,非常便利。

Weak Supervision#

Weak supervision 是我們拿來減少人工標記成本的一種手段,透過類似 Rule-based 的方法來定義 Labeling Functions ,並利用這些函數來進行標記,最後產出的 Label 會透過一個 Aggregation model 來決定,細節的部分可以參考 skweak ,因為在 Eyes 裡面,我們透過爬蟲爬回來的 Entity 就有上萬個,不大可能利用人工標記的方法來產生訓練資料,所以採用了弱監督的方法。

graph LR; A(LF1) --> B(Aggregate); C(LF2) --> B; B --> D(Final Label); E(LF3) --> B;

結語#

第一次認真地開發 Side project ,發現自己走過一遍以後對於系統架構、各種常用的框架有了更深入的了解。但疲累程度也是工作時的數倍 XD