Data Curation
Search
.png&blockId=de4d05fa-f606-470a-9915-c73503660b73)
ä»åã®ããã°ã§ã¯ãéåœèªããŒã¿ãã¥ã¬ãŒã·ã§ã³ã«ã€ããŠèª¬æããŸããããŒã¿ãã¥ã¬ãŒã·ã§ã³ã¯ãããŒã¿ã®æ§ç¯ãšçæã ãã§ãªããããŒã¿ã®æŽ»çšäŸ¡å€ãé«ãããã¹ãŠã®æŽ»åãå«ã¿ãŸãããã®èšäºã§æ±ããã¹ãŠã®ããŒã¿ã¯ãå€åœäººãããŠã³ããŒãã§ããããŒã¿ã§ãã ããŒã¿ã«ã€ããŠã®ãã詳ãã説æã¯ https://github.com/ko-nlp/Open-korean-corpora ãš https://corpus.korean.go.kr/main/requestMain.doìì ãåç
§ããŠãã ãããå€åœäººã®åå ç³è«ã«ã€ããŠã¯ãæ¬¡ã®ææžã®å
容ãåç
§ããŠãã ããã

1. éåœèªã³ãŒãã¹æ§ç¯ã®å€åã®æ§å
Open-korean-corporaãéããŠ1次éåœèªããŒã¿ãã¥ã¬ãŒã·ã§ã³ãè¡ã£ã2019幎ã«ã¯ãæ§æè§£æããŒã¿ãšé¡äŒŒæã䞊åã³ãŒãã¹ãªã©ã倿°ãããŸããã
<å³1> ããŒã¿ã®äžè¬çãªäœ¿çšãšæäŸæ©é¢
ããã¯ã次ã®ç»åã®ããã«ã圢æ
çŽ ãæç« ã®ç¹æ§ãæœåºããå¿
èŠãªæ
å ±ãåŠçããããŒã¿ãäž»ã«æ§ç¯ãããããã§ãã

<å³2>ããã¹ãããŒã¿ã®åææ¹æ³ãšèªç¶èšèªåŠç課é¡ã®çްåå
2020幎ããçŸåš(2023幎)ãŸã§ã嫿ªè¡šçŸãšãšãã«æ§ã
ãªããŒã(ãã®ä»ã®ããŒãã«å«ãŸãã)ã®ããŒã¿ãå¢å ããŸãããå
šäœçã«æå³åé¡ã«é¢é£ããããŒã¿ãç ç©¶ãç£æ¥ã§å€ã䜿ãããŠããããšã確èªã§ããŸãã
éåœèªã³ãŒãã¹ã®æ§ç¯(1)
éåœèªã³ãŒãã¹ã®ç޹ä»
宿°žæ· ãœã³ã»ãšã³ãœã¯/ ML Researcher
data
large language model
corpus
衚ããŒã¿ã®åœ¹å²
â¢
627MBã®ãµã€ãºãæã€è¡šããŒã¿ãå
¬éããGus Eggert(2023) ã¯è¡šãæšè«ã®"æèŠåšå®"ã®ãããªåœ¹å²ããããã®ãšããŠèª¬æããŠããã
衚ããŒã¿ã®æŽå²
(1) ç¹å®ãã¡ã€ã³ã«é¢é£ããããŒã¿ãäž»ã«æ§ç¯ããããããã¹ã±ããããŒã«ã«é¢é£ããRotowire(Wiseman et al, 2017)ããŒã¿ã»ãããçç©åŠã«é¢é£ããKBGen(Banik et al, 2013)ãWikibio(Lebret et al, 2016)ããŒã¿ã»ãããã¬ã¹ãã©ã³äºçŽãªã©ã«é¢é£ããE2E(Novikova et al, 2016, 2017)ãªã©ããã®äŸã§ããã(2) 衚ã«ããæç« çæã«é¢ããŠã¯ãPuduppully,R.(2018), Ankur Parikh et al(2020), Jonathan et al(2020) ãªã©ãããããã®èšäºã§ã¯ããã®äžã§ãToTTo:A Controlled Table-To-Text Generation Dataset ã«ã€ããŠèª¬æããã

ToTTo ã§è¡šããŒã¹ã®æç« çæããŒã¿ãäœæããããã»ã¹
â¢
(1) æ§ã
ãªåœ¢åŒã®ãã©ãŒãããããã¿ã€ãã«ããµãã¿ã€ãã«ã衚æ
å ±ãæœåºããåŸãäž»èŠãªè¡šæ
å ±ãé»è€è²ã§åŒ·èª¿è¡šç€º(highlight)ããã
â¢
(2) 衚ãšäžç·ã«åéããæç« (äžã®ç»åã§Original text)ãã衚ã®å
容ãšé¢ä¿ãªããã®ã¯åé€(text after deletion)ããåŸãæçµçã«æç« ãäœæããæç« çæã®ç²ŸåºŠãé«ããã

衚ããŒã¿ããŒã¹ã®æç« çæ
衚ã®ããŒã¿åææ¹æ³è«ã®ç޹ä»
宿°žæ· ãœã³ã»ãšã³ãœã¯ / ML Researcher
Table
Generation



