Skip to content

South Korea Unveils Massive AI Learning Data Sets

Source: AI Hub

South Korea's Ministry of Science and ICT and the National Information Society Agency have announced the sequential release of 310 artificial intelligence (AI) learning data sets via the "AI Hub" from today through July.

The Data Construction Project for Artificial Intelligence Learning, which built these data sets, is an essential national initiative to advance AI technology and promote intelligent services in various fields, from specialized areas to everyday life.

All Koreans interested in AI development can use the learning data through the AI Hub. Since 2020, the Ministry of Science and ICT and the National Information Society Agency have built about 200 data types yearly. In July 2022, the AI Hub surpassed 1 million annual visitors.

This year, 310 data types will be released as the initiative expands from six to 14 major fields, including manufacturing, robotics, education, finance, and sports.

This year's release allows users to access 691 data types and approximately 2.6 billion records.

The first batch of open data will include approximately 70 data types related to natural language and AI vision, such as raw language-based query, search, and generation data; optical character recognition (OCR) data; indoor and outdoor crowd characteristics data; and three-dimensional (3D) object data on firefighter behavior.

Data published through the AI Hub will meet international quality standards and be de-identified from personally identifiable information.

To ensure validity and accuracy, usage checks will be conducted by training the data on AI models companies and institutions use.

In addition, user input on data quality requirements and errors will be incorporated into a complementary process for about three months after opening to improve data quality.

The AI Policy Officer said, "The AI industry is evolving rapidly with the emergence of super-scale AI such as ChatGPT."

To help companies and researchers secure new data, the official added that the existing business would be restructured around labeled data to ensure large-scale unlabeled and multi-task labeled data for training super-scale AI and multiple types simultaneously.