We mine for “information of interest” from the enormous archive of data with the use of Statistical Machine Learning approach, and retrieve knowledge that people have overlooked.
Mining for Value concealed in Data
People have long summarized their thoughts and acquired knowledge into documents, and libraries have been the place to collect, archive, and read such documentation. The University of Tokyo Library is no exception, with research papers and large amounts of reference material, valuable to academic research and education, accumulated every day. The documents are opened to the university members and the public for viewing, and also converted into digital format. On the other hand, there are exponential amounts of machine generated data appearing daily on the websites, formulating a new repository of information. In 2009, the Academic Information Science Research Division started our activity to provide an integrated navigation interface and tools, to make practical use of this ever expanding data.
How do we find what is needed, from a huge pile of incremental data, and to simplify extraction of knowledge embedded in various formats of data? We have applied statistical machine learning techniques, to deliver a more human-friendly way to search for information, in combination with data mining methods to make information more available.
The value of data analysis using machine learning methods have increased, with the rise of massive volumes of data. Therefore, research on technologies to efficiently process large amounts of data in parallel, and to create large peta-scale databases, have become as vital as the theoretical research on how to create a statistical model for data mining applications.
Search Features that are Friendly to Human Thought Process
The “Library Information Navigator” has integrated Wikipedia classification categories with traditional library classification categories. The system makes use of the information on the web, to present related keywords to enable users to become aware of related works, and navigate them to references and journals archived in the library. This system has not only been deployed on University of Tokyo campus, but also has been adopted by libraries in other universities, and the Japanese National Diet Library.
“Deep” Data Analysis to Meet Human Needs
Search Engine results do not always fulfill the priority of information we are looking for. Our research aims to supplement the “something not quite there” aspect by performing a deeper analysis on web based information being searched. When academic research is done by different scientists with similar names, the “Nayose” (meaning aggregation by names) system will sort the data belonging to each research scientist according to context. Similarly, the “GENSEN web” (meaning carefully selected) will display terminology used in journals of specific areas of study, in order of importance in the area of research.
Data Driven Intelligence
We are researching ways to apply statistical machine learning methods to extract certain topics from massive volumes of data, and classify them automatically. The value here in not only in the automation that relieves people from the task, but also finding new correlation in the data attributes that people may have overlooked.
Data Mining with Privacy in Consideration
We are looking into methods to perform integrated multi-corporation customer data mining, without one company disclosing the customer information to any other member. With the progress in this field of research, applications such as linking hospital records together, to trace how the infection of an epidemic spread, while still protecting personal medication information privacy at the same time, will be possible.