Профессор Joshua Zhexue Huang (Big Data Institute, Shenzhen University, China) прочитает открытую лекцию-тьюториал «Approximate Computing for Big Data Analysis».
Лекция пройдёт 8 декабря с 17:00 по 19:45 в онлайн-формате. Приглашаем вас принять участие!
Подключиться к конференции Zoom: https://us02web.zoom.us/j/83189630550?pwd=S0dmTTNkNnFGcjdlV0trZWprQk9mQT09 (ID: 831 8963 0550, пароль: 984871).
In the era of big data, datasets with millions of objects and thousands of features have become a phenomenon in many organizations. Such datasets, often in the size of hundred gigabytes or even terabytes, can easily exceed the size of the memory of the cluster systems, creating computing problems in big data analysis. Therefore, how to effectively processing and analyzing terabyte big data with limited resources is both a theoretical and technical challenge in current big data research.
In this tutorial, we will discuss the issues of distributed data computing with a particular focus on approximate computing for big data. I will start with a general introduction to big data and challenges in big data analysis, and continue with discussions of current technologies used in big data analysis and their shortcomings. Then, I will introduce approximate computing for big data and a new method that uses multiple random samples to compute approximate results of big data. Finally, I will present the new technologies and algorithms to enable approximate computing, including the random sample partition (RSP) data model, the LMGI computing framework and the algorithm to generate the RSP data models fr om HDFS big data files. LMGI is a non-MapReduce framework that allows execution of serial algorithms independently on local nodes or virtual machines without data communications among the nodes. The new technologies present the following breakthroughs in big data computing: analyzing big data without memory lim it, executing serial algorithms directly in distributed computing, and extending the scalability of data analysis to the scale of terabytes on small clusters.