Highly Scalable Algorithm For Distributed Real-Time Text Indexing

被引:1
|
作者
Narang, Ankur [1 ]
Agarwal, Vikas [1 ]
Kedia, Monu [1 ]
Garg, Vijay K. [1 ]
机构
[1] IBM India Res Lab, New Delhi, India
关键词
D O I
10.1109/HIPC.2009.5433193
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Stream computing research is moving from terascale to petascale levels. It aims to rapidly analyze data as it streams in from many sources and make decisions with high speed and accuracy in fields as diverse as security surveillance and financial services including stock trading. We specifically consider real-time text indexing and search with high input data rates (10 GB/s or more) along with small index age-off(expiry) time. This makes it necessary to have maximal indexing rates for large volumes of data as well as minimal latency for indexing (time between start of indexing for a document and its availability for search) while maintaining very-low search response time. In addition, future massively parallel architectures with storage class memories will enable high speed in-memory real-time indexing, where index can be completely stored in a high capacity storage class memory. In this paper, we present the design of distributed data-structures and distributed real-time text indexing algorithm for parallel systems having large (thousands to hundred thousand) number of cores/processors, while simultaneously providing acceptable search performance [1]. The inherent trade-offs involved in index space, indexing throughput and search response time make this problem particularly challenging. Our algorithm uses group-based index construction and leverages novel index data structures that reduce load imbalance and make text indexing and merge process more scalable and efficient. We show analytically that the asymptotic parallel time complexity of our distributed indexing algorithm, is at least Omega(log(P)) factor better than typical indexing approaches, where P is the number of indexing nodes in a group. We further demonstrate the performance and scalability of our distributed indexing algorithm, on an MPP architecture (Blue Gene/L-1) using actual IBM intranet data. We achieved high indexing throughput of around 312 GB/min on an 8K node Blue Gene/L machine. In comparison with parallel indexing implemented using typical approaches like CLucene (2), this is 3x -7x better. To the best of our knowledge, this is the first published result on indexing throughput at such a large scale, with sustained search performance. We further show that our approach is scalable to 128K nodes, giving an estimated indexing throughput of 5 TB/min. We also achieved indexing latency that is around 10x better than typical indexing approaches.
引用
收藏
页码:332 / 341
页数:10
相关论文
共 50 条
  • [21] MIRA:: A distributed and scalable WAN/LAN real-time measurement platform
    Romeral, R
    García-Martínez, A
    García, AB
    Azcorra, A
    Alvarez-Campana, M
    FROM QOS PROVISIONING TO QOS CHARGING, PROCEEDINGS, 2002, 2511 : 263 - 272
  • [22] Scalable Data Gathering for Real-time Monitoring Systems on Distributed Computing
    Kamoshida, Yoshikazu
    Taura, Kenjiro
    CCGRID 2008: EIGHTH IEEE INTERNATIONAL SYMPOSIUM ON CLUSTER COMPUTING AND THE GRID, VOLS 1 AND 2, PROCEEDINGS, 2008, : 425 - 432
  • [23] GeoFlink: A Distributed and Scalable Framework for the Real-time Processing of Spatial Streams
    Shaikh, Salman Ahmed
    Mariam, Komal
    Kitagawa, Hiroyuki
    Kim, Kyoung-Sook
    CIKM '20: PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON INFORMATION & KNOWLEDGE MANAGEMENT, 2020, : 3149 - 3156
  • [24] Design of a Scalable Reasoning Engine for Distributed, Real-Time and Embedded Systems
    Edmondson, James
    Gokhale, Aniruddha
    KNOWLEDGE SCIENCE, ENGINEERING AND MANAGEMENT, 2011, 7091 : 221 - 232
  • [25] Towards a Scalable Distributed Real-Time Hybrid Simulator for Autonomous Vehicles
    de Hoog, Jens
    Pepermans, Manu
    Mercelis, Siegfried
    Hellinckx, Peter
    ADVANCES ON P2P, PARALLEL, GRID, CLOUD AND INTERNET COMPUTING, 3PGCIC-2018, 2019, 24 : 447 - 456
  • [26] Scalable indexing algorithm for multi-dimensional time-gap analysis with distributed computing
    Sutrisnowati, Riska Asriana
    Yahya, Bernardo Nugroho
    Bae, Hyerim
    Pulshashi, Iq Reviessay
    Adi, Taufik Nur
    4TH INFORMATION SYSTEMS INTERNATIONAL CONFERENCE (ISICO 2017), 2017, 124 : 224 - 231
  • [27] Schedulability analysis and utilization bounds for highly scalable real-time services
    Abdelzaher, TF
    Lu, CY
    SEVENTH IEEE REAL-TIME TECHNOLOGY AND APPLICATIONS SYMPOSIUM, PROCEEDINGS, 2001, : 15 - 25
  • [28] Architecture and analysis of color structure and scalable color descriptor for real-time video indexing and retrieval
    Chang, JY
    Lian, CJ
    Chen, LG
    2004 IEEE INTERNATIONAL SYMPOSIUM ON CONSUMER ELECTRONICS, PROCEEDINGS, 2004, : 365 - 369
  • [29] A new scalable multicast routing algorithm for interactive real-time applications
    Mohamed Aissa
    Adel Ben Mnaouer
    Rion Murray
    Habib Youssef
    Abdelfettah Belghith
    Personal and Ubiquitous Computing, 2011, 15 : 833 - 844
  • [30] A new scalable multicast routing algorithm for interactive real-time applications
    Aissa, Mohamed
    Ben Mnaouer, Adel
    Murray, Rion
    Youssef, Habib
    Belghith, Abdelfettah
    PERSONAL AND UBIQUITOUS COMPUTING, 2011, 15 (08) : 833 - 844