Highly Scalable Algorithm For Distributed Real-Time Text Indexing

被引:1
|
作者
Narang, Ankur [1 ]
Agarwal, Vikas [1 ]
Kedia, Monu [1 ]
Garg, Vijay K. [1 ]
机构
[1] IBM India Res Lab, New Delhi, India
关键词
D O I
10.1109/HIPC.2009.5433193
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Stream computing research is moving from terascale to petascale levels. It aims to rapidly analyze data as it streams in from many sources and make decisions with high speed and accuracy in fields as diverse as security surveillance and financial services including stock trading. We specifically consider real-time text indexing and search with high input data rates (10 GB/s or more) along with small index age-off(expiry) time. This makes it necessary to have maximal indexing rates for large volumes of data as well as minimal latency for indexing (time between start of indexing for a document and its availability for search) while maintaining very-low search response time. In addition, future massively parallel architectures with storage class memories will enable high speed in-memory real-time indexing, where index can be completely stored in a high capacity storage class memory. In this paper, we present the design of distributed data-structures and distributed real-time text indexing algorithm for parallel systems having large (thousands to hundred thousand) number of cores/processors, while simultaneously providing acceptable search performance [1]. The inherent trade-offs involved in index space, indexing throughput and search response time make this problem particularly challenging. Our algorithm uses group-based index construction and leverages novel index data structures that reduce load imbalance and make text indexing and merge process more scalable and efficient. We show analytically that the asymptotic parallel time complexity of our distributed indexing algorithm, is at least Omega(log(P)) factor better than typical indexing approaches, where P is the number of indexing nodes in a group. We further demonstrate the performance and scalability of our distributed indexing algorithm, on an MPP architecture (Blue Gene/L-1) using actual IBM intranet data. We achieved high indexing throughput of around 312 GB/min on an 8K node Blue Gene/L machine. In comparison with parallel indexing implemented using typical approaches like CLucene (2), this is 3x -7x better. To the best of our knowledge, this is the first published result on indexing throughput at such a large scale, with sustained search performance. We further show that our approach is scalable to 128K nodes, giving an estimated indexing throughput of 5 TB/min. We also achieved indexing latency that is around 10x better than typical indexing approaches.
引用
下载
收藏
页码:332 / 341
页数:10
相关论文
共 50 条
  • [31] VOLAP: A Scalable Distributed System for Real-Time OLAP with High Velocity Data
    Dehne, Frank
    Robillard, David
    Rau-Chaplin, Andrew
    Burke, Neil
    2016 IEEE INTERNATIONAL CONFERENCE ON CLUSTER COMPUTING (CLUSTER), 2016, : 354 - 363
  • [32] Propeller: A Scalable Real-Time File-Search Service in Distributed Systems
    Xu, Lei
    Jiang, Hong
    Tian, Lei
    Huang, Ziling
    2014 IEEE 34TH INTERNATIONAL CONFERENCE ON DISTRIBUTED COMPUTING SYSTEMS (ICDCS 2014), 2014, : 378 - 388
  • [33] R-Store: A Scalable Distributed System for Supporting Real-time Analytics
    Li, Feng
    Oezsu, M. Tamer
    Chen, Gang
    Ooi, Beng Chin
    2014 IEEE 30TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING (ICDE), 2014, : 40 - 51
  • [34] Exploring Scalable, Distributed Real-Time Anomaly Detection for Bridge Health Monitoring
    Moallemi, Amirhossein
    Burrello, Alessio
    Brunelli, Davide
    Benini, Luca
    IEEE INTERNET OF THINGS JOURNAL, 2022, 9 (18) : 17660 - 17674
  • [35] A genetic algorithm for scheduling tasks in a real-time distributed system
    Monnier, Y
    Beauvais, JP
    Deplanche, AM
    24TH EUROMICRO CONFERENCE - PROCEEDING, VOLS 1 AND 2, 1998, : 708 - 714
  • [36] A distributed backoff algorithm to support real-time traffic on Ethernet
    Gupta, Vijay
    Operating Systems Review (ACM), 2001, 35 (03): : 43 - 54
  • [37] A Parametric Nonconvex Decomposition Algorithm for Real-Time and Distributed NMPC
    Hours, Jean-Hubert
    Jones, Colin N.
    IEEE TRANSACTIONS ON AUTOMATIC CONTROL, 2016, 61 (02) : 287 - 302
  • [38] A STATIC SCHEDULING ALGORITHM FOR DISTRIBUTED HARD REAL-TIME SYSTEMS
    VERHOOSEL, JPC
    LUIT, EJ
    HAMMER, DK
    JANSEN, E
    REAL-TIME SYSTEMS, 1991, 3 (03) : 227 - 246
  • [39] A multivariables Algorithm for Dynamic Reconfiguration of real-time Distributed Systems
    Soidridine, Moussa Moindze
    Karim, Konate
    PROCEEDINGS OF 2016 IEEE ADVANCED INFORMATION MANAGEMENT, COMMUNICATES, ELECTRONIC AND AUTOMATION CONTROL CONFERENCE (IMCEC 2016), 2016, : 960 - 968
  • [40] A distributed real-time control algorithm for energy storage sharing
    Zhu, Hailing
    Ouahada, Khmaies
    ENERGY AND BUILDINGS, 2021, 230