Home
World Journal of Advanced Research and Reviews
International Journal with High Impact Factor for fast publication of Research and Review articles

Main navigation

  • Home
    • Journal Information
    • Editorial Board Members
    • Reviewer Panel
    • Abstracting and Indexing
    • Journal Policies
    • Our CrossMark Policy
    • Publication Ethics
    • Issue in Progress
    • Current Issue
    • Past Issues
    • Instructions for Authors
    • Article processing fee
    • Track Manuscript Status
    • Get Publication Certificate
    • Join Editorial Board
    • Join Reviewer Panel
  • Contact us
  • Downloads

eISSN: 2581-9615 || CODEN: WJARAI || Impact Factor 8.2 ||  CrossRef DOI

Research and review articles are invited for publication in March 2026 (Volume 29, Issue 3) Submit manuscript

Designing highly resilient AI fabrics: Networking architectures for large-scale model training

Breadcrumb

  • Home
  • Designing highly resilient AI fabrics: Networking architectures for large-scale model training

Oluwatosin Oladayo Aramide *

NetApp Ireland Limited. Ireland.
 
Research Article
World Journal of Advanced Research and Reviews, 2024, 23(03), 3291-3303
Article DOI: 10.30574/wjarr.2024.23.3.2632
DOI url: https://doi.org/10.30574/wjarr.2024.23.3.2632
 
Received on 18 July 2024; revised on 21 September 2024; accepted on 27 September 2024
 
The fast development of big AI models, mostly big language models (LLMs), has caused new challenges to networking infrastructure as never seen before. As training expands towards hundreds and even thousands of GPUs in distributed systems, the resilience, efficiency and performance of AI fabrics become paramount to long-run throughput and reliability. This paper discusses some of the architecture design concepts and new technologies creating resilient AI fabrics to build large-scale model training. We discuss the use of high-bandwidth interconnects like RoCEv2 and 800G / 1.6T Ethernet, look at topology-aware routing schemes and evaluate how network-level fault-tolerance mechanisms can be made resilient. With the help of case studies and benchmarking, we point out both the good and bad practice of existing AI training networks. Our results give some advice to future-proof design of AI networking architectures to scale to model complexity in next generation models.
 
AI Fabric; Distributed Training; RoCev2; Resilient Networking; 800G Per Ethernet; High Performance Data Center Computing; Network Fault Tolerance; Large Language Models; Smart-NIC; Data Center Interconnects
 
https://wjarr.com/sites/default/files/fulltext_pdf/WJARR-2024-2632.pdf

Preview Article PDF

Oluwatosin Oladayo Aramide. Designing highly resilient AI fabrics: Networking architectures for large-scale model training. World Journal of Advanced Research and Reviews, 2024, 23(3), 3291-3303. Article DOI: https://doi.org/10.30574/wjarr.2024.23.3.2632

Copyright © Author(s). All rights reserved. This article is published under the terms of the Creative Commons Attribution 4.0 International License (CC BY 4.0), which permits use, sharing, adaptation, distribution, and reproduction in any medium or format, as long as appropriate credit is given to the original author(s) and source, a link to the license is provided, and any changes made are indicated.


All statements, opinions, and data contained in this publication are solely those of the individual author(s) and contributor(s). The journal, editors, reviewers, and publisher disclaim any responsibility or liability for the content, including accuracy, completeness, or any consequences arising from its use.

Get Certificates

Get Publication Certificate

Download LoA

Check Corssref DOI details

Issue details

Issue Cover Page

Editorial Board

Table of content

Copyright © 2026 World Journal of Advanced Research and Reviews - All rights reserved

Developed & Designed by VS Infosolution