Programmable networking hardware creates new opportunities for infusing intelligence into the network. This raises a fundamental question: what kinds of computation should be delegated to the network?
To answer, we turn our attention to modern machine learning workloads. Efficiently training complex machine learning models at scale requires high performance at the infrastructure level. With large models, communication among multiple workers becomes a scalability concern due to limited bandwidth.
We propose to address this problem by redesigning communication in distributed machine learning to take advantage of programmable network data planes. Our key insight is to reduce the volume of exchanged data by performing in-network computation to aggregate the model’s parameter updates as they are being transferred. However, in-network computation tasks must be judiciously crafted to match the limitations of the network machine architecture of programmable devices. With the help of our experiments on machine learning workloads, we identify that aggregation functions raise opportunities to exploit the limited computation power of networking hardware to lessen network congestion and improve the overall application performance. Moreover, as a proof-of-concept, we propose DAIET, a system that performs in-network data aggregation. Experimental results with an initial prototype show a large data reduction ratio (86.9%-89.3%) and a similar decrease in the workers' computation time.