A dynamic distribution of the input buffer for on-chip routers

Fan Lifang, Zhang Xingming, Chen Ting

Institute of Information Engineering, PLA Information Engineering University, Zhengzhou 450002, China

Keywords: NoC, buffer, dynamic allocation, performance.

Abstract. Each port has a same buffer in typical network on chip input ports. Under the condition of the imbalance load in the work, there will be a port all buffer used and some other ports has many idle buffer. It will reduce the NoC buffer utilization, then influence the NoC overall performance. In this paper, an algorithm of dynamically allocation port buffer based on load is proposed. This algorithm allocates buffer to each port according to the load of each port. It is improve the buffer utilization effectively. Under the 90nm process, we achieve the on chip router that can allocate input buffer dynamically. The experimental results show that the router’s information delay reduced up to 34.7% and throughput increased by 13.4% in the hot mode. The router’s area is smaller than the typical router overhead, so it can meet the application for the network on chip.

1. Introduction

With the continuous development and progress of semiconductor technology, network based on packet switching, Network on Chip (NoC) has become a effective solution for large-scale on-chip multi-core interconnection[1]. Typical Network on Chip is composed of resource nodes, routers and two-way link. Resource nodes connect to a router through the local network interface, thus can make all sorts of different interface protocols in network layer be blocked. Buffer occupies most area of a router and it consumes more than 50% power consumption of a router[2], so it has a vital significance for the research of network buffer.

Buffer plays an important role in NoC, it directly affects NoC’s overall performance. Large buffer can improve the performance of the system[3], but NoC’s area is limited and buffer has the characteristic of large power consumption, this limits the buffer’s large capacity design. A study found that when the router’s FIFO depth is increased from 2 to 3, the router’s area is increased by 30%, but the NoC’s performance wasn’t improved[4]. So how to reasonable use of buffer resource is a key point of NoC research.

2. Correlation study

DAMQ (Dynamically-allocated multi-queue) was a idea which was first put forward on the dynamic adjustment of buffer structure. It can improve NoC’s buffer resource utilization[5], but it still has queue head congestion problem. Paper[6] proposed a DAMQ with multiple packets (DAMQ - MP) algorithm based on DAMQ. DAMQ - MP allows multiple packets exits at the same time in one virtual channel. A packet can export from any exit if the exit is leisure, otherwise it has to be exported from the corresponding exit. The algorithm improve throughput by the rate of 24.52% compared with DAMQ. Paper[7] proposed a virtual channel dynamic allocation structure (Vichar), it can distribute the number of input ports VC dynamically according to the actual network load. In the same buffer resources, Vichar’s performance increase by 25% than typical router. Paper[8] proposed a dynamic buffer structure based on Vichar which can improve network congestion. It is improve NoC’s performance further.

The above algorithms all research a single port that has a fixed buffer. They regulate each port in the internal. but different applications have different communication requirements and...
communication features, so the network communication load of each port is different[9]. In the load imbalance network, if each port has the same buffer, there will be some port’s buffer are inadequate but some are free, thus will cause buffer resource utilization lower.

3. Load dynamically allocation virtual channel (BDVC)

Tamir and Gregory first mentioned DAMQ (dynamically allocated multi - queue) algorithm. The algorithm links every buffer with a linked list, then allocates buffer for each virtual channel according to the load[10]. This paper uses the idea in multi-port.

3.1 NoC typical router input port structure

In typical router, each input port has fixed buffer space and virtual channel. VC storage unit adopts SRAM, as shown in figure 1, the router adopts wormhole and virtual channel technology. Assuming each input port contains m virtual channels, each virtual channel stores k flit. Router structure has P input port, P output port, routing computation, virtual channel allocator, switch allocator and cross switch. This paper research two-dimensional mesh structure, so P is 5.

3.2 BDVC introduction

Because of this paper is based on two-dimensional mesh structure, so it has five input ports, each port has m VC, a router total has 5m VC. The basic idea of BDVC is: the linked list links all buffer by unified management, when a packet or a flit is still at the front router, it sends a request signal to the next router in advance, request signal includes packet length, destination port and so on, when a router receives a request signal, it applies to the linked list a VC immediately, when a packet or a flit leaves, the router releases the VC. Figure 2 is the design of improved port E.

We can see from the figure 2, port E has other ports’s VC, in the same way, other ports also contain port E’s VC. Because of the linked list links all VC, when a port sends a request signal, the list distribute the current pointer node to the port, then the pointer point to the next node. Because of a port’s load isn’t balance over a period, it may be that at a period time there are many packet, it also may be that at a period time there are less packet. So it can be appear the situation above.

When a adjacent superior router sends a request signal, the input controller will send a Gnt_up
signal to the linked list. The signal has two function, on the one hand it applies to the list for a VC, on the other hand it sends feedback to the superior router, then input controller sends internal request signal Req_in_E to the output port. When packet or flit transmits to the next router, the input controller release the VC. The VC released are connected to the list’s tail.

The list has two pointers. The head pointer points to the VC that will be distributed, the tail pointer point to the tail of the list. When a VC is assigned to a port, the head pointer plus 1, when a VC is released, the tail pointer plus 1. The list is a circular linked list.

3.3 VC state recording strategy

Figure 3 is VC state recording block diagram. Virtual channel Recorder (VC Recorder) is used to record the usage of virtual channel. It includes three modules: VC Status Table, Port Occupied Table and Regulation Table.

![Fig.3 VC state recording block diagram](image)

VC State Table is used to record the current state of each VC, occupied or free, as shown in figure 4(a). Port Occupied Table records the situation that each port takes up VC, as shown in figure 4(b), each line in the figure represents input port, state equal to 1 indicating the VC is occupied by a input port, 0 indicating isn’t occupied. If a VC column is full of 0, it is saying that the VC is idle, not occupied. Regulation Table can regulate VC State Table and Port Occupied Table dynamically according to the change of linked list.

<table>
<thead>
<tr>
<th>VC id</th>
<th>status</th>
<th>owner</th>
</tr>
</thead>
<tbody>
<tr>
<td>P1_VC1</td>
<td>Occupied</td>
<td>E</td>
</tr>
<tr>
<td>......</td>
<td>......</td>
<td>......</td>
</tr>
<tr>
<td>P1_VCm</td>
<td>Occupied</td>
<td>W</td>
</tr>
<tr>
<td>......</td>
<td>......</td>
<td>......</td>
</tr>
<tr>
<td>P5_VC1</td>
<td>free</td>
<td>Null</td>
</tr>
<tr>
<td>......</td>
<td>......</td>
<td>......</td>
</tr>
<tr>
<td>P5_VCm</td>
<td>free</td>
<td>Null</td>
</tr>
</tbody>
</table>

(a)VC Status Table (b)Port Occupied Table

![Fig.4 VC regulation data structure dynamically](image)

The work of VC recorder includes: 1) allocation VC dynamically according to VC Status Table and Port Occupied Table; 2) updating linked list according to adjustment result; 3) updating adjacent VC Status Table according to Port Occupied Table; 4) when packet head flit arrives at VC, taking VC...
state table corresponding VC is not available and Port Occupied Table is 1; 5) when packet tail flit leaves VC, updating VC Status Table and Port Occupied Table, also.

4. Simulation result and analysis

4.1 performance compare

This paper uses SystemC to simulate[11]. It is based on two-dimensional mesh topology structure of $5 \times 5$. Each port’s VC is equal in typical router, BDVC algorithm allocates VC dynamically according to network’s load condition. The simulation uses store-and-forward technology and XY order routing algorithm, in hot and balanced two modes.

Figure 5 is BDVC algorithm designing router and typical router simulation comparison of average delay and throughput in hot mode, figure 6 is in balanced mode. We can see from the diagram, when the network load rate is low, the two router’s performance is very close. But with the increase of load, the performance of BDVC algorithm improves obviously.

![Fig.5 performance in hot mode](image)

![Fig.6 performance in balanced mode](image)

When load rate is low, the contention is rare, so the two router’s performance is basically same. In high load rate situation, typical router will reach saturation soon, but BDVC designing router reach saturation at a higher load factor. Especially in hot mode, the highest average delay is reduced by 34.7%, the highest throughput increased by 13.4%. Because in hot mode, each port’s load is imbalance, BDVC can allocate VC to each port dynamically according to port’s request.

4.2 Area overhead

Table 1 is BDVC designing router’s area and typical router’s area. Because of BDVC regulates all VC dynamically, it’s control logic is larger. BDVC designing router increases virtual channel recorder, which also increases the router’s area. In general, BDVC designing router’s area is increased by 9.1%, it is in a reasonable range.
Table 1 BDVC’s area and typical router’s area comparing

<table>
<thead>
<tr>
<th></th>
<th>Typical router</th>
<th>BDVC designing router</th>
</tr>
</thead>
<tbody>
<tr>
<td>Router logic</td>
<td>1.3</td>
<td>1.3</td>
</tr>
<tr>
<td>Buffer</td>
<td>42.27</td>
<td>42.27</td>
</tr>
<tr>
<td>VC control logic</td>
<td>9.7</td>
<td>13.2</td>
</tr>
<tr>
<td>others</td>
<td>46.7</td>
<td>52.3</td>
</tr>
<tr>
<td>total</td>
<td>99.97</td>
<td>109.07</td>
</tr>
</tbody>
</table>

5. Summary

This paper proposes an algorithm that can allocate port buffer according to network load. The algorithm allocate and release buffer for each port dynamically. It can solve resource contention problem due to port’s load imbalance. Simulation results show that BDVC router’s average delay and throughput has an obvious increase in hot mode compared with typical router. But BDVC router require additional data structure and register, so it’s area increased by 9.1%. In the future, it is necessary to improve BDVC algorithm to decrease it’s router’s area.

Acknowledgments

This work was financially supported by the national “863” (2009AA012201).

References


