Commit 39b4bd00 authored by Zsolt Istvan's avatar Zsolt Istvan
Browse files

initial commit (copied from internal commit fbd7e12c0440)

parents
This diff is collapsed.
# Caribou
Caribou [1] is **smart distributed storage** built wit FPGAs. Each node stores key-value pairs in main memory and exposes a simple interface over TCP/IP [2] that software clients can connect to.
It is **smart** because it is possible to offload filtering into the storage nodes. The nodes can also perform scans on the data. In this design filtering is a combination of regular expression matching and predicate evaluation. Different types of processing can, however, easily be added to the processing pipeline.
It is **distributed** because it runs on multiple FPGAs that replicate the data using a leader-based consensus protocol [3] that is both low latency and high throughput.
It is **storage** because it stores key-value pairs in a Cuckoo hash table and implements slab-based memory allocation. The current design uses DRAM to store data, as an exploration for solutions that will work well with the emerging non-volatile memory technologies.
#### Referenced articles:
[1] Caribou: Intelligent Distributed Storage. Zs. Istvan, D. Sidler, G. Alonso. To appear in VLDB 2017, Munich, Germany. https://people.inf.ethz.ch/zistvan/doc/vldb17-caribou.pdf
[2] Low-Latency TCP/IP Stack for Data Center Applications. D. Sidler, Zs. Istvan, G. Alonso. 26th International Conference on Field Programmable Logic and Applications (FPL'16), Lausanne, Switzerland, September 2016. http://davidsidler.ch/files/fpl16-lowlatencytcpip.pdf
[3] Consensus in a Box: Inexpensive Coordination in Hardware. Zs. Istvan, D. Sidler, G. Alonso, M. Vukolic. 13th USENIX Symposium on Networked Systems Design and Implementation (NSDI '16), March 2016. https://people.inf.ethz.ch/zistvan/doc/nsdi16-istvan-rev1.pdf
\ No newline at end of file
A word on replacing modules
---------------------------
The project is fairly modular, so it should be relatively easy to swap out modules. In the following I will outline the most obvious options for "tweaking":
* If you want to change the key-value store implementation it should be easy because all functionality is encapsulated in a single wrapper (nukv_Top_Module_v2). The interfaces are simple, the only tricky part might be handling the different opcodes coming in.
* It is possible to modify the near-data processing by removing the modules from between nukv_Value_Set and nukv_Value_Get, and adding your own. Ideally there should be a single "drop" bit that gets passed on to the Value_Get module to indicate if data has not matched the filter and therefor should be dropped.
* Playing around with the consensus logic is also possible, just replace the files prefixed with zk_control. These implement the actual decisions. If you want to change headers, etc., the zk_data modules will need adjustments as well.
* Other networking stack than TCP. Even though the atomic broadcast requires ordered reliable transport (in essence TCP) to function correctly, it is possible to introduce your own module instead of the TCP stack we have. The consensus logic and KVS will make very few assumptions about the networking protocol: 1) for requests coming from clients they will send a response on the same socket (on the same socket-ID that has been provided to the logic by the TCP stack), 2) to send messages between FPGAs, the consensus logic will use "mono-directional" sockets, that each of them open to all other nodes. The socket-ID of this connection will be saved inside the consensus logic and is assumed to stay valid unless the FPGA on the other end "dies".
Differences to published papers
-------------------------------
The code in this repository differs slightly from the system presented in the VLDB17 paper. Some of the pipeline stages have been rewritten for clarity or to remove bugs (mostly the files with _v2 marking).
* Insert acts as both Insert and Replace in the KVS.
* The log of the replication engine has been moved to BRAM instead of DRAM. This has no impact on performance or correctness, but should help to meet timing in Vivado project without any tweaks.
The memory holding the log and the log header entries can be sized to hold several thousand requests. Beyond these recovery is done as a bulk-copy of the hash table and bitmap state. While this is sub-optimal in terms of recovery time, it should provide the needed functionality.
* The relative sizes of different data structures are different. See the code for the bit-width of addresses to each portion of the memory.
* This version of the memory allocator uses a single tablespace in the bitmaps (but making it parameterizable for multiple tablespaces again should be straightforward).
* The regular expression matchers have been moved to a faster clock domain (312MHz), thus there number can be halved and still achieve the same bandwidth.
Known issues / limitations
--------------------------
* In the current version of the code (June 2017) there is a mix between ZAB opcodes and KVS opcodes. Most notably, the ZAB write request will map internally to a KVS Insert when successfully replicated. At this point deletes, etc., have to be done "outside" the replication logic. This is a momentary limitation and originates from the fac tthat we re-use some parts of the header for both parts of the logic. In the future the KVS code shall be held inside the replicated package (which is anyway passed on verbatim to the KVS once the ZAB header has been stripped out).
* Scans and GETs are not compatible at the same time. This means that while a scan is executing no other operations can be serviced. This is a result of how memory access is arbitrated between the memory allocator (that performs the scan) and the rest of the pipeline. In future iterations this shall be fixed for better flexibility.
SW Environment and Setup
========================
To build Caribou we used Ubuntu 14.04LTS.
The IDEs and tools are as follows:
* Vivado 2014.3
* Modelsim SE-64 10.1c
* ChipScope Analyzer 14.7
* Java 1.7
* Golang 1.6
**While everything might work with newer versions of the tools, we have not tested it (especially IP core major version could have changed in newer versions of Vivado).**
The source code is organized in the following way:
./hw
/src -- actual source code of Caribou (toplevel file = zookeeper_nkv_fpga_para.v, toplevel for simulation = zk_toplevel_nukv_TB.vhd)
/ip -- IP cores and various DCPs
/constraints -- XDC constraint to use with the project
./sw -- software clients
/ClusterManagement -- Java code to initialize/modify zookeeper groups
/client-scan-demo -- Client written in Go to showcase the scan feature, but not only
HW Environment and Setup
========================
Boards
------
To build and test Caribou we used the Xilinx VC709 evaluation boards (Virtex-7 VX690T).
While we have included DCPs (binaries) for the DDR3 memory controller, SmartCam for network sessions and the TCP/IP implementation, please visit the repository below for more up to date versions:
https://github.com/fpgasystems/fpga-network-stack
The code is fairly easily portable to the Alpha Data ADM-PCIE-7V3 board, and binaries for the memory controller and XDC constraints should be available at the previously mentioned address.
Network setup
-------------
By default the boards have an IP address of 10.1.212.209 and the first and last byte can be incremented with up to 16 based on switches on the device (see picture).
In the provided code all network traffic happens through interface no.0 of the boards, so these have to be connected to the switch or to an other 10Gbps NIC in a machine. We used a 10Gbps Switch (Intel 82599ES).
Contact
=======
For questions about Caribou please feel free to email Zsolt (zsolt.istvan@inf.ethz.ch)
Booting
=======
### Flushing
Caribou can be used in two different modes (or a mix of these): replicated or node-local.
Regardless of the use, after programming the FPGA it needs to be reset (essentially a flush command to each FPGA):
echo -n 'FFFF000001000108F00BA20000000000f00f00f00f00f00f' | xxd -r -p | nc $FPGAIP_0 2888 -q 2
echo -n 'FFFF000001000108F00BA20000000000f00f00f00f00f00f' | xxd -r -p | nc $FPGAIP_1 2888 -q 2
echo -n 'FFFF000001000108F00BA20000000000f00f00f00f00f00f' | xxd -r -p | nc $FPGAIP_2 2888 -q 2
...
Once this has been done, the FPGA is ready to serve get/put requests or to have the Zookeeper Atomic Broadcast subsystem configured.
### Initial ZAB config
Nodes need to be told that they will participate in the replication group and who the first leader is. This can be done either "manually" using a script, or with the code in the /src/ClusterManagement project running the CommandLineInterface class:
CommandLineInterface $FPGAIP_0:2888;$FPGAIP_1:2888;$FPGAIP_2:2888
To add nodes later, run the same class with the original group as first argument, and the additional node as second argument:
CommandLineInterface $FPGAIP_0:2888;$FPGAIP_1:2888;$FPGAIP_2:2888 $FPGAIP_new:2888
Sendind requests
================
In the current setup there are two ways to execute commands on Caribou.
### Replicated
From the client's perspective the only important operation is "replicated set" that will replicate the given key and value to all nodes. (Other operations and their code can be found in zk_control_CentralSM.vhdl.)
These operations are formatted as follows:
FFFFxxCCPPPP0000
EEEEEEEEEEEEEEEE
KKKKKKKKKKKKKKKK
LLLLVVVVVVVVVVVV
...
VVVVVVVVVVVVVVVV
Legend:
* x [1B] = reserved to encode node id
* C [1B] = opcode of the operation
* P [2B] = payload (key + value) size in 64bit words. E.g. 4=4*64bit
* E [8B] = reserved to encode epoch, zxid
* K [64B] = key
* L [2B] = length of value (including these two bytes) in bytes
* V [variable] = value (if no value is needed for the operation, stop at K)
### Node-local
To perform operations that are local to the node, we use a similar format of the packets as above, but with extra information in bytes 4-7 (see nukv_ht_write_v2.v for opcodes):
FFFF0000PPPPkkQQ
0000000000000000
KKKKKKKKKKKKKKKK
LLLLVVVVVVVVVVVV
...
VVVVVVVVVVVVVVVV
* P [2B] = payload (key + value) size in 64bit words. E.g. 4=4*64bit
* k [1B] = length of key in 64 bit words (can be 01 or 02).
* Q [1B] = node-local command code
* K [64B/128B] = key
* L [2B] = length of value (including these two bytes) in bytes
* V [variable] = value (if no value is needed for the operation, stop at K)
In-code examples of these operations can be found in the Go client in /src/
### Go Client
To populate run:
./caribou -host "$LEADER_IP:2888" -populate -time 120 (-replicate) (-flush)
To do some mixed ops (50% writes) for 10 seconds
./caribou -host "$LEADER_IP:2888" -setp 0.5 -time 10
...
0050
0000
0FFFF021100000000
10000000000000000
0020
0000
0FFFF011200000000
1F10A0000CED4010A
0020
0000
0FFFF031200000000
1F00A0000CED4010A
0020
0000
0FFFF011400000000
10000000000000000
0040
0000
0FFFF000001000108
0F00BA20000000000
11000000000000001
0400
0000
0FFFF010207000000
00100000001000000
03130666f6f626172
03000303132333435
06172203020302032
034380d0af6003031
03233343536373839
0200034123412ffff
03233343536373839
0FFFF010207000000
00200000001000000
03130666f6f626173
03000303132333435
06172203020302032
034380d0af6003031
03233343536373839
0200034123412ffff
03233343536373839
0FFFF010207000000
00300000001000000
03130666f6f626174
03000303132333435
06172203020302032
034380d0af6003031
03233343536373839
0200034123412ffff
03233343536373839
0FFFF010400000000
00100000001000000
0FFFF010400000000
00200000001000000
0FFFF010400000000
00300000001000000
0FFFF010207000000
00400000001000000
03130666f6f626175
03000303132333435
06172203020302032
034380d0af6003031
03233343536373839
0200034123412ffff
03233343536373839
0FFFF010207000000
00500000001000000
03130666f6f626172
03000303132333435
06172203020302032
034380d0af6003031
03233343536373839
0200034123412ffff
13233343536373839
0060
0000
0FFFF010207000000
00600000001000000
03130666f6f626173
03000303132333435
06172203020302032
034380d0af6003031
03233343536373839
0200034123412ffff
03233343536373839
0FFFF010207000000
00700000001000000
03130666f6f626174
03000303132333435
06172203020302032
034380d0af6003031
03233343536373839
0200034123412ffff
03233343536373839
0FFFF010207000000
00800000001000000
03130666f6f626175
03000303132333435
06172203020302032
034380d0af6003031
03233343536373839
0200034123412ffff
03233343536373839
0FFFF010207000000
00900000001000000
03130666f6f626172
03000303132333435
06172203020302032
034380d0af6003031
03233343536373839
0200034123412ffff
03233343536373839
0FFFF010207000000
00a00000001000000
03130666f6f626172
03000303132333435
06172203020302032
034380d0af6003031
03233343536373839
0200034123412ffff
03233343536373839
0FFFF010207000000
00b00000001000000
03130666f6f626172
03000303132333435
06172203020302032
034380d0af6003031
03233343536373839
0200034123412ffff
03233343536373839
0FFFF010207000000
00c00000001000000
03130666f6f626172
03000303132333435
06172203020302032
034380d0af6003031
03233343536373839
0200034123412ffff
03233343536373839
0FFFF010207000000
00d00000001000000
03130666f6f626172
03000303132333435
06172203020302032
034380d0af6003031
03233343536373839
0200034123412ffff
13233343536373839
0055
0000
0FFFF010207000000
00e00000001000000
03130666f6f626172
03000303132333435
06172203020302032
034380d0af6003031
03233343536373839
0200034123412ffff
03233343536373839
0FFFF010207000000
00e00000001000000
03130666f6f626172
03000303132333435
06172203020302032
034380d0af6003031
03233343536373839
0200034123412ffff
03233343536373839
0FFFF010207000000
00f00000001000000
03130666f6f626172
03000303132333435
06172203020302032
034380d0af6003031
03233343536373839
0200034123412ffff
03233343536373839
0FFFF010207000000
01000000001000000
03130666f6f626172
03000303132333435
06172203020302032
034380d0af6003031
03233343536373839
0200034123412ffff
03233343536373839
0FFFF010207000000
01100000001000000
03130666f6f626172
03000303132333435
06172203020302032
034380d0af6003031
03233343536373839
0200034123412ffff
03233343536373839
0FFFF010207000000
01200000001000000
03130666f6f626172
03000303132333435
06172203020302032
034380d0af6003031
03233343536373839
0200034123412ffff
03233343536373839
0FFFF010207000000
01300000001000000
03130666f6f626172
03000303132333435
06172203020302032
034380d0af6003031
03233343536373839
0200034123412ffff
03233343536373839
0FFFF010207000000
01400000001000000
03130666f6f626172
03000303132333435
06172203020302032
034380d0af6003031
03233343536373839
0200034123412ffff
13233343536373839
0200
0000
0FFFF010400000000
00400000001000000
0FFFF010400000000
00500000001000000
0FFFF010400000000
00600000001000000
0FFFF010400000000
00600000001000000
0FFFF010400000000
00700000001000000
0FFFF010400000000
00800000001000000
0FFFF010400000000
00900000001000000
0FFFF010400000000
00a00000001000000
0FFFF010400000000
00b00000001000000
0FFFF010400000000
00c00000001000000
0FFFF010400000000
00d00000001000000
0FFFF010400000000
00e00000001000000
0FFFF010400000000
00f00000001000000
0FFFF010400000000
01000000001000000
0FFFF010400000000
01100000001000000
0FFFF010400000000
01200000001000000
0FFFF010400000000
11300000001000000
1000
0000
\ No newline at end of file
0050
0000
0FFFF021100000000
10000000000000000
0020
0000
0FFFF011200000000
1F10A0000CED4010A
0020
0000
0FFFF031200000000
1F00A0000CED4010A
0020
0000
0FFFF021400000000
10000000000000000
0040
0000
0FFFF000001000108
0F00BA20000000000
11000000000000001
0400
0000
0FFFF000007000101
00100000001000000
03130666f6f626172
03000303132333435
06172203020302032
034380d0af6003031
03233343536373839
0200034123412ffff
03233343536373839
0FFFF000007000101
00100000001000000
03130666f6f626174
03000303132333435
06172203020302032
034380d0af6003031
03233343536373839
0200034123412ffff
03233343536373839
0FFFF000007000101
00100000001000000
03130666f6f626175
03000303132333435
06172203020302032
034380d0af6003031
03233343536373839
0200034123412ffff
03233343536373839
0FFFF000007000101
00100000001000000
03130666f6f626173
03000303132333435
06172203020302032
034380d0af6003031
03233343536373839
0200034123412ffff
13233343536373839
0300
0000
0FFFF000107000000
00100000001000000
03130666f6f626172
03000303132333435
06172203020302032
034380d0af6003031
03233343536373839
0200034123412ffff
03233343536373839
0FFFF000107000000
00100000001000000
03130666f6f626173
03000303132333435
06172203020302032
034380d0af6003031
03233343536373839
0200034123412ffff
03233343536373839
0FFFF000107000000
00100000001000000
03130666f6f626174
03000303132333435
06172203020302032
034380d0af6003031
03233343536373839
0200034123412ffff
03233343536373839
0FFFF030300000000
00100000001000000
0FFFF030300000000
00200000001000000
0FFFF030300000000
00300000001000000
0FFFF000107000000
00100000001000000
03130666f6f626175
03000303132333435
06172203020302032
034380d0af6003031
03233343536373839
0200034123412ffff
03233343536373839
0FFFF000107000000
00100000001000000
03130666f6f626172
03000303132333435
06172203020302032
034380d0af6003031
03233343536373839
0200034123412ffff
13233343536373839
0060
0000
0FFFF000107000000
00100000001000000
03130666f6f626173
03000303132333435
06172203020302032
034380d0af6003031
03233343536373839
0200034123412ffff
03233343536373839
0FFFF000107000000
00100000001000000
03130666f6f626174
03000303132333435
06172203020302032
034380d0af6003031
03233343536373839
0200034123412ffff
03233343536373839
0FFFF000107000000
00100000001000000
03130666f6f626175
03000303132333435
06172203020302032
034380d0af6003031
03233343536373839
0200034123412ffff
03233343536373839
0FFFF000107000000
00100000001000000
03130666f6f626172
03000303132333435
06172203020302032
034380d0af6003031
03233343536373839
0200034123412ffff
03233343536373839
0FFFF000107000000
00100000001000000
03130666f6f626172
03000303132333435
06172203020302032
034380d0af6003031
03233343536373839
0200034123412ffff
03233343536373839
0FFFF000107000000
00100000001000000
03130666f6f626172
03000303132333435
06172203020302032
034380d0af6003031
03233343536373839
0200034123412ffff
03233343536373839
0FFFF000107000000