# GlusterFS Build Steps ## Contents - [Overview](#overview) - [Prerequisites](#prerequisites) - [Build Document Setup](#build-document-setup) - [Node Prep](#node-prep) - [Configure /etc/hosts and iptables](#configure-etchosts-and-iptables) - [Granular iptables](#granular-iptables) - [Install Packages](#install-packages) - [Prepare Bricks](#prepare-bricks) - [GlusterFS Setup](#glusterfs-setup) - [Start glusterfsd daemon](#start-glusterfsd-daemon) - [Build Peer Group](#build-peer-group) - [Volume Creation](#volume-creation) - [Replicated Volume](#replicated-volume) - [Distributed-Replicated Volume](#distributed-replicated-volume) - [Volume Deletion](#volume-deletion) - [Clearing Bricks](#clearing-bricks) - [Adding Bricks](#adding-bricks) - [Volume Options](#volume-options) - [Client Mounts](#client-mounts) - [FUSE Client](#fuse-client) - [NFS Client](#nfs-client) - [References](#references) ## Overview Prior to starting work, a fundamental decision must be made - what type of Volume(s) need to be used for the given scenario. While 6 methods exist, two are used most often to achieve different results: - **Replicated**: This type of Volume provides a file replication across multiple bricks, it is a best choice for environments where High Availability and High Reliability are CRITICAL, as well as if you wish to self-mount the volume on every node such as with a webserver DocumentRoot - the GlusterFS nodes are their own clients. - Files are copied to each brick in the volume similar to a RAID-1, however you can have 3+ bricks and an odd number as well; usable space is the size of one brick, and all files written to one brick are replicated to all others. This makes the most sense if you are going to self-mount the GlusterFS volume, for instance as the web docroot (/var/www) or similar where all files must reside on that node. The value passed to `replica` is the same number of nodes in the volume. - **Distributed-Replicated**: In this scenario files are distributed across replicated bricks in the volume. You can use this type of volume in environments where the requirement is to scale storage as well as having high availability. Volumes of this type also offer improved read performance in most environments, and are most common type of volumes used when clients are external to the GlusterFS nodes themselves. - Somewhat like a RAID-10, an even number of bricks must be used; usable space is the size of the combined bricks passed to the `replica` value. For example, if there are **4 bricks of 20G** and you pass `replica 2` to the creation, your files will distribute to 2 nodes (40G) and replicate to 2 nodes. With **6 bricks of 20G** and `replica 3` it would distribute to 3 nodes (60G) and replicate to 3 nodes, but if you used `replica 2` it would distribute to 2 nodes (40G) and replicate to 4 nodes in pairs. This would be used when your clients are external to the cluster, not local self-mounts. All the fundamental work in this document is the same except for the one step where the Volume is created as outlined above with the `replica` keyword. Using Striped-based volumes is not covered here. ## Prerequisites 1. 2 or more servers with separate Storage 2. Private network between servers ## Build Document Setup This build document will use the following setup that can be easily stood up; using Cloud block devices is no different than VMware vDisks, SAN/DAS LUNs, iSCSI, etc. - 4x Performance 1 Tier 2 Rackspace Cloud servers - a 20G /dev/xvde ready to use for each brick - 1x Cloud Private Network on 192.168.3.0/24 for GlusterFS communication - GlusterFS 3.7 installed from Vendor package repository ## Node Prep - Configure /etc/hosts and iptables - Install base toolset(s) - Install GlusterFS software - Connect GlusterFS nodes ### Configure /etc/hosts and iptables In lieu of using DNS, we prepare /etc/hosts so that every machine and ensure they can talk to each other. All servers have the name `gluster`_N_ as a hostname, so we'll use `glus`_N_ for our private communication layer between nodes. ``` # vi /etc/hosts 192.168.3.2 glus1 192.168.3.4 glus2 192.168.3.1 glus3 192.168.3.3 glus4 # ping -c2 glus1; ping -c2 glus2; ping -c2 glus3; ping -c2 glus4 ## Red Hat oriented: # vi /etc/sysconfig/iptables -A INPUT -s 192.168.3.0/24 -j ACCEPT # service iptables restart ## Debian oriented # vi /etc/iptables/rules.v4 -A INPUT -s 192.168.3.0/24 -j ACCEPT # service iptables-persistent restart ``` #### Granular iptables The above generic iptables rule opens all ports to the subnet; if more granular setup is required: - **111** - portmap / rpcbind - **24007** - GlusterFS Daemon - **24008** - GlusterFS Management - **38465** to **38467** - Required for GlusterFS NFS service - **24009** to +X - GlusterFS versions less than 3.4, OR - **49152** to +X - GlusterFS versions 3.4 and later Each brick for every volume on the host requires it’s own port. For every new brick, one new port will be used starting at **24009** for GlusterFS versions below 3.4 and **49152** for version 3.4 and above. **Example**: If you have one volume with two bricks, you will need to open 24009 - 24010, or 49152 - 49153. ### Install Packages 1. Install the basic packages for partitioning, LVM2 and XFS 2. Install the GlusterFS repository and glusterfs\* packages 3. Disable automatic updates of gluster\* packages Some of the required packages may already be installed on the cluster nodes. ``` ## YUM/RPM Based: # yum -y install parted lvm2 xfsprogs # wget -P /etc/yum.repos.d http://download.gluster.org/pub/gluster/glusterfs/LATEST/CentOS/glusterfs-epel.repo # yum -y install glusterfs glusterfs-fuse glusterfs-server ## Ubuntu based (Default Ubuntu repo has glusterfs 3.4, here's how to install 3.7): # apt-get install lvm2 xfsprogs python-software-properties # add-apt-repository ppa:gluster/glusterfs-3.7 # apt-get update # apt-get install glusterfs-server ``` Ensure that the gluster\* packages are filtered out of automatic updates; upgrades while it's running can crash the bricks. ``` # grep ^exclude /etc/yum.conf exclude=kernel* gluster* ## Ubuntu method: # apt-mark hold glusterfs* ``` ### Prepare Bricks 1. Partition block devices 2. Create LVM foundation 3. Prepare volume bricks The underlying bricks are a standard filesystem and mount point. However, make sure to mount each brick in such a way so as to discourage any use from changing to the directory and writing to the underlying bricks themselves. **Writing directly to a Brick will corrupt your Volume\!** The bricks must be unique per node, and there should be a directory within the mount to use in volume creation. Attempting to create a replicated volume using the top-level of the mounts results in an error with instructions to use a subdirectory. ``` all nodes: # parted -s -- /dev/xvde mktable gpt # parted -s -- /dev/xvde mkpart primary 2048s 100% # parted -s -- /dev/xvde set 1 lvm on # partx -a /dev/xvde # pvcreate /dev/xvde1 # vgcreate vgglus1 /dev/xvde1 Logical Volumes --------------- Standard LVM: # lvcreate -l 100%VG -n gbrick1 vgglus1 For GlusterFS snapshot support: # lvcreate -l 100%FREE --thinpool lv_thin vgglus1 # lvcreate --thin -V $(lvdisplay /dev/vgglus1/lv_thin | awk '/LV\ Size/ { print $3 }')G -n gbrick1 vgglus1/lv_thin Filesystems for bricks ---------------------- For XFS bricks: (recommended) # mkfs.xfs -i size=512 /dev/vgglus1/gbrick1 # echo '/dev/vgglus1/gbrick1 /data/gluster/gvol0 xfs inode64,nobarrier 0 0' >> /etc/fstab # mkdir -p /data/gluster/gvol0 # mount /data/gluster/gvol0 For ext4 bricks: # mkfs.ext4 /dev/vgglus1/gbrick1 # echo '/dev/vgglus1/gbrick1 /data/gluster/gvol0 ext4 defaults,user_xattr,acl 0 0' >> /etc/fstab # mkdir -p /data/gluster/gvol0 # mount /data/gluster/gvol0 glus1: # mkdir -p /data/gluster/gvol0/brick1 glus2: # mkdir -p /data/gluster/gvol0/brick1 glus3: # mkdir -p /data/gluster/gvol0/brick1 glus4: # mkdir -p /data/gluster/gvol0/brick1 ``` ## GlusterFS Setup ### Start glusterfsd daemon The daemon can be restarted at runtime as well: ``` ## Red Hat based: # service glusterd start # chkconfig glusterd on ``` ### Build Peer Group This is what's known as a **Trusted Storage Pool** in the GlusterFS world. Note that as of early release of version 3, you only need to probe all other nodes from glus1. The peer list is then automatically distributed to all peers from there. ``` glus1: # gluster peer probe glus2 # gluster peer probe glus3 # gluster peer probe glus4 # gluster peer status [root@gluster1 ~]# gluster pool list UUID Hostname State 734aea4c-fc4f-4971-ba3d-37bd5d9c35b8 glus4 Connected d5c9e064-c06f-44d9-bf60-bae5fc881e16 glus3 Connected 57027f23-bdf2-4a95-8eb6-ff9f936dc31e glus2 Connected e64c5148-8942-4065-9654-169e20ed6f20 localhost Connected ``` ### Volume Creation We will set up basic auth restrictions to only our private subnet as by default glusterd NFS allows global read/write during Volume creation. glusterd automatically starts NFSd on each server and exports the volume through it from each of the nodes. The reason for this behaviour is that in order to use native client (FUSE) for mounting the volume on clients, the clients have to run exactly same version of GlusterFS packages. If the versions are different there might be differences in the hashing algorithms used by servers and clients and the clients won't be able to connect. #### Replicated Volume This example will create replication to all 4 nodes - each node contains a copy of all data and the size of the Volume is the size of a single brick. Notice how the info shows `1 x 4 = 4` in the output. ``` one node only: # gluster volume create gvol0 replica 4 transport tcp \ glus1:/data/gluster/gvol0/brick1 \ glus2:/data/gluster/gvol0/brick1 \ glus3:/data/gluster/gvol0/brick1 \ glus4:/data/gluster/gvol0/brick1 # gluster volume set gvol0 auth.allow 192.168.3.*,127.0.0.1 # gluster volume set gvol0 nfs.disable off # gluster volume set gvol0 nfs.addr-namelookup off # gluster volume set gvol0 nfs.export-volumes on # gluster volume set gvol0 nfs.rpc-auth-allow 192.168.3.* # gluster volume set gvol0 performance.io-thread-count 32 # gluster volume start gvol0 [root@gluster1 ~]# gluster volume info gvol0 Volume Name: gvol0 Type: Replicate Volume ID: 65ece3b3-a4dc-43f8-9b0f-9f39c7202640 Status: Started Number of Bricks: 1 x 4 = 4 Transport-type: tcp Bricks: Brick1: glus1:/data/gluster/gvol0/brick1 Brick2: glus2:/data/gluster/gvol0/brick1 Brick3: glus3:/data/gluster/gvol0/brick1 Brick4: glus4:/data/gluster/gvol0/brick1 Options Reconfigured: nfs.rpc-auth-allow: 192.168.3.*,127.0.0.1 nfs.export-volumes: on nfs.addr-namelookup: off nfs.disable: off auth.allow: 192.168.3.* performance.io-thread-count: 32 ``` #### Distributed-Replicated Volume This example will create distributed replication to 2x2 nodes - each pair of nodes contains the data and the size of the Volume is the size of a two bricks. Notice how the info shows `2 x 2 = 4` in the output. ``` one node only: # gluster volume create gvol0 replica 2 transport tcp \ glus1:/data/gluster/gvol0/brick1 \ glus2:/data/gluster/gvol0/brick1 \ glus3:/data/gluster/gvol0/brick1 \ glus4:/data/gluster/gvol0/brick1 # gluster volume set gvol0 auth.allow 192.168.3.*,127.0.0.1 # gluster volume set gvol0 nfs.disable off # gluster volume set gvol0 nfs.addr-namelookup off # gluster volume set gvol0 nfs.export-volumes on # gluster volume set gvol0 nfs.rpc-auth-allow 192.168.3.* # gluster volume set gvol0 performance.io-thread-count 32 # gluster volume start gvol0 [root@gluster1 ~]# gluster volume info gvol0 Volume Name: gvol0 Type: Distributed-Replicate Volume ID: d883f891-e38b-4565-8487-7e50ca33dbd4 Status: Started Number of Bricks: 2 x 2 = 4 Transport-type: tcp Bricks: Brick1: glus1:/data/gluster/gvol0/brick1 Brick2: glus2:/data/gluster/gvol0/brick1 Brick3: glus3:/data/gluster/gvol0/brick1 Brick4: glus4:/data/gluster/gvol0/brick1 Options Reconfigured: nfs.rpc-auth-allow: 192.168.3.* nfs.export-volumes: on nfs.addr-namelookup: off nfs.disable: off auth.allow: 192.168.3.*,127.0.0.1 performance.io-thread-count: 32 ``` ## Volume Deletion After ensure that no clients (either local or remote) are mounting the Volume, stop the Volume and delete it. ``` # gluster volume stop gvol0 # gluster volume delete gvol0 ``` ### Clearing Bricks If brick(s) were used in a volume and they need to be removed, there's an attribute that GlusterFS had set on the brick subdirectories. This needs to be cleared before they can be reused - or the subdir can be deleted and recreated. ``` glus1: # setfattr -x trusted.glusterfs.volume-id /data/gluster/gvol0/brick1/ # setfattr -x trusted.gfid /data/gluster/gvol0/brick1 # rm -rf /data/gluster/gvol0/brick1/.glusterfs glus2: # setfattr -x trusted.glusterfs.volume-id /data/gluster/gvol0/brick1/ # setfattr -x trusted.gfid /data/gluster/gvol0/brick1 # rm -rf /data/gluster/gvol0/brick1/.glusterfs glus3: # setfattr -x trusted.glusterfs.volume-id /data/gluster/gvol0/brick1/ # setfattr -x trusted.gfid /data/gluster/gvol0/brick1 # rm -rf /data/gluster/gvol0/brick1/.glusterfs glus4: # setfattr -x trusted.glusterfs.volume-id /data/gluster/gvol0/brick1/ # setfattr -x trusted.gfid /data/gluster/gvol0/brick1 # rm -rf /data/gluster/gvol0/brick1/.glusterfs ...or just deleting all data: glus1: # rm -rf /data/gluster/gvol0/brick1 # mkdir /data/gluster/gvol0/brick1 glus2: # rm -rf /data/gluster/gvol0/brick1 # mkdir /data/gluster/gvol0/brick1 glus3: # rm -rf /data/gluster/gvol0/brick1 # mkdir /data/gluster/gvol0/brick1 glus4: # rm -rf /data/gluster/gvol0/brick1 # mkdir /data/gluster/gvol0/brick1 ``` ### Adding Bricks Additional bricks can be added to a running Volume easily: ``` # gluster volume add-brick gvol0 glus5:/data/gluster/gvol0/brick1 ``` The add-brick command can also be used to change the LAYOUT of your volume. For example, to change a 2 node Distributed volume into a 4 node Distributed-Replicated Volume. After such an operation you **must rebalance** your volume. New files will be automatically created on the new nodes, but the old ones will not get moved. ``` # gluster volume add-brick gvol0 replica 2 \ glus5:/data/gluster/gvol0/brick1 \ glus6:/data/gluster/gvol0/brick1 # gluster rebalance gvol0 start # gluster rebalance gvol0 status ## If needed (something didn't work right) # gluster rebalance gvol0 stop ``` > When expanding distributed replicated and distributed striped volumes, you must add a number of bricks that is a multiple of the replica or stripe count. For example, to expand a distributed replicated volume with a replica count of 2, you need to add bricks in multiples of 2 (such as 4, 6, 8, etc.): > > ``` > # gluster volume add-brick gvol0 \ > glus5:/data/gluster/gvol0/brick1 \ > glus6:/data/gluster/gvol0/brick1 > ``` ### Volume Options To view configured volume options: ``` # gluster volume info gvol0 Volume Name: gvol0 Type: Replicate Volume ID: bcbfc645-ebf9-4f83-b9f0-2a36d0b1f6e3 Status: Started Number of Bricks: 1 x 4 = 4 Transport-type: tcp Bricks: Brick1: glus1:/data/gluster/gvol0/brick1 Brick2: glus2:/data/gluster/gvol0/brick1 Brick3: glus3:/data/gluster/gvol0/brick1 Brick4: glus4:/data/gluster/gvol0/brick1 Options Reconfigured: performance.cache-size: 1073741824 performance.io-thread-count: 64 cluster.choose-local: on nfs.rpc-auth-allow: 192.168.3.*,127.0.0.1 nfs.export-volumes: on nfs.addr-namelookup: off nfs.disable: off auth.allow: 192.168.3.*,127.0.0.1 ``` To set an option for a volume, use the `set` keyword like so: ``` # gluster volume set gvol0 performance.write-behind off volume set: success ``` To clear an option to a Volume back to defaults, use the `reset` keyword like so: ``` # gluster volume reset gvol0 performance.read-ahead volume reset: success: reset volume successful ``` ## Client Mounts From a client perspective the GlusterFS Volume can be mounted in two fundamental ways: 1. FUSE Client 2. NFS Client ### FUSE Client The FUSE client allows the mount to happen with a GlusterFS "round robin" style connection; in /etc/fstab the name of one node is used, however internal mechanisms allows that node to fail and the clients will roll over to other connected nodes in the Trusted Storage Pool. The performance is slightly lower than the NFS method based on tests, however not drastically so - the gain is automatic HA client failover which is typically worth the performance hit. ``` ## RPM based: # wget -P /etc/yum.repos.d http://download.gluster.org/pub/gluster/glusterfs/LATEST/CentOS/glusterfs-epel.repo # yum -y install glusterfs glusterfs-fuse ## Ubuntu based(glusterfs-client 3.4 works with glusterfs-server 3.5 but for most the recent version do this): # add-apt-repository ppa:gluster/glusterfs-3.7 # apt-get update # apt-get install glusterfs-client ## ##Common: # vi /etc/hosts 192.168.3.2 glus1 192.168.3.4 glus2 192.168.3.1 glus3 192.168.3.3 glus4 # modprobe fuse # echo 'glus1:/gvol0 /mnt/gluster/gvol0 glusterfs defaults,_netdev,backup-volfile-servers=glus2 0 0' >> /etc/fstab # mkdir -p /mnt/gluster/gvol0 # mount /mnt/gluster/gvol0 ``` ### NFS Client The standard Linux NFSv3 client tools are used to mount one of the GlusterFS nodes; the performance is typically a little better than the FUSE client, however the downside is the connection is 1-to-1 – of the GlusterFS node goes down the client will not round-robin out to another node. A different solution has to be added such as HAProxy/keepalived, load balancer, etc. in order to provide a floating IP proxy in this use case. ``` ## RPM based: # yum -y install rpcbind nfs-utils # service rpcbind restart; chkconfig rpcbind on # service nfslock restart; chkconfig on ## Ubuntu: # apt-get install nfs-common ## ## Common: # echo 'glus1:/repvol1 /mnt/gluster/gvol0 nfs rsize=4096,wsize=4096,hard,intr 0 0' >> /etc/fstab # mkdir -p /mnt/gluster/gvol0 # mount /mnt/gluster/gvol0 ``` ## References - - - -