18 KiB
GlusterFS Build Steps
Contents
- Overview
- Prerequisites
- Build Document Setup
- Node Prep
- GlusterFS Setup
- Volume Deletion
- Client Mounts
- References
Overview
Prior to starting work, a fundamental decision must be made - what type of Volume(s) need to be used for the given scenario. While 6 methods exist, two are used most often to achieve different results:
- Replicated: This type of Volume provides a file replication across multiple bricks, it is a best choice for environments where High Availability and High Reliability are CRITICAL, as well as if you wish to self-mount the volume on every node such as with a webserver DocumentRoot - the GlusterFS nodes are their own clients.
- Files are copied to each brick in the volume similar to a RAID-1, however you can have 3+ bricks and an odd number as well; usable space is the size of one brick, and all files written to one brick are replicated to all others. This makes the most sense if you are going to self-mount the GlusterFS volume, for instance as the web docroot (/var/www) or similar where all files must reside on that node. The value passed to
replicais the same number of nodes in the volume.
- Files are copied to each brick in the volume similar to a RAID-1, however you can have 3+ bricks and an odd number as well; usable space is the size of one brick, and all files written to one brick are replicated to all others. This makes the most sense if you are going to self-mount the GlusterFS volume, for instance as the web docroot (/var/www) or similar where all files must reside on that node. The value passed to
- Distributed-Replicated: In this scenario files are distributed across replicated bricks in the volume. You can use this type of volume in environments where the requirement is to scale storage as well as having high availability. Volumes of this type also offer improved read performance in most environments, and are most common type of volumes used when clients are external to the GlusterFS nodes themselves.
- Somewhat like a RAID-10, an even number of bricks must be used; usable space is the size of the combined bricks passed to the
replicavalue. For example, if there are 4 bricks of 20G and you passreplica 2to the creation, your files will distribute to 2 nodes (40G) and replicate to 2 nodes. With 6 bricks of 20G andreplica 3it would distribute to 3 nodes (60G) and replicate to 3 nodes, but if you usedreplica 2it would distribute to 2 nodes (40G) and replicate to 4 nodes in pairs. This would be used when your clients are external to the cluster, not local self-mounts.
- Somewhat like a RAID-10, an even number of bricks must be used; usable space is the size of the combined bricks passed to the
All the fundamental work in this document is the same except for the one step where the Volume is created as outlined above with the replica keyword. Using Striped-based volumes is not covered here.
Prerequisites
- 2 or more servers with separate Storage
- Private network between servers
Build Document Setup
This build document will use the following setup that can be easily stood up; using Cloud block devices is no different than VMware vDisks, SAN/DAS LUNs, iSCSI, etc.
- 4x Performance 1 Tier 2 Rackspace Cloud servers - a 20G /dev/xvde ready to use for each brick
- 1x Cloud Private Network on 192.168.3.0/24 for GlusterFS communication
- GlusterFS 3.7 installed from Vendor package repository
Node Prep
- Configure /etc/hosts and iptables
- Install base toolset(s)
- Install GlusterFS software
- Connect GlusterFS nodes
Configure /etc/hosts and iptables
In lieu of using DNS, we prepare /etc/hosts so that every machine and ensure they can talk to each other. All servers have the name glusterN as a hostname, so we'll use glusN for our private communication layer between nodes.
# vi /etc/hosts
192.168.3.2 glus1
192.168.3.4 glus2
192.168.3.1 glus3
192.168.3.3 glus4
# ping -c2 glus1; ping -c2 glus2; ping -c2 glus3; ping -c2 glus4
## Red Hat oriented:
# vi /etc/sysconfig/iptables
-A INPUT -s 192.168.3.0/24 -j ACCEPT
# service iptables restart
## Debian oriented
# vi /etc/iptables/rules.v4
-A INPUT -s 192.168.3.0/24 -j ACCEPT
# service iptables-persistent restart
Granular iptables
The above generic iptables rule opens all ports to the subnet; if more granular setup is required:
- 111 - portmap / rpcbind
- 24007 - GlusterFS Daemon
- 24008 - GlusterFS Management
- 38465 to 38467 - Required for GlusterFS NFS service
- 24009 to +X - GlusterFS versions less than 3.4, OR
- 49152 to +X - GlusterFS versions 3.4 and later
Each brick for every volume on the host requires it’s own port. For every new brick, one new port will be used starting at 24009 for GlusterFS versions below 3.4 and 49152 for version 3.4 and above.
Example: If you have one volume with two bricks, you will need to open 24009 - 24010, or 49152 - 49153.
Install Packages
- Install the basic packages for partitioning, LVM2 and XFS
- Install the GlusterFS repository and glusterfs* packages
- Disable automatic updates of gluster* packages
Some of the required packages may already be installed on the cluster nodes.
## YUM/RPM Based:
# yum -y install parted lvm2 xfsprogs
# wget -P /etc/yum.repos.d http://download.gluster.org/pub/gluster/glusterfs/LATEST/CentOS/glusterfs-epel.repo
# yum -y install glusterfs glusterfs-fuse glusterfs-server
## Ubuntu based (Default Ubuntu repo has glusterfs 3.4, here's how to install 3.7):
# apt-get install lvm2 xfsprogs python-software-properties
# add-apt-repository ppa:gluster/glusterfs-3.7
# apt-get update
# apt-get install glusterfs-server
Ensure that the gluster* packages are filtered out of automatic updates; upgrades while it's running can crash the bricks.
# grep ^exclude /etc/yum.conf
exclude=kernel* gluster*
## Ubuntu method:
# apt-mark hold glusterfs*
Prepare Bricks
- Partition block devices
- Create LVM foundation
- Prepare volume bricks
The underlying bricks are a standard filesystem and mount point. However, make sure to mount each brick in such a way so as to discourage any use from changing to the directory and writing to the underlying bricks themselves. Writing directly to a Brick will corrupt your Volume!
The bricks must be unique per node, and there should be a directory within the mount to use in volume creation. Attempting to create a replicated volume using the top-level of the mounts results in an error with instructions to use a subdirectory.
all nodes:
# parted -s -- /dev/xvde mktable gpt
# parted -s -- /dev/xvde mkpart primary 2048s 100%
# parted -s -- /dev/xvde set 1 lvm on
# partx -a /dev/xvde
# pvcreate /dev/xvde1
# vgcreate vgglus1 /dev/xvde1
Logical Volumes
---------------
Standard LVM:
# lvcreate -l 100%VG -n gbrick1 vgglus1
For GlusterFS snapshot support:
# lvcreate -l 100%FREE --thinpool lv_thin vgglus1
# lvcreate --thin -V $(lvdisplay /dev/vgglus1/lv_thin | awk '/LV\ Size/ { print $3 }')G -n gbrick1 vgglus1/lv_thin
Filesystems for bricks
----------------------
For XFS bricks: (recommended)
# mkfs.xfs -i size=512 /dev/vgglus1/gbrick1
# echo '/dev/vgglus1/gbrick1 /data/gluster/gvol0 xfs inode64,nobarrier 0 0' >> /etc/fstab
# mkdir -p /data/gluster/gvol0
# mount /data/gluster/gvol0
For ext4 bricks:
# mkfs.ext4 /dev/vgglus1/gbrick1
# echo '/dev/vgglus1/gbrick1 /data/gluster/gvol0 ext4 defaults,user_xattr,acl 0 0' >> /etc/fstab
# mkdir -p /data/gluster/gvol0
# mount /data/gluster/gvol0
glus1:
# mkdir -p /data/gluster/gvol0/brick1
glus2:
# mkdir -p /data/gluster/gvol0/brick1
glus3:
# mkdir -p /data/gluster/gvol0/brick1
glus4:
# mkdir -p /data/gluster/gvol0/brick1
GlusterFS Setup
Start glusterfsd daemon
The daemon can be restarted at runtime as well:
## Red Hat based:
# service glusterd start
# chkconfig glusterd on
Build Peer Group
This is what's known as a Trusted Storage Pool in the GlusterFS world. Note that as of early release of version 3, you only need to probe all other nodes from glus1. The peer list is then automatically distributed to all peers from there.
glus1:
# gluster peer probe glus2
# gluster peer probe glus3
# gluster peer probe glus4
# gluster peer status
[root@gluster1 ~]# gluster pool list
UUID Hostname State
734aea4c-fc4f-4971-ba3d-37bd5d9c35b8 glus4 Connected
d5c9e064-c06f-44d9-bf60-bae5fc881e16 glus3 Connected
57027f23-bdf2-4a95-8eb6-ff9f936dc31e glus2 Connected
e64c5148-8942-4065-9654-169e20ed6f20 localhost Connected
Volume Creation
We will set up basic auth restrictions to only our private subnet as by default glusterd NFS allows global read/write during Volume creation. glusterd automatically starts NFSd on each server and exports the volume through it from each of the nodes. The reason for this behaviour is that in order to use native client (FUSE) for mounting the volume on clients, the clients have to run exactly same version of GlusterFS packages. If the versions are different there might be differences in the hashing algorithms used by servers and clients and the clients won't be able to connect.
Replicated Volume
This example will create replication to all 4 nodes - each node contains a copy of all data and the size of the Volume is the size of a single brick. Notice how the info shows 1 x 4 = 4 in the output.
one node only:
# gluster volume create gvol0 replica 4 transport tcp \
glus1:/data/gluster/gvol0/brick1 \
glus2:/data/gluster/gvol0/brick1 \
glus3:/data/gluster/gvol0/brick1 \
glus4:/data/gluster/gvol0/brick1
# gluster volume set gvol0 auth.allow 192.168.3.*,127.0.0.1
# gluster volume set gvol0 nfs.disable off
# gluster volume set gvol0 nfs.addr-namelookup off
# gluster volume set gvol0 nfs.export-volumes on
# gluster volume set gvol0 nfs.rpc-auth-allow 192.168.3.*
# gluster volume set gvol0 performance.io-thread-count 32
# gluster volume start gvol0
[root@gluster1 ~]# gluster volume info gvol0
Volume Name: gvol0
Type: Replicate
Volume ID: 65ece3b3-a4dc-43f8-9b0f-9f39c7202640
Status: Started
Number of Bricks: 1 x 4 = 4
Transport-type: tcp
Bricks:
Brick1: glus1:/data/gluster/gvol0/brick1
Brick2: glus2:/data/gluster/gvol0/brick1
Brick3: glus3:/data/gluster/gvol0/brick1
Brick4: glus4:/data/gluster/gvol0/brick1
Options Reconfigured:
nfs.rpc-auth-allow: 192.168.3.*,127.0.0.1
nfs.export-volumes: on
nfs.addr-namelookup: off
nfs.disable: off
auth.allow: 192.168.3.*
performance.io-thread-count: 32
Distributed-Replicated Volume
This example will create distributed replication to 2x2 nodes - each pair of nodes contains the data and the size of the Volume is the size of a two bricks. Notice how the info shows 2 x 2 = 4 in the output.
one node only:
# gluster volume create gvol0 replica 2 transport tcp \
glus1:/data/gluster/gvol0/brick1 \
glus2:/data/gluster/gvol0/brick1 \
glus3:/data/gluster/gvol0/brick1 \
glus4:/data/gluster/gvol0/brick1
# gluster volume set gvol0 auth.allow 192.168.3.*,127.0.0.1
# gluster volume set gvol0 nfs.disable off
# gluster volume set gvol0 nfs.addr-namelookup off
# gluster volume set gvol0 nfs.export-volumes on
# gluster volume set gvol0 nfs.rpc-auth-allow 192.168.3.*
# gluster volume set gvol0 performance.io-thread-count 32
# gluster volume start gvol0
[root@gluster1 ~]# gluster volume info gvol0
Volume Name: gvol0
Type: Distributed-Replicate
Volume ID: d883f891-e38b-4565-8487-7e50ca33dbd4
Status: Started
Number of Bricks: 2 x 2 = 4
Transport-type: tcp
Bricks:
Brick1: glus1:/data/gluster/gvol0/brick1
Brick2: glus2:/data/gluster/gvol0/brick1
Brick3: glus3:/data/gluster/gvol0/brick1
Brick4: glus4:/data/gluster/gvol0/brick1
Options Reconfigured:
nfs.rpc-auth-allow: 192.168.3.*
nfs.export-volumes: on
nfs.addr-namelookup: off
nfs.disable: off
auth.allow: 192.168.3.*,127.0.0.1
performance.io-thread-count: 32
Volume Deletion
After ensure that no clients (either local or remote) are mounting the Volume, stop the Volume and delete it.
# gluster volume stop gvol0
# gluster volume delete gvol0
Clearing Bricks
If brick(s) were used in a volume and they need to be removed, there's an attribute that GlusterFS had set on the brick subdirectories. This needs to be cleared before they can be reused - or the subdir can be deleted and recreated.
glus1:
# setfattr -x trusted.glusterfs.volume-id /data/gluster/gvol0/brick1/
# setfattr -x trusted.gfid /data/gluster/gvol0/brick1
# rm -rf /data/gluster/gvol0/brick1/.glusterfs
glus2:
# setfattr -x trusted.glusterfs.volume-id /data/gluster/gvol0/brick1/
# setfattr -x trusted.gfid /data/gluster/gvol0/brick1
# rm -rf /data/gluster/gvol0/brick1/.glusterfs
glus3:
# setfattr -x trusted.glusterfs.volume-id /data/gluster/gvol0/brick1/
# setfattr -x trusted.gfid /data/gluster/gvol0/brick1
# rm -rf /data/gluster/gvol0/brick1/.glusterfs
glus4:
# setfattr -x trusted.glusterfs.volume-id /data/gluster/gvol0/brick1/
# setfattr -x trusted.gfid /data/gluster/gvol0/brick1
# rm -rf /data/gluster/gvol0/brick1/.glusterfs
...or just deleting all data:
glus1:
# rm -rf /data/gluster/gvol0/brick1
# mkdir /data/gluster/gvol0/brick1
glus2:
# rm -rf /data/gluster/gvol0/brick1
# mkdir /data/gluster/gvol0/brick1
glus3:
# rm -rf /data/gluster/gvol0/brick1
# mkdir /data/gluster/gvol0/brick1
glus4:
# rm -rf /data/gluster/gvol0/brick1
# mkdir /data/gluster/gvol0/brick1
Adding Bricks
Additional bricks can be added to a running Volume easily:
# gluster volume add-brick gvol0 glus5:/data/gluster/gvol0/brick1
The add-brick command can also be used to change the LAYOUT of your volume. For example, to change a 2 node Distributed volume into a 4 node Distributed-Replicated Volume. After such an operation you must rebalance your volume. New files will be automatically created on the new nodes, but the old ones will not get moved.
# gluster volume add-brick gvol0 replica 2 \
glus5:/data/gluster/gvol0/brick1 \
glus6:/data/gluster/gvol0/brick1
# gluster rebalance gvol0 start
# gluster rebalance gvol0 status
## If needed (something didn't work right)
# gluster rebalance gvol0 stop
When expanding distributed replicated and distributed striped volumes, you must add a number of bricks that is a multiple of the replica or stripe count. For example, to expand a distributed replicated volume with a replica count of 2, you need to add bricks in multiples of 2 (such as 4, 6, 8, etc.):
# gluster volume add-brick gvol0 \ glus5:/data/gluster/gvol0/brick1 \ glus6:/data/gluster/gvol0/brick1
Volume Options
To view configured volume options:
# gluster volume info gvol0
Volume Name: gvol0
Type: Replicate
Volume ID: bcbfc645-ebf9-4f83-b9f0-2a36d0b1f6e3
Status: Started
Number of Bricks: 1 x 4 = 4
Transport-type: tcp
Bricks:
Brick1: glus1:/data/gluster/gvol0/brick1
Brick2: glus2:/data/gluster/gvol0/brick1
Brick3: glus3:/data/gluster/gvol0/brick1
Brick4: glus4:/data/gluster/gvol0/brick1
Options Reconfigured:
performance.cache-size: 1073741824
performance.io-thread-count: 64
cluster.choose-local: on
nfs.rpc-auth-allow: 192.168.3.*,127.0.0.1
nfs.export-volumes: on
nfs.addr-namelookup: off
nfs.disable: off
auth.allow: 192.168.3.*,127.0.0.1
To set an option for a volume, use the set keyword like so:
# gluster volume set gvol0 performance.write-behind off
volume set: success
To clear an option to a Volume back to defaults, use the reset keyword like so:
# gluster volume reset gvol0 performance.read-ahead
volume reset: success: reset volume successful
Client Mounts
From a client perspective the GlusterFS Volume can be mounted in two fundamental ways:
- FUSE Client
- NFS Client
FUSE Client
The FUSE client allows the mount to happen with a GlusterFS "round robin" style connection; in /etc/fstab the name of one node is used, however internal mechanisms allows that node to fail and the clients will roll over to other connected nodes in the Trusted Storage Pool. The performance is slightly lower than the NFS method based on tests, however not drastically so - the gain is automatic HA client failover which is typically worth the performance hit.
## RPM based:
# wget -P /etc/yum.repos.d http://download.gluster.org/pub/gluster/glusterfs/LATEST/CentOS/glusterfs-epel.repo
# yum -y install glusterfs glusterfs-fuse
## Ubuntu based(glusterfs-client 3.4 works with glusterfs-server 3.5 but for most the recent version do this):
# add-apt-repository ppa:gluster/glusterfs-3.7
# apt-get update
# apt-get install glusterfs-client
##
##Common:
# vi /etc/hosts
192.168.3.2 glus1
192.168.3.4 glus2
192.168.3.1 glus3
192.168.3.3 glus4
# modprobe fuse
# echo 'glus1:/gvol0 /mnt/gluster/gvol0 glusterfs defaults,_netdev,backup-volfile-servers=glus2 0 0' >> /etc/fstab
# mkdir -p /mnt/gluster/gvol0
# mount /mnt/gluster/gvol0
NFS Client
The standard Linux NFSv3 client tools are used to mount one of the GlusterFS nodes; the performance is typically a little better than the FUSE client, however the downside is the connection is 1-to-1 – of the GlusterFS node goes down the client will not round-robin out to another node. A different solution has to be added such as HAProxy/keepalived, load balancer, etc. in order to provide a floating IP proxy in this use case.
## RPM based:
# yum -y install rpcbind nfs-utils
# service rpcbind restart; chkconfig rpcbind on
# service nfslock restart; chkconfig on
## Ubuntu:
# apt-get install nfs-common
##
## Common:
# echo 'glus1:/repvol1 /mnt/gluster/gvol0 nfs rsize=4096,wsize=4096,hard,intr 0 0' >> /etc/fstab
# mkdir -p /mnt/gluster/gvol0
# mount /mnt/gluster/gvol0