diff --git a/INDEX.md b/INDEX.md new file mode 100644 index 0000000..ce0ab91 --- /dev/null +++ b/INDEX.md @@ -0,0 +1,49 @@ +# Index + + - [Active Directory with Winbind](md/active_directory_with_winbind.md) + - [Arch UEFI Installation](md/arch_uefi_installation.md) + - [CIFS Client Setup](md/cifs_client_setup.md) + - [Compose Key Sequences](md/compose_key_sequences.md) + - [Conversion Table](md/conversion_table.md) + - [Debian BOINC Client](md/debian_boinc_client.md) + - [Debian Server Setup](md/debian_server_setup.md) + - [Debian Tor Relay](md/debian_tor_relay.md) + - [Device Mapper Mechanics](md/device_mapper_mechanics.md) + - [Device Mapper Multipath](md/device_mapper_multipath.md) + - [DRBD Build Steps](md/drbd_build_steps.md) + - [Fonts and Linux](md/fonts_and_linux.md) + - [GlusterFS Build Steps](md/glusterfs_build_steps.md) + - [Grub 2 Info](md/grub_2_info.md) + - [Jumbo Frames](md/jumbo_frames.md) + - [Kernel Module Weak Updates](md/kernel_module_weak_updates.md) + - [Linux Partitioning](md/linux_partitioning.md) + - [Linux x86 Storage](md/linux_x86_storage.md) + - [LVM Mechanics](md/lvm_mechanics.md) + - [LVM Snapshot Merging](md/lvm_snapshot_merging.md) + - [MongoDB Basics](md/mongodb_basics.md) + - [Network Quick Reference](md/network_quick_reference.md) + - [NFS Debugging](md/nfs_debugging.md) + - [NFS Setup](md/nfs_setup.md) + - [Oracle Environment](md/oracle_environment.md) + - [RAID Penalties](md/raid_penalties.md) + - [Reducing the root LV](md/reducing_the_root_lv.md) + - [RHCS Mechanics](md/rhcs_mechanics.md) + - [RHEL7 Networking](md/rhel7_networking.md) + - [SCSI Sense Data](md/scsi_sense_data.md) + - [SMTP Relay](md/smtp_relay.md) + - [Stunnel Setup](md/stunnel_setup.md) + - [System Backup](md/system_backup.md) + - [systemd Mechanics](md/systemd_mechanics.md) + - [Tomcat Configuration](md/tomcat_configuration.md) + - [Tomcat Logging](md/tomcat_logging.md) + - [Tomcat Mechanics](md/tomcat_mechanics.md) + - [Tomcat Packaging](md/tomcat_packaging.md) + - [Tuning nf conntrack](md/tuning_nf_conntrack.md) + - [Understanding Swap Use](md/understanding_swap_use.md) + - [Vsftpd Setup](md/vsftpd_setup.md) + - [XFS Info](md/xfs_info.md) + +# Licenses + + * SPDX-License-Identifier: CC-BY-SA-4.0 + * SPDX-License-Identifier: MIT diff --git a/md/active_directory_with_winbind.md b/md/active_directory_with_winbind.md new file mode 100644 index 0000000..ed12506 --- /dev/null +++ b/md/active_directory_with_winbind.md @@ -0,0 +1,313 @@ +# Active Directory with Winbind + +## Contents + + - [Prerequisites](#prerequisites) + - [AD Setup Information](#ad-setup-information) + - [Implementation](#implementation) + - [Install RPMs](#install-rpms) + - [DNS Configuration](#dns-configuration) + - [Configure Kerberos](#configure-kerberos) + - [Get a Kerberos ticket](#get-a-kerberos-ticket) + - [List the ticket provided](#list-the-ticket-provided) + - [Destroy the ticket](#destroy-the-ticket) + - [Samba Configuration](#samba-configuration) + - [Join the domain](#join-the-domain) + - [Configure winbind authentication](#configure-winbind-authentication) + - [PAM Configuration](#pam-configuration) + - [RHE5 and RHEL6](#rhel5-and-rhel6) + - [RHEL6 Only](#rhel6-only) + - [Parent Home Directory](#parent-home-directory) + - [Testing](#testing) + - [Cached Logins](#cached-logins) + - [User crontabs](#user-crontabs) + - [References](#references) + + +## Prerequisites + +### AD Setup Information + +Needed information: + + - NETBIOS name of one or more domain controllers + - DNS IPs of same servers that resolve lookups + - Admin level user already in the AD + +Examples used in this article: + + - AD1.DOMAIN.LOCAL, AD2.DOMAIN.LOCAL + - 192.168.100.10, 192.168.100.20 + - 'admin' + + +## Implementation + +### Install RPMs + +Standard YUM install: + +``` +RHEL6: +# yum install samba-winbind samba-winbind-clients krb5-workstation krb5-libs + +RHEL5: +# yum install samba3x-winbind samba3x-client krb5-workstation krb5-libs +``` + +Notes: + + - krb5-workstation adds `/usr/kerberos/bin` to your `$PATH`, you may need to log in again so that `kinit` and other apps are now found + - RHEL5 `winbind` is not Windows 2008R2+ friendly. Use `winbind3x` (samba3x) RPMs instead + +### DNS Configuration + +``` +/etc/resolv.conf + +nameserver 192.168.100.10 +nameserver 192.168.100.20 +search DOMAIN.LOCAL +``` + +### Configure Kerberos + +``` +/etc/krb5.conf + +[logging] + default = FILE:/var/log/krb5libs.log + kdc = FILE:/var/log/krb5kdc.log + admin_server = FILE:/var/log/kadmind.log + + [libdefaults] + default_realm = DOMAIN.LOCAL + dns_lookup_realm = false + dns_lookup_kdc = false + ticket_lifetime = 24h + forwardable = yes + + [realms] + DOMAIN.LOCAL = { + kdc = AD1.DOMAIN.LOCAL:88 + kdc = AD2.DOMAIN.LOCAL:88 + admin_server = AD1.DOMAIN.LOCAL:749 + admin_server = AD2.DOMAIN.LOCAL:749 + } + + [domain_realm] + .DOMAIN.local = DOMAIN.LOCAL + DOMAIN.local = DOMAIN.LOCAL + + [appdefaults] + pam = { + debug = false + ticket_lifetime = 36000 + renew_lifetime = 36000 + forwardable = true + krb4_convert = false + } +``` + +#### Get a Kerberos ticket + +``` +# kinit admin@DOMAIN.LOCAL +``` + +#### List the ticket provided + +``` +# klist +``` + +#### Destroy the ticket + +``` +# kdestroy +``` + + +### Samba Configuration + +``` +/etc/samba/smb.conf + +[global] + workgroup = DOMAIN + interfaces = 127.0.0.1 eth0 + bind interfaces only = true + security = ads + passdb backend = tdbsam + template shell = /bin/bash + template homedir = /home/%D/%U + realm = DOMAIN.LOCAL + password server = AD1.DOMAIN.LOCAL, AD2.DOMAIN.LOCAL + winbind use default domain = yes + winbind enum users = yes + winbind enum groups = yes + winbind refresh tickets = yes + idmap uid = 16777216-33554431 + idmap gid = 16777216-33554431 + printing = cups + printcap name = cups + load printers = no +``` + +If required you can assign a name to the server. This is useful since NT has a limit of 15 chars to the servers. Just add: + +``` +netbios name = MYNTNAME +``` + +#### Join the domain + +``` +# net ads join -U admin +``` + +Example session: + +``` +# net ads join -U admin + Enter admin's password: + Using short domain name -- DOMAIN + Joined 'MYSERVER' to realm 'domain.local' + [2012/03/04 06:06:06.123456, 0] libads/kerberos.c:333(ads_kinit_password) + kerberos_kinit_password MYSERVER$@DOMAIN.LOCAL failed: Client not found in Kerberos database + DNS update failed! +``` + +> This error message is expected, the server joined the domain, but the AD DNS was not updated for your server. + +#### Configure winbind authentication + +``` +# authconfig-tui +``` + +1. Select **Use Winbind** under the User Information section +2. Select **Use MD5 Passwords** under the Authentication section +3. Select **Use Shadow Passwords** under the Authentication section +4. Select **Use Winbind Authentication** under the Authentication section +5. Select **Local Authentication is sufficient** under the Authentication section +6. Click **Next** +7. Click **OK** (*not* Join Domain\!) + + +### PAM Configuration + +The system may need to be updated to make two configuration changes; it's possible one or both of these are already taken care of however. The first change is to update the existing line for `pam_winbind.so` and add extra config; the second is to add/update the `pam_mkhomedir.so` line to have the user's home directory create itself. + +> Whenever editing PAM config files, ALWAYS test logins in a second terminal before you log out of the editing session. Breaking a PAM config file can cause _root_ to be locked out and require single-user mode to rescue. + +#### RHEL5 and RHEL6 + +This is a pseudo diff of the changes to be made; examine the existing file and apply only the needed values as shown. + +``` +/etc/pam.d/system-auth + +< auth sufficient pam_winbind.so use_first_pass +--- +> auth sufficient pam_winbind.so krb5_auth krb5_ccache_type=FILE use_first_pass + +> session required pam_mkhomedir.so skel=/etc/skel umask=0022 +``` + +#### RHEL6 Only + +**Configure** + +In RHEL5/CentOS5 all the various other PAM configuration files _sub-include_ system-auth; in RHEL6 this was split out into two different files; some sub-include `system-auth` (like sudo), some sub-include `password-auth` (like sshd). Changing both files is required. + +``` +/etc/pam.d/password-auth + +make the exact same changes as outlined above +``` + +### Parent Home Directory + +All DOMAIN homedirs will be created below this dir by pam\_mkhomedir.so (via smb.conf `template homedir` variable): + +``` +# mkdir /home/DOMAIN +# chcon --reference=/home /home/DOMAIN +``` + + +## Testing + +Test the basics: + +``` +# wbinfo -u +# wbinfo -g +# ssh DOMAIN\\admin@localhost +``` + +> Due to an interesting conflict between the presence of local user 'admin' in `/etc/passwd` (with /home/admin defined) and the attempt to use /home/DOMAIN/admin during a DOMAIN login you can get curious permission denied results. It's best to test DOMAIN logins with a username _other_ than one that exists in /etc/passwd on the local machine to avoid the DOMAIN login conflict with pam\_mkhomedir. + + +## Cached Logins + +The pam\_winbind.so module supports cached logins - this can be handy if the Active Directory server(s) become unavailable, you'll still be able to log into Linux. It is very useful to include a cache time otherwise the cache seems _not_ to be updated regardless of the default value(300 secs). This parameter specifies the number of seconds the winbindd will cache user and group information before querying an AD server again. + +In the same global section as defined above, add a new directive as shown: + +``` +/etc/samba/smb.conf + +[global] + ... + winbind offline logon = yes + winbind cache time = 600 + ... +``` + +It's possible that this file may not exist; create it if needed: + +``` +/etc/security/pam_winbind.conf + +[global] + cached_login = yes +``` + +Perform a standard Winbind restart and test things out: + +``` +# service winbind restart +# smbcontrol winbind offline + +# wbinfo --online-status + BUILTIN : online + MYSERVER : online + DOMAIN : offline + +# ssh DOMAIN\\username@localhost + Domain Controller unreachable, using cached credentials instead. Network resources may be unavailable + ... + +# smbcontrol winbind online +``` + +Some items - such as groups - don't get added to the cache until there is a successful login when things are in online mode; this may affect tools like sudo or sshd if they are configured to allow/restrict access based on group level membership. Your exact situation will determine any further tweaks needed to fully support offline access in an emergency. This can be fixed with the use of winbind cache time as noted above. If required, the cache can be deleted by removing the `/var/lib/samba/*.tdb` files. + + +## User crontabs + +There is a problem with the vixie-cron (RHEL5) and cronie \<= 1.4.7 (RHEL6) packages and crontabs which belong to remote network users; when CROND starts up at boot it cannot "see" these remote users when scanning the `/var/spool/cron/` crontabs as networking is not online yet; as a consequence it places each unmatched crontab in an "Orphan" list and never checks again. Restarting CROND after Winbind/LDAP/NIS/etc. are up will work correctly, so one possible solution if this is needed is to place a 'service crond restart' in `/etc/rc.d/rc.local` if you must use this type of crontab. + +This issue was fixed in 1.4.8 release of cronie; if it is not yet available any RPM upgrade will have to be manually rebuilt from the Koji system (Fedora packaging) to obtain a newer release. + + - Git commit: + - Koji package: + +The cronie package in RHEL6 replaces the vixie-cron and anacron packages from RHEL5. Compiling (rebuilding) cronie for RHEL5 and doing a manual package swap may work but is untested. It would be best to create a local user account to run the crontabs instead of using a remote network user until (and if) Red Hat releases packages which address this issue. + + +## References + + - diff --git a/md/arch_uefi_installation.md b/md/arch_uefi_installation.md new file mode 100644 index 0000000..c1ea677 --- /dev/null +++ b/md/arch_uefi_installation.md @@ -0,0 +1,120 @@ +# Arch UEFI Installation + +## Contents + + - [Overview](#overview) + - [Process](#process) + + +## Overview + +A concise example of how to use a UEFI system with GPT disk partitioning with [Arch](https://www.archlinux.org); the intent is to demonstrate how the EFI partition works, how it's mounted, and how [GRUB](http://www.gnu.org/software/grub/) is configured. + +This example was designed using a [Virtualbox](https://www.virtualbox.org/) host with the guest VM in EFI mode `VM -> Settings -> System -> Enable EFI`, then booting the standard [Arch ISO](https://www.archlinux.org/download/) and using UEFI mode install. + + +## Process + +First, partition your device in GPT format (`gdisk` or `parted`) like so: + +> Note that gdisk and parted display the UEFI (`/dev/sda1` below) in different ways -- gdisk shows it as an ESP (EFI System Partition), parted likes to show it instead with flags "boot,esp" -- this is all the same, it's type ef00 under the hood on a GPT disk. + +``` +/dev/sda1 size:200M type:ef00 "EFI System Partition" or "boot,esp" +/dev/sda2 size:500M type:8300 "Linux Filesystem" or "Linux" +/dev/sda3 size:rest type:8e00 "Linux LVM" +``` + +Now create `/dev/sda3` as LVM: + +``` +pvcreate /dev/sda3 +vgcreate vglocal00 /dev/sda3 +lvcreate -L 1G -n swap00 vglocal00 +lvcreate -l 100%FREE -n root00 vglocal00 +``` + +Make your swap, ext4 and vfat for EFI: + +``` +mkswap /dev/vglocal00/swap00 +mkfs.vfat /dev/sda1 +mkfs.ext4 /dev/sda2 +mkfs.ext4 /dev/vglocal00/root00 +``` + +Mount everything - notice how `/dev/sda1` is a /boot/efi VFAT (aka FAT32) partition type: + +``` +swapon /dev/vglocal00/swap00 +mount /dev/vglocal00/root00 /mnt +mkdir /mnt/boot +mount /dev/sda2 /mnt/boot +mkdir /mnt/boot/efi +mount /dev/sda1 /mnt/boot/efi +``` + +Pacstrap the core packages and chroot into the mount: + +``` +pacstrap /mnt base +genfstab -p /mnt >> /mnt/etc/fstab +arch-chroot /mnt +``` + +Prep the system with all the usual things - adjust to your locale as desired: + +``` +export LANG="en_US.UTF-8" +echo "en_US.UTF-8 UTF-8" >> /etc/locale.gen +locale-gen + +cat << EOF > /etc/locale.conf +LANG="en_US.UTF-8" +LC_COLLATE="C" +EOF + +cat << EOF > /etc/vconsole.conf +KEYMAP="us" +FONT="eurlatgr" +EOF + +ln -s /usr/share/zoneinfo/America/Chicago /etc/localtime +hwclock --systohc --utc +echo "toolbox" > /etc/hostname +hostname "toolbox" +``` + +Install grub, kernel headers, os-prober (so grub can see Windows, etc.) and the UEFI tools: + +``` +pacman -Sy --noconfirm +pacman -S --noconfirm grub linux-headers os-prober intel-ucode dosfstools efibootmgr +``` + +Add the mkinitcpio hook for LVM: + +``` +sed -i.bak -r 's/^HOOKS=(.*)block(.*)/HOOKS=\1block lvm2\2/g' /etc/mkinitcpio.conf +mkinitcpio -p linux +``` + +Install grub2 in UEFI mode and add the hack for some BIOSes which expect the boot bits in a specific place: + +``` +grub-install --target=x86_64-efi --efi-directory=/boot/efi --bootloader-id=arch_grub --recheck --debug +grub-mkconfig -o /boot/grub/grub.cfg +mkdir /boot/efi/EFI/boot +cp /boot/efi/EFI/arch_grub/grubx64.efi /boot/efi/EFI/boot/bootx64.efi +``` + +Finally, set the password for root, back your way out and reboot: + +``` +passwd root +exit +umount -R /mnt +reboot +``` + +The guest VM is now a UEFI + GPT based Linux system; from here the process can be expanded to work with real live devices and systems, custom partitions and so forth. diff --git a/md/cifs_client_setup.md b/md/cifs_client_setup.md new file mode 100644 index 0000000..2556a83 --- /dev/null +++ b/md/cifs_client_setup.md @@ -0,0 +1,546 @@ +# CIFS Client Setup + +## Contents + + - [Overview](#overview) + - [Conventions Used](#conventions-used) + - [Software Installation](#software-installation) + - [Share Testing](#share-testing) + - [Client Configuration](#client-configuration) + - [ACL Testing](#acl-testing) + - [Kerberos Authentication](#kerberos-authentication) + - [Linux Packages](#linux-packages) + - [Domain Controller](#domain-controller) + - [Test the Domain](#test-the-domain) + - [Generate the Keytab](#generate-the-keytab) + - [Client KRB5 Setup](#client-krb5-setup) + - [Connect and Test](#connect-and-test) + - [CIFS Debugging](#cifs-debugging) + - [Packaging Bugs](#packaging-bugs) + - [Troubleshooting](#troubleshooting) + - [RHEL5 Kerberos Fails](#rhel5-kerberos-fails) + - [Kerberos Upcall Fails](#kerberos-upcall-fails) + - [Additional Reading](#additional-reading) + + +## Overview + +Common Internet File System (**CIFS**) is the Windows analog to Network File System (**NFS**); to [quote Microsoft](http://technet.microsoft.com/en-us/library/cc939973.aspx): + +> The _Common Internet File System_ (CIFS) is the standard way that computer users share files across corporate intranets and the Internet. An enhanced version of the Microsoft open, cross-platform Server Message Block (SMB) protocol, CIFS is a native file-sharing protocol in Windows 2000. CIFS defines a series of commands used to pass information between networked computers. + +Linux users are used to the [Samba project](http://www.samba.org/) and it's suite of utilities which provide the client side connection to CIFS shares on the Windows server platform, allowing a Linux server to mount the remote Windows share for use. + + +## Conventions Used + +Setting up a Windows server is not covered herein; this document is +using these conventions: + + - **Release**: Windows 2012 R2 Server + - **Domain**: CIFSGROUP + - **Server**: CIFSSERVER, 192.168.5.5 + - **CIFS User**: cifsuser / p@ssw0rd + - **Test User**: testuser / p@ssw0rd, used for testing additional ACLs + - **CIFS Share**: c:\\cifsdata\\ as 'cifsdata', full ownership by user 'cifsuser' + +The above values are used in all examples below. A dummy file `server-file.txt` has been created in `c:\cifsdata\` and the _cifsuser_ been given ownership; this will allow for initial testing. + + +## Software Installation + +The filesystem kernel module `cifs.ko` is already provided by the kernel on all distros, however the userspace utilities need to be installed. This primarily consists of `mount.cifs`, `cifs.idmap`, `setcifsacl` and `getcifsacl`. While not absolutely required, installing the `smbclient` utility is also recommended for debugging and troubleshooting. + +Install the required packages: + +``` +# RHEL5 / CentOS5 +# Note: RHEL/CentOS 5 do not provide setcifsacl/getcifsacl +yum install samba3x-client + +# RHEL6 / CentOS6 / RHEL7 / CentOS7 +yum install cifs-utils samba-client + +# Debian / Ubuntu +apt-get update; apt-get install cifs-utils smbclient + +# Arch +pacman -Sy; pacman -S cifs-utils smbclient + +# openSUSE +zypper install cifs-utils samba-client +``` + +Select distros need a special service enabled to mount network filesystems at boot: + +``` +# RHEL5 / CentOS5 / RHEL6 / CentOS6 +chkconfig netfs on + +# Debian 7 - ignore the name, it works for CIFS too +insserv mountnfs.sh +``` + +Distros using **systemd** and **upstart** require no special service to be enabled. Some distros (such as Debian) may automatically start and enable the _Winbind_ daemon which is not being used, disable it: + +``` +insserv -r winbind +service winbind stop +``` + + +## Share Testing + +Before setting up the CIFS mount with the kernel, use `smbclient` to test that the share can be listed: + +``` +# smbclient -W CIFSGROUP -U CIFSSERVER\\cifsuser -L //192.168.5.5 +Enter CIFSSERVER\cifsuser's password: +Domain=[CIFSSERVER] OS=[Windows Server 2012 R2 Standard 9600] Server=[Windows Server 2012 R2 Standard 6.3] + + Sharename Type Comment + --------- ---- ------- + ADMIN$ Disk Remote Admin + C$ Disk Default share + cifsdata Disk + IPC$ IPC Remote IPC +``` + +If this is working correctly, completely connect to the share and ensure the `server-file.txt` can be seen: + +``` +# smbclient -W CIFSGROUP -U CIFSSERVER\\cifsuser //192.168.5.5/cifsdata +Enter CIFSSERVER\cifsuser's password: +Domain=[CIFSSERVER] OS=[Windows Server 2012 R2 Standard 9600] Server=[Windows Server 2012 R2 Standard 6.3] +smb: \> ls + . D 0 Sun Sep 21 14:54:42 2014 + .. D 0 Sun Sep 21 14:54:42 2014 + server-file.txt A 0 Sun Sep 21 14:54:25 2014 + + 40957 blocks of size 1048576. 23832 blocks available +smb: \> exit +``` + + +## Client Configuration + +First, create a file to hold the _user_ and _password_ values - the _domain_ can also be specified, however experience has shown that the _domain_ setting in this file doesn't always work depending on which release of the software is present. We will specify the _domain_ in the `/etc/fstab` config instead. + +``` +/etc/cifspw + +user=cifsuser +password=p@ssw0rd +``` + +Secure the file against prying eyes: + +``` +chmod 0600 /etc/cifspw +``` + +Next, add an entry to `/etc/fstab` for the mount - CIFS and POSIX permissions/attributes differ greatly, so notice that we are going to set a handful of values that maps the remote CIFS share to Linux friendly values. + +> Remember that the user permissions actually used are those on the remote server used to mount the share (_cifsuser_ in these examples); these uid/gid mappings are for the Linux side only to provide POSIX friendly uid/gid usage as a client. Regardless of what is specified here, the remote end will always use the mount user from `/etc/cifspw`. + +Options: + + - **uid**: user to map + - **gid**: group to map + - **domain**: remote CIFS domain + - **credentials**: name of the file with user/pass + - **iocharset**: which character set to use locally + - **file\_mode**: default file permissions + - **dir\_mode**: default directory permissions + - **\_netdev**: on startup ensure the network is up first + - **soft**: if the CIFS share goes away, don't hang Linux + +The user _nobody_ exists on all distros, however on Debian/Ubuntu the group is _nogroup_ whereas on all other distros the group is _nobody_. These are being used as a generic mapping example - they can be any local Linux user/group needed to accomplish the mission, for example _apache_ or _www-data_ if configuring a webserver that needs to write to the share. + +``` +/etc/fstab + +# every distro except Debian/Ubuntu +//192.168.5.5/cifsdata /data cifs uid=nobody,gid=nobody,domain=CIFSGROUP,credentials=/etc/cifspw,iocharset=utf8,file_mode=0644,dir_mode=0755,_netdev,soft 0 0 + +# Debian/Ubuntu only +//192.168.5.5/cifsdata /data cifs uid=nobody,gid=nogroup,domain=CIFSGROUP,credentials=/etc/cifspw,iocharset=utf8,file_mode=0644,dir_mode=0755,_netdev,soft 0 0 +``` + +Mount the share and test that basic create and delete privileges are working as expected, while examining the permissions and uid/gid mappings: + +``` +mkdir /data +mount /data +touch /data/test-file +ls -l /data/ + + -rw-r--r-- 0 nobody nobody 0 Sep 21 14:54 server-file.txt + -rw-r--r-- 1 nobody nobody 0 Sep 21 14:59 test-file +``` + + +## ACL Testing + +The `setcifsacl` and `getcifsacl` userspace tools allow setting/getting the ACLs present on the remote end. If the client is connected to the domain [via Winbind](active_directory_with_winbind.md) then the remote user name can be used and the `cifs.idmap` infrastructure will handle the details. + +However, for these examples we're not connected via Winbind, so we need to obtain the [Security Identifier](http://en.wikipedia.org/wiki/Security_Identifier) (**SID**) of the user on the domain. By far the easiest way to accomplish this is by [downloading PSUtils](http://technet.microsoft.com/en-us/sysinternals/bb897417.aspx) and using `psgetsid.exe` from PowerShell. + +Retrieve the SID of _testuser_: + +``` +PS C:\PSTools> .\PsGetsid.exe testuser +SID for CIFSSERVER\testuser: +S-1-5-21-762712803-3572108623-4099884218-1003 +``` + +From the client, attempt to set an ACL for _testuser_ and check it; you can also use the standard Windows properties Security tab to verify it worked as intended: + +``` +setcifsacl -a "ACL:S-1-5-21-762712803-3572108623-4099884218-1003:ALLOWED/I/FULL" /data/test-file + +getcifsacl /data/test-file + + REVISION:0x1 + CONTROL:0x8004 + OWNER:S-1-5-21-762712803-3572108623-4099884218-1001 + GROUP:S-1-5-21-762712803-3572108623-4099884218-513 + ACL:S-1-5-21-762712803-3572108623-4099884218-500:ALLOWED/I/FULL + ACL:S-1-5-32-544:ALLOWED/I/FULL + ACL:S-1-5-21-762712803-3572108623-4099884218-1001:ALLOWED/I/FULL + ACL:S-1-5-18:ALLOWED/I/FULL + ACL:S-1-5-21-762712803-3572108623-4099884218-1003:ALLOWED/I/FULL +``` + +If the Linux client is connected with Winbind, the use of the remote domain\\username is possible, see the `setcifsacl(1)` man page for further information. + + +## Kerberos Authentication + +If a Windows Active Directory domain is available - or you can create one - Kerberos (**KRB5**) authentication with a randomized password can be used in place of the credentials file (`/etc/cifspw`) for additional security. First, we need to redefine a convention: + + - **Domain**: _CIFSDOMAIN_ (NetBIOS name), _cifsdomain.local_ (AD name) + +Note that after this section is complete the usage of `smbclient` is different - since we will use a random password in the Kerberos principal, Kerberos authentication with smbclient must also be used: + +``` +smbclient -k -U cifsuser@CIFSDOMAIN.LOCAL //cifsserver/cifsdata +``` + +**The IP address cannot be used**, it has to be the name of the CIFS server as shown and described below, as the Kerberos infrastructure will now be using names, not IPs. The `smbclient` command can be used while the share is mounted, the two methods work together without issue. + +### Linux Packages + +Two additional packages are required along with the ones installed previously; they may or may not already be installed. Some distros may ask for KRB5 configuration post-install, just accept the defaults - we'll overwrite them later. + +``` +# RHEL / CentOS +yum install keyutils krb5-workstation + +# Debian / Ubuntu +apt-get update; apt-get install keyutils krb5-user + +# Arch +pacman -Sy; pacman -S keyutils krb5 + +# openSUSE +zypper install keyutils krb5-client +``` + +### Domain Controller + +> If the server is already part of a domain, skip this step and use the existing domain. There is no going back from this section, back up your server first as required. + +If an AD domain is not present, the server will need to have the Active Directory Domain Services (**AD DS**) services installed and the server itself promoted to Domain Controller. This is required to create the Key Distribution Center (**KDC**) on the domain, responsible for supplying session tickets and temporary session keys. + +Assuming Windows 2012 R2: + +1. Open _Server Manager_, go to _Local Server_ on the left, then scroll all the way to the bottom _Roles and Features_ +2. Click the drop-down _Tasks_ on the right side of the _Roles and Features_ block, choose _Add Roles and Features_ +3. Under _Installation Type_ in the Wizard choose the _Role-based_ option then Next +4. Under _Server Selection_ choose our server, CIFSSERVER then Next +5. Under _Server Roles_ choose **Active Directory Domain Services**, accept the popup, then Next +6. Under _Features_ and _AD DS_ accept the defaults then Next and start the install + +Almost done, leave that dialog there even though it says you can close it and wait. Now that the bits are installed, you need to **Promote this server to a Domain Controller** -- the option is listed in blue text on the finish screen from the above steps, click it. + +1. On the initial screen choose the last option _Add a Forest_ to start fresh +2. For _Root domain name_ enter the domain (_cifsdomain.local_ herein), then Next +3. The next screen asks for a Functional level and DNS, accept the defaults +4. Enter a password of your choosing for _Directory Services Restore Mode_, then Next +5. You will most likely get an error about DNS Delegation, click Next +6. The NetBIOS name (_CIFSDOMAIN_ herein) should automatically fill in; if not, enter it then Next +7. Accept all other defaults for _Paths_, click Next to review and Next to being the Prerequisites check +8. More warnings about DNS Delegation show up, ignore them +9. Click Install and go get coffee, this takes awhile + +**The server will reboot automatically** when finished. Give it some time, even after the reboot it's doing things that cause the login to take quite awhile. + +### Test the Domain + +Identical to the above, first test the Domain for basic functionality -- this will ensure that something has not gone wrong if you had to promote this server to be a Domain Controller. Simply swap the old **CIFSGROUP** for **CIFSDOMAIN** in the `smbclient` command: + +``` +smbclient -W CIFSDOMAIN -U CIFSDOMAIN\\cifsuser //192.168.5.5/cifsdata + +Enter CIFSDOMAIN\cifsuser's password: +Domain=[CIFSDOMAIN] OS=[Windows Server 2012 R2 Standard 9600] Server=[Windows Server 2012 R2 Standard 6.3] +smb: \> exit +``` + +If this is no longer working, correct it before continuing. + +### Generate the Keytab + +On the Windows server, the KRB5 keytab file needs to be generated to map the user to the principal at the same time. Open an Administrator PowerShell prompt, change to the shared directory and create it like so: + +``` +PS C:\Users\Administrator> cd C:\cifsdata +PS C:\cifsdata> ktpass.exe /princ cifsuser@CIFSDOMAIN.LOCAL /ptype KRB5_NT_PRINCIPAL /out krb5.keytab +rndPass /crypto AES256-SHA1 /mapuser CIFSDOMAIN\cifsuser +``` + +If this works successfully, a message should look like: + +``` +Targeting domain controller: CIFSSERVER.cifsdomain.local +Using legacy password setting method +Failed to set property 'servicePrincipalName' to 'cifsuser' on Dn 'CN=cifsuser,CN=Users,DC=cifsdomain,DC=local': 0x13. +WARNING: Unable to set SPN mapping data. +If cifsuser already has an SPN mapping installed for cifsuser, this is no cause for concern. +Key created. +Output keytab to krb5.keytab: +Keytab version: 0x502 +keysize 75 cifsuser@CIFSDOMAIN.LOCAL ptype 1 (KRB5_NT_PRINCIPAL) vno 2 etype 0x12 (AES256-SHA1) keylength 32 (0x3053927eb10407491db5a4cd05849cca3ac96f7ce1bad32269e174efa50439f9) +``` + +Over on the Linux side, use `smbclient` to connect and download the file; this keytab file can be used on multiple clients, so make sure a backup copy is stored in a secure, non-public location. + +> **SECURITY ALERT**: Do not leave this file laying around on the public share\! This file should be protected and accessible to only the Windows _Administrator_ or Linux _root_ users. + +``` +# smbclient -W CIFSDOMAIN -U CIFSDOMAIN\\cifsuser //192.168.5.5/cifsdata +smb: \> get krb5.keytab +smb: \> rm krb5.keytab +``` + +Move it into `/etc/` and secure it from prying eyes, then test that you can read the principal: + +``` +mv krb5.keytab /etc/krb5.keytab +chown root:root /etc/krb5.keytab +chmod 0600 /etc/krb5.keytab + +# klist -ke +Keytab name: FILE:/etc/krb5.keytab +KVNO Principal +---- -------------------------------------------------------------------------- + 2 cifsuser@CIFSDOMAIN.LOCAL (aes256-cts-hmac-sha1-96) +``` + +> On RHEL5/CentOS5 use `/usr/kerberos/bin/klist` as it's not in `$PATH` until you log out and back in again. + +### Client KRB5 Setup + +The client needs a few files configured to utilize a transparent ticket via _upcall_ methods to the server. Some of these may already be present, or possibly configured in different files as there are several ways to do it. Debian and Ubuntu for instance configure these in `/etc/request-key.conf` by default, whereas RHEL/CentOS and openSUSE place them in separate files in `/etc/request-key.d/`: + +``` +/etc/request-key.d/cifs.idmap.conf + +create cifs.idmap * * /usr/sbin/cifs.idmap %k + +/etc/request-key.d/cifs.spnego.conf + +create cifs.spnego * * /usr/sbin/cifs.upcall %k + +/etc/request-key.d/dns_resolver.conf + +create dns_resolver * * /usr/sbin/cifs.upcall %k +``` + +Next, we need to tell the client Kerberos libraries how to contact the upstream KRB5 infrastructure via the normal configuration in `/etc/krb5.conf` - this is only a basic template, adjust if other settings are already present: + +``` +/etc/krb5.conf + +[logging] + default = FILE:/var/log/krb5libs.log + kdc = FILE:/var/log/krb5kdc.log + admin_server = FILE:/var/log/kadmind.log + +[libdefaults] + default_realm = CIFSDOMAIN.LOCAL + dns_lookup_realm = false + dns_lookup_kdc = false + ticket_lifetime = 24h + renew_lifetime = 7d + forwardable = true + +[realms] + CIFSDOMAIN.LOCAL = { + kdc = cifsserver.cifsdomain.local + admin_server = cifsserver.cifsdomain.local + default_domain = cifsdomain.local + } + +[domain_realm] + .cifsdomain.local = CIFSDOMAIN.LOCAL + cifsdomain.local = CIFSDOMAIN.LOCAL +``` + +Finally, add an entry to `/etc/hosts` to map the names to IP of the server and domain: + +``` +/etc/hosts + +192.168.5.5 cifsserver cifsserver.cifsdomain.local cifsdomain.local +``` + +> Be careful using DNS (`/etc/resolv.conf`) instead - it's possible the Windows server - which is now a DNS server - will return public IPs for a DNS query. We are ensuring the client always uses the private network IP by using a local /etc/hosts configuration. + +### Connect and Test + +Using the identical style from the non-KRB5 mount, the options are changed slightly to indicate the security **krb5i** in `/etc/fstab`: + +``` +# every distro except Debian/Ubuntu +//cifsserver/cifsdata /data cifs user=cifsuser,sec=krb5i,uid=nobody,gid=nobody,iocharset=utf8,file_mode=0644,dir_mode=0755,_netdev,soft 0 0 + +# Debian/Ubuntu only +//cifsserver/cifsdata /data cifs user=cifsuser,sec=krb5i,uid=nobody,gid=nogroup,iocharset=utf8,file_mode=0644,dir_mode=0755,_netdev,soft 0 0 +``` + +Connect to the share, test it out: + +``` +mount /data +touch /data/test-krb5.txt +ls -l /data/test-krb5.txt +rm -f /data/test-krb5.txt +``` + +At this point everything should be working as expected as with a non-Kerberos mount. The `smbclient` command must be adjusted to use Kerberos as well, no password prompt should now occur: + +``` +smbclient -k -U cifsuser@CIFSDOMAIN.LOCAL //cifsserver/cifsdata +``` + +See the debugging section below if something is now working tosatisfaction. + + +## CIFS Debugging + +Various parts of the setup - particularly Kerberos - may not work as expected and require a bit of debugging. Fortunately a nice interface is present in the kernel module to activate on the fly. + +First, enable the debug mode of the cifs module **after** it's been loaded into the kernel (which creates this interface). The default is **0** (no debugging), set it to **9** for max verbosity: + +``` +echo 9 > /proc/fs/cifs/cifsFYI +``` + +Next add an output destination for the debug info; for example, if using **rsyslog** on RHEL / CentOS 6: + +``` +echo '*.debug /var/log/cifs-debug.log' >> /etc/rsyslog.conf +touch /var/log/cifs-debug.log +service rsyslog restart +``` + +Exact configuration will depend on the distribution and client setup already in place, as the logger in use may be **syslog**, **rsyslog**, **syslog-ng**, **systemd-journald**, etc. Don't forget to disable the debugging interface when complete: + +``` +echo 0 > /proc/fs/cifs/cifsFYI +``` + +...otherwise performance will suffer as everything is being logged to a very high degree. + + +## Packaging Bugs + +Some distributions may have slightly broken packages of _cifs-utils_ for which the symlink of idmap is missing. I've reported these two bugs on Ubuntu 14 and Arch, Red Hat already had a bug report and it was fixed: + + - + - + - + +In general, the fix is very easy until these packages are updated or if you cannot use the latest packages: + +``` +# Ubuntu 14, cifs-utils 2:6.0-1ubuntu2 +mkdir /etc/cifs-utils +ln -s /usr/lib/x86_64-linux-gnu/cifs-utils/idmapwb.so /etc/cifs-utils/idmap-plugin + +# Arch, cifs-utils 6.3-2 +mkdir /etc/cifs-utils +ln -s /usr/lib/cifs-utils/idmapwb.so /etc/cifs-utils/idmap-plugin +``` + +For versions of cifs-utils less than 6.2 an error will occur if it's missing: + +``` +ERROR: unable to initialize idmapping plugin: /etc/cifs-utils/idmap-plugin: cannot open shared object file: No such file or directory +``` + +For cifs-utils 6.2 and above, it's a warning instead: + +``` +WARNING: unable to initialize idmapping plugin. Only "raw" SID strings will be accepted: /etc/cifs-utils/idmap-plugin: cannot open shared object file: No such file or directory +``` + +All the other distros have this symlink present and work as expected. + + +## Troubleshooting + +### RHEL5 Kerberos Fails + +When configuring Kerberos authentication, an error may present like so when the `mount` command is issued: + +``` +mount error(126): Required key not available +Refer to the mount.cifs(8) manual page (e.g. man mount.cifs) +``` + +Enabling debugging reveals lines such as these in the log: + +``` +kernel: CIFS VFS: Send error in SessSetup = -126 +kernel: fs/cifs/connect.c: CIFS VFS: leaving cifs_mount (xid = 2) rc = -126 +kernel: CIFS VFS: cifs_mount failed w/return code = -126 +``` + +Please refer to upstream [bugzilla\#574750](https://bugzilla.redhat.com/show_bug.cgi?id=574750) for full information; in a nutshell there's an issue with how the KRB5 credentials are obtained and used with the older code (RHEL5 doesn't use upstream `cifs-utils`). Adjust the `/etc/fstab` mount line to use `uid=0` per that bug report - be warned though this means only the root user can actually write to the share when it's a system level mount. + +### Kerberos Upcall Fails + +When configuring Kerberos authentication, an error may present like so when the `mount` command is issued: + +``` +mount error(38): Function not implemented +Refer to the mount.cifs(8) manual page (e.g. man mount.cifs) +``` + +Enabling debugging reveals lines such as these in the log: + +``` +kernel: fs/cifs/sess.c: sess setup type 5 +kernel: CIFS VFS: Kerberos negotiated but upcall support disabled! +kernel: CIFS VFS: Send error in SessSetup = -38 +kernel: CIFS VFS: cifs_mount failed w/return code = -38 +``` + +This indicates that `CONFIG_CIFS_UPCALL` is disabled in the kernel; it can be checked like so: + +``` +gzip -dc /proc/config.gz | grep CONFIG_CIFS_UPCALL +``` + +This feature is enabled on all major upstream kernels, however some providers (such as custom cloud images) may have their own kernel. This feature must be enabled for Kerberos to work with CIFS. + + +## Additional Reading + + - + - + - + - + - diff --git a/md/compose_key_sequences.md b/md/compose_key_sequences.md new file mode 100644 index 0000000..3662e3a --- /dev/null +++ b/md/compose_key_sequences.md @@ -0,0 +1,443 @@ +# Compose Key Sequences + +## Contents + + - [Overview](#overview) + - [Basic Configuration](#basic-configuration) + - [Advanced Configuration](#advanced-configuration) + - [Key Sequences Chart](#key-sequences-chart) + - [References](#references) + + +## Overview + +A **compose key** must be defined for the input method (keyboard); each desktop has it's quirks on how it can be used - for instance in MATE if the _Right-Alt_ is defined you can only use the _Left-Shift_ in combination with it. As an example, to type `¡` the key sequence is _Right-Alt_ + _Left+Shift_ held down while typing `!!` -- using the Right-Shift with this sequence doesn't work. Each desktop will have it's own quirks. + + +## Basic Configuration + +GNOME, KDE and MATE (and maybe others) have a graphical way to set the compose key: + + - **GNOME** and **MATE** + +1. Menu +2. System +3. Preferences +4. Keyboard +5. Layout (tab) +6. Options (button) +7. Compose key position + + - **KDE4** + +1. System Settings +2. Regional & Language +3. Keyboard Layout +4. Enable keyboard layouts +5. Advanced (tab) +6. Compose key position + + - **KDE3** + +1. System Settings +2. Regional & Language +3. Keyboard Layout +4. Xkb Options (tab) +5. Compose key position + + +## Advanced Configuration + +Other desktops like XFCE rely on using more basic X11 configuration to set the compose key. There are several ways it might be implemented on a given desktop. The list of compose keys used below can be obtained via `grep compose /usr/share/X11/xkb/rules/xorg.lst` and generally looks something like: + +| Option | Compose Key | +| ------------- | ----------- | +| compose:ralt | Right Alt | +| compose:lwin | Left Win | +| compose:rwin | Right Win | +| compose:menu | Menu | +| compose:lctrl | Left Ctrl | +| compose:rctrl | Right Ctrl | +| compose:caps | Caps Lock | +| compose:paus | Pause | +| compose:prsc | PrtSc | +| compose:sclk | Scroll Lock | + + +If `/etc/default/keyboard` file exists and/or is in use, set the value: + +``` +XKBOPTIONS="compose:ralt" +``` + +This will take affect after a new login; to change it right away in a shell use: + +``` +$ setxkbmap -option compose:ralt +``` + +Other locations that can be used where the `setxkbmap` command could added: + + - ~/.xinit + - ~/.xsession + - ~/.config/autostart/somefile.desktop + +An autostart file for desktops like XFCE generally looks like: + +``` +[Desktop Entry] +Encoding=UTF-8 +Version=0.9.4 +Type=Application +Name=Set compose key +Comment=Set compose key +Exec=/usr/bin/setxkbmap -option compose:ralt +StartupNotify=false +Terminal=false +Hidden=false +``` + + +## Key Sequences Chart + +| Unicode | Char | Compose | Comment | +| ------- | ---- | ------------------------------- | ------------------------------------------ | +| U00a0 | | `" "` | NO-BREAK SPACE | +| U00a1 | `¡` | `"!!"` | INVERTED EXCLAMATION MARK | +| U00a2 | `¢` | `"\|c" "c\|" "c/" "/c"` | CENT SIGN | +| U00a3 | `£` | `"L-" "-L"` | POUND SIGN | +| U00a4 | `¤` | `"ox" "xo"` | CURRENCY SIGN | +| U00a5 | `¥` | `"Y=" "=Y"` | YEN SIGN | +| U00a6 | `¦` | `"!^"` | BROKEN BAR | +| U00a7 | `§` | `"so" "os"` | SECTION SIGN | +| U00a9 | `©` | `"oc" "oC" "Oc" "OC"` | COPYRIGHT SIGN | +| U00aa | `ª` | `"^_a"` | FEMININE ORDINAL INDICATOR | +| U00ab | `«` | `"<<"` | LEFT-POINTING DOUBLE ANGLE QUOTATION MARK | +| U00ac | `¬` | `",-" "-,"` | NOT SIGN | +| U00ae | `®` | `"or" "oR" "Or" "OR"` | REGISTERED SIGN | +| U00b0 | `°` | `"oo"` | DEGREE SIGN | +| U00b1 | `±` | `"+-"` | PLUS-MINUS SIGN | +| U00b2 | `²` | `"^2"` | SUPERSCRIPT TWO | +| U00b3 | `³` | `"^3"` | SUPERSCRIPT THREE | +| U00b5 | `µ` | `"mu"` | MICRO SIGN | +| U00b6 | `¶` | `"p!" "P!" "PP"` | PILCROW SIGN | +| U00b7 | `·` | `".."` | MIDDLE DOT | +| U00b8 | `¸` | `", " " ,"` | CEDILLA | +| U00b9 | `¹` | `"^1"` | SUPERSCRIPT ONE | +| U00ba | `º` | `"^_o"` | MASCULINE ORDINAL INDICATOR | +| U00bb | `»` | `">>"` | RIGHT-POINTING DOUBLE ANGLE QUOTATION MARK | +| U00bc | `¼` | `"14"` | VULGAR FRACTION ONE QUARTER | +| U00bd | `½` | `"12"` | VULGAR FRACTION ONE HALF | +| U00be | `¾` | `"34"` | VULGAR FRACTION THREE QUARTERS | +| U00bf | `¿` | `"??"` | INVERTED QUESTION MARK | +| U00c0 | `À` | ``"`A"`` | LATIN CAPITAL LETTER A WITH GRAVE | +| U00c1 | `Á` | `"'A"` | LATIN CAPITAL LETTER A WITH ACUTE | +| U00c2 | `Â` | `"^A"` | LATIN CAPITAL LETTER A WITH CIRCUMFLEX | +| U00c3 | `Ã` | `"~A"` | LATIN CAPITAL LETTER A WITH TILDE | +| U00c4 | `Ä` | `""A"` | LATIN CAPITAL LETTER A WITH DIAERESIS | +| U00c5 | `Å` | `"oA"` | LATIN CAPITAL LETTER A WITH RING ABOVE | +| U00c6 | `Æ` | `"AE"` | LATIN CAPITAL LETTER AE | +| U00c7 | `Ç` | `",C"` | LATIN CAPITAL LETTER C WITH CEDILLA | +| U00c8 | `È` | ``"`E"`` | LATIN CAPITAL LETTER E WITH GRAVE | +| U00c9 | `É` | `"'E"` | LATIN CAPITAL LETTER E WITH ACUTE | +| U00ca | `Ê` | `"^E"` | LATIN CAPITAL LETTER E WITH CIRCUMFLEX | +| U00cb | `Ë` | `""E"` | LATIN CAPITAL LETTER E WITH DIAERESIS | +| U00cc | `Ì` | ``"`I"`` | LATIN CAPITAL LETTER I WITH GRAVE | +| U00cd | `Í` | `"'I"` | LATIN CAPITAL LETTER I WITH ACUTE | +| U00ce | `Î` | `"^I"` | LATIN CAPITAL LETTER I WITH CIRCUMFLEX | +| U00cf | `Ï` | `""I"` | LATIN CAPITAL LETTER I WITH DIAERESIS | +| U00d0 | `Ð` | `"DH"` | LATIN CAPITAL LETTER ETH | +| U00d1 | `Ñ` | `"~N"` | LATIN CAPITAL LETTER N WITH TILDE | +| U00d2 | `Ò` | ``"`O"`` | LATIN CAPITAL LETTER O WITH GRAVE | +| U00d3 | `Ó` | `"'O"` | LATIN CAPITAL LETTER O WITH ACUTE | +| U00d4 | `Ô` | `"^O"` | LATIN CAPITAL LETTER O WITH CIRCUMFLEX | +| U00d5 | `Õ` | `"~O"` | LATIN CAPITAL LETTER O WITH TILDE | +| U00d6 | `Ö` | `""O"` | LATIN CAPITAL LETTER O WITH DIAERESIS | +| U00d7 | `×` | `"xx"` | MULTIPLICATION SIGN | +| U00d8 | `Ø` | `"/O"` | LATIN CAPITAL LETTER O WITH STROKE | +| U00d9 | `Ù` | ``"`U"`` | LATIN CAPITAL LETTER U WITH GRAVE | +| U00da | `Ú` | `"'U"` | LATIN CAPITAL LETTER U WITH ACUTE | +| U00db | `Û` | `"^U"` | LATIN CAPITAL LETTER U WITH CIRCUMFLEX | +| U00dc | `Ü` | `""U"` | LATIN CAPITAL LETTER U WITH DIAERESIS | +| U00dd | `Ý` | `"'Y"` | LATIN CAPITAL LETTER Y WITH ACUTE | +| U00de | `Þ` | `"TH"` | LATIN CAPITAL LETTER THORN | +| U00df | `ß` | `"ss"` | LATIN SMALL LETTER SHARP S | +| U00e0 | `à` | ``"`a"`` | LATIN SMALL LETTER A WITH GRAVE | +| U00e1 | `á` | `"'a"` | LATIN SMALL LETTER A WITH ACUTE | +| U00e2 | `â` | `"^a"` | LATIN SMALL LETTER A WITH CIRCUMFLEX | +| U00e3 | `ã` | `"~a"` | LATIN SMALL LETTER A WITH TILDE | +| U00e4 | `ä` | `""a"` | LATIN SMALL LETTER A WITH DIAERESIS | +| U00e5 | `å` | `"oa"` | LATIN SMALL LETTER A WITH RING ABOVE | +| U00e6 | `æ` | `"ae"` | LATIN SMALL LETTER AE | +| U00e7 | `ç` | `",c"` | LATIN SMALL LETTER C WITH CEDILLA | +| U00e8 | `è` | ``"`e"`` | LATIN SMALL LETTER E WITH GRAVE | +| U00e9 | `é` | `"'e"` | LATIN SMALL LETTER E WITH ACUTE | +| U00ea | `ê` | `"^e"` | LATIN SMALL LETTER E WITH CIRCUMFLEX | +| U00eb | `ë` | `""e"` | LATIN SMALL LETTER E WITH DIAERESIS | +| U00ec | `ì` | ``"`i"`` | LATIN SMALL LETTER I WITH GRAVE | +| U00ed | `í` | `"'i"` | LATIN SMALL LETTER I WITH ACUTE | +| U00ee | `î` | `"^i"` | LATIN SMALL LETTER I WITH CIRCUMFLEX | +| U00ef | `ï` | `""i"` | LATIN SMALL LETTER I WITH DIAERESIS | +| U00f0 | `ð` | `"dh"` | LATIN SMALL LETTER ETH | +| U00f1 | `ñ` | `"~n"` | LATIN SMALL LETTER N WITH TILDE | +| U00f2 | `ò` | ``"`o"`` | LATIN SMALL LETTER O WITH GRAVE | +| U00f3 | `ó` | `"'o"` | LATIN SMALL LETTER O WITH ACUTE | +| U00f4 | `ô` | `"^o"` | LATIN SMALL LETTER O WITH CIRCUMFLEX | +| U00f5 | `õ` | `"~o"` | LATIN SMALL LETTER O WITH TILDE | +| U00f6 | `ö` | `""o"` | LATIN SMALL LETTER O WITH DIAERESIS | +| U00f7 | `÷` | `":-" "-:"` | DIVISION SIGN | +| U00f8 | `ø` | `"/o"` | LATIN SMALL LETTER O WITH STROKE | +| U00f9 | `ù` | ``"`u"`` | LATIN SMALL LETTER U WITH GRAVE | +| U00fa | `ú` | `"'u"` | LATIN SMALL LETTER U WITH ACUTE | +| U00fb | `û` | `"^u"` | LATIN SMALL LETTER U WITH CIRCUMFLEX | +| U00fc | `ü` | `""u"` | LATIN SMALL LETTER U WITH DIAERESIS | +| U00fd | `ý` | `"'y"` | LATIN SMALL LETTER Y WITH ACUTE | +| U00fe | `þ` | `"th"` | LATIN SMALL LETTER THORN | +| U00ff | `ÿ` | `""y"` | LATIN SMALL LETTER Y WITH DIAERESIS | +| U0100 | `Ā` | `"_A"` | LATIN CAPITAL LETTER A WITH MACRON | +| U0101 | `ā` | `"_a"` | LATIN SMALL LETTER A WITH MACRON | +| U0102 | `Ă` | `"UA" "bA"` | LATIN CAPITAL LETTER A WITH BREVE | +| U0103 | `ă` | `"Ua" "ba"` | LATIN SMALL LETTER A WITH BREVE | +| U0104 | `Ą` | `";A"` | LATIN CAPITAL LETTER A WITH OGONEK | +| U0105 | `ą` | `";a"` | LATIN SMALL LETTER A WITH OGONEK | +| U0106 | `Ć` | `"'C"` | LATIN CAPITAL LETTER C WITH ACUTE | +| U0107 | `ć` | `"'c"` | LATIN SMALL LETTER C WITH ACUTE | +| U0108 | `Ĉ` | `"^C"` | LATIN CAPITAL LETTER C WITH CIRCUMFLEX | +| U0109 | `ĉ` | `"^c"` | LATIN SMALL LETTER C WITH CIRCUMFLEX | +| U010c | `Č` | `"cC"` | LATIN CAPITAL LETTER C WITH CARON | +| U010d | `č` | `"cc"` | LATIN SMALL LETTER C WITH CARON | +| U010e | `Ď` | `"cD"` | LATIN CAPITAL LETTER D WITH CARON | +| U010f | `ď` | `"cd"` | LATIN SMALL LETTER D WITH CARON | +| U0110 | `Đ` | `"-D" "/D"` | LATIN CAPITAL LETTER D WITH STROKE | +| U0111 | `đ` | `"-d" "/d"` | LATIN SMALL LETTER D WITH STROKE | +| U0112 | `Ē` | `"_E"` | LATIN CAPITAL LETTER E WITH MACRON | +| U0113 | `ē` | `"_e"` | LATIN SMALL LETTER E WITH MACRON | +| U0114 | `Ĕ` | `"UE" "bE"` | LATIN CAPITAL LETTER E WITH BREVE | +| U0115 | `ĕ` | `"Ue" "be"` | LATIN SMALL LETTER E WITH BREVE | +| U0118 | `Ę` | `";E"` | LATIN CAPITAL LETTER E WITH OGONEK | +| U0119 | `ę` | `";e"` | LATIN SMALL LETTER E WITH OGONEK | +| U011a | `Ě` | `"cE"` | LATIN CAPITAL LETTER E WITH CARON | +| U011b | `ě` | `"ce"` | LATIN SMALL LETTER E WITH CARON | +| U011c | `Ĝ` | `"^G"` | LATIN CAPITAL LETTER G WITH CIRCUMFLEX | +| U011d | `ĝ` | `"^g"` | LATIN SMALL LETTER G WITH CIRCUMFLEX | +| U011e | `Ğ` | `"UG" "bG"` | LATIN CAPITAL LETTER G WITH BREVE | +| U011f | `ğ` | `"Ug" "bg"` | LATIN SMALL LETTER G WITH BREVE | +| U0122 | `Ģ` | `",G"` | LATIN CAPITAL LETTER G WITH CEDILLA | +| U0123 | `ģ` | `",g"` | LATIN SMALL LETTER G WITH CEDILLA | +| U0124 | `Ĥ` | `"^H"` | LATIN CAPITAL LETTER H WITH CIRCUMFLEX | +| U0125 | `ĥ` | `"^h"` | LATIN SMALL LETTER H WITH CIRCUMFLEX | +| U0126 | `Ħ` | `"/H"` | LATIN CAPITAL LETTER H WITH STROKE | +| U0127 | `ħ` | `"/h"` | LATIN SMALL LETTER H WITH STROKE | +| U0128 | `Ĩ` | `"~I"` | LATIN CAPITAL LETTER I WITH TILDE | +| U0129 | `ĩ` | `"~i"` | LATIN SMALL LETTER I WITH TILDE | +| U012a | `Ī` | `"_I"` | LATIN CAPITAL LETTER I WITH MACRON | +| U012b | `ī` | `"_i"` | LATIN SMALL LETTER I WITH MACRON | +| U012c | `Ĭ` | `"UI" "bI"` | LATIN CAPITAL LETTER I WITH BREVE | +| U012d | `ĭ` | `"Ui" "bi"` | LATIN SMALL LETTER I WITH BREVE | +| U012e | `Į` | `";I"` | LATIN CAPITAL LETTER I WITH OGONEK | +| U012f | `į` | `";i"` | LATIN SMALL LETTER I WITH OGONEK | +| U0131 | `ı` | `"i."` | LATIN SMALL LETTER DOTLESS I | +| U0134 | `Ĵ` | `"^J"` | LATIN CAPITAL LETTER J WITH CIRCUMFLEX | +| U0135 | `ĵ` | `"^j"` | LATIN SMALL LETTER J WITH CIRCUMFLEX | +| U0136 | `Ķ` | `",K"` | LATIN CAPITAL LETTER K WITH CEDILLA | +| U0137 | `ķ` | `",k"` | LATIN SMALL LETTER K WITH CEDILLA | +| U0138 | `ĸ` | `"kk"` | LATIN SMALL LETTER KRA | +| U0139 | `Ĺ` | `"'L"` | LATIN CAPITAL LETTER L WITH ACUTE | +| U013a | `ĺ` | `"'l"` | LATIN SMALL LETTER L WITH ACUTE | +| U013b | `Ļ` | `",L"` | LATIN CAPITAL LETTER L WITH CEDILLA | +| U013c | `ļ` | `",l"` | LATIN SMALL LETTER L WITH CEDILLA | +| U013d | `Ľ` | `"cL"` | LATIN CAPITAL LETTER L WITH CARON | +| U013e | `ľ` | `"cl"` | LATIN SMALL LETTER L WITH CARON | +| U0141 | `Ł` | `"/L"` | LATIN CAPITAL LETTER L WITH STROKE | +| U0142 | `ł` | `"/l"` | LATIN SMALL LETTER L WITH STROKE | +| U0143 | `Ń` | `"'N"` | LATIN CAPITAL LETTER N WITH ACUTE | +| U0144 | `ń` | `"'n"` | LATIN SMALL LETTER N WITH ACUTE | +| U0145 | `Ņ` | `",N"` | LATIN CAPITAL LETTER N WITH CEDILLA | +| U0146 | `ņ` | `",n"` | LATIN SMALL LETTER N WITH CEDILLA | +| U0147 | `Ň` | `"cN"` | LATIN CAPITAL LETTER N WITH CARON | +| U0148 | `ň` | `"cn"` | LATIN SMALL LETTER N WITH CARON | +| U014a | `Ŋ` | `"NG"` | LATIN CAPITAL LETTER ENG | +| U014b | `ŋ` | `"ng"` | LATIN SMALL LETTER ENG | +| U014c | `Ō` | `"_O"` | LATIN CAPITAL LETTER O WITH MACRON | +| U014d | `ō` | `"_o"` | LATIN SMALL LETTER O WITH MACRON | +| U014e | `Ŏ` | `"UO" "bO"` | LATIN CAPITAL LETTER O WITH BREVE | +| U014f | `ŏ` | `"Uo" "bo"` | LATIN SMALL LETTER O WITH BREVE | +| U0150 | `Ő` | `"=O"` | LATIN CAPITAL LETTER O WITH DOUBLE ACUTE | +| U0151 | `ő` | `"=o"` | LATIN SMALL LETTER O WITH DOUBLE ACUTE | +| U0152 | `Œ` | `"OE"` | LATIN CAPITAL LIGATURE OE | +| U0153 | `œ` | `"oe"` | LATIN SMALL LIGATURE OE | +| U0154 | `Ŕ` | `"'R"` | LATIN CAPITAL LETTER R WITH ACUTE | +| U0155 | `ŕ` | `"'r"` | LATIN SMALL LETTER R WITH ACUTE | +| U0156 | `Ŗ` | `",R"` | LATIN CAPITAL LETTER R WITH CEDILLA | +| U0157 | `ŗ` | `",r"` | LATIN SMALL LETTER R WITH CEDILLA | +| U0158 | `Ř` | `"cR"` | LATIN CAPITAL LETTER R WITH CARON | +| U0159 | `ř` | `"cr"` | LATIN SMALL LETTER R WITH CARON | +| U015a | `Ś` | `"'S"` | LATIN CAPITAL LETTER S WITH ACUTE | +| U015b | `ś` | `"'s"` | LATIN SMALL LETTER S WITH ACUTE | +| U015c | `Ŝ` | `"^S"` | LATIN CAPITAL LETTER S WITH CIRCUMFLEX | +| U015d | `ŝ` | `"^s"` | LATIN SMALL LETTER S WITH CIRCUMFLEX | +| U015e | `Ş` | `",S"` | LATIN CAPITAL LETTER S WITH CEDILLA | +| U015f | `ş` | `",s"` | LATIN SMALL LETTER S WITH CEDILLA | +| U0160 | `Š` | `"cS"` | LATIN CAPITAL LETTER S WITH CARON | +| U0161 | `š` | `"cs"` | LATIN SMALL LETTER S WITH CARON | +| U0162 | `Ţ` | `",T"` | LATIN CAPITAL LETTER T WITH CEDILLA | +| U0163 | `ţ` | `",t"` | LATIN SMALL LETTER T WITH CEDILLA | +| U0164 | `Ť` | `"cT"` | LATIN CAPITAL LETTER T WITH CARON | +| U0165 | `ť` | `"ct"` | LATIN SMALL LETTER T WITH CARON | +| U0166 | `Ŧ` | `"/T"` | LATIN CAPITAL LETTER T WITH STROKE | +| U0167 | `ŧ` | `"/t"` | LATIN SMALL LETTER T WITH STROKE | +| U0168 | `Ũ` | `"~U"` | LATIN CAPITAL LETTER U WITH TILDE | +| U0169 | `ũ` | `"~u"` | LATIN SMALL LETTER U WITH TILDE | +| U016a | `Ū` | `"_U"` | LATIN CAPITAL LETTER U WITH MACRON | +| U016b | `ū` | `"_u"` | LATIN SMALL LETTER U WITH MACRON | +| U016c | `Ŭ` | `"UU" "bU"` | LATIN CAPITAL LETTER U WITH BREVE | +| U016d | `ŭ` | `"Uu" "bu"` | LATIN SMALL LETTER U WITH BREVE | +| U016e | `Ů` | `"oU"` | LATIN CAPITAL LETTER U WITH RING ABOVE | +| U016f | `ů` | `"ou"` | LATIN SMALL LETTER U WITH RING ABOVE | +| U0170 | `Ű` | `"=U"` | LATIN CAPITAL LETTER U WITH DOUBLE ACUTE | +| U0171 | `ű` | `"=u"` | LATIN SMALL LETTER U WITH DOUBLE ACUTE | +| U0172 | `Ų` | `";U"` | LATIN CAPITAL LETTER U WITH OGONEK | +| U0173 | `ų` | `";u"` | LATIN SMALL LETTER U WITH OGONEK | +| U0174 | `Ŵ` | `"^W"` | LATIN CAPITAL LETTER W WITH CIRCUMFLEX | +| U0175 | `ŵ` | `"^w"` | LATIN SMALL LETTER W WITH CIRCUMFLEX | +| U0176 | `Ŷ` | `"^Y"` | LATIN CAPITAL LETTER Y WITH CIRCUMFLEX | +| U0177 | `ŷ` | `"^y"` | LATIN SMALL LETTER Y WITH CIRCUMFLEX | +| U0178 | `Ÿ` | `""Y"` | LATIN CAPITAL LETTER Y WITH DIAERESIS | +| U0179 | `Ź` | `"'Z"` | LATIN CAPITAL LETTER Z WITH ACUTE | +| U017a | `ź` | `"'z"` | LATIN SMALL LETTER Z WITH ACUTE | +| U017d | `Ž` | `"cZ"` | LATIN CAPITAL LETTER Z WITH CARON | +| U017e | `ž` | `"cz"` | LATIN SMALL LETTER Z WITH CARON | +| U017f | `ſ` | `"fs" "fS"` | LATIN SMALL LETTER LONG S | +| U0180 | `ƀ` | `"/b"` | LATIN SMALL LETTER B WITH STROKE | +| U0197 | `Ɨ` | `"/I"` | LATIN CAPITAL LETTER I WITH STROKE | +| U01b5 | `Ƶ` | `"/Z"` | LATIN CAPITAL LETTER Z WITH STROKE | +| U01b6 | `ƶ` | `"/z"` | LATIN SMALL LETTER Z WITH STROKE | +| U01cd | `Ǎ` | `"cA"` | LATIN CAPITAL LETTER A WITH CARON | +| U01ce | `ǎ` | `"ca"` | LATIN SMALL LETTER A WITH CARON | +| U01cf | `Ǐ` | `"cI"` | LATIN CAPITAL LETTER I WITH CARON | +| U01d0 | `ǐ` | `"ci"` | LATIN SMALL LETTER I WITH CARON | +| U01d1 | `Ǒ` | `"cO"` | LATIN CAPITAL LETTER O WITH CARON | +| U01d2 | `ǒ` | `"co"` | LATIN SMALL LETTER O WITH CARON | +| U01d3 | `Ǔ` | `"cU"` | LATIN CAPITAL LETTER U WITH CARON | +| U01d4 | `ǔ` | `"cu"` | LATIN SMALL LETTER U WITH CARON | +| U01e4 | `Ǥ` | `"/G"` | LATIN CAPITAL LETTER G WITH STROKE | +| U01e5 | `ǥ` | `"/g"` | LATIN SMALL LETTER G WITH STROKE | +| U01e6 | `Ǧ` | `"cG"` | LATIN CAPITAL LETTER G WITH CARON | +| U01e7 | `ǧ` | `"cg"` | LATIN SMALL LETTER G WITH CARON | +| U01e8 | `Ǩ` | `"cK"` | LATIN CAPITAL LETTER K WITH CARON | +| U01e9 | `ǩ` | `"ck"` | LATIN SMALL LETTER K WITH CARON | +| U01ea | `Ǫ` | `";O"` | LATIN CAPITAL LETTER O WITH OGONEK | +| U01eb | `ǫ` | `";o"` | LATIN SMALL LETTER O WITH OGONEK | +| U01f0 | `ǰ` | `"cj"` | LATIN SMALL LETTER J WITH CARON | +| U01f4 | `Ǵ` | `"'G"` | LATIN CAPITAL LETTER G WITH ACUTE | +| U01f5 | `ǵ` | `"'g"` | LATIN SMALL LETTER G WITH ACUTE | +| U01f8 | `Ǹ` | ``"`N"`` | LATIN CAPITAL LETTER N WITH GRAVE | +| U01f9 | `ǹ` | ``"`n"`` | LATIN SMALL LETTER N WITH GRAVE | +| U021e | `Ȟ` | `"cH"` | LATIN CAPITAL LETTER H WITH CARON | +| U021f | `ȟ` | `"ch"` | LATIN SMALL LETTER H WITH CARON | +| U0228 | `Ȩ` | `",E"` | LATIN CAPITAL LETTER E WITH CEDILLA | +| U0229 | `ȩ` | `",e"` | LATIN SMALL LETTER E WITH CEDILLA | +| U0232 | `Ȳ` | `"_Y"` | LATIN CAPITAL LETTER Y WITH MACRON | +| U0233 | `ȳ` | `"_y"` | LATIN SMALL LETTER Y WITH MACRON | +| U0259 | `ə` | `"ee"` | LATIN SMALL LETTER SCHWA | +| U0268 | `ɨ` | `"/i"` | LATIN SMALL LETTER I WITH STROKE | +| U1e10 | `Ḑ` | `",D"` | LATIN CAPITAL LETTER D WITH CEDILLA | +| U1e11 | `ḑ` | `",d"` | LATIN SMALL LETTER D WITH CEDILLA | +| U1e20 | `Ḡ` | `"_G"` | LATIN CAPITAL LETTER G WITH MACRON | +| U1e21 | `ḡ` | `"_g"` | LATIN SMALL LETTER G WITH MACRON | +| U1e26 | `Ḧ` | `""H"` | LATIN CAPITAL LETTER H WITH DIAERESIS | +| U1e27 | `ḧ` | `""h"` | LATIN SMALL LETTER H WITH DIAERESIS | +| U1e28 | `Ḩ` | `",H"` | LATIN CAPITAL LETTER H WITH CEDILLA | +| U1e29 | `ḩ` | `",h"` | LATIN SMALL LETTER H WITH CEDILLA | +| U1e30 | `Ḱ` | `"'K"` | LATIN CAPITAL LETTER K WITH ACUTE | +| U1e31 | `ḱ` | `"'k"` | LATIN SMALL LETTER K WITH ACUTE | +| U1e3e | `Ḿ` | `"'M"` | LATIN CAPITAL LETTER M WITH ACUTE | +| U1e3f | `ḿ` | `"'m"` | LATIN SMALL LETTER M WITH ACUTE | +| U1e54 | `Ṕ` | `"'P"` | LATIN CAPITAL LETTER P WITH ACUTE | +| U1e55 | `ṕ` | `"'p"` | LATIN SMALL LETTER P WITH ACUTE | +| U1e7c | `Ṽ` | `"~V"` | LATIN CAPITAL LETTER V WITH TILDE | +| U1e7d | `ṽ` | `"~v"` | LATIN SMALL LETTER V WITH TILDE | +| U1e80 | `Ẁ` | ``"`W"`` | LATIN CAPITAL LETTER W WITH GRAVE | +| U1e81 | `ẁ` | ``"`w"`` | LATIN SMALL LETTER W WITH GRAVE | +| U1e82 | `Ẃ` | `"'W"` | LATIN CAPITAL LETTER W WITH ACUTE | +| U1e83 | `ẃ` | `"'w"` | LATIN SMALL LETTER W WITH ACUTE | +| U1e84 | `Ẅ` | `""W"` | LATIN CAPITAL LETTER W WITH DIAERESIS | +| U1e85 | `ẅ` | `""w"` | LATIN SMALL LETTER W WITH DIAERESIS | +| U1e8c | `Ẍ` | `""X"` | LATIN CAPITAL LETTER X WITH DIAERESIS | +| U1e8d | `ẍ` | `""x"` | LATIN SMALL LETTER X WITH DIAERESIS | +| U1e90 | `Ẑ` | `"^Z"` | LATIN CAPITAL LETTER Z WITH CIRCUMFLEX | +| U1e91 | `ẑ` | `"^z"` | LATIN SMALL LETTER Z WITH CIRCUMFLEX | +| U1e97 | `ẗ` | `""t"` | LATIN SMALL LETTER T WITH DIAERESIS | +| U1e98 | `ẘ` | `"ow"` | LATIN SMALL LETTER W WITH RING ABOVE | +| U1e99 | `ẙ` | `"oy"` | LATIN SMALL LETTER Y WITH RING ABOVE | +| U1ebc | `Ẽ` | `"~E"` | LATIN CAPITAL LETTER E WITH TILDE | +| U1ebd | `ẽ` | `"~e"` | LATIN SMALL LETTER E WITH TILDE | +| U1ef2 | `Ỳ` | ``"`Y"`` | LATIN CAPITAL LETTER Y WITH GRAVE | +| U1ef3 | `ỳ` | ``"`y"`` | LATIN SMALL LETTER Y WITH GRAVE | +| U1ef8 | `Ỹ` | `"~Y"` | LATIN CAPITAL LETTER Y WITH TILDE | +| U1ef9 | `ỹ` | `"~y"` | LATIN SMALL LETTER Y WITH TILDE | +| U2008 | ` ` | `" ."` | PUNCTUATION SPACE | +| U2013 | `–` | `"--."` | EN DASH | +| U2014 | `—` | `"---"` | EM DASH | +| U2018 | `‘` | `"<'" "'<"` | LEFT SINGLE QUOTATION MARK | +| U2019 | `’` | `">'" "'>"` | RIGHT SINGLE QUOTATION MARK | +| U201a | `‚` | `",'" "',"` | SINGLE LOW-9 QUOTATION MARK | +| U201c | `“` | `"<"" ""<"` | LEFT DOUBLE QUOTATION MARK | +| U201d | `”` | `">"" "">"` | RIGHT DOUBLE QUOTATION MARK | +| U201e | `„` | `","" "","` | DOUBLE LOW-9 QUOTATION MARK | +| U2030 | `‰` | `"%o"` | PER MILLE SIGN | +| U2039 | `‹` | `".<"` | SINGLE LEFT-POINTING ANGLE QUOTATION MARK | +| U203a | `›` | `".>"` | SINGLE RIGHT-POINTING ANGLE QUOTATION MARK | +| U2070 | `⁰` | `"^0"` | SUPERSCRIPT ZERO | +| U2071 | `ⁱ` | `"^_i"` | SUPERSCRIPT LATIN SMALL LETTER I | +| U2074 | `⁴` | `"^4"` | SUPERSCRIPT FOUR | +| U2075 | `⁵` | `"^5"` | SUPERSCRIPT FIVE | +| U2076 | `⁶` | `"^6"` | SUPERSCRIPT SIX | +| U2077 | `⁷` | `"^7"` | SUPERSCRIPT SEVEN | +| U2078 | `⁸` | `"^8"` | SUPERSCRIPT EIGHT | +| U2079 | `⁹` | `"^9"` | SUPERSCRIPT NINE | +| U207a | `⁺` | `"^+"` | SUPERSCRIPT PLUS SIGN | +| U207c | `⁼` | `"^="` | SUPERSCRIPT EQUALS SIGN | +| U207d | `⁽` | `"^("` | SUPERSCRIPT LEFT PARENTHESIS | +| U207e | `⁾` | `"^)"` | SUPERSCRIPT RIGHT PARENTHESIS | +| U207f | `ⁿ` | `"^_n"` | SUPERSCRIPT LATIN SMALL LETTER N | +| U2080 | `₀` | `"_0"` | SUBSCRIPT ZERO | +| U2081 | `₁` | `"_1"` | SUBSCRIPT ONE | +| U2082 | `₂` | `"_2"` | SUBSCRIPT TWO | +| U2083 | `₃` | `"_3"` | SUBSCRIPT THREE | +| U2084 | `₄` | `"_4"` | SUBSCRIPT FOUR | +| U2085 | `₅` | `"_5"` | SUBSCRIPT FIVE | +| U2086 | `₆` | `"_6"` | SUBSCRIPT SIX | +| U2087 | `₇` | `"_7"` | SUBSCRIPT SEVEN | +| U2088 | `₈` | `"_8"` | SUBSCRIPT EIGHT | +| U2089 | `₉` | `"_9"` | SUBSCRIPT NINE | +| U208a | `₊` | `"_+"` | SUBSCRIPT PLUS SIGN | +| U208c | `₌` | `"_="` | SUBSCRIPT EQUALS SIGN | +| U208d | `₍` | `"_("` | SUBSCRIPT LEFT PARENTHESIS | +| U208e | `₎` | `"_)"` | SUBSCRIPT RIGHT PARENTHESIS | +| U20a0 | `₠` | `"CE"` | EURO-CURRENCY SIGN | +| U20a1 | `₡` | `"C/" "/C"` | COLON SIGN | +| U20a2 | `₢` | `"Cr"` | CRUZEIRO SIGN | +| U20a3 | `₣` | `"Fr"` | FRENCH FRANC SIGN | +| U20a4 | `₤` | `"L=" "=L"` | LIRA SIGN | +| U20a5 | `₥` | `"m/" "/m"` | MILL SIGN | +| U20a6 | `₦` | `"N=" "=N"` | NAIRA SIGN | +| U20a7 | `₧` | `"Pt"` | PESETA SIGN | +| U20a8 | `₨` | `"Rs"` | RUPEE SIGN | +| U20a9 | `₩` | `"W=" "=W"` | WON SIGN | +| U20ab | `₫` | `"d-"` | DONG SIGN | +| U20ac | `€` | `"C=" "=C" "c=" "=c" "E=" "=E"` | EURO SIGN | +| U2120 | `℠` | `"^SM"` | SERVICE MARK | +| U2122 | `™` | `"^TM"` | TRADE MARK SIGN | +| U301d | `〝` | `""\"` | REVERSED DOUBLE PRIME QUOTATION MARK | +| U301e | `〞` | `""/"` | DOUBLE PRIME QUOTATION MARK | + + +## References + + - + - + - + - + - + - diff --git a/md/conversion_table.md b/md/conversion_table.md new file mode 100644 index 0000000..b4c88fd --- /dev/null +++ b/md/conversion_table.md @@ -0,0 +1,132 @@ +# Conversion Table + +| **Dec** | **Hex** | **Oct** | `Bin` | | **Dec** | **Hex** | **Oct** | `Bin` | +| ------- | ------- | ------- | ---------- | --- | ------- | ------- | ------- | ---------- | +| **0** | 0 | 000 | `00000000` | | **128** | 80 | 200 | `10000000` | +| **1** | 1 | 001 | `00000001` | | **129** | 81 | 201 | `10000001` | +| **2** | 2 | 002 | `00000010` | | **130** | 82 | 202 | `10000010` | +| **3** | 3 | 003 | `00000011` | | **131** | 83 | 203 | `10000011` | +| **4** | 4 | 004 | `00000100` | | **132** | 84 | 204 | `10000100` | +| **5** | 5 | 005 | `00000101` | | **133** | 85 | 205 | `10000101` | +| **6** | 6 | 006 | `00000110` | | **134** | 86 | 206 | `10000110` | +| **7** | 7 | 007 | `00000111` | | **135** | 87 | 207 | `10000111` | +| **8** | 8 | 010 | `00001000` | | **136** | 88 | 210 | `10001000` | +| **9** | 9 | 011 | `00001001` | | **137** | 89 | 211 | `10001001` | +| **10** | a | 012 | `00001010` | | **138** | 8a | 212 | `10001010` | +| **11** | b | 013 | `00001011` | | **139** | 8b | 213 | `10001011` | +| **12** | c | 014 | `00001100` | | **140** | 8c | 214 | `10001100` | +| **13** | d | 015 | `00001101` | | **141** | 8d | 215 | `10001101` | +| **14** | e | 016 | `00001110` | | **142** | 8e | 216 | `10001110` | +| **15** | f | 017 | `00001111` | | **143** | 8f | 217 | `10001111` | +| **16** | 10 | 020 | `00010000` | | **144** | 90 | 220 | `10010000` | +| **17** | 11 | 021 | `00010001` | | **145** | 91 | 221 | `10010001` | +| **18** | 12 | 022 | `00010010` | | **146** | 92 | 222 | `10010010` | +| **19** | 13 | 023 | `00010011` | | **147** | 93 | 223 | `10010011` | +| **20** | 14 | 024 | `00010100` | | **148** | 94 | 224 | `10010100` | +| **21** | 15 | 025 | `00010101` | | **149** | 95 | 225 | `10010101` | +| **22** | 16 | 026 | `00010110` | | **150** | 96 | 226 | `10010110` | +| **23** | 17 | 027 | `00010111` | | **151** | 97 | 227 | `10010111` | +| **24** | 18 | 030 | `00011000` | | **152** | 98 | 230 | `10011000` | +| **25** | 19 | 031 | `00011001` | | **153** | 99 | 231 | `10011001` | +| **26** | 1a | 032 | `00011010` | | **154** | 9a | 232 | `10011010` | +| **27** | 1b | 033 | `00011011` | | **155** | 9b | 233 | `10011011` | +| **28** | 1c | 034 | `00011100` | | **156** | 9c | 234 | `10011100` | +| **29** | 1d | 035 | `00011101` | | **157** | 9d | 235 | `10011101` | +| **30** | 1e | 036 | `00011110` | | **158** | 9e | 236 | `10011110` | +| **31** | 1f | 037 | `00011111` | | **159** | 9f | 237 | `10011111` | +| **32** | 20 | 040 | `00100000` | | **160** | a0 | 240 | `10100000` | +| **33** | 21 | 041 | `00100001` | | **161** | a1 | 241 | `10100001` | +| **34** | 22 | 042 | `00100010` | | **162** | a2 | 242 | `10100010` | +| **35** | 23 | 043 | `00100011` | | **163** | a3 | 243 | `10100011` | +| **36** | 24 | 044 | `00100100` | | **164** | a4 | 244 | `10100100` | +| **37** | 25 | 045 | `00100101` | | **165** | a5 | 245 | `10100101` | +| **38** | 26 | 046 | `00100110` | | **166** | a6 | 246 | `10100110` | +| **39** | 27 | 047 | `00100111` | | **167** | a7 | 247 | `10100111` | +| **40** | 28 | 050 | `00101000` | | **168** | a8 | 250 | `10101000` | +| **41** | 29 | 051 | `00101001` | | **169** | a9 | 251 | `10101001` | +| **42** | 2a | 052 | `00101010` | | **170** | aa | 252 | `10101010` | +| **43** | 2b | 053 | `00101011` | | **171** | ab | 253 | `10101011` | +| **44** | 2c | 054 | `00101100` | | **172** | ac | 254 | `10101100` | +| **45** | 2d | 055 | `00101101` | | **173** | ad | 255 | `10101101` | +| **46** | 2e | 056 | `00101110` | | **174** | ae | 256 | `10101110` | +| **47** | 2f | 057 | `00101111` | | **175** | af | 257 | `10101111` | +| **48** | 30 | 060 | `00110000` | | **176** | b0 | 260 | `10110000` | +| **49** | 31 | 061 | `00110001` | | **177** | b1 | 261 | `10110001` | +| **50** | 32 | 062 | `00110010` | | **178** | b2 | 262 | `10110010` | +| **51** | 33 | 063 | `00110011` | | **179** | b3 | 263 | `10110011` | +| **52** | 34 | 064 | `00110100` | | **180** | b4 | 264 | `10110100` | +| **53** | 35 | 065 | `00110101` | | **181** | b5 | 265 | `10110101` | +| **54** | 36 | 066 | `00110110` | | **182** | b6 | 266 | `10110110` | +| **55** | 37 | 067 | `00110111` | | **183** | b7 | 267 | `10110111` | +| **56** | 38 | 070 | `00111000` | | **184** | b8 | 270 | `10111000` | +| **57** | 39 | 071 | `00111001` | | **185** | b9 | 271 | `10111001` | +| **58** | 3a | 072 | `00111010` | | **186** | ba | 272 | `10111010` | +| **59** | 3b | 073 | `00111011` | | **187** | bb | 273 | `10111011` | +| **60** | 3c | 074 | `00111100` | | **188** | bc | 274 | `10111100` | +| **61** | 3d | 075 | `00111101` | | **189** | bd | 275 | `10111101` | +| **62** | 3e | 076 | `00111110` | | **190** | be | 276 | `10111110` | +| **63** | 3f | 077 | `00111111` | | **191** | bf | 277 | `10111111` | +| **64** | 40 | 100 | `01000000` | | **192** | c0 | 300 | `11000000` | +| **65** | 41 | 101 | `01000001` | | **193** | c1 | 301 | `11000001` | +| **66** | 42 | 102 | `01000010` | | **194** | c2 | 302 | `11000010` | +| **67** | 43 | 103 | `01000011` | | **195** | c3 | 303 | `11000011` | +| **68** | 44 | 104 | `01000100` | | **196** | c4 | 304 | `11000100` | +| **69** | 45 | 105 | `01000101` | | **197** | c5 | 305 | `11000101` | +| **70** | 46 | 106 | `01000110` | | **198** | c6 | 306 | `11000110` | +| **71** | 47 | 107 | `01000111` | | **199** | c7 | 307 | `11000111` | +| **72** | 48 | 110 | `01001000` | | **200** | c8 | 310 | `11001000` | +| **73** | 49 | 111 | `01001001` | | **201** | c9 | 311 | `11001001` | +| **74** | 4a | 112 | `01001010` | | **202** | ca | 312 | `11001010` | +| **75** | 4b | 113 | `01001011` | | **203** | cb | 313 | `11001011` | +| **76** | 4c | 114 | `01001100` | | **204** | cc | 314 | `11001100` | +| **77** | 4d | 115 | `01001101` | | **205** | cd | 315 | `11001101` | +| **78** | 4e | 116 | `01001110` | | **206** | ce | 316 | `11001110` | +| **79** | 4f | 117 | `01001111` | | **207** | cf | 317 | `11001111` | +| **80** | 50 | 120 | `01010000` | | **208** | d0 | 320 | `11010000` | +| **81** | 51 | 121 | `01010001` | | **209** | d1 | 321 | `11010001` | +| **82** | 52 | 122 | `01010010` | | **210** | d2 | 322 | `11010010` | +| **83** | 53 | 123 | `01010011` | | **211** | d3 | 323 | `11010011` | +| **84** | 54 | 124 | `01010100` | | **212** | d4 | 324 | `11010100` | +| **85** | 55 | 125 | `01010101` | | **213** | d5 | 325 | `11010101` | +| **86** | 56 | 126 | `01010110` | | **214** | d6 | 326 | `11010110` | +| **87** | 57 | 127 | `01010111` | | **215** | d7 | 327 | `11010111` | +| **88** | 58 | 130 | `01011000` | | **216** | d8 | 330 | `11011000` | +| **89** | 59 | 131 | `01011001` | | **217** | d9 | 331 | `11011001` | +| **90** | 5a | 132 | `01011010` | | **218** | da | 332 | `11011010` | +| **91** | 5b | 133 | `01011011` | | **219** | db | 333 | `11011011` | +| **92** | 5c | 134 | `01011100` | | **220** | dc | 334 | `11011100` | +| **93** | 5d | 135 | `01011101` | | **221** | dd | 335 | `11011101` | +| **94** | 5e | 136 | `01011110` | | **222** | de | 336 | `11011110` | +| **95** | 5f | 137 | `01011111` | | **223** | df | 337 | `11011111` | +| **96** | 60 | 140 | `01100000` | | **224** | e0 | 340 | `11100000` | +| **97** | 61 | 141 | `01100001` | | **225** | e1 | 341 | `11100001` | +| **98** | 62 | 142 | `01100010` | | **226** | e2 | 342 | `11100010` | +| **99** | 63 | 143 | `01100011` | | **227** | e3 | 343 | `11100011` | +| **100** | 64 | 144 | `01100100` | | **228** | e4 | 344 | `11100100` | +| **101** | 65 | 145 | `01100101` | | **229** | e5 | 345 | `11100101` | +| **102** | 66 | 146 | `01100110` | | **230** | e6 | 346 | `11100110` | +| **103** | 67 | 147 | `01100111` | | **231** | e7 | 347 | `11100111` | +| **104** | 68 | 150 | `01101000` | | **232** | e8 | 350 | `11101000` | +| **105** | 69 | 151 | `01101001` | | **233** | e9 | 351 | `11101001` | +| **106** | 6a | 152 | `01101010` | | **234** | ea | 352 | `11101010` | +| **107** | 6b | 153 | `01101011` | | **235** | eb | 353 | `11101011` | +| **108** | 6c | 154 | `01101100` | | **236** | ec | 354 | `11101100` | +| **109** | 6d | 155 | `01101101` | | **237** | ed | 355 | `11101101` | +| **110** | 6e | 156 | `01101110` | | **238** | ee | 356 | `11101110` | +| **111** | 6f | 157 | `01101111` | | **239** | ef | 357 | `11101111` | +| **112** | 70 | 160 | `01110000` | | **240** | f0 | 360 | `11110000` | +| **113** | 71 | 161 | `01110001` | | **241** | f1 | 361 | `11110001` | +| **114** | 72 | 162 | `01110010` | | **242** | f2 | 362 | `11110010` | +| **115** | 73 | 163 | `01110011` | | **243** | f3 | 363 | `11110011` | +| **116** | 74 | 164 | `01110100` | | **244** | f4 | 364 | `11110100` | +| **117** | 75 | 165 | `01110101` | | **245** | f5 | 365 | `11110101` | +| **118** | 76 | 166 | `01110110` | | **246** | f6 | 366 | `11110110` | +| **119** | 77 | 167 | `01110111` | | **247** | f7 | 367 | `11110111` | +| **120** | 78 | 170 | `01111000` | | **248** | f8 | 370 | `11111000` | +| **121** | 79 | 171 | `01111001` | | **249** | f9 | 371 | `11111001` | +| **122** | 7a | 172 | `01111010` | | **250** | fa | 372 | `11111010` | +| **123** | 7b | 173 | `01111011` | | **251** | fb | 373 | `11111011` | +| **124** | 7c | 174 | `01111100` | | **252** | fc | 374 | `11111100` | +| **125** | 7d | 175 | `01111101` | | **253** | fd | 375 | `11111101` | +| **126** | 7e | 176 | `01111110` | | **254** | fe | 376 | `11111110` | +| **127** | 7f | 177 | `01111111` | | **255** | ff | 377 | `11111111` | diff --git a/md/debian_boinc_client.md b/md/debian_boinc_client.md new file mode 100644 index 0000000..b5bba2a --- /dev/null +++ b/md/debian_boinc_client.md @@ -0,0 +1,385 @@ +# Debian BOINC Client + +## Contents + + - [Installation](#server-installation) + - [User Setup](#user-setup) + - [Disable root Login](#disable-root-login) + - [Server Hardening](#server-hardening) + - [fail2ban Setup](#fail2ban-setup) + - [BOINC Client Setup](#boinc-client-setup) + - [BOINC Project Setup](#boinc-project-setup) + - [Final Checks](#final-checks) + - [References](#references) + + +## Server Installation + +### Physical Device + +Install **Debian 9 (Stretch) 64bit or newer** onto the device using the Minimal setup option plus the SSH server option during the final steps on the installation. BOINC does not require a lot of disk space requiring custom partitioning, the default installation options should work in most cases. + +### Cloud Server + +Spin up a basic **Debian 9 (Stretch) 64bit or newer** cloud instance at the provider of your choice; note that a [BOINC](https://boinc.berkeley.edu/) client is CPU and RAM intensive, so be careful which provider you choose. This guide is written for a 1 CPU, 512M to 1G RAM, 20G disk cloud server with 1TB transfer/month which is billed at a flat monthly rate by the provider; tuning is provided to avoid noisy-neighbor behavior and keep resource usage within reasonable limits. + +> **Be Careful of Costs** - Cloud providers typically charge for time spent running (uptime) _plus_ bandwidth charges. Research costs carefully and ensure the client is configured to meet your budget. The below tuning is only a guide and may need tweaked to specific circumstances. + +The below instructions have been tested on various standard Debian instances supplied by cloud providers. The actual CPU resource limits vary depending on type of provider/instance, however - be sure to pay attention to the CPU usage tuning details. + + +## User Setup + +> **Use a very secure password** - at a minimum use `pwgen -sB 15`, strong password security encouraged! + +Set up a non-root user and add to the `sudo` group, then add a password. Use this user for SSH access and become root once logged in with sudo; if you have used SSH keys to log in as root, copy to this new user's setup as well if needed: + +``` +apt-get update +apt-get install sudo + +export MYUSER="frankthetank" +useradd -m -d /home/${MYUSER} -s /bin/bash -g users -G sudo ${MYUSER} +passwd ${MYUSER} +``` + +If you are unable to use `ssh-copy-id` from your workstation to add a new SSH key, perform the work manually: + +``` +mkdir /home/${MYUSER}/.ssh +cp /root/.ssh/authorized_keys /home/${MYUSER}/.ssh/ +chmod 0700 /home/${MYUSER}/.ssh +chmod 0600 /home/${MYUSER}/.ssh/authorized_keys +chown -R ${MYUSER}:users /home/${MYUSER}/.ssh +``` + +> **SSH in as this user and test `sudo` several times** - log out completely between tests + +### Disable root Login + +**If the above is successful** and you are capable of gaining full root privileges via the non-root SSH session using sudo, now disable root logins in SSH from the outside world for an additional security layer. The `root` account still remains usable, just not via _direct_ SSH access. + +The task is to set `PermitRootLogin no` - the setting varies from one provider to another, sometimes it's already set (either yes or no), sometimes it's commented out. This small scriptlet should handle these 2 most common cases, **be careful** and investigate for yourself: + +``` +_SCFG="/etc/ssh/sshd_config" +if $(grep -iEq '^PermitRootLogin[[:space:]]+yes' "${_SCFG}"); then + sed -i.bak -e 's/^PermitRootLogin.*/PermitRootLogin no/gi' "${_SCFG}" +else + sed -i.bak -e 's/^#PermitRootLogin.*/PermitRootLogin no/gi' "${_SCFG}" +fi +``` + +**After confirming the change is correct**, restart the SSH core daemon (it will not log you out): + +``` +systemctl restart sshd +``` + +**Test logging in again** to ensure the changes are as expected. Do not log out of the active, working SSH session as root until you've confirmed in _another_ session you can log in as your non-root user and still gain `sudo` to root. + + +## Server Hardening + +**1.** The Debian default vimrc (`set mouse=a`, `/usr/share/vim/vim80/defaults.vim`) messes up middle-mouse click paste when remote via SSH, override the setting to just disable the mouse: + +``` +echo 'set mouse=' >> /root/.vimrc +``` + +**2.** Install a few basic packages to make life a little nicer; typically the minimal install / cloud instances are stripped down and need a few things added, both for security and ease of use. Adjust as needed, at a minimum ensure the below are in place: + +``` +apt-get update +echo "iptables-persistent iptables-persistent/autosave_v4 boolean true" | debconf-set-selections +echo "iptables-persistent iptables-persistent/autosave_v6 boolean true" | debconf-set-selections +echo "unattended-upgrades unattended-upgrades/enable_auto_updates boolean true" | debconf-set-selections +apt-get install sysstat unattended-upgrades iptables-persistent man less vim +``` + +The `smem` package will pull in a lot of X dependencies due to an embedded recommendation, install it while disabling that feature. This utility can be used to quickly query memory usage (including swap) on the memory constrained cloud server: + +``` +apt-get install smem --no-install-recommends +``` + +**3.** Enable `journald` to store logs on disk instead of just RAM. By default, the `journald` system is in automatic mode - on boot it will create the ephemeral tmpfs `/run` out of RAM, but will _only_ transition to storing the journal on disk (out of RAM) if this directory exists. + +``` +mkdir /var/log/journal +``` + +**4.** Enable _sysstat_ for ongoing statistics capture of your instance (use `sar` to view): + +``` +sed -i.bak -e 's|^ENABLED=".*"|ENABLED="true"|g' /etc/default/sysstat +``` + +**5.** Enable _unattended-upgrades_ to ensure that all Security updates are applied: + +``` +cat << 'EOF' > /etc/apt/apt.conf.d/02periodic +APT::Periodic::Enable "1"; +APT::Periodic::Update-Package-Lists "1"; +APT::Periodic::Download-Upgradeable-Packages "1"; +APT::Periodic::AutocleanInterval "5"; +APT::Periodic::Unattended-Upgrade "1"; +EOF +``` + +**6.** Enable the basic _iptables_ rules to allow only port 22: + +``` +cat << 'EOF' > /etc/iptables/rules.v4 +*filter +:INPUT ACCEPT [0:0] +:FORWARD ACCEPT [0:0] +:OUTPUT ACCEPT [0:0] +-A INPUT -m state --state RELATED,ESTABLISHED -j ACCEPT +-A INPUT -p icmp -j ACCEPT +-A INPUT -i lo -j ACCEPT +-A INPUT -p tcp -m state --state NEW -m tcp --dport 22 -j ACCEPT +-A INPUT -j REJECT --reject-with icmp-host-prohibited +-A FORWARD -j REJECT --reject-with icmp-host-prohibited +COMMIT +EOF + +cat << 'EOF' > /etc/iptables/rules.v6 +*filter +:INPUT ACCEPT [0:0] +:FORWARD ACCEPT [0:0] +:OUTPUT ACCEPT [0:0] +-A INPUT -m state --state RELATED,ESTABLISHED -j ACCEPT +-A INPUT -p ipv6-icmp -j ACCEPT +-A INPUT -i lo -j ACCEPT +-A INPUT -p tcp -m state --state NEW -m tcp --dport 22 -j ACCEPT +-A INPUT -j REJECT --reject-with icmp6-adm-prohibited +-A FORWARD -j REJECT --reject-with icmp6-adm-prohibited +COMMIT +EOF +``` + +**7.** If using a cloud server, add a bit of swap if needed - using swap is not bad in and of itself, the Linux kernel will attempt to cache it's small bits of data if available. Using swap in place of real RAM is bad, however - the tuning below will avoid actual application swapping. + +``` +# 512M file, probably overkill +dd if=/dev/zero of=/swap.file bs=4096 count=128000 +chmod 0600 /swap.file +mkswap /swap.file +echo '/swap.file none swap defaults 0 0' >> /etc/fstab +swapon /swap.file +``` + +**8.** Finally, ensure all the services are enabled and apply all outstanding updates; reboot as needed for a new kernel. If you don't reboot here, you'll need to `service` _foo_ `restart` each one individually (just reboot, it's easier): + +``` +systemctl disable remote-fs.target +systemctl enable sysstat unattended-upgrades netfilter-persistent + +apt-get dist-upgrade -y + +reboot +``` + +### fail2ban Setup + +Optional: configure fail2ban to keep an eye on the SSH port for brute force attacks. + +> **Note**: `fail2ban` tends to consume a fair amount of memory the longer it runs; if the cloud server is memory constrained, you may wish to skip this step or disable the service later. Use `smem` to monitor it periodically. + +``` +apt-get install fail2ban sqlite3 + +cat << 'EOF' > /etc/fail2ban/jail.local +[DEFAULT] +ignoreip = 127.0.0.1/8 +bantime = 600 +maxretry = 3 +backend = auto +destemail = root@localhost + +[ssh] +enabled = true +port = ssh +filter = sshd +logpath = /var/log/auth.log +maxretry = 6 +EOF + +systemctl enable --now fail2ban +``` + +Additionally, add a weekly `cron` task to purge the database of old IPs (bug in 0.9.x series) and to restart the daemon to free up it's RAM usage: + +``` +cat << 'EOF' > /etc/fail2ban/dbpurge.sql +delete from bans where timeofban <= strftime('%s', date('now', '-7 days')); +vacuum; +.quit +EOF + +cat << 'EOF' > /etc/cron.weekly/f2b-cleanup +#!/bin/sh +if [ -x /usr/bin/sqlite3 ]; then + sqlite3 /var/lib/fail2ban/fail2ban.sqlite3 < /etc/fail2ban/dbpurge.sql +fi +systemctl restart fail2ban.service +EOF + +chown root:root /etc/cron.weekly/f2b-cleanup +chmod 0755 /etc/cron.weekly/f2b-cleanup +``` + + +## BOINC Client Setup + +> **Note**: Debian 9 uses `/var/lib/boinc-client` and Debian 10 uses `/var/lib/boinc` as the default data directory. If you installed using 9 and `dist-upgrade` to 10, the upgrade process will create a symlink connecting the old name to the new name. + +The BOINC client is basically a wrapper application; it will "phone home" periodically and download new work units from the upstream project(s) and manage running those work units, then submit the results of the work to the upstream projects. + +**1.** Install the `boinc-client` and `boinctui` software; the client will automatically start and enable as part of the package installation process. As of this writing errors appeared in the logfile about missing directories which are being added: + +``` +apt-get install boinc-client boinctui + +# Debian 9 +mkdir /var/lib/boinc-client/{slots,locale} +chown boinc:boinc /var/lib/boinc-client/{slots,locale} + +# Debian 10 +mkdir /var/lib/boinc-client/{slots,locale} +chown boinc:boinc /var/lib/boinc-client/{slots,locale} + +systemctl restart boinc-client +``` + + +**3.** Enable the client to always run, as this is a dedicated researching instance: + +``` +boinccmd --set_run_mode always +boinccmd --set_network_mode always +``` + +**4.** Cloud server: tune the client settings to fit within the resources, and keep it from being a noisy-neighbor; as this is a dedicated instance, most RAM and disk will be allocated, but the CPU thresholds are reduced. As an example of two very different platforms, if you're using a Google Cloud `f1-micro` VM which has only 0.2 (20%) of a vCPU, a good throttle is 19%. But if you're using a full vCPU instance, 49% is an acceptable limit for most providers to avoid being a noisy neighbor. You will need to research the exact right throttle for your specific circumstances. + +> **1 TB/month** is roughly **400 KB/s** sustained bandwidth + +**4.1** First, use `systemd` to control the CPU throttle, not the settings in the BOINC client. The BOINC client uses CPU idle detection and frequently pauses the process as it detects other things happening, which is undesirable in this specific setup. Instead, we will use `systemd` to create a _cgroup_ (resources control group) to confine the BOINC process - and it's children, the _work units_ - to a smaller percentage of the real CPU. + +> The `CPUQuota` option was added in systemd version 213 and requires `CONFIG_CFS_BANDWIDTH=y` to be configured in the active kernel. Debian 9 meets both of the requirements. + +Use the built-in edit capability of `systemctl` to create the unit override setting(s), it will create `/etc/systemd/system/boinc-client.service.d/override.conf` and place you into edit mode: + +``` +EDITOR=vi systemctl edit boinc-client + +# Add the below to the new file being edited and save +[Service] +CPUQuota=49% +``` + +Inform `systemd` of the new content just created, then restart the BOINC client to activate: + +``` +systemctl daemon-reload +systemctl restart boinc-client +``` + + +**4.2** Next, set up the BOINC client to use 100% of what it thinks is the whole CPU and to only pause when it thinks other processes are using 100%, such that it basically never stops running at all while confined to it's custom _cgroup_. Remember that on the outside, it's only actually using 49% of the CPU in this throttled configuration. + +> Notice that we still set the real disk space and RAM thresholds, we are only telling it to use 100% of CPU as that's the only item we throttled from the outside. If you cannot use the above systemd throttle, you must set the CPU settings below to their actual lower values desired\! + +``` +# Debian 9 +vi /var/lib/boinc-client/global_prefs_override.xml + +# Debian 10 +vi /var/lib/boinc/global_prefs_override.xml + + + 100.000000 + 100.000000 + 75.000000 + 70.000000 + 90.000000 + 50.000000 + 1 + 1 + 1 + 1 + 8000.000000 + 30 + + +boinccmd --read_global_prefs_override +``` + +> **Note**: the above step 4 instructions can also be followed on a Physical device to limit the CPU use, this is not unique to a cloud server. Throttling a physical device to 95% of the CPU power leaves a little room for other things (nightly updates, fail2ban, ssh sessions, etc.) to breathe a little easier. + +## BOINC Project Setup + +The easiest way to manage BOINC projects is to use an online Account Manager; much like the BOINC client manages downloading and running work units, the Account Manager handles which projects you wish to join, as well as helping connect all those projects together easily in one view. One of the most popular Account Managers is BOINC Account Manager (BAM!); it's already integrated to the `boinctui` and `boinc-client` software and has a handy setup guide: + + - + +Follow the above set of steps, then when reaching step 8 "Attach the BOINC client to BAM!" return here and follow these steps. There are multiple ways to do this, using `boinctui` simply makes everything a lot easier. + +> **Tip**: When making your accounts, use the same username / password combination for the various projects that you use with BAM! setup. They all interconnect with each other, using the same credentials leads to a better experience and is easier to manage. However, be sure to use a unique password for this setup, do _not_ use an already known password! + +**1.** As your _non-privileged user_ (not _root_!) start the application: `boinctui`. The very first time it starts a TUI dialog will appear asking where to connect for the BOINC client. Accept the defaults, `127.0.0.1` and no password. + +> **Note**: a password can be added by root to `/etc/boinc-client/gui_rpc_auth.cfg`, then it's used for the above first-time connection dialog. This will add one more layer of security and is recommended; do _not_ use the BAM! password, just generate something random. I recommend `pwgen -sB 15` at a minimum. + +**2.** Within the `boinctui` TUI now on screen, press `F9` to bring up the menu along the top. Arrow over to the **Projects** menu item, then choose the option to **start using an account manager** and press Enter. + +**3.** From the list of account managersi now shown, select the one near the top named **BAM!** and a new dialog will appear asking for your login and password. This is the same login created on above. + +At this point a few actions will be happening upstream, be patient - in general, the following is happening when you first join a new server to your BAM! account: + + - A new host record is created in BAM! + - A new host record is created in the actual project(s) + - The new host records are tied together inside BAM! + - BAM! returns the list of projects to use back to your server + - Your server "phones home" to each project and initializes + - Your server runs first-time performance benchmarks + - Your server starts running it's first work units + +After just a bit of time (it varies per project; as a rule of thumb, wait at least 5 minutes) perform two more first-time only actions to get everything aligned: + + - Within `boinctui`, press `F9` for the Menu, scroll to **Projects**, select the project(s) and press Enter, a new submenu will appear. Select **Update project** -- this triggers your server to contact the project to sort of "cement it's existence" by uploading statistics that it's working hard for the money (so hard for the money). + - Within `boinctui`, press `F9` again, **Projects**, then **Synchronize with manager** -- this triggers BAM! to query the project you just updated above, so now BAM! knows your server is working hard and is connected. + +The above two steps can be skipped and they will eventually happen on their own, they are normal regular tasks performed by the BOINC client wrapper software. Performing them manually the first time just helps make sure everything is working as expected. After a day or so (some projects only update statistics once a day) you will now be able to track the performance of your server in both BAM! and the specific projects themselves. + + +## Final Checks + +Note that it may take a bit of time (not long) for tasks to be assigned to your new cloud client for processing; you should be able to get an immediate confirmation the client is connected and working with a general overview: + +``` +boinccmd --get_state +``` + +Once the Projects attach and start sending tasks, use `top` and you should see the active binaries crunching data using the CPU and RAM, and can be observed (see the man page for all options): + +``` +boinccmd --get_simple_gui_info +boinccmd --get_project_status +boinccmd --get_tasks +``` + +The `boinctui` application is a very nice alternative, it presents a nice windowed interface which can fully control the client options with menus, and presents all the information cleanly. Pretty much anything `boinccmd` can do, `boinctui` can do as well. + +``` +boinctui +``` + +See the above section on configuring `boinctui` it's first time run. + + +## References + + - + - diff --git a/md/debian_server_setup.md b/md/debian_server_setup.md new file mode 100644 index 0000000..6e46eab --- /dev/null +++ b/md/debian_server_setup.md @@ -0,0 +1,373 @@ +# Debian Server Setup + +## Contents + + - [Server Installation](#server-installation) + - [Server User Setup](#server-user-setup) + - [Disable root Login](#disable-root-login) + - [Server Hardening](#server-hardening) + - [fail2ban Setup](#fail2ban-setup) + - [Apache Webserver](#apache-webserver) + - [Apache iptables Ports](#apache-iptables-ports) + - [Apache Default Template](#apache-default-template) + - [Apache 80 Template](#apache-80-template) + - [Apache 443 Template](#apache-443-template) + + +## Server Installation + +Install **Debian 10** using the Minimal setup method, add the SSH server option during the final steps on the installation. This is the default image delivered from many cloud providers; it may use the default hostname `localhost` - if desired, set a new one: + +``` +hostnamectl set-hostname myhostname +``` + +Ensure the hostname resolves locally - it does not have to be `127.0.0.1` (localhost) nor a FQDN, for example this works: + +``` +127.0.0.1 localhost +127.0.1.1 myhostname +``` + +Adjust as needed based on how `/etc/hosts` is already configured from the installation. + + +## Server User Setup + +> **Use a very secure password** - at a minimum use `pwgen -sB 15`, strong password security encouraged! + +Set up a non-root user and add to the `sudo` group, then add a password. Use this user for SSH access and become root once logged in with sudo; if you have used SSH keys to log in as root, copy to this new user's setup as well if needed: + +``` +apt-get update +apt-get install sudo + +export MYUSER="frankthetank" +useradd -m -d /home/${MYUSER} -s /bin/bash -g users -G sudo ${MYUSER} +passwd ${MYUSER} +``` + +If you are unable to use `ssh-copy-id` from your workstation to add a new SSH key, perform the work manually: + +``` +mkdir /home/${MYUSER}/.ssh +cp /root/.ssh/authorized_keys /home/${MYUSER}/.ssh/ +chmod 0700 /home/${MYUSER}/.ssh +chmod 0600 /home/${MYUSER}/.ssh/authorized_keys +chown -R ${MYUSER}:users /home/${MYUSER}/.ssh +``` + +> **SSH in as this user and test `sudo` several times** - log out completely between tests + +### Disable root Login + +**If the above is successful** and you are capable of gaining full root privileges via the non-root SSH session using sudo, now disable root logins in SSH from the outside world for an additional security layer. The `root` account still remains usable, just not via _direct_ SSH access. + +The task is to set `PermitRootLogin no` - the setting varies from one provider to another, sometimes it's already set (either yes or no), sometimes it's commented out. This small scriptlet should handle these 2 most common cases, **be careful** and investigate for yourself: + +``` +_SCFG="/etc/ssh/sshd_config" +if $(grep -iEq '^PermitRootLogin[[:space:]]+yes' "${_SCFG}"); then + sed -i.bak -e 's/^PermitRootLogin.*/PermitRootLogin no/gi' "${_SCFG}" +else + sed -i.bak -e 's/^#PermitRootLogin.*/PermitRootLogin no/gi' "${_SCFG}" +fi +``` + +**After confirming the change is correct**, restart the SSH core daemon (it will not log you out): + +``` +systemctl restart sshd +``` + +**Test logging in again** to ensure the changes are as expected. Do not log out of the active, working SSH session as root until you've confirmed in _another_ session you can log in as your non-root user and still gain `sudo` to root. + + +## Server Hardening + +**1.** The Debian default vimrc (`set mouse=a`, `/usr/share/vim/vim80/defaults.vim`) messes up middle-mouse click paste when remote via SSH, override the setting to just disable the mouse: + +``` +echo 'set mouse=' >> ~/.vimrc +``` + +**2.** Install a few basic packages to make life a little nicer; typically the minimal install / cloud instances are stripped down and need a few things added, both for security and ease of use. Adjust as desired: + +``` +apt-get update +echo "iptables-persistent iptables-persistent/autosave_v4 boolean true" | debconf-set-selections +echo "iptables-persistent iptables-persistent/autosave_v6 boolean true" | debconf-set-selections +echo "unattended-upgrades unattended-upgrades/enable_auto_updates boolean true" | debconf-set-selections +apt-get install sysstat unattended-upgrades iptables-persistent man less vim rsync bc net-tools git strace +``` + +The `smem` package will pull in a lot of X dependencies due to an embedded recommendation, install it while disabling that feature. This utility can be used to quickly query memory usage (including swap) on the memory constrained cloud server: + +``` +apt-get install smem --no-install-recommends +``` + +**3.** Enable `journald` to store logs on disk instead of just RAM. By default, the `journald` system is in automatic mode - on boot it will create the ephemeral tmpfs `/run` out of RAM, but will _only_ transition to storing the journal on disk (out of RAM) if this directory exists. + +``` +mkdir /var/log/journal +``` + +**4.** Enable _sysstat_ for ongoing statistics capture of your instance (use `sar` to view): + +``` +sed -i.bak -e 's|^ENABLED=".*"|ENABLED="true"|g' /etc/default/sysstat +``` + +**5.** Enable _unattended-upgrades_ to ensure that all Security updates are applied: + +``` +cat << 'EOF' > /etc/apt/apt.conf.d/02periodic +APT::Periodic::Enable "1"; +APT::Periodic::Update-Package-Lists "1"; +APT::Periodic::Download-Upgradeable-Packages "1"; +APT::Periodic::AutocleanInterval "5"; +APT::Periodic::Unattended-Upgrade "1"; +EOF +``` + +**6.** Enable the basic _iptables_ rules to allow only port 22: + +``` +cat << 'EOF' > /etc/iptables/rules.v4 +*filter +:INPUT ACCEPT [0:0] +:FORWARD ACCEPT [0:0] +:OUTPUT ACCEPT [0:0] +-A INPUT -m state --state RELATED,ESTABLISHED -j ACCEPT +-A INPUT -p icmp -j ACCEPT +-A INPUT -i lo -j ACCEPT +-A INPUT -p tcp -m state --state NEW -m tcp --dport 22 -j ACCEPT +-A INPUT -j REJECT --reject-with icmp-host-prohibited +-A FORWARD -j REJECT --reject-with icmp-host-prohibited +COMMIT +EOF + +cat << 'EOF' > /etc/iptables/rules.v6 +*filter +:INPUT ACCEPT [0:0] +:FORWARD ACCEPT [0:0] +:OUTPUT ACCEPT [0:0] +-A INPUT -m state --state RELATED,ESTABLISHED -j ACCEPT +-A INPUT -p ipv6-icmp -j ACCEPT +-A INPUT -i lo -j ACCEPT +-A INPUT -p tcp -m state --state NEW -m tcp --dport 22 -j ACCEPT +-A INPUT -j REJECT --reject-with icmp6-adm-prohibited +-A FORWARD -j REJECT --reject-with icmp6-adm-prohibited +COMMIT +EOF +``` + +**7.** Add a bit of swap if needed - using swap is not bad in and of itself, the Linux kernel will attempt to cache it's small bits of data if available. A cloud instance may not be delivered with any swap configured. + +``` +# 128M swap file +dd if=/dev/zero of=/swap.file bs=1024 count=128000 +chmod 0600 /swap.file +mkswap /swap.file +echo '/swap.file none swap defaults 0 0' >> /etc/fstab +swapon /swap.file +``` + +**8.** Finally, ensure all the services are enabled and apply all outstanding updates; reboot as needed for a new kernel. If you don't reboot here, you'll need to `service` _foo_ `restart` each one individually (just reboot, it's easier): + +``` +systemctl disable remote-fs.target +systemctl enable sysstat unattended-upgrades netfilter-persistent + +apt-get dist-upgrade -y + +reboot +``` + +### fail2ban Setup + +Recommended: configure fail2ban to keep an eye on the SSH port for brute force attacks. + +> **Note**: `fail2ban` tends to consume a fair amount of memory the longer it runs; if the cloud server is memory constrained, you may wish to skip this step or disable the service later. Use `smem` to monitor it periodically. + +``` +apt-get install fail2ban sqlite3 + +cat << 'EOF' > /etc/fail2ban/jail.local +[DEFAULT] +ignoreip = 127.0.0.1/8 +bantime = 600 +maxretry = 3 +backend = auto +destemail = root@localhost +EOF + +systemctl enable --now fail2ban +``` + +Additionally, add a weekly `cron` task to purge the database of old IPs (bug in 0.9.x series) and to restart the daemon to free up it's RAM usage: + +``` +cat << 'EOF' > /etc/fail2ban/dbpurge.sql +delete from bans where timeofban <= strftime('%s', date('now', '-7 days')); +vacuum; +.quit +EOF + +cat << 'EOF' > /etc/cron.weekly/f2b-cleanup +#!/bin/sh +if [ -x /usr/bin/sqlite3 ]; then + sqlite3 /var/lib/fail2ban/fail2ban.sqlite3 < /etc/fail2ban/dbpurge.sql +fi +systemctl restart fail2ban.service +EOF + +chown root:root /etc/cron.weekly/f2b-cleanup +chmod 0755 /etc/cron.weekly/f2b-cleanup +``` + + +## Apache Webserver + +Optional: adding a webserver might be desired, the method of obtain the SSL certificate is not covered here. + +### Apache Installation + +The Debian package includes the SSL libraries, a few extra modules need to be enabled to support the extra security tuning in the templates. + +``` +apt-get update +apt-get install apache2 +a2enmod ssl +a2enmod reqtimeout +a2enmod rewrite +a2enmod headers +a2enmod expires +``` + +### Apache iptables Ports + +Ensure the ports for 80 and 443 are added to `/etc/iptables/rules.v4` and `/etc/iptables/rules.v6`, typically near where the SSH port has been opened: + +``` +-A INPUT -p tcp -m tcp --dport 443 -j ACCEPT +-A INPUT -p tcp -m tcp --dport 80 -j ACCEPT +``` + +Restart the daemon: `systemctl restart netfilter-persistent` + + +### Apache Default Template + +This is the main template setting up parameters for all virtualhosts; the choice to include the virtual hosts in this template is not required, only a stylistic choice of the author. Save this to `/etc/apache2/sites-available/00_main.conf` (or use a symlink): + +``` +Timeout 60 +KeepAlive Off +MaxKeepAliveRequests 100 +KeepAliveTimeout 15 +ServerName localhost +ServerTokens OS +TraceEnable off + + + StartServers 3 + MinSpareServers 2 + MaxSpareServers 4 + ServerLimit 9 + MaxClients 9 + MaxRequestsPerChild 2000 + + + + RequestReadTimeout header=20-40,MinRate=500 body=20,MinRate=500 + + + + AllowOverride None + Require all granted + + +# Port 80 +Include /path/to/port_80.conf + +# Port 443 +Include /path/to/port_443.conf +``` + +Disable the Debian default website and enable the new one created above: + +``` +a2dissite 000-default +a2ensite 00_main +``` + +...or just manually change symlinks in `/etc/apache2/sites-enabled/` as desired. + + +### Apache 80 Template + +Included above as `/path/to/port_80.conf` + +``` + + ServerName example.com + ServerAlias www.example.com + ServerAdmin root@example.com + ErrorLog /var/log/apache2/example-error.log + CustomLog /var/log/apache2/example-access.log combined + + DocumentRoot /path/to/www/html + + Options FollowSymLinks + AllowOverride All + Require all granted + + +``` + +### Apache 443 Template + +Included above as `/path/to/port_443.conf` + +``` + + ServerName example.com + ServerAlias www.example.com + ServerAdmin root@example.com + ErrorLog /var/log/apache2/example-error.log + CustomLog /var/log/apache2/example-access.log combined + + SSLEngine on + SSLHonorCipherOrder on + SSLProtocol all -SSLv2 -SSLv3 -TLSv1 -TLSv1.1 + SSLCipherSuite ECDHE-ECDSA-AES256-GCM-SHA384:ECDHE-RSA-AES256-GCM-SHA384:ECDHE-ECDSA-CHACHA20-POLY1305:ECDHE-RSA-CHACHA20-POLY1305:ECDHE-ECDSA-AES128-GCM-SHA256:ECDHE-RSA-AES128-GCM-SHA256:ECDHE-ECDSA-AES256-SHA384:ECDHE-RSA-AES256-SHA384:ECDHE-ECDSA-AES128-SHA256:ECDHE-RSA-AES128-SHA256 + SSLHonorCipherOrder on + SSLCompression off + SSLSessionTickets off + + SSLCertificateFile /path/to/sslkeys/2020-example.crt + SSLCertificateKeyFile /path/to/sslkeys/2020-example.key + SSLCACertificateFile /path/to/sslkeys/2020-ssl-issuer-CA.pem + + Header always set Strict-Transport-Security "max-age=15768000" + + + SSLOptions +StdEnvVars + + + SetEnvIf User-Agent ".*MSIE.*" \ + nokeepalive ssl-unclean-shutdown \ + downgrade-1.0 force-response-1.0 + + DocumentRoot /path/to/www/html + + Options FollowSymLinks + AllowOverride All + Require all granted + + +``` + +Note the above 443 template does not enable HSTS on all subdomains by design, add as required. diff --git a/md/debian_tor_relay.md b/md/debian_tor_relay.md new file mode 100644 index 0000000..bef0e31 --- /dev/null +++ b/md/debian_tor_relay.md @@ -0,0 +1,195 @@ +# Debian Tor Relay + +## Contents + + - [Server Installation](#server-installation) + - [Server Hardening](#server-hardening) + - [Tor Installation](#tor-installation) + - [Tor Setup](#tor-setup) + - [Tor Backup](#tor-backup) + - [Final Checks](#final-checks) + - [References](#references) + + +## Server Installation + +Spin up a basic **Debian 8 (Jessie) 64bit** cloud instance; the use of inexpensive cloud instances from Digital Ocean are perfect for this type of project. Only basic networking with minimal disk and memory is required, these pre-prepared cloud installations of Debian 8 are ready to go with only a minor bit of work. + +> **Be Careful of Costs** - Cloud providers typically charge for time spent running (uptime) _plus_ bandwidth charges. Research costs carefully and ensure the _RelayBandwidthRate_ is configured to meet your budget. Shop around cloud providers to get the best bang for your buck - low uptime and low bandwidth charges are the key factors for a tor node. + +The below instructions have been tested on a Digital Ocean standard Debian 8 instance. + + +## Server Hardening + +**1.** Install a few basic packages to make life a little nicer; typically the cloud instances are stripped down and need a few things added, both for security and ease of use. Adjust as needed, at a minimum ensure the below are in place: + +``` +apt-get update +apt-get install sysstat unattended-upgrades iptables-persistent fail2ban chrony vim-nox iftop sudo -y +``` + +**2.** Enable _sysstat_ for ongoing statistics capture of your instance (use `sar` to view): + +``` +sed -i.bak -e 's|^ENABLED=".*"|ENABLED="true"|g' /etc/default/sysstat +``` + +**3.** Enable _unattended-upgrades_ to ensure that all Security updates are applied: + +``` +cat << 'EOF' > /etc/apt/apt.conf.d/02periodic +APT::Periodic::Enable "1"; +APT::Periodic::Update-Package-Lists "1"; +APT::Periodic::Download-Upgradeable-Packages "1"; +APT::Periodic::AutocleanInterval "5"; +APT::Periodic::Unattended-Upgrade "1"; +EOF +``` + +**4.** Enable the basic _iptables_ rules to allow only ports 22, 80 and 443: + +``` +cat << 'EOF' > /etc/iptables/rules.v4 +*filter +:INPUT ACCEPT [0:0] +:FORWARD ACCEPT [0:0] +:OUTPUT ACCEPT [0:0] +-A INPUT -m state --state RELATED,ESTABLISHED -j ACCEPT +-A INPUT -p icmp -j ACCEPT +-A INPUT -i lo -j ACCEPT +-A INPUT -p tcp -m state --state NEW -m tcp --dport 443 -j ACCEPT +-A INPUT -p tcp -m state --state NEW -m tcp --dport 80 -j ACCEPT +-A INPUT -p tcp -m state --state NEW -m tcp --dport 22 -j ACCEPT +-A INPUT -j REJECT --reject-with icmp-host-prohibited +-A FORWARD -j REJECT --reject-with icmp-host-prohibited +COMMIT +EOF + +cat << 'EOF' > /etc/iptables/rules.v6 +*filter +:INPUT ACCEPT [0:0] +:FORWARD ACCEPT [0:0] +:OUTPUT ACCEPT [0:0] +-A INPUT -m state --state RELATED,ESTABLISHED -j ACCEPT +-A INPUT -p ipv6-icmp -j ACCEPT +-A INPUT -i lo -j ACCEPT +-A INPUT -p tcp -m state --state NEW -m tcp --dport 443 -j ACCEPT +-A INPUT -p tcp -m state --state NEW -m tcp --dport 80 -j ACCEPT +-A INPUT -p tcp -m state --state NEW -m tcp --dport 22 -j ACCEPT +-A INPUT -j REJECT --reject-with icmp6-adm-prohibited +-A FORWARD -j REJECT --reject-with icmp6-adm-prohibited +COMMIT +EOF +``` + +**5.** Configure fail2ban to keep an eye on the SSH port for brute force attacks: + +``` +cat << 'EOF' > /etc/fail2ban/jail.local +[DEFAULT] +ignoreip = 127.0.0.1/8 +bantime = 600 +maxretry = 3 +backend = auto +destemail = root@localhost + +[ssh] +enabled = true +port = ssh +filter = sshd +logpath = /var/log/auth.log +maxretry = 6 +EOF +``` + +**6.** Finally, ensure all the services are enabled and apply all outstanding updates; reboot as needed for a new kernel. If you don't reboot here, you'll need to `service` _foo_ `restart` each one individually: + +``` +systemctl disable remote-fs.target +systemctl enable sysstat unattended-upgrades iptables-persistent fail2ban chrony + +apt-get upgrade -y + +reboot +``` + + +## Tor Installation + +Add the upstream repository to the server, install the GPG key and tor itself. The `tor-arm` package provides an interesting console interface for the daemon. (run `arm` later on to see it) + +``` +echo "deb http://deb.torproject.org/torproject.org jessie main" > \ + /etc/apt/sources.list.d/tor.list + +gpg --keyserver keys.gnupg.net --recv 886DDD89 +gpg --export A3C4F0F979CAA22CDBA8F512EE8CBC9E886DDD89 | sudo apt-key add - + +apt-get update +apt-get install deb.torproject.org-keyring -y +apt-get install tor tor-arm -y + +systemctl stop tor +``` + + +## Tor Setup + +Edit the `/etc/tor/torrc` configuration to set up the basic parameters; this config file's comments are parsed by the `arm` utility, so don't be tempted to just replace it with the below - hand edit is recommended to preserve the comments. + +> **1 TB/month** is roughly **400 KB/s** sustained bandwidth + +We will configure bandwidth to 300 KB/s normal and 350 KB/s burst to keep our cloud bandwidth charges in check, and ports 443 and 80 - maximum compatibility for persons in locations with strict ACLs on their network traffic. Choose **Nickname** wisely, it's how others will refer to your node in public. Be careful with ContactInfo and protect yourself from spammers\! + +``` +# egrep -v "^(#|$)" /etc/tor/torrc +RunAsDaemon 1 +ORPort 443 +Address +Nickname +RelayBandwidthRate 300 KB +RelayBandwidthBurst 350 KB +ContactInfo +DirPort 80 +DirPortFrontPage /etc/tor/index.html +ExitPolicy reject *:* +``` + +Copy over the HTML man page to display on port 80 (see _DirPortFrontPage_ above), ensure it's set to start on reboot and get it running: + +``` +cp /usr/share/doc/tor/tor.html /etc/tor/index.html +systemctl enable tor +systemctl restart tor +``` + + +## Tor Backup + +Preserve a copy of your Tor node information; this is needed if you have to rebuild or move the node and want to retain the same history in the community: + +``` +cp /var/lib/tor/fingerprint /root/tor.fingerprint +cp /var/lib/tor/keys/secret_id_key /root/tor.secret_id_key +``` + +Download those two files from the cloud instance and put them in a safe place in your normal backups. The first has one line (nickname and 40-hex char ID), the second is a RSA key. + + +## Final Checks + +Wait an hour or two, then use one (or both) of the below links to search for your relay's nickname: + + - + - + +Once it's showing up as expected and you're happy with the results, submit your relay to the EFF Tor Challenge and sign up via Tor Weather to keep an eye on it: + + - + - + + +## References + + - diff --git a/md/device_mapper_mechanics.md b/md/device_mapper_mechanics.md new file mode 100644 index 0000000..2a79485 --- /dev/null +++ b/md/device_mapper_mechanics.md @@ -0,0 +1,337 @@ +# Device Mapper Mechanics + +## Contents + + - [Overview](#overview) + - [Basic Usage](#basic-usage) + - [Linear Target](#linear-target) + - [Striped Target](#striped-target) + - [Mirror Target](#mirror-target) + - [References](#references) + + +## Overview + +Concisely, [device mapper](https://en.wikipedia.org/wiki/Device_mapper) is a method for the Linux kernel to map physical block devices into logical devices ("virtual") for further use; most notably implemented via [LVM2](lvm_mechanics.md), [Multipath](device_mapper_multipath.md) and [dm-crypt](https://en.wikipedia.org/wiki/Dm-crypt) (LUKS) in widespread use. Each implementation is known as a _target_ to the device-mapper subsystem within the kernel. + +The fundamental tool used is named `dmsetup` and provided by the `device-mapper` package on most (if not all) Linux distributions. In a sense, it's analogous to common [partitioning](linux_partitioning.md) - using it requires a start, end and size component of a device. By default, when creating a logical device with `dmsetup` the human-readable name ends up symlinked in `/dev/mapper/` as evidenced with higher level tools like `lvcreate`. + +> This article presents low-level examples for building targets; when a higher level subsystem such as _dm-raid_, _dm-crypt_, _LVM_ and _dm-multipath_ exists it should always be used. The higher level subsystems provide a wealth of additional features required for a production level situation. + +A secondary tool named `dmstats` is also delivered which allows for collecting statistics about the underlying _regions_ and _areas_ of the device map (similar to what one might see from `sar` and the `sysstat` package). This can be useful for examining the performance characteristics of combined physical block devices and looking for deltas. + + +## Basic Usage + +In this example we have a single 75G physical block device name `/dev/xvdb` (a Xen virtual disk). Using `dmsetup` it will be split into two virtual devices without using a partition table (aka "raw" device). Each virtual device is then formatted and mounted as normal with any block device. Key to all the work is understanding the [table mapping format](https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/5/html/Logical_Volume_Manager_Administration/device_mapper.html#dm-mappings) as listed in the [man page](http://linux.die.net/man/8/dmsetup), as it can be a bit confusing at first. + +First, we need to determine what our starting and ending sectors on the disk will be for use; for this [a small one-liner shell script](https://gitlab.com/snippets/1720498) can do all the math for us. Note that we ensure a 2048s offset into the device for the standard performance alignment across all tiers - this is unique per situation, used here as a Best Practice. + +``` +DISK="/dev/xvdb"; OFFSET=2048 \ + parted ${DISK} unit s print 2>/dev/null | \ + grep "^Disk ${DISK}" | \ + awk -v OFF=${OFFSET} '{gsub(/s$/,"",$3); \ + printf "STA1=%s\nEND1=%s\nLEN1=%s\nSTA2=%s\nEND2=%s\nLEN2=%s\n", + OFF,(($3/2)-OFF),((($3/2)-OFF)-OFF), + ((($3/2)-OFF)+1),$3,($3-((($3/2)-OFF)+1)) + }' + +STA1=2048 +END1=78641152 +LEN1=78639104 +STA2=78641153 +END2=157286400 +LEN2=78645247 +``` + +Given the start, end and size of each part of the disk we can use `dmsetup` to build the virtual maps; exactly like what may be familiar from previous LVM, LUKS and Multipath work the _real_ devices are `/dev/dm-?`, the kernel uses symlinks to inform the user of which map is which logical name for further use. We will also create a statistics area to see how that looks. + +``` +// Table format for linear target used below: +// linear + +# dmsetup create xyzzy1 --table "0 78639104 linear /dev/xvdb 2048" +# dmsetup create xyzzy2 --table "0 78645247 linear /dev/xvdb 78641153" + +# ls -og /dev/mapper/xyzzy* +lrwxrwxrwx. 1 7 Jan 10 18:32 /dev/mapper/xyzzy1 -> ../dm-0 +lrwxrwxrwx. 1 7 Jan 10 18:32 /dev/mapper/xyzzy2 -> ../dm-1 + +# dmsetup table +xyzzy1: 0 78639104 linear 202:16 2048 +xyzzy2: 0 78645247 linear 202:16 78641153 + +# dmstats create /dev/mapper/xyzzy1 +xyzzy1: Created new region with 1 area(s) as region ID 0 +# dmstats create /dev/mapper/xyzzy2 +xyzzy2: Created new region with 1 area(s) as region ID 0 + +# dmstats list +Name RgID RgSta RgSize #Areas ArSize ProgID +xyzzy1 0 0 37.50g 1 37.50g dmstats +xyzzy2 0 0 37.50g 1 37.50g dmstats + +# dmstats report +Name RgID ArID ArStart ArSize RMrg/s WMrg/s R/s W/s RSz/s WSz/s AvgRqSz QSize Util% AWait RdAWait WrAWait +xyzzy1 0 0 0 37.50g 0.00 0.00 0.00 0.00 0 0 0 0.00 0.00 0.00 0.00 0.00 +xyzzy2 0 0 0 37.50g 0.00 0.00 0.00 0.00 0 0 0 0.00 0.00 0.00 0.00 0.00 +``` + +Now it's just the same work as usual using these two new names; a simple `dd` is used below for testing. + +``` +# dd if=/dev/zero of=/dev/mapper/xyzzy1 bs=1024 count=100 +100+0 records in +100+0 records out +102400 bytes (102 kB) copied, 0.0371653 s, 2.8 MB/s + +# dd if=/dev/zero of=/dev/mapper/xyzzy2 bs=1024 count=100 +100+0 records in +100+0 records out +102400 bytes (102 kB) copied, 0.0024724 s, 41.4 MB/s + +# dmstats report +Name RgID ArID ArStart ArSize RMrg/s WMrg/s R/s W/s RSz/s WSz/s AvgRqSz QSize Util% AWait RdAWait WrAWait +xyzzy1 0 0 0 37.50g 0.00 0.00 88.00 25.00 556.00k 100.00k 5.50k 0.18 12.30 1.58 1.47 2.00 +xyzzy2 0 0 0 37.50g 0.00 0.00 63.00 200.00 455.50k 100.00k 2.00k 0.51 10.60 1.94 1.75 2.00 + +# dmstats delete /dev/mapper/xyzzy2 --allregions +# dmstats delete /dev/mapper/xyzzy1 --allregions +# dmsetup remove /dev/mapper/xyzzy2 +# dmsetup remove /dev/mapper/xyzzy1 +``` + + +## Linear Target + +The [linear target](https://www.kernel.org/doc/Documentation/device-mapper/linear.txt) is the most basic as shown above; however in a slightly more complex example we can build our own LVM-like single filesystem that spans two physical block devices. The LVM subsystem at it's core uses this linear methodology by default, however is contains many additional features (mapping UUIDs, maintaining block device lists and the tables, checksumming, management, etc.) which make it desirable in daily use. + +Two 75G block devices are presented to the host; to add a bit more complication to exemplify the math, each block device has an empty GPT partition table to simulate not being able to use the end of the disk (the [backup GPT partition table](linux_x86_storage.md) is kept in the last 34s). + +``` +# parted /dev/xvdb mktable gpt +# parted /dev/xvdc mktable gpt +``` + +Next, we need to get the last _usable_ sector of the disk and subtract our performance-oriented beginning 2048s offset from it to get the size of the fully usable disk area; for this I prefer `sgdisk` utility (part of the `gdisk` package): + +``` +# sgdisk -p /dev/xvdb | grep "last usable sector" | awk '{print $NF-2048}' +157284318 + +# sgdisk -p /dev/xvdc | grep "last usable sector" | awk '{print $NF-2048}' +157284318 +``` + +Because this requires two lines to feed `dmsetup` (one line for each disk), we create the mapping in a text file: + +``` +// Table format used below is the same as the Basic example, but notice that the +// virtual start of the second disk is the same as the ending of the first - +// remember, 0 offset not 1 + +# cat linear.table +0 157284318 linear /dev/xvdb 2048 +157284318 157284318 linear /dev/xvdc 2048 +``` + +Now it's just a matter of creating the map using the table and testing it out by making a filesystem and writing a file larger than any one single physical device (below, 120G is used): + +``` +# dmsetup create xyzzy linear.table + +# mkfs.ext4 -v /dev/mapper/xyzzy +# mkdir /mnt/xyzzy +# mount /dev/mapper/xyzzy /mnt/xyzzy/ + +# dd if=/dev/zero of=/mnt/xyzzy/testfile bs=512M count=240 +240+0 records in +240+0 records out +128849018880 bytes (129 GB) copied, 367.6 s, 351 MB/s + +# umount /mnt/xyzzy +# dmsetup remove xyzzy +# dmsetup create foobar linear.table +# mount /dev/mapper/foobar /mnt/xyzzy/ +# ls -og /mnt/xyzzy/ +total 125829148 +drwx------. 2 16384 Jan 10 21:16 lost+found +-rw-r--r--. 1 128849018880 Jan 10 21:25 testfile +``` + +Notice that the maps are disassembled and recreated as part of the testing to simulate what will happen when the server is rebooted - **device maps are in memory only** so in real use startup/shutdown scripts would be required to implement the above correctly. We also tested giving the device map a randomly different name the second as a test. + + +## Striped Target + +The [striped target](https://www.kernel.org/doc/Documentation/device-mapper/striped.txt) is the basis of software RAID0 and can be used with LVM. Using the same techniques as the linear target, we'll build a simple striped target of our two physical block devices with the intent of increasing performance (so we'll add `dmstats`). + +First, we have to do a bit of math; when using striping technology design each group of data is written in a _chunk_ that is a multiple of 2 and typically of a size that is optimized for the data. A chunk of 256k is very common for physical RAID controllers, we'll use this as our chunk size. However, our sector size is 512b so we'll need to use 512 as our divisor; given that, we must ensure that **an entire stripe can be written**. + +In order to determine the largest size we can make the striped target, we take the usable size of the disk (in sectors), divide it by 512 and then get the floor() of that value re-multiplied times 512 (in laypersons' terms, divide the size by 512, throw away the remainder and re-multiply by 512 to get the perfect multiple). For this we'll use a `bc` function: + +``` +# bc + +// add both usable sizes to get one large size for striping +157284318*2 +314568636 + +// now divide by 512, throw away the remainder, re-multiply by 512 +define floor(x) { + auto os,xx;os=scale;scale=0 + xx=x/1;if(xx>x).=xx-- + scale=os;return(xx) +} +floor(314568636/512)*512 +314568192 +``` + +Armed with this perfect multiple of 512 (ergo 256k), build a striped map. Create the device as before and this time we'll create 2 `dmstats` areas (one for each physical disk's sectors used) so that we can contrast./compare performance of each one. Notice that because we have two identically sized devices the `dmstats --areas 2` usage perfectly splits it for us so we don't have to define each area by hand: + +``` +# cat striped.table +0 314568192 striped 2 256 /dev/xvdb 2048 /dev/xvdc 2048 + +# dmsetup create xyzzy striped.table + +# dmstats create xyzzy --areas 2 +xyzzy: Created new region with 2 area(s) as region ID 0 + +# dmstats list +Name RgID RgSta RgSize #Areas ArSize ProgID +xyzzy 0 0 150.00g 2 75.00g dmstats + +# dmstats report +Name RgID ArID ArStart ArSize RMrg/s WMrg/s R/s W/s RSz/s WSz/s AvgRqSz QSize Util% AWait RdAWait WrAWait +xyzzy 0 0 0 75.00g 0.00 0.00 0.00 0.00 0 0 0 0.00 0.00 0.00 0.00 0.00 +xyzzy 0 1 75.00g 75.00g 0.00 0.00 0.00 0.00 0 0 0 0.00 0.00 0.00 0.00 0.00 +``` + +Now that we have the statistics gathering readied, create the filesystem and write data as per our normal testing plan: + +``` +# mkfs.ext4 -v /dev/mapper/xyzzy +# mount /dev/mapper/xyzzy /mnt/xyzzy/ +# dd if=/dev/zero of=/mnt/xyzzy/testfile bs=512M count=240 +240+0 records in +240+0 records out +128849018880 bytes (129 GB) copied, 224.258 s, 575 MB/s + +# dmstats report +Name RgID ArID ArStart ArSize RMrg/s WMrg/s R/s W/s RSz/s WSz/s AvgRqSz QSize Util% AWait RdAWait WrAWait +xyzzy 0 0 0 75.00g 0.00 0.00 185.00 1786193.00 1.13m 71.35g 41.50k 23550.26 100.00 13.18 30.79 13.18 +xyzzy 0 1 75.00g 75.00g 0.00 0.00 61.00 1257156.00 244.00k 51.14g 42.50k 17752.53 100.00 14.12 35.92 14.12 + +# umount /mnt/xyzzy +# dmsetup remove xyzzy +# dmsetup create xyzzy striped.table +# mount /dev/mapper/xyzzy /mnt/xyzzy +# ls -og /mnt/xyzzy +total 125829148 +drwx------. 2 16384 Jan 10 22:28 lost+found +-rw-r--r--. 1 128849018880 Jan 10 22:33 testfile +``` + +Based on our data above, it appears that we're getting better performance from `/dev/xvdb` than we am from `/dev/xvdc`; this is a public cloud instance so it's expected we have two data block volumes from different back-end cloud hosts via iSCSI. This exemplifies the risk in LVM of combining such objects in the cloud, performance characteristics will vary from data block device to data block device in this environment. What we did see, though, was the raw write performance go up about 2.5x over linear as expected. + + +## Mirror Target + +The mirror target is arguably the most difficult to construct; in essence it's a RAID1 (but again can also be used in LVM) but it requires a log device (similar to a classic journal in a filesystem if you will, a space to record metadata about writes). In order to construct this example, we're going to construct disk partitions and use a technique generally known as "mirroring the mirror". + +First we'll prep out 2 partition on each physical block device, one to store log data to disk - so that when rebooting/recreating the mirror has it's data on each _leg_, otherwise a bootstrap would be required again - and one to store data. They will be exactly the same on both disks as this is a mirror configuration. + +``` +# sgdisk -Z /dev/xvdb +# parted /dev/xvdb mktable gpt +# parted /dev/xvdb mkpart primary ext3 2048s 18432s +# parted /dev/xvdb mkpart primary ext3 20480s 100% + +// Note that we choose some arbitrarily sized numbers; 8k for logs is plenty +# sgdisk -Z /dev/xvdc +# parted /dev/xvdc mktable gpt +# parted /dev/xvdc mkpart primary ext3 2048s 18432s +# parted /dev/xvdc mkpart primary ext3 20480s 100% + +# parted /dev/xvdb unit s print +[...] +Number Start End Size File system Name Flags + 1 2048s 18432s 16385s primary + 2 20480s 157284351s 157263872s primary +``` + +Now we need to build the virtual maps for these devices; in a sense it's like connecting two linear targets together but with some other items as outlined on the `dmsetup` man page, noting that _core_ is stored in memory so the regions will be pushed to disk. Notice that since partitions are already defined, the offset is 0 (beginning of partition) for each virtual map. + +``` +// Table format used: +// core + +# cat mirror-log.table +0 8192 mirror core 1 1024 2 /dev/xvdb1 0 /dev/xvdc1 0 1 handle_errors + +// Table format used: +// disk >param1 = log dev> + +# cat mirror-data.table +0 157263872 mirror disk 2 /dev/mapper/xyzzy-log 1024 2 /dev/xvdb2 0 /dev/xvdc2 0 1 handle_errors +``` + +Now we create the device maps using these tables, ensure the _log_ is created first: + +``` +# dmsetup create xyzzy-log mirror-log.table +# dmsetup create xyzzy-data mirror-data.table + +# mkfs.ext4 -v /dev/mapper/xyzzy-data +# mount /dev/mapper/xyzzy-data /mnt/xyzzy/ +# dmstats create xyzzy-log +# dmstats create xyzzy-data +# dd if=/dev/zero of=/mnt/xyzzy/testfile bs=512M count=120 +120+0 records in +120+0 records out +64424509440 bytes (64 GB) copied, 235.429 s, 274 MB/s + +# dmstats report +Name RgID ArID ArStart ArSize RMrg/s WMrg/s R/s W/s RSz/s WSz/s AvgRqSz QSize Util% AWait RdAWait WrAWait +xyzzy-log 0 0 0 4.00m 0.00 0.00 0.00 35357.00 0 690.57m 20.00k 42.16 100.00 1.19 0.00 1.19 +xyzzy-data 0 0 0 74.99g 0.00 0.00 47.00 1486089.00 188.00k 60.23g 42.00k 163854.02 100.00 110.26 66.28 110.26 + +# umount /mnt/xyzzy +# dmsetup remove xyzzy-data +# dmsetup remove xyzzy-log +# dmsetup create xyzzy-log mirror-log.table +# dmsetup create xyzzy-data mirror-data.table +# mount /dev/mapper/xyzzy-data /mnt/xyzzy/ +# ls -og /mnt/xyzzy/ +total 62914584 +drwx------. 2 16384 Jan 10 23:39 lost+found +-rw-r--r--. 1 64424509440 Jan 10 23:45 testfile + +# umount /mnt/xyzzy/ +# dmsetup remove xyzzy-data +# dmsetup remove xyzzy-log + +# mount /dev/xvdb2 /mnt/xyzzy/ +# ls -og /mnt/xyzzy/ +total 62914584 +drwx------. 2 16384 Jan 10 23:39 lost+found +-rw-r--r--. 1 64424509440 Jan 10 23:45 testfile + +# umount /mnt/xyzzy/ +# mount /dev/xvdc2 /mnt/xyzzy/ +# ls -og /mnt/xyzzy/ +total 62914584 +drwx------. 2 16384 Jan 10 23:39 lost+found +-rw-r--r--. 1 64424509440 Jan 10 23:45 testfile +``` + +Notice that we also test destroying the device maps and manually mounting and verifying the raw block device as might be needed in a real situation where one member of the mirror goes offline. + + +## References + + - + - + - diff --git a/md/device_mapper_multipath.md b/md/device_mapper_multipath.md new file mode 100644 index 0000000..5faadd9 --- /dev/null +++ b/md/device_mapper_multipath.md @@ -0,0 +1,363 @@ +# Device Mapper Multipath + +## Contents + + - [Overview](#overview) + - [Initial Setup](#initial-setup) + - [DAS Config](#das-config) + - [NAS iSCSI Config](#nas-iscsi-config) + - [Multipath Names](#multipath-names) + - [Administrating Multipaths](#administrating-multipaths) + - [Partitioning Multipaths](#partitioning-multipaths) + - [Clustered Multipaths](#clustered-multipaths) + - [Renaming Multipaths](#renaming-multipaths) + - [Multipath Ownership](#multipath-ownership) + - [RHEL5 / CentOS5](#rhel5--centos5) + - [RHEL6 / CentOS6](#rhel6--centos6) + - [References](#references) + + +## Overview + +The `device-mapper-multipath` (a sub-component of `device-mapper`) subsystem is the native way of configuring 2 or more individual paths to the same storage LUN, typically used in a HA (failover) capacity. If one underlying path fails the system transfers I/O to another path; higher level operations (such as LVM) use the single multipath pseudo device and are abstracted from the underlying physical links. + + +## Initial Setup + +A standard setup requires 2 RPMs which provide the `multipathd` service and udev rules for naming the multipaths: + +1. device-mapper +2. device-mapper-multipath + +For a Dell DAS such as the MD32xx 2 more packages are required, typically from the vendor install media: + +1. dkms (Dynamic Kernel Module Support - framework required for the below RPM) +2. scsi\_dh\_rdac (Dell custom version, the [kernel also contains one](https://github.com/torvalds/linux/blob/master/drivers/scsi/device_handler/scsi_dh_rdac.c)) + +The `multipathd` service is what pulls it all together. + + +### DAS Config + +A well formed Dell MD32xx DAS deployed config might look like: + +``` +# DAS /etc/multipath.conf + +blacklist { + device { + vendor "*" + product "Universal Xport" + } + device { + vendor "*" + product "MD3000" + } + device { + vendor "*" + product "MD3000i" + } + device { + vendor "*" + product "Virtual Disk" + } + device { + vendor "*" + product "PERC|Perc" + } +} +defaults { + user_friendly_names yes + max_fds 8192 + polling_interval 5 +} +devices { + device { + vendor "DELL" + product "MD32xxi" + path_grouping_policy group_by_prio + prio rdac + path_checker rdac + path_selector "round-robin 0" + hardware_handler "1 rdac" + failback immediate + features "2 pg_init_retries 50" + no_path_retry 30 + rr_min_io 100 + } + device { + vendor "DELL" + product "MD32xx" + path_grouping_policy group_by_prio + prio rdac + path_checker rdac + path_selector "round-robin 0" + hardware_handler "1 rdac" + failback immediate + features "2 pg_init_retries 50" + no_path_retry 30 + rr_min_io 100 + } + device { + vendor "DELL" + product "MD36xxi" + path_grouping_policy group_by_prio + prio rdac + path_checker rdac + path_selector "round-robin 0" + hardware_handler "1 rdac" + failback immediate + features "2 pg_init_retries 50" + no_path_retry 30 + rr_min_io 100 + } + device { + vendor "DELL" + product "MD36xxf" + path_grouping_policy group_by_prio + prio rdac + path_checker rdac + path_selector "round-robin 0" + hardware_handler "1 rdac" + failback immediate + features "2 pg_init_retries 50" + no_path_retry 30 + rr_min_io 100 + } +} +``` + + +### NAS iSCSI Config + +An example config for a Netapp NAS iSCSI might look like: + +``` +# NAS iSCSI /etc/multipath.conf + +blacklist { + device { + vendor "*" + product "PERC|Perc" + } + device { + vendor "*" + product "Universal Xport" + } + device { + vendor "*" + product "Virtual Disk" + } +} + +defaults { + user_friendly_names yes + max_fds max + queue_without_daemon no +} + +devices { + device { + vendor "NETAPP" + product "LUN" + getuid_callout "/sbin/scsi_id -g -u -s /block/%n" + # + # RHEL5 style + prio_callout "/sbin/mpath_prio_ontap /dev/%n" + # RHEL6 style + # prio ontap + # + features "1 queue_if_no_path" + hardware_handler "0" + path_grouping_policy group_by_prio + failback immediate + rr_weight uniform + rr_min_io 128 + path_checker directio + flush_on_last_del yes + } +} +``` + + +## Multipath Names + +By default in RHEL/CentOS, the names of the multipath will be in `/dev/mapper/` and begin with "mpath" and be followed by a number (v5) or a letter (v6). A partition within that path will then have "p" followed by it's number. These are controlled by `udev` and a config file installed by the `device-mapper-multipath` RPM; for example on RHEL6/CentOS6 it's named `/lib/udev/rules.d/40-multipath.rules`. + +Examples: + +``` +/dev/mapper/mpath1p2 - 2nd partition on path #1 (1) (v5) +/dev/mapper/mpathbp1 - 1st partition on path #2 (b) (v6) +``` + +These are a human-friendly format of the WWID triggered by the setting `user_friendly_names yes` in the config file. These can be changed to suit needs - it's easy and can save a lot of confusion later if a dozen LUNs are used as RAW devices (such as in an Oracle RAC). + + +## Administrating Multipaths + +The main tool for administering multipaths is called `multipath` and is normally found in /sbin/ (root only). The primary use day-to-day will be the -l or -ll flags to simply list multipaths and their associated 'real' SCSI devices (paths). Using this tool you can examine the health of the (multi)paths and all associated information. + +Example: + +``` +## DAS multipath +# multipath -l +[...] +VOTING5 (3690b11c0001b99ba0000098f5192345e) dm-5 DELL,MD32xx +size=1.0G features='3 queue_if_no_path pg_init_retries 50' hwhandler='1 rdac' wp=rw +|-+- policy='round-robin 0' prio=0 status=active +| `- 2:0:0:4 sdw 65:96 active undef running +`-+- policy='round-robin 0' prio=0 status=enabled + `- 1:0:0:4 sdf 8:80 active undef running +``` + +The system knows which SCSI devices belong together by their WWID (aka WWN, UUID) that are presented from the storage host - if they match, they belong together in a multipath. From the above example LUN, using the -v3 flag will show they match: + +``` +## DAS WWIDs (WWN/UUID) +# multipath -v3 +[...] +uuid hcil dev dev_t pri dm_st chk_st vend/p +3690b11c0001b99ba0000098f5192345e 2:0:0:4 sdw 65:96 14 undef ready DELL,M +3690b11c0001b99ba0000098f5192345e 1:0:0:4 sdf 8:80 9 undef ready DELL,M +``` + +The WWID 3690b11c0001b99ba0000098f5192345e matches on both SCSI devices, so now the multipath daemon knows they belong together and creates a pseudo device for us to work with. If one underlying path (device) fails, it goes over to the other one without any manual intervention. Magic. + +There are other uses of the multipath tool, such as the -f/-F flags (flush paths) and -p (change policies) -- be careful using these on a live server. Check the man page for detailed information, and know there is a -d (dry run) option to test things before commit. It's sometimes easier to restart the multipathd daemon instead depending on what you're doing (such as renaming - see below). + + +### Partitioning Multipaths + +The tool `kpartx` is what an administrator will use to have the kernel re-examine newly partitioned multipaths and create new device entries for us; it's the equivalent of using `partx` on normal devices. + +``` +## Normal SCSI device + +# parted /dev/sdb (create new partition 1) +# partx -a /dev/sdb +# ls -1 /dev/sdb* +/dev/sdb +/dev/sdb1 + +## Multipath device + +# parted /dev/mapper/mpathb (create new partition 1) +# kpartx -a /dev/mapper/mpathb +# ls -1 /dev/mapper/mpathb* +/dev/mapper/mpathb +/dev/mapper/mpathbp1 +``` + +The device /dev/mapper/mpathbp1 is now used just like /dev/sdb1 would be for any other tools (mkfs, pvcreate, vgextend, etc.) -- the multipath daemon takes care of routing the actual SCSI commands out to the active device (path) in the multipath to storage. + + +### Clustered Multipaths + +Using the WWIDs as described above will allow you to ensure that if you have a host group of LUNs presented to 2 or more servers match multipaths. **The mapping of a WWID to multipath on one node must match on all other nodes**, otherwise you're writing to different storage areas on different nodes. If your examination finds they do not match you may need to rename them manually - see below. + +> Always double-check the WWID to multipath mappings match on all nodes in a cluster\! This may not be quick but it's extremely important the time be spent doing this work. Never assume it's "just right" on a new build. + +### Renaming Multipaths + +Renaming them is easy - add a new stanza to the bottom of multipath.conf that has a grouping, then rename each one. The setting `user_friendly_names yes` is required in multipath.conf for this to work as expected. For example, here's is a rename of a shared Oracle RAC voting LUN from the spurious name into something that makes sense for use inside Oracle as a RAW device: + +``` +multipaths { + multipath { + wwid 3690b11c0001b99ba0000098f5192345e + alias VOTING5 + } +} +``` + +Restart `multipathd` service and now the multipath is named like so: + +``` +# ls -1 /dev/mapper/VOTING5 +/dev/mapper/VOTING5 +``` + +The partitions within a renamed multipath follow the same convention, 'p' followed by a number. You would expect names like `/dev/mapper/VOTING5p1`, `/dev/mapper/VOTING5p2`, etc. if you partitioned this LUN for use as a normal filesystem. + + +## Multipath Ownership + +One of the other common desires is to set the UID, GID and mode on the multipaths; alas there's a different method for RHEL/CentOS v5 and v6. + + +### RHEL5 / CentOS5 + +This is done in the same block schema as renaming them like so: + +``` +multipaths { + multipath { + wwid 3690b11c0001b99ba0000098f5192345e + alias VOTING5 + uid 503 + gid 503 + mode 755 + } +} +``` + +Note that the system requires the numerical UID/GID and octal mode as shown above. + + +### RHEL6 / CentOS6 + +The above method was deprecated in RHEL6 in favor of udev rules - Red Hat's article on how to set it up is wee bit lacking; use a ruleset like this instead of their official doc: + +``` +/etc/udev/rules.d/12-dm-permissions.rules + +ENV{DM_NAME}=="VOTING5", OWNER:="oracle", GROUP:="oinstall", MODE:="660" +``` + +This is based on renaming the multipath outlined above; to get the value of the DM\_NAME you are trying to rename the "udevadm" tool is used to query the raw device-map node. + + - Get the raw node-name with a simple ls: + +``` +# ls -l /dev/mapper/VOTING5 +lrwxrwxrwx 1 root root 7 May 30 22:41 /dev/mapper/VOTING5 -> ../dm-5 +``` + + - Use that dm-?? number against the sysfs interface for it: + +``` +# udevadm info --query=all --path=/devices/virtual/block/dm-5/ +P: /devices/virtual/block/dm-5 +N: dm-5 +S: mapper/VOTING5 +S: disk/by-id/dm-name-VOTING5 +S: disk/by-id/dm-uuid-mpath-3690b11c0001b99ba0000098f5192345e +S: block/253:5 +E: UDEV_LOG=3 +E: DEVPATH=/devices/virtual/block/dm-5 +E: MAJOR=253 +E: MINOR=5 +E: DEVNAME=/dev/dm-5 +E: DEVTYPE=disk +E: SUBSYSTEM=block +E: DM_SBIN_PATH=/sbin +E: DM_UDEV_PRIMARY_SOURCE_FLAG=1 +E: DM_UDEV_RULES_VSN=2 +E: DM_NAME=VOTING5 +E: DM_UUID=mpath-3690b11c0001b99ba0000098f5192345e +E: DM_SUSPENDED=0 +E: MPATH_SBIN_PATH=/sbin +E: DEVLINKS=/dev/mapper/VOTING5 /dev/disk/by-id/dm-name-VOTING5 /dev/disk/by-id/dm-uuid-mpath-3690b11c0001b99ba0000098f5192345e /dev/block/253:5 +``` + +Use any line item that begins with "E: " as the match clause in your udev rule; it seems the most obvious to use DM\_NAME however your situation may require using one of the others. + + +## References + + - + - + - + - diff --git a/md/drbd_build_steps.md b/md/drbd_build_steps.md new file mode 100644 index 0000000..9b0b989 --- /dev/null +++ b/md/drbd_build_steps.md @@ -0,0 +1,455 @@ +# DRBD Build Steps + +## Contents + + - [Overview](#overview) + - [Conventions](#conventions) + - [Group 1 - CentOS](#group-1---centos) + - [Group 2 - Debian](#group-2---debian) + - [Node Prep](#node-prep) + - [Hosts file](#hosts-file) + - [CentOS](#centos) + - [Debian](#debian) + - [IPTables](#iptables) + - [CentOS](#centos-1) + - [Debian](#debian-1) + - [Software Installation](#software-installation) + - [CentOS](#centos-2) + - [Debian](#debian-2) + - [Storage Prep](#storage-prep) + - [DRBD Resource Prep](#drbd-resource-prep) + - [Common Settings](#common-settings) + - [Cloud Example](#cloud-example) + - [Resource Settings](#resource-settings) + - [CentOS](#centos-3) + - [Debian](#debian-3) + - [DRBD Resource Init](#drbd-resource-init) + - [CentOS](#centos-4) + - [Debian](#debian-4) + - [Filesystem Build](#filesystem-build) + - [Testing](#testing) + - [CentOS](#centos-5) + - [Debian](#debian-5) + - [HA Failover](#ha-failover) + - [References](#references) + + +## Overview + +From : + +> "DRBD® refers to block devices designed as a building block to form high availability (HA) clusters. This is done by mirroring a whole block device via an assigned network. DRBD can be understood as network based raid-1." + +DRBD is the concept of taking two similar block storage devices and performing a network RAID-1 to connect them for HA redundancy. The block devices can be CBS (Cloud Block Storage), VMware vDisks, local in-chassis RAID arrays and so forth. The only requirements are they be unique and have a (preferably) private network between them for the replication. As with a traditional RAID-1 array, only one of the block devices is usable (live) - DRBD performs a block-level replication between the two devices in one direction only. However, DRBD allows for making either one of the devices Master (live) and shifting back and forth dynamically with a few commands. + +## Conventions + +This article article will use 2 groups of cloud servers as examples: + +``` +[root@drbd1 ~]# ip a | grep "inet " | grep 192 + inet 192.168.5.4/24 brd 192.168.5.255 scope global eth2 +[root@drbd2 ~]# ip a | grep "inet " | grep 192 + inet 192.168.5.2/24 brd 192.168.5.255 scope global eth2 + +root@drbd3:~# ip a | grep "inet " | grep 192 + inet 192.168.5.1/24 brd 192.168.5.255 scope global eth2 +root@drbd4:~# ip a | grep "inet " | grep 192 + inet 192.168.5.3/24 brd 192.168.5.255 scope global eth2 +``` + +### Group 1 - CentOS + + - drbd1 and drbd2 + - CentOS 6.5 + - /dev/xvde block devices + +### Group 2 - Debian + + - drbd3 and drbd4 + - Debian 7 Stable + - 20G /dev/xvde block devices + + +## Node Prep + +A working DRBD setup requires at a minimum: + + - 2x servers with similar block devices + - DRBD kernel module and userspace utilities + - Private network between the servers + - iptables port 7788 open between servers on the Private network + - /etc/hosts configured + - NTP synchronized + +For future growth, LVM should be used underneath the DRBD implementation; the underlying PV/VG/LV can then be grown and the DRBD device ("resource") resized with the [drbdadm resize resource](http://www.drbd.org/users-guide-8.3/s-resizing.html#s-growing-online) command online. + +> Timing is critical to proper operation - ensure NTP is configured properly. + +### Hosts file + +We need to ensure that the 2 servers can find each other on the private network as typical with any type of cluster build. When initializing the resource below, the drbdadm tool uses the hostname to match what's in the resource configuration file so it's important they align. In our examples the servers are in the domain .local as shown below, their FQDN hostnames are properly configured with drbdX.local as expected. + +#### CentOS + +``` +/etc/hosts + +192.168.5.4 drbd1.local +192.168.5.2 drbd2.local +``` + +#### Debian + +``` +/etc/hosts + +192.168.5.1 drbd3.local +192.168.5.3 drbd4.local +``` + +### IPTables + +We'll add a basic rule to allow all communication on the private 192.168.5.0/24 subnet between the nodes. This can be tuned to be more granular as required. + +#### CentOS + +``` +# vi /etc/sysconfig/iptables + +... +-A INPUT -s 192.168.5.0/24 -j ACCEPT +... + +# service iptables restart +``` + +#### Debian + +``` +# apt-get update; apt-get install iptables-persistent +# vi /etc/iptables/rules.v4 + +... +-A INPUT -s 192.168.5.0/24 -j ACCEPT +... + +# service iptables-persistent restart +# insserv iptables-persistent +``` + +### Software Installation + +#### CentOS + +CentOS requires the use of the RPM packages; this provides the DKMS-based kernel module and userspace toolset. + +``` +rpm -Uvh http://www.elrepo.org/elrepo-release-6-6.el6.elrepo.noarch.rpm +yum repolist +yum install drbd83-utils kmod-drbd83 dkms lvm2 ntp ntpdate +service ntpd restart && chkconfig ntpd on +reboot +``` + +Note that this installation will pull in kernel-devel and gcc (for DKMS) and a few device-mapper packages (for LVM2). + +#### Debian + +The Debian 7 kernel includes the drbd.ko module as a stock item; all that's needed is to install the userspace toolset on all nodes. + +``` +apt-get update +apt-get install --no-install-recommends drbd8-utils lvm2 ntp ntpdate +service ntp restart && insserv -v ntp +reboot +``` + +Note that without --no-install-recommends apt will install perl and other tools. + + +## Storage Prep + +Create a single volume group and logical volume from the storage on each node in the cluster, but do not create a filesystem - that comes later. + +``` +parted -s -- /dev/xvde mktable gpt +parted -s -- /dev/xvde mkpart primary ext3 2048s 100% +parted -s -- /dev/xvde set 1 lvm on + +pvcreate /dev/xvde1 +vgcreate vgdata00 /dev/xvde1 +lvcreate -l 100%VG -n drbd00 vgdata00 +``` + + +## DRBD Resource Prep + +### Common Settings + +The file `/etc/drbd.d/global_common.conf` exists on both nodes; as the default content will vary from release to release it's best to edit the file provided instead of creating a new one overtop – in general you most likely want to disable the usage-count for performance and set the syncer rate – changes made to this file and the default options should be researched to provide optimum settings for the platform DRBD is being deployed on. + +> TODO: provide common default configurations of the global settings for various scenarios + +#### Cloud Example + +``` +/etc/drbd.d/global_common.conf + +global { usage-count no; } +common { + syncer { rate 10M; } +} +``` + +### Resource Settings + +We create configuration files on both nodes that ties the two servers together with their new storage - note that the name of the file should be the name of the resource as a Best Practice. As we're building two different clusters on the same IP subnet, we'll be careful to name them uniquely to prevent any chance of collision at runtime. A shared secret was generated using pwgen as shown below. + +#### CentOS + +``` +/etc/drbd.d/cent00.res + +resource cent00 { + protocol C; + startup { wfc-timeout 0; degr-wfc-timeout 120; } + disk { on-io-error detach; } + net { cram-hmac-alg "sha1"; shared-secret "m9bTmbsK4quE"; } + on drbd1.local { + device /dev/drbd0; + disk /dev/vgdata00/drbd00; + meta-disk internal; + address 192.168.5.4:7788; + } + on drbd2.local { + device /dev/drbd0; + disk /dev/vgdata00/drbd00; + meta-disk internal; + address 192.168.5.2:7788; + } +} +``` + +#### Debian + +``` +/etc/drbd.d/deb00.res + +resource deb00 { + protocol C; + startup { wfc-timeout 0; degr-wfc-timeout 120; } + disk { on-io-error detach; } + net { cram-hmac-alg "sha1"; shared-secret "m9bTmbsK4quE"; } + on drbd3.local { + device /dev/drbd0; + disk /dev/vgdata00/drbd00; + meta-disk internal; + address 192.168.5.1:7788; + } + on drbd4.local { + device /dev/drbd0; + disk /dev/vgdata00/drbd00; + meta-disk internal; + address 192.168.5.3:7788; + } +} +``` + + +## DRBD Resource Init + +On both nodes, the drbdadm tool is used to initialize the resource. After the initialization and service start, on one node only we start the synchronization process. We then track the progress of the init - in our example, we'll use drbd1 as CentOS primary and drbd4 as Debian primary to show how it works in either node. + +### CentOS + +Create the resource, start the service and start the sync: + +``` +[root@drbd1 ~]# drbdadm create-md cent00 +[root@drbd2 ~]# drbdadm create-md cent00 + +[root@drbd1 ~]# service drbd start; chkconfig drbd on +[root@drbd2 ~]# service drbd start; chkconfig drbd on + +[root@drbd1 ~]# drbdadm -- --overwrite-data-of-peer primary cent00 +``` + +Check progress: + +``` +[root@drbd1 ~]# cat /proc/drbd +version: 8.3.16 (api:88/proto:86-97) +GIT-hash: a798fa7e274428a357657fb52f0ecf40192c1985 build by phil@Build64R6, 2013-09-27 16:00:43 + 0: cs:SyncSource ro:Primary/Secondary ds:UpToDate/Inconsistent C r----- + ns:1124352 nr:0 dw:0 dr:1125016 al:0 bm:68 lo:0 pe:1 ua:0 ap:0 ep:1 wo:f oos:19842524 + [>...................] sync'ed: 5.4% (19376/20472)M + finish: 0:31:21 speed: 10,536 (10,312) K/sec + +[root@drbd1 ~]# drbdadm -- status cent00 + + + + + +``` + +### Debian + +Create the resource, start the service and start the sync: + +``` +root@drbd3:~# drbdadm create-md deb00 +root@drbd4:~# drbdadm create-md deb00 + +root@drbd3:~# service drbd start; insserv drbd +root@drbd4:~# service drbd start; insserv drbd + +root@drbd4:~# drbdadm -- --overwrite-data-of-peer primary deb00 +``` + +Check progress: + +``` +root@drbd4:~# cat /proc/drbd +version: 8.3.11 (api:88/proto:86-96) +srcversion: F937DCB2E5D83C6CCE4A6C9 + 0: cs:SyncSource ro:Primary/Secondary ds:UpToDate/Inconsistent C r----- + ns:808960 nr:0 dw:0 dr:809624 al:0 bm:49 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:20157788 + [>....................] sync'ed: 3.9% (19684/20472)Mfinish: 0:32:40 speed: 10,264 (10,112) K/sec + +root@drbd4:~# drbdadm -- status deb00 + + + + + +``` + +## Filesystem Build + +> Remember this is **not** a shared filesystem, you can only format/mount/etc. on the Primary node + +It's best to wait until the initial synchronization is complete; as per the above use one of these methods to watch and wait for 100% completion; the sync can take awhile depending on the size of the block storage and the speed of the network between them. + +``` +cat /proc/drbd +drbdadm -- status ('cent00' or 'deb00' in these examples) +``` + +A proper looking sync shows **both datastores** (ds1, ds2) as **UpToDate** like so: + +``` +[root@drbd1 ~]# drbdadm -- status cent00 + + + + + + +root@drbd4:~# drbdadm -- status deb00 + + + + + +``` + +After that's complete, normal methodology is use to format and mount the device node as defined in the resource config. Typically - as listed in the config - the first resource is `/dev/drbd0`, the second `/dev/drbd1` and so forth. If you've lost track the devfs tree can help you with a simple ls: + +``` +[root@drbd1 ~]# ls -l /dev/drbd/by-res/ +total 0 +lrwxrwxrwx 1 root root 11 Jul 18 19:03 cent00 -> ../../drbd0 + +root@drbd4:~# ls -l /dev/drbd/by-res/ +total 0 +lrwxrwxrwx 1 root root 11 Jul 18 19:03 deb00 -> ../../drbd0 +``` + +Double-check who is the Primary – this is the **ro1** and **ro2** information shown with the `drbdadm status` command as shown above. Notice in this example that drbd1 shows r01=Primary and drbd2 shows ro1=Secondary – we know we should be formatting, mounting, etc. on drbd1 for this work once synchronization is complete. On the Debian nodes, we see that drbd4 shows r01=Primary as expected. + +``` +[root@drbd1 ~]# drbdadm -- status cent00 + + + + + + +[root@drbd2 ~]# drbdadm -- status cent00 + + + + + +``` + +We'll use a standard ext4 filesystem for this build on both CentOS and Debian: + +``` +mkfs.ext4 -v -m0 /dev/drbd0 +mkdir /data +mount /dev/drbd0 /data +df -h +``` + +They look like what you'd expect: + +``` +[root@drbd1 ~]# df -h /data +Filesystem Size Used Avail Use% Mounted on +/dev/drbd0 20G 172M 20G 1% /data + +root@drbd4:~# df -h /data +Filesystem Size Used Avail Use% Mounted on +/dev/drbd0 20G 172M 20G 1% /data +``` + + +## Testing + +Test that the resource can be made active and mounted on the Secondary node. We'll write a test file, unmount, demote the Primary to Secondary, mount on the partner node and check the test file. + +### CentOS + +``` +[root@drbd1 ~]# touch /data/test.file +[root@drbd1 ~]# umount /data +[root@drbd1 ~]# drbdadm secondary cent00 + +[root@drbd2 ~]# drbdadm primary cent00 +[root@drbd2 ~]# mkdir /data +[root@drbd2 ~]# mount /dev/drbd0 /data +[root@drbd2 ~]# ls -l /data/ +total 16 +drwx------ 2 root root 16384 Jul 18 19:41 lost+found +-rw-r--r-- 1 root root 0 Jul 18 19:47 test.file +``` + +### Debian + +``` +root@drbd4:~# touch /data/test.file +root@drbd4:~# umount /data +root@drbd4:~# drbdadm secondary deb00 + +root@drbd3:~# drbdadm primary deb00 +root@drbd3:~# mkdir /data +root@drbd3:~# mount /dev/drbd0 /data +root@drbd3:~# ls -l /data/ +total 16 +drwx------ 2 root root 16384 Jul 18 19:41 lost+found +-rw-r--r-- 1 root root 0 Jul 18 19:47 test.file +``` + + +## HA Failover + +> TODO: Build this section some day + + +## References + + - + - diff --git a/md/fonts_and_linux.md b/md/fonts_and_linux.md new file mode 100644 index 0000000..d72b57c --- /dev/null +++ b/md/fonts_and_linux.md @@ -0,0 +1,228 @@ +# Fonts and Linux + +## Contents + + - [Configuration Files](#configuration-files) + - [Remapping Fonts](#remapping-fonts) + - [Whitelisting and Blacklisting](#whitelisting-and-blacklisting) + - [Disable Hinting](#disable-hinting) + - [LCD Optimization](#lcd-optimization) + - [Local Fonts](#local-fonts) + - [Luxi Sans](#luxi-sans) + - [gVim Fonts](#gvim-fonts) + - [LightDM Fonts](#lightdm-fonts) + - [References](#references) + + +## Configuration Files + +The exact location of the configuration depends on which release of which distro and the Desktop Environment being used. + +| Distro | Location | +| ---------------- | ------------------------------- | +| Fedora 17- | ~/.fonts.conf | +| Fedora 18+, Arch | ~/.config/fontconfig/fonts.conf | + + +## Remapping Fonts + +This will remap Courier to Liberation Mono and Caladea to Carlito - Google Fonts have a habit of "stealing" these typefaces resulting in unexpected fonts when browsing the general web. Substitute **DejaVu Sans Mono** for Liberation Mono for a similar result depending on your desktop. + +``` + + + + false + + + + + Courier + + + Liberation Mono + + + + + Courier + + Liberation Mono + + + + + + + false + + + + + + + + + Caladea + + + + + Carlito + + + + + + +``` + + +## Whitelisting and Blacklisting + +The element `` is used in conjunction with the `` and `` elements to selectively whitelist or blacklist fonts from the resolve list and match requests. The simplest and most typical use case it to reject one font that is needed to be installed, however is getting matched for a generic font query that is causing problems within application user interfaces. + +First obtain the Family name as listed in the font itself: + +``` +$ fc-scan .fonts/lklug.ttf --format='%{family}\n' +LKLUG +``` + +Then use that Family name in a `` stanza: + +``` + + + + + LKLUG + + + + +``` + +Typically when both elements are combined, `` is first used on a more general matching glob to reject a large group (such as a whole directory), then `` is used after it to whitelist individual fonts out of the larger blacklisted group. + +``` + + + /usr/share/fonts/OTF/* + + + + + Monaco + + + + +``` + + +## Disable Hinting + +Chrome in XFCE can have issues with hinting resulting in bad display for instance, it might be handy to disable hinting: + +``` + + + + false + + +``` + + +## LCD Optimization + +Assuming a standard RGB subpixel ordering: + +``` + + + + + true + + + false + + + true + + + hintslight + + + rgb + + + lcddefault + + + false + + + +``` + + +## Local Fonts + +Keeping fonts locally in your home directory: + + - Make a new directory in your home called `.fonts` (note leading `.`) + - Copy the downloaded TTF file into this directory + - Change directory to .fonts (`cd ~/.fonts/`) + - Run the command: `mkfontscale` (creates fonts.scale) + - Run the command: `mkfontdir` (creates fonts.dir) + - Run the command: `fc-cache -fv ~/.fonts` (rebuilds local cache) + - Test with: `fc-cache` + + +## Luxi Sans + +This disappeared with Fedora 8 due to [licensing issues](https://fedoraproject.org/wiki/Luxi_fonts), Google for this file and unpack it to get the TTF files you can install in your [home directory](#Local_Fonts). + + - `xorg-x11-fonts-truetype-7.2-3.fc8.noarch.rpm` + + +## gVim Fonts + +A snippet for your `~/.vimrc` to map the editing font in [gVim](http://www.vim.org/): + +``` +~/.vimrc + +" GVIM preferences +if has("gui_running") + let os=substitute(system('uname'), '\n', '', '') + if os == 'Darwin' || os == 'Mac' + set guifont=Menlo:h16 + elseif os == 'Linux' + set guifont=Monospace\ 12 + endif +endif +``` + + +## LightDM Fonts + +The typical config file is `/etc/lightdm/lightdm-gtk-greeter.conf`: + +``` +/etc/lightdm/lightdm-gtk-greeter.conf + +[greeter] +font-name=Luxi Sans 12 +xft-antialias=true +xft-hintstyle=hintnone +xft-rgba=rgb +``` + + +## References + + - diff --git a/md/glusterfs_build_steps.md b/md/glusterfs_build_steps.md new file mode 100644 index 0000000..a5a2ac3 --- /dev/null +++ b/md/glusterfs_build_steps.md @@ -0,0 +1,483 @@ +# GlusterFS Build Steps + +## Contents + + - [Overview](#overview) + - [Prerequisites](#prerequisites) + - [Build Document Setup](#build-document-setup) + - [Node Prep](#node-prep) + - [Configure /etc/hosts and iptables](#configure-etchosts-and-iptables) + - [Granular iptables](#granular-iptables) + - [Install Packages](#install-packages) + - [Prepare Bricks](#prepare-bricks) + - [GlusterFS Setup](#glusterfs-setup) + - [Start glusterfsd daemon](#start-glusterfsd-daemon) + - [Build Peer Group](#build-peer-group) + - [Volume Creation](#volume-creation) + - [Replicated Volume](#replicated-volume) + - [Distributed-Replicated Volume](#distributed-replicated-volume) + - [Volume Deletion](#volume-deletion) + - [Clearing Bricks](#clearing-bricks) + - [Adding Bricks](#adding-bricks) + - [Volume Options](#volume-options) + - [Client Mounts](#client-mounts) + - [FUSE Client](#fuse-client) + - [NFS Client](#nfs-client) + - [References](#references) + + +## Overview + +Prior to starting work, a fundamental decision must be made - what type of Volume(s) need to be used for the given scenario. While 6 methods exist, two are used most often to achieve different results: + + - **Replicated**: This type of Volume provides a file replication across multiple bricks, it is a best choice for environments where High Availability and High Reliability are CRITICAL, as well as if you wish to self-mount the volume on every node such as with a webserver DocumentRoot - the GlusterFS nodes are their own clients. + - Files are copied to each brick in the volume similar to a RAID-1, however you can have 3+ bricks and an odd number as well; usable space is the size of one brick, and all files written to one brick are replicated to all others. This makes the most sense if you are going to self-mount the GlusterFS volume, for instance as the web docroot (/var/www) or similar where all files must reside on that node. The value passed to `replica` is the same number of nodes in the volume. + - **Distributed-Replicated**: In this scenario files are distributed across replicated bricks in the volume. You can use this type of volume in environments where the requirement is to scale storage as well as having high availability. Volumes of this type also offer improved read performance in most environments, and are most common type of volumes used when clients are external to the GlusterFS nodes themselves. + - Somewhat like a RAID-10, an even number of bricks must be used; usable space is the size of the combined bricks passed to the `replica` value. For example, if there are **4 bricks of 20G** and you pass `replica 2` to the creation, your files will distribute to 2 nodes (40G) and replicate to 2 nodes. With **6 bricks of 20G** and `replica 3` it would distribute to 3 nodes (60G) and replicate to 3 nodes, but if you used `replica 2` it would distribute to 2 nodes (40G) and replicate to 4 nodes in pairs. This would be used when your clients are external to the cluster, not local self-mounts. + +All the fundamental work in this document is the same except for the one step where the Volume is created as outlined above with the `replica` keyword. Using Striped-based volumes is not covered here. + + +## Prerequisites + +1. 2 or more servers with separate Storage +2. Private network between servers + +## Build Document Setup + +This build document will use the following setup that can be easily stood up; using Cloud block devices is no different than VMware vDisks, SAN/DAS LUNs, iSCSI, etc. + + - 4x Performance 1 Tier 2 Rackspace Cloud servers - a 20G /dev/xvde ready to use for each brick + - 1x Cloud Private Network on 192.168.3.0/24 for GlusterFS communication + - GlusterFS 3.7 installed from Vendor package repository + +## Node Prep + + - Configure /etc/hosts and iptables + - Install base toolset(s) + - Install GlusterFS software + - Connect GlusterFS nodes + +### Configure /etc/hosts and iptables + +In lieu of using DNS, we prepare /etc/hosts so that every machine and ensure they can talk to each other. All servers have the name `gluster`_N_ as a hostname, so we'll use `glus`_N_ for our private communication layer between nodes. + +``` +# vi /etc/hosts + 192.168.3.2 glus1 + 192.168.3.4 glus2 + 192.168.3.1 glus3 + 192.168.3.3 glus4 + +# ping -c2 glus1; ping -c2 glus2; ping -c2 glus3; ping -c2 glus4 + +## Red Hat oriented: +# vi /etc/sysconfig/iptables + -A INPUT -s 192.168.3.0/24 -j ACCEPT +# service iptables restart + +## Debian oriented +# vi /etc/iptables/rules.v4 + -A INPUT -s 192.168.3.0/24 -j ACCEPT +# service iptables-persistent restart +``` + +#### Granular iptables + +The above generic iptables rule opens all ports to the subnet; if more granular setup is required: + + - **111** - portmap / rpcbind + - **24007** - GlusterFS Daemon + - **24008** - GlusterFS Management + - **38465** to **38467** - Required for GlusterFS NFS service + - **24009** to +X - GlusterFS versions less than 3.4, OR + - **49152** to +X - GlusterFS versions 3.4 and later + +Each brick for every volume on the host requires it’s own port. For every new brick, one new port will be used starting at **24009** for GlusterFS versions below 3.4 and **49152** for version 3.4 and above. + +**Example**: If you have one volume with two bricks, you will need to open 24009 - 24010, or 49152 - 49153. + +### Install Packages + +1. Install the basic packages for partitioning, LVM2 and XFS +2. Install the GlusterFS repository and glusterfs\* packages +3. Disable automatic updates of gluster\* packages + +Some of the required packages may already be installed on the cluster nodes. + +``` +## YUM/RPM Based: +# yum -y install parted lvm2 xfsprogs +# wget -P /etc/yum.repos.d http://download.gluster.org/pub/gluster/glusterfs/LATEST/CentOS/glusterfs-epel.repo +# yum -y install glusterfs glusterfs-fuse glusterfs-server + +## Ubuntu based (Default Ubuntu repo has glusterfs 3.4, here's how to install 3.7): +# apt-get install lvm2 xfsprogs python-software-properties +# add-apt-repository ppa:gluster/glusterfs-3.7 +# apt-get update +# apt-get install glusterfs-server +``` + +Ensure that the gluster\* packages are filtered out of automatic updates; upgrades while it's running can crash the bricks. + +``` +# grep ^exclude /etc/yum.conf +exclude=kernel* gluster* + +## Ubuntu method: +# apt-mark hold glusterfs* +``` + +### Prepare Bricks + +1. Partition block devices +2. Create LVM foundation +3. Prepare volume bricks + +The underlying bricks are a standard filesystem and mount point. However, make sure to mount each brick in such a way so as to discourage any use from changing to the directory and writing to the underlying bricks themselves. **Writing directly to a Brick will corrupt your Volume\!** + +The bricks must be unique per node, and there should be a directory within the mount to use in volume creation. Attempting to create a replicated volume using the top-level of the mounts results in an error with instructions to use a subdirectory. + +``` +all nodes: + # parted -s -- /dev/xvde mktable gpt + # parted -s -- /dev/xvde mkpart primary 2048s 100% + # parted -s -- /dev/xvde set 1 lvm on + # partx -a /dev/xvde + # pvcreate /dev/xvde1 + # vgcreate vgglus1 /dev/xvde1 + +Logical Volumes +--------------- + Standard LVM: + # lvcreate -l 100%VG -n gbrick1 vgglus1 + + For GlusterFS snapshot support: + # lvcreate -l 100%FREE --thinpool lv_thin vgglus1 + # lvcreate --thin -V $(lvdisplay /dev/vgglus1/lv_thin | awk '/LV\ Size/ { print $3 }')G -n gbrick1 vgglus1/lv_thin + +Filesystems for bricks +---------------------- + For XFS bricks: (recommended) + # mkfs.xfs -i size=512 /dev/vgglus1/gbrick1 + # echo '/dev/vgglus1/gbrick1 /data/gluster/gvol0 xfs inode64,nobarrier 0 0' >> /etc/fstab + # mkdir -p /data/gluster/gvol0 + # mount /data/gluster/gvol0 + + For ext4 bricks: + # mkfs.ext4 /dev/vgglus1/gbrick1 + # echo '/dev/vgglus1/gbrick1 /data/gluster/gvol0 ext4 defaults,user_xattr,acl 0 0' >> /etc/fstab + # mkdir -p /data/gluster/gvol0 + # mount /data/gluster/gvol0 + +glus1: + # mkdir -p /data/gluster/gvol0/brick1 + +glus2: + # mkdir -p /data/gluster/gvol0/brick1 + +glus3: + # mkdir -p /data/gluster/gvol0/brick1 + +glus4: + # mkdir -p /data/gluster/gvol0/brick1 +``` + + +## GlusterFS Setup + +### Start glusterfsd daemon + +The daemon can be restarted at runtime as well: + +``` +## Red Hat based: +# service glusterd start +# chkconfig glusterd on +``` + +### Build Peer Group + +This is what's known as a **Trusted Storage Pool** in the GlusterFS world. Note that as of early release of version 3, you only need to probe all other nodes from glus1. The peer list is then automatically distributed to all peers from there. + +``` +glus1: + # gluster peer probe glus2 + # gluster peer probe glus3 + # gluster peer probe glus4 + # gluster peer status + +[root@gluster1 ~]# gluster pool list +UUID Hostname State +734aea4c-fc4f-4971-ba3d-37bd5d9c35b8 glus4 Connected +d5c9e064-c06f-44d9-bf60-bae5fc881e16 glus3 Connected +57027f23-bdf2-4a95-8eb6-ff9f936dc31e glus2 Connected +e64c5148-8942-4065-9654-169e20ed6f20 localhost Connected +``` + +### Volume Creation + +We will set up basic auth restrictions to only our private subnet as by default glusterd NFS allows global read/write during Volume creation. glusterd automatically starts NFSd on each server and exports the volume through it from each of the nodes. The reason for this behaviour is that in order to use native client (FUSE) for mounting the volume on clients, the clients have to run exactly same version of GlusterFS packages. If the versions are different there might be differences in the hashing algorithms used by servers and clients and the clients won't be able to connect. + +#### Replicated Volume + +This example will create replication to all 4 nodes - each node contains a copy of all data and the size of the Volume is the size of a single brick. Notice how the info shows `1 x 4 = 4` in the output. + +``` +one node only: + # gluster volume create gvol0 replica 4 transport tcp \ + glus1:/data/gluster/gvol0/brick1 \ + glus2:/data/gluster/gvol0/brick1 \ + glus3:/data/gluster/gvol0/brick1 \ + glus4:/data/gluster/gvol0/brick1 + # gluster volume set gvol0 auth.allow 192.168.3.*,127.0.0.1 + # gluster volume set gvol0 nfs.disable off + # gluster volume set gvol0 nfs.addr-namelookup off + # gluster volume set gvol0 nfs.export-volumes on + # gluster volume set gvol0 nfs.rpc-auth-allow 192.168.3.* + # gluster volume set gvol0 performance.io-thread-count 32 + # gluster volume start gvol0 + +[root@gluster1 ~]# gluster volume info gvol0 +Volume Name: gvol0 +Type: Replicate +Volume ID: 65ece3b3-a4dc-43f8-9b0f-9f39c7202640 +Status: Started +Number of Bricks: 1 x 4 = 4 +Transport-type: tcp +Bricks: +Brick1: glus1:/data/gluster/gvol0/brick1 +Brick2: glus2:/data/gluster/gvol0/brick1 +Brick3: glus3:/data/gluster/gvol0/brick1 +Brick4: glus4:/data/gluster/gvol0/brick1 +Options Reconfigured: +nfs.rpc-auth-allow: 192.168.3.*,127.0.0.1 +nfs.export-volumes: on +nfs.addr-namelookup: off +nfs.disable: off +auth.allow: 192.168.3.* +performance.io-thread-count: 32 +``` + +#### Distributed-Replicated Volume + +This example will create distributed replication to 2x2 nodes - each pair of nodes contains the data and the size of the Volume is the size of a two bricks. Notice how the info shows `2 x 2 = 4` in the output. + +``` +one node only: + # gluster volume create gvol0 replica 2 transport tcp \ + glus1:/data/gluster/gvol0/brick1 \ + glus2:/data/gluster/gvol0/brick1 \ + glus3:/data/gluster/gvol0/brick1 \ + glus4:/data/gluster/gvol0/brick1 + # gluster volume set gvol0 auth.allow 192.168.3.*,127.0.0.1 + # gluster volume set gvol0 nfs.disable off + # gluster volume set gvol0 nfs.addr-namelookup off + # gluster volume set gvol0 nfs.export-volumes on + # gluster volume set gvol0 nfs.rpc-auth-allow 192.168.3.* + # gluster volume set gvol0 performance.io-thread-count 32 + # gluster volume start gvol0 + +[root@gluster1 ~]# gluster volume info gvol0 +Volume Name: gvol0 +Type: Distributed-Replicate +Volume ID: d883f891-e38b-4565-8487-7e50ca33dbd4 +Status: Started +Number of Bricks: 2 x 2 = 4 +Transport-type: tcp +Bricks: +Brick1: glus1:/data/gluster/gvol0/brick1 +Brick2: glus2:/data/gluster/gvol0/brick1 +Brick3: glus3:/data/gluster/gvol0/brick1 +Brick4: glus4:/data/gluster/gvol0/brick1 +Options Reconfigured: +nfs.rpc-auth-allow: 192.168.3.* +nfs.export-volumes: on +nfs.addr-namelookup: off +nfs.disable: off +auth.allow: 192.168.3.*,127.0.0.1 +performance.io-thread-count: 32 +``` + +## Volume Deletion + +After ensure that no clients (either local or remote) are mounting the Volume, stop the Volume and delete it. + +``` +# gluster volume stop gvol0 +# gluster volume delete gvol0 +``` + +### Clearing Bricks + +If brick(s) were used in a volume and they need to be removed, there's an attribute that GlusterFS had set on the brick subdirectories. This needs to be cleared before they can be reused - or the subdir can be deleted and recreated. + +``` +glus1: + # setfattr -x trusted.glusterfs.volume-id /data/gluster/gvol0/brick1/ + # setfattr -x trusted.gfid /data/gluster/gvol0/brick1 + # rm -rf /data/gluster/gvol0/brick1/.glusterfs +glus2: + # setfattr -x trusted.glusterfs.volume-id /data/gluster/gvol0/brick1/ + # setfattr -x trusted.gfid /data/gluster/gvol0/brick1 + # rm -rf /data/gluster/gvol0/brick1/.glusterfs +glus3: + # setfattr -x trusted.glusterfs.volume-id /data/gluster/gvol0/brick1/ + # setfattr -x trusted.gfid /data/gluster/gvol0/brick1 + # rm -rf /data/gluster/gvol0/brick1/.glusterfs +glus4: + # setfattr -x trusted.glusterfs.volume-id /data/gluster/gvol0/brick1/ + # setfattr -x trusted.gfid /data/gluster/gvol0/brick1 + # rm -rf /data/gluster/gvol0/brick1/.glusterfs + +...or just deleting all data: + +glus1: + # rm -rf /data/gluster/gvol0/brick1 + # mkdir /data/gluster/gvol0/brick1 +glus2: + # rm -rf /data/gluster/gvol0/brick1 + # mkdir /data/gluster/gvol0/brick1 +glus3: + # rm -rf /data/gluster/gvol0/brick1 + # mkdir /data/gluster/gvol0/brick1 +glus4: + # rm -rf /data/gluster/gvol0/brick1 + # mkdir /data/gluster/gvol0/brick1 +``` + +### Adding Bricks + +Additional bricks can be added to a running Volume easily: + +``` +# gluster volume add-brick gvol0 glus5:/data/gluster/gvol0/brick1 +``` + +The add-brick command can also be used to change the LAYOUT of your volume. For example, to change a 2 node Distributed volume into a 4 node Distributed-Replicated Volume. After such an operation you **must rebalance** your volume. New files will be automatically created on the new nodes, but the old ones will not get moved. + +``` +# gluster volume add-brick gvol0 replica 2 \ + glus5:/data/gluster/gvol0/brick1 \ + glus6:/data/gluster/gvol0/brick1 +# gluster rebalance gvol0 start +# gluster rebalance gvol0 status + +## If needed (something didn't work right) +# gluster rebalance gvol0 stop +``` + +> When expanding distributed replicated and distributed striped volumes, you must add a number of bricks that is a multiple of the replica or stripe count. For example, to expand a distributed replicated volume with a replica count of 2, you need to add bricks in multiples of 2 (such as 4, 6, 8, etc.): +> +> ``` +> # gluster volume add-brick gvol0 \ +> glus5:/data/gluster/gvol0/brick1 \ +> glus6:/data/gluster/gvol0/brick1 +> ``` + +### Volume Options + +To view configured volume options: + +``` +# gluster volume info gvol0 + +Volume Name: gvol0 +Type: Replicate +Volume ID: bcbfc645-ebf9-4f83-b9f0-2a36d0b1f6e3 +Status: Started +Number of Bricks: 1 x 4 = 4 +Transport-type: tcp +Bricks: +Brick1: glus1:/data/gluster/gvol0/brick1 +Brick2: glus2:/data/gluster/gvol0/brick1 +Brick3: glus3:/data/gluster/gvol0/brick1 +Brick4: glus4:/data/gluster/gvol0/brick1 +Options Reconfigured: +performance.cache-size: 1073741824 +performance.io-thread-count: 64 +cluster.choose-local: on +nfs.rpc-auth-allow: 192.168.3.*,127.0.0.1 +nfs.export-volumes: on +nfs.addr-namelookup: off +nfs.disable: off +auth.allow: 192.168.3.*,127.0.0.1 +``` + +To set an option for a volume, use the `set` keyword like so: + +``` +# gluster volume set gvol0 performance.write-behind off +volume set: success +``` + +To clear an option to a Volume back to defaults, use the `reset` keyword like so: + +``` +# gluster volume reset gvol0 performance.read-ahead +volume reset: success: reset volume successful +``` + + +## Client Mounts + +From a client perspective the GlusterFS Volume can be mounted in two fundamental ways: + +1. FUSE Client +2. NFS Client + +### FUSE Client + +The FUSE client allows the mount to happen with a GlusterFS "round robin" style connection; in /etc/fstab the name of one node is used, however internal mechanisms allows that node to fail and the clients will roll over to other connected nodes in the Trusted Storage Pool. The performance is slightly lower than the NFS method based on tests, however not drastically so - the gain is automatic HA client failover which is typically worth the performance hit. + +``` +## RPM based: +# wget -P /etc/yum.repos.d http://download.gluster.org/pub/gluster/glusterfs/LATEST/CentOS/glusterfs-epel.repo +# yum -y install glusterfs glusterfs-fuse + +## Ubuntu based(glusterfs-client 3.4 works with glusterfs-server 3.5 but for most the recent version do this): +# add-apt-repository ppa:gluster/glusterfs-3.7 +# apt-get update +# apt-get install glusterfs-client +## + +##Common: +# vi /etc/hosts + 192.168.3.2 glus1 + 192.168.3.4 glus2 + 192.168.3.1 glus3 + 192.168.3.3 glus4 + +# modprobe fuse +# echo 'glus1:/gvol0 /mnt/gluster/gvol0 glusterfs defaults,_netdev,backup-volfile-servers=glus2 0 0' >> /etc/fstab +# mkdir -p /mnt/gluster/gvol0 +# mount /mnt/gluster/gvol0 +``` + +### NFS Client + +The standard Linux NFSv3 client tools are used to mount one of the GlusterFS nodes; the performance is typically a little better than the FUSE client, however the downside is the connection is 1-to-1 – of the GlusterFS node goes down the client will not round-robin out to another node. A different solution has to be added such as HAProxy/keepalived, load balancer, etc. in order to provide a floating IP proxy in this use case. + +``` +## RPM based: +# yum -y install rpcbind nfs-utils +# service rpcbind restart; chkconfig rpcbind on +# service nfslock restart; chkconfig on + +## Ubuntu: +# apt-get install nfs-common +## + +## Common: +# echo 'glus1:/repvol1 /mnt/gluster/gvol0 nfs rsize=4096,wsize=4096,hard,intr 0 0' >> /etc/fstab +# mkdir -p /mnt/gluster/gvol0 +# mount /mnt/gluster/gvol0 +``` + + +## References + + - + - + - + - diff --git a/md/grub_2_info.md b/md/grub_2_info.md new file mode 100644 index 0000000..ebaa7b9 --- /dev/null +++ b/md/grub_2_info.md @@ -0,0 +1,205 @@ +# Grub 2 Info + +## Contents + + - [Overview](#overview) + - [Default Menu Entry](#default-menu-entry) + - [Vendor Defaults](#vendor-defaults) + - [RHEL Kernels](#rhel-kernels) + - [Simple Text Mode](#simple-text-mode) + - [Boot Options](#boot-options) + - [Disable Recovery Menus](#disable-recovery-menus) + - [Use Backup Config](#use-backup-config) + - [References](#references) + + +## Overview + +Grub v2 is a radical departure fron grub v1 (aka "grub legacy"); at first it seems daunting, especially if you've spent years working with lilo and grub v1 and their simplistic, single-file index based configuration. These are some of the frequent tasks encountered when working with the new grub v2 environment; be sure to refer to the References links for more detailed and in-depth information. Grub2 is used in RHEL / CentOS 7 and above, Fedora 19 and above, and Ubuntu 14 and above. + + +## Default Menu Entry + +Choose the default boot target via the `GRUB_DEFAULT` setting in `/etc/default/grub`: + + - `GRUB_DEFAULT=`_n_ : specify the _n_th entry in `/boot/grub2/grub.cfg`; first entry is `0` + - `GRUB_DEFAULT=`_string_ : choose an entry by name + - `GRUB_DEFAULT=saved` : use the boot target specified in `/boot/grub/grubenv` + +### Vendor Defaults + +Vendors are using different out of the box defaults for the kernel to boot: + +| **Distro** | **Default** | +| ------------- | -------------------- | +| RHEL / CentOS | `GRUB_DEFAULT=saved` | +| Ubuntu | `GRUB_DEFAULT=0` | + +> Note that Ubuntu by default is using the 0-index method to boot a kernel; in real terms, this means it always tries to boot the top-level entry named "Ubuntu" that has no discrete kernel name listed. You must first set `GRUB_DEFAULT=saved` in `/etc/default/grub` and run `update-grub` to convert it over to using named kernels as outlined below. Failure to do these steps will result in an outcome you did not expect. + +First, make sure that your setup is using a special setting in `/etc/default/grub` - add or replace as needed: + +``` +/etc/default/grub + +GRUB_DEFAULT=saved +``` + +Rebuild your grub config file after making a backup. You will notice we create a backup **in the boot directory** as the Grub2 bootloader has the capability to use an alternate backup file\! So if your changes end up with a mistake, at a `grub>` menu you can tell it to boot your backup file out of the `/boot` directory. This provides an emergency rollback scenario should things go south during the reboot; see the "Use Backup Config" section in this article. + +``` +Fedora/RHEL/CentOS: +# cp -a /boot/grub2/grub.cfg{,.bak} +# grub2-mkconfig -o /boot/grub2/grub.cfg + +Ubuntu/Debian/Mint: +# cp -a /boot/grub/grub.cfg{,.bak} +# update-grub +``` + +Get a list of the boot menu items with a simple grep - note that RHEL style systems use a flat top-level menu, whereas Ubuntu uses a top-level single menu "Ubuntu" with several submenus of named kernels; this is just a different implementation by the vendors, but it does require a slightly different grep to find the menu entry names. + +``` +Fedora/RHEL/CentOS: +# grep "^menuentry" /boot/grub2/grub.cfg | cut -d "'" -f2 + +Red Hat Enterprise Linux Server (3.10.0-229.1.2.el7.x86_64) 7.1 (Maipo) +Red Hat Enterprise Linux Server (3.10.0-229.1.2.el7.x86_64) 7.1 (Maipo) with debugging +Red Hat Enterprise Linux Server (3.10.0-229.4.2.el7.x86_64) 7.1 (Maipo) +Red Hat Enterprise Linux Server (3.10.0-229.4.2.el7.x86_64) 7.1 (Maipo) with debugging +Red Hat Enterprise Linux Server 7.1 (Maipo), with Linux 3.10.0-229.el7.x86_64 +Red Hat Enterprise Linux Server 7.1 (Maipo), with Linux 0-rescue-2c9acae4aae44399a33ff8405cdfda12 + +Ubuntu/Debian/Mint: +# egrep "^[[:space:]]?(submenu|menuentry)" /boot/grub/grub.cfg | cut -d "'" -f2 + +Ubuntu +Advanced options for Ubuntu +Ubuntu, with Linux 4.4.0-75-generic +Ubuntu, with Linux 4.4.0-72-generic +Ubuntu, with Linux 4.4.0-43-generic +``` + +> Ubuntu uses submenus. This means that the line above "Advanced options for Ubuntu" is the top level item, then the lines below it are children; visually: +> +> - Ubuntu +> - Advanced options for Ubuntu +> - Ubuntu, with Linux 4.4.0-75-generic +> - Ubuntu, with Linux 4.4.0-72-generic +> - Ubuntu, with Linux 4.4.0-43-generic +> +> Below you will prepend "Advanced options for Ubuntu", a '\>' symbol, then the name to make it work correctly. + +Use one of these lines to set the default with `grub2-set-default` / `grub-set-default` - a 0-based index can be used instead of a name, however on RHEL `grubby` uses the long name so stick to that. `grub(2)-set-default` is just a fancy shell script that runs `grub(2)-editenv` to unset old entries and set the `saved_entry` with your new choice. It updates the `/boot/grub(2)/grubenv` file: + +``` +Fedora/RHEL/CentOS: +# grub2-set-default "Red Hat Enterprise Linux Server (3.10.0-229.1.2.el7.x86_64) 7.1 (Maipo)" + +# grub2-editenv list +saved_entry=Red Hat Enterprise Linux Server (3.10.0-229.1.2.el7.x86_64) 7.1 (Maipo) + +# grep saved_entry /boot/grub2/grubenv +saved_entry=Red Hat Enterprise Linux Server (3.10.0-229.1.2.el7.x86_64) 7.1 (Maipo) + +=== + +Ubuntu/Debian/Mint: +# grub-set-default 'Advanced options for Ubuntu>Ubuntu, with Linux 4.4.0-72-generic' + +# grub-editenv list +saved_entry=Advanced options for Ubuntu>Ubuntu, with Linux 4.4.0-72-generic + +# grep saved_entry /boot/grub2/grubenv +saved_entry=Advanced options for Ubuntu>Ubuntu, with Linux 4.4.0-72-generic +``` + + +## RHEL Kernels + +A note about how Red Hat builds their kernel packages; in the above examples `grub2-mkconfig` is used to build new config files; be aware that the RHEL kernel package installs do not use it. Instead, they use a shim `/sbin/new-kernel-pkg` which in turn uses a program called `grubby`. The critical differences: + + - **grubby does not use /etc/default/grub** – it uses the as-booted kernel parameters to build a new commandline for the new kernel. If you have custom kernel arguments, be sure to first update `/etc/default/grub`, use `grub2-mkconfig`, reboot to activate the new options then upgrade the new kernel package. This is most commonly done on a freshly installed server before a `yum update` is issued. + - **the grubby menu entries are different** – the Title used in the menu entries is of a different text string than the one produced by `grub2-mkconfig`; this means you should always check and if required set the default kernel to boot when manipulating the grub config file. A quick example: + +``` +# Build the new config +grub2-mkconfig -o /boot/grub2/grub.cfg + +# Get the name of the one you want, let's call it "Red Hat FOOBAR" here +grep "^menuentry" /boot/grub2/grub.cfg | cut -d "'" -f2 + +# Update the grub environment file /boot/grub2/grubenv saved_entry +grub2-set-default "Red Hat FOOBAR" +grub2-editenv list +``` + +The Red Hat kernel packages will automatically reconfigure the default kernel to boot as part of the `grubby` process; if simply doing a standard kernel package upgrade or downgrade the environment file will be updated to that package version. + + +## Simple Text Mode + +This is handy for servers - set up grub2 to use basic text mode and show all kernels by default: + +``` +# vi /etc/default/grub + +GRUB_DISABLE_SUBMENU=true +GRUB_GFXMODE=1024x768 +GRUB_GFXPAYLOAD_LINUX=keep + +Comment this out if present: +#GRUB_TERMINAL_OUTPUT="console" + +# cp -a /boot/grub2/grub.cfg{,.bak} +# grub2-mkconfig -o /boot/grub2/grub.cfg +``` + + +## Boot Options + +The config variable `GRUB_CMDLINE_LINUX` will be applied to all the auto-detected linux kernels when you rebuild the config; typically the config already has some options, just add to the end as needed. Some common options might be disabling IPv6, setting the LANG variable or blacklisting initrd modules (such as SAN/DAS HBAs). + +``` +# vi /etc/default/grub + +GRUB_CMDLINE_LINUX="crashkernel=auto rd.lvm.lv=vglocal00/swap00 rd.lvm.lv=vglocal00/root00 net.ifnames=0 biosdevname=0 rdblacklist=bfa nomodeset" + +# cp -a /boot/grub2/grub.cfg{,.bak} +# grub2-mkconfig -o /boot/grub2/grub.cfg +``` + + +## Disable Recovery Menus + +You may wish to disable the "(recovery mode)" (single-user mode) menu entries by setting `GRUB_DISABLE_RECOVERY` and rebuilding your config. In reality all the menu entry does is add the boot option `single` so there's no real value in cluttering up your grub menus. + +``` +# vi /etc/default/grub + +GRUB_DISABLE_RECOVERY="true" + +# cp -a /boot/grub2/grub.cfg{,.bak} +# grub2-mkconfig -o /boot/grub2/grub.cfg +``` + + +## Use Backup Config + +It's possible something might go wrong during `grub2-mkconfig` and leave you with a corrupt config file, but you won't notice it until you've rebooted. Typically grub will drop you to the shell with an error (or just empty menus). If that happens and you've made a backup use the `configfile` command to use it - as soon as the command is issued the menu should appear. + +``` +grub2> ls +(hd0) (hd0,msdos3) (hd0,msdos2) (hd0,msdos1) +grub2> configfile (hd0,1)/boot/grub2/grub.cfg.bak +``` + + +## References + + - + - + - + - + - + - diff --git a/md/jumbo_frames.md b/md/jumbo_frames.md new file mode 100644 index 0000000..05d4cad --- /dev/null +++ b/md/jumbo_frames.md @@ -0,0 +1,197 @@ +# Jumbo Frames + +## Contents + + - [Overview](#overview) + - [Configuration](#configuration) + - [RHEL/CentOS](#rhelcentos) + - [Basic Testing](#basic-testing) + - [Node A to B](#node-a-to-b) + - [Bode B to A](#node-b-to-a) + - [Performance Testing](#performance-testing) + - [Node A to B](#node-a-to-b-1) + - [Node B to A](#node-b-to-a-1) + - [Protocol Overhead](#protocol-overhead) + + +## Overview + +[Jumbo frames](http://en.wikipedia.org/wiki/Jumbo_frames) are the concept of opening up the Ethernet frames to a large MTU to be able to push large packets without fragmenting; this can be a common need on a private GigE switched network for Oracle RAC Interconnect between nodes. The accepted standard is 9000 (large enough for 8k payload plus packet overhead). Once your GigE switches have been configured for the new MTU, configure and test your servers. + + +## Configuration + +**Example Setup**: + + - Node **A**: 192.168.100.101 (bond1, eth1 + eth5) + - Node **B**: 192.168.100.102 (bond1, eth1 + eth5) + + +### RHEL/CentOS + +On both nodes: + +``` +# ip link set dev bond1 mtu 9000 +# vi /etc/sysconfig/networking-scripts/ifcfg-bond1 + add: MTU=9000 +``` + + +## Basic Testing + +### Node A to B + +``` +# ifenslave -c bond1 eth5 +# ip route get 192.168.100.102 +# tracepath -n 192.168.100.102 +# ping -c 5 -s 8972 -M do 192.168.100.102 + +# ifenslave -c bond1 eth1 +# ip route get 192.168.100.102 +# tracepath -n 192.168.100.102 +# ping -c 5 -s 8972 -M do 192.168.100.102 +``` + +### Node B to A + +``` +# ifenslave -c bond1 eth5 +# ip route get 192.168.100.101 +# tracepath -n 192.168.100.101 +# ping -c 5 -s 8972 -M do 192.168.100.101 + +# ifenslave -c bond1 eth1 +# ip route get 192.168.100.101 +# tracepath -n 192.168.100.101 +# ping -c 5 -s 8972 -M do 192.168.100.101 +``` + + +## Performance Testing + +Using `iperf` (available via EPEL) for throughput measurements. + +### Node A to B + +``` +Node B: (receiver) + # ifenslave -c bond1 eth5 + # iperf -B 192.168.100.102 -s -u -l 8972 -w 768k + +Node A: (sender) + # ifenslave -c bond1 eth5 + # iperf -B 192.168.100.101 -c 192.168.100.102 -u \ + -b 10G -l 8972 -w 768k -i 2 -t 30 + + +Node B: (receiver) + # ifenslave -c bond1 eth1 + # iperf -B 192.168.100.102 -s -u -l 8972 -w 768k + +Node A: (sender) + # ifenslave -c bond1 eth1 + # iperf -B 192.168.100.101 -c 192.168.100.102 -u \ + -b 10G -l 8972 -w 768k -i 2 -t 30 +``` + +### Node B to A + +``` +Node A: (receiver) + # ifenslave -c bond1 eth5 + # iperf -B 192.168.100.101 -s -u -l 8972 -w 768k + +Node B: (sender) + # ifenslave -c bond1 eth5 + # iperf -B 192.168.100.102 -c 192.168.100.101 -u \ + -b 10G -l 8972 -w 768k -i 2 -t 30 + + +Node A: (receiver) + # ifenslave -c bond1 eth1 + # iperf -B 192.168.100.101 -s -u -l 8972 -w 768k + +Node B: (sender) + # ifenslave -c bond1 eth1 + # iperf -B 192.168.100.102 -c 192.168.100.101 -u \ + -b 10G -l 8972 -w 768k -i 2 -t 30 +``` + + +## Protocol Overhead + +Reference: [Theoretical Maximums and Protocol Overhead](http://sd.wareonearth.com/~phil/net/overhead/): + +``` +Theoretical maximum TCP throughput on GigE using jumbo frames: + + (9000-20-20-12)/(9000+14+4+7+1+12)*1000000000/1000000 = 990.042 Mbps + | | | | | | | | | | | | + MTU | | | MTU | | | | | GigE Mbps + | | | | | | | | + IP | | Ethernet | | | | InterFrame Gap (IFG), aka + Header | | Header | | | | InterPacket Gap (IPG), is + | | | | | | a minimum of 96 bit times + TCP | FCS | | | from the last bit of the + Header | | | | FCS to the first bit of + | Preamble | | the preamble + TCP | | + Options Start | + (Timestamp) Frame | + Delimiter | + (SFD) | + | + Inter + Frame + Gap + (IFG) + +Theoretical maximum UDP throughput on GigE using jumbo frames: + (9000-20-8)/(9000+14+4+7+1+12)*1000000000/1000000 = 992.697 Mbps + +Theoretical maximum TCP throughput on GigE without using jumbo frames: + (1500-20-20-12)/(1500+14+4+7+1+12)*1000000000/1000000 = 941.482 Mbps + +Theoretical maximum UDP throughput on GigE without using jumbo frames: + (1500-20-8)/(1500+14+4+7+1+12)*1000000000/1000000 = 957.087 Mbps + +Ethernet frame format: + * 6 byte dest addr + * 6 byte src addr + * [4 byte optional 802.1q VLAN Tag] + * 2 byte length/type + * 46-1500 byte data (payload) + * 4 byte CRC + +Ethernet overhead bytes: + 12 gap + 8 preamble + 14 header + 4 trailer = 38 bytes/packet w/o 802.1q + 12 gap + 8 preamble + 18 header + 4 trailer = 42 bytes/packet with 802.1q + +Ethernet Payload data rates are thus: + 1500/(38+1500) = 97.5293 % w/o 802.1q tags + 1500/(42+1500) = 97.2763 % with 802.1q tags + +TCP over Ethernet: + Assuming no header compression (e.g. not PPP) + Add 20 IPv4 header or 40 IPv6 header (no options) + Add 20 TCP header + Add 12 bytes optional TCP timestamps + Max TCP Payload data rates over ethernet are thus: + (1500-40)/(38+1500) = 94.9285 % IPv4, minimal headers + (1500-52)/(38+1500) = 94.1482 % IPv4, TCP timestamps + (1500-52)/(42+1500) = 93.9040 % 802.1q, IPv4, TCP timestamps + (1500-60)/(38+1500) = 93.6281 % IPv6, minimal headers + (1500-72)/(38+1500) = 92.8479 % IPv6, TCP timestamps + (1500-72)/(42+1500) = 92.6070 % 802.1q, IPv6, ICP timestamps + +UDP over Ethernet: + Add 20 IPv4 header or 40 IPv6 header (no options) + Add 8 UDP header + Max UDP Payload data rates over ethernet are thus: + (1500-28)/(38+1500) = 95.7087 % IPv4 + (1500-28)/(42+1500) = 95.4604 % 802.1q, IPv4 + (1500-48)/(38+1500) = 94.4083 % IPv6 + (1500-48)/(42+1500) = 94.1634 % 802.1q, IPv6 +``` diff --git a/md/kernel_module_weak_updates.md b/md/kernel_module_weak_updates.md new file mode 100644 index 0000000..af16c7d --- /dev/null +++ b/md/kernel_module_weak_updates.md @@ -0,0 +1,318 @@ +# Kernel Module Weak Updates + +## Contents + + - [Overview](#overview) + - [Module Loading](#module-loading) + - [Functional Examples](#functional-examples) + - [Example: bfa](#example-bfa) + - [Example: lpfc](#example-lpfc) + - [Compatible Modules](#compatible-modules) + - [Kernel Symbols](#kernel-symbols) + - [Module Symbols](#module-symbols) + - [Methodology](#methodology) + - [Dependencies](#dependencies) + - [Weak Modules](#weak-modules) + - [Incompatible Example](#incompatible-example) + - [Caveats](#caveats) + - [References](#references) + + +## Overview + +The Red Hat oriented Linux kernel architecture has a method for 3rd party entities to provide a kernel module for an entire family of kernel releases, based on the fundamental understanding the kernel's entry tables and module interface does not change within that family. This document goes over the basic design behind the solution. + +The use of this methodology is popular amongst 3rd party vendors who provide a pre-compiled kernel module for their hardware and allow that same binary module to work for a number of compatible kernels. Shipping a new binary module for each and every Red Hat kernel release is therefore not required, reducing the complexity of producing the module and it's runtime maintenance on a server. + +In common usage, these types of modules are delivered in packages named `kmod-`, where `foo>` is the name of the existing stock kernel module as shipped by the distribution. The overall compatibility is referred to as **kABI** or _Kernel Application Binary Interface_. + + +## Module Loading + +Key to understanding the method is how the kernel will look for the modules to load. It varies vendor by vendor depending on the distribution, but generally speaking the modules are looked for in this order: + + - `/lib/modules/(kernel-version)/updates` + - manually controlled area for use by sysadmins to insert a module by hand and override everything + - `/lib/modules/(kernel-version)/extra` + - override everything shipped with the kernel and weak-updates (see below) + - `/lib/modules/(kernel-version)/*` + - stock kernel modules (usually in a subdirectory kernel) and other named directories; a vendor may choose to have a top-level directory here, such as the EMC PowerPath software using `/lib/modules/(kernel-version)/powerpath` as it's standard location + - `/lib/modules/(kernel-version)/weak-updates` + - compatible kernel modules for this kernel, but were actually compiled against another similar kernel in the family + +The concept named _weak-updates_ works in tandem with the extra module location; typically the original module is installed in an 'extra' directory named where it was compiled, and a symlink exists in the weak-updates directory from another kernel. + + +## Functional Examples + +Using a Red Hat Enterprise (RHEL) 7 system, we first note that only one kernel is installed: + +``` +# rpm -qa | grep ^kernel-3 +kernel-3.10.0-327.49.1.el7.x86_64 +``` + +In Red Hat's versioning scheme, this is read as two parts of a design: + + - 3.10.0-327.\* - the suite or "family" of kernel releases + - .49.1.el7 - the specific patched release of this kernel within the family + +This design indicates that any kernel module built for the 3.10.0-327.\* family of kernels _should be_ compatible with any specific kernel in the family; but as it's not possible to 100% guarantee this ahead of time, safety checks exist (more on this below). On this server, we have a kmod kernel module that is replacing one of the stock ones. + +### Example: bfa + +This package `kmod-bfa` was obtained from a 3rd party provider for the Brocade fiber channel adapters. + +``` +/lib/modules/3.10.0-327.el7.x86_64/extra/bfa: +-rw-r--r--. 1 root root 23431886 Apr 22 2016 bfa.ko + +/lib/modules/3.10.0-327.49.1.el7.x86_64/weak-updates/bfa: +lrwxrwxrwx. 1 root root 51 Feb 14 12:54 bfa.ko -> /lib/modules/3.10.0-327.el7.x86_64/extra/bfa/bfa.ko +``` + +Notice that the real file is in a directory for a kernel that is not installed; it's located in the "base" kernel for the family (the first one released in the family, in this case RHEL 7.2) `extra/` directory, and from our running kernel has a symlink from it's `weak-updates/` directory back to the module. This module is compatible for weak-updates; it was compiled against kernel 3.10.0-327 but functionally works with kernel 3.10.0-327.49.1 as is, no modifications needed. + +### Example: lpfc + +This package `kmod-lpfc` is provided in the main RHEL7 software repository by Red Hat, providing newer upstream code for Emulex fiber channel adapters. + +``` +/lib/modules/3.10.0-327.el7.x86_64/extra/lpfc: +-rw-r--r--. 1 root root 1180268 Sep 5 02:51 lpfc.ko + +/lib/modules/3.10.0-327.49.1.el7.x86_64/weak-updates/lpfc: +lrwxrwxrwx. 1 root root 53 Feb 16 15:09 lpfc.ko -> /lib/modules/3.10.0-327.el7.x86_64/extra/lpfc/lpfc.ko +``` + +The design is exactly that of the previous example; this module is compatible for weak-updates; it was compiled against kernel 3.10.0-327 but functionally works with kernel 3.10.0-327.49.1 as is, no modifications needed. + + +## Compatible Modules + +Ensuring that a module compiled for one version of a kernel is compatible with another kernel is key to the system working correctly; the topic deals a great deal with compilers, assemblers and linkers which provide the needed data to compare for compatibility. When a binary is compiled it has a _symbol table_ which basically indicates the structural location of all usable functions; this is both the kernel itself, and any modules trying to load themselves into that kernel. + +The location address of all kernel functions a module expects to use are embedded in itself, as well as what it exports for others (imagine a module using a module) - the process is at it's simplest asking the target kernel if the map the module knows about has changed or not. If nothing has changed, it's compatible so long as some sort of internal change has not happened that is not visible to the outside world. This is the **kABI** in effect, the module is kernel ABI compatible between several compiled kernels. + +### Kernel Symbols + +The kernel(s) ship with a pre-exported symbol table stored in the /boot directory next to the kernel: + +``` +# ls -l /boot/symvers-3.10.0-327.49.1.el7.x86_64.gz +-rw-r--r--. 1 root root 252731 Jan 25 11:37 /boot/symvers-3.10.0-327.49.1.el7.x86_64.gz + +# zgrep blk_queue_init_tags /boot/symvers-3.10.0-327.49.1.el7.x86_64.gz +0x00a006aa blk_queue_init_tags vmlinux EXPORT_SYMBOL +``` + +The output above shows the function `blk_queue_init_tags` is exported for all to use (`EXPORT_SYMBOL`) by the binary `vmlinux` (the kernel) with address `0x00a006aa` in the stack. This is a general function being used for example purposes herein, there are many more in use. + +Due to specifically how the kernel operates, the shipped `vmlinuz` file (a compressed, stripped copy of `vmlinux`) typically does not contain the symbols; hence, they are extracted while the kernel package is being compiled and packaged and saved as a separate file for use in userspace. The `symvers` file contains all the symbols of every module as well as just the main kernel itself, making it quite a large set of data. If a symbol is exported by a module the module's name will be located where `vmlinux` is shown above. + +Also note that some exports are for GPL compliant module use only; they have `EXPORT_SYMBOL_GPL` type and can only be used by GPL compliant modules. + +### Module Symbols + +A module is nothing more fancy than a standard library (shared object) designed specifically to work with the kernel. As such, all the normal commands to deal with symbol tables can be used like so with GNU `nm`: + +``` +# nm /lib/modules/3.10.0-327.el7.x86_64/extra/bfa/bfa.ko | grep blk_queue_init_tags + U blk_queue_init_tags + +# nm /lib/modules/3.10.0-327.el7.x86_64/extra/lpfc/lpfc.ko | grep blk_queue_init_tags + U blk_queue_init_tags +``` + +You'll notice that this information is not super useful as shown; how a binary is assembled is more complex and requires a bit of work to get the data required in a format which makes sense. The `modprobe` tool with a bit of `sed` can be used to reassemble the data in a way that makes more sense for the task at hand, namely comparisons of addresses to names: + +``` +# modprobe --dump-modversions /lib/modules/3.10.0-327.el7.x86_64/extra/bfa/bfa.ko \ + | sed -r -e 's:^(0x[0]*[0-9a-f]{8}\t.*):\1:' | grep blk_queue_init_tags + +0x00a006aa blk_queue_init_tags + +# modprobe --dump-modversions /lib/modules/3.10.0-327.el7.x86_64/extra/lpfc/lpfc.ko \ + | sed -r -e 's:^(0x[0]*[0-9a-f]{8}\t.*):\1:' | grep blk_queue_init_tags + +0x00a006aa blk_queue_init_tags +``` + +This shows us the kernel is exporting the function `blk_queue_init_tags` at `0x00a006aa` and the _weak module_ (compiled for another kernel) is expecting to find this same function at address `0x00a006aa` - this is a compatible function entry point, nothing has changed. From here, all that's left is to ensure each and every function the module uses or exports undergoes the same scrutiny for kABI compatibility. + + +## Methodology + +There are several steps to ensuring a weak-updates kernel module is integrated well with the system and is compatible with a given target kernel. Each kernel is checked on it's own, so it is possible to have one kernel in a family using the kernel module (a symlink exists from it to the older file), or to not be using it (no symlink exists). + +### Dependencies + +The dependencies first must be taken care of; in the chance the module being inserted as a _weak-update_ is used by another module, the system needs to know about the symbols in the weak-update version as they may have changed (therefore causing a cascaded incompatibility by accident). + +The entity shipping the module creates a file in `/etc/depmod.d/` with the override, like so: + +``` +# cat /etc/depmod.d/bfa.conf +override bfa 3.10.0-* weak-updates/bfa + +# cat /etc/depmod.d/lpfc.conf +override lpfc 3.10.0-327.* weak-updates/lpfc +``` + +This is telling the system to use the `weak-updates/bfa` module version for all kernels in the 3.10.0-\* suite (which is all of RHEL 7 in this example) if it is found for bfa, but for `lpfc` the wildcard is more refined to only work with 3.10.0-327.\* kernels as an alternate example. + +The entity shipping the module then runs this command after the module has been added (via RPM post-install, etc.); in this example, the module was compiled for 3.10.0-327 so the `depmod` command is using that version to update the symbols: + +``` +# depmod -aeF "/boot/System.map-3.10.0-327.el7.x86_64" "3.10.0-327.el7.x86_64" +``` + +As might be inferred from the above, this updated the `/boot/System.map-3.10.0-327.el7.x86_64` file with all symbols from the new file in the `extra/` directory. + +### Weak Modules + +The second step is to now create all the compatibility symlinks in the `weak-updates/` subdirectories of all kernels installed on the system which are in fact 100% compatible with this new module. From the outside, it's all built into a script that can just be used by the entity shipping the module (again, in their package post-install): + +``` +# weak-modules --add-modules /lib/modules/3.10.0-327.el7.x86_64/extra/bfa/bfa.ko +...or: +# weak-modules --add-modules /lib/modules/3.10.0-327.el7.x86_64/extra/lpfc/lpfc.ko +``` + +> The `weak-updates` script will also rebuild the _initramfs_ files in `/boot` for all the kernels found, inserting the new module setup for stage 1 boot. When installing a kmod package this is the perceived lag, after the RPM has placed the bits down it's updating all initramfs files for kernels it adjusted. + +The process inside the script can be broken down into these basic steps: + +1. Take the kernel symbols file and massage it into a format that works with `diff` and `join` later (loops for every kernel found): + +``` + # krel=$(uname -r) + + # zcat /boot/symvers-$krel.gz \ + | sed -r -ne 's:^(0x[0]*[0-9a-f]{8}\t[0-9a-zA-Z_.]+)\t.*:\1:p' \ + > symvers-$krel +``` + +2. If required (the kernel may not have any), extract and prepare the same information from any `extra/` modules in the target kernel (this will loop for every installed kernel). Notice that we're only extracting data of the installed kernels and if they have something in `extra/` - this file may be zero bytes if none are there: + +``` + # krel=$(uname -r) + + # find /lib/modules/$krel/extra -name '*.ko' \ + | xargs nm \ + | sed -nre 's:^[0]*([0-9a-f]{8}) A __crc_(.*):0x\1 \2:p' \ + > addon-symvers-$krel +``` + +3. Do the same action as the above, but **specifically for the kernel the module was built against** known as `vermagic` within the module's data: + +``` + # modinfo -F vermagic bfa lpfc + 3.10.0-327.el7.x86_64 SMP mod_unload modversions + 3.10.0-327.el7.x86_64 SMP mod_unload modversions + + # module_krel=3.10.0-327.el7.x86_64 + + # find /lib/modules/$module_krel/extra -name '*.ko' \ + | xargs nm \ + | sed -nre 's:^[0]*([0-9a-f]{8}) A __crc_(.*):0x\1 \2:p' \ + > extra-symvers-$module_krel +``` + +4. Take the data from the above steps and simply combine and sort it for use: + +``` + # sort -u symvers-$krel \ + extra-symvers-$module_krel \ + addon-symvers-$krel \ + > all-symvers-$krel-$module_krel +``` + +5. Now extract the data from the new module physically being added to the system and extract it's symbols as well: + +``` + # module="/lib/modules/3.10.0-327.el7.x86_64/extra/bfa/bfa.ko" + ...or: + # module="/lib/modules/3.10.0-327.el7.x86_64/extra/lpfc/lpfc.ko" + + # /sbin/modprobe --dump-modversions "$module" \ + | sed -r -e 's:^(0x[0]*[0-9a-f]{8}\t.*):\1:' \ + | sort -u \ + > modvers +``` + +6. Last, use the `join` command in reverse mode (think `grep -v`) to tell us if any lines from all the known symbols provided does **not** match the symbols the new module is expecting: + +``` + join -j 1 -v 2 all-symvers-$krel-$module_krel modvers +``` + +This set of steps tells us if the incoming module being added is identical in symbols to what's actually running an expected on the system; any output from the last step is indicating that something was found that differs in either address or availability and this module is not compatible. No output means it's fully compatible and cane be symlinked to the target kernel safely. + + +## Incompatible Example + +Using the above methodology, we can examine a different kernel module which is incompatible. This specific version of `kmod-bna` has 4 occurrences of incompatible function addresses with a specific (older) kernel that has been installed. Each of the items above is covered in order: + +``` +The setup: +# rpm -qa | egrep "^(kernel-3|kmod-bna)" +kernel-3.10.0-229.20.1.el7.x86_64 +kernel-3.10.0-327.49.1.el7.x86_64 +kmod-bna-3.2.7.0-0.el7.x86_64 + +Step 1: +# krel=3.10.0-229.20.1.el7.x86_64 +# zcat /boot/symvers-$krel.gz \ +> | sed -r -ne 's:^(0x[0]*[0-9a-f]{8}\t[0-9a-zA-Z_.]+)\t.*:\1:p' \ +> > symvers-$krel + +Step 2: +# find /lib/modules/$krel/extra -name '*.ko' \ +> | xargs nm \ +> | sed -nre 's:^[0]*([0-9a-f]{8}) A __crc_(.*):0x\1 \2:p' \ +> > addon-symvers-$krel + +Step 3: +# modinfo -F vermagic bna | cut -f1 -d' ' +3.10.0-327.el7.x86_64 +# module_krel=3.10.0-327.el7.x86_64 +# find /lib/modules/$module_krel/extra -name '*.ko' \ +> | xargs nm \ +> | sed -nre 's:^[0]*([0-9a-f]{8}) A __crc_(.*):0x\1 \2:p' \ +> > extra-symvers-$module_krel + +Step 4: +# sort -u symvers-$krel \ +> extra-symvers-$module_krel \ +> addon-symvers-$krel \ +> > all-symvers-$krel-$module_krel + +Step 5: +# module="/lib/modules/3.10.0-327.el7.x86_64/extra/bna/bna.ko" +# /sbin/modprobe --dump-modversions "$module" \ +> | sed -r -e 's:^(0x[0]*[0-9a-f]{8}\t.*):\1:' \ +> | sort -u \ +> > modvers + +Step 6: +# join -j 1 -v 2 all-symvers-$krel-$module_krel modvers +0x7efd609f __netif_napi_add +0x905307be napi_complete_done +0xd93737a0 napi_disable +0xe1d1af76 __dev_kfree_skb_any +``` + +The methodology is showing us there are 4 addresses which do not match up between the older kernel and this newer module, making them incompatible for use together. + + +## Caveats + +A compatible kernel module as determined by the _weak-updates_ methodology is an observation from the symbol addresses from the outside only; there is no way to functionally test the module works at runtime transparently, only that it can be inserted to the target kernel without error. It is entirely possible for a coding error internally to occur and the module not work; the kernel engineers patching a given kernel may have changed something which causes breakage. + +Testing a newly updated kernel against any existing weak module must performed to ensure all functionality is retained. + + +## References + + - diff --git a/md/linux_partitioning.md b/md/linux_partitioning.md new file mode 100644 index 0000000..142ff24 --- /dev/null +++ b/md/linux_partitioning.md @@ -0,0 +1,333 @@ +# Linux Partitioning + +## Contents + + - [Overview](#overview) + - [MBR vs. GPT](#mbr-vs-gpt) + - [MBR Extended/Logical Partitions](#mbr-extendedlogical-partitions) + - [Partition Alignment](#partition-alignment) + - [Common Tools](#common-tools) + - [fdisk](#fdisk) + - [parted](#parted) + - [gdisk](#gdisk) + - [partx](#partx) + - [Usage Comparison](#usage-comparison) + - [Resizing Partitions](#resizing-partitions) + - [Advanced Examples](#advanced-examples) + - [Expanding MBR Primary](#expanding-mbr-primary) + - [Adding MBR Logical](#adding-mbr-logical) + - [Zapping Devices](#zapping-devices) + - [References](#references) + - [Citations](#citations) + + +## Overview + +After understanding the design of how [x86 Storage works](linux_x86_storage.md) the next logical step is learning and using the various utilities to manipulate the MBR/GPT and partitions themselves. We'll focus on the most common utilities herein and common tasks and how they differ and their use with both partition types. + + +## MBR vs. GPT + +This topic is covered in detail in the [Linux x86 Storage](linux_x86_storage.md) article; a quick recap of the basics: + +| **Purpose** | **MBR** | **GPT** | +| -------------- | ----------------------------------------------------- | ------- | +| Max Partitions | 4 Primary or 3 Primary, 1 Extended, Unlimited Logical | 128 | +| Max Size | 2 TiB | 8 ZiB | + +Extrapolating from this table we can then make a set of rules to live by: + + - MBR format is limited to 2 TiB + - GPT format is limited to 8 ZiB + - Use GPT format if your storage is greater than 2 TiB + - Use GPT if your storage could grow larger than 2 TiB + - If a 4th MBR partition is marked Primary, you cannot use Extended + - If 3 MBR Primary partitions exist, make the 4th one Extended rest of device + - Logical MBR partitions 5+ live inside the 4th Extended partition + - Adding a new Logical partition requires growing the Extended partition first + + +## MBR Extended/Logical Partitions + +Having only 4 primary partitions is a limit of the original MBR design - as such, an extension was invented called the [extended partition](http://en.wikipedia.org/wiki/Extended_partition) with a very wide open design. The 4th primary partition is created as a type _extended_ which points to the first Extended Boot Record (EBR) of the first logical partition within. The extended partition is normally created with the rest of the disk as it's size, it will contain logical partitions inside. + +Each EBR contains a pointer to the next EBR (along with it's logical partition info) which allows chaining EBRs together. The number of logical partitions within an extended partition is limited only by the amount of available disk space in the extended partition; all of this work is handled by the various tools and does not need to be manually manipulated by the end user. + + +## Partition Alignment + +With the traditional design of a 512-byte sector, starting the first partition at LBA 63 poses no problem in alignment of the physical device (sometimes called the _radial geometry_) with the logical use by the CPU. Modern devices are starting to use 4096-byte (4k) sector sizes [(1)][c1] - and in the case of SSD possibly 8192-bytes (8k), sometimes referred to as _pages_ to match the CPU terminology. Externally they may emulate 512-byte sectors for compatibility (called **512e** mode) but internally these devices are working with 4096.[(2)][c2] + +Starting a partition at LBA 63 with 4k pages it problematic -- mathematically this is one 512-byte sector short of a natural 4096 boundary of the physical geometry. If history had worked out better, using LBA 64 would have worked out mathematically: + +``` +(63*512)/4096 = 7.8750 +(63*512)/8192 = 3.9375 + +(64*512)/4096 = 8.0000 +(64*512)/8192 = 4.0000 +``` + +To overcome this issue, in modern use the first partition is started with at LBA 2048 - with a 512-byte sector this is 1 MiB, or 256 pages of a 4096-byte sector. This conservative approach allows for future changes that are not foreseen today, allowing enough space and performing meticulous boundary alignment on the classic power of 2^8. SSD drives use multiples of 128 KiB, 256 KiB or 512 KiB depending on device, again creating a natural alignment mechanism.[(3)][c3] + +On a modern Linux distribution the userspace tools described below understand these alignment needs and default to using 2048 sector offset when creating the first partition on a drive. + + +## Common Tools + +### fdisk + +The fdisk utility is the classic partitioning tool. It's usage is based around a menu driven text interface; all changes performed are held into memory until a command is issued to write them out to disk; this technique allows one to cancel/exit if desired without changes, making it an enticing tool to use. The fdisk binary itself is part of the `util-linux` (or `util-linux-ng`) package of tools that also includes utilities like `fsck`, `mount` and `umount`. + +One of the most important things to note about using fdisk: the version shipped on most enterprise class distributions (RHEL/CentOS for instance) tends to be an older but stable version. The downside is that these older stable releases only support the MBR (aka "dos" or "msdos") style and partition design, limiting your disk to 2 TiB. The most recent fdisk release are beginning to support GPT style, but as they are not yet packaged as for these distros means another tool must be used. + +> Use the `-c` and `-u` options to fdisk to disable DOS compatibility mode (C/H/S) and use Sector mode. These options can also be toggled while in fdisk with the `c` and `u` commands. + + +### parted + +The parted utility works the opposite of fdisk; as commands are issued the changes are made, whether it's used in pure commandline-only mode or when using the menu driven interface. This concept tends to make the usage of parted a bit scary to the beginning tech, and rightly so - once you do it, it's done. No backing out of a mistake (easily) like with fdisk. + +The draw to parted tends to come in two parts: it's easily commandline scriptable and it fully supports GPT style disks (as well as MBR). This makes it the current _de-facto_ utility for dealing with storage larger than 2 TiB and GPT partitions (required for UEFI). The parted package also includes the `partprobe` tool that many like. + + +### gdisk + +The `gdisk` utility (sometimes called _gptdisk_ or _GPT fdisk_) is a newer tool designed to combine the two worlds of fdisk and parted; it provides a bit of commandline and a bit of menu driven interface. It allows all the GPT functionality but with the capability of staging in memory first before writing to disk, allowing for an exit without write as needed. The gdisk package includes complementary targeted tools such as `cgdisk`, `sgdisk` and `fixparts` which are designed for GPT/MBR manipulation. + +One of the largest draws to gdisk is it's ability to repair (or attempt to repair) corrupted or broken MBR and GPT partition style types. It can extract and back up partition tables, load them from backups, repair the primary GPT from the secondary copy, convert GPT to MBR and all sorts of advanced features. The use of gdisk in certain situations can achieve results that fdisk and parted cannot. + +At this time the gdisk package is hosted via the EPEL project for RHEL/CentOS distributions and is not present in the standard vendor repositories. Conversely the Ubuntu LTS repositories do contain it, however they may have an older version (0.8.1 for 12.04 LTS). If working on GPT disks be sure and use a stable relatively bug-free version. + +### partx + +The partx utility is useful for triggering the kernel subsystems to add/remove device nodes based on the partition map of a block device. It's akin to the `partprobe` tool however works a bit differently; in general it's a better choice than partprobe due to it's design and integration with the kernel. The most typical usage is to add new partition device nodes after new partitions were created, or delete in the reverse scenario. Normally these functions are handled by fdisk/parted/gdisk but in some cases they are not. + +For example - if a device has partitions and devices nodes such as /dev/xvdb1, /dev/xvdb2, etc. and you utilize the `zap` feature of gdisk to zero out a disk, this does not trigger cleanup of the device nodes. Running `partx -d /dev/xvdb` will trigger a kernel level cleanup and remove them at runtime - likewise, it's common using `fdisk /dev/sda` to make a new partition results in an ioctl error when saving. You can use `partx -a /dev/sda` to trigger a kernel level refresh to create your new /dev/sdaX device node entry. + + +## Usage Comparison + +Utilizing fdisk and gdisk menu driven interfaces are nearly identical; while parted has a menu interface, the commands typed are the same as on the commandline scripted mode. A quick comparison of the most common operations being performed: + +| **Purpose** | **fdisk** | **parted** | **gdisk** | +| -------------------------- | --------------------------------------------- | ------------------------------------------------ | ---------------------------------------------- | +| **List Partitions** | fdisk -cul /dev/xvdb | parted /dev/xvdb unit s print | gdisk -l /dev/xvdb | +| **Create MBR or GPT** | fdisk -cu /dev/xvdb; o, w | parted -s -- /dev/xvdb mkpart table gpt | gdisk /dev/xvdb; o, w | +| **Create First Partition** | fdisk -cu /dev/xvdb; n, p, 1, 2048, _size_, w | parted -s -- /dev/xvdb mkpart primary 2048s 100% | gdisk /dev/xvdb; n, 1, 2048, _size_, _type_, w | +| **Set Partition to LVM** | fdisk -cu /dev/xvdb; t, 1, 8e, w | parted -s -- /dev/xvdb set 1 lvm on | gdisk /dev/xvdb; t, 1, 8e00, w | +| **Delete Partition** | fdisk -cu /dev/xvdb; d, 1, w | parted -s -- /dev/xvdb rm 1 | gdisk /dev/xvdb; d, 1, w | + + +## Resizing Partitions + +Resizing partitions is typically encountered when a disk is grown (expanded) or shrunk (reduced); remember that a higher level infrastructure sits on top of the partitions themselves in almost all cases. Given that, remember to take precautions such as: + + - If reducing, unmount the filesystem or make it inactive as required for safety + - If reducing, resize the filesystem (ext3, ext4, etc.) **before** the container/partition + - If reducing, reduce the container (LVM, etc.) **before** the partition + - If expanding, expand the container (LVM, etc.) **after** the partition + - If expanding, resize the filesystem (ext3, ext4, XFS, etc.) **after** the container/partition + - Not all filesystem types can be reduced or expanded\! XFS for example cannot be reduced + +In practice using `parted` is the best tool for the job; however, remember that parted is a one shot tool - **changes are immediate**. Use `gdisk` for GPT or `fdisk` for MBR disks if you need to stage the work and double check it before executing. Do not use these tools to resize the filesystem itself\! For example the parted tool has a _resize_ command that works on some filesystem types and not others, and no way to ignore the filesystem. + + +## Advanced Examples + +### Expanding MBR Primary + +> Warning: all data could be lost if this is done incorrectly\! Pay very close attention to your start/end sectors and math. + + - **Scenario**: MBR, 3 Primary partitions + - **Task**: Increase partition 3 size +10G + +Unmount the partition and record the layout; for this example only I'll show the disk in GiB to help in understanding the sector math first; we'll unmount it and fsck it to ensure the filesystem is OK before starting. + +``` +# df -h | grep xvdb3 +/dev/xvdb3 9.9G 151M 9.2G 2% /mnt + +# umount /mnt + +# parted -s -- /dev/xvdb unit GiB print +Model: Xen Virtual Block Device (xvd) +Disk /dev/xvdb: 100GiB +Sector size (logical/physical): 512B/512B +Partition Table: msdos + +Number Start End Size Type File system Flags + 1 0.00GiB 10.0GiB 10.0GiB primary + 2 10.0GiB 20.0GiB 10.0GiB primary + 3 20.0GiB 30.0GiB 10.0GiB primary ext4 + +# fsck -fC /dev/xvdb3 +fsck from util-linux-ng 2.17.2 +e2fsck 1.41.12 (17-May-2010) +Pass 1: Checking inodes, blocks, and sizes +Pass 2: Checking directory structure +Pass 3: Checking directory connectivity +Pass 4: Checking reference counts +Pass 5: Checking group summary information +/dev/xvdb3: 11/655360 files (0.0% non-contiguous), 79663/2621440 blocks +``` + +Now let's do the work - first, the geometry needs to be recorded **in sector mode** as we must precisely recreate the starting sector of the partition: + +``` +# parted -s -- /dev/xvdb unit s print +Model: Xen Virtual Block Device (xvd) +Disk /dev/xvdb: 209715200s +Sector size (logical/physical): 512B/512B +Partition Table: msdos + +Number Start End Size Type File system Flags + 1 2048s 20973568s 20971521s primary + 2 20973569s 41945089s 20971521s primary + 3 41945090s 62916610s 20971521s primary ext4 +``` + +We use the built-in functionality of parted to simple change the starting and ending sectors. We will add 20971520 sectors (10 GiB @ 512-byte sector size) to the end of partition 3. We then tell parted to use the same starting sector (41945090s) and new ending sector (83888130s) we just computed. + +Note that the numerical value **ends in s** to denote sectors\! Do not forget your trailing s. Lastly, we resize (grow) the ext4 filesystem on top of it to the new size. + +``` +# echo "62916610 + 20971520" | bc -l +83888130 + +# parted -s -- /dev/xvdb rm 3 +# parted -s -- /dev/xvdb mkpart primary 41945090s 83888130s +# resize2fs /dev/xvdb3 +# mount /dev/xvdb3 /mnt + +# parted /dev/xvdb unit s print +Model: Xen Virtual Block Device (xvd) +Disk /dev/xvdb: 209715200s +Sector size (logical/physical): 512B/512B +Partition Table: msdos + +Number Start End Size Type File system Flags + 1 2048s 20973568s 20971521s primary + 2 20973569s 41945089s 20971521s primary + 3 41945090s 83888130s 41943041s primary ext4 + +# df -h | grep mnt +/dev/xvdb3 20G 156M 19G 1% /mnt +``` + +### Adding MBR Logical + +> Warning: all data could be lost if this is done incorrectly\! Pay very close attention to your start/end sectors and math. + + - **Scenario**: MBR, 3 Primary partitions, 1 Extended partition, 1 Logical partition + - **Task**: Increase Extended partition, add new Logical partition + +This builds upon the basic expansion of a primary partition concept, however the `parted resize` command is our saving grace; if it were done 100% manually (say, with fdisk) the process would go like this: + +1. Record geometry +2. Delete Logical partition 1 +3. Delete Extended partition +4. Recreate Extended partition larger +5. Recreate Logical partition 1 +6. Add new Logical partition 2 + +Using parted however allows us to magically resize the Extended partition without having to perform all of the above steps. Once again, pay attention to your exact geometry and math\! + +``` +# parted /dev/xvdb unit s print +Model: Xen Virtual Block Device (xvd) +Disk /dev/xvdb: 209715200s +Sector size (logical/physical): 512B/512B +Partition Table: msdos + +Number Start End Size Type File system Flags + 1 2048s 20973568s 20971521s primary + 2 20973569s 41945089s 20971521s primary + 3 41945090s 62916610s 20971521s primary ext4 + 4 62916611s 83888131s 20971521s extended + 5 62918659s 83888131s 20969473s logical ext4 +``` + +Notice the Start of 4 (Extended) and 5 (Logical) do not match - watch out for this, especially if you intended to resize 5 after growing 4. We will add 10G (20971520s) to the end of 4, then add a new 6 (Logical) using this new space starting after the existing 5. Note that I am adding 2048 to the ending sector of 5 to determine my start of 6 for optimal performance (in theory - this disk is misaligned starting at partition 2 already): + +``` +# echo "83888131 + 20971520" | bc -l +104859651 + +# parted -s -- /dev/xvdb resize 4 62916611s 104859651s + +# parted /dev/xvdb unit s print +Model: Xen Virtual Block Device (xvd) +Disk /dev/xvdb: 209715200s +Sector size (logical/physical): 512B/512B +Partition Table: msdos + +Number Start End Size Type File system Flags + 1 2048s 20973568s 20971521s primary + 2 20973569s 41945089s 20971521s primary + 3 41945090s 62916610s 20971521s primary ext4 + 4 62916611s 104859651s 41943041s extended + 5 62918659s 83888131s 20969473s logical ext4 + +# parted -s -- /dev/xvdb mkpart logical 83890179s 104859651s + +# parted /dev/xvdb unit s print +Model: Xen Virtual Block Device (xvd) +Disk /dev/xvdb: 209715200s +Sector size (logical/physical): 512B/512B +Partition Table: msdos + +Number Start End Size Type File system Flags + 1 2048s 20973568s 20971521s primary + 2 20973569s 41945089s 20971521s primary + 3 41945090s 62916610s 20971521s primary ext4 + 4 62916611s 104859651s 41943041s extended + 5 62918659s 83888131s 20969473s logical ext4 + 6 83890179s 104859651s 20969473s logical +``` + +### Zapping Devices + +Zapping is the concept of completely wiping all traces of a MBR and/or GPT to result in clean device, as if it were brand new. All data is lost, use with caution\! Use `gdisk` for this need: + +``` +# gdisk /dev/xvdb +GPT fdisk (gdisk) version 0.8.10 + +Partition table scan: + MBR: protective + BSD: not present + APM: not present + GPT: present + +Found valid GPT with protective MBR; using GPT. + +Command (? for help): x + +Expert command (? for help): z +About to wipe out GPT on /dev/xvdb. Proceed? (Y/N): Y +GPT data structures destroyed! You may now partition the disk using fdisk or +other utilities. +Blank out MBR? (Y/N): Y +``` + + +## References + + - + - + - + - + - + - + + +## Citations + + 1. + 2. + 3. + +[c1]: http://www.anandtech.com/show/2888 +[c2]: https://access.redhat.com/site/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/Storage_Administration_Guide/ch-iolimits.html +[c3]: http://blog.nuclex-games.com/2009/12/aligning-an-ssd-on-linux/ diff --git a/md/linux_x86_storage.md b/md/linux_x86_storage.md new file mode 100644 index 0000000..08d106d --- /dev/null +++ b/md/linux_x86_storage.md @@ -0,0 +1,400 @@ +# Linux x86 Storage + +## Contents + + - [Overview](#overview) + - [BIOS vs. UEFI](#bios-vs-uefi) + - [4096 Everywhere](#4096-everywhere) + - [Drive Geometry](#drive-geometry) + - [CHS Addressing](#chs-addressing) + - [ZBR Tracks](#zbr-tracks) + - [LBA Addressing](#lba-addressing) + - [Boot Sector](#boot-sector) + - [MBR Format](#mbr-format) + - [Bootstrap Code Area](#bootstrap-code-area) + - [Partition Table Entry](#partition-table-entry) + - [GPT Format](#gpt-format) + - [Legacy MBR Sector](#legacy-mbr-sector) + - [Partition Table Entry](#partition-table-entry-1) + - [OS Compatibility](#os-compatibility) + - [GRUB Bootstrap](#grub-bootstrap) + - [LVM Boot Volumes](#lvm-boot-volumes) + - [Scanning Devices](#scanning-devices) + - [Scan for new Devices](#scan-for-new-devices) + - [Rescanning and Deleting Devices](#rescanning-and-deleting-devices) + - [The udev Subsystem](#the-udev-subsystem) + - [HBA Blacklisting](#hba-blacklisting) + - [References](#references) + - [Citations](#citations) + + +## Overview + +This article focuses on a number of core concepts combined to provide an overview of the process without going down the rabbit hole too far in any one subject area. In general, the areas of focus: + + - [IA-32/x86/x86-64 Architecture CPU](http://en.wikipedia.org/wiki/X86) + - [Basic Input/Output System<](http://en.wikipedia.org/wiki/BIOS) (BIOS) + - [Unified Extensible Firmware Interface](http://en.wikipedia.org/wiki/Unified_Extensible_Firmware_Interface) (UEFI) + - [Master Boot Record boot sector](http://en.wikipedia.org/wiki/Master_boot_record) (MBR) + - [GUI Partition Table](http://en.wikipedia.org/wiki/GUID_Partition_Table) (GPT) + - [GNU/Linux Operating System](http://en.wikipedia.org/wiki/Linux) + - [GNU GRand Unified Bootloader](http://en.wikipedia.org/wiki/GRUB) (GRUB) + +The use of partitioned mass storage devices (such as fixed disks) is our target medium; discussing non-partitioned devices (such as floppies) is not considered as part of this article. This article also focuses on traditional rotating magnetic media as Storage - newer devices such as SSD drives are still in thrall to the older design mechanisms, so understanding the basic mechanics is key. + + +### BIOS vs. UEFI + +BIOS is still the most widely deployed firmware - modern systems are embracing UEFI but until it becomes the _de-facto_ standard we must adhere to the design and limitations of the BIOS firmware infrastructure in order to address storage and boot operating systems. Most of this article is discussing traditional BIOS design. + +BIOS is only capable of reading the boot sector of the disk and executing it's code, while recognizing the MBR partition format itself. BIOS has no concept of filesystem types and only executes the first 440 bytes before relinquishing control. This leads to the use of secondary (_multi-staged_ or _chained_) boot loaders such as GRUB. + +UEFI by contrast does not execute boot sector code; instead there exists an **EFI System Partition** (ESP) in which the required firmware files are loaded (called an _UEFI Application_); this partition is typically a FAT32 formatted 512 MiB space and supports multiple UEFI Applications at the same time. The firmware itself has a boot menu embedded that defines the disks and partitions to be launched for the applications, effectively acting like Stage 1 of GRUB. + +BIOS-GPT is a hybrid; it allows BIOS to load boot code of the protected MBR on a GPT partition and execute it. Typically this requires a [BIOS boot partition](http://en.wikipedia.org/wiki/BIOS_Boot_partition) around 1 MiB in size as the 440-byte bootstrap area is not large enough and there exist no extra sectors where the boot loader is typically located in MBR. + + +### 4096 Everywhere + +The traditional storage sector size is 512-bytes; with the introduction of the [Advanced Format](http://en.wikipedia.org/wiki/Advanced_Format) 4096-byte sector size this brings to light a question - why is everything based on 4kb? The Linux kernel memory page size, the largest Linux filesystem block size that can be used, the disk sector size are all a max of 4kb - this is all based on the [classic x86 MMU design](http://en.wikipedia.org/wiki/Memory_management_unit#IA-32_.2F_x86). + +The classic [x86 MMU architecture](http://wiki.osdev.org/Paging) contains two [page tables](http://en.wikipedia.org/wiki/Page_table) of 1024 4-bytes entries making each one 4kb in size. One table is called the _paging directory_ and the other one the _paging table_, they work together to provide virtual-to-physical access to the memory within the system. If you're doing the math that limits us to 4GiB (32bits) of memory - hence the introduction of [PAE](http://en.wikipedia.org/wiki/Physical_Address_Extension) to allow up to 64 GiB (36 bits) to be addressed. The [x86\_64 platform](http://en.wikipedia.org/wiki/X86-64#Virtual_address_space_details) further increases this _currently_ to 128 TiB (48 bits) per current spec, but a theoretical maximum of 256 TiB (64bits) of virtual address space could happen. + +Modern CPUs are starting to offer larger page tables ("huge pages") but are not the norm - as such, the alignment of the x86 MMU 4kb page to other parts of the system is why we don't see 8096-byte sectors or 16kb filesystem blocks possible at this time with most Linux infrastructure even though the code itself supports it (such as XFS block size \> 4096). + +The Linux kernel has various extensions such as [transparent huge pages](https://lwn.net/Articles/188056/) to perform virtual mappings, however at the end of the day performance is limited to the specific x86 CPU that is being used at the time. As such the 4096-byte sector size, block size and memory page size are our daily use scenarios. + + +## Drive Geometry + +The root of everything is the radial geometry of the physical drive itself, how it's segregated and most importantly how do we "find" specific data at any given moment on the magnetic medium. + + +### CHS Addressing + +At the core of our x86 history is the [cylinder-head-sector](http://en.wikipedia.org/wiki/Cylinder-head-sector) addressing scheme that was invented to determine where on a platter the needed information lives on a storage device. This design is what has imposed the 2 TiB limit for storage using the Master Boot Record (MBR) format as the spec has limits (more on that below). CHS is expressed as a _tuple_ - 0/0/1, 12/9/17, etc. - to refer to a finite physical location of data. + + - **Platter** + - The platter is the thin piece of magnetic storage medium; it has two sides that can be used. All the platters are stacked on top of each other with heads in between + - **Head** + - A head is the little "arm" that moves over the platters to read the magnetic information, so for each platter you have 2 heads (one on each side). The maximum is 256 however an old bug limits this to 255 in use + - **Track** + - A track is one of the concentric rings on a platter; they start with 0 at the outer edge and increase numerically inwards + - **Cylinder** + - A cylinder is the set of "stacked" tracks of all platters in the same physical location; they start at 0 along with the tracks. The address of a cylinder is always the same as an individual track since it's just a stack of them + - **Sector** + - A sector is the segregated block within a track, maximum 63 sectors/track with 512-bytes per sector. **Sectors start counting at 1, not 0** + +Where this all comes into play is the concept that BIOS will look specifically at location 0/0/1 (first cylinder/track, first head, first sector) to load the initial [machine language](http://en.wikipedia.org/wiki/Machine_code) boot code. This creates an absolute physical location for every storage device to boot and has carried forth into the more modern LBA addressing mechanism. + +Note that the maximums for tracks, cylinders and sectors evolved over time and ended with the ATA-5 specification of 16383/16/63 requiring a full 24-bit value. + +``` +16383 cylinders * 16 heads * 63 sectors = 1032*254*63 = 16514064 sectors / 2 (512-bytes/sector) = 8257032 kb = ~ 8Gib +``` + +The [INT 13H](http://en.wikipedia.org/wiki/INT_13) BIOS EXT extension is what permits us to read beyond the original CHS limit; INT-13H CHS is 24-bits and ATA spec is 28-bits, BIOS routines exist to translate between the two for full compatibility. The ATA 16:4:8 bit scheme to 10:8:6 bit scheme used by INT 13H routines are what allow mapping up to 8 GiB. + + +#### ZBR Tracks + +Initially the design was that all tracks contained the same number of sectors (MFM and RLL drives) - this was updated with a newer technique called [zone bit recording](http://en.wikipedia.org/wiki/Zone_bit_recording) in ATA/IDE drives that allowed more sectors on the outer (larger) tracks and fewer moving inwards. This technique, however, created a problem - the physical geometry of the drive no longer matched the CHS addressing. + +Because data (such as a partition) needs to start/end on a track/cylinder boundary, this leaves surplus sectors at the end of the drive less than 1 cylinder in size since they almost never line up perfectly. This is why when making partitions in tools like fdisk or parted you will see unused sectors even though you specified using the whole drive - the tools are translating your request into cylinder boundaries and discarding any surplus sectors as unusable since they are not aligned. + + +### LBA Addressing + +The limitations of the CHS design were quickly encountered; as such a more extensive format was introduced called [Logical Block Addressing](http://en.wikipedia.org/wiki/Logical_block_addressing). Now that CHS has been defined understanding LBA becomes easy and is best explained with a simple table. + +| **LBA Value** | **CHS _Tuple_** | +| -------------- | --------------- | +| 0 | 0 / 0 / 1 | +| 62 | 0 / 0 / 63 | +| 1008 | 1 / 0 / 1 | +| 1070 | 1 / 0 / 63 | +| 16,514,063 | 16382 / 15 / 63 | + +As exemplified, LBA addressing simply starts at 0 and increases by 1 for each CHS tuple. The original LBA was native 28-bit (see the CHS mapping above), the current ATA-6 spec is a 48-bit wide LBA allowing addressing up to 128 PiB of storage. As might be obvious there is a cutoff after 8 GiB of being able to translate CHS to LBA for backwards compatibility. Modern INT 13H extensions allow native LBA access thereby negating any need to use CHS style structures. + +Our **CHS tuple 0/0/1 and LBA value 0 are aligned**, however - this is what we care about most for booting the system. + + +## Boot Sector + +Now that we understand CHS and LBA addressing, let's look at what's going on once the BIOS reads the first 512-byte sector of the drive to get going. This breaks down into two formats - traditional [Master Boot Record](http://en.wikipedia.org/wiki/Master_boot_record) (MBR) format, and [GUID Partition Table](http://en.wikipedia.org/wiki/GUID_Partition_Table) (GPT) format. The Wikipedia pages on both are fantastic, I highly recommend reading both to gain a deeper understanding. + +There are two kinds of basic boot sectors: + + - [Master Boot Record](http://en.wikipedia.org/wiki/Master_boot_record) (MBR) is the first sector of the partitioned storage + - [Volume Boot Record](http://en.wikipedia.org/wiki/Volume_Boot_Record) (VBR) is the first sector of an individual partition + +We are used to thinking of the boot sector as the MBR, but in fact there are two present in our x86 partitioned storage. GPT contains a 512-byte MBR protection mechanism for backwards compatibility. Essentially a MBR and VBR are the same thing, just located at different locations for different purposes. A non-partitioned device like a floppy disk uses only a VBR at the beginning, whereas a partitioned device typically uses a MBR (which may then load a VBR later). + + +### MBR Format + +The MBR is at minimum the first 512-byte sector of the storage. There are two basic structures that are in use for our purposes as detailed in the tables; of most import are the bootstrap code area and partition table entries. + +**Classic Generic MBR Structure** + +| **Offset** | **Description** | **Size(bytes)** | +| ---------- | ------------------------ | --------------- | +| +0 | Bootstrap code area | 446 | +| +446 | PTE #1 | 16 | +| +462 | PTE #2 | 16 | +| +478 | PTE #3 | 16 | +| +494 | PTE #4 | 16 | +| +510 | Boot signature (55h AAh) | 2 | + +**Modern Standard MBR Structure** + +| **Offset** | **Description** | **Size(bytes)** | +| ---------- | ---------------------------- | --------------- | +| +0 | Bootstrap code area (part 1) | 218 | +| +218 | Disk timestamp | 6 | +| +224 | Bootstrap code area (part 2) | 216 | +| +440 | Disk signature | 6 | +| +446 | PTE #1 | 16 | +| +462 | PTE #2 | 16 | +| +478 | PTE #3 | 16 | +| +494 | PTE #4 | 16 | +| +510 | Boot signature | 2 | + + +The first data partition does not start until sector 63 (for historical reasons), leaving a 62 "MBR gap" present on the system. This gap of unused sectors is typically used for Stage 1.5 chained boot managers, low level device utilities and so forth. + + +#### Bootstrap Code Area + +This area of the sector is pure [machine language code](http://en.wikipedia.org/wiki/Machine_code) run in real time mode; think of it like a BASIC program, it's read line by line and executes each one of those lines sequentially to manipulate CPU registers (more or less - it's complicated\!). This mechanism allows the CPU to execute arbitrary code without understanding anything about the higher level filesystem or storage design. + +Notice that we have an extremely limited amount of space (440 bytes) - this is nowhere near enough room to run a fancy modern boot manager like [GNU GRUB](http://en.wikipedia.org/wiki/GRUB) full of graphics, features and whatnot. Hence we have the concept of _staged_ (or _chained_) boot managers. This area represents **Stage 1** of the bootloader process and serves to simply provide instructions on where to load the next bit of code from physically. More on that in the GRUB section. + + +#### Partition Table Entry + +We come finally to the [Achilles' Heel](http://en.wikipedia.org/wiki/Achilles%27_heel) of the MBR design - partition table design and it's relation to the CHS addressing format. As each PTE is only 16-bytes we have a finite limit on what can be stored; extrapolating this is where are limit is created in how much disk can be addressed, leading to our 2 TiB limit of a MBR-based storage disk. + +**16-byte PTE** + +| **Length** | **Description** | +| ---------- | ------------------------------ | +| 1 | Status (active/inactive) | +| 3 | CHS address of partition start | +| 1 | Partition type | +| 3 | CHS address of partition end | +| 4 | LBA address of partition start | +| 4 | Total sectors in partition | + +Given this design, at most 4-bytes (32-bits) can store the number of sectors in LBA mode and the limitations as discussed above. These are referred to as the Primary partitions of the disk and the above exemplifies why only 4 of them exist when using tools like fdisk and parted. + +### GPT Format + +The GUID Partition Table format was invented to solve the whole mess of CHS, MBR and 32-bit LBA limitations. It's actually part of the [Unified Extensible Firmware Interface](http://en.wikipedia.org/wiki/Unified_Extensible_Firmware_Interface) (UEFI) designed to replace the aging [Basic Input/Output System](http://en.wikipedia.org/wiki/BIOS) (BIOS) design, however due to it's widespread use to utilize larger than 2 TiB of storage it's often considered it's own project. + +| **Offset** | **Description** | **Size(bytes)** | +| ---------- | ----------------------------------------------------- | --------------- | +| | **LBA 0 (Legacy MBR)** | | +| +0 | Bootstrap code, Disk timestamp and signature | 446 | +| +446 | PTE #1 Type 0xEE (EFI GPT) | 16 | +| +462 | PTE #2 (unused) | 16 | +| +478 | PTE #3 (unused) | 16 | +| +494 | PTE #4 (unused) | 16 | +| +510 | Boot Signature (55h AAh) | 2 | +| | **LBA 1 (Primary GPT Header)** | | +| +512 | Definition of usable blocks on disk, PTEs, GUID, etc. | 512 | +| | **LBA 2-33 (Primary Partition Table Entries)** | | +| +1024 | 128x 128-byte PTEs | 16128 | +| | **LBA 34+ (Partitions)** | | +| +17408 | Actual partitions | n/a | +| | **LBA -33 to -2 (Secondary Partition Table Entries)** | | +| -1023 | 128x 128-byte PTEs | 16128 | +| | **LBA -1 (Secondary GPT Header)** | | +| -511 | Definition of usable blocks on disk, PTEs, GUID, etc. | 512 | + +LBA is used exclusively, there is no CHS mapping. While it's possible to use LBA 34 to start partitions, due to the prevalence of MBR track boundary requirements the first partition often starts at LBA 63. This allows chained bootloaders such as GRUB to store their Stage 1.5 images similar to the MBR technique prior to sector 63. + +#### Legacy MBR Sector + +Notice the Legacy MBR is clearly defined per the specification; this allows booting a GPT-based storage medium using BIOS techniques as it contains the same area for bootstrap code and PTEs in the same disk locations. + +The bootstrap code area remains, only the first PTE is used and it denotes a type of EFI. This sufficiently protects the disk from tools which do not understand EFI, as they should report simply a partition of type "unknown" in the worst case scenario. + + +#### Partition Table Entry + +The PTE format of GPT is very similar to the MBR style and should come as no surprise; most notable of the structure is the use of 8-byte (64-bit) values for the LBA address. Much like MBR, this defines a hard limit on the maximum LBA value that could be addressed as useful storage within GPT. + +**128-byte PTE** + +| **Length** | **Description** | +| ---------- | --------------------------------------- | +| 16 | Partition type GUID | +| 16 | Unique partition GUID | +| 8 | First LBA (little endian) | +| 8 | Last LBA (inclusive, usually odd) | +| 8 | Attribute flags | +| 72 | Partition name (36 UTF-16LE code units) | + + +### OS Compatibility + +Userspace tools such as fdisk (2.17.1+) and parted contain checks and balances for this more modern approach - one must ensure to not use "DOS Compatibility Mode" and use Sectors mode inside a utility like fdisk or parted to achieve the desired perfect alignment. Additionally the LVM subsystem starting with 2.02.73 will align to this 1 MiB boundary - previous versions used a 64 KiB alignment, akin to the LBA 63 offset. Same goes with software RAID - as long as it's using the modern Superblock Metadata v1 it will align to 1 MiB.[(1)][c1][(2)][c2][(3)][c3] + + +## GRUB Bootstrap + +Understanding how GRUB loads becomes fairly straightforward once the mechanics of the MBR/GPT world are understood. The installation of GRUB onto the MBR (or, optionally the VBR) consists of three primary parts, the first two of which are concrete in design. + + - **Stage 1** + - The `boot.img` 440-byte code is loaded into the boostrap area as defined in the MBR design, and is coded to load the first sector of core.img (the next stage) using LBA48 addressing. + - **Stage 1.5 MBR** + - The `core.img` ~30 KiB code is loaded into the 62 empty sectors between the end of the MBR and beginning of the first partition (sector 63). This code contains the ability to recognize filesystems to read stage 2 configuration. + - **Stage 1.5 GPT** + - The `core.img` ~30 KiB code is loaded starting at sector 34 after the GPT structure. This code contains the ability to recognize filesystems to read stage 2 configuration + - **Stage 2** + - This stage reads the configurations by file/path names under /boot/grub to build the TUI and present choices. The majority of userspace code and configuration is located here + +Once stage 2 is loaded this is where the higher level GRUB magic begins; the most visible example of this is the user interface allowing for selection of multiple boot choices on multiple partitions. UEFI is similar, however instead of core.img a different piece of code (grub\*.efi) is copied to the EFI System Partition and acts as the UEFI Application as outlined above. + + +### LVM Boot Volumes + +The default storage mechanism for `pvcreate` is to use the second 512-byte sector of the MBR/VBR to hold it's metadata structure; however the LVM subsystem will scan the first 4 sectors of the MBR/VBR for it's data. The physical volume label begins with the string **LABELONE** and contains 4 basic items of information: + + - Physical volume UUID + - Size of block device in bytes + - NULL-terminated list of data area locations + - NULL-terminated lists of metadata area locations + +Metadata locations are stored as offset and size (in bytes). There is room in the label for about 15 locations, but the LVM tools currently use 3: a single data area plus up to two metadata areas.[(4)][c4] + +Historically the lack of LVM-capable boot loaders (such as LILO and GRUB1) required the /boot/ filesystem to reside at the beginning of the disk (yet another CHS legacy issue) or be of a more basic filesystem format such as ext2 or ext3. With the advent of GRUB2 (GRUB version 2) the ability exists to read from more complex filesystems such as ext4, LVM and RAID.[(5)][c5] + +In the default installation of many server-oriented Linux distributions such as Ubuntu 14.04 LTS and RHEL/CentOS 7 the /boot/ partition is still non-LVM and the first partition of the disk for maximum backwards compatibility even though they use GRUB2. + + +## Scanning Devices + +When expanding an existing device or adding a new device, the underlying controller(s) needs to be performed. There are two separate interfaces into the kernel to perform this work, each a little different than the other. + +### Scan for new Devices + +Given either a single controller or multiple controllers for the same storage (in the case of high availability) we need to issue a scan requests to those controllers to look for new devices presented and create /dev device nodes for the ones found. The `host0` is always the local controller, `host1` and above tend to be add-in controllers to external storage for example. + +``` +# echo "- - -" > /sys/class/scsi_host/hostX/scan (where X is your HBA) +``` + +Local controller (which includes VMware vDisks): + +``` +# echo "- - -" > /sys/class/scsi_host/host0/scan +``` + +Add-on HBA (Host Based Adapater) cards of some sort: + +``` +# echo "- - -" > /sys/class/scsi_host/host1/scan +# echo "- - -" > /sys/class/scsi_host/host1/scan +``` + + +### Rescanning and Deleting Devices + +The scenario is an existing block device is already presented (i.e. /dev/sda) and it's been expanded upstream of the OS already - for example the VMware vDisk was grown or the SAN/DAS LUN was expanded. In this case every block device that comprises that piece of storage has to be rescanned -- for a single controller it's only one device, but for HA situations (using Multipath for instance) all individual devices need rescanned. + +``` +# echo 1 > /sys/block/sda/device/rescan +``` + +Multiple paths to the same storage (Multipath, etc.): + +``` +# echo 1 > /sys/block/sdb/device/rescan +# echo 1 > /sys/block/sdc/device/rescan +``` + +Deleting those block device entries from the Linux kernel maps is just as easy -- the devices have to be completely unused and released from the OS itself first, and **do not force it** - a kernel panic may (and most probably will) ensue if you try and force a block device delete while the kernel still thinks it's in use. + +``` +# echo 1 > /sys/block/sdb/device/delete +# echo 1 > /sys/block/sdc/device/delete +``` + + +## The udev Subsystem + +Udev is the device manager for the Linux 2.6 kernel that creates/removes device nodes in the /dev directory dynamically. It is the successor of devfs and hotplug. It runs in userspace and the user can change device names using udev rules.[(6)][c6] The udev subsystem allows for a very wide variety of user control over devices, whether they be storage, network, UI (keyboard/mouse) or others - one of the common uses in udev is to name network interfaces. + +When it comes to Linux storage this can have subtle yet extremely important implications on how the server finds and uses it's boot devices. When the kernel initializes it and the udev subsystem scan the bus and created device nodes for the storage devices it finds. Logistically, this means if a supported HBA (Host Based Adapter, a PCI-based Fiber/SAS card for instance) is found before the internal SCSI controller it's highly possible (and in practice and experience **does happen**) that a device consumes device node "sda" on the system from outside the chassis (SAN/DAS) instead of the internal disk or RAID array. + +Care should be taken when researching modern udev - in some distributions it's now been subsumed by systemd and is no longer a discreet entity within the Linux ecosphere; the specific methodology has changed for certain parts of the process. For example, in traditional udev the bootstrap process initialized /dev from the pre-prepared data in /lib/udev/devices tree; in the systemd implementation it reads from /etc/udev/hwdb.bin instead. + + +### HBA Blacklisting + +One graceful solution to the boot-from-HBA problem is to simply blacklist the kernel module **from initrd only** to prevent the kernel from having the device driver on boot, so it doesn't find the HBA controllers. Once the kernel switches to the real root filesystem and releases the initrd, it has already assigned "sda" to the internal expected array and can then load the HBA driver at runtime and initialize the controllers and find storage. + +The `rdblacklist` mechanism is used on the kernel boot line of your GRUB configuration - just append as needed with the specific HBA to blacklist: + +``` +Blacklist the Brocade HBAs: + + rdblacklist=bfa + +Blacklist the QLogic HBAs: + + rdblacklist=qla2xxx +``` + +The kernel will then respect the `/etc/modprobe.d/*.conf` entries to load the appropriate module once it's switched to the real root filesystem and discovers the devices during scan. + + +## References + + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + + +## Citations + + 1. + 2. + 3. + 4. + 5. + 6. + +[c1]: http://www.thomas-krenn.com/en/wiki/Partition_Alignment +[c2]: http://www.ibm.com/developerworks/linux/library/l-4kb-sector-disks/#benchmarks +[c3]: http://www.rodsbooks.com/gdisk/advice.html +[c4]: https://access.redhat.com/site/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/Logical_Volume_Manager_Administration/lvm_metadata.html +[c5]: http://www.gnu.org/software/grub/manual/grub.html#Changes-from-GRUB-Legacy +[c6]: http://www.linux.com/news/hardware/peripherals/180950-udev diff --git a/md/lvm_mechanics.md b/md/lvm_mechanics.md new file mode 100644 index 0000000..eee93fc --- /dev/null +++ b/md/lvm_mechanics.md @@ -0,0 +1,399 @@ +# LVM Mechanics + +## Contents + + - [Overview](#overview) + - [Best Practices](#best-practices) + - [Device Mapper](#device-mapper) + - [kpartx](#kpartx) + - [LVM Components](#lvm-components) + - [Physical Volumes](#physical-volumes) + - [Whole Device vs Partition](#whole-device-vs-partition) + - [Volume Groups](#volume-groups) + - [Migrating Volume Groups](#migrating-volume-groups) + - [Logical Volumes](#logical-volumes) + - [LV Sizing Methods](#lv-sizing-methods) + - [LVM Filters](#lvm-filters) + - [LVM Snapshots](#lvm-snapshots) + - [Reverting Snapshots](#reverting-snapshots) + - [Selected Examples](#selected-examples) + - [Expanding VM and LV](#expand-vg-and-lv) + - [Migrate PVs](#migrate-pvs) + - [LVM Metadata Example](#lvm-metadata-example) + - [References](#references) + + +## Overview + +After understanding the [x86 storage design](linux_x86_storage.md) and [partitioning](linux_partitioning.md), the next step is LVM - Logical Volume Management. Generically, LVM is the concept of taking individual pieces of storage (whole disks or individual partitions) and combining them together with a layer of software to make the group appear as one single entity. This single entity can then be sub-divided into logical filesystems and treated individually, even though they share the same physical storage. + +While we use the short phrase LVM in practice, _technically_ we are referring to **LVM2** as opposed to LVM1. LVM1 hasn't been around in modern distros for quite some time once the Device Mapper infrastructure was introduced to the kernel. LVM2 utilizes device-mapper fully, unlike LVM1 - we are discussing only LVM2 as LVM here. + +LVM has many configuration options, this article is an introduction to the overall LVM world and does not cover all the ways `lvm.conf`​ can be tuned. See `man 5 lvm.conf`​ for more information. + + +## Best Practices + +A few rules to live by in the LVM world: + + - Name the Volume Group with a name that represents the design - for example _vglocal01_, _vgiscsi05_, _vgsan00_, _vgraid5_, etc. + - Never combine two disparate objects together - for example, do not combine local (in the chassis) storage with remote (iSCSI/SAN/DAS) storage + - Never combine different performance tiers - for example, do not combine a RAID-1 array and RAID-5 array in the same group + - Never combine non-partitioned and partitioned devices - this could lead to performance issues or end-user confusion in the future + + +## Device Mapper + +At the heart of the system is the [Device Mapper](device_mapper_mechanics.md) kernel level infrastructure introduced with kernel 2.6.x; it's a kernel framework for mapping block level devices to virtual block devices. Not only is it the underpinnings of LVM2 but also RAID (_dm-raid_), cryptsetup (_dm-crypt_) and others like _dm-cache_. It's the component that provides the snapshots feature for LVM as well. + +Normally ioctls (I/O controls) are sent to the block device itself; within the DM world there is a special endpoint `/dev/mapper/control` that is used instead, all ioctls are sent to this device node. The userspace tool `dmsetup` can be used to manually investigate and manipulate the device-mapper subsystem; `dmsetup ls` is a common usage used by techs to quickly review device maps. + +With LVM this is exactly what we're doing - creating virtual block devices on top of physical block devices. + +### kpartx + +While not technically LVM related, a mention of `kpartx` should be made here - where the tool `partx` is designed to read partition maps and create the proper device nodes, the `kpartx` tool is used to read partition tables and create device maps over partition segments detected. This tends to come into play more when using Multipath than LVM, however be aware there are two tools to perform these functions in their own discreet fashion. + + +## LVM Components + +LVM is the name/acronym of the entire puzzle; there are three discreet components to make it all work: + + - **Physical Volume (PV)** + - A physical volume is the actual storage space itself as a single item. It can be the whole object (entire local drive, entire SAN LUN, etc.) or the individual partitions on that storage device. Each partition is it's own discreet physical volume in the latter case. + - **Volume Group (VG)** + - A volume group is a collection of physical volumes as a single entity. It is used as a single block of storage to carve up into logical volumes. Space from one logical volume can be transferred to another logical volume within the same group, filesystem type permitting. + - **Logical Volume (LV)** + - A logical volume is an area of space - akin to a partition on a drive - that is used to hold the filesystem. A logical volume cannot span volume groups, and is the object which is manipulated with userspace tools like mount, umount, cryptsetup, etc. + +### Physical Volumes + +A PV is the most direct part of the LVM puzzle, and the most critical. The PV first has to be created using the command _pvcreate_ -- what this is doing is writing a metadata block to the 2nd sector up to 1MB of the VBR (on a partitioned device) or the beginning of the device if it is not partitioned. When scanning for metadata the LVM subsystem will read this data to determine all information needed - for example, on a 512-byte sector drive 2048 sectors might be scanned to try and locate the PV metadata. + +The pvcreate utility provides basic protection - if a partition table is present but there are no partitions defined, it will still not work - the partition table (whether MBR or GPT) must be zapped before you can use a whole device as a PV without partitions. Otherwise a partition must be created and then pvcreate used on that partition - but **be careful**, if a device is _already_ a PV it is possible to use a tool like fdisk to create a partition table in the first 512-byte sector afterwards\! + +The metadata area starts with the string `LABELONE` followed by several groups of data: + + - Physical volume UUID + - Size of block device in bytes + - NULL-terminated list of data area locations + - NULL-terminated lists of metadata area locations + +What's important to understand is that the concept of the higher level VG and LV are stored in this PV metadata. This protects the VG and LV from activating if one of the PVs in the group is missing - by every PV having a complete view of the objects it becomes self-referential and self-sufficient. Because this metadata also stores the location of data as it's written it's critical that all items be present before it becomes available for use to the end user. + +This design makes the PVs themselves portable as a group - a group of disks can be removed from one server and presented to another, and so long as all PVs are present and the metadata is intact that LVM group can simply be activated on the new server with ease. + +An example of the PV-stored LVM metadata can be found at the end of this page. + +#### Whole Device vs Partition + +The decision whether or not to use partitions with a PV has only one concrete advantage, but several disadvantages in practice. So long as the alignment to storage is correct (radial geometry) there is no performance gain or loss in using either method. The concrete advantage to using an entire device for the PV is the ability to expand that PV using `pvresize` at a later date. This simplifies the work in expanding the underlying storage and increasing the PV/VG/LV sizes. + +The practical disadvantages to using an entire device all revolve around visibility. For example, when a tech uses fdisk/parted/gdisk and does not see a partition they may be inclined to think the drive is unused; this can result in adding a partition table to a device already in a VG by accident. The boot device which does need a MBR/GPT to operate cannot be used as a whole disk, so if other PVs will be added later it's considered bad practice to combine partitioned and non-partitioned PVs into the same VG. While not technical disadvantages these considerations should be taken on when setting up the PV. + +### Volume Groups + +A volume group is the abstract layer that sits between the PV and the LV -- it's role is to combine and hide the physical block devices, presenting one picture of unified storage to the LVs on top. The VG operates on **Physical Extents** - think of these as blocks of data of a given size, where 4 MiB is the default size when creating the VG. Much like the physical sectors of a disk, the PE is treated as a unit -- data is read/written to it as one chunk and the ability to move it is handled in the same chunk. During vgcreate different sizes can be chosen depending on expected workload - 1 KiB minimum and must be a power of 2. + +The default VG allocation policy when writing PEs is to use _normal_ mode -- this means it has some basic intelligence built in to prevent parallel stripes from being placed on the same PV for instance. This can be changed to other methods - for example, _contiguous_ policy requires new PEs being written to be placed right after the existing PEs; the _cling_ policy places new PEs on the same PV as existing PEs in the same stripe of the LV. Note this is not the same as the LV _type_ of segments. + +A single VG can span many, many PVs however a VG cannot be combined with another VG; ergo, a VG has a finite size of the PVs underneath it and how they're used by the LVs on top. The VG can be expanded or reduced by adding/removing PVs or expanding/reducing the existing PVs. + +#### Migrating Volume Groups + +Volume Groups are independent of the system itself, providing the VG is not the container for the root filesystem of the server. They can be exported from one system, physically moved, then imported to another system. The `vgexport` command will clear the VG metadata hostname and deactivate it from the current system, while the `vgimport` command will set the VG metadata hostname to the current system and activate it. The VG should be deactivated with `vgchange` first to ensure it's unmounted and not in use. + +``` +# vgs vglocal + VG #PV #LV #SN Attr VSize VFree + vglocal 1 1 0 wz--n- 50.00g 0 + +# vgexport vglocal + Volume group "vglocal" has active logical volumes + +# vgchange -an vglocal + 0 logical volume(s) in volume group "vglocal" now active + +# vgexport vglocal + Volume group "vglocal" successfully exported + +# vgimport vglocal + Volume group "vglocal" successfully imported +``` + +### Logical Volumes + +The logical volume is the top-most container segmenting a given amount of space from the underlying VG; a LV is restricted to the single VG it is on top of, a LV cannot span 2 more more VGs for increased space. To increase space in a LV the underlying VG has to be increased first. The LV acts as the final virtual block device endpoint of the device mapper design -- this container is what is used with tools like `mkfs.ext4`, `mount` and so forth. It acts and reacts just like a real block device for all intents and purposes, save that it is more like a single partition instead of a whole device (doesn't use a MBR/GPT table). + +The LV can be manipulated in 2 primary ways - by the /dev/_vgname_/_lvname_ symlink or the /dev/mapper/_vgname-lvname_ symlink. Using either is fine since they point to the same actual, real device mapper node entry in the /dev/ tree that corresponds to the virtual block device: + +``` +# ls -l /dev/vglocal/lvtest /dev/mapper/vglocal-lvtest +lrwxrwxrwx 1 root root 7 May 8 19:54 /dev/mapper/vglocal-lvtest -> ../dm-0 +lrwxrwxrwx 1 root root 7 May 8 19:54 /dev/vglocal/lvtest -> ../dm-0 + +# lvdisplay | awk '/LV Name/{n=$3} /Block device/{d=$3;sub(".*:","dm-",d);print d,n;}' +dm-0 lvtest +``` + +So our actual DM node is /dev/dm-0 -- this should never be used as it's possible it could change after a reboot for instance; always use the symlink name instead for maximum resilience to change. These have the major character node type `253` in Linux, the minor character is simply the position it was discovered added in by the kernel. These can be examined with the `dmsetup` tool as outlined above: + +``` +# dmsetup ls +vglocal-lvtest (253:0) + +# dmsetup info vglocal-lvtest +Name: vglocal-lvtest +State: ACTIVE +Read Ahead: 256 +Tables present: LIVE +Open count: 0 +Event number: 0 +Major, minor: 253, 0 +Number of targets: 1 +UUID: LVM-c1MITBqcCORe5icvRwAhlAQUJvVceVDfQXSRQz0T42vcwnPKbotggmXwxrTWB1l5 +``` + +Logical volumes can be created in a number of different modes that might look familiar: linear, striped and mirrored are the three most common. The default mode is linear - use the space from beginning to end as a whole. Striped and mirrored are exactly like your basic RAID - both require minimum 2 PVs and write across both PVs like RAID-0 and RAID-1. Other modes exist, one of which is Snapshot -- the usage of striped and mirrored LVs is not being covered here as they tend to be specific use case oriented solutions. + +#### LV Sizing Methods + +A note here about specifying the size of the logical volume to be created, extended or reduced: two commandline options exist to perform the same work but tend to be confusing. Think of them this way: + +| **Flag** | **Usage** | +| -------- | ---------------------------------------------- | +| `-l` | dynamic math - "100%VG", "+90%VG" and so forth | +| `-L` | absolute math - "100G", "+30G" and so forth | + +Thinking of these flags in this manner aides in their usage later - the same operation could be used using either one of them, however it may be easier to do with one or the other depending on the exact situation. The `-l` by default with no quantifier is using PE as a value, handy if you need to move an exact number of physical extents around for the need at hand. + + +## LVM Filters + +One of the more critical parts to using LVM within environments which contain [multiple HA paths](device_mapper_multipath.md) to the storage is setting up LVM filters to ignore the individual paths and only respect the meta (pseudo) path to the storage, whether that be SAN, DAS or iSCSI in nature. If the filters are not set correctly the underpinnings of LVM will use a single path by name – if that path dies, LVM dies. + +The way filters are written is simple - "add these, remove these" in nature. Looking at a few examples reveals the concepts used in `/etc/lvm/lvm.conf`: + +``` +# A single /dev/sda internal device, PowerPath devices: +filter = [ "a|^/dev/sda[0-9]+$|", "a|^/dev/emcpower|", "r|.*|" ] + +# Two internal devices, /dev/sda and /dev/sdb, and PowerPath devices: +filter = [ "a|^/dev/sd[ab][0-9]+$|", "a|^/dev/emcpower|", "r|.*|" ] + +# Two internal devices, /dev/sda and /dev/sdb, and Device Mapper Multipath devices: +filter = [ "a|^/dev/sd[ab][0-9]+$|", "a|^/dev/mapper/mpath|", "r|.*|" ] + +# Two internal devices, Device Mapper Multipath and PowerPath devices all at once: +filter = [ "a|^/dev/sd[ab][0-9]+$|", "a|^/dev/mapper/mpath|", "a|^/dev/emcpower|", "r|.*|" ] +``` + +Note in the above examples that regular expressions can be used for configuration. + + +## LVM Snapshots + +A snapshot is a special type of LV that has a method called Copy on Write (CoW) used to store a point in time view of the source LV plus all changes since that point. A source LV is specified during creation and the name of the new snapshot LV; the snapshot LV **must** exist on the same VG as the source LV. The snapshot LV only needs to be large enough to hold changes made since the time it was taken, it does not need to be the same size as the source entirely. The VG must have this space free - it cannot be used by any LV already. + +When creating the snapshot LV basically a copy of the inode table is taken - hence the need for the source and snapshot LVs to exist within the same VG. At that point all new changes are recorded in the CoW table to either be discarded or applied depending on usage. However, be aware that a lot of magic happens under the hood to support this\! It's **not** a simple LV, let's take a look: + +``` +# lvremove /dev/mapper/vglocal-lvtest +# lvcreate -l 50%VG -n lvtest vglocal +# lvcreate -L 10G -s -n lvsnap /dev/vglocal/lvtest + +# lvs + LV VG Attr LSize Pool Origin Data% + lvsnap vglocal swi-a-s--- 10.00g lvtest 0.00 + lvtest vglocal owi-a-s--- 50.00g + +# dmsetup ls +vglocal-lvsnap-cow (253:3) +vglocal-lvsnap (253:1) +vglocal-lvtest (253:0) +vglocal-lvtest-real (253:2) +``` + +Notice how there are two additional device maps - "vglocal-lvsnap-cow" and "vglocal-lvtest-real" - that are used behind the scenes to store and work with those CoW changes **to the source volume** that occur while the snapshot is alive. If the snapshot fills up with changes and flips to read-only mode it can be a bit of an ordeal to get the snapshot fully released correctly if something goes wrong within LVM, so proper planning should be take to remove the snapshot in a timely fashion or plan for it's expected growth. + +### Reverting Snapshots + +It is possible to roll back changes made to the original logical volume (lvtest) by merging the original LV extents from the CoW volume back onto the origin volume, provided that the _snapshot-merge_ target is available. + +``` +Check if supported by kernel + +# dmsetup targets | grep snapshot-merge +snapshot-merge v1.1.0 +``` + +This operation is seamless to the user and starts automatically when the origin (lvtest) and snapshot (lvsnap) volumes are activated but not opened. If either the origin or snapshot volumes are opened, the merge operation is deferred until the next time both volumes are activated. As soon as the merge operation starts, the origin volume can be opened and the filesystem within it mounted. + +From this point, all read and write operations to the origin volume are seamlessly routed to the correct logical extents (at the start of the merge operation, these would be the original extents on lvsnap-cow and the unchanged extents on lvtest-real) until the merge is complete. The lvsnap-cow, lvsnap and lvtest-real volumes are then removed from the system. + +Following the lvtest/lvsnap the following command would start the merge/rollback operation: + +``` +# lvconvert --merge /dev/vglocal/lvsnap +``` + + +## Selected Examples + +### Expand VG and LV + +One of the more common scenarios: your boot disk has two partitions, 1 and 2. 1 is /boot using non-LVM and 2 is LVM based / (root) filesystem. You have run out of space and wish to add more -- the new space can be either a new partition on the same storage device that was just expanded (SAN/DAS LUN, VMware vDisk, etc.) or a new device and partition entirely. + +After creating your new partition and using `pvcreate` on it, review the mission goal - we're adding the new space from xvdb3 to the VG, growing the LV and resizing the ext4 filesystem. + +``` +# pvs; echo; vgs; echo; lvs + PV VG Fmt Attr PSize PFree + /dev/xvdb2 vglocal lvm2 a-- 10.00g 0 + /dev/xvdb3 lvm2 a-- 20.00g 20.00g + + VG #PV #LV #SN Attr VSize VFree + vglocal 1 1 0 wz--n- 10.00g 0 + + LV VG Attr LSize + lvroot vglocal -wi-a----- 10.00g +``` + +First, add the new PV into the VG. Then grow the LV with the newly added space. Lastly grow the ext4 filesystem itself: + +``` +# vgextend vglocal /dev/xvdb3 + Volume group "vglocal" successfully extended + +# lvextend -l +100%FREE /dev/vglocal/lvroot + Extending logical volume lvroot to 29.99 GiB + Logical volume lvroot successfully resized + +# resize2fs /dev/vglocal/lvroot + Resizing the filesystem on /dev/vglocal/lvroot to 7862272 (4k) blocks. + The filesystem on /dev/vglocal/lvroot is now 7862272 blocks long. +``` + +Check our work again: + +``` +# pvs; echo; vgs; echo; lvs + PV VG Fmt Attr PSize PFree + /dev/xvdb2 vglocal lvm2 a-- 10.00g 0 + /dev/xvdb3 vglocal lvm2 a-- 20.00g 0 + + VG #PV #LV #SN Attr VSize VFree + vglocal 2 1 0 wz--n- 29.99g 0 + + LV VG Attr LSize + lvroot vglocal -wi-a----- 29.99g +``` + +### Migrate PVs + +The scenario: an existing LV contains a PV we wish to replace - this could be for migrating from one type of storage to another for instance, or replacing several small PVs with one large PV for better performance at the storage side. The `pvmove` command is used, and the PV being added must be at least as large as the one being removed\! + +Existing LV has one PV in VG "vglocal" of 9.3 GiB in size: + +``` +# pvs + PV VG Fmt Attr PSize PFree + /dev/xvdb1 vglocal lvm2 a-- 9.31g 0 + /dev/xvdb2 lvm2 a-- 10.00g 10.00g +``` + +We will replace xvdb1 with xvdb2 - note how it's 10 GiB, at least as large as the one being replaced. After the VG is extended to add the second PV, we check again and see that it's been added but all the PEs (_PFree_) from xvdb2 are still unused. **Do not extend the LV** on top of the VG, the new PV must show as free in order to use it as a migration device. + +``` +# vgextend vglocal /dev/xvdb2 + +# pvs + PV VG Fmt Attr PSize PFree + /dev/xvdb1 vglocal lvm2 a-- 9.31g 0 + /dev/xvdb2 vglocal lvm2 a-- 10.00g 10.00g +``` + +Now we move all the PEs from xvdb1 to xvdb2 with a few commandline options to show verbose info and a progress update every 5 seconds. After all the PEs have been moved to xvdb2 we do a quick check again, then if all looks kosher we remove the old PV: + +``` +# pvmove -v -i5 /dev/xvdb1 /dev/xvdb2 + +# pvs + PV VG Fmt Attr PSize PFree + /dev/xvdb1 vglocal lvm2 a-- 9.31g 9.31g + /dev/xvdb2 vglocal lvm2 a-- 10.00g 704.00m + +# vgreduce /dev/xvdb1 + +# pvs + PV VG Fmt Attr PSize PFree + /dev/xvdb1 lvm2 a-- 9.31g 9.31g + /dev/xvdb2 vglocal lvm2 a-- 10.00g 704.00m +``` + + +## LVM Metadata Example + +Using the `lvmdump -m` command is the easiest way to extract the metadata from all the PVs on the system; here is an example of the data with basic formatting added (spaces/indents, etc.) for easier readability. Note that the metadata area stores rolling revisions of the changes made, it might be useful in a given situation to determine what has transpired. + +``` +LABELONE LVM2 001wTDfgkU6aRyAwCheopo1LeCEFWWodQbd + +vglocal { + id = "7PHX1A-PJ0n-fgdv-qRup-In2G-dah1-iOgWm4" + seqno = 2 + format = "lvm2" # informational + status = ["RESIZEABLE", "READ", "WRITE"] + flags = [] + extent_size = 8192 + max_lv = 0 + max_pv = 0 + metadata_copies = 0 + + physical_volumes { + pv0 { + id = "wTDfgk-U6aR-yAwC-heop-o1Le-CEFW-WodQbd" + device = "/dev/xvdb1" + status = ["ALLOCATABLE"] + flags = [] + dev_size = 209711104 + pe_start = 2048 + pe_count = 25599 + } + } + + logical_volumes { + lvtest { + id = "ZSHT4d-K4lc-pUma-6UtB-vJ9e-9jox-hRTibF" + status = ["READ", "WRITE", "VISIBLE"] + flags = [] + creation_host = "r7rc-ha" + creation_time = 1399583935 + segment_count = 1 + segment1 { + start_extent = 0 + extent_count = 12800 + type = "striped" + stripe_count = 1 # linear + stripes = [ + "pv0", 0 + ] + } + } + } +} + +# Generated by LVM2 version 2.02.100(2)-RHEL6 (2013-10-23): Thu May 8 21:18:55 2014 +contents = "Text Format Volume Group" +version = 1 +description = "" +creation_host = "localhost" # Linux localhost 2.6.32-431.11.2.el6.x86_64 #1 SMP Tue Mar 25 19:59:55 UTC 2014 x86_64 +creation_time = 1399583935 # Thu May 8 21:18:55 2014 +``` + + +## References + + - + - + - diff --git a/md/lvm_snapshot_merging.md b/md/lvm_snapshot_merging.md new file mode 100644 index 0000000..10c9e33 --- /dev/null +++ b/md/lvm_snapshot_merging.md @@ -0,0 +1,218 @@ +# LVM Snapshot Merging + +## Contents + + - [Overview](#overview) + - [General Setup](#general-setup) + - [Preparing a Scenario](#preparing-a-scenario) + - [Merging a Snapshot](#merging-a-snapshot) + - [Advanced Usage](#advanced-usage) + - [Data Retention](#data-retention) + - [References](#references) + + +## Overview + +Within the LVM2 Device Mapper infrastructure a method and kernel module exists to merge the contents of a snapshot back into it's source using `lvconvert`; the typical use case for many snapshots is to back up and discard, this article outlines an alternate use where the need is to re-merge the changes back to the source instead. + + +## General Setup + +In order to work with the `lvconvert` merging process: + + - The **LVM2** packages for the distro must be installed + - The kernel module `dm-snapshot.ko` must be loaded + - A snapshot to merge must exist + +Check for the snapshot-merge feature using `dmsetup targets` and load the module as needed: + +``` +# dmsetup targets +mirror v1.14.0 +striped v1.5.6 +linear v1.1.0 +error v1.2.0 + +# modprobe -v dm-snapshot +insmod /lib/modules/2.6.32-642.4.2.el6.x86_64/kernel/drivers/md/dm-bufio.ko +insmod /lib/modules/2.6.32-642.4.2.el6.x86_64/kernel/drivers/md/dm-snapshot.ko + +# dmsetup targets +snapshot-merge v1.3.6 +snapshot-origin v1.9.6 +snapshot v1.13.6 +mirror v1.14.0 +striped v1.5.6 +linear v1.1.0 +error v1.2.0 +``` + + +## Preparing a Scenario + +For instructional purposes, we need to prepare a scenario: + + - A LV with snapshot in the VG is mounted and in use + - A snapshot was active/mounted and we add new data to it + +The setup looks like: + +``` +Create a new snapshot and add dummy data: + +# lvcreate -s /dev/vgcbs00/lvdata -L 10G -n lvdata_snap +# mount /dev/vgcbs00/lvdata_snap /snap +# for ii in 5 6 7 8; do dd if=/dev/zero of=/snap/data.$ii bs=4M count=10$ii; done + +Examine the results: + +# ls /data/ /snap/ +/data/: +data.1 data.2 data.3 data.4 +/snap/: +data.1 data.2 data.3 data.4 data.5 data.6 data.7 data.8 + +# df -h /data/ /snap/ +Filesystem Size Used Avail Use% Mounted on +/dev/mapper/vgcbs00-lvdata + 37G 1.7G 34G 5% /data +/dev/mapper/vgcbs00-lvdata_snap + 37G 3.4G 32G 10% /snap +``` + +So here we have the origin (source) LV `lvdata` and a snapshot of it `lvdata_snap`; 4 additional data files were added to the mounted snapshot `/snap/` in order to simulate a a snapshot which has had data written to it since it was instantiated. We see there is now twice as much data in the snapshot and the additional files we want to merge back to the origin. + + +## Merging a Snapshot + +To merge the snapshot back to it's origin: + +1. Ensure the origin has enough space to hold the contents of the additional snapshot data +2. Ensure the **origin and snapshot are unmounted** from the filesystem +3. Use the `lvconvert` command to merge the snapshot to the origin + +> It is possible to perform the merge while the origin is online, see the **Advanced Usage** section for this scenario. + +Using the `lvconvert` command is very straightforward, be sure to use the `-i` flag to set an update interval for progress output. Be aware that the **snapshot will be deleted after merge** so ensure this is the expected outcome desired: + +``` +# umount /snap +# umount /data + +# lvconvert --merge -i 2 vgcbs00/lvdata_snap + Merging of volume lvdata_snap started. + lvdata: Merged: 83.4% + lvdata: Merged: 84.8% + ...lots of status... + lvdata: Merged: 100.0% + Merge of snapshot into logical volume lvdata has finished. + Logical volume "lvdata_snap" successfully removed + +# mount /dev/vgcbs00/lvdata /data +# ls /data/ +data.1 data.2 data.3 data.4 data.5 data.6 data.7 data.8 +``` + +You have now successfully merged the snapshot content back to it's origin. + + +## Advanced Usage + +It is possible to reduce the offline time by performing an online merge when the LV is next activated; this can be done on the fly if the volume group can be deactivated (so it cannot be the home of the root filesystem) however requires a **complete deactivation**, making it unsuitable for the root partition or another volume which cannot be deactivated while the server is online. The process is similar to the basic usage, you simply have to leave the origin mounted, perform the commands then wait for it to complete in the background. + +Using the same scenario setup prepared above: + +``` +# umount /snap + +# lvconvert --merge vgcbs00/lvdata_snap + Logical volume vgcbs00/lvdata contains a filesystem in use. + Can't merge over open origin volume. + Merging of snapshot vgcbs00/lvdata_snap will occur on next activation of vgcbs00/lvdata. + +(stop service using /data) +# umount /data +# vgchange -an vgcbs00 +# vgchange -ay vgcbs00 +# mount /dev/vgcbs00/lvdata /data +(start service using /data) +``` + +At this point you can `ls` immediately and see the changed data on the source, however be aware it's still merging in the background. Using `lvs -a` to watch the `Data%` column decrease it's way to zero is how status is checked, and when it's complete it will **delete the snapshot** LV on it's own. + +``` +# lvs -a + LV VG Attr LSize Pool Origin Data% + lvdata vgcbs00 Owi-aos--- 37.50g + [lvdata_snap] vgcbs00 Swi-a-s--- 10.00g lvdata 4.76 + +... + +# lvs -a + LV VG Attr LSize Pool Origin Data% + lvdata vgcbs00 Owi-aos--- 37.50g + [lvdata_snap] vgcbs00 Swi-a-s--- 10.00g lvdata 3.17 +``` + +Eventually you will see the `Data%` column decrease to 0.00, then the snapshot will be removed. + + +## Data Retention + +Simply put, the data changes on the snapshot will overwrite the data on the origin volume. + +It does not matter if it's a new file or changed, the process will not attempt to merge the actual contents of a file (similar to diff and patch) but instead just replace the data blocks from the snapshot on the origin volume. File deletions are handled the same - if a file is deleted on the snapshot, upon merging it the file will be removed on the origin volume. The process does not differentiate between a binary or text file, all data is treated the same during the merge process. + +Starting with a few data files: + +``` +/data/test-d1.txt +This line was edited on /data before snapshot creation + +/data/test-d2.txt +This line was edited on /data before snapshot creation + +/data/test-s2.txt +This line was edited on /data after snapshot creation + +==== + +/snap/test-d1.txt +This line was edited on /data before snapshot creation +This line was edited on /data after snapshot creation + +/snap/test-d2.txt +This line was edited on /data before snapshot creation +This line was edited on /snap after snapshot creation + +/snap/test-s1.txt +This line was edited on /snap after snapshot creation + +/snap/test-s2.txt +This line was edited on /snap after snapshot creation +``` + +We then run the `lvconvert` process outlined above to merge the data and end up with: + +``` +/data/test-d1.txt +This line was edited on /data before snapshot creation +This line was edited on /data after snapshot creation + +/data/test-d2.txt +This line was edited on /data before snapshot creation +This line was edited on /snap after snapshot creation + +/data/test-s1.txt +This line was edited on /snap after snapshot creation + +/data/test-s2.txt +This line was edited on /snap after snapshot creation +``` + +As the content of the test text files shows, the snapshot data blocks simply replace the origin whether it's been edited in one place or both - there is no attempt to compare timestamps or perform otherwise more intelligent data merging, it's an all or nothing approach to the merge process. + + +## References + + - diff --git a/md/mongodb_basics.md b/md/mongodb_basics.md new file mode 100644 index 0000000..c2f09e6 --- /dev/null +++ b/md/mongodb_basics.md @@ -0,0 +1,360 @@ +# MongoDB Basics + +## Contents + + - [Introduction](#introduction) + - [Fundamentals](#fundamentals) + - [Installation and Updates](#installation-and-updates) + - [RHEL / CentOS](#rhel--centos) + - [Ubuntu / Debian](#ubuntu--debian) + - [User Management](#user-management) + - [Network Connectivity](#network-connectivity) + - [MongDB Default Ports](#mongodb-default-ports) + - [System Level Software Configuration](#system-level-software-configuration) + - [Red Hat / CentOS](#red-hat--centos) + - [Ubuntu / Debian](#ubuntu--debian-1) + - [Intermediate Troubleshooting](#intermediate-troubleshooting) + - [Kill errant MongoDB Thread](#kill-errant-mongodb-thread) + - [Check Replica Status](#check-replica-status) + - [Check Status](#check-sharding-status) + - [References](#references) + + +## Introduction + +MongoDB is a document-oriented database model; it uses JSON constructs to not only store the data but also to interact with the system itself. Many commands may look a little odd coming from a MySQL background but in general the concept of what you're trying to do is somewhat the same. The 10gen website has a great page detailing how to apply your MySQL knowledge to the MongoDB world: + + - + - + - + +| **SQL Terms/Concepts** | **MongDB Terms/Concepts** | +| ---------------------- | ------------------------------ | +| database | database | +| table | collection | +| row | document or BSON document | +| column | field | +| index | index | +| table joins | embedded documents and linking | +| primary key | primary key | +| aggregation (group by) | aggregation pipeline | + +A few select examples from the linked website: + +| **SQL Select Statements** | **MongoDB find() Statements** | +| ------------------------------------------------------------ | ------------------------------------------------ | +| `SELECT * FROM users` | `db.users.find()` | +| `SELECT id, user_id, status FROM users` | `db.users.find({},{user_id:1,status:1})` | +| `SELECT * FROM users WHERE status="A" ORDER BY user_id DESC` | `db.users.find({status:"A"}).sort({user_id:-1})` | + +> Using the `.explain()` method in MongoDB **runs the query** which is exactly the **opposite of MySQL**. Be very careful you are not using .explain() on a database with any sort of data altering command (think UPDATE / INSERT / DELETE in MySQL) + + +## Fundamentals + +### Installation and Updates + +#### RHEL / CentOS + +Utilize the standard Yum repository style configuration: + +``` +# vi /etc/yum.repos.d/10gen.repo +[10gen] +name=10gen Repository +baseurl=http://downloads-distro.mongodb.org/repo/redhat/os/x86_64 +gpgcheck=0 +enabled=1 +# yum install mongo-10gen mongo-10gen-server +``` + + +#### Ubuntu / Debian + +Utilize the standard APT sources style configuration: + +``` +# apt-key adv --keyserver keyserver.ubuntu.com --recv 7F0CEB10 +# echo 'deb http://downloads-distro.mongodb.org/repo/ubuntu-upstart dist 10gen' >> /etc/apt/sources.list.d/10gen.list +# apt-get update +# apt-get install mongodb-10gen +``` + + +### User Management + +MongoDB uses role-based access control based on database level; the system.users collection contains the data which correlates roughly to the mysql.user table in MySQL, however it is not manipulated quite the same way. The 10gen website has great introductory material on Access Control: + + - + - + - + - + - + +> **Authentication is disabled by default** in an out of the box installation\! Refer to the above documentation and tutorials for basic user administration tasks should they be required or repaired as most production level configurations will have had security practices applied. + + +### Network Connectivity + +#### MongoDB Default Ports + +**27017** + + - default port for mongod and mongos instances + - change with port with --port / port + - bind with --bind\_ip / bind\_ip + - define Replicat set with --replSet / replSet + - set DB datadir with --dbpath / dbpath + +**27018** + + - default port when running with --shardsvr / shardsvr + +**27019** + + - default port when running with --configsvr / configsvr + +**28017** + + - default port for the web status page + - always accessible at a port + 1000 + - disable with --nohttppinterface / nohttpinterface + - no authentication by default + - enable REST interface with --rest / rest + + +### System Level Software Configuration + +Vendor packages place the default configurations, service scripts and data directories in the standard location methodologies. Subtle differences exist between the platforms: + +#### Red Hat / CentOS + + - /etc/mongod.conf + - /etc/sysconfig/mongod + - /etc/rc.d/init.d/mongod + - /var/log/mongo/mongod.log + - /var/lib/mongo/ + - ~/.mongorc.js + +#### Ubuntu / Debian + + - /etc/mongodb.conf + - /etc/init/mongodb.conf + - /etc/init.d/mongodb + - /var/log/mongodb/mongodb.log + - /var/lib/mongodb/ + - ~/.mongorc.js + + +## Intermediate Troubleshooting + +### Kill errant MongoDB Thread + +Killing an errant thread in MongoDB is directly analogous to killing one in MySQL - you examine the stack, find the one in question and issue a command to terminate it. + +> Do not kill threads which are compacting databases or any background threads which are indexing data - this can lead to database corruption + +First, use the `db.currentOp()` mongo shell command to list your threads; this is analogous to `show full processlist` in MySQL. + +``` +$ mongo +MongoDB shell version: 2.4.5 +connecting to: test +> db.currentOp() +{ + "inprog" : [ + { + "opid" : 2506233, + "active" : true, + "secs_running" : 140, + "op" : "update", + "ns" : "generators.sensor_readings", + "query" : { + "$where" : "function(){sleep(500);return false;}" + }, + "client" : "127.0.0.1:51773", + "desc" : "conn20", + "threadId" : "0x7f694753d700", + "connectionId" : 20, + "locks" : { + "^" : "w", + "^generator" : "W" + }, + "waitingForLock" : false, + "numYields" : 279, + "lockStats" : { + "timeLockedMicros" : { + "r" : NumberLong(0), + "w" : NumberLong(280242564) + }, + "timeAcquiringMicros" : { + "r" : NumberLong(0), + "w" : NumberLong(140420592) + } + } + }, + { + "opid" : 2507691, + "active" : false, + "op" : "query", + "ns" : "", + "query" : { + }, + "client" : "127.0.0.1:51772", + "desc" : "conn19", + "threadId" : "0x7f6962e4a700", + "connectionId" : 19, + "locks" : { + "^generator" : "R" + }, + "waitingForLock" : true, + "numYields" : 0, + "lockStats" : { + "timeLockedMicros" : { + }, + "timeAcquiringMicros" : { + } + } + } + ] +} + +``` + +In the example above we see two threads; the keys to look for are the `waitingForLock`, `secs_running`, and `op` fields of the command. The threads we're looking for is the first one with `opid` 2506233 as it's the one locking up our database; but notice it has `W` in the `locks` subdocument. We kill it with the `db.killOp()` command only if we're sure the data it's writing can be lost – this is a dangerous operation to perform and should be examined carefully. Read operations are generally safe to kill in an emergency. + +``` +> db.killOp(2506233); +{ "info" : "attempting to kill op" } +> db.currentOp() +{ "inprog" : [ ] } +``` + + +### Check Replica Status + +Somewhat similar to MySQL, replication is based on two configurations working together; the core `mongod` process must be started with a config file/command line flag to tell it which replica set it lives. This is the `replSet` keyword and can be any string, so long as all instances (processes) share the same name. For example, here are three processes started on the same server for testing a replica set: + +``` +# mongod --dbpath 1 --port 27001 --smallfiles --oplogSize 50 \ + --logpath 1.log --logappend --fork --replSet w4 +# mongod --dbpath 2 --port 27002 --smallfiles --oplogSize 50 \ + --logpath 2.log --logappend --fork --replSet w4 +# mongod --dbpath 3 --port 27003 --smallfiles --oplogSize 50 \ + --logpath 3.log --logappend --fork --replSet w4 +``` + +Once the Replica set is initialized and configured (using `rs.initiate()` and `rs.add()` / `rs.reconfig()` commands), checking the status is done from any member of the set using the `rs.status()` command: + +``` +$ mongo --port 27002 +MongoDB shell version: 2.4.5 +connecting to: 127.0.0.1:27002/test +w4:PRIMARY> rs.status() +{ + "set" : "w4", + "date" : ISODate("2013-08-19T18:53:23Z"), + "myState" : 1, + "members" : [ + { + "_id" : 1, + "name" : "mongo1c:27002", + "health" : 1, + "state" : 1, + "stateStr" : "PRIMARY", + "uptime" : 586, + "optime" : Timestamp(1376937880, 1), + "optimeDate" : ISODate("2013-08-19T18:44:40Z"), + "self" : true + }, + { + "_id" : 2, + "name" : "mongo1c:27003", + "health" : 1, + "state" : 2, + "stateStr" : "SECONDARY", + "uptime" : 584, + "optime" : Timestamp(1376937880, 1), + "optimeDate" : ISODate("2013-08-19T18:44:40Z"), + "lastHeartbeat" : ISODate("2013-08-19T18:53:21Z"), + "lastHeartbeatRecv" : ISODate("2013-08-19T18:53:21Z"), + "pingMs" : 0, + "syncingTo" : "mongo1c:27002" + }, + { + "_id" : 3, + "name" : "mongo1c:27001", + "health" : 1, + "state" : 2, + "stateStr" : "SECONDARY", + "uptime" : 523, + "optime" : Timestamp(1376937880, 1), + "optimeDate" : ISODate("2013-08-19T18:44:40Z"), + "lastHeartbeat" : ISODate("2013-08-19T18:53:23Z"), + "lastHeartbeatRecv" : ISODate("2013-08-19T18:53:21Z"), + "pingMs" : 0, + "syncingTo" : "mongo1c:27002" + } + ], + "ok" : 1 +} + +``` + +Notice how the `stateStr` field will help identify who is the PRIMARY (writer) of the set; unlike MySQL the PRIMARY node can be moved around on the fly - whether it's automatic by voting, or manual actions performed (such as taking a node offline for maintenance work). Actions such as `rs.freeze()`, `rs.stepDown()` and `rs.remove()` exist to manipulate the Replica set. Note that you can always query the instance you logged into with the `db.isMaster()` command to get another view of who is the PRIMARY writer. + +``` +w4:PRIMARY> db.isMaster() +{ + "setName" : "w4", + "ismaster" : true, + "secondary" : false, + "hosts" : [ + "mongo1c:27002", + "mongo1c:27001", + "mongo1c:27003" + ], + "primary" : "mongo1c:27002", + "me" : "mongo1c:27002", + "maxBsonObjectSize" : 16777216, + "maxMessageSizeBytes" : 48000000, + "localTime" : ISODate("2013-08-19T18:58:40.488Z"), + "ok" : 1 +} + +``` + +### Check Sharding Status + + - + +Connecting to the shard server (mongos) to view the configuration: + +``` +mongo localhost:27108/admin -u admin -p +mongos> sh.status() + --- Sharding Status --- + sharding version: { "_id" : 1, "version" : 3 } +shards: +{ "_id" : "db1", "host" : "db1:27001,db2:27001,db3:27001" } +{ "_id" : "db2", "host" : "db3:27002,db1:27002,db2:27002" } +{ "_id" : "db3", "host" : "db2:27003,db3:27003,db1:27003" } +databases: +{ "_id" : "admin", "partitioned" : false, "primary" : "config" } +{ "_id" : "generators", "partitioned" : true, "primary" : "db1" } + +generators.sensor_readings chunks: + + db3 3 + db2 6 + db1 6 +``` + + +## References + + - + - + - + - + - diff --git a/md/network_quick_reference.md b/md/network_quick_reference.md new file mode 100644 index 0000000..46c345f --- /dev/null +++ b/md/network_quick_reference.md @@ -0,0 +1,51 @@ +# Network Quick Reference + + +## Class Quick Reference + +| **Class** | **Bits** | **Start** | **End** | **Networks** | **Default Mask** | **CIDR** | +| --------- | -------- | --------- | --------------- | ------------:| ---------------- | -------- | +| A | 0 | 0.0.0.0 | 127.255.255.255 | 128 | 255.0.0.0 | /8 | +| B | 10 | 128.0.0.0 | 191.255.255.255 | 16384 | 255.255.0.0 | /16 | +| C | 110 | 192.0.0.0 | 223.255.255.255 | 2097152 | 255.255.255.0 | /24 | +| D (mcast) | 1110 | 224.0.0.0 | 239.255.255.255 | _undef_ | _undef_ | _undef_ | +| E (rsrvd) | 1111 | 240.0.0.0 | 255.255.255.255 | _undef_ | _undef_ | _undef_ | + + +## Netmask Quick Reference + +| **Bits** | **Total Hosts** | **Usable** | **Netmask** | **Inverse Mask** | +| -------- | -----------------:| -----------| --------------- | ----------------:| +| `/0` | 4294967296 (2^32) | 4294967294 | 0.0.0.0 | 255.255.255.255 | +| `/1` | 2147483648 (2^31) | 2147483646 | 128.0.0.0 | 127.255.255.255 | +| `/2` | 1073741824 (2^30) | 1073741822 | 192.0.0.0 | 63.255.255.255 | +| `/3` | 536870912 (2^29) | 536870910 | 224.0.0.0 | 31.255.255.255 | +| `/4` | 268435456 (2^28) | 268435454 | 240.0.0.0 | 15.255.255.255 | +| `/5` | 134217728 (2^27) | 134217726 | 248.0.0.0 | 7.255.255.255 | +| `/6` | 67108864 (2^26) | 67108862 | 252.0.0.0 | 3.255.255.255 | +| `/7` | 33554432 (2^25) | 33554430 | 254.0.0.0 | 1.255.255.255 | +| `/8` | 16777216 (2^24) | 16777214 | 255.0.0.0 | 0.255.255.255 | +| `/9` | 8388608 (2^23) | 8388606 | 255.128.0.0 | 0.127.255.255 | +| `/10` | 4194304 (2^22) | 4194302 | 255.192.0.0 | 0.63.255.255 | +| `/11` | 2097152 (2^21) | 2097150 | 255.224.0.0 | 0.31.255.255 | +| `/12` | 1048576 (2^20) | 1048574 | 255.240.0.0 | 0.15.255.255 | +| `/13` | 524288 (2^19) | 524286 | 255.248.0.0 | 0.7.255.255 | +| `/14` | 262144 (2^18) | 262142 | 255.252.0.0 | 0.3.255.255 | +| `/15` | 131072 (2^17) | 131070 | 255.254.0.0 | 0.1.255.255 | +| `/16` | 65536 (2^16) | 65534 | 255.255.0.0 | 0.0.255.255 | +| `/17` | 32768 (2^15) | 32766 | 255.255.128.0 | 0.0.127.255 | +| `/18` | 16384 (2^14) | 16382 | 255.255.192.0 | 0.0.63.255 | +| `/19` | 8192 (2^13) | 8190 | 255.255.224.0 | 0.0.31.255 | +| `/20` | 4096 (2^12) | 4094 | 255.255.240.0 | 0.0.15.255 | +| `/21` | 2048 (2^11) | 2046 | 255.255.248.0 | 0.0.7.255 | +| `/22` | 1024 (2^10) | 1022 | 255.255.252.0 | 0.0.3.255 | +| `/23` | 512 (2^9) | 510 | 255.255.254.0 | 0.0.1.255 | +| `/24` | 256 (2^8) | 254 | 255.255.255.0 | 0.0.0.255 | +| `/25` | 128 (2^7) | 126 | 255.255.255.128 | 0.0.0.127 | +| `/26` | 64 (2^6) | 62 | 255.255.255.192 | 0.0.0.63 | +| `/27` | 32 (2^5) | 30 | 255.255.255.224 | 0.0.0.31 | +| `/28` | 16 (2^4) | 14 | 255.255.255.240 | 0.0.0.15 | +| `/29` | 8 (2^3) | 6 | 255.255.255.248 | 0.0.0.7 | +| `/30` | 4 (2^2) | 2 | 255.255.255.252 | 0.0.0.3 | +| `/31` | 2 (2^1) | 0 | 255.255.255.254 | 0.0.0.1 | +| `/32` | 1 (2^0) | 1 | 255.255.255.255 | 0.0.0.0 | diff --git a/md/nfs_debugging.md b/md/nfs_debugging.md new file mode 100644 index 0000000..eb7159d --- /dev/null +++ b/md/nfs_debugging.md @@ -0,0 +1,248 @@ +# NFS Debugging + +## Contents + + - [Userspace Tools](#userspace-tools) + - [Kernel Interfaces](#kernel-interfaces) + - [NFSD debug flags](#nfsd-debug-flags) + - [NFS debug flags](#nfs-debug-flags) + - [NLM debug flags](#nlm-debug-flags) + - [RPC debug flags](#rpc-debug-flags) + - [General Notes](#general-notes) + - [References](#references) + - [Bugs](#bugs) + + +## Userspace Tools + +Using `rpcdebug` is the easiest way to manipulate the kernel interfaces in place of echoing bitmasks to `/proc`. + +| **Option** | **Description** | +| ----------- | --------------------------------------------- | +| `-c` | Clear the given debug flags | +| `-s` | Set the given debug flags | +| `-m module` | Specify which module's flags to set or clear. | +| `-v` | Increase the verbosity of rpcdebug's output | +| `-h` | Print a help message and exit | +| `-vh` | Print the available debug flags | + +For the `-m` option, the available modules are: + +| **Module** | **Description** | +| ---------- | ------------------------------------------------------------- | +| nfsd | The NFS server | +| nfs | The NFS client | +| nlm | The Network Lock Manager, in either an NFS client or server | +| rpc | The Remote Procedure Call module, either NFS client or server | + +Examples: + +``` +rpcdebug -m rpc -s all # sets all debug flags for RPC +rpcdebug -m rpc -c all # clears all debug flags for RPC + +rpcdebug -m nfsd -s all # sets all debug flags for NFS Server +rpcdebug -m nfsd -c all # clears all debug flags for NFS Server +``` + + +## Kernel Interfaces + +A bitmask of the debug flags can be echoed into the interface to enable output to syslog; 0 is the default: + +``` +/proc/sys/sunrpc/nfsd_debug +/proc/sys/sunrpc/nfs_debug +/proc/sys/sunrpc/nlm_debug +/proc/sys/sunrpc/rpc_debug +``` + +Sysctl controls are registered for these interfaces, so they can be used instead of echo: + +``` +sysctl -w sunrpc.rpc_debug=1023 +sysctl -w sunrpc.rpc_debug=0 + +sysctl -w sunrpc.nfsd_debug=1023 +sysctl -w sunrpc.nfsd_debug=0 +``` + +At runtime the server holds information that can be examined: + +``` +grep . /proc/net/rpc/*/content +cat /proc/fs/nfs/exports +cat /proc/net/rpc/nfsd +ls -l /proc/fs/nfsd +``` + +A rundown of `/proc/net/rpc/nfsd` (the userspace tool `nfsstat` pretty-prints this info): + +``` +/proc/net/rpc/nfsd + +* rc (reply cache): +- hits: client it's retransmitting +- misses: a operation that requires caching +- nocache: a operation that no requires caching + +* fh (filehandle): +- stale: file handle errors +- total-lookups, anonlookups, dir-not-in-cache, nodir-not-in-cache + . always seem to be zeros + +* io (input/output): +- bytes-read: bytes read directly from disk +- bytes-written: bytes written to disk + +* th (threads): <10%-20%> <20%-30%> ... <90%-100%> <100%> +- threads: number of nfsd threads +- fullcnt: number of times that the last 10% of threads are busy +- 10%-20%, 20%-30% ... 90%-100%: 10 numbers representing 10-20%, 20-30% to 100% + . Counts the number of times a given interval are busy + +* ra (read-ahead): <10%> <20%> ... <100%> +- cache-size: always the double of number threads +- 10%, 20% ... 100%: how deep it found what was looking for +- not-found: not found in the read-ahead cache + +* net: +- netcnt: counts every read +- netudpcnt: counts every UDP packet it receives +- nettcpcnt: counts every time it receives data from a TCP connection +- nettcpconn: count every TCP connection it receives + +* rpc: +- rpccnt: counts all rpc operations +- rpcbadfmt: counts if while processing a RPC it encounters the following errors: + . err_bad_dir, err_bad_rpc, err_bad_prog, err_bad_vers, err_bad_proc, err_bad +- rpcbadauth: bad authentication + . does not count if you try to mount from a machine that it's not in your exports file +- rpcbadclnt: unused + +* procN (N = vers): +- vs_nproc: number of procedures for NFS version + . v2: nfsproc.c, 18 + . v3: nfs3proc.c, 22 + - v4, nfs4proc.c, 2 +- statistics: generated from NFS operations at runtime + +* proc4ops: +- ops: the definition of LAST_NFS4_OP, OP_RELEASE_LOCKOWNER = 39, plus 1 (so 40); defined in nfs4.h +- x..y: the array of nfs_opcount up to LAST_NFS4_OP (nfsdstats.nfs4_opcount[i]) +``` + + +## NFSD debug flags + +``` +/usr/include/linux/nfsd/debug.h (kernel 3.13.5) + +/* + * knfsd debug flags + */ +#define NFSDDBG_SOCK 0x0001 +#define NFSDDBG_FH 0x0002 +#define NFSDDBG_EXPORT 0x0004 +#define NFSDDBG_SVC 0x0008 +#define NFSDDBG_PROC 0x0010 +#define NFSDDBG_FILEOP 0x0020 +#define NFSDDBG_AUTH 0x0040 +#define NFSDDBG_REPCACHE 0x0080 +#define NFSDDBG_XDR 0x0100 +#define NFSDDBG_LOCKD 0x0200 +#define NFSDDBG_ALL 0x7FFF +#define NFSDDBG_NOCHANGE 0xFFFF +``` + + +## NFS debug flags + +``` +/usr/include/linux/nfs_fs.h (kernel 3.13.5) + +/* + * NFS debug flags + */ +#define NFSDBG_VFS 0x0001 +#define NFSDBG_DIRCACHE 0x0002 +#define NFSDBG_LOOKUPCACHE 0x0004 +#define NFSDBG_PAGECACHE 0x0008 +#define NFSDBG_PROC 0x0010 +#define NFSDBG_XDR 0x0020 +#define NFSDBG_FILE 0x0040 +#define NFSDBG_ROOT 0x0080 +#define NFSDBG_CALLBACK 0x0100 +#define NFSDBG_CLIENT 0x0200 +#define NFSDBG_MOUNT 0x0400 +#define NFSDBG_FSCACHE 0x0800 +#define NFSDBG_PNFS 0x1000 +#define NFSDBG_PNFS_LD 0x2000 +#define NFSDBG_STATE 0x4000 +#define NFSDBG_ALL 0xFFFF +``` + + +## NLM debug flags + +``` +/usr/include/linux/lockd/debug.h (kernel 3.13.5) + +/* + * Debug flags + */ +#define NLMDBG_SVC 0x0001 +#define NLMDBG_CLIENT 0x0002 +#define NLMDBG_CLNTLOCK 0x0004 +#define NLMDBG_SVCLOCK 0x0008 +#define NLMDBG_MONITOR 0x0010 +#define NLMDBG_CLNTSUBS 0x0020 +#define NLMDBG_SVCSUBS 0x0040 +#define NLMDBG_HOSTCACHE 0x0080 +#define NLMDBG_XDR 0x0100 +#define NLMDBG_ALL 0x7fff +``` + + +## RPC debug flags + +``` +/usr/include/linux/sunrpc/debug.h (kernel 3.13.5) + +/* + * RPC debug facilities + */ +#define RPCDBG_XPRT 0x0001 +#define RPCDBG_CALL 0x0002 +#define RPCDBG_DEBUG 0x0004 +#define RPCDBG_NFS 0x0008 +#define RPCDBG_AUTH 0x0010 +#define RPCDBG_BIND 0x0020 +#define RPCDBG_SCHED 0x0040 +#define RPCDBG_TRANS 0x0080 +#define RPCDBG_SVCXPRT 0x0100 +#define RPCDBG_SVCDSP 0x0200 +#define RPCDBG_MISC 0x0400 +#define RPCDBG_CACHE 0x0800 +#define RPCDBG_ALL 0x7fff +``` + + +## General Notes + + - While the number of threads can be increased at runtime via an echo to `/proc/fs/nfsd/threads`, the cache size (double the threads, see the `ra` line of /proc/net/rpc/nfsd) is not dynamic. The NFS daemon must be restarted with the new thread size during initialization (`/etc/sysconfig/nfs` on RHEL and CentOS) in order for the thread cache to properly adjust. + + +## References + + - + - + - + - + - + - + + +## Bugs + + - diff --git a/md/nfs_setup.md b/md/nfs_setup.md new file mode 100644 index 0000000..ccbe290 --- /dev/null +++ b/md/nfs_setup.md @@ -0,0 +1,1097 @@ +# NFS Setup + +## Contents + + - [Overview](#overview) + - [TCP Wrappers](#tcp-wrappers) + - [Firewalls and IPtables](#firewalls-and-iptables) + - [Client systemd NFS Mounts](#client-systemd-nfs-mounts) + - [Client noatime NFS Mounts](#client-noatime-nfs-mounts) + - [NFSv3 Server](#nfsv3-server) + - [RHEL / CentOS](#rhel--centos) + - [Debian / Ubuntu](#debian--ubuntu) + - [NFSv3 Client](#nfsv3-client) + - [RHEL / CentOS](#rhel--centos-1) + - [Debian / Ubuntu](#debian--ubuntu-1) + - [NFSv4 Server](#nfsv4-server) + - [RHEL / CentOS](#rhel--centos-2) + - [Debian / Ubuntu](#debian--ubuntu-2) + - [NFSv4 Client](#nfsv4-client) + - [RHEL / CentOS](#rhel--centos-3) + - [Debian / Ubuntu](#debian--ubuntu-3) + - [Additional Reading](#additional-reading) + + +## Overview + +Network File Sharing (**NFS**) is a file and directory sharing mechanism native to Linux. It requires a Server and Client configuration. The Server presents a named directory to be shared out with specific permissions as to which clients can access it (IP addresses) and what capabilities they can use (such as read-write or read-only). The configuration can use both TCP and UDP at the same time depending on the needs of the specific implementation. The process utilizes the Remote Procedure Call (**RPC**) infrastructure to facilitate communication and operation. + +Within the modern world of NFS, there are two major levels of functionality in use today - **NFSv3** (version 3) and **NFSv4** (version 4). While there are many differences between the two, some of the more important changes: + + - **NFSv3** + - Transports - TCP and UDP + - Authentication - IP only + - Protocols - Stateless. Several protocols required such as MOUNT, LOCK, STATUS + - Locking - NLM and the lock protocol + - Security - Traditional UNIX permissions, no ACL support + - Communication - One RPC call per operation + - **NFSv4** + - Transports - TCP + - Authentication - Kerberos support (optional) + - Protocols - Stateful. Single protocol within the stack with security auditing + - Locking - Lease based locking within the protocol + - Security - Kerberos and ACL support + - Communication - Several operations supported by one RPC call + +Both NFSv3 and NFSv4 services are handled by the same software package installations as outlined in this article; a system can use NFSv3 only, NFSv4 only, or a mixture of the two at the same time depending on the configuration and implementation. Older versions such as NFSv2 are considered obsolete and not covered herein, NFSv3 support has been stable for over a decade. + +> **NFSv4**: The use of **idmapd** with `sec=sys` (system level, not Kerberos) may not always produce the results expected with permissions. +> +> NFSv4 uses UTF8 string principals between the client and server; it's sufficient to use the same user names and NFSv4 domain on client and server while the UIDs differ when using Kerberos (`sec=krb`). However, with `AUTH_SYS` (`sec=sys`) the RPC requests will use UID and GIDs from the client host. In a general sense, this means that you still have to manually align the UIDs/GIDs among all hosts in the NFS environment like traditional NFSv3 - **idmapd** does not do what you think. For detailed information read [this thread](http://thread.gmane.org/gmane.linux.nfsv4/7103/focus=7105) on the linux-nfsv4 mailing list. +> +> Red Hat has a comprehensive article detailing this subject: + + +## TCP Wrappers + +The concept and software package TCP Wrappers was invented back in 1990 before the modern ecosphere of pf (BSD), iftables (older Linux) and iptables (modern Linux) existed. In general, a software author could link their application against a shared library (libwrap.so) that would provide a common mechanism to provide network-level access control (ACL) without the software authors having to implement it independently. In modern Linux installs, the use of iptables instead of TCP Wrappers is the standard and preferred. + +The design has two main configuration files of note: + + - `/etc/hosts.allow` + - `/etc/hosts.deny` + +Ensure that these files are examined for any lines that might begin with keywords like portmap, nfs, or rpc and comment them out in both files. Unless a very specific need and use case is required, using TCP Wrappers with NFS is to be avoided in favor of iptables. No restarts are required of any software after editing these files, the changes are dynamic - the NFS subsystem is still linked to this library so it cannot be uninstalled, instead ensure the configuration ignores it. + +A common question is "How do I know if an application uses TCP Wrappers?" - the binary itself will be dynamically linked against the `libwrap.so` binary, `ldd`​ can be used to check; for example: + +``` +[root@nfs-server ~]# ldd /sbin/rpcbind | grep libwrap + libwrap.so.0 => /lib64/libwrap.so.0 (0x00007fbee9b8e000) +``` + +By linking to the `libwrap.so` binary as provided by the `tcp_wrappers` package, we know it can be affected by the `/etc/hosts.allow` and `/etc/hosts.deny` configuration. + + +## Firewalls and IPtables + +Standard network communication is required between the server and client; exact ports and protocols will be discussed below. In this document we will use the private network `192.168.5.1/24` between the servers, and have added a generic IPtable rule to open all communication carte blanche for this subnet: + +``` +-A INPUT -s 192.168.5.0/24 -j ACCEPT +``` + +The exact configuration and what may need to be changed to allow communication could involve IPtables, firewalld, UFW, firewall ACLs and so forth - the exact environment will dictate how the network security needs to be adjusted to allow cross-server communication. If using the standard `/etc/sysconfig/nfs` pre-defined ports for RHEL, this set of iptables rules should suffice - the Debian / Ubuntu sections will also use these same ports for compatibility: + +``` +-A INPUT -p tcp -m tcp --dport 111 -j ACCEPT +-A INPUT -p udp -m udp --dport 111 -j ACCEPT +-A INPUT -p tcp -m tcp --dport 662 -j ACCEPT +-A INPUT -p udp -m udp --dport 662 -j ACCEPT +-A INPUT -p tcp -m tcp --dport 892 -j ACCEPT +-A INPUT -p udp -m udp --dport 892 -j ACCEPT +-A INPUT -p tcp -m tcp --dport 2049 -j ACCEPT +-A INPUT -p udp -m udp --dport 2049 -j ACCEPT +-A INPUT -p tcp -m tcp --dport 32803 -j ACCEPT +-A INPUT -p udp -m udp --dport 32769 -j ACCEPT +``` + + +## Client systemd NFS Mounts + +With the advent of **systemd**, the use of a service like `netfs` is no longer required or present - the mounts are still configured in `/etc/fstab` but how they work is completely different mechanically. The new systemd methodology of mounting filesystems at boot time is using the _generator_ infrastructure with RHEL CentOS 7, Debian 8, Arch and others. + +Upon boot, `systemd-fstab-generator` examines the configured mounts and writes each one as a systemd Unit in the runtime directory, `/run/systemd/generator`, that are then "started" as if they were traditional systemd unit files. These Unit files have the normal dependency chains listed for that particular mount point; using the example mount point in this article from fstab: + +``` +/etc/fstab + +192.168.5.1:/data /data nfs vers=3,proto=tcp,hard,intr,rsize=32768,wsize=32768,noatime 0 0 +``` + +...the generator creates this file on boot, `data.mount`: + +``` +/run/systemd/generator/data.mount + +# Automatically generated by systemd-fstab-generator + +[Unit] +SourcePath=/etc/fstab +Before=remote-fs.target + +[Mount] +What=192.168.5.1:/data +Where=/data +Type=nfs +Options=vers=3,proto=tcp,hard,intr,rsize=32768,wsize=32768,noatime +``` + +These units are then instantiated by the `remote-fs.target` Unit that is Wanted by the `multi-user.target` Unit. This methodology ensures that the networking layer is up and running, the rpcbind/nfs-lock service running and local filesystem ready to perform the mount in a fully automated method. No longer does a tech need to remember to enable a service like netfs to get NFS shares mounted at boot. + + - + +The use of Generators goes beyond just NFS mounts - please see the official documentation for a full overview. + + +## Client noatime NFS Mounts + +There tends to be a misconception about the use of `noatime` and NFS client mounts; in short, using `noatime` from the client mount has no real effect. The NFS server should mount it's source data directory using `noatime` instead, then export that to the clients. Red Hat has written an Access article detailing the process that happens under the covers: + + - + +An excerpt from the article: + +> "... Because of this caching behavior, the Linux NFS client does not support generic atime-related mount options. See `mount(8)` for details on these options. In particular, the atime/noatime, diratime/nodiratime, relatime/norelatime, and strictatime/nostrictatime mount options have no effect on NFS mounts." + +However, it should be noted that specifying `noatime` in the mount options will reduce the amount of traffic sent to the server, even if conceptually the data is inaccurate. When mounting with `noatime` this allows the client to make full use of it's local cache from the `GETATTR` NFS calls at the expense of possibly having outdated information about these attributes from the server. Without specifying `noatime` on the client mount, the client system will basically bypass it's cache to ensure it's local view of the attributes is up to date, which you may not actually care about in practical use. + +Therefore the best solution is to ensure that **both** the server **and** client mount the filesystem with `noatime` to increase performance on both ends of the connection; the server will need to mount the real filesystem with noatime first, export the share, then the client mount noatime in /etc/fstab. The noatime options is not used on the server's /etc/exports options at all, it's an invalid setting in this location. + + +## NFSv3 Server + +### RHEL / CentOS + +Install the required software packages for the specific release of RHEL / CentOS; basically, `rpcbind` replaced `portmap` from RHEL5 to RHEL6: + +``` +# RHEL5 / CentOS5 +yum -y install portmap nfs-utils + +# RHEL6 / CentOS6 / RHEL7 / CentOS7 +yum -y install rpcbind nfs-utils +``` + +By default the NFSv3 methodology will assign random ephemeral ports for each one of the required daemons upon start; this needs to be changed so that the NFS server is firewall/VLAN ACL friendly and always use static ports on every start. Red Hat provides a configuration file ready to use. + +The default configuration only enables 8 NFS threads, a holdover from 2 decades ago. Set `RPCNFSDCOUNT` to a value that the specific server can handle based on it's available resources. In practice, 32 or 64 threads works for most servers since at this point your disk I/O probably can't keep up if they get exhausted. This can be changed at runtime as well. + +Edit `/etc/sysconfig/nfs` and uncomment the lines that have `_PORT` in them to use the predefined static ports and set the threads. For RHEL5 and RHEL6 all the ports are pre-defined for NFSv3, for RHEL7 some of the configuration must be added manually. + +``` +## RHEL5 / CentOS5 / RHEL6 / CentOS6 +# egrep "(PORT|COUNT)" /etc/sysconfig/nfs +RPCNFSDCOUNT=64 +RQUOTAD_PORT=875 +LOCKD_TCPPORT=32803 +LOCKD_UDPPORT=32769 +MOUNTD_PORT=892 +STATD_PORT=662 +STATD_OUTGOING_PORT=2020 + +## RHEL7 / CentOS7 +# egrep -v "(^(#|$)|=\"\")" /etc/sysconfig/nfs +RPCRQUOTADOPTS="-p 875" +LOCKD_TCPPORT=32803 +LOCKD_UDPPORT=32769 +RPCNFSDCOUNT=64 +RPCMOUNTDOPTS="-p 892" +STATDARG="-p 662 -o 2020" +GSS_USE_PROXY="no" +``` + +Next, define the specific directory to be shared along with it's permissions and capabilities in `/etc/exports`. Be careful and understand that a space between the IP/name and the opening parenthesis is usually incorrect – the parsing of this file treats the space as a field separator for the entire permissions object\! In this example, we share out the /data directory to all servers in the subnet 192.168.5.x: + +``` +/etc/exports + +/data 192.168.5.0/24(rw,no_root_squash) +``` + +Notice the format is '(_what_) (_who_)(_how_)' in nature, with no space between the _who_ and _how_; an entry on this line without _who_ (or a \*) applies global permissions\! Only one _what_ allowed per line, several _who_(_how_) combinations can be used like so: + +``` +/etc/exports + +Correct, no space between /who/ and (how): + + /data 192.168.1.4(rw) 192.168.1.9(rw,no_root_squash) *(ro) + /opt 10.11.12.0/24(rw) 10.11.13.0/24(rw,no_root_squash) + +Incorrect, spaces between /who/ and (how): + + /data 192.168.1.4 (rw) 192.168.1.9 (rw,no_root_squash) * (ro) + /opt 10.11.12.0/24 (rw) 10.11.13.0/24 (rw,no_root_squash) +``` + +Start the portmap/rpcbind, nfslock, and nfs services. Enable them to start at boot. + +``` +# RHEL5 / CentOS5 +service portmap start; chkconfig portmap on +service nfslock start; chkconfig nfslock on +service nfs start; chkconfig nfs on + +# RHEL6 / CentOS6 +service rpcbind start; chkconfig rpcbind on +service nfslock start; chkconfig nfslock on +service nfs start; chkconfig nfs on + +# RHEL7 / CentOS7 +systemctl start rpcbind nfs-lock nfs-server +systemctl enable rpcbind nfs-lock nfs-server +``` + +Finally, check the server configuration locally - `rpcinfo` is used to check the protocols, daemons and ports, `showmount` is used to check the exported share. + +``` +# rpcinfo -p + program vers proto port + 100000 2 tcp 111 portmapper + 100000 2 udp 111 portmapper + 100024 1 udp 662 status + 100024 1 tcp 662 status + 100003 2 udp 2049 nfs + 100003 3 udp 2049 nfs + 100003 4 udp 2049 nfs + 100021 1 udp 32769 nlockmgr + 100021 3 udp 32769 nlockmgr + 100021 4 udp 32769 nlockmgr + 100021 1 tcp 32803 nlockmgr + 100021 3 tcp 32803 nlockmgr + 100021 4 tcp 32803 nlockmgr + 100003 2 tcp 2049 nfs + 100003 3 tcp 2049 nfs + 100003 4 tcp 2049 nfs + 100005 1 udp 892 mountd + 100005 1 tcp 892 mountd + 100005 2 udp 892 mountd + 100005 2 tcp 892 mountd + 100005 3 udp 892 mountd + 100005 3 tcp 892 mountd + +# showmount -e +Export list for nfs-server.local: +/data 192.168.5.0/24 +``` + +### Debian / Ubuntu + +Install the required software packages: + +``` +apt-get update +apt-get install rpcbind nfs-common nfs-kernel-server +``` + +**IMMEDIATELY** stop the auto-started services so that we can unload the `lockd` kernel module\! In order to set static ports the module has to be unloaded, which can be troublesome if it's not done right away – if you do not perform these actions now, you might have to reboot later which is undesirable. + +``` +# Debian 7 +service nfs-kernel-server stop +service nfs-common stop +service rpcbind stop +modprobe -r nfsd nfs lockd + +# Ubuntu 14 +service nfs-kernel-server stop +service statd stop +service idmapd stop +service rpcbind stop +modprobe -r nfsd nfs lockd +``` + +By default the NFSv3 methodology will assign random ephemeral ports for each one of the required daemons upon start; this needs to be changed so that the NFS server is firewall/VLAN ACL friendly and always use static ports on every start. The ports are configured in two different files differently that Red Hat; they can be configured to use the same static port numbers which is recommended for maximum compatibility. + +First, edit `/etc/default/nfs-common` to define that you want to run `STATD` and set the ports; notice on Ubuntu that we need to disable `rpc.idmapd` using an Upstart override: + +``` +## Debian 7 +# egrep -v "^(#|$)" /etc/default/nfs-common +NEED_STATD=yes +STATDOPTS="-p 662 -o 2020" +NEED_IDMAPD=no +NEED_GSSD=no + +## Ubuntu 14 +# egrep -v "^(#|$)" /etc/default/nfs-common +NEED_STATD=yes +STATDOPTS="-p 662 -o 2020" +NEED_GSSD=no + +## Ubuntu 14 Only +echo "manual" > /etc/init/idmapd.override +``` + +The default configuration only enables 8 NFS threads, a holdover from 2 decades ago. Set `RPCNFSDCOUNT` to a value that the specific server can handle based on it's available resources. In practice, 32 or 64 threads works for most servers since at this point your disk I/O probably can't keep up if they get exhausted. This can be changed at runtime as well. + +Second, edit `/etc/default/nfs-kernel-server` to set the `RPCNFSDCOUNT` higher than the default 8 threads and set the `MOUNTD` ports: + +``` +# egrep -v "^(#|$)" /etc/default/nfs-kernel-server +RPCNFSDCOUNT=64 +RPCNFSDPRIORITY=0 +RPCMOUNTDOPTS="--manage-gids -p 892" +NEED_SVCGSSD=no +RPCSVCGSSDOPTS= +``` + +Last, create `/etc/modprobe.d/nfs-lockd.conf'` with this content to set the ports for `LOCKD`: + +``` +echo "options lockd nlm_udpport=32769 nlm_tcpport=32803" > /etc/modprobe.d/nfs-lockd.conf +``` + +Next, define the specific directory to be shared along with it's permissions and capabilities in `/etc/exports`. Be careful and understand that a space between the IP/name and the opening parenthesis is usually incorrect – the parsing of this file treats the space as a field separator for the entire permissions object\! In this example, we share out the /data directory to all servers in the subnet 192.168.5.x: + +``` +/etc/exports + +/data 192.168.5.0/24(rw,no_root_squash,no_subtree_check) +``` + +Notice the format is '(_what_) (_who_)(_how_)' in nature, with no space between the _who_ and _how_; an entry on this line without _who_ (or a \*) applies global permissions\! Only one _what_ allowed per line, several _who_(_how_) combinations can be used like so: + +``` +/etc/exports + +Correct, no space between /who/ and (how): + + /data 192.168.1.4(rw) 192.168.1.9(rw,no_root_squash) *(ro) + /opt 10.11.12.0/24(rw) 10.11.13.0/24(rw,no_root_squash) + +Incorrect, spaces between /who/ and (how): + + /data 192.168.1.4 (rw) 192.168.1.9 (rw,no_root_squash) * (ro) + /opt 10.11.12.0/24 (rw) 10.11.13.0/24 (rw,no_root_squash) +``` + +Start the rpcbind, nfs-common and nfs-kernel-server services. Enable them to start at boot - this is normally automatic on Debian/Ubuntu, however to be 100% safe run the commands to verify it's configured. + +``` +# Debian 7 +service rpcbind start; insserv rpcbind +service nfs-common start; insserv nfs-common +service nfs-kernel-server start; insserv nfs-kernel-server + +# Ubuntu 14 - Upstart controlled rpcbind/statd +service rpcbind start +service statd start +service nfs-kernel-server start; update-rc.d nfs-kernel-server enable +``` + +Finally, check the server configuration locally - `rpcinfo` is used to check the protocols, daemons and ports, `showmount` is used to check the exported share. + +``` +# rpcinfo -p + program vers proto port service + 100000 4 tcp 111 portmapper + 100000 3 tcp 111 portmapper + 100000 2 tcp 111 portmapper + 100000 4 udp 111 portmapper + 100000 3 udp 111 portmapper + 100000 2 udp 111 portmapper + 100024 1 udp 662 status + 100024 1 tcp 662 status + 100003 2 tcp 2049 nfs + 100003 3 tcp 2049 nfs + 100003 4 tcp 2049 nfs + 100227 2 tcp 2049 + 100227 3 tcp 2049 + 100003 2 udp 2049 nfs + 100003 3 udp 2049 nfs + 100003 4 udp 2049 nfs + 100227 2 udp 2049 + 100227 3 udp 2049 + 100021 1 udp 32769 nlockmgr + 100021 3 udp 32769 nlockmgr + 100021 4 udp 32769 nlockmgr + 100021 1 tcp 32803 nlockmgr + 100021 3 tcp 32803 nlockmgr + 100021 4 tcp 32803 nlockmgr + 100005 1 udp 892 mountd + 100005 1 tcp 892 mountd + 100005 2 udp 892 mountd + 100005 2 tcp 892 mountd + 100005 3 udp 892 mountd + 100005 3 tcp 892 mountd + +# showmount -e +Export list for nfs-server.local: +/data 192.168.5.0/24 +``` + + +## NFSv3 Client + +### RHEL / CentOS + +Install the required software packages for the specific release of RHEL / CentOS; basically, `rpcbind` replaced `portmap` from RHEL5 to RHEL6: + +``` +# RHEL5 / CentOS5 +yum -y install portmap nfs-utils + +# RHEL6 / CentOS6 / RHEL7 / CentOS7 +yum -y install rpcbind nfs-utils +``` + +Start the `portmap`/`rpcbind` and `nfslock` services; note that the `nfs` service is not required, that is for the server only. On the client the `netfs` service is enabled, which mounts network filesystem from `/etc/fstab` after networking is started during the boot process. The netfs service is not typically started by hand after the server is online, only run at boot. + +``` +# RHEL5 / CentOS5 +service portmap start; chkconfig portmap on +service nfslock start; chkconfig nfslock on +chkconfig netfs on + +# RHEL6 / CentOS6 +service rpcbind start; chkconfig rpcbind on +service nfslock start; chkconfig nfslock on +chkconfig netfs on + +# RHEL7 / CentOS7 +systemctl start rpcbind nfs-lock nfs-client.target +systemctl enable rpcbind nfs-lock nfs-client.target +``` + +From the client, query the server to check the RPC ports and the available exports - compare against the server configuration for validity: + +``` +# rpcinfo -p 192.168.5.1 + program vers proto port + 100000 2 tcp 111 portmapper + 100000 2 udp 111 portmapper + 100024 1 udp 662 status + 100024 1 tcp 662 status + 100003 2 udp 2049 nfs + 100003 3 udp 2049 nfs + 100003 4 udp 2049 nfs + 100021 1 udp 32769 nlockmgr + 100021 3 udp 32769 nlockmgr + 100021 4 udp 32769 nlockmgr + 100021 1 tcp 32803 nlockmgr + 100021 3 tcp 32803 nlockmgr + 100021 4 tcp 32803 nlockmgr + 100003 2 tcp 2049 nfs + 100003 3 tcp 2049 nfs + 100003 4 tcp 2049 nfs + 100005 1 udp 892 mountd + 100005 1 tcp 892 mountd + 100005 2 udp 892 mountd + 100005 2 tcp 892 mountd + 100005 3 udp 892 mountd + 100005 3 tcp 892 mountd + +# showmount -e 192.168.5.1 +Export list for 192.168.5.1: +/data 192.168.5.0/24 +``` + +Make the destination directory for the mount, and test a simple mount with no advanced options: + +``` +# showmount -e 192.168.5.1 +# mkdir /data +# mount -t nfs -o vers=3 192.168.5.1:/data /data +# df -h /data +# umount /data +``` + +Now that it's confirmed working, add it to `/etc/fstab` as a standard mount. This is where the Best Practices and performance options can be applied; in general the recommended set of options is typically based on using TCP or UDP, which will depend on the environment in question. See the man page `nfs(5)` for a full list of everything that can be tuned. + +``` +/etc/fstab + +# TCP example +192.168.5.1:/data /data nfs vers=3,proto=tcp,hard,intr,rsize=32768,wsize=32768,noatime 0 0 + +# UDP example +192.168.5.1:/data /data nfs vers=3,proto=udp,hard,intr,rsize=32768,wsize=32768,noatime 0 0 +``` + +Finally, test the mount and check the desired options were applied. + +``` +# mount /data +# touch /data/test-file +# df -h /data +# grep /data /proc/mounts +``` + +Performance testing should typically be performed at this point (possibly with a tool such as `fio`) to determine if any of these options could be adjusted for better results. + + +### Debian / Ubuntu + +Install the required software packages: + +``` +apt-get update +apt-get install rpcbind nfs-common +``` + +The service should automatically be started; if not, start them and ensure they're enabled on boot. Note that the `nfs-kernel-server` service is not required, that is for the server only. On the client the `mountnfs.sh` service is enabled, which mounts network filesystem from `/etc/fstab` after networking is started during the boot process. The `mountnfs.sh` service is not typically started by hand after the server is online, only run at boot. + +``` +# Debian 7 +service rpcbind start; insserv rpcbind +service nfs-common start; insserv nfs-common +insserv mountnfs.sh + +# Ubuntu 14 - Upstart controlled rpcbind/statd/mountnfs +service rpcbind start +service statd start +service idmapd stop +echo "manual" > /etc/init/idmapd.override +``` + +From the client, query the server to check the RPC ports and the available exports - compare against the server configuration for validity: + +``` +# rpcinfo -p 192.168.5.1 + program vers proto port service + 100000 4 tcp 111 portmapper + 100000 3 tcp 111 portmapper + 100000 2 tcp 111 portmapper + 100000 4 udp 111 portmapper + 100000 3 udp 111 portmapper + 100000 2 udp 111 portmapper + 100024 1 udp 662 status + 100024 1 tcp 662 status + 100003 2 tcp 2049 nfs + 100003 3 tcp 2049 nfs + 100003 4 tcp 2049 nfs + 100227 2 tcp 2049 + 100227 3 tcp 2049 + 100003 2 udp 2049 nfs + 100003 3 udp 2049 nfs + 100003 4 udp 2049 nfs + 100227 2 udp 2049 + 100227 3 udp 2049 + 100021 1 udp 32769 nlockmgr + 100021 3 udp 32769 nlockmgr + 100021 4 udp 32769 nlockmgr + 100021 1 tcp 32803 nlockmgr + 100021 3 tcp 32803 nlockmgr + 100021 4 tcp 32803 nlockmgr + 100005 1 udp 892 mountd + 100005 1 tcp 892 mountd + 100005 2 udp 892 mountd + 100005 2 tcp 892 mountd + 100005 3 udp 892 mountd + 100005 3 tcp 892 mountd + +# showmount -e 192.168.5.1 +Export list for 192.168.5.1: +/data 192.168.5.0/24 +``` + +Make the destination directory for the mount, and test a simple mount with no advanced options: + +``` +# showmount -e 192.168.5.1 +# mkdir /data +# mount -t nfs -o vers=3 192.168.5.1:/data /data +# df -h /data +# umount /data +``` + +Now that it's confirmed working, add it to `/etc/fstab` as a standard mount. This is where the Best Practices and performance options can be applied; in general the recommended set of options is typically based on using TCP or UDP, which will depend on the environment in question. See the man page `nfs(5)` for a full list of everything that can be tuned. + +``` +/etc/fstab + +# TCP example +192.168.5.1:/data /data nfs vers=3,proto=tcp,hard,intr,rsize=32768,wsize=32768,noatime 0 0 + +# UDP example +192.168.5.1:/data /data nfs vers=3,proto=udp,hard,intr,rsize=32768,wsize=32768,noatime 0 0 +``` + +Finally, test the mount and check the desired options were applied. + +``` +# mount /data +# touch /data/test-file +# df -h /data +# grep /data /proc/mounts +``` + +Performance testing should typically be performed at this point (possibly with a tool such as `fio`) to determine if any of these options could be adjusted for better results. + + +## NFSv4 Server + +> If the server is supporting both NFSv3 **and** NFSv4, be sure to combine the setup steps above with these below to set static ports. Kerberos support is explicitly not configured within these instructions, traditional `sec=sys` (security = system) mode is being used. + +### RHEL / CentOS + +Install the required software packages: + +``` +# RHEL5 / CentOS5 +yum -y install portmap nfs-utils nfs4-acl-tools + +# RHEL6 / CentOS6 / RHEL7 / CentOS7 +yum -y install rpcbind nfs-utils nfs4-acl-tools +``` + +Edit `/etc/idmapd.conf` to set the `Domain` – all the servers and clients must be on the same domain: + +``` +/etc/idmapd.conf + +[General] +Domain = example.com +``` + +Edit `/etc/sysconfig/nfs` to set the `RPCNFSDCOUNT` higher than the default 8 threads: + +``` +# egrep "(PORT|COUNT)" /etc/sysconfig/nfs +RPCNFSDCOUNT=64 +``` + +Unlike NFSv3, NFSv4 has a concept of a "root file system" under which all the actual desired directories are to be exposed; there are many ways of doing this (such as using bind mounts) which may or may not work for the given situation. These are defined in `/etc/exports` just like NFSv3 but with special options; the textbook method for setting up the parent/child relationship is to first make an empty directory to be used as `fsid=0` (root/parent) - `/exports` is the commonly used name: + +``` +mkdir /exports +``` + +Now bind mount the desired data directories into it - for example, the real directory `/data` will be bind-mounted to `/exports/data` like so: + +``` +touch /data/test-file +echo '/data /exports/data none bind 0 0' >> /etc/fstab +mkdir -p /exports/data +mount /exports/data +ls -l /exports/data/test-file +``` + +Now we build the `/etc/exports` listing this special parent first with `fsid=0` and `crossmnt` in the options, then the children using their bind-mounted home: + +``` +/etc/exports + +/exports 192.168.5.0/24(ro,no_subtree_check,fsid=0,crossmnt) +/exports/data 192.168.5.0/24(rw,no_subtree_check,no_root_squash) +``` + +Start the required services: + +``` +# RHEL5 / CentOS5 +service portmap start; chkconfig portmap on +service rpcidmapd start; chkconfig rpcidmapd on +service nfs start; chkconfig nfs on + +# RHEL6 / CentOS6 +service rpcbind start; chkconfig rpcbind on +service rpcidmapd start; chkconfig rpcidmapd on +service nfs start; chkconfig nfs on + +# RHEL7 / CentOS7 +systemctl start rpcbind nfs-idmap nfs-server +systemctl enable rpcbind nfs-idmap nfs-server +``` + + +Finally, check the local exports with `showmount`: + +``` +# showmount -e +Export list for nfs-server.local: +/exports 192.168.5.0/24 +/exports/data 192.168.5.0/24 +``` + +### Debian / Ubuntu + +Install the required software packages: + +``` +apt-get update +apt-get install rpcbind nfs-common nfs4-acl-tools nfs-kernel-server +``` + +Stop the auto-started services so we can configure them: + +``` +# Debian 7 +service nfs-kernel-server stop +service nfs-common stop +service rpcbind stop +modprobe -r nfsd nfs lockd + +# Ubuntu 14 +service nfs-kernel-server stop +service statd stop +service idmapd stop +service rpcbind stop +modprobe -r nfsd nfs lockd +``` + +Edit `/etc/idmapd.conf` to set the `Domain` – all the servers and clients must be on the same domain: + +``` +/etc/idmapd.conf + +[General] +Domain = example.com +``` + +Edit `/etc/default/nfs-common` to indicate that _idmapd_ is required and _statd_ is not, along with _rpc.gssd_: + +``` +## Debian 7 +# egrep -v "^(#|$)" /etc/default/nfs-common +NEED_STATD=no +STATDOPTS= +NEED_IDMAPD=yes +NEED_GSSD=no + +## Ubuntu 14 +# egrep -v "^(#|$)" /etc/default/nfs-common +NEED_STATD=no +STATDOPTS= +NEED_GSSD=no +``` + +Edit `/etc/default/nfs-kernel-server` to set the `RPCNFSDCOUNT` higher than the default 8 threads: + +``` +# egrep -v "^(#|$)" /etc/default/nfs-kernel-server +RPCNFSDCOUNT=64 +RPCNFSDPRIORITY=0 +RPCMOUNTDOPTS=--manage-gids +NEED_SVCGSSD=no +RPCSVCGSSDOPTS= +``` + +Unlike NFSv3, NFSv4 has a concept of a "root file system" under which all the actual desired directories are to be exposed; there are many ways of doing this (such as using bind mounts) which may or may not work for the given situation. These are defined in `/etc/exports` just like NFSv3 but with special options; the textbook method for setting up the parent/child relationship is to first make an empty directory to be used as `fsid=0` (root/parent) - `/exports` is the commonly used name: + +``` +mkdir /exports +``` + +Now bind mount the desired data directories into it - for example, the real directory `/data` will be bind-mounted to `/exports/data` like so: + +``` +touch /data/test-file +echo '/data /exports/data none bind 0 0' >> /etc/fstab +mkdir -p /exports/data +mount /exports/data +ls -l /exports/data/test-file +``` + +Now we build the `/etc/exports` listing this special parent first with `fsid=0` and `crossmnt` in the options, then the children using their bind-mounted home: + +``` +/etc/exports + +/exports 192.168.5.0/24(ro,no_subtree_check,fsid=0,crossmnt) +/exports/data 192.168.5.0/24(rw,no_subtree_check,no_root_squash) +``` + +Start the required services and enable at boot - this is normally automatic on Debian/Ubuntu, however to be 100% safe run the commands to verify it's configured. + +``` +# Debian 7 +service rpcbind start; insserv rpcbind +service nfs-common start; insserv nfs-common +service nfs-kernel-server start; insserv nfs-kernel-server + +# Ubuntu 14 - Upstart controlled rpcbind/statd +service rpcbind start +service idmapd start +service nfs-kernel-server start; update-rc.d nfs-kernel-server enable +``` + +Finally, check the local exports with `showmount`: + +``` +# showmount -e +Export list for nfs-server.local: +/exports 192.168.5.0/24 +/exports/data 192.168.5.0/24 +``` + + +## NFSv4 Client + +> If the client is supporting both NFSv3 **and** NFSv4, be sure to combine the setup steps above with these below to set static ports. Kerberos support is explicitly not configured within these instructions, traditional `sec=sys` (security = system) mode is being used. + +### RHEL / CentOS + +Install the required software packages: + +``` +# RHEL5 / CentOS5 +yum -y install portmap nfs-utils nfs4-acl-tools + +# RHEL6 / CentOS6 / RHEL7 / CentOS7 +yum -y install rpcbind nfs-utils nfs4-acl-tools +``` + +Some releases of the `nfs-utils` package may have buggy behaviour trying to load gssd incorrectly, blacklist the module as a workaround if required. This usually manifests in a ~15 second delay when the `mount` command is issued until it completes. + +``` +modprobe -r rpcsec_gss_krb5 +echo "blacklist rpcsec_gss_krb5" > /etc/modprobe.d/blacklist-nfs-gss-krb5.conf +``` + +See [this bug](https://bugzilla.redhat.com/show_bug.cgi?id=1001934) and [this patch](http://article.gmane.org/gmane.linux.nfs/60081) for further details. + + +Set a static callback port for NFSv4 4.0; the server will initiate and use this port to communicate with the client. The nfsv4.ko kernel module is loaded when the share is mounted, so there should be no need to unload the module first. + +``` +echo 'options nfs callback_tcpport=4005' > /etc/modprobe.d/nfsv4_callback_port.conf +``` + +> The above is no longer necessary with NFSv4 4.1, as the client will initiate the outgoing channel for callbacks instead of the server instantiating the connection. + +Next, edit `/etc/idmapd.conf` to set the `Domain` – all the servers and clients must be on the same domain, so set this to what the server has configured: + +``` +/etc/idmapd.conf + +[General] +Domain = example.com +``` + +Start the required services: + +``` +# RHEL5 / CentOS5 +service portmap start; chkconfig portmap on +service rpcidmapd start; chkconfig rpcidmapd on + +# RHEL6 / CentOS6 +service rpcbind start; chkconfig rpcbind on +service rpcidmapd start; chkconfig rpcidmapd on + +# RHEL7 / CentOS7 +systemctl start rpcbind nfs-idmap nfs-client.target +systemctl enable rpcbind nfs-idmap nfs-client.target +``` + +Make the destination directory for the mount, and test a simple mount with no advanced options: + +``` +# showmount -e 192.168.5.1 +# mkdir /data +# mount -t nfs4 192.168.5.1:/data /data +# df -h /data +# umount /data +``` + +Now that it's confirmed working, add it to `/etc/fstab` as a standard mount. + +``` +/etc/fstab + +192.168.5.1:/data /data nfs4 sec=sys,noatime 0 0 +``` + +Test the mount and check the desired options were applied. + +> On RHEL5 a warning about rpc.gssd not running may occur; since we are not using Kerberos this can be ignored + +``` +# mount /data +# touch /data/test-file +# df -h /data +# grep /data /proc/mounts +``` + +With NFSv4 and the use of _idmapd_ additional testing should be performed to ensure the user mapping is performing as expected if possible. Make a user account on the server and client with the same UIDs, then test setting ownership on one side is mapped to the same user on the other: + +``` +# NFSv4 server +useradd -u 5555 test + +# NFSv4 client +useradd -u 5555 test + +# NFSv4 server +chown test /exports/data/test-file +ls -l /exports/data/test-file + +# NFSv4 client - should show 'test' owns the file +ls -l /data/test-file +``` + +Additionally, test that the `nfs4_setfacl` and `nfs4_getfacl` commands seem to perform as expected (the format is not the same as setfacl/getfacl), see `nfs4_acl(5)` man page for details. Note that the principal (user) listed is in the format `user@nfs.domain` - the same domain that was used in `/etc/idmapd.conf` which may not necessarily be the same has the hostname domain. + +``` +# Add user 'test' to have read +nfs4_setfacl -a A::test@example.com:r /data/test-file + +# Look for expected result - some distros list user@domain, some the UID +nfs4_getfacl /data/test-file + + A::OWNER@:rwatTcCy + A::test@example.com:rtcy (or: A::5555:rtcy) + A::GROUP@:rtcy + A::EVERYONE@:rtcy + +# From the *server*, try a standard 'getfacl' and it should also show up +getfacl /exports/data/test-file + + # file: exports/data/test-file + # owner: test + # group: root + user::rw- + user:test:r-- + group::r-- + mask::r-- + other::r-- +``` + +### Debian / Ubuntu + +Install the required software packages: + +``` +apt-get update +apt-get install rpcbind nfs-common nfs4-acl-tools +``` + +Stop the auto-started services so we can configure them: + +``` +# Debian 7 +service nfs-common stop +service rpcbind stop +modprobe -r nfsd nfs lockd + +# Ubuntu 14 +service statd stop +service idmapd stop +service rpcbind stop +modprobe -r nfsd nfs lockd +``` + +Some releases of the `nfs-common` package may have buggy behaviour trying to load gssd incorrectly; blacklist the `rpcsec_gss_krb5` module as a workaround if this problem is encountered. The problem usually manifests in a ~15 second delay when the `mount` command is issued until it completes, and has `RPC: AUTH_GSS upcall timed out` in dmesg output. If these symptoms are encountered, use this method: + +``` +modprobe -r rpcsec_gss_krb5 +echo "blacklist rpcsec_gss_krb5" > /etc/modprobe.d/blacklist-nfs-gss-krb5.conf +``` + +See [this bug](https://bugzilla.redhat.com/show_bug.cgi?id=1001934) and [this patch](http://article.gmane.org/gmane.linux.nfs/60081) for further details. + + +Set a static callback port for NFSv4 4.0; the server will initiate and use this port to communicate with the client. The nfsv4.ko kernel module is loaded when the share is mounted, so there should be no need to unload the module first. + +``` +echo 'options nfs callback_tcpport=4005' > /etc/modprobe.d/nfsv4_callback_port.conf +``` + +> The above is no longer necessary with NFSv4 4.1, as the client will initiate the outgoing channel for callbacks instead of the server instantiating the connection. + +Next, edit `/etc/idmapd.conf` to set the `Domain` – all the servers and clients must be on the same domain, so set this to what the server has configured: + +``` +/etc/idmapd.conf + +[General] +Domain = example.com +``` + +Edit `/etc/default/nfs-common` to indicate that _idmapd_ is required and _statd_ is not, along with _rpc.gssd_: + +``` +## Debian 7 +# egrep -v "^(#|$)" /etc/default/nfs-common +NEED_STATD=no +STATDOPTS= +NEED_IDMAPD=yes +NEED_GSSD=no + +## Ubuntu 14 +# egrep -v "^(#|$)" /etc/default/nfs-common +NEED_STATD=no +STATDOPTS= +NEED_GSSD=no +``` + +Start the required services: + +``` +# Debian 7 +service rpcbind start; insserv rpcbind +service nfs-common start; insserv nfs-common + +# Ubuntu 14 - Upstart controlled rpcbind/statd +service rpcbind start +service idmapd start +``` + +Make the destination directory for the mount, and test a simple mount with no advanced options: + +``` +# showmount -e 192.168.5.1 +# mkdir /data +# mount -t nfs4 192.168.5.1:/data /data +# df -h /data +# umount /data +``` + +Now that it's confirmed working, add it to `/etc/fstab` as a standard mount. + +``` +/etc/fstab + +192.168.5.1:/data /data nfs4 sec=sys,noatime 0 0 +``` + +Test the mount and check the desired options were applied. + +``` +# mount /data +# touch /data/test-file +# df -h /data +# grep /data /proc/mounts +``` + +With NFSv4 and the use of _idmapd_ additional testing should be performed to ensure the user mapping is performing as expected if possible. Make a user account on the server and client with the same UIDs, then test setting ownership on one side is mapped to the same user on the other: + +``` +# NFSv4 server +useradd -u 5555 test + +# NFSv4 client +useradd -u 5555 test + +# NFSv4 server +chown test /exports/data/test-file +ls -l /exports/data/test-file + +# NFSv4 client - should show 'test' owns the file +ls -l /data/test-file +``` + +Additionally, test that the `nfs4_setfacl` and `nfs4_getfacl` commands seem to perform as expected )the format is not the same as setfacl/getfacl), see `nfs4_acl(5)` man page for details. Note that the principal (user) listed is in the format `user@nfs.domain` - the same domain that was used in `/etc idmapd.conf` which may not necessarily be the same has the hostname domain. + +``` +# Add user 'test' to have read +nfs4_setfacl -a A::test@example.com:r /data/test-file + +# Look for expected result - some distros list user@domain, some the UID +nfs4_getfacl /data/test-file + + A::OWNER@:rwatTcCy + A::test@example.com:rtcy (or: A::5555:rtcy) + A::GROUP@:rtcy + A::EVERYONE@:rtcy + +# From the *server*, try a standard 'getfacl' and it should also show up +getfacl /exports/data/test-file + + # file: exports/data/test-file + # owner: test + # group: root + user::rw- + user:test:r-- + group::r-- + mask::r-- + other::r-- +``` + + +## Additional Reading + + - [NFS Debugging](nfs_debugging.md) diff --git a/md/oracle_environment.md b/md/oracle_environment.md new file mode 100644 index 0000000..0e2514c --- /dev/null +++ b/md/oracle_environment.md @@ -0,0 +1,128 @@ +# Oracle Environment + +## Contents + + - [Kernel Parameters](#kernel-parameters) + - [Userspace Setup](#userspace-setup) + - [Automatic Storage Management (ASM)](#automatic-storage-management-asm) + - [Real Application Cluster (RAC)](#real-application-cluster-rac) + - [RAC Networking](#rac-networking) + - [Shared Storage](#shared-storage) + - [Virtual IP Setup](#virtual-ip-setup) + + +## Kernel Parameters + +Oracle Global Customer Support officially recommends a maximum for SHMMAX of "1/2 of physical RAM". + +The maximum size of a shared memory segment is limited by the size of the available user address space. On 64-bit systems, this is a theoretical 264bytes. So the theoretical limit for SHMMAX is the amount of physical RAM that you have. However, to actually attempt to use such a value could potentially lead to a situation where no system memory is available for anything else. Therefore a more realistic physical limit for SHMMAX would probably be "physical RAM - 2G". + +In an Oracle RDBMS application, this physical limit still leaves inadequate system memory for other necessary functions. Therefore, the common Oracle maximum for SHMMAX that you will often see is "1/2 of physical RAM". Operators may erroneously think that that setting the SHMMAX as recommended limits the total SGA, which is untrue. Setting the SHMMAX as recommended only causes a few more shared memory segments to be used for whatever total SGA that you subsequently configure in Oracle. + +Modify your kernel settings in /etc/sysctl.conf as follows. If the current value for any parameter is higher than the value listed in this table, do not change the value of that parameter. Range values (such as net.ipv4.ip\_local\_port\_range) must match exactly. + +``` +kernel.shmall = physical RAM size / pagesize +kernel.shmmax = 1/2 of physical RAM. +kernel.shmmni = 4096 +kernel.sem = 250 32000 100 128 +fs.file-max = 512 x processes (for example 6815744 for 13312 processes) +fs.aio-max-nr = 1048576 +net.ipv4.ip_local_port_range = 9000 65500 +net.core.rmem_default = 262144 +net.core.rmem_max = 4194304 +net.core.wmem_default = 262144 +net.core.wmem_max = 1048576 +``` + +Set shmall equal to the sum of all the SGAs on the system, divided by the page size. The SGA values can be calculated with a one line script: + +``` +# su - oracle +$ SGA=`echo "show sga"|sqlplus -s / as sysdba|grep "^Total System"|awk '{print $5}'`; PAGE=`getconf PAGE_SIZE`; echo "$SGA/$PAGE" | bc +``` + + +## Userspace Setup + +Oracle groups and user(s): normally two groups, 'oinstall' and 'dba' and one user, 'oracle'). The 'oracle' user has a hefty custom environment configured for all the variables needed. This is why we 'su - oracle' and not just a 'su oracle' when needing to run sqlplus - a full login shell with all variables initialized is required. + +Add the following settings to `/etc/security/limits.conf` for the 'oracle' user (adjust as needed): + +``` +oracle soft nproc 2047 +oracle hard nproc 16384 +oracle soft nofile 1024 +oracle hard nofile 65536 +oracle soft stack 10240 +``` + + +## Automatic Storage Management (ASM) + +Oracle supplies it's own kernel module for ASM use which the DBA team will install; ASM can be summarized by three main points: + + - Direct I/O to storage (bypasses kernel buffering) + - Solves 4k block size limitation of ext3 + - Cluster aware filesystem on raw devices (RAC) + +Unlike other kernel modules using DKMS, Oracle provides pre-compiled binaries for very specific versions of the Red Hat Enterprise kernel and can be found via their landing page below. They have a great intro article on learning more about how it actually works. + + - [Oracle ASMLib Downloads](http://www.oracle.com/technetwork/topics/linux/asmlib/index-101839.html) + - [Introduction to Automatic Storage Management](http://docs.oracle.com/cd/B28359_01/server.111/b31107/asmcon.htm) + + + +## Real Application Cluster (RAC) + +An Oracle RAC is typically 2+ machines in a cluster with shared storage, although it is possible to configure a single-node RAC. + +### RAC Networking + +Oracle RAC needs a private network between servers to use; in some cases using Jumbo Frames and/or 10G switches. The Oracle RAC nodes use this private network link for inter-node communication of large amounts of data using UDP, terabytes of both RX and TX traffic per month is not uncommon for highly active cluster nodes. It is common for the RAC nodes to use 8k UDP packets on this private network as they pass table data back and forth to stay in sync. + +**Server A** + + - bond0 (eth0 / eth1) - 172.16.10.5 + - bond1 (eth2 / eth3) - 10.10.10.5 (RAC Interconnect) + +**Server B** + + - bond0 (eth0 / eth1) - 172.16.10.6 + - bond1 (eth2 / eth3) - 10.10.10.6 (RAC Interconnect) + + +### Shared Storage + +The shared storage should have a minimum of 7 LUNs presented to the servers; 5x 1G control and 2+ xxxG data. More may be used to further spread out the data for better performance; RAID-10 is the suggested design across as many spindles as possible. + + - 5x 1G Raw Control LUNs + - 2x LUNs for OCR + - Oracle Cluster Registry: The OCR stores the details of the cluster configuration including the names and current status of the database, associated instances, services, and node applications such as the listener process. + - 3x LUNs for Voting + - CSS Voting Disks are used to determine which nodes are currently available within the cluster. An odd number is always used. + - 2+ xxxG ASM Data LUNs + - 1 LUN has one set of data, control, redo + - 1 LUN has one set of data, control, redo, archivelog + +It is common the Server Parameter File (SPFILE) is stored on the ASM disks and there is no Flash Recovery Area (FRA) unless specifically requested. The FRA is typically twice as large as the Data LUNs. + + +### Virtual IP Setup + +A RAC requires 5 additional IP addresses from the same subnet as the NAT IPs to be used by Oracle; they are not configured on the servers in the traditional fashion. + + - The two **VIP** addresses are considered legacy for 11gR2; their use is deprecated. It's possible that an Oracle client may have an older JDBC driver that talks only "VIP" style. + - The three SCAN addresses are round-robin returned by a standard NS resolve on the client end; one node listener ("TNS LIstener") has two of the IPs configured, the other node has one IP. When an Oracle client does a NS lookup on the SCAN DNS name they connect to the TNS Listener on that IP; the TNS Listener is a load balancer and may actually communicate with another node in the cluster (so not always his local node). + - The placement of how the client resolves the listener IPs should be done in such a way that we don't insert a point of failure. For instance it's a bad idea to host the DNS record for SCAN across a site-to-site VPN link; if that link goes down then no clients can connect\! It is a best practice that the DNS record for the SCAN be hosted in such a way that it's redundant and reachable via multiple paths. + +During configuration the IPs needs to be in DNS: + + - Forward and reverse list the Linux server hostnames in DNS with their **primary public IPs**. + - DNS list the SCAN name with all three SCANx IP addresses; you're in effect creating a round-robin lookup in DNS for the same single-named record, not creating unique records for each IP. + +``` +scan01.mydomain.com IN A 172.16.30.52 +scan01.mydomain.com IN A 172.16.30.53 +scan01.mydomain.com IN A 172.16.30.54 +``` diff --git a/md/raid_penalties.md b/md/raid_penalties.md new file mode 100644 index 0000000..cdbe424 --- /dev/null +++ b/md/raid_penalties.md @@ -0,0 +1,47 @@ +# RAID Penalties + +## Contents + + - [Write Penalty](#write-penalty) + - [Penalty Calculation](#penalty-calculation) + - [References](#references) + + +## Write Penalty + +The penalty is for how we deal with the stripe (parity, etc.) where 1 means no penalty. + +| **RAID Level** | **Write Penalty** | +| -------------- | ----------------- | +| 0 | 1 | +| 1 | 2 | +| 5 | 4 | +| 10 | 2 | + + +## Penalty Calculation + + - **raw IOPS** = disk speed IOPS \* number of disks + - **functional IOPS** = (raw IOPS \* write% / write penalty) + (raw IOPS \* read%) + +Given 5x 15k SAS drives @ 900 IOPS (180/disk) as an average, a sample calculation for RAID-5, with varying levels of write vs read percentage: + +``` +16% writes: (900*.16/4)+(900*.84) = 792 / 900 = 88.0% efficiency +10% writes: (900*.10/4)+(900*.90) = 832 / 900 = 92.4% efficiency + 5% writes: (900*.05/4)+(900*.95) = 866 / 900 = 96.2% efficiency +``` + +Reducing the write percentage from 16% to 10% would yield an efficiency gain of 4.4%, from 16% to 5% yields an 8.2% gain. Writing to RAID5 and calculating parity is very costly - we haven't even talked about linear writes vs. scattered block writes. As an exercise, if the same 5-disk RAID-5 was converted to an 8 disk RAID-10 (to achieve the same capacity for data): + +``` +8*180 = 1440 raw IOPs +(1440*.16/2)+(1440*.84) = 1324 / 792 = 167% efficiency +``` + +This is all very basic math that doesn't take into account real world load and deals with theoretical maximums based on published standards. Use a tool such as `fio` to obtain real world performance. + + +## References + + - diff --git a/md/reducing_the_root_lv.md b/md/reducing_the_root_lv.md new file mode 100644 index 0000000..113a425 --- /dev/null +++ b/md/reducing_the_root_lv.md @@ -0,0 +1,119 @@ +# Reducing the root LV + +## Contents + + - [Overview](#overview) + - [Procedure](#procedure) + - [Without lvresize -r flag](#without-lvresize--r-flag) + - [With lvresize -r flag](#with-lvresize--r-flag) + + +## Overview + +Reducing the root logical volume requires booting into a rescue environment that has the LVM utilities. + +> Some rescue images have a version of the LVM utilities that were released missing a critical shim needed to resize filesystems using the `-r` flag to `lvresize`. It may or may not be necessary to resize the filesystem as a separate step, both are outlined. + + +## Procedure + +Given this current configuration of a 250G boot volume: + +| **Mount** | **Size** | **VG / LV** | +| --------- | --------- | ---------------- | +| /boot | 250M | n/a | +| /tmp | 2G | vglocal / lvtmp | +| swap | 2G | vglocal / lvswap | +| / | remainder | vglocal / lvroot | + + +We'll reduce the root (/) volume and grow swap and /tmp to end up with: + +| **Mount** | **Size** | **VG / LV** | +| --------- | --------- | ---------------- | +| /boot | 250M | n/a | +| /tmp | 4G | vglocal / lvtmp | +| swap | 32G | vglocal / lvswap | +| / | remainder | vglocal / lvroot | + + +### Without lvresize -r flag + +``` +## +## DO NOT MOUNT ANY FILESYSTEMS DURING RESCUE BOOT +## + +# activate all LVM +lvm vgchange -a y + +# fsck the filesystems; it's normal to get a message about a time error that needs fixed +fsck -fC /dev/vglocal/lvroot +fsck -fC /dev/vglocal/lvtmp + +# shrink the ext3/4 root far below what we need +resize2fs -p /dev/vglocal/lvroot 200G + +# reduce the root LV a bit above what we just resized (+5GB) +lvm lvresize /dev/vglocal/lvroot --size 205G + +# increase swap and tmp LVs +lvm lvresize /dev/vglocal/lvswap --size 32G +lvm lvresize /dev/vglocal/lvtmp --size 4G + +# re-grow the root LV back to max space +lvm lvresize -l +100%FREE /dev/vglocal/lvroot + +# re-grow the / and /tmp ext3/4 to fill the increased LVs +resize2fs -p /dev/vglocal/lvroot +resize2fs -p /dev/vglocal/lvtmp + +# fsck the filesystems again +fsck -fC /dev/vglocal/lvroot +fsck -fC /dev/vglocal/lvtmp + +# rescue image 'mkswap' is sometimes not able to see a large swap +reboot + +# finally, make a new swap signature that sees the whole LV +swapoff /dev/vglocal/lvswap +mkswap /dev/vglocal/lvswap +swapon /dev/vglocal/lvswap +``` + +### With lvresize -r flag + +``` +## +## DO NOT MOUNT ANY FILESYSTEMS DURING RESCUE BOOT +## + +# activate all LVM +lvm vgchange -a y + +# fsck the filesystems; it's normal to get a message about a time error that needs fixed +fsck -fC /dev/vglocal/lvroot +fsck -fC /dev/vglocal/lvtmp + +# reduce the size of LV root a tad more than we need +lvm lvresize -r /dev/vglocal/lvroot --size 200G + +# increase swap and tmp LVs +lvm lvresize /dev/vglocal/lvswap --size 32G +lvm lvresize /dev/vglocal/lvtmp --size 4G + +# re-grow the root LV back to max space +lvm lvresize -r -l +100%FREE /dev/vglocal/lvroot + +# fsck the filesystems again +fsck -fC /dev/vglocal/lvroot +fsck -fC /dev/vglocal/lvtmp + +# rescue image 'mkswap' is sometimes not able to see a large swap +reboot + +# finally, make a new swap signature that sees the whole LV +swapoff /dev/vglocal/lvswap +mkswap /dev/vglocal/lvswap +swapon /dev/vglocal/lvswap +``` diff --git a/md/rhcs_mechanics.md b/md/rhcs_mechanics.md new file mode 100644 index 0000000..d9847dd --- /dev/null +++ b/md/rhcs_mechanics.md @@ -0,0 +1,478 @@ +# RHCS Mechanics + +## Contents + + - [Acronyms](#acronyms) + - [Configuration Files](#configuration-files) + - [Filesystem Locations](#filesystem-locations) + - [Operational Commands](#operational-commands) + - [Cluster Components](#cluster-components) + - [Operational Examples](#operational-examples) + - [Configuration Validation](#configuration-validation) + - [Status Check](#status-check) + - [Service Manipulation](#service-manipulation) + - [Configuration Examples](#configuration-examples) + - [Standard LVM and PgSQL Initscript](#standard-lvm-and-pgsql-initscript) + - [HA-LVM and MySQL Object](#ha-lvm-and-mysql-object) + - [Standard LVM, MySQL script and NFS](#standard-lvm-mysql-script-and-nfs) + - [HA-LVM and NFS Object](#ha-lvm-and-nfs-object) + - [References](#references) + + +## Acronyms + + - **AIS**: Application Interface Specification + - **AMF**: Availability Management Framework + - **CCS**: Cluster Configuration System + - **CLM**: Cluster Membership + - **CLVM**: Cluster Logical Volume Manager + - **CMAN**: Cluster Manager + - **DLM**: Distributed Lock Manager + - **GFS2**: Global File System 2 + - **GNDB**: Global Network Block Device + - **STONITH**: Shoot The Other Node In The Head + - **TOTEM**: Group communication algorithm for reliable group messaging among cluster members + + +## Configuration Files + + - `/etc/cluster/cluster.conf` - The main cluster configuration file + - `/etc/lvm/lvm.conf` - The LVM configuration file - typically `locking_type` and a `filter` are being configured here + + +## Filesystem Locations + + - `/usr/share/cluster/` - The main directory of code used for cluster objects | + - `/var/log/cluster/` - The main logging directory (**RHEL6**) + + +## Operational Commands + +**Graphical Cluster Configuration** + + - `luci` - Cluster Management Web Interface primarily used with **RHEL6** + - `system-config-cluster` - Cluster Management X11/Motif Interface primarily used with **RHEL5** + +**RGManager** - Resource Group Manager + + - `clustat` - Command used to display the status of the cluster, including node membership and services running + - `clusvcadm` - Command used to manually enable, disable, relocate, and restart user services in a cluster + - `rg_test` - Debug and test services and resource ordering + +**CCS** - Cluster Configuration System + + - `ccs_config_validate` - Verify a configuration; can validate the running config or a named file (**RHEL6**) + - `ccs_config_dump` - Tool to generate XML output of running configuration (**RHEL6**) + - `ccs_sync` - Synchronize the cluster configuration file to one or more machines in a cluster (**RHEL6**) + - `ccs_update_schema` - Update the cluster relaxng schema that validates cluster.conf (**RHEL6**) + - `ccs_test` - Diagnostic and testing command used to retrieve information from configuration files via **ccsd** + - `ccs_tool` - Used to make online updates of CCS configuration files - **considered obsolete** + +**CMAN** - Cluster Manager + + - `cman_tool` - The administrative front end to CMAN, starts and stops CMAN infrastructure and can perform changes + - `group_tool` - Used to get a list of groups related to fencing, DLM, GFS, and getting debug information + - `fence_XXXX` - Fence agent for XXXX type of device- for example `fence_drac` (Dell DRAC), `fence_ipmilan` (IPMI) and `fence_ilo` (HP iLO) + - `fence_check` - Test the fence configuration for each node in the cluster + - `fence_node` - A program which performs I/O fencing on a single node + - `fence_tool` - A program to join and leave the fence domain + - `dlm_tool` - Utility for the `dlm` and `dlm_controld` daemon + - `gfs_control` - Utility for the gfs\_controld daemon + +**GFS2** - Global File System 2 + + - `mkfs.gfs2` - Creates a GFS2 file system on a storage device + - `mount.gfs2` - Mount a GFS2 file system; normally not used by the user directly + - `fsck.gfs2` - Repair an unmounted GFS2 file system + - `gfs2_grow` - Grows a mounted GFS2 file system + - `gfs2_jadd` - Adds journals to a mounted GFS2 file system + - `gfs2_quota` - Manage quotas on a mounted GFS2 file system + - `gfs2_tool` - Configures, tunes and gather information on a GFS2 file system + +**Quorum Disk** + + - `mkqdisk` - Cluster Quorum Disk Utility + + +## Cluster Components + +**RGManager** - Resource Group Manager + + - `rgmanager` - Daemon used to handle user service requests including service start, service disable, service relocate, and service restart; **RHEL6** + - `clurgmgrd` - Daemon used to handle user service requests including service start, service disable, service relocate, and service restart; **RHEL5** + - `cpglockd` - Utilizes the extended virtual synchrony features of Corosync to implement a simplistic, distributed lock server for rgmanager + +**CLVM** - Cluster Logical Volume Manager + + - `clvmd` - The daemon that distributes LVM metadata updates around a cluster. Requires `cman` to be running first + +**CCS** - Cluster Configuration System + + - `ricci` - CCS daemon running on all cluster nodes and provides configuration file data to cluster software; **RHEL6** + - `ccsd` - CCS daemon running on all cluster nodes and provides configuration file data to cluster software; **RHEL5** + +**CMAN** - Cluster Manager + + - `cman` - Cluster initscript used to start/stop all the CMAN daemons + - `corosync` - Corosync cluster communications infrastructure daemon using TOTEM; **RHEL6** + - `aisexec` - OpenAIS cluster communications infrastructure daemon using TOTEM; **RHEL5** + - `fenced` - Fences cluster nodes that have failed (fencing generally means rebooting) + - `dlm_controld` - Daemon that configures dlm according to cluster events + - `gfs_controld` - Daemon that coordinates GFS mounts and recovery + - `groupd` - Compatibility daemon for `fenced`, `dlm_controld` and `gfs_controld` + - `qdiskd` - Talks to CMAN and provides a mechanism for determining node-fitness in a cluster environment + - `cmannotifyd` - Talks to CMAN and provides a mechanism to notify external entities about cluster changes + + +## Operational Examples + +The man pages for `clustat` and `clusvcadm` contain more in-depth explanations of all the shown options; more options exist than are shown here. + +### Configuration Validation + +As RHEL5 does not have the `ccs_config_validate` utility, an alternate method is possible to perform XML validation against the cluster schema instead: + +``` +xmllint --relaxng /usr/share/system-config-cluster/misc/cluster.ng /etc/cluster/cluster.conf +``` + +The well-formatted XML file and a final message about validation should be printed out when run. + +### Status Check + +Use the `clustat` command to check the cluster status: + +``` +# clustat +Cluster Status for cluster1 @ Fri Jan 17 16:49:45 2014 +Member Status: Quorate + + Member Name ID Status + ------ ---- ---- ------ + node1 1 Online, Local, rgmanager + node2 2 Online, rgmanager + + Service Name Owner (Last) State + ------- ---- ----- ------ ----- + service:pgsql-svc node1 started +``` + +### Service Manipulation + +Use the `clusvcadm` command to manipulate the services: + +``` +# Restart PostgreSQL in place on the same server +clusvcadm -R pgsql-svc + +# Relocate PostgreSQL to a specific node +clusvcadm -r pgsql-svc -m + +# Disable PostgreSQL +clusvcadm -d pgsql-svc + +# Enable PostgreSQL +clusvcadm -e pgsql-svc + +# Freeze PostgreSQL on the current node +clusvcadm -Z pgsql-svc + +# Unfreeze PostgreSQL after it was frozen +clusvcadm -U pgsql-svc +``` + + +## Configuration Examples + +### Standard LVM and PgSQL Initscript + +This example uses a single standard LVM mount from SAN (as opposed to HA-LVM) and a normal initscript to start the service. An IP for a secondary backup network is included as well as the Dell DRAC fencing devices on the same VLAN. + +``` +/etc/hosts + +127.0.0.1 localhost localhost.localdomain +10.11.12.10 pgdb1.example.com pgdb1 +10.11.12.11 pgdb2.example.com pgdb2 +10.11.12.20 pgdb1-drac +10.11.12.21 pgdb2-drac + +/etc/cluster/cluster.conf + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +