Introduction to Git
This post is the note on the DataCamp course led by Greg Wilson, the Co-founder of Software Carpentry.
Basic workflow
Git is a modern version control tool created by Linus Torvalds in 2005, now it is very popular with data scientists and software developers. It can keep track of changes to files, notice conflicts between changes made by different people, and synchronize files between different computers.
A repository is the combination of two parts: the files and directories, and their historical information that Git records which are called .git
located in the root directory.
git status
— check the status of your repositorygit diff filename
— show you the changesgit add filename
— add a file to the staging areagit diff -r HEAD path/to/file
— compare the state of your files with those in the staging area, the-r
flag means “compare to a particular revision”, andHEAD
means “the most recent commit”nano filename
— use Nano to editfilename
git commit -m "some message in quotes"
— commit the changes in the staging area with a log message,git commit --amend -m "new message"
change a commit messagegit log
— view a repository’s history,git log -3 filename
show the last three commits involving a specific file
Repositories
Git uses a three-level structure for information stored by each commit.
- A commit contains metadata such as the author, the commit message, and the time the commit happened.
- A tree tracks the names and locations in the repository when that commit happened.
- A blob (short for binary large object) contains a compressed snapshot of the contents of the file when the commit happened.
Looking at the diagram SVG
(zoom for better clarity), first in the oldest (top) commit, there were two files tracked by the repository, then report.md
and draft.md
were changed in the middle commit, so the blobs are shown next to that commit. data/northern.csv
didn’t change in that commit, so the tree links to the blob from the previous commit. Reusing blobs between commits help make common operations fast and minimize storage space.
-
A hash is a unique identifier for every commit, which enables Git to share data efficiently between repositories.
-
The special label
HEAD
is another way to identify a specific commit. It always refers to the most recent commit. The labelHEAD~1
then refers to the commit before it, whileHEAD~2
refers to the commit before that, and so on. git annotate file
— show who made the last change to each line of a file and when.git diff ID1..ID2
— show the changes between two commits,..
is a pair of dots.- A
.gitignore
file in the root directory tells Git to ignore certain files. git clean
— only works on untracked files,git clean -n
show a list of files whose history Git is not currently tracking,git clean - f
delete those files for good.git config --list
— see what the settings are with one of three additional options:--system
— every user on this computer--global
— every one of your projects--local
— one specific project
git config -- global setting value
— change a configuration value for all of your projects on a particular computer.
Undo
-
git reset HEAD
— unstage the additions,git reset
unstage everything. -
git checkout -- filename
— discard the changes that have not yet been staged,git checkout -- .
revert all files in the current directory. -
git reset HEAD path/to/file git checkout -- path/to/file
By combining
git reset
withgit checkout
, you can undo changes to a file that you staged changes to. - You can think of committing as saving your work, and checking out as loading that saved version.
git checkout ID filename
would replace the current version of a file with the version thatID
identified. Notice that this is the same syntax that you used to undo the unstaged changes, except--
has been replaced byID
.
Working with branches
Branches allow you to have multiple versions of your work and let you track each version systematically. A commit will have two parents when branches are being merged, that’s why Git needs both trees and commits.
git branch
— list all of the branches in a repositorygit diff branch-1..branch-2
— show the difference between two branchesgit checkout branch-name
— switch to a branchgit checkout -b branch-name
— create a branchgit merge source destination
— merge one branch (source) to another (destination)
Collaborating
git init project-name
— create a repository for a new project in the current working directorygit init /path/to/project
— convert existing projects into repositoriesgit clone URL
— clone a repository,git clone /existing/project
use a path,git clone /existing/project newprojectname
call the clone something elsegit remote
— list the names of its remotes,git remote -v
(“v” for “verbose”) show the remote’s URLsgit remote add remote-name URL
— add more remotesgit remote rm remote-name
— remove existing onesgit pull remote branch
— get everything inbranch
in the remote repository identified byremote
and merges it into the current branch of your local repository,git pull
is a combination ofgit fetch
andgit merge
git push remote-name branch-name
— push the changes you have made locally into a remote repository
Useful links
- Step-by-step guide to contributing on GitHub
- Pro Git book
- http://git.io/sheet — a list of cool features of Git and GitHub
- Learn Git Branching — the visual and interactive way