************************************************************************** *** Distributed/Parallel Optimization for Big Data Analytics *** *** Dr. Tianbao Yang *** *** yangtia1@gmail.com *** ************************************************************************** Welcome to the birds library ============================================================================== I. INTRODUCTION The birds library implements distributed stochastic dual coordinate ascent for classification and regression with a broad support. This package adds support for optimizing the dual sparse regularized problems (good for learning from randomized reduced data) and for recovering a full model from dual variables learned from the dimension reduced data. See sbirds.cc and ibirds.cc respectively, and examples below. The technical details see: "Trading Computation for Communication: Distributed Stochastic Dual Coordinate Ascent." Tianbao Yang. NIPS 2013. "Analysis of Distributed Stochastic Dual Coordinate Ascent" Tianbao Yang, et al. Tech Report 2013. "Theory of Dual-sparse Regularized Randomized Reduction". Tianbao Yang et al. ICML 2015. The code is distributed under GNU General Public License (see license.txt for details). Copyright (c) 2015 Tianbao Yang. All Rights Reserved. ============================================================================== II. REQUIREMENTS AND INSTALL -platform: Linux/Unix Multi-core or Clusters In a cluster enviornment, make sure you can ssh to other machines without a password. Use public key authentication -Prerequisite: g++, Boost MPI library, Boost Serialization library Before installing, check where your boost library is installed and set the MPI_BOOST_PATH in Makefile accordingly To install, run the command: $ make ============================================================================== III. DESCRIPTION: The library supports binary classification, regression and multi-class classification. For classification, it supports the following objective: 1. SVM(hinge) + l2-norm: 1/n \sum_i max(0, 1-y_iw^Tx_i) + lambda/2 |w|_2^2 2. SVM(squared-hinge) + l2-norm: 1/n \sum_i max(0, 1-y_iw^Tx_i)^2 + lambda/2 |w|_2^2 3. Logistic Regression + l2-norm:1/n\sum_i log(1+exp(-yw^Tx)) + lambda/2|w|_2^2 It also supports sparse regularization by solving above objectives with the l2-norm regularization replaced by the elastic net regularization, lambda/2*(|w|_2^2 + l1s*|w|_1) For regression, it supports least square regression and lasso 1. least square regression: 1/n\sum_i 1/2|y_i - w^Tx_i|^2 + lambda/2|w|_2^2 2. lasso : 1/n\sum_i 1/2|y_i - w^Tx_i|^2 + lambda/2(|w|_2^2 + l1s*|w|_1) For multi-class classification, it supports one-vs-all with aforementioned objectives. ================================================================================= IV. USAGE: for binary classification and regression: mpirun -np num_procs [mpi-options] birds [options] for multi-class classification: mpirun -np num_procs [mpi-options] eagles [options] for prediction: mpirun -np num_procs [mpi-options] eye [options] A. mpi-oprions: check any MPI tutorial (usefule option --hostfile host_filename). In host_filename, each line gives the name of a machine B. options: -ls {1,2,3,4} choose among L1-svm, L2-svm, logistic regression and least square regression -lambda value of lambda (see DESCRIPTION) (default=1/n) -l1 value of l1 scalar (see DESCRIPTION) (default=0) -T number of iterations to run (see papers) (default=1000) -m number of dual updates at each iteration (see papers)(default=1000) -iiter number of iterations for solving sub-problems (default=1) -tol tolerance of duality gap (see papers) (default = 0) if tol=0, run T iterations, otherwise run until duality gap is less than tol -norm {0, 1} normalize each data or not. (l2-normalization)(default=0) -fmt {sparse, dense} data formats (see C.) (default=sparse) -dtype {txt, dat} data types (see C.) (default=txt) -d dimension of data (for binary data only) (default=0) -org starting index of data (0, 1) (see C.) (default=1) -labelFile input label file or folder name for training data (for binary data only) (default = noLabelFile) -testFile input file or folder name for testing data (default = noTestFile) -testlabelFile input label file or folder name for testing data (for binary data only) (default=noLabelFile) -track {0,1} monitor the error on testing data or not -modelFile filename for saving model (default = noModelFile) -objFile filename for saving objective (defautl = noobjFile) -nb number of bacthes of training data (default = num_procs) useful when data is split into large number batches but want to run on a smaller number of processors. -nt number of batches of testing data (default = 1) if your testing data is large, consider use eye to do prediction separately. -ng number of groups of processors for training multi-class classification useful where there are a large number of classes (see D.) -eval number of iterations to evaluate objective and duality gap (default = 0 if tol is turned off, otherwise =T) -verb {0, 1} level of display -ws window size for computing averaged model (default = 0, no average) C. Prepare your data Two data formats and types are supported: sparse and dense, text and binary use -fmt [sparse/dense] to choose sparse or dense, use -dtype [txt/dat] to choose text or binary. The default is sparse and txt. Data can be saved in either text files (.txt) or binary files (.dat) use -dtpe [txt/data] to determine from text files or binary files -> For data in text files: data could be either sparse or dense <> For sparse data, see exmample: 1 0:4 2:2 5:3 -1 1:2 3:1 4:4 There are two data points. The first column specifies the label. The remaining are index:value pairs. This is the same as Libsvm or SVM-light format. This the defualt option. <> For desen data, see example: 1 0.01 0.2 -0.1 -0.2 -1 -0.1 -0.2 0.01 0.2 The first column is the label. The remaining columns are features. -> For data in binary files. Assume the data is saved in the order of data points with float precision. use -fmt sparse/dense to choose which format to use in the program For binary data, you may also need to provide label file/folder name with -labelFile option and provide dimension with -d option -> Split your data To run distribtued/parallel optimization on multiple processors or machines, you need to split your data into a number of batches depending on the size of your data. use provded script split.py to split a text file into a number of batches. $ python split.py data_FileName n K # n is the total number of data points, K is the number of batches to split Then a directory data_Filename_K is created and in which files 1.txt, 2.txt, ... K.txt are saved. see provided covtype data set (downloaded from libsvm website http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/) For binary data, see an example in provided data folder btrain and btest D. Multi-class classification The eagles implements one-vs-all multi-class classification. When you number of classes is large, you may want to parallelize training of different classes. The option -ng (num_groups) is very useful in this case. Set num_procs = num_groups * group_size, it will organize the num_procs machines(processors) into num_groups. Each group will train a part of classes (num_classes/num_groups) on the training data. For example, if you have nb=10 batches of data and nc=50 number of classes. If you set ng=10, num_procs=50, you will train 10 groups of multi-class model. Ecah group has 5 machines(processors) and handles nc/ng=5 classes on all the training data. See an example in EXAMPLES. ======================================================================================================= V. EXAMPLES: 1. running birds with default settings $ mpirun -np 2 birds data/covtype.train_2 reading data...2s +--------------------------------------------+ running summary: L2-SVM examples(n) dimensions(d) lambda 522911 55 1.91237e-06 process(K) exam_per(m) iter_to_eval 2 1000 0 Iters Train-time Eval-Time 1000 1 0 Norm Loss Obj 58.8111 0.808148 0.808204 primal dual gap 0.808204 0.221063 0.587142 class gap train_error 1 0.587142 0.299093 +--------------------------------------------+ 2. running birds by tunning lambda and T $ mpirun -np 2 birds -lambda 0.00001 -T 10000 data/covtype.train_2 reading data...2s +--------------------------------------------+ running summary: L2-SVM examples(n) dimensions(d) lambda 522911 55 1e-05 process(K) exam_per(m) iter_to_eval 2 1000 0 Iters Train-time Eval-Time 10000 5 0 Norm Loss Obj 45.26 0.705957 0.706183 primal dual gap 0.706183 0.706171 1.17118e-05 class gap train_error 1 1.17118e-05 0.248943 +--------------------------------------------+ 3. running birds by tuning m for better optimization performances $ mpirun -np 2 birds -lambda 0.00001 -T 1000 -m 4000 data/covtype.train_2 reading data...2s +--------------------------------------------+ running summary: L2-SVM examples(n) dimensions(d) lambda 522911 55 1e-05 process(K) exam_per(m) iter_to_eval 2 4000 0 Iters Train-time Eval-Time 1000 2 0 Norm Loss Obj 45.1188 0.706268 0.706494 primal dual gap 0.706494 0.70278 0.00371372 class gap train_error 1 0.00371372 0.249423 +--------------------------------------------+ 4. running birds by choosing different loss functions $ mpirun -np 2 birds -ls 1 -lambda 0.00001 data/covtype.train_2 reading data...2s +--------------------------------------------+ running summary: L1-SVM examples(n) dimensions(d) lambda 522911 55 1e-05 process(K) exam_per(m) iter_to_eval 2 1000 0 Iters Train-time Eval-Time 1000 1 0 Norm Loss Obj 196.788 0.597927 0.598911 primal dual gap 0.598911 0.530709 0.0682019 class gap train_error 1 0.0682019 0.245285 +--------------------------------------------+ $ mpirun -np 2 birds -ls 3 -lambda 0.00001 data/covtype.train_2 reading data...2s +--------------------------------------------+ running summary: LogReg examples(n) dimensions(d) lambda 522911 55 1e-05 process(K) exam_per(m) iter_to_eval 2 1000 0 Iters Train-time Eval-Time 1000 1 0 Norm Loss Obj 302.057 0.521195 0.522705 primal dual gap 0.522705 0.503084 0.019621 class gap train_error 1 0.019621 0.251362 +--------------------------------------------+ 5. running birds by using sparse regularizations $ mpirun -np 2 birds -ls 3 -l1 1000 -lambda 0.00001 data/covtype.train_2 reading data...2s +--------------------------------------------+ running summary: LogReg examples(n) dimensions(d) lambda 522911 55 1e-05 process(K) exam_per(m) iter_to_eval 2 1000 0 Iters Train-time Eval-Time 1000 1 0 Norm Loss Obj 5643.63 0.694973 0.723191 primal dual gap 0.723191 0.642483 0.0807083 class gap train_error 1 0.0807083 0.454936 +--------------------------------------------+ 6. running birds with testing data set $ mpirun -np 2 birds -lambda 0.00001 -testFile data/covtype.test data/covtype.train_2 reading data...3s +--------------------------------------------+ running summary: L2-SVM examples(n) dimensions(d) lambda 522911 55 1e-05 process(K) exam_per(m) iter_to_eval 2 1000 0 Iters Train-time Eval-Time 1000 0 0 Norm Loss Obj 48.4893 0.755844 0.756086 primal dual gap 0.756086 0.527142 0.228945 class gap train_error 1 0.228945 0.280405 test_loss test_error 0.713378 0.246381 +--------------------------------------------+ 7. running birds with tolerance parameter $ mpirun -np 2 birds -lambda 0.00001 -tol 0.001 -testFile data/covtype.test data/covtype.train_2 reading data...3s +--------------------------------------------+ running summary: L2-SVM examples(n) dimensions(d) lambda 522911 55 1e-05 process(K) exam_per(m) iter_to_eval 2 1000 1000 Iters Train-time Eval-Time 6000 3 0 Norm Loss Obj 45.2438 0.705973 0.7062 primal dual gap 0.7062 0.705927 0.000273048 class gap train_error 1 0.000273048 0.249096 test_loss test_error 0.610967 0.208671 +--------------------------------------------+ 8. running birds with saving model $ mpirun -np 2 birds -lambda 0.00001 -tol 0.001 -modelFile covtype.model data/covtype.train_2 reading data...2s +--------------------------------------------+ running summary: L2-SVM examples(n) dimensions(d) lambda 522911 55 1e-05 process(K) exam_per(m) iter_to_eval 2 1000 1000 Iters Train-time Eval-Time 6000 4 0 Norm Loss Obj 45.2438 0.705973 0.7062 primal dual gap 0.7062 0.705927 0.000273048 class gap train_error 1 0.000273048 0.249096 +--------------------------------------------+ $ cat covtype.model 55 1:4.38543 2:0.0654094 3:-0.462926 4:-0.889125 5:-0.203232 6:-0.235037 7:-0.532578 8:-1.92654 9:-0.130274 10:0.0344773 11:0.248163 12:0.0784594 13:0.304337 14:1.99627 15:1.29748 16:1.39914 17:0.804446 18:1.02862 19:1.19154 20:0.0132775 21:-0.85663 22:-0.417634 23:-0.0780022 24:0.104929 25:-0.197528 26:-0.596994 27:-0.46566 28:2.02647 29:0.374638 30:-0.0882047 31:0.607031 32:0.436491 33:-0.100154 34:0.103247 35:0.575328 36:0.0716147 37:-0.0794976 38:-0.258034 39:-0.859306 40:-0.566779 41:-0.156037 42:-0.650499 43:-0.419479 44:-0.323515 45:-0.349297 46:-0.491995 47:-0.375718 48:-0.822987 49:0.205375 50:-0.194246 51:0.230549 52:0.157913 53:0.249656 54:0.0976877 (the first entry is the dimension of model, 55 is because of data index starting from 1 not 0) 9. running birds eye (prediction) on testing data with saved model $ mpirun -np 1 eye -modelFile covtype.model data/covtype.test reading data...0 = Time for reading the data +--------------------------------------------+ running summary: L2-SVM examples(n) dimensions(d) time(s) 58101 55 0 test_error 0.208671 +--------------------------------------------+ 10. running eagles on multi-class data $ mpirun -np 2 eagles -lambda 0.00001 data/rcv1_test.multiclass_2 reading data...22s +--------------------------------------------+ running summary: L2-SVM leader #classes classes 0 53 [1...53] examples(n) dimensions(d) lambda 518571 47237 1e-05 process(K) exam_per(m) iter_to_eval 2 1000 0 Iters Train-time Eval-Time 100 69 0 max_gap 0.0535181 +--------------------------------------------+ 11. running eagles using -nb $ mpirun -np 10 --hostfile host.txt eagles -ng 5 -lambda 0.00001 data/rcv1_test.multiclass_2 reading data...22s +--------------------------------------------+ running summary: L2-SVM leader #classes classes 2 11 [12...22] examples(n) dimensions(d) lambda 518571 47237 1e-05 process(K) exam_per(m) iter_to_eval 10 1000 0 Iters Train-time Eval-Time 100 21 0 max_gap 0.0234637 +--------------------------------------------+ +--------------------------------------------+ running summary: L2-SVM leader #classes classes 4 11 [23...33] examples(n) dimensions(d) lambda 518571 47237 1e-05 process(K) exam_per(m) iter_to_eval 10 1000 0 Iters Train-time Eval-Time 100 20 0 max_gap 0.0338269 +--------------------------------------------+ +--------------------------------------------+ running summary: L2-SVM leader #classes classes 6 10 [34...43] examples(n) dimensions(d) lambda 518571 47237 1e-05 process(K) exam_per(m) iter_to_eval 10 1000 0 Iters Train-time Eval-Time 100 19 0 max_gap 0.0114528 +--------------------------------------------+ +--------------------------------------------+ running summary: L2-SVM leader #classes classes 8 10 [44...53] examples(n) dimensions(d) lambda 518571 47237 1e-05 process(K) exam_per(m) iter_to_eval 10 1000 0 Iters Train-time Eval-Time 100 19 0 max_gap 0.00193814 +--------------------------------------------+ +--------------------------------------------+ running summary: L2-SVM leader #classes classes 0 11 [1...11] examples(n) dimensions(d) lambda 518571 47237 1e-05 process(K) exam_per(m) iter_to_eval 10 1000 0 Iters Train-time Eval-Time 100 20 0 max_gap 0.0535181 +--------------------------------------------+ 12. Running ibirds in this package. Running ibirds is pretty much the same as running birds except that it allows for taking input for initial dual variables (e.g., learned from reduced data), and uses it to construct an initial model. mpirun -np 5 ibirds -initdual 1 -testFile Path_To_TestFile TrainDataFold It will search for dual files (ending with .dual.txt) in TrainDataFold and use it to initialize dual and primal variables. 13. Running sbirds in this package. sbirds is used to learn models from randomized reduced data following our ICML15 paper references above. It is similar to birds except that it takes a parameter margint (the tau parameter in the paper) for specifying the l1 regularization parameter of the dual variable. It can output the values of dual variables in files. You can use it to reconstruct a full model by running ibirds without further optimization. mpirun -np 5 sbirds -margint 0.5 -lambda 0.00001 -tol 0.0001 -dualout 1 -fmt dense -d 100 -dtype dat -labelFile TrainDataFold -testFile TestDataFold/1.dat -testlabelFile TestDataFold/1.txt TrainDataFold Training data in TrainDataFold is split into multiple batches named “n.txt” for labels, “n.dat” for feature vectors (n=1, ……, np) sbirds will output values of corresponding dual variables in TrainDataFold for each batch ========================================================================================================= VI. FILES ibirds.cc is the main file for binary classification or regression sbirds.cc is the main file for binary classification or regression eagles.cc is te main file for multi-class classification eye.cc is the main file for predictions cmd_line.cc is the file for reading command line arguments (adopted from Joseph Keshet) WeightVector.cc is the filee for implementing a Weight vector structure (modified from Shai's code) simple_sparse_vec_hash.cc is the file for implementing a sparse data vector (adopted from Shai Shalev-Shwartz) dense_vec.cc is the file for implementing a dense data vector data_set.h data_set.cc implements the interface for reading data inc_dual.h inc_dual.cc implements the function for updating the dual variables objective.h objective.cc implements the function for evaluating the objective New: sbirds.cc implements distributed SDCA for sparse dual regularized formulation ibirds.cc implements a recovery and post-optimization framework. ========================================================================================================= VII. ACKNOWLEDGMENT Thanks to Shenghuo Zhu for providing useful suggestions on improving the codes. This material is based upon work supported by the National Science Foundation under Grant No. IIS-1463988. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.