**************************************************************************
*** Distributed/Parallel Optimization for Big Data Analytics           ***
***         Dr. Tianbao Yang                                           ***
***         yangtia1@gmail.com                                         ***
**************************************************************************

Welcome to the birds library 


==============================================================================

I. INTRODUCTION

The birds library implements distributed stochastic dual coordinate ascent for
classification and regression with a broad support. 

This package adds support for optimizing the dual sparse regularized problems (good for learning
from randomized reduced data) and for recovering a full model from dual variables learned
from the dimension reduced  data. See sbirds.cc and ibirds.cc respectively, and examples below. 

The technical details see:
"Trading Computation for Communication: Distributed Stochastic Dual Coordinate
Ascent." Tianbao Yang. NIPS 2013.

"Analysis of Distributed Stochastic Dual Coordinate Ascent"
Tianbao Yang, et al. Tech Report 2013. 

"Theory of Dual-sparse Regularized Randomized Reduction". Tianbao Yang et al. ICML 2015.

The code is distributed under GNU General Public License (see license.txt for
details). 
Copyright (c) 2015 Tianbao Yang. All Rights Reserved. 

==============================================================================

II. REQUIREMENTS AND INSTALL
  -platform: 
      Linux/Unix Multi-core or Clusters
      
      In a cluster enviornment, make sure you can ssh to other machines without
      a password. Use public key authentication
     
  -Prerequisite: g++, Boost MPI library, Boost Serialization library
   
  Before installing, check where your boost library is installed 
  and set the MPI_BOOST_PATH in Makefile accordingly
  
  To install, run the command:
  $ make 

==============================================================================

III. DESCRIPTION: 
  The library supports binary classification, regression and multi-class
  classification. 

  For classification, it supports the following objective:
    1. SVM(hinge) + l2-norm:   1/n \sum_i max(0, 1-y_iw^Tx_i)  + lambda/2 |w|_2^2
    2. SVM(squared-hinge) + l2-norm:   1/n \sum_i max(0, 1-y_iw^Tx_i)^2 + lambda/2 |w|_2^2
    3. Logistic Regression + l2-norm:1/n\sum_i log(1+exp(-yw^Tx)) + lambda/2|w|_2^2

  It also supports sparse regularization by solving above objectives with the
  l2-norm regularization replaced by the elastic net regularization,
  lambda/2*(|w|_2^2 + l1s*|w|_1)

  For regression, it supports least square regression and lasso
    1. least square regression:  1/n\sum_i 1/2|y_i - w^Tx_i|^2 + lambda/2|w|_2^2
    2. lasso :        1/n\sum_i 1/2|y_i - w^Tx_i|^2 + lambda/2(|w|_2^2 + l1s*|w|_1)

  For multi-class classification, it supports one-vs-all with aforementioned
  objectives. 

=================================================================================

IV. USAGE: 

  for binary classification and regression: 

   mpirun -np num_procs [mpi-options] birds [options] <data-file or data-folder> 

  for multi-class classification:
   mpirun -np num_procs [mpi-options] eagles [options] <data-file or data-folder> 

  for prediction: 
   mpirun -np num_procs [mpi-options] eye [options] <data-file or data-folder>

  A. mpi-oprions: check any MPI tutorial (usefule option --hostfile
  host_filename). In host_filename, each line gives the name of a machine 
  
  
  B. options:
       -ls      {1,2,3,4} choose among L1-svm, L2-svm, logistic regression and
                least square regression 
       -lambda  value of lambda (see DESCRIPTION) (default=1/n)
       -l1      value of l1 scalar (see DESCRIPTION) (default=0)
       
       -T       number of iterations to run (see papers) (default=1000)
       -m       number of dual updates at each iteration (see papers)(default=1000)
       -iiter   number of iterations for solving sub-problems (default=1)
       -tol     tolerance of duality gap (see papers) (default = 0) 
                if tol=0, run T iterations, otherwise run until duality gap is less than tol 

       -norm    {0, 1} normalize each data or not. (l2-normalization)(default=0)
       -fmt     {sparse, dense} data formats (see C.) (default=sparse)
       -dtype   {txt, dat}      data types   (see C.) (default=txt)
       -d       dimension of data (for binary data only) (default=0)
       -org     starting index of data (0, 1) (see C.) (default=1)
       
       -labelFile     input label file or folder name for training data 
                      (for binary data only) (default = noLabelFile)
       -testFile      input file or folder name for testing data (default = noTestFile)
       -testlabelFile input label file or folder name for testing data 
                      (for binary data only) (default=noLabelFile)
      
       -track      {0,1} monitor the error on testing data or not 
       -modelFile  filename for saving model  (default = noModelFile)
       -objFile    filename for saving objective (defautl = noobjFile)

       -nb         number of bacthes of training data (default = num_procs) 
                   useful when data is split into large number batches 
		   but want to run on a smaller number of processors.  

       -nt         number of batches of testing data (default = 1) 
                   if your testing data is large, consider use eye to do
		   prediction separately. 
    
       -ng         number of groups of processors for training multi-class classification
                   useful where there are a large number of classes (see D.)
        
       -eval      number of iterations to evaluate objective and duality gap
                  (default = 0 if tol is turned off, otherwise =T)
       -verb      {0, 1} level of display 

       -ws         window size for computing averaged model (default = 0, no average) 

  C. Prepare your data
    Two data formats and types are supported: sparse and dense, text and binary
    use -fmt [sparse/dense] to choose sparse or dense, use -dtype [txt/dat] to choose 
    text or binary. The default is sparse and txt. 
     
    Data can be saved in either text files (.txt) or binary files (.dat)
    use -dtpe [txt/data] to determine from text files or binary files

    -> For data in text files:  data could be either sparse or dense 
    
      <> For sparse data, see exmample: 
          1 0:4 2:2 5:3
         -1 1:2 3:1 4:4
       There are two data points. The first column specifies the label. The remaining are index:value pairs. 
       This is the same as Libsvm or SVM-light format. This the defualt option.
 
      <> For desen data, see example:
         1 0.01 0.2 -0.1 -0.2
        -1 -0.1 -0.2 0.01 0.2
         The first column is the label. The remaining columns are features. 
     
    -> For data in binary files. Assume the data is saved in the order of data points with float precision. 
         use -fmt sparse/dense to choose which format to use in the program   
         For binary data, you may also need to provide label file/folder name
	 with -labelFile option and provide dimension with -d option
      
 
    -> Split your data
       To run distribtued/parallel optimization on multiple processors or
       machines, you need to split your data into a number of batches depending 
       on the size of your data. 

       use provded script split.py to split a text file into a number of batches. 
       $ python split.py data_FileName n K  # n is the total number of data points, K is the number of batches to split 
   
       Then a directory data_Filename_K is created and in which files 1.txt, 2.txt, ... K.txt are saved. 
       see provided covtype data set 
       (downloaded from libsvm website http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/)
       
       For binary data, see an example in provided data folder btrain and btest

       
  D. Multi-class classification 
     The eagles implements one-vs-all multi-class classification. 
     When you number of classes is large, you may want to parallelize training of
     different classes. The option -ng (num_groups) is very useful in this case. 

     Set num_procs = num_groups * group_size,  it will organize the num_procs
     machines(processors) into num_groups. Each group will train a part of
     classes (num_classes/num_groups) on the training data. For example, if you have
     nb=10 batches of data and nc=50 number of classes. If you set ng=10,
     num_procs=50, you will train 10 groups of multi-class model. Ecah group
     has 5 machines(processors) and handles nc/ng=5 classes on all the training data. 
     See an example in EXAMPLES.   
     

=======================================================================================================

V. EXAMPLES: 
    1.  running birds with default settings

      $ mpirun -np 2 birds data/covtype.train_2
      reading data...2s
      +--------------------------------------------+
      running summary: L2-SVM
      examples(n)     dimensions(d)   lambda
      522911          55              1.91237e-06
      process(K)      exam_per(m)     iter_to_eval
      2               1000            0
      Iters           Train-time      Eval-Time
      1000            1               0
      Norm            Loss            Obj
      58.8111         0.808148        0.808204
      primal          dual            gap
      0.808204        0.221063        0.587142
      class           gap             train_error
      1               0.587142        0.299093
      +--------------------------------------------+

   2. running birds by tunning lambda and T 
    
      $ mpirun -np 2 birds -lambda 0.00001 -T 10000  data/covtype.train_2
      reading data...2s
      +--------------------------------------------+
      running summary: L2-SVM
      examples(n)     dimensions(d)   lambda
      522911          55              1e-05
      process(K)      exam_per(m)     iter_to_eval
      2               1000            0
      Iters           Train-time      Eval-Time
      10000           5               0
      Norm            Loss            Obj
      45.26           0.705957        0.706183
      primal          dual            gap
      0.706183        0.706171        1.17118e-05
      class           gap             train_error
      1               1.17118e-05     0.248943
      +--------------------------------------------+

   3. running birds by tuning m for better optimization  performances
     
     $ mpirun -np 2 birds -lambda 0.00001 -T 1000 -m 4000 data/covtype.train_2
     reading data...2s
     +--------------------------------------------+
     running summary: L2-SVM
     examples(n)     dimensions(d)   lambda
     522911          55              1e-05
     process(K)      exam_per(m)     iter_to_eval
     2               4000            0
     Iters           Train-time      Eval-Time
     1000            2               0
     Norm            Loss            Obj
     45.1188         0.706268        0.706494
     primal          dual            gap
     0.706494        0.70278         0.00371372
     class           gap             train_error
     1               0.00371372      0.249423
     +--------------------------------------------+


  4. running birds by choosing different loss functions
     
     $ mpirun -np 2 birds -ls 1 -lambda 0.00001 data/covtype.train_2
     reading data...2s
     +--------------------------------------------+
     running summary: L1-SVM
     examples(n)     dimensions(d)   lambda
     522911          55              1e-05
     process(K)      exam_per(m)     iter_to_eval
     2               1000            0
     Iters           Train-time      Eval-Time
     1000            1               0
     Norm            Loss            Obj
     196.788         0.597927        0.598911
     primal          dual            gap
     0.598911        0.530709        0.0682019
     class           gap             train_error
     1               0.0682019       0.245285
     +--------------------------------------------+
   
     $ mpirun -np 2 birds -ls 3 -lambda 0.00001 data/covtype.train_2
     reading data...2s
     +--------------------------------------------+
     running summary: LogReg
     examples(n)     dimensions(d)   lambda
     522911          55              1e-05
     process(K)      exam_per(m)     iter_to_eval
     2               1000            0
     Iters           Train-time      Eval-Time
     1000            1               0
     Norm            Loss            Obj
     302.057         0.521195        0.522705
     primal          dual            gap
     0.522705        0.503084        0.019621
     class           gap             train_error
     1               0.019621        0.251362
     +--------------------------------------------+

  5. running birds by using sparse regularizations 
      
     $ mpirun -np 2 birds -ls 3 -l1 1000 -lambda 0.00001 data/covtype.train_2
     reading data...2s
     +--------------------------------------------+
     running summary: LogReg
     examples(n)     dimensions(d)   lambda
     522911          55              1e-05
     process(K)      exam_per(m)     iter_to_eval
     2               1000            0
     Iters           Train-time      Eval-Time
     1000            1               0
     Norm            Loss            Obj
     5643.63         0.694973        0.723191
     primal          dual            gap
     0.723191        0.642483        0.0807083
     class           gap             train_error
     1               0.0807083       0.454936
     +--------------------------------------------+

  
 6. running birds  with testing data set

    $ mpirun -np 2 birds -lambda 0.00001 -testFile data/covtype.test data/covtype.train_2
    reading data...3s
    +--------------------------------------------+
    running summary: L2-SVM
    examples(n)     dimensions(d)   lambda
    522911          55              1e-05
    process(K)      exam_per(m)     iter_to_eval
    2               1000            0
    Iters           Train-time      Eval-Time
    1000            0               0
    Norm            Loss            Obj
    48.4893         0.755844        0.756086
    primal          dual            gap
    0.756086        0.527142        0.228945
    class           gap             train_error
    1               0.228945        0.280405
    test_loss       test_error
    0.713378        0.246381        
    +--------------------------------------------+

  7. running birds with tolerance parameter
    
    $ mpirun -np 2 birds -lambda 0.00001 -tol 0.001 -testFile data/covtype.test data/covtype.train_2
    reading data...3s
    +--------------------------------------------+
    running summary: L2-SVM
    examples(n)     dimensions(d)   lambda
    522911          55              1e-05
    process(K)      exam_per(m)     iter_to_eval
    2               1000            1000
    Iters           Train-time      Eval-Time
    6000            3               0
    Norm            Loss            Obj
    45.2438         0.705973        0.7062
    primal          dual            gap
    0.7062          0.705927        0.000273048
    class           gap             train_error
    1               0.000273048     0.249096
    test_loss       test_error
    0.610967        0.208671        
    +--------------------------------------------+

 8. running birds with saving model  
    
    $ mpirun -np 2 birds -lambda 0.00001 -tol 0.001 -modelFile covtype.model data/covtype.train_2
    reading data...2s
    +--------------------------------------------+
    running summary: L2-SVM
    examples(n)     dimensions(d)   lambda
    522911          55              1e-05
    process(K)      exam_per(m)     iter_to_eval
    2               1000            1000
    Iters           Train-time      Eval-Time
    6000            4               0
    Norm            Loss            Obj
    45.2438         0.705973        0.7062
    primal          dual            gap
    0.7062          0.705927        0.000273048
    class           gap             train_error
    1               0.000273048     0.249096
    +--------------------------------------------+
     
    $ cat covtype.model
    55 1:4.38543 2:0.0654094 3:-0.462926 4:-0.889125 5:-0.203232 6:-0.235037
    7:-0.532578 8:-1.92654 9:-0.130274 10:0.0344773 11:0.248163 12:0.0784594
    13:0.304337 14:1.99627 15:1.29748 16:1.39914 17:0.804446 18:1.02862
    19:1.19154 20:0.0132775 21:-0.85663 22:-0.417634 23:-0.0780022 24:0.104929
    25:-0.197528 26:-0.596994 27:-0.46566 28:2.02647 29:0.374638 30:-0.0882047
    31:0.607031 32:0.436491 33:-0.100154 34:0.103247 35:0.575328 36:0.0716147
    37:-0.0794976 38:-0.258034 39:-0.859306 40:-0.566779 41:-0.156037
    42:-0.650499 43:-0.419479 44:-0.323515 45:-0.349297 46:-0.491995
    47:-0.375718 48:-0.822987 49:0.205375 50:-0.194246 51:0.230549 52:0.157913
    53:0.249656 54:0.0976877 
    
    (the first entry is the dimension of model, 55 is because of data index starting from 1 not 0)

 9. running birds eye (prediction) on testing data with saved model 

   $ mpirun -np 1 eye -modelFile covtype.model data/covtype.test 
   reading data...0 = Time for reading the data
   +--------------------------------------------+
   running summary: L2-SVM
   examples(n)     dimensions(d)   time(s)
   58101           55              0
   test_error
   0.208671        
   +--------------------------------------------+

   
 10. running eagles on multi-class data 
   $ mpirun -np 2 eagles -lambda 0.00001 data/rcv1_test.multiclass_2
   reading data...22s
   +--------------------------------------------+
   running summary: L2-SVM
   leader          #classes        classes
   0               53              [1...53]
   examples(n)     dimensions(d)   lambda
   518571          47237           1e-05
   process(K)      exam_per(m)     iter_to_eval
   2               1000            0
   Iters           Train-time      Eval-Time
   100             69              0
   max_gap
   0.0535181       
   +--------------------------------------------+

 11. running eagles using -nb 
   
   $ mpirun -np 10 --hostfile host.txt eagles -ng 5 -lambda 0.00001 data/rcv1_test.multiclass_2
   reading data...22s
   +--------------------------------------------+
   running summary: L2-SVM
   leader          #classes        classes
   2               11              [12...22]
   examples(n)     dimensions(d)   lambda
   518571          47237           1e-05
   process(K)      exam_per(m)     iter_to_eval
   10              1000            0
   Iters           Train-time      Eval-Time
   100             21              0
   max_gap         0.0234637
   +--------------------------------------------+

   +--------------------------------------------+
   running summary: L2-SVM
   leader          #classes        classes
   4               11              [23...33]
   examples(n)     dimensions(d)   lambda
   518571          47237           1e-05
   process(K)      exam_per(m)     iter_to_eval
   10              1000            0
   Iters           Train-time      Eval-Time
   100             20              0
   max_gap         0.0338269
   +--------------------------------------------+

   +--------------------------------------------+
   running summary: L2-SVM
   leader          #classes        classes
   6               10              [34...43]
   examples(n)     dimensions(d)   lambda
   518571          47237           1e-05
   process(K)      exam_per(m)     iter_to_eval
   10              1000            0
   Iters           Train-time      Eval-Time
   100             19              0
   max_gap         0.0114528
   +--------------------------------------------+

   +--------------------------------------------+
   running summary: L2-SVM
   leader          #classes        classes
   8               10              [44...53]
   examples(n)     dimensions(d)   lambda
   518571          47237           1e-05
   process(K)      exam_per(m)     iter_to_eval
   10              1000            0
   Iters           Train-time      Eval-Time
   100             19              0
   max_gap         0.00193814
   +--------------------------------------------+

   +--------------------------------------------+
   running summary: L2-SVM
   leader          #classes        classes
   0               11              [1...11]
   examples(n)     dimensions(d)   lambda
   518571          47237           1e-05
   process(K)      exam_per(m)     iter_to_eval
   10              1000            0
   Iters           Train-time      Eval-Time
   100             20              0
   max_gap         0.0535181
   +--------------------------------------------+


 12. Running ibirds in this package. Running ibirds is pretty much the same as running birds except that it 
     allows for taking input for initial dual variables (e.g., learned from reduced data), and uses it to construct
     an initial model.

     mpirun -np 5 ibirds -initdual 1  -testFile Path_To_TestFile TrainDataFold

     It will search for dual files (ending with .dual.txt) in TrainDataFold and use it to initialize dual and primal variables.  

 13. Running sbirds in this package. sbirds is used to learn models from randomized reduced data following our 
     ICML15 paper references above. It is similar to birds except that it takes a parameter margint (the tau parameter 
     in the paper) for specifying the l1 regularization parameter of the dual variable. It can output the values of dual 
     variables in files. You can use it to reconstruct a full model by running ibirds without further optimization.
    
     mpirun -np 5 sbirds -margint 0.5 -lambda 0.00001 -tol 0.0001 -dualout 1 -fmt dense -d 100 -dtype dat -labelFile TrainDataFold -testFile TestDataFold/1.dat -testlabelFile TestDataFold/1.txt TrainDataFold 

     Training data in TrainDataFold is split into multiple batches named “n.txt” for labels, “n.dat” for feature vectors (n=1, ……, np)
     sbirds will output values of corresponding dual variables in TrainDataFold for each batch

=========================================================================================================


VI. FILES

   ibirds.cc is the main file for binary classification or regression

   sbirds.cc is the main file for binary classification or regression

   eagles.cc is te main file for multi-class classification 

   eye.cc  is the main file for predictions 
    
   cmd_line.cc is the file for reading command line arguments (adopted from Joseph Keshet)

   WeightVector.cc is the filee for implementing a Weight vector structure (modified from Shai's code)

   simple_sparse_vec_hash.cc is the file for implementing a sparse data vector (adopted from Shai Shalev-Shwartz) 

   dense_vec.cc is the file for implementing a dense data vector 

   data_set.h data_set.cc implements the interface for reading data

   inc_dual.h inc_dual.cc implements the function for updating the dual variables 

   objective.h objective.cc implements the function for evaluating the objective

   New: sbirds.cc implements distributed SDCA for sparse dual regularized formulation
        ibirds.cc implements a recovery and post-optimization framework. 
=========================================================================================================

VII. ACKNOWLEDGMENT
  Thanks to Shenghuo Zhu for providing useful suggestions on improving the codes. This material is based upon work supported by the National Science Foundation under Grant No. IIS-1463988. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.