pio.log
pio status
pio-start-all
pio template get apache/incubator-predictionio-template-recommender MyRecommendation
pio app new MyApp1
pio app list
you should find an
pio deploy
curl -H "Content-Type: application/json" \
-d '{ "user": "1", "num": 4 }' http://localhost:8000/queries.json
http://predictionio.apache.org/evaluation/evaluationdashboard/
pio dashboard
The dashboard lists out all completed evaluations in a reversed chronological order. A high level description of each evaluation can be seen directly from the dashboard. We can also click on the HTML button to see the evaluation drill down page.
http://localhost:9000/
http://mirror.nexcess.net/apache/incubator/predictionio/0.10.0-incubating/apache-predictionio-0.10.0-incubating.tar.gz
https://github.com/apache/incubator-predictionio/tree/develop/examples
https://github.com/apache/incubator-predictionio/tree/develop/examples
install:
http://predictionio.incubator.apache.org/install/install-sourcecode/
http://predictionio.incubator.apache.org/install/
install manually
http://predictionio.incubator.apache.org/install/install-sourcecode/
$ createdb pio
If you get an error of the form could not connect to server: No such file or directory, then you must first start the server manually,:
$ pg_ctl -D /usr/local/var/postgres -l /usr/local/var/postgres/server.log start
Finally use the command:
$ psql -c "create user pio with password 'pio'"
docker
http://predictionio.incubator.apache.org/community/projects/#docker-installation-for-predictionio
https://github.com/steveny2k/docker-predictionio
docker run -p 9000:9000 -p 8000:8000 -p 7070:7070 -p 9300:9300 -it predictionio /bin/bash
==> https://github.com/mingfang/docker-predictionio
http://predictionio.incubator.apache.org/community/projects/
https://github.com/tobilg/docker-predictionio
docker run -d -p 7070:7070 -p 8000:8000 tobilg/predictionio
https://github.com/steveny2k/docker-predictionio
docker run -it -p 8000:8000 steveny/predictionio /bin/bash
http://predictionio.incubator.apache.org/templates/recommendation/quickstart/
To update the model periodically with new data, simply set up a cron job to call pio train and pio deploy. The engine will continue to serve prediction results during the re-train process. After the training is completed, pio deploy will automatically shutdown the existing engine server and bring up a new process on the same port.
http://predictionio.incubator.apache.org/system/
http://predictionio.incubator.apache.org/system/anotherdatastore/
PIO_STORAGE_REPOSITORIES_METADATA_NAME=predictionio_metadata
PIO_STORAGE_REPOSITORIES_METADATA_SOURCE=ELASTICSEARCH
PIO_STORAGE_REPOSITORIES_EVENTDATA_NAME=predictionio_eventdata
PIO_STORAGE_REPOSITORIES_EVENTDATA_SOURCE=HBASE
PIO_STORAGE_REPOSITORIES_MODELDATA_NAME=pio_
PIO_STORAGE_REPOSITORIES_MODELDATA_SOURCE=LOCALFS
http://predictionio.incubator.apache.org/install/config-datastore/
PIO_STORAGE_SOURCES_ELASTICSEARCH_TYPE=elasticsearch
PIO_STORAGE_SOURCES_ELASTICSEARCH_HOSTS=localhost
PIO_STORAGE_SOURCES_ELASTICSEARCH_PORTS=9300
Event api:
http://predictionio.incubator.apache.org/datacollection/eventapi/
Combining Multiple Algorithms
You can use more than one algorithm to build multiple models in an engine. The predicted results can be combined in the Serving class.
http://predictionio.incubator.apache.org/templates/similarproduct/multi-events-multi-algos/
http://predictionio.incubator.apache.org/templates/recommendation/dase/
http://predictionio.incubator.apache.org/templates/recommendation/reading-custom-events/
http://predictionio.incubator.apache.org/templates/recommendation/customize-serving/
curl -i -X GET http://localhost:7070
Delete All Events of an App
$ pio app data-delete <your_app_name>
http://predictionio.incubator.apache.org/datacollection/eventmodel/
Similar Product Template is a great choice if you want to make recommendations based on immediate user activities or for new users with limited history. It uses MLLib Alternating Least Squares (ALS) recommendation algorithm, a Collaborative filtering (CF) algorithm commonly used for recommender systems. These techniques aim to fill in the missing entries of a user-item association matrix. Users and products are described by a small set of latent factors that can be used to predict missing entries.
http://predictionio.apache.org/community/projects/#docker-installation-for-predictionio
PredictionIO
Data includes data source and DataPreparator
Algorithm(s)
Serving
Evaluator
Event server passes events to the data source. These events are actually the real data, which we want to learn from. DataPreparator transforms it into a representation that can then be passed on to different learning algorithms.
PredictionIO is an open source machine learning service built on Spark, HBase, and Spray
Prediction.IO is a self-hosted open source platform, providing the full stack from data storage to modeling to serving the predictions. Prediciton.IO can talk to Apache Spark to leverage its learning algorithms. In addition, it is shipped with a wide variety of models targeting specific domains, for instance, recommender system, churn prediction, and others.
Titan offers a number of storage options, but I will concentrate only on two, HBase—the Hadoop NoSQL database, and Cassandra—the non-Hadoop NoSQL database.
http://thinkaurelius.github.io/titan/
http://tinkerpop.apache.org/
bash -c "$(curl -s https://install.prediction.io/install.sh)"
echo 'export PATH=$PATH:'"$HOME/PredictionIO/bin" >> ~/.bashrc
pio-start-all
pio template get PredictionIO/template-scala-parallel-universal-recommendation urec-app
pio-start-all
pio app new urec-app
sudo pip install PredictionIO
http://blog.prediction.io/ecommerce-personalization
https://groups.google.com/forum/#!topic/predictionio-user/lxZJCn9zZGk
http://predictionio.incubator.apache.org/start/
http://predictionio.incubator.apache.org/gallery/template-gallery/
example/practice
https://blog.openshift.com/day-4-predictionio-how-to-build-a-blog-recommender/
Now that data is inserted into our PredictionIO application we need to add engine to our application. Click on Add an Engine button. Choose Item Similarity Engine as shown below
After pressing Create button you will have Item Similarity Engine. Now you can make some configuration changes but we will use the default settings. Go under Algorithms tab and you will engine is not running. Run the engine by clicking “Train Data Model Now”.
deployment
http://predictionio.incubator.apache.org/install/
https://www.quora.com/Did-anyone-deploy-http-Prediction-io-in-production-on-a-big-cluster
I want to run 1 prediction io machine with 2 compute machines ( running Apache spark ) with 3 storage machines ( 1 hbase master and 2 regional servers ).
hbase is not installed on prediction io server, its only on independent cluster.
Locally, all you need to do is to make a copy of your hbase-site.xml to nodes where you will run "pio" commands, and have HBASE_CONF_DIR in conf/pio-env.sh points to where your hbase-site.xml lives. Verify your connection is good to go by using "pio status".
https://groups.google.com/forum/#!topic/predictionio-user/lxZJCn9zZGk
https://github.com/apache/incubator-predictionio/blob/develop/data/src/main/scala/org/apache/predictionio/data/storage/Storage.scala
PIO_STORAGE_REPOSITORIES_MODELDATA_SOURCE=LOCALFS ==> use hbase will fail
2016-12-13 23:05:55,071 ERROR org.apache.predictionio.tools.console.Console$ [main] - No storage backend implementation can be found (tried both org.apache.predictionio.data.storage.hbase.HBModels and hbase.HBModels) (org.apache.predictionio.data.storage.StorageClientException)
org.apache.predictionio.data.storage.StorageClientException: No storage backend implementation can be found (tried both org.apache.predictionio.data.storage.hbase.HBModels and hbase.HBModels)
https://github.com/actionml/cluster-setup/blob/master/standalone-servers.md
--- settings
cluster.name: predictionio
network.host: 127.0.0.1
PIO_STORAGE_SOURCES_ELASTICSEARCH_TYPE=elasticsearch
PIO_STORAGE_SOURCES_ELASTICSEARCH_CLUSTERNAME=predictionio
PIO_STORAGE_SOURCES_ELASTICSEARCH_HOSTS=localhost
PIO_STORAGE_SOURCES_ELASTICSEARCH_PORTS=9300
PIO_STORAGE_SOURCES_ELASTICSEARCH_HOME=$PIO_HOME/vendors/elasticsearch-1.7.6
# Local File System Example
PIO_STORAGE_SOURCES_LOCALFS_TYPE=localfs
PIO_STORAGE_SOURCES_LOCALFS_PATH=$PIO_FS_BASEDIR/models
# HBase Example
PIO_STORAGE_SOURCES_HBASE_TYPE=hbase
PIO_STORAGE_SOURCES_HBASE_HOME=$PIO_HOME/vendors/hbase-1.2.4
PIO_STORAGE_REPOSITORIES_METADATA_NAME=pio_meta
PIO_STORAGE_REPOSITORIES_METADATA_SOURCE=ELASTICSEARCH
PIO_STORAGE_REPOSITORIES_EVENTDATA_NAME=pio_event
PIO_STORAGE_REPOSITORIES_EVENTDATA_SOURCE=HBASE
PIO_STORAGE_REPOSITORIES_MODELDATA_NAME=pio_model
PIO_STORAGE_REPOSITORIES_MODELDATA_SOURCE=LOCALFS
Recommendation
pio template get apache/incubator-predictionio-template-recommender MyRecommendation
pio app new MyApp1
https://github.com/PredictionIO/Programmers-Guide-to-Machine-Intelligence-with-PredictionIO
https://github.com/PredictionIO/Programmers-Guide-to-Machine-Intelligence-with-PredictionIO/blob/master/Chapter1.md
Training involves creating a system that will replace those unknown values (the hyphens) with good predictions of how that user would rate that artist. By default the recommendation engine uses a method called matrix factorization, which we will learn about in chapter 3. The basic idea is that the training starts by the engine building a recommendation system based on random values. Obviously, it will perform very poorly. Then we will run this recommendation system on our data and see how well it predicts our known values. For example, how well did it predict Jake would give a 5 to Taylor Swift. Let’s say the system predicted Jake would give Taylor Swift a 3. In that case we would adjust the parameters to increase the Jake-Taylor Swift rating (and account for other mis-predictions). Now we have a second generation of our recommendation engine and we again evaluate it on our data and adjust the parameters. We will do this perhaps 1,000 times until our recommender is precise in predicting the known values. Now we can use this recommender to fill in those missing values.
We are going to start our recommendation engine, which in essence is a server running on port 8000.
https://github.com/PredictionIO/Programmers-Guide-to-Machine-Intelligence-with-PredictionIO/blob/master/chapter2.md
This architecture defines the four components you need to specify to create a PredictionIO machine learning engine: the Data source and data preparator, the machine learning Algorithm, the Serving component which responds to queries, and the Evaluation metrics
The Data Source and Preparator create an RDD that is amenable to processing by the Spark machine learning algorithms.
if there are multiple algorithms, the serving component can combine the results of these algorithms. On ocasion you may want to filter the results. For example, if you are recommending products you may not want to show out-of-stock items.
The Evaluation component of an engine evaluates the accuracy of the engine by comparing the actual results to the predicted results. This is particularly useful if you want to tune your engine by adjusting a variety of parameters. The evaluation module can repeatedly adjust the parameters and evaluate to find the parameter values that produce the most accurate results.
https://github.com/PredictionIO/Programmers-Guide-to-Machine-Intelligence-with-PredictionIO/blob/master/chapter3.md
The rank parameter specifies how many latent features to have. As you can see when we built our recommender in chapter one we used 10 latent features. The more latent features you have the better the accuracy of the recommender. For example, one seminal paper in this area Large-scale Parallel Collaborative Filtering for the Netflix Prize by Zhou et al. used 1000 features. However, the more latent features you use the more memory your recommender will take. In most cases a value between 10 and 500 is reasonable.
As you can guess, this parameter refers to the number of iterations (repeating steps 2 and 3 above) the algorithm executes. As you might think, the more iterations, the better the results, but surprisingly, the algorithm needs very few iterations to produce good values for the P and Q matrices. Generally, a value between 10 and 20 will be sufficient.
In many machine learning algorithms, we train the algorithm on some set of data (called the training set) and once it is trained we use it on new data. There is a danger that the algorithm is so highly tuned to the training data that it performs poorly on the new data. This is called overfitting. In the ALS algorithm, one potential cause of overfitting is having some latent features having extreme values. This lambda is roughly how much we are weighing these extreme values. The greater the lambda the more the algorithm dislikes solutions that contain extreme values.
Recall the the first step of the algorithm is to generate random values for one of the matrices. Generally, each time we run the algorithm we get different initial values and because these initial values are different, the final results can vary as well. Often when we are testing, we want the algorithm to be deterministic and produce the exact same results each time we run it. To do so, we pass in a value for this seed parameter. In the example above, we passed the value 3 as a seed.
We will first use the following Python script to convert the movie data into a json file:
pio import --appid 1 --input my_events.json
By default the amount of memory allocated for training is 512MB.
https://www.sitepoint.com/predictionio-bootstrapping-a-movie-recommendation-app/
https://www.sitepoint.com/create-movie-recommendation-app-prediction-io-setup/
bin/setup-vendors.sh
bin/start-all.sh
https://github.com/apache/incubator-predictionio/issues/108
deployment:
https://github.com/apache/incubator-predictionio/issues/200
https://github.com/richdynamix/personalised-products
pio status
pio-start-all
pio template get apache/incubator-predictionio-template-recommender MyRecommendation
pio app new MyApp1
pio app list
you should find an
engine.json
file; this is where you specify parameters for the engine.
pio build --verbose
pio trainpio deploy
curl -H "Content-Type: application/json" \
-d '{ "user": "1", "num": 4 }' http://localhost:8000/queries.json
http://predictionio.apache.org/evaluation/evaluationdashboard/
pio dashboard
The dashboard lists out all completed evaluations in a reversed chronological order. A high level description of each evaluation can be seen directly from the dashboard. We can also click on the HTML button to see the evaluation drill down page.
http://localhost:9000/
http://mirror.nexcess.net/apache/incubator/predictionio/0.10.0-incubating/apache-predictionio-0.10.0-incubating.tar.gz
https://github.com/apache/incubator-predictionio/tree/develop/examples
https://github.com/apache/incubator-predictionio/tree/develop/examples
install:
http://predictionio.incubator.apache.org/install/install-sourcecode/
http://predictionio.incubator.apache.org/install/
install manually
http://predictionio.incubator.apache.org/install/install-sourcecode/
$ createdb pio
If you get an error of the form could not connect to server: No such file or directory, then you must first start the server manually,:
$ pg_ctl -D /usr/local/var/postgres -l /usr/local/var/postgres/server.log start
Finally use the command:
$ psql -c "create user pio with password 'pio'"
docker
http://predictionio.incubator.apache.org/community/projects/#docker-installation-for-predictionio
https://github.com/steveny2k/docker-predictionio
docker run -p 9000:9000 -p 8000:8000 -p 7070:7070 -p 9300:9300 -it predictionio /bin/bash
pio-start-all
http://predictionio.incubator.apache.org/templates/recommendation/quickstart/==> https://github.com/mingfang/docker-predictionio
http://predictionio.incubator.apache.org/community/projects/
https://github.com/tobilg/docker-predictionio
docker run -d -p 7070:7070 -p 8000:8000 tobilg/predictionio
https://github.com/steveny2k/docker-predictionio
docker run -it -p 8000:8000 steveny/predictionio /bin/bash
http://predictionio.incubator.apache.org/templates/recommendation/quickstart/
To update the model periodically with new data, simply set up a cron job to call pio train and pio deploy. The engine will continue to serve prediction results during the re-train process. After the training is completed, pio deploy will automatically shutdown the existing engine server and bring up a new process on the same port.
Note that if you import a large data set and the training seems to be taking forever or getting stuck, it's likely that there is not enough executor memory. It's recommended to setup a Spark standalone cluster, you'll need to specify more driver and executor memory when training with a large data set.
http://predictionio.incubator.apache.org/system/anotherdatastore/
- Meta data is used by PredictionIO to store engine training and evaluation information. Commands like
pio build
,pio train
,pio deploy
, andpio eval
all access meta data. - Event data is used by the Event Server to collect events, and by engines to source data.
- Model data is used by PredictionIO for automatic persistence of trained models.
PredictionIO comes with the following sources:
- JDBC (tested on MySQL and PostgreSQL):
- Type name is jdbc.
- Can be used for Meta Data, Event Data and Model Data repositories
- Elasticsearch:
- Type name is elasticsearch
- Can be used for Meta Data repository
- Apache HBase:
- Type name is hbase
- Can be used for Event Data repository
- Local file system:
- Type name is localfs
- Can be used for Model Data repository
- HDFS:
- Type name is hdfs.
- Can be used for Model Data repository
PIO_STORAGE_REPOSITORIES_METADATA_NAME=predictionio_metadata
PIO_STORAGE_REPOSITORIES_METADATA_SOURCE=ELASTICSEARCH
PIO_STORAGE_REPOSITORIES_EVENTDATA_NAME=predictionio_eventdata
PIO_STORAGE_REPOSITORIES_EVENTDATA_SOURCE=HBASE
PIO_STORAGE_REPOSITORIES_MODELDATA_NAME=pio_
PIO_STORAGE_REPOSITORIES_MODELDATA_SOURCE=LOCALFS
http://predictionio.incubator.apache.org/install/config-datastore/
PIO_STORAGE_SOURCES_ELASTICSEARCH_TYPE=elasticsearch
PIO_STORAGE_SOURCES_ELASTICSEARCH_HOSTS=localhost
PIO_STORAGE_SOURCES_ELASTICSEARCH_PORTS=9300
Event api:
http://predictionio.incubator.apache.org/datacollection/eventapi/
Combining Multiple Algorithms
You can use more than one algorithm to build multiple models in an engine. The predicted results can be combined in the Serving class.
http://predictionio.incubator.apache.org/templates/similarproduct/multi-events-multi-algos/
- Add a new field
likeEvents
intoTrainingData
class to store theRDD[LikeEvents]
. - Modidy DataSource's
read()
function to read "like" and "dislike" events from the Event Store.
In addition, MLlib ALS can handle negative preference with
ALS.trainImplicit()
. Hence, we can map a dislike to rating of -1 and like to 1.
In summary, this new
LikeAlgorithm
does the following:- Extends original
ALSAlgorithm
class - Override
train()
to process thelikeEvents
inPreparedData
- Use the latest event if the user likes/dislikes the same item multiple times
- Map dislike to a
MLlibRating
object with rating of -1 and like to rating of 1 - Use the
MLlibRating
to train theALSModel
in the same way as the originalALSAlgorithm
- The
predict()
function is the same as the originalALSAlgorithm
When the engine is deployed, the Query is sent to all algorithms of the engine. The PredictedResults returned by all algorithms are passed to Serving component for further processing, as you could see that the argument
predictedResults
of the serve()
function is type of Seq[PredictedResult]
.
In this example, the
serve()
function at first standardizes the PredictedResults of each algorithm so that we can combine the scores of multiple algorithms by adding the scores of the same item. Then we can take the top N items as defined in query
.- Data - includes Data Source and Data Preparator
- Algorithm(s)
- Serving
- Evaluator
Here you need to map your String-supported
Rating
to MLlib's Integer-only Rating
. First, you can rename MLlib's Integer-only Rating
to MLlibRating
for clarity:1 | import org.apache.spark.mllib.recommendation.{Rating => MLlibRating} |
To map like and dislike event to a Rating object with value of 4 and 1, respectively :
Data Source reads data from the data store of Event Server and then Data Preparator prepares
RDD[Rating]
for the ALS algorithm.
val noTrainItems = Source.fromFile(pp.filepath).getLines.toSet //CHANGED
http://predictionio.incubator.apache.org/templates/recommendation/customize-serving/
Serving component is where post-processing occurs. For example, if you are recommending items to users, you may want to remove items that are not currently in stock from the list of recommendation.
User properties can be gender, age, location, etc. Item properties can be genre, author, and other attributes that may be related to the the user's preference.
Delete All Events of an App
$ pio app data-delete <your_app_name>
http://predictionio.incubator.apache.org/datacollection/eventmodel/
An entity may peform some events (e.g user 1 does something), and entity may have properties associated with it (e.g. user may have gender, age, email etc). Hence, events involve entities and there are two types of events, respectively:
- Generic events performed by an entity.
- Special events for recording changes of an entity's properties
- Batch events
$set
, $unset
and $delete
are introduced.
Creating multiple Channels allows you more easily to identify, manage and use specific event data if you may collect events from different multiple sources (eg. mobile, website, or third-party webhooks service) for the your application.
http://predictionio.incubator.apache.org/demo/tapster/Similar Product Template is a great choice if you want to make recommendations based on immediate user activities or for new users with limited history. It uses MLLib Alternating Least Squares (ALS) recommendation algorithm, a Collaborative filtering (CF) algorithm commonly used for recommender systems. These techniques aim to fill in the missing entries of a user-item association matrix. Users and products are described by a small set of latent factors that can be used to predict missing entries.
http://predictionio.apache.org/community/projects/#docker-installation-for-predictionio
PredictionIO
Data includes data source and DataPreparator
Algorithm(s)
Serving
Evaluator
Event server passes events to the data source. These events are actually the real data, which we want to learn from. DataPreparator transforms it into a representation that can then be passed on to different learning algorithms.
PredictionIO is an open source machine learning service built on Spark, HBase, and Spray
Prediction.IO is a self-hosted open source platform, providing the full stack from data storage to modeling to serving the predictions. Prediciton.IO can talk to Apache Spark to leverage its learning algorithms. In addition, it is shipped with a wide variety of models targeting specific domains, for instance, recommender system, churn prediction, and others.
Titan offers a number of storage options, but I will concentrate only on two, HBase—the Hadoop NoSQL database, and Cassandra—the non-Hadoop NoSQL database.
http://thinkaurelius.github.io/titan/
http://tinkerpop.apache.org/
bash -c "$(curl -s https://install.prediction.io/install.sh)"
echo 'export PATH=$PATH:'"$HOME/PredictionIO/bin" >> ~/.bashrc
pio-start-all
pio template get PredictionIO/template-scala-parallel-universal-recommendation urec-app
pio-start-all
pio app new urec-app
sudo pip install PredictionIO
http://blog.prediction.io/ecommerce-personalization
https://groups.google.com/forum/#!topic/predictionio-user/lxZJCn9zZGk
The basic idea is that recording user behavior including searches can be correlated with some conversion event like a purchase. The Universal Recommender uses an engine that does this correlation calcluation (Correlated Cross-Occurrence or CCO). We can use this same engine to personalize search to take user preferences into account. The template is in pre-release but working.
No general way to do this since the needs of each algorithm are unique. With PredictionIO each algorithm is a separate template and defines it’s own input, model, and output.
If you are asking about personalized search, you will need a search engine that supports field-based searches like Elasticsearch or Solr. The data in the indexes can be synced with your DB by looking at the docs for those engines. However the input to the Personalized Search template will need to be created by you from user actions (like purchase and search).
The Personalized Search takes user actions like the Universal Recommender, and creates fields that you associate with items in the search engine index. This is done periodically. You then take history of user actions and add them to the search terms for the search query. This will personalize search and turn it into targeted recommendations, in other words search results that also may be preferred items for the user.
http://predictionio.incubator.apache.org/start/
PredictionIO consist of the following components:
- PredictionIO platform - our open source machine learning stack for building, evaluating and deploying engines with machine learning algorithms.
- Event Server - our open source machine learning analytics layer for unifying events from multiple platforms
- Template Gallery - the place for you to download engine templates for different type of machine learning applications
In a common scenario, PredictionIO's Event Server continuously collects data from your application. A PredictionIO enginethen builds predictive model(s) with one or more algorithms using the data. After it is deployed as a web service, it listens to queries from your application and respond with predicted results in real-time.
Event Server collects data from your application, in real-time or in batch. It can also unify data that are related to your application from multiple platforms. After data is collected, it mainly serves two purposes:
- Provide data to Engine(s) for model training and evaluation
- Offer a unified view for data analysis
example/practice
https://blog.openshift.com/day-4-predictionio-how-to-build-a-blog-recommender/
- We create instance of Client class. Client is the class which wraps the PredictionIO REST API. We need to provide it the API_KEY of PredictionIO blog-recommender application.
- Next we created two users using Client instance. The users get created in PredictionIO application. Only mandatory field is userId.
- After that we added 10 blogs using Client instance. The blogs get created in PredictionIO application. When creating an item only you need pass two things — itemId and itemType. The blog1 ,.. blog10 are itemIds and javascript, scala, etc are itemTypes.
- Next we performed some actions on items. The user “shekhar” viewed “blog1”, “blog2”, and “blog4” and user “rahul” viewed “blog1″,”blog4”, “blog6” , and “blog7”.
- Finally, we closed the client instance.
Now that data is inserted into our PredictionIO application we need to add engine to our application. Click on Add an Engine button. Choose Item Similarity Engine as shown below
After pressing Create button you will have Item Similarity Engine. Now you can make some configuration changes but we will use the default settings. Go under Algorithms tab and you will engine is not running. Run the engine by clicking “Train Data Model Now”.
The usecase that we are solving is to recommend blogs to user depending on the blog he has viewed. In the code shown below, we are getting all similar items to blog1 for userId “shekhar”.
client.identify("shekhar"); String[] recommendedItems = client.getItemSimTopN("engine1", "blog1", 5); System.out.println(String.format("User %s is recommended %s", "shekhar", Arrays.toString(recommendedItems)));
deployment
http://predictionio.incubator.apache.org/install/
https://www.quora.com/Did-anyone-deploy-http-Prediction-io-in-production-on-a-big-cluster
The PredictionIO stack creates two types of instance:compute and storage. By default, the stack will launch 1 compute Instance and 3 Storage instances.
The compute instance (ComputeInstance) acts as Spark master. You can launch extra compute instances (ComputeInstanceExtra) by updating the stack. The storage instances(StorageInstance) form the core of the HDFS, ZooKeeper quorum, and HBase storage. Extra storage instances(StorageInstanceExtra) can be added to the cluster by updating the stack. They cannot be removed once they are spinned up.
PredictionIO Event Server will be launched on all storage instances.
https://groups.google.com/forum/#!topic/predictionio-user/DDOihJLIba0I want to run 1 prediction io machine with 2 compute machines ( running Apache spark ) with 3 storage machines ( 1 hbase master and 2 regional servers ).
hbase is not installed on prediction io server, its only on independent cluster.
Locally, all you need to do is to make a copy of your hbase-site.xml to nodes where you will run "pio" commands, and have HBASE_CONF_DIR in conf/pio-env.sh points to where your hbase-site.xml lives. Verify your connection is good to go by using "pio status".
https://groups.google.com/forum/#!topic/predictionio-user/lxZJCn9zZGk
If you are asking about personalized search, you will need a search engine that supports field-based searches like Elasticsearch or Solr. The data in the indexes can be synced with your DB by looking at the docs for those engines. However the input to the Personalized Search template will need to be created by you from user actions (like purchase and search).
The Personalized Search takes user actions like the Universal Recommender, and creates fields that you associate with items in the search engine index. This is done periodically. You then take history of user actions and add them to the search terms for the search query. This will personalize search and turn it into targeted recommendations, in other words search results that also may be preferred items for the user.
https://github.com/apache/incubator-predictionio/blob/develop/data/src/main/scala/org/apache/predictionio/data/storage/Storage.scala
PIO_STORAGE_REPOSITORIES_MODELDATA_SOURCE=LOCALFS ==> use hbase will fail
2016-12-13 23:05:55,071 ERROR org.apache.predictionio.tools.console.Console$ [main] - No storage backend implementation can be found (tried both org.apache.predictionio.data.storage.hbase.HBModels and hbase.HBModels) (org.apache.predictionio.data.storage.StorageClientException)
org.apache.predictionio.data.storage.StorageClientException: No storage backend implementation can be found (tried both org.apache.predictionio.data.storage.hbase.HBModels and hbase.HBModels)
https://github.com/actionml/cluster-setup/blob/master/standalone-servers.md
- PIO_STORAGE_REPOSITORIES_MODELDATA_SOURCE=ELASTICSEARCH
--- settings
cluster.name: predictionio
network.host: 127.0.0.1
PIO_STORAGE_SOURCES_ELASTICSEARCH_TYPE=elasticsearch
PIO_STORAGE_SOURCES_ELASTICSEARCH_CLUSTERNAME=predictionio
PIO_STORAGE_SOURCES_ELASTICSEARCH_HOSTS=localhost
PIO_STORAGE_SOURCES_ELASTICSEARCH_PORTS=9300
PIO_STORAGE_SOURCES_ELASTICSEARCH_HOME=$PIO_HOME/vendors/elasticsearch-1.7.6
# Local File System Example
PIO_STORAGE_SOURCES_LOCALFS_TYPE=localfs
PIO_STORAGE_SOURCES_LOCALFS_PATH=$PIO_FS_BASEDIR/models
# HBase Example
PIO_STORAGE_SOURCES_HBASE_TYPE=hbase
PIO_STORAGE_SOURCES_HBASE_HOME=$PIO_HOME/vendors/hbase-1.2.4
PIO_STORAGE_REPOSITORIES_METADATA_NAME=pio_meta
PIO_STORAGE_REPOSITORIES_METADATA_SOURCE=ELASTICSEARCH
PIO_STORAGE_REPOSITORIES_EVENTDATA_NAME=pio_event
PIO_STORAGE_REPOSITORIES_EVENTDATA_SOURCE=HBASE
PIO_STORAGE_REPOSITORIES_MODELDATA_NAME=pio_model
PIO_STORAGE_REPOSITORIES_MODELDATA_SOURCE=LOCALFS
Recommendation
pio template get apache/incubator-predictionio-template-recommender MyRecommendation
pio app new MyApp1
https://github.com/PredictionIO/Programmers-Guide-to-Machine-Intelligence-with-PredictionIO
https://github.com/PredictionIO/Programmers-Guide-to-Machine-Intelligence-with-PredictionIO/blob/master/Chapter1.md
Training involves creating a system that will replace those unknown values (the hyphens) with good predictions of how that user would rate that artist. By default the recommendation engine uses a method called matrix factorization, which we will learn about in chapter 3. The basic idea is that the training starts by the engine building a recommendation system based on random values. Obviously, it will perform very poorly. Then we will run this recommendation system on our data and see how well it predicts our known values. For example, how well did it predict Jake would give a 5 to Taylor Swift. Let’s say the system predicted Jake would give Taylor Swift a 3. In that case we would adjust the parameters to increase the Jake-Taylor Swift rating (and account for other mis-predictions). Now we have a second generation of our recommendation engine and we again evaluate it on our data and adjust the parameters. We will do this perhaps 1,000 times until our recommender is precise in predicting the known values. Now we can use this recommender to fill in those missing values.
We are going to start our recommendation engine, which in essence is a server running on port 8000.
https://github.com/PredictionIO/Programmers-Guide-to-Machine-Intelligence-with-PredictionIO/blob/master/chapter2.md
This architecture defines the four components you need to specify to create a PredictionIO machine learning engine: the Data source and data preparator, the machine learning Algorithm, the Serving component which responds to queries, and the Evaluation metrics
The Data Source and Preparator create an RDD that is amenable to processing by the Spark machine learning algorithms.
if there are multiple algorithms, the serving component can combine the results of these algorithms. On ocasion you may want to filter the results. For example, if you are recommending products you may not want to show out-of-stock items.
The Evaluation component of an engine evaluates the accuracy of the engine by comparing the actual results to the predicted results. This is particularly useful if you want to tune your engine by adjusting a variety of parameters. The evaluation module can repeatedly adjust the parameters and evaluate to find the parameter values that produce the most accurate results.
https://github.com/PredictionIO/Programmers-Guide-to-Machine-Intelligence-with-PredictionIO/blob/master/chapter3.md
The rank parameter specifies how many latent features to have. As you can see when we built our recommender in chapter one we used 10 latent features. The more latent features you have the better the accuracy of the recommender. For example, one seminal paper in this area Large-scale Parallel Collaborative Filtering for the Netflix Prize by Zhou et al. used 1000 features. However, the more latent features you use the more memory your recommender will take. In most cases a value between 10 and 500 is reasonable.
As you can guess, this parameter refers to the number of iterations (repeating steps 2 and 3 above) the algorithm executes. As you might think, the more iterations, the better the results, but surprisingly, the algorithm needs very few iterations to produce good values for the P and Q matrices. Generally, a value between 10 and 20 will be sufficient.
In many machine learning algorithms, we train the algorithm on some set of data (called the training set) and once it is trained we use it on new data. There is a danger that the algorithm is so highly tuned to the training data that it performs poorly on the new data. This is called overfitting. In the ALS algorithm, one potential cause of overfitting is having some latent features having extreme values. This lambda is roughly how much we are weighing these extreme values. The greater the lambda the more the algorithm dislikes solutions that contain extreme values.
Recall the the first step of the algorithm is to generate random values for one of the matrices. Generally, each time we run the algorithm we get different initial values and because these initial values are different, the final results can vary as well. Often when we are testing, we want the algorithm to be deterministic and produce the exact same results each time we run it. To do so, we pass in a value for this seed parameter. In the example above, we passed the value 3 as a seed.
We will first use the following Python script to convert the movie data into a json file:
pio import --appid 1 --input my_events.json
By default the amount of memory allocated for training is 512MB.
- Within the PredictionIO directory there is the directory vendors/spark-1.3.0/conf. Within that folder make a copy of the spark-defaults.conf.template and call it spark-defaults.conf.
- Edit the spark-defaults.conf file by uncommenting the
spark.driver.memory
line and changing the value from 5 to 24. - Adding the line
spark.executor.memory 24g
spark.driver.memory 24g
# spark.executor.extraJavaOptions -XX:+PrintGCDetails -Dkey=value -Dnumbers="one two three"
spark.executor.memory 24g
we will test on a set of data where we know the ratings. Plus, we will test on data that hasn't been seen by the recommender. And we will compute the average distance between our estimate and the real value. This is called root-mean-square-deviation and the formula is as follows:
$$ RMSD = \sqrt{\frac{\sum_{i=1}^{n}(r_i - \hat{r}_i)^2}{n}} $$
n is the number of ratings in our test data. $r_i$ refers to the $i^{th}$ rating
https://www.sitepoint.com/predictionio-bootstrapping-a-movie-recommendation-app/
https://www.sitepoint.com/create-movie-recommendation-app-prediction-io-setup/
bin/setup-vendors.sh
bin/start-all.sh
https://github.com/apache/incubator-predictionio/issues/108
The web interface was removed during the redesign. You can still accomplish the asks via CLI though. https://docs.prediction.io/cli/
https://github.com/apache/incubator-predictionio/issues/200
You can use
pio undeploy
pio undeploy --ip <user_ip> --port <user_port>
another way to undeploy your running engine server is to perform a
GET
on the /stop
endpoint of the server.
Using a browser simply goto:
http://<host>:<port>/stop
Using curl it would look like
curl <host>:<port>/stop
To use the script, copy local.sh.template as local.sh, redeploy.sh as (say) MyEngine_Redeploy_(production).sh (Name of the script will appear as title of email) and put both files under the scripts/ directory of your engine. Then, modify the settings inside both file, filling in details like
PIO_HOME
, LOG_DIR
, TARGET_EMAIL
, ENGINE_JSON
and others. You need to do pio build
once before using this script. This script only trains and deploys. If pio train
or pio deploy
fails for some reason, the running engine stays put in most cases. If engine is retrained and deployed successfully, the email sent will have Normal in the title so you can set filtering rules.
This script does not guarantee no down time since at some point during
pio deploy
the original engine is shut down. The down time is usually not more than a few seconds though it can be more.
0 0 * * * /path/to/script >/dev/null 2>/dev/null # mute both stdout and stderr to supress email sent from cron
https://github.com/richdynamix/personalised-products
The enhanced crosssell feature will show products often bought with the products in the customers basket.
Now your customers will never forget to add those additional batteries...
http://actionml.com/blog/personalized_search