As a website-based business, how to predict a user will do a transaction or not, and based on what?
Have fun, North America!
Thursday, March 24, 2016
Monday, September 28, 2015
Install intl-* extension for PHP in Mac
1. Dependencies.
install ICU4C.
install Autoconf.
brew install autoconf
install PECL.
2. install intl-*.
pecl install intl.
3. update php.ini.
install ICU4C.
tar xzvf icu4c-4_4_2-src.tgz cd icu/source chmod +x runConfigureICU configure install-sh ./runConfigureICU MacOSX make && make install
install Autoconf.
brew install autoconf
install PECL.
2. install intl-*.
pecl install intl.
3. update php.ini.
extension=/usr/lib/php/extensions/no-debug-non-zts-20121212/intl.so
Tuesday, September 22, 2015
加拿大旅游签证
http://www.16safety.ca/page/%E5%8A%A0%E6%8B%BF%E5%A4%A7%E7%AD%BE%E8%AF%81%E7%BD%91%E4%B8%8A%E7%94%B3%E8%AF%B7%E8%BF%87%E7%A8%8B%E4%BB%8B%E7%BB%8D%E5%8F%8A%E6%B3%A8%E6%84%8F%E4%BA%8B%E9%A1%B9-%EF%BC%88%E5%B7%B2%E6%9B%B4%E6%96%B0%EF%BC%89
需提供的材料:
1. IMM5257表。(填写完直接上传)
2. 结婚证。(先翻译,再扫描原件和翻译件)
3. 家庭信息表。(填写完打印签字,再扫描)
4. 旅游信息。(机票,提供加国的行程安排等)
5. 旅行目的。(申请人写给大使馆的保证信或者婚礼邀请函等)
6. 护照
7. 教育及工作信息表。(打印,人工填写,签字。扫描)
8. 邀请信。提到cost coverage.
9. IMM5713代理人表?????
10. 收入证明(中英文,签字盖章扫描)
11. 申请人财产状况。(房产证,存款证明,银行流水6个月)
12. 电子照片。
13. Schedule 1(5257表的附表)
14. 申请人公司的准假证明。(中英文)
需提供的材料:
1. IMM5257表。(填写完直接上传)
2. 结婚证。(先翻译,再扫描原件和翻译件)
3. 家庭信息表。(填写完打印签字,再扫描)
4. 旅游信息。(机票,提供加国的行程安排等)
5. 旅行目的。(申请人写给大使馆的保证信或者婚礼邀请函等)
6. 护照
7. 教育及工作信息表。(打印,人工填写,签字。扫描)
8. 邀请信。提到cost coverage.
9. IMM5713代理人表?????
10. 收入证明(中英文,签字盖章扫描)
11. 申请人财产状况。(房产证,存款证明,银行流水6个月)
12. 电子照片。
13. Schedule 1(5257表的附表)
14. 申请人公司的准假证明。(中英文)
Friday, September 11, 2015
Performance test on Single Spark + HDFS + Sqoop VS MySQL
Mac OS X Version 10.9.4, Processor 2.6G Intel Core i5, Memory 16G 1600MHz DDR3
* using group+count in mysql and spark:
select id, count(1) as cnt from contact_alerts group by id;
rows size(M) create(php) import(sqoop) spark(s) mysql(s)
100k 13 15s NA 6 0.003
1m 104 222s 26s 15 5
10m 1016 31m 1m28s ? 39
* using group+count in mysql and spark:
select id, count(1) as cnt from contact_alerts group by id;
rows size(M) create(php) import(sqoop) spark(s) mysql(s)
100k 13 15s NA 6 0.003
1m 104 222s 26s 15 5
10m 1016 31m 1m28s ? 39
transform txt file of HDFS to DataFrame in Spark, and join multi DataFrames
reference
Without HIVE, Spark will read multi txt files from HDFS and transform them to DataFrame, which is to analyze conveniently.
pom.xml
Without HIVE, Spark will read multi txt files from HDFS and transform them to DataFrame, which is to analyze conveniently.
pom.xml
<?xml version="1.0" encoding="UTF-8"?><project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd"> <modelVersion>4.0.0</modelVersion> <groupId>edu.berkeley</groupId> <artifactId>simple-project</artifactId> <name>Simple Project</name> <packaging>jar</packaging> <version>1.0-SNAPSHOT</version> <dependencies> <dependency> <!-- Spark dependency --> <groupId>org.apache.spark</groupId> <artifactId>spark-core_2.10</artifactId> <version>1.3.1</version> </dependency> <dependency> <!-- Spark dependency --> <groupId>org.apache.spark</groupId> <artifactId>spark-sql_2.10</artifactId> <version>1.4.1</version> </dependency> </dependencies> </project>
----------------------------------------------
Alert.java
import scala.Serializable; public class Alert implements Serializable { private String id; private String alert;private String created; public String getId() { return id; } public void setId(String id) { this.id = id; } public String getAlert() { return alert; } public void setAlert(String alert) { this.alert = alert; } public String getCreated() { return created; } public void setCreated(String created) { this.created = created; } }--------------------------------------------------AlertMore.javaimport scala.Serializable; public class AlertMore implements Serializable { private String id; private String contactId; public String getContactId() { return contactId; } public void setContactId(String contactId) { this.contactId = contactId; } public String getId() { return id; } public void setId(String id) { this.id = id; } }----------------------------------------------------SimpleApp.java/* SimpleApp.java */import org.apache.spark.api.java.*; import org.apache.spark.SparkConf; import org.apache.spark.sql.*; import org.apache.spark.api.java.function.Function; public class SimpleJava { public static void main(String[] args) { String logFile = "/user/XXX/sample/contact_alerts"; // Should be some file on your system SparkConf conf = new SparkConf().setAppName("Simple Application");JavaSparkContext sc = new JavaSparkContext(conf); JavaRDD<String> logData = sc.textFile(logFile).cache(); SQLContext sqlContext = new org.apache.spark.sql.SQLContext(sc);JavaRDD<Alert> alerts = logData.map(new Function<String, Alert>() { public Alert call(String line) throws Exception { Alert alert = new Alert(); alert.setId(null); alert.setAlert(null); alert.setCreated(null); String[] tokens = line.split(","); for (int i = 0; i < tokens.length; i++) { if (i == 0) alert.setId(tokens[i]); if (i == 3) alert.setAlert(tokens[i]); if (i == 7) alert.setCreated(tokens[i]); } return alert; } }); DataFrame alertDF = sqlContext.createDataFrame(alerts, Alert.class); alertDF.registerTempTable("alerts"); JavaRDD<AlertMore> alertsMore = logData.map(new Function<String, AlertMore>() { public AlertMore call(String line) throws Exception { AlertMore alertMore = new AlertMore(); alertMore.setId(null); alertMore.setContactId(null); String[] tokens = line.split(","); for (int i = 0; i < tokens.length; i++) { if (i == 0) alertMore.setId(tokens[i]); if (i == 1) alertMore.setContactId(tokens[i]); } return alertMore; } }); DataFrame alertMoreDF = sqlContext.createDataFrame(alertsMore, AlertMore.class); alertMoreDF.registerTempTable("alerts_more"); System.out.println("-----------------------------------------------------------------------"); System.out.println("DataFrame - query from alerts"); DataFrame totalAlerts = sqlContext.sql("SELECT * FROM alerts").join(alertMoreDF, alertDF.col("id").equalTo(alertMoreDF.col("id"))); totalAlerts.show();System.out.println(alertDF.filter(alertDF.col("id").gt(911111)).count());/* DataFrame from JsonDataFrame dfFromJson = sqlContext.jsonFile("/user/XXXXX/people.json");dfFromJson.show();dfFromJson.select("name").show();dfFromJson.select(dfFromJson.col("name"), dfFromJson.col("age").plus(1)).show();dfFromJson.filter(dfFromJson.col("age").gt(21)).show();dfFromJson.groupBy("age").count().show();*/} }Run:$ ./bin/spark-submit --class "SimpleJava" --master local[4] ~/work/dev/bigdata/SimpleJava/out/artifacts/SimpleJava_jar/SimpleJava.jar
if java.lang.OutOfMemoryError: GC overhead limit exceeded, added -Dspark.executor.memory=6g
Tuesday, August 18, 2015
Run a simple java app in Spark and Hdfs
1. Started Hadoop.
2. Created a maven project in IntelliJ.
2. Created a maven project in IntelliJ.
<?xml version="1.0" encoding="UTF-8"?><project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd"> <modelVersion>4.0.0</modelVersion> <groupId>edu.berkeley</groupId> <artifactId>simple-project</artifactId> <name>Simple Project</name> <packaging>jar</packaging> <version>1.0-SNAPSHOT</version> <dependencies> <dependency> <!-- Spark dependency --> <groupId>org.apache.spark</groupId> <artifactId>spark-core_2.10</artifactId> <version>1.3.1</version> </dependency> </dependencies> </project>
-----------------------------------
/* SimpleApp.java */import org.apache.spark.api.java.*; import org.apache.spark.SparkConf; import org.apache.spark.api.java.function.Function; public class SimpleJava { public static void main(String[] args) { String logFile = "/user/XXXXXXXX/input/a.log"; // Should be some file on your system SparkConf conf = new SparkConf().setAppName("Simple Application"); JavaSparkContext sc = new JavaSparkContext(conf); JavaRDD<String> logData = sc.textFile(logFile).cache(); long numAs = logData.filter(new Function<String, Boolean>() { public Boolean call(String s) { return s.contains("a"); } }).count(); long numBs = logData.filter(new Function<String, Boolean>() { public Boolean call(String s) { return s.contains("b"); } }).count(); System.out.println("Lines with a: " + numAs + ", lines with b: " + numBs);long numByField = logData.filter(new Function<String, Boolean>() {public Boolean call(String s) { String[] token = s.split(";"); boolean existed = false; for (int i = 0; i < token.length; i++) { if (i == 7) { String timeInHdfs = token[i]; //2015-06-30 14:00:29.0 System.out.println(timeInHdfs); if (!timeInHdfs.equalsIgnoreCase("null") && timeInHdfs.compareTo("2015-06-29 23:59:59") > 0) { existed = true; } } } return existed; } }).count(); System.out.println("-----------------------------------------------------------------------"); System.out.println("Lines with bigger time: numByField: " + numByField);} }-----------------------------------notes: when created artifact, select "link by META-INF" rather than build in.Run:$ ./bin/spark-submit --class "SimpleJava" --master local[4] ~/work/dev/bigdata/SimpleJava/out/artifacts/SimpleJava_jar/SimpleJava.jar
Check in web ui:
$ ./sbin/start-all.shvisit http://localhost:8080
Friday, August 14, 2015
Setup Hue on Mac
reference: Hue Installation
package: hue-3.8.1
too many configuration......(tbd)
start:
build/env/bin/supervisor
and visit:
http://127.0.0.1:8888
root, 123456
Subscribe to:
Posts (Atom)