Loading Data Connection Failed


#1

Hi, I’m using Databricks/PySpark to load a DataFrame onto OmniSci. Here is my code and the error:

df.write.format(“jdbc”).option(“url”, “jdbc:mapd:EC2ADDRESS:9091:mapd”).option(“driver”, “com.mapd.jdbc.MapDDriver”).option(“dbtable”, “subs_dim”).option(“user”, “mapd”).option(“password”, “INSTANCEID”).save()

Here is the error:
java.sql.SQLException: Connection failed - org.apache.thrift.transport.TTransportException: java.net.ConnectException: Connection timed out (Connection timed out)

Py4JJavaError: An error occurred while calling o477.save.
: java.sql.SQLException: Connection failed - org.apache.thrift.transport.TTransportException: java.net.ConnectException: Connection timed out (Connection timed out)
at com.mapd.jdbc.MapDConnection.(MapDConnection.java:113)
at com.mapd.jdbc.MapDDriver.connect(MapDDriver.java:55)
at org.apache.spark.sql.execution.datasources.jdbc.DriverWrapper.connect(DriverWrapper.scala:45)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$createConnectionFactory$1.apply(JdbcUtils.scala:63)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$createConnectionFactory$1.apply(JdbcUtils.scala:54)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:63)
at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:45)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:72)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:70)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:88)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:150)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:138)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$5.apply(SparkPlan.scala:190)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:187)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:138)
at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:108)
at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:108)
at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:683)
at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:683)
at org.apache.spark.sql.execution.SQLExecution$$anonfun$withCustomExecutionEnv$1.apply(SQLExecution.scala:89)
at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:175)
at org.apache.spark.sql.execution.SQLExecution$.withCustomExecutionEnv(SQLExecution.scala:84)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:126)
at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:683)
at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:287)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:281)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:380)
at py4j.Gateway.invoke(Gateway.java:295)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:251)
at java.lang.Thread.run(Thread.java:748)


#2

Hi @mayanklive1,

An obvious question, as I see a connection timeout error, have you opened access to port 9091 in security group? You can confirm that OmniSci is listening at port 9091 for client connections using curl.

curl -v telnet://EC2ADDRESS:9091

Regards,
Veda


#3

Obvious is good. In this case, I believe the port is open. I did the curl request and got this response:


#4

Hi @mayanklive1,

Thanks for checking. To further isolate the problem, have you tried testing the OmniSci JDBC interface using the sample code provided in the doc.

Regards,
Veda


#5

I don’t know Java unfortunately.

I ran into a similar error a year ago but I had the wrong port/did have the right jar loaded. This time I made sure to download the latest. Does the extended error tell you anything?


#6

I changed the EC2 address to a bogus one and got the same error. This leads me to believe my Databricks/Spark code isn’t even reaching the instance. Is there a way to run a similar curl-type command from PySpark? Perhaps using the requests module? (Just to confirm Databricks can reach the OmniSci dB)

I tried requests.get("telnet://EC2Address.compute-1.amazonaws.com:9091") but it didn’t understand that, so I tried requests.get('http://EC2ADDRESS.compute-1.amazonaws.com:9091') and nothing happened. Advice?


#7

Hi @mayanklive1,

I do not have experience testing OmniSci with Databricks/Spark, I will research and get back to you.

Happy Thanksgiving!
Regards,
Veda


#8

Hi @mayanklive1,

I see that @dwayneberry has answered this question and laid out a step-by-step procedure for loading a dataset from Spark into an OmniSci table using a DataFrame write procedure. Could you please try the steps listed here.

Regards,
Veda


#9

Thats my post from a year ago :grimacing: But I know the next couple steps to take to move the ball forward:

  1. Confirm OmniSci is listening at port 9091 with curl command from a different IP (not whitelisted to access VPC). If the curl command fails, then I likely need to whitelist my Spark-running EC2s to access Omnisci, as they are probably just not permitted access.

  2. Follow DwyaneBerry’s linked steps from my previous post. SSH onto OmniSci, etc.

Will report back.

Happy Thanksgiving!


#11

@veda.shankar @dwayneberry

Still working on this

I’m following Dwayne Berry’s instructions from a year ago and I’m confused by this part of his instructions:

When I ssh onto the OmniSci instance and type “build/bin/mapdql -p HyperInteractive” I get an error. “No such file or directory”

What am I missing? Do I need to SSH onto the OmniSci instance and change something in order to write to it? Thanks

EDIT: To confirm, here is my write command with the ec2 address/password slightly changed. Is there something wrong with my formatting?

subs.write.format("jdbc").option("url", "jdbc:mapd:ec2-52-292-231-169.compute-1.amazonaws.com:9091:mapd").option("driver", "com.mapd.jdbc.MapDDriver").option("dbtable", "subs_dim").option("user", "mapd").option("password", "i-09f1082e9f0320222").save()


#12

It’s unlikely you will find mapdql on this path.

It’s located in bin subdirectory of mapd installation


#13

I’m using the AWS AMI of OmniSci (community). How would I go about accessing it there?


#14

Hi @mayanklive1,

the aws image would have the build installed under /opt/mapd/, so the mapdql command is under /opt/mapd/bin directory.

anyway I think you are experiencing a network problem between your Databricks/PySpark and Omnisci instances and this could be caused by the following reason (from the most to the least likely reason)

1-the Omnisci instance hasn’t the public ip configured to accept incoming connection by DB/PS public ip address
2-The DB/PS instance is configured to block outgoing connections to Omnisci’s one
3-Omnisci instance hasn’t available connections to serve new incoming connections, so it’s queueing your connection
4-You have an asymmetric routing problem between the source and destination hosts