--- Log opened Fri Feb 19 00:00:34 2021 00:20 -!- maaku [~quassel@ec2-54-186-10-232.us-west-2.compute.amazonaws.com] has joined ##hplusroadmap 00:26 -!- mjr[m] [mjrconvers@gateway/shell/matrix.org/x-budkhveusszdwlwp] has quit [Ping timeout: 246 seconds] 00:26 -!- nikivi[m] [nikivimatr@gateway/shell/matrix.org/x-kejwyqnjjcpurrrf] has quit [Ping timeout: 244 seconds] 00:29 -!- spaceangel [~spaceange@ip-94-112-205-34.net.upcbroadband.cz] has quit [Remote host closed the connection] 01:03 -!- Llamamoe [~Llamagedd@088156213124.radom.vectranet.pl] has joined ##hplusroadmap 01:16 -!- justan0theruser is now known as justanotheruser 01:21 -!- mjr[m] [mjrconvers@gateway/shell/matrix.org/x-ucnoykzksenuvqhy] has joined ##hplusroadmap 01:33 -!- nikivi[m] [nikivimatr@gateway/shell/matrix.org/x-hsguertsecmomxdc] has joined ##hplusroadmap 02:45 -!- darsie [~kvirc@84-113-55-200.cable.dynamic.surfer.at] has joined ##hplusroadmap 03:17 -!- Human_G33k [~HumanG33k@82-64-99-84.subs.proxad.net] has quit [Remote host closed the connection] 03:18 -!- Human_G33k [~HumanG33k@82-64-99-84.subs.proxad.net] has joined ##hplusroadmap 03:25 < justanotheruser> Hi all, long time no chat 03:26 < justanotheruser> Is anyone aware of research into deep learning applied to tabular data extraction from messy data formats like PDF? 03:27 < justanotheruser> In many/most cases Tabula is unable to properly detect where the table is 03:31 < fenn> keep me in the loop if you find anything useful 03:32 < fenn> there's also the whole ontology problem 03:40 < fenn> another PDF table data extraction thing http://camelot-py.readthedocs.io/en/master/ 03:40 < L29Ah> fenn: you need at least four tin cans for it to work in the space w/o wasting propellant 03:41 < L29Ah> and you need to stop the carousel to apply meaningful amount of thrust or to connect to other spacecraft 03:41 < fenn> i'm fine with wasting propellant 03:41 < fenn> and you only need one tin can and some mass on a long string 03:42 < L29Ah> too bad there's generally no such chunks of mass in space 03:42 < fenn> solar panel works 03:42 < L29Ah> and you want your solar panels oriented 03:42 < fenn> not hard 03:42 < fenn> spin axis points at sun 03:43 < L29Ah> ideally with your meat can hiding behind them and the fuel depot 03:43 < L29Ah> to soak up the radiation 03:43 < fenn> i would despin for a solar storm 03:44 < fenn> you get several minutes of warning at least 04:08 -!- nickjohnson [sid789@gateway/web/irccloud.com/x-yzrvmhezpyqtbkhm] has quit [Remote host closed the connection] 04:08 -!- cannedprimates_ [sid16585@gateway/web/irccloud.com/x-omfmvlhcaaezptdk] has quit [Remote host closed the connection] 04:08 -!- yonkunas [uid403824@gateway/web/irccloud.com/x-jpgcjhnlxadslzwk] has quit [Remote host closed the connection] 04:12 -!- nickjohnson [sid789@gateway/web/irccloud.com/x-taeuafevyboxewek] has joined ##hplusroadmap 04:13 -!- cannedprimates_ [sid16585@gateway/web/irccloud.com/x-teqefaxslkaiozga] has joined ##hplusroadmap 04:13 -!- cannedprimates_ [sid16585@gateway/web/irccloud.com/x-teqefaxslkaiozga] has quit [Remote host closed the connection] 04:13 -!- nickjohnson [sid789@gateway/web/irccloud.com/x-taeuafevyboxewek] has quit [Remote host closed the connection] 04:14 -!- nickjohnson [sid789@gateway/web/irccloud.com/x-tzhejwwqbakkroxm] has joined ##hplusroadmap 04:15 -!- cannedprimates_ [sid16585@gateway/web/irccloud.com/x-cnwkdqcfcxbfgwra] has joined ##hplusroadmap 04:25 -!- yonkunas [uid403824@gateway/web/irccloud.com/x-qesdegtzdswbqpnn] has joined ##hplusroadmap 04:47 -!- faceface [~faceface@unaffiliated/faceface] has joined ##hplusroadmap 04:57 -!- Human_G33k [~HumanG33k@82-64-99-84.subs.proxad.net] has quit [Quit: Leaving] 04:57 -!- Human_G33k [~HumanG33k@82-64-99-84.subs.proxad.net] has joined ##hplusroadmap 04:58 -!- Human_G33k [~HumanG33k@82-64-99-84.subs.proxad.net] has quit [Remote host closed the connection] 06:01 < kanzure> justanotheruser: https://tabula.technology/ 06:01 < kanzure> oh, you know them. well let's see. 06:02 < kanzure> https://excalibur-py.readthedocs.io/en/master/ 06:02 < kanzure> hrm i guess that's probably related to camelot 06:02 < kanzure> maybe i'm just behind the times here 06:02 < kanzure> pasky: ^might know a thing 06:27 < kanzure> nmigen usb stack https://github.com/greatscottgadgets/luna 06:48 -!- Human_G33k [~HumanG33k@82-64-99-84.subs.proxad.net] has joined ##hplusroadmap 06:54 -!- Llamamoe [~Llamagedd@088156213124.radom.vectranet.pl] has quit [Quit: Leaving.] 07:14 -!- preview [~quassel@2407:7000:8423:b00::3] has joined ##hplusroadmap 07:15 -!- Human_G33k [~HumanG33k@82-64-99-84.subs.proxad.net] has quit [Quit: Leaving] 09:13 -!- juri__ [~juri@178.63.35.222] has joined ##hplusroadmap 09:13 -!- sivoais_ [~zaki@199.19.225.239] has quit [Ping timeout: 272 seconds] 09:18 -!- Netsplit *.net <-> *.split quits: juri_ 09:21 -!- Netsplit over, joins: juri_ 09:21 -!- juri_ [~juri@178.63.35.222] has quit [Max SendQ exceeded] 09:22 -!- sivoais [~zaki@unaffiliated/sivoais] has joined ##hplusroadmap 09:25 -!- mjr[m] [mjrconvers@gateway/shell/matrix.org/x-ucnoykzksenuvqhy] has quit [Ping timeout: 244 seconds] 09:39 -!- mjr[m] [mjrconvers@gateway/shell/matrix.org/x-irbddingoviqrmzi] has joined ##hplusroadmap 09:41 -!- thahxa_ [~thahxa@vlnsm7-toronto63-142-116-127-110.internet.virginmobile.ca] has joined ##hplusroadmap 09:47 -!- spaceangel [~spaceange@ip-94-112-205-34.net.upcbroadband.cz] has joined ##hplusroadmap 10:02 -!- juri__ is now known as juri_ 10:47 -!- HumanG33k [~HumanG33k@82-64-99-84.subs.proxad.net] has joined ##hplusroadmap 10:49 -!- HumanG33k [~HumanG33k@82-64-99-84.subs.proxad.net] has quit [Remote host closed the connection] 10:49 < fenn> perseverance landing still https://nitter.fdn.fr/pic/media%2FEum6fRMVIAENjK1.jpg%3Fname%3Dorig 10:49 -!- HumanG33k [~HumanG33k@82-64-99-84.subs.proxad.net] has joined ##hplusroadmap 11:24 < kanzure> "GFI1 functions to repress neuronal gene expression in the developing inner ear hair cells" https://dev.biologists.org/content/147/17/dev186015 11:36 -!- Netsplit *.net <-> *.split quits: L29Ah, mjr[m], aztec, ptrcmd, bsm117532, branon, Cory, join_cordblood, HumanG33k, acertain, (+67 more, use /NETSPLIT to show all of them) 11:40 -!- Netsplit over, joins: dr_orlovsky, midnight, drolmer, gwillen, rodarmor, kanzure, andytoshi, livestradamus, maaku, bsm117532 (+67 more) 12:05 -!- thahxa_ [~thahxa@vlnsm7-toronto63-142-116-127-110.internet.virginmobile.ca] has quit [Remote host closed the connection] 12:06 -!- preview [~quassel@2407:7000:8423:b00::3] has quit [Ping timeout: 264 seconds] 12:07 < heath> https://www.reuters.com/news/picture/stray-dogs-with-bright-blue-fur-found-in-idUSRTR4Z65B 12:11 -!- HumanG33k [~HumanG33k@82-64-99-84.subs.proxad.net] has quit [Remote host closed the connection] 12:12 -!- HumanG33k [~HumanG33k@82-64-99-84.subs.proxad.net] has joined ##hplusroadmap 12:14 -!- thahxa [~thahxa@vlnsm7-toronto63-142-116-127-110.internet.virginmobile.ca] has joined ##hplusroadmap 12:15 -!- preview [~quassel@2407:7000:8423:b00::3] has joined ##hplusroadmap 14:02 < L29Ah> wtf no 24/7 live stream from mars rover 14:07 < lsneff> i don't think the DSN has the bandwidth for that 14:13 < kanzure> let's fix that 15:07 -!- spaceangel [~spaceange@ip-94-112-205-34.net.upcbroadband.cz] has quit [Remote host closed the connection] 15:17 -!- SDr [~SDr@unaffiliated/sdr] has quit [Quit: SDr] 15:20 < kanzure> https://en.wikipedia.org/wiki/Leonard_v._Pepsico,_Inc 15:20 -!- SDr [~SDr@unaffiliated/sdr] has joined ##hplusroadmap 15:21 < kanzure> https://migflug.com/ 15:28 -!- yashgaroth [~ffffffff@2601:5c4:c780:6aa0:65ac:3a6c:fb9:9b99] has joined ##hplusroadmap 15:31 < kanzure> https://www.raptoraviation.com/warbirds/1986-mig-29ub 15:37 < kanzure> http://www.jetlease.com/listings.cfm?p=sales 15:39 < lsneff> going somewhere? 15:50 < kanzure> you ever wake up thinking "you know what, i feel like purchasing a fighter jet today"? 15:50 < kanzure> larry ellison gets to, so why not 15:54 < kanzure> and just for joining in the fun https://memed.io/laser-eyes-meme-maker 16:02 -!- thahxa [~thahxa@vlnsm7-toronto63-142-116-127-110.internet.virginmobile.ca] has quit [Remote host closed the connection] 16:10 < lsneff> https://usercontent.irccloud-cdn.com/file/uxCni6q9/memed-io-output.jpeg 16:55 -!- filipepe_ [uid362247@gateway/web/irccloud.com/x-hmomngupxltipqhw] has joined ##hplusroadmap 17:33 -!- yashgaroth [~ffffffff@2601:5c4:c780:6aa0:65ac:3a6c:fb9:9b99] has quit [Quit: Leaving] 17:36 < justanotheruser> kanzure: I've tried camelot, excalibur, and tabula. But the failure of all three major libraries made me wonder if this problem could be solved well and generalized. I have a public dataset from a government agency for which 100% of the data is in pdfs and 10% is api accessible. I'm considering whether I could train a CNN to convert PDF/image to data json. Additionally this might be expanded 17:36 < justanotheruser> to other datasets with pdf to mappings. Is there any relevant research that comes to mind? 17:39 < justanotheruser> also worth mentioning that the document structure can vary quite a bit between the pdf in the mentioned dataset. Different sections appearing and disappearing between publications for example. 17:55 -!- darsie [~kvirc@84-113-55-200.cable.dynamic.surfer.at] has quit [Ping timeout: 256 seconds] 18:41 < kanzure> justanotheruser: pasky will hook you up with the right machine learning models 18:44 < kanzure> if you have a consistent format in multiple pdf docs... https://github.com/measuresforjustice/textricator 18:45 < kanzure> justanotheruser: or this thing https://github.com/interviewBubble/Tabulo 19:07 -!- filipepe_ [uid362247@gateway/web/irccloud.com/x-hmomngupxltipqhw] has quit [Quit: Connection closed for inactivity] 19:14 < justanotheruser> interesting leads, thanks! 22:08 -!- faceface [~faceface@unaffiliated/faceface] has quit [Remote host closed the connection] 23:48 < fenn> pdf is a wonky format and it's unlikely that one set of documents will have the same internal data structures as another, as seen by a computer 23:48 < fenn> it would be better to train on OCR instead because that regularizes the input 23:48 < fenn> and as a bonus OCR works on images 23:50 < fenn> i mean train on OCR output --- Log closed Sat Feb 20 00:00:34 2021