--- Log opened Sat Feb 20 00:00:34 2021 01:20 -!- darsie [~kvirc@84-113-55-200.cable.dynamic.surfer.at] has joined ##hplusroadmap 01:23 < TMA> on the other hand, there is usually reduced ambiguity between 0oO, Il1|, rn/m and such when you take the pdfs in a non-image way 01:34 < fenn> you could feed the OCR a dictionary from the PDF text 04:02 -!- sanehatter [~sanehatte@141.98.255.147] has quit [Quit: -] 04:04 -!- biglot00 [~biglot00@212.102.45.109] has joined ##hplusroadmap 04:06 -!- biglot00 [~biglot00@212.102.45.109] has left ##hplusroadmap [] 04:17 -!- sanehatter [sanehatter@gateway/vpn/mullvad/sanehatter] has joined ##hplusroadmap 04:19 -!- yashgaroth [~ffffffff@2601:5c4:c780:6aa0:5cda:ee35:7600:7b2a] has joined ##hplusroadmap 05:47 -!- docl_ is now known as docl 08:24 -!- CryptoDavid [uid14990@gateway/web/irccloud.com/x-xsfyarruxqtezbep] has joined ##hplusroadmap 09:46 -!- Malvolio [~Malvolio@unaffiliated/malvolio] has quit [Quit: brb] 09:51 -!- Malvolio [~Malvolio@unaffiliated/malvolio] has joined ##hplusroadmap 10:11 -!- geekitgood [~Adium@2600-6c67-6e7f-18d6-d417-4003-7e43-76ae.res6.spectrum.com] has joined ##hplusroadmap 10:12 -!- geekitgood1 [~Kaleb@2600-6c67-6e7f-18d6-9c50-97b9-4c6c-71ae.res6.spectrum.com] has joined ##hplusroadmap 10:17 -!- geekitgood1 [~Kaleb@2600-6c67-6e7f-18d6-9c50-97b9-4c6c-71ae.res6.spectrum.com] has left ##hplusroadmap [] 10:20 -!- geekitgood1 [~Kaleb@2600-6c67-6e7f-18d6-9c50-97b9-4c6c-71ae.res6.spectrum.com] has joined ##hplusroadmap 10:30 -!- geekitgood [~Adium@2600-6c67-6e7f-18d6-d417-4003-7e43-76ae.res6.spectrum.com] has quit [Quit: Leaving.] 10:35 -!- geekitgood [~Adium@2600-6c67-6e7f-18d6-d417-4003-7e43-76ae.res6.spectrum.com] has joined ##hplusroadmap 10:37 -!- geekitgood [~Adium@2600-6c67-6e7f-18d6-d417-4003-7e43-76ae.res6.spectrum.com] has left ##hplusroadmap [] 10:57 -!- geekitgood1 [~Kaleb@2600-6c67-6e7f-18d6-9c50-97b9-4c6c-71ae.res6.spectrum.com] has quit [Quit: Leaving.] 11:27 -!- spaceangel [~spaceange@ip-94-112-205-34.net.upcbroadband.cz] has joined ##hplusroadmap 12:16 < lsneff> I'd bet a combined approach would be best. Use ocr to determine what elements correspond to what, and then chain together the text elements actually present at those locations in the pdf. 14:24 -!- geekitgood [~Kaleb@2600-6c67-6e7f-18d6-0cde-7a8f-7c5c-3b95.res6.spectrum.com] has joined ##hplusroadmap 15:13 -!- preview [~quassel@2407:7000:8423:b00::3] has quit [Ping timeout: 240 seconds] 15:19 -!- spaceangel [~spaceange@ip-94-112-205-34.net.upcbroadband.cz] has quit [Remote host closed the connection] 15:23 -!- SDr [~SDr@unaffiliated/sdr] has quit [Ping timeout: 256 seconds] 15:24 -!- SDr [~SDr@unaffiliated/sdr] has joined ##hplusroadmap 15:35 < pasky> fenn: since justanotheruser is talking about CNNs he likely wants to work on rendered PDF (which is absolutely the right call) 15:36 < pasky> unfortunately i don't have up-to-date idea about opensource tools in this domain; we have moved pretty far from low-level developer tool to an end-to-end business solution meanwhile 16:20 -!- CryptoDavid [uid14990@gateway/web/irccloud.com/x-xsfyarruxqtezbep] has quit [Quit: Connection closed for inactivity] 16:58 -!- filipepe_ [uid362247@gateway/web/irccloud.com/x-otsyqjgmofpkydkz] has joined ##hplusroadmap 17:01 -!- geekitgood [~Kaleb@2600-6c67-6e7f-18d6-0cde-7a8f-7c5c-3b95.res6.spectrum.com] has left ##hplusroadmap [] 17:04 -!- join_cordblood [~join_cord@135-23-248-163.cpe.pppoe.ca] has quit [Ping timeout: 256 seconds] 17:36 -!- join_cordblood [~join_cord@135-23-248-163.cpe.pppoe.ca] has joined ##hplusroadmap 17:58 -!- darsie [~kvirc@84-113-55-200.cable.dynamic.surfer.at] has quit [Ping timeout: 240 seconds] 18:56 -!- yashgaroth [~ffffffff@2601:5c4:c780:6aa0:5cda:ee35:7600:7b2a] has quit [Quit: Leaving] 19:04 -!- SDr [~SDr@unaffiliated/sdr] has quit [Ping timeout: 265 seconds] 19:15 -!- SDr [~SDr@unaffiliated/sdr] has joined ##hplusroadmap 20:17 -!- preview [~quassel@2407:7000:8423:b03:8f:7e45:612d:862b] has joined ##hplusroadmap 21:18 -!- filipepe_ [uid362247@gateway/web/irccloud.com/x-otsyqjgmofpkydkz] has quit [Quit: Connection closed for inactivity] 21:29 -!- juri_ [~juri@178.63.35.222] has quit [Ping timeout: 272 seconds] 22:09 -!- SDr [~SDr@unaffiliated/sdr] has quit [Ping timeout: 272 seconds] 22:16 -!- SDr [~SDr@unaffiliated/sdr] has joined ##hplusroadmap 22:28 < justanotheruser> In theory a CNN could probably work with raw pdf, but that seams suboptimal :p 22:31 < justanotheruser> also wouldn't have the flexibility of being applicable to other documents 22:32 < justanotheruser> lsneff: I've tried that and it results in a massive load of edge case handling in the dataset I'm working with. 22:32 -!- thahxa [~thahxa@vlnsm7-toronto63-142-116-127-110.internet.virginmobile.ca] has joined ##hplusroadmap 22:33 < lsneff> Honestly, the easiest way that might work well enough would probably be to do shitty ocr on the pdf, and then dump that through gpt3 and ask it to sort everything into a nice, labeled list. --- Log closed Sun Feb 21 00:00:35 2021