So I consider it a complete success.
Kudos to all contributors.
Distributions attempting to package phantomjs properly had one hell of a time trying to reproduce its builds reliably. Most gave up.
Distribution from author as binaries is a whole bundle of fail from the get-go.
In fact, there's a command line switch for it https://developers.google.com/web/updates/2017/04/headless-c...
The Chrome team also make Puppeteer, a node's library for interfacing with headless chrome, and has methods for making PDFs as well https://github.com/GoogleChrome/puppeteer
--print-to-pdf
https://developers.google.com/web/updates/2017/04/headless-c...
It's really easy to do using [puppeteer](https://github.com/GoogleChrome/puppeteer). The 2nd or 3rd example is PDF.
> Headless Chrome is coming [...] I think people will switch to it, eventually. Chrome is faster and more stable than PhantomJS. And it doesn't eat memory like crazy. [...] I don't see any future in developing PhantomJS. Developing PhantomJS 2 and 2.5 as a single developer is a bloody hell.
One potential path forward could have been to have PhantomJS support Headless Chrome as a runtime [2], which Paul Irish (of Google Chrome team) reached out to PhantomJS about. However, it seems there hasn't been enough interest/resources to ever make this happen.
[1] https://groups.google.com/d/msg/phantomjs/9aI5d-LDuNE/5Z3SMZ...
• phantomjs is 7 years old, @pixiuPL has been contributing for about 2 months
• @ariya didn't respond to his requests for owner level permissions
• @pixiuPL published an open letter to the main page of phantomjs.org https://github.com/ariya/phantomjs/issues/15345
• the stress leads @ariya to close the repo.
• @pixiuPL intends to continue development on a fork
This is a good reminder of why non-technical skills are so important in OS and in general.
It's much more lightweight than a real browser, and it doesn't require large extra binaries.
I don't do any complex scrapping, but occasionally I want to pull down and aggregate a site's data. For most pages, it's as simple as making a request and passing the response into a new jsdom instance. You can then query the DOM using the same built-in browser APIs you're already familiar with.
I've previously used jsdom to run a large web app's tests on node, which provided a huge performance boost and drastically lowered our build times. As long as you maintain a good architecture (i.e. isolating browser specific bits from your business logic) you're unlikely to encounter any pitfalls. Our testing strategy was to use node and jsdom during local testing and on each commit. IMO, you should generally only need to run tests on an actual browser before each release (as a safety net), and possibly on a regular schedule (if your release cycle is long).
To summarize: It does not look like the guy has done a single commit with any meaning. His commits are basically the following:
1. Adding his own name in package.json 2. Adding and deleting whitespace. 3. Deleting the entire project and commiting. 4. Adding the entire project back again and commiting.
Just out of curiosity: How likely is that someone may be able to use a large number of such non functional commits(adding and removing whitespace) to a popular open source repository to boost their career ambitions.(e,g. Claiming that they made 50 commits to a popular project might sound impressive in an interview.)
@pixiuPL thinks he's king of the world, but gets rightfully put in his place.
Headless Chrome with Puppeteer: https://github.com/GoogleChrome/puppeteer
Firefox-based Slimer.js: https://github.com/laurentj/slimerjs (same API as Phantom which is useful if using a higher level library like http://casperjs.org/)
I’m working on building out a serverless model, which is the holy grail of headless workflows, but it’s a bit more challenging to operationalize than one would think.
I’m hoping that these efforts will lower the bar for folks wanting to get started with puppeteer and headless Chrome!
Has anyone here figured out any tricks to get headless Chrome booted fast?
godet is the lib I use for chrome piloting, replace with your favorite one.
I based the pool off of https://github.com/latesh/puppeteer-pool/blob/master/src/ind... .
All the best to everybody!
--ssl-client-certificate-file and --ssl-client-key-file"Will do as advised, as I really think PhantomJS is good project, it just needs good, devoted leader."
chromium-browser --headless --disable-gpu --print-to-pdf=output_file_name.pdf file:///path/to/your/htmlSo this does work for very basic pdf printouts, but so far phantom is the only tool that offers full control over the PDF output. Even down to things like margins, paper size, etc.
[1]: https://developer.mozilla.org/en-US/Firefox/Headless_mode
[2]: https://developers.google.com/web/updates/2017/04/headless-c...
Especially his own commits (non-merge commits)
How did his changes even make it to the repo. There are commits adding and deleting whitespace with the disguised commit message of "Refactoring Code". I have no doubt on why ariya couldn't work with him.
I couldn't find a single one containing any meaningful code changes. The closest one is a81a38f[1] which seems to introduce bugs - removing open file check, plus a hanging if clause.
Sounds like it's either an elaborate prank, or the guy has no grounding on reality.
[1] https://github.com/ariya/phantomjs/commit/a81a38ffabe2cea715...
Looking at some issues filed by him (https://github.com/composer/composer/issues/7016) makes the entire thing more clear.
In this commit the guy deletes two spaces from a file, and adds copyright for his name at the top. Going through his commits has made me extremely shocked. I mean how did such low quality commits made it into the master branch of the repo. It is like these commits were invisible to all the visitors and users of the repo.
Obviously with my situation, this is not the end of the world. I use the parser twice a year and Phantom will continue to handle that task just fine. But I also know that the switch to using headless Chrome would be an expensive one if necessary; we have to research it, we have to update local dev environments, we have to implement it, we have to write new tests for it, we have to test it, we have to updating our deployment strategy, update our server deployment configuration, and, worst of all, get all of these changes and new software installations approved by the USPTO which is a nightmare. My situation is simple, but would take several weeks to several months to actually deploy to production. As it stands, I will likely have to explain why we have a now-unmaintained piece of software on the server and may be forced to switch regardless.
I can easily imagine how this project sunsetting, even though there is a clear alternative and successor, could be a nightmare to a lot of people. It's not the end of the world, but it's definitely unfortunate
https://www.cooperativepatentclassification.org/Archive.html
Phantomjs was generally great for that type of rendering
IPC can be downloaded from the link below. I needed the Valid Symbol List. Looks like they fixed the encoded JSON that was there when they first put out the new format.
http://www.wipo.int/classifications/ipc/en/ITsupport/Version...
Though I could be wrong but it didn't seem like an equivalent to headless chrome or firefox.
- Removing one whitespace and adding an unnecessary file https://github.com/ariya/phantomjs/commit/98272b9752b2d505f7...
- Conflicted files https://github.com/ariya/phantomjs/commit/63a69d9e2e9c31baab...
- Personal build env https://github.com/ariya/phantomjs/commit/d57ff74f36c5b79d82...
- Deleted the whole project while changing cloud provider https://github.com/ariya/phantomjs/commit/a242fb8d605d9aa4af...
- then re-adding the whole project again https://github.com/ariya/phantomjs/commit/ddaaa09785d453e415...
and other weird/careless commits
Are such incidences common in other open source software too, or does this one seem a rare case?
It must be particularly difficult when your Groovy-as-a-string script itself has many strings in its code, which is what a typical Apache Groovy build script for Gradle looks like.
Not the grandfather, but generally in browsers you have two versions of HTML "source" - the canonical source, the stuff pulled down over HTTP, and the repaired source, the version that actually gets rendered.
I'm unfamiliar with Nokogiri, but I suspect that from context, it doesn't repair HTML in the same way that browsers do.
That is both true and false. Because the JS can introduce dynamic content, the source returned by the HTTP response often doesn't match the source that is rendered by the browser itself. In many cases, a site will return a skeleton (just HTML) and then make an Ajax request to populate it. In my case, it was just the skeleton HTML with a few hundred lines of JS plus a long string of JSON
As far as I am aware, Nokogiri isn't capable of that and even if it is, I was unaware of that library at the time I wrote the Phantom solution (only discovered it last Summer but have yet to use it for anything)
>In 2017, they switched to a system that loads in the data from JSON stored in Javascript in the HTML
The post replied to claims that Nokogiri doesn't see this however so I'm puzzled.
That's also the reason while you had to "pre-render" you javascript web apps for SEO purposes until google bot got the ability to execute javascript.
I've never seen "View Page Source" or "Show Page Source" be the current DOM representation. It's always the HTML what came over the wire, the same you'll get from curl (unless the server is going user agent shenanigans, which I think we can agree is out of scope here).
If you're talking about the page after Javascript is ran, the only way you're seeing that is by opening the dev tools and looking in the 'Elements' or 'Inspector' panel.
I just checked in Safari, Chrome, and Firefox and found this to be true in all of them. The distinction between the View Source and DOM Inspector is very clear.