a ‑ ARM: Automatic Rhodopsin Modeling with Chromophore Cavity Generation, Ionization State Selection, and External Counterion Placement

: The Automatic Rhodopsin Modeling (ARM) protocol has recently been proposed as a tool for the fast and parallel generation of basic hybrid quantum mechanics/ molecular mechanics (QM/MM) models of wild type and mutant rhodopsins. However, in its present version, input preparation requires a few hours long user ’ s manipulation of the template protein structure, which also impairs the reproducibility of the generated models. This limitation, which makes model building semiautomatic rather than fully automatic, comprises four tasks: de ﬁ nition of the retinal chromophore cavity, assignment of protonation states of the ionizable residues, neutralization of the protein with external counterions, and ﬁ nally congruous generation of single or multiple mutations. In this work, we show that the automation of the original ARM protocol can be extended to a level suitable for performing the above tasks without user ’ s manipulation and with an input preparation time of minutes. The new protocol, called a -ARM, delivers fully reproducible (i.e., user independent) rhodopsin QM/MM models as well as an improved model quality. More speci ﬁ cally, we show that the trend in vertical excitation energies observed for a set of 25 wild type and 14 mutant rhodopsins is predicted by the new protocol better than when using the original. Such an agreement is re ﬂ ected by an estimated (relative to


INTRODUCTION
−3 The recent discovery of a new family of lightsensing microbial rhodopsins 4−7 indicates that we do not still fully comprehend the vast distribution and functional diversity of these systems, which are likely to exploit, globally, an amount of sun-light energy larger than that harnessed by photosynthetic systems.
In spite of their functional diversity, rhodopsins display a remarkably common protein architecture featuring seven αhelices forming a cavity hosting a retinal protonated Schiff base (rPSB) chromophore covalently bound to a lysine located in the middle of helix VII (helix G for microbial rhodopsins). 2−32 For instance, the comprehension of how such variation determines a change in the wavelength of absorption maximum (λ max a ) in tens of rhodopsins or rhodopsin mutants, was studied as the first stage in the understanding of functional variation. 20However, it is apparent that a solid comprehension of how different functions emerged would require the comparative investigation of entire arrays of rhodopsins, thus actively searching for common molecularlevel (e.g., residue type, placement, and conformation) traits associated with an observed property.
There is another reason for moving from the investigation of few rhodopsins to the investigation of larger rhodopsin arrays.
−37 In optogenetics, specific microbial rhodopsins and/or their mutants are expressed in neurons, with the aim of activating, inhibiting, or visualizing neuronal activity through their interaction with light of a specific wavelength.−46 As discussed above, both the understanding of function variability and the search for mutants with desired properties call for a comparative investigation of large arrays (e.g., hundreds, if not thousands) of rhodopsins with different amino acid sequences.In principle, this type of investigation could be carried out experimentally via expression and purification of rhodopsins from many organisms or, in the case of mutant screening, using directed evolution methods based on random mutagenesis.However, this appears to be an expensive and unpractical research effort to be carried out systematically.As we will now discuss, these issues can, in principle, be pursued through computational means, provided that novel and specialized protocols become available.
−51 In fact, QM/ MM models decrease the computational cost by limiting the size of the protein moiety treated at the expensive QM level.For instance, in the rhodopsin models considered here, the rPSB chromophore is treated at the QM level using a multiconfigurational quantum chemical method, whereas the protein itself is treated at the inexpensive MM level using a suitable force field.However, even though the application of such technology had, and still has, an important impact for rhodopsin studies, conventional QM/MM models are, almost regularly, computationally complex models which are built manually and feature different QM methods and MM force fields when designed by different research groups.For this reason, they are often (i) timeconsuming to construct, (ii) noncongruous (e.g., not comparable), and (iii) error prone.Features i−iii impair the production of such models for extended rhodopsin arrays.
The recently proposed Automatic Rhodopsin Modeling protocol (from now on called ARM) 52 represents a first attempt toward the automated and fast generation of congruous QM/ MM models of rhodopsins.As illustrated in Figure 1, ARM models are specialized QM/MM models and, in general, would not be applicable to other (e.g., cytoplasmic) photoresponsive proteins.ARM is not designed to produce the most accurate QM/MM models possible (see, for instance, the models of refs 50 and 53 targeting accurate spectroscopic studies), but basic, gas-phase, and computationally fast models aimed at the rationalization and prediction of trends between sequence variability and function.Therefore, ARM aims to satisfy the following desirable features suitable for the generation of arrays of models: automation, so as to reduce building errors and avoid biased QM/MM modeling; speed, so as to deal with large sets of rhodopsins and/or rhodopsin mutants; documented accuracy, so as to be able to translate results into an experimentally assessable hypothesis; transferability, so as to treat rhodopsins with large differences in sequence (i.e., organism belonging to different life domains and kingdoms).
The current version of ARM has been tested for the prediction of trends in λ max a of a limited set of wild type and mutant vertebrate, invertebrate, and microbial rhodopsins, 10,20,52,54,55 showing good agreement with experimental data.The required input includes (A) an X-ray crystallographic structure or comparative model of the protein in PDB (Protein Data Bank) format, 56,57 (B) a list of residues forming the chromophore cavity, (C) the protonation states of ionizable side chains, and (D) the position of extracellular (OS) and intracellular (IS) counterions.As we will detail below, the main drawback of ARM is that it is, substantially, only a semiautomatic (i.e., not fully automated) protocol, as the generation of its input is achieved through a manual manipulation of the template structure necessary to provide the information on points A−D. 52urthermore, due to possible different user choices (e.g., during the placement of IS and OS counterions), the reproducibility of the results cannot be guaranteed.The latter is a worrisome aspect, since the produced ARM model and, consequently, the calculated properties may be user-dependent.Such limitations, added to the human error factor, represent a serious issue when the target is the generation of hundreds of rhodopsin models (see for instance ref 58 for an example where this would be the case).
In order to overcome the above-described limitations, here we report a novel version of ARM named a-ARM.We will show that, when adopting certain default choices/parameters, a-ARM is capable of performing automatically (i.e., avoiding user manipulation) the following four key steps: (A) definition of the chromophore cavity, (B) assignment of protonation states of  59 ionizable residues side chain, (C) placement of OS and IS counterions, and (D) congruous generation of single or multiple point mutations, allowing in principle for a faster and parallel model building.Such an automated approach, called a-ARM default , adopts a set of default values for the choices determining how the QM/MM model is built.These are chain A, if different chains are present in the crystallographic data; chromophore cavity generation based on Voronoi tessellation and alpha spheres theory and including the lysine residue covalently linked to the rPSB, plus the main (MC) and secondary (SC) chromophore counterion residues; protonation states of the ionizable residues based on partial charges calculated at the crystallographic pH and using neutral His residues with the δ-nitrogen of the imidazole protonated (HID tautomer); OS and IS counterion (Na + /Cl − ) positions optimized with respect to an electrostatic potential grid constructed around each charged OS and IS residue.
Based on a benchmark set of 25 wild type rhodopsins (including vertebrate, invertebrate, and microbial) and 14 mutants and providing 39 observed λ max a values, below we report that a-ARM default has a 32/39 success ratio in reproducing the observed λ max a trend.In the cases for which the fully automated protocol fails (i.e., produces ΔE S1−S0 values far from the observed ones), we show that a semiautomatic approach called a-ARM customized can be employed, allowing for the construction of customized models, which display consistency with the observed trend.
Both a-ARM default and a-ARM customized not only have a high level of automation with respect to the original ARM, but also greatly reduced input preparation time, higher accuracy even when considering distant rhodopsins, and finally full reproducibility of the final results.

THEORY AND IMPLEMENTATION OF a-ARM a-ARM is an improved version of the original ARM based on a
Python subroutine, which allows for an automated production of QM/MM models of the type described in Figure 1.a-ARM is designed to generate the ARM input therefore avoiding, as much as possible, human manipulations.In a sense, a-ARM incorporates the original protocol but provides automatically (but also semiautomatically) all required input information.In order to facilitate the description of how a-ARM works, in section 2.1 we revise the main feature of the original input.The following sections 2.2 and 2.3 deal with the a-ARM structure and section 2.4 deals with a-ARM benchmarking.
2.1.ARM Input: Assets and Limitations.The original ARM is, substantially, a Bash shell script that links a series of publicly available computational packages, by automatically managing and passing information between them.The input (herein called ARM input) is constituted by two files containing the information described in points A−D of section 1.The PDB ARM input file contains the protein structure in PDB format (from either crystallographic or comparative modeling data) with the assigned residue protonation states and positions of Na + /Cl − external counterions.Instead, the cavity input file contains a list of residues constituting the cavity where the chromophore resides.
In the workflow of the protocol shown in Figure S1 of the Supporting Information, the ARM input is treated sequentially to perform the following actions by a series of software packages: mutation and rotamer selection, using SCWRL4; 60 addition of waters and hydrogens, employing DOWSER; 61 MM energy minimization and simulated annealing (SA)/molecular dynam-ics (MD) relaxation, with GROMACS; 62 geometry optimization and energy reevaluation at the CASSCF (12,12)/AMBER and CASPT2 (12,12) levels, respectively, 48 using a combination of the quantum chemical package MOLCAS 63 and molecular mechanics package TINKER. 64The SA/MD procedure is performed starting with N = 10 different seeds that provide 10 independent sets of initial velocities for generating 10 independent QM/MM models.Therefore, the resulting output files include 10 replicas of the final equilibrated ARM model as well as the average vertical excitation energy, from now on called simply vertical excitation energy (ΔE S1−S0 ), between ground state (S 0 ) and the first singlet excited state (S 1 ) computed at the CASPT2 level.From these 10 models, the output structure characterized by ΔE S1−S0 values closest to the average (N = 10) is selected.As anticipated above, such models correspond to gasphase and globally uncharged models of a rhodopsin monomer, composed of three subsystems, i.e., environment, cavity, and Lys-QM (see Figure 1).The QM part of the Lys-QM subsystem is treated at the CASSCF level and corresponds to the rPSB chromophore, while the Lys part of the same subsystem as well as the environment and cavity subsystems correspond to the MM part of the model and are described at the AMBER level.The entire model construction and 3-root state-average CASPT2 (12,12) vertical excitation energy calculation takes, after the input file preparation, ∼36 h CPU time when running the 10 replicas in parallel on a modern workstation.
In spite of their elementary structure, ARM models have been shown to be able to reproduce trends in λ max a variation in a set of diverse rhodopsins. 52−67 However, as mentioned in section 1, a critical automation limit of ARM is related to the manual preparation of its input files.Such preparation takes time (see section S1 in the Supporting Information).In our experience, we found that a skilled user can complete the preparation of an ARM input for a new rhodopsin protein in not less than 3 h.
The first step in the manual preparation of an ARM input is the manipulation of the PDB file containing the original rhodopsin crystallographic structure (see point A of section 1), aimed at removing unnecessary information such as unwanted protein chains and subsequently adjusting atoms and residue numbering.This step will also deal with the possible presence of two alternate locations of selected side chains in the same PDB file, for which there is no established selection procedure.Related to this issue, also selection of the residue containing the retinal chromophore (i.e., the residue that will define the Lys-QM subsystem) has to be performed manually.The selected protein chain, side-chain rotamers, and chromophore residue are ultimately written in the PDB ARM file.
ARM models are sensitive to the correct choice of protonation state of the protein ionizable residues 52 (see point C of section 1).To perform such an assignment, one may use experimental data and/or execute the external program PROPKA 68 (see also section 2.2.3) and analyze its output.In this way, residues with uncommon protonation states are identified and their threeletter code manually written in the PDB ARM file.
The location of the residues belonging to the cavity surrounding the retinal chromophore (see point B of section 1) is also performed manually through an external Web-based tool (CASTp, 69 see section 2.2.2).The user has then to manually prepare the cavity file containing the list of the selected cavity residues.Finally, the last step of the ARM input preparation is the neutralization of the protein environment, through the distribution of OS and IS counterions (see point D of section 1).This step is the most time-demanding and does not follow a well-defined procedure, since it requires the visual inspection of the protein structure and, therefore, has an impact on the reproducibility of the generated model.Again, the final type and coordinates of the selected counterions are added to the PDB ARM file.For a more detailed description of the above steps see section S1 in the Supporting Information.
2.2.a-ARM.As already mentioned above, a-ARM has the ability to operate either as a fully automated tool or as an interactive system for the semiautomatic generation of the ARM input presented above.More specifically, in a-ARM the information required for generating complete PDB ARM and cavity files may be provided via either default choices or by answering specific questions in the command line of terminal window.With such an input, the QM/MM model generated by the subsequent calculation is called a-ARM model.
According to the general workflow of a-ARM (see Figure 2), the procedure starts with the selection of the rhodopsin structure of interest used to prepare the ARM input and ends with the generation of the QM/MM a-ARM model and the calculation of the ΔE S1−S0 and corresponding λ max a (throughout Figure 2. a-ARM workflow.After the selection of the protein chain, a-ARM generates the ARM input files with complete information on the chromophore cavity, protonation states, and counterion placement (see Figure 1) corresponding to points B−D of section 1.The input is then used for the execution of the original ARM, 52 obtaining as output 10 a-ARM models along with the calculated average vertical excitation energy (ΔE S1−S0 ).The parallelograms represent input or output data, the continuous line squares refer to processes or actions, and the dashed lines mean software executions.The [A] mark symbolizes fully automation, whereas the [M] mark represent manual decision.Finally, the [M/A] mark indicates situation that may be either manual or automated (see text).Notice that the software execution labeled "QM/MM calculation" is the same as in the original ARM (see ref 52).In a-ARM the production of the PDB ARM and cavity input files takes only a few minutes.
this work, we assume that the vertical excitation energy provides a good approximation for the energy corresponding to the λ max a at the CASPT2 level of theory).The code behind the workflow reported in Figure 2 is driven by a modular, Python-based collection of routines and can be accessed upon request to the authors.In the following, we detail the four steps (see sections 2.2.1−2.2.4) of the a-ARM workflow.In section 2.2.5, we will instead report on an automatic mutant generation method also currently incorporated in a-ARM. 2.2.1.
Step 1. Automatic Identification of Protein Chain, rPSB, Chromophore Bounded Lys, MC, and SC.In Step 1 of Figure 2 (see also Figure S2 in the Supporting Information) we display the workflow necessary to obtain the initial structure of the rhodopsin of interest.To begin with, the user has the option to provide a crystallographic structure or a comparative model in PDB format or type the PDB ID to download the file directly from the RCSB PDB. 57The program is then able to identify automatically the different protein chains, which may be present in the PDB file and select chain A by default (i.e., automatically or [A]) or, alternatively, let the user select the chain (i.e., manually [M]).Thus, the program generates a file PDB (i) ARM , which contains information on the selected chain, residue conformations, chromophore, and water molecules.
Due to their local flexibility, certain residues may have two alternate side chains locations (i.e., conformations) in the protein crystallographic structure.The strategy adopted to assign a single rotamer, without the need to visualize the structure, is to analyze the atom occupancy number in the coordinate section of the file.This parameter, which takes values from 0 to 1, is used as a criteria to estimate the frequency of each conformation.Accordingly, a-ARM identifies the residues with atom occupancy different from 1.0, creates a list with residue name and sequential number, and the occupancy value of the alternate side chain locations and acts automatically [A] by selecting the rotamer with the largest occupancy or, alternatively, asks the user to select the wanted rotamer by typing the corresponding number [M].
The rPSB chromophore is automatically recognized and selected.For this purpose, the program identifies all residues which are not standard amino acids, waters, and membrane lipids and generates a list of possible chromophores.Once again, the chromophore can be selected automatically by default, which corresponds to the ordinary rPSB chromophore [A], or the user may select the correct option manually by typing the corresponding residue number [M].Here, we should stress that, while this step is instrumental for a future generalization of a-ARM (e.g., for considering multiple chromophores), there is only a single rPSB chromophore in rhodopsins and therefore the user intervention is not needed.Moreover, although in the majority of rhodopsin coordinate files in the RCSB PDB, 57 the retinal and the covalently linked lysine are two distinct residues (i.e., RET and LYS, respectively) in a minority of cases (e.g., 6EID 70 and 6EIG 70 ) retinal and lysine constitute a single residue (LYR).In principle, this LYR formatting is not compatible with ARM 52 algorithms, which are designed to recognize the RET and linker LYS as distinct residues.To deal with that, a-ARM is now able to automatically recognize the LYR residue and subsequently split it into RET and LYS, respecting the standard format of residue and atom names (see section S4 in the Supporting Information).
Another important feature of the program is that, based on the geometrical parameters of the selected chromophore, the chromophore-linked Lys side chain and the potential MC and SC counterions are automatically identified.This is achieved by first locating the linked Lys as the residue geometrically closest to the chromophore, by computing the Euclidean distance between each atom in the chromophore and the coordinates of the nitrogen "NZ" of all the Lys residues present in the structure.Then, the MC and SC are identified as the two Asp and/or Glu and/or crystallographic Cl − residues geometrically closest to the chromophore-linked Lys side chain, by computing the distance between the coordinates of its nitrogen "NZ" and the coordinates of the oxygen "O" of each of the carboxylatebearing residues (or the chlorine atom).However, this selection is only preparatory to the ionization state assignment (see section 2.2.3) that determines if the SC and MC are indeed acting as negatively charged Schiff base counterions.The inclusion of the Cl − anions contained in the X-ray structure into the QM/MM model, even when not considered as MC or SC, is a new feature of a-ARM that allows a more realistic description of rhodopsin chloride pumps (i.e., 5B2N 71 and 5G28 72 ). 2.2.2.
Step 2: Automatic Generation of the Chromophore Cavity.The identification and characterization of the chromophore cavity is a key step for the definition of congruous QM/MM models of rhodopsins (see Figure 2 and section 2.1).There are different algorithms for protein pocket detection. 73hese are mainly available via Web server-based facilities, but a few are distributed as a code for local usage.The widely used Web servers include CASTp, 69 employed in the original ARM protocol. 52Even though CASTp has proven to be effective, the fact that it is not available as a command line code makes it unsuitable for a full automation.Thus, we decided to use the Fpocket software, 74 which can be integrated in a-ARM as illustrated in Figure 2 (see also Figure S3 in the Supporting Information).Fpocket detects the chromophore cavity based on Voronoi tessellation and alpha spheres built on top of the publicly available package Qhull. 74First, a-ARM receives as input the previously generated PDB (i) ARM file and automatically executes the Fpocket software using the default options. 74As output, several protein pockets are obtained along with their scores.The selected chromophore cavity is the one that contains the Lys covalently linked to the rPSB and has the highest score.Finally, the previously identified MC and SC counterion residues are added to the cavity list (if not already present) and the updated list is written in the final cavity file.

2.2.3.
Step 3: Automatic Assignment of Ionization States.Our procedure for the assignment of the protonation state of the ionizable residues at a given pH and in their specific protein environments is based on the assumption that such state is a function of the pK a value. 75Accordingly, each residue with a titratable group is associated with a model pK a value (pK a Model ), 76 interpreted as the pK a displayed when the other protein side chains are in their neutral state.On the other hand, pK a Model is affected by the interaction between the residue and its actual environment, causing a change from the model value to the real pK a value (see eq 1) called pK a Calc .The magnitude of this change, called shift value (ΔpK a ), depends on the presence of hydrogen bonds, desolvation effects, and Coulomb interactions, all modulated through the degree to which the ionizable residue is "buried" within the protein. 68,75 The adopted procedure is outlined in Step 3 of Figure 2 (see also Figure S4 in the Supporting Information), and it is initialized automatically after the detection of the PDB (i) ARM and cavity files.In case that the crystallization conditions are available in the initial PDB structure file, the program identifies the experimental pH making the pH selection automatic [A].
Otherwise, the user is asked to insert the pH value and the pH selection is thus not automatic [M].Once the working pH is assigned, the pK a Calc is obtained using the PROPKA software which also determines the burying percentage. 68A preliminary preparation of the PDB (i) ARM file, consisting of completing the heavy missing atoms of chain residues (including hydrogen atoms), is needed to guarantee the correct operation of PROPKA. 77This requires using the PDB2PQR 78,79 software, which operates under the following workflow: (i) check for missing heavy atoms, (ii) reconstruct heavy atoms, (iii) build and optimize hydrogens, and (iv) assign atomic parameters (for further details see ref 78).PDB2PQR is automatically launched using as input the PDB (i) ARM file and as arguments the given pH and the AMBER force field.After that, PROPKA is launched and its output contains information on the calculated (pK a Calc ) values for each ionizable residue in the protein at the given pH. 68The subsequent assignment of the protonation states of the ionizable groups is carried out based on the above information.
According to a first approach (not reported in Figure 2) employed by Melaccio et al. 52 for the construction of the original ARM models, the parameters used to identify the state of the ionizable residues are the burying percentage, which indicates how accessible the residue is from the surface (for further details see ref 68), and the ΔpK a shift calculated at pH 7.0 as shown in eq 1.In contrast, in a-ARM the parameter used to identify the state of the ionizable residues is the side-chain ionization equilibrium.Such equilibrium is estimated by inserting both the pK a Calc value and the established working pH in the Henderson− Hasselbalch equation, 80  (2) The charges of the positive and negatively ionizable residues are then deduced from eq 2 using the following approximated rules: 81 ; for Asp and Glu and ; for Arg, Lys, and His where ⌈Q + ⌉ and ⌈Q − ⌉ are integers obtained by rounding the decimals using the "round half to even" convention.Once ⌈Q + ⌉ and ⌈Q − ⌉ are obtained, the following criteria is used to assign the ionization (i.e., protonation) state: Arg, Lys, His, if 1 ARN, LYD, HIE HID, if 1 The final result is added to the file PDB (i) ARM to generate the file PDB (ii) ARM now also containing the ionization states.There are two aspects that limit the confidence in the automation of the ionizable-state assignment described above.The first is that, due to the fact that the information provided by PROPKA 68 is approximated, the computed pK a Calc value may, in certain cases, be not sufficiently realistic.The second aspect concerns the assignment of the correct tautomer of histidine.This amino acid has charge of +1 when both the δ-nitrogen and ϵ-nitrogen of the imidazole ring are protonated (HIP), while it is neutral when either the δ-nitrogen (HID) or the ϵ-nitrogen (HIE) are deprotonated.a-ARM uses as a default the HID tautomer for the automatic assignment [A] or allows the user to choose between the three tautomers for a non-automated selection [M].Therefore, when possible, the user should collect the available experimental data and/or inspect the chemical environment of the ionizable residues including the histidines and propose the appropriate tautomer.Further details are given in section S8 in the Supporting Information. 2.2.4.
Step 4: Automatic Counterion Placement.The procedure to select and place OS and IS counterions in the model represents a difficult automation problem (see section 2.1).Herein, we report a novel approach for automatically generating and placing such counterions and therefore avoiding user manipulation.The approach is documented in Step 4 of Figure 2 (see also Figure S5 of the Supporting Information).The initial task consists in determining the type (Cl − and/or Na + ) and number of counterions needed to neutralize the protein environment.This calculation is carried out based on the actual charges of the OS and IS surfaces, which depend on the quantity of positively and negatively charged residues.Therefore, the output of Step 4 depends on the result of Step 3.
To define the OS and IS surfaces, the protein is oriented along the z axis, as illustrated in Figure 3.To this aim, the protein coordinates found in the PDB (ii) ARM file are first centered at the protein center of mass (xyz cm ).The new set of coordinates are then rotated such as the main rotational axis is aligned with the z axis, using the Orient utility of the VMD 83 software.Finally the coordinates are recentered at the center of mass of the chromophore.These coordinate transformations allow to define an imaginary plane orthogonal to the z-axis and containing the z coordinate of the NZ atom (z PSB ) of the rPSB moiety.Such a plane divides the protein in two halves and establishes the OS and IS surfaces in terms of the z value: the ionized residues with a z value larger than z rPSB belong to the OS surface, whereas those residues with z lower or equal to z rPSB belong to the IS surface.The charge of each surface (Q OS , Q IS ) is calculated as the difference between the number of positively charged (Arg, Lys, and His) and negatively charged (Asp, Glu, and crystallographic Cl − anions) residues.Once the surface charge is known, the protocol provides the type and number of counterions required to neutralize the net charge of each surface independently and, in turn, of the full protein.This procedure is illustrated in Figure 3A,B for the case of bovine rhodopsin (Rh). 84Accordingly, the net charge of the IS surface is Q IS = +6, resulting from 16 positively charged and 10 negatively charged residues, whereas the net charge of the OS is Q OS = −2, given by 7 positively charged and 9 negatively charged residues.As a consequence, 6 Cl − and 2 Na + must be added to compensate the positive and negative charge of the IS and OS, respectively.
One main difference between the original ARM and the new a-ARM protocol is that, whereas the original version requires the visual inspection of the PDB file to manually identify the charged residues and calculate the number and identity of the counterions to be added on each surface, the new version performs these tasks automatically.The automatic location of ionized residues on OS and IS provides the basis to properly and automatically place the counterions.
As described for the original ARM, 52 the user-defined OS and IS surfaces are neutralized using a set of counterions placed, semiautomatically, in the regions where the field generated by the charge of the ionized residues is stronger.In fact, while ARM employs a program called PUTION (described by Melaccio et al. as the ION Module 52 ) that uses an energy minimization procedure to place the counterions, the user has to manually specify the target residues on the IS and OS surfaces, including number of residues, residue number identification, and the number and type of counterions to be added.With the aim of removing the above automation limits, a-ARM adopts a different strategy to assign the target residues and execute PUTION automatically.More specifically, PUTION optimizes the counterion positions on the basis of the Coulomb's law, 85 by computing an electrostatic potential grid constructed around all charged residues and excluding points whose distance is larger than 8.0 Å from the center of charge of a ionized residue and shorter than 2.0 Å from any residue atom.
As reported in Step 4 of Figure 2, the PUTION code is automatically launched right after the determination of the partial charges of each residue (see previous section).The program starts by placing a counterion on the surface with the highest net charge.The placement process is then alternated between the OS and IS surfaces, until both are neutralized.The energy of the Nth counterion is computed from the electrostatic interaction with the protein and the N − 1 preceding counterions.As an output, the geometry of all external counterions is generated as illustrated in Figure 3C,D and added to PDB (ii) ARM to generate the final PDB ARM file, which is ready to be used as an input for the QM/MM model building.
2.2.5.Automatic Generation of Mutants: Redefinition of Cavity, Ionization States, and Counterion Placement.By exploiting the backbone-dependent rotamer library implemented in the software SCWRL4, 60 the original ARM has the ability to perform amino acid substitutions on rhodopsin structures and generate QM/MM models of mutants. 52owever, such calculation has serious limitations, since the generated mutants tend to preserve the chromophore environment (i.e., chromophore cavity, protonation states, and external counterions) of the wild type form (unless this information is manually modified).Therefore, although the method has been shown to be effective when a wild type residue is substituted with a residue with the same charge, 52 it is unsuitable for replacements altering the residue charge or polarity, thus possibly affecting the protonation state of nearby residues and, in general, the distribution of OS and IS counterions.An additional problem with the original ARM is that when a mutated residue does not belong to the chromophore cavity, this is not relaxed but kept frozen.
Given the importance of developing a suitable tool for the construction of congrous QM/MM models of mutants, we implemented in a-ARM a new mutation method that takes into account the effects of amino acid substitution on the protein environment (see the workflow in Figure 4).The method requires an additional input file with a .seqmutextension that contains the information on the type (single, double, triple, etc.) and number (N in the flowchart of Figure 4) of required mutations.After detecting the .seqmutfile, a-ARM generates N lists with information on each mutation (mut n in the flowchart of Figure 4).Each list, along with the PDB (i) ARM generated for the wild type structure in Step 1, provides the input for the automatic execution of the SCWRL4 software.In the case of multiple mutations, the SCWRL4 software is re-executed.When the mutation process is concluded, the mutant QM/MM models are built through generation of the cavity, assignment of protonation states, and selection of counterion placement carried out by following Steps 2−4, as described above for the wild type structure.Notice that in Step 2 the mutated residues are always included in the cavity subsystem (MM part) and, consequently, they are relaxed during the SA/MD procedure and subsequent QM/MM level geometry optimization.
2.3.a-ARM default and a-ARM customized Approaches.In Figure 2, we marked as automatic [A] or manual [M] the choices in Steps 1−4 described in sections 2.2.1−2.2.4.The [A] or [M] choices define two different approaches for the generation of a-ARM models.The first, named a-ARM default , is a fully automated approach that delivers maximum input preparation speed (see section 3.1) for the systematic building of a-ARM models and, therefore, useful for the generation of large arrays of wild type rhodopsins and of their mutants (as described in section 2.2.5).This is achieved by employing the default choices described in sections 2.2.1−2.2.4.Accordingly, these models are built on the basis of chain A and the side-chain rotamer with the highest occupancy, a chromophore cavity generated by Fpocket and including the Lys covalently linked to the rPSB and MC and SC residues, ionization states predicted at the crystallographic pH (or at physiological pH 7.4 in the case of no experimental information available) with a neutral HID tautomer of histidine and automatic counterion placement decided by the PUTION code. 52In addition to these choices, a default choice has to be taken for the rhodopsins displaying, for certain residues, alternate side chain locations with exactly the same top occupancy.As we will see below, this is found in two crystallographic structure of our benchmark set (see section 2.4) where two rotamers display a 50% probability (occupancy number 0.5) to contribute to the observed structure.In this situation, the default action of the automated a-ARM default approach is to generate one a-ARM model for each rotamer.
The second approach, named a-ARM customized , is semiautomatic and slower than a-ARM default but of increased accuracy (see section 3.2).In fact it allows, for instance, the construction of "customized" a-ARM models useful when the default choices give a poor result in terms of reproducing the experimental ΔE S0−S1 trends (e.g., differences with experimental data larger than 3−4 kcal mol −1 ; 0.1−0.2eV).a-ARM customized requires user manipulation during Steps 1 and 3, which consists of selecting the protein chain (in case of multi-chain rhodopsins), typing the number identifier of ionizable residues with neutral charge (based on chemical criteria or experimental data), and selecting the tautomer of the histidine.Instead, Steps 2 and 4 are performed as in the a-ARM default approach.Notice that even though the semiautomatic procedure requires user manipulation, the resulting models are always replicated even when different users select the options.
2.4.Benchmark Set for a-ARM.In sections 2.2 and 2.3, we have mainly dealt with the automation, speed, and reproducibility of a-ARM.However, no information is provided on the protocol accuracy in predicting property trends and, at the same time, on the transferability of the a-ARM model between rhodopsins with diverse (i.e., non-homologous) sequences.Information on both accuracy and transferability requires a benchmark study that, here, we limit to the calculation of ΔE S1−S0 .In order to compare this computed quantity with the experimental data, we assume that the observed ΔE S1−S0 values can be derived from the observed λ max a via the equation ΔE S1−S0 = hc/λ max a .As mentioned above, the calculated values are obtained via single-point 3-roots state-average CASPT2(12,12)// CASSCF (12,12)/AMBER calculations yielding the potential energy of the S 0 , S 1 , and S 2 states.The fact that ΔE S1−S0 corresponds to an allowed electronic transition is supported via oscillator strength ( f Osc ) calculations.
A benchmark data set comprising a pool of observed λ max a (expressed in terms of ΔE S0−S1 ) values for 25 wild type and 14 mutant rhodopsins was employed for testing a-ARM.From these mutant rhodopsins, only 2 have an available X-ray structure (i.e., ASR AT -D217E and ChR2-C128T), while the other 12 were generated by the procedure described in section 2.2.5.The data set incorporates the set employed by Melaccio et al. 52 for testing the original ARM (m-set), an additional set of rhodopsins (a-set), and a set of Rh mutants (Rh mutants).The full set, which includes vertebrate, invertebrate, and microbial rhodopsins is presented in Table 1 and features λ max a values ranging from 430 to 575 nm.The number of observed λ max a values will provide information on the method accuracy while the diversity (e.g., microbial vs vertebrate) of rhodopsins will provide information on the transferability of the generated models.
In our benchmark study, we initially use the a-ARM default approach to obtain, in a fully automated fashion, the ΔE S0−S1 trend.However, as reported in the previous section, default choices do not always generate a single a-ARM model.As we will document in section 3.2, this happens for the ASR AT , ASR 13C , and KR2 rhodopsins.In these cases, the selection of a single representative rotamer is only possible when the corresponding observed ΔE S0−S1 value is available (as for our benchmarks).The selected a-ARM model will be the one yielding the computed ΔE S0−S1 value closest to the observed one.

RESULTS AND DISCUSSION
We are interested in answering the question of whether the a-ARM models generated using the input files PDB ARM , cavity, and seqmut are suitable for predicting trends of ΔE S1−S0 of wild type and mutant rhodopsins.For this purpose, we first compute the trend generated using the fully automated a-ARM default approach.Then, we describe some specific models that do not produce values consistent with the observed trend (i.e., with deviations larger than ∼3−4 kcal mol −1 ), for which the use of a-ARM customized is needed.We recall that, in all cases, the computed ΔE S1−S0 values are averages over 10 replicas of the final equilibrated a-ARM model (see section 2.1).The S 0 and S 1 energies, for each of the 10 replicas, are reported in Table S3 in the Supporting Information.
3.1.a-ARM default .Figure 5A displays the ΔE S1−S0 values for the 25 wild type and 14 mutant rhodopsins of the benchmark set (see Table 1), using the a-ARM default approach described in   2 and demonstrate that the S 1 state is indeed a strongly absorbing state.
Before discussing the performance of the fully automated approach, it is necessary to explain why Figure 5 shows, for certain rhodopsins, results from more than one model.
According to the a-ARM default approach (see section 2.3), this occurs for rhodopsins whose crystallographic data contain two alternate locations of some side chains.Multiple rotamers are found in the 1XIO, 86 3X3C, 102 6G7H, 87 and 6EID 70 crystallographic structures.In the 1XIO structure, corresponding to Anabaena sensory rhodopsin (ASR), two possible conformations were identified for both residues Lys-310 (ALys and BLys) and RET-301 (all-trans and 13-cis rPSB) that form the Lys-QM subsystem.Each rotamer in each pair exhibits 50% probability (occupancy number 0.5) to contribute to the observed structure. 86,115Therefore, the favored rotamer cannot be selected based on their occupancy, and thus, a-ARM default generates four models: the all-trans (ASR AT ) models using ALys (ASR AT -1) and BLys (ASR AT -2) and the 13-cis (ASR 13C ) models with, again, ALys (ASR 13C -1) and BLys (ASR 13C -2), as also done manually in previous studies. 52,54,55,86,115The final models are then assigned to be those yielding a ΔE S0−S1 value closest to the ones observed experimentally.More specifically, for ASR AT , we have selected model ASR AT -1 since (i) both the error and the standard deviation are lower than that observed for the second model (ASR AT -2), while (ii) the oscillator strengths are practically the same (see Table 2).A similar argument applies to the case of ASR 13C where, however, the selected model is ASR 13C -2.
The 6G7H structure, corresponding to Bacteriorhodopsin (bR), contains alternate locations for Asp-104, Leu-109, and Leu-15.However, the default choice leads to the generation of a single model with the rotamers AAsp-104, ALeu-109, and ALeu-15, since the occupancy numbers of these specific rotamers are 0.80, 0.54, and 0.57, respectively.Furthermore, in the case of 6EID structure, corresponding to Channelrhodopsin-2 (ChR2), two alternate locations exist for the rPSB LYR: ALYR and BLYR, with occupancy numbers of 0.70 and 0.30, respectively.Therefore, the default model was generated using the conformation ALYR.This choice is consistent with the all-trans configuration of the rPSB presented in the resting conformation of ChR2. 70n the following sections, when discussing the results of ASR AT , ASR 13C , and KR2, we will solely consider the models ASR AT -1, ASR 13C -2, and KR2-2, respectively.
We now discuss the performance of the fully automated approach in predicting experimental λ max a , expressed in terms of ΔE S1−S0 trends.As observed in Figure 5A, the general trend for wild type and Rh mutants models is qualitatively reproduced, mostly displaying blue-shifted absorption similar to the results of the original ARM. 20,52,54,55Actually, as can be seen in Figure 5B, 30 out of the 39 studied rhodopsins (77%) exhibit blue-shifted errors lower than 3 kcal mol −1 , 6 (15%) higher than 5 kcal mol −1 , and only 3 (8%) present red-shifted values of just few (0.5−1.6) kcal mol −1 .More specifically, among the m-set, BPR and ChR C1C2 shows deviation of 5.4 kcal mol −1 and 14.5 kcal mol −1 , respectively, which are larger than the more acceptable 3−4 kcal mol −1 difference.Among the a-set, ChR2, ChR2-C128T, KR2, and RCone are off the observed value, with deviations around 9 and 21 kcal mol −1 .
The ability of a-ARM default models to predict rhodopsin functions can be estimated by using the data in Table 2.The analysis of these data reveals a mean absolute error (MAE) of 3.0 kcal mol −1 , a mean absolute deviation (MAD) of 3.4 kcal mol −1 , and a maximum absolute deviation (AD max ) of 20.7 kcal mol −1 .Clearly, these large statistical values are due to the fact that models created for BPR, ChR2, ChR2-C128T, KR2, and RCone with default parameters are insufficient to provide an acceptable description.For such cases, we employ the a-ARM customized approach, as detailed in the next section.
3.2.a-ARM customized .We now employ the a-ARM customized approach to generate improved models for the KR2, BPR, ChR C1C2 , ChR2, ChR2-C128T, and RCone outliers identified in the previous section.Indeed, we show that it is possible to construct a-ARM models (sections 3.2.1−3.2.5) yielding ΔE S1−S0 values in good agreement with the observed quantities in all cases (see the orange squares and bars in Figure 5A,B, respectively).Moreover, in section 3.2.6 we deepen the study of bR AT , given its intrinsic importance and the debate surrounding the protonation state of Asp-85 and Asp-212, linked to which of the two residues constitutes the actual MC. 87,116,117.2.1.KR2.Since the KR2 models generated using a-ARM default (KR2-1 and KR2-2) are unable to reproduce the experimental ΔE S1−S0 , we explored other possible protonation states, although without changing the other default choices (e.g., the default rotamer choices AAsp-116 and BGln-157), as shown in Figure 6A.The hypothesis we followed is that, in certain cases, a-ARM default does not correctly assign the residue charge (i.e., through Steps 3 and 4).According to the default model, the charge of the rPSB is stabilized by a counterion complex comprising two aspartic acid residues, Asp-116 and Asp-251.Based on distance analysis (see section 2.2.1) and the experimental evidence, 102,118 Asp-116 and Asp-251 are identified as the MC and SC residues, respectively.The a-ARM default approach suggests that, at the crystallographic pH 8.0, both residues are deprotonated and therefore negatively charged.However, this seems questionable as two negative charges would outbalance the rPSB chromophore single positive charge (see Figure 6A).We propose that Asp-251 could be, instead, protonated (i.e., neutral).Accordingly, we generated a new model (KR2-2 (c) ) with the same features of KR2-2 (i.e., the default selected rotamers) but with a protonated Asp-251 residue.As observed in Figure 5 (orange square) and Table 2, this model successfully reproduces the observed data.Thus, KR2 indicates a possible limit of the default protocol for the assignment of protonation states residue and shows how the a-ARM customized approach may be used to explore different choices based on chemical reasoning and/or experimental evidence, so as to achieve a model with better agreement with experimental data.
We used KR2 also for testing the performance of the rotamer default assignment.As described in section 2.2.1, the assigned rotamer is the one with the highest occupancy number.To test this choice, we generated the models for all possible rotamers (see Figure 6A and Table 3) reported in the crystallographic data (keeping the ASH-251 customized choice).As reported in Table 3, we found that both models generated using the rotamer BAsp-116 with occupancy number 0.35 (KR2-3 (c) for AGln-157 and KR2-4 (c) for BGln-157) produce a ΔΔE S1−S0 of ∼15 kcal mol −1 , whereas those with AAsp-116 with occupancy factor of 0.65 (KR2-1 (c) for AGln-157 and KR2-2 (c) for BGln-157) ) are also presented.b Average value of 10 replicas, along with the corresponding standard deviation given as subindex.c For BPR, bR AT , ChR C1C2 , ChR2, ChR2-C128T, RCone, and KR2-2 a-ARM customized are considered.d ASR AT -2, ASR 13C -1, KR2-1, and bR AT (c-2) are excluded from the statistical analysis.
produce ΔΔE S1−S0 of ∼2 kcal mol −1 .As discussed above, the choice of the Gln-157 rotamer, being relatively far from the Schiff base, does not have a significant effect on ΔE S1−S0 , but the conformer BGln (corresponding to the KR2-2 (c) model discussed above) has a slightly smaller value and may be selected as the KR2 representative rotamer.
3.2.2.BPR.Blue Proteorhodopsin has a structure (and a function) close to that of bR. 119Whereas the a-ARM default approach suggests to protonate both residues Glu-90 and Glu-124, within the a-ARM customized approach (see Figure 6B), we propose to keep the residue Glu-124 deprotonated and to protonate only the residue Glu-90.This choice was based on the protonation states found when imposing a pH of 7.4, as later shown in section 3.4.As observed in Figure 5 (orange square) and Table 2, such a choice has a favorable effect reducing the ΔΔE S1−S0 from 5.4 kcal mol −1 to −1.1 kcal mol −1 .
3.2.3.ChR C1C2 .Similar to the case of KR2 explained above, the a-ARM default model for the Chimaera channelrhodopsin ChR C1C2 suggests that at the crystallographic pH 6.0, both MC and SC residues (Asp-292 and Glu-162) are deprotonated and therefore negatively charged.At a first glance, this seems to be the cause of its largely blue-shifted (14.5 kcal mol −1 ) computed ΔE S1−S0 value.This hypothesis is supported by the fact that the default models for other microbial rhodopsins in the benchmarking set provide accurate results when one of the counterions is protonated (see Table S2 in the Supporting Information).Since the protonation states in a-ARM are defined by the pH choice, we compared the crystallographic pH values for KR2 and ChR C1C2 (8.0 and 6.0, respectively) with those corresponding to the other microbial rhodopsins in the benchmarking set (i.e., ASR, bR, Arch1, SR-II).Remarkably, the range of crystallographic pH of such rhodopsins is 5.2−5.6,suggesting that one should calculate the charges for microbial rhodopsins using a low pH.To test this hypothesis, we generated an a-ARM customized model for ChR C1C2 at pH 5.2.As a result, besides the protonated residues predicted in the default model (see Table S2 in the Supporting Information), the SC Glu-162 as well as Glu-140 are protonated.This customized model provides a decrease in the ΔΔE S1−S0 from 14.5 kcal mol −1 to 1.3 kcal mol −1 , highlighting the importance of ensuring a proper balance to the rPSB chromophore single positive charge.3.2.4.ChR2 and ChR2-C128T.In the X-ray structures for Channelrhodopsin-2 and its C128T mutant, there is no available information on their experimental crystallographic pH, and therefore, the default model reverts to use the physiological pH value of 7.4.In such default models, both MC and SC (Glu-123 and Asp-253) are deprotonated and therefore negatively charged.As observed in Table 2 and Figure 5, these default models present large deviations of 19.1 and 20.7 kcal mol −1 , respectively, with respect to experimental data.However, these rhodopsins represent a good case for testing the above presented hypothesis concerning the generation of customized models at low pH.To this aim, we generated a-ARM customized models at pH 5.2 for ChR2 (6EID) and ChR2-C128T (6EIG), with a protonated SC Asp-253, obtaining ΔΔE S1−S0 s of 1.4 and 0.3 kcal mol −1 , respectively, which are in good agreement with experimental values (see Figure 5).
3.2.5.RCone.Starting from the comparative model of the human red cone generated using as a template the crystallographic structure of Rh (PDB ID 1U19), 84 we generated a default model (RCone) that displays a large deviation from the experimental data, as opposed to the related green and blue cone models.For this reason, we also built a customized model with protonation states that better reproduce the observed ΔE S1−S0 values.Specifically, considering that the pair Glu-83 and Glu-110 are the two negatively charged residues closest to the rPSB chromophore single positive charge and may play a role in its stabilization (see Figure 6C), we switched their protonation states, which in the default model are predicted to be protonated and unprotonated, respectively.As documented in Figure 5 (orange square) and Table 2, the resulting customized model (RCone (c) ) produces a ΔE S1−S0 value in good agreement with the experimental data, decreasing the ΔΔE S1−S0 from 8.8 to 0.2 kcal mol −1 .
3.2.6.bR AT .The structure corresponding to bR AT , the all-trans conformation of bacteriorhodopsin, has been recently structurally elucidated at a resolution high enough to detect hydrogen atoms (PDB ID 6G7H 87 ).We used such a structure, after removing all hydrogen atoms for consistency with the building process, for generating the a-ARM model (see Figure 6D) at pH 5.6, as listed in Table 4.As observed in Figure 5 and Table 2, the ΔΔE S1−S0 Exp produced by the default model (bR AT ) is smaller than 3.0 kcal mol −1 , which is within the expected error range.However, since the experimental evidence does not establish the role of Asp-85 and Asp-212 as MC or SC, 87,116,117 we propose a customized model in which Asp-85 is assumed to be the MC residue and it is therefore deprotonated, whereas Asp-212 is protonated.Using this model (bR AT (c) ) we obtained results in even better agreement with experimental data, showing a ΔΔE S1−S0 Exp of 0.3 kcal mol −1 .Furthermore, considering the compelling importance of having a high quality model for bR, we found that the default cavity does not include the Asp-96, Asp-115, and Glu-194 residues, which are crucial for the proton pump function, 87 and may therefore sensibly interact with the surrounding and even the rPSB chromophore.When we include these residues in a customized cavity (see Figure 6E), we get ΔE S1−S0 values in consistent agreement with experimental data, showing a ΔE S1−S0 of 50.4 kcal mol −1 and a ΔΔE S1−S0 Exp of 0.1 kcal mol −1 (bR AT (c-2) in Table 2).These results show that not only the state of the ionizable residues (possibly the most relevant) but also the definition of the chromophore cavity may affect the quality of a-ARM models.
The results presented in sections 3.2.1,3.2.3, and 3.2.4provide a first clue to deal with the rational design of customized models for microbial rhodopsins.In summary, for the models with high crystallographic pH (≥6.0), in which both MC and SC are deprotonated, one can try producing new customized models at lower pH (5.2).In case the SC is still deprotonated, the next step sOkhould be to protonate it, to balance the charges around the rPSB.Finally, considering that not always the MC is the one closest to the rPSB as suggested by a-ARM, one can attempt to identify the role of the MC and SC by switching the pair predicted by a-ARM, as shown for the case of bR and RCone in the benchmarking set.
3.3.Models Comparison.When we consider for KR2-2, ChR C1C2 , ChR2, ChR2-C128T, BPR, RCone, and bR AT rhodopsins, the a-ARM customized ΔΔE S1−S0 Exp values rather than the corresponding a-ARM default values, our benchmark result analysis yields a calculated MAE of 0.9 kcal mol −1 , a MAD of 0.7 kcal mol −1 , and an AD max of 2.7 kcal mol −1 (see Table 2) and thus show a substantial agreement with the experimental data.
Comparing the results for a subset constituted by the m-set and Rh mutants (excluding E122Q, A269T, E113D, D83N-E122Q, and A292S-A295S-A299C) with the corresponding values reported by Melaccio et al. 52 using the original ARM protocol (gold circles in Figure 5), one sees an improvement in the accuracy of the predicted trend (see Figure 5).In fact, the agreement between computed and observed quantities for such a subset is improved not only in terms of trend but also in terms of individual errors.For instance, the MAE ± MAD for this subset is reduced from 2.1 ± 0.8 kcal mol −1 (see values in Tables 1 and 2 of ref 52) to 0.9 ± 0.6 kcal mol −1 (see values in Table 2) when using a-ARM with respect to using the original ARM.Notice also that the X-ray structure-based and comparative model-based a-ARM models show a similar quality.
With the aim of quantifying the parallelism between the computed and experimental trends in ΔE S1−S0 and thus compare the performance of the a-ARM default and a-ARM customized approaches, we defined the trend deviation factor (∥Trend Dev.∥).This ∥Trend Dev.∥ describes the ability of the a-ARM models to predict the changes in ΔE S1−S0 observed experimentally from one rhodopsin to another, with respect to a selected reference rhodopsin.For our benchmark set, we selected Rh as the reference.To compute ∥Trend Dev.∥, we first calculated the change in experimental ΔE S1−S0 produced for each of the x = 37 rhodopsins with respect to Rh, as the absolute difference (δ x,Exp Rh,Exp ΔE S1−S0 ).Then, we performed a similar procedure but this time considering the calculated ΔE S1−S0 of Rh as reference to be compared with the calculated value of the other x = 37 rhodopsins (δ x,Calc Rh,Calc ΔE S1−S0 ).Once obtained δ x,Exp Rh,Exp ΔE S1−S0 and δ x,Calc Rh,Calc ΔE S1−S0 for each rhodopsin, we computed the difference between these two quantities and, finally, the ∥Trend Dev.∥ value as the corresponding MAE and MAD.Further information on the complete data for the calculation is provided in Table S4 in the Supporting Information.
The results of ∥Trend Dev.∥ for the 37 rhodopsins in the benchmark set, expressed as MAE ± MAD, are reported in Table 2.As observed, there is a significant improvement when we consider the a-ARM customized values for KR2-2, ChR C1C2 , ChR2, ChR2-C128T, BPR, RCone, and bR AT instead of the a-ARM default values.More specifically, ∥Trend Dev.∥ changed from 1.3 ± 1.2 kcal mol −1 for the a-ARM default to 0.7 ± 0.5 kcal mol −1 for the a-ARM customized approach.The latest data validates the excellent agreement between our calculated and the available experimental values.
3.4.Effect of the Chain and pH on ΔE S1−S0 .As previously discussed by Melaccio et al., 52 and discussed above, when a different ionization state is assigned to a chromophore cavity residue, significant variations in the predicted ΔE S1−S0 have to be expected.The KR2, ChR C1C2 , ChR2, ChR2-C128T, BPR, bR, and RCone cases indicate that the method used in a-ARM default for predicting the state of the ionizable residues of rhodopsins should be mainly used as a guideline.In fact, the change in protonation state of specific residues also have a direct effect on its global charge and, consequently, on the number of counterions needed to neutralize its OS and IS surfaces which, in turn, also affects the ΔE S1−S0 .
Another way of changing the ionization states of certain residues in an a-ARM model is through a pH change.In this last section, we document the effect of specific pH changes, namely, from crystallographic to physiologic pH, which shows that, in certain cases, the default choice of using the crystallization pH may not lead to a satisfactory result.In fact, such change may determine the change in the residues charge, as seen in eqs 3 and 4. To explore this potential issue, we look at the a-ARM model change in protonation state induced by a pH variation for the rhodopsins of the m-set.In particular, we selected two pH values: physiological (7.4) and experimental (imposed during crystallization) pH and compute the corresponding charges.Concurrently, we show that the charge variation can also be a function of the selected protein chain when the crystallographic data includes more than one chain.
Table 4 reports the list of ionizable residues which are calculated to be neutral for the m-set.Therefore, with the aim of evaluating the effect of the pH on the predicted ΔE S1−S0 , we generated a a-ARM model for each pH value, as specified in the last column of Table 4 and detailed in Table S2 of the Supporting Information.The table shows that the crystallization pH of animal rhodopsins fall in the 6.0−6.4 range, whereas for microbial rhodopsins fall in the 4.5−6.0range.It can be seen that in most of the table entries the pH change has an influence on the calculated charges and therefore on the ionization state and, ultimately, ΔE S1−S0 .
Inside the explored pH range, SqRh residues do not change their protonation state, irrespective of the employed chain.The difference in computed ΔE S1−S0 between SqRh(A) and SqRh(B) is, evidently, due to other factors.hMeOp is also insensitive to the change in pH, while bathoRh and Rh have different behaviors depending on the employed chain: bathoRh-(B) and Rh(A) residues do not change protonation state when varying pH, as opposed to bathoRh(A) and Rh(B).Conversely, for BPR there is no significant variation on ΔE S1−S0 when chains A and B are considered, and the same residues protonation state change is found due to the pH.
Finally, it should be noted that for both bathoRh and SqRh we found a better agreement of ΔE S1−S0 with respect to experimental data when chain B is considered.These results are consistent with previous studies in which some authors recommend the use of chain B in bathoRh 120,121 and SqRh, 122 because this is more compact than chain A and the retinal included in chain B takes a closer form to the 11-cisconformation than the retinal included in chain A.

CONCLUSIONS
The possibility of automatically building QM/MM models of rhodopsins rather than via user manipulation opens up new perspectives in diverse fields, including the engineering of lightresponsive proteins.In fact, automation is an unavoidable prerequisite for the production of sizable arrays (from hundreds to thousands) of rhodopsin models and, therefore, for the design of novel optogenetic tools through the in silico screening of mutant rhodopsins or for following evolutionary steps along the branches of a phylogenetic tree.However, to be useful, automation has to be accompanied by other properties such as speed in preparing the model building input and reproducibility of the final model when different users operate.Furthermore, the resulting models has to show a suitable accuracy in reproducing property trends as well as transferability to rhodopsins of very different sequence.In fact, one of the most appealing features of a-ARM is that the generation of the input for the QM/MM construction and ΔE S0−S1 calculation is reduced from ∼3 h to less than 5 min with respect to the original ARM protocol.This time reduction is a consequence of the automation of points A− D (see section 2.2), for which the user does not need to directly manipulate text files or visualize chemical structures anymore.
Above we have introduced and benchmarked a-ARM, a protocol designed to automatically build QM/MM models using a multiconfigurational QM level suitable for electronically excited state computations, including spectroscopy and photochemical reactivity.With respect to the previous semiautomatic version, a-ARM features an automated assignment of the residues defining the chromophore cavity, including the chromophore linker and counterions, of the state of ionizable residues and, finally, the unambiguous placement of cytoplasmic and extracellular counterions.These steps ensure automation, speed in input preparation for the QM/MM model building, and reproducibility.
While, presently, the benchmarking of a-ARM has been limited to a relatively small set of rhodopsins and to a single property (i.e., λ max a ), our study has revealed that (1) when used in a fully automated mode (a-ARM default ) the protocol has a relatively high rate of success in predicting/simulating the trend in vertical excitation energies obtained from the corresponding λ max a values, (2) the automatically constructed models which do not follow the trend can be analyzed and improved using a semiautomatic version of the protocol (a-ARM default ) to modify parameters such as the ionization states of specific residues, and (3) the trend seems to hold not only for homologous proteins (like mutants) but also for distant rhodopsins displaying severely different sequences and even chromophore isomers.These results indicate useful levels of accuracy and transferability.
In spite of the encouraging outcome of our studies, additional work has to be done for moving to a systematic applications of a-ARM to the production of sizable rhodopsin arrays.More specifically, since rhodopsin structural data are rarely available, it would be important to investigate the possibility of building, automatically, the corresponding comparative models.With such an additional tool one could achieve a protocol capable of producing QM/MM models starting directly from the constantly growing repositories of rhodopsin amino acid sequences.This target is currently pursued in our lab.
Finally, we have to stress that the structure of the a-ARM tool discussed in this manuscript could, in principle, be replicated for other biologically or technologically important photoresponsive proteins (e.g., the natural photoactive yellow proteins or the synthetic rhodopsin mimics).Therefore, our research effort can also be considered a first step toward a more general photobiological tool applicable outside the rhodopsin area.

Figure 1 .
Figure 1.General scheme of a QM/MM ARM and a-ARM model, composed by (1) main chain (cyan cartoon), (2) chromophore rPSB (green ball-and-sticks), (3) Lys side chain covalently linked to the chromophore (blue ball-and-sticks), (4) main counterion MC (cyan tubes), (5) protonated residues GLH and ASH (violet tubes), (6) external Cl − (green balls) counterions, (7) water molecules (tubes), and the (8) residues of the chromophore cavity subsystem (red frames).Parts 1 and 6 form the environment subsystem.Parts 2 and 3 form the Lys-QM subsystem, which includes the H-link atom located along the only bond connecting blue and green atoms.Parts 4 and 8 form the cavity subsystem.Water molecules (Part 7) may be part of the environment or cavity subsystems.The external OS and IS charged residues are shown in frame representation.This figure, and all other protein structures presented in this work, were produced using PyMOL, version 1.2.59 which describes the relationship between the pH and the pK a and the equilibrium concentrations of dissociated [A − ] and non-dissociated acid [HA], respectively:80−82

Figure 3 .
Figure 3. External counterion placement.Schematic representation of the procedure for the definition of the number and type of external counterions needed to neutralize the IS (A) and OS (B) surfaces of bovine rhodopsin.We also illustrate the grid generated by the PUTION code to calculate the coordinates of the Cl − counterions in the IS (C) and the Na + in the OS (D).The negatively and positively charged residues are illustrated as red and blue sticks, respectively, and the Na + and Cl − counterions as blue and green spheres, respectively.

Figure 4 .
Figure 4. Automatic generation of mutants by using the SCWRL4 60 software.The code does not require any interaction with the user during execution.

section 2 . 3 (
green up triangles), whereas Figure5Bdisplays their differences calculated with respect to experimental data (ΔΔE S1−S0 Exp ).The numerical values together with the corresponding λ max a and transition oscillator strength (f Osc ) values are given in Table

Figure 6 .
Figure 6.a-ARM models.Conformational (the occupancy factor of the rotamers Asp-116 and Gln-157 are presented in parentheses) and ionization state variability for KR2 [PDB ID 3X3C] (A), BPR [PDB ID 4JQ6] (B), RCone (C) [PDB ID template 1U19], bR AT [PDB ID 6G7H] with standard (D) and modified cavity (E).MC and SC are presented as cyan and violet tubes, respectively.

Table 1 .
Benchmark Set of Wild Type and Mutant Rhodopsins a Experimental maximum absorption wavelength, λ max a , in nm, and first vertical excitation energy, ΔE S0−S1 , in kcal mol −1 .Values of ΔE S0−S1 in eV are also provided in parentheses.b Vertebrate (V), invertebrate (I), and microbial (M).c Retinal conformation.d Average of available experimental values in refs 106 and 114.e X-ray structure template model.See the details of the comparative model construction in section S6 of the Supporting Information.

Table 2 . continued
Calculated using the a-ARM default and the a-ARM customized approaches.Differences between calculated and experimental data (ΔΔE S1−S0 Exp , Δλ max a

Table 3 .
a-ARM customized Models for KR2 [PDB ID 3X3C] Testing All the Possible Combinations of Rotamers for Both Residues Asp-116 and Gln-157 a Best model, presented in Figure 5 as orange square.

Table 4 .
Effect of the pH on the State of Ionizable Residues for the Rhodopsins of the m-Set a a Residues with neutral charge at physiological (7.4) and experimental crystallographic pH.b Customized models at physiological pH (c-pH).cExperimentalcrystallographic pH.dOne letter code: Aspartic acid (D), Glutamic acid (E) and Histidine (H).e Values in kcal mol −1 and in eV in parentheses.fExperimentalΔE S1−S0 Exp values are reported in Table2.